[PERF] backport vectorization improvements#8774
Conversation
f8fb448 to
1ea02b3
Compare
Each vectorized cell allocated a fresh `args` array via `args.map(...)`. For large vectorized formulas this produces one array allocation per output cell, hammering the GC. Allocate a single `argsBuffer` once and fill it in-place per iteration. Benchmark scenario: 1500 cells of `=SUM($C:$C="132")` against a 10k-row `C:C` column. evaluate all cells before: Mean: 2171.79 ms, StdErr: 12.84 ms, n=30 after: Mean: 1983.47 ms, StdErr: 12.82 ms, n=30 (vs prev: -9%) Task: 6222157
The inner loop re-evaluated vectorArgsType?.[k] via a switch for every arg of every vectorized cell. The arg's access pattern is fixed for the whole call, so resolve it once into a closure per arg and just invoke the closure inside the loop. Benchmark scenario: 1500 cells of `=SUM($C:$C="132")` against a 10k-row `C:C` column. evaluate all cells baseline: Mean: 2171.79 ms, StdErr: 12.84 ms, n=30 before: Mean: 1983.47 ms, StdErr: 12.82 ms, n=30 after: Mean: 1919.10 ms, StdErr: 12.66 ms, n=30 (vs prev: -3%, vs baseline: -12%) Task: 6222157
errorHandlingCompute called descr.getArgToFocus() on every arg of every vectorized cell. The arg→definition mapping is fixed for the whole call, so compute argDefinitions once in vectorizedCompute and share it with errorHandlingCompute via a closure variable. Benchmark scenario: 1500 cells of `=SUM($C:$C="132")` against a 10k-row `C:C` column. evaluate all cells baseline: Mean: 2171.79 ms, StdErr: 12.84 ms, n=30 before: Mean: 1919.10 ms, StdErr: 12.66 ms, n=30 after: Mean: 1677.95 ms, StdErr: 21.29 ms, n=30 (vs prev: -13%, vs baseline: -23%) Task: 6222157
Non-vectorized args have the same value for every cell, so they don't need to be written into argsBuffer on each iteration. Write them once during setup and only iterate over the truly vectorized slots in the hot loop. Benchmark scenario: 1500 cells of `=SUM($C:$C="132")` against a 10k-row `C:C` column. evaluate all cells baseline: Mean: 2171.79 ms, StdErr: 12.84 ms, n=30 before: Mean: 1677.95 ms, StdErr: 21.29 ms, n=30 after: Mean: 1528.36 ms, StdErr: 11.75 ms, n=30 (vs prev: -9%, vs baseline: -30%) Task: 6222157
Inlining the generateMatrix loop body directly into applyVectorization removes one callback invocation per cell and lets the engine specialize the loop. Benchmark scenario: 1500 cells of `=SUM($C:$C="132")` against a 10k-row `C:C` column. evaluate all cells baseline: Mean: 2171.79 ms, StdErr: 12.84 ms, n=30 before: Mean: 1528.36 ms, StdErr: 11.75 ms, n=30 after: Mean: 1458.47 ms, StdErr: 10.62 ms, n=30 (vs prev: -5%, vs baseline: -33%) Task: 6222157
Spreading an array is slower than a direct call with explicit arguments. argsBuffer.length is constant for the whole applyVectorization call, so dispatch once before the loops and use direct calls for 1–3 args, falling back to spread for higher arities. Benchmark scenario: 1500 cells of `=SUM($C:$C="132")` against a 10k-row `C:C` column. evaluate all cells baseline: Mean: 2171.79 ms, StdErr: 12.84 ms, n=30 before: Mean: 1458.47 ms, StdErr: 10.62 ms, n=30 after: Mean: 1378.92 ms, StdErr: 10.24 ms, n=30 (vs prev: -5%, vs baseline: -37%) Task: 6222157
1ea02b3 to
c90a043
Compare
|
robodoo rebase-ff |
|
Merge method set to rebase and fast-forward. |
hokolomopo
left a comment
There was a problem hiding this comment.
👋
My comments aren't about whether the code works or not, just readability. This part of the code is already complex, and becomes even more so. Performance is fine and all, but we should at least do the maximal effort to keep the code somewhat readable.
Additionally, these optimization are very low-level, and look very dependent of the JS engine. Sure V8 is good and all, but should our benchmark really be only "what happens on the current version of node" ? What about older/more recent V8 versions ? Firefox ? Safari ? Or even what happens on a more realistic test dataset.
| const argsBuffer: Arg[] = new Array(args.length); | ||
|
|
||
| const fillArgsBuffer = (i: number, j: number): void => { | ||
| for (let k = 0; k < args.length; k++) { |
There was a problem hiding this comment.
can you use real names for the variables? very hard to understand what this does. k is the argIndex, is i the col ? Or the colOffset ? It wasn't very clear before, but even worse now 😅
Talking about a buffer also sounds very esoteric IMO. It's just functionArgs/computeArgsAtPosition no ?
| // Shared across vectorizedCompute→errorHandlingCompute within a single (synchronous) call. | ||
| let currentArgDefinitions: ArgDefinition[] = []; |
There was a problem hiding this comment.
Seems a bit wonky to have currentArgDefinitions in this scope, and rely on the fact that we call the methods in the correct order so it is correctly defined in errorHandlingCompute. Can't you scope the array to vectorizedCompute, and pass it as argument of errorHandlingCompute ?
The variable could then also be named argDefinitions, w/o the current which I'm not sure what it refers to (current evaluation cycle ? current function call ?)
| for (let k = 0; k < nbVectorized; k++) { | ||
| argsBuffer[vectorizedIndices[k]] = argGetters[k](col, row); | ||
| } |
There was a problem hiding this comment.
nitpick: I'd inline vectorizedIndices.length instead of creating another variables which adds even more confusion to this complex code
| const argGetters: ArgGetter[] = []; | ||
| const vectorizedIndices: number[] = []; // tracks which slots need updating each iteration. |
There was a problem hiding this comment.
Can you do a single array {argGetter; argIndex}[] instead ? Maybe it'd be a bit easier to undersand
| } | ||
| for (let k = 0; k < nbVectorized; k++) { | ||
| argsBuffer[vectorizedIndices[k]] = argGetters[k](col, row); | ||
| const result: Matrix<FunctionResultObject> = new Array(countVectorizedCol); |
There was a problem hiding this comment.
If inlining a loop really has such an impact on perf, we should investigate other places too ... Maybe computeFunctionToObject for example
| } | ||
| const nbVectorized = vectorizedIndices.length; | ||
|
|
||
| // Specialize the call for common arities — argsBuffer.length is constant for the whole call, |
There was a problem hiding this comment.
Not sure if "arities" is a real word (it's not recognized by my spell checker), and if it is I'm not sure if a comment that require a google search to understand is really useful ... (or maybe it's common knowledge, and I'm just bad at computer english)
There was a problem hiding this comment.
it's a real word: https://en.wikipedia.org/wiki/Arity even though I agree it's not super common knowledge, it's exactly what it's used for here.
| type FormulaCall = () => Matrix<FunctionResultObject> | FunctionResultObject; | ||
| let callFormula: FormulaCall; | ||
| switch (argsBuffer.length) { | ||
| case 1: | ||
| callFormula = () => formula(argsBuffer[0]); | ||
| break; | ||
| case 2: | ||
| callFormula = () => formula(argsBuffer[0], argsBuffer[1]); | ||
| break; | ||
| case 3: | ||
| callFormula = () => formula(argsBuffer[0], argsBuffer[1], argsBuffer[2]); | ||
| break; | ||
| default: | ||
| callFormula = () => formula(...argsBuffer); | ||
| } |
There was a problem hiding this comment.
I mean if we stop caring that much about readability, and it's faster, why stop there at 3 ? Vlookup takes 4 arguments, Xlookup takes 6. Create a helper that returns a callback from 1 to 10 arguments, bind it, and call it a day.
|
Oh it's a backport of master commits .... well my comments still stand. @LucasLefevre |
|
@hokolomopo can be improved, but then I'd merge it as it is in 18.0, then directly refactor in 18.0 and forward-port, that would be probably easier |

Description:
description of this task, what is implemented and why it is implemented that way.
Task: TASK_ID
review checklist