Skip to content

No longer call wasm-opt with --opt 1#2238

Draft
vouillon wants to merge 7 commits into
masterfrom
fast-o1
Draft

No longer call wasm-opt with --opt 1#2238
vouillon wants to merge 7 commits into
masterfrom
fast-o1

Conversation

@vouillon
Copy link
Copy Markdown
Member

@vouillon vouillon commented May 11, 2026

We skip calling wasm-opt with --opt 1. Expected benefits:

  • lower compilation time
  • improved debugging experience:
    • variable names are preserved
    • events are not reordered

Instead, we perform a few optimizations to reduce the code size and avoid having too many variables per function:

  • sink local.set
  • coalesce local variables with disjoint lifetime
  • reorder locals

The wasm_of_ocaml compilation time for the test suite is reduced by about 60%. On the CI, dune build @runtest-wasm from 2m 19s down to 1m 40s (-28%).

Speedup on ocaml and PRT:

binary wall-clock time user time
ocamlc -32% -70%
PRT -26% -53%

vouillon added 7 commits May 11, 2026 17:30
Previously [link_and_optimize] always had [Binaryen.link] write to a
temp file, optionally had [Binaryen.dead_code_elimination] write to a
second temp file, and then had [Binaryen.optimize] (or, at --opt 1, a
file copy) produce the final output. That means at --opt 1 we were
reading the DCE'd wasm back into memory and writing it out as a copy
just to land it at the right path.

Restructure [link_and_optimize] into three helpers — [link], [dce],
[optimize] — plus a [with_step_output] combinator that yields either
the final [output_file] (when the pass is the last one in the
pipeline) or a fresh pair of temp files. The last pass in the chain
now always writes straight to [output_file] and its sourcemap to
[opt_sourcemap_file], so no copy is needed and there are no
intermediate passes through memory. The four profiles work as before:

  - dynlink + O1:      link ->output_file
  - dynlink + O2/O3:   link -> temp; optimize ->output_file
  - !dynlink + O1:     link -> temp; dce ->output_file
  - !dynlink + O2/O3:  link -> temp; dce -> temp'; optimize ->output_file
Adds a Wasm-backend linear-scan variable coalescing pass, mirroring
Js_variable_coalescing. When --opt 1 skips Binaryen, we need our own
pass to reuse locals; at --opt 2/3 the new pass is off and Binaryen
still handles it.
Adds a new Wasm-backend pass [Local_sink] that rewrites

    local.set x e
    ...
    local.get x

into

    ...
    local.tee x e

when the sink is sound: we must not cross another write to [x], and
moving [e] past intermediate code must not reorder observable
effects.

Because [Local_sink] always introduces [local.tee] (never inlines
[e] itself), some of the tees it produces are immediately dead:
the variable is never read again. Extend [Var_coalescing] to
detect such dead tees using the liveness it already computes — a
[LocalTee x _] Def node whose successors do not list [x] in their
[live_in] is a dead store. Replace it with its inner expression,
and drop the local from the function's [locals] list when its
last reference was the tee we just erased.
The forward walker in [Local_sink] is O(N) per candidate, and there
can be O(N) candidates per function, making the pass O(N²) on
adversarial inputs. Each crossed sub-expression and instruction also
recomputes [effect_free], [trivially_pure], and [reads_of_expr] from
scratch, so the constant is not small.

Add a per-candidate walk budget ([max_walk_distance], 64) and a
[tick] helper that is called at every forward step in evaluation
order — instruction-to-instruction in [try_sink_in_list], and
sibling-to-sibling in [wrap_binary], [wrap_list_intermediate], and
the [StructSet]/[ArraySet] arms of [try_sink_in_instr]. Descending
into a unary sub-expression is free. When the budget is exhausted
the walker bails just as it does for any other cross failure.

The bound caps worst-case per-function cost at O(N · K) with K=64.
Profitable sinks are almost all within a handful of steps, so the
clip is on the long tail rather than the common case.
Wasm encodes local indices as LEB128 u32 (1 byte for < 128, 2 above).
With [wasm-opt] skipped at --opt 1, hot locals routinely cross the
128 boundary and pay an extra byte per access.

Add a [Reorder_locals] pass that sorts non-parameter locals into a
numeric block followed by a reference block, with same-type runs
ordered by descending total use count and individual locals inside
a run by descending per-local count. Indices are derived from list
position, so reordering [locals] alone suffices.

Runs at the end of [post_process_function_body], after
[Initialize_locals.f]; gated on [O1] and the new [wasm-reorder-locals]
flag.
When two adjacent generated columns receive identical mappings, the
encoder emits only the first. If that first segment is itself skipped
because it matches the previous (or is at column 0 after a newline),
the old [if i > 0] guard still prepended a comma even though no
segment had been written on the current line. Track whether any
segment was emitted on the line ([prev >= 0]) and use that instead,
so the generated .map stays parseable by wasm-merge.
@vouillon vouillon added the wasm label May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant