eisop · wmdietl · Jun 14, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/.claude/skills/cf-performance/SKILL.md b/.claude/skills/cf-performance/SKILL.md
@@ -195,6 +195,30 @@ could not resolve. (`-J` forwards the flag to the forked compiler JVM — `check
 is `CheckerMain`, which forks. Use a single source set so there is one compile thread.)
 It measures exactly the checker + javac allocation, independent of GC/JIT scheduling.
 
+`ab-measure.sh` automates this for one build side — it runs each given source file REPS
+times and prints the median allocation **and** median wall (in separate runs, since JFR
+perturbs timing):
+
+```
+git checkout <baseline>  && ./gradlew assembleForJavac && \
+    .claude/skills/cf-performance/ab-measure.sh -l master Big300.java Varargs.java
+git checkout <treatment> && ./gradlew assembleForJavac && \
+    .claude/skills/cf-performance/ab-measure.sh -l branch Big300.java Varargs.java
+```
+
+Two cautions about this deterministic-allocation measurement:
+
+- **A `settings=profile` trace does NOT carry TLAB / `ObjectAllocationSample` events**, so
+  `jfr-analyze.java top`/`alloc` on it report `TLAB events: 0` and attribute nothing. Use
+  these single-`javac` traces only for *magnitude* (`alloc-total.java` reads
+  `ThreadAllocationStatistics`) and self-time; for allocation-*by-class attribution* capture
+  with `record-jfr.sh`'s JFC instead.
+- **It is blind to pure-CPU / traversal wins.** A change that does fewer scans / less work but
+  allocates the same shows up flat here (the shallow-location defaulting shortcut cut scan
+  calls 10% with *flat* allocation). Classify your change first: for an allocation change use
+  this; for a CPU/traversal change measure wall and/or on-CPU `ExecutionSample` counts, or
+  instrument an operation counter (below).
+
 ## Detecting super-linear (quadratic) costs with a size sweep
 
 The all-systems corpus is 267 **tiny** files (1–3 method bodies each): a good
@@ -221,8 +245,10 @@ done
 A curve that is super-linear on master and flattens after the change is the quadratic's
 signature. **Report both ends:** small-N is the realistic per-file cost (typically
 low-single-digit), large-N is worst-case protection (machine-generated / giant
-single-class files). Tune the generator's method body for other mechanisms (deep
-expression nesting, many fields, large `switch`).
+single-class files). The generator's `--shape` flag selects which machinery the body
+stresses (`generic` default, `vararg`, `deep-nesting`, `many-fields`) — e.g. `--shape
+vararg` stresses `AnnotatedExecutableType` copying and is what exposed PR #1798's copier
+vararg-aliasing bug; add a shape there for other mechanisms (large `switch`, etc.).
 
 ## Measuring wall-clock effects (the A/B that decides if a change is worth it)
 
@@ -280,6 +306,80 @@ but cost ≈10% wall clock (far more than its hit-rate delta implied). Cut
   1500-method file — synergy, not additivity. A "not worth it" verdict is not permanent
   across an evolving hot path; the *Tried and rejected* section is a starting point, not a
   closed book.
+- **Re-trace the current baseline before committing to a multi-PR plan — a logged hotspot may
+  already be gone.** The same traffic-changes-value effect runs in reverse: an optimization
+  named as "the recommended next direction" in `performance-notes.md` can be overtaken by
+  intervening commits. The PR #1798 immutability program was sized off a logged
+  "`AnnotatedTypeCopier` ≈ 2% self-time / ~22% of `Object[]` allocation" figure; a fresh
+  full-`checknullness` trace *at the start of the work* would have shown it was already
+  ~0.76% / ~1.5% — earlier caches (PR #1777), the thread-local copier map, and lazy
+  `visitedNodes` had harvested it — so the boundary flips had little left to remove and came in
+  at ~1%, not the projected large win. Profile **today's** code before planning, not the
+  number in the log. Corollary: **never repeat a prior-session A/B number — re-measure against
+  the current baseline.** A/B deltas do not compose; the −5.3% from one session did not
+  reproduce once an earlier flip had already shipped (the baseline moved under it).
+- **Measure the *addressable fraction* before building anything that helps only a subset.**
+  If a change only helps when some condition holds (the type is frozen, the consumer is
+  read-only, the default is a top-level location), first count how often that condition is
+  actually true on a real workload — the cheapest possible experiment. Several PR #1798
+  dead-ends died in minutes this way: caching `hashCode` on frozen types (**0%** of
+  `hashCode` calls hit frozen types — all hot hash targets are mutable copies); the
+  shallow-location defaulting shortcut (only **10%** of scans, over cheap types). The recipe:
+  a `static AtomicLong` counter + a JVM shutdown hook that prints to stderr, and a
+  `-Dflag`-gated toggle so you can A/B both variants in *one* build (no rebuild). This is the
+  no-behavior-change cousin of "instrument and simulate" above.
+- **Count ≠ cost (companion to "hit-rate ≠ win").** Reducing the *number* of operations is
+  not a win if the skipped operations are cheap. The shallow defaulting shortcut cut scan
+  *calls* 10% but allocation was flat, because the skipped scans were over cheap `Object`
+  types; the expensive scans (deep `OTHERWISE`/bound traversals over generic types) were
+  untouched. Measure the cost of the work you remove, not its count.
+- **A flat A/B can mean the workload never reached your code — confirm the path is hit.**
+  Before trusting a null result, check the change is actually exercised on that workload. Two
+  PR #1798 A/Bs read flat for exactly this reason: a non-generic-call program showed nothing
+  for the `elementTypeCache` flip because non-generic calls hit `methodAsMemberOfCache`, which
+  short-circuits *before* `elementTypeCache` is consulted; a generic-varargs program was
+  dominated by type-argument inference (which copies regardless), drowning the flip. The cheap
+  guard is a counter on the code you changed — if it never fires, "no effect" is a workload
+  bug, not a verdict on the change. Pair this with picking a workload that *maximizes* traffic
+  through the target (the size-sweep / shape generators exist for this).
+
+## Removing a defensive copy (the opposite of adding a cache)
+
+Most of this skill is about *adding* a cache to save recomputation. *Removing* a defensive
+copy (e.g. making a cache return a shared immutable value instead of a fresh `deepCopy()`)
+is a distinct kind of work with its own failure mode, learned the hard way in PR #1798.
+
+- **A boundary flip pays only if the cache's *hot* consumer is read-only — and you cannot
+  tell that from the freeze flush.** A cache that hands out `deepCopy()` on every hit does so
+  because *some* consumer mutates the result. The question is not *whether* a consumer mutates
+  but whether the **dominant** one does. If the hot consumer rewrites the whole result — the
+  tree pipeline (defaulting + flow refinement + annotators), type-argument inference, or
+  `asSuper` — the flip just **moves** the copy to that consumer and is a wash (PR #1798:
+  `elementCache`-into-the-tree-pipeline, the `methodFromUse`/inference copy-elision, and the
+  `directSupertypesCache`/`AsSuperVisitor` flip all relocated the copy). But if the hot
+  consumers are mostly read-only, the flip removes a real copy: the `elementTypeCache` and
+  `classAndMethodTreeCache` flips shipped this way — though only **~1%**, because by then the
+  copier was already cheap (see "re-trace before planning"). **The trap:** the freeze flush
+  (below) enumerates only the *mutating* consumers — it is structurally blind to the read-only
+  majority, so reasoning "look at all these mutators, the flip won't pay" is survivorship bias.
+  Reasoning from the flush is how I first wrongly called these flips a wash. **Measure the
+  read-only fraction directly** (a counter at the cache-return: how often is the result mutated
+  before the next cache call?), don't infer it from who shows up in the flush.
+- **Freeze-as-bug-finder.** To enumerate which consumers mutate a value you want to share,
+  make the value reject mutation (a `frozen` flag that throws on the mutators) and run the
+  suite — each crash is a mutating call site, with a stack. Far faster than auditing call
+  sites by hand.
+- **When freezing / an assertion flushes a *cluster* of failures, suspect ONE shared
+  copier/construction bug before concluding the aliasing is pervasive.** PR #1798's ~13-case
+  flush looked like the construction pipeline aliased cached substructure everywhere; it was
+  a single `AnnotatedTypeCopier` bug (it aliased the vararg type instead of copying it, so
+  `deepCopy` was not actually deep). Fixing that one bug made all the freezing green.
+- **A benign latent aliasing bug may have no wrong-result symptom.** That copier bug never
+  produced a wrong diagnostic on master (its only post-copy mutator, defaulting, is
+  idempotent), so no checker-test program demonstrates it without the freeze enforcement; the
+  regression test (`checker/tests/nullness/VarargCacheAliasing.java`) only fails under
+  freezing. A pure correctness fix like this needs a unit test asserting `deepCopy`
+  independence, or it ships together with the enforcement that makes it observable.
 
 ## What to look for
 

diff --git a/.claude/skills/cf-performance/ab-measure.sh b/.claude/skills/cf-performance/ab-measure.sh
@@ -0,0 +1,75 @@
+#!/usr/bin/env bash
+# A/B measurement driver for one build side: runs `checker/bin/javac -processor nullness`
+# over each given source file REPS times and reports the MEDIAN total allocation
+# (deterministic jdk.ThreadAllocationStatistics, ~0.15% reproducible) and the MEDIAN wall
+# clock. Allocation and wall are measured in SEPARATE runs, because JFR recording perturbs
+# timing (see SKILL.md "Measuring wall-clock effects").
+#
+# This measures whatever is currently built into checker/dist. To A/B two sides:
+#   git checkout <baseline>  && ./gradlew assembleForJavac && ab-measure.sh -l master Big300.java Varargs.java
+#   git checkout <treatment> && ./gradlew assembleForJavac && ab-measure.sh -l branch Big300.java Varargs.java
+# then diff the two tables. (Rebuild the dist on each side: the forked javac uses it.)
+#
+# Note: deterministic allocation is BLIND to pure-CPU/traversal wins (a change that does
+# fewer scans but allocates the same shows flat here) -- for those, compare wall and/or
+# on-CPU samples, or instrument an operation counter. See SKILL.md.
+#
+# Usage: ab-measure.sh [-n REPS] [-l LABEL] <File.java> [File2.java ...]
+
+set -euo pipefail
+
+REPS=3
+LABEL="side"
+while getopts "n:l:" opt; do
+  case "$opt" in
+    n) REPS="$OPTARG" ;;
+    l) LABEL="$OPTARG" ;;
+    *)
+      echo "usage: ab-measure.sh [-n REPS] [-l LABEL] <File.java>..." >&2
+      exit 2
+      ;;
+  esac
+done
+shift $((OPTIND - 1))
+[ "$#" -ge 1 ] || {
+  echo "usage: ab-measure.sh [-n REPS] [-l LABEL] <File.java>..." >&2
+  exit 2
+}
+
+SKILL_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO="$(cd "$SKILL_DIR/../../.." && pwd)"
+JAVAC="$REPO/checker/bin/javac"
+ALLOC="$SKILL_DIR/alloc-total.java"
+TMP="$(mktemp -d)"
+trap 'rm -rf "$TMP"' EXIT
+
+[ -x "$JAVAC" ] || {
+  echo "no checker/bin/javac at $JAVAC -- run ./gradlew assembleForJavac" >&2
+  exit 1
+}
+
+median() { python3 -c "import sys; xs=sorted(float(x) for x in sys.argv[1:]); print(f'{xs[len(xs)//2]:.2f}')" "$@"; }
+
+printf '### SIDE=%s  reps=%d ###\n' "$LABEL" "$REPS"
+printf '%-28s %14s %10s\n' "program" "alloc(MB,med)" "wall(s,med)"
+for prog in "$@"; do
+  allocs=()
+  for _ in $(seq "$REPS"); do
+    "$JAVAC" -J-XX:StartFlightRecording=settings=profile,filename="$TMP/r.jfr",dumponexit=true \
+      -processor nullness -d "$TMP/out" "$prog" > /dev/null 2>&1 || true
+    mb="$(java "$ALLOC" "$TMP/r.jfr" 2> /dev/null \
+      | grep -oE '= [0-9.]+ MB' | grep -oE '[0-9.]+' || echo 0)"
+    allocs+=("$mb")
+  done
+  walls=()
+  for _ in $(seq "$REPS"); do
+    t="$(python3 -c "
+import subprocess,time
+t=time.time()
+subprocess.run(['$JAVAC','-processor','nullness','-d','$TMP/out','$prog'],
+               stdout=subprocess.DEVNULL,stderr=subprocess.DEVNULL)
+print(f'{time.time()-t:.2f}')")"
+    walls+=("$t")
+  done
+  printf '%-28s %14s %10s\n' "$(basename "$prog")" "$(median "${allocs[@]}")" "$(median "${walls[@]}")"
+done
diff --git a/.claude/skills/cf-performance/gen-sized-program.py b/.claude/skills/cf-performance/gen-sized-program.py
@@ -7,27 +7,38 @@
 each). Sweeping N and plotting allocation/wall vs N exposes the quadratic and separates
 the realistic case (small N) from the worst case (large N).
 
+Different costs need different method bodies, so the body shape is selectable with
+`--shape` (the body is what decides which machinery you stress):
+
+  generic       (default) a generic call + branch + return: CFG construction, flow
+                analysis, getPath, and type-argument inference.
+  vararg        calls to JDK vararg methods (Arrays.asList, String.format,
+                Class.getMethod): exercises AnnotatedExecutableType deep-copying and the
+                vararg-type path. Use this to stress the methodAsMemberOf cache and the
+                AnnotatedTypeCopier vararg handling (it is what exposed PR #1798's copier
+                aliasing bug). A *user* vararg method does not reproduce it; JDK ones do,
+                because their executable types become cached masters.
+  deep-nesting  one deeply nested expression per method: stresses tree-path / scan depth.
+  many-fields   N fields instead of N methods: stresses field defaulting and fromElement.
+
 Usage:
-    gen-sized-program.py 1500 > Big1500.java          # one size
-    for n in 100 300 600 1500; do                     # a sweep
+    gen-sized-program.py 1500 > Big1500.java                 # one size, default shape
+    gen-sized-program.py --shape vararg 400 > Varargs.java   # a different shape
+    for n in 100 300 600 1500; do                            # a sweep
         gen-sized-program.py $n > Big$n.java
     done
 
-Then A/B each side with the deterministic allocation reader:
+Then A/B each side with the deterministic allocation reader (or ab-measure.sh):
     checker/bin/javac -J-XX:StartFlightRecording=settings=profile,filename=r.jfr,dumponexit=true \\
         -processor nullness -d /tmp/out Big$n.java
     java .claude/skills/cf-performance/alloc-total.java r.jfr
-
-Each method has a non-trivial body (locals, a generic call, a branch, a return) so it
-triggers CFG construction, flow analysis, and getPath/type-argument inference -- the
-machinery that exercises the tree-path and dataflow hot paths. Tune `body` for other
-mechanisms (deep expression nesting, many fields, large switch, etc.).
 """
 
+import argparse
 import sys
 
 
-def generate(n: int) -> str:
+def generic(n: int) -> str:
     lines = ["class Big {", "    static <T> T id(T x) { return x; }"]
     for i in range(n):
         lines += [
@@ -42,7 +53,73 @@ def generate(n: int) -> str:
     return "\n".join(lines) + "\n"
 
 
+def vararg(n: int) -> str:
+    # Calls to JDK vararg methods: their executable types are cached and re-defaulted,
+    # and deep-copied per use, so this stresses AnnotatedExecutableType copying and the
+    # vararg-type subtree.
+    lines = [
+        "import java.util.Arrays;",
+        "import java.util.List;",
+        "class Big {",
+    ]
+    for i in range(n):
+        lines += [
+            f"    void m{i}(String s{i}) throws Exception {{",
+            f"        List<String> xs{i} = Arrays.asList(s{i}, s{i}, s{i});",
+            f"        List<String> ys{i} = Arrays.asList(s{i});",
+            f'        String f{i} = String.format("%s %s", s{i}, s{i});',
+            f"        java.lang.reflect.Method mm{i} ="
+            f' Big.class.getMethod("m{i}", String.class);',
+            "    }",
+        ]
+    lines.append("}")
+    return "\n".join(lines) + "\n"
+
+
+def deep_nesting(n: int) -> str:
+    lines = ["class Big {", "    static <T> T id(T x) { return x; }"]
+    for i in range(n):
+        # A single deeply nested generic-call expression.
+        expr = f"a{i}"
+        for _ in range(20):
+            expr = f"id({expr})"
+        lines += [
+            f"    Object m{i}(Object a{i}) {{",
+            f"        return {expr};",
+            "    }",
+        ]
+    lines.append("}")
+    return "\n".join(lines) + "\n"
+
+
+def many_fields(n: int) -> str:
+    lines = ["import java.util.List;", "class Big {"]
+    for i in range(n):
+        lines.append(f"    List<String> f{i};")
+    lines.append("}")
+    return "\n".join(lines) + "\n"
+
+
+SHAPES = {
+    "generic": generic,
+    "vararg": vararg,
+    "deep-nesting": deep_nesting,
+    "many-fields": many_fields,
+}
+
+
 if __name__ == "__main__":
-    if len(sys.argv) != 2:
-        sys.exit("usage: gen-sized-program.py <method-count>")
-    sys.stdout.write(generate(int(sys.argv[1])))
+    parser = argparse.ArgumentParser(
+        description="Generate a sized Java program for perf sweeps."
+    )
+    parser.add_argument(
+        "n", type=int, help="number of methods (or fields, for many-fields)"
+    )
+    parser.add_argument(
+        "--shape",
+        choices=sorted(SHAPES),
+        default="generic",
+        help="method-body shape to stress",
+    )
+    args = parser.parse_args()
+    sys.stdout.write(SHAPES[args.shape](args.n))