Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 102 additions & 2 deletions .claude/skills/cf-performance/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,30 @@ could not resolve. (`-J` forwards the flag to the forked compiler JVM — `check
is `CheckerMain`, which forks. Use a single source set so there is one compile thread.)
It measures exactly the checker + javac allocation, independent of GC/JIT scheduling.

`ab-measure.sh` automates this for one build side — it runs each given source file REPS
times and prints the median allocation **and** median wall (in separate runs, since JFR
perturbs timing):

```
git checkout <baseline> && ./gradlew assembleForJavac && \
.claude/skills/cf-performance/ab-measure.sh -l master Big300.java Varargs.java
git checkout <treatment> && ./gradlew assembleForJavac && \
.claude/skills/cf-performance/ab-measure.sh -l branch Big300.java Varargs.java
```

Two cautions about this deterministic-allocation measurement:

- **A `settings=profile` trace does NOT carry TLAB / `ObjectAllocationSample` events**, so
`jfr-analyze.java top`/`alloc` on it report `TLAB events: 0` and attribute nothing. Use
these single-`javac` traces only for *magnitude* (`alloc-total.java` reads
`ThreadAllocationStatistics`) and self-time; for allocation-*by-class attribution* capture
with `record-jfr.sh`'s JFC instead.
- **It is blind to pure-CPU / traversal wins.** A change that does fewer scans / less work but
allocates the same shows up flat here (the shallow-location defaulting shortcut cut scan
calls 10% with *flat* allocation). Classify your change first: for an allocation change use
this; for a CPU/traversal change measure wall and/or on-CPU `ExecutionSample` counts, or
instrument an operation counter (below).

## Detecting super-linear (quadratic) costs with a size sweep

The all-systems corpus is 267 **tiny** files (1–3 method bodies each): a good
Expand All @@ -221,8 +245,10 @@ done
A curve that is super-linear on master and flattens after the change is the quadratic's
signature. **Report both ends:** small-N is the realistic per-file cost (typically
low-single-digit), large-N is worst-case protection (machine-generated / giant
single-class files). Tune the generator's method body for other mechanisms (deep
expression nesting, many fields, large `switch`).
single-class files). The generator's `--shape` flag selects which machinery the body
stresses (`generic` default, `vararg`, `deep-nesting`, `many-fields`) — e.g. `--shape
vararg` stresses `AnnotatedExecutableType` copying and is what exposed PR #1798's copier
vararg-aliasing bug; add a shape there for other mechanisms (large `switch`, etc.).

## Measuring wall-clock effects (the A/B that decides if a change is worth it)

Expand Down Expand Up @@ -280,6 +306,80 @@ but cost ≈10% wall clock (far more than its hit-rate delta implied). Cut
1500-method file — synergy, not additivity. A "not worth it" verdict is not permanent
across an evolving hot path; the *Tried and rejected* section is a starting point, not a
closed book.
- **Re-trace the current baseline before committing to a multi-PR plan — a logged hotspot may
already be gone.** The same traffic-changes-value effect runs in reverse: an optimization
named as "the recommended next direction" in `performance-notes.md` can be overtaken by
intervening commits. The PR #1798 immutability program was sized off a logged
"`AnnotatedTypeCopier` ≈ 2% self-time / ~22% of `Object[]` allocation" figure; a fresh
full-`checknullness` trace *at the start of the work* would have shown it was already
~0.76% / ~1.5% — earlier caches (PR #1777), the thread-local copier map, and lazy
`visitedNodes` had harvested it — so the boundary flips had little left to remove and came in
at ~1%, not the projected large win. Profile **today's** code before planning, not the
number in the log. Corollary: **never repeat a prior-session A/B number — re-measure against
the current baseline.** A/B deltas do not compose; the −5.3% from one session did not
reproduce once an earlier flip had already shipped (the baseline moved under it).
- **Measure the *addressable fraction* before building anything that helps only a subset.**
If a change only helps when some condition holds (the type is frozen, the consumer is
read-only, the default is a top-level location), first count how often that condition is
actually true on a real workload — the cheapest possible experiment. Several PR #1798
dead-ends died in minutes this way: caching `hashCode` on frozen types (**0%** of
`hashCode` calls hit frozen types — all hot hash targets are mutable copies); the
shallow-location defaulting shortcut (only **10%** of scans, over cheap types). The recipe:
a `static AtomicLong` counter + a JVM shutdown hook that prints to stderr, and a
`-Dflag`-gated toggle so you can A/B both variants in *one* build (no rebuild). This is the
no-behavior-change cousin of "instrument and simulate" above.
- **Count ≠ cost (companion to "hit-rate ≠ win").** Reducing the *number* of operations is
not a win if the skipped operations are cheap. The shallow defaulting shortcut cut scan
*calls* 10% but allocation was flat, because the skipped scans were over cheap `Object`
types; the expensive scans (deep `OTHERWISE`/bound traversals over generic types) were
untouched. Measure the cost of the work you remove, not its count.
- **A flat A/B can mean the workload never reached your code — confirm the path is hit.**
Before trusting a null result, check the change is actually exercised on that workload. Two
PR #1798 A/Bs read flat for exactly this reason: a non-generic-call program showed nothing
for the `elementTypeCache` flip because non-generic calls hit `methodAsMemberOfCache`, which
short-circuits *before* `elementTypeCache` is consulted; a generic-varargs program was
dominated by type-argument inference (which copies regardless), drowning the flip. The cheap
guard is a counter on the code you changed — if it never fires, "no effect" is a workload
bug, not a verdict on the change. Pair this with picking a workload that *maximizes* traffic
through the target (the size-sweep / shape generators exist for this).

## Removing a defensive copy (the opposite of adding a cache)

Most of this skill is about *adding* a cache to save recomputation. *Removing* a defensive
copy (e.g. making a cache return a shared immutable value instead of a fresh `deepCopy()`)
is a distinct kind of work with its own failure mode, learned the hard way in PR #1798.

- **A boundary flip pays only if the cache's *hot* consumer is read-only — and you cannot
tell that from the freeze flush.** A cache that hands out `deepCopy()` on every hit does so
because *some* consumer mutates the result. The question is not *whether* a consumer mutates
but whether the **dominant** one does. If the hot consumer rewrites the whole result — the
tree pipeline (defaulting + flow refinement + annotators), type-argument inference, or
`asSuper` — the flip just **moves** the copy to that consumer and is a wash (PR #1798:
`elementCache`-into-the-tree-pipeline, the `methodFromUse`/inference copy-elision, and the
`directSupertypesCache`/`AsSuperVisitor` flip all relocated the copy). But if the hot
consumers are mostly read-only, the flip removes a real copy: the `elementTypeCache` and
`classAndMethodTreeCache` flips shipped this way — though only **~1%**, because by then the
copier was already cheap (see "re-trace before planning"). **The trap:** the freeze flush
(below) enumerates only the *mutating* consumers — it is structurally blind to the read-only
majority, so reasoning "look at all these mutators, the flip won't pay" is survivorship bias.
Reasoning from the flush is how I first wrongly called these flips a wash. **Measure the
read-only fraction directly** (a counter at the cache-return: how often is the result mutated
before the next cache call?), don't infer it from who shows up in the flush.
- **Freeze-as-bug-finder.** To enumerate which consumers mutate a value you want to share,
make the value reject mutation (a `frozen` flag that throws on the mutators) and run the
suite — each crash is a mutating call site, with a stack. Far faster than auditing call
sites by hand.
- **When freezing / an assertion flushes a *cluster* of failures, suspect ONE shared
copier/construction bug before concluding the aliasing is pervasive.** PR #1798's ~13-case
flush looked like the construction pipeline aliased cached substructure everywhere; it was
a single `AnnotatedTypeCopier` bug (it aliased the vararg type instead of copying it, so
`deepCopy` was not actually deep). Fixing that one bug made all the freezing green.
- **A benign latent aliasing bug may have no wrong-result symptom.** That copier bug never
produced a wrong diagnostic on master (its only post-copy mutator, defaulting, is
idempotent), so no checker-test program demonstrates it without the freeze enforcement; the
regression test (`checker/tests/nullness/VarargCacheAliasing.java`) only fails under
freezing. A pure correctness fix like this needs a unit test asserting `deepCopy`
independence, or it ships together with the enforcement that makes it observable.

## What to look for

Expand Down
75 changes: 75 additions & 0 deletions .claude/skills/cf-performance/ab-measure.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env bash
# A/B measurement driver for one build side: runs `checker/bin/javac -processor nullness`
# over each given source file REPS times and reports the MEDIAN total allocation
# (deterministic jdk.ThreadAllocationStatistics, ~0.15% reproducible) and the MEDIAN wall
# clock. Allocation and wall are measured in SEPARATE runs, because JFR recording perturbs
# timing (see SKILL.md "Measuring wall-clock effects").
#
# This measures whatever is currently built into checker/dist. To A/B two sides:
# git checkout <baseline> && ./gradlew assembleForJavac && ab-measure.sh -l master Big300.java Varargs.java
# git checkout <treatment> && ./gradlew assembleForJavac && ab-measure.sh -l branch Big300.java Varargs.java
# then diff the two tables. (Rebuild the dist on each side: the forked javac uses it.)
#
# Note: deterministic allocation is BLIND to pure-CPU/traversal wins (a change that does
# fewer scans but allocates the same shows flat here) -- for those, compare wall and/or
# on-CPU samples, or instrument an operation counter. See SKILL.md.
#
# Usage: ab-measure.sh [-n REPS] [-l LABEL] <File.java> [File2.java ...]

set -euo pipefail

REPS=3
LABEL="side"
while getopts "n:l:" opt; do
case "$opt" in
n) REPS="$OPTARG" ;;
l) LABEL="$OPTARG" ;;
*)
echo "usage: ab-measure.sh [-n REPS] [-l LABEL] <File.java>..." >&2
exit 2
;;
esac
done
shift $((OPTIND - 1))
[ "$#" -ge 1 ] || {
echo "usage: ab-measure.sh [-n REPS] [-l LABEL] <File.java>..." >&2
exit 2
}

SKILL_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO="$(cd "$SKILL_DIR/../../.." && pwd)"
JAVAC="$REPO/checker/bin/javac"
ALLOC="$SKILL_DIR/alloc-total.java"
TMP="$(mktemp -d)"
trap 'rm -rf "$TMP"' EXIT

[ -x "$JAVAC" ] || {
echo "no checker/bin/javac at $JAVAC -- run ./gradlew assembleForJavac" >&2
exit 1
}

median() { python3 -c "import sys; xs=sorted(float(x) for x in sys.argv[1:]); print(f'{xs[len(xs)//2]:.2f}')" "$@"; }

printf '### SIDE=%s reps=%d ###\n' "$LABEL" "$REPS"
printf '%-28s %14s %10s\n' "program" "alloc(MB,med)" "wall(s,med)"
for prog in "$@"; do
allocs=()
for _ in $(seq "$REPS"); do
"$JAVAC" -J-XX:StartFlightRecording=settings=profile,filename="$TMP/r.jfr",dumponexit=true \
-processor nullness -d "$TMP/out" "$prog" > /dev/null 2>&1 || true
mb="$(java "$ALLOC" "$TMP/r.jfr" 2> /dev/null \
| grep -oE '= [0-9.]+ MB' | grep -oE '[0-9.]+' || echo 0)"
allocs+=("$mb")
done
walls=()
for _ in $(seq "$REPS"); do
t="$(python3 -c "
import subprocess,time
t=time.time()
subprocess.run(['$JAVAC','-processor','nullness','-d','$TMP/out','$prog'],
stdout=subprocess.DEVNULL,stderr=subprocess.DEVNULL)
print(f'{time.time()-t:.2f}')")"
walls+=("$t")
done
printf '%-28s %14s %10s\n' "$(basename "$prog")" "$(median "${allocs[@]}")" "$(median "${walls[@]}")"
done
101 changes: 89 additions & 12 deletions .claude/skills/cf-performance/gen-sized-program.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,27 +7,38 @@
each). Sweeping N and plotting allocation/wall vs N exposes the quadratic and separates
the realistic case (small N) from the worst case (large N).

Different costs need different method bodies, so the body shape is selectable with
`--shape` (the body is what decides which machinery you stress):

generic (default) a generic call + branch + return: CFG construction, flow
analysis, getPath, and type-argument inference.
vararg calls to JDK vararg methods (Arrays.asList, String.format,
Class.getMethod): exercises AnnotatedExecutableType deep-copying and the
vararg-type path. Use this to stress the methodAsMemberOf cache and the
AnnotatedTypeCopier vararg handling (it is what exposed PR #1798's copier
aliasing bug). A *user* vararg method does not reproduce it; JDK ones do,
because their executable types become cached masters.
deep-nesting one deeply nested expression per method: stresses tree-path / scan depth.
many-fields N fields instead of N methods: stresses field defaulting and fromElement.

Usage:
gen-sized-program.py 1500 > Big1500.java # one size
for n in 100 300 600 1500; do # a sweep
gen-sized-program.py 1500 > Big1500.java # one size, default shape
gen-sized-program.py --shape vararg 400 > Varargs.java # a different shape
for n in 100 300 600 1500; do # a sweep
gen-sized-program.py $n > Big$n.java
done

Then A/B each side with the deterministic allocation reader:
Then A/B each side with the deterministic allocation reader (or ab-measure.sh):
checker/bin/javac -J-XX:StartFlightRecording=settings=profile,filename=r.jfr,dumponexit=true \\
-processor nullness -d /tmp/out Big$n.java
java .claude/skills/cf-performance/alloc-total.java r.jfr

Each method has a non-trivial body (locals, a generic call, a branch, a return) so it
triggers CFG construction, flow analysis, and getPath/type-argument inference -- the
machinery that exercises the tree-path and dataflow hot paths. Tune `body` for other
mechanisms (deep expression nesting, many fields, large switch, etc.).
"""

import argparse
import sys


def generate(n: int) -> str:
def generic(n: int) -> str:
lines = ["class Big {", " static <T> T id(T x) { return x; }"]
for i in range(n):
lines += [
Expand All @@ -42,7 +53,73 @@ def generate(n: int) -> str:
return "\n".join(lines) + "\n"


def vararg(n: int) -> str:
# Calls to JDK vararg methods: their executable types are cached and re-defaulted,
# and deep-copied per use, so this stresses AnnotatedExecutableType copying and the
# vararg-type subtree.
lines = [
"import java.util.Arrays;",
"import java.util.List;",
"class Big {",
]
for i in range(n):
lines += [
f" void m{i}(String s{i}) throws Exception {{",
f" List<String> xs{i} = Arrays.asList(s{i}, s{i}, s{i});",
f" List<String> ys{i} = Arrays.asList(s{i});",
f' String f{i} = String.format("%s %s", s{i}, s{i});',
f" java.lang.reflect.Method mm{i} ="
f' Big.class.getMethod("m{i}", String.class);',
" }",
]
lines.append("}")
return "\n".join(lines) + "\n"


def deep_nesting(n: int) -> str:
lines = ["class Big {", " static <T> T id(T x) { return x; }"]
for i in range(n):
# A single deeply nested generic-call expression.
expr = f"a{i}"
for _ in range(20):
expr = f"id({expr})"
lines += [
f" Object m{i}(Object a{i}) {{",
f" return {expr};",
" }",
]
lines.append("}")
return "\n".join(lines) + "\n"


def many_fields(n: int) -> str:
lines = ["import java.util.List;", "class Big {"]
for i in range(n):
lines.append(f" List<String> f{i};")
lines.append("}")
return "\n".join(lines) + "\n"


SHAPES = {
"generic": generic,
"vararg": vararg,
"deep-nesting": deep_nesting,
"many-fields": many_fields,
}


if __name__ == "__main__":
if len(sys.argv) != 2:
sys.exit("usage: gen-sized-program.py <method-count>")
sys.stdout.write(generate(int(sys.argv[1])))
parser = argparse.ArgumentParser(
description="Generate a sized Java program for perf sweeps."
)
parser.add_argument(
"n", type=int, help="number of methods (or fields, for many-fields)"
)
parser.add_argument(
"--shape",
choices=sorted(SHAPES),
default="generic",
help="method-body shape to stress",
)
args = parser.parse_args()
sys.stdout.write(SHAPES[args.shape](args.n))
Loading