Skip to content

Phase 3 classification harness#3

Merged
johnoooh merged 31 commits into
mainfrom
phase-3-classification-harness
May 7, 2026
Merged

Phase 3 classification harness#3
johnoooh merged 31 commits into
mainfrom
phase-3-classification-harness

Conversation

@johnoooh
Copy link
Copy Markdown
Owner

@johnoooh johnoooh commented May 7, 2026

Summary

Validates the Stage-1 binary classifier from the Handoff end-to-end (worker, hook, prompt, multilingual + production-realistic harness) and lays the groundwork to surface it in the results UI — but the in-app fit dots and "evaluating fit" caption are gated behind a feature flag (ENABLE_CLASSIFY_IN_RESULTS = false) until "Best fit" sort is wired. Without sort, the dots don't drive any user-visible behavior, so they were just decoration.

The harness at ?test=classify runs the full pipeline against a 24-trial fixture (validated 93% in-scope binary agreement on Gemma 2 2B with zero catastrophic UNLIKELY). The custom-model stub for future LoRA work is in place. Detail-pane disclaimer was redesigned to a two-tier disclosure to set honest expectations about model accuracy.

Default model stays Gemma 2 2B (q4f32_1, ~1.3 GB). Qwen2.5-1.5B (?model=qwen25) and Llama 3.2 3B (?model=llama32) registered as opt-in alternatives for prompt validation in the harness.

What's in

Validation harness (dev-only, ?test=classify)

  • React port of the standalone Classification Harness.html reusing the live worker — no second model download, just shares the engine that the NL flow already loaded.
  • 24-trial fixture across LIKELY / POSSIBLE / UNLIKELY (incl. 4 with realistic-length CT.gov-style eligibility, 9 wrong-condition stress tests). Moved to ClassificationHarness.fixtures.js so prompt edits aren't buried under data.
  • 9 multilingual user-description presets (English variants, Spanish, Mandarin, Arabic, Portuguese, French, terse).
  • "Production-realistic agreement" toggle (default ON) excludes wrong-condition stress-test trials from the headline % — since the CT.gov API filters those out before the model ever sees them in real flow.
  • "Translate-then-classify" toggle: pre-translates non-English patient descriptions to English in one model call before classifying. Uses a dedicated translate worker task type with higher max_tokens.
  • "Copy results as markdown" button for sharing runs.

Underlying machinery (worker + hook + utils)

  • src/workers/nlp.worker.js — new classify and translate task types. Both are one-shot, low-temperature, with task-type-appropriate max_tokens (80 for classify, 200 for translate).
  • src/hooks/useClassifier.jsclassifyOne and translateOne, both serialized through a single promise chain because WebLLM's MLCEngine is single-threaded. .catch(()=>{}) prevents one rejection from poisoning the queue. Listener cleanup rejects pending tasks on unmount so awaiting callers don't hang.
  • src/utils/classifyTrial.jsDEFAULT_CLASSIFY_PROMPT + parseVerdict shared by the harness and the (gated) in-app flow. Single source of truth — prompt tweaks no longer have to be made in two places.

In-app classification (gated behind ENABLE_CLASSIFY_IN_RESULTS = false)
The wiring exists in ResultsList.jsx and TriageRow.jsx but is currently inert. When flipped on:

  • After search results land (and the user previously consented to the on-device model), classify each trial against the patient description.
  • Surface verdicts as FitDot per row (iris-violet filled circle for LIKELY, hollow ring for UNLIKELY, shimmer for pending; reason folded into aria-label).
  • Toolbar caption: evaluating fit · 7 of 24fit evaluated for 24.
  • Stage-2 simplification gates on classification done so the two passes don't compete for the single-threaded worker.
  • Pagination dedupes via classifiedRef; search-param OR patient-description change resets state and cancels in-flight batch.

Detail pane

  • "Why this might or might not fit you" section removed (Gemma's accuracy on the fit narrative wasn't reliable — occasionally flipped disease stage or treatment history).
  • Two-tier oncologist disclaimer (variant 7): native <details>/<summary> with a one-line surface ("Check with your doctor when exploring treatment options — this AI summary uses plain language to explain the treatment but can miss eligibility details.") and a why? expansion explaining the on-device model and its limits.
  • Pipeline-stage caption appears between the meta line and the simplifier output when classification is in flight (currently only fires if the feature flag is on).

Custom-model stub

  • Commented-out appConfig block in nlp.worker.js shows the exact shape needed to serve a fine-tuned MLC model from HuggingFace Hub. Pointer comment references the LoRA training process doc in the project vault.

Worker reliability

  • Bumped summarize frequency_penalty from 0.3 → 0.6 to break Gemma out of ## repetition collapse loops on the simplifier prompt.

Validation results (Gemma 2 2B, English fixture, production-realistic mode)

Metric Value Gate
In-scope agreement 93% (14/15) ≥ 80% ✓
Catastrophic UNLIKELY 0 0 ✓
Parse rate 100% ≥ 90% ✓
Avg latency ~2.5s < 1.5s ✗ (acceptable for v1)

Spanish + translate-first: 71% (vs 67% direct Spanish). Translation helped marginally but didn't bridge the gap to English. Multilingual is best-effort for now; English-only gating is a 1-line change if it proves needed in user testing.

Out of scope (deferred)

  • Sort by best fit — chips were removed entirely (were placeholder-disabled before). When sort wiring lands, restore the chips from git history and flip ENABLE_CLASSIFY_IN_RESULTS to true so the dots become actionable.
  • Collapse UNLIKELY trials under "N less likely matches" disclosure (Handoff Phase 3 step 6).
  • Auto-load model on app start — classification only fires if user previously consented via the NL flow.
  • LoRA fine-tune — stub is wired; the full pipeline doc is in the project vault.
  • Compare view — the sticky bar's Compare → button is still disabled. Compare state lift to App is queued for that follow-up.
  • Doctor-handoff PDF/print — separate follow-up PR after this lands.

Test plan

  • npm run test:run — 197/197 pass (added useClassifier.test.js for serialization, error isolation, unmount cleanup; added queued-submit drain regression test in NaturalLanguageInput.test.jsx)
  • npm run build — clean
  • npm run lint — only pre-existing errors / warnings
  • Harness smoke runs: English (Gemma + Qwen), Spanish + translate-first
  • End-to-end app flow: NL search → results render → simplifier produces summaries on selected trial (no fit dots since flag is off)
  • Hard refresh, clear cache, verify model download flow still works
  • Lighthouse a11y on the iris-violet detail-pane links (computed contrast ~9.6:1, expected to pass AA)
  • CI build-test passes

johnoooh added 30 commits May 7, 2026 00:49
Implements the harness validation rig from Handoff Phase 3, ported from
the standalone HTML file into the React app so it reuses the already-
loaded worker (no second model download needed).

- worker: new 'classify' task type. One-shot completion, low max_tokens
  (80), temperature 0.1, KV-cache reset between calls. Returns
  { type: 'classify_done', taskId, raw, latencyMs } or 'classify_error'.
- useClassifier hook: thin wrapper that posts 'classify' to the shared
  worker, tracks pending tasks by taskId, resolves with raw + latency.
  Caller parses the verdict.
- ClassificationHarness component: ports the standalone harness UI to
  React + Tailwind. Inputs: user description, prompt template, trials
  JSON (fixture preset), concurrency (1/2/3/5), eligibility max chars.
  Renders results table with verdict pill (LIKELY/POSSIBLE/UNLIKELY/
  PARSE_FAIL), latency, raw output, expected vs actual checkmark.
  Stats: parse rate, avg/max latency, agreement.
- App.jsx: dev-only ?test=classify route, lazy-loaded so production
  bundle stays unaffected.

Pass criteria from Handoff (parse ≥90%, avg latency <1.5s, agreement
≥80%) shown inline so the user can validate before deciding to wire
classification into the actual results UI.

This is the validation gate, not the in-app classification pipeline
itself. Once the harness numbers look good, a follow-up PR adds
classifyAll() to ResultsList and the fit meter to TriageRow.
…l-safe

User reported 33% agreement, 0ms latencies, and "Message error should
not be 0" failures from the harness. Root cause: WebLLM's MLCEngine
is single-threaded; concurrent engine.chat.completions.create() calls
collide on the engine's internal KV cache and sampling state, returning
errors instantly. The harness was firing 3 in parallel by default, so
the first 3 always failed, the next batch sometimes squeezed through.

Fix at the hook level via a promise chain in useClassifier — each
request waits for the previous one to settle. Caller-side concurrency
becomes a no-op for actual parallelism but still controls queue depth.
This matches how useSimplifier already serializes through a single
inFlightRef. The .catch() on the chain prevents one failure from
breaking the whole queue.

Harness UI: replaced the misleading concurrency dropdown with a
static caption noting the engine constraint. Re-running with the
fixture should now produce real latencies (1-2s per trial on
mid-range hardware) and meaningful agreement numbers.
The original 6-trial smoke test wasn't enough surface to evaluate
prompt changes — every tweak swung agreement by ±17% on a single
trial. Add 14 trials grouped by failure mode:

- Subtype-gated breast cancer trials (3): Tucatinib HER2+, Olaparib
  BRCA, CDK4/6 switch HR+. All POSSIBLE for a generic "breast cancer"
  user — exercises whether the model penalizes unknown subtype.
- Strong matches for a 58yo with breast cancer (4): CBT for fatigue,
  lymphedema surveillance, mindfulness for survivors, vaginal estrogen
  in postmenopausal survivors. All LIKELY — these are the "obvious
  yes" cases that should never get marked POSSIBLE.
- Wrong condition / wrong demographic (5): melanoma, AFib, Type 2
  diabetes, prostate cancer, pediatric vaccines. All UNLIKELY — these
  are the "obvious no" cases that catastrophic-false-negative would
  hide.
- Edge cases (2): palliative care for any advanced solid tumor
  (POSSIBLE — depends on stage which is unknown), and premenopausal
  breast cancer (UNLIKELY — 58yo is almost certainly postmenopausal,
  the trial explicitly excludes postmenopausal).

Distribution: 6 LIKELY, 6 POSSIBLE, 8 UNLIKELY. The Handoff calls for
50+ trials for real validation; 20 is a workable middle ground for
prompt iteration without crushing test cycle time.
The fixture trials use 200-400 char eligibility blurbs, which masks
two real-world failure modes: latency growth with longer prompts and
the model losing the signal in CT.gov noise (formal numbered lists,
lab cutoffs, prior-therapy specifics, washout periods).

Add 4 trials with realistic CT.gov-style eligibility (~2-3.5kB each)
spanning all three verdicts:

- LONG-01 Sacituzumab metastatic HR+/HER2- after CDK4/6 (POSSIBLE):
  subtype-gated, 12 inclusion + 12 exclusion criteria with full
  organ-function lab ranges. Tests whether the model still penalizes
  unknown HR/HER2 status when buried under three pages of text.
- LONG-02 Adjuvant abemaciclib node-positive postmenopausal (LIKELY):
  the postmenopausal woman with HR+ early breast cancer at high risk
  case. Many specific lab cutoffs and prior-therapy washout rules.
  Tests whether the model picks up on the clear inclusion match
  without getting confused by exclusion noise.
- LONG-03 Pembrolizumab metastatic squamous NSCLC (UNLIKELY): wrong
  cancer entirely, but mentions "breast" once in context of allowed
  prior malignancies. Tests whether the model anchors on incidental
  mentions vs the core indication.
- LONG-04 Empagliflozin HFpEF + T2DM (UNLIKELY): wrong condition
  entirely (cardiology + diabetes), but excludes only "active
  malignancy within 12 months" — disease-free history is allowed.
  Tests whether the model conflates "cancer history allowed" with
  "this trial is for cancer patients."

Combined with the eligMax knob (default 1500), these let you sweep:
800 → 1500 → 3000 → 6000 chars and watch latency vs accuracy. The
Handoff suggests dropping to 800 if latency is a problem; this is
where you'd validate that.
After a run completes, surfaces a button in the stats row that copies
a formatted markdown report to the clipboard. Includes:

- Model + user description + eligMax (so a different prompt run is
  reproducible)
- Stats table (done, elapsed, latencies, parse rate, agreement)
- Per-trial results table (verdict, expected, match check, latency,
  reason)
- Collapsed prompt template in a <details> block

Pipes pretty quickly into chat or a PR comment for sharing harness
runs without screenshotting. Confirmation state on the button
(✓ copied / copy failed) for 1.8-2.4s.
Initial run had 67% agreement with two LIKELY-vs-POSSIBLE drift errors
on subtype-gated trials (Sacituzumab metastatic TNBC, HER2+ radiation
boost). Both required subtype the user hadn't confirmed; the model
called them LIKELY because "breast cancer" matched the broad indication.

Update the default prompt to make the verdict ladder explicit:
- LIKELY = slam-dunk, every gating criterion confirmed
- POSSIBLE = broad condition matches, key criteria unknown
- UNLIKELY = clear mismatch (wrong condition / sex / age / comorbidity)

Also tightens the response format requirement (one line, pipe-separated,
reason required) since 2/6 outputs in the first run dropped the reason.
"raw: LIKELY" with no reason isn't useful for a user reading the fit
meter; tightening here propagates to the eventual in-app integration.
Run 2 (verbose-bias prompt) dropped to 50% agreement with three new
failure modes:

1. Model hallucinated user attributes ("user has prior radiation",
   "user is HR-positive") that weren't in the description, then
   reasoned about them. Driving false UNLIKELY verdicts.
2. Model misread exclusion semantics on Lymphedema Surveillance —
   said the user "does not meet the exclusion criteria" as a NEGATIVE,
   producing a catastrophic false UNLIKELY (the user-doesn't-see-it
   failure mode the Handoff specifically calls out).
3. UNLIKELY filter went leaky — melanoma, NSCLC, HFpEF + T2DM all
   moved to POSSIBLE on "user's age is within the range" alone,
   ignoring the actual disease mismatch.

New prompt structure:
- Explicit "user description is ONLY what you know — do not assume"
  guard against hallucination
- Five ordered rules: condition mismatch → UNLIKELY (highest
  priority), demographic exclusion → UNLIKELY, missing-info →
  POSSIBLE, all-met → LIKELY, default POSSIBLE
- Explicit "UNLIKELY ≠ many requirements; UNLIKELY = user is clearly
  disqualified" to fix the exclusion-semantics confusion

Trade-off: slightly longer than the original baseline prompt, but
more declarative than the verbose run-2 prompt. Latency should sit
between the two.
… rules

Run 3 (rules-list) hit 54% with three failure modes:

1. The "VERDICT | reason" instruction made Gemma emit the literal word
   "VERDICT" as a placeholder name, dropping the actual verdict from
   most outputs. raw: "VERDICT:" with no verdict, or "VERDICT: LIKELY"
   with no reason.
2. Five ordered rules were too many for Gemma 2B — it appears to
   default to "user meets all requirements -> LIKELY" and stop
   reasoning. Pembrolizumab Melanoma flipped to LIKELY, four
   subtype-gated breast cancer trials flipped to LIKELY.
3. Long eligibility (LONG-03 NSCLC, LONG-04 HFpEF) degraded the
   wrong-condition signal — both flipped POSSIBLE/LIKELY despite the
   shorter NSCLC version correctly classifying UNLIKELY.

New strategy:
- Drop the literal "VERDICT" placeholder. Use "<LABEL> | <reason>"
  with angle brackets so it reads as a template, not a literal.
- Drop the numbered rules. Use 3 short bullet definitions instead.
- Add 3 worked examples covering POSSIBLE / UNLIKELY / LIKELY for a
  consistent patient profile (62yo man, prostate cancer). Examples
  demonstrate the no-assume rule, the wrong-condition rule, and the
  positive-match rule by behavior, not by abstract instruction.
- "Now classify:" cue separates examples from the actual task,
  reducing the chance Gemma copies the example patient.

Trade-off: prompt is longer, latency may rise modestly. If accuracy
holds, that's a fair price.
Run 4 (3-class few-shot) hit 67%, but mapping LIKELY+POSSIBLE -> SHOW
already gave 88% on the same outputs. The LIKELY-vs-POSSIBLE judgment
is the source of most disagreement noise and isn't actually a product
question — the user-facing decision is binary (show this trial or
collapse it under "less likely matches"). POSSIBLE adds cognitive
load to a panicking patient without buying useful signal.

Pivot the default prompt to binary:
- LIKELY: trial studies the patient's condition AND nothing clearly
  excludes them. Worth showing.
- UNLIKELY: different disease, OR clearly wrong sex/age/population.
  Not worth showing.

Explicit "be inclusive on LIKELY" instruction: missing subtype/
biomarker/stage info still counts as LIKELY since the patient or
their doctor can verify. UNLIKELY is reserved for clear
disqualification on something the patient DID state.

Three worked examples: LIKELY (matched indication w/ unstated
biomarker), UNLIKELY (wrong sex), UNLIKELY (wrong condition entirely).

Parser side: POSSIBLE is still recognized for backward compatibility
but normalized to LIKELY. Fixture trials keep their 3-class expected
values (informationally rich) but the agreement check uses binary
mapping (POSSIBLE expected counts as LIKELY).

Markdown report uses the same binary mapping in the match column.
Gemma 2 2B has plateaued at 79% binary agreement on the harness with
recurring hallucination of patient attributes ("user has AFib") that
five prompt iterations didn't fix — looks like a structural limit of
the small instruct model.

Llama 3.2 3B Instruct is the next sensible rung: ~50% more
parameters, much better instruction following per published
benchmarks, and well-tested in the WebLLM ecosystem. q4f16_1 quant
keeps the download to ~1.9 GB (vs 1.3 GB for current Gemma).

Switch via ?model=llama32 in any IRIS URL — including the harness:
http://localhost:5173/iris/?test=classify&model=llama32

First load downloads ~1.9 GB (cached after). Re-run the harness with
the same prompt to see if accuracy improves on the same fixture.
Llama 3.2 3B run was catastrophic — 42% agreement, marked nearly
everything UNLIKELY. Reasons revealed the failure mode: the model
pattern-matched on the few-shot examples (all "62yo man with prostate
cancer") and emitted fragments of the example patient as if they
applied to the actual patient ("patient is a man", "trial is for
atrial fibrillation in men; patient is a woman"). Small instruct
models are notorious for this kind of example contamination.

Two changes:

1. Register Qwen2.5-1.5B-Instruct (~900 MB, q4f16_1) as ?model=qwen25.
   Smaller than Gemma 2B but the Qwen2.5 series is known for strong
   instruction-following per parameter. Worth a shot before
   concluding small models can't do this.

2. Rewrite the four few-shot examples to use FOUR different patients
   (45yo woman ovarian, 70yo man T2D, 8yo child asthma, 55yo man
   hypertension) covering LIKELY/UNLIKELY/UNLIKELY/LIKELY. Explicit
   "each example uses a DIFFERENT patient — focus on reasoning, not
   patient details" header. Removes the gravity well that made Llama
   anchor.

Run any model against the same fixture to compare. The example fix
should help all models, not just Qwen.
Two follow-ups for Phase 3 validation:

1. ClassificationHarness: dropdown above the user description with 9
   patient presets. Same 58yo woman with breast cancer in Boston in
   English, Spanish, Mandarin, Arabic, Portuguese, French, plus two
   English variants (terse "58F, BC, Boston" and a more detailed
   version) and a more-detailed Spanish. RTL direction auto-detected
   for Arabic. Lets us validate Qwen2.5-1.5B's multilingual handling
   without manually pasting translations. Selecting a preset replaces
   the description; manual edits show as "— custom —".

2. nlp.worker.js: commented-out appConfig stub for serving a custom
   model via prebuiltAppConfig extension. Walk-through comment points
   at the LoRA training doc in the vault. Drop-in: uncomment three
   lines in the worker, add a matching entry in nlpModels.js, swap
   via ?model=iris-finetune. No code path active until you flip the
   commented lines.
Spanish run on Qwen2.5-1.5B dropped 16 points vs English (67% vs 83%)
because the model stops anchoring on the patient's stated condition
when reasoning across languages — it confabulates "matches the
patient's condition" for trials about prostate cancer, T2D, NSCLC.

Translation is the model's strength (well-defined task) and runs ONCE
per batch (the patient description), not per trial. So the cost is
~1s amortized across all N classifications.

Adds a "translate to English first" checkbox to the harness controls.
When enabled:
1. Before the classify queue starts, run a translation prompt on the
   user description.
2. Show the result in an iris-tinted callout above the stats.
3. Use the translated text for all per-trial classifications.

If translation fails the run aborts with an alert (rare but possible).

This is harness-only for now — the eventual app integration would
trigger translation only when detectInputLanguage(userDescription)
returns non-English. Path A (gate on en-only) is still on the table
if translation doesn't recover enough accuracy.
Couldn't tell from the previous Spanish run whether the translate-first
toggle was active because the markdown report omitted that state. Add
both the toggle status and the resulting translated text to the
report header so runs are unambiguously reproducible.
The headline agreement was being dragged down by stress-test trials
the CT.gov API would never return for the user's stated condition
(melanoma, NSCLC, T2D, HFpEF, prostate, etc. for a breast-cancer
search). Headline 83% / Spanish 71% understated the actual production
quality, which is closer to 94% on trials the user would actually see.

- Tag 9 wrong-condition fixture trials with outOfScope: true
- Add a "production-realistic agreement" checkbox (default ON)
- When ON: agreement % only counts in-scope trials. Out-of-scope are
  still classified and shown in the table but excluded from the
  headline + agreement note explains how many were excluded.
- When OFF: previous all-trials behavior (useful for stress-testing
  prompt changes).
- Markdown report includes the toggle state and exclusion count so
  shared runs are unambiguous.
… TriageRow

End-to-end Phase 3 integration. After search results land, kick off
binary classification for every trial; surface verdicts as a small
fit dot in each row + a live "evaluating fit · X of N" caption in
the toolbar.

Architecture:
- New utils/classifyTrial.js holds DEFAULT_CLASSIFY_PROMPT and
  parseVerdict, shared between the harness and the in-app flow so
  the model sees identical prompts in both contexts.
- ResultsList uses useNLP (idempotent — worker fast-returns ready
  if model already loaded by NL extraction) + useClassifier.
- Classification only fires when:
  1. iris_nlp_enabled === 'true' in localStorage (user previously
     consented to the on-device model — we never auto-load without
     consent), AND
  2. patientDesc is available (from userDescription if NL was used,
     or synthesized from extractedFields), AND
  3. WebGPU is supported.
  Structured-form-only sessions skip classification silently — no
  covert worker init, no model download surprise.
- Pagination: only newly-arrived NCTs get classified (classifiedRef
  Set dedups). Search-param change resets state and cancels any
  in-flight batch.

UI:
- TriageRow grows a FitDot:
  • iris-violet filled circle = LIKELY (with reason in tooltip)
  • parchment-300 hollow ring = UNLIKELY (with reason)
  • shimmer dot = pending
  • nothing rendered for PARSE_FAIL (don't surface model errors to users)
- Toolbar shows "evaluating fit · 7 of 24" while running, then
  "fit evaluated for 24" when done.

What's NOT in this PR (deferred):
- Sort by best fit (Handoff Phase 3 step 5)
- Collapse UNLIKELY under "12 less likely matches" disclosure (step 6)
Both need more UX consideration; the dots + caption are the safe
first slice.
…ions

Qwen2.5-1.5B was hallucinating Python tutorial fragments in the
"What this study is testing" section because the simplifier prompt
ended with a bare label ("PLAIN-LANGUAGE OUTPUT:") that chat-tuned
models read as a topic header — then they pattern-match training
data that follows similar headers and emit unrelated content.

Two changes to both buildSummarizePrompt and buildAssessFitPrompt:

1. Rename the example block label from "PLAIN-LANGUAGE OUTPUT:" /
   "ASSESSMENT:" to "EXAMPLE PLAIN-LANGUAGE OUTPUT:" /
   "EXAMPLE ASSESSMENT:" — clarifies its role for the model.

2. Replace the trailing bare label with a directive instruction
   ("Now write the plain-language summary…", "Begin your response
   with \"## What this study is testing\":") that names the action
   and pins the first header. This is what chat-instruct models like
   Qwen and Gemma both expect — a clear command, not a label.

Updated the language-reminder ordering test to look for the new
trailing directive instead of the old label.
Same Gemma 2 2B model, just q4f16_1 quantization instead of q4f32_1:
~900 MB weights (vs 1.3 GB) and fp16 activations. WebGPU's native
compute type is fp16 on most GPUs, so inference latency drops
measurably with no quality difference on short instruction tasks.

Registered as ?model=gemma_fast so it can be A/B tested against the
current q4f32_1 default without forcing a re-download for anyone
who already has the larger weights cached. Promote to default once
validated in the harness + actual app flow.
User tested ?model=gemma_fast (gemma-2-2b-it-q4f16_1-MLC) and the
simplifier produced malformed output. The reduced activation precision
(fp16 vs fp32) loses the grip on the structured "## What this study
is testing" / "## Who can join" section headers the streaming parser
needs.

Since WebLLM can only have one model loaded at a time and the
simplifier is critical UX, mixing quants for different tasks isn't
viable. Drop the entry from the registry; leave a comment so future-
us doesn't re-discover the same dead end.
19e5d34 changed the trailing simplifier prompt label to a directive
instruction ("Begin your response with \"## What this study is
testing\":") to fix Qwen2.5-1.5B hallucinating training data. That
fix worked for Qwen but broke Gemma 2B — Gemma got stuck in a
repetition loop emitting bare ## patterns ("##############  ## ## ##
## ##...") because the trailing instruction over-anchored on the ##
header pattern.

Since we ship Gemma by default and Qwen is opt-in (?model=qwen25,
mainly for the harness), the right call is to restore the original
prompt that Gemma is validated against. Qwen will continue to
hallucinate in the simplifier — but that's a known issue with no
production exposure as long as Gemma stays default.

If we ever need to ship Qwen for production, the right fix is
probably structural (proper chat-message turns with role:assistant
for the example) rather than a trailing-label tweak.
…aise frequency_penalty

Two issues from end-to-end testing:

1. Toolbar showed "evaluating fit · 0 of 10" while the simplifier had
   already produced output — classifier and simplifier were both
   queueing into the single-threaded worker on results land. The
   simplifier's eager-batch-of-5 ran first, the classifier got stuck
   behind it, and per-trial fit dots never appeared until the
   simplifier finished all 10 chunks of the eager batch.

2. Gemma 2 2B was hitting degenerate repetition loops on the
   simplify prompt — "## What this study is testing" header rendered
   correctly, then the body was just "############ ## ## ## ##..."
   for the full max_tokens budget.

Two changes per Handoff Phase 3 step 6 ("Stage 2 simplification only
fires for the currently-selected trial in the detail pane"):

- ResultsList: removed the eager-batch-of-5 simplifier call. Simplifier
  now only enqueues for the currently-selected trial, and only after
  classification has completed (or is not running, e.g. structured-
  form-only sessions). Eliminates the classifier/simplifier race AND
  drastically cuts how often the simplifier runs (so when Gemma does
  hit a repetition loop, it affects one trial not five).

- nlp.worker.js: bumped frequency_penalty for summarize from 0.3 → 0.6.
  Higher penalty discourages the same n-gram from re-firing, breaking
  the "## ## ##" loop. Assess-fit stays at 0 so its hedging language
  ("may", "might") doesn't get penalized.
User saw an empty "What this study is testing" area and an inaccurate
summary, with no explanation of WHY the simplifier hadn't run yet
(classifier was still working through 10 trials in the background).

Adds an explicit pipeline-stage caption above the simplifier section
in the detail pane:

- Stage "classifying": iris-violet pill with shimmer dot:
  "evaluating fit · 3 of 10 · plain-language summary will follow"
  Tells the user the per-trial fit is still being computed and the
  summary is queued to come after.

- Stage "awaiting-summary": parchment pill: "generating plain-language
  summary…" — fires after classification completes but before the
  selected trial's summary stream starts producing tokens.

- Once tokens land (status === 'streaming' / 'complete'): caption
  disappears, real content renders.

ResultsList computes the stage from canClassify + classifyDone +
the simplifier's per-trial status, then passes pipelineStage +
classifyProgress as new ResultCard props (pane mode only — rows
already have their own fit dot).

Doesn't address the underlying small-model accuracy issue (Gemma
turning "Early Breast Cancer" into "spread to other parts" in the
summary). That's the next escalation — short-term fix would be lower
max_tokens for summarize; real fix is the LoRA pipeline in the vault
doc.
Two changes that work together to set honest expectations about the
on-device model's reliability:

1. Removed the "Why this might or might not fit you" section from the
   detail pane. Gemma 2B's accuracy on the fit narrative isn't
   reliable enough to ship — recent E2E testing showed it occasionally
   flips disease stage (early vs metastatic) or treatment history.
   The TriageRow fit dot (driven by the binary classifier in
   useClassifier) is the safer signal. ResultsList no longer enqueues
   assess_fit; ResultCard no longer renders the fit paragraph; the
   useSimplifier code path is preserved in case we re-enable it after
   a fine-tune.

2. Added an unconditional oncologist disclaimer above the contact
   block in pane mode: "Talk to your oncologist if you think you
   might qualify for this trial. The plain-language summary above is
   generated on-device by a small AI model — it can miss or misstate
   eligibility details. Your care team can confirm whether the trial
   fits your specific situation."

   Iris-tinted callout, always visible. Sets the right expectation:
   the AI summary is a starting point, not a recommendation.

Test that previously asserted the fit paragraph rendered now asserts
its absence. The previous "fit hidden when in error" test was
redundant (the fit section is unconditionally hidden now) so it was
folded into the single assertion.
Disclaimer reshaped per the brainstorm + user pick (variant 7 with
user-rewritten surface line):

  Surface (always visible, in <summary>):
    "Check with your doctor when exploring treatment options —
     this AI summary uses plain language to explain the treatment
     but can miss eligibility details."  with a small "why?" hint.

  Expandable (in <details> body):
    "The plain-language summary above was generated on your device
     by a small AI model. It can miss or misstate who qualifies for
     a trial. Your care team has your full medical picture and can
     confirm whether this one actually fits."

Frames the AI as helper rather than decider, gives the in-crisis
reader minimum cognitive load by default, full honesty + privacy
beat one tap away. Group-open hides the "why?" hint when expanded
so the chrome stays clean.

Also carrying redesign-branch fix forward: tightened the auto-load
gate in NaturalLanguageInput from (idle || downloading) → idle
only. The widened gate caused redundant load() messages mid-download
because the cleanup-reset pattern would re-fire the effect on every
status flip while still in the gate. Worker dropped them as
duplicates so harmless in practice, but cleaner this way. Same fix
went in on redesign branch as 3c2e750; keeping branches in sync.
The sort UI was visible-but-permanently-disabled placeholders for
"Best fit" / "Distance" / "Phase" / "Most recent" with tooltips like
"Sort wiring coming in a follow-up." Reads as broken UI to users.
Better to ship without than ship with disabled controls.

When the wiring lands, restore the chips from git history. CT.gov
v2 API supports sort= for distance (when location set) and
LastUpdatePostDate; "Best fit" depends on having the classifier
verdicts per-trial, which now exists post-Phase-3 wiring.

Toolbar now shows just the search summary line + classification
progress, which is the load-bearing info anyway.
Three of the deferred items from PR #1's review, addressed in one
follow-up to keep the phase-3 PR from carrying open feedback debt:

1. useIsMobile uses matchMedia.change instead of window 'resize'
   (ResultsList.jsx:36-50). iOS Safari fires 'resize' inconsistently
   on rotation; matchMedia.change is the reliable signal and also
   catches iPad split-screen + browser-window mode switches.
   src/test/setup.js stubs matchMedia (jsdom doesn't ship it) so
   the existing ResultsList tests keep rendering the desktop two-pane
   path.

2. New regression test for the queued-submit drain in
   NaturalLanguageInput. Indirect coverage for the StrictMode listener
   re-attach fix in useNLP — if the listener doesn't reach the worker
   'ready' message, the queued submit never fires and this test fails.
   The bug surfaced this session: status was getting stuck at
   'downloading' forever in dev because the listener detached on
   StrictMode's first cleanup never re-attached. Now: pin the contract.

3. shared/iris-shared.jsx (478-line design reference, never imported
   by src/) moved to docs/design-references-shared/ with a README
   explaining its role. Stops future readers (LLM or human) from
   trying to "fix" or "consolidate" it as if it were live code.

Contrast-check (#3 in the review): computed iris-700 on parchment-50
yields ~9.6:1 (WCAG AAA). iris-700 on iris-50 is similar (~8.5:1).
All iris-violet links and the model badge clear AA easily; no
palette change needed. Lighthouse can confirm at deploy time.

Compare-state lift (#1 in the review): deferred to its own follow-up
PR alongside the actual compare view (currently a placeholder).
User reported: hitting "Find trials" again with a refined prompt did
nothing visible. The classifier kept its old verdicts; the toolbar
sat at "fit evaluated for N" against a stale patient description.

The classification-reset effect was watching only searchParams. If
the refined prompt extracted to the same condition (e.g., "breast
cancer" → still "breast cancer"), the API result set was cached and
the trials list didn't change either. classifiedRef carried the old
NCT IDs into the next render, so newTrials.length === 0 and the
classification effect short-circuited.

Two changes:
- Add patientDesc to the reset effect's deps so a prompt change
  alone (no condition change) wipes classifications + classifiedRef
  and cancels the in-flight batch via cancelClassifyRef.
- Cancel the simplifier's pending queue too — otherwise an in-flight
  summary keeps the worker busy while the re-classification waits
  for it to drain.

Effect: change the prompt → classifications wipe → fit dots reset to
shimmer → toolbar shows "evaluating fit · 0 of N" again → classifier
re-runs against the new patient → fit dots refill.
Two correctness fixes from code review:

1. useClassifier.js:34-49 — listener cleanup now also rejects every
   in-flight classify in pendingRef. Previously: on unmount (or
   StrictMode dev cleanup) the listener detached but pendingRef Map
   still held resolve/reject handles — those promises hung forever.
   Now any awaiting caller sees a clean rejection with
   "classifier unmounted" so error paths fire and the queue clears.

2. ResultsList.jsx:118 — replaced `nlp` (whole hook return object)
   in the load-trigger effect deps with `nlpLoad` destructured. The
   useNLP hook doesn't memoize its return object, so each render
   produced a new ref → effect ran every render. The status guard
   prevented redundant load() calls but the body still ran. Now the
   effect only fires when load callback identity changes (it's
   useCallback'd with stable deps so essentially never), or when
   canClassify / status / modelKey change.

Both surfaced by post-merge code review of PR #2; neither blocks
shipping but both are trivial to land before merge.
Bundles all eight follow-up items the reviewer flagged on PR #2,
plus the user-requested decision to disable classify-driven UI in
the results view (the fit dots had no actionable consequence
without sort wiring, so they were just decoration).

User-visible
- Classify-in-results gated behind ENABLE_CLASSIFY_IN_RESULTS = false
  (ResultsList.jsx). The classifier hook + worker task + ?test=classify
  harness all stay live for prompt iteration; only the in-app fit
  dots and "evaluating fit · X of N" caption are suppressed. Flip
  the constant to true once "Best fit" sort lands so the dots become
  actionable.

Refactors / dedupe
- Moved 305-line SAMPLE_TRIALS array + USER_PRESETS list out of
  ClassificationHarness.jsx into a sibling fixtures file
  (ClassificationHarness.fixtures.js). Harness file dropped from
  924 → ~550 lines.
- Harness now imports DEFAULT_CLASSIFY_PROMPT and parseVerdict from
  utils/classifyTrial.js instead of duplicating both verbatim. Single
  source of truth — prompt tweaks no longer have to be made in two
  places.
- useIsMobile hoisted out of ResultsList.jsx into hooks/useIsMobile.js
  so any other component that needs the same breakpoint can import
  it without copy-pasting.

Small fixes
- patientDesc in ResultsList wrapped in useMemo (was recomputed every
  render — value-equality made it work but the implicit reliance
  was fragile).
- FitDot in TriageRow folds the model's reason into aria-label so
  screen readers and keyboard-focused users get the same context as
  a sighted hover (title alone reaches neither group reliably). Same
  string in both attrs. Added role="img" since it carries semantic
  content now.

New worker task type
- Added 'translate' message type to nlp.worker.js. max_tokens 200
  (vs classify's 80) so verbose-language paraphrases fit. Same low
  temperature (0.1) since translation wants fidelity, not creativity.
- useClassifier now exposes both classifyOne and translateOne, sharing
  the single promise chain (engine is single-threaded regardless of
  task type). handleMessage routes done/error events for both via a
  single isDone/isError predicate.
- Harness translate-first toggle now uses translateOne instead of
  overloading classifyOne — clarifies intent and lets the worker
  apply the right max_tokens budget.

New tests
- useClassifier.test.js: three tests covering serialization
  (concurrent calls post FIFO), error isolation (one rejection
  doesn't poison the queue), and unmount cleanup (pending tasks
  reject with 'classifier unmounted'). Mock the shared worker via
  vi.mock so the tests don't touch real WebLLM. 197/197 pass.

Skipped from the review
- "Simplifier idle gap during model load" — the gap is intentional
  per Handoff Phase 3 step 6 (stage-2 only after stage-1 completes),
  and the pipeline-stage caption already addresses the UX. Reviewer's
  suggestion to "let the simplifier proceed" would re-introduce the
  classifier-vs-simplifier worker contention we explicitly fixed.
PR #3 added 4 new lint hits on top of main's pre-existing baseline.
All trivial — fixing them keeps the CI lint output clean for future
reviewers (the workflow runs lint as continue-on-error so they don't
block, but fewer ignorable lines is fewer ignorable lines).

- ClassificationHarness.jsx:36 — setConcurrency unused since the
  concurrency dropdown was removed (serialization happens in the
  hook now). Drop the setter, keep the value as a const.
- ResultCard.jsx:126 — showFit unused since the "Why this might or
  might not fit you" section was dropped. Drop the var; comment
  notes the path back if a fine-tuned model lets us re-introduce.
- ResultsList.jsx:75 — wrap allTrials in useMemo. react-query keeps
  data ref stable across non-data renders so the memo identity is
  stable too; without the memo, every render produced a new array
  and effect dep arrays comparing against allTrials would have
  thrashed (the actual classify trigger effect depends on a derived
  trialKeyAll string so this was cosmetic, but cleaner this way).
- useClassifier.js:73-74 — exhaustive-deps disable on the
  classifyOne/translateOne useCallbacks. runTask only closes over
  refs (stable); the linter can't see through that.

Lint count: 29 → 24 (14 errors, 10 warnings — one fewer than main's
baseline). All remaining are pre-existing.
@johnoooh johnoooh merged commit 1d6ad40 into main May 7, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant