Phase 3 classification harness by johnoooh · Pull Request #3 · johnoooh/iris

johnoooh · 2026-05-07T05:08:36Z

Summary

Validates the Stage-1 binary classifier from the Handoff end-to-end (worker, hook, prompt, multilingual + production-realistic harness) and lays the groundwork to surface it in the results UI — but the in-app fit dots and "evaluating fit" caption are gated behind a feature flag (ENABLE_CLASSIFY_IN_RESULTS = false) until "Best fit" sort is wired. Without sort, the dots don't drive any user-visible behavior, so they were just decoration.

The harness at ?test=classify runs the full pipeline against a 24-trial fixture (validated 93% in-scope binary agreement on Gemma 2 2B with zero catastrophic UNLIKELY). The custom-model stub for future LoRA work is in place. Detail-pane disclaimer was redesigned to a two-tier disclosure to set honest expectations about model accuracy.

Default model stays Gemma 2 2B (q4f32_1, ~1.3 GB). Qwen2.5-1.5B (?model=qwen25) and Llama 3.2 3B (?model=llama32) registered as opt-in alternatives for prompt validation in the harness.

What's in

Validation harness (dev-only, ?test=classify)

React port of the standalone Classification Harness.html reusing the live worker — no second model download, just shares the engine that the NL flow already loaded.
24-trial fixture across LIKELY / POSSIBLE / UNLIKELY (incl. 4 with realistic-length CT.gov-style eligibility, 9 wrong-condition stress tests). Moved to ClassificationHarness.fixtures.js so prompt edits aren't buried under data.
9 multilingual user-description presets (English variants, Spanish, Mandarin, Arabic, Portuguese, French, terse).
"Production-realistic agreement" toggle (default ON) excludes wrong-condition stress-test trials from the headline % — since the CT.gov API filters those out before the model ever sees them in real flow.
"Translate-then-classify" toggle: pre-translates non-English patient descriptions to English in one model call before classifying. Uses a dedicated translate worker task type with higher max_tokens.
"Copy results as markdown" button for sharing runs.

Underlying machinery (worker + hook + utils)

src/workers/nlp.worker.js — new classify and translate task types. Both are one-shot, low-temperature, with task-type-appropriate max_tokens (80 for classify, 200 for translate).
src/hooks/useClassifier.js — classifyOne and translateOne, both serialized through a single promise chain because WebLLM's MLCEngine is single-threaded. .catch(()=>{}) prevents one rejection from poisoning the queue. Listener cleanup rejects pending tasks on unmount so awaiting callers don't hang.
src/utils/classifyTrial.js — DEFAULT_CLASSIFY_PROMPT + parseVerdict shared by the harness and the (gated) in-app flow. Single source of truth — prompt tweaks no longer have to be made in two places.

In-app classification (gated behind ENABLE_CLASSIFY_IN_RESULTS = false)
The wiring exists in ResultsList.jsx and TriageRow.jsx but is currently inert. When flipped on:

After search results land (and the user previously consented to the on-device model), classify each trial against the patient description.
Surface verdicts as FitDot per row (iris-violet filled circle for LIKELY, hollow ring for UNLIKELY, shimmer for pending; reason folded into aria-label).
Toolbar caption: evaluating fit · 7 of 24 → fit evaluated for 24.
Stage-2 simplification gates on classification done so the two passes don't compete for the single-threaded worker.
Pagination dedupes via classifiedRef; search-param OR patient-description change resets state and cancels in-flight batch.

Detail pane

"Why this might or might not fit you" section removed (Gemma's accuracy on the fit narrative wasn't reliable — occasionally flipped disease stage or treatment history).
Two-tier oncologist disclaimer (variant 7): native <details>/<summary> with a one-line surface ("Check with your doctor when exploring treatment options — this AI summary uses plain language to explain the treatment but can miss eligibility details.") and a why? expansion explaining the on-device model and its limits.
Pipeline-stage caption appears between the meta line and the simplifier output when classification is in flight (currently only fires if the feature flag is on).

Custom-model stub

Commented-out appConfig block in nlp.worker.js shows the exact shape needed to serve a fine-tuned MLC model from HuggingFace Hub. Pointer comment references the LoRA training process doc in the project vault.

Worker reliability

Bumped summarize frequency_penalty from 0.3 → 0.6 to break Gemma out of ## repetition collapse loops on the simplifier prompt.

Validation results (Gemma 2 2B, English fixture, production-realistic mode)

Metric	Value	Gate
In-scope agreement	93% (14/15)	≥ 80% ✓
Catastrophic UNLIKELY	0	0 ✓
Parse rate	100%	≥ 90% ✓
Avg latency	~2.5s	< 1.5s ✗ (acceptable for v1)

Spanish + translate-first: 71% (vs 67% direct Spanish). Translation helped marginally but didn't bridge the gap to English. Multilingual is best-effort for now; English-only gating is a 1-line change if it proves needed in user testing.

Out of scope (deferred)

Sort by best fit — chips were removed entirely (were placeholder-disabled before). When sort wiring lands, restore the chips from git history and flip ENABLE_CLASSIFY_IN_RESULTS to true so the dots become actionable.
Collapse UNLIKELY trials under "N less likely matches" disclosure (Handoff Phase 3 step 6).
Auto-load model on app start — classification only fires if user previously consented via the NL flow.
LoRA fine-tune — stub is wired; the full pipeline doc is in the project vault.
Compare view — the sticky bar's Compare → button is still disabled. Compare state lift to App is queued for that follow-up.
Doctor-handoff PDF/print — separate follow-up PR after this lands.

Test plan

npm run test:run — 197/197 pass (added useClassifier.test.js for serialization, error isolation, unmount cleanup; added queued-submit drain regression test in NaturalLanguageInput.test.jsx)
npm run build — clean
npm run lint — only pre-existing errors / warnings
Harness smoke runs: English (Gemma + Qwen), Spanish + translate-first
End-to-end app flow: NL search → results render → simplifier produces summaries on selected trial (no fit dots since flag is off)
Hard refresh, clear cache, verify model download flow still works
Lighthouse a11y on the iris-violet detail-pane links (computed contrast ~9.6:1, expected to pass AA)
CI build-test passes

Implements the harness validation rig from Handoff Phase 3, ported from the standalone HTML file into the React app so it reuses the already- loaded worker (no second model download needed). - worker: new 'classify' task type. One-shot completion, low max_tokens (80), temperature 0.1, KV-cache reset between calls. Returns { type: 'classify_done', taskId, raw, latencyMs } or 'classify_error'. - useClassifier hook: thin wrapper that posts 'classify' to the shared worker, tracks pending tasks by taskId, resolves with raw + latency. Caller parses the verdict. - ClassificationHarness component: ports the standalone harness UI to React + Tailwind. Inputs: user description, prompt template, trials JSON (fixture preset), concurrency (1/2/3/5), eligibility max chars. Renders results table with verdict pill (LIKELY/POSSIBLE/UNLIKELY/ PARSE_FAIL), latency, raw output, expected vs actual checkmark. Stats: parse rate, avg/max latency, agreement. - App.jsx: dev-only ?test=classify route, lazy-loaded so production bundle stays unaffected. Pass criteria from Handoff (parse ≥90%, avg latency <1.5s, agreement ≥80%) shown inline so the user can validate before deciding to wire classification into the actual results UI. This is the validation gate, not the in-app classification pipeline itself. Once the harness numbers look good, a follow-up PR adds classifyAll() to ResultsList and the fit meter to TriageRow.

…l-safe User reported 33% agreement, 0ms latencies, and "Message error should not be 0" failures from the harness. Root cause: WebLLM's MLCEngine is single-threaded; concurrent engine.chat.completions.create() calls collide on the engine's internal KV cache and sampling state, returning errors instantly. The harness was firing 3 in parallel by default, so the first 3 always failed, the next batch sometimes squeezed through. Fix at the hook level via a promise chain in useClassifier — each request waits for the previous one to settle. Caller-side concurrency becomes a no-op for actual parallelism but still controls queue depth. This matches how useSimplifier already serializes through a single inFlightRef. The .catch() on the chain prevents one failure from breaking the whole queue. Harness UI: replaced the misleading concurrency dropdown with a static caption noting the engine constraint. Re-running with the fixture should now produce real latencies (1-2s per trial on mid-range hardware) and meaningful agreement numbers.

The original 6-trial smoke test wasn't enough surface to evaluate prompt changes — every tweak swung agreement by ±17% on a single trial. Add 14 trials grouped by failure mode: - Subtype-gated breast cancer trials (3): Tucatinib HER2+, Olaparib BRCA, CDK4/6 switch HR+. All POSSIBLE for a generic "breast cancer" user — exercises whether the model penalizes unknown subtype. - Strong matches for a 58yo with breast cancer (4): CBT for fatigue, lymphedema surveillance, mindfulness for survivors, vaginal estrogen in postmenopausal survivors. All LIKELY — these are the "obvious yes" cases that should never get marked POSSIBLE. - Wrong condition / wrong demographic (5): melanoma, AFib, Type 2 diabetes, prostate cancer, pediatric vaccines. All UNLIKELY — these are the "obvious no" cases that catastrophic-false-negative would hide. - Edge cases (2): palliative care for any advanced solid tumor (POSSIBLE — depends on stage which is unknown), and premenopausal breast cancer (UNLIKELY — 58yo is almost certainly postmenopausal, the trial explicitly excludes postmenopausal). Distribution: 6 LIKELY, 6 POSSIBLE, 8 UNLIKELY. The Handoff calls for 50+ trials for real validation; 20 is a workable middle ground for prompt iteration without crushing test cycle time.

The fixture trials use 200-400 char eligibility blurbs, which masks two real-world failure modes: latency growth with longer prompts and the model losing the signal in CT.gov noise (formal numbered lists, lab cutoffs, prior-therapy specifics, washout periods). Add 4 trials with realistic CT.gov-style eligibility (~2-3.5kB each) spanning all three verdicts: - LONG-01 Sacituzumab metastatic HR+/HER2- after CDK4/6 (POSSIBLE): subtype-gated, 12 inclusion + 12 exclusion criteria with full organ-function lab ranges. Tests whether the model still penalizes unknown HR/HER2 status when buried under three pages of text. - LONG-02 Adjuvant abemaciclib node-positive postmenopausal (LIKELY): the postmenopausal woman with HR+ early breast cancer at high risk case. Many specific lab cutoffs and prior-therapy washout rules. Tests whether the model picks up on the clear inclusion match without getting confused by exclusion noise. - LONG-03 Pembrolizumab metastatic squamous NSCLC (UNLIKELY): wrong cancer entirely, but mentions "breast" once in context of allowed prior malignancies. Tests whether the model anchors on incidental mentions vs the core indication. - LONG-04 Empagliflozin HFpEF + T2DM (UNLIKELY): wrong condition entirely (cardiology + diabetes), but excludes only "active malignancy within 12 months" — disease-free history is allowed. Tests whether the model conflates "cancer history allowed" with "this trial is for cancer patients." Combined with the eligMax knob (default 1500), these let you sweep: 800 → 1500 → 3000 → 6000 chars and watch latency vs accuracy. The Handoff suggests dropping to 800 if latency is a problem; this is where you'd validate that.

After a run completes, surfaces a button in the stats row that copies a formatted markdown report to the clipboard. Includes: - Model + user description + eligMax (so a different prompt run is reproducible) - Stats table (done, elapsed, latencies, parse rate, agreement) - Per-trial results table (verdict, expected, match check, latency, reason) - Collapsed prompt template in a <details> block Pipes pretty quickly into chat or a PR comment for sharing harness runs without screenshotting. Confirmation state on the button (✓ copied / copy failed) for 1.8-2.4s.

Initial run had 67% agreement with two LIKELY-vs-POSSIBLE drift errors on subtype-gated trials (Sacituzumab metastatic TNBC, HER2+ radiation boost). Both required subtype the user hadn't confirmed; the model called them LIKELY because "breast cancer" matched the broad indication. Update the default prompt to make the verdict ladder explicit: - LIKELY = slam-dunk, every gating criterion confirmed - POSSIBLE = broad condition matches, key criteria unknown - UNLIKELY = clear mismatch (wrong condition / sex / age / comorbidity) Also tightens the response format requirement (one line, pipe-separated, reason required) since 2/6 outputs in the first run dropped the reason. "raw: LIKELY" with no reason isn't useful for a user reading the fit meter; tightening here propagates to the eventual in-app integration.

Run 2 (verbose-bias prompt) dropped to 50% agreement with three new failure modes: 1. Model hallucinated user attributes ("user has prior radiation", "user is HR-positive") that weren't in the description, then reasoned about them. Driving false UNLIKELY verdicts. 2. Model misread exclusion semantics on Lymphedema Surveillance — said the user "does not meet the exclusion criteria" as a NEGATIVE, producing a catastrophic false UNLIKELY (the user-doesn't-see-it failure mode the Handoff specifically calls out). 3. UNLIKELY filter went leaky — melanoma, NSCLC, HFpEF + T2DM all moved to POSSIBLE on "user's age is within the range" alone, ignoring the actual disease mismatch. New prompt structure: - Explicit "user description is ONLY what you know — do not assume" guard against hallucination - Five ordered rules: condition mismatch → UNLIKELY (highest priority), demographic exclusion → UNLIKELY, missing-info → POSSIBLE, all-met → LIKELY, default POSSIBLE - Explicit "UNLIKELY ≠ many requirements; UNLIKELY = user is clearly disqualified" to fix the exclusion-semantics confusion Trade-off: slightly longer than the original baseline prompt, but more declarative than the verbose run-2 prompt. Latency should sit between the two.

… rules Run 3 (rules-list) hit 54% with three failure modes: 1. The "VERDICT | reason" instruction made Gemma emit the literal word "VERDICT" as a placeholder name, dropping the actual verdict from most outputs. raw: "VERDICT:" with no verdict, or "VERDICT: LIKELY" with no reason. 2. Five ordered rules were too many for Gemma 2B — it appears to default to "user meets all requirements -> LIKELY" and stop reasoning. Pembrolizumab Melanoma flipped to LIKELY, four subtype-gated breast cancer trials flipped to LIKELY. 3. Long eligibility (LONG-03 NSCLC, LONG-04 HFpEF) degraded the wrong-condition signal — both flipped POSSIBLE/LIKELY despite the shorter NSCLC version correctly classifying UNLIKELY. New strategy: - Drop the literal "VERDICT" placeholder. Use "<LABEL> | <reason>" with angle brackets so it reads as a template, not a literal. - Drop the numbered rules. Use 3 short bullet definitions instead. - Add 3 worked examples covering POSSIBLE / UNLIKELY / LIKELY for a consistent patient profile (62yo man, prostate cancer). Examples demonstrate the no-assume rule, the wrong-condition rule, and the positive-match rule by behavior, not by abstract instruction. - "Now classify:" cue separates examples from the actual task, reducing the chance Gemma copies the example patient. Trade-off: prompt is longer, latency may rise modestly. If accuracy holds, that's a fair price.

Run 4 (3-class few-shot) hit 67%, but mapping LIKELY+POSSIBLE -> SHOW already gave 88% on the same outputs. The LIKELY-vs-POSSIBLE judgment is the source of most disagreement noise and isn't actually a product question — the user-facing decision is binary (show this trial or collapse it under "less likely matches"). POSSIBLE adds cognitive load to a panicking patient without buying useful signal. Pivot the default prompt to binary: - LIKELY: trial studies the patient's condition AND nothing clearly excludes them. Worth showing. - UNLIKELY: different disease, OR clearly wrong sex/age/population. Not worth showing. Explicit "be inclusive on LIKELY" instruction: missing subtype/ biomarker/stage info still counts as LIKELY since the patient or their doctor can verify. UNLIKELY is reserved for clear disqualification on something the patient DID state. Three worked examples: LIKELY (matched indication w/ unstated biomarker), UNLIKELY (wrong sex), UNLIKELY (wrong condition entirely). Parser side: POSSIBLE is still recognized for backward compatibility but normalized to LIKELY. Fixture trials keep their 3-class expected values (informationally rich) but the agreement check uses binary mapping (POSSIBLE expected counts as LIKELY). Markdown report uses the same binary mapping in the match column.

Gemma 2 2B has plateaued at 79% binary agreement on the harness with recurring hallucination of patient attributes ("user has AFib") that five prompt iterations didn't fix — looks like a structural limit of the small instruct model. Llama 3.2 3B Instruct is the next sensible rung: ~50% more parameters, much better instruction following per published benchmarks, and well-tested in the WebLLM ecosystem. q4f16_1 quant keeps the download to ~1.9 GB (vs 1.3 GB for current Gemma). Switch via ?model=llama32 in any IRIS URL — including the harness: http://localhost:5173/iris/?test=classify&model=llama32 First load downloads ~1.9 GB (cached after). Re-run the harness with the same prompt to see if accuracy improves on the same fixture.

Llama 3.2 3B run was catastrophic — 42% agreement, marked nearly everything UNLIKELY. Reasons revealed the failure mode: the model pattern-matched on the few-shot examples (all "62yo man with prostate cancer") and emitted fragments of the example patient as if they applied to the actual patient ("patient is a man", "trial is for atrial fibrillation in men; patient is a woman"). Small instruct models are notorious for this kind of example contamination. Two changes: 1. Register Qwen2.5-1.5B-Instruct (~900 MB, q4f16_1) as ?model=qwen25. Smaller than Gemma 2B but the Qwen2.5 series is known for strong instruction-following per parameter. Worth a shot before concluding small models can't do this. 2. Rewrite the four few-shot examples to use FOUR different patients (45yo woman ovarian, 70yo man T2D, 8yo child asthma, 55yo man hypertension) covering LIKELY/UNLIKELY/UNLIKELY/LIKELY. Explicit "each example uses a DIFFERENT patient — focus on reasoning, not patient details" header. Removes the gravity well that made Llama anchor. Run any model against the same fixture to compare. The example fix should help all models, not just Qwen.

Two follow-ups for Phase 3 validation: 1. ClassificationHarness: dropdown above the user description with 9 patient presets. Same 58yo woman with breast cancer in Boston in English, Spanish, Mandarin, Arabic, Portuguese, French, plus two English variants (terse "58F, BC, Boston" and a more detailed version) and a more-detailed Spanish. RTL direction auto-detected for Arabic. Lets us validate Qwen2.5-1.5B's multilingual handling without manually pasting translations. Selecting a preset replaces the description; manual edits show as "— custom —". 2. nlp.worker.js: commented-out appConfig stub for serving a custom model via prebuiltAppConfig extension. Walk-through comment points at the LoRA training doc in the vault. Drop-in: uncomment three lines in the worker, add a matching entry in nlpModels.js, swap via ?model=iris-finetune. No code path active until you flip the commented lines.

Spanish run on Qwen2.5-1.5B dropped 16 points vs English (67% vs 83%) because the model stops anchoring on the patient's stated condition when reasoning across languages — it confabulates "matches the patient's condition" for trials about prostate cancer, T2D, NSCLC. Translation is the model's strength (well-defined task) and runs ONCE per batch (the patient description), not per trial. So the cost is ~1s amortized across all N classifications. Adds a "translate to English first" checkbox to the harness controls. When enabled: 1. Before the classify queue starts, run a translation prompt on the user description. 2. Show the result in an iris-tinted callout above the stats. 3. Use the translated text for all per-trial classifications. If translation fails the run aborts with an alert (rare but possible). This is harness-only for now — the eventual app integration would trigger translation only when detectInputLanguage(userDescription) returns non-English. Path A (gate on en-only) is still on the table if translation doesn't recover enough accuracy.

Couldn't tell from the previous Spanish run whether the translate-first toggle was active because the markdown report omitted that state. Add both the toggle status and the resulting translated text to the report header so runs are unambiguously reproducible.

…lowup to a628b7a)

The headline agreement was being dragged down by stress-test trials the CT.gov API would never return for the user's stated condition (melanoma, NSCLC, T2D, HFpEF, prostate, etc. for a breast-cancer search). Headline 83% / Spanish 71% understated the actual production quality, which is closer to 94% on trials the user would actually see. - Tag 9 wrong-condition fixture trials with outOfScope: true - Add a "production-realistic agreement" checkbox (default ON) - When ON: agreement % only counts in-scope trials. Out-of-scope are still classified and shown in the table but excluded from the headline + agreement note explains how many were excluded. - When OFF: previous all-trials behavior (useful for stress-testing prompt changes). - Markdown report includes the toggle state and exclusion count so shared runs are unambiguous.

… TriageRow End-to-end Phase 3 integration. After search results land, kick off binary classification for every trial; surface verdicts as a small fit dot in each row + a live "evaluating fit · X of N" caption in the toolbar. Architecture: - New utils/classifyTrial.js holds DEFAULT_CLASSIFY_PROMPT and parseVerdict, shared between the harness and the in-app flow so the model sees identical prompts in both contexts. - ResultsList uses useNLP (idempotent — worker fast-returns ready if model already loaded by NL extraction) + useClassifier. - Classification only fires when: 1. iris_nlp_enabled === 'true' in localStorage (user previously consented to the on-device model — we never auto-load without consent), AND 2. patientDesc is available (from userDescription if NL was used, or synthesized from extractedFields), AND 3. WebGPU is supported. Structured-form-only sessions skip classification silently — no covert worker init, no model download surprise. - Pagination: only newly-arrived NCTs get classified (classifiedRef Set dedups). Search-param change resets state and cancels any in-flight batch. UI: - TriageRow grows a FitDot: • iris-violet filled circle = LIKELY (with reason in tooltip) • parchment-300 hollow ring = UNLIKELY (with reason) • shimmer dot = pending • nothing rendered for PARSE_FAIL (don't surface model errors to users) - Toolbar shows "evaluating fit · 7 of 24" while running, then "fit evaluated for 24" when done. What's NOT in this PR (deferred): - Sort by best fit (Handoff Phase 3 step 5) - Collapse UNLIKELY under "12 less likely matches" disclosure (step 6) Both need more UX consideration; the dots + caption are the safe first slice.

…ions Qwen2.5-1.5B was hallucinating Python tutorial fragments in the "What this study is testing" section because the simplifier prompt ended with a bare label ("PLAIN-LANGUAGE OUTPUT:") that chat-tuned models read as a topic header — then they pattern-match training data that follows similar headers and emit unrelated content. Two changes to both buildSummarizePrompt and buildAssessFitPrompt: 1. Rename the example block label from "PLAIN-LANGUAGE OUTPUT:" / "ASSESSMENT:" to "EXAMPLE PLAIN-LANGUAGE OUTPUT:" / "EXAMPLE ASSESSMENT:" — clarifies its role for the model. 2. Replace the trailing bare label with a directive instruction ("Now write the plain-language summary…", "Begin your response with \"## What this study is testing\":") that names the action and pins the first header. This is what chat-instruct models like Qwen and Gemma both expect — a clear command, not a label. Updated the language-reminder ordering test to look for the new trailing directive instead of the old label.

Same Gemma 2 2B model, just q4f16_1 quantization instead of q4f32_1: ~900 MB weights (vs 1.3 GB) and fp16 activations. WebGPU's native compute type is fp16 on most GPUs, so inference latency drops measurably with no quality difference on short instruction tasks. Registered as ?model=gemma_fast so it can be A/B tested against the current q4f32_1 default without forcing a re-download for anyone who already has the larger weights cached. Promote to default once validated in the harness + actual app flow.

User tested ?model=gemma_fast (gemma-2-2b-it-q4f16_1-MLC) and the simplifier produced malformed output. The reduced activation precision (fp16 vs fp32) loses the grip on the structured "## What this study is testing" / "## Who can join" section headers the streaming parser needs. Since WebLLM can only have one model loaded at a time and the simplifier is critical UX, mixing quants for different tasks isn't viable. Drop the entry from the registry; leave a comment so future- us doesn't re-discover the same dead end.

19e5d34 changed the trailing simplifier prompt label to a directive instruction ("Begin your response with \"## What this study is testing\":") to fix Qwen2.5-1.5B hallucinating training data. That fix worked for Qwen but broke Gemma 2B — Gemma got stuck in a repetition loop emitting bare ## patterns ("############## ## ## ## ## ##...") because the trailing instruction over-anchored on the ## header pattern. Since we ship Gemma by default and Qwen is opt-in (?model=qwen25, mainly for the harness), the right call is to restore the original prompt that Gemma is validated against. Qwen will continue to hallucinate in the simplifier — but that's a known issue with no production exposure as long as Gemma stays default. If we ever need to ship Qwen for production, the right fix is probably structural (proper chat-message turns with role:assistant for the example) rather than a trailing-label tweak.

…aise frequency_penalty Two issues from end-to-end testing: 1. Toolbar showed "evaluating fit · 0 of 10" while the simplifier had already produced output — classifier and simplifier were both queueing into the single-threaded worker on results land. The simplifier's eager-batch-of-5 ran first, the classifier got stuck behind it, and per-trial fit dots never appeared until the simplifier finished all 10 chunks of the eager batch. 2. Gemma 2 2B was hitting degenerate repetition loops on the simplify prompt — "## What this study is testing" header rendered correctly, then the body was just "############ ## ## ## ##..." for the full max_tokens budget. Two changes per Handoff Phase 3 step 6 ("Stage 2 simplification only fires for the currently-selected trial in the detail pane"): - ResultsList: removed the eager-batch-of-5 simplifier call. Simplifier now only enqueues for the currently-selected trial, and only after classification has completed (or is not running, e.g. structured- form-only sessions). Eliminates the classifier/simplifier race AND drastically cuts how often the simplifier runs (so when Gemma does hit a repetition loop, it affects one trial not five). - nlp.worker.js: bumped frequency_penalty for summarize from 0.3 → 0.6. Higher penalty discourages the same n-gram from re-firing, breaking the "## ## ##" loop. Assess-fit stays at 0 so its hedging language ("may", "might") doesn't get penalized.

User saw an empty "What this study is testing" area and an inaccurate summary, with no explanation of WHY the simplifier hadn't run yet (classifier was still working through 10 trials in the background). Adds an explicit pipeline-stage caption above the simplifier section in the detail pane: - Stage "classifying": iris-violet pill with shimmer dot: "evaluating fit · 3 of 10 · plain-language summary will follow" Tells the user the per-trial fit is still being computed and the summary is queued to come after. - Stage "awaiting-summary": parchment pill: "generating plain-language summary…" — fires after classification completes but before the selected trial's summary stream starts producing tokens. - Once tokens land (status === 'streaming' / 'complete'): caption disappears, real content renders. ResultsList computes the stage from canClassify + classifyDone + the simplifier's per-trial status, then passes pipelineStage + classifyProgress as new ResultCard props (pane mode only — rows already have their own fit dot). Doesn't address the underlying small-model accuracy issue (Gemma turning "Early Breast Cancer" into "spread to other parts" in the summary). That's the next escalation — short-term fix would be lower max_tokens for summarize; real fix is the LoRA pipeline in the vault doc.

Two changes that work together to set honest expectations about the on-device model's reliability: 1. Removed the "Why this might or might not fit you" section from the detail pane. Gemma 2B's accuracy on the fit narrative isn't reliable enough to ship — recent E2E testing showed it occasionally flips disease stage (early vs metastatic) or treatment history. The TriageRow fit dot (driven by the binary classifier in useClassifier) is the safer signal. ResultsList no longer enqueues assess_fit; ResultCard no longer renders the fit paragraph; the useSimplifier code path is preserved in case we re-enable it after a fine-tune. 2. Added an unconditional oncologist disclaimer above the contact block in pane mode: "Talk to your oncologist if you think you might qualify for this trial. The plain-language summary above is generated on-device by a small AI model — it can miss or misstate eligibility details. Your care team can confirm whether the trial fits your specific situation." Iris-tinted callout, always visible. Sets the right expectation: the AI summary is a starting point, not a recommendation. Test that previously asserted the fit paragraph rendered now asserts its absence. The previous "fit hidden when in error" test was redundant (the fit section is unconditionally hidden now) so it was folded into the single assertion.

Disclaimer reshaped per the brainstorm + user pick (variant 7 with user-rewritten surface line): Surface (always visible, in <summary>): "Check with your doctor when exploring treatment options — this AI summary uses plain language to explain the treatment but can miss eligibility details." with a small "why?" hint. Expandable (in <details> body): "The plain-language summary above was generated on your device by a small AI model. It can miss or misstate who qualifies for a trial. Your care team has your full medical picture and can confirm whether this one actually fits." Frames the AI as helper rather than decider, gives the in-crisis reader minimum cognitive load by default, full honesty + privacy beat one tap away. Group-open hides the "why?" hint when expanded so the chrome stays clean. Also carrying redesign-branch fix forward: tightened the auto-load gate in NaturalLanguageInput from (idle || downloading) → idle only. The widened gate caused redundant load() messages mid-download because the cleanup-reset pattern would re-fire the effect on every status flip while still in the gate. Worker dropped them as duplicates so harmless in practice, but cleaner this way. Same fix went in on redesign branch as 3c2e750; keeping branches in sync.

The sort UI was visible-but-permanently-disabled placeholders for "Best fit" / "Distance" / "Phase" / "Most recent" with tooltips like "Sort wiring coming in a follow-up." Reads as broken UI to users. Better to ship without than ship with disabled controls. When the wiring lands, restore the chips from git history. CT.gov v2 API supports sort= for distance (when location set) and LastUpdatePostDate; "Best fit" depends on having the classifier verdicts per-trial, which now exists post-Phase-3 wiring. Toolbar now shows just the search summary line + classification progress, which is the load-bearing info anyway.

Three of the deferred items from PR #1's review, addressed in one follow-up to keep the phase-3 PR from carrying open feedback debt: 1. useIsMobile uses matchMedia.change instead of window 'resize' (ResultsList.jsx:36-50). iOS Safari fires 'resize' inconsistently on rotation; matchMedia.change is the reliable signal and also catches iPad split-screen + browser-window mode switches. src/test/setup.js stubs matchMedia (jsdom doesn't ship it) so the existing ResultsList tests keep rendering the desktop two-pane path. 2. New regression test for the queued-submit drain in NaturalLanguageInput. Indirect coverage for the StrictMode listener re-attach fix in useNLP — if the listener doesn't reach the worker 'ready' message, the queued submit never fires and this test fails. The bug surfaced this session: status was getting stuck at 'downloading' forever in dev because the listener detached on StrictMode's first cleanup never re-attached. Now: pin the contract. 3. shared/iris-shared.jsx (478-line design reference, never imported by src/) moved to docs/design-references-shared/ with a README explaining its role. Stops future readers (LLM or human) from trying to "fix" or "consolidate" it as if it were live code. Contrast-check (#3 in the review): computed iris-700 on parchment-50 yields ~9.6:1 (WCAG AAA). iris-700 on iris-50 is similar (~8.5:1). All iris-violet links and the model badge clear AA easily; no palette change needed. Lighthouse can confirm at deploy time. Compare-state lift (#1 in the review): deferred to its own follow-up PR alongside the actual compare view (currently a placeholder).

User reported: hitting "Find trials" again with a refined prompt did nothing visible. The classifier kept its old verdicts; the toolbar sat at "fit evaluated for N" against a stale patient description. The classification-reset effect was watching only searchParams. If the refined prompt extracted to the same condition (e.g., "breast cancer" → still "breast cancer"), the API result set was cached and the trials list didn't change either. classifiedRef carried the old NCT IDs into the next render, so newTrials.length === 0 and the classification effect short-circuited. Two changes: - Add patientDesc to the reset effect's deps so a prompt change alone (no condition change) wipes classifications + classifiedRef and cancels the in-flight batch via cancelClassifyRef. - Cancel the simplifier's pending queue too — otherwise an in-flight summary keeps the worker busy while the re-classification waits for it to drain. Effect: change the prompt → classifications wipe → fit dots reset to shimmer → toolbar shows "evaluating fit · 0 of N" again → classifier re-runs against the new patient → fit dots refill.

Two correctness fixes from code review: 1. useClassifier.js:34-49 — listener cleanup now also rejects every in-flight classify in pendingRef. Previously: on unmount (or StrictMode dev cleanup) the listener detached but pendingRef Map still held resolve/reject handles — those promises hung forever. Now any awaiting caller sees a clean rejection with "classifier unmounted" so error paths fire and the queue clears. 2. ResultsList.jsx:118 — replaced `nlp` (whole hook return object) in the load-trigger effect deps with `nlpLoad` destructured. The useNLP hook doesn't memoize its return object, so each render produced a new ref → effect ran every render. The status guard prevented redundant load() calls but the body still ran. Now the effect only fires when load callback identity changes (it's useCallback'd with stable deps so essentially never), or when canClassify / status / modelKey change. Both surfaced by post-merge code review of PR #2; neither blocks shipping but both are trivial to land before merge.

Bundles all eight follow-up items the reviewer flagged on PR #2, plus the user-requested decision to disable classify-driven UI in the results view (the fit dots had no actionable consequence without sort wiring, so they were just decoration). User-visible - Classify-in-results gated behind ENABLE_CLASSIFY_IN_RESULTS = false (ResultsList.jsx). The classifier hook + worker task + ?test=classify harness all stay live for prompt iteration; only the in-app fit dots and "evaluating fit · X of N" caption are suppressed. Flip the constant to true once "Best fit" sort lands so the dots become actionable. Refactors / dedupe - Moved 305-line SAMPLE_TRIALS array + USER_PRESETS list out of ClassificationHarness.jsx into a sibling fixtures file (ClassificationHarness.fixtures.js). Harness file dropped from 924 → ~550 lines. - Harness now imports DEFAULT_CLASSIFY_PROMPT and parseVerdict from utils/classifyTrial.js instead of duplicating both verbatim. Single source of truth — prompt tweaks no longer have to be made in two places. - useIsMobile hoisted out of ResultsList.jsx into hooks/useIsMobile.js so any other component that needs the same breakpoint can import it without copy-pasting. Small fixes - patientDesc in ResultsList wrapped in useMemo (was recomputed every render — value-equality made it work but the implicit reliance was fragile). - FitDot in TriageRow folds the model's reason into aria-label so screen readers and keyboard-focused users get the same context as a sighted hover (title alone reaches neither group reliably). Same string in both attrs. Added role="img" since it carries semantic content now. New worker task type - Added 'translate' message type to nlp.worker.js. max_tokens 200 (vs classify's 80) so verbose-language paraphrases fit. Same low temperature (0.1) since translation wants fidelity, not creativity. - useClassifier now exposes both classifyOne and translateOne, sharing the single promise chain (engine is single-threaded regardless of task type). handleMessage routes done/error events for both via a single isDone/isError predicate. - Harness translate-first toggle now uses translateOne instead of overloading classifyOne — clarifies intent and lets the worker apply the right max_tokens budget. New tests - useClassifier.test.js: three tests covering serialization (concurrent calls post FIFO), error isolation (one rejection doesn't poison the queue), and unmount cleanup (pending tasks reject with 'classifier unmounted'). Mock the shared worker via vi.mock so the tests don't touch real WebLLM. 197/197 pass. Skipped from the review - "Simplifier idle gap during model load" — the gap is intentional per Handoff Phase 3 step 6 (stage-2 only after stage-1 completes), and the pipeline-stage caption already addresses the UX. Reviewer's suggestion to "let the simplifier proceed" would re-introduce the classifier-vs-simplifier worker contention we explicitly fixed.

PR #3 added 4 new lint hits on top of main's pre-existing baseline. All trivial — fixing them keeps the CI lint output clean for future reviewers (the workflow runs lint as continue-on-error so they don't block, but fewer ignorable lines is fewer ignorable lines). - ClassificationHarness.jsx:36 — setConcurrency unused since the concurrency dropdown was removed (serialization happens in the hook now). Drop the setter, keep the value as a const. - ResultCard.jsx:126 — showFit unused since the "Why this might or might not fit you" section was dropped. Drop the var; comment notes the path back if a fine-tuned model lets us re-introduce. - ResultsList.jsx:75 — wrap allTrials in useMemo. react-query keeps data ref stable across non-data renders so the memo identity is stable too; without the memo, every render produced a new array and effect dep arrays comparing against allTrials would have thrashed (the actual classify trigger effect depends on a derived trialKeyAll string so this was cosmetic, but cleaner this way). - useClassifier.js:73-74 — exhaustive-deps disable on the classifyOne/translateOne useCallbacks. runTask only closes over refs (stable); the linter can't see through that. Lint count: 29 → 24 (14 errors, 10 warnings — one fewer than main's baseline). All remaining are pre-existing.

johnoooh added 30 commits May 7, 2026 00:49

fix(phase-3): wire translate-first state through markdown report (fol…

c05ac74

…lowup to a628b7a)

johnoooh merged commit 1d6ad40 into main May 7, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 3 classification harness#3

Phase 3 classification harness#3
johnoooh merged 31 commits into
mainfrom
phase-3-classification-harness

johnoooh commented May 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnoooh commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in

Validation results (Gemma 2 2B, English fixture, production-realistic mode)

Out of scope (deferred)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

johnoooh commented May 7, 2026 •

edited

Loading