Phase 3 classification harness#3
Merged
Merged
Conversation
Implements the harness validation rig from Handoff Phase 3, ported from
the standalone HTML file into the React app so it reuses the already-
loaded worker (no second model download needed).
- worker: new 'classify' task type. One-shot completion, low max_tokens
(80), temperature 0.1, KV-cache reset between calls. Returns
{ type: 'classify_done', taskId, raw, latencyMs } or 'classify_error'.
- useClassifier hook: thin wrapper that posts 'classify' to the shared
worker, tracks pending tasks by taskId, resolves with raw + latency.
Caller parses the verdict.
- ClassificationHarness component: ports the standalone harness UI to
React + Tailwind. Inputs: user description, prompt template, trials
JSON (fixture preset), concurrency (1/2/3/5), eligibility max chars.
Renders results table with verdict pill (LIKELY/POSSIBLE/UNLIKELY/
PARSE_FAIL), latency, raw output, expected vs actual checkmark.
Stats: parse rate, avg/max latency, agreement.
- App.jsx: dev-only ?test=classify route, lazy-loaded so production
bundle stays unaffected.
Pass criteria from Handoff (parse ≥90%, avg latency <1.5s, agreement
≥80%) shown inline so the user can validate before deciding to wire
classification into the actual results UI.
This is the validation gate, not the in-app classification pipeline
itself. Once the harness numbers look good, a follow-up PR adds
classifyAll() to ResultsList and the fit meter to TriageRow.
…l-safe User reported 33% agreement, 0ms latencies, and "Message error should not be 0" failures from the harness. Root cause: WebLLM's MLCEngine is single-threaded; concurrent engine.chat.completions.create() calls collide on the engine's internal KV cache and sampling state, returning errors instantly. The harness was firing 3 in parallel by default, so the first 3 always failed, the next batch sometimes squeezed through. Fix at the hook level via a promise chain in useClassifier — each request waits for the previous one to settle. Caller-side concurrency becomes a no-op for actual parallelism but still controls queue depth. This matches how useSimplifier already serializes through a single inFlightRef. The .catch() on the chain prevents one failure from breaking the whole queue. Harness UI: replaced the misleading concurrency dropdown with a static caption noting the engine constraint. Re-running with the fixture should now produce real latencies (1-2s per trial on mid-range hardware) and meaningful agreement numbers.
The original 6-trial smoke test wasn't enough surface to evaluate prompt changes — every tweak swung agreement by ±17% on a single trial. Add 14 trials grouped by failure mode: - Subtype-gated breast cancer trials (3): Tucatinib HER2+, Olaparib BRCA, CDK4/6 switch HR+. All POSSIBLE for a generic "breast cancer" user — exercises whether the model penalizes unknown subtype. - Strong matches for a 58yo with breast cancer (4): CBT for fatigue, lymphedema surveillance, mindfulness for survivors, vaginal estrogen in postmenopausal survivors. All LIKELY — these are the "obvious yes" cases that should never get marked POSSIBLE. - Wrong condition / wrong demographic (5): melanoma, AFib, Type 2 diabetes, prostate cancer, pediatric vaccines. All UNLIKELY — these are the "obvious no" cases that catastrophic-false-negative would hide. - Edge cases (2): palliative care for any advanced solid tumor (POSSIBLE — depends on stage which is unknown), and premenopausal breast cancer (UNLIKELY — 58yo is almost certainly postmenopausal, the trial explicitly excludes postmenopausal). Distribution: 6 LIKELY, 6 POSSIBLE, 8 UNLIKELY. The Handoff calls for 50+ trials for real validation; 20 is a workable middle ground for prompt iteration without crushing test cycle time.
The fixture trials use 200-400 char eligibility blurbs, which masks two real-world failure modes: latency growth with longer prompts and the model losing the signal in CT.gov noise (formal numbered lists, lab cutoffs, prior-therapy specifics, washout periods). Add 4 trials with realistic CT.gov-style eligibility (~2-3.5kB each) spanning all three verdicts: - LONG-01 Sacituzumab metastatic HR+/HER2- after CDK4/6 (POSSIBLE): subtype-gated, 12 inclusion + 12 exclusion criteria with full organ-function lab ranges. Tests whether the model still penalizes unknown HR/HER2 status when buried under three pages of text. - LONG-02 Adjuvant abemaciclib node-positive postmenopausal (LIKELY): the postmenopausal woman with HR+ early breast cancer at high risk case. Many specific lab cutoffs and prior-therapy washout rules. Tests whether the model picks up on the clear inclusion match without getting confused by exclusion noise. - LONG-03 Pembrolizumab metastatic squamous NSCLC (UNLIKELY): wrong cancer entirely, but mentions "breast" once in context of allowed prior malignancies. Tests whether the model anchors on incidental mentions vs the core indication. - LONG-04 Empagliflozin HFpEF + T2DM (UNLIKELY): wrong condition entirely (cardiology + diabetes), but excludes only "active malignancy within 12 months" — disease-free history is allowed. Tests whether the model conflates "cancer history allowed" with "this trial is for cancer patients." Combined with the eligMax knob (default 1500), these let you sweep: 800 → 1500 → 3000 → 6000 chars and watch latency vs accuracy. The Handoff suggests dropping to 800 if latency is a problem; this is where you'd validate that.
After a run completes, surfaces a button in the stats row that copies a formatted markdown report to the clipboard. Includes: - Model + user description + eligMax (so a different prompt run is reproducible) - Stats table (done, elapsed, latencies, parse rate, agreement) - Per-trial results table (verdict, expected, match check, latency, reason) - Collapsed prompt template in a <details> block Pipes pretty quickly into chat or a PR comment for sharing harness runs without screenshotting. Confirmation state on the button (✓ copied / copy failed) for 1.8-2.4s.
Initial run had 67% agreement with two LIKELY-vs-POSSIBLE drift errors on subtype-gated trials (Sacituzumab metastatic TNBC, HER2+ radiation boost). Both required subtype the user hadn't confirmed; the model called them LIKELY because "breast cancer" matched the broad indication. Update the default prompt to make the verdict ladder explicit: - LIKELY = slam-dunk, every gating criterion confirmed - POSSIBLE = broad condition matches, key criteria unknown - UNLIKELY = clear mismatch (wrong condition / sex / age / comorbidity) Also tightens the response format requirement (one line, pipe-separated, reason required) since 2/6 outputs in the first run dropped the reason. "raw: LIKELY" with no reason isn't useful for a user reading the fit meter; tightening here propagates to the eventual in-app integration.
Run 2 (verbose-bias prompt) dropped to 50% agreement with three new
failure modes:
1. Model hallucinated user attributes ("user has prior radiation",
"user is HR-positive") that weren't in the description, then
reasoned about them. Driving false UNLIKELY verdicts.
2. Model misread exclusion semantics on Lymphedema Surveillance —
said the user "does not meet the exclusion criteria" as a NEGATIVE,
producing a catastrophic false UNLIKELY (the user-doesn't-see-it
failure mode the Handoff specifically calls out).
3. UNLIKELY filter went leaky — melanoma, NSCLC, HFpEF + T2DM all
moved to POSSIBLE on "user's age is within the range" alone,
ignoring the actual disease mismatch.
New prompt structure:
- Explicit "user description is ONLY what you know — do not assume"
guard against hallucination
- Five ordered rules: condition mismatch → UNLIKELY (highest
priority), demographic exclusion → UNLIKELY, missing-info →
POSSIBLE, all-met → LIKELY, default POSSIBLE
- Explicit "UNLIKELY ≠ many requirements; UNLIKELY = user is clearly
disqualified" to fix the exclusion-semantics confusion
Trade-off: slightly longer than the original baseline prompt, but
more declarative than the verbose run-2 prompt. Latency should sit
between the two.
… rules Run 3 (rules-list) hit 54% with three failure modes: 1. The "VERDICT | reason" instruction made Gemma emit the literal word "VERDICT" as a placeholder name, dropping the actual verdict from most outputs. raw: "VERDICT:" with no verdict, or "VERDICT: LIKELY" with no reason. 2. Five ordered rules were too many for Gemma 2B — it appears to default to "user meets all requirements -> LIKELY" and stop reasoning. Pembrolizumab Melanoma flipped to LIKELY, four subtype-gated breast cancer trials flipped to LIKELY. 3. Long eligibility (LONG-03 NSCLC, LONG-04 HFpEF) degraded the wrong-condition signal — both flipped POSSIBLE/LIKELY despite the shorter NSCLC version correctly classifying UNLIKELY. New strategy: - Drop the literal "VERDICT" placeholder. Use "<LABEL> | <reason>" with angle brackets so it reads as a template, not a literal. - Drop the numbered rules. Use 3 short bullet definitions instead. - Add 3 worked examples covering POSSIBLE / UNLIKELY / LIKELY for a consistent patient profile (62yo man, prostate cancer). Examples demonstrate the no-assume rule, the wrong-condition rule, and the positive-match rule by behavior, not by abstract instruction. - "Now classify:" cue separates examples from the actual task, reducing the chance Gemma copies the example patient. Trade-off: prompt is longer, latency may rise modestly. If accuracy holds, that's a fair price.
Run 4 (3-class few-shot) hit 67%, but mapping LIKELY+POSSIBLE -> SHOW already gave 88% on the same outputs. The LIKELY-vs-POSSIBLE judgment is the source of most disagreement noise and isn't actually a product question — the user-facing decision is binary (show this trial or collapse it under "less likely matches"). POSSIBLE adds cognitive load to a panicking patient without buying useful signal. Pivot the default prompt to binary: - LIKELY: trial studies the patient's condition AND nothing clearly excludes them. Worth showing. - UNLIKELY: different disease, OR clearly wrong sex/age/population. Not worth showing. Explicit "be inclusive on LIKELY" instruction: missing subtype/ biomarker/stage info still counts as LIKELY since the patient or their doctor can verify. UNLIKELY is reserved for clear disqualification on something the patient DID state. Three worked examples: LIKELY (matched indication w/ unstated biomarker), UNLIKELY (wrong sex), UNLIKELY (wrong condition entirely). Parser side: POSSIBLE is still recognized for backward compatibility but normalized to LIKELY. Fixture trials keep their 3-class expected values (informationally rich) but the agreement check uses binary mapping (POSSIBLE expected counts as LIKELY). Markdown report uses the same binary mapping in the match column.
Gemma 2 2B has plateaued at 79% binary agreement on the harness with
recurring hallucination of patient attributes ("user has AFib") that
five prompt iterations didn't fix — looks like a structural limit of
the small instruct model.
Llama 3.2 3B Instruct is the next sensible rung: ~50% more
parameters, much better instruction following per published
benchmarks, and well-tested in the WebLLM ecosystem. q4f16_1 quant
keeps the download to ~1.9 GB (vs 1.3 GB for current Gemma).
Switch via ?model=llama32 in any IRIS URL — including the harness:
http://localhost:5173/iris/?test=classify&model=llama32
First load downloads ~1.9 GB (cached after). Re-run the harness with
the same prompt to see if accuracy improves on the same fixture.
Llama 3.2 3B run was catastrophic — 42% agreement, marked nearly
everything UNLIKELY. Reasons revealed the failure mode: the model
pattern-matched on the few-shot examples (all "62yo man with prostate
cancer") and emitted fragments of the example patient as if they
applied to the actual patient ("patient is a man", "trial is for
atrial fibrillation in men; patient is a woman"). Small instruct
models are notorious for this kind of example contamination.
Two changes:
1. Register Qwen2.5-1.5B-Instruct (~900 MB, q4f16_1) as ?model=qwen25.
Smaller than Gemma 2B but the Qwen2.5 series is known for strong
instruction-following per parameter. Worth a shot before
concluding small models can't do this.
2. Rewrite the four few-shot examples to use FOUR different patients
(45yo woman ovarian, 70yo man T2D, 8yo child asthma, 55yo man
hypertension) covering LIKELY/UNLIKELY/UNLIKELY/LIKELY. Explicit
"each example uses a DIFFERENT patient — focus on reasoning, not
patient details" header. Removes the gravity well that made Llama
anchor.
Run any model against the same fixture to compare. The example fix
should help all models, not just Qwen.
Two follow-ups for Phase 3 validation: 1. ClassificationHarness: dropdown above the user description with 9 patient presets. Same 58yo woman with breast cancer in Boston in English, Spanish, Mandarin, Arabic, Portuguese, French, plus two English variants (terse "58F, BC, Boston" and a more detailed version) and a more-detailed Spanish. RTL direction auto-detected for Arabic. Lets us validate Qwen2.5-1.5B's multilingual handling without manually pasting translations. Selecting a preset replaces the description; manual edits show as "— custom —". 2. nlp.worker.js: commented-out appConfig stub for serving a custom model via prebuiltAppConfig extension. Walk-through comment points at the LoRA training doc in the vault. Drop-in: uncomment three lines in the worker, add a matching entry in nlpModels.js, swap via ?model=iris-finetune. No code path active until you flip the commented lines.
Spanish run on Qwen2.5-1.5B dropped 16 points vs English (67% vs 83%) because the model stops anchoring on the patient's stated condition when reasoning across languages — it confabulates "matches the patient's condition" for trials about prostate cancer, T2D, NSCLC. Translation is the model's strength (well-defined task) and runs ONCE per batch (the patient description), not per trial. So the cost is ~1s amortized across all N classifications. Adds a "translate to English first" checkbox to the harness controls. When enabled: 1. Before the classify queue starts, run a translation prompt on the user description. 2. Show the result in an iris-tinted callout above the stats. 3. Use the translated text for all per-trial classifications. If translation fails the run aborts with an alert (rare but possible). This is harness-only for now — the eventual app integration would trigger translation only when detectInputLanguage(userDescription) returns non-English. Path A (gate on en-only) is still on the table if translation doesn't recover enough accuracy.
Couldn't tell from the previous Spanish run whether the translate-first toggle was active because the markdown report omitted that state. Add both the toggle status and the resulting translated text to the report header so runs are unambiguously reproducible.
The headline agreement was being dragged down by stress-test trials the CT.gov API would never return for the user's stated condition (melanoma, NSCLC, T2D, HFpEF, prostate, etc. for a breast-cancer search). Headline 83% / Spanish 71% understated the actual production quality, which is closer to 94% on trials the user would actually see. - Tag 9 wrong-condition fixture trials with outOfScope: true - Add a "production-realistic agreement" checkbox (default ON) - When ON: agreement % only counts in-scope trials. Out-of-scope are still classified and shown in the table but excluded from the headline + agreement note explains how many were excluded. - When OFF: previous all-trials behavior (useful for stress-testing prompt changes). - Markdown report includes the toggle state and exclusion count so shared runs are unambiguous.
… TriageRow
End-to-end Phase 3 integration. After search results land, kick off
binary classification for every trial; surface verdicts as a small
fit dot in each row + a live "evaluating fit · X of N" caption in
the toolbar.
Architecture:
- New utils/classifyTrial.js holds DEFAULT_CLASSIFY_PROMPT and
parseVerdict, shared between the harness and the in-app flow so
the model sees identical prompts in both contexts.
- ResultsList uses useNLP (idempotent — worker fast-returns ready
if model already loaded by NL extraction) + useClassifier.
- Classification only fires when:
1. iris_nlp_enabled === 'true' in localStorage (user previously
consented to the on-device model — we never auto-load without
consent), AND
2. patientDesc is available (from userDescription if NL was used,
or synthesized from extractedFields), AND
3. WebGPU is supported.
Structured-form-only sessions skip classification silently — no
covert worker init, no model download surprise.
- Pagination: only newly-arrived NCTs get classified (classifiedRef
Set dedups). Search-param change resets state and cancels any
in-flight batch.
UI:
- TriageRow grows a FitDot:
• iris-violet filled circle = LIKELY (with reason in tooltip)
• parchment-300 hollow ring = UNLIKELY (with reason)
• shimmer dot = pending
• nothing rendered for PARSE_FAIL (don't surface model errors to users)
- Toolbar shows "evaluating fit · 7 of 24" while running, then
"fit evaluated for 24" when done.
What's NOT in this PR (deferred):
- Sort by best fit (Handoff Phase 3 step 5)
- Collapse UNLIKELY under "12 less likely matches" disclosure (step 6)
Both need more UX consideration; the dots + caption are the safe
first slice.
…ions
Qwen2.5-1.5B was hallucinating Python tutorial fragments in the
"What this study is testing" section because the simplifier prompt
ended with a bare label ("PLAIN-LANGUAGE OUTPUT:") that chat-tuned
models read as a topic header — then they pattern-match training
data that follows similar headers and emit unrelated content.
Two changes to both buildSummarizePrompt and buildAssessFitPrompt:
1. Rename the example block label from "PLAIN-LANGUAGE OUTPUT:" /
"ASSESSMENT:" to "EXAMPLE PLAIN-LANGUAGE OUTPUT:" /
"EXAMPLE ASSESSMENT:" — clarifies its role for the model.
2. Replace the trailing bare label with a directive instruction
("Now write the plain-language summary…", "Begin your response
with \"## What this study is testing\":") that names the action
and pins the first header. This is what chat-instruct models like
Qwen and Gemma both expect — a clear command, not a label.
Updated the language-reminder ordering test to look for the new
trailing directive instead of the old label.
Same Gemma 2 2B model, just q4f16_1 quantization instead of q4f32_1: ~900 MB weights (vs 1.3 GB) and fp16 activations. WebGPU's native compute type is fp16 on most GPUs, so inference latency drops measurably with no quality difference on short instruction tasks. Registered as ?model=gemma_fast so it can be A/B tested against the current q4f32_1 default without forcing a re-download for anyone who already has the larger weights cached. Promote to default once validated in the harness + actual app flow.
User tested ?model=gemma_fast (gemma-2-2b-it-q4f16_1-MLC) and the simplifier produced malformed output. The reduced activation precision (fp16 vs fp32) loses the grip on the structured "## What this study is testing" / "## Who can join" section headers the streaming parser needs. Since WebLLM can only have one model loaded at a time and the simplifier is critical UX, mixing quants for different tasks isn't viable. Drop the entry from the registry; leave a comment so future- us doesn't re-discover the same dead end.
19e5d34 changed the trailing simplifier prompt label to a directive instruction ("Begin your response with \"## What this study is testing\":") to fix Qwen2.5-1.5B hallucinating training data. That fix worked for Qwen but broke Gemma 2B — Gemma got stuck in a repetition loop emitting bare ## patterns ("############## ## ## ## ## ##...") because the trailing instruction over-anchored on the ## header pattern. Since we ship Gemma by default and Qwen is opt-in (?model=qwen25, mainly for the harness), the right call is to restore the original prompt that Gemma is validated against. Qwen will continue to hallucinate in the simplifier — but that's a known issue with no production exposure as long as Gemma stays default. If we ever need to ship Qwen for production, the right fix is probably structural (proper chat-message turns with role:assistant for the example) rather than a trailing-label tweak.
…aise frequency_penalty
Two issues from end-to-end testing:
1. Toolbar showed "evaluating fit · 0 of 10" while the simplifier had
already produced output — classifier and simplifier were both
queueing into the single-threaded worker on results land. The
simplifier's eager-batch-of-5 ran first, the classifier got stuck
behind it, and per-trial fit dots never appeared until the
simplifier finished all 10 chunks of the eager batch.
2. Gemma 2 2B was hitting degenerate repetition loops on the
simplify prompt — "## What this study is testing" header rendered
correctly, then the body was just "############ ## ## ## ##..."
for the full max_tokens budget.
Two changes per Handoff Phase 3 step 6 ("Stage 2 simplification only
fires for the currently-selected trial in the detail pane"):
- ResultsList: removed the eager-batch-of-5 simplifier call. Simplifier
now only enqueues for the currently-selected trial, and only after
classification has completed (or is not running, e.g. structured-
form-only sessions). Eliminates the classifier/simplifier race AND
drastically cuts how often the simplifier runs (so when Gemma does
hit a repetition loop, it affects one trial not five).
- nlp.worker.js: bumped frequency_penalty for summarize from 0.3 → 0.6.
Higher penalty discourages the same n-gram from re-firing, breaking
the "## ## ##" loop. Assess-fit stays at 0 so its hedging language
("may", "might") doesn't get penalized.
User saw an empty "What this study is testing" area and an inaccurate summary, with no explanation of WHY the simplifier hadn't run yet (classifier was still working through 10 trials in the background). Adds an explicit pipeline-stage caption above the simplifier section in the detail pane: - Stage "classifying": iris-violet pill with shimmer dot: "evaluating fit · 3 of 10 · plain-language summary will follow" Tells the user the per-trial fit is still being computed and the summary is queued to come after. - Stage "awaiting-summary": parchment pill: "generating plain-language summary…" — fires after classification completes but before the selected trial's summary stream starts producing tokens. - Once tokens land (status === 'streaming' / 'complete'): caption disappears, real content renders. ResultsList computes the stage from canClassify + classifyDone + the simplifier's per-trial status, then passes pipelineStage + classifyProgress as new ResultCard props (pane mode only — rows already have their own fit dot). Doesn't address the underlying small-model accuracy issue (Gemma turning "Early Breast Cancer" into "spread to other parts" in the summary). That's the next escalation — short-term fix would be lower max_tokens for summarize; real fix is the LoRA pipeline in the vault doc.
Two changes that work together to set honest expectations about the on-device model's reliability: 1. Removed the "Why this might or might not fit you" section from the detail pane. Gemma 2B's accuracy on the fit narrative isn't reliable enough to ship — recent E2E testing showed it occasionally flips disease stage (early vs metastatic) or treatment history. The TriageRow fit dot (driven by the binary classifier in useClassifier) is the safer signal. ResultsList no longer enqueues assess_fit; ResultCard no longer renders the fit paragraph; the useSimplifier code path is preserved in case we re-enable it after a fine-tune. 2. Added an unconditional oncologist disclaimer above the contact block in pane mode: "Talk to your oncologist if you think you might qualify for this trial. The plain-language summary above is generated on-device by a small AI model — it can miss or misstate eligibility details. Your care team can confirm whether the trial fits your specific situation." Iris-tinted callout, always visible. Sets the right expectation: the AI summary is a starting point, not a recommendation. Test that previously asserted the fit paragraph rendered now asserts its absence. The previous "fit hidden when in error" test was redundant (the fit section is unconditionally hidden now) so it was folded into the single assertion.
Disclaimer reshaped per the brainstorm + user pick (variant 7 with
user-rewritten surface line):
Surface (always visible, in <summary>):
"Check with your doctor when exploring treatment options —
this AI summary uses plain language to explain the treatment
but can miss eligibility details." with a small "why?" hint.
Expandable (in <details> body):
"The plain-language summary above was generated on your device
by a small AI model. It can miss or misstate who qualifies for
a trial. Your care team has your full medical picture and can
confirm whether this one actually fits."
Frames the AI as helper rather than decider, gives the in-crisis
reader minimum cognitive load by default, full honesty + privacy
beat one tap away. Group-open hides the "why?" hint when expanded
so the chrome stays clean.
Also carrying redesign-branch fix forward: tightened the auto-load
gate in NaturalLanguageInput from (idle || downloading) → idle
only. The widened gate caused redundant load() messages mid-download
because the cleanup-reset pattern would re-fire the effect on every
status flip while still in the gate. Worker dropped them as
duplicates so harmless in practice, but cleaner this way. Same fix
went in on redesign branch as 3c2e750; keeping branches in sync.
The sort UI was visible-but-permanently-disabled placeholders for "Best fit" / "Distance" / "Phase" / "Most recent" with tooltips like "Sort wiring coming in a follow-up." Reads as broken UI to users. Better to ship without than ship with disabled controls. When the wiring lands, restore the chips from git history. CT.gov v2 API supports sort= for distance (when location set) and LastUpdatePostDate; "Best fit" depends on having the classifier verdicts per-trial, which now exists post-Phase-3 wiring. Toolbar now shows just the search summary line + classification progress, which is the load-bearing info anyway.
Three of the deferred items from PR #1's review, addressed in one follow-up to keep the phase-3 PR from carrying open feedback debt: 1. useIsMobile uses matchMedia.change instead of window 'resize' (ResultsList.jsx:36-50). iOS Safari fires 'resize' inconsistently on rotation; matchMedia.change is the reliable signal and also catches iPad split-screen + browser-window mode switches. src/test/setup.js stubs matchMedia (jsdom doesn't ship it) so the existing ResultsList tests keep rendering the desktop two-pane path. 2. New regression test for the queued-submit drain in NaturalLanguageInput. Indirect coverage for the StrictMode listener re-attach fix in useNLP — if the listener doesn't reach the worker 'ready' message, the queued submit never fires and this test fails. The bug surfaced this session: status was getting stuck at 'downloading' forever in dev because the listener detached on StrictMode's first cleanup never re-attached. Now: pin the contract. 3. shared/iris-shared.jsx (478-line design reference, never imported by src/) moved to docs/design-references-shared/ with a README explaining its role. Stops future readers (LLM or human) from trying to "fix" or "consolidate" it as if it were live code. Contrast-check (#3 in the review): computed iris-700 on parchment-50 yields ~9.6:1 (WCAG AAA). iris-700 on iris-50 is similar (~8.5:1). All iris-violet links and the model badge clear AA easily; no palette change needed. Lighthouse can confirm at deploy time. Compare-state lift (#1 in the review): deferred to its own follow-up PR alongside the actual compare view (currently a placeholder).
User reported: hitting "Find trials" again with a refined prompt did nothing visible. The classifier kept its old verdicts; the toolbar sat at "fit evaluated for N" against a stale patient description. The classification-reset effect was watching only searchParams. If the refined prompt extracted to the same condition (e.g., "breast cancer" → still "breast cancer"), the API result set was cached and the trials list didn't change either. classifiedRef carried the old NCT IDs into the next render, so newTrials.length === 0 and the classification effect short-circuited. Two changes: - Add patientDesc to the reset effect's deps so a prompt change alone (no condition change) wipes classifications + classifiedRef and cancels the in-flight batch via cancelClassifyRef. - Cancel the simplifier's pending queue too — otherwise an in-flight summary keeps the worker busy while the re-classification waits for it to drain. Effect: change the prompt → classifications wipe → fit dots reset to shimmer → toolbar shows "evaluating fit · 0 of N" again → classifier re-runs against the new patient → fit dots refill.
Two correctness fixes from code review: 1. useClassifier.js:34-49 — listener cleanup now also rejects every in-flight classify in pendingRef. Previously: on unmount (or StrictMode dev cleanup) the listener detached but pendingRef Map still held resolve/reject handles — those promises hung forever. Now any awaiting caller sees a clean rejection with "classifier unmounted" so error paths fire and the queue clears. 2. ResultsList.jsx:118 — replaced `nlp` (whole hook return object) in the load-trigger effect deps with `nlpLoad` destructured. The useNLP hook doesn't memoize its return object, so each render produced a new ref → effect ran every render. The status guard prevented redundant load() calls but the body still ran. Now the effect only fires when load callback identity changes (it's useCallback'd with stable deps so essentially never), or when canClassify / status / modelKey change. Both surfaced by post-merge code review of PR #2; neither blocks shipping but both are trivial to land before merge.
Bundles all eight follow-up items the reviewer flagged on PR #2, plus the user-requested decision to disable classify-driven UI in the results view (the fit dots had no actionable consequence without sort wiring, so they were just decoration). User-visible - Classify-in-results gated behind ENABLE_CLASSIFY_IN_RESULTS = false (ResultsList.jsx). The classifier hook + worker task + ?test=classify harness all stay live for prompt iteration; only the in-app fit dots and "evaluating fit · X of N" caption are suppressed. Flip the constant to true once "Best fit" sort lands so the dots become actionable. Refactors / dedupe - Moved 305-line SAMPLE_TRIALS array + USER_PRESETS list out of ClassificationHarness.jsx into a sibling fixtures file (ClassificationHarness.fixtures.js). Harness file dropped from 924 → ~550 lines. - Harness now imports DEFAULT_CLASSIFY_PROMPT and parseVerdict from utils/classifyTrial.js instead of duplicating both verbatim. Single source of truth — prompt tweaks no longer have to be made in two places. - useIsMobile hoisted out of ResultsList.jsx into hooks/useIsMobile.js so any other component that needs the same breakpoint can import it without copy-pasting. Small fixes - patientDesc in ResultsList wrapped in useMemo (was recomputed every render — value-equality made it work but the implicit reliance was fragile). - FitDot in TriageRow folds the model's reason into aria-label so screen readers and keyboard-focused users get the same context as a sighted hover (title alone reaches neither group reliably). Same string in both attrs. Added role="img" since it carries semantic content now. New worker task type - Added 'translate' message type to nlp.worker.js. max_tokens 200 (vs classify's 80) so verbose-language paraphrases fit. Same low temperature (0.1) since translation wants fidelity, not creativity. - useClassifier now exposes both classifyOne and translateOne, sharing the single promise chain (engine is single-threaded regardless of task type). handleMessage routes done/error events for both via a single isDone/isError predicate. - Harness translate-first toggle now uses translateOne instead of overloading classifyOne — clarifies intent and lets the worker apply the right max_tokens budget. New tests - useClassifier.test.js: three tests covering serialization (concurrent calls post FIFO), error isolation (one rejection doesn't poison the queue), and unmount cleanup (pending tasks reject with 'classifier unmounted'). Mock the shared worker via vi.mock so the tests don't touch real WebLLM. 197/197 pass. Skipped from the review - "Simplifier idle gap during model load" — the gap is intentional per Handoff Phase 3 step 6 (stage-2 only after stage-1 completes), and the pipeline-stage caption already addresses the UX. Reviewer's suggestion to "let the simplifier proceed" would re-introduce the classifier-vs-simplifier worker contention we explicitly fixed.
PR #3 added 4 new lint hits on top of main's pre-existing baseline. All trivial — fixing them keeps the CI lint output clean for future reviewers (the workflow runs lint as continue-on-error so they don't block, but fewer ignorable lines is fewer ignorable lines). - ClassificationHarness.jsx:36 — setConcurrency unused since the concurrency dropdown was removed (serialization happens in the hook now). Drop the setter, keep the value as a const. - ResultCard.jsx:126 — showFit unused since the "Why this might or might not fit you" section was dropped. Drop the var; comment notes the path back if a fine-tuned model lets us re-introduce. - ResultsList.jsx:75 — wrap allTrials in useMemo. react-query keeps data ref stable across non-data renders so the memo identity is stable too; without the memo, every render produced a new array and effect dep arrays comparing against allTrials would have thrashed (the actual classify trigger effect depends on a derived trialKeyAll string so this was cosmetic, but cleaner this way). - useClassifier.js:73-74 — exhaustive-deps disable on the classifyOne/translateOne useCallbacks. runTask only closes over refs (stable); the linter can't see through that. Lint count: 29 → 24 (14 errors, 10 warnings — one fewer than main's baseline). All remaining are pre-existing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validates the Stage-1 binary classifier from the Handoff end-to-end (worker, hook, prompt, multilingual + production-realistic harness) and lays the groundwork to surface it in the results UI — but the in-app fit dots and "evaluating fit" caption are gated behind a feature flag (
ENABLE_CLASSIFY_IN_RESULTS = false) until "Best fit" sort is wired. Without sort, the dots don't drive any user-visible behavior, so they were just decoration.The harness at
?test=classifyruns the full pipeline against a 24-trial fixture (validated 93% in-scope binary agreement on Gemma 2 2B with zero catastrophic UNLIKELY). The custom-model stub for future LoRA work is in place. Detail-pane disclaimer was redesigned to a two-tier disclosure to set honest expectations about model accuracy.Default model stays Gemma 2 2B (q4f32_1, ~1.3 GB). Qwen2.5-1.5B (
?model=qwen25) and Llama 3.2 3B (?model=llama32) registered as opt-in alternatives for prompt validation in the harness.What's in
Validation harness (dev-only,
?test=classify)Classification Harness.htmlreusing the live worker — no second model download, just shares the engine that the NL flow already loaded.ClassificationHarness.fixtures.jsso prompt edits aren't buried under data.translateworker task type with higher max_tokens.Underlying machinery (worker + hook + utils)
src/workers/nlp.worker.js— newclassifyandtranslatetask types. Both are one-shot, low-temperature, with task-type-appropriatemax_tokens(80 for classify, 200 for translate).src/hooks/useClassifier.js—classifyOneandtranslateOne, both serialized through a single promise chain because WebLLM's MLCEngine is single-threaded..catch(()=>{})prevents one rejection from poisoning the queue. Listener cleanup rejects pending tasks on unmount so awaiting callers don't hang.src/utils/classifyTrial.js—DEFAULT_CLASSIFY_PROMPT+parseVerdictshared by the harness and the (gated) in-app flow. Single source of truth — prompt tweaks no longer have to be made in two places.In-app classification (gated behind
ENABLE_CLASSIFY_IN_RESULTS = false)The wiring exists in
ResultsList.jsxandTriageRow.jsxbut is currently inert. When flipped on:FitDotper row (iris-violet filled circle for LIKELY, hollow ring for UNLIKELY, shimmer for pending; reason folded intoaria-label).evaluating fit · 7 of 24→fit evaluated for 24.classifiedRef; search-param OR patient-description change resets state and cancels in-flight batch.Detail pane
<details>/<summary>with a one-line surface ("Check with your doctor when exploring treatment options — this AI summary uses plain language to explain the treatment but can miss eligibility details.") and awhy?expansion explaining the on-device model and its limits.Custom-model stub
appConfigblock innlp.worker.jsshows the exact shape needed to serve a fine-tuned MLC model from HuggingFace Hub. Pointer comment references the LoRA training process doc in the project vault.Worker reliability
frequency_penaltyfrom 0.3 → 0.6 to break Gemma out of##repetition collapse loops on the simplifier prompt.Validation results (Gemma 2 2B, English fixture, production-realistic mode)
Spanish + translate-first: 71% (vs 67% direct Spanish). Translation helped marginally but didn't bridge the gap to English. Multilingual is best-effort for now; English-only gating is a 1-line change if it proves needed in user testing.
Out of scope (deferred)
ENABLE_CLASSIFY_IN_RESULTSto true so the dots become actionable.Compare →button is still disabled. Compare state lift toAppis queued for that follow-up.Test plan
npm run test:run— 197/197 pass (addeduseClassifier.test.jsfor serialization, error isolation, unmount cleanup; added queued-submit drain regression test inNaturalLanguageInput.test.jsx)npm run build— cleannpm run lint— only pre-existing errors / warnings