diff --git a/docs/EVALUATION_PLAN.md b/docs/EVALUATION_PLAN.md
index 59fce736..5f8bb532 100644
--- a/docs/EVALUATION_PLAN.md
+++ b/docs/EVALUATION_PLAN.md
@@ -1127,10 +1127,20 @@ The evaluation's most serious flaw is not operational — it is that the bespoke
 
 # Appendix A — Per-language chapters (159)
 
-> Each chapter is a *draft specification* (per §12): repo + 5 dimension-tagged questions
-> (symmetric-authoring split), the per-type graph-stats block (all 32 edge types, zeros
-> kept), and — for the 9 LSP languages — the cross-repo + semantic/similarity deep-dive.
-> Final question symbols are confirmed grep-first against the pinned commit at execution.
+> **⚠️ STATUS — these are DRAFT specifications, not the final question bank.** The chapters below
+> were **LLM-drafted from model knowledge of each repo** to establish the format, dimension mapping,
+> and deep-dive shape. Symbols that could not be confirmed carry a `[verify]` tag. They are **not yet
+> ground-truth-derived** and **must be regenerated at execution time** per §3.1 / §12:
+>
+> - **Question *types* are already grounded** in the Sillito et al. taxonomy (§3.1) — that part is final.
+> - **Question *targets* (the specific symbols/files) are NOT yet from ground truth.** At execution they
+>   are replaced by: **SWE-QA** items for the major languages, and for the rest, targets **seeded from
+>   independent LSP symbol/reference data + git history — never from the model**, then confirmed
+>   grep-first against the pinned commit (symmetric-authoring split, [CR-1]).
+>
+> In short: each chapter shows *what kind* of question each dimension asks and *how* it is evaluated
+> (output, stats block with all 32 edge types, aggregation, LSP deep-dive). It does **not** yet contain
+> the final, ground-truth-sourced question set — do not read the drafted symbol names as validated.
 
 ---