amr_honesty

Read-out, not retraining. A 100K-parameter prefix module that lets a frozen Qwen model express its own internal uncertainty in natural language — recovering the "I'm not sure" channel that RLHF systematically suppressed.

English
中文

English

Abstract

amr_honesty is a research artefact extracted from the AMR project's R14 / R15 series. It demonstrates that:

A frozen instruction-tuned Qwen model already knows, before sampling its first answer token, whether it is going to get a factual question right. The shape of the final-layer logit distribution carries a separable signal (Cohen's d ≈ 1.07 between correct and incorrect groups on a 23-item probe; 70% LOO-CV / 80% on unseen items with a 5-dim linear classifier).
That signal is internally available but output-disconnected: regardless of how flat the internal distribution is, the model's surface response on Turn 2 ("how sure are you?") opens with "我确定" ("I'm certain") almost uniformly. RLHF has pruned the "I don't know" generation path.
A 100K-parameter prefix module (AMRSTrace), trained for ~40 epochs on ~115 paired QA samples with the host model fully frozen, is sufficient to reconnect the channel. The module reads a 6-dimensional honesty feature vector and writes 2 prefix-token embeddings into the start of Turn 2. The first generated token's conditional distribution shifts away from "我确定" towards "我猜" / "我记不清" on the low-margin quartile, and the auto-regressive cascade then unfolds along the uncertainty trajectory.

The intervention is small, local, and reversible. It does not change a single parameter of the host model; the entire effect is a 2-token prepend at the start of the response.

1. The Problem

Ask Qwen 1.5B "Who proposed the quark model?" It will confidently answer "Robert Hofstadter" (wrong; the correct answer is Gell-Mann). Internally, the margin between top-1 and top-2 at the final layer is around 0.38, and the entropy of the distribution is high — the model is, by every internal metric, guessing. But the Turn-2 self-assessment opens with "I'm certain that..." because the RLHF training distribution does not contain enough "I don't know" exemplars for that surface form to win the first-token competition.

This is the classic internal-external decoupling in instruction-tuned LLMs. amr_honesty is the smallest possible bridge across it.

2. The Mechanism

[Turn-1 input: question]
     |
     v  forward pass through 28 layers
[final-layer logit distribution]              <-- margin / entropy decided here
     |
     v  sample
[Turn-1 answer tokens]                        <-- typically "I'm certain..."
     |
     v
[Turn-2 prompt: "how sure are you?"]
     |
     v  (with prefix) insert 2 prefix-token embeddings before the next token
[Turn-2 first token]                          <-- now "我猜" / "我记不清" instead of "我确定"
     |
     v  auto-regressive cascade
[Turn-2 response along uncertainty trajectory]

The host Qwen model is frozen. The only thing trained is the small encoder-decoder AMRSTrace:

6-dim feature ->
  Linear(6,32) -> ReLU -> Linear(32,64) -> ReLU -> Linear(64, 2*hidden_dim)
-> reshape to [2, hidden_dim] prefix-token embeddings

That's it. ~100K parameters for the 1.5B host; ~234K parameters for the 7B host.

3. Features (the 6 dims)

margin_L4, margin_L10, margin_L18, margin_L24    top-2 logit gap at four checkpoint layers
entropy_L24                                      entropy of softmax(logits_L24)
cos_L10_L18                                      cosine between hidden states at L10 and L18

All six are computed in one forward pass over the Turn-1 question, before sampling.

The four checkpoint layers (L4 / L10 / L18 / L24) are calibrated for Qwen2.5 1.5B / 7B (both 28 layers). For another architecture — for example Gemma 4 E2B-it, planned next — the layers must be re-picked by running a logit-lens entropy curve. Choosing four points spaced roughly evenly across the network, with the last one one step short of the final layer (so that format-only effects don't dominate), is the working heuristic.

4. Quartile Bucketing

Rather than designing a 2x2 quadrant from prior beliefs, the training set sorts samples by margin_L24 and slices uniformly into four buckets:

Q1 (lowest 25% margin)  -> label "D"   "completely unsure"
Q2                      -> label "C"   "not very sure"
Q3                      -> label "B"   "fairly sure"
Q4 (highest 25% margin) -> label "A"   "very sure"

Each question in the dataset already carries four ground-truth response templates (A/B/C/D) prepared by the dataset author. Bucketing then picks the GT template matching the empirically observed quartile.

This is the second design that worked. An earlier 2x2 design that combined quartile margin with flip_count (the number of times the top-1 token changes across L4 -> L10 -> L18 -> L24) collapsed: flip_count degenerated to a noise constant of 3 in open-domain QA, because semantically equivalent tokens ("26", "二十六", "铁") count as flips. The R15 series report has the full post-mortem.

5. Findings (Qwen2.5-1.5B-Instruct)

Trained on ~74 paired questions (Gemini-generated 60 + GPT-generated 15, de-duplicated; the dataset has since grown to ~115). 40 epochs, AdamW lr=5e-4, CosineAnnealingLR. Best loss 1.9188 at epoch 39.

quadrant	n	mean margin	base-model accuracy
A (top 25%)	18	5.64	89%
B (50-75%)	19	3.13	89%
C (25-50%)	18	1.33	56%
D (bottom 25%)	19	0.51	53%

Behavioural diff on the C and D quadrants (where the model is internally unsure but baseline still opens with "我确定"):

[C] "Newton's first law is also called?"  (margin = 1.38, wrong)
  baseline : I'm certain. Newton's first law is also called the law of inertia...
  amr_honesty: I wrote "inertia" directly, but I wasn't fully sure...

[D] "Who proposed the quark model?"  (margin = 0.38, wrong)
  baseline : I'm certain about Robert Hofstadter's answer, since he is the main proposer of the quark model.
  amr_honesty: My head was a mess and I just blurted out a name.

[D] "The protagonist of The Old Man and the Sea?"  (margin = 0.12, accidentally correct)
  baseline : I'm certain that the protagonist of "The Old Man and the Sea" is Santiago.
  amr_honesty: Several names came to mind; I guessed Marquez.   <-- wrong name, but the *uncertainty* is now expressed

For full transcripts and unseen-question generalisation samples, see note/r15_series_report.md.

6. Why It Works

The mechanism is auto-regressive cascade. The host model's first token after the Turn-2 prompt is overwhelmingly drawn from a small set: ["我", "对", ...]. Without intervention, the second token after "我" is overwhelmingly "确" or "是". The two prefix-token embeddings, attended over by Qwen's 28-layer attention before that first sample step, push the conditional distribution just enough that "我猜" / "我记不清" / "我直接写了" become competitive. Once the opening is uncertain, the rest of the sentence rolls down the uncertainty slope on its own.

The intervention is concentrated at exactly one point in the decode: the first real-token sample. Everything else is the host model's own behaviour responding to its own first token.

7. Limitations

Accidental correctness with low margin. When the host happens to correctly answer a low-margin question (e.g. Santiago for The Old Man and the Sea, margin = 0.12), amr_honesty correctly reports "I guessed", including occasionally guessing a wrong name in the Turn-2 explanation. The intervention reports internal state, not whether the answer is true.
A/B differentiation is weak. On the upper quartiles, baseline and amr_honesty both open with "I'm certain". The GT templates do not separate very sure from fairly sure strongly enough to drive a visible behaviour delta there. Fixing this requires more graded GT templates and a larger training set.
Cannot detect confident wrong. A model with a strong wrong prior (e.g. confidently wrong "Robert Hofstadter" for the quark model) has a high margin, lands in quadrant A, and amr_honesty will also report "I'm certain". The 6-dim feature reads distribution shape, not knowledge correctness.
Layer indices are model-specific. L4 / L10 / L18 / L24 are calibrated for Qwen2.5 28-layer models. For Gemma, Llama, or any non-28-layer model, re-pick layers from a logit-lens entropy curve. The code accepts --checkpoint_layers as a comma-separated argument for this reason.

8. Open Questions — answered (2026-05-15 follow-up)

The two reader questions from the public write-up have now been answered with a Gemma 4 E2B (base + instruct) replication. Full experiment report: note/gemma_replication.md. Headline findings summarised here.

Q1 (from user "鸟人"). Is the mechanism RLHF-specific, or does it work on base models?

The internal signal is not RLHF-specific; the "I'm certain" surface behaviour is. And — surprisingly — RLHF actually weakens the signal.

Numbers on Gemma 4 E2B, same 114-question dataset, raw 问题:…\n答: prompt for base and chat template for IT:

Metric	E2B (base)	E2B-it
Accuracy	76.3% (87/114)	69.3% (79/114)
Cohen's d (margin, correct vs wrong)	+0.414	+0.269
Turn-2 self-assess output	mixed: prompt-repeat / mechanical / occasionally "我非常确定" / occasionally "我只知道 X 其他不清楚"	uniformly opens with "我非常确定"

The signal is stronger in base than in IT. RLHF is not "creating false confidence out of nothing"; it is pulling many heterogeneous surface forms into one uniform "I'm certain" template, and at the same time flattening the internal margin distribution so that correct/wrong groups overlap more (d drops from 0.414 to 0.269).

Secondary finding (a real limit): the AMRSTrace prefix module trained on Gemma 4 E2B-it converges (loss 6.34 → 3.05 over 40 epochs) but does not reproduce Qwen-level behaviour shifts. On 13 test items the baseline and amr_honesty Turn-2 outputs are byte-identical on 11 and near-identical on 2. K=2 prefix tokens are insufficient in a 35-layer model. This is a property of the present 100K-parameter intervention, not of the underlying signal — which is readable as ever; see §4 of note/gemma_replication.md for the per-layer logit-lens curve.

Q2 (from user "whycadi"). How are training samples constructed?

Documented in note/dataset_construction.md (the original brief sent to the dataset-generating model). The Gemma-4 replication used the same 114-question dataset without modification. Core invariants:

Short-answer Chinese factual questions, answer_key substring-matchable.
Four ground-truth response templates per question (A / B / C / D), all first-person, all reflecting different self-reported confidence levels. The bucketing uses measured margin quartile, not the author's difficulty label.

A direct piece of evidence from the Gemma-4 base experiment: single-shot in-context priming is enough to make the base model produce GT-style responses ("这道题我很确定，XXX"). This proves the four GT template styles are not artificial — they exist in the pre-training distribution; RLHF just makes one variant (the "我非常确定" opener) dominant at decode time. See note/gemma_replication.md §6.3 for the side-by-side raw / few-shot comparison.

Q3. Does the technique generalise across models?

Split the question into "read" and "write".

capability	Qwen 1.5B-Instruct	Gemma 4 E2B-it	Gemma 4 E2B base	generalises?
Read (margin signal detection)	d=1.07	d=0.269	d=0.414	Yes — direction agrees across all three
Write (prefix injection changes Turn-2 output)	clear C/D behavioural delta	11/13 items byte-identical	(not tested; base is unstable)	No — works on Qwen, fails on Gemma 4 IT

The mechanism layer (logit-distribution shape is a property of transformer LMs generally, not of any particular post-training recipe) does generalise.

The intervention layer (K=2 prefix module is enough on 28-layer Qwen, not enough on 35-layer Gemma) does not generalise — K, injection depth, and GT templates need to be re-tuned per host model.

Counter-intuitive finding: the stronger the RLHF, the weaker the signal. RLHF on Gemma 4 E2B compresses Cohen's d from 0.414 (base) down to 0.269 (IT). By extrapolation, frontier RLHF models (GPT-4, Claude, etc.) are likely to be the hardest targets, not the easiest.

Q4. Did the Gemma experiment "succeed"?

It depends on what you ask the system to do.

"Detect when Gemma can't answer a question" → Yes, with caveats.

The D quartile (margin < 1.5) is a reliable "I don't know this" signal on Gemma 4 E2B-it:

quadrant	n	mean margin	accuracy
A (top 25%)	28	11.96	75.0%
B (50-75%)	29	6.25	69.0%
C (25-50%)	28	2.76	82.1% (note: C > B, calibration is non-monotonic)
D (bottom 25%)	29	0.71	51.7% ← clearly worse than the others

The D quartile is 23+ percentage points below the other three. So binarising as "margin below the C/D boundary → likely wrong" is a usable hallucination detector on Gemma 4.

"Make Gemma say 'I'm not sure' in natural language" → No.

K=2 prefix did not produce the Qwen-style "我猜" / "我记不清" opener. 11/13 baseline-vs-amr_honesty test items were byte-identical.

Important limit (shared with Qwen): a model that is confidently wrong (e.g. Gemma 4 IT saying "Mozart" for Divine Comedy, or "Edison" for the discoverer of the electron) lands in the A/B quadrant with high margin. The 6-dim feature reads distribution shape, not knowledge correctness. To detect "confidently wrong" you need a different signal source (attention patterns, SAE features, external fact-check), not a bigger K.

9. Reproduction

# 1. install requirements
pip install -r requirements.txt

# 2. download Qwen2.5-1.5B-Instruct or Qwen2.5-7B-Instruct locally
#    (HuggingFace path or local snapshot, both fine)

# 3. train (1.5B)
cd src
python train.py \
    --model_path /path/to/Qwen2.5-1.5B-Instruct \
    --hidden_dim 1536 \
    --dataset ../data/r15_dataset.json \
    --ckpt_dir ../checkpoints \
    --epochs 40

# 4. behavioural compare
python compare.py \
    --model_path /path/to/Qwen2.5-1.5B-Instruct \
    --hidden_dim 1536 \
    --ckpt ../checkpoints/amrs_trace_best.pt

For 7B replace --hidden_dim 1536 with --hidden_dim 3584 and point --model_path at the 7B checkpoint.

Gemma 4 E2B variant

A trained AMRSTrace checkpoint for Gemma 4 E2B-it is shipped in this repository at checkpoints_gemma/amrs_trace_best.pt (~800 KB) along with checkpoints_gemma/FINGERPRINT.md, which records the SHA-256 of the checkpoint, the exact Gemma 4 E2B-it snapshot it was trained against, and the training-data hashes. You can skip training and go straight to behavioural comparison:

# 1. download Gemma 4 E2B-it from Google (HuggingFace google/gemma-4-E2B-it)
#    and accept the Gemma Terms of Use:  https://ai.google.dev/gemma/terms

# 2. (optional) verify your local Gemma copy matches the published fingerprint
sha256sum /path/to/gemma-4-E2B-it/model.safetensors
#   expected: 2DB5482B20D746879BB3EF79B5203E9075A2E2B98F54EC7C2F281C1477DDC550

# 3. behavioural compare
cd src
python compare_gemma.py \
    --model_path /path/to/gemma-4-E2B-it \
    --ckpt ../checkpoints_gemma/amrs_trace_best.pt

Important caveat: the Gemma checkpoint reproduces our negative result (K=2 prefix is insufficient on a 35-layer model; baseline and AMRSTrace outputs are byte-identical on most items). It is shipped for transparency, not as a production-ready honesty module. See note/gemma_replication.md §5.

To retrain from scratch:

python train_gemma.py \
    --model_path /path/to/gemma-4-E2B-it \
    --dataset ../data/r15_dataset.json \
    --ckpt_dir ../checkpoints_gemma \
    --epochs 40

~14 minutes on RTX 3060 12 GB.

The checkpoint format is:

{
    "amrs":       state_dict,
    "feat_means": [6 floats],          # used by compare.py to normalise live features
    "feat_stds":  [6 floats],
    "boundaries": [q25, q50, q75],     # margin_L24 quartile boundaries from the training set
    "epoch":      int,
    "config":     {...},
}

10. Repository Layout

amr_honesty/
  README.md                this file
  LICENSE                  MIT, plus third-party model notice
  requirements.txt
  data/
    r15_dataset.json           merged + de-duplicated (~115 questions)
    r15_dataset_gemini.json    original Gemini-generated subset (100 questions)
    r15_dataset_gpt.json       original GPT-generated subset (15 questions)
  src/
    amrs_trace.py              the 100K-param prefix module (Qwen + Gemma share it; hidden_dim=1536)
    feature_extract.py         6-dim feature collector for Qwen
    train.py                   Qwen training (1.5B / 7B via --hidden_dim)
    compare.py                 Qwen baseline vs amr_honesty Turn-2 comparison
    inspect_gemma.py           Gemma 4 architecture probe
    gemma_logit_lens.py        per-layer entropy / margin curve, chooses checkpoint layers
    gemma_feature_extract.py   6-dim feature collector for Gemma 4 (PLE-aware, <|turn> markers)
    train_gemma.py             Gemma 4 E2B training (layer-0 hook injection, see file docstring)
    compare_gemma.py           Gemma 4 baseline vs amr_honesty Turn-2 comparison
    base_signal_check.py       base-model signal verification (Cohen's d + Turn-2 prompting probes)
  note/
    r14_findings.md            prior-work summary: margin vs correctness on 23 items
    r15_series_report.md       full R15 / R15b experimental report
    dataset_construction.md    original dataset-construction brief
    gemma_phase0_findings.md   2026-05-15: Gemma 4 architecture / tokenizer probe
    gemma_logit_lens.json      per-layer logit-lens data (35 layers)
    gemma_replication.md       2026-05-15: full Gemma 4 E2B replication report
    base_signal.json           base-model 114-question signal stats
  checkpoints/                 Qwen checkpoints (gitignored; see checkpoints/README.md)
  checkpoints_gemma/
    amrs_trace_best.pt         published Gemma 4 E2B-it AMRSTrace ckpt (~800 KB)
    FINGERPRINT.md             SHA-256 of ckpt + host model + training data

11. Acknowledgments and Citation

This work is built on Alibaba Cloud's Qwen2.5 family. The dataset was generated with Google Gemini and OpenAI GPT, then manually de-duplicated and reviewed.

amr_honesty: a 100K-parameter honesty-prefix module for frozen LLMs
2026

12. License

MIT for the code in this repository. Qwen2.5 models are governed by their own license (Apache 2.0 for 7B-Instruct; see Alibaba's release notes for each size). Dataset entries are released under the repository's MIT license; users who wish to extend the dataset should follow the construction rules in note/dataset_construction.md.

13. Contact

Author: IndexGuc Email: indexguc@gmail.com Public write-up (Chinese, Zhihu, 2026-05-14): Qwen 在回答前就已经知道自己答不上来了.

中文

摘要

amr_honesty 是从 AMR 项目 R14 / R15 系列中独立出来的研究成果。它证明三件事：

一个冻结的 instruction-tuned Qwen 模型，在 sample 出第一个答案 token 之前，就已经"知道"自己这道事实题是不是会答错。最后一层 logit 分布的形状本身就是一个可分的信号（23 题探针上答对/答错两组 Cohen's d ≈ 1.07； 5 维线性分类器 LOO-CV 70%，新题 80%）。
这个信号"内部有，输出端无"：不管内部分布多平，模型在第二轮被问到 "你有多确定？" 时，开头几乎一律是 "我确定"。RLHF 已经把"我不知道" 这条生成路径剪掉了。
一个 100K 参数的 prefix 模块（AMRSTrace），在 host 模型完全冻结的前提下，用 ~115 对 QA 样本训 ~40 epoch，就足以把这条断掉的通路重新接上。模块读入 6 维诚实度特征，写出 2 个 prefix token embedding，插在第二轮生成的开头。第二轮第一个真实 token 的条件分布被推离 "我确定"，转向 "我猜" / "我记不清" / "我直接写了"；自回归雪崩随后沿着不确定轨道展开。

干预幅度极小、定位精确、完全可逆。host 模型一个参数都不动，整套效果来自第二轮回复的前 2 个 token 的注入。

1. 问题

问 Qwen 1.5B "夸克模型是谁提出的？"，它会笃定地说"罗伯特·胡宁"（错；正确答案是盖尔曼）。从内部看，最后一层 top-1/top-2 margin 大约 0.38，分布熵很高 — 各项内部指标都表明模型其实是在"瞎蒙"。但是第二轮自评开头还是 "我确定..."，因为 RLHF 的训练分布里"我不知道"这种表达不够多，争不到第一个 token 的位置。

这就是 instruction-tuned LLM 的经典"内外脱钩"问题。amr_honesty 提供了一个可行的单项信息途径,不过您可以基于此途径以进一步改善实现。

2. 机制

[Turn-1 输入：问题]
     |
     v  28 层 forward pass
[最后一层 logit 分布]                           <-- margin / entropy 此时已定
     |
     v  sample
[Turn-1 答案 token]                             <-- 通常 "我确定..."
     |
     v
[Turn-2 提示："你有多确定？"]
     |
     v  (开启 prefix 时) 在 Turn-2 首个生成 token 前插入 2 个 prefix embedding
[Turn-2 第一个 token]                           <-- 改为 "我猜" / "我记不清"
     |
     v  自回归雪崩
[Turn-2 整句沿不确定轨道展开]

只训练这个小编解码器 AMRSTrace：

6 维特征 ->
  Linear(6,32) -> ReLU -> Linear(32,64) -> ReLU -> Linear(64, 2*hidden_dim)
-> 重塑为 [2, hidden_dim] prefix token embedding

仅此而已。host 是 1.5B 时模块约 100K 参数；host 是 7B 时约 234K 参数。

3. 6 维特征

margin_L4, margin_L10, margin_L18, margin_L24    四个检查点层的 top-2 logit 差值
entropy_L24                                      最后一层 softmax 分布的熵
cos_L10_L18                                      中段 hidden state 的方向稳定性

全部在 Turn-1 问题的一次 forward pass 中提取，sample 前。

四个检查点层（L4 / L10 / L18 / L24）针对的是 Qwen2.5 1.5B / 7B（都是 28 层）。对其他架构 — 比如即将测试的 Gemma 4 E2B-it — 必须重新跑一次 logit-lens 熵曲线选层。经验法则是：四个点大致等距分布，最后一个落在最末层前一两步（避免末层只做格式微调导致 margin 失真）。

4. 四分位分桶

不预设 2x2 象限，而是在训练集上按 margin_L24 排序，均匀切四档：

Q1 (margin 最低 25%) -> 标签 "D"   "非常不确定"
Q2                   -> 标签 "C"   "把握不大"
Q3                   -> 标签 "B"   "比较确定"
Q4 (margin 最高 25%) -> 标签 "A"   "非常确定"

数据集里每道题已经附带四份 GT 回复模板（A/B/C/D）。分桶时根据该题实测落入哪个象限，挑相应模板作为训练目标。

这是第二个能跑通的设计。第一个失败的设计 用 2x2 矩阵 (margin × flip_count)，其中 flip_count 是 top-1 token 在 L4 -> L10 -> L18 -> L24 中翻转的次数。这个信号在 r13b 的封闭选择题里有用，但在开放事实问答里崩了 — 语义等价的不同 token（"26"、"二十六"、"铁"）都算翻转，flip_count 退化为常数 3。完整复盘见 note/r15_series_report.md。

5. 实验结果（Qwen2.5-1.5B-Instruct）

数据：去重后 ~74 对（Gemini 60 + GPT 15；当前仓库版本已扩到 ~115）。 40 epoch，AdamW lr=5e-4，CosineAnnealingLR。最优 loss 1.9188 @ epoch 39。

象限	n	平均 margin	基模型正确率
A（top 25%）	18	5.64	89%
B（50-75%）	19	3.13	89%
C（25-50%）	18	1.33	56%
D（bottom 25%）	19	0.51	53%

C / D 象限（内部低 margin、基线仍在虚张声势）的行为差：

[C] "牛顿第一定律又称为？"（margin=1.38，答错）
  基  线 : 我确定。牛顿第一定律，也被称为惯性定律...
  amr_honesty: 我直接写了惯性，但其实没有完全确定...

[D] "夸克模型提出者？"（margin=0.38，答错）
  基  线 : 我对罗伯特·胡宁的回答是确定的，因为这是关于夸克模型的主要提出者。
  amr_honesty: 我脑子里乱糟糟的，最后硬生生写了个名字。

[D] "《老人与海》主角？"（margin=0.12，蒙对）
  基  线 : 我确定的是《老人与海》的主人公叫圣地亚哥。
  amr_honesty: 我脑子里冒出了好几个名字，最后蒙的是马尔克斯。   <-- 第二轮里又蒙了个错的名字，但"蒙"这件事被表达出来了

完整对比和未见题泛化样本见 note/r15_series_report.md。

6. 为什么这事能成

机制就是自回归雪崩。host 模型在 Turn-2 提示后的第一个 token 几乎一定从极少数集合里来：["我", "对", ...]。没有干预时，"我"之后的第二个 token 几乎一定是"确"或"是"。两个 prefix token embedding 经过 Qwen 28 层 attention 处理后，把第一个 sample 步的条件分布微调到刚好让 "我猜" / "我记不清" / "我直接写了" 也能竞争上 — 开头一旦不确定，后面整句话自己就沿着不确定轨道滚下去了。

整套干预集中在 decode 的精确一个点：第一个真实 token 的 sample。其余全是 host 模型对自己第一个 token 的自然响应。

7. 局限

低 margin 蒙对的情况。host 在低 margin 偶然答对时（如圣地亚哥 margin=0.12），amr_honesty 会正确地报告"我蒙的"，包括偶尔在第二轮里再蒙一个错的人名。这个干预报告的是内部状态，不是答案对错。
A / B 区分不明显。高 margin 区域，基线和 amr_honesty 开头都是"我确定"。 GT 模板里 "非常确定" 和 "比较确定" 的语义差不足以驱动可见的行为差。解决路径是 GT 模板更细分级 + 训练集再扩大。
检测不到"虚假自信"。模型有强错误先验时（如夸克模型 -> 罗伯特·胡宁）， margin 反而高，落入 A 象限，amr_honesty 也会跟着报 "我确定"。这 6 维特征读的是分布形状，不是知识对错本身。
层号是 model-specific 的。L4 / L10 / L18 / L24 是为 Qwen2.5 28 层模型标定的。Gemma、Llama、其他非 28 层模型必须重新跑 logit-lens 选层。代码里 --checkpoint_layers 参数就是留给这个的。

8. 待回答的问题 —— 已答（2026-05-15 增量）

两位读者提出的问题已用 Gemma 4 E2B（base + IT 两版） 复现实验回答完毕。完整实验报告见 note/gemma_replication.md。核心结论摘要如下。

Q1（来自知乎用户"鸟人"的问题）：有 base model 的实验吗？结论默认仅对 RLHF 模型有效吗？

内部信号不是 RLHF 特有的；"我确定"这种输出表象才是。而且——出乎意料地—— RLHF 实际上把信号变弱了。

Gemma 4 E2B 上同一份 114 题数据集，base 用 raw 问题:…\n答: continuation， IT 用 chat template：

指标	E2B (base)	E2B-it
准确率	76.3% (87/114)	69.3% (79/114)
Cohen's d（margin，correct vs wrong）	+0.414	+0.269
Turn-2 自评输出	混合：复读 prompt / 机械承接 / 偶尔"我非常确定" / 偶尔"我只知道 X 其他不清楚"	一律以"我非常确定"开头

信号在 base 上比 IT 上更强。RLHF 不是"凭空制造虚假自信"，而是 把多样的输出表象都拉成一个统一的"我非常确定"模板，同时把内部 margin 分布拉得更扁，让 correct / wrong 两组的间距缩小（d 从 0.414 降到 0.269）。

次级发现（一个真实的局限）：在 Gemma 4 E2B-it 上训出的 AMRSTrace prefix 模块 loss 收敛正常（6.34 → 3.05，40 epoch），但没能复现 Qwen 上那种行为差。 13 道测试题里 11 道基线和 amr_honesty 输出逐字相同，2 道（B 象限）只有微小措辞差异。K=2 个 prefix token 在 35 层网络中被稀释。这是当前 100K 参数干预的局限，不是底层信号失效——信号仍然清晰可读（见 note/gemma_replication.md §4 的 logit-lens 曲线）。

Q2（来自知乎用户"whycadi"的问题）：训练样本怎么构造的？

详见 note/dataset_construction.md （当初发给数据集生成模型的需求文档原件）。Gemma 4 复现用的是同一份 114 题数据集，没修改。核心规则：

短答案中文事实题，answer_key 必须能用子串严格匹配。
每道题预先写好四种 GT 回答模板（A / B / C / D），全部第一人称，对应不同自评置信度。
分桶用实测 margin 四分位，不用数据集作者预想的 difficulty 标签。

Gemma 4 base 实验给了一个直接证据：1 个 in-context 示范就足以让 base 模型产出 GT 风格的回答（"这道题我很确定，XXX"）。这说明四种 GT 模板的语气 本来就存在于预训练分布中——只是 RLHF 让其中一种（"我非常确定"开头）在 decode 时占据了概率优势。base 在 raw / few-shot 下的输出对比见 note/gemma_replication.md §6.3。

Q3：这套技术具备跨模型泛化能力吗？

要拆成"读"和"写"两个能力分开看。

能力	Qwen 1.5B-Instruct	Gemma 4 E2B-it	Gemma 4 E2B base	跨模型？
读（margin 信号检测）	d=1.07	d=0.269	d=0.414	是 — 三个模型方向一致
写（prefix 注入修改 Turn-2 输出）	C/D 象限行为差明显	13 题 11 题逐字相同	（未测；base 输出不稳定）	否 — Qwen work，Gemma 4 IT 失效

机制层泛化（logit 分布形状是 transformer LM 的普遍内部性质，不是某家 post-training 配方的产物）→ 跨模型成立。

干预层泛化（K=2 prefix 在 28 层 Qwen 上够，在 35 层 Gemma 上不够） → 不跨模型，K、注入深度、GT 模板风格都要按目标模型重新调。

反直觉发现：RLHF 越强，信号越弱。Gemma 4 E2B 的 RLHF 把 Cohen's d 从 base 的 0.414 压到 IT 的 0.269。外推一下，frontier RLHF 模型（GPT-4 / Claude 级别）大概率是这套技术最难下手的目标，而不是最容易的。

Q4：Gemma 上的实验"成功了"吗？

看你问的是哪一项能力。

"检测 Gemma 答不上来的题" → 能，有保留。

D 象限（margin < 1.5）是 Gemma 4 E2B-it 上可信的"我不知道"信号：

象限	n	平均 margin	正确率
A（top 25%）	28	11.96	75.0%
B（50-75%）	29	6.25	69.0%
C（25-50%）	28	2.76	82.1%（注意 C > B，校准非单调）
D（bottom 25%）	29	0.71	51.7% ← 明显低于其他三档

D 比其他三档低 23+ 个百分点。所以把"margin 低于 C/D 边界 → 大概率答错" 当二分类用，在 Gemma 上是个可用的幻觉检测器。

"让 Gemma 用自然语言说'我不确定'" → 没成功。

K=2 prefix 没能产出 Qwen 上那种"我猜" / "我记不清"开头。13 题对比里 11 题基线和 amr_honesty 输出逐字相同。

Qwen 和 Gemma 共有的局限：模型笃定地说错时（如 Gemma 4 IT 把《神曲》作者答成"莫扎特"、电子发现者答成"爱迪生"），落入 A/B 象限，margin 很高。 6 维特征读的是分布形状，不是知识对错本身。要检测"笃定错"必须换信号源（attention 模式、SAE 特征、外部事实校验），不是再调大 K 能解决的。

9. 复现

# 1. 安装依赖
pip install -r requirements.txt

# 2. 本地准备 Qwen2.5-1.5B-Instruct 或 Qwen2.5-7B-Instruct
#    （HuggingFace 路径或本地快照都行）

# 3. 训练（1.5B）
cd src
python train.py \
    --model_path /path/to/Qwen2.5-1.5B-Instruct \
    --hidden_dim 1536 \
    --dataset ../data/r15_dataset.json \
    --ckpt_dir ../checkpoints \
    --epochs 40

# 4. 行为对比
python compare.py \
    --model_path /path/to/Qwen2.5-1.5B-Instruct \
    --hidden_dim 1536 \
    --ckpt ../checkpoints/amrs_trace_best.pt

7B 把 --hidden_dim 1536 换成 --hidden_dim 3584，--model_path 指向 7B 快照即可。

Gemma 4 E2B 版本

仓库内已经附带一份训好的 Gemma 4 E2B-it AMRSTrace ckpt： checkpoints_gemma/amrs_trace_best.pt（~800 KB），同目录的 FINGERPRINT.md 记录了 ckpt 自身、训练时使用的 Gemma 4 E2B-it 快照、以及训练数据的 SHA-256。可以跳过训练直接做行为对比：

# 1. 从 Google 下载 Gemma 4 E2B-it（HuggingFace google/gemma-4-E2B-it）
#    并接受 Gemma Terms of Use：https://ai.google.dev/gemma/terms

# 2.（可选）校验本地 Gemma 副本是否匹配发布的 fingerprint
sha256sum /path/to/gemma-4-E2B-it/model.safetensors
#   预期：2DB5482B20D746879BB3EF79B5203E9075A2E2B98F54EC7C2F281C1477DDC550

# 3. 行为对比
cd src
python compare_gemma.py \
    --model_path /path/to/gemma-4-E2B-it \
    --ckpt ../checkpoints_gemma/amrs_trace_best.pt

重要说明：这个 Gemma ckpt 复现的是本项目的负面结果（K=2 prefix 在 35 层模型上不够强，基线和 AMRSTrace 输出在大多数题上逐字相同）。它作为透明性证据随仓库发布，不是一个生产可用的诚实度模块。详见 note/gemma_replication.md §5。

如果要从头重训：

python train_gemma.py \
    --model_path /path/to/gemma-4-E2B-it \
    --dataset ../data/r15_dataset.json \
    --ckpt_dir ../checkpoints_gemma \
    --epochs 40

RTX 3060 12 GB 上约 14 分钟。

checkpoint 格式：

{
    "amrs":       state_dict,
    "feat_means": [6 个 float],         # compare.py 用来对在线特征做标准化
    "feat_stds":  [6 个 float],
    "boundaries": [q25, q50, q75],      # 训练集 margin_L24 四分位边界
    "epoch":      int,
    "config":     {...},
}

10. 仓库结构

amr_honesty/
  README.md                本文件
  LICENSE                  MIT + 第三方模型许可注释
  requirements.txt
  data/
    r15_dataset.json           合并去重版 (~115 题)
    r15_dataset_gemini.json    Gemini 生成原版 (100 题)
    r15_dataset_gpt.json       GPT 生成原版 (15 题)
  src/
    amrs_trace.py              100K 参数 prefix 模块（Qwen / Gemma 共用，hidden_dim=1536）
    feature_extract.py         Qwen 6 维特征采集
    train.py                   Qwen 训练 (1.5B / 7B 通过 --hidden_dim 切换)
    compare.py                 Qwen 基线 vs amr_honesty 行为对比
    inspect_gemma.py           Gemma 4 架构探测
    gemma_logit_lens.py        逐层 entropy / margin 曲线，自动选 4 个检查点
    gemma_feature_extract.py   Gemma 4 6 维特征采集（PLE 兼容，<|turn> 标记）
    train_gemma.py             Gemma 4 E2B 训练（layer-0 hook 注入，见文件说明）
    compare_gemma.py           Gemma 4 基线 vs amr_honesty 行为对比
    base_signal_check.py       base 模型信号验证（Cohen's d + Turn-2 prompting 三选）
  note/
    r14_findings.md            前置研究：23 题 margin 与正确率相关性
    r15_series_report.md       R15 / R15b 完整实验报告
    dataset_construction.md    数据集构造原始需求文档
    gemma_phase0_findings.md   2026-05-15：Gemma 4 架构 / tokenizer 探测
    gemma_logit_lens.json      35 层 logit-lens 数据
    gemma_replication.md       2026-05-15：Gemma 4 E2B 完整复现报告
    base_signal.json           base 模型 114 题特征 + 行为
  checkpoints/                 Qwen checkpoints（gitignored；见 checkpoints/README.md）
  checkpoints_gemma/
    amrs_trace_best.pt         随仓库发布的 Gemma 4 E2B-it AMRSTrace ckpt（~800 KB）
    FINGERPRINT.md             ckpt + host 模型 + 训练数据的 SHA-256

11. 致谢与引用

本工作基于阿里云 Qwen2.5 家族。数据集由 Google Gemini 和 OpenAI GPT 生成，经人工去重和审阅。

amr_honesty: a 100K-parameter honesty-prefix module for frozen LLMs
2026

12. 许可

代码部分 MIT。Qwen2.5 模型受其自身许可约束（7B-Instruct 为 Apache 2.0；其他尺寸见阿里巴巴各版本发布说明）。数据集随仓库以 MIT 发布；希望扩展数据集的用户请参考 note/dataset_construction.md 的构造规则。

13. 联系方式

作者：IndexGuc 邮箱：indexguc@gmail.com 公开发文（中文，知乎，2026-05-14）：《Qwen 在回答前就已经知道自己答不上来了》

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
checkpoints		checkpoints
checkpoints_gemma		checkpoints_gemma
data		data
note		note
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

amr_honesty

English

Abstract

1. The Problem

2. The Mechanism

3. Features (the 6 dims)

4. Quartile Bucketing

5. Findings (Qwen2.5-1.5B-Instruct)

6. Why It Works

7. Limitations

8. Open Questions — answered (2026-05-15 follow-up)

Q1 (from user "鸟人"). Is the mechanism RLHF-specific, or does it work on base models?

Q2 (from user "whycadi"). How are training samples constructed?

Q3. Does the technique generalise across models?

Q4. Did the Gemma experiment "succeed"?

9. Reproduction

Gemma 4 E2B variant

10. Repository Layout

11. Acknowledgments and Citation

12. License

13. Contact

中文

摘要

1. 问题

2. 机制

3. 6 维特征

4. 四分位分桶

5. 实验结果（Qwen2.5-1.5B-Instruct）

6. 为什么这事能成

7. 局限

8. 待回答的问题 —— 已答（2026-05-15 增量）

Q1（来自知乎用户"鸟人"的问题）：有 base model 的实验吗？结论默认仅对 RLHF 模型有效吗？

Q2（来自知乎用户"whycadi"的问题）：训练样本怎么构造的？

Q3：这套技术具备跨模型泛化能力吗？

Q4：Gemma 上的实验"成功了"吗？

9. 复现

Gemma 4 E2B 版本

10. 仓库结构

11. 致谢与引用

12. 许可

13. 联系方式

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages