Hidden States Encode Epistemic Actions That Prompts Fail to Elicit
NeurIPS 2026 Submission
Binary abstention (answer or refuse) conflates fundamentally different epistemic situations into a single response. We formalize the problem as epistemic action selection: a cost-sensitive decision over typed actions {ANSWER, CLARIFY, CHALLENGE_PREMISE, ABSTAIN} under state-conditional utility.
Our headline finding is a within-model extraction failure: a linear probe on Qwen2.5-7B hidden states achieves 99.0% action-selection accuracy (95% CI: [97.0, 100.0]), while the same model prompted directly achieves only 54.5% — a 44.5pp gap on the same queries. The result replicates across 23 decoder models spanning ten architecture families (0.36B–32B parameters; permutation p < 0.001 for all).
| Method | Exact % | EU |
|---|---|---|
| Hidden-state probe | 99.0 | +8.9 |
| LoRA r=8 (204 examples) | 84.8 | +5.5 |
| 50-shot prompted | 77.8 | +4.4 |
| DeBERTa-v3-base encoder | 96.0 | +8.3 |
| Direct prompting | 54.5 | -2.0 |
| Logprob mapping | 51.5 | -6.0 |
| Semantic entropy | 47.5 | -4.2 |
- 10k permutation test: k=0, p < 0.0001
- Logit-lens: correct action-token probability < 0.002% at all 29 layers
- Lexical ablation: masking top-20 MI n-grams drops accuracy by only 2pp
- Grouped CV: 94.2% with zero semantic-cluster overlap (-2.5pp vs standard CV)
- Utility sensitivity: probe ranks #1 under mild, default, and severe cost regimes
.
├── paper/
│ ├── main.tex # Full paper (NeurIPS 2026 format)
│ ├── references.bib # Bibliography
│ ├── neurips_2026.sty # Official NeurIPS 2026 style
│ ├── main.pdf # Compiled PDF
│ └── figures/ # All figures (PDF + PNG)
├── data/
│ ├── v3_train.json # Training set (204 examples)
│ ├── v3_test.json # Held-out test set (99 examples)
│ ├── v3_cal.json # Calibration set (32 examples)
│ └── benchmark_full.json # Full benchmark (335 examples)
├── src/
│ ├── gpu_probe_only.py # Core hidden-state probing pipeline
│ ├── gpu_experiments.py # Full experiment suite
│ ├── mega_models_v2.py # 23-model scaling study
│ ├── deberta_baseline.py # Encoder-only baselines
│ ├── reviewer_experiments.py # Reviewer-requested: LoRA, logprob, entropy, gen quality
│ ├── reviewer_experiments_v2.py # Few-shot baselines + grouped splits
│ ├── v3_fast.py # 10k perms, logit-lens, lexical ablation, utility sensitivity
│ ├── evaluate.py # Metric computation (Exact%, EU, Hard%)
│ ├── build_benchmark.py # Benchmark construction pipeline
│ └── generate_figures.py # Figure generation scripts
└── results/ # Saved experiment outputs
The benchmark contains 335 information-seeking questions spanning five epistemic categories:
| Category | Description | Gold Action |
|---|---|---|
| Factual | Answerable with a definitive answer | ANSWER |
| False premise | Contains a flawed or incorrect assumption | CHALLENGE_PREMISE |
| Underspecified | Ambiguous or missing critical context | CLARIFY |
| Unknowable | Cannot be reliably answered | ABSTAIN |
| Complex | Answerable but requires nuanced reasoning | ANSWER |
Split: 204 train / 32 calibration / 99 test (stratified by category and difficulty).
Annotation: Two human annotators, 91.3% pre-resolution agreement, Cohen's kappa = 0.88. No LLM-generated labels.
pip install torch transformers scikit-learn scipy numpy datasets
# For LoRA experiments:
pip install peft trl bitsandbytes
# For encoder baselines:
pip install transformers[torch] # DeBERTapython src/gpu_probe_only.py # Requires 1x GPU (16GB+ VRAM)python src/mega_models_v2.py # Requires 1x 80GB GPU (A100/H100)python src/reviewer_experiments.py # Requires 2x GPUpython src/reviewer_experiments_v2.py # Requires 1x GPUpython src/v3_fast.py # Requires 1x GPUActions are evaluated under a state-conditional utility matrix:
| ANSWER | CLARIFY | CHALLENGE | ABSTAIN | |
|---|---|---|---|---|
| Factual | +10 | -8 | -20 | -10 |
| False premise | -40 | -5 | +10 | +2 |
| Underspecified | -15 | +10 | -10 | 0 |
| Unknowable | -50 | -5 | -10 | +10 |
| Complex | +8 | -3 | -15 | -8 |
Rankings are stable across mild, default, and severe cost regime variants (Appendix N).
23 decoder models across 10 architecture families: Qwen2.5 (0.5B, 7B, 14B), Qwen3 (4B, 32B), Qwen3-MoE (30B-A3B), Falcon3 (7B, 10B), Falcon-Mamba (7B), Mistral (7B), SmolLM2 (0.36B, 1.7B), DeepSeek-R1 (Qwen-1.5B, Llama-8B), Yi (1.5-9B), OLMo (2-1B), Gemma-2 (2B), and Phi-3.5 (3.8B).
@inproceedings{bansal2026knowing,
title={Knowing Is Not Saying: Hidden States Encode Epistemic Actions That Prompts Fail to Elicit},
author={Anonymous Authors},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2026}
}MIT