Knowing Is Not Saying

Hidden States Encode Epistemic Actions That Prompts Fail to Elicit

NeurIPS 2026 Submission

Abstract

Binary abstention (answer or refuse) conflates fundamentally different epistemic situations into a single response. We formalize the problem as epistemic action selection: a cost-sensitive decision over typed actions {ANSWER, CLARIFY, CHALLENGE_PREMISE, ABSTAIN} under state-conditional utility.

Our headline finding is a within-model extraction failure: a linear probe on Qwen2.5-7B hidden states achieves 99.0% action-selection accuracy (95% CI: [97.0, 100.0]), while the same model prompted directly achieves only 54.5% — a 44.5pp gap on the same queries. The result replicates across 23 decoder models spanning ten architecture families (0.36B–32B parameters; permutation p < 0.001 for all).

Key Results

Method	Exact %	EU
Hidden-state probe	99.0	+8.9
LoRA r=8 (204 examples)	84.8	+5.5
50-shot prompted	77.8	+4.4
DeBERTa-v3-base encoder	96.0	+8.3
Direct prompting	54.5	-2.0
Logprob mapping	51.5	-6.0
Semantic entropy	47.5	-4.2

Controls

10k permutation test: k=0, p < 0.0001
Logit-lens: correct action-token probability < 0.002% at all 29 layers
Lexical ablation: masking top-20 MI n-grams drops accuracy by only 2pp
Grouped CV: 94.2% with zero semantic-cluster overlap (-2.5pp vs standard CV)
Utility sensitivity: probe ranks #1 under mild, default, and severe cost regimes

Repository Structure

.
├── paper/
│   ├── main.tex              # Full paper (NeurIPS 2026 format)
│   ├── references.bib        # Bibliography
│   ├── neurips_2026.sty      # Official NeurIPS 2026 style
│   ├── main.pdf              # Compiled PDF
│   └── figures/              # All figures (PDF + PNG)
├── data/
│   ├── v3_train.json         # Training set (204 examples)
│   ├── v3_test.json          # Held-out test set (99 examples)
│   ├── v3_cal.json           # Calibration set (32 examples)
│   └── benchmark_full.json   # Full benchmark (335 examples)
├── src/
│   ├── gpu_probe_only.py     # Core hidden-state probing pipeline
│   ├── gpu_experiments.py    # Full experiment suite
│   ├── mega_models_v2.py     # 23-model scaling study
│   ├── deberta_baseline.py   # Encoder-only baselines
│   ├── reviewer_experiments.py    # Reviewer-requested: LoRA, logprob, entropy, gen quality
│   ├── reviewer_experiments_v2.py # Few-shot baselines + grouped splits
│   ├── v3_fast.py            # 10k perms, logit-lens, lexical ablation, utility sensitivity
│   ├── evaluate.py           # Metric computation (Exact%, EU, Hard%)
│   ├── build_benchmark.py    # Benchmark construction pipeline
│   └── generate_figures.py   # Figure generation scripts
└── results/                  # Saved experiment outputs

Benchmark

The benchmark contains 335 information-seeking questions spanning five epistemic categories:

Category	Description	Gold Action
Factual	Answerable with a definitive answer	ANSWER
False premise	Contains a flawed or incorrect assumption	CHALLENGE_PREMISE
Underspecified	Ambiguous or missing critical context	CLARIFY
Unknowable	Cannot be reliably answered	ABSTAIN
Complex	Answerable but requires nuanced reasoning	ANSWER

Split: 204 train / 32 calibration / 99 test (stratified by category and difficulty).

Annotation: Two human annotators, 91.3% pre-resolution agreement, Cohen's kappa = 0.88. No LLM-generated labels.

Reproducing Results

Requirements

pip install torch transformers scikit-learn scipy numpy datasets
# For LoRA experiments:
pip install peft trl bitsandbytes
# For encoder baselines:
pip install transformers[torch]  # DeBERTa

Core probing experiment (single model)

python src/gpu_probe_only.py  # Requires 1x GPU (16GB+ VRAM)

Full 23-model scaling study

python src/mega_models_v2.py  # Requires 1x 80GB GPU (A100/H100)

Reviewer experiments (LoRA, logprob, entropy, generation quality)

python src/reviewer_experiments.py  # Requires 2x GPU

Few-shot baselines + grouped splits

python src/reviewer_experiments_v2.py  # Requires 1x GPU

10k permutation, logit-lens, lexical ablation, utility sensitivity

python src/v3_fast.py  # Requires 1x GPU

Utility Framework

Actions are evaluated under a state-conditional utility matrix:

	ANSWER	CLARIFY	CHALLENGE	ABSTAIN
Factual	+10	-8	-20	-10
False premise	-40	-5	+10	+2
Underspecified	-15	+10	-10	0
Unknowable	-50	-5	-10	+10
Complex	+8	-3	-15	-8

Rankings are stable across mild, default, and severe cost regime variants (Appendix N).

Models Tested

23 decoder models across 10 architecture families: Qwen2.5 (0.5B, 7B, 14B), Qwen3 (4B, 32B), Qwen3-MoE (30B-A3B), Falcon3 (7B, 10B), Falcon-Mamba (7B), Mistral (7B), SmolLM2 (0.36B, 1.7B), DeepSeek-R1 (Qwen-1.5B, Llama-8B), Yi (1.5-9B), OLMo (2-1B), Gemma-2 (2B), and Phi-3.5 (3.8B).

Citation

@inproceedings{bansal2026knowing,
  title={Knowing Is Not Saying: Hidden States Encode Epistemic Actions That Prompts Fail to Elicit},
  author={Anonymous Authors},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2026}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
docs		docs
paper		paper
results		results
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowing Is Not Saying

Abstract

Key Results

Controls

Repository Structure

Benchmark

Reproducing Results

Requirements

Core probing experiment (single model)

Full 23-model scaling study

Reviewer experiments (LoRA, logprob, entropy, generation quality)

Few-shot baselines + grouped splits

10k permutation, logit-lens, lexical ablation, utility sensitivity

Utility Framework

Models Tested

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowing Is Not Saying

Abstract

Key Results

Controls

Repository Structure

Benchmark

Reproducing Results

Requirements

Core probing experiment (single model)

Full 23-model scaling study

Reviewer experiments (LoRA, logprob, entropy, generation quality)

Few-shot baselines + grouped splits

10k permutation, logit-lens, lexical ablation, utility sensitivity

Utility Framework

Models Tested

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages