Skip to content

aayambansal/KINS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowing Is Not Saying

Hidden States Encode Epistemic Actions That Prompts Fail to Elicit

NeurIPS 2026 Submission

Abstract

Binary abstention (answer or refuse) conflates fundamentally different epistemic situations into a single response. We formalize the problem as epistemic action selection: a cost-sensitive decision over typed actions {ANSWER, CLARIFY, CHALLENGE_PREMISE, ABSTAIN} under state-conditional utility.

Our headline finding is a within-model extraction failure: a linear probe on Qwen2.5-7B hidden states achieves 99.0% action-selection accuracy (95% CI: [97.0, 100.0]), while the same model prompted directly achieves only 54.5% — a 44.5pp gap on the same queries. The result replicates across 23 decoder models spanning ten architecture families (0.36B–32B parameters; permutation p < 0.001 for all).

Key Results

Method Exact % EU
Hidden-state probe 99.0 +8.9
LoRA r=8 (204 examples) 84.8 +5.5
50-shot prompted 77.8 +4.4
DeBERTa-v3-base encoder 96.0 +8.3
Direct prompting 54.5 -2.0
Logprob mapping 51.5 -6.0
Semantic entropy 47.5 -4.2

Controls

  • 10k permutation test: k=0, p < 0.0001
  • Logit-lens: correct action-token probability < 0.002% at all 29 layers
  • Lexical ablation: masking top-20 MI n-grams drops accuracy by only 2pp
  • Grouped CV: 94.2% with zero semantic-cluster overlap (-2.5pp vs standard CV)
  • Utility sensitivity: probe ranks #1 under mild, default, and severe cost regimes

Repository Structure

.
├── paper/
│   ├── main.tex              # Full paper (NeurIPS 2026 format)
│   ├── references.bib        # Bibliography
│   ├── neurips_2026.sty      # Official NeurIPS 2026 style
│   ├── main.pdf              # Compiled PDF
│   └── figures/              # All figures (PDF + PNG)
├── data/
│   ├── v3_train.json         # Training set (204 examples)
│   ├── v3_test.json          # Held-out test set (99 examples)
│   ├── v3_cal.json           # Calibration set (32 examples)
│   └── benchmark_full.json   # Full benchmark (335 examples)
├── src/
│   ├── gpu_probe_only.py     # Core hidden-state probing pipeline
│   ├── gpu_experiments.py    # Full experiment suite
│   ├── mega_models_v2.py     # 23-model scaling study
│   ├── deberta_baseline.py   # Encoder-only baselines
│   ├── reviewer_experiments.py    # Reviewer-requested: LoRA, logprob, entropy, gen quality
│   ├── reviewer_experiments_v2.py # Few-shot baselines + grouped splits
│   ├── v3_fast.py            # 10k perms, logit-lens, lexical ablation, utility sensitivity
│   ├── evaluate.py           # Metric computation (Exact%, EU, Hard%)
│   ├── build_benchmark.py    # Benchmark construction pipeline
│   └── generate_figures.py   # Figure generation scripts
└── results/                  # Saved experiment outputs

Benchmark

The benchmark contains 335 information-seeking questions spanning five epistemic categories:

Category Description Gold Action
Factual Answerable with a definitive answer ANSWER
False premise Contains a flawed or incorrect assumption CHALLENGE_PREMISE
Underspecified Ambiguous or missing critical context CLARIFY
Unknowable Cannot be reliably answered ABSTAIN
Complex Answerable but requires nuanced reasoning ANSWER

Split: 204 train / 32 calibration / 99 test (stratified by category and difficulty).

Annotation: Two human annotators, 91.3% pre-resolution agreement, Cohen's kappa = 0.88. No LLM-generated labels.

Reproducing Results

Requirements

pip install torch transformers scikit-learn scipy numpy datasets
# For LoRA experiments:
pip install peft trl bitsandbytes
# For encoder baselines:
pip install transformers[torch]  # DeBERTa

Core probing experiment (single model)

python src/gpu_probe_only.py  # Requires 1x GPU (16GB+ VRAM)

Full 23-model scaling study

python src/mega_models_v2.py  # Requires 1x 80GB GPU (A100/H100)

Reviewer experiments (LoRA, logprob, entropy, generation quality)

python src/reviewer_experiments.py  # Requires 2x GPU

Few-shot baselines + grouped splits

python src/reviewer_experiments_v2.py  # Requires 1x GPU

10k permutation, logit-lens, lexical ablation, utility sensitivity

python src/v3_fast.py  # Requires 1x GPU

Utility Framework

Actions are evaluated under a state-conditional utility matrix:

ANSWER CLARIFY CHALLENGE ABSTAIN
Factual +10 -8 -20 -10
False premise -40 -5 +10 +2
Underspecified -15 +10 -10 0
Unknowable -50 -5 -10 +10
Complex +8 -3 -15 -8

Rankings are stable across mild, default, and severe cost regime variants (Appendix N).

Models Tested

23 decoder models across 10 architecture families: Qwen2.5 (0.5B, 7B, 14B), Qwen3 (4B, 32B), Qwen3-MoE (30B-A3B), Falcon3 (7B, 10B), Falcon-Mamba (7B), Mistral (7B), SmolLM2 (0.36B, 1.7B), DeepSeek-R1 (Qwen-1.5B, Llama-8B), Yi (1.5-9B), OLMo (2-1B), Gemma-2 (2B), and Phi-3.5 (3.8B).

Citation

@inproceedings{bansal2026knowing,
  title={Knowing Is Not Saying: Hidden States Encode Epistemic Actions That Prompts Fail to Elicit},
  author={Anonymous Authors},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2026}
}

License

MIT

About

Knowledge is Not Saying: Some work on Abstention

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors