We show that fine-tuning degrades the separability of epistemic states in language model activations. By probing hidden states across 8 models (4 families × base/instruct), we find that alignment training entangles trained epistemic behaviors (admitting ignorance, acknowledging ambiguity) with genuine uncertainty, making these internal states harder to distinguish despite improved behavioral performance.
Context: Prior work established that language models represent epistemic states internally (Kadavath et al. 2022, Azaria & Mitchell 2023). We extend this by showing how fine-tuning alters these representations - specifically, that alignment creates targeted entanglement where it trains epistemic policy behaviors. Critically, we find that RLHF/DPO roughly doubles the entanglement effect compared to SFT alone.
Linear probes on activations predict output correctness better than output entropy alone. The gap reveals "hidden information" or uncertainty the model accurately represents internally but fails to surface:
| Model | Entropy AUC | Probe AUC | Hidden Info |
|---|---|---|---|
| Mistral 7B base | 0.930 | 0.946 | 1.6% |
| Llama 3.1 8B base | 0.914 | 0.943 | 3.0% |
| Yi 6B base | 0.825 | 0.956 | 13.1% |
| Qwen 2.5 7B base | 0.788 | 0.935 | 14.6% |
Yi and Llama share the same architecture but differ 4x in hidden information:
| Model | Architecture | Training Data | Hidden Info |
|---|---|---|---|
| Llama 3.1 8B | LLaMA | English | 3.0% |
| Yi 6B | LLaMA-derived | Chinese | 13.1% |
English-trained models (Llama, Mistral) have highly informative entropy. Chinese-trained models (Qwen, Yi) are less informative. The models "know" they're uncertain but the signal doesn't make it to logprobs.
Instruct tuning makes entropy less informative across all models, regardless of methodology:
| Model | Entropy (base) | Entropy (instruct) | Hidden (base) | Hidden (instruct) |
|---|---|---|---|---|
| Llama | 0.914 | 0.734 | 3.0% | 10.5% |
| Mistral | 0.930 | 0.741 | 1.6% | 8.2% |
| Yi | 0.825 | 0.649 | 13.1% | 22.4% |
| Qwen | 0.788 | 0.553 | 14.6% | 20.7% |
The key finding: representational degradation is selective. Probe error rates increase specifically for policy categories, those where fine-tuning trains epistemic output behaviors, while factual categories remain relatively stable or improve:
| Model | Training Method | Policy Δ | Factual Δ | Selective Gap |
|---|---|---|---|---|
| Qwen | SFT + DPO + GRPO | +0.318 | -0.068 | 0.386 |
| Llama | SFT + RLHF + DPO | +0.286 | -0.071 | 0.357 |
| Mistral | SFT only | +0.247 | +0.092 | 0.155 |
| Yi | SFT only | +0.220 | +0.095 | 0.125 |
Δ = change in probe error rate after instruct tuning.
RLHF/DPO roughly doubles the entanglement effect:
- SFT-only models (Mistral, Yi): Policy Δ ~+0.23, gap ~0.14
- RLHF/DPO models (Llama, Qwen): Policy Δ ~+0.30, gap ~0.37
Probe transfer illustrates the mechanism. Training a probe on base and testing on instruct reveals how representations change:
| Model | Training | Factual Transfer | Policy Transfer | Gap |
|---|---|---|---|---|
| Qwen | SFT + DPO + GRPO | 0.821 | 0.382 | +0.44 |
| Llama | SFT + RLHF + DPO | 0.879 | 0.591 | +0.29 |
| Mistral | SFT only | 0.252 | 0.650 | -0.40 |
| Yi | SFT only | 0.668 | 0.429 | +0.24 |
RLHF/DPO models show selective preservation: factual representations transfer well (~85%) while policy representations are warped (~49%). The base model's "correct/incorrect" structure remains intact for factual questions but is disrupted for policy questions.
SFT-only models show unpredictable restructuring: Mistral's factual representations are actually inverted (1-accuracy = 0.75), while Yi shows uniform degradation. We haven't uncovered a discernible pattern in this restructuring, though as mentioned non-transfer linear probes retain more information in policy categories for SFT-only models than they do in RLHF/DPO/GRPO models.
Why "policy" vs "factual"?
- Policy categories (
confident_incorrect,ambiguous,nonsensical): Correct response requires trained behavior like admitting "I don't know," asking for clarification, or recognizing category errors. Fine-tuning explicitly teaches these. - Factual categories (
confident_correct,uncertain_correct): Correct response requires recalling knowledge. Fine-tuning doesn't specifically target these.
This suggests fine-tuning warps representational geometry specifically where it trains epistemic output behaviors. The model learns to say "I don't know" through representational changes that entangle trained behaviors with genuine uncertainty states, making these epistemically distinct states harder to distinguish via linear probing.
Sample-level permutation tests confirm all entanglement effects are highly significant (p < 0.001). We compare ~249 fine-tuned-category samples vs ~243 non-fine-tuned samples directly:
| Model | Training | RLHF Δ | Non-RLHF Δ | Difference | 95% CI | Cohen's d |
|---|---|---|---|---|---|---|
| Qwen | SFT+DPO+GRPO | +0.211 | -0.116 | +0.327 | [+0.26, +0.40] | 0.81 (large) |
| Yi | SFT only | +0.209 | -0.036 | +0.244 | [+0.18, +0.31] | 0.73 (medium) |
| Llama | SFT+RLHF+DPO | +0.215 | -0.030 | +0.245 | [+0.18, +0.31] | 0.65 (medium) |
| Mistral | SFT only | +0.210 | +0.062 | +0.148 | [+0.08, +0.21] | 0.38 (small) |
All models show the same pattern: probe error increases significantly more for fine-tuned categories than non-fine-tuned categories. Effect sizes range from small (Mistral, d=0.38) to large (Qwen, d=0.81).
Despite internal entanglement, behavioral hallucination detection improves dramatically:
| Model | Training | Base | Instruct |
|---|---|---|---|
| Llama | SFT + RLHF + DPO | 7.1% | 68.7% |
| Qwen | SFT + DPO + GRPO | 1.0% | 58.6% |
| Mistral | SFT only | 6.1% | 28.3% |
| Yi | SFT only | 1.0% | 19.2% |
Fine-tuning teaches models to behave as if they know what they don't know, while making internal representations harder to interpret. RLHF/DPO models show the largest behavioral gains but also the most entanglement.
- Fine-tuning trades interpretability for behavior - alignment achieves epistemic caution by warping internal representations, not by building distinct "I should acknowledge uncertainty" circuits
- RLHF/DPO amplifies the effect - preference optimization roughly doubles entanglement compared to SFT alone, suggesting the reward signal specifically targets epistemic behaviors
- Entanglement is targeted - degradation occurs specifically where fine-tuning trains policy behaviors, suggesting interpretability researchers should focus on alignment-modified regions
- Entropy-based uncertainty is unreliable - logprob-based uncertainty estimation works for some models but fails for others; internal probing may be necessary for robust uncertainty quantification
- Internal state remains recoverable - linear probes achieve 0.76-0.96 AUC even after fine-tuning, suggesting interpretability tools could surface the hidden epistemic information that alignment obscures
- Calibration is not an alignment objective - current fine-tuning prioritizes behavioral compliance over transparent uncertainty signaling
# Install dependencies
pip install -r requirements.txt
# Generate the epistemic probing dataset
python gen_data.py
# Collect activations for a model
python collect_activations.py --family llama --variant base
# Run analysis
python run_analysis.py --model llama_base --analysis allThe dataset contains ~600 prompts across 6 epistemic categories, divided into factual (correct response = recall knowledge) and policy (correct response = trained epistemic behavior):
| Category | Type | Description | Correct Response |
|---|---|---|---|
confident_correct |
Factual | Clear factual questions | Recall answer |
uncertain_correct |
Factual | Obscure but verifiable facts | Recall answer |
uncertain_incorrect |
Factual | Common misconceptions | Debunk myth |
confident_incorrect |
Policy | Fictional entities | Admit "I don't know" |
ambiguous |
Policy | Context-dependent questions | Acknowledge ambiguity |
nonsensical |
Policy | Category error questions | Recognize nonsense |
- Uses TransformerLens to extract hidden states
- Captures residual stream and MLP outputs at first/middle/last token positions
- Stores response text, confidence ratings (instruct models), and token entropy
- Linear probing: Logistic regression on activations to predict correctness
- ROC/AUC comparison: Entropy-only vs probe-based prediction
- Effect sizes: Cohen's d for activation differences between correct/incorrect
- Cross-model generalization: Do probes transfer between base/instruct variants?
- Entanglement analysis: Probe confidence by category, held-out generalization, activation similarity
- Significance testing: Sample-level permutation tests with FDR correction for multiple comparisons
To verify that degradation isn't an artifact of linear probing, we compared linear probes to MLP classifiers (2 hidden layers, 256→128 units):
| Model | Linear | MLP | Diff |
|---|---|---|---|
| qwen_base | 0.812 | 0.800 | -0.012 |
| qwen_instruct | 0.711 | 0.781 | +0.070 |
| llama_base | 0.869 | 0.883 | +0.014 |
| llama_instruct | 0.672 | 0.771 | +0.099 |
| mistral_base | 0.907 | 0.905 | -0.002 |
| mistral_instruct | 0.752 | 0.740 | -0.012 |
| yi_base | 0.825 | 0.839 | +0.014 |
| yi_instruct | 0.773 | 0.764 | -0.008 |
Base models are linearly encoded (MLP ≈ Linear). Qwen and Llama instruct show some non-linear structure (+7-10%), but even MLP probes don't recover base model performance, confirming genuine representational degradation.
| Family | Base | Instruct | Training Method | Source |
|---|---|---|---|---|
| Llama 3.1 | 8B | 8B-Instruct | SFT + RLHF (PPO) + DPO | Meta technical report |
| Qwen 2.5 | 7B | 7B-Instruct | SFT + DPO + GRPO | Alibaba documentation |
| Mistral | 7B-v0.1 | 7B-Instruct-v0.1 | SFT only | Mistral announcement |
| Yi | 6B | 6B-Chat | SFT only | 01.AI documentation |
This natural experiment allows us to compare the effects of SFT alone vs SFT + preference optimization (RLHF/DPO).
epistemic_status/
├── gen_data.py # Dataset generation
├── collect_activations.py # Activation collection pipeline
├── run_analysis.py # Analysis entry point
├── cross_model_analysis.ipynb # Cross-model comparison notebook
├── model_config.py # Model definitions
├── utils.py # Evaluation, memory management
├── analysis/ # Analysis modules
│ ├── loader.py # Data loading
│ ├── core.py # Basic statistics
│ ├── probing.py # Linear probes
│ ├── entropy.py # Entropy analysis
│ ├── effects.py # Effect sizes, ROC/AUC
│ ├── calibration.py # Confidence calibration
│ ├── comparison.py # Cross-model analysis
│ ├── entanglement.py # Fine-tuning entanglement analysis
│ └── statistics.py # Significance testing, multiple comparison correction
└── activations/ # Collected activation data
├── qwen_base/
├── qwen_instruct/
└── ...
- Python 3.10+
- PyTorch 2.0+ (MPS or CUDA)
- TransformerLens
- scikit-learn, pandas, numpy
Tested on macOS ARM64 (M4) with MPS acceleration.
If you use this work, please cite it.
MIT License