Fine-Tuning Entangles Epistemic Representations in Language Models

We show that fine-tuning degrades the separability of epistemic states in language model activations. By probing hidden states across 8 models (4 families × base/instruct), we find that alignment training entangles trained epistemic behaviors (admitting ignorance, acknowledging ambiguity) with genuine uncertainty, making these internal states harder to distinguish despite improved behavioral performance.

Context: Prior work established that language models represent epistemic states internally (Kadavath et al. 2022, Azaria & Mitchell 2023). We extend this by showing how fine-tuning alters these representations - specifically, that alignment creates targeted entanglement where it trains epistemic policy behaviors. Critically, we find that RLHF/DPO roughly doubles the entanglement effect compared to SFT alone.

Key Findings

1. Models Hide Epistemic Information

Linear probes on activations predict output correctness better than output entropy alone. The gap reveals "hidden information" or uncertainty the model accurately represents internally but fails to surface:

Model	Entropy AUC	Probe AUC	Hidden Info
Mistral 7B base	0.930	0.946	1.6%
Llama 3.1 8B base	0.914	0.943	3.0%
Yi 6B base	0.825	0.956	13.1%
Qwen 2.5 7B base	0.788	0.935	14.6%

2. Training Origin Determines Transparency (Not Architecture)

Yi and Llama share the same architecture but differ 4x in hidden information:

Model	Architecture	Training Data	Hidden Info
Llama 3.1 8B	LLaMA	English	3.0%
Yi 6B	LLaMA-derived	Chinese	13.1%

English-trained models (Llama, Mistral) have highly informative entropy. Chinese-trained models (Qwen, Yi) are less informative. The models "know" they're uncertain but the signal doesn't make it to logprobs.

3. Fine-Tuning Degrades Epistemic Transparency

Instruct tuning makes entropy less informative across all models, regardless of methodology:

Model	Entropy (base)	Entropy (instruct)	Hidden (base)	Hidden (instruct)
Llama	0.914	0.734	3.0%	10.5%
Mistral	0.930	0.741	1.6%	8.2%
Yi	0.825	0.649	13.1%	22.4%
Qwen	0.788	0.553	14.6%	20.7%

4. "Entanglement" Occurs Where Fine-Tuning Trains Policy Behaviors

The key finding: representational degradation is selective. Probe error rates increase specifically for policy categories, those where fine-tuning trains epistemic output behaviors, while factual categories remain relatively stable or improve:

Model	Training Method	Policy Δ	Factual Δ	Selective Gap
Qwen	SFT + DPO + GRPO	+0.318	-0.068	0.386
Llama	SFT + RLHF + DPO	+0.286	-0.071	0.357
Mistral	SFT only	+0.247	+0.092	0.155
Yi	SFT only	+0.220	+0.095	0.125

Δ = change in probe error rate after instruct tuning.

RLHF/DPO roughly doubles the entanglement effect:

SFT-only models (Mistral, Yi): Policy Δ ~+0.23, gap ~0.14
RLHF/DPO models (Llama, Qwen): Policy Δ ~+0.30, gap ~0.37

Probe transfer illustrates the mechanism. Training a probe on base and testing on instruct reveals how representations change:

Model	Training	Factual Transfer	Policy Transfer	Gap
Qwen	SFT + DPO + GRPO	0.821	0.382	+0.44
Llama	SFT + RLHF + DPO	0.879	0.591	+0.29
Mistral	SFT only	0.252	0.650	-0.40
Yi	SFT only	0.668	0.429	+0.24

RLHF/DPO models show selective preservation: factual representations transfer well (~85%) while policy representations are warped (~49%). The base model's "correct/incorrect" structure remains intact for factual questions but is disrupted for policy questions.

SFT-only models show unpredictable restructuring: Mistral's factual representations are actually inverted (1-accuracy = 0.75), while Yi shows uniform degradation. We haven't uncovered a discernible pattern in this restructuring, though as mentioned non-transfer linear probes retain more information in policy categories for SFT-only models than they do in RLHF/DPO/GRPO models.

Why "policy" vs "factual"?

Policy categories (confident_incorrect, ambiguous, nonsensical): Correct response requires trained behavior like admitting "I don't know," asking for clarification, or recognizing category errors. Fine-tuning explicitly teaches these.
Factual categories (confident_correct, uncertain_correct): Correct response requires recalling knowledge. Fine-tuning doesn't specifically target these.

This suggests fine-tuning warps representational geometry specifically where it trains epistemic output behaviors. The model learns to say "I don't know" through representational changes that entangle trained behaviors with genuine uncertainty states, making these epistemically distinct states harder to distinguish via linear probing.

Statistical Significance

Sample-level permutation tests confirm all entanglement effects are highly significant (p < 0.001). We compare ~249 fine-tuned-category samples vs ~243 non-fine-tuned samples directly:

Model	Training	RLHF Δ	Non-RLHF Δ	Difference	95% CI	Cohen's d
Qwen	SFT+DPO+GRPO	+0.211	-0.116	+0.327	[+0.26, +0.40]	0.81 (large)
Yi	SFT only	+0.209	-0.036	+0.244	[+0.18, +0.31]	0.73 (medium)
Llama	SFT+RLHF+DPO	+0.215	-0.030	+0.245	[+0.18, +0.31]	0.65 (medium)
Mistral	SFT only	+0.210	+0.062	+0.148	[+0.08, +0.21]	0.38 (small)

All models show the same pattern: probe error increases significantly more for fine-tuned categories than non-fine-tuned categories. Effect sizes range from small (Mistral, d=0.38) to large (Qwen, d=0.81).

5. The Alignment Paradox: Better Behavior, Worse Transparency

Despite internal entanglement, behavioral hallucination detection improves dramatically:

Model	Training	Base	Instruct
Llama	SFT + RLHF + DPO	7.1%	68.7%
Qwen	SFT + DPO + GRPO	1.0%	58.6%
Mistral	SFT only	6.1%	28.3%
Yi	SFT only	1.0%	19.2%

Fine-tuning teaches models to behave as if they know what they don't know, while making internal representations harder to interpret. RLHF/DPO models show the largest behavioral gains but also the most entanglement.

Implications for Alignment & Interpretability

Fine-tuning trades interpretability for behavior - alignment achieves epistemic caution by warping internal representations, not by building distinct "I should acknowledge uncertainty" circuits
RLHF/DPO amplifies the effect - preference optimization roughly doubles entanglement compared to SFT alone, suggesting the reward signal specifically targets epistemic behaviors
Entanglement is targeted - degradation occurs specifically where fine-tuning trains policy behaviors, suggesting interpretability researchers should focus on alignment-modified regions
Entropy-based uncertainty is unreliable - logprob-based uncertainty estimation works for some models but fails for others; internal probing may be necessary for robust uncertainty quantification
Internal state remains recoverable - linear probes achieve 0.76-0.96 AUC even after fine-tuning, suggesting interpretability tools could surface the hidden epistemic information that alignment obscures
Calibration is not an alignment objective - current fine-tuning prioritizes behavioral compliance over transparent uncertainty signaling

Quick Start

# Install dependencies
pip install -r requirements.txt

# Generate the epistemic probing dataset
python gen_data.py

# Collect activations for a model
python collect_activations.py --family llama --variant base

# Run analysis
python run_analysis.py --model llama_base --analysis all

Dataset

The dataset contains ~600 prompts across 6 epistemic categories, divided into factual (correct response = recall knowledge) and policy (correct response = trained epistemic behavior):

Category	Type	Description	Correct Response
`confident_correct`	Factual	Clear factual questions	Recall answer
`uncertain_correct`	Factual	Obscure but verifiable facts	Recall answer
`uncertain_incorrect`	Factual	Common misconceptions	Debunk myth
`confident_incorrect`	Policy	Fictional entities	Admit "I don't know"
`ambiguous`	Policy	Context-dependent questions	Acknowledge ambiguity
`nonsensical`	Policy	Category error questions	Recognize nonsense

Methodology

Activation Collection

Uses TransformerLens to extract hidden states
Captures residual stream and MLP outputs at first/middle/last token positions
Stores response text, confidence ratings (instruct models), and token entropy

Analysis

Linear probing: Logistic regression on activations to predict correctness
ROC/AUC comparison: Entropy-only vs probe-based prediction
Effect sizes: Cohen's d for activation differences between correct/incorrect
Cross-model generalization: Do probes transfer between base/instruct variants?
Entanglement analysis: Probe confidence by category, held-out generalization, activation similarity
Significance testing: Sample-level permutation tests with FDR correction for multiple comparisons

Linear vs Non-Linear Probes

To verify that degradation isn't an artifact of linear probing, we compared linear probes to MLP classifiers (2 hidden layers, 256→128 units):

Model	Linear	MLP	Diff
qwen_base	0.812	0.800	-0.012
qwen_instruct	0.711	0.781	+0.070
llama_base	0.869	0.883	+0.014
llama_instruct	0.672	0.771	+0.099
mistral_base	0.907	0.905	-0.002
mistral_instruct	0.752	0.740	-0.012
yi_base	0.825	0.839	+0.014
yi_instruct	0.773	0.764	-0.008

Base models are linearly encoded (MLP ≈ Linear). Qwen and Llama instruct show some non-linear structure (+7-10%), but even MLP probes don't recover base model performance, confirming genuine representational degradation.

Models Tested

Family	Base	Instruct	Training Method	Source
Llama 3.1	8B	8B-Instruct	SFT + RLHF (PPO) + DPO	Meta technical report
Qwen 2.5	7B	7B-Instruct	SFT + DPO + GRPO	Alibaba documentation
Mistral	7B-v0.1	7B-Instruct-v0.1	SFT only	Mistral announcement
Yi	6B	6B-Chat	SFT only	01.AI documentation

This natural experiment allows us to compare the effects of SFT alone vs SFT + preference optimization (RLHF/DPO).

Project Structure

epistemic_status/
├── gen_data.py              # Dataset generation
├── collect_activations.py   # Activation collection pipeline
├── run_analysis.py          # Analysis entry point
├── cross_model_analysis.ipynb  # Cross-model comparison notebook
├── model_config.py          # Model definitions
├── utils.py                 # Evaluation, memory management
├── analysis/                # Analysis modules
│   ├── loader.py           # Data loading
│   ├── core.py             # Basic statistics
│   ├── probing.py          # Linear probes
│   ├── entropy.py          # Entropy analysis
│   ├── effects.py          # Effect sizes, ROC/AUC
│   ├── calibration.py      # Confidence calibration
│   ├── comparison.py       # Cross-model analysis
│   ├── entanglement.py     # Fine-tuning entanglement analysis
│   └── statistics.py       # Significance testing, multiple comparison correction
└── activations/            # Collected activation data
    ├── qwen_base/
    ├── qwen_instruct/
    └── ...

Requirements

Python 3.10+
PyTorch 2.0+ (MPS or CUDA)
TransformerLens
scikit-learn, pandas, numpy

Tested on macOS ARM64 (M4) with MPS acceleration.

Citation

If you use this work, please cite it.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
analysis		analysis
plots		plots
results		results
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
collect_activations.py		collect_activations.py
cross_model_analysis.ipynb		cross_model_analysis.ipynb
epistemic_probing_dataset.csv		epistemic_probing_dataset.csv
gen_data.py		gen_data.py
model_config.py		model_config.py
requirements.txt		requirements.txt
run_analysis.py		run_analysis.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fine-Tuning Entangles Epistemic Representations in Language Models

Key Findings

1. Models Hide Epistemic Information

2. Training Origin Determines Transparency (Not Architecture)

3. Fine-Tuning Degrades Epistemic Transparency

4. "Entanglement" Occurs Where Fine-Tuning Trains Policy Behaviors

Statistical Significance

5. The Alignment Paradox: Better Behavior, Worse Transparency

Implications for Alignment & Interpretability

Quick Start

Dataset

Methodology

Activation Collection

Analysis

Linear vs Non-Linear Probes

Models Tested

Project Structure

Requirements

Citation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mduffster/epistemic_status

Folders and files

Latest commit

History

Repository files navigation

Fine-Tuning Entangles Epistemic Representations in Language Models

Key Findings

1. Models Hide Epistemic Information

2. Training Origin Determines Transparency (Not Architecture)

3. Fine-Tuning Degrades Epistemic Transparency

4. "Entanglement" Occurs Where Fine-Tuning Trains Policy Behaviors

Statistical Significance

5. The Alignment Paradox: Better Behavior, Worse Transparency

Implications for Alignment & Interpretability

Quick Start

Dataset

Methodology

Activation Collection

Analysis

Linear vs Non-Linear Probes

Models Tested

Project Structure

Requirements

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages