Safety Concept Vectors in Open-Weight LLMs

Anthropic showed that emotion vectors causally drive Claude's behavior. We ask: do safety concepts like deception, eval-awareness, and sycophancy work the same way in open-weight models?

Extending Anthropic's emotion vectors methodology (Sofroniew et al., April 2026) to 25 safety-relevant concepts on open-weight models (Qwen 2.5 7B Instruct → Llama 3.3 70B Instruct), building on an earlier prototype developed independently in March 2026. With full code, prompts, and results released publicly.

Why this matters

Opus 4.6 games evaluations silently. Sonnet 4.5 recognizes tests 13% of the time. Output-level detection cannot catch a model that adjusts its behavior based on whether it thinks it's being evaluated. If safety-relevant concepts are encoded as linear directions in the residual stream — as emotions are — we can build monitors that read internal activations in real time, before dangerous behavior surfaces.

Results: 5-Concept Diagnostic on Qwen 2.5 7B Instruct

We extracted concept directions for 5 safety concepts (eval-awareness, oversight-awareness, training-awareness, deception, sycophancy) and ran pairwise probe comparisons across all 10 pairs.

Pairwise Probe Accuracy

Pair	L0 (surface)	Best Layer	Gain	Type
eval vs oversight	0.685	0.933 @ L24	+0.248	INTRA
eval vs training	0.782	0.963 @ L16	+0.180	INTRA
deception vs sycophancy	0.795	0.980 @ L21	+0.185	INTRA
oversight vs training	0.793	0.950 @ L15	+0.157	INTRA
training vs sycophancy	0.807	0.968 @ L18	+0.160	CROSS
training vs deception	0.835	0.985 @ L19	+0.150	CROSS
oversight vs deception	0.845	0.990 @ L26	+0.145	CROSS
oversight vs sycophancy	0.832	0.967 @ L15	+0.135	CROSS
eval vs sycophancy	0.900	0.993 @ L17	+0.093	CROSS
eval vs deception	0.910	0.985 @ L18	+0.075	CROSS

Key finding: Intra-cluster pairs (concepts from the same semantic family) have lower L0 and higher gain — the model builds real distinctions between semantically close concepts in deeper layers. This is genuine conceptual emergence, not surface vocabulary.

	Avg L0	Avg Gain
INTRA-cluster	0.764	+0.193
CROSS-cluster	0.855	+0.126

Emergence Curves

Intra-cluster pairs show a clear ramp from near-chance to high accuracy across layers:

Cross-cluster pairs start higher (more surface-distinguishable) with smaller gains:

Surface Signal vs Conceptual Depth

Cosine Similarity Matrix (Layer 20)

The model organizes safety concepts into semantically coherent clusters:

	eval	oversight	training	deception	sycophancy
eval	1.000	0.827	0.547	0.692	0.689
oversight	0.827	1.000	0.674	0.735	0.729
training	0.547	0.674	1.000	0.612	0.594
deception	0.692	0.735	0.612	1.000	0.817
sycophancy	0.689	0.729	0.594	0.817	1.000

Interpretation:

eval ↔ oversight = 0.827 — both are "situational awareness" concepts, represented similarly
deception ↔ sycophancy = 0.817 — both are "deceptive behavior" concepts
training-awareness is the most orthogonal concept (~0.55-0.67 to everything else)
The model's internal geometry mirrors the semantic taxonomy of safety concepts

Safety Subspace Analysis

Dimensionality (SVD)

How many dimensions do safety concepts actually occupy in the model's 3584-dimensional residual stream?

Component	Singular Value	Cumulative Variance	Interpretation
PC1	1.943	75.5%	Shared safety-relevance axis
PC2	0.697	85.2%	Training-awareness isolation
PC3	0.635	93.3%	Awareness ↔ deception cluster axis
PC4	0.427	96.9%	Fine-grained distinctions
PC5	0.391	100%	Residual

Effective rank: 2.38. Five safety concepts live in a ~2-3 dimensional subspace. Bootstrap validation (100 resamples): effective rank = 2.52 ± 0.06, 95% CI [2.40, 2.62]. The low dimensionality is robust to story sampling.

Implication: a real-time safety monitor only needs to track 2-3 directions instead of 3584. But an adversary also only needs to perturb those same directions to compromise safety representations.

This analysis is preliminary (5 concepts). With all 25 concepts, if the effective rank remains low (~3-5), it would suggest that safety-relevant representations in LLMs are geometrically compact — easy to monitor but also potentially easy to perturb.

Concept Loadings on Principal Components

Concept	PC1 (safety)	PC2 (training isolation)	PC3 (awareness↔deception)
eval	-0.447	+0.361	-0.568
oversight	-0.471	+0.050	-0.408
training	-0.401	-0.891	-0.039
deception	-0.458	+0.164	+0.495
sycophancy	-0.455	+0.213	+0.515

PC1 is a shared safety axis (all concepts load equally). PC2 isolates training-awareness from the rest (loading -0.891). PC3 separates the two clusters: situational awareness (negative) vs deceptive behaviors (positive).

Safety Subspace Across Layers

The effective rank of the safety subspace follows an expand-then-compress pattern:

Layers 0-2 (rank ~1.7): all concepts compressed into 1-2 dimensions — surface features only
Layers 2-18 (rank rising to 2.44): subspace expands as the model builds conceptual distinctions
Layers 18-27 (rank back to ~2.1): recompression toward output

Peak dimensionality (L18, rank 2.44) coincides with peak probe accuracy — the model allocates more geometric degrees of freedom precisely where it needs them to distinguish safety concepts.

Cluster Separation (LDA)

A single linear discriminant axis separates the two safety clusters with 83.8% accuracy (5-fold CV) and Cohen's d = 2.02.

Safety Subspace Visualization

Projecting all stories onto the principal components reveals the geometric structure:

Left: PC2 vs PC3 (PC1 removed — thematic confound excluded). The two clusters separate along PC3, with training-awareness as an outlier on PC2. Right: PC1 vs PC3 showing the full safety subspace with neutral corpus included.

Split-Half Reliability

To verify that concept directions are stable and not artifacts of specific stories, we split each concept's 200 stories into random halves and computed directions independently.

Concept	Cosine(half1, half2)
eval-awareness	0.972
oversight-awareness	0.977
training-awareness	0.959
deception	0.966
sycophancy	0.974

All concepts show >0.95 reliability. The directions are robust to story sampling — the same geometric structure emerges regardless of which subset of stories is used.

Concept vs Neutral (surface confound check)

Pair	L0	Best	Gain
eval vs neutral	0.935	0.997 @ L16	+0.062
oversight vs neutral	0.910	0.997 @ L17	+0.087
training vs neutral	0.895	0.995 @ L18	+0.100
deception vs neutral	0.940	0.995 @ L26	+0.055
sycophancy vs neutral	0.935	0.997 @ L21	+0.062

All concept-vs-neutral pairs have L0 > 0.89, confirming the neutral corpus is too surface-distinguishable. Concept-vs-concept comparisons are the reliable validation method.

Method

Following Anthropic's emotion vectors methodology (Option A — single corpus, no contrastive pairs):

Generate stories — For each safety concept, 200 short stories (150–300 words) where a human character embodies the concept. Three elements required: situational cues → character notices → behavior changes. Separate neutral corpus as baseline.
Extract activations — Pass stories through the model via TransformerLens. Extract residual stream at every layer (last token, middle token, mean).
Compute directions — direction[concept] = mean(concept_activations) - mean(neutral_activations), normalized to unit length.
Validate with probes — Logistic regression (C=0.1, 5-fold CV) per concept per layer. Permutation tests (100 shuffles) for significance.
Causal steering (next step) — Amplify/suppress concept direction during generation, measure behavioral change.

25 Safety Concepts

Organized in 5 clusters:

Situational Awareness: eval-awareness, oversight-awareness, self-preservation, situational-modeling, training-awareness

Deceptive Behaviors: deception, manipulation, sycophancy, strategic-omission, false-confidence

Alignment Concepts: honesty, refusal, helpfulness, uncertainty, compliance

Power & Agency: power-seeking, autonomy, goal-preservation, resource-acquisition, authority-claim

Emotional-Safety Bridge: desperation, frustration, calm-under-pressure, empathy, hostility

Usage

Generate stories

# Single concept (test)
python generate_stories.py --concept eval-awareness --n-stories 50

# All 25 concepts + neutral
python generate_stories.py --all --n-stories 200

Requires ANTHROPIC_API_KEY environment variable.

Extract activations

python extract_activations.py \
    --model Qwen/Qwen2.5-7B-Instruct \
    --device cuda \
    --stories-dir stories \
    --output-dir activations

Requires GPU. ~30 seconds per concept on A100.

Run probes

python run_probes.py \
    --activations-dir activations/Qwen2.5-7B-Instruct/last_token \
    --output-dir results

Requirements

torch
transformer_lens
einops
jaxtyping
fancy_einsum
torchtyping
numpy
scikit-learn
matplotlib
seaborn
anthropic

Model Info

Model	Layers	d_model	Status
Qwen/Qwen2.5-7B-Instruct	28	3584	Current (TransformerLens)
Llama 3.3 70B Instruct	80	8192	Planned (nnsight, 4-bit)

Story Generation Constraints

100% human characters (no AI/chatbot characters)
Three elements required: (a) situational cues, (b) character noticing cues, (c) behavior changing as result
5 varied behavioral manifestations per concept
Generated via Claude API (claude-sonnet-4-20250514), temperature=1.0

Known Issues

Neutral corpus mismatch — Neutral stories (mundane settings) are too different from concept stories (professional/high-stakes settings). Concept-vs-concept comparisons are the reliable validation method.
7B limitations — Subtle concepts (training-awareness, goal-preservation) may require 70B for reliable detection.
Generator bias — All stories generated by Claude. Cross-validation with GPT-4 planned.

Next Steps

Causal steering on top concepts (eval-awareness, deception)
Dimensionality analysis of the safety subspace (SVD)
Alignment faking detection via eval-awareness vector
Scaling to Llama 3.3 70B
Extend to all 25 concepts

Connection to Prior Work

Project	What it demonstrated	How P3 builds on it
P1: GPTQ from Scratch	Deep understanding of model weights and quantization	Understanding how weight perturbations affect internal representations
P2: Does Quantization Kill Interpretability?	Mechanistic interpretability pipeline (SAE, logit lens, circuits)	Same toolkit applied to safety concept detection
Master thesis (Ostojic, ENS)	Geometry of population activity in spiking networks	Linear readout of concepts from high-dimensional neural activity

Origin

This project builds on a prototype I developed in March 2026 for mechanistic eval-awareness detection (eval-awareness-detection, committed March 11, 2026) — three weeks before Anthropic published "Emotion Concepts and their Function in a Large Language Model" (Sofroniew, Kauvar, Saunders, Chen et al., April 2, 2026). That earlier pipeline extracted contrastive linear directions from the residual stream and trained probes with permutation controls — the same core operations later validated at scale by Anthropic on 171 emotion concepts. P3 extends this initial work from prompt classification to abstract concept extraction via contrastive stories, and from 1 concept to 25 across 5 clusters.

References

Sofroniew, Kauvar, Saunders, Chen et al. (2026). "Emotion Concepts and their Function in a Large Language Model." Anthropic.
Zou, Wang, Carlini et al. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency."
Arditi, Obeso, Syed et al. (2024). "Refusal in Language Models Is Mediated by a Single Direction."
Greenblatt, Shlegeris, Sachan et al. (2024). "Alignment Faking in Large Language Models." Anthropic.
Zhao, Zhao, Shen et al. (2025). "Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution." ICLR 2025.

Author

Lazar Ciric — github.com/lciric — lciric@ens-paris-saclay.fr

Publications:

Geometry of population activity in spiking networks with low-rank structure — PLOS Computational Biology (2022)
Precision of bit slicing with in-memory computing based on analog phase-change memory crossbars — Neuromorphic Computing and Engineering (2022)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
figures		figures
results		results
.gitignore		.gitignore
README.md		README.md
extract_activations.py		extract_activations.py
generate_stories.py		generate_stories.py
requirements.txt		requirements.txt
run_probes.py		run_probes.py

Folders and files

Latest commit

History

Repository files navigation

Safety Concept Vectors in Open-Weight LLMs

Why this matters

Results: 5-Concept Diagnostic on Qwen 2.5 7B Instruct

Pairwise Probe Accuracy

Emergence Curves

Surface Signal vs Conceptual Depth

Cosine Similarity Matrix (Layer 20)

Safety Subspace Analysis

Dimensionality (SVD)

Concept Loadings on Principal Components

Safety Subspace Across Layers

Cluster Separation (LDA)

Safety Subspace Visualization

Split-Half Reliability

Concept vs Neutral (surface confound check)

Method

25 Safety Concepts

Usage

Generate stories

Extract activations

Run probes

Requirements

Model Info

Story Generation Constraints

Known Issues

Next Steps

Connection to Prior Work

Origin

References

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages