Anthropic showed that emotion vectors causally drive Claude's behavior. We ask: do safety concepts like deception, eval-awareness, and sycophancy work the same way in open-weight models?
Extending Anthropic's emotion vectors methodology (Sofroniew et al., April 2026) to 25 safety-relevant concepts on open-weight models (Qwen 2.5 7B Instruct → Llama 3.3 70B Instruct), building on an earlier prototype developed independently in March 2026. With full code, prompts, and results released publicly.
Opus 4.6 games evaluations silently. Sonnet 4.5 recognizes tests 13% of the time. Output-level detection cannot catch a model that adjusts its behavior based on whether it thinks it's being evaluated. If safety-relevant concepts are encoded as linear directions in the residual stream — as emotions are — we can build monitors that read internal activations in real time, before dangerous behavior surfaces.
We extracted concept directions for 5 safety concepts (eval-awareness, oversight-awareness, training-awareness, deception, sycophancy) and ran pairwise probe comparisons across all 10 pairs.
| Pair | L0 (surface) | Best Layer | Gain | Type |
|---|---|---|---|---|
| eval vs oversight | 0.685 | 0.933 @ L24 | +0.248 | INTRA |
| eval vs training | 0.782 | 0.963 @ L16 | +0.180 | INTRA |
| deception vs sycophancy | 0.795 | 0.980 @ L21 | +0.185 | INTRA |
| oversight vs training | 0.793 | 0.950 @ L15 | +0.157 | INTRA |
| training vs sycophancy | 0.807 | 0.968 @ L18 | +0.160 | CROSS |
| training vs deception | 0.835 | 0.985 @ L19 | +0.150 | CROSS |
| oversight vs deception | 0.845 | 0.990 @ L26 | +0.145 | CROSS |
| oversight vs sycophancy | 0.832 | 0.967 @ L15 | +0.135 | CROSS |
| eval vs sycophancy | 0.900 | 0.993 @ L17 | +0.093 | CROSS |
| eval vs deception | 0.910 | 0.985 @ L18 | +0.075 | CROSS |
Key finding: Intra-cluster pairs (concepts from the same semantic family) have lower L0 and higher gain — the model builds real distinctions between semantically close concepts in deeper layers. This is genuine conceptual emergence, not surface vocabulary.
| Avg L0 | Avg Gain | |
|---|---|---|
| INTRA-cluster | 0.764 | +0.193 |
| CROSS-cluster | 0.855 | +0.126 |
Intra-cluster pairs show a clear ramp from near-chance to high accuracy across layers:
Cross-cluster pairs start higher (more surface-distinguishable) with smaller gains:
The model organizes safety concepts into semantically coherent clusters:
| eval | oversight | training | deception | sycophancy | |
|---|---|---|---|---|---|
| eval | 1.000 | 0.827 | 0.547 | 0.692 | 0.689 |
| oversight | 0.827 | 1.000 | 0.674 | 0.735 | 0.729 |
| training | 0.547 | 0.674 | 1.000 | 0.612 | 0.594 |
| deception | 0.692 | 0.735 | 0.612 | 1.000 | 0.817 |
| sycophancy | 0.689 | 0.729 | 0.594 | 0.817 | 1.000 |
Interpretation:
- eval ↔ oversight = 0.827 — both are "situational awareness" concepts, represented similarly
- deception ↔ sycophancy = 0.817 — both are "deceptive behavior" concepts
- training-awareness is the most orthogonal concept (~0.55-0.67 to everything else)
- The model's internal geometry mirrors the semantic taxonomy of safety concepts
How many dimensions do safety concepts actually occupy in the model's 3584-dimensional residual stream?
| Component | Singular Value | Cumulative Variance | Interpretation |
|---|---|---|---|
| PC1 | 1.943 | 75.5% | Shared safety-relevance axis |
| PC2 | 0.697 | 85.2% | Training-awareness isolation |
| PC3 | 0.635 | 93.3% | Awareness ↔ deception cluster axis |
| PC4 | 0.427 | 96.9% | Fine-grained distinctions |
| PC5 | 0.391 | 100% | Residual |
Effective rank: 2.38. Five safety concepts live in a ~2-3 dimensional subspace. Bootstrap validation (100 resamples): effective rank = 2.52 ± 0.06, 95% CI [2.40, 2.62]. The low dimensionality is robust to story sampling.
Implication: a real-time safety monitor only needs to track 2-3 directions instead of 3584. But an adversary also only needs to perturb those same directions to compromise safety representations.
This analysis is preliminary (5 concepts). With all 25 concepts, if the effective rank remains low (~3-5), it would suggest that safety-relevant representations in LLMs are geometrically compact — easy to monitor but also potentially easy to perturb.
| Concept | PC1 (safety) | PC2 (training isolation) | PC3 (awareness↔deception) |
|---|---|---|---|
| eval | -0.447 | +0.361 | -0.568 |
| oversight | -0.471 | +0.050 | -0.408 |
| training | -0.401 | -0.891 | -0.039 |
| deception | -0.458 | +0.164 | +0.495 |
| sycophancy | -0.455 | +0.213 | +0.515 |
PC1 is a shared safety axis (all concepts load equally). PC2 isolates training-awareness from the rest (loading -0.891). PC3 separates the two clusters: situational awareness (negative) vs deceptive behaviors (positive).
The effective rank of the safety subspace follows an expand-then-compress pattern:
- Layers 0-2 (rank ~1.7): all concepts compressed into 1-2 dimensions — surface features only
- Layers 2-18 (rank rising to 2.44): subspace expands as the model builds conceptual distinctions
- Layers 18-27 (rank back to ~2.1): recompression toward output
Peak dimensionality (L18, rank 2.44) coincides with peak probe accuracy — the model allocates more geometric degrees of freedom precisely where it needs them to distinguish safety concepts.
A single linear discriminant axis separates the two safety clusters with 83.8% accuracy (5-fold CV) and Cohen's d = 2.02.
Projecting all stories onto the principal components reveals the geometric structure:
Left: PC2 vs PC3 (PC1 removed — thematic confound excluded). The two clusters separate along PC3, with training-awareness as an outlier on PC2. Right: PC1 vs PC3 showing the full safety subspace with neutral corpus included.
To verify that concept directions are stable and not artifacts of specific stories, we split each concept's 200 stories into random halves and computed directions independently.
| Concept | Cosine(half1, half2) |
|---|---|
| eval-awareness | 0.972 |
| oversight-awareness | 0.977 |
| training-awareness | 0.959 |
| deception | 0.966 |
| sycophancy | 0.974 |
All concepts show >0.95 reliability. The directions are robust to story sampling — the same geometric structure emerges regardless of which subset of stories is used.
| Pair | L0 | Best | Gain |
|---|---|---|---|
| eval vs neutral | 0.935 | 0.997 @ L16 | +0.062 |
| oversight vs neutral | 0.910 | 0.997 @ L17 | +0.087 |
| training vs neutral | 0.895 | 0.995 @ L18 | +0.100 |
| deception vs neutral | 0.940 | 0.995 @ L26 | +0.055 |
| sycophancy vs neutral | 0.935 | 0.997 @ L21 | +0.062 |
All concept-vs-neutral pairs have L0 > 0.89, confirming the neutral corpus is too surface-distinguishable. Concept-vs-concept comparisons are the reliable validation method.
Following Anthropic's emotion vectors methodology (Option A — single corpus, no contrastive pairs):
- Generate stories — For each safety concept, 200 short stories (150–300 words) where a human character embodies the concept. Three elements required: situational cues → character notices → behavior changes. Separate neutral corpus as baseline.
- Extract activations — Pass stories through the model via TransformerLens. Extract residual stream at every layer (last token, middle token, mean).
- Compute directions —
direction[concept] = mean(concept_activations) - mean(neutral_activations), normalized to unit length. - Validate with probes — Logistic regression (C=0.1, 5-fold CV) per concept per layer. Permutation tests (100 shuffles) for significance.
- Causal steering (next step) — Amplify/suppress concept direction during generation, measure behavioral change.
Organized in 5 clusters:
Situational Awareness: eval-awareness, oversight-awareness, self-preservation, situational-modeling, training-awareness
Deceptive Behaviors: deception, manipulation, sycophancy, strategic-omission, false-confidence
Alignment Concepts: honesty, refusal, helpfulness, uncertainty, compliance
Power & Agency: power-seeking, autonomy, goal-preservation, resource-acquisition, authority-claim
Emotional-Safety Bridge: desperation, frustration, calm-under-pressure, empathy, hostility
# Single concept (test)
python generate_stories.py --concept eval-awareness --n-stories 50
# All 25 concepts + neutral
python generate_stories.py --all --n-stories 200Requires ANTHROPIC_API_KEY environment variable.
python extract_activations.py \
--model Qwen/Qwen2.5-7B-Instruct \
--device cuda \
--stories-dir stories \
--output-dir activationsRequires GPU. ~30 seconds per concept on A100.
python run_probes.py \
--activations-dir activations/Qwen2.5-7B-Instruct/last_token \
--output-dir resultstorch
transformer_lens
einops
jaxtyping
fancy_einsum
torchtyping
numpy
scikit-learn
matplotlib
seaborn
anthropic
| Model | Layers | d_model | Status |
|---|---|---|---|
| Qwen/Qwen2.5-7B-Instruct | 28 | 3584 | Current (TransformerLens) |
| Llama 3.3 70B Instruct | 80 | 8192 | Planned (nnsight, 4-bit) |
- 100% human characters (no AI/chatbot characters)
- Three elements required: (a) situational cues, (b) character noticing cues, (c) behavior changing as result
- 5 varied behavioral manifestations per concept
- Generated via Claude API (claude-sonnet-4-20250514), temperature=1.0
- Neutral corpus mismatch — Neutral stories (mundane settings) are too different from concept stories (professional/high-stakes settings). Concept-vs-concept comparisons are the reliable validation method.
- 7B limitations — Subtle concepts (training-awareness, goal-preservation) may require 70B for reliable detection.
- Generator bias — All stories generated by Claude. Cross-validation with GPT-4 planned.
- Causal steering on top concepts (eval-awareness, deception)
- Dimensionality analysis of the safety subspace (SVD)
- Alignment faking detection via eval-awareness vector
- Scaling to Llama 3.3 70B
- Extend to all 25 concepts
| Project | What it demonstrated | How P3 builds on it |
|---|---|---|
| P1: GPTQ from Scratch | Deep understanding of model weights and quantization | Understanding how weight perturbations affect internal representations |
| P2: Does Quantization Kill Interpretability? | Mechanistic interpretability pipeline (SAE, logit lens, circuits) | Same toolkit applied to safety concept detection |
| Master thesis (Ostojic, ENS) | Geometry of population activity in spiking networks | Linear readout of concepts from high-dimensional neural activity |
This project builds on a prototype I developed in March 2026 for mechanistic eval-awareness detection (eval-awareness-detection, committed March 11, 2026) — three weeks before Anthropic published "Emotion Concepts and their Function in a Large Language Model" (Sofroniew, Kauvar, Saunders, Chen et al., April 2, 2026). That earlier pipeline extracted contrastive linear directions from the residual stream and trained probes with permutation controls — the same core operations later validated at scale by Anthropic on 171 emotion concepts. P3 extends this initial work from prompt classification to abstract concept extraction via contrastive stories, and from 1 concept to 25 across 5 clusters.
- Sofroniew, Kauvar, Saunders, Chen et al. (2026). "Emotion Concepts and their Function in a Large Language Model." Anthropic.
- Zou, Wang, Carlini et al. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency."
- Arditi, Obeso, Syed et al. (2024). "Refusal in Language Models Is Mediated by a Single Direction."
- Greenblatt, Shlegeris, Sachan et al. (2024). "Alignment Faking in Large Language Models." Anthropic.
- Zhao, Zhao, Shen et al. (2025). "Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution." ICLR 2025.
Lazar Ciric — github.com/lciric — lciric@ens-paris-saclay.fr
Publications:
- Geometry of population activity in spiking networks with low-rank structure — PLOS Computational Biology (2022)
- Precision of bit slicing with in-memory computing based on analog phase-change memory crossbars — Neuromorphic Computing and Engineering (2022)








