mini-llm-lab is a small controlled research project for studying when tiny causal transformers genuinely use earlier context, and when they fall into shortcut regimes.
The project is intentionally small. The goal is not to reproduce GPT-scale behavior, but to build a setting where context visibility, shortcut behavior, and architecture effects can be isolated, measured, and explained.
When does a causal language model use earlier context compositionally, rather than relying on a partial visible clue as a shortcut?
The current main line is Family B, a two-clue synthetic task family where the correct next token depends on combining two earlier clues.
Example:
amber + fox -> den
amber + bird -> nest
silver + fox -> burrow
silver + bird -> roost
One clue alone is not sufficient. The correct answer depends on the pair.
The current evidence suggests a visibility-gated transition:
0 clues visible -> local guessing
1 clue visible -> stable single-clue shortcut regime
2 clues visible -> near-compositional use of both clues
On the expanded Family B Stage 1 suite:
| Visibility condition | Held-out accuracy |
|---|---|
0 clues visible |
0.016 +/- 0.000 |
1 clue visible |
0.481 +/- 0.042 |
2 clues visible |
0.877 +/- 0.132 |
The shortcut diagnostic shows the same regime change:
| Condition | Shortcut failure rate | Correct separation rate |
|---|---|---|
1 clue visible |
1.000 |
0.000 |
2 clues visible |
0.247 |
0.753 |
The first bridge to a more LLM-like tiny backbone (RoPE + RMSNorm + SwiGLU) preserves the story and makes it cleaner:
| Backbone | 0 clues |
1 clue |
2 clues |
|---|---|---|---|
| classic tiny transformer | 0.016 |
0.481 |
0.877 |
| modern tiny bridge | 0.016 |
0.500 |
1.000 |
Regenerate the figure with:
python3 experiments/family_b_main_figure.pydatasets/ synthetic task generators
models/ tiny causal transformer implementations
experiments/ experiment runners and probes
memos/ technical writeups
results/ saved result cards, JSON outputs, and figures
docs/ public-facing project page and release notes
Create an environment and install the minimal dependencies:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtRun the base Family B visibility ladder:
python3 experiments/family_b_context_visibility_ladder_probe.pyRun the Stage 1 visibility ladder:
python3 experiments/family_b_context_visibility_ladder_probe.py --scale stage1Run the shortcut diagnostic:
python3 experiments/family_b_single_clue_shortcut_diagnostic.py --scale stage1Run the modern-backbone bridge:
python3 experiments/family_b_bridge_visibility_probe.py
python3 experiments/family_b_bridge_shortcut_probe.pyOr run the compact public reproduction script:
bash scripts/reproduce_family_b_main.sh- Main result card: results/family_b_main_result_card.md
- Main memo: memos/family-b-visibility-and-shortcut-memo.md
- Bridge memo: memos/family-b-bridge-memo.md
- Project page draft: docs/project-page.md
The strongest current claim is:
Tiny causal transformers enter different context-use regimes depending on clue visibility. In
Family B, one visible clue induces a stable single-clue shortcut regime, while jointly visible clues enable near-compositional use of context. Architecture effects become much stronger once the task exposes enough information to support composition.
This is stronger than a generic "longer context helps" claim. The interesting part is the transition from partial-clue shortcut behavior to near-compositional behavior.
This is a research seed, not a finished paper.
Current limitations:
- The setting is tiny and synthetic.
Family B Stage 1is still small compared with paper-scale synthetic benchmarks.- The modern-backbone bridge is an initial bridge, not broad external validation.
- The exact paper positioning still needs sharpening.
The next high-value steps are:
- Expand
Family Bbeyond Stage 1. - Add another realism bridge, one dimension at a time.
- Turn
Family Binto a small reusable benchmark with cleaner configs and result generation. - Write a short technical report around the visibility-gated shortcut-to-composition transition.
This repo is best understood as an early controlled research artifact: small enough to inspect, but structured enough to support feedback from researchers working on language model behavior, evaluation, and research engineering.
