Skip to content

FU-max-boop/mini-llm-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mini-llm-lab

mini-llm-lab is a small controlled research project for studying when tiny causal transformers genuinely use earlier context, and when they fall into shortcut regimes.

The project is intentionally small. The goal is not to reproduce GPT-scale behavior, but to build a setting where context visibility, shortcut behavior, and architecture effects can be isolated, measured, and explained.

Research Question

When does a causal language model use earlier context compositionally, rather than relying on a partial visible clue as a shortcut?

The current main line is Family B, a two-clue synthetic task family where the correct next token depends on combining two earlier clues.

Example:

amber + fox  -> den
amber + bird -> nest
silver + fox -> burrow
silver + bird -> roost

One clue alone is not sufficient. The correct answer depends on the pair.

Main Result

The current evidence suggests a visibility-gated transition:

0 clues visible -> local guessing
1 clue visible  -> stable single-clue shortcut regime
2 clues visible -> near-compositional use of both clues

On the expanded Family B Stage 1 suite:

Visibility condition Held-out accuracy
0 clues visible 0.016 +/- 0.000
1 clue visible 0.481 +/- 0.042
2 clues visible 0.877 +/- 0.132

The shortcut diagnostic shows the same regime change:

Condition Shortcut failure rate Correct separation rate
1 clue visible 1.000 0.000
2 clues visible 0.247 0.753

The first bridge to a more LLM-like tiny backbone (RoPE + RMSNorm + SwiGLU) preserves the story and makes it cleaner:

Backbone 0 clues 1 clue 2 clues
classic tiny transformer 0.016 0.481 0.877
modern tiny bridge 0.016 0.500 1.000

Main Figure

Family B main result

Regenerate the figure with:

python3 experiments/family_b_main_figure.py

Repository Map

datasets/     synthetic task generators
models/       tiny causal transformer implementations
experiments/  experiment runners and probes
memos/        technical writeups
results/      saved result cards, JSON outputs, and figures
docs/         public-facing project page and release notes

Quickstart

Create an environment and install the minimal dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run the base Family B visibility ladder:

python3 experiments/family_b_context_visibility_ladder_probe.py

Run the Stage 1 visibility ladder:

python3 experiments/family_b_context_visibility_ladder_probe.py --scale stage1

Run the shortcut diagnostic:

python3 experiments/family_b_single_clue_shortcut_diagnostic.py --scale stage1

Run the modern-backbone bridge:

python3 experiments/family_b_bridge_visibility_probe.py
python3 experiments/family_b_bridge_shortcut_probe.py

Or run the compact public reproduction script:

bash scripts/reproduce_family_b_main.sh

Key Artifacts

Current Interpretation

The strongest current claim is:

Tiny causal transformers enter different context-use regimes depending on clue visibility. In Family B, one visible clue induces a stable single-clue shortcut regime, while jointly visible clues enable near-compositional use of context. Architecture effects become much stronger once the task exposes enough information to support composition.

This is stronger than a generic "longer context helps" claim. The interesting part is the transition from partial-clue shortcut behavior to near-compositional behavior.

Limitations

This is a research seed, not a finished paper.

Current limitations:

  • The setting is tiny and synthetic.
  • Family B Stage 1 is still small compared with paper-scale synthetic benchmarks.
  • The modern-backbone bridge is an initial bridge, not broad external validation.
  • The exact paper positioning still needs sharpening.

Next Experiments

The next high-value steps are:

  1. Expand Family B beyond Stage 1.
  2. Add another realism bridge, one dimension at a time.
  3. Turn Family B into a small reusable benchmark with cleaner configs and result generation.
  4. Write a short technical report around the visibility-gated shortcut-to-composition transition.

Status

This repo is best understood as an early controlled research artifact: small enough to inspect, but structured enough to support feedback from researchers working on language model behavior, evaluation, and research engineering.

About

Controlled mini-benchmark for context visibility, shortcut regimes, and composition in tiny causal transformers.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors