Deep sequence models tend to memorize geometrically; it is unclear why.

We are excited to share the code and notebooks for Deep sequence models tend to memorize geometrically; it is unclear why, accepted to ICML 2026.

This repository is meant to make the paper's experiments easier to inspect, rerun, and extend. The experiments study when sequence models memorize graph-structured atomic facts as local associations versus as a global embedding geometry. The main CLI supports the in-weights path-finding experiments: models first memorize graph edges and then solve path queries from knowledge stored in their weights. The tiny_graphs_notebooks/ notebooks provide small, controlled graph experiments used to inspect embedding geometry, geometric/associative memorization curves, and spectral-bias behavior.

Paper: https://arxiv.org/abs/2510.26745

Note: this repository was recreated after the internship at Google, so it is intended as a faithful research release rather than a direct export of the internal development repository.

Repository layout

.
├── train_in_weights.py              # Main training entry point for in-weights graph experiments
├── geometric_memory.yml             # Conda environment used for the release
├── data/
│   ├── build_in_weights_datasets.py # Builds edge-memorization and path-finetuning text datasets
│   ├── dataset_naming.py            # Shared naming/split-size conventions
│   └── datasets/                    # Generated or checked-in graph datasets
├── in_weights/                      # Experiment configs, data loaders, training loops, evaluation
├── models/                          # GPT-style Transformer, Mamba, Pythia, and model utilities
├── tokenizing/                      # Numeric tokenizer for graph node tokens and special tokens
├── utils/                           # Device, logging, run-directory, and training helpers
├── tiny_graphs_notebooks/           # Tiny-graph experiments and analysis notebooks
│   ├── experiment_notebooks/        # Baseline tiny graph notebooks
│   ├── experiment_notebooks_self_edges/
│   ├── experiment_notebooks_regularizers/
│   ├── analysis/                    # Plotting, spectral, metric, and storage helpers
│   ├── notebook_utils/              # Reusable notebook training/evaluation wrappers
│   └── data/                        # Tiny-graph datasets generated by notebooks
└── geometric_memory/                # Import shim for running from the repository root

Environment

The recommended setup is the included Conda environment:

conda env create -f geometric_memory.yml
conda activate geometric_memory

Run commands from the repository root. The repository includes a compatibility package shim so imports such as geometric_memory.in_weights work when scripts are launched from this directory.

For GPU runs, the environment is configured around PyTorch with CUDA 12.1. The code also resolves CPU or MPS devices when CUDA is unavailable, but the large path-star experiments are intended for GPU execution.

Dataset generation

Datasets are plain text files with three splits:

*_pretrain.txt: edge memorization examples, formatted as u=v
*_train_*.txt: path-finetuning examples
*_test_*.txt: held-out path examples

Build a path-star dataset with forward and reverse edge supervision:

python data/build_in_weights_datasets.py \
  --graph_type star \
  --star_degree 5000 \
  --star_subtree_degree 1 \
  --path_length 5 \
  --add_forward_edges \
  --add_backward_edges \
  --random_seed 0

Supported graph families are star, grid, cycle, and irregular. Useful dataset flags include:

--star_degree, --star_subtree_degree, --path_length: path-star or star-tree shape.
--grid_rows, --grid_cols: grid shape.
--total_nodes: required for cycle; auto-derived for star, grid, and irregular when left as -1.
--train_split_ratio: fraction of path examples used for training.
--add_forward_edges, --add_backward_edges: edge directions included in pretraining. If neither is passed, both are used.
--add_self_edges: adds (node, node) examples to edge memorization.
--include_start_node_in_path_finetuning: uses start,leaf as the path prefix instead of only leaf.
--split_subtree_holdout: for star trees, holds out entire subtrees.
--overwrite: replace existing generated files.

The current builder writes both baseline and self-edge variants for a sampled graph, distinguished in filenames by selfedge_0 and selfedge_1.

Running in-weights experiments

The main entry point is:

python train_in_weights.py --help

Example Transformer run on a generated path-star dataset:

python train_in_weights.py \
  --training_recipe mixed_full_path \
  --model_family gpt \
  --graph_type star \
  --star_degree 5000 \
  --star_subtree_degree 1 \
  --path_length 5 \
  --add_forward_edges \
  --add_backward_edges \
  --edge_memorization_epochs 2500 \
  --path_finetuning_epochs 10000 \
  --enable_wandb \
  --wandb_mode offline

Training recipes:

staged_full_path: edge memorization first, then full-path finetuning.
mixed_full_path: mixed edge/path batches with full-path prediction.
staged_hardest_token: edge memorization first, then first-token-only path finetuning.
mixed_hardest_token: mixed edge/path batches with first-token-only prediction.

Common model and training flags:

--model_family {gpt,gpt2,pythia,mamba}: model backend.
--transformer_layer_count, --embedding_dimension, --attention_head_count: model size.
--use_attention / --no-use_attention: ablate Transformer attention.
--use_residual_connections, --use_layer_norm, --use_positional_encoding: architecture toggles.
--freeze_token_embeddings: freeze node embeddings, useful for associative-memory controls.
--weight_init_mode {default,non_geometric}: initialization mode.
--edge_memorization_batch_size, --path_finetuning_batch_size: stage batch sizes.
--edge_memorization_learning_rate, --path_finetuning_learning_rate: stage learning rates.
--skip_edge_memorization with --edge_memorization_checkpoint_path: resume from an edge-memorized checkpoint.
--track_embedding_evolution, --track_top_k_predictions: write analysis pickles during edge memorization.

By default, path prefixes include task tokens such as [EDGE] and [PATH], and edge-memorization prefixes drop pause tokens. Use --exclude_task_token_in_prefix or --edge_memorization_include_pause_token to change this behavior.

Tiny-graph notebooks

tiny_graphs_notebooks/ contains the small graph experiments used to study the geometry directly. These notebooks train or load models on tiny path-star, grid, cycle, and fixed irregular graphs, then plot:

node embedding geometry,
node-node embedding similarity heatmaps,
associative versus geometric memorization curves,
embedding evolution over training,
graph Laplacian / spectral-bias diagnostics.

Main notebook groups:

tiny_graphs_notebooks/experiment_notebooks/: baseline tiny graph experiments for Transformer, forward-only Transformer, neural network, associative controls, associative-fast controls, and Node2Vec.
tiny_graphs_notebooks/experiment_notebooks_self_edges/: corresponding experiments with self-edge supervision.
tiny_graphs_notebooks/experiment_notebooks_regularizers/: regularizer and initialization ablations, currently focused on cycle graphs.

The notebooks are self-contained. Open one with Jupyter from the repository root, for example:

jupyter notebook tiny_graphs_notebooks/experiment_notebooks/tiny_transformer.ipynb

Each notebook section follows the same pattern: configure graph/model/training settings, build or reuse tiny datasets under tiny_graphs_notebooks/data/, train or load a checkpoint, evaluate edge/path behavior, and generate plots. Generated notebook outputs such as local saved_artifacts/ and experiment_logs/ folders are run artifacts rather than core source files.

Outputs and logging

CLI runs write under experiment_logs/ by default:

run directories and checkpoints,
analysis artifacts when tracking flags are enabled,
optional W&B logs.

W&B is disabled unless requested. To log offline:

python train_in_weights.py ... --enable_wandb --wandb_mode offline

Use --wandb_mode online after logging into W&B if you want runs synced to the hosted service.

Citation

If you find this code or the paper useful, please cite:

@article{noroozizadeh2025deep,
  title={Deep sequence models tend to memorize geometrically; it is unclear why},
  author={Noroozizadeh, Shahriar and Nagarajan, Vaishnavh and Rosenfeld, Elan and Kumar, Sanjiv},
  journal={arXiv preprint arXiv:2510.26745},
  url={https://arxiv.org/abs/2510.26745},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep sequence models tend to memorize geometrically; it is unclear why.

Repository layout

Environment

Dataset generation

Running in-weights experiments

Tiny-graph notebooks

Outputs and logging

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
data		data
geometric_memory		geometric_memory
in_weights		in_weights
models		models
tiny_graphs_notebooks		tiny_graphs_notebooks
tokenizing		tokenizing
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
geometric_memory.yml		geometric_memory.yml
train_in_weights.py		train_in_weights.py

Folders and files

Latest commit

History

Repository files navigation

Deep sequence models tend to memorize geometrically; it is unclear why.

Repository layout

Environment

Dataset generation

Running in-weights experiments

Tiny-graph notebooks

Outputs and logging

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages