We are excited to share the code and notebooks for Deep sequence models tend to memorize geometrically; it is unclear why, accepted to ICML 2026.
This repository is meant to make the paper's experiments easier to inspect, rerun, and extend. The experiments study when sequence models memorize graph-structured atomic facts as local associations versus as a global embedding geometry. The main CLI supports the in-weights path-finding experiments: models first memorize graph edges and then solve path queries from knowledge stored in their weights. The tiny_graphs_notebooks/ notebooks provide small, controlled graph experiments used to inspect embedding geometry, geometric/associative memorization curves, and spectral-bias behavior.
Paper: https://arxiv.org/abs/2510.26745
Note: this repository was recreated after the internship at Google, so it is intended as a faithful research release rather than a direct export of the internal development repository.
.
├── train_in_weights.py # Main training entry point for in-weights graph experiments
├── geometric_memory.yml # Conda environment used for the release
├── data/
│ ├── build_in_weights_datasets.py # Builds edge-memorization and path-finetuning text datasets
│ ├── dataset_naming.py # Shared naming/split-size conventions
│ └── datasets/ # Generated or checked-in graph datasets
├── in_weights/ # Experiment configs, data loaders, training loops, evaluation
├── models/ # GPT-style Transformer, Mamba, Pythia, and model utilities
├── tokenizing/ # Numeric tokenizer for graph node tokens and special tokens
├── utils/ # Device, logging, run-directory, and training helpers
├── tiny_graphs_notebooks/ # Tiny-graph experiments and analysis notebooks
│ ├── experiment_notebooks/ # Baseline tiny graph notebooks
│ ├── experiment_notebooks_self_edges/
│ ├── experiment_notebooks_regularizers/
│ ├── analysis/ # Plotting, spectral, metric, and storage helpers
│ ├── notebook_utils/ # Reusable notebook training/evaluation wrappers
│ └── data/ # Tiny-graph datasets generated by notebooks
└── geometric_memory/ # Import shim for running from the repository root
The recommended setup is the included Conda environment:
conda env create -f geometric_memory.yml
conda activate geometric_memoryRun commands from the repository root. The repository includes a compatibility package shim so imports such as geometric_memory.in_weights work when scripts are launched from this directory.
For GPU runs, the environment is configured around PyTorch with CUDA 12.1. The code also resolves CPU or MPS devices when CUDA is unavailable, but the large path-star experiments are intended for GPU execution.
Datasets are plain text files with three splits:
*_pretrain.txt: edge memorization examples, formatted asu=v*_train_*.txt: path-finetuning examples*_test_*.txt: held-out path examples
Build a path-star dataset with forward and reverse edge supervision:
python data/build_in_weights_datasets.py \
--graph_type star \
--star_degree 5000 \
--star_subtree_degree 1 \
--path_length 5 \
--add_forward_edges \
--add_backward_edges \
--random_seed 0Supported graph families are star, grid, cycle, and irregular. Useful dataset flags include:
--star_degree,--star_subtree_degree,--path_length: path-star or star-tree shape.--grid_rows,--grid_cols: grid shape.--total_nodes: required forcycle; auto-derived forstar,grid, andirregularwhen left as-1.--train_split_ratio: fraction of path examples used for training.--add_forward_edges,--add_backward_edges: edge directions included in pretraining. If neither is passed, both are used.--add_self_edges: adds(node, node)examples to edge memorization.--include_start_node_in_path_finetuning: usesstart,leafas the path prefix instead of onlyleaf.--split_subtree_holdout: for star trees, holds out entire subtrees.--overwrite: replace existing generated files.
The current builder writes both baseline and self-edge variants for a sampled graph, distinguished in filenames by selfedge_0 and selfedge_1.
The main entry point is:
python train_in_weights.py --helpExample Transformer run on a generated path-star dataset:
python train_in_weights.py \
--training_recipe mixed_full_path \
--model_family gpt \
--graph_type star \
--star_degree 5000 \
--star_subtree_degree 1 \
--path_length 5 \
--add_forward_edges \
--add_backward_edges \
--edge_memorization_epochs 2500 \
--path_finetuning_epochs 10000 \
--enable_wandb \
--wandb_mode offlineTraining recipes:
staged_full_path: edge memorization first, then full-path finetuning.mixed_full_path: mixed edge/path batches with full-path prediction.staged_hardest_token: edge memorization first, then first-token-only path finetuning.mixed_hardest_token: mixed edge/path batches with first-token-only prediction.
Common model and training flags:
--model_family {gpt,gpt2,pythia,mamba}: model backend.--transformer_layer_count,--embedding_dimension,--attention_head_count: model size.--use_attention/--no-use_attention: ablate Transformer attention.--use_residual_connections,--use_layer_norm,--use_positional_encoding: architecture toggles.--freeze_token_embeddings: freeze node embeddings, useful for associative-memory controls.--weight_init_mode {default,non_geometric}: initialization mode.--edge_memorization_batch_size,--path_finetuning_batch_size: stage batch sizes.--edge_memorization_learning_rate,--path_finetuning_learning_rate: stage learning rates.--skip_edge_memorizationwith--edge_memorization_checkpoint_path: resume from an edge-memorized checkpoint.--track_embedding_evolution,--track_top_k_predictions: write analysis pickles during edge memorization.
By default, path prefixes include task tokens such as [EDGE] and [PATH], and edge-memorization prefixes drop pause tokens. Use --exclude_task_token_in_prefix or --edge_memorization_include_pause_token to change this behavior.
tiny_graphs_notebooks/ contains the small graph experiments used to study the geometry directly. These notebooks train or load models on tiny path-star, grid, cycle, and fixed irregular graphs, then plot:
- node embedding geometry,
- node-node embedding similarity heatmaps,
- associative versus geometric memorization curves,
- embedding evolution over training,
- graph Laplacian / spectral-bias diagnostics.
Main notebook groups:
tiny_graphs_notebooks/experiment_notebooks/: baseline tiny graph experiments for Transformer, forward-only Transformer, neural network, associative controls, associative-fast controls, and Node2Vec.tiny_graphs_notebooks/experiment_notebooks_self_edges/: corresponding experiments with self-edge supervision.tiny_graphs_notebooks/experiment_notebooks_regularizers/: regularizer and initialization ablations, currently focused on cycle graphs.
The notebooks are self-contained. Open one with Jupyter from the repository root, for example:
jupyter notebook tiny_graphs_notebooks/experiment_notebooks/tiny_transformer.ipynbEach notebook section follows the same pattern: configure graph/model/training settings, build or reuse tiny datasets under tiny_graphs_notebooks/data/, train or load a checkpoint, evaluate edge/path behavior, and generate plots. Generated notebook outputs such as local saved_artifacts/ and experiment_logs/ folders are run artifacts rather than core source files.
CLI runs write under experiment_logs/ by default:
- run directories and checkpoints,
- analysis artifacts when tracking flags are enabled,
- optional W&B logs.
W&B is disabled unless requested. To log offline:
python train_in_weights.py ... --enable_wandb --wandb_mode offlineUse --wandb_mode online after logging into W&B if you want runs synced to the hosted service.
If you find this code or the paper useful, please cite:
@article{noroozizadeh2025deep,
title={Deep sequence models tend to memorize geometrically; it is unclear why},
author={Noroozizadeh, Shahriar and Nagarajan, Vaishnavh and Rosenfeld, Elan and Kumar, Sanjiv},
journal={arXiv preprint arXiv:2510.26745},
url={https://arxiv.org/abs/2510.26745},
year={2025}
}