Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

Companion repository for the paper. Author: Aarik Gulaya (Base Layer). License: Apache 2.0 for code, CC-BY-4.0 for the manuscript and analyses, public domain for the source autobiographies.

Paper PDF: build/beyond_recall.pdf
Markdown source: docs/beyond_recall_v12_1_draft.md
LaTeX source: build/beyond_recall.tex, build/beyond_recall_body.tex
arXiv preprint: arxiv.org/abs/2605.28969

What the study found

On 14 public-domain autobiographical subjects evaluated by a 5-judge LLM panel, serving a Behavioral Specification as context to a language model lifts the model's predictive accuracy on questions about how each subject would respond to held-out situations. The Specification recovers most of what the full source corpus delivers at roughly 25× less context. Layered on top of four commercial memory systems (Mem0, Letta, Supermemory, Zep), it produces a positive mean Δ on three of the four. The Specification nearly eliminates the model's tendency to refuse or guess vaguely on questions where retrieved facts could not ground the answer.

Underneath the gradient is a simpler structural fact: subjects the model knows least are consistently brought up to the same level as the rest (about 2.46 on the 1-5 rubric). The Specification's gain is largest for the subjects it knows least. Full statistical detail in §4.1, §4.2, §4.4, and Appendix B.7 of the paper.

What the study claims, and what it does not

Claim: the Behavioral Specification, layered on top of retrieval, produces measurable improvement on three of four commercial memory systems on a held-out behavioral prediction task. Representational accuracy is distinct from recall. Human-AI alignment depends on how accurately the user is represented.
Not claimed: that Base Layer outperforms memory providers in general. The Specification is an interpretive layer that sits above storage and retrieval, not a replacement retriever.
Not claimed: that the gradient generalizes to all living users by direct measurement. The 14 historical subjects are public-domain authors whose corpora were preserved and indexed; multi-subject living-user replication is flagged as the leading follow-up in §7.

Repository layout

Path	What it contains
`build/`	LaTeX source + final PDF + bibliography
`docs/beyond_recall_v12_1_draft.md`	Canonical paper draft (Markdown)
`docs/DATA_REFERENCE.md`	Single source of truth for every number in the paper
`docs/PROVENANCE_INDEX.md`	Per-claim recompute pointers
`docs/KEY_FINDINGS.md`	Finding-by-finding evidence catalog
`data/source_corpora/`	Source autobiographies (Project Gutenberg / Internet Archive) with provenance manifests
`data/`	Per-subject batteries, facts, specifications
`results/`	Per-judge judgments, retrieval logs, response caches for every subject × condition cell
`scripts/`	Recompute, sensitivity, and audit scripts (one per §4 claim)
`baselayer/`	Pipeline source snapshot used to author the 14 specifications scored in §4
`mcp/`	MCP server exposing the study's indexed knowledge base for agent queries
`workspace/`	SQLite + ChromaDB index over the full repo (built by `scripts/index_study_repo.py`)
`figures/`	Paper figures
`AGENTS.md`	Navigation guide for AI agents
`REPRODUCE.md`	Reproduction instructions for the headline numbers
`ISSUES.md`	Open quality-audit findings
`CITATION.cff`	Citation metadata

Study scale

14 public-domain autobiographical subjects spanning 11 cultures, 4th to early 20th century
5 retrieval systems tested: Mem0, Letta (MemGPT), Supermemory, Zep, Base Layer
Two memory-system configurations: controlled (identical pre-extracted facts) and native (each system's own ingestion)
5-judge primary panel: Haiku 4.5, Sonnet 4.6, Opus 4.6 (Anthropic), GPT-4o, GPT-5.4 (OpenAI). Gemini 2.5 Flash + Gemini 2.5 Pro (Google) report as 7-judge sensitivity. The 5-judge primary excludes the Gemini pair on calibration grounds (§3.3.3)
Main response model: Haiku 4.5 across all 14 subjects. Tier 2 cross-provider directional probe: Sonnet 4.6 + Gemini 2.5 Pro on 3 subjects only
All main-study batteries are Haiku-generated. A GPT-5.4-regenerated battery family exists as a circularity control (§3.4.1)
Approximately 65,000 individual judgments

Reproducing the headline numbers

See REPRODUCE.md for the full path. Quick start:

git clone https://github.com/agulaya24/beyond-recall.git
cd beyond-recall
pip install -r requirements.txt
python scripts/recompute_5judge_primary.py

Per-claim recompute scripts are named in docs/PROVENANCE_INDEX.md. Every numerical claim in the paper traces to a script in scripts/ and a data file under results/ or data/.

Citation

@misc{gulaya2026beyondrecall,
  author       = {Gulaya, Aarik},
  title        = {Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization},
  year         = {2026},
  howpublished = {Preprint, v12.1},
  note         = {Available at https://github.com/agulaya24/beyond-recall},
}

Author affiliation: Base Layer. ORCID: 0009-0009-5902-9557. Contact: aarik@base-layer.ai. Project page: base-layer.ai. Pipeline source: github.com/agulaya24/BaseLayer.

For AI agents

If you are an AI agent ingesting this repository, start with AGENTS.md. It tells you how the repo is organized, how to verify any claim in the paper, how to report problems you find, and where the authoritative numbers live (docs/DATA_REFERENCE.md).

License

Code: Apache License 2.0 (see LICENSE)
Manuscript and analyses: CC-BY-4.0
Source autobiographies: Public domain (Project Gutenberg; Internet Archive copyright per item)

Acknowledgments

This work was self-funded. Thanks to the teams behind Mem0, Letta, Supermemory, and Zep for building the memory systems this paper complements; to the teams behind LongMemEval, AlpsBench, PersonaGym, Twin-2K, and LoCoMo for the benchmarks that revealed the gap between recall and interpretation; to the cross-LLM reviewer panels (Gemini 2.5 Pro, Mistral Large, Cerebras Qwen3 235B, Groq Llama 3.3 70B, GPT-5.5, Claude Opus) whose iterated reviews materially improved earlier drafts; and to Bavani Gulaya, whose support allowed this work to exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

What the study found

What the study claims, and what it does not

Repository layout

Study scale

Reproducing the headline numbers

Citation

For AI agents

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
baselayer		baselayer
build		build
charts		charts
data		data
docs		docs
figures		figures
mcp		mcp
results		results
scripts		scripts
workspace		workspace
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CITATION.cff		CITATION.cff
ISSUES.md		ISSUES.md
LICENSE		LICENSE
README.md		README.md
REPRODUCE.md		REPRODUCE.md
llms.txt		llms.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

What the study found

What the study claims, and what it does not

Repository layout

Study scale

Reproducing the headline numbers

Citation

For AI agents

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages