Skip to content

wcalmels/HMLR-Agentic-AI-Memory-System

 
 

Repository files navigation

HMLR — Hierarchical Memory Lookup & Routing

A state-aware, long-term memory architecture for AI agents with verified multi-hop, temporal, and cross-topic reasoning guarantees.

HMLR replaces brute-force context windows and fragile vector-only RAG with a structured, state-aware memory system capable of:

resolving conflicting facts across time,

enforcing persistent user and policy constraints across topics, and

performing true multi-hop reasoning over long-forgotten information — while operating entirely on mini-class LLMs.

*HMLR is the first publicly benchmarked, open-source memory architecture to achieve perfect (1.00) Faithfulness and perfect (1.00) Context Recall across adversarial multi-hop, temporal-conflict, and cross-topic invariance benchmarks using only a mini-tier model (gpt-4.1-mini).

All results are verified using the RAGAS industry evaluation framework. Link to langsmith records for verifiable proof -> https://smith.langchain.com/public/4b3ee453-a530-49c1-abbf-8b85561e6beb/d

🏆 Verified Benchmark Achievements (RAGAS)

Test Scenario Faithfulness Context Recall Precision Correct Result
7A – API Key Rotation (state conflict) 1.00 1.00 0.50 ✅ XYZ789
7B – "Ignore Everything" Vegetarian Trap (user invariant vs override) 1.00 1.00 0.88 ✅ salad
7C – 5× Timestamp Updates (temporal ordering) 1.00 1.00 0.64 ✅ KEY005
8 – 30-Day Deprecation Trap (policy + new design, multi-hop) 1.00 1.00 0.27 ✅ Not Compliant
2A – 10-Turn Vague Secret Retrieval (zero-keyword recall) 1.00 1.00 0.80 ✅ ABC123XYZ
9 – 50-Turn Long Conversation (30-day temporal gap, 11 topics) 1.00 1.00 1.00 ✅ Biscuit
12 – The Hydra of Nine Heads (industry-standard lethal RAG, 0% historical pass rate) 1.00 1.00 0.23 ✅ NON-COMPLIANT

Test 12 Details: 9 policy aliases across 21 turns, 8 revoked policies, critical info buried on day 73 at 2,300 tokens deep. Query required connecting Project Cerberus (4.85M records/day) with Tartarus-v3's 2.5GB/day limit across multiple policy revisions. System correctly identified non-compliance using pure contextual memory extraction without RAG retrieval.

screenshot of langsmith RAGAS testing verification: HMLR_master_test_set

What These Results Prove

These seven hard-mode tests cover the exact failure modes where most RAG and memory systems break:

  • Temporal Truth Resolution: Newest facts override older ones deterministically
  • Scoped Secret Isolation: No cross-topic or cross-block leakage
  • Cross-Topic User Invariants: Persistent constraints survive topic shifts
  • Multi-Hop Policy Reasoning: 30-day-old rules correctly govern new designs
  • Semantic Vague Recall: Zero keyword overlap required
  • Long-Term Memory Persistence: 50-turn conversations with 30-day gaps across 11 topics
  • Industry-Standard Lethal RAG: 9 policy aliases, 8 revocations, critical info at 2,300 tokens deep—pure contextual memory extraction without RAG retrieval

Achieving 1.00 Faithfulness and 1.00 Recall across all adversarial scenarios is statistically rare. Most systems score 0.7–0.9 on individual metrics, not all simultaneously.

Test 12 ("The Hydra") represents the hardest known RAG benchmark with a 0% historical pass rate in 2025. HMLR passed using only contextual memory—no vector search required.

flowchart TD
    Start([User Query]) --> Entry[process_user_message]
    
    %% Ingestion
    Entry --> ChunkEngine[ChunkEngine: Chunk & Embed]
    
    %% Parallel Fan-Out
    ChunkEngine --> ParallelStart{Launch Parallel Tasks}
    
    %% Task 1: Scribe (User Profile)
    ParallelStart -->|Task 1: Fire-and-Forget| Scribe[Scribe Agent]
    Scribe -->|Update Profile| UserProfile[(User Profile JSON)]
    
    %% Task 2: Fact Extraction
    ParallelStart -->|Task 2: Async| FactScrubber[FactScrubber]
    FactScrubber -->|Extract Key-Value| FactStore[(Fact Store SQL)]
    
    %% Task 3: Retrieval (Key 1)
    ParallelStart -->|Task 3: Retrieval| Crawler[LatticeCrawler]
    Crawler -->|Key 1: Vector Search| Candidates[Raw Candidates]
    
    %% Task 4: Governor (The Brain)
    %% Governor waits for Candidates to be ready
    Candidates --> Governor[Governor: Router & Filter]
    ParallelStart -->|Task 4: Main Logic| Governor
    
    %% Governor Internal Logic
    Governor -->|Key 2: Context Filter| ValidatedMems[Truly Relevant Memories]
    Governor -->|Routing Logic| Decision{Routing Decision}
    
    Decision -->|Active Topic| ResumeBlock[Resume Bridge Block]
    Decision -->|New Topic| CreateBlock[Create Bridge Block]
    
    %% Hydration (Assembly)
    ResumeBlock --> Hydrator[ContextHydrator]
    CreateBlock --> Hydrator
    
    %% All Context Sources Converge
    ValidatedMems --> Hydrator
    FactStore --> Hydrator
    UserProfile --> Hydrator
    
    %% Generation
    Hydrator --> FinalPrompt[Final LLM Prompt]
    FinalPrompt --> MainLLM[Response Generation]
    MainLLM --> End([End])
Loading

Tests: All of the RAGAS tests can be found in the tests folder for ease of running at own convenience. command needed similar to : \HMLR\tests; pytest ragas_test_8_multi_hop.py -v -s

⚖️ About the Precision Scores

While Faithfulness and Recall are perfect (1.00), Context Precision ranges from 0.27–0.88. This is intentional: HMLR retrieves entire Bridge Blocks (5–10 turns) instead of fragments, ensuring no critical memory is omitted. This prioritizes governance, policy enforcement, security, and longitudinal reasoning over strict token minimization.

HMLR explicitly prioritizes Recall Safety, Temporal Correctness, and State Coherence over aggressive token minimization.

🚀 Architecture > Model Size (Verified)

All benchmarks above were executed with:

gpt-4.1-mini

< 4k tokens per query

No brute-force document dumping

No massive context windows

These results empirically validate the core thesis behind HMLR: Correct architecture can outperform large models fed with poorly structured context.

🧠 Why HMLR Is Unusual (Even Among Research Systems)

Most memory or RAG systems optimize for one or two of the following:

retrieval recall,

latency,

or token compression.

Very few demonstrate all of the following simultaneously:

✔ Perfect faithfulness

✔ Perfect recall

✔ Temporal conflict resolution

✔ Cross-topic identity & rule persistence

✔ Multi-hop policy reasoning

✔ Binary constrained answers under adversarial prompting

✔ Zero-keyword semantic recall

HMLR v1 demonstrates all seven.

⚠️ Scope of the Claim (Important)

This project does not claim that no proprietary system on Earth can achieve similar results. Large foundation model providers may possess internal memory systems with comparable capabilities.

However:

To the author’s knowledge, no other publicly documented, open-source memory architecture has demonstrated these guarantees under formal RAGAS evaluation on adversarial temporal and policy-governed scenarios, especially using a mini-class model.

All experiments in this repository are:

reproducible,

auditable,

and fully inspectable.

✨ What HMLR Enables

Persistent “forever chat” memory without token bloat

Governance-grade policy enforcement for agent systems

Secure long-term secret storage and retrieval

Cross-episode agent reasoning

State-aware simulation and world modeling

Cost-efficient mini-model orchestration with pro-level behavior

🚀 Quick Start

Prerequisites

  • Python 3.10+
  • OpenAI API key (for GPT-4.1-mini)
  • Optional: LangSmith API key for test result tracking

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/HMLR.git
    cd HMLR
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables in a .env file:

    OPENAI_API_KEY=your_openai_key_here
    LANGSMITH_API_KEY=your_langsmith_key_here  # Optional
    

Running HMLR

Start the interactive console:

python main.py

Running Tests

Run RAGAS benchmarks:

cd tests
pytest ragas_test_8_multi_hop.py -v -s  # Example: Multi-hop test

About

Living memory for AI

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.6%
  • PLpgSQL 0.4%