Skip to content

Memory System

Kohlbern Jary edited this page Dec 9, 2025 · 1 revision

Memory System

The memory system provides Cass with persistent context across conversations using hierarchical retrieval and automatic summarization.

Architecture

┌─────────────────────────────────────────────────────────┐
│                   Context Assembly                       │
├─────────────────────────────────────────────────────────┤
│  Working Summary (compressed history)                    │
│  + Relevant Memories (vector search)                     │
│  + Recent Messages (unsummarized)                        │
│  + User Context (profile + observations)                 │
│  + Self Context (identity + growth)                      │
└─────────────────────────────────────────────────────────┘
                         │
                         ▼
                    LLM Context

Components

ChromaDB Collections

Three collections in data/chroma/:

Collection Purpose
cass_memory General memory chunks
cass_summaries Compressed conversation summaries
cass_journals Daily journal entries

Memory Types

  1. Working Summary - Latest compressed history for a conversation
  2. Memory Chunks - Individual memorable moments/facts
  3. Journal Entries - Daily reflections (searchable)

Context Building

get_context_for_conversation()

Main function that assembles context:

def get_context_for_conversation(
    conversation_id: str,
    user_id: str,
    query: str = None,
    recent_message_limit: int = 20
) -> ContextResult:
    """
    Build full context for an LLM call.

    Returns:
        - working_summary: Compressed history
        - relevant_memories: Semantic search results
        - recent_messages: Unsummarized messages
        - user_context: Profile and observations
        - self_context: Identity and growth edges
    """

Hierarchy

  1. Working Summary (most compressed)

    • Single summary covering conversation history
    • Created by summarization process
    • ~500-1000 tokens typically
  2. Relevant Memories (semantic)

    • Vector similarity search against query
    • Top-k results based on embedding distance
    • Includes metadata (timestamp, source)
  3. Recent Messages (full detail)

    • Messages since last summarization
    • Full content preserved
    • Newest first, limited count

Summarization

Automatic Summarization

Triggered when unsummarized message count exceeds threshold:

if len(unsummarized_messages) > SUMMARIZE_THRESHOLD:
    trigger_summarization(conversation_id)

Manual Summarization

Via /summarize command in TUI or API:

POST /memory/summarize
{
  "conversation_id": "uuid"
}

Summary Generation

Uses LLM to compress messages:

async def create_summary(
    conversation_id: str,
    messages: List[Message]
) -> str:
    """
    Generate a compressed summary of messages.

    Prompt instructs LLM to:
    - Preserve key facts and decisions
    - Note emotional/relational moments
    - Maintain chronological flow
    - Compress without losing important detail
    """

Storage

Memory Chunk

{
    "id": "uuid",
    "content": "The actual memory text",
    "metadata": {
        "conversation_id": "uuid",
        "user_id": "uuid",
        "timestamp": "2025-12-08T...",
        "type": "message",  # message, summary, journal
        "role": "assistant"
    }
}

Summary Record

{
    "id": "uuid",
    "conversation_id": "uuid",
    "content": "Summary text...",
    "created_at": "2025-12-08T...",
    "message_count": 25,  # Messages summarized
    "token_count": 450
}

Vector Search

Embedding

Uses ChromaDB's default embedding (sentence-transformers):

collection.add(
    documents=[content],
    metadatas=[metadata],
    ids=[doc_id]
)

Retrieval

results = collection.query(
    query_texts=[query],
    n_results=5,
    where={"conversation_id": conversation_id}
)

Relevance Filtering

Results filtered by:

  • Conversation scope (optional)
  • Minimum similarity threshold
  • Recency weighting (optional)

Journal System

Writing Journals

Cass writes daily reflections via tool:

{
    "name": "write_journal",
    "input": {
        "date": "2025-12-08",
        "content": "Today I noticed..."
    }
}

Searching Journals

{
    "name": "search_journals",
    "input": {
        "query": "autonomy",
        "limit": 5
    }
}

Journal Storage

  • ChromaDB collection: cass_journals
  • File backup: data/journals/{date}.json
  • Metadata includes date, word count, themes

API Endpoints

GET  /memory/summary/{conversation_id}  # Get working summary
POST /memory/summarize                  # Trigger summarization
GET  /memory/chunks                     # Get memory chunks
GET  /memory/search                     # Semantic search

Implementation

Key Files

  • backend/memory.py - ChromaDB operations, context building
  • backend/config.py - Memory configuration constants

Configuration

# config.py

CHROMA_PATH = os.getenv("CHROMA_PATH", "./data/chroma")
SUMMARIZE_THRESHOLD = 30  # Messages before auto-summarize
MEMORY_SEARCH_LIMIT = 10  # Max vector search results
RECENT_MESSAGE_LIMIT = 20  # Messages to include unsummarized

Performance Considerations

Token Management

  • Summaries compress ~10:1 (10 messages → ~100 tokens)
  • Recent messages limited to prevent context overflow
  • Relevant memories capped at k results

Caching

  • ChromaDB handles embedding caching
  • Working summary cached per conversation
  • User context cached with TTL

Scaling

  • ChromaDB scales to millions of documents
  • Consider sharding by user for multi-user
  • Summarization can be batched for efficiency

Clone this wiki locally