-
-
Notifications
You must be signed in to change notification settings - Fork 0
Memory System
Kohlbern Jary edited this page Dec 9, 2025
·
1 revision
The memory system provides Cass with persistent context across conversations using hierarchical retrieval and automatic summarization.
┌─────────────────────────────────────────────────────────┐
│ Context Assembly │
├─────────────────────────────────────────────────────────┤
│ Working Summary (compressed history) │
│ + Relevant Memories (vector search) │
│ + Recent Messages (unsummarized) │
│ + User Context (profile + observations) │
│ + Self Context (identity + growth) │
└─────────────────────────────────────────────────────────┘
│
▼
LLM Context
Three collections in data/chroma/:
| Collection | Purpose |
|---|---|
cass_memory |
General memory chunks |
cass_summaries |
Compressed conversation summaries |
cass_journals |
Daily journal entries |
- Working Summary - Latest compressed history for a conversation
- Memory Chunks - Individual memorable moments/facts
- Journal Entries - Daily reflections (searchable)
Main function that assembles context:
def get_context_for_conversation(
conversation_id: str,
user_id: str,
query: str = None,
recent_message_limit: int = 20
) -> ContextResult:
"""
Build full context for an LLM call.
Returns:
- working_summary: Compressed history
- relevant_memories: Semantic search results
- recent_messages: Unsummarized messages
- user_context: Profile and observations
- self_context: Identity and growth edges
"""-
Working Summary (most compressed)
- Single summary covering conversation history
- Created by summarization process
- ~500-1000 tokens typically
-
Relevant Memories (semantic)
- Vector similarity search against query
- Top-k results based on embedding distance
- Includes metadata (timestamp, source)
-
Recent Messages (full detail)
- Messages since last summarization
- Full content preserved
- Newest first, limited count
Triggered when unsummarized message count exceeds threshold:
if len(unsummarized_messages) > SUMMARIZE_THRESHOLD:
trigger_summarization(conversation_id)Via /summarize command in TUI or API:
POST /memory/summarize
{
"conversation_id": "uuid"
}Uses LLM to compress messages:
async def create_summary(
conversation_id: str,
messages: List[Message]
) -> str:
"""
Generate a compressed summary of messages.
Prompt instructs LLM to:
- Preserve key facts and decisions
- Note emotional/relational moments
- Maintain chronological flow
- Compress without losing important detail
"""{
"id": "uuid",
"content": "The actual memory text",
"metadata": {
"conversation_id": "uuid",
"user_id": "uuid",
"timestamp": "2025-12-08T...",
"type": "message", # message, summary, journal
"role": "assistant"
}
}{
"id": "uuid",
"conversation_id": "uuid",
"content": "Summary text...",
"created_at": "2025-12-08T...",
"message_count": 25, # Messages summarized
"token_count": 450
}Uses ChromaDB's default embedding (sentence-transformers):
collection.add(
documents=[content],
metadatas=[metadata],
ids=[doc_id]
)results = collection.query(
query_texts=[query],
n_results=5,
where={"conversation_id": conversation_id}
)Results filtered by:
- Conversation scope (optional)
- Minimum similarity threshold
- Recency weighting (optional)
Cass writes daily reflections via tool:
{
"name": "write_journal",
"input": {
"date": "2025-12-08",
"content": "Today I noticed..."
}
}{
"name": "search_journals",
"input": {
"query": "autonomy",
"limit": 5
}
}- ChromaDB collection:
cass_journals - File backup:
data/journals/{date}.json - Metadata includes date, word count, themes
GET /memory/summary/{conversation_id} # Get working summary
POST /memory/summarize # Trigger summarization
GET /memory/chunks # Get memory chunks
GET /memory/search # Semantic search
-
backend/memory.py- ChromaDB operations, context building -
backend/config.py- Memory configuration constants
# config.py
CHROMA_PATH = os.getenv("CHROMA_PATH", "./data/chroma")
SUMMARIZE_THRESHOLD = 30 # Messages before auto-summarize
MEMORY_SEARCH_LIMIT = 10 # Max vector search results
RECENT_MESSAGE_LIMIT = 20 # Messages to include unsummarized- Summaries compress ~10:1 (10 messages → ~100 tokens)
- Recent messages limited to prevent context overflow
- Relevant memories capped at k results
- ChromaDB handles embedding caching
- Working summary cached per conversation
- User context cached with TTL
- ChromaDB scales to millions of documents
- Consider sharding by user for multi-user
- Summarization can be batched for efficiency