Memory sidecar for AI agents. Extracts structured memory from conversations — decisions, facts, investigation outcomes, work checkpoints — and returns compact evidence-backed cards when the agent needs context from earlier threads.
Multilingual by design: queries in one language retrieve memory stored in another. Local-first, no cloud dependencies.
Thread 1: your agent helps debug a deployment issue. After investigation, it decides to use event timestamps for ordering. Pallium extracts and stores the decision with its evidence.
Thread 2 (days later): a colleague asks "why do we use event time for ordering?" Pallium returns a compact card:
decision: "Use event time for reservation ordering — avoids timezone drift."
evidence: thread-A, 2024-03-15
The agent answers immediately with the original reasoning — no re-investigation, no guessing, no pasting from old threads.
Store a decision, then ask about it later:
# Ingest + query in one call (recommended pattern)
curl -X POST http://localhost:8000/item-and-query \
-H 'Content-Type: application/json' -d '{
"source_type": "chat_message",
"source_id": "msg-042",
"content_type": "text/plain",
"content": "Why did we choose event time for reservation ordering?",
"role": "user",
"artifact_kind": "message",
"container_ref": "channel:catalog-sync",
"visibility": "container",
"thread_ref": "thread-17"
}'Pallium returns a compact memory card with an injection decision:
{
"should_inject": true,
"decision_reason": "carry_forward_available",
"injectable_blocks": [
{
"block_type": "memory_hit",
"title": "decision",
"text": "Use item event time for reservation ordering — avoids timezone drift.",
"memory_type": "decision"
}
]
}The agent injects that card directly. No reranking, no local filtering —
should_inject and injectable_blocks are the contract.
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev,vector]"
cp pallium.example.toml pallium.local.toml
cp .env.example .env.local
# Set your LLM API key in .env.localStart the service and try the interactive harness:
python -m app.run --host 127.0.0.1 --port 8000 --processors 1
# In another terminal:
python -m app.agent_simulation chat-liteThe harness runs a thin-agent loop against the real HTTP endpoints — ask repeated questions or resume interrupted work and inspect Pallium's memory decisions.
See docs/getting-started.md for the full walkthrough.
flowchart LR
A[Agent] -->|POST /item-and-query| P[Pallium]
P -->|background| W[Extract & Embed]
W -->|decisions, facts,\ncheckpoints| M[(Memory + Index)]
M -->|hybrid retrieval| P
P -->|should_inject\ninjectable_blocks| A
- Ingest — selected evidence goes in via
POST /items(not everything, just high-value events) - Process — background workers extract structured memory and concrete facts, then embed for retrieval
- Query —
POST /queryretrieves compact memory + source evidence, scoped by visibility, with an injection decision - Combined —
POST /item-and-querydoes ingest + query in one call (recommended for the common per-message pattern) - Debug —
POST /query/debugorPOST /item-and-query/debugexposes the full retrieval and routing trace
From stored evidence, Pallium derives typed memory:
| Type | Example |
|---|---|
decision |
"Use event time for ordering — avoids timezone drift" |
investigation_outcome |
"Root cause: stale cache after deploy" |
task_checkpoint |
"Blocked on API rate limit, next: implement backoff" |
atomic_fact |
"Jordan completed a half-marathon in Denver in March 2024" |
thread_summary |
"Discussed migration strategy, agreed on staged rollout" |
constraint_memory |
"Must stay on Python 3.12 for compatibility" |
Every memory object stays linked to its supporting source evidence.
Retrieval combines lexical search (FTS5 + BM25), vector similarity, and hybrid RRF fusion. The query path is deterministic by default, with selective LLM-assisted disambiguation only for bounded ambiguous cases.
See docs/how-it-works.md for the full model.
Pallium sits between your agent and its LLM. On each user message, the agent calls Pallium once; Pallium stores the message and returns any relevant prior memory. After the LLM responds, the agent sends the reply back as evidence.
User message → Pallium (store + query) → inject memory → LLM → reply → Pallium (store)
Two endpoints cover the full loop:
POST /item-and-query— store the user message, get memory back (before the LLM call)POST /items— store the reply and artifacts (after the LLM call)
Pallium decides what to extract, what to inject, and when to stay silent.
The agent trusts should_inject and passes injectable_blocks through.
See agent-integration.md for the full guide and integration-example.md for a Slack agent walkthrough.
Pallium includes an MCP server for direct LLM tool access:
claude mcp add pallium -- python -m app.run mcp
Three tools: pallium_query (search memory), pallium_query_debug
(retrieval trace), pallium_ingest (store evidence).
Context defaults (container, thread, actor, visibility) are set via environment variables so tool calls don't need to repeat them.
See agent-integration.md for setup details.
Pallium is designed to be multilingual. Memory is preserved in the original language and cross-language recall works natively — a query in one language can retrieve memory stored in another.
This is an intentional architectural property, not an undocumented side effect. Tokenization, lexical scoring, content-overlap gates, and embedding are all built to handle non-Latin scripts (Hebrew, Arabic, CJK, Cyrillic) as first-class content.
Good fit:
- agent-mediated conversations and follow-up questions
- resumed investigations or implementation work
- scoped public/private memory boundaries
- inspectable retrieval when results look wrong
Not a fit:
- transcript archive or raw event storage
- broad workspace or org-wide knowledge search
- agent runtime or workflow engine
- general-purpose vector database
Pallium optimizes for work continuity — carrying forward decisions, investigations, and checkpoints across threads. These benchmarks test a broader mix including trivia-style factual recall.
Results show both retrieval rate (did Pallium deliver the right memory?) and end-to-end accuracy (did the LLM answer correctly?). Retrieval rate isolates what Pallium controls; the gap shows what the answering LLM adds or loses.
| Benchmark | Retrieval | End-to-end | Questions |
|---|---|---|---|
| LoCoMo — conversational recall (ACL 2024) | 45.5% | 61.0% | 1,540 |
| LongMemEval — multi-session memory (ICLR 2025) | 91.7% | 93.2% | 60 (mini) |
| FactConsolidation — contradiction handling (MABench, ICLR 2026) | 65% | 54.0% | 200 |
LoCoMo end-to-end exceeds retrieval because the answering LLM compensates with its own knowledge on trivia questions. FactConsolidation single-hop reached 86% after fact extraction hardening; multi-hop (22%) remains an active improvement area. Per-category breakdowns and reproduction commands are in docs/benchmarks.md.
Using Pallium:
- Getting Started — local setup to first query
- Demo Session — complete walkthrough with real requests
- HTTP API — endpoints, shapes, examples
Integrating Pallium:
- Agent Integration — wiring into a runtime, MCP tools
- Integration Example — Slack agent walkthrough
- Privacy and Visibility — scoped memory boundaries
Understanding Pallium:
- How It Works — architecture, memory model, retrieval
- Configuration — providers, packages, tuning
- Benchmarks — per-category results, reproduction commands
