Skip to content
Merged
20 changes: 14 additions & 6 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -442,7 +442,10 @@ interface Page {
```

#### Book
Ordered sequence of pages with representative medoid.
Ordered sequence of pages from a **single ingest call** with a representative medoid.
One `ingestText()` call always produces exactly one Book — the entire ingested document.
A collection of Books forms a Volume; a collection of Volumes forms a Shelf.
Books are identified by `SHA-256(sorted pageIds)` so their identity is content-addressed.

```typescript
interface Book {
Expand Down Expand Up @@ -630,14 +633,19 @@ Rather than returning nearest neighbors by similarity, Cortex traces a coherent
2. **Generate Embeddings** — Batch embed with selected provider
3. **Persist Vectors** — Append to OPFS vector file
4. **Persist Pages** — Write page metadata to IndexedDB; initialise `PageActivity` record
5. **Build/Attach Hierarchy** — Construct/update books, volumes, shelves; attempt hotpath admission for each level's medoid/prototype using tier quota via `SalienceEngine`
6. **Fast Semantic Neighbor Insert** — Update semantic neighbor graph incrementally; bounded degree via `HotpathPolicy`; check new page for hotpath admission
5. **Create Ingest Book** — Build exactly one Book for the entire ingest: compute the medoid page (minimum total cosine distance to all other pages in the document), derive `bookId = SHA-256(sorted pageIds)`, persist. Hotpath admission for the book runs via `SalienceEngine`. Volumes and Shelves are assembled lazily by the Daydreamer from accumulated Books.
6. **Fast Semantic Neighbor Insert** — Update semantic neighbor graph incrementally; bounded degree via `HotpathPolicy`; check new pages for hotpath admission
7. **Mark Dirty** — Flag volumes for full recalc by Daydreamer

**Incremental Strategy:**
Fast local semantic neighbor insertion keeps ingest-time latency low. At ingest time, only the initial forward and reverse edges are created — neighbors are selected by cosine similarity within Williams-cutoff **distance** (not a fixed K; the cutoff is derived from `HotpathPolicy`). On degree overflow, the lowest-cosine-similarity neighbor is evicted.
**Incremental Strategy (fast and lightweight):**
Ingest must remain fast and lightweight. At ingest time only two classes of edges are created:
- **Document-order adjacency** — Forward and reverse `SemanticNeighbor` edges between each consecutive page pair within the book slice, inserted unconditionally (document-adjacent chunks are always related). This uses a pre-built `Map<pageId, embedding>` for O(1) lookups; no O(n²) index scans.
- **Proximity edges** — Additional `SemanticNeighbor` edges to nearby pages already in the corpus, bounded by cosine-distance cutoff and `maxDegree` eviction.

Full cross-edge reconnection is intentionally deferred: Daydreamer walks the graph during idle passes to build additional edges, strengthening or pruning connections via LTP/LTD. This avoids a full graph recalculation on every insert while still converging to a well-connected graph over time. Hotpath admission runs at ingest time for new pages and hierarchy prototypes.
Full cross-edge reconnection is intentionally deferred: Daydreamer walks the graph during idle passes to build additional edges — connections we never noticed at ingest time — and strengthens or prunes them via LTP/LTD. This keeps ingest cost sublinear while converging to a well-connected graph over time.

**IndexedDB Schema Upgrade Strategy:**
During early development (pre-v1.0) the schema upgrade path intentionally drops and recreates object stores rather than migrating data. This keeps upgrade code minimal and avoids cruft until the data model stabilises. The neighbor graph is rebuilt from scratch after any ingest replay.

## Consolidation Design

Expand Down
2 changes: 2 additions & 0 deletions core/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,14 @@ export interface Edge {
// Semantic nearest-neighbor graph
// ---------------------------------------------------------------------------

/** A single directed proximity edge in the sparse semantic neighbor graph. */
export interface SemanticNeighbor {
neighborPageId: Hash;
cosineSimilarity: number; // threshold is defined by runtime policy
distance: number; // 1 - cosineSimilarity (ready for TSP)
}

/** Induced subgraph returned by BFS expansion of the semantic neighbor graph. */
export interface SemanticNeighborSubgraph {
nodes: Hash[];
edges: { from: Hash; to: Hash; distance: number }[];
Expand Down
66 changes: 66 additions & 0 deletions cortex/KnowledgeGapDetector.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import type { Hash } from "../core/types";
import type { ModelProfile } from "../core/ModelProfile";
import { hashText } from "../core/crypto/hash";
import type { Metroid } from "./MetroidBuilder";

export interface KnowledgeGap {
queryText: string;
queryEmbedding: Float32Array;
knowledgeBoundary: Hash | null;
detectedAt: string;
}

export interface CuriosityProbe {
probeId: Hash;
queryText: string;
queryEmbedding: Float32Array;
knowledgeBoundary: Hash | null;
mimeType: string;
modelUrn: string;
createdAt: string;
}

/**
* Returns a KnowledgeGap when the metroid signals that m2 could not be found
* (i.e. the engine has no antithesis for this query). Returns null when the
* metroid is complete and no gap was detected.
*/
export async function detectKnowledgeGap(
queryText: string,
queryEmbedding: Float32Array,
metroid: Metroid,
// eslint-disable-next-line @typescript-eslint/no-unused-vars -- reserved for future model-aware gap categorisation
_modelProfile: ModelProfile,
): Promise<KnowledgeGap | null> {
if (!metroid.knowledgeGap) return null;

return {
queryText,
queryEmbedding,
knowledgeBoundary: metroid.m1 !== "" ? metroid.m1 : null,
detectedAt: new Date().toISOString(),
};
}

/**
* Builds a serialisable CuriosityProbe from a detected KnowledgeGap.
* The probeId is the SHA-256 of (queryText + detectedAt) so it is
* deterministic for the same gap inputs.
*/
export async function buildCuriosityProbe(
gap: KnowledgeGap,
modelProfile: ModelProfile,
mimeType = "text/plain",
): Promise<CuriosityProbe> {
const probeId = await hashText(gap.queryText + gap.detectedAt);

return {
probeId,
queryText: gap.queryText,
queryEmbedding: gap.queryEmbedding,
knowledgeBoundary: gap.knowledgeBoundary,
mimeType,
modelUrn: `urn:model:${modelProfile.modelId}`,
createdAt: new Date().toISOString(),
};
}
217 changes: 217 additions & 0 deletions cortex/MetroidBuilder.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
import type { Hash, VectorStore } from "../core/types";
import type { ModelProfile } from "../core/ModelProfile";

export interface Metroid {
m1: Hash;
m2: Hash | null;
c: Float32Array | null;
knowledgeGap: boolean;
}

export interface MetroidBuilderOptions {
modelProfile: ModelProfile;
vectorStore: VectorStore;
}

/** Standard Matryoshka tier sizes in ascending order. */
const MATRYOSHKA_TIERS = [32, 64, 128, 256, 512, 768, 1024, 2048] as const;

function cosineSimilarity(a: Float32Array, b: Float32Array): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
const len = Math.min(a.length, b.length);
for (let i = 0; i < len; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
if (normA === 0 || normB === 0) return 0;
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

function cosineDistance(a: Float32Array, b: Float32Array): number {
return 1 - cosineSimilarity(a, b);
}

/**
* Returns the index of the medoid: the element that minimises total cosine
* distance to every other element in the set.
*/
function findMedoidIndex(embeddings: Float32Array[]): number {
if (embeddings.length === 1) return 0;

let bestIdx = 0;
let bestTotal = Infinity;

for (let i = 0; i < embeddings.length; i++) {
let total = 0;
for (let j = 0; j < embeddings.length; j++) {
if (i !== j) {
total += cosineDistance(embeddings[i], embeddings[j]);
}
}
if (total < bestTotal) {
bestTotal = total;
bestIdx = i;
}
}

return bestIdx;
}

interface CandidateEntry {
pageId: Hash;
embeddingOffset: number;
embeddingDim: number;
}

interface CandidateWithEmbedding extends CandidateEntry {
embedding: Float32Array;
}

/**
* Searches for m2 among `others` (candidates excluding m1) using the free
* dimensions starting at `protectedDim`.
*
* Returns the selected medoid candidate or `null` if no valid opposite set
* can be assembled.
*/
function searchM2(
others: CandidateWithEmbedding[],
m1Embedding: Float32Array,
protectedDim: number,
): CandidateWithEmbedding | null {
if (others.length === 0) return null;

const m1Free = m1Embedding.slice(protectedDim);

const scored = others.map((c) => {
const free = c.embedding.slice(protectedDim);
return { candidate: c, score: -cosineSimilarity(free, m1Free) };
});

// Prefer candidates that are genuinely opposite (score >= 0).
let oppositeSet = scored.filter((s) => s.score >= 0);

// Fall back to the top 50% when the genuine-opposite set is too small.
if (oppositeSet.length < 2) {
const byScore = [...scored].sort((a, b) => b.score - a.score);
const topHalf = Math.max(1, Math.ceil(byScore.length / 2));
oppositeSet = byScore.slice(0, topHalf);
}

if (oppositeSet.length === 0) return null;

const medoidIdx = findMedoidIndex(oppositeSet.map((s) => s.candidate.embedding.slice(protectedDim)));
return oppositeSet[medoidIdx].candidate;
}

/**
* Builds the dialectical probe (Metroid) for a given query embedding and a
* ranked list of candidate memory nodes.
*
* Step overview
* 1. Select m1 (thesis): the candidate with highest cosine similarity to the query.
* 2. Select m2 (antithesis): the medoid of the cosine-opposite set in free dims.
* Uses Matryoshka dimensional unwinding when the initial tier yields no m2.
* 3. Compute centroid c (synthesis): protected dims copied from m1, free dims
* averaged between m1 and m2.
*/
export async function buildMetroid(
queryEmbedding: Float32Array,
candidateMedoids: Array<{ pageId: Hash; embeddingOffset: number; embeddingDim: number }>,
options: MetroidBuilderOptions,
): Promise<Metroid> {
const { modelProfile, vectorStore } = options;

if (candidateMedoids.length === 0) {
return { m1: "", m2: null, c: null, knowledgeGap: true };
}

// Load all candidate embeddings in one pass.
const candidates: CandidateWithEmbedding[] = await Promise.all(
candidateMedoids.map(async (cand) => ({
...cand,
embedding: await vectorStore.readVector(cand.embeddingOffset, cand.embeddingDim),
})),
);

// Select m1: highest cosine similarity to the query.
let m1Candidate = candidates[0];
let m1Score = cosineSimilarity(queryEmbedding, candidates[0].embedding);

for (let i = 1; i < candidates.length; i++) {
const score = cosineSimilarity(queryEmbedding, candidates[i].embedding);
if (score > m1Score) {
m1Score = score;
m1Candidate = candidates[i];
}
}

const protectedDim = modelProfile.matryoshkaProtectedDim;

if (protectedDim === undefined) {
// Non-Matryoshka model: antithesis search is impossible.
return { m1: m1Candidate.pageId, m2: null, c: null, knowledgeGap: true };
}

const others = candidates.filter((c) => c.pageId !== m1Candidate.pageId);

// --- Matryoshka dimensional unwinding ---
// Start at modelProfile.matryoshkaProtectedDim. If m2 not found, progressively
// shrink the protected boundary (expand the free-dimension search region).

const startingTierIndex = MATRYOSHKA_TIERS.indexOf(
protectedDim as (typeof MATRYOSHKA_TIERS)[number],
);

// Build the list of tier boundaries to attempt, from the configured value
// down to the smallest tier (expanding the free region at each step).
const tierBoundaries: number[] = [];
if (startingTierIndex !== -1) {
for (let i = startingTierIndex; i >= 0; i--) {
tierBoundaries.push(MATRYOSHKA_TIERS[i]);
}
} else {
// protectedDim is not a standard tier; try it as-is plus any smaller standard tiers.
tierBoundaries.push(protectedDim);
for (const t of [...MATRYOSHKA_TIERS].reverse()) {
if (t < protectedDim) tierBoundaries.push(t);
}
}

let m2Candidate: CandidateWithEmbedding | null = null;
let usedProtectedDim = protectedDim;

for (const tierBoundary of tierBoundaries) {
const found = searchM2(others, m1Candidate.embedding, tierBoundary);
if (found !== null) {
m2Candidate = found;
usedProtectedDim = tierBoundary;
break;
}
}

if (m2Candidate === null) {
return { m1: m1Candidate.pageId, m2: null, c: null, knowledgeGap: true };
}

// Compute frozen synthesis centroid c.
const fullDim = m1Candidate.embedding.length;
const c = new Float32Array(fullDim);

for (let i = 0; i < usedProtectedDim; i++) {
c[i] = m1Candidate.embedding[i];
}
for (let i = usedProtectedDim; i < fullDim; i++) {
c[i] = (m1Candidate.embedding[i] + m2Candidate.embedding[i]) / 2;
}

return {
m1: m1Candidate.pageId,
m2: m2Candidate.pageId,
c,
knowledgeGap: false,
};
}
62 changes: 62 additions & 0 deletions cortex/OpenTSPSolver.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
import type { Hash, SemanticNeighborSubgraph } from "../core/types";

/**
* Greedy nearest-neighbor open-path TSP heuristic.
*
* Visits every node in the subgraph exactly once, starting from the
* lexicographically smallest node ID for determinism. At each step the
* algorithm advances to the unvisited node nearest to the current one
* (using edge distance). Ties are broken lexicographically. Missing edges
* are treated as having distance Infinity.
*/
export function solveOpenTSP(subgraph: SemanticNeighborSubgraph): Hash[] {
const { nodes, edges } = subgraph;
if (nodes.length === 0) return [];

// Build undirected adjacency map: node → (neighbor → distance).
const adj = new Map<Hash, Map<Hash, number>>();
for (const node of nodes) {
adj.set(node, new Map());
}
for (const edge of edges) {
const fromMap = adj.get(edge.from);
const toMap = adj.get(edge.to);
if (fromMap !== undefined) fromMap.set(edge.to, edge.distance);
if (toMap !== undefined) toMap.set(edge.from, edge.distance);
}

// Pre-sort once so lexicographic tiebreaking is O(1) per step.
const sorted = [...nodes].sort();

const visited = new Set<Hash>();
const path: Hash[] = [];
let current = sorted[0];

while (path.length < nodes.length) {
visited.add(current);
path.push(current);

if (path.length === nodes.length) break;

const neighbors = adj.get(current)!;
let bestNode: Hash | undefined;
let bestDist = Infinity;

for (const node of sorted) {
if (visited.has(node)) continue;
const dist = neighbors.get(node) ?? Infinity;
if (
dist < bestDist ||
(dist === bestDist && (bestNode === undefined || node < bestNode))
) {
bestDist = dist;
bestNode = node;
}
}

// bestNode is always defined here because at least one unvisited node remains.
current = bestNode!;
}

return path;
}
Loading
Loading