Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,8 +283,24 @@ This mechanism enables **distributed learning without hallucination**: the syste

### Motivation

#### The Geometric Root: Curse of Dimensionality

CORTEX operates on high-dimensional Matryoshka embeddings. In `n`-dimensional Euclidean space the volume of the unit ball is:

```
Vol(B²ᵐ) = πᵐ / m! (n = 2m, even dimension)
```

As `m` (half the embedding dimension) grows, this volume collapses toward zero exponentially fast. This is the geometric driver of the **curse of dimensionality**: pairwise distances concentrate (everything looks equally far away), interiors vanish (rejection sampling and kernel methods fail), and any linear or polynomial scaling law blows up. Naïve nearest-neighbor search, flat clustering, fixed-K neighbor graphs, and uniform fan-out become either useless or unboundedly expensive as the corpus scales.

Every structural decision in CORTEX — protected Matryoshka layers, hierarchical medoids, the Metroid antithesis hunt, dimensional unwinding, Williams-derived index sizes — is a direct geometric counter-measure to this collapse.

#### The Fix: Williams 2025 Sublinear Bound

CORTEX applies the Williams 2025 result — S = O(√(t log t)) — as a universal sublinear growth law everywhere the system trades space against time: the resident hotpath index, per-tier hierarchy quotas, per-community graph budgets, semantic neighbor degree limits, and Daydreamer maintenance batch sizing. This single principle ensures the system stays efficient as the memory graph scales from hundreds to millions of nodes.

Concretely: where a naïve system would grow capacity linearly (O(t)) or even quadratically (O(t²) for pairwise operations), CORTEX caps every space-or-time budget at O(√(t log t)). This is the mathematically precise bound that keeps the engine on-device forever, regardless of corpus size.

### Graph Mass Definition

```
Expand Down Expand Up @@ -797,7 +813,7 @@ relative to frozen c. Planned module: `cortex/MetroidBuilder.ts`.

**Hotpath**: The in-memory resident index of H(t) entries spanning all four hierarchy tiers. The hotpath is the first lookup target for every query; misses spill to WARM/COLD storage. HOT membership and salience are checkpointed to the `hotpath_index` IndexedDB store by Daydreamer each maintenance cycle, allowing the RAM index to be restored after a page reload or machine reboot without full corpus replay.

**Williams Bound**: The theoretical result S = O(√(t log t)) from Williams 2025, applied here as a universal sublinear growth law for all space-time tradeoff subsystems in CORTEX.
**Williams Bound**: The theoretical result S = O(√(t log t)) from Williams 2025, applied here as a universal sublinear growth law for all space-time tradeoff subsystems in CORTEX. The bound is the constructive answer to the curse of dimensionality: in `n`-dimensional space the unit-ball volume collapses as `πᵐ/m!` (n = 2m), making linear-scale data structures infeasible. The Williams sublinear bound keeps every budget — hotpath capacity, hierarchy fanout, neighbor degree, maintenance batch size — proportional to √(t log t) rather than t, ensuring on-device viability at any corpus scale.

**Graph mass (t)**: t = |V| + |E| = total pages plus all edges (Hebbian + semantic neighbor). The canonical input to all capacity and bound formulas.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ This is the "dreaming" phase that prevents catastrophic forgetting and forces ab
## Core Design Principles

- **Biological Scarcity** — Only a fixed number of active prototypes live in memory. Everything else is gracefully demoted to disk.
- **Sublinear Growth (Williams Bound)** — The resident hotpath index is bounded to H(t) = ⌈c·√(t·log₂(1+t))⌉ where t = total graph mass (pages + edges). Memory scales sublinearly as the graph grows, trading time for space at a mathematically principled rate. See [`DESIGN.md`](DESIGN.md) for the full theorem mapping.
- **Sublinear Growth (Williams Bound)** — In `n`-dimensional embedding space the unit-ball volume collapses as `πᵐ/m!` (n = 2m). This geometric fact — the curse of dimensionality — makes linear-scale data structures infeasible as corpora grow. CORTEX counters it with the Williams 2025 result S = O(√(t log t)), used as a universal sublinear growth law: the resident hotpath index is bounded to H(t) = ⌈c·√(t·log₂(1+t))⌉, with the same formula driving hierarchy fanout limits, semantic-neighbor degree caps, and Daydreamer maintenance batch sizes. Every space-or-time budget scales sublinearly, keeping the engine on-device at any corpus size. See [`DESIGN.md`](DESIGN.md) for the full theorem mapping.
- **Three-Zone Memory** — HOT (resident in-memory index, capacity H(t)), WARM (indexed in IndexedDB, reachable via nearest-neighbor search), COLD (metadata in IndexedDB + raw vectors in OPFS, but semantically isolated from the search path — no strong nearest neighbors in vector space at insertion time; only discoverable by a deliberate random walk). All data is retained locally forever; zones control lookup cost and discoverability, not data lifetime.
- **Hierarchical & Sparse** — Progressive dimensionality reduction + medoid clustering keeps memory efficient at any scale, with Williams-derived fanout bounds preventing any single tier from monopolising the index.
- **Hebbian & Dynamic** — Connections strengthen and weaken naturally. Node salience (σ = α·H_in + β·R + γ·Q) drives promotion into and eviction from the resident hotpath.
Expand Down
109 changes: 109 additions & 0 deletions core/HotpathPolicy.ts
Original file line number Diff line number Diff line change
Expand Up @@ -194,3 +194,112 @@ export function deriveCommunityQuotas(
for (let i = 0; i < n; i++) quotas[i] += floors[i];
return quotas;
}

// ---------------------------------------------------------------------------
// Semantic neighbor degree limit — Williams-bound derived
// ---------------------------------------------------------------------------

// Bootstrap floor for Williams-bound log formulas: ensures t_eff ≥ 2 so that
// log₂(t_eff) > 0 and log₂(log₂(1+t_eff)) is defined and positive.
const MIN_GRAPH_MASS_FOR_LOGS = 2;

/**
* Compute the Williams-bound-derived maximum degree for the semantic neighbor
* graph given a corpus of `graphMass` total pages.
*
* The degree limit uses the same H(t) formula as the hotpath capacity but is
* bounded by a hard cap to keep the graph sparse. At small corpora the
* Williams formula naturally returns small values (e.g. 1–5 for t < 10);
* at large corpora the `hardCap` clamps growth to prevent the graph becoming
* too dense.
*
* @param graphMass Total number of pages in the corpus.
* @param c Williams Bound scaling constant (default from policy).
* @param hardCap Maximum degree regardless of formula result. Default: 32.
*/
export function computeNeighborMaxDegree(
graphMass: number,
c: number = DEFAULT_HOTPATH_POLICY.c,
hardCap = 32,
): number {
const derived = computeCapacity(graphMass, c);
return Math.min(hardCap, Math.max(1, derived));
}

// ---------------------------------------------------------------------------
// Dynamic subgraph expansion bounds — Williams-bound derived
// ---------------------------------------------------------------------------

export interface SubgraphBounds {
/** Maximum number of nodes to include in the induced subgraph. */
maxSubgraphSize: number;
/** Maximum BFS hops from seed nodes. */
maxHops: number;
/** Maximum fanout per hop (branching factor). */
perHopBranching: number;
}

/**
* Compute dynamic Williams-derived bounds for subgraph expansion (step 9 of
* the Cortex query path).
*
* Formulas from DESIGN.md "Dynamic Subgraph Expansion Bounds":
*
* t_eff = max(t, 2)
* maxSubgraphSize = min(30, ⌊√(t_eff · log₂(1+t_eff)) / log₂(t_eff)⌋)
* maxHops = max(1, ⌈log₂(log₂(1 + t_eff))⌉)
* perHopBranching = max(1, ⌊maxSubgraphSize ^ (1/maxHops)⌋)
*
* The bootstrap floor `t_eff = max(t, 2)` eliminates division-by-zero for
* t ≤ 1 and ensures a safe minimum of `maxSubgraphSize=1, maxHops=1`.
*
* @param graphMass Total number of pages in the corpus.
*/
export function computeSubgraphBounds(graphMass: number): SubgraphBounds {
const tEff = Math.max(graphMass, MIN_GRAPH_MASS_FOR_LOGS);
const log2tEff = Math.log2(tEff);

const maxSubgraphSize = Math.min(
30,
Math.floor(Math.sqrt(tEff * Math.log2(1 + tEff)) / log2tEff),
);

const maxHops = Math.max(1, Math.ceil(Math.log2(Math.log2(1 + tEff))));

const perHopBranching = Math.max(
1,
Math.floor(Math.pow(maxSubgraphSize, 1 / maxHops)),
);

return {
maxSubgraphSize: Math.max(1, maxSubgraphSize),
maxHops,
perHopBranching,
};
}

// ---------------------------------------------------------------------------
// Williams-derived hierarchy fanout limit
// ---------------------------------------------------------------------------

/**
* Compute the Williams-derived fanout limit for a hierarchy node that
* currently has `childCount` children.
*
* Per DESIGN.md "Sublinear Fanout Bounds":
* Max children = O(√(childCount · log childCount))
*
* The formula is evaluated with a bootstrap floor of t_eff = max(t, 2) to
* avoid log(0) and returns at least 1 child.
*
* @param childCount Current number of children for the parent node.
* @param c Williams Bound scaling constant.
*/
export function computeFanoutLimit(
childCount: number,
c: number = DEFAULT_HOTPATH_POLICY.c,
): number {
const tEff = Math.max(childCount, MIN_GRAPH_MASS_FOR_LOGS);
const raw = c * Math.sqrt(tEff * Math.log2(1 + tEff));
return Math.max(1, Math.ceil(raw));
}
25 changes: 20 additions & 5 deletions cortex/Query.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import type { ModelProfile } from "../core/ModelProfile";
import type { Hash, MetadataStore, Page, VectorStore } from "../core/types";
import type { EmbeddingRunner } from "../embeddings/EmbeddingRunner";
import { runPromotionSweep } from "../core/SalienceEngine";
import { computeSubgraphBounds } from "../core/HotpathPolicy";
import type { QueryResult } from "./QueryResult";
import { rankPages, spillToWarm } from "./Ranking";
import { buildMetroid } from "./MetroidBuilder";
Expand All @@ -14,9 +15,13 @@ export interface QueryOptions {
vectorStore: VectorStore;
metadataStore: MetadataStore;
topK?: number;
/** BFS depth for semantic neighbor subgraph expansion. 2 hops covers direct
* neighbors and their neighbors, which is the minimum needed to surface
* bridge nodes without exploding the graph size. */
/**
* Maximum BFS depth for semantic neighbor subgraph expansion.
*
* When omitted, a dynamic Williams-derived value is computed from the
* corpus size via `computeSubgraphBounds(t)`. Providing an explicit value
* overrides the dynamic bound (useful for tests and controlled experiments).
*/
maxHops?: number;
}

Expand All @@ -30,7 +35,6 @@ export async function query(
vectorStore,
metadataStore,
topK = 10,
maxHops = 2,
} = options;
const nowIso = new Date().toISOString();

Expand Down Expand Up @@ -116,8 +120,19 @@ export async function query(
);

// --- Subgraph expansion ---
// Use dynamic Williams-derived bounds unless the caller has pinned an
// explicit maxHops value. Only load all pages when we actually need to
// compute bounds — skip the full-page scan on the hot path when maxHops is
// already known.
const topPageIds = topPages.map((p) => p.pageId);
const subgraph = await metadataStore.getInducedNeighborSubgraph(topPageIds, maxHops);
let effectiveMaxHops: number;
if (options.maxHops !== undefined) {
effectiveMaxHops = options.maxHops;
} else {
const allPages = await metadataStore.getAllPages();
effectiveMaxHops = computeSubgraphBounds(allPages.length).maxHops;
}
const subgraph = await metadataStore.getInducedNeighborSubgraph(topPageIds, effectiveMaxHops);

// --- TSP coherence path ---
const coherencePath = solveOpenTSP(subgraph);
Expand Down
54 changes: 24 additions & 30 deletions daydreamer/ClusterStability.ts
Original file line number Diff line number Diff line change
@@ -1,22 +1,36 @@
// ---------------------------------------------------------------------------
// ClusterStability — Community detection via label propagation (P2-F)
// ClusterStability — Community detection via label propagation (P2-F) and
// volume split/merge for balanced cluster maintenance (P2-F3)
// ---------------------------------------------------------------------------
//
// Assigns community labels to pages by running lightweight label propagation
// on the semantic (Metroid) neighbor graph. Labels are stored in
// on the semantic neighbor graph. Labels are stored in
// PageActivity.communityId and propagate into SalienceEngine community quotas.
//
// Label propagation terminates when assignments stabilise (no label changes)
// or a maximum iteration limit is reached.
//
// The Daydreamer background worker also calls ClusterStability periodically to
// detect and fix unstable volumes:
// - HIGH-VARIANCE volumes are split into two balanced sub-volumes.
// - LOW-COUNT volumes are merged into the nearest neighbour volume.
// - Community labels are updated after structural changes.
// ---------------------------------------------------------------------------

import type { Hash, MetadataStore, PageActivity } from "../core/types";
import { hashText } from "../core/crypto/hash";
import type {
Book,
Hash,
MetadataStore,
PageActivity,
Volume,
} from "../core/types";

// ---------------------------------------------------------------------------
// Options
// Label propagation options
// ---------------------------------------------------------------------------

export interface ClusterStabilityOptions {
export interface LabelPropagationOptions {
metadataStore: MetadataStore;
/** Maximum number of label propagation iterations. Default: 20. */
maxIterations?: number;
Expand Down Expand Up @@ -55,7 +69,7 @@ async function propagationPass(
const sorted = [...pageIds].sort();

for (const pageId of sorted) {
const neighbors = await metadataStore.getMetroidNeighbors(pageId);
const neighbors = await metadataStore.getSemanticNeighbors(pageId);
if (neighbors.length === 0) continue;

// Count neighbor labels
Expand Down Expand Up @@ -103,7 +117,7 @@ async function propagationPass(
* `MetadataStore.putPageActivity`.
*/
export async function runLabelPropagation(
options: ClusterStabilityOptions,
options: LabelPropagationOptions,
): Promise<LabelPropagationResult> {
const {
metadataStore,
Expand Down Expand Up @@ -200,32 +214,12 @@ export function detectEmptyCommunities(
}
}
return empty;
// ClusterStability — Volume split/merge for balanced cluster maintenance
}

// ---------------------------------------------------------------------------
//
// The Daydreamer background worker calls ClusterStability periodically to
// detect and fix unstable volumes:
//
// - HIGH-VARIANCE volumes are split into two balanced sub-volumes using
// K-means with K=2 (one pass).
// - LOW-COUNT volumes are merged into the nearest neighbour volume
// (by medoid distance).
// - Community labels on PageActivity records are updated after structural
// changes so downstream salience computation stays coherent.
//
// All operations are idempotent: re-running on a stable set of volumes is a
// no-op.
// ClusterStability class — Volume split/merge configuration
// ---------------------------------------------------------------------------

import { hashText } from "../core/crypto/hash";
import type {
Book,
Hash,
MetadataStore,
PageActivity,
Volume,
} from "../core/types";

// ---------------------------------------------------------------------------
// Configuration
// ---------------------------------------------------------------------------
Expand Down
Loading
Loading