Local-first research for Evolutionary Attention Memory (EAM).
Synthetic-data-based. Designed for thesis reviewer walkthroughs.
| Component | Description |
|---|---|
| Frozen NN | Nearest-centroid classifier (T=0.8), 5 fixed output classes |
| Evolutionary memory | 30 Gaussian-activation basins, evolved offline |
| Trust mechanism | Legacy C2 trust plus a comparative regional trust mode |
| Hybrid prediction | p = τ²·p_nn + (1−τ²)·p_evo — normalised per class |
Three scenarios demonstrate the mechanism under domain shift:
- Class Emergence — a 6th class appears at week 8; the frozen NN assigns it zero probability
- Class Imbalance — class 0 becomes 10× more frequent at week 5
- Distributional Drift — all class centroids migrate 0.15 units/week for 20 weeks
pip install streamlit numpy plotlyTested with Python 3.10+, NumPy 1.26, Streamlit 1.32, Plotly 5.20.
streamlit run app.pyThe app opens in your browser at http://localhost:8501.
No internet connection required after install.
python verify.pyExpected output (seed 42, C2 variant):
[Class Emergence] hybrid ~86.1% NN ~88.8% evo ~78.8%
[Class Imbalance] hybrid ~99.5% NN ~99.7% evo ~95.1%
[Distributional Drift] hybrid ~96.9% NN ~95.8% evo ~93.5%
All results within ±3 pp of these targets (see paper Table 1).
python compare_versions.pyThis runs a 5-seed comparison across three variants:
legacy_paper— original optimistic paper-style benchmarkonline_c2— fairer online split using the original C2 trustonline_compare— fairer online split with comparative trust and 3-week replay evolution
Current 5-seed summary:
| Scenario | online_c2 hybrid |
online_compare hybrid |
Change |
|---|---|---|---|
| Class Emergence | 86.6% ± 2.4 | 86.6% ± 0.6 | +0.1 pp |
| Class Imbalance | 99.6% ± 0.1 | 99.6% ± 0.1 | -0.1 pp |
| Distributional Drift | 96.4% ± 0.5 | 96.6% ± 0.4 | +0.1 pp |
| Overall mean | 94.2% | 94.3% | +0.1 pp |
Additional stability gain:
- Drift evo memory improved from
80.0% ± 17.2to87.5% ± 8.8under the fair online benchmark.
- Select Class Emergence in the sidebar.
- Click Feed Sample 7 times to build baseline trust.
Observe: both predictions agree, trust bars sit at ~0.5 and rising. - Click Activate Shift — Class 5 is injected.
Observe: Class 5 trust initialises at 0.0; blend weight shifts to evo for Class 5. - Feed a few more weeks. Observe the hybrid diverging from the frozen NN on Class 5 samples.
- Click Simulate Evolution.
Observe: evo accuracy rises as basins cover Class 5. - Inspect the Audit Panel to see probability vectors, trust, and top-3 contributing basins.
- Repeat with Distributional Drift to see trust decay as NN centroids go stale.
app.py Streamlit UI
simulation.py Synthetic data generation and frozen NN
basins.py Basin representation, scoring, evolution
trust.py Trust initialisation, EMA update, prediction blending
audit.py Audit-record builder
charts.py Plotly figure builders
scenarios.py Scenario definitions
verify.py Headless accuracy verification script
compare_versions.py Multi-seed baseline vs improved comparison
requirements.txt
| Parameter | Value |
|---|---|
| Latent dimension | 8 |
| Base classes | 5 (labels 0–4) |
| Samples / week | 200 |
| Simulation duration | 20 weeks |
| Basins | 6 / class = 30 total |
| Evolution generations | 6 |
| Elite preservation | top 30% per class |
| Weight decay | 0.4% / generation |
| Temperature T | 0.8 |
| Trust λ | 0.5 (symmetric EMA) |
| Trust initial (known) | 0.5 |
| Trust initial (new) | 0.0 |
| Rolling window | 3 weeks |
| Random seed | 42 |
- All results are on synthetic Gaussian clusters. Real-data validation is future work.
- Trust region assignment uses class label; more formal analysis is listed as future work in the paper.
- Evolution runs in pure Python (O(n) basin scan); KD-tree indexing would reduce latency.
- The original paper-aligned verification is still single-seed and optimistic by design.
- The new multi-seed benchmark is fairer, but still uses synthetic data rather than real deployment traces.
The repository now includes a separate real_eval/ package for first-pass
LongBench-style testing on a real language-model path.
What it adds:
- focused LongBench subset loader for QA, summarization, and code tasks
- conservative MemoirAI prompt-side prefill compression proxy
- optional TurboQuant backend detection hook
- pluggable model adapters (
dummyfor smoke tests,transformersfor real runs) - JSON report generation with score, latency, compression, and KV-cache proxies
Quick smoke test:
python -m real_eval.runner --backend dummy --dataset-source localTransformers-backed run after installing dependencies:
python -m real_eval.runner \
--backend transformers \
--dataset-source hf \
--dataset-name THUDM/LongBench \
--model-name Qwen/Qwen2.5-0.5B-Instruct \
--max-examples-per-task 5Notes:
- The current MemoirAI real-data path is a conservative prompt-compression proxy. It does not yet patch the model's internal KV cache.
- If a TurboQuant backend is installed, the harness will surface capability flags and pass backend kwargs through the adapter. Otherwise it still reports stacked-mode KV-memory proxies so the evaluation pipeline remains runnable.