nela-local · Amoghk04 · May 31, 2026 · May 18, 2026 · May 19, 2026 · May 19, 2026
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -17,13 +17,35 @@ The customization files under `.github/` are part of the repository contract.
 - Text chat supports both document-grounding paths: KB-ingested RAG retrieval and direct file-to-prompt attachments, controlled by a RAG on/off toggle (default off = direct prompting).
 - Runtime model parameters panel is hidden by default and opened explicitly by the user.
 - Disk-scanned model sync preserves user-applied runtime params (for example `ctx_size`, `max_tokens`, `flash_attn`) instead of resetting them during model-list refreshes.
-- `benchmark/` contains runtime benchmark scripts and plotting tools.
+- `benchmark/` contains runtime benchmark scripts, plotting tools, and the RAG retrieval quality benchmark CLI.
+- `benchmark/prepare_squad.py` downloads SQuAD 1.1 and produces a corpus and QA-pairs file for `rag-bench`. Each QA pair includes an `answers` array (all acceptable gold answers) for E2E evaluation.
+- `scripts/` contains end-to-end benchmark orchestration:
+  - `download_datasets.py` — download SQuAD, TriviaQA RC, BEIR subsets, and BGE embed models.
+    - Flags: `--squad-only`, `--trivia-only`, `--beir-only`, `--models-only`
+    - TriviaQA RC (Wikipedia domain) written to `benchmark/trivia_qa/` (same format as SQuAD: `corpus/*.txt` + `qa_pairs.json`)
+    - NQ is available as `BeIR/nq` but its 2.68M-doc corpus is impractical; excluded from default BEIR loop.
+  - `baseline_llamaindex.py` — LlamaIndex + BGE + llama-server baseline (EM + F1).
+  - `baseline_chromadb.py` — ChromaDB + sentence-transformers + llama-server baseline.
+  - `generate_paper_assets.py` — read all result JSONs, emit LaTeX tables + matplotlib PDFs.
+  - `run_all_benchmarks.sh` — full end-to-end runner (ingest → bench × 2 datasets → BEIR → ablations → baselines → assets). Steps [0/10]–[assets]. TriviaQA steps are soft-skipped if `benchmark/trivia_qa/` is absent.
+  - `requirements_benchmark.txt` — pinned Python deps for the above scripts.
 - `The-Bare/` contains standalone experiments/prototypes.
+- `genhat-desktop/src-tauri/src/bin/rag_bench.rs` is a standalone Rust CLI (`rag-bench`) that benchmarks the NELA RAG pipeline (recall@k, latency breakdown, IVF memory stats, RAPTOR ablation, E2E answer quality, scale degradation) without the Tauri runtime. Subcommands: `ingest`, `bench`, `run`, `scale`, `eval`, `beir-bench`, `ablate-chunking`, `ablate-rrf-k`, `ablate-quant`.
+  - `ingest --raptor [--llm-model <gguf>]` — ingest corpus; optionally build RAPTOR tree.
+  - `bench --e2e-count 500 --bootstrap-samples 1000 [--no-rag-baseline]` — retrieval + E2E with bootstrap 95% CIs.
+  - `eval --count 500 --bootstrap-samples 1000` — standalone E2E eval outputting `{ "raw": ..., "ci": ... }`.
+  - `beir-bench --beir-dir <dir>` — BEIR NDCG@10/MAP/Recall@100/MRR across bm25/vector/hybrid configs.
+  - `ablate-chunking --chunk-sizes 512,1024,1536,2048 --overlaps 64,128,256` — grid chunking ablation.
+  - `ablate-rrf-k --rrf-k-values 10,30,60,100,200` — RRF fusion constant sweep.
+  - `ablate-quant --embed-models <path1,path2>` — per-model quantisation ablation.
+  - `scale --sizes 100,500,1000,2000 [--qa-sample 500]` — scale degradation (recall/latency vs corpus size).
 
 ## Validation Commands
 
 - Frontend build and lint: `cd genhat-desktop && npm run lint && npm run build`
 - Rust compile check: `cd genhat-desktop/src-tauri && cargo check`
+- RAG benchmark binary check: `cd genhat-desktop/src-tauri && cargo check --bin rag-bench`
+- Build RAG benchmark binary: `cd genhat-desktop/src-tauri && cargo build --release --bin rag-bench`
 - Desktop dev run: `cd genhat-desktop && npx tauri dev`
 
 ## Change Hygiene

diff --git a/.gitignore b/.gitignore
@@ -24,6 +24,18 @@ The-Bare/TTS-inference/--hidden-import
 
 benchmark/results/
 benchmark/__pycache__/
+benchmark/.squad_cache/
+benchmark/squad_bench/
+benchmark/squad_bench_answers/
+benchmark/squad_bench_large/
+benchmark/trivia_qa/
+benchmark/beir/
+
+# Ingest workspaces (large SQLite + index files)
+workspace/
+
+# Script results
+results.zip
 RESEARCH_PAPER_PLAN.md
 
 # Added by Spec Kitty CLI (auto-managed)

diff --git a/IEEE-conference-template-062824.log b/IEEE-conference-template-062824.log
@@ -0,0 +1,35 @@
+This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/Debian) (preloaded format=pdflatex 2026.5.17)  20 MAY 2026 01:43
+entering extended mode
+ restricted \write18 enabled.
+ %&-line parsing enabled.
+**/home/amogh/Documents/8thsem/major-project/nela/conference_latex_template/IEE
+E-conference-template-062824.tex
+
+(/home/amogh/Documents/8thsem/major-project/nela/conference_latex_template/IEEE
+-conference-template-062824.tex
+LaTeX2e <2023-11-01> patch level 1
+L3 programming layer <2024-01-22>
+
+! LaTeX Error: File `IEEEtran.cls' not found.
+
+Type X to quit or <RETURN> to proceed,
+or enter new name. (Default extension: cls)
+
+Enter file name: 
+! Emergency stop.
+<read *> 
+
+l.2 \IEEEoverridecommandlockouts
+                                ^^M
+*** (cannot \read from terminal in nonstop modes)
+
+
+Here is how much of TeX's memory you used:
+ 23 strings out of 476106
+ 813 string characters out of 5793933
+ 1922975 words of memory out of 5000000
+ 22132 multiletter control sequences out of 15000+600000
+ 558069 words of font info for 36 fonts, out of 8000000 for 9000
+ 59 hyphenation exceptions out of 8191
+ 19i,0n,29p,177b,17s stack positions out of 10000i,1000n,20000p,200000b,200000s
+!  ==> Fatal error occurred, no output PDF file produced!
diff --git a/benchmark/BENCHMARK_COMMANDS.txt b/benchmark/BENCHMARK_COMMANDS.txt
@@ -0,0 +1,105 @@
+# NELA RAG Benchmark — Command Reference
+# All commands run from: genhat-desktop/src-tauri/
+# Binary: ./target/release/rag-bench
+# Build:  cargo build --release --bin rag-bench
+
+# ─────────────────────────────────────────────
+# 0. Build the binary
+# ─────────────────────────────────────────────
+cd genhat-desktop/src-tauri
+cargo build --release --bin rag-bench
+
+
+# ─────────────────────────────────────────────
+# 1. Ingest corpus into a fresh workspace
+# ─────────────────────────────────────────────
+./target/release/rag-bench ingest \
+  --workspace-dir /tmp/nela-bench \
+  --corpus-dir ../../benchmark/squad_bench_large/docs \
+  --embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf
+
+
+# ─────────────────────────────────────────────
+# 2. Recall@k + MRR benchmark (no LLM needed)
+# ─────────────────────────────────────────────
+./target/release/rag-bench bench \
+  --workspace-dir /tmp/nela-bench \
+  --qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
+  --embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf
+
+
+# ─────────────────────────────────────────────
+# 3. E2E answer quality + no-RAG baseline
+#    Prints EM%, F1%, RAG gain (pp) in the summary
+# ─────────────────────────────────────────────
+./target/release/rag-bench bench \
+  --workspace-dir /tmp/nela-bench \
+  --qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
+  --embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
+  --llm-model ../../models/LLM/unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-UD-Q4_K_XL.gguf \
+  --e2e-count 50 \
+  --no-rag-baseline
+
+
+# ─────────────────────────────────────────────
+# 4. RAPTOR ablation
+#    Requires RAPTOR trees already in the workspace.
+#    Run after step 1 (ingest) with --raptor flag.
+# ─────────────────────────────────────────────
+./target/release/rag-bench bench \
+  --workspace-dir /tmp/nela-bench \
+  --qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
+  --embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
+  --raptor
+
+
+# ─────────────────────────────────────────────
+# 5. Scale degradation benchmark
+#    Uses a CLEAN workspace (/tmp/nela-scale).
+#    Each checkpoint gets its own isolated subdirectory.
+#    Previous results in /tmp/nela-bench do NOT affect this.
+# ─────────────────────────────────────────────
+./target/release/rag-bench scale \
+  --workspace-dir /tmp/nela-scale \
+  --corpus-dir ../../benchmark/squad_bench_large/docs \
+  --qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
+  --embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
+  --sizes 100,500,1000,2000 \
+  --qa-sample 200
+
+
+# ─────────────────────────────────────────────
+# 6. Standalone E2E eval (no recall benchmarking)
+# ─────────────────────────────────────────────
+./target/release/rag-bench eval \
+  --workspace-dir /tmp/nela-bench \
+  --qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
+  --embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
+  --llm-model ../../models/LLM/unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-UD-Q4_K_XL.gguf \
+  --count 50
+
+
+# ─────────────────────────────────────────────
+# 7. Embedding model comparison (bge-small)
+#    Use a separate workspace so indexes don't collide.
+# ─────────────────────────────────────────────
+./target/release/rag-bench ingest \
+  --workspace-dir /tmp/nela-bench-small \
+  --corpus-dir ../../benchmark/squad_bench_large/docs \
+  --embed-model ../../models/embedding/bge-small-en-v1.5-q8_0/bge-small-en-v1.5-q8_0.gguf
+
+./target/release/rag-bench bench \
+  --workspace-dir /tmp/nela-bench-small \
+  --qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
+  --embed-model ../../models/embedding/bge-small-en-v1.5-q8_0/bge-small-en-v1.5-q8_0.gguf
+
+
+# ─────────────────────────────────────────────
+# Notes
+# ─────────────────────────────────────────────
+# - llama-server binary is auto-detected from bin/llama-lin/llama-server
+#   relative to the workspace. Override with --llama-server <path>.
+# - Default embed port: 12345. Default LLM port: 12346.
+# - bench_results.json is written to the CWD unless --output is specified.
+# - Scale benchmark: do NOT reuse an existing /tmp/nela-bench workspace;
+#   use a dedicated directory (e.g., /tmp/nela-scale) to avoid stale data.
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -1,6 +1,115 @@
-# GenHat Application Benchmark Suite
+# NELA Benchmark Suite
 
-This folder contains an application-only benchmark pipeline (no app code changes required) that captures:
+This folder contains two independent benchmark tools:
+
+1. **`rag-bench`** — Rust CLI that measures RAG pipeline retrieval quality and latency
+   (recall@k, latency breakdown, IVF memory stats, RAPTOR ablation)
+2. **`run_benchmark.py`** — Python application-level benchmark (startup timing, CPU/RAM over time)
+
+---
+
+## rag-bench — RAG Retrieval Quality Benchmarks
+
+`rag-bench` is a standalone Rust binary that directly exercises the NELA RAG components
+(`RagDb`, `BM25Index`, `VectorIndex`, `rrf_fuse`) without the Tauri runtime.
+It is intended for researchers validating retrieval quality claims.
+
+### Scenarios covered
+
+| ID  | Scenario                       | Configs measured                                    |
+|-----|--------------------------------|-----------------------------------------------------|
+| A1  | Recall@k                       | BM25-only, Vector-only, Hybrid, Hybrid+expand       |
+| D1  | Query latency breakdown        | Embed / BM25 / Vector / RRF / Expand per stage      |
+| B1  | Ingestion timing               | Per-file: chunk count, embed time, total time       |
+| C2  | IVF memory efficiency          | Quantized MB vs raw f32 estimate, compression ratio |
+| A4  | RAPTOR confidence-gate ablation| threshold=-1.5 vs +∞ vs -∞ (requires pre-built trees) |
+
+### Step 1 — Prepare a test corpus (SQuAD)
+
+```bash
+cd benchmark
+python3 prepare_squad.py \
+  --out-dir    squad_bench \
+  --max-contexts 100 \
+  --max-qa     400
+```
+
+This downloads SQuAD 1.1 dev set (~10 MB), writes 100 `.txt` context files to
+`squad_bench/docs/`, and produces `squad_bench/qa_pairs.json` with 400 QA pairs.
+
+### Step 2 — Run the full benchmark
+
+```bash
+cd genhat-desktop
+
+cargo run --release --bin rag-bench -- run \
+  --workspace-dir /tmp/nela-bench \
+  --corpus-dir    ../benchmark/squad_bench/docs \
+  --qa-file       ../benchmark/squad_bench/qa_pairs.json \
+  --embed-model   /path/to/bge-base-en-v1.5-q8_0.gguf \
+  --top-k         5,10 \
+  --output        ../benchmark/rag_results.json
+```
+
+`llama-server` is auto-detected from `src-tauri/bin/llama-lin/llama-server`.
+Override with `--llama-server <path>` if needed.
+
+### Step 2 (alternative) — Ingest once, bench many times
+
+```bash
+# Ingest corpus once
+cargo run --release --bin rag-bench -- ingest \
+  --workspace-dir /tmp/nela-bench \
+  --corpus-dir    ../benchmark/squad_bench/docs \
+  --embed-model   /path/to/bge.gguf
+
+# Run benchmarks against the same workspace (fast, no re-embedding)
+cargo run --release --bin rag-bench -- bench \
+  --workspace-dir /tmp/nela-bench \
+  --qa-file       ../benchmark/squad_bench/qa_pairs.json \
+  --embed-model   /path/to/bge.gguf \
+  --output        ../benchmark/rag_results.json
+```
+
+### RAPTOR ablation (A4)
+
+RAPTOR trees must be pre-built.  The easiest way is to ingest the corpus via the NELA
+desktop app (Phase 2 background enrichment auto-builds RAPTOR), then point
+`--workspace-dir` at that existing NELA workspace:
+
+```bash
+cargo run --release --bin rag-bench -- bench \
+  --workspace-dir ~/.local/share/nela/workspaces/<your-workspace> \
+  --qa-file       ../benchmark/squad_bench/qa_pairs.json \
+  --embed-model   /path/to/bge.gguf \
+  --raptor \
+  --output        ../benchmark/rag_raptor_results.json
+```
+
+### QA pairs format
+
+```json
+[
+  {
+    "question": "What causes the tides?",
+    "relevant_keywords": ["gravitational", "Moon", "tidal"],
+    "doc_title": "optional_partial_doc_title"
+  }
+]
+```
+
+The oracle marks a chunk as relevant if it contains **any** keyword (case-insensitive)
+and, if `doc_title` is set, if the chunk's document title contains that substring.
+
+### Plotting results
+
+```bash
+python3 benchmark/plot_results.py --input benchmark/rag_results.json
+```
+
+---
+
+## run_benchmark.py — Application-Level Benchmark
 
 - Startup timing (cold start)
 - Process-tree resource use over time (CPU, RSS, process count)