Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
b3983eb
NELA: Added Benchmarking
Amoghk04 May 18, 2026
221f579
NELA: Benching changes
Amoghk04 May 19, 2026
a3628a1
fix: surface chat_complete errors and unexpected response JSON
Amoghk04 May 19, 2026
6b909f4
fix: ChatServer parallel=1 for bench (ctx_size/parallel was giving 51…
Amoghk04 May 19, 2026
f09a3ec
fix: swap raptor_trust_all/-inf and raptor_expand_all/+inf thresholds…
Amoghk04 May 19, 2026
14a330f
fix(beir): bump embed ubatch-size 512→2048; add per-doc progress to B…
Amoghk04 May 19, 2026
4aee0f3
fix(beir): skip oversized docs on embed failure; stable BEIR workspac…
Amoghk04 May 19, 2026
5ab1af4
fix(bench): surface llama-server crash reason on embed server startup…
Amoghk04 May 19, 2026
fe0ed8b
fix(bench): drop --log-disable; add GPU diagnostic after embed server…
Amoghk04 May 19, 2026
b0cebe6
fix(bench): warn explicitly when CUDA detected but no layers offloade…
Amoghk04 May 19, 2026
2df4246
fix(bench): accept 'fitting params to device' as GPU-active signal
Amoghk04 May 19, 2026
ef10391
perf(bench): parallel BEIR ingest — embed 8 docs concurrently
Amoghk04 May 19, 2026
55bf28d
perf(bench): defer BM25 commit to end of ingest — fix O(N²) slowdown
Amoghk04 May 19, 2026
855ba20
perf(bench): BM25 periodic checkpoint every 500 docs — crash safety +…
Amoghk04 May 19, 2026
1fca864
perf: parallel embed in ingest_corpus_with_config and cmd_scale
Amoghk04 May 19, 2026
0c8c086
perf: pre-embed queries once in run_recall_bench and ablate-rrf-k
Amoghk04 May 19, 2026
d5e2438
perf: defer BM25 commits to end of ingest in corpus_with_config and s…
Amoghk04 May 19, 2026
34e28a9
feat: add --max-docs to ablate-chunking and ablate-quant for corpus s…
Amoghk04 May 19, 2026
fcd181f
feat: make ablation workspaces persistent and resumable
Amoghk04 May 19, 2026
299b941
feat: add --max-qa to ablation commands to cap QA pairs per grid point
Amoghk04 May 19, 2026
b1cdaf8
NELA: Final benchmark figures for Claude code
Amoghk04 May 20, 2026
78f57f3
NELA: Added web scraping
Amoghk04 May 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,35 @@ The customization files under `.github/` are part of the repository contract.
- Text chat supports both document-grounding paths: KB-ingested RAG retrieval and direct file-to-prompt attachments, controlled by a RAG on/off toggle (default off = direct prompting).
- Runtime model parameters panel is hidden by default and opened explicitly by the user.
- Disk-scanned model sync preserves user-applied runtime params (for example `ctx_size`, `max_tokens`, `flash_attn`) instead of resetting them during model-list refreshes.
- `benchmark/` contains runtime benchmark scripts and plotting tools.
- `benchmark/` contains runtime benchmark scripts, plotting tools, and the RAG retrieval quality benchmark CLI.
- `benchmark/prepare_squad.py` downloads SQuAD 1.1 and produces a corpus and QA-pairs file for `rag-bench`. Each QA pair includes an `answers` array (all acceptable gold answers) for E2E evaluation.
- `scripts/` contains end-to-end benchmark orchestration:
- `download_datasets.py` — download SQuAD, TriviaQA RC, BEIR subsets, and BGE embed models.
- Flags: `--squad-only`, `--trivia-only`, `--beir-only`, `--models-only`
- TriviaQA RC (Wikipedia domain) written to `benchmark/trivia_qa/` (same format as SQuAD: `corpus/*.txt` + `qa_pairs.json`)
- NQ is available as `BeIR/nq` but its 2.68M-doc corpus is impractical; excluded from default BEIR loop.
- `baseline_llamaindex.py` — LlamaIndex + BGE + llama-server baseline (EM + F1).
- `baseline_chromadb.py` — ChromaDB + sentence-transformers + llama-server baseline.
- `generate_paper_assets.py` — read all result JSONs, emit LaTeX tables + matplotlib PDFs.
- `run_all_benchmarks.sh` — full end-to-end runner (ingest → bench × 2 datasets → BEIR → ablations → baselines → assets). Steps [0/10]–[assets]. TriviaQA steps are soft-skipped if `benchmark/trivia_qa/` is absent.
- `requirements_benchmark.txt` — pinned Python deps for the above scripts.
- `The-Bare/` contains standalone experiments/prototypes.
- `genhat-desktop/src-tauri/src/bin/rag_bench.rs` is a standalone Rust CLI (`rag-bench`) that benchmarks the NELA RAG pipeline (recall@k, latency breakdown, IVF memory stats, RAPTOR ablation, E2E answer quality, scale degradation) without the Tauri runtime. Subcommands: `ingest`, `bench`, `run`, `scale`, `eval`, `beir-bench`, `ablate-chunking`, `ablate-rrf-k`, `ablate-quant`.
- `ingest --raptor [--llm-model <gguf>]` — ingest corpus; optionally build RAPTOR tree.
- `bench --e2e-count 500 --bootstrap-samples 1000 [--no-rag-baseline]` — retrieval + E2E with bootstrap 95% CIs.
- `eval --count 500 --bootstrap-samples 1000` — standalone E2E eval outputting `{ "raw": ..., "ci": ... }`.
- `beir-bench --beir-dir <dir>` — BEIR NDCG@10/MAP/Recall@100/MRR across bm25/vector/hybrid configs.
- `ablate-chunking --chunk-sizes 512,1024,1536,2048 --overlaps 64,128,256` — grid chunking ablation.
- `ablate-rrf-k --rrf-k-values 10,30,60,100,200` — RRF fusion constant sweep.
- `ablate-quant --embed-models <path1,path2>` — per-model quantisation ablation.
- `scale --sizes 100,500,1000,2000 [--qa-sample 500]` — scale degradation (recall/latency vs corpus size).

## Validation Commands

- Frontend build and lint: `cd genhat-desktop && npm run lint && npm run build`
- Rust compile check: `cd genhat-desktop/src-tauri && cargo check`
- RAG benchmark binary check: `cd genhat-desktop/src-tauri && cargo check --bin rag-bench`
- Build RAG benchmark binary: `cd genhat-desktop/src-tauri && cargo build --release --bin rag-bench`
- Desktop dev run: `cd genhat-desktop && npx tauri dev`

## Change Hygiene
Expand Down
12 changes: 12 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,18 @@ The-Bare/TTS-inference/--hidden-import

benchmark/results/
benchmark/__pycache__/
benchmark/.squad_cache/
benchmark/squad_bench/
benchmark/squad_bench_answers/
benchmark/squad_bench_large/
benchmark/trivia_qa/
benchmark/beir/

# Ingest workspaces (large SQLite + index files)
workspace/

# Script results
results.zip
RESEARCH_PAPER_PLAN.md

# Added by Spec Kitty CLI (auto-managed)
Expand Down
35 changes: 35 additions & 0 deletions IEEE-conference-template-062824.log
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023/Debian) (preloaded format=pdflatex 2026.5.17) 20 MAY 2026 01:43
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
**/home/amogh/Documents/8thsem/major-project/nela/conference_latex_template/IEE
E-conference-template-062824.tex

(/home/amogh/Documents/8thsem/major-project/nela/conference_latex_template/IEEE
-conference-template-062824.tex
LaTeX2e <2023-11-01> patch level 1
L3 programming layer <2024-01-22>

! LaTeX Error: File `IEEEtran.cls' not found.

Type X to quit or <RETURN> to proceed,
or enter new name. (Default extension: cls)

Enter file name:
! Emergency stop.
<read *>

l.2 \IEEEoverridecommandlockouts
^^M
*** (cannot \read from terminal in nonstop modes)


Here is how much of TeX's memory you used:
23 strings out of 476106
813 string characters out of 5793933
1922975 words of memory out of 5000000
22132 multiletter control sequences out of 15000+600000
558069 words of font info for 36 fonts, out of 8000000 for 9000
59 hyphenation exceptions out of 8191
19i,0n,29p,177b,17s stack positions out of 10000i,1000n,20000p,200000b,200000s
! ==> Fatal error occurred, no output PDF file produced!
105 changes: 105 additions & 0 deletions benchmark/BENCHMARK_COMMANDS.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# NELA RAG Benchmark — Command Reference
# All commands run from: genhat-desktop/src-tauri/
# Binary: ./target/release/rag-bench
# Build: cargo build --release --bin rag-bench

# ─────────────────────────────────────────────
# 0. Build the binary
# ─────────────────────────────────────────────
cd genhat-desktop/src-tauri
cargo build --release --bin rag-bench


# ─────────────────────────────────────────────
# 1. Ingest corpus into a fresh workspace
# ─────────────────────────────────────────────
./target/release/rag-bench ingest \
--workspace-dir /tmp/nela-bench \
--corpus-dir ../../benchmark/squad_bench_large/docs \
--embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf


# ─────────────────────────────────────────────
# 2. Recall@k + MRR benchmark (no LLM needed)
# ─────────────────────────────────────────────
./target/release/rag-bench bench \
--workspace-dir /tmp/nela-bench \
--qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
--embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf


# ─────────────────────────────────────────────
# 3. E2E answer quality + no-RAG baseline
# Prints EM%, F1%, RAG gain (pp) in the summary
# ─────────────────────────────────────────────
./target/release/rag-bench bench \
--workspace-dir /tmp/nela-bench \
--qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
--embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
--llm-model ../../models/LLM/unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-UD-Q4_K_XL.gguf \
--e2e-count 50 \
--no-rag-baseline


# ─────────────────────────────────────────────
# 4. RAPTOR ablation
# Requires RAPTOR trees already in the workspace.
# Run after step 1 (ingest) with --raptor flag.
# ─────────────────────────────────────────────
./target/release/rag-bench bench \
--workspace-dir /tmp/nela-bench \
--qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
--embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
--raptor


# ─────────────────────────────────────────────
# 5. Scale degradation benchmark
# Uses a CLEAN workspace (/tmp/nela-scale).
# Each checkpoint gets its own isolated subdirectory.
# Previous results in /tmp/nela-bench do NOT affect this.
# ─────────────────────────────────────────────
./target/release/rag-bench scale \
--workspace-dir /tmp/nela-scale \
--corpus-dir ../../benchmark/squad_bench_large/docs \
--qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
--embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
--sizes 100,500,1000,2000 \
--qa-sample 200


# ─────────────────────────────────────────────
# 6. Standalone E2E eval (no recall benchmarking)
# ─────────────────────────────────────────────
./target/release/rag-bench eval \
--workspace-dir /tmp/nela-bench \
--qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
--embed-model ../../models/embedding/bge-base-en-v1.5-q8_0/bge-base-en-v1.5-q8_0.gguf \
--llm-model ../../models/LLM/unsloth/Qwen3.5-0.8B-GGUF/Qwen3.5-0.8B-UD-Q4_K_XL.gguf \
--count 50


# ─────────────────────────────────────────────
# 7. Embedding model comparison (bge-small)
# Use a separate workspace so indexes don't collide.
# ─────────────────────────────────────────────
./target/release/rag-bench ingest \
--workspace-dir /tmp/nela-bench-small \
--corpus-dir ../../benchmark/squad_bench_large/docs \
--embed-model ../../models/embedding/bge-small-en-v1.5-q8_0/bge-small-en-v1.5-q8_0.gguf

./target/release/rag-bench bench \
--workspace-dir /tmp/nela-bench-small \
--qa-file ../../benchmark/squad_bench_large/qa_pairs.json \
--embed-model ../../models/embedding/bge-small-en-v1.5-q8_0/bge-small-en-v1.5-q8_0.gguf


# ─────────────────────────────────────────────
# Notes
# ─────────────────────────────────────────────
# - llama-server binary is auto-detected from bin/llama-lin/llama-server
# relative to the workspace. Override with --llama-server <path>.
# - Default embed port: 12345. Default LLM port: 12346.
# - bench_results.json is written to the CWD unless --output is specified.
# - Scale benchmark: do NOT reuse an existing /tmp/nela-bench workspace;
# use a dedicated directory (e.g., /tmp/nela-scale) to avoid stale data.
113 changes: 111 additions & 2 deletions benchmark/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,115 @@
# GenHat Application Benchmark Suite
# NELA Benchmark Suite

This folder contains an application-only benchmark pipeline (no app code changes required) that captures:
This folder contains two independent benchmark tools:

1. **`rag-bench`** — Rust CLI that measures RAG pipeline retrieval quality and latency
(recall@k, latency breakdown, IVF memory stats, RAPTOR ablation)
2. **`run_benchmark.py`** — Python application-level benchmark (startup timing, CPU/RAM over time)

---

## rag-bench — RAG Retrieval Quality Benchmarks

`rag-bench` is a standalone Rust binary that directly exercises the NELA RAG components
(`RagDb`, `BM25Index`, `VectorIndex`, `rrf_fuse`) without the Tauri runtime.
It is intended for researchers validating retrieval quality claims.

### Scenarios covered

| ID | Scenario | Configs measured |
|-----|--------------------------------|-----------------------------------------------------|
| A1 | Recall@k | BM25-only, Vector-only, Hybrid, Hybrid+expand |
| D1 | Query latency breakdown | Embed / BM25 / Vector / RRF / Expand per stage |
| B1 | Ingestion timing | Per-file: chunk count, embed time, total time |
| C2 | IVF memory efficiency | Quantized MB vs raw f32 estimate, compression ratio |
| A4 | RAPTOR confidence-gate ablation| threshold=-1.5 vs +∞ vs -∞ (requires pre-built trees) |

### Step 1 — Prepare a test corpus (SQuAD)

```bash
cd benchmark
python3 prepare_squad.py \
--out-dir squad_bench \
--max-contexts 100 \
--max-qa 400
```

This downloads SQuAD 1.1 dev set (~10 MB), writes 100 `.txt` context files to
`squad_bench/docs/`, and produces `squad_bench/qa_pairs.json` with 400 QA pairs.

### Step 2 — Run the full benchmark

```bash
cd genhat-desktop

cargo run --release --bin rag-bench -- run \
--workspace-dir /tmp/nela-bench \
--corpus-dir ../benchmark/squad_bench/docs \
--qa-file ../benchmark/squad_bench/qa_pairs.json \
--embed-model /path/to/bge-base-en-v1.5-q8_0.gguf \
--top-k 5,10 \
--output ../benchmark/rag_results.json
```

`llama-server` is auto-detected from `src-tauri/bin/llama-lin/llama-server`.
Override with `--llama-server <path>` if needed.

### Step 2 (alternative) — Ingest once, bench many times

```bash
# Ingest corpus once
cargo run --release --bin rag-bench -- ingest \
--workspace-dir /tmp/nela-bench \
--corpus-dir ../benchmark/squad_bench/docs \
--embed-model /path/to/bge.gguf

# Run benchmarks against the same workspace (fast, no re-embedding)
cargo run --release --bin rag-bench -- bench \
--workspace-dir /tmp/nela-bench \
--qa-file ../benchmark/squad_bench/qa_pairs.json \
--embed-model /path/to/bge.gguf \
--output ../benchmark/rag_results.json
```

### RAPTOR ablation (A4)

RAPTOR trees must be pre-built. The easiest way is to ingest the corpus via the NELA
desktop app (Phase 2 background enrichment auto-builds RAPTOR), then point
`--workspace-dir` at that existing NELA workspace:

```bash
cargo run --release --bin rag-bench -- bench \
--workspace-dir ~/.local/share/nela/workspaces/<your-workspace> \
--qa-file ../benchmark/squad_bench/qa_pairs.json \
--embed-model /path/to/bge.gguf \
--raptor \
--output ../benchmark/rag_raptor_results.json
```

### QA pairs format

```json
[
{
"question": "What causes the tides?",
"relevant_keywords": ["gravitational", "Moon", "tidal"],
"doc_title": "optional_partial_doc_title"
}
]
```

The oracle marks a chunk as relevant if it contains **any** keyword (case-insensitive)
and, if `doc_title` is set, if the chunk's document title contains that substring.

### Plotting results

```bash
python3 benchmark/plot_results.py --input benchmark/rag_results.json
```

---

## run_benchmark.py — Application-Level Benchmark

- Startup timing (cold start)
- Process-tree resource use over time (CPU, RSS, process count)
Expand Down
Loading
Loading