NELA: Added web scraping support by Amoghk04 · Pull Request #64 · nela-local/nela

Amoghk04 · 2026-05-31T15:34:54Z

Web Scraping support

…2 tokens per slot)

… (were inverted)

…EIR ingest

…e across runs - Reorder per-doc ingest: embed *before* DB/BM25 insertion so a failed embed leaves no partial state; doc is retried on the next run. - On embed failure (e.g. input > model ctx size) print a WARN and continue instead of propagating the error and aborting the whole task. - Track skipped_embed counter; report it in the Corpus ready summary. - run_all_benchmarks.sh: use stable WS_BEIR=workspace/beir_ws/ dir (not the timestamped results/ dir) so caching actually persists across re-runs.

… failure - Capture llama-server stderr to a temp file instead of /dev/null - Add process.try_wait() check in health loop: detect instant crashes (missing library, unsupported flag, OOM, etc.) and bail immediately with the captured server output rather than waiting the full 90s - Include last 40 lines of server output in both the premature-exit and 90s-timeout error messages so the failure is actionable - Fix broken warm-up timer (was always ~0s due to Instant arithmetic)

… startup - Remove --log-disable so llama-server GPU detection messages ("found X CUDA devices", "offloaded N/N layers to GPU") are written to the captured stderr log file. - After successful startup, scan the log for CUDA/GPU/offload lines and print them as [bench] GPU: ... lines. If none are found, print a clear warning that the server is likely running on CPU and how to verify (ldd | grep cuda). - This makes silent CPU fallback immediately visible instead of only showing up as mysteriously low throughput.

…d to GPU

Each add_chunks_batch() call commits a Tantivy segment. Tantivy's segment merge is O(total_docs) per commit, so committing once per doc batch causes quadratic slowdown on large corpora (fiqa: 170k chunks → 7k commits → 1 doc/s at the end). Accumulate all BM25 data in bm25_pending during the ingest loop and issue a single add_chunks_batch() after all docs are written. This reduces Tantivy segment merges from O(N) to O(1) for ingest.

… O(N²) fix Previous commit deferred BM25 entirely to end-of-ingest, which breaks crash recovery: SQLite-cached docs are skipped on re-run, so BM25 is permanently empty if a run dies before the final commit. Fix: flush bm25_pending to Tantivy every 500 ingested docs. - fiqa (57k docs): 114 commits instead of 7000 → 60× fewer merges - crash window: at most 500 docs of BM25 data lost (re-run restores the SQLite/embed data that was safely committed per-doc) - tail flush after loop handles the remainder

Both functions were firing one embed request per doc sequentially, leaving the GPU mostly idle while waiting for HTTP round-trips. Fix: pre-read and chunk all files, then fire embeds in batches of 8 using futures_util::join_all — same EMBED_CONCURRENCY=8 pattern already used by cmd_beir_bench after ef10391. Affected commands: - ablate-chunking (calls ingest_corpus_with_config per chunk/overlap combo) - ablate-quant (calls ingest_corpus_with_config per embed model) - scale (inline ingest per checkpoint, now SCALE_EMBED_CONCURRENCY=8)

run_recall_bench was embedding every query once per retrieval config (4x redundancy for bm25_only/vector_only/hybrid/hybrid_expand). Now pre-embeds all queries in parallel batches (EMBED_CONCURRENCY=8) before the config loop, reusing the results across all four configs. run_recall_hybrid_rrf_k was re-embedding 500 queries per k-value. cmd_ablate_rrf_k now pre-embeds once and passes Vec<Vec<f32>> into the function (signature change: server param replaced by query_embeddings). Net improvement per run_recall_bench call: 500 queries x 4 configs = 2000 sequential single-sentence embed calls -> ~63 parallel batches. Net improvement for ablate-rrf-k: 500 x 5 k-values = 2500 calls -> 63.

…cale Per-document add_chunks_batch() flushes a Tantivy segment each call. For ablate-chunking (12 configs x ~400 SQuAD docs) this was 4800 Tantivy segment serialisations, each burning CPU while the GPU sat idle. Fix for ingest_corpus_with_config (stages 6 and 8): - Accumulate all (chunk_id, text, title) tuples in bm25_pending vec - Single add_chunks_batch() call after all embed batches complete Fix for cmd_scale inline ingest (stage 9): - Same pattern: scale_bm25_pending + single commit per checkpoint BEIR ingest (stage 5) already uses checkpoint-every-500-docs pattern. Stage 7 (ablate-rrf-k) has no ingest. Stage 10 is Python-only.

…ubsetting - AblateChunkingArgs and AblateQuantArgs gain --max-docs: Option<usize> - ingest_corpus_with_config takes max_docs: Option<usize>; truncates sorted entries after the first N docs (alphabetical order) - Before each ablation command's inner loop, qa_pairs are filtered to those whose doc_title matches an ingested document title, so recall numbers remain valid for the subset - run_all_benchmarks.sh gains --ablate-max-docs <N> flag that is forwarded to both ablation stages (6 and 8) Usage example: bash scripts/run_all_benchmarks.sh --ablate-max-docs 100 With 100 docs instead of 400+, stage 6 (12 configs × 100 docs) uses ~3× less VRAM / RAM per workspace and runs ~4× faster.

- Move ablation workspace paths from per-run $RESULTS/ to persistent workspace/ablate_chunking_ws and workspace/ablate_quant_ws so data survives across benchmark runs - Sub-dir names encode --max-docs when set (e.g. cs512ov64_n100) so changing the corpus cap never reuses stale data - Before each ingest, check if the workspace db already has documents; if so, skip ingest entirely so interrupted runs resume from the last completed grid-point rather than starting over - Remove remove_dir_all cleanup at end of each config iteration - Add WS_ABLATE_CHUNK / WS_ABLATE_QUANT vars in run_all_benchmarks.sh

- AblateChunkingArgs, AblateQuantArgs: new --max-qa Option<usize> field - cmd_ablate_chunking / cmd_ablate_quant: truncate qa_pairs to max_qa after the doc-title filter so only answerable pairs survive - dir names now encode both _d<N> (max_docs) and _q<N> (max_qa) suffixes so cached workspaces are never shared across different cap settings - run_all_benchmarks.sh: --ablate-max-qa CLI flag; when --ablate-max-docs is set without an explicit --ablate-max-qa the script auto-defaults to 500 (pass --ablate-max-qa 0 to disable); stages 6 and 8 pass --max-qa With --ablate-max-docs 100 --ablate-max-qa 500 stage 6 drops from ~25 min to ~2-3 min (500 QA pairs × 12 configs instead of 20456 × 12).

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Amoghk04 added 22 commits May 18, 2026 21:53

NELA: Added Benchmarking

b3983eb

NELA: Benching changes

221f579

fix: surface chat_complete errors and unexpected response JSON

a3628a1

fix: ChatServer parallel=1 for bench (ctx_size/parallel was giving 51…

6b909f4

…2 tokens per slot)

fix: swap raptor_trust_all/-inf and raptor_expand_all/+inf thresholds…

f09a3ec

… (were inverted)

fix(beir): bump embed ubatch-size 512→2048; add per-doc progress to B…

14a330f

…EIR ingest

fix(bench): warn explicitly when CUDA detected but no layers offloade…

b0cebe6

…d to GPU

fix(bench): accept 'fitting params to device' as GPU-active signal

2df4246

perf(bench): parallel BEIR ingest — embed 8 docs concurrently

ef10391

NELA: Final benchmark figures for Claude code

b1cdaf8

NELA: Added web scraping

78f57f3

Copilot AI review requested due to automatic review settings May 31, 2026 15:34

Copilot started reviewing on behalf of Amoghk04 May 31, 2026 15:35 View session

Amoghk04 merged commit 8f1d6c5 into dev May 31, 2026
4 of 5 checks passed

Amoghk04 deleted the feature/bench branch May 31, 2026 15:35

Copilot AI reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NELA: Added web scraping support#64

NELA: Added web scraping support#64
Amoghk04 merged 22 commits into
devfrom
feature/bench

Amoghk04 commented May 31, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Amoghk04 commented May 31, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants