Skip to content

NELA: Added web scraping support#64

Merged
Amoghk04 merged 22 commits into
devfrom
feature/bench
May 31, 2026
Merged

NELA: Added web scraping support#64
Amoghk04 merged 22 commits into
devfrom
feature/bench

Conversation

@Amoghk04
Copy link
Copy Markdown
Collaborator

Web Scraping support

Amoghk04 added 22 commits May 18, 2026 21:53
…e across runs

- Reorder per-doc ingest: embed *before* DB/BM25 insertion so a failed
  embed leaves no partial state; doc is retried on the next run.
- On embed failure (e.g. input > model ctx size) print a WARN and
  continue instead of propagating the error and aborting the whole task.
- Track skipped_embed counter; report it in the Corpus ready summary.
- run_all_benchmarks.sh: use stable WS_BEIR=workspace/beir_ws/ dir
  (not the timestamped results/ dir) so caching actually persists
  across re-runs.
… failure

- Capture llama-server stderr to a temp file instead of /dev/null
- Add process.try_wait() check in health loop: detect instant crashes
  (missing library, unsupported flag, OOM, etc.) and bail immediately
  with the captured server output rather than waiting the full 90s
- Include last 40 lines of server output in both the premature-exit
  and 90s-timeout error messages so the failure is actionable
- Fix broken warm-up timer (was always ~0s due to Instant arithmetic)
… startup

- Remove --log-disable so llama-server GPU detection messages
  ("found X CUDA devices", "offloaded N/N layers to GPU") are written
  to the captured stderr log file.
- After successful startup, scan the log for CUDA/GPU/offload lines
  and print them as [bench] GPU: ... lines.  If none are found, print
  a clear warning that the server is likely running on CPU and how to
  verify (ldd | grep cuda).
- This makes silent CPU fallback immediately visible instead of only
  showing up as mysteriously low throughput.
Each add_chunks_batch() call commits a Tantivy segment. Tantivy's
segment merge is O(total_docs) per commit, so committing once per
doc batch causes quadratic slowdown on large corpora (fiqa: 170k
chunks → 7k commits → 1 doc/s at the end).

Accumulate all BM25 data in bm25_pending during the ingest loop and
issue a single add_chunks_batch() after all docs are written. This
reduces Tantivy segment merges from O(N) to O(1) for ingest.
… O(N²) fix

Previous commit deferred BM25 entirely to end-of-ingest, which breaks
crash recovery: SQLite-cached docs are skipped on re-run, so BM25 is
permanently empty if a run dies before the final commit.

Fix: flush bm25_pending to Tantivy every 500 ingested docs.
- fiqa (57k docs): 114 commits instead of 7000 → 60× fewer merges
- crash window: at most 500 docs of BM25 data lost (re-run restores the
  SQLite/embed data that was safely committed per-doc)
- tail flush after loop handles the remainder
Both functions were firing one embed request per doc sequentially,
leaving the GPU mostly idle while waiting for HTTP round-trips.

Fix: pre-read and chunk all files, then fire embeds in batches of 8
using futures_util::join_all — same EMBED_CONCURRENCY=8 pattern
already used by cmd_beir_bench after ef10391.

Affected commands:
- ablate-chunking (calls ingest_corpus_with_config per chunk/overlap combo)
- ablate-quant    (calls ingest_corpus_with_config per embed model)
- scale           (inline ingest per checkpoint, now SCALE_EMBED_CONCURRENCY=8)
run_recall_bench was embedding every query once per retrieval config
(4x redundancy for bm25_only/vector_only/hybrid/hybrid_expand). Now
pre-embeds all queries in parallel batches (EMBED_CONCURRENCY=8) before
the config loop, reusing the results across all four configs.

run_recall_hybrid_rrf_k was re-embedding 500 queries per k-value.
cmd_ablate_rrf_k now pre-embeds once and passes Vec<Vec<f32>> into the
function (signature change: server param replaced by query_embeddings).

Net improvement per run_recall_bench call: 500 queries x 4 configs =
2000 sequential single-sentence embed calls -> ~63 parallel batches.
Net improvement for ablate-rrf-k: 500 x 5 k-values = 2500 calls -> 63.
…cale

Per-document add_chunks_batch() flushes a Tantivy segment each call.
For ablate-chunking (12 configs x ~400 SQuAD docs) this was 4800 Tantivy
segment serialisations, each burning CPU while the GPU sat idle.

Fix for ingest_corpus_with_config (stages 6 and 8):
- Accumulate all (chunk_id, text, title) tuples in bm25_pending vec
- Single add_chunks_batch() call after all embed batches complete

Fix for cmd_scale inline ingest (stage 9):
- Same pattern: scale_bm25_pending + single commit per checkpoint

BEIR ingest (stage 5) already uses checkpoint-every-500-docs pattern.
Stage 7 (ablate-rrf-k) has no ingest. Stage 10 is Python-only.
…ubsetting

- AblateChunkingArgs and AblateQuantArgs gain --max-docs: Option<usize>
- ingest_corpus_with_config takes max_docs: Option<usize>; truncates
  sorted entries after the first N docs (alphabetical order)
- Before each ablation command's inner loop, qa_pairs are filtered to
  those whose doc_title matches an ingested document title, so recall
  numbers remain valid for the subset
- run_all_benchmarks.sh gains --ablate-max-docs <N> flag that is
  forwarded to both ablation stages (6 and 8)

Usage example:
  bash scripts/run_all_benchmarks.sh --ablate-max-docs 100

With 100 docs instead of 400+, stage 6 (12 configs × 100 docs) uses
~3× less VRAM / RAM per workspace and runs ~4× faster.
- Move ablation workspace paths from per-run $RESULTS/ to persistent
  workspace/ablate_chunking_ws and workspace/ablate_quant_ws so data
  survives across benchmark runs
- Sub-dir names encode --max-docs when set (e.g. cs512ov64_n100) so
  changing the corpus cap never reuses stale data
- Before each ingest, check if the workspace db already has documents;
  if so, skip ingest entirely so interrupted runs resume from the last
  completed grid-point rather than starting over
- Remove remove_dir_all cleanup at end of each config iteration
- Add WS_ABLATE_CHUNK / WS_ABLATE_QUANT vars in run_all_benchmarks.sh
- AblateChunkingArgs, AblateQuantArgs: new --max-qa Option<usize> field
- cmd_ablate_chunking / cmd_ablate_quant: truncate qa_pairs to max_qa
  after the doc-title filter so only answerable pairs survive
- dir names now encode both _d<N> (max_docs) and _q<N> (max_qa) suffixes
  so cached workspaces are never shared across different cap settings
- run_all_benchmarks.sh: --ablate-max-qa CLI flag; when --ablate-max-docs
  is set without an explicit --ablate-max-qa the script auto-defaults to 500
  (pass --ablate-max-qa 0 to disable); stages 6 and 8 pass --max-qa

With --ablate-max-docs 100 --ablate-max-qa 500 stage 6 drops from ~25 min
to ~2-3 min (500 QA pairs × 12 configs instead of 20456 × 12).
Copilot AI review requested due to automatic review settings May 31, 2026 15:34
@Amoghk04 Amoghk04 merged commit 8f1d6c5 into dev May 31, 2026
4 of 5 checks passed
@Amoghk04 Amoghk04 deleted the feature/bench branch May 31, 2026 15:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants