NELA: Added web scraping support#64
Merged
Merged
Conversation
…2 tokens per slot)
…e across runs - Reorder per-doc ingest: embed *before* DB/BM25 insertion so a failed embed leaves no partial state; doc is retried on the next run. - On embed failure (e.g. input > model ctx size) print a WARN and continue instead of propagating the error and aborting the whole task. - Track skipped_embed counter; report it in the Corpus ready summary. - run_all_benchmarks.sh: use stable WS_BEIR=workspace/beir_ws/ dir (not the timestamped results/ dir) so caching actually persists across re-runs.
… failure - Capture llama-server stderr to a temp file instead of /dev/null - Add process.try_wait() check in health loop: detect instant crashes (missing library, unsupported flag, OOM, etc.) and bail immediately with the captured server output rather than waiting the full 90s - Include last 40 lines of server output in both the premature-exit and 90s-timeout error messages so the failure is actionable - Fix broken warm-up timer (was always ~0s due to Instant arithmetic)
… startup
- Remove --log-disable so llama-server GPU detection messages
("found X CUDA devices", "offloaded N/N layers to GPU") are written
to the captured stderr log file.
- After successful startup, scan the log for CUDA/GPU/offload lines
and print them as [bench] GPU: ... lines. If none are found, print
a clear warning that the server is likely running on CPU and how to
verify (ldd | grep cuda).
- This makes silent CPU fallback immediately visible instead of only
showing up as mysteriously low throughput.
Each add_chunks_batch() call commits a Tantivy segment. Tantivy's segment merge is O(total_docs) per commit, so committing once per doc batch causes quadratic slowdown on large corpora (fiqa: 170k chunks → 7k commits → 1 doc/s at the end). Accumulate all BM25 data in bm25_pending during the ingest loop and issue a single add_chunks_batch() after all docs are written. This reduces Tantivy segment merges from O(N) to O(1) for ingest.
… O(N²) fix Previous commit deferred BM25 entirely to end-of-ingest, which breaks crash recovery: SQLite-cached docs are skipped on re-run, so BM25 is permanently empty if a run dies before the final commit. Fix: flush bm25_pending to Tantivy every 500 ingested docs. - fiqa (57k docs): 114 commits instead of 7000 → 60× fewer merges - crash window: at most 500 docs of BM25 data lost (re-run restores the SQLite/embed data that was safely committed per-doc) - tail flush after loop handles the remainder
Both functions were firing one embed request per doc sequentially, leaving the GPU mostly idle while waiting for HTTP round-trips. Fix: pre-read and chunk all files, then fire embeds in batches of 8 using futures_util::join_all — same EMBED_CONCURRENCY=8 pattern already used by cmd_beir_bench after ef10391. Affected commands: - ablate-chunking (calls ingest_corpus_with_config per chunk/overlap combo) - ablate-quant (calls ingest_corpus_with_config per embed model) - scale (inline ingest per checkpoint, now SCALE_EMBED_CONCURRENCY=8)
run_recall_bench was embedding every query once per retrieval config (4x redundancy for bm25_only/vector_only/hybrid/hybrid_expand). Now pre-embeds all queries in parallel batches (EMBED_CONCURRENCY=8) before the config loop, reusing the results across all four configs. run_recall_hybrid_rrf_k was re-embedding 500 queries per k-value. cmd_ablate_rrf_k now pre-embeds once and passes Vec<Vec<f32>> into the function (signature change: server param replaced by query_embeddings). Net improvement per run_recall_bench call: 500 queries x 4 configs = 2000 sequential single-sentence embed calls -> ~63 parallel batches. Net improvement for ablate-rrf-k: 500 x 5 k-values = 2500 calls -> 63.
…cale Per-document add_chunks_batch() flushes a Tantivy segment each call. For ablate-chunking (12 configs x ~400 SQuAD docs) this was 4800 Tantivy segment serialisations, each burning CPU while the GPU sat idle. Fix for ingest_corpus_with_config (stages 6 and 8): - Accumulate all (chunk_id, text, title) tuples in bm25_pending vec - Single add_chunks_batch() call after all embed batches complete Fix for cmd_scale inline ingest (stage 9): - Same pattern: scale_bm25_pending + single commit per checkpoint BEIR ingest (stage 5) already uses checkpoint-every-500-docs pattern. Stage 7 (ablate-rrf-k) has no ingest. Stage 10 is Python-only.
…ubsetting - AblateChunkingArgs and AblateQuantArgs gain --max-docs: Option<usize> - ingest_corpus_with_config takes max_docs: Option<usize>; truncates sorted entries after the first N docs (alphabetical order) - Before each ablation command's inner loop, qa_pairs are filtered to those whose doc_title matches an ingested document title, so recall numbers remain valid for the subset - run_all_benchmarks.sh gains --ablate-max-docs <N> flag that is forwarded to both ablation stages (6 and 8) Usage example: bash scripts/run_all_benchmarks.sh --ablate-max-docs 100 With 100 docs instead of 400+, stage 6 (12 configs × 100 docs) uses ~3× less VRAM / RAM per workspace and runs ~4× faster.
- Move ablation workspace paths from per-run $RESULTS/ to persistent workspace/ablate_chunking_ws and workspace/ablate_quant_ws so data survives across benchmark runs - Sub-dir names encode --max-docs when set (e.g. cs512ov64_n100) so changing the corpus cap never reuses stale data - Before each ingest, check if the workspace db already has documents; if so, skip ingest entirely so interrupted runs resume from the last completed grid-point rather than starting over - Remove remove_dir_all cleanup at end of each config iteration - Add WS_ABLATE_CHUNK / WS_ABLATE_QUANT vars in run_all_benchmarks.sh
- AblateChunkingArgs, AblateQuantArgs: new --max-qa Option<usize> field - cmd_ablate_chunking / cmd_ablate_quant: truncate qa_pairs to max_qa after the doc-title filter so only answerable pairs survive - dir names now encode both _d<N> (max_docs) and _q<N> (max_qa) suffixes so cached workspaces are never shared across different cap settings - run_all_benchmarks.sh: --ablate-max-qa CLI flag; when --ablate-max-docs is set without an explicit --ablate-max-qa the script auto-defaults to 500 (pass --ablate-max-qa 0 to disable); stages 6 and 8 pass --max-qa With --ablate-max-docs 100 --ablate-max-qa 500 stage 6 drops from ~25 min to ~2-3 min (500 QA pairs × 12 configs instead of 20456 × 12).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Web Scraping support