feat(scrape): month-frequency-weighted, source-balanced batch selection by shaypal5 · Pull Request #188 · DataHackIL/tfht_enforce_idx

shaypal5 · 2026-06-10T12:39:38Z

Summary

Adds --balanced-batch N to scrape_candidates for draining the 2026 backlog in fixed-size batches that respect a limited scraping budget (block/rate-limit risk, time cost).

Two balancing axes:

Frequency-weighted across months — months with more prefilter-passing candidates get proportionally more slots (largest-remainder apportionment). Jan–Mar (the busy, uncovered months) get the bulk; quiet months still get representation.
Source-balanced within each month — round-robin across distinct publication sources/source-families (visited in descending scrape priority), so no single site monopolises a month's slots. This also spreads requests across hosts to reduce per-site block risk.

Source is derived from the domain (collapsing subdomains onto a family, e.g. sport1.maariv.co.il → maariv) because source_hints records the discovery engine (exa/brave), not the publisher.

New module: `denbust/discovery/balanced_selection.py`

candidate_source_key, candidate_month
largest_remainder_allocation (Hamilton method)
select_within_month (round-robin)
plan_balanced_scrape_batch (months → apportion → within-month → top-up)

Wiring

Config.scrape_balanced_batch_size: int | None
CLI: denbust run … --balanced-batch N
_run_candidate_scrape_job: balanced branch selects from the full prefilter-passing pool instead of the priority-ordered limit head

denbust run --dataset news_items --job scrape_candidates \
            --config agents/news/local_search_brave_exa.yaml \
            --balanced-batch 60

Test plan

10 new unit tests in test_balanced_selection.py (apportionment, round-robin spread, month weighting, cap, top-up, undated exclusion)
1335 unit tests pass, 0 failures
ruff + mypy clean
Validated against live data: 60-batch from 1,845 Stage-B passers → Jan 20 / Feb 17 / Mar 14 / Apr 3 / May 6

🤖 Generated with Claude Code

Adds --balanced-batch N to the scrape_candidates job for working through the backlog in fixed-size batches that respect a limited scraping budget: - Frequency-weighted across months: months with more prefilter-passing candidates get proportionally more of the batch (largest-remainder / Hamilton apportionment), so busy months (Jan–Mar) are prioritised over quiet ones without starving them. - Source-balanced within each month: a month's allocation is spread round-robin across distinct publication sources/source families, visited in descending scrape-priority order, so one prolific site cannot monopolise a month's slots (also spreads load to reduce per-site rate-limit/block risk). Publication source is derived from the candidate domain (collapsing subdomains onto a family, e.g. sport1.maariv.co.il -> maariv), since source_hints records the discovery engine (exa/brave), not the publisher. New module denbust/discovery/balanced_selection.py: - candidate_source_key / candidate_month - largest_remainder_allocation - select_within_month (round-robin) - plan_balanced_scrape_batch (months -> apportion -> within-month -> top-up) Wiring: - Config.scrape_balanced_batch_size: int | None - CLI: denbust run ... --balanced-batch N - _run_candidate_scrape_job: balanced branch draws the batch from the full prefilter-passing pool instead of the priority-ordered limit head Usage: denbust run --dataset news_items --job scrape_candidates \ --config agents/news/local_search_brave_exa.yaml \ --balanced-batch 60 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-06-10T12:43:51Z

pr-agent-context report:

This run includes patch coverage gaps on PR #188.

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 77.23%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/discovery/balanced_selection.py: 65, 66, 67, 68, 169, 186, 187, 188, 189, 190, 191, 192
- src/denbust/pipeline.py: 1438, 1439, 1440, 1441, 1442, 1447, 1448, 3348, 3349, 3350, 3351

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: pull request opened
Workflow run: 27276850469 attempt 1
Comment timestamp: 2026-06-10T12:43:01.516051+00:00
PR head commit: 82688d1e43e545ea8b61b4ce4313d8241d36e294

Copilot

Pull request overview

This PR adds an optional “balanced batch” mode for the scrape_candidates job to select a fixed-size scrape batch that (a) allocates slots across months proportionally to candidate volume and (b) balances selections within each month across publication sources (domain families), helping drain backlog under a constrained scrape budget.

Changes:

Added --balanced-batch N CLI option and Config.scrape_balanced_batch_size wiring into the scrape-candidates pipeline path.
Introduced denbust.discovery.balanced_selection implementing month apportionment (largest remainder) plus within-month source round-robin selection.
Added unit tests covering allocation, round-robin balancing, top-up behavior, and undated exclusion.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/denbust/discovery/balanced_selection.py`	New balanced batch planning utilities (month weighting + source balancing).
`src/denbust/pipeline.py`	Adds balanced selection branch to candidate scraping job and plumbs config through.
`src/denbust/config.py`	Adds `scrape_balanced_batch_size` config field.
`src/denbust/cli.py`	Adds `--balanced-batch` option and passes it through to `run_job`.
`tests/unit/test_balanced_selection.py`	New unit tests for allocation and selection logic.
`tests/unit/test_cli.py`	Updates fake runner signature to accept the new config parameter.
`tests/unit/test_pipeline_core.py`	Updates fakes/config scaffolding for the new balanced batch size parameter.

+            if pub_date_from is not None:
+                cutoff_month = pub_date_from.strftime("%Y-%m")
+                eligible = [
+                    candidate
+                    for candidate in eligible
+                    if (month := candidate_month(candidate)) is not None and month >= cutoff_month
+                ]


    plan_backfill_windows,
    resolve_backfill_request_window,
 )
+from denbust.discovery.balanced_selection import candidate_month, plan_balanced_scrape_batch


+    weight_sum = sum(weights.values())
+    if weight_sum <= 0 or total <= 0:
+        return dict.fromkeys(weights, 0)
+    exact = {key: total * value / weight_sum for key, value in weights.items()}
+    allocation = {key: int(value) for key, value in exact.items()}


+from denbust.discovery.scrape_queue import (
+    _SOURCE_SCRAPE_PRIORITY,
+    _backfill_publication_datetime,
+)


Copilot AI review requested due to automatic review settings June 10, 2026 12:39

shaypal5 merged commit 95d2184 into main Jun 10, 2026

shaypal5 deleted the codex/balanced-scrape-batch branch June 10, 2026 12:39

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scrape): month-frequency-weighted, source-balanced batch selection#188

feat(scrape): month-frequency-weighted, source-balanced batch selection#188
shaypal5 merged 1 commit into
mainfrom
codex/balanced-scrape-batch

shaypal5 commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented Jun 10, 2026

Summary

New module: denbust/discovery/balanced_selection.py

Wiring

Test plan

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New module: `denbust/discovery/balanced_selection.py`