Skip to content

feat(scrape): month-frequency-weighted, source-balanced batch selection#188

Merged
shaypal5 merged 1 commit into
mainfrom
codex/balanced-scrape-batch
Jun 10, 2026
Merged

feat(scrape): month-frequency-weighted, source-balanced batch selection#188
shaypal5 merged 1 commit into
mainfrom
codex/balanced-scrape-batch

Conversation

@shaypal5

Copy link
Copy Markdown
Member

Summary

Adds --balanced-batch N to scrape_candidates for draining the 2026 backlog in fixed-size batches that respect a limited scraping budget (block/rate-limit risk, time cost).

Two balancing axes:

  1. Frequency-weighted across months — months with more prefilter-passing candidates get proportionally more slots (largest-remainder apportionment). Jan–Mar (the busy, uncovered months) get the bulk; quiet months still get representation.
  2. Source-balanced within each month — round-robin across distinct publication sources/source-families (visited in descending scrape priority), so no single site monopolises a month's slots. This also spreads requests across hosts to reduce per-site block risk.

Source is derived from the domain (collapsing subdomains onto a family, e.g. sport1.maariv.co.il → maariv) because source_hints records the discovery engine (exa/brave), not the publisher.

New module: denbust/discovery/balanced_selection.py

  • candidate_source_key, candidate_month
  • largest_remainder_allocation (Hamilton method)
  • select_within_month (round-robin)
  • plan_balanced_scrape_batch (months → apportion → within-month → top-up)

Wiring

  • Config.scrape_balanced_batch_size: int | None
  • CLI: denbust run … --balanced-batch N
  • _run_candidate_scrape_job: balanced branch selects from the full prefilter-passing pool instead of the priority-ordered limit head
denbust run --dataset news_items --job scrape_candidates \
            --config agents/news/local_search_brave_exa.yaml \
            --balanced-batch 60

Test plan

  • 10 new unit tests in test_balanced_selection.py (apportionment, round-robin spread, month weighting, cap, top-up, undated exclusion)
  • 1335 unit tests pass, 0 failures
  • ruff + mypy clean
  • Validated against live data: 60-batch from 1,845 Stage-B passers → Jan 20 / Feb 17 / Mar 14 / Apr 3 / May 6

🤖 Generated with Claude Code

Adds --balanced-batch N to the scrape_candidates job for working through the
backlog in fixed-size batches that respect a limited scraping budget:

- Frequency-weighted across months: months with more prefilter-passing
  candidates get proportionally more of the batch (largest-remainder /
  Hamilton apportionment), so busy months (Jan–Mar) are prioritised over
  quiet ones without starving them.
- Source-balanced within each month: a month's allocation is spread
  round-robin across distinct publication sources/source families, visited in
  descending scrape-priority order, so one prolific site cannot monopolise a
  month's slots (also spreads load to reduce per-site rate-limit/block risk).

Publication source is derived from the candidate domain (collapsing
subdomains onto a family, e.g. sport1.maariv.co.il -> maariv), since
source_hints records the discovery engine (exa/brave), not the publisher.

New module denbust/discovery/balanced_selection.py:
  - candidate_source_key / candidate_month
  - largest_remainder_allocation
  - select_within_month (round-robin)
  - plan_balanced_scrape_batch (months -> apportion -> within-month -> top-up)

Wiring:
  - Config.scrape_balanced_batch_size: int | None
  - CLI: denbust run ... --balanced-batch N
  - _run_candidate_scrape_job: balanced branch draws the batch from the full
    prefilter-passing pool instead of the priority-ordered limit head

Usage:
  denbust run --dataset news_items --job scrape_candidates \
              --config agents/news/local_search_brave_exa.yaml \
              --balanced-batch 60

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 10, 2026 12:39
@shaypal5 shaypal5 merged commit 95d2184 into main Jun 10, 2026
@shaypal5 shaypal5 deleted the codex/balanced-scrape-batch branch June 10, 2026 12:39
@github-actions

Copy link
Copy Markdown

pr-agent-context report:

This run includes patch coverage gaps on PR #188.

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 77.23%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/discovery/balanced_selection.py: 65, 66, 67, 68, 169, 186, 187, 188, 189, 190, 191, 192
- src/denbust/pipeline.py: 1438, 1439, 1440, 1441, 1442, 1447, 1448, 3348, 3349, 3350, 3351

Run metadata:

Tool ref: v4.0.19
Tool version: 4.0.19
Trigger: pull request opened
Workflow run: 27276850469 attempt 1
Comment timestamp: 2026-06-10T12:43:01.516051+00:00
PR head commit: 82688d1e43e545ea8b61b4ce4313d8241d36e294

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional “balanced batch” mode for the scrape_candidates job to select a fixed-size scrape batch that (a) allocates slots across months proportionally to candidate volume and (b) balances selections within each month across publication sources (domain families), helping drain backlog under a constrained scrape budget.

Changes:

  • Added --balanced-batch N CLI option and Config.scrape_balanced_batch_size wiring into the scrape-candidates pipeline path.
  • Introduced denbust.discovery.balanced_selection implementing month apportionment (largest remainder) plus within-month source round-robin selection.
  • Added unit tests covering allocation, round-robin balancing, top-up behavior, and undated exclusion.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/denbust/discovery/balanced_selection.py New balanced batch planning utilities (month weighting + source balancing).
src/denbust/pipeline.py Adds balanced selection branch to candidate scraping job and plumbs config through.
src/denbust/config.py Adds scrape_balanced_batch_size config field.
src/denbust/cli.py Adds --balanced-batch option and passes it through to run_job.
tests/unit/test_balanced_selection.py New unit tests for allocation and selection logic.
tests/unit/test_cli.py Updates fake runner signature to accept the new config parameter.
tests/unit/test_pipeline_core.py Updates fakes/config scaffolding for the new balanced batch size parameter.

Comment thread src/denbust/pipeline.py
Comment on lines +1440 to +1446
if pub_date_from is not None:
cutoff_month = pub_date_from.strftime("%Y-%m")
eligible = [
candidate
for candidate in eligible
if (month := candidate_month(candidate)) is not None and month >= cutoff_month
]
Comment thread src/denbust/pipeline.py
plan_backfill_windows,
resolve_backfill_request_window,
)
from denbust.discovery.balanced_selection import candidate_month, plan_balanced_scrape_batch
Comment on lines +89 to +93
weight_sum = sum(weights.values())
if weight_sum <= 0 or total <= 0:
return dict.fromkeys(weights, 0)
exact = {key: total * value / weight_sum for key, value in weights.items()}
allocation = {key: int(value) for key, value in exact.items()}
Comment on lines +26 to +29
from denbust.discovery.scrape_queue import (
_SOURCE_SCRAPE_PRIORITY,
_backfill_publication_datetime,
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants