feat(scrape): month-frequency-weighted, source-balanced batch selection#188
Merged
Conversation
Adds --balanced-batch N to the scrape_candidates job for working through the
backlog in fixed-size batches that respect a limited scraping budget:
- Frequency-weighted across months: months with more prefilter-passing
candidates get proportionally more of the batch (largest-remainder /
Hamilton apportionment), so busy months (Jan–Mar) are prioritised over
quiet ones without starving them.
- Source-balanced within each month: a month's allocation is spread
round-robin across distinct publication sources/source families, visited in
descending scrape-priority order, so one prolific site cannot monopolise a
month's slots (also spreads load to reduce per-site rate-limit/block risk).
Publication source is derived from the candidate domain (collapsing
subdomains onto a family, e.g. sport1.maariv.co.il -> maariv), since
source_hints records the discovery engine (exa/brave), not the publisher.
New module denbust/discovery/balanced_selection.py:
- candidate_source_key / candidate_month
- largest_remainder_allocation
- select_within_month (round-robin)
- plan_balanced_scrape_batch (months -> apportion -> within-month -> top-up)
Wiring:
- Config.scrape_balanced_batch_size: int | None
- CLI: denbust run ... --balanced-batch N
- _run_candidate_scrape_job: balanced branch draws the batch from the full
prefilter-passing pool instead of the priority-ordered limit head
Usage:
denbust run --dataset news_items --job scrape_candidates \
--config agents/news/local_search_brave_exa.yaml \
--balanced-batch 60
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes patch coverage gaps on PR #188.
Address the patch coverage gaps below, then push all of these changes in a single commit.
# Patch coverage
Patch test coverage is 77.23%; please raise it to 100%. These are the uncovered code lines:
- src/denbust/discovery/balanced_selection.py: 65, 66, 67, 68, 169, 186, 187, 188, 189, 190, 191, 192
- src/denbust/pipeline.py: 1438, 1439, 1440, 1441, 1442, 1447, 1448, 3348, 3349, 3350, 3351Run metadata: |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds an optional “balanced batch” mode for the scrape_candidates job to select a fixed-size scrape batch that (a) allocates slots across months proportionally to candidate volume and (b) balances selections within each month across publication sources (domain families), helping drain backlog under a constrained scrape budget.
Changes:
- Added
--balanced-batch NCLI option andConfig.scrape_balanced_batch_sizewiring into the scrape-candidates pipeline path. - Introduced
denbust.discovery.balanced_selectionimplementing month apportionment (largest remainder) plus within-month source round-robin selection. - Added unit tests covering allocation, round-robin balancing, top-up behavior, and undated exclusion.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/denbust/discovery/balanced_selection.py |
New balanced batch planning utilities (month weighting + source balancing). |
src/denbust/pipeline.py |
Adds balanced selection branch to candidate scraping job and plumbs config through. |
src/denbust/config.py |
Adds scrape_balanced_batch_size config field. |
src/denbust/cli.py |
Adds --balanced-batch option and passes it through to run_job. |
tests/unit/test_balanced_selection.py |
New unit tests for allocation and selection logic. |
tests/unit/test_cli.py |
Updates fake runner signature to accept the new config parameter. |
tests/unit/test_pipeline_core.py |
Updates fakes/config scaffolding for the new balanced batch size parameter. |
Comment on lines
+1440
to
+1446
| if pub_date_from is not None: | ||
| cutoff_month = pub_date_from.strftime("%Y-%m") | ||
| eligible = [ | ||
| candidate | ||
| for candidate in eligible | ||
| if (month := candidate_month(candidate)) is not None and month >= cutoff_month | ||
| ] |
| plan_backfill_windows, | ||
| resolve_backfill_request_window, | ||
| ) | ||
| from denbust.discovery.balanced_selection import candidate_month, plan_balanced_scrape_batch |
Comment on lines
+89
to
+93
| weight_sum = sum(weights.values()) | ||
| if weight_sum <= 0 or total <= 0: | ||
| return dict.fromkeys(weights, 0) | ||
| exact = {key: total * value / weight_sum for key, value in weights.items()} | ||
| allocation = {key: int(value) for key, value in exact.items()} |
Comment on lines
+26
to
+29
| from denbust.discovery.scrape_queue import ( | ||
| _SOURCE_SCRAPE_PRIORITY, | ||
| _backfill_publication_datetime, | ||
| ) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
--balanced-batch Ntoscrape_candidatesfor draining the 2026 backlog in fixed-size batches that respect a limited scraping budget (block/rate-limit risk, time cost).Two balancing axes:
Source is derived from the domain (collapsing subdomains onto a family, e.g.
sport1.maariv.co.il → maariv) becausesource_hintsrecords the discovery engine (exa/brave), not the publisher.New module:
denbust/discovery/balanced_selection.pycandidate_source_key,candidate_monthlargest_remainder_allocation(Hamilton method)select_within_month(round-robin)plan_balanced_scrape_batch(months → apportion → within-month → top-up)Wiring
Config.scrape_balanced_batch_size: int | Nonedenbust run … --balanced-batch N_run_candidate_scrape_job: balanced branch selects from the full prefilter-passing pool instead of the priority-orderedlimitheaddenbust run --dataset news_items --job scrape_candidates \ --config agents/news/local_search_brave_exa.yaml \ --balanced-batch 60Test plan
test_balanced_selection.py(apportionment, round-robin spread, month weighting, cap, top-up, undated exclusion)🤖 Generated with Claude Code