Skip to content

Recommendation: weekly→daily — decouple live-scrape from dump/report #17

Description

@Seungpyo1007

Context

Moving the weekly automation to a daily cadence. The three scheduled workflows differ a lot in weight/side-effects, so recommending a split rather than flipping all three to * * * uniformly.

Current schedules (all Mondays):

  • weekly-refresh.yml0 6 * * 1 — enrich (12 live sources) → validate → integrity gate → dump → PR to TechAPI
  • coverage-report.yml23 6 * * 1 — coverage diff → sticky issue (in-place)
  • weekly-ingest.yml29 6 * * 1 — upstream catalog scrape → draft-SKU PR for curator review

Recommended direction

1. Cron: keep the minute/hour stagger, only flip day-of-week 1* (preserves refresh→coverage→ingest ordering):

  • refresh 0 6 * * *, coverage 23 6 * * *, ingest 29 6 * * *

2. Don't make all three daily as-is — they have different weight:

  • coverage-report → daily ✅ (cheap, sticky issue in-place; no downside)
  • dump regeneration → daily ✅, but only after the timestamp fix below
  • live enrich (12-source scrape) → keep weekly. Daily scraping of Wikipedia/cpubenchmark/topcpu etc. is ToS/rate-limit/load-heavy. Decouple "scrape benchmarks" (weekly) from "regenerate dump + report from already-curated data/" (daily).
  • weekly-ingest (draft-SKU PRs) → keep weekly (or switch to a single sticky branch/PR). A new curator-review PR every day will overwhelm review.

3. Prerequisite for daily dump: make app.dump deterministic.
Today the dump is a stateless rebuild that stamps created_at/updated_at = build-time on every run, so a daily refresh PR would carry ~1400-file timestamp churn even with zero data changes — daily PRs become unreviewable noise. Fix before going daily: preserve created_at, set updated_at only on real change (or drop build timestamps / pin SOURCE_DATE_EPOCH). Then daily PRs contain only real data deltas.

4. Housekeeping for daily:

  • Rename weekly-* files / concurrency.group / comments → daily-* (or neutral) to avoid misleading names.
  • Auto-merge the refresh PR (dated refresh/<date> branch) or commit directly, so daily PRs don't backlog.
  • Add concurrency: groups to coverage-report and weekly-ingest (refresh already has one) so a slow run doesn't overlap the next day's.
  • integrity_check.py --strict will be exercised daily — per-source enrich failures already ::warning::-skip, so a flaky upstream just skips that day's PR (harmless), but worth monitoring.

One-line summary

Split "scrape (weekly)" from "dump + coverage (daily)"; make the dump deterministic first so daily PRs aren't 1400-file timestamp churn; flip only the day-of-week field and keep the stagger.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No fields configured for Task.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions