Skip to content

feat(ingest): weekly crawler that opens PRs to TechAPI with new SKUs #2

Description

@Seungpyo1007

Purpose

Long-running TechEngine tracker for finding missing TechAPI records, drafting import candidates, requesting PR validation, and keeping TechAPI data PRs reviewable.

This issue is the TechEngine-side companion to GetTechAPI/TechAPI#1. It should stay open while the dataset rebuild continues. TechAPI PRs may use Closes GetTechAPI/TechAPI#1 for Development linking, but this TechEngine issue tracks the automation and validation work behind those PRs.

Current Status

Latest Dataset Snapshot From PR #25

Category Total Verified Unverified Missing verified Verified %
brand 129 0 0 129 n/a
soc 195 58 137 0 29.7%
smartphone 6,544 184 6,360 0 2.8%
gpu 2,030 0 2,030 0 0.0%
cpu 3,977 976 3,001 0 24.5%
all 12,896 1,218 11,549 129 9.5%

TechEngine Responsibilities

  • Maintain coverage reports that show missing CPU, GPU, smartphone, SoC, and brand records
  • Generate or support import batches where source coverage is useful
  • Validate TechAPI PRs from reusable branches such as data/import-staging and feat/site
  • Post two useful PR comments:
    • Changed-data review: what changed, examples, source/verified counts, and heuristic warnings
    • Validation stats: totals, verified coverage, warning callouts, and key command output
  • Keep validation warnings actionable without failing expected bulk-import cases
  • Add site-build verification only when site/ files changed
  • Keep project metadata, priority, labels, milestones, assignees, and issue linkage filled in automatically where possible

Validation Contract

Data PR validation should include:

Check Expected behavior
python -m app.validate Hard fail on schema/API validation errors
python integrity_check.py TechAPI/data --strict Hard fail on structural anomalies; summarize stable advisory outliers
Verified coverage warning Warn when category or overall verified coverage is low; do not fail bulk import PRs
Heuristic review Flag suspicious names, typo-like patterns, duplicates, or source artifacts
Site build Run only when TechAPI site/ files changed

Recent Linked Work

PR / Issue Repository Status Main change
TechAPI#25 TechAPI Open Import 5,000 PhoneDB raw smartphone variants plus 45 Mobiles 2025 records
TechAPI#24 TechAPI Merged Add smartphone and SoC records, improve PR metadata and project automation
TechAPI#23 TechAPI Merged Import a larger smartphone batch
TechAPI#22 TechAPI Merged Add smartphone and SoC records from Kaggle-derived sources
TechEngine#19 TechEngine Open Auto-generated coverage gaps report
TechEngine#18 TechEngine Open Deterministic dump timestamps for daily refresh readiness

Remaining Work

  • Keep coverage reports useful by reducing obvious table-artifact slugs and source noise
  • Expand automated candidate generation beyond CPU/GPU into smartphone and SoC sources where structured data is reliable
  • Continue improving PR comments so reviewers can see what changed without opening thousands of files
  • Make low verified coverage warnings clear and category-specific
  • Keep issue and project metadata synchronized for TechAPI and TechEngine PRs
  • Preserve weekly coverage workflows unless a separate daily/PR workflow is explicitly added
  • Avoid closing this tracker until the TechAPI dataset rebuild and supporting automation are mature

Operational Notes

  • Assignees: @Seungpyo1007 and @TechEngineBot
  • Labels: enhancement
  • Milestone: Daily automation
  • Project: TechEngine work
  • Priority: High
  • Start date: 2026-05-29
  • Target date: 2026-09-30

TechEngineBot should comment on relevant PRs and update linked tracker issues whenever validation or metadata automation runs.

Metadata

Metadata

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

Status
Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions