feat(verify): tiered data verification layer (Tier 0 offline scoring)#52
Conversation
…ing) Adds app/verify/, an existence/trust verification layer that sits above the structural validator (app/validate.py, untouched). It answers "does this record describe a real, existing device/part — confidently enough to set verified:true?" to lift the ~1.2% verified ratio. - Tier 0 (offline, deterministic, all ~102k records): completeness + cross-field consistency (signals.py) + source-host trust (hosts.py) + provenance -> a green/yellow/red band. Full scores cached to gitignored data/_verify/state/; the tracked data/_verify/ledger.jsonl is reserved for promotion decisions. - Tier 1 (http_check.py): source_urls HTTP liveness, urllib + ThreadPool, per-host rate limit, resumable TTL cache. - Tier 2 (crossref.py): external cross-reference under a strict exact-heading rule (no fuzzy matching; ambiguous candidates never auto-promote). - Tier 3 (promote.py): hybrid escalation + surgical verified:false->true write-back (only that token, atomic, LF-preserved, never clobbers curated data). CLI: python -m app.verify score|report|check-urls|crossref|promote. CI: non-blocking verify-offline job in validate-data.yml; scheduled/manual verify-network.yml for network tiers with a diff-scope guard. Validates that the offline scorer reproduces the human-curated verified CPU set (40 tests pass). Refs #1
Reworks how verification surfaces on PRs so TechEngineBot owns the analysis, instead of TechAPI running its own (failing) job: - Remove the self-run verify-offline job from validate-data.yml. It failed because the stdlib-only CI image has no pytest, and having TechAPI score its own PRs duplicated what the bot should own. validate-data.yml is back to the pure structural gate. - Add verify-report.yml: runs `app.verify score` (changed records + full baseline) and has TechEngineBot post the band histogram as a PR comment via ENGINE_TOKEN. Dormant if the token is unset; same-repo PRs only; never gates a merge; updates one marked comment in place. - Add app/verify/** to request-engine-pr-validation paths so the engine's PR validation (and its TechEngineBot comment) also covers verifier changes. Refs #1
TechEngine change review: PASS
Changed data
Changed record examples
Heuristic review
|
TechEngine validation stats: PASS
Data summary
Warning Tracked verified coverage is below 50% for brand 0.0% (0/189), tablet 0.0% (0/3048), watch 0.0% (0/378), pda 0.0% (0/110), gpu 0.0% (0/2030), smartphone 0.2% (184/90118), all 1.2% (1218/101954), soc 2.8% (58/2104), and 1 more. Validation notes
Key output:
|
Use TECHENGINEBOT_TOKEN (the bot's PAT) for the github-script step so the Tier 0 analysis comment is authored by TechEngineBot, falling back to ENGINE_TOKEN only to keep the workflow running if the bot token is absent. Refs #1
🔎 Data verification — Tier 0 (offline existence/trust)Scored by green = authoritative source + complete + consistent · yellow = plausible, needs confirmation · red = sparse/weak source or a hard contradiction. Promotion to |
🔎 Data verification — Tier 0 (on demand)Requested by @Seungpyo1007 via |
What
Adds
app/verify/— an existence/trust verification layer that sits on top of the structural validator (app/validate.py, left untouched). Wherevalidate.pyonly checks "is this file well-formed?", this answers "does this record describe a real, actually-existing device/part — confidently enough to setverified:true?" — the goal being to lift the dataset's ~1.2% verified ratio.Tiers
offline.py/signals.py/hosts.py): four sub-scores (completeness + cross-field consistency + source-host trust + provenance) → a green/yellow/red band. Hard contradictions (threads<cores, boost<base, future release) force red. Full scores cached to gitignoreddata/_verify/state/; the trackeddata/_verify/ledger.jsonlis reserved for promotion decisions.http_check.py:source_urlsHTTP liveness (stdlib urllib + ThreadPool, per-host rate limit, resumable TTL cache).crossref.py: external cross-reference under a strict exact-heading rule — no fuzzy matching (fuzzy serves the wrong SKU ~35% of the time), so ambiguous candidates never auto-promote.promote.py: hybrid escalation + surgicalverified:false→truewrite-back (only that token, atomic, LF-preserved; never clobbers curated records or reformats inline arrays).CLI
python -m app.verify score | report | check-urls | crossref | promoteCI
verify-offlinejob invalidate-data.yml(continue-on-error, scores changed records + runs unit tests) — never gates a merge.verify-network.ymlfor the network tiers, with a diff-scope guard that fails unless onlyverifiedflags + the ledger changed. No TechEngine/submodule ops.Validation
40 tests pass. The golden-subset test confirms the offline scorer — blind to the
verifiedflag — reproduces the human-curated verified-CPU set (976/976 land green), which is the empirical justification for using the score to drive promotion.Tuning note:
soc_not_after_deviceis a soft signal, not hard — the dataset's SoCrelease_datevalues are largely placeholderYYYY-01-01that skew late, so a device-vs-SoC mismatch usually means the SoC date is wrong, not the device. Fixing those dates is a separate enrichment task.Refs #1