Skip to content

data(smartphone): import PhoneDB historic tail (1993-2009, ids 1-1587)#29

Merged
Seungpyo1007 merged 5 commits into
mainfrom
data/phonedb-tail
Jun 19, 2026
Merged

data(smartphone): import PhoneDB historic tail (1993-2009, ids 1-1587)#29
Seungpyo1007 merged 5 commits into
mainfrom
data/phonedb-tail

Conversation

@Seungpyo1007

Copy link
Copy Markdown
Member

Summary

Completes the PhoneDB import range with the historic tail — device ids 1–1587 (years 1993–2009), which neither my earlier batches (#27 / #28, top range) nor the concurrent session had reached. +868 variant-level smartphones + 75 SoC seeds (all verified: false, per-record source_urls), plus a refreshed dump.

Changes vs main

  • +868 smartphones — Nokia 109, HTC 63, Samsung 62, Motorola 45, O2 39, i-mate 37, Palm/Qtek/Sony… (classic 2000s & early-smartphone era).
  • +75 SoC seeds.
  • site/public/v1 dump refreshed → 25,523 smartphones / 1,848 SoC.
  • Distinct per-batch commit messages + Refs #1 in each commit body.

Validation (local)

  • python -m app.validate → ✅ passed
  • python TechEngine/integrity_check.py data --strict → ✅ no hard anomalies
  • Built in an isolated git worktree off origin/main (+ dump from a fresh DB) to avoid a concurrent session's writes to the shared data/. Branch data/phonedb-tail is used instead of the contested data/import-staging.

Closes #1

@Seungpyo1007

Copy link
Copy Markdown
Member Author

Validation review — PASS ✅

Commands (local): python -m app.validate ✅ · python TechEngine/integrity_check.py data --strict ✅ (no hard anomalies)
Data changes vs main: +868 smartphones, +75 SoC seeds, dump refreshed (→ 25,523 / 1,848).
Coverage: PhoneDB ids 1–1587 (years 1993–2009) — the historic tail not covered by #27/#28 or the concurrent session.
Example: data/smartphone/palm/2002/..., data/smartphone/nokia/... (early-smartphone / feature-phone era).
Heuristic warnings: none new (names normalized at scrape).
Build note: generated in an isolated git worktree off origin/main + fresh-DB dump, to avoid a concurrent session's writes to the shared data/.

@Seungpyo1007

Copy link
Copy Markdown
Member Author

Dataset stats (post-merge projection)

category total verified unverified verified %
smartphone 25,523 184 25,339 0.7%
soc 1,848 58 1,790 3.1%

Warning

Verified coverage is far below 50% — expected for bulk raw seed imports. Verification deferred to TechEngine / manual audit.

@TechEngineBot

Copy link
Copy Markdown
Member

TechEngine change review: PASS

Check Result
python -m app.validate PASS
python integrity_check.py TechAPI/data --strict PASS

Changed data

Category Added Modified Deleted Added verified Added unverified Added Kaggle-sourced
brand 0 0 0 0 0 0
soc 75 0 0 0 75 0
smartphone 868 0 0 0 868 0
gpu 0 0 0 0 0 0
cpu 0 0 0 0 0 0

Changed record examples

soc added

  • soc/arm/1996/arm-610.json - ARM 610
  • soc/arm/1997/arm-710a.json - ARM 710A
  • soc/arm/1997/dec-strongarm-sa-110.json - DEC StrongARM SA-110
  • soc/arm/1997/nec-vr4101.json - NEC VR4101
  • soc/arm/1997/philips-mp3910.json - Philips MP3910
  • soc/arm/1998/hitachi-superh-sh3.json - Hitachi SuperH SH3
  • soc/arm/1998/nec-vr4102.json - NEC VR4102
  • soc/arm/1998/nec-vr4111.json - NEC VR4111
  • soc/arm/1998/toshiba-tmpr3912u.json - Toshiba TMPR3912U
  • soc/arm/1998/toshiba-tmpr3922u.json - Toshiba TMPR3922U
  • soc/arm/1999/hitachi-superh-sh4.json - Hitachi SuperH SH4
  • soc/arm/1999/philips-pr31700.json - Philips PR31700
  • soc/arm/2000/arm-710t.json - ARM 710T
  • soc/arm/2000/nec-vr4122.json - NEC VR4122
  • soc/arm/2001/motorola-dragonball-mc68328.json - Motorola DragonBall MC68328
  • ... 60 more

smartphone added

  • smartphone/acer/2001/acer-s10.json - Acer s10
  • smartphone/acer/2002/acer-s50.json - Acer s50
  • smartphone/acer/2002/acer-s60.json - Acer s60
  • smartphone/acer/2003/acer-n10.json - Acer n10
  • smartphone/acer/2003/acer-n20.json - Acer n20
  • smartphone/acer/2003/acer-n20w.json - Acer n20w
  • smartphone/acer/2004/acer-n30.json - Acer n30
  • smartphone/acer/2004/acer-n35.json - Acer n35
  • smartphone/acer/2004/acer-n50.json - Acer n50
  • smartphone/acer/2005/acer-n300.json - Acer n300
  • smartphone/acer/2005/acer-n311.json - Acer n311
  • smartphone/acer/2006/acer-c510.json - Acer c510
  • smartphone/acer/2006/acer-c530.json - Acer c530
  • smartphone/acer/2006/acer-d155-d156.json - Acer d155 / d156
  • smartphone/allview/2014/allview-x1-xtreme.json - Allview X1 Xtreme
  • ... 853 more

Heuristic review

  • Added records by manufacturer/brand: nokia: 109, samsung: 67, htc: 63, motorola: 45, o2: 39, i-mate: 37, arm: 36, asus: 35

  • Added records by source class: other: 943

  • Heuristic warnings: 2 total; showing first 2.

    • soc: soc/arm/2008/n-a.json: placeholder-like token in name
    • smartphone: smartphone/i-mate/2005/i-mate-new-jam-jam-limited-edition-htc-magician.json: repeated adjacent word in name

@TechEngineBot

Copy link
Copy Markdown
Member

TechEngine validation stats: PASS

Data summary

Category Total Verified Unverified Missing verified Tracked Verified % of tracked
brand 189 0 60 129 60 0.0%
soc 1848 58 1790 0 1848 3.1%
smartphone 25523 184 25339 0 25523 0.7%
gpu 2030 0 2030 0 2030 0.0%
cpu 3977 976 3001 0 3977 24.5%
all 33567 1218 32220 129 33438 3.6%

Warning

Tracked verified coverage is below 50% for brand 0.0% (0/60), gpu 0.0% (0/2030), smartphone 0.7% (184/25523), soc 3.1% (58/1848), all 3.6% (1218/33438), cpu 24.5% (976/3977).
Tracked coverage excludes records missing the verified field; see the Missing verified column for those records.
This does not fail validation. Keep imported records verified: false until manual audit, but treat this as follow-up verification work before relying on the affected categories as curated data.

Validation notes

  • Full advisory outlier listings are suppressed on successful runs because they are dataset-wide and mostly stable between PRs.
  • Failure runs still include a detailed log excerpt for debugging.

Key output:

## app.validate
## integrity_check.py --strict
loaded CPU=3977 GPU=2030
✅ integrity gate: no hard anomalies.
Integrity section Flagged lines
structural 0
CPU name/tier consistency (desktop mainstream only) 0
CPU single>multi (cinebench/geekbench — should be multi>=single) 0
CPU era-vs-score outliers 8
CPU cross-source ratio outliers (possible wrong-variant) 152
GPU cross-source ratio outliers + sanity 18

@Seungpyo1007 Seungpyo1007 moved this from Todo to In Progress in TechAPI-Project Jun 19, 2026
@Seungpyo1007 Seungpyo1007 merged commit 7cca8c7 into main Jun 19, 2026
4 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in TechAPI-Project Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Dataset changes enhancement New feature or request

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Massive dataset rebuild: CPU + brand + GPU + smartphone + SoC (1989-2026)

2 participants