Skip to content

fix(domain-rer): rebuild dataset from frwiki + OSM#69

Merged
Calixteair merged 1 commit into
mainfrom
fix/rer-dataset-rebuild
May 19, 2026
Merged

fix(domain-rer): rebuild dataset from frwiki + OSM#69
Calixteair merged 1 commit into
mainfrom
fix/rer-dataset-rebuild

Conversation

@Calixteair
Copy link
Copy Markdown
Owner

Summary

  • Rebuild domains/rer/entities.json from scratch using the frwiki category tree (Ligne <A..E> du RER d'Île-de-France) as the canonical station list, then OSM Overpass joined on wikidata=Q... to pull the operator-facing name tag.
  • Adds wiki_titles.json sidecar so fame_score_rer.py resolves frwiki titles in O(1) instead of guessing via TITLE_VARIANTS.
  • Dataset 230 → 241 entities, fame_score coverage 143 → 241 (62% → 100%).

Why

The v1 ingest (scripts/ingest/build_rer_dataset.py) filtered OSM relations on network="RER" (strict equality) and member nodes on role="stop" (strict). Both filters lose every SNCF-operated portion of the network and silently drop nodes mapped with halt/station/empty roles. Visible symptoms reported: missing Saint-Germain-en-Laye, Poissy, Cergy-le-Haut, Marne-la-Vallée - Chessy, Boissy-Saint-Léger, Melun, Tournan, plus a handful of OSM-name typos (e.g. doublons "Val-de-Fontenay (RER A)" vs "Val de Fontenay").

WDQS was deliberately avoided — rate-limited to ~1 req/min since the 2024 outage, returns Wikimedia error HTML at category scale.

Diff at a glance

Metric Before After
Entities 230 241
Fame coverage 143 241
Phantom entries ~10 0
Dataset version 0.2.0 0.4.0

Notable changes:

  • +19 real additions, -8 removals (all dedup/rename via stable QID join).
  • Val de Fontenay now correctly tagged RER A + E.
  • Station names corrected from OSM ground truth: "Luxembourg" (not "Le Luxembourg"), "Invalides" (not "Les Invalides"), "Avenue Foch" (not "L'avenue Foch"), etc.

Files

  • scripts/ingest/build_rer_dataset_v2.py (new) — main pipeline, defaults to dry-run, --apply writes.
  • scripts/ingest/fame_score_rer.py — now reads wiki_titles.json sidecar first, falls back to the previous TITLE_VARIANTS guess loop only for entities absent from the sidecar.
  • domains/rer/entities.json — rebuilt (241 entities, all fame-scored).
  • domains/rer/metadata.json — version 0.4.0, new dataset_sha256.
  • domains/rer/wiki_titles.json (new) — {entity_id: frwiki_title} sidecar, regenerated on every --apply.

Test plan

  • CI green (cargo fmt + clippy + test, pnpm typecheck + lint, gitleaks, Trivy).
  • After merge: deploy via SSH ci-kalidokubao-deploy.sh kalidoku.
  • After deploy: wipe stale RER grids in prod DB (separate ops step, prepared SQL below) so the worker regenerates with the new dataset.
  • Spot-check in-game: load a fresh RER grid, confirm Saint-Germain-en-Laye / Poissy / Marne-la-Vallée-Chessy appear and accept their lines.

Follow-up (out of scope)

  • zone attribute still left at 0.0 (Wikidata P5031 not used, OSM doesn't expose fare zone). A future enrichment pass would need WDQS or a manual mapping for the ~5 zone-based predicates.

v1's pipeline used `network=RER`+`role=stop` only on OSM Overpass, which
filtered out SNCF-operated portions of every line and silently dropped
member nodes mapped with the `halt`/`station`/empty roles. Effect:
~10 missing terminus stations (Saint-Germain-en-Laye, Poissy, Cergy-le-
Haut, Marne-la-Vallée-Chessy, Boissy-Saint-Léger, Melun, Tournan, …)
plus a few OSM-name typos kept as-is.

v2 takes the canonical station list from the frwiki category tree
`Ligne <A..E> du RER d'Île-de-France`, then resolves each title via the
Wikipedia REST summary endpoint (name, coords, wikibase QID). For the
station name itself, an Overpass call joined on `wikidata=Q...` pulls
the operator-facing `name` tag (RATP/SNCF signage) so we get "Luxembourg"
not "Le Luxembourg", "Invalides" not "Les Invalides".

WDQS is intentionally avoided — rate-limited to ~1 req/min since 2024,
returns Wikimedia error HTML at category scale.

Dataset 230 → 241 entities, +19 added, -8 removed (duplicates and renames
unified by QID). A `wiki_titles.json` sidecar maps each entity_id to its
frwiki article title; `fame_score_rer.py` reads it first and skips the
TITLE_VARIANTS guess loop. Result: fame_score coverage 143/230 (62%) →
241/241 (100%).

Metadata bumped to 0.4.0 (build 0.3.0 + fame 0.4.0).
@Calixteair Calixteair merged commit dca3165 into main May 19, 2026
7 checks passed
@Calixteair Calixteair deleted the fix/rer-dataset-rebuild branch May 19, 2026 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant