fix(domain-rer): rebuild dataset from frwiki + OSM#69
Merged
Conversation
v1's pipeline used `network=RER`+`role=stop` only on OSM Overpass, which filtered out SNCF-operated portions of every line and silently dropped member nodes mapped with the `halt`/`station`/empty roles. Effect: ~10 missing terminus stations (Saint-Germain-en-Laye, Poissy, Cergy-le- Haut, Marne-la-Vallée-Chessy, Boissy-Saint-Léger, Melun, Tournan, …) plus a few OSM-name typos kept as-is. v2 takes the canonical station list from the frwiki category tree `Ligne <A..E> du RER d'Île-de-France`, then resolves each title via the Wikipedia REST summary endpoint (name, coords, wikibase QID). For the station name itself, an Overpass call joined on `wikidata=Q...` pulls the operator-facing `name` tag (RATP/SNCF signage) so we get "Luxembourg" not "Le Luxembourg", "Invalides" not "Les Invalides". WDQS is intentionally avoided — rate-limited to ~1 req/min since 2024, returns Wikimedia error HTML at category scale. Dataset 230 → 241 entities, +19 added, -8 removed (duplicates and renames unified by QID). A `wiki_titles.json` sidecar maps each entity_id to its frwiki article title; `fame_score_rer.py` reads it first and skips the TITLE_VARIANTS guess loop. Result: fame_score coverage 143/230 (62%) → 241/241 (100%). Metadata bumped to 0.4.0 (build 0.3.0 + fame 0.4.0).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
domains/rer/entities.jsonfrom scratch using the frwiki category tree (Ligne <A..E> du RER d'Île-de-France) as the canonical station list, then OSM Overpass joined onwikidata=Q...to pull the operator-facingnametag.wiki_titles.jsonsidecar sofame_score_rer.pyresolves frwiki titles in O(1) instead of guessing viaTITLE_VARIANTS.Why
The v1 ingest (
scripts/ingest/build_rer_dataset.py) filtered OSM relations onnetwork="RER"(strict equality) and member nodes onrole="stop"(strict). Both filters lose every SNCF-operated portion of the network and silently drop nodes mapped withhalt/station/empty roles. Visible symptoms reported: missingSaint-Germain-en-Laye,Poissy,Cergy-le-Haut,Marne-la-Vallée - Chessy,Boissy-Saint-Léger,Melun,Tournan, plus a handful of OSM-name typos (e.g. doublons "Val-de-Fontenay (RER A)" vs "Val de Fontenay").WDQS was deliberately avoided — rate-limited to ~1 req/min since the 2024 outage, returns Wikimedia error HTML at category scale.
Diff at a glance
Notable changes:
+19real additions,-8removals (all dedup/rename via stable QID join).Val de Fontenaynow correctly taggedRER A + E.Files
scripts/ingest/build_rer_dataset_v2.py(new) — main pipeline, defaults to dry-run,--applywrites.scripts/ingest/fame_score_rer.py— now readswiki_titles.jsonsidecar first, falls back to the previous TITLE_VARIANTS guess loop only for entities absent from the sidecar.domains/rer/entities.json— rebuilt (241 entities, all fame-scored).domains/rer/metadata.json— version0.4.0, newdataset_sha256.domains/rer/wiki_titles.json(new) —{entity_id: frwiki_title}sidecar, regenerated on every--apply.Test plan
ci-kalidoku→bao-deploy.sh kalidoku.Follow-up (out of scope)
zoneattribute still left at0.0(Wikidata P5031 not used, OSM doesn't expose fare zone). A future enrichment pass would need WDQS or a manual mapping for the ~5 zone-based predicates.