Skip to content

feat(domain-rer): RER d'Île-de-France — first multi-domain ship#62

Merged
Calixteair merged 5 commits into
mainfrom
feat/domain-rer
May 13, 2026
Merged

feat(domain-rer): RER d'Île-de-France — first multi-domain ship#62
Calixteair merged 5 commits into
mainfrom
feat/domain-rer

Conversation

@Calixteair
Copy link
Copy Markdown
Owner

What

First step of phase 3 from the roadmap: add the rer domain pack alongside paris-metro.

  • 230 stations covering RER A–E
  • 31 predicates (letters, length, word count, lines, geo anchors)
  • fame_score on 136 stations from French Wikipedia pageviews
  • Generator validated on 100 seeds × 0 failures
  • Worker auto-discovers the new domain — no extra wiring needed once this lands

Why

paris-metro proved the engine works. Phase 3 measures how cheap it is to clone the pattern for a new dataset. This PR is the answer: about a half-day of work, mostly ingest tooling + tuning the predicate pack until the CSP solver consistently finds grids.

How

5 commits in increasing scope:

1. 09f95c4 feat(worker): bump generator max_attempts 200 → 1000

The CSP solver in core::generator retries up to max_attempts random predicate selections before giving up. paris-metro is wide enough that 200 always succeeds; rer is denser and was failing 17% of seeds at 200. Bumping to 1000 gets us 0/100 failures and costs sub-second extra runtime per (domain, day).

2. b0fbc3c feat(scripts): ingest tooling for the rer domain

Two stdlib-only Python scripts under scripts/ingest/:

  • build_rer_dataset.py pulls relation[network=RER] + their stop-role node members from OSM Overpass, deduplicates by canonical name (strips Voie X / Quai X platform suffixes), centroidises per group, packages as entities.json.
  • fame_score_rer.py skips Wikidata entirely (WDQS rate-limits at 1 req/min since the 2024 outage — too aggressive for batch). Goes straight at the Wikipedia REST endpoints: summary to resolve <name> / <name> (RER) / Gare de <name>, filters by topic keyword (word-boundary regex so 'métropole' / 'commune' don't smuggle in unrelated commune pages), then pageviews/per-article for the 365-day sum, percentile-ranks.

Coverage: 136/230 stations matched to a frwiki article. The remaining 94 are small banlieue stops without a dedicated page; they stay at fame_score: null and core::scoring treats them as neutral 50.

3. a490c45 feat(domain-rer): seed pack — 230 stations, 31 predicates

The artefacts the two scripts produce, committed as-is:

  • entities.json — 230 stations, sorted by id, with attributes lines, geo, in_paris, zone (placeholder 0 — Wikidata's P5031 to wire later)
  • predicates.json — mirror of paris-metro with rer tweaks. Two predicates from paris-metro were dropped because coverage was < 8 stations (contains_letter Z = 6, attr_list_size_gte n=3 = 1) — under that threshold the CSP solver thrashes.
  • metadata.json — version 0.2.0, both ingestion sources tracked in sources_versions

4. d72101d feat(server,web): register rer domain + i18n labels

Worker auto-discovers any subdirectory of domains/, so once this lands the next nightly cron produces a rer daily grid automatically. No worker config change needed.

A domain picker on the home page is not part of this PR — the URL still hardcodes paris-metro on /. Multi-domain switching will land as a follow-up once we have at least two playable domains in prod (i.e. once this PR ships).

5. 97a43e5 chore: generalise tmp/ gitignore

Domain-pack scratch dirs are now domains/*/tmp/ (was specifically domains/paris-metro/tmp/). Drops an audit report that slipped in.

Calibration sanity check

Top fame_score on rer:

score station
100 Musée d'Orsay
99 Gare de Lyon
98 Paris Gare du Nord
97 Paris Austerlitz, Châtelet - Les Halles
96 Massy-Palaiseau
95 Versailles-Chantiers
94 La Défense - Grande Arche, Magenta
93 Juvisy-sur-Orge

Bottom:

score station
0 Boutigny
1 Sermaise
2 Buno - Gironville
3 Boissise-le-Roi, Villabé

Checklist

  • Conventional commit titles, no AI / Claude mention
  • cargo fmt --check passes
  • cargo clippy --all-targets -- -D warnings passes
  • cargo test --workspace passes
  • 100-seed generator run on rer: 0/100 failures @ max_attempts=1000
  • pnpm --dir web lint && typecheck && test passes
  • No new secret
  • No contract change (the new pack uses existing schemas)
  • Docs: docs/agents/agent-f-domains.md already covers generalisation (paris-metro PR), no update needed for this domain

Test plan

  • After merge + deploy, the worker's next nightly run publishes a rer daily grid
  • GET /api/grids/rer/today returns a PublicGrid shape
  • POST /api/games {domain:"rer", mode:"solo"} returns a fresh seed + grid with cells that can be solved
  • POST /api/duels {domain:"rer"} succeeds, share link plays cleanly
  • GET /api/leaderboard/rer?period=daily returns an empty list (no plays yet) without erroring
  • paris-metro flows still work — GET /api/grids/paris-metro/today unchanged

The daily-grid generator backs off after max_attempts CSP attempts. 200
was fine for the paris-metro pack (~36 predicates, 310 entities, broad
intersections). Denser packs — RER ships with 230 entities and a
narrower set of valid candidates — were occasionally failing publish on
their first attempt. 100-seed stress test:

  paris-metro @ 200:  0/100 failures
  rer         @ 200: 17/100 failures
  rer         @ 1000: 0/100 failures

Each attempt is sub-ms on the runner, so 1000 still costs under a
second per (domain, day). No need to expose the value as a config
field yet — flip the constant when a future pack needs more.
Two Python scripts (stdlib only, no external deps), mirroring the
paris-metro pipeline but adapted to the rer constraints.

scripts/ingest/build_rer_dataset.py
- One Overpass query pulls every relation[type=route][network=RER] in
  Île-de-France plus their stop-role node members. Members tagged
  'stop_entry_only' / 'platform' are intentionally filtered out via
  the role selector.
- Each station node carries name + geo. The trunk line letter (A..E)
  is harvested from the relation's ref tag.
- Names like 'Châtelet - Voie 1' / 'Brétigny - Voie 6' are stripped
  of their platform suffix so the 4-5 platform nodes of a station
  collapse into one entity.
- Centroid geo per group, in_paris=true when inside the Paris bbox.
- Wikidata's zone tarifaire (P5031) isn't exposed by OSM so 'zone' is
  left at 0; a follow-up enrich pass can wire it later.

scripts/ingest/fame_score_rer.py
- WDQS rate-limits at 1 req/min for repeat clients since the 2024
  outage — too aggressive for a 230-entity batch. So this version
  skips SPARQL entirely and hits the Wikipedia REST endpoints
  directly:
    1. summary/<name> | summary/<name> (RER) | summary/Gare de <name>
       — first 200-OK whose description matches a transit keyword
       wins. Word-boundary anchored regex so 'métropole' / 'commune'
       don't smuggle in unrelated commune pages (the false positives
       that were inflating Issy, Antony, etc.).
    2. /metrics/pageviews/per-article/.../monthly/ → 365-day sum.
    3. percentile-rank → fame_score 0..=100.
- 136/230 stations resolved (the rest are small banlieue stops
  without a dedicated frwiki page) — they stay at fame_score=null and
  the scoring engine treats them as neutral 50. Coverage gap is
  acceptable for v1 of the pack.
Bootstrap of the rer domain pack via the two ingest scripts in
scripts/ingest/. Shapes:

domains/rer/entities.json (230 stations)
- Pulled from OSM Overpass: every relation[network=RER] in IDF, then
  stop-role nodes deduped by canonical name (platform suffixes
  stripped).
- Attributes: lines (subset of {A,B,C,D,E}), geo, in_paris (Paris
  inner-ring bbox), zone (placeholder 0 — Wikidata's P5031 to come).
- fame_score populated on 136 stations (those with an unambiguous
  frwiki article matching a transit-topic keyword). The remaining 94
  small banlieue stops sit at null → core::scoring treats them as
  neutral 50.

domains/rer/predicates.json (31 predicates)
- Mirror of the paris-metro pack with rer-specific tweaks:
    - 4 ends_with, 4 starts_with, 2 contains_letter (Y, U)
    - name_length_max=7, name_length_min=13, name_length_eq=9
    - name_word_count_eq={1, 3}, name_word_count_min=4
    - attr_list_size_eq=1 (single-line stations only)
    - on_attr_in_set on every individual line A..E + A|B
    - in_paris true/false
    - within_km anchors: Louvre, Tour Eiffel, Gare de Lyon, CDG,
      Versailles
- Two predicates that paris-metro carries were dropped on rer because
  coverage was < 8 stations (contains_letter Z = 6, attr_list_size_gte
  n=3 = 1 station) — under that threshold the CSP solver thrashes.

domains/rer/metadata.json
- version 0.2.0 (0.1.0 seeded, 0.2.0 after fame ingest)
- sources_versions records both the Overpass query date and the
  Wikipedia pageviews window for auditability.

100-seed generator stress test passes at the worker's new 1000
attempts threshold.
server:
- main.rs declares rer in active_domains so /api/domains lists it and
  the start_game / solo / duel endpoints accept domain='rer'. The
  Dockerfile already COPYs domains/* into the image; the worker
  auto-discovers any subdirectory of domains/, so once this PR ships
  the next daily run produces a rer grid automatically.
- paris-metro version field bumped 0.3.0 → 0.4.0 to match the fame
  score patch from #57, and 'duel' added to its available_modes (was
  missing since #55 made duels free).

web:
- grid_domain_rer i18n key in fr + en. The Grid component reads the
  active domain label from this key; the rest of the UI is
  domain-agnostic, so no other change is needed.

A domain picker on the home page isn't part of this PR — the URL still
hardcodes paris-metro on /. Multi-domain selection will land as a
follow-up once we have at least two playable domains in prod.
domains/paris-metro/tmp/ was the original local-only scratch dir for
ingest audit reports. The rer domain reuses the same convention so the
ignore rule should match any domain. Also drops the accidentally
committed domains/rer/tmp/fame_score_report.json (the audit output, not
a build artefact).
@Calixteair Calixteair merged commit 652dece into main May 13, 2026
7 checks passed
@Calixteair Calixteair deleted the feat/domain-rer branch May 13, 2026 21:25
Calixteair added a commit that referenced this pull request May 13, 2026
Adding a new domain in main.rs used to require running INSERT INTO
domains (...) manually on prod — the grids.domain foreign key rejects
any solo/daily/duel start_game otherwise with a 500. We hit this on
rer's first deploy (#62).

Run a one-shot upsert pass right after migrations from state.domains:
- Existing rows get version + active updated to match the code
- New rows are inserted with empty metadata, active=true, created_at=now()

Idempotent — safe to re-run every boot. A failure is logged and the
server keeps booting (downgraded to the previous behaviour) so a
typo in the upsert path doesn't block a redeploy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant