Skip to content

feat(scrape): tipo de cambio + RUC online via agent-browser#8

Merged
Railly merged 1 commit into
mainfrom
feat/agent-browser-tc-ruc
Apr 29, 2026
Merged

feat(scrape): tipo de cambio + RUC online via agent-browser#8
Railly merged 1 commit into
mainfrom
feat/agent-browser-tc-ruc

Conversation

@Railly
Copy link
Copy Markdown
Contributor

@Railly Railly commented Apr 29, 2026

Summary

Bypasses the SUNAT WAF that blocked direct fetch in PR #3. Drives a real headless Chrome through agent-browser (the wrapper RHE/F616 already use in this repo). Two new capabilities flip from ⛔ "blocked" to ⚠️ "verified shape, untested live in CI":

  • Tipo de Cambio oficial SUNAT (USD/PEN)
  • RUC consulta puntual via portal (single lookup, no padrón sync needed)

What's in this PR

Tipo de Cambio (sunat tipo-cambio)

sunat tipo-cambio                       # today's USD/PEN
sunat tipo-cambio --fecha 2026-04-15    # historical, immutable
sunat tipo-cambio --force               # bypass cache
sunat tipo-cambio cached --fecha 2026-04-15  # cache-only, no scrape
  • JSONL cache at ~/.sunat/cache/tipo-cambio.jsonl, deduped by fecha
  • Cached forever (SUNAT TC is immutable per date — what was published once never changes)
  • Parser handles 4 layouts: aria-label "Compra X Venta Y", with colons, table cells "X | Y", 4-decimal values
  • Sanity check rejects false positives (item weights, totals): TC must be in 1–10 S/USD range with abs(compra-venta) < 0.5

RUC online (sunat padron ruc-online <RUC>)

sunat padron ruc-online 20131312955   # ~5-10s, drives e-consultaruc.sunat.gob.pe
  • Bypasses the numRnd token + reCAPTCHA gate that broke direct POST in PR feat(rest): consulta CPE OAuth + padrón RUC local #3
  • Parses 10 fields: razon social, estado, condicion, tipoContribuyente, dirección, departamento, provincia, distrito, source, fetchedAt
  • Handles SUNAT's hyphen-joined "DIRECCION DISTRITO - PROV - DEPT" format with whitespace fallback for the distrito tail
  • For batch always use sunat padron ruc/batch (offline padrón, instantaneous after sync). ruc-online is for ad-hoc single-RUC checks only.

Architecture

Pure parsers (parseTcSnapshot, parseRucSnapshot) live separately from browser orchestration (getTipoCambio, consultarRucPortal). This means:

  • Parsers are 100% unit-tested without Chrome (20 tests cover layout variations + edge cases)
  • Live scraping is verifiable manually post-merge (CI doesn't have Chrome)
  • A SUNAT layout change breaks ONLY the scraper, not the parser — parser returns null and the operator gets a clear error suggesting --debug

Tests

283 pass / 2 skip / 0 fail in 3.0s (was 265)

tipo-cambio.test.ts (13):
- parseTcSnapshot: 7 layout cases + sanity reject (weights aren't TCs)
- saveTc/loadCachedTc: roundtrip, missing fecha, dedupe by fecha, malformed JSONL skipped

ruc-portal.test.ts (7):
- parseRucSnapshot canonical SUNAT detail page (full 10-field extract)
- RUC mismatch returns null
- Missing optional fields handled
- Razon social trimmed
- Condicion without acentos
- source + fetchedAt always populated

LIMITATIONS.md updated

  • Tipo de cambio: ⛔ → ⚠️
  • Padrón puntual portal: ⛔ → ⚠️
  • New caveat: "No automatic fallback — if SUNAT changes the table layout, parser returns null; future PR could add a third-party fallback (with explicit user opt-in via env var)"

Test plan

  • bun test green (283/2 skip/0 fail)
  • sunat tipo-cambio --help lists cached subcommand
  • sunat padron ruc-online --help shows usage
  • All parser logic unit-tested with realistic SUNAT snapshot fixtures
  • Live sunat tipo-cambio returns reasonable USD/PEN value (post-merge, Hunter)
  • Live sunat padron ruc-online 20131312955 matches local padrón data (post-merge)

Out of scope

  • 🚧 SBS scraping — also blocked by WAF, not bypassed (SUNAT's TC is the legally-valid one for tax purposes anyway)
  • 🚧 Third-party fallback (apis.net.pe / decolecta) — would need user opt-in via env var; deferred to future PR
  • 🚧 Live CI verification — no Chrome in CI runners; would need a smoke job with agent-browser + Chrome installed

Bypasses the SUNAT WAF that blocked direct fetch in PR #3 by driving a
real headless Chrome session through agent-browser (the same wrapper
already used by RHE/F616 in this repo). Two new capabilities:

1. Tipo de Cambio oficial SUNAT (USD/PEN)
   - src/sunat-rest/tipo-cambio.ts — pure parser + cache + scraper
   - sunat tipo-cambio [--fecha YYYY-MM-DD] [--force]
   - sunat tipo-cambio cached --fecha — cache-only, no scrape
   - JSONL cache at ~/.sunat/cache/tipo-cambio.jsonl, deduped by fecha
   - Cached forever (SUNAT TC is immutable per date)
   - Parser handles 4 layouts: aria-label "Compra X Venta Y", with
     colons, table cells "X | Y", 4-decimal values
   - Sanity check: TC must be 1-10 PEN per USD with abs(compra-venta)<0.5
     to reject false positives (item weights, totals, etc)

2. RUC consulta puntual via portal
   - src/sunat-rest/ruc-portal.ts — pure parser + scraper
   - sunat padron ruc-online <RUC> — drives e-consultaruc.sunat.gob.pe
   - Bypasses the numRnd token + reCAPTCHA gate that broke direct POST
   - Parses razon social, estado, condicion, tipo, dirección,
     departamento/provincia/distrito (handles SUNAT's hyphen-joined
     "DIRECCION DISTRITO - PROV - DEPT" format with whitespace fallback
     for the distrito tail)
   - For batch use, always prefer 'padron ruc/batch' (offline, instant)

Architecture: pure parsers separated from browser orchestration so they
can be 100% unit-tested without Chrome. Live scraping verifiable
manually post-merge.

Tests: 283 pass / 2 skip / 0 fail in 3.0s (was 265)

tipo-cambio.test.ts (13):
- parseTcSnapshot: 7 layout cases + sanity reject (weights aren't TCs)
- saveTc/loadCachedTc: roundtrip, missing fecha, dedupe by fecha,
  malformed JSONL skipped

ruc-portal.test.ts (7):
- parseRucSnapshot canonical SUNAT detail page (full 10-field extract)
- RUC mismatch returns null
- Missing optional fields handled
- Razon social trim
- Condicion without acentos
- source + fetchedAt always populated

LIMITATIONS.md updated:
- Tipo de cambio: ⛔ → ⚠️ (verified shape, untested live in CI)
- Padrón puntual portal: ⛔ → ⚠️ (same)
- New 'No automatic fallback' caveat: parser returns null if SUNAT
  changes layout; user can opt-in to a third-party fallback in future PR
@Railly Railly marked this pull request as ready for review April 29, 2026 06:14
@Railly Railly merged commit 84b0966 into main Apr 29, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant