feat(scrape): tipo de cambio + RUC online via agent-browser#8
Merged
Conversation
Bypasses the SUNAT WAF that blocked direct fetch in PR #3 by driving a real headless Chrome session through agent-browser (the same wrapper already used by RHE/F616 in this repo). Two new capabilities: 1. Tipo de Cambio oficial SUNAT (USD/PEN) - src/sunat-rest/tipo-cambio.ts — pure parser + cache + scraper - sunat tipo-cambio [--fecha YYYY-MM-DD] [--force] - sunat tipo-cambio cached --fecha — cache-only, no scrape - JSONL cache at ~/.sunat/cache/tipo-cambio.jsonl, deduped by fecha - Cached forever (SUNAT TC is immutable per date) - Parser handles 4 layouts: aria-label "Compra X Venta Y", with colons, table cells "X | Y", 4-decimal values - Sanity check: TC must be 1-10 PEN per USD with abs(compra-venta)<0.5 to reject false positives (item weights, totals, etc) 2. RUC consulta puntual via portal - src/sunat-rest/ruc-portal.ts — pure parser + scraper - sunat padron ruc-online <RUC> — drives e-consultaruc.sunat.gob.pe - Bypasses the numRnd token + reCAPTCHA gate that broke direct POST - Parses razon social, estado, condicion, tipo, dirección, departamento/provincia/distrito (handles SUNAT's hyphen-joined "DIRECCION DISTRITO - PROV - DEPT" format with whitespace fallback for the distrito tail) - For batch use, always prefer 'padron ruc/batch' (offline, instant) Architecture: pure parsers separated from browser orchestration so they can be 100% unit-tested without Chrome. Live scraping verifiable manually post-merge. Tests: 283 pass / 2 skip / 0 fail in 3.0s (was 265) tipo-cambio.test.ts (13): - parseTcSnapshot: 7 layout cases + sanity reject (weights aren't TCs) - saveTc/loadCachedTc: roundtrip, missing fecha, dedupe by fecha, malformed JSONL skipped ruc-portal.test.ts (7): - parseRucSnapshot canonical SUNAT detail page (full 10-field extract) - RUC mismatch returns null - Missing optional fields handled - Razon social trim - Condicion without acentos - source + fetchedAt always populated LIMITATIONS.md updated: - Tipo de cambio: ⛔ →⚠️ (verified shape, untested live in CI) - Padrón puntual portal: ⛔ →⚠️ (same) - New 'No automatic fallback' caveat: parser returns null if SUNAT changes layout; user can opt-in to a third-party fallback in future PR
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bypasses the SUNAT WAF that blocked direct fetch in PR #3. Drives a real headless Chrome through⚠️ "verified shape, untested live in CI":
agent-browser(the wrapper RHE/F616 already use in this repo). Two new capabilities flip from ⛔ "blocked" toWhat's in this PR
Tipo de Cambio (
sunat tipo-cambio)~/.sunat/cache/tipo-cambio.jsonl, deduped byfechaabs(compra-venta) < 0.5RUC online (
sunat padron ruc-online <RUC>)sunat padron ruc-online 20131312955 # ~5-10s, drives e-consultaruc.sunat.gob.penumRndtoken + reCAPTCHA gate that broke direct POST in PR feat(rest): consulta CPE OAuth + padrón RUC local #3"DIRECCION DISTRITO - PROV - DEPT"format with whitespace fallback for the distrito tailsunat padron ruc/batch(offline padrón, instantaneous after sync).ruc-onlineis for ad-hoc single-RUC checks only.Architecture
Pure parsers (
parseTcSnapshot,parseRucSnapshot) live separately from browser orchestration (getTipoCambio,consultarRucPortal). This means:nulland the operator gets a clear error suggesting--debugTests
LIMITATIONS.md updated
Test plan
bun testgreen (283/2 skip/0 fail)sunat tipo-cambio --helplistscachedsubcommandsunat padron ruc-online --helpshows usagesunat tipo-cambioreturns reasonable USD/PEN value (post-merge, Hunter)sunat padron ruc-online 20131312955matches local padrón data (post-merge)Out of scope
agent-browser+ Chrome installed