feat(tools): langfuse_export — pull 3 months of historical rag_query traces to local JSONL/CSV#2172
feat(tools): langfuse_export — pull 3 months of historical rag_query traces to local JSONL/CSV#2172Mikecranesync wants to merge 2 commits into
Conversation
…cal JSONL/CSV Three months of real production troubleshooting (~3,725 rag_query traces) lived only in Langfuse Cloud with nothing reading it back. This tool extracts it for analysis, archival, and eval seeding. - Langfuse public REST API via httpx (version-independent — the host's installed SDK is often too old for fetch_traces). Two sweeps (/api/public/traces + /api/public/observations) joined by trace id; per-trace obs fetch for small runs, bulk sweep for the full ~3,725. - Outputs to git-ignored tools/langfuse-export/: full JSONL archive (trace + 4 spans), flat analysis CSV (machine/question/answer_preview/latency/fsm_state/ n_chunks/top_score), resumable manifest. - --as-evalseed: dedup real questions into a draft eval pack matching simlab/observe/evalpacks/*.yaml, all active:false, expected_asset a placeholder for human curation. PII-scrubbed via InferenceRouter.sanitize_text (historical input predates the forward-going scrub). - argparse --dry-run/--max/--from/--to/--resume; polite paging + 429 backoff. - Read-only against Langfuse; output is unsanitized customer data, git-ignored, never committed. - tests/test_langfuse_export.py: 6 cases on the pure parse/flatten/seed helpers. Verified live: dry-run reports totalItems=3725; --max 5 --as-evalseed wrote JSONL (4 spans joined), CSV, and a valid inactive eval-seed with no raw IPs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu
🤖 AI Code ReviewReview by: groq (llama-3.3-70b-versatile) Review of MIRA Project Pull Request🔴 IMPORTANT: Security Vulnerabilities
🔴 IMPORTANT: Missing Error Handling
🟡 WARNING: Logic Bugs or Incorrect Assumptions
🟡 WARNING: Missing Input Validation
🔵 SUGGESTION: Code Quality Improvements
✅ GOOD: Noteworthy Good Practices
Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade) |
MIRA staging gate — ✅ PASSEngine + NeonDB staging branch + Groq cascade against fixed questions, graded on the 5-dimension rubric in
Rubric: |
… numeric guards Hardening after a full-scale run against the live project (3,725 traces / 14,835 spans): - Observations bulk sweep now pages in weekly time windows (fromStartTime/ toStartTime). The list endpoint rejects deep offset pagination (HTTP 422 past ~page 42); windowing keeps each sweep shallow. - _get_page retries 429/5xx honoring Retry-After with exponential backoff (the free tier 429s mid-sweep; observed a 36s Retry-After). - flatten_row coerces scores/count to numbers (a stray string score broke max()); the JSONL/CSV write loop guards each row so one bad record can't lose the whole run. - Default --sleep raised to 0.5s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu
🤖 AI Code ReviewReview by: groq (llama-3.3-70b-versatile) Review of PR: feat(tools): langfuse_export🔴 IMPORTANT: Security vulnerabilities
🔴 IMPORTANT: Missing error handling on network/IO operations
🟡 WARNING: Logic bugs or incorrect assumptions
🟡 WARNING: Missing input validation at API boundaries
🔵 SUGGESTION: Code quality improvements, naming, maintainability
✅ GOOD: Noteworthy good practices found
Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade) |
Why
Mike logged into the Langfuse Cloud account and found ~3,725
rag_querytraces (back to 2026-03-23) — three months of real production troubleshooting on the GARAGE CONVEYOR — that nothing in the repo could read back. This tool extracts it for our use: analyze / archive / seed a regression test set. (Audit that started this:docs/research/2026-06-21-langfuse-integration-audit.md, PR #2157.)What
tools/langfuse_export.py— read-only pull via the Langfuse public REST API (httpx; version-independent — the host's installed SDK is often too old forfetch_traces)./api/public/traces+/api/public/observations) joined by trace id. Per-trace obs fetch for small runs; bulk sweep for the full set (~190 calls, not ~7,500).tools/langfuse-export/: full JSONL archive (trace + 4 spans), flat analysis CSV, resumablemanifest.json.--as-evalseed: dedup real questions into a draft eval pack matchingsimlab/observe/evalpacks/*.yaml— allactive:false,expected_asseta placeholder for human curation, PII-scrubbed viaInferenceRouter.sanitize_text(historical input predates the forward-going scrub from security(langfuse): scrub PII on trace path + fix dead Telegram tracing #2157).--dry-run / --max / --from / --to / --resume; polite paging + 429 backoff.Verification (live, under
doppler --config prd)--dry-run→totalItems=3725, sample parsedmachine='GARAGE CONVEYOR' question='why did it stop?'--max 5 --as-evalseed→ JSONL with all 4 spans joined, populated CSV, valid eval-seed YAML (all inactive, no raw IPs)pytest tests/test_langfuse_export.py→ 6 passed;ruffcleanBoundaries
doppler run --project factorylm --config prd -- python tools/langfuse_export.py [--as-evalseed]Note surfaced by the data
Sampled rows show
n_chunks=0and answers prefixed "Based on general industrial knowledge (not from documentation specific to this equipment)" — the historical prod answers were ungrounded (no KB retrieval). Direct evidence of the upload→retrieval / beta-gate gap, now measurable across 3 months.🤖 Generated with Claude Code