feat(tools): langfuse_export — pull 3 months of historical rag_query traces to local JSONL/CSV by Mikecranesync · Pull Request #2172 · Mikecranesync/MIRA

Mikecranesync · 2026-06-21T14:22:59Z

Why

Mike logged into the Langfuse Cloud account and found ~3,725 rag_query traces (back to 2026-03-23) — three months of real production troubleshooting on the GARAGE CONVEYOR — that nothing in the repo could read back. This tool extracts it for our use: analyze / archive / seed a regression test set. (Audit that started this: docs/research/2026-06-21-langfuse-integration-audit.md, PR #2157.)

What

tools/langfuse_export.py — read-only pull via the Langfuse public REST API (httpx; version-independent — the host's installed SDK is often too old for fetch_traces).

Two sweeps (/api/public/traces + /api/public/observations) joined by trace id. Per-trace obs fetch for small runs; bulk sweep for the full set (~190 calls, not ~7,500).
Outputs to git-ignored tools/langfuse-export/: full JSONL archive (trace + 4 spans), flat analysis CSV, resumable manifest.json.
--as-evalseed: dedup real questions into a draft eval pack matching simlab/observe/evalpacks/*.yaml — all active:false, expected_asset a placeholder for human curation, PII-scrubbed via InferenceRouter.sanitize_text (historical input predates the forward-going scrub from security(langfuse): scrub PII on trace path + fix dead Telegram tracing #2157).
--dry-run / --max / --from / --to / --resume; polite paging + 429 backoff.

Verification (live, under `doppler --config prd`)

--dry-run → totalItems=3725, sample parsed machine='GARAGE CONVEYOR' question='why did it stop?'
--max 5 --as-evalseed → JSONL with all 4 spans joined, populated CSV, valid eval-seed YAML (all inactive, no raw IPs)
pytest tests/test_langfuse_export.py → 6 passed; ruff clean

Boundaries

Read-only against Langfuse; nothing written back.
Raw JSONL/CSV = unsanitized customer data → git-ignored, never committed. Only the eval-seed (which a human may curate into the repo) is scrubbed.
Run command: doppler run --project factorylm --config prd -- python tools/langfuse_export.py [--as-evalseed]

Note surfaced by the data

Sampled rows show n_chunks=0 and answers prefixed "Based on general industrial knowledge (not from documentation specific to this equipment)" — the historical prod answers were ungrounded (no KB retrieval). Direct evidence of the upload→retrieval / beta-gate gap, now measurable across 3 months.

🤖 Generated with Claude Code

…cal JSONL/CSV Three months of real production troubleshooting (~3,725 rag_query traces) lived only in Langfuse Cloud with nothing reading it back. This tool extracts it for analysis, archival, and eval seeding. - Langfuse public REST API via httpx (version-independent — the host's installed SDK is often too old for fetch_traces). Two sweeps (/api/public/traces + /api/public/observations) joined by trace id; per-trace obs fetch for small runs, bulk sweep for the full ~3,725. - Outputs to git-ignored tools/langfuse-export/: full JSONL archive (trace + 4 spans), flat analysis CSV (machine/question/answer_preview/latency/fsm_state/ n_chunks/top_score), resumable manifest. - --as-evalseed: dedup real questions into a draft eval pack matching simlab/observe/evalpacks/*.yaml, all active:false, expected_asset a placeholder for human curation. PII-scrubbed via InferenceRouter.sanitize_text (historical input predates the forward-going scrub). - argparse --dry-run/--max/--from/--to/--resume; polite paging + 429 backoff. - Read-only against Langfuse; output is unsanitized customer data, git-ignored, never committed. - tests/test_langfuse_export.py: 6 cases on the pure parse/flatten/seed helpers. Verified live: dry-run reports totalItems=3725; --max 5 --as-evalseed wrote JSONL (4 spans joined), CSV, and a valid inactive eval-seed with no raw IPs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu

github-actions · 2026-06-21T14:23:59Z

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of MIRA Project Pull Request

🔴 IMPORTANT: Security Vulnerabilities

The code stores Langfuse public and secret keys as environment variables (LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY). While this is not inherently insecure, it's crucial to ensure these variables are not committed to the repository or exposed in any way. It's recommended to use a secure secrets management system.
In the _client function (tools/langfuse_export.py:246), the httpx.Client is created with a timeout of 60 seconds. This could potentially lead to a denial-of-service (DoS) attack if the server takes too long to respond. Consider implementing a more robust timeout and retry mechanism.

🔴 IMPORTANT: Missing Error Handling

In the _get_page function (tools/langfuse_export.py:258), the code raises an exception if the response status code is not 200. However, it does not handle other potential exceptions that may occur during the request, such as connection errors or timeouts. Add try-except blocks to handle these potential exceptions.
In the _sweep function (tools/langfuse_export.py:272), the code catches all exceptions and retries the request after a 5-second delay. While this is a good practice, it may lead to an infinite loop if the error persists. Consider implementing a limit on the number of retries.

🟡 WARNING: Logic Bugs or Incorrect Assumptions

In the parse_machine function (tools/langfuse_export.py:64), the code assumes that the machine name is always enclosed in square brackets and starts with "MACHINE:". Consider adding more robust parsing logic to handle potential variations in the input format.
In the parse_question function (tools/langfuse_export.py:83), the code assumes that the question is always the last non-empty, non-bracketed line in the input string. Consider adding more robust parsing logic to handle potential variations in the input format.

🟡 WARNING: Missing Input Validation

In the flatten_row function (tools/langfuse_export.py:140), the code does not validate the input trace and spans dictionaries. Consider adding input validation to ensure that the required keys exist and have the expected data types.

🔵 SUGGESTION: Code Quality Improvements

The code uses a mix of snake_case and camelCase naming conventions. Consider adopting a consistent naming convention throughout the codebase.
Some functions, such as _sweep, have a large number of parameters. Consider breaking these functions down into smaller, more manageable pieces.
The code uses type hints, but some types are not explicitly defined (e.g., dict). Consider adding more explicit type definitions to improve code readability and maintainability.

✅ GOOD: Noteworthy Good Practices

The code uses a consistent coding style and indentation.
The code includes docstrings and comments to explain the purpose and behavior of each function.
The code uses a modular structure, with separate functions for different tasks.

Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2172 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

github-actions · 2026-06-21T14:24:37Z

MIRA staging gate — ✅ PASS

Engine + NeonDB staging branch + Groq cascade against fixed questions, graded on the 5-dimension rubric in docs/specs/mira-answer-quality-standard.md. Skipped questions (embed sidecar unavailable, etc.) are excluded from pass/fail math; the run fails closed if >50% are skipped.

mean of means: 4.96 (pass threshold: 3.5, scored over 15/15)
questions passed: 15 / 15
skipped (harness): 0
below mean 3.0: 0 (max allowed: 2)
hard fails: 0
full run logs

id	category	g	c	a	s	t	mean
✅ `oem-model-fault-powerflex-f004`	oem_model_fault	5	5	5	5	5	5.00
✅ `oem-only-no-fault-sew`	oem_only	5	5	5	5	5	5.00
✅ `symptom-no-oem-abbrev`	symptom_only	5	5	5	5	5	5.00
✅ `uns-gate-grinding`	uns_gate	5	5	5	5	5	5.00
✅ `safety-arc-flash`	safety	5	5	5	5	5	5.00
✅ `greeting-hygiene`	greeting	5	5	5	5	5	5.00
✅ `session-followup`	followup	5	5	5	5	5	5.00
✅ `photo-less-ocr-claim`	no_photo	5	5	5	5	5	5.00
✅ `off-topic-redirect`	off_topic	5	5	5	5	5	5.00
✅ `cmms-context-followup`	cmms_context	4	4	5	5	5	4.60
✅ `oem-fault-variant-lowercase`	oem_model_fault	5	5	5	5	5	5.00
✅ `cross-oem-confusion`	oem_model_fault	5	5	5	5	5	5.00
✅ `oem-unknown-fault-admit`	oem_unknown_fault	5	5	5	5	5	5.00
✅ `safety-loto-explicit`	safety	5	5	5	5	5	5.00
✅ `uns-gate-no-line`	uns_gate	5	4	5	5	5	4.80

Rubric: docs/specs/mira-answer-quality-standard.md · Spec: docs/specs/staging-environment-spec.md

… numeric guards Hardening after a full-scale run against the live project (3,725 traces / 14,835 spans): - Observations bulk sweep now pages in weekly time windows (fromStartTime/ toStartTime). The list endpoint rejects deep offset pagination (HTTP 422 past ~page 42); windowing keeps each sweep shallow. - _get_page retries 429/5xx honoring Retry-After with exponential backoff (the free tier 429s mid-sweep; observed a 36s Retry-After). - flatten_row coerces scores/count to numbers (a stray string score broke max()); the JSONL/CSV write loop guards each row so one bad record can't lose the whole run. - Default --sleep raised to 0.5s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu

github-actions · 2026-06-21T15:02:39Z

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of PR: feat(tools): langfuse_export

🔴 IMPORTANT: Security vulnerabilities

Hardcoded secrets: The code uses environment variables LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, and LANGFUSE_BASE_URL to store sensitive information. It is recommended to use a secure secrets management system instead of relying on environment variables.
Potential SQL injection/Path traversal: The _get_page function uses the params dictionary to construct the URL for the GET request. If the params dictionary contains user-controlled data, it may be vulnerable to SQL injection or path traversal attacks.

🔴 IMPORTANT: Missing error handling on network/IO operations

Network errors: The _get_page function only retries on 429 and 5xx status codes. It does not handle other network errors, such as connection timeouts or DNS resolution errors.
IO errors: The flatten_row function assumes that the trace and spans dictionaries are always valid. However, if the data is corrupted or incomplete, it may raise exceptions when trying to access certain keys.

🟡 WARNING: Logic bugs or incorrect assumptions

Incorrect assumption about retrieved data structure: In the flatten_row function, it is assumed that the retrieved list contains dictionaries with a score key. However, if this assumption is incorrect, the scores list will be empty, and the top_score will be an empty string.
Incomplete data handling: The flatten_row function does not handle cases where the query or meta dictionaries are missing or incomplete.

🟡 WARNING: Missing input validation at API boundaries

Invalid or missing query parameter: The parse_machine and parse_question functions do not validate the query parameter. If it is missing or invalid, the functions may return incorrect results.
Invalid or missing spans parameter: The flatten_row function does not validate the spans parameter. If it is missing or invalid, the function may raise exceptions or return incorrect results.

🔵 SUGGESTION: Code quality improvements, naming, maintainability

Consistent naming conventions: The code uses both camelCase and underscore notation for variable names. It is recommended to use a consistent naming convention throughout the codebase.
Type hints: The code does not use type hints for function parameters and return types. Adding type hints can improve code readability and maintainability.
Function length and complexity: Some functions, such as _get_page and flatten_row, are quite long and complex. It may be beneficial to break them down into smaller, more manageable functions.

✅ GOOD: Noteworthy good practices found

Error handling: The code uses try-except blocks to handle exceptions and provides informative error messages.
Logging: The code uses a logger to log important events and errors, making it easier to debug and monitor the application.
Code organization: The code is well-organized, with separate functions for different tasks and a clear structure.

Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2172 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

Mikecranesync temporarily deployed to staging June 21, 2026 14:23 — with GitHub Actions Inactive

Mikecranesync temporarily deployed to staging June 21, 2026 15:00 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): langfuse_export — pull 3 months of historical rag_query traces to local JSONL/CSV#2172

feat(tools): langfuse_export — pull 3 months of historical rag_query traces to local JSONL/CSV#2172
Mikecranesync wants to merge 2 commits into
mainfrom
feat/langfuse-trace-export

Mikecranesync commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

github-actions Bot commented Jun 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mikecranesync commented Jun 21, 2026

Why

What

Verification (live, under doppler --config prd)

Boundaries

Note surfaced by the data

Uh oh!

github-actions Bot commented Jun 21, 2026

🤖 AI Code Review

Review of MIRA Project Pull Request

🔴 IMPORTANT: Security Vulnerabilities

🔴 IMPORTANT: Missing Error Handling

🟡 WARNING: Logic Bugs or Incorrect Assumptions

🟡 WARNING: Missing Input Validation

🔵 SUGGESTION: Code Quality Improvements

✅ GOOD: Noteworthy Good Practices

Uh oh!

github-actions Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MIRA staging gate — ✅ PASS

Uh oh!

github-actions Bot commented Jun 21, 2026

🤖 AI Code Review

Review of PR: feat(tools): langfuse_export

🔴 IMPORTANT: Security vulnerabilities

🔴 IMPORTANT: Missing error handling on network/IO operations

🟡 WARNING: Logic bugs or incorrect assumptions

🟡 WARNING: Missing input validation at API boundaries

🔵 SUGGESTION: Code quality improvements, naming, maintainability

✅ GOOD: Noteworthy good practices found

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Verification (live, under `doppler --config prd`)

github-actions Bot commented Jun 21, 2026 •

edited

Loading