Skip to content

feat(tools): langfuse_export — pull 3 months of historical rag_query traces to local JSONL/CSV#2172

Open
Mikecranesync wants to merge 2 commits into
mainfrom
feat/langfuse-trace-export
Open

feat(tools): langfuse_export — pull 3 months of historical rag_query traces to local JSONL/CSV#2172
Mikecranesync wants to merge 2 commits into
mainfrom
feat/langfuse-trace-export

Conversation

@Mikecranesync

Copy link
Copy Markdown
Owner

Why

Mike logged into the Langfuse Cloud account and found ~3,725 rag_query traces (back to 2026-03-23) — three months of real production troubleshooting on the GARAGE CONVEYOR — that nothing in the repo could read back. This tool extracts it for our use: analyze / archive / seed a regression test set. (Audit that started this: docs/research/2026-06-21-langfuse-integration-audit.md, PR #2157.)

What

tools/langfuse_export.py — read-only pull via the Langfuse public REST API (httpx; version-independent — the host's installed SDK is often too old for fetch_traces).

  • Two sweeps (/api/public/traces + /api/public/observations) joined by trace id. Per-trace obs fetch for small runs; bulk sweep for the full set (~190 calls, not ~7,500).
  • Outputs to git-ignored tools/langfuse-export/: full JSONL archive (trace + 4 spans), flat analysis CSV, resumable manifest.json.
  • --as-evalseed: dedup real questions into a draft eval pack matching simlab/observe/evalpacks/*.yaml — all active:false, expected_asset a placeholder for human curation, PII-scrubbed via InferenceRouter.sanitize_text (historical input predates the forward-going scrub from security(langfuse): scrub PII on trace path + fix dead Telegram tracing #2157).
  • --dry-run / --max / --from / --to / --resume; polite paging + 429 backoff.

Verification (live, under doppler --config prd)

  • --dry-runtotalItems=3725, sample parsed machine='GARAGE CONVEYOR' question='why did it stop?'
  • --max 5 --as-evalseed → JSONL with all 4 spans joined, populated CSV, valid eval-seed YAML (all inactive, no raw IPs)
  • pytest tests/test_langfuse_export.py → 6 passed; ruff clean

Boundaries

  • Read-only against Langfuse; nothing written back.
  • Raw JSONL/CSV = unsanitized customer data → git-ignored, never committed. Only the eval-seed (which a human may curate into the repo) is scrubbed.
  • Run command: doppler run --project factorylm --config prd -- python tools/langfuse_export.py [--as-evalseed]

Note surfaced by the data

Sampled rows show n_chunks=0 and answers prefixed "Based on general industrial knowledge (not from documentation specific to this equipment)" — the historical prod answers were ungrounded (no KB retrieval). Direct evidence of the upload→retrieval / beta-gate gap, now measurable across 3 months.

🤖 Generated with Claude Code

…cal JSONL/CSV

Three months of real production troubleshooting (~3,725 rag_query traces) lived
only in Langfuse Cloud with nothing reading it back. This tool extracts it for
analysis, archival, and eval seeding.

- Langfuse public REST API via httpx (version-independent — the host's installed
  SDK is often too old for fetch_traces). Two sweeps (/api/public/traces +
  /api/public/observations) joined by trace id; per-trace obs fetch for small
  runs, bulk sweep for the full ~3,725.
- Outputs to git-ignored tools/langfuse-export/: full JSONL archive (trace +
  4 spans), flat analysis CSV (machine/question/answer_preview/latency/fsm_state/
  n_chunks/top_score), resumable manifest.
- --as-evalseed: dedup real questions into a draft eval pack matching
  simlab/observe/evalpacks/*.yaml, all active:false, expected_asset a placeholder
  for human curation. PII-scrubbed via InferenceRouter.sanitize_text (historical
  input predates the forward-going scrub).
- argparse --dry-run/--max/--from/--to/--resume; polite paging + 429 backoff.
- Read-only against Langfuse; output is unsanitized customer data, git-ignored,
  never committed.
- tests/test_langfuse_export.py: 6 cases on the pure parse/flatten/seed helpers.

Verified live: dry-run reports totalItems=3725; --max 5 --as-evalseed wrote
JSONL (4 spans joined), CSV, and a valid inactive eval-seed with no raw IPs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu
@github-actions

Copy link
Copy Markdown

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of MIRA Project Pull Request

🔴 IMPORTANT: Security Vulnerabilities

  • The code stores Langfuse public and secret keys as environment variables (LANGFUSE_PUBLIC_KEY and LANGFUSE_SECRET_KEY). While this is not inherently insecure, it's crucial to ensure these variables are not committed to the repository or exposed in any way. It's recommended to use a secure secrets management system.
  • In the _client function (tools/langfuse_export.py:246), the httpx.Client is created with a timeout of 60 seconds. This could potentially lead to a denial-of-service (DoS) attack if the server takes too long to respond. Consider implementing a more robust timeout and retry mechanism.

🔴 IMPORTANT: Missing Error Handling

  • In the _get_page function (tools/langfuse_export.py:258), the code raises an exception if the response status code is not 200. However, it does not handle other potential exceptions that may occur during the request, such as connection errors or timeouts. Add try-except blocks to handle these potential exceptions.
  • In the _sweep function (tools/langfuse_export.py:272), the code catches all exceptions and retries the request after a 5-second delay. While this is a good practice, it may lead to an infinite loop if the error persists. Consider implementing a limit on the number of retries.

🟡 WARNING: Logic Bugs or Incorrect Assumptions

  • In the parse_machine function (tools/langfuse_export.py:64), the code assumes that the machine name is always enclosed in square brackets and starts with "MACHINE:". Consider adding more robust parsing logic to handle potential variations in the input format.
  • In the parse_question function (tools/langfuse_export.py:83), the code assumes that the question is always the last non-empty, non-bracketed line in the input string. Consider adding more robust parsing logic to handle potential variations in the input format.

🟡 WARNING: Missing Input Validation

  • In the flatten_row function (tools/langfuse_export.py:140), the code does not validate the input trace and spans dictionaries. Consider adding input validation to ensure that the required keys exist and have the expected data types.

🔵 SUGGESTION: Code Quality Improvements

  • The code uses a mix of snake_case and camelCase naming conventions. Consider adopting a consistent naming convention throughout the codebase.
  • Some functions, such as _sweep, have a large number of parameters. Consider breaking these functions down into smaller, more manageable pieces.
  • The code uses type hints, but some types are not explicitly defined (e.g., dict). Consider adding more explicit type definitions to improve code readability and maintainability.

✅ GOOD: Noteworthy Good Practices

  • The code uses a consistent coding style and indentation.
  • The code includes docstrings and comments to explain the purpose and behavior of each function.
  • The code uses a modular structure, with separate functions for different tasks.

Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2172 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

MIRA staging gate — ✅ PASS

Engine + NeonDB staging branch + Groq cascade against fixed questions, graded on the 5-dimension rubric in docs/specs/mira-answer-quality-standard.md. Skipped questions (embed sidecar unavailable, etc.) are excluded from pass/fail math; the run fails closed if >50% are skipped.

  • mean of means: 4.96 (pass threshold: 3.5, scored over 15/15)
  • questions passed: 15 / 15
  • skipped (harness): 0
  • below mean 3.0: 0 (max allowed: 2)
  • hard fails: 0
  • full run logs
id category g c a s t mean note
oem-model-fault-powerflex-f004 oem_model_fault 5 5 5 5 5 5.00
oem-only-no-fault-sew oem_only 5 5 5 5 5 5.00
symptom-no-oem-abbrev symptom_only 5 5 5 5 5 5.00
uns-gate-grinding uns_gate 5 5 5 5 5 5.00
safety-arc-flash safety 5 5 5 5 5 5.00
greeting-hygiene greeting 5 5 5 5 5 5.00
session-followup followup 5 5 5 5 5 5.00
photo-less-ocr-claim no_photo 5 5 5 5 5 5.00
off-topic-redirect off_topic 5 5 5 5 5 5.00
cmms-context-followup cmms_context 4 4 5 5 5 4.60
oem-fault-variant-lowercase oem_model_fault 5 5 5 5 5 5.00
cross-oem-confusion oem_model_fault 5 5 5 5 5 5.00
oem-unknown-fault-admit oem_unknown_fault 5 5 5 5 5 5.00
safety-loto-explicit safety 5 5 5 5 5 5.00
uns-gate-no-line uns_gate 5 4 5 5 5 4.80

Rubric: docs/specs/mira-answer-quality-standard.md · Spec: docs/specs/staging-environment-spec.md

… numeric guards

Hardening after a full-scale run against the live project (3,725 traces / 14,835
spans):

- Observations bulk sweep now pages in weekly time windows (fromStartTime/
  toStartTime). The list endpoint rejects deep offset pagination (HTTP 422 past
  ~page 42); windowing keeps each sweep shallow.
- _get_page retries 429/5xx honoring Retry-After with exponential backoff (the
  free tier 429s mid-sweep; observed a 36s Retry-After).
- flatten_row coerces scores/count to numbers (a stray string score broke
  max()); the JSONL/CSV write loop guards each row so one bad record can't lose
  the whole run.
- Default --sleep raised to 0.5s.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CS9fxC3gdSUJDJqHw1uMiu
@github-actions

Copy link
Copy Markdown

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of PR: feat(tools): langfuse_export

🔴 IMPORTANT: Security vulnerabilities

  1. Hardcoded secrets: The code uses environment variables LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, and LANGFUSE_BASE_URL to store sensitive information. It is recommended to use a secure secrets management system instead of relying on environment variables.
  2. Potential SQL injection/Path traversal: The _get_page function uses the params dictionary to construct the URL for the GET request. If the params dictionary contains user-controlled data, it may be vulnerable to SQL injection or path traversal attacks.

🔴 IMPORTANT: Missing error handling on network/IO operations

  1. Network errors: The _get_page function only retries on 429 and 5xx status codes. It does not handle other network errors, such as connection timeouts or DNS resolution errors.
  2. IO errors: The flatten_row function assumes that the trace and spans dictionaries are always valid. However, if the data is corrupted or incomplete, it may raise exceptions when trying to access certain keys.

🟡 WARNING: Logic bugs or incorrect assumptions

  1. Incorrect assumption about retrieved data structure: In the flatten_row function, it is assumed that the retrieved list contains dictionaries with a score key. However, if this assumption is incorrect, the scores list will be empty, and the top_score will be an empty string.
  2. Incomplete data handling: The flatten_row function does not handle cases where the query or meta dictionaries are missing or incomplete.

🟡 WARNING: Missing input validation at API boundaries

  1. Invalid or missing query parameter: The parse_machine and parse_question functions do not validate the query parameter. If it is missing or invalid, the functions may return incorrect results.
  2. Invalid or missing spans parameter: The flatten_row function does not validate the spans parameter. If it is missing or invalid, the function may raise exceptions or return incorrect results.

🔵 SUGGESTION: Code quality improvements, naming, maintainability

  1. Consistent naming conventions: The code uses both camelCase and underscore notation for variable names. It is recommended to use a consistent naming convention throughout the codebase.
  2. Type hints: The code does not use type hints for function parameters and return types. Adding type hints can improve code readability and maintainability.
  3. Function length and complexity: Some functions, such as _get_page and flatten_row, are quite long and complex. It may be beneficial to break them down into smaller, more manageable functions.

✅ GOOD: Noteworthy good practices found

  1. Error handling: The code uses try-except blocks to handle exceptions and provides informative error messages.
  2. Logging: The code uses a logger to log important events and errors, making it easier to debug and monitor the application.
  3. Code organization: The code is well-organized, with separate functions for different tasks and a clear structure.

Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2172 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant