Skip to content

fix(kb): collapse manufacturer fragments in Knowledge rollup (#2263)#2275

Open
Mikecranesync wants to merge 1 commit into
mainfrom
fix/manufacturer-fragmentation
Open

fix(kb): collapse manufacturer fragments in Knowledge rollup (#2263)#2275
Mikecranesync wants to merge 1 commit into
mainfrom
fix/manufacturer-fragmentation

Conversation

@Mikecranesync

Copy link
Copy Markdown
Owner

Summary

  • Extends manufacturer-aliases.json with three new entries: "rockwell""Rockwell Automation", "automationdirect" / "automation direct" / "automationdirect.com""AutomationDirect"
  • Applies normalizeManufacturer() in /api/knowledge after the SQL GROUP BY to re-aggregate rows that map to the same canonical name (e.g. 34K "Rockwell Automation" + 18 "Rockwell" → single "Rockwell Automation" row)
  • Syncs OCR_VARIANT_ALIASES in mira-crawler/ingest/manufacturer_normalize.py to stay in lockstep with the alias JSON (existing cross-surface consistency test guards this)

Partial fix for #2263. Remaining work: OCR artifact fix in ingest pipeline, classifier pass on 24K Uncategorized chunks.

Root cause

The Knowledge page was grouped purely by SQL INITCAP(LOWER(TRIM(manufacturer))) with no alias pass, so "Rockwell Automation" and "Rockwell" produced separate catalog rows. The alias map existed but was only applied at upload time and in quickstart, not in the library rollup.

Test plan

  • /api/knowledge response: verify "Rockwell Automation" row count now absorbs the former "Rockwell" row (18 extra chunks)
  • "AutomationDirect" appears as a single catalog row (no "Automationdirect" duplicate)
  • Knowledge page manufacturer list is shorter by collapsed entries

🤖 Generated with Claude Code

Extend manufacturer-aliases.json with three new canonical entries:
- "rockwell" → "Rockwell Automation" (bare brand in older manuals)
- "automationdirect" / "automation direct" / "automationdirect.com"
  → "AutomationDirect" (casing/spacing variants)

Apply normalizeManufacturer() in the /api/knowledge rollup after the
SQL GROUP BY so that title-cased variants that survived INITCAP still
collapse to a single catalog row (e.g. "Rockwell" + "Rockwell Automation"
now appear as one entry).

Also sync the Python OCR_VARIANT_ALIASES in manufacturer_normalize.py
to keep both sides in lockstep per the existing cross-surface test.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🤖 AI Code Review

Review by: groq (llama-3.3-70b-versatile)

Review of PR #2263

🔴 IMPORTANT: Security Vulnerabilities

No hardcoded secrets, SQL injection, path traversal, or command injection vulnerabilities were found in the diff.

🔴 IMPORTANT: Missing Error Handling

No missing error handling on network/IO operations that could crash in production were found in the diff. However, it's worth noting that the pool import from @/lib/db is used in mira-hub/src/app/api/knowledge/route.ts, but no error handling is shown for database operations. It's assumed that error handling is implemented elsewhere in the codebase, but it's essential to verify this.

🟡 WARNING: Logic Bugs or Incorrect Assumptions

The normalization of manufacturer names using normalizeManufacturer function seems correct. However, the function itself is not shown in the diff, so it's essential to review the implementation of this function to ensure it works as expected.

In mira-hub/src/app/api/knowledge/route.ts, the canonicalMap is used to group manufacturers by their canonical names. The sorting of the resulting array is case-sensitive, which might lead to unexpected results if the case of the manufacturer names is not consistent. Consider using a case-insensitive sorting method.

🟡 WARNING: Missing Input Validation

The normalizeManufacturer function is called with rawName as an argument, but there is no validation of the input. It's essential to validate the input to prevent potential errors or security vulnerabilities.

🔵 SUGGESTION: Code Quality Improvements

The code is generally well-structured and readable. However, some variable names could be more descriptive. For example, mfrRows could be renamed to manufacturerRows for better clarity.

The use of type aliases (e.g., Mfr) improves code readability. Consider adding more type aliases for other complex types to make the code easier to understand.

✅ GOOD: Noteworthy Good Practices

The use of const and let instead of var is a good practice. The code also uses type annotations, which improves code readability and maintainability.

The implementation of the canonicalMap and the subsequent sorting of the resulting array is a good example of efficient data processing.


Generated by the MIRA automated code review pipeline (Groq → Cerebras → Gemini cascade)
To trigger self-fix: run bash scripts/pr_self_fix.sh 2275 locally, or add the auto-fix label to this PR (or run /autofix-pr from a Claude Code session)

@github-actions

Copy link
Copy Markdown

MIRA staging gate — ✅ PASS

Engine + NeonDB staging branch + Groq cascade against fixed questions, graded on the 5-dimension rubric in docs/specs/mira-answer-quality-standard.md. Skipped questions (embed sidecar unavailable, etc.) are excluded from pass/fail math; the run fails closed if >50% are skipped.

  • mean of means: 4.93 (pass threshold: 3.5, scored over 15/15)
  • questions passed: 15 / 15
  • skipped (harness): 0
  • below mean 3.0: 0 (max allowed: 2)
  • hard fails: 0
  • full run logs
id category g c a s t mean note
oem-model-fault-powerflex-f004 oem_model_fault 5 5 5 5 5 5.00
oem-only-no-fault-sew oem_only 5 5 5 5 5 5.00
symptom-no-oem-abbrev symptom_only 5 5 5 5 5 5.00
uns-gate-grinding uns_gate 5 5 5 5 5 5.00
safety-arc-flash safety 5 5 5 5 5 5.00
greeting-hygiene greeting 5 5 5 5 5 5.00
session-followup followup 5 5 5 5 5 5.00
photo-less-ocr-claim no_photo 5 5 5 5 5 5.00
off-topic-redirect off_topic 5 5 5 5 5 5.00
cmms-context-followup cmms_context 4 3 4 5 5 4.20
oem-fault-variant-lowercase oem_model_fault 5 5 5 5 5 5.00
cross-oem-confusion oem_model_fault 5 5 5 5 5 5.00
oem-unknown-fault-admit oem_unknown_fault 5 5 5 5 5 5.00
safety-loto-explicit safety 5 5 5 5 5 5.00
uns-gate-no-line uns_gate 5 4 5 5 5 4.80

Rubric: docs/specs/mira-answer-quality-standard.md · Spec: docs/specs/staging-environment-spec.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant