[integrations] Smart ingest edge function#100
[integrations] Smart ingest edge function#100alanshurafa wants to merge 5 commits intoNateBJones-Projects:mainfrom
Conversation
LLM-powered document extraction with semantic deduplication, fingerprint matching, and dry-run preview. Supports Anthropic, OpenAI, and OpenRouter providers with automatic fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename "Steps" to "Step-by-step instructions" for OB1 review bot - Replace relative links to schemas/ingestion-jobs (not in this branch) with plain text references to avoid broken link check failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OB1 review bot checks for lines starting with '1.' — convert bold numbered steps to standard markdown numbered list format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
justfinethanku
left a comment
There was a problem hiding this comment.
Code Review: Smart Ingest Edge Function
Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.
✅ What's Good
-
Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.
-
Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).
-
Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.
-
Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.
-
Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.
-
Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.
-
Metadata valid — metadata.json has all required fields with correct types and values.
🔴 Blocking Issue: Missing Dependency
This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.
The contribution references:
- Tables:
ingestion_jobs,ingestion_items - RPCs:
append_thought_evidence(in addition to coreupsert_thoughtandmatch_thoughts)
These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.
Recommendation:
- Mark this PR as draft or blocked until #98 is merged
- OR merge #98 first, then review and merge this PR
📋 Minor Suggestions (Non-Blocking)
-
README improvements:
- Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
- The API Reference section could include response schema examples
- Step 3 says "Copy the contents of
index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
-
Code considerations:
- The
SEMANTIC_SKIP_THRESHOLD(0.92) andSEMANTIC_MATCH_THRESHOLD(0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section - The
MAX_THOUGHTS_PER_EXTRACTION(20) limit might be worth mentioning in the README's "What It Does" section
- The
-
Metadata:
- The
servicesfield says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.
- The
✅ Verification Checklist
- Folder structure correct (
integrations/smart-ingest/) - Required files present (README.md, metadata.json, index.ts)
- metadata.json valid and complete
- No credentials or secrets
- SQL safety (no dangerous operations)
- README has Prerequisites, Step-by-step instructions, Expected Outcome, Troubleshooting
- PR title format correct:
[integrations] Smart ingest edge function - No binary files over 1MB
- Remote MCP pattern (Edge Function, not local server)
- All changes within contribution folder
- Dependencies available — ❌ Depends on unmerged PR #98
Verdict: Significant changes needed
The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.
Next steps:
- Merge PR #98 first
- Address the dependency blocker (either rebase or just wait)
- Optionally consider the minor suggestions above
- Re-request review
Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.
justfinethanku
left a comment
There was a problem hiding this comment.
Code Review: Smart Ingest Edge Function
Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.
✅ What's Good
-
Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.
-
Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).
-
Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.
-
Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.
-
Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.
-
Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.
-
Metadata valid — metadata.json has all required fields with correct types and values.
🔴 Blocking Issue: Missing Dependency
This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.
The contribution references:
- Tables:
ingestion_jobs,ingestion_items - RPCs:
append_thought_evidence(in addition to coreupsert_thoughtandmatch_thoughts)
These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.
Recommendation:
- Mark this PR as draft or blocked until #98 is merged
- OR merge #98 first, then review and merge this PR
📋 Minor Suggestions (Non-Blocking)
-
README improvements:
- Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
- The API Reference section could include response schema examples
- Step 3 says "Copy the contents of
index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
-
Code considerations:
- The
SEMANTIC_SKIP_THRESHOLD(0.92) andSEMANTIC_MATCH_THRESHOLD(0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section - The
MAX_THOUGHTS_PER_EXTRACTION(20) limit might be worth mentioning in the README's "What It Does" section
- The
-
Metadata:
- The
servicesfield says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.
- The
✅ Verification Checklist
- Folder structure correct (
integrations/smart-ingest/) - Required files present (README.md, metadata.json, index.ts)
- metadata.json valid and complete
- No credentials or secrets
- SQL safety (no dangerous operations)
- README has Prerequisites, Step-by-step instructions, Expected Outcome, Troubleshooting
- PR title format correct:
[integrations] Smart ingest edge function - No binary files over 1MB
- Remote MCP pattern (Edge Function, not local server)
- All changes within contribution folder
- Dependencies available — ❌ Depends on unmerged PR #98
Verdict: Significant changes needed
The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.
Next steps:
- Merge PR #98 first
- Address the dependency blocker (either rebase or just wait)
- Optionally consider the minor suggestions above
- Re-request review
Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.
Add Use Cases section, document dedup thresholds with rationale, clarify that embeddings require OpenAI/OpenRouter (not Anthropic). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI requires tool audit guide link for integrations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Dependencies
ingestion_jobs+ingestion_itemstablesupsert_thoughtandmatch_thoughtsRPCs (from core Open Brain setup)append_thought_evidenceRPC (from Ingestion Jobs schema)Routes
POST /smart-ingest— Extract and reconcile (dry_run or immediate)POST /smart-ingest/execute— Execute a previously dry-run jobTest plan
Tested against a production instance with 75K+ thoughts.
🤖 Generated with Claude Code