[integrations] Smart ingest edge function by alanshurafa · Pull Request #100 · NateBJones-Projects/OB1

alanshurafa · 2026-03-22T02:10:51Z

Summary

Supabase Edge Function that extracts atomic thoughts from raw text via LLM
Three-stage pipeline: extract → deduplicate (fingerprint + semantic) → reconcile
Four reconciliation actions: add, skip, append_evidence, create_revision
Dry-run mode for previewing without mutations
Multi-provider support: Anthropic, OpenAI, OpenRouter with automatic fallback
All IDs are UUID/opaque strings throughout

Dependencies

Ingestion Jobs schema ([schemas] Ingestion jobs schema for document ingest #98) — ingestion_jobs + ingestion_items tables
upsert_thought and match_thoughts RPCs (from core Open Brain setup)
append_thought_evidence RPC (from Ingestion Jobs schema)

Routes

POST /smart-ingest — Extract and reconcile (dry_run or immediate)
POST /smart-ingest/execute — Execute a previously dry-run job

Test plan

Dry-run extracts and reconciles without writes
Execute commits a dry-run job successfully
Exact duplicate → skip (fingerprint match)
Semantic near-duplicate (>0.92) → skip
Similar but richer existing → append_evidence
Similar but new has more info → create_revision
Job/item/thought references work with UUID IDs
Provider fallback chain works (Anthropic → OpenAI → OpenRouter)

Tested against a production instance with 75K+ thoughts.

🤖 Generated with Claude Code

LLM-powered document extraction with semantic deduplication, fingerprint matching, and dry-run preview. Supports Anthropic, OpenAI, and OpenRouter providers with automatic fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Rename "Steps" to "Step-by-step instructions" for OB1 review bot - Replace relative links to schemas/ingestion-jobs (not in this branch) with plain text references to avoid broken link check failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

OB1 review bot checks for lines starting with '1.' — convert bold numbered steps to standard markdown numbered list format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

justfinethanku

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.
Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).
Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.
Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.
Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.
Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.
Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

Tables: ingestion_jobs, ingestion_items
RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

Mark this PR as draft or blocked until #98 is merged
OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

README improvements:
- Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
- The API Reference section could include response schema examples
- Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
Code considerations:
- The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
- The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
Metadata:
- The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.

Next steps:

Merge PR #98 first
Address the dependency blocker (either rebase or just wait)
Optionally consider the minor suggestions above
Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

justfinethanku

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.
Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).
Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.
Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.
Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.
Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.
Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

Tables: ingestion_jobs, ingestion_items
RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

Mark this PR as draft or blocked until #98 is merged
OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

README improvements:
- Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
- The API Reference section could include response schema examples
- Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
Code considerations:
- The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
- The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
Metadata:
- The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.

Next steps:

Merge PR #98 first
Address the dependency blocker (either rebase or just wait)
Optionally consider the minor suggestions above
Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

Add Use Cases section, document dedup thresholds with rationale, clarify that embeddings require OpenAI/OpenRouter (not Anthropic). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CI requires tool audit guide link for integrations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

alanshurafa and others added 3 commits March 21, 2026 22:09

Fix steps format to use markdown numbered lists

bbd37f0

OB1 review bot checks for lines starting with '1.' — convert bold numbered steps to standard markdown numbered list format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

justfinethanku mentioned this pull request Mar 24, 2026

[recipes] Universal ingest primitives #99

Open

4 tasks

justfinethanku reviewed Mar 24, 2026

View reviewed changes

alanshurafa and others added 2 commits March 25, 2026 09:31

[integrations] Address review feedback for smart ingest

e5b1211

Add Use Cases section, document dedup thresholds with rationale, clarify that embeddings require OpenAI/OpenRouter (not Anthropic). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[integrations] Add tool audit link for smart ingest

53cbaaf

CI requires tool audit guide link for integrations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[integrations] Smart ingest edge function#100

[integrations] Smart ingest edge function#100
alanshurafa wants to merge 5 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest

alanshurafa commented Mar 22, 2026

Uh oh!

justfinethanku left a comment

Uh oh!

justfinethanku left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alanshurafa commented Mar 22, 2026

Summary

Dependencies

Routes

Test plan

Uh oh!

justfinethanku left a comment

Choose a reason for hiding this comment

Code Review: Smart Ingest Edge Function

✅ What's Good

🔴 Blocking Issue: Missing Dependency

📋 Minor Suggestions (Non-Blocking)

✅ Verification Checklist

Verdict: Significant changes needed

Uh oh!

justfinethanku left a comment

Choose a reason for hiding this comment

Code Review: Smart Ingest Edge Function

✅ What's Good

🔴 Blocking Issue: Missing Dependency

📋 Minor Suggestions (Non-Blocking)

✅ Verification Checklist

Verdict: Significant changes needed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants