Skip to content

[integrations] Smart ingest edge function#100

Open
alanshurafa wants to merge 5 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest
Open

[integrations] Smart ingest edge function#100
alanshurafa wants to merge 5 commits intoNateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest

Conversation

@alanshurafa
Copy link
Contributor

Summary

  • Supabase Edge Function that extracts atomic thoughts from raw text via LLM
  • Three-stage pipeline: extract → deduplicate (fingerprint + semantic) → reconcile
  • Four reconciliation actions: add, skip, append_evidence, create_revision
  • Dry-run mode for previewing without mutations
  • Multi-provider support: Anthropic, OpenAI, OpenRouter with automatic fallback
  • All IDs are UUID/opaque strings throughout

Dependencies

Routes

  • POST /smart-ingest — Extract and reconcile (dry_run or immediate)
  • POST /smart-ingest/execute — Execute a previously dry-run job

Test plan

  • Dry-run extracts and reconciles without writes
  • Execute commits a dry-run job successfully
  • Exact duplicate → skip (fingerprint match)
  • Semantic near-duplicate (>0.92) → skip
  • Similar but richer existing → append_evidence
  • Similar but new has more info → create_revision
  • Job/item/thought references work with UUID IDs
  • Provider fallback chain works (Anthropic → OpenAI → OpenRouter)

Tested against a production instance with 75K+ thoughts.

🤖 Generated with Claude Code

alanshurafa and others added 3 commits March 21, 2026 22:09
LLM-powered document extraction with semantic deduplication, fingerprint
matching, and dry-run preview. Supports Anthropic, OpenAI, and OpenRouter
providers with automatic fallback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename "Steps" to "Step-by-step instructions" for OB1 review bot
- Replace relative links to schemas/ingestion-jobs (not in this branch)
  with plain text references to avoid broken link check failures

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OB1 review bot checks for lines starting with '1.' — convert bold
numbered steps to standard markdown numbered list format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Collaborator

@justfinethanku justfinethanku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

  1. Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.

  2. Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).

  3. Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.

  4. Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.

  5. Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.

  6. Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.

  7. Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

  • Tables: ingestion_jobs, ingestion_items
  • RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

  • Mark this PR as draft or blocked until #98 is merged
  • OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

  1. README improvements:

    • Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
    • The API Reference section could include response schema examples
    • Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
  2. Code considerations:

    • The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
    • The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
  3. Metadata:

    • The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

  • Folder structure correct (integrations/smart-ingest/)
  • Required files present (README.md, metadata.json, index.ts)
  • metadata.json valid and complete
  • No credentials or secrets
  • SQL safety (no dangerous operations)
  • README has Prerequisites, Step-by-step instructions, Expected Outcome, Troubleshooting
  • PR title format correct: [integrations] Smart ingest edge function
  • No binary files over 1MB
  • Remote MCP pattern (Edge Function, not local server)
  • All changes within contribution folder
  • Dependencies available — ❌ Depends on unmerged PR #98

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.


Next steps:

  1. Merge PR #98 first
  2. Address the dependency blocker (either rebase or just wait)
  3. Optionally consider the minor suggestions above
  4. Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

Copy link
Collaborator

@justfinethanku justfinethanku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Smart Ingest Edge Function

Thank you for this contribution! This is a well-thought-out integration that adds valuable document ingestion capabilities to Open Brain. I've completed a thorough review against the OB1 contribution standards.

✅ What's Good

  1. Excellent documentation — The README is comprehensive with clear sections for prerequisites, step-by-step instructions, API reference, expected outcomes, and troubleshooting. The credential tracker is a nice touch.

  2. Clean code structure — Edge Function follows best practices with proper error handling, CORS headers, and environment variable usage (no hardcoded credentials).

  3. Multi-provider support — Automatic fallback chain (Anthropic → OpenAI → OpenRouter) provides flexibility and resilience.

  4. Dry-run workflow — The preview-before-commit pattern is user-friendly and prevents accidental data writes.

  5. Security — No dangerous SQL operations (DROP, TRUNCATE, unqualified DELETE), no hardcoded secrets, proper authentication via x-brain-key header.

  6. Remote MCP pattern — Correctly uses Supabase Edge Function deployment (not local Node.js server), complying with OB1 standards.

  7. Metadata valid — metadata.json has all required fields with correct types and values.

🔴 Blocking Issue: Missing Dependency

This PR depends on PR #98 (Ingestion Jobs schema), which is still OPEN.

The contribution references:

  • Tables: ingestion_jobs, ingestion_items
  • RPCs: append_thought_evidence (in addition to core upsert_thought and match_thoughts)

These are documented in Prerequisites and Step 1, and the troubleshooting section addresses the missing schema error. However, PR #98 must be merged first before this contribution can be tested or used by the community.

Recommendation:

  • Mark this PR as draft or blocked until #98 is merged
  • OR merge #98 first, then review and merge this PR

📋 Minor Suggestions (Non-Blocking)

  1. README improvements:

    • Consider adding a "What You'll Learn" or "Use Cases" section to help users understand when to use smart-ingest vs. other import methods
    • The API Reference section could include response schema examples
    • Step 3 says "Copy the contents of index.ts" — consider providing a direct curl/wget command to download the file (similar to other OB1 contributions)
  2. Code considerations:

    • The SEMANTIC_SKIP_THRESHOLD (0.92) and SEMANTIC_MATCH_THRESHOLD (0.85) are hardcoded. Consider documenting these as tunable parameters in a comment or README section
    • The MAX_THOUGHTS_PER_EXTRACTION (20) limit might be worth mentioning in the README's "What It Does" section
  3. Metadata:

    • The services field says "Anthropic API or OpenAI API or OpenRouter" — technically only one is required, but embeddings require OpenAI or OpenRouter (not Anthropic). This could be clearer, though the README does explain it correctly in Prerequisites.

✅ Verification Checklist

  • Folder structure correct (integrations/smart-ingest/)
  • Required files present (README.md, metadata.json, index.ts)
  • metadata.json valid and complete
  • No credentials or secrets
  • SQL safety (no dangerous operations)
  • README has Prerequisites, Step-by-step instructions, Expected Outcome, Troubleshooting
  • PR title format correct: [integrations] Smart ingest edge function
  • No binary files over 1MB
  • Remote MCP pattern (Edge Function, not local server)
  • All changes within contribution folder
  • Dependencies available — ❌ Depends on unmerged PR #98

Verdict: Significant changes needed

The contribution quality is excellent, but it cannot be merged until the dependency (PR #98 - Ingestion Jobs schema) is merged first. Once #98 lands, this PR will be ready to merge with only minor optional improvements.


Next steps:

  1. Merge PR #98 first
  2. Address the dependency blocker (either rebase or just wait)
  3. Optionally consider the minor suggestions above
  4. Re-request review

Great work overall! The implementation is solid, the documentation is thorough, and this will be a valuable addition to the OB1 ecosystem.

alanshurafa and others added 2 commits March 25, 2026 09:31
Add Use Cases section, document dedup thresholds with rationale,
clarify that embeddings require OpenAI/OpenRouter (not Anthropic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI requires tool audit guide link for integrations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants