Skip to content

Deterministic Text Embedder Implementation#62

Closed
guiperry wants to merge 1 commit into
giancarloerra:mainfrom
guiperry:feat/deterministic-text-embedder
Closed

Deterministic Text Embedder Implementation#62
guiperry wants to merge 1 commit into
giancarloerra:mainfrom
guiperry:feat/deterministic-text-embedder

Conversation

@guiperry
Copy link
Copy Markdown

@guiperry guiperry commented May 23, 2026

Summary

Changes

Goal

  • Integrate the text-embedder Go binary as the primary embedding provider in SocratiCode, with all existing providers as fallbacks.
    Constraints & Preferences
  • text-embedder binary must be distributed as text-embedder.gz (gzip) for practical file size in the npm package / git repo.
  • The TypeScript provider must decompress the .gz at runtime to a temp directory and spawn it as a subprocess.
  • All existing providers (ollama, openai, google, lmstudio, litellm) must remain as fallbacks switchable via EMBEDDING_PROVIDER.
  • text-embedder produces 768-dim int32 vectors scaled to 0, 10000; the provider converts to float64 0, 1 for Qdrant.
    Progress
    Done
  • Built the Go text-embedder binary (make deploy) and placed text-embedder.gz (5.0 MB) in the SocratiCode project root.
  • Created src/services/provider-textembedder.ts — full EmbeddingProvider implementation with binary decompression, subprocess spawning, HTTP API client, and process cleanup.
  • Updated src/services/embedding-config.ts — added "textembedder" to the EmbeddingProvider type union, PROVIDER_DEFAULTS (landmark-lattice-v1, 768 dims), MODEL_CONTEXT_LENGTHS (8192), validation, default provider from "ollama" to "textembedder".
  • Updated src/services/embedding-provider.ts — added case "textembedder" in factory switch that imports and instantiates TextEmbedderEmbeddingProvider.
  • Updated text-embedder/Makefile — added compress, deploy, deploy-linux, deploy-mac, deploy-windows targets that build, gzip, and copy to SOCRA_DIR.
  • Updated tests/unit/embedding-config.test.ts and tests/unit/embedding-provider.test.ts — changed default provider assertions from "ollama" to "textembedder".
  • TypeScript compilation: zero errors.
  • Unit tests: 823 passed across 36 test files.
    In Progress
  • (none)
    Blocked
  • (none)
    Key Decisions
  • Binary is stored as text-embedder.gz (gzip) to keep repo size small (~5 MB compressed vs ~9.2 MB raw).
  • Decompression happens once to os.tmpdir()/socraticode-textembedder/text-embedder; the extracted binary is reused across restarts until cleanup.
  • Provider supports three modes: (1) compressed binary auto-extracted, (2) external URL via TEXTEMBEDDER_URL, (3) bare binary via TEXTEMBEDDER_BIN_PATH.
  • Batch size 64; inter-batch delay 0 (local subprocess, no rate limits).
  • Process cleanup registered on exit/SIGINT/SIGTERM; temp dir removed best-effort.
    Next Steps
  • Run integration/E2E tests to verify the binary actually starts and produces valid vectors through the full Qdrant pipeline.
  • Add text-embedder.gz to the package.json files array for npm distribution.
  • Optionally add a postinstall script to extract the binary automatically.
    Critical Context
  • text-embedder Go module: github.com/guiperry/text-embedder (Go 1.22).
  • SocratiCode TypeScript: all embedding providers implement the Embedding Provider interface from embedding-types.ts.
  • Factory in embedding-provider.ts uses switch on EMBEDDING_PROVIDER env var; default is now "textembedder".
  • embeddings.ts uses PROVIDER_BATCH_DELAY record — textembedder not in the record, so it gets ?? 0 fallback (correct for local).
    Relevant Files
  • src/services/provider-textembedder.ts: new provider — decompresses .gz, spawns binary, HTTP client, health checks.
  • src/services/embedding-config.ts: provider type union, defaults, validation.
  • src/services/embedding-provider.ts: factory switch with case "textembedder".
  • text-embedder/Makefile: deploy target builds, gzips, copies to SocratiCode.
  • text-embedder.gz (in SocratiCode root): distributed gzip binary.
  • tests/unit/embedding-config.test.ts: updated default-provider assertions.
  • tests/unit/embedding-provider.test.ts: updated factory default + cache tests.

Type of change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Test coverage improvement

Testing

  • Unit tests pass (npm run test:unit)
  • Integration tests pass (npm run test: integration) — if applicable
  • TypeScript compiles cleanly (npx tsc --noEmit)
  • New tests added for new/changed functionality

Checklist

  • My code follows the existing code style and conventions
  • I have added/updated JSDoc comments where appropriate
  • I have updated documentation (README.md / DEVELOPER.md) if needed
  • I have addressed all CodeRabbit review comments (or marked as resolved with explanation)
  • I have read the Contributing Guide
  • I agree to the Contributor License Agreement

Related issues

Summary by CodeRabbit

  • New Features

    • Added Text Embedder as a new embedding provider backend, now set as the default. Supports deterministic embeddings via local binary execution or HTTP server modes without requiring Docker, GPU, or external API keys.
  • Documentation

    • Updated configuration guide with Text Embedder setup options and environment variables.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

📝 Walkthrough

Walkthrough

This PR introduces Text Embedder (textembedder) as the new default embedding provider, replacing Ollama. It adds a new provider implementation with local binary management, updates configuration and provider selection, wires integration through the factory, and adjusts tests to reflect the new defaults.

Changes

Text Embedder Integration

Layer / File(s) Summary
Configuration and provider defaults
src/services/embedding-config.ts, README.md
EmbeddingProvider type now includes "textembedder". Default provider changes to textembedder with model landmark-lattice-v1 (768 dimensions). Environment variables TEXTEMBEDDER_URL, TEXTEMBEDDER_BIN_PATH, TEXTEMBEDDER_PORT are documented. API key logic updated to treat textembedder as not requiring external API keys. README updated with new provider feature and configuration subsection.
Provider factory integration
src/services/embedding-provider.ts
getEmbeddingProvider adds a switch case to dynamically import and instantiate TextEmbedderEmbeddingProvider when provider is textembedder. Module documentation and error messages updated to include the new provider.
Text Embedder provider implementation
src/services/provider-textembedder.ts
New class TextEmbedderEmbeddingProvider manages either external Text Embedder URL or local Go binary mode. Binary discovery resolves platform-specific executables, decompresses .gz files to temp directory on first use, starts subprocess with health polling and readiness timeout, handles graceful shutdown and cleanup. Embedding endpoints support batch (/embed/batch) and single (/embed) calls with int-to-float vector conversion via fixed-point scaling. healthCheck reports mode, binary availability, and reachability.
Test updates
tests/unit/embedding-config.test.ts, tests/unit/embedding-provider.test.ts
Configuration tests updated to expect textembedder + landmark-lattice-v1 defaults instead of Ollama. Cache/singleton behavior tests adjusted to validate embeddingProvider caching. Factory tests verify new default provider is textembedder. Error message expectations updated to include textembedder in allowed provider list.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • giancarloerra/SocratiCode#57: Both PRs extend the embedding provider factory logic in src/services/embedding-config.ts and src/services/embedding-provider.ts to add new provider branches with shared validation and API-key handling.
  • giancarloerra/SocratiCode#42: Both PRs modify the core embedding-provider wiring by extending EmbeddingProvider and updating configuration/factory code to add new provider options within the same code paths.

Poem

🐰 A textual tapestry we weave,
With embeddings deterministic and neat,
No Docker, no keys, just a binary so fleet—
The default at last that we all did believe!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the primary change: adding a deterministic text-embedder as the new embedding provider implementation.
Description check ✅ Passed The description follows the template structure with detailed Summary, comprehensive Changes section, correct Type of change marked, Testing checklist completed with all items checked, and most Checklist items addressed appropriately.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@giancarloerra giancarloerra self-assigned this May 23, 2026
@giancarloerra
Copy link
Copy Markdown
Owner

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/services/embedding-config.ts`:
- Around line 292-304: The logged hasApiKey value is inverted for the
textembedder case; update the hasApiKey computation (referencing hasApiKey and
embeddingProvider) so textembedder yields false (no API key) and all other
providers evaluate whether their respective env/config keys exist (e.g.,
_config.ollamaApiKey, process.env.OPENAI_API_KEY, process.env.GOOGLE_API_KEY,
process.env.LMSTUDIO_API_KEY, process.env.LITELLM_API_KEY); ensure you return a
boolean (use explicit Boolean(...) or !!) and keep the same provider
switch/conditional structure in embedding-config.ts.

In `@src/services/provider-textembedder.ts`:
- Around line 436-439: In _embedBatch validate the BatchResponse contents before
sorting/mapping: check data.results exists and that its length equals the
expected input count, ensure each result has a numeric index within 0..(n-1) and
that indices are unique and cover the full range; if any
missing/duplicate/out-of-range index is detected, throw a descriptive error. Use
the index-to-result mapping to assemble the ordered embeddings (or validate then
sort) and only then call unscaleVector on each embedding; reference
data.results, BatchResponse, _embedBatch and unscaleVector when making the
changes.
- Around line 169-186: startBinary sets binaryStarting = true but never clears
it if resolveBinarySource/ensureBinaryExtracted (or any subsequent operation)
throws, causing callers to hang; update startBinary so that after setting
binaryStarting = true you wrap the subsequent work (resolveBinarySource(), the
not-found throw, ensureBinaryExtracted(), and spawn logic) in a
try/catch/finally and ensure binaryStarting is reset to false on any thrown
error (or in finally when failing to start), e.g., catch the error from
resolveBinarySource/ensureBinaryExtracted/child spawn and set binaryStarting =
false before rethrowing; reference the startBinary function and the variables
binaryStarting, resolveBinarySource, ensureBinaryExtracted, and subprocess to
locate where to add the try/catch/finally.
- Around line 220-230: The poll and API fetches in provider-textembedder.ts can
hang because they lack AbortController timeouts; add a small helper (e.g.,
fetchWithTimeout(url, options?, timeoutMs)) that creates an AbortController,
sets a setTimeout to abort after the provided timeout, and clears the timer on
completion, then replace all raw fetch calls to `${url}/health`,
`${baseUrl}/health`, `${baseUrl}/embed`, and `${baseUrl}/embed/batch` with
fetchWithTimeout (use HEALTH_POLL_MS or a sensible per-call timeout and ensure
BINARY_START_TIMEOUT_MS logic still advances if a request is aborted), keeping
existing response handling and updating references to subprocessUrl and
binaryStarting as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a861762d-4a79-424a-af08-c42f2f7c3287

📥 Commits

Reviewing files that changed from the base of the PR and between bca2598 and bea12e3.

⛔ Files ignored due to path filters (3)
  • text-embedder-darwin.gz is excluded by !**/*.gz
  • text-embedder-linux.gz is excluded by !**/*.gz
  • text-embedder-win.gz is excluded by !**/*.gz
📒 Files selected for processing (6)
  • README.md
  • src/services/embedding-config.ts
  • src/services/embedding-provider.ts
  • src/services/provider-textembedder.ts
  • tests/unit/embedding-config.test.ts
  • tests/unit/embedding-provider.test.ts

Comment on lines +292 to +304
hasApiKey: !!(embeddingProvider === "textembedder"
? true // binary / external URL; no key needed
: embeddingProvider === "ollama"
? _config.ollamaApiKey
: embeddingProvider === "openai"
? process.env.OPENAI_API_KEY
: embeddingProvider === "google"
? process.env.GOOGLE_API_KEY
: embeddingProvider === "lmstudio"
? process.env.LMSTUDIO_API_KEY
: embeddingProvider === "litellm"
? process.env.LITELLM_API_KEY
: undefined),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

hasApiKey is inverted for textembedder in config logs.

This currently logs hasApiKey: true for textembedder even though the provider explicitly does not use an API key, which makes diagnostics misleading.

Suggested fix
-    hasApiKey: !!(embeddingProvider === "textembedder"
-      ? true // binary / external URL; no key needed
+    hasApiKey: !!(embeddingProvider === "textembedder"
+      ? false // binary / external URL; no key needed
       : embeddingProvider === "ollama"
         ? _config.ollamaApiKey
         : embeddingProvider === "openai"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hasApiKey: !!(embeddingProvider === "textembedder"
? true // binary / external URL; no key needed
: embeddingProvider === "ollama"
? _config.ollamaApiKey
: embeddingProvider === "openai"
? process.env.OPENAI_API_KEY
: embeddingProvider === "google"
? process.env.GOOGLE_API_KEY
: embeddingProvider === "lmstudio"
? process.env.LMSTUDIO_API_KEY
: embeddingProvider === "litellm"
? process.env.LITELLM_API_KEY
: undefined),
hasApiKey: !!(embeddingProvider === "textembedder"
? false // binary / external URL; no key needed
: embeddingProvider === "ollama"
? _config.ollamaApiKey
: embeddingProvider === "openai"
? process.env.OPENAI_API_KEY
: embeddingProvider === "google"
? process.env.GOOGLE_API_KEY
: embeddingProvider === "lmstudio"
? process.env.LMSTUDIO_API_KEY
: embeddingProvider === "litellm"
? process.env.LITELLM_API_KEY
: undefined),
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/services/embedding-config.ts` around lines 292 - 304, The logged
hasApiKey value is inverted for the textembedder case; update the hasApiKey
computation (referencing hasApiKey and embeddingProvider) so textembedder yields
false (no API key) and all other providers evaluate whether their respective
env/config keys exist (e.g., _config.ollamaApiKey, process.env.OPENAI_API_KEY,
process.env.GOOGLE_API_KEY, process.env.LMSTUDIO_API_KEY,
process.env.LITELLM_API_KEY); ensure you return a boolean (use explicit
Boolean(...) or !!) and keep the same provider switch/conditional structure in
embedding-config.ts.

Comment on lines +169 to +186
binaryStarting = true;
const source = await resolveBinarySource();

if (!source) {
binaryStarting = false;
const pfx = platformGzName();
throw new Error(
`text-embedder binary not found for platform "${process.platform}". ` +
`Run 'make deploy-all' from the text-embedder directory to generate ` +
(pfx ? `${pfx} (expected name), ` : "") +
"or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",
);
}

// Decompress if necessary, or use bare binary directly
const binPath = source.isCompressed
? await ensureBinaryExtracted(source.sourcePath)
: source.sourcePath;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/services/provider-textembedder.ts"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE" >&2
  exit 1
fi

# Show a focused range around binaryStarting and startBinary
rg -n "binaryStarting|startBinary|resolveBinarySource|ensureBinaryExtracted" "$FILE" -S

# Print around startBinary function definition and binaryStarting usage
python3 - <<'PY'
import re, pathlib
path = pathlib.Path("src/services/provider-textembedder.ts")
txt = path.read_text()
# find startBinary line
m = re.search(r'async function startBinary\s*\(', txt)
if not m:
    print("startBinary not found")
    raise SystemExit(0)
start = txt[:m.start()].count("\n")
print("startBinary approx line:", start+1)
# print 220 lines around
lines = txt.splitlines()
a = max(0, start-30)
b = min(len(lines), start+220)
for i in range(a,b):
    print(f"{i+1:4d}: {lines[i]}")
PY

# Also check any polling/wait loop for binaryStarting
rg -n "while\s*\(.*binaryStarting|poll|setTimeout|sleep|await new Promise|binaryStarting\s*\)" "$FILE" -S || true

# Show any place that sets binaryStarting=false
rg -n "binaryStarting\s*=\s*false" "$FILE" -S

Repository: giancarloerra/SocratiCode

Length of output: 9800


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/services/provider-textembedder.ts"
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE" >&2
  exit 1
fi

rg -n "binaryStarting|startBinary|resolveBinarySource|ensureBinaryExtracted" "$FILE" -S

python3 - <<'PY'
import re, pathlib
path = pathlib.Path("src/services/provider-textembedder.ts")
txt = path.read_text()
m = re.search(r'async function startBinary\s*\(', txt)
if not m:
    print("startBinary not found")
    raise SystemExit(0)
start = txt[:m.start()].count("\n")
print("startBinary approx line:", start+1)
lines = txt.splitlines()
a = max(0, start-40)
b = min(len(lines), start+260)
for i in range(a,b):
    print(f"{i+1:4d}: {lines[i]}")
PY

rg -n "while\s*\(.*binaryStarting|poll|setTimeout|sleep|await new Promise|binaryStarting\s*\)" "$FILE" -S || true

rg -n "binaryStarting\s*=\s*false" "$FILE" -S

Repository: giancarloerra/SocratiCode

Length of output: 12000


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/services/provider-textembedder.ts"

# Inspect resolveBinarySource and ensureBinaryExtracted bodies
python3 - <<'PY'
import itertools, pathlib
path = pathlib.Path("src/services/provider-textembedder.ts")
lines = path.read_text().splitlines()
def dump(start, end):
    for i in range(start, end+1):
        if 1 <= i <= len(lines):
            print(f"{i:4d}: {lines[i-1]}")
# Based on rg: resolveBinarySource around 97, ensureBinaryExtracted around 119
dump(70, 150)
PY

# Locate _embedBatch and inspect
rg -n "_embedBatch|BatchResponse|results|index" "$FILE" -S
python3 - <<'PY'
import re, pathlib
path = pathlib.Path("src/services/provider-textembedder.ts")
txt = path.read_text()
m = re.search(r'async function _embedBatch', txt)
if not m:
    # maybe method: this._embedBatch or class method
    # fallback: print around first occurrence of 'protected async _embedBatch' etc
    import sys
    pass
# Print around all lines containing '_embedBatch'
lines = txt.splitlines()
for i,l in enumerate(lines, start=1):
    if '_embedBatch' in l:
        start=max(1,i-40); end=min(len(lines),i+140)
        print(f"\n--- around line {i} ---")
        for j in range(start,end+1):
            print(f"{j:4d}: {lines[j-1]}")
PY

# Find fetch usage without AbortController/timeout settings (heuristic)
rg -n "fetch\\(.*\\)" "$FILE" -S
rg -n "AbortController|timeout|signal:" "$FILE" -S

Repository: giancarloerra/SocratiCode

Length of output: 11659


Fix binaryStarting stuck/hang on extraction/spawn errors in startBinary
startBinary() sets binaryStarting = true and only resets it for specific paths; if ensureBinaryExtracted() (or anything after setting binaryStarting=true) throws, binaryStarting remains true, and later callers enter the polling Promise and can wait forever (no timeout) because the interval only rejects when !subprocess && !binaryStarting.

Suggested fix
 async function startBinary(port: number): Promise<string> {
   if (subprocessUrl) return subprocessUrl;
   if (binaryStarting) {
@@
   }
 
   binaryStarting = true;
-  const source = await resolveBinarySource();
+  let source: { sourcePath: string; isCompressed: boolean } | null = null;
+  try {
+    source = await resolveBinarySource();
 
-  if (!source) {
-    binaryStarting = false;
-    const pfx = platformGzName();
-    throw new Error(
-      `text-embedder binary not found for platform "${process.platform}". ` +
-      `Run 'make deploy-all' from the text-embedder directory to generate ` +
-      (pfx ? `${pfx} (expected name), ` : "") +
-      "or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",
-    );
-  }
+    if (!source) {
+      const pfx = platformGzName();
+      throw new Error(
+        `text-embedder binary not found for platform "${process.platform}". ` +
+        `Run 'make deploy-all' from the text-embedder directory to generate ` +
+        (pfx ? `${pfx} (expected name), ` : "") +
+        "or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",
+      );
+    }
 
-  // Decompress if necessary, or use bare binary directly
-  const binPath = source.isCompressed
-    ? await ensureBinaryExtracted(source.sourcePath)
-    : source.sourcePath;
+    // Decompress if necessary, or use bare binary directly
+    const binPath = source.isCompressed
+      ? await ensureBinaryExtracted(source.sourcePath)
+      : source.sourcePath;
+    // ... existing spawn/readiness logic ...
+  } catch (err) {
+    binaryStarting = false;
+    throw err;
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
binaryStarting = true;
const source = await resolveBinarySource();
if (!source) {
binaryStarting = false;
const pfx = platformGzName();
throw new Error(
`text-embedder binary not found for platform "${process.platform}". ` +
`Run 'make deploy-all' from the text-embedder directory to generate ` +
(pfx ? `${pfx} (expected name), ` : "") +
"or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",
);
}
// Decompress if necessary, or use bare binary directly
const binPath = source.isCompressed
? await ensureBinaryExtracted(source.sourcePath)
: source.sourcePath;
binaryStarting = true;
let source: { sourcePath: string; isCompressed: boolean } | null = null;
try {
source = await resolveBinarySource();
if (!source) {
const pfx = platformGzName();
throw new Error(
`text-embedder binary not found for platform "${process.platform}". ` +
`Run 'make deploy-all' from the text-embedder directory to generate ` +
(pfx ? `${pfx} (expected name), ` : "") +
"or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",
);
}
// Decompress if necessary, or use bare binary directly
const binPath = source.isCompressed
? await ensureBinaryExtracted(source.sourcePath)
: source.sourcePath;
} catch (err) {
binaryStarting = false;
throw err;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/services/provider-textembedder.ts` around lines 169 - 186, startBinary
sets binaryStarting = true but never clears it if
resolveBinarySource/ensureBinaryExtracted (or any subsequent operation) throws,
causing callers to hang; update startBinary so that after setting binaryStarting
= true you wrap the subsequent work (resolveBinarySource(), the not-found throw,
ensureBinaryExtracted(), and spawn logic) in a try/catch/finally and ensure
binaryStarting is reset to false on any thrown error (or in finally when failing
to start), e.g., catch the error from
resolveBinarySource/ensureBinaryExtracted/child spawn and set binaryStarting =
false before rethrowing; reference the startBinary function and the variables
binaryStarting, resolveBinarySource, ensureBinaryExtracted, and subprocess to
locate where to add the try/catch/finally.

Comment on lines +220 to +230
const resp = await fetch(`${url}/health`);
if (resp.ok) {
logger.info("text-embedder binary is ready", { url });
subprocessUrl = url;
binaryStarting = false;
return url;
}
} catch {
// Not ready yet
}
await new Promise((r) => setTimeout(r, HEALTH_POLL_MS));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/services/provider-textembedder.ts"

echo "== Fetch occurrences =="
rg -n "fetch\(" "$FILE" || true

echo
echo "== AbortController/timeout helpers in file =="
rg -n "AbortController|timeout|setTimeout" "$FILE" || true

echo
echo "== Snippet around 200-260 =="
nl -ba "$FILE" | sed -n '190,260p'

echo
echo "== Snippet around 300-390 =="
nl -ba "$FILE" | sed -n '300,390p'

echo
echo "== Snippet around 400-460 =="
nl -ba "$FILE" | sed -n '400,460p'

Repository: giancarloerra/SocratiCode

Length of output: 617


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE="src/services/provider-textembedder.ts"

echo "== Snippet around 210-255 =="
sed -n '210,255p' "$FILE" | cat -n

echo
echo "== Snippet around 315-380 =="
sed -n '315,380p' "$FILE" | cat -n

echo
echo "== Snippet around 395-440 =="
sed -n '395,440p' "$FILE" | cat -n

echo
echo "== Any fetch wrapper definitions in file =="
rg -n "fetchWithTimeout|AbortController|controller\.abort|signal:|timeoutMs|HTTP_TIMEOUT_MS" "$FILE" || true

echo
echo "== Any shared fetch helper elsewhere in repo (search quick) =="
rg -n "fetchWithTimeout" src || true
rg -n "AbortController" src || true

Repository: giancarloerra/SocratiCode

Length of output: 6691


🌐 Web query:

Does Node.js global fetch (undici) have a default request timeout? Is there a built-in timeout for hung sockets?

💡 Result:

Yes—Node.js’ global fetch (which uses undici under the hood) has default timeouts, including timeouts that can terminate “hung” requests due to lack of activity. 1) Default overall request timeout behavior (headers/body) - undici’s Dispatcher has defaults: - headersTimeout defaults to 300 seconds (5 minutes) [1][2][3] - bodyTimeout defaults to 300 seconds (5 minutes) [1][2][3] These are “activity” timeouts: headersTimeout monitors waiting for complete HTTP headers; bodyTimeout monitors time between receiving body data. If a server accepts a connection but never completes headers (or never sends body data), the request times out. 2) Default connect/establish-timeout (for “hung sockets” during connect) - undici also sets a default connect timeout used when establishing the connection; undici will throw UND_ERR_CONNECT_TIMEOUT if the connect attempt exceeds that timeout [4]. - This is not the same as headersTimeout/bodyTimeout; it’s specifically for connection establishment. 3) Is there a built-in timeout for “hung sockets” after connect? - undici provides socket-/activity-level timeouts via dispatcher/client options such as socketTimeout (timeout when there is no activity on a connected socket) [5]. In the default Dispatcher settings that global fetch uses, the request-level protections you most directly hit are headersTimeout/bodyTimeout (5 minutes each) [1][2]. - So: there isn’t a single documented “default request timeout” knob exposed as one number, but there are built-in default timeouts that will break otherwise-stuck connections/request states (connect timeout for connect hangs; headersTimeout/bodyTimeout for hangs in those phases) [4][1][2]. If you need a shorter or longer behavior, you must configure undici (via Agent/Dispatcher options) or use AbortSignal/timeouts at the fetch call site; undici documents you can pass a dispatcher/Agent to fetch to customize timeouts [6][1].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail
FILE="src/services/provider-textembedder.ts"

echo "== Constants in provider-textembedder.ts =="
rg -n "BINARY_START_TIMEOUT_MS|HEALTH_POLL_MS|TEXTEMBEDDER_BATCH_SIZE" "$FILE" || true

echo
echo "== Snippet around where constants are defined =="
# show first ~120 lines (often constants live near top)
sed -n '1,140p' "$FILE" | cat -n

echo
echo "== Snippet around the polling loop start =="
# show around first fetch at ~220 with a bit more context
sed -n '160,255p' "$FILE" | cat -n

Repository: giancarloerra/SocratiCode

Length of output: 9738


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for fetch timeout helpers across src =="
rg -n "fetchWithTimeout|fetchWithAbort|timeoutMs|AbortSignal|AbortController" src || true

echo
echo "== Search for global fetch wrapper patterns =="
rg -n "function\s+fetch|const\s+fetch\s*=|export\s+.*fetch" src || true

Repository: giancarloerra/SocratiCode

Length of output: 785


Add explicit request timeouts (AbortController) to all fetch calls in src/services/provider-textembedder.ts
Binary readiness polling awaits fetch(${url}/health) without an abort, so it can block beyond BINARY_START_TIMEOUT_MS (the loop can’t advance until the request resolves). The other /health, /embed, and /embed/batch calls have the same issue (they currently rely only on undici’s default timeouts).

Impacted fetch sites: ~220-230, ~326-327, ~364-368, ~406-407, ~423-427.

Suggested fix
+const HTTP_TIMEOUT_MS = 5000;
+
+async function fetchWithTimeout(input: string, init: RequestInit = {}, timeoutMs = HTTP_TIMEOUT_MS): Promise<Response> {
+  const controller = new AbortController();
+  const timeout = setTimeout(() => controller.abort(), timeoutMs);
+  try {
+    return await fetch(input, { ...init, signal: controller.signal });
+  } finally {
+    clearTimeout(timeout);
+  }
+}
@@
-      const resp = await fetch(`${url}/health`);
+      const resp = await fetchWithTimeout(`${url}/health`);
@@
-      const resp = await fetch(`${baseUrl}/health`);
+      const resp = await fetchWithTimeout(`${baseUrl}/health`);
@@
-    const response = await fetch(`${baseUrl}/embed`, {
+    const response = await fetchWithTimeout(`${baseUrl}/embed`, {
@@
-    const response = await fetch(`${baseUrl}/embed/batch`, {
+    const response = await fetchWithTimeout(`${baseUrl}/embed/batch`, {

Also wrap the second ${baseUrl}/health call (~406-407) with fetchWithTimeout (same change as the first /health call).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const resp = await fetch(`${url}/health`);
if (resp.ok) {
logger.info("text-embedder binary is ready", { url });
subprocessUrl = url;
binaryStarting = false;
return url;
}
} catch {
// Not ready yet
}
await new Promise((r) => setTimeout(r, HEALTH_POLL_MS));
const resp = await fetchWithTimeout(`${url}/health`);
if (resp.ok) {
logger.info("text-embedder binary is ready", { url });
subprocessUrl = url;
binaryStarting = false;
return url;
}
} catch {
// Not ready yet
}
await new Promise((r) => setTimeout(r, HEALTH_POLL_MS));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/services/provider-textembedder.ts` around lines 220 - 230, The poll and
API fetches in provider-textembedder.ts can hang because they lack
AbortController timeouts; add a small helper (e.g., fetchWithTimeout(url,
options?, timeoutMs)) that creates an AbortController, sets a setTimeout to
abort after the provided timeout, and clears the timer on completion, then
replace all raw fetch calls to `${url}/health`, `${baseUrl}/health`,
`${baseUrl}/embed`, and `${baseUrl}/embed/batch` with fetchWithTimeout (use
HEALTH_POLL_MS or a sensible per-call timeout and ensure BINARY_START_TIMEOUT_MS
logic still advances if a request is aborted), keeping existing response
handling and updating references to subprocessUrl and binaryStarting as before.

Comment on lines +436 to +439
const data: BatchResponse = await response.json() as BatchResponse;
const sorted = data.results.sort((a, b) => a.index - b.index);
return sorted.map((r) => unscaleVector(r.embedding));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate and print the relevant section
sed -n '380,520p' src/services/provider-textembedder.ts | nl -ba | sed -n '1,200p'

# Show BatchResponse / result type definitions
rg -n "type\s+BatchResponse|interface\s+BatchResponse|BatchResponse" src/services/provider-textembedder.ts
rg -n "results\s*:" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 112


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Snippet around lines 420-470 =="
sed -n '420,470p' src/services/provider-textembedder.ts | awk '{print NR+419 ":" $0}'

echo
echo "== Type/Interface definitions for BatchResponse =="
rg -n "type\s+BatchResponse|interface\s+BatchResponse|BatchResponse" src/services/provider-textembedder.ts

echo
echo "== Any existing validation around batch results (index/length) =="
rg -n "validate|validation|index|length|results\.length|texts\.length" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 1297


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== BatchResponse interface (around line 260-320) =="
sed -n '250,320p' src/services/provider-textembedder.ts

echo
echo "== BatchResponse result item type (search for 'embedding' near BatchResponse) =="
rg -n "interface BatchResponse|type BatchResponse|embedding|index\s*:\s*number" src/services/provider-textembedder.ts

echo
echo "== Where _embedBatch is called =="
rg -n "_embedBatch\(" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 2522


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the full _embedBatch and its caller context
sed -n '300,470p' src/services/provider-textembedder.ts | cat -n

echo
echo "== Search for BatchResponse.count usage =="
rg -n "data\.count|BatchResponse\s*.*count|count" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 6113


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for '/embed/batch' usage/contract =="
rg -n "/embed/batch" -S .

echo
echo "== Search for BatchItem/BatchResponse usage elsewhere =="
rg -n "BatchItem|interface BatchItem|BatchResponse|interface BatchResponse" -S .

Repository: giancarloerra/SocratiCode

Length of output: 718


Validate /embed/batch result completeness/index integrity before sorting and mapping (src/services/provider-textembedder.ts ~436-439).

_embedBatch sorts data.results by index and immediately maps embeddings; it never checks that the response has exactly one result per input or that indices are valid/unique/cover the expected range. Missing/duplicate/out-of-range indices can silently misalign vectors with input texts.

Suggested fix
   const data: BatchResponse = await response.json() as BatchResponse;
   const sorted = data.results.sort((a, b) => a.index - b.index);
+  if (sorted.length !== texts.length) {
+    throw new Error(
+      `text-embedder /embed/batch returned ${sorted.length} results for ${texts.length} inputs.`,
+    );
+  }
+  for (let i = 0; i < sorted.length; i++) {
+    if (!Number.isInteger(sorted[i].index) || sorted[i].index !== i) {
+      throw new Error(
+        `text-embedder /embed/batch returned invalid index mapping at position ${i} (index=${sorted[i].index}).`,
+      );
+    }
+  }
   return sorted.map((r) => unscaleVector(r.embedding));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/services/provider-textembedder.ts` around lines 436 - 439, In _embedBatch
validate the BatchResponse contents before sorting/mapping: check data.results
exists and that its length equals the expected input count, ensure each result
has a numeric index within 0..(n-1) and that indices are unique and cover the
full range; if any missing/duplicate/out-of-range index is detected, throw a
descriptive error. Use the index-to-result mapping to assemble the ordered
embeddings (or validate then sort) and only then call unscaleVector on each
embedding; reference data.results, BatchResponse, _embedBatch and unscaleVector
when making the changes.

@giancarloerra
Copy link
Copy Markdown
Owner

giancarloerra commented May 23, 2026

I'm closing this PR for the following reasons:

  1. Security. This PR adds three compiled binaries from an external repository (github.com/guiperry/text-embedder) and makes them the default embedding provider. Every SocratiCode user would automatically decompress and execute these binaries as a subprocess with full system access. Distributing compiled binaries from an unaudited source through an open-source project is a supply chain risk.

  2. Major breaking change. Changing the default provider from Ollama to textembedder would silently break every existing installation. Qdrant indexes built with nomic-embed-text vectors are incompatible with vectors from a different model. Users would get degraded or broken search results after upgrading.

  3. Unverified embedding quality. landmark-lattice-v1 has no MTEB benchmarks, no published evaluations, and no independent review. SocratiCode's value depends on search quality. Replacing an established model (nomic-embed-text) with an unverified one requires evidence that results are at least comparable.

  4. Binaries in git. Committing ~15 MB of platform binaries to the repository permanently inflates the git history for every clone. This is not a sustainable distribution method or one I want.

Also even without these 4 major problems, there are technical ones:

  1. CI fails. Biome lint reports unused imports, non-null assertions, and unsorted imports. The linter was not run locally before submission.

  2. No tests. The new provider module is 440 lines with no unit tests. The only test changes update default-provider assertions.

  3. CodeRabbit flagged critical issues. Race condition in subprocess startup, missing batch response validation (silent vector misalignment), no fetch timeout on HTTP calls, inverted hasApiKey field, and the signal handlers (process.on("SIGINT"/"SIGTERM") calling process.exit(0)) would override the MCP server's own shutdown logic.

Whatever the intention behind this, it's not a positive or justifiable change/improvement. Thanks for your interest in SocratiCode!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants