Deterministic Text Embedder Implementation by guiperry · Pull Request #62 · giancarloerra/SocratiCode

guiperry · 2026-05-23T12:58:18Z

Summary

Changes

Goal

Integrate the text-embedder Go binary as the primary embedding provider in SocratiCode, with all existing providers as fallbacks.
Constraints & Preferences
text-embedder binary must be distributed as text-embedder.gz (gzip) for practical file size in the npm package / git repo.
The TypeScript provider must decompress the .gz at runtime to a temp directory and spawn it as a subprocess.
All existing providers (ollama, openai, google, lmstudio, litellm) must remain as fallbacks switchable via EMBEDDING_PROVIDER.
text-embedder produces 768-dim int32 vectors scaled to 0, 10000; the provider converts to float64 0, 1 for Qdrant.
Progress
Done
Built the Go text-embedder binary (make deploy) and placed text-embedder.gz (5.0 MB) in the SocratiCode project root.
Created src/services/provider-textembedder.ts — full EmbeddingProvider implementation with binary decompression, subprocess spawning, HTTP API client, and process cleanup.
Updated src/services/embedding-config.ts — added "textembedder" to the EmbeddingProvider type union, PROVIDER_DEFAULTS (landmark-lattice-v1, 768 dims), MODEL_CONTEXT_LENGTHS (8192), validation, default provider from "ollama" to "textembedder".
Updated src/services/embedding-provider.ts — added case "textembedder" in factory switch that imports and instantiates TextEmbedderEmbeddingProvider.
Updated text-embedder/Makefile — added compress, deploy, deploy-linux, deploy-mac, deploy-windows targets that build, gzip, and copy to SOCRA_DIR.
Updated tests/unit/embedding-config.test.ts and tests/unit/embedding-provider.test.ts — changed default provider assertions from "ollama" to "textembedder".
TypeScript compilation: zero errors.
Unit tests: 823 passed across 36 test files.
In Progress
(none)
Blocked
(none)
Key Decisions
Binary is stored as text-embedder.gz (gzip) to keep repo size small (~5 MB compressed vs ~9.2 MB raw).
Decompression happens once to os.tmpdir()/socraticode-textembedder/text-embedder; the extracted binary is reused across restarts until cleanup.
Provider supports three modes: (1) compressed binary auto-extracted, (2) external URL via TEXTEMBEDDER_URL, (3) bare binary via TEXTEMBEDDER_BIN_PATH.
Batch size 64; inter-batch delay 0 (local subprocess, no rate limits).
Process cleanup registered on exit/SIGINT/SIGTERM; temp dir removed best-effort.
Next Steps
Run integration/E2E tests to verify the binary actually starts and produces valid vectors through the full Qdrant pipeline.
Add text-embedder.gz to the package.json files array for npm distribution.
Optionally add a postinstall script to extract the binary automatically.
Critical Context
text-embedder Go module: github.com/guiperry/text-embedder (Go 1.22).
SocratiCode TypeScript: all embedding providers implement the Embedding Provider interface from embedding-types.ts.
Factory in embedding-provider.ts uses switch on EMBEDDING_PROVIDER env var; default is now "textembedder".
embeddings.ts uses PROVIDER_BATCH_DELAY record — textembedder not in the record, so it gets ?? 0 fallback (correct for local).
Relevant Files
src/services/provider-textembedder.ts: new provider — decompresses .gz, spawns binary, HTTP client, health checks.
src/services/embedding-config.ts: provider type union, defaults, validation.
src/services/embedding-provider.ts: factory switch with case "textembedder".
text-embedder/Makefile: deploy target builds, gzips, copies to SocratiCode.
text-embedder.gz (in SocratiCode root): distributed gzip binary.
tests/unit/embedding-config.test.ts: updated default-provider assertions.
tests/unit/embedding-provider.test.ts: updated factory default + cache tests.

Type of change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Test coverage improvement

Testing

Unit tests pass (npm run test:unit)
Integration tests pass (npm run test: integration) — if applicable
TypeScript compiles cleanly (npx tsc --noEmit)
New tests added for new/changed functionality

Checklist

My code follows the existing code style and conventions
I have added/updated JSDoc comments where appropriate
I have updated documentation (README.md / DEVELOPER.md) if needed
I have addressed all CodeRabbit review comments (or marked as resolved with explanation)
I have read the Contributing Guide
I agree to the Contributor License Agreement

Related issues

Summary by CodeRabbit

New Features
- Added Text Embedder as a new embedding provider backend, now set as the default. Supports deterministic embeddings via local binary execution or HTTP server modes without requiring Docker, GPU, or external API keys.
Documentation
- Updated configuration guide with Text Embedder setup options and environment variables.

coderabbitai · 2026-05-23T12:58:24Z

📝 Walkthrough

Walkthrough

This PR introduces Text Embedder (textembedder) as the new default embedding provider, replacing Ollama. It adds a new provider implementation with local binary management, updates configuration and provider selection, wires integration through the factory, and adjusts tests to reflect the new defaults.

Changes

Text Embedder Integration

Layer / File(s)	Summary
Configuration and provider defaults `src/services/embedding-config.ts`, `README.md`	`EmbeddingProvider` type now includes `"textembedder"`. Default provider changes to `textembedder` with model `landmark-lattice-v1` (768 dimensions). Environment variables `TEXTEMBEDDER_URL`, `TEXTEMBEDDER_BIN_PATH`, `TEXTEMBEDDER_PORT` are documented. API key logic updated to treat `textembedder` as not requiring external API keys. README updated with new provider feature and configuration subsection.
Provider factory integration `src/services/embedding-provider.ts`	`getEmbeddingProvider` adds a switch case to dynamically import and instantiate `TextEmbedderEmbeddingProvider` when provider is `textembedder`. Module documentation and error messages updated to include the new provider.
Text Embedder provider implementation `src/services/provider-textembedder.ts`	New class `TextEmbedderEmbeddingProvider` manages either external Text Embedder URL or local Go binary mode. Binary discovery resolves platform-specific executables, decompresses `.gz` files to temp directory on first use, starts subprocess with health polling and readiness timeout, handles graceful shutdown and cleanup. Embedding endpoints support batch (`/embed/batch`) and single (`/embed`) calls with int-to-float vector conversion via fixed-point scaling. `healthCheck` reports mode, binary availability, and reachability.
Test updates `tests/unit/embedding-config.test.ts`, `tests/unit/embedding-provider.test.ts`	Configuration tests updated to expect `textembedder` + `landmark-lattice-v1` defaults instead of Ollama. Cache/singleton behavior tests adjusted to validate `embeddingProvider` caching. Factory tests verify new default provider is `textembedder`. Error message expectations updated to include `textembedder` in allowed provider list.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

giancarloerra/SocratiCode#57: Both PRs extend the embedding provider factory logic in src/services/embedding-config.ts and src/services/embedding-provider.ts to add new provider branches with shared validation and API-key handling.
giancarloerra/SocratiCode#42: Both PRs modify the core embedding-provider wiring by extending EmbeddingProvider and updating configuration/factory code to add new provider options within the same code paths.

Poem

🐰 A textual tapestry we weave,
With embeddings deterministic and neat,
No Docker, no keys, just a binary so fleet—
The default at last that we all did believe! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the primary change: adding a deterministic text-embedder as the new embedding provider implementation.
Description check	✅ Passed	The description follows the template structure with detailed Summary, comprehensive Changes section, correct Type of change marked, Testing checklist completed with all items checked, and most Checklist items addressed appropriately.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

giancarloerra · 2026-05-23T13:05:32Z

@coderabbitai review

coderabbitai · 2026-05-23T13:05:37Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/services/embedding-config.ts`:
- Around line 292-304: The logged hasApiKey value is inverted for the
textembedder case; update the hasApiKey computation (referencing hasApiKey and
embeddingProvider) so textembedder yields false (no API key) and all other
providers evaluate whether their respective env/config keys exist (e.g.,
_config.ollamaApiKey, process.env.OPENAI_API_KEY, process.env.GOOGLE_API_KEY,
process.env.LMSTUDIO_API_KEY, process.env.LITELLM_API_KEY); ensure you return a
boolean (use explicit Boolean(...) or !!) and keep the same provider
switch/conditional structure in embedding-config.ts.

In `@src/services/provider-textembedder.ts`:
- Around line 436-439: In _embedBatch validate the BatchResponse contents before
sorting/mapping: check data.results exists and that its length equals the
expected input count, ensure each result has a numeric index within 0..(n-1) and
that indices are unique and cover the full range; if any
missing/duplicate/out-of-range index is detected, throw a descriptive error. Use
the index-to-result mapping to assemble the ordered embeddings (or validate then
sort) and only then call unscaleVector on each embedding; reference
data.results, BatchResponse, _embedBatch and unscaleVector when making the
changes.
- Around line 169-186: startBinary sets binaryStarting = true but never clears
it if resolveBinarySource/ensureBinaryExtracted (or any subsequent operation)
throws, causing callers to hang; update startBinary so that after setting
binaryStarting = true you wrap the subsequent work (resolveBinarySource(), the
not-found throw, ensureBinaryExtracted(), and spawn logic) in a
try/catch/finally and ensure binaryStarting is reset to false on any thrown
error (or in finally when failing to start), e.g., catch the error from
resolveBinarySource/ensureBinaryExtracted/child spawn and set binaryStarting =
false before rethrowing; reference the startBinary function and the variables
binaryStarting, resolveBinarySource, ensureBinaryExtracted, and subprocess to
locate where to add the try/catch/finally.
- Around line 220-230: The poll and API fetches in provider-textembedder.ts can
hang because they lack AbortController timeouts; add a small helper (e.g.,
fetchWithTimeout(url, options?, timeoutMs)) that creates an AbortController,
sets a setTimeout to abort after the provided timeout, and clears the timer on
completion, then replace all raw fetch calls to `${url}/health`,
`${baseUrl}/health`, `${baseUrl}/embed`, and `${baseUrl}/embed/batch` with
fetchWithTimeout (use HEALTH_POLL_MS or a sensible per-call timeout and ensure
BINARY_START_TIMEOUT_MS logic still advances if a request is aborted), keeping
existing response handling and updating references to subprocessUrl and
binaryStarting as before.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a861762d-4a79-424a-af08-c42f2f7c3287

📥 Commits

Reviewing files that changed from the base of the PR and between bca2598 and bea12e3.

⛔ Files ignored due to path filters (3)

text-embedder-darwin.gz is excluded by !**/*.gz
text-embedder-linux.gz is excluded by !**/*.gz
text-embedder-win.gz is excluded by !**/*.gz

📒 Files selected for processing (6)

README.md
src/services/embedding-config.ts
src/services/embedding-provider.ts
src/services/provider-textembedder.ts
tests/unit/embedding-config.test.ts
tests/unit/embedding-provider.test.ts

coderabbitai · 2026-05-23T13:12:55Z

+    hasApiKey: !!(embeddingProvider === "textembedder"
+      ? true // binary / external URL; no key needed
+      : embeddingProvider === "ollama"
+        ? _config.ollamaApiKey
+        : embeddingProvider === "openai"
+          ? process.env.OPENAI_API_KEY
+          : embeddingProvider === "google"
+            ? process.env.GOOGLE_API_KEY
+            : embeddingProvider === "lmstudio"
+              ? process.env.LMSTUDIO_API_KEY
+              : embeddingProvider === "litellm"
+                ? process.env.LITELLM_API_KEY
+                : undefined),


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

hasApiKey is inverted for textembedder in config logs.

This currently logs hasApiKey: true for textembedder even though the provider explicitly does not use an API key, which makes diagnostics misleading.

Suggested fix

- hasApiKey: !!(embeddingProvider === "textembedder" - ? true // binary / external URL; no key needed + hasApiKey: !!(embeddingProvider === "textembedder" + ? false // binary / external URL; no key needed : embeddingProvider === "ollama" ? _config.ollamaApiKey : embeddingProvider === "openai"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

hasApiKey: !!(embeddingProvider === "textembedder"

? true // binary / external URL; no key needed

: embeddingProvider === "ollama"

? _config.ollamaApiKey

: embeddingProvider === "openai"

? process.env.OPENAI_API_KEY

: embeddingProvider === "google"

? process.env.GOOGLE_API_KEY

: embeddingProvider === "lmstudio"

? process.env.LMSTUDIO_API_KEY

: embeddingProvider === "litellm"

? process.env.LITELLM_API_KEY

: undefined),

hasApiKey: !!(embeddingProvider === "textembedder"

? false // binary / external URL; no key needed

: embeddingProvider === "ollama"

? _config.ollamaApiKey

: embeddingProvider === "openai"

? process.env.OPENAI_API_KEY

: embeddingProvider === "google"

? process.env.GOOGLE_API_KEY

: embeddingProvider === "lmstudio"

? process.env.LMSTUDIO_API_KEY

: embeddingProvider === "litellm"

? process.env.LITELLM_API_KEY

: undefined),

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/embedding-config.ts` around lines 292 - 304, The logged hasApiKey value is inverted for the textembedder case; update the hasApiKey computation (referencing hasApiKey and embeddingProvider) so textembedder yields false (no API key) and all other providers evaluate whether their respective env/config keys exist (e.g., _config.ollamaApiKey, process.env.OPENAI_API_KEY, process.env.GOOGLE_API_KEY, process.env.LMSTUDIO_API_KEY, process.env.LITELLM_API_KEY); ensure you return a boolean (use explicit Boolean(...) or !!) and keep the same provider switch/conditional structure in embedding-config.ts.

coderabbitai · 2026-05-23T13:12:55Z

+  binaryStarting = true;
+  const source = await resolveBinarySource();
+
+  if (!source) {
+    binaryStarting = false;
+    const pfx = platformGzName();
+    throw new Error(
+      `text-embedder binary not found for platform "${process.platform}". ` +
+      `Run 'make deploy-all' from the text-embedder directory to generate ` +
+      (pfx ? `${pfx} (expected name), ` : "") +
+      "or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",
+    );
+  }
+
+  // Decompress if necessary, or use bare binary directly
+  const binPath = source.isCompressed
+    ? await ensureBinaryExtracted(source.sourcePath)
+    : source.sourcePath;


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="src/services/provider-textembedder.ts" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" >&2 exit 1 fi # Show a focused range around binaryStarting and startBinary rg -n "binaryStarting|startBinary|resolveBinarySource|ensureBinaryExtracted" "$FILE" -S # Print around startBinary function definition and binaryStarting usage python3 - <<'PY' import re, pathlib path = pathlib.Path("src/services/provider-textembedder.ts") txt = path.read_text() # find startBinary line m = re.search(r'async function startBinary\s*$', txt) if not m: print("startBinary not found") raise SystemExit(0) start = txt[:m.start()].count("\n") print("startBinary approx line:", start+1) # print 220 lines around lines = txt.splitlines() a = max(0, start-30) b = min(len(lines), start+220) for i in range(a,b): print(f"{i+1:4d}: {lines[i]}") PY # Also check any polling/wait loop for binaryStarting rg -n "while\s*\(.*binaryStarting|poll|setTimeout|sleep|await new Promise|binaryStarting\s*$" "$FILE" -S || true # Show any place that sets binaryStarting=false rg -n "binaryStarting\s*=\s*false" "$FILE" -S

Repository: giancarloerra/SocratiCode

Length of output: 9800

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="src/services/provider-textembedder.ts" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" >&2 exit 1 fi rg -n "binaryStarting|startBinary|resolveBinarySource|ensureBinaryExtracted" "$FILE" -S python3 - <<'PY' import re, pathlib path = pathlib.Path("src/services/provider-textembedder.ts") txt = path.read_text() m = re.search(r'async function startBinary\s*$', txt) if not m: print("startBinary not found") raise SystemExit(0) start = txt[:m.start()].count("\n") print("startBinary approx line:", start+1) lines = txt.splitlines() a = max(0, start-40) b = min(len(lines), start+260) for i in range(a,b): print(f"{i+1:4d}: {lines[i]}") PY rg -n "while\s*\(.*binaryStarting|poll|setTimeout|sleep|await new Promise|binaryStarting\s*$" "$FILE" -S || true rg -n "binaryStarting\s*=\s*false" "$FILE" -S

Repository: giancarloerra/SocratiCode

Length of output: 12000

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="src/services/provider-textembedder.ts" # Inspect resolveBinarySource and ensureBinaryExtracted bodies python3 - <<'PY' import itertools, pathlib path = pathlib.Path("src/services/provider-textembedder.ts") lines = path.read_text().splitlines() def dump(start, end): for i in range(start, end+1): if 1 <= i <= len(lines): print(f"{i:4d}: {lines[i-1]}") # Based on rg: resolveBinarySource around 97, ensureBinaryExtracted around 119 dump(70, 150) PY # Locate _embedBatch and inspect rg -n "_embedBatch|BatchResponse|results|index" "$FILE" -S python3 - <<'PY' import re, pathlib path = pathlib.Path("src/services/provider-textembedder.ts") txt = path.read_text() m = re.search(r'async function _embedBatch', txt) if not m: # maybe method: this._embedBatch or class method # fallback: print around first occurrence of 'protected async _embedBatch' etc import sys pass # Print around all lines containing '_embedBatch' lines = txt.splitlines() for i,l in enumerate(lines, start=1): if '_embedBatch' in l: start=max(1,i-40); end=min(len(lines),i+140) print(f"\n--- around line {i} ---") for j in range(start,end+1): print(f"{j:4d}: {lines[j-1]}") PY # Find fetch usage without AbortController/timeout settings (heuristic) rg -n "fetch\$.*\$" "$FILE" -S rg -n "AbortController|timeout|signal:" "$FILE" -S

Repository: giancarloerra/SocratiCode

Length of output: 11659

Fix binaryStarting stuck/hang on extraction/spawn errors in startBinary
startBinary() sets binaryStarting = true and only resets it for specific paths; if ensureBinaryExtracted() (or anything after setting binaryStarting=true) throws, binaryStarting remains true, and later callers enter the polling Promise and can wait forever (no timeout) because the interval only rejects when !subprocess && !binaryStarting.

Suggested fix

async function startBinary(port: number): Promise<string> { if (subprocessUrl) return subprocessUrl; if (binaryStarting) { @@ } binaryStarting = true; - const source = await resolveBinarySource(); + let source: { sourcePath: string; isCompressed: boolean } | null = null; + try { + source = await resolveBinarySource(); - if (!source) { - binaryStarting = false; - const pfx = platformGzName(); - throw new Error( - `text-embedder binary not found for platform "${process.platform}". ` + - `Run 'make deploy-all' from the text-embedder directory to generate ` + - (pfx ? `${pfx} (expected name), ` : "") + - "or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.", - ); - } + if (!source) { + const pfx = platformGzName(); + throw new Error( + `text-embedder binary not found for platform "${process.platform}". ` + + `Run 'make deploy-all' from the text-embedder directory to generate ` + + (pfx ? `${pfx} (expected name), ` : "") + + "or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.", + ); + } - // Decompress if necessary, or use bare binary directly - const binPath = source.isCompressed - ? await ensureBinaryExtracted(source.sourcePath) - : source.sourcePath; + // Decompress if necessary, or use bare binary directly + const binPath = source.isCompressed + ? await ensureBinaryExtracted(source.sourcePath) + : source.sourcePath; + // ... existing spawn/readiness logic ... + } catch (err) { + binaryStarting = false; + throw err; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

binaryStarting = true;

const source = await resolveBinarySource();

if (!source) {

binaryStarting = false;

const pfx = platformGzName();

throw new Error(

`text-embedder binary not found for platform "${process.platform}". ` +

`Run 'make deploy-all' from the text-embedder directory to generate ` +

(pfx ? `${pfx} (expected name), ` : "") +

"or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",

);

}

// Decompress if necessary, or use bare binary directly

const binPath = source.isCompressed

? await ensureBinaryExtracted(source.sourcePath)

: source.sourcePath;

binaryStarting = true;

let source: { sourcePath: string; isCompressed: boolean } | null = null;

try {

source = await resolveBinarySource();

if (!source) {

const pfx = platformGzName();

throw new Error(

`text-embedder binary not found for platform "${process.platform}". ` +

`Run 'make deploy-all' from the text-embedder directory to generate ` +

(pfx ? `${pfx} (expected name), ` : "") +

"or set TEXTEMBEDDER_BIN_PATH / TEXTEMBEDDER_URL.",

);

}

// Decompress if necessary, or use bare binary directly

const binPath = source.isCompressed

? await ensureBinaryExtracted(source.sourcePath)

: source.sourcePath;

} catch (err) {

binaryStarting = false;

throw err;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/provider-textembedder.ts` around lines 169 - 186, startBinary sets binaryStarting = true but never clears it if resolveBinarySource/ensureBinaryExtracted (or any subsequent operation) throws, causing callers to hang; update startBinary so that after setting binaryStarting = true you wrap the subsequent work (resolveBinarySource(), the not-found throw, ensureBinaryExtracted(), and spawn logic) in a try/catch/finally and ensure binaryStarting is reset to false on any thrown error (or in finally when failing to start), e.g., catch the error from resolveBinarySource/ensureBinaryExtracted/child spawn and set binaryStarting = false before rethrowing; reference the startBinary function and the variables binaryStarting, resolveBinarySource, ensureBinaryExtracted, and subprocess to locate where to add the try/catch/finally.

coderabbitai · 2026-05-23T13:12:55Z

+      const resp = await fetch(`${url}/health`);
+      if (resp.ok) {
+        logger.info("text-embedder binary is ready", { url });
+        subprocessUrl = url;
+        binaryStarting = false;
+        return url;
+      }
+    } catch {
+      // Not ready yet
+    }
+    await new Promise((r) => setTimeout(r, HEALTH_POLL_MS));


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="src/services/provider-textembedder.ts" echo "== Fetch occurrences ==" rg -n "fetch\(" "$FILE" || true echo echo "== AbortController/timeout helpers in file ==" rg -n "AbortController|timeout|setTimeout" "$FILE" || true echo echo "== Snippet around 200-260 ==" nl -ba "$FILE" | sed -n '190,260p' echo echo "== Snippet around 300-390 ==" nl -ba "$FILE" | sed -n '300,390p' echo echo "== Snippet around 400-460 ==" nl -ba "$FILE" | sed -n '400,460p'

Repository: giancarloerra/SocratiCode

Length of output: 617

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="src/services/provider-textembedder.ts" echo "== Snippet around 210-255 ==" sed -n '210,255p' "$FILE" | cat -n echo echo "== Snippet around 315-380 ==" sed -n '315,380p' "$FILE" | cat -n echo echo "== Snippet around 395-440 ==" sed -n '395,440p' "$FILE" | cat -n echo echo "== Any fetch wrapper definitions in file ==" rg -n "fetchWithTimeout|AbortController|controller\.abort|signal:|timeoutMs|HTTP_TIMEOUT_MS" "$FILE" || true echo echo "== Any shared fetch helper elsewhere in repo (search quick) ==" rg -n "fetchWithTimeout" src || true rg -n "AbortController" src || true

Repository: giancarloerra/SocratiCode

Length of output: 6691

🌐 Web query:

Does Node.js global fetch (undici) have a default request timeout? Is there a built-in timeout for hung sockets?

💡 Result:

Yes—Node.js’ global fetch (which uses undici under the hood) has default timeouts, including timeouts that can terminate “hung” requests due to lack of activity. 1) Default overall request timeout behavior (headers/body) - undici’s Dispatcher has defaults: - headersTimeout defaults to 300 seconds (5 minutes) [1][2][3] - bodyTimeout defaults to 300 seconds (5 minutes) [1][2][3] These are “activity” timeouts: headersTimeout monitors waiting for complete HTTP headers; bodyTimeout monitors time between receiving body data. If a server accepts a connection but never completes headers (or never sends body data), the request times out. 2) Default connect/establish-timeout (for “hung sockets” during connect) - undici also sets a default connect timeout used when establishing the connection; undici will throw UND_ERR_CONNECT_TIMEOUT if the connect attempt exceeds that timeout [4]. - This is not the same as headersTimeout/bodyTimeout; it’s specifically for connection establishment. 3) Is there a built-in timeout for “hung sockets” after connect? - undici provides socket-/activity-level timeouts via dispatcher/client options such as socketTimeout (timeout when there is no activity on a connected socket) [5]. In the default Dispatcher settings that global fetch uses, the request-level protections you most directly hit are headersTimeout/bodyTimeout (5 minutes each) [1][2]. - So: there isn’t a single documented “default request timeout” knob exposed as one number, but there are built-in default timeouts that will break otherwise-stuck connections/request states (connect timeout for connect hangs; headersTimeout/bodyTimeout for hangs in those phases) [4][1][2]. If you need a shorter or longer behavior, you must configure undici (via Agent/Dispatcher options) or use AbortSignal/timeouts at the fetch call site; undici documents you can pass a dispatcher/Agent to fetch to customize timeouts [6][1].

Citations:

1: https://github.com/nodejs/undici/blob/main/docs/docs/api/Dispatcher.md

2: https://undici-docs.vramana.dev/docs/api/Client

3: Default fetch timeout too short nodejs/undici#1373

4: UND_ERR_CONNECT_TIMEOUT errors thrown when there is CPU intensive code on the event loop nodejs/undici#3410

5: https://github.com/clovis-guillemot/undici/tree/2.x

6: https://github.com/nodejs/undici/tree/main/docs

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="src/services/provider-textembedder.ts" echo "== Constants in provider-textembedder.ts ==" rg -n "BINARY_START_TIMEOUT_MS|HEALTH_POLL_MS|TEXTEMBEDDER_BATCH_SIZE" "$FILE" || true echo echo "== Snippet around where constants are defined ==" # show first ~120 lines (often constants live near top) sed -n '1,140p' "$FILE" | cat -n echo echo "== Snippet around the polling loop start ==" # show around first fetch at ~220 with a bit more context sed -n '160,255p' "$FILE" | cat -n

Repository: giancarloerra/SocratiCode

Length of output: 9738

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Search for fetch timeout helpers across src ==" rg -n "fetchWithTimeout|fetchWithAbort|timeoutMs|AbortSignal|AbortController" src || true echo echo "== Search for global fetch wrapper patterns ==" rg -n "function\s+fetch|const\s+fetch\s*=|export\s+.*fetch" src || true

Repository: giancarloerra/SocratiCode

Length of output: 785

Add explicit request timeouts (AbortController) to all fetch calls in src/services/provider-textembedder.ts
Binary readiness polling awaits fetch(${url}/health) without an abort, so it can block beyond BINARY_START_TIMEOUT_MS (the loop can’t advance until the request resolves). The other /health, /embed, and /embed/batch calls have the same issue (they currently rely only on undici’s default timeouts).

Impacted fetch sites: ~220-230, ~326-327, ~364-368, ~406-407, ~423-427.

Suggested fix

+const HTTP_TIMEOUT_MS = 5000; + +async function fetchWithTimeout(input: string, init: RequestInit = {}, timeoutMs = HTTP_TIMEOUT_MS): Promise<Response> { + const controller = new AbortController(); + const timeout = setTimeout(() => controller.abort(), timeoutMs); + try { + return await fetch(input, { ...init, signal: controller.signal }); + } finally { + clearTimeout(timeout); + } +} @@ - const resp = await fetch(`${url}/health`); + const resp = await fetchWithTimeout(`${url}/health`); @@ - const resp = await fetch(`${baseUrl}/health`); + const resp = await fetchWithTimeout(`${baseUrl}/health`); @@ - const response = await fetch(`${baseUrl}/embed`, { + const response = await fetchWithTimeout(`${baseUrl}/embed`, { @@ - const response = await fetch(`${baseUrl}/embed/batch`, { + const response = await fetchWithTimeout(`${baseUrl}/embed/batch`, {

Also wrap the second ${baseUrl}/health call (~406-407) with fetchWithTimeout (same change as the first /health call).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const resp = await fetch(`${url}/health`);

if (resp.ok) {

logger.info("text-embedder binary is ready", { url });

subprocessUrl = url;

binaryStarting = false;

return url;

}

} catch {

// Not ready yet

}

await new Promise((r) => setTimeout(r, HEALTH_POLL_MS));

const resp = await fetchWithTimeout(`${url}/health`);

if (resp.ok) {

logger.info("text-embedder binary is ready", { url });

subprocessUrl = url;

binaryStarting = false;

return url;

}

} catch {

// Not ready yet

}

await new Promise((r) => setTimeout(r, HEALTH_POLL_MS));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/provider-textembedder.ts` around lines 220 - 230, The poll and API fetches in provider-textembedder.ts can hang because they lack AbortController timeouts; add a small helper (e.g., fetchWithTimeout(url, options?, timeoutMs)) that creates an AbortController, sets a setTimeout to abort after the provided timeout, and clears the timer on completion, then replace all raw fetch calls to `${url}/health`, `${baseUrl}/health`, `${baseUrl}/embed`, and `${baseUrl}/embed/batch` with fetchWithTimeout (use HEALTH_POLL_MS or a sensible per-call timeout and ensure BINARY_START_TIMEOUT_MS logic still advances if a request is aborted), keeping existing response handling and updating references to subprocessUrl and binaryStarting as before.

coderabbitai · 2026-05-23T13:12:55Z

+    const data: BatchResponse = await response.json() as BatchResponse;
+    const sorted = data.results.sort((a, b) => a.index - b.index);
+    return sorted.map((r) => unscaleVector(r.embedding));
+  }


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Locate and print the relevant section sed -n '380,520p' src/services/provider-textembedder.ts | nl -ba | sed -n '1,200p' # Show BatchResponse / result type definitions rg -n "type\s+BatchResponse|interface\s+BatchResponse|BatchResponse" src/services/provider-textembedder.ts rg -n "results\s*:" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 112

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Snippet around lines 420-470 ==" sed -n '420,470p' src/services/provider-textembedder.ts | awk '{print NR+419 ":" $0}' echo echo "== Type/Interface definitions for BatchResponse ==" rg -n "type\s+BatchResponse|interface\s+BatchResponse|BatchResponse" src/services/provider-textembedder.ts echo echo "== Any existing validation around batch results (index/length) ==" rg -n "validate|validation|index|length|results\.length|texts\.length" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 1297

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== BatchResponse interface (around line 260-320) ==" sed -n '250,320p' src/services/provider-textembedder.ts echo echo "== BatchResponse result item type (search for 'embedding' near BatchResponse) ==" rg -n "interface BatchResponse|type BatchResponse|embedding|index\s*:\s*number" src/services/provider-textembedder.ts echo echo "== Where _embedBatch is called ==" rg -n "_embedBatch\(" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 2522

🏁 Script executed:

#!/bin/bash set -euo pipefail # Show the full _embedBatch and its caller context sed -n '300,470p' src/services/provider-textembedder.ts | cat -n echo echo "== Search for BatchResponse.count usage ==" rg -n "data\.count|BatchResponse\s*.*count|count" src/services/provider-textembedder.ts

Repository: giancarloerra/SocratiCode

Length of output: 6113

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Search for '/embed/batch' usage/contract ==" rg -n "/embed/batch" -S . echo echo "== Search for BatchItem/BatchResponse usage elsewhere ==" rg -n "BatchItem|interface BatchItem|BatchResponse|interface BatchResponse" -S .

Repository: giancarloerra/SocratiCode

Length of output: 718

Validate /embed/batch result completeness/index integrity before sorting and mapping (src/services/provider-textembedder.ts ~436-439).

_embedBatch sorts data.results by index and immediately maps embeddings; it never checks that the response has exactly one result per input or that indices are valid/unique/cover the expected range. Missing/duplicate/out-of-range indices can silently misalign vectors with input texts.

Suggested fix

const data: BatchResponse = await response.json() as BatchResponse; const sorted = data.results.sort((a, b) => a.index - b.index); + if (sorted.length !== texts.length) { + throw new Error( + `text-embedder /embed/batch returned ${sorted.length} results for ${texts.length} inputs.`, + ); + } + for (let i = 0; i < sorted.length; i++) { + if (!Number.isInteger(sorted[i].index) || sorted[i].index !== i) { + throw new Error( + `text-embedder /embed/batch returned invalid index mapping at position ${i} (index=${sorted[i].index}).`, + ); + } + } return sorted.map((r) => unscaleVector(r.embedding));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/services/provider-textembedder.ts` around lines 436 - 439, In _embedBatch validate the BatchResponse contents before sorting/mapping: check data.results exists and that its length equals the expected input count, ensure each result has a numeric index within 0..(n-1) and that indices are unique and cover the full range; if any missing/duplicate/out-of-range index is detected, throw a descriptive error. Use the index-to-result mapping to assemble the ordered embeddings (or validate then sort) and only then call unscaleVector on each embedding; reference data.results, BatchResponse, _embedBatch and unscaleVector when making the changes.

giancarloerra · 2026-05-23T14:03:36Z

I'm closing this PR for the following reasons:

Security. This PR adds three compiled binaries from an external repository (github.com/guiperry/text-embedder) and makes them the default embedding provider. Every SocratiCode user would automatically decompress and execute these binaries as a subprocess with full system access. Distributing compiled binaries from an unaudited source through an open-source project is a supply chain risk.
Major breaking change. Changing the default provider from Ollama to textembedder would silently break every existing installation. Qdrant indexes built with nomic-embed-text vectors are incompatible with vectors from a different model. Users would get degraded or broken search results after upgrading.
Unverified embedding quality. landmark-lattice-v1 has no MTEB benchmarks, no published evaluations, and no independent review. SocratiCode's value depends on search quality. Replacing an established model (nomic-embed-text) with an unverified one requires evidence that results are at least comparable.
Binaries in git. Committing ~15 MB of platform binaries to the repository permanently inflates the git history for every clone. This is not a sustainable distribution method or one I want.

Also even without these 4 major problems, there are technical ones:

CI fails. Biome lint reports unused imports, non-null assertions, and unsorted imports. The linter was not run locally before submission.
No tests. The new provider module is 440 lines with no unit tests. The only test changes update default-provider assertions.
CodeRabbit flagged critical issues. Race condition in subprocess startup, missing batch response validation (silent vector misalignment), no fetch timeout on HTTP calls, inverted hasApiKey field, and the signal handlers (process.on("SIGINT"/"SIGTERM") calling process.exit(0)) would override the MCP server's own shutdown logic.

Whatever the intention behind this, it's not a positive or justifiable change/improvement. Thanks for your interest in SocratiCode!

Deterministic Text Embedder Implementation

bea12e3

giancarloerra self-assigned this May 23, 2026

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

giancarloerra closed this May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deterministic Text Embedder Implementation#62

Deterministic Text Embedder Implementation#62
guiperry wants to merge 1 commit into
giancarloerra:mainfrom
guiperry:feat/deterministic-text-embedder

guiperry commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

giancarloerra commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 23, 2026

Uh oh!

coderabbitai Bot May 23, 2026

Uh oh!

coderabbitai Bot May 23, 2026

Uh oh!

coderabbitai Bot May 23, 2026

Uh oh!

giancarloerra commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

guiperry commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Type of change

Testing

Checklist

Related issues

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

giancarloerra commented May 23, 2026

Uh oh!

coderabbitai Bot commented May 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

giancarloerra commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

guiperry commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading

giancarloerra commented May 23, 2026 •

edited

Loading