Skip to content

fix: make the test suite Windows-green and restore full Windows support#228

Merged
dfrostar merged 2 commits into
mainfrom
claude/awesome-wozniak-uhq13c
Jun 12, 2026
Merged

fix: make the test suite Windows-green and restore full Windows support#228
dfrostar merged 2 commits into
mainfrom
claude/awesome-wozniak-uhq13c

Conversation

@dfrostar

Copy link
Copy Markdown
Owner

Closes the four Windows failure classes from the cross-platform CI run (509 passed / 5 failed / 134 errors on windows-latest) and re-adds windows-latest to the gating matrix.

The four fixes

1. ChromaDB temp-dir teardown (~134 errors)

Chroma caches one System per storage path for the life of the process, holding the sqlite connection pool and HNSW segment files. GraphEmbedder.close() previously called delete_collection — which released nothing (and destroyed data a later open expected to find). It now stops the client's System and evicts it from SharedSystemClient._identifier_to_system, so Windows can actually delete the store afterwards. Verified against chromadb 1.5.9.

On top of that:

  • NeuralMind.close() added (delegates to the backend, safe to call twice).
  • tests/conftest.py gains an autouse fixture that stops every cached chroma System after each test — handles are released regardless of whether the test cleaned up — and the temp_project/empty_project fixtures use ignore_cleanup_errors=True as a belt-and-suspenders.

2. Event-log rotation (3 failures)

The tailer holds its read handle across poll intervals, and rotation is a logrotate-style rename. POSIX allows renaming an open file; Windows' open() omits FILE_SHARE_DELETE, so the rotating process got PermissionError. The tailer now opens via CreateFileW with share-delete on Windows, recreating POSIX semantics — rotation never depends on catching the tailer between polls. POSIX path is the same open(path, "rb") as before.

3. Concurrent recent-queries appends (1 failure)

The append relied on POSIX O_APPEND atomicity; Windows' CRT implements append as a separate seek-to-end + write, so 8 threads × 5 appends landed 37/40 lines. Appends are now a single os.write on an O_APPEND fd serialized by a process-local lock, plus a best-effort cross-process byte-range lock (msvcrt.locking, non-blocking with ~50ms retry) shared with _compact_recent_queries — so a compaction's read-truncate-rewrite can't drop a concurrent process's append either. POSIX behavior unchanged.

4. Executable-bit test (1 failure)

test_cmd_init_hook_makes_executable is skipped on Windows, which has no POSIX execute bit.

Support claims restored (issue checklist)

  • Items 1–4 fixed
  • windows-latest re-added to the test matrix in ci.yml (Python 3.12) — the run on this PR is the proof
  • docs/COMPATIBILITY.md Windows row restored to ✅ Full
  • docs/index.html operatingSystem restored to "Linux, macOS, Windows"

Also folds in the post-release landing-page touch-up (same file as the operatingSystem edit): v0.24.0 is marked as the latest release in the hero badge, timeline, and JSON-LD softwareVersion, mirroring what #224 did for v0.23.0.

Verification

  • Linux: full suite 750 passed, 17 skipped; black --check and ruff check clean; mypy introduces no new errors in the touched modules.
  • Windows: validated by this PR's own windows-latest CI leg (the whole point of re-gating it).

Fixes #186

https://claude.ai/code/session_01FkHXHcjpWZL2EWn4HGi547


Generated by Claude Code

Closes the four Windows failure classes from the cross-platform CI run
(509 passed / 5 failed / 134 errors on windows-latest):

- chromadb teardown (~134 errors): GraphEmbedder.close() now stops the
  client's cached System and evicts it from chroma's per-path cache,
  releasing the sqlite/HNSW file handles Windows needs closed before a
  directory can be deleted. The previous close() deleted the collection,
  which released nothing and destroyed data. NeuralMind.close() added on
  top; conftest releases all cached Systems after every test and the
  temp-project fixtures ignore residual cleanup errors.

- event-log rotation (3 failures): the tailer's read handle is opened
  with FILE_SHARE_DELETE on Windows (CreateFileW), so a logrotate-style
  rename of the live log no longer throws PermissionError under a
  reader. POSIX path unchanged.

- concurrent appends (1 failure): recent-queries appends are a single
  O_APPEND write serialized by a process-local lock, plus a best-effort
  cross-process byte-range lock on Windows shared with compaction.
  POSIX behavior is unchanged (O_APPEND was already atomic).

- executable-bit test (1 failure): skipped on Windows, which has no
  POSIX execute bit.

windows-latest (Python 3.12) rejoins the gating matrix, COMPATIBILITY.md
restores the Windows row to Full, and the landing page's schema.org
operatingSystem claims Windows again (and marks v0.24.0 as the latest
release now that it has shipped).

Fixes #186

https://claude.ai/code/session_01FkHXHcjpWZL2EWn4HGi547
@github-actions github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested labels Jun 12, 2026
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Backend parity gate — graphify vs built-in tree-sitter

✅ PASS — the built-in backend must stay within tolerance of graphify on the reference fixture.

Metric graphify built-in
code nodes 65 79
mean reduction 6.05× 6.66×
faithfulness delta -0.047 +0.143
fact recall 0.527 0.717
grounding 0.889 1.000

Gate checks

  • reduction within tolerance of graphify — built-in 6.66× ≥ 4.54× (graphify 6.05× − 25%)
  • reduction ≥ absolute floor — built-in 6.66× ≥ floor 4.00×
  • faithfulness delta within tolerance of graphify — built-in +0.143 ≥ -0.147 (graphify -0.047 − 0.10)
  • faithfulness delta ≥ absolute floor — built-in +0.143 ≥ floor +0.000
  • fact recall within tolerance of graphify — built-in 0.717 ≥ 0.427 (graphify 0.527 − 0.10)

Tolerances: reduction within 25% (floor 4.0×), faithfulness within 0.10 (floor +0.00). Override via NEURALMIND_PARITY_* env vars.

Automated by evals/parity/run.py — reproduce locally with python -m evals.parity.run.

Multi-language structural parity

Language graphify symbols built-in covers dangling
typescript 54 54 (100%) 0
go 45 45 (100%) 0
  • typescript: symbol coverage ≥ floor — 54/54 graphify symbols (100%) ≥ 90%
  • typescript: no dangling edges — 0 dangling edge(s)
  • go: symbol coverage ≥ floor — 45/45 graphify symbols (100%) ≥ 90%
  • go: no dangling edges — 0 dangling edge(s)

Coverage floor: 90% of graphify's per-language symbols (no gold-fact set exists for TS/Go, so parity is structural).

Optional SCIP precision pass

  • precision: SCIP corrects the heuristic call edge — run() → A.handle under SCIP (heuristic wrongly linked B.handle)
  • precision: strict no-op when disabled — graph unchanged when NEURALMIND_PRECISION is unset

Off by default (NEURALMIND_PRECISION); proven on tests/fixtures/scip_precision to replace a heuristic call edge with the compiler-accurate one a SCIP index resolves.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

NeuralMind self-benchmark

Status: PASS — floor , measured 6.2×.

Phase 1 — Reduction on committed fixture

  • Average reduction: 6.2×
  • Top-k retrieval hit rate: 71.7%
  • Naive baseline: 47,360 tokens (all fixture files concatenated)
  • NeuralMind total: 7,706 tokens across 10 queries
  • Estimated monthly savings @ 100 queries/day on Claude 3.5 Sonnet: ~$35.69
# Query Shape Naive NeuralMind Ratio Hit
1 auth-flow cross-file 4,736 773 6.1× 33.3%
2 api-endpoints focused 4,736 758 6.2× 100.0%
3 billing-flow cross-file 4,736 774 6.1× 33.3%
4 user-storage cross-file 4,736 651 7.3× 50.0%
5 jwt-verify focused 4,736 669 7.1× 100.0%
6 stripe-webhook focused 4,736 801 5.9× 100.0%
7 create-user cross-file 4,736 771 6.1× 50.0%
8 refund focused 4,736 760 6.2× 100.0%
9 db-choice identity 4,736 854 5.5× 100.0%
10 invoice-send cross-file 4,736 895 5.3× 50.0%

Phase 2 — Learning uplift

  • Memory events logged: 20
  • Learned patterns: 16
  • Reduction ratio after neuralmind learn: 6.1× (Δ -0.08× vs. cold)
  • Top-k hit rate after learning: 75.0% (Δ +3.3 points vs. cold)

Note: uplift numbers on a 500-line fixture are intentionally modest — the point is to
verify the learning mechanism persists and applies. On real production repos the lift
is larger; this test only catches regressions in persistence.

Phase 3 — Synapse recall A/B (same warm graph, recall off vs on)

  • Synapse edges after seeding co-editing sessions: 2834
  • Top-k hit rate: 71.7% off → 83.3% on (Δ +11.7 points)
  • Reduction ratio: 6.2× off → 6.2× on (Δ -0.07× — budget-neutral by design)

This isolates the Hebbian synapse layer from the learned_patterns reranker in
Phase 2. The hit-rate delta shows associative recall surfacing co-edited modules a
purely textual search ranks lower; the near-zero reduction delta confirms it does so
without spending extra tokens (recalled nodes displace the weakest hits, not add to them).

Assumptions

  • Baseline: every .py file in tests/fixtures/sample_project/ concatenated.
  • Tokenizer: tiktoken GPT-4o encoding (per-model breakdown in multi_model.json if generated).
  • Pricing: Claude 3.5 Sonnet input @ $3.0/MTok.
  • Regression floor: — well below NeuralMind's typical 40–70× on real repos.

Per-model token reduction

Model Tokenizer Naive NeuralMind Ratio Source
GPT-4o / GPT-4o-mini tiktoken o200k_base 4,739 779 6.1× measured
GPT-4 / GPT-3.5-turbo tiktoken cl100k_base 4,710 770 6.1× measured
Claude 3.5 Sonnet estimated: GPT-4o × 1.08 — install anthropic for an exact count 5,118 841 6.1× estimated
Llama 3 (70B) estimated: GPT-4o × 1.22 — Llama tokenizer requires model weights; estimate based on published vocab ratios 5,781 950 6.1× estimated

Rows marked measured use the provider's real tokenizer. Rows marked
estimated apply a published vocab-size correction to the GPT-4o count —
honest approximations, not hardcoded claims.

NeuralMind retrieval-quality eval

Suite Queries MRR Answerability Recall@5 Precision@5 Gate
go 10 0.950 100% 0.833 0.603 PASS
python 10 0.950 100% 0.833 0.678 PASS
typescript 10 0.900 100% 0.800 0.562 PASS

go vs baseline:

  • mrr: 0.950 (= +0.000)
  • answerability: 1.000 (= +0.000)
  • recall@1: 0.617 (= -0.000)
  • recall@3: 0.833 (= +0.000)
  • recall@5: 0.833 (= +0.000)

python vs baseline:

  • mrr: 0.950 (▲ +0.050)
  • answerability: 1.000 (= +0.000)
  • recall@1: 0.617 (▲ +0.100)
  • recall@3: 0.833 (= +0.000)
  • recall@5: 0.833 (= +0.000)

typescript vs baseline:

  • mrr: 0.900 (= +0.000)
  • answerability: 1.000 (= +0.000)
  • recall@1: 0.583 (= +0.000)
  • recall@3: 0.800 (= +0.000)
  • recall@5: 0.800 (= +0.000)

Overall: PASS


Automated by .github/workflows/ci-benchmark.yml — regenerate locally with python -m tests.benchmark.run and neuralmind benchmark --quality.

os.kill(pid, 0) is the POSIX "does this process exist" idiom, but on
Windows signal.CTRL_C_EVENT == 0, so the call delivers a real Ctrl-C to
the probed pid's console process group. In the test suite the discovery
file records pytest's own pid, so the probe interrupted the whole run
with a KeyboardInterrupt; for users it could interrupt any console the
daemon shares. Probe via OpenProcess/GetExitCodeProcess instead on
Windows; POSIX path unchanged.

https://claude.ai/code/session_01FkHXHcjpWZL2EWn4HGi547
@dfrostar dfrostar marked this pull request as ready for review June 12, 2026 05:23
Copilot AI review requested due to automatic review settings June 12, 2026 05:23
@dfrostar dfrostar merged commit bd3daad into main Jun 12, 2026
18 checks passed
@dfrostar dfrostar review requested due to automatic review settings June 12, 2026 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows support: make the test suite CI-green

2 participants