fix: make the test suite Windows-green and restore full Windows support by dfrostar · Pull Request #228 · dfrostar/neuralmind

dfrostar · 2026-06-12T05:04:20Z

Closes the four Windows failure classes from the cross-platform CI run (509 passed / 5 failed / 134 errors on windows-latest) and re-adds windows-latest to the gating matrix.

The four fixes

1. ChromaDB temp-dir teardown (~134 errors)

Chroma caches one System per storage path for the life of the process, holding the sqlite connection pool and HNSW segment files. GraphEmbedder.close() previously called delete_collection — which released nothing (and destroyed data a later open expected to find). It now stops the client's System and evicts it from SharedSystemClient._identifier_to_system, so Windows can actually delete the store afterwards. Verified against chromadb 1.5.9.

On top of that:

NeuralMind.close() added (delegates to the backend, safe to call twice).
tests/conftest.py gains an autouse fixture that stops every cached chroma System after each test — handles are released regardless of whether the test cleaned up — and the temp_project/empty_project fixtures use ignore_cleanup_errors=True as a belt-and-suspenders.

2. Event-log rotation (3 failures)

The tailer holds its read handle across poll intervals, and rotation is a logrotate-style rename. POSIX allows renaming an open file; Windows' open() omits FILE_SHARE_DELETE, so the rotating process got PermissionError. The tailer now opens via CreateFileW with share-delete on Windows, recreating POSIX semantics — rotation never depends on catching the tailer between polls. POSIX path is the same open(path, "rb") as before.

3. Concurrent recent-queries appends (1 failure)

The append relied on POSIX O_APPEND atomicity; Windows' CRT implements append as a separate seek-to-end + write, so 8 threads × 5 appends landed 37/40 lines. Appends are now a single os.write on an O_APPEND fd serialized by a process-local lock, plus a best-effort cross-process byte-range lock (msvcrt.locking, non-blocking with ~50ms retry) shared with _compact_recent_queries — so a compaction's read-truncate-rewrite can't drop a concurrent process's append either. POSIX behavior unchanged.

4. Executable-bit test (1 failure)

test_cmd_init_hook_makes_executable is skipped on Windows, which has no POSIX execute bit.

Support claims restored (issue checklist)

Items 1–4 fixed
windows-latest re-added to the test matrix in ci.yml (Python 3.12) — the run on this PR is the proof
docs/COMPATIBILITY.md Windows row restored to ✅ Full
docs/index.html operatingSystem restored to "Linux, macOS, Windows"

Also folds in the post-release landing-page touch-up (same file as the operatingSystem edit): v0.24.0 is marked as the latest release in the hero badge, timeline, and JSON-LD softwareVersion, mirroring what #224 did for v0.23.0.

Verification

Linux: full suite 750 passed, 17 skipped; black --check and ruff check clean; mypy introduces no new errors in the touched modules.
Windows: validated by this PR's own windows-latest CI leg (the whole point of re-gating it).

Fixes #186

https://claude.ai/code/session_01FkHXHcjpWZL2EWn4HGi547

Generated by Claude Code

Closes the four Windows failure classes from the cross-platform CI run (509 passed / 5 failed / 134 errors on windows-latest): - chromadb teardown (~134 errors): GraphEmbedder.close() now stops the client's cached System and evicts it from chroma's per-path cache, releasing the sqlite/HNSW file handles Windows needs closed before a directory can be deleted. The previous close() deleted the collection, which released nothing and destroyed data. NeuralMind.close() added on top; conftest releases all cached Systems after every test and the temp-project fixtures ignore residual cleanup errors. - event-log rotation (3 failures): the tailer's read handle is opened with FILE_SHARE_DELETE on Windows (CreateFileW), so a logrotate-style rename of the live log no longer throws PermissionError under a reader. POSIX path unchanged. - concurrent appends (1 failure): recent-queries appends are a single O_APPEND write serialized by a process-local lock, plus a best-effort cross-process byte-range lock on Windows shared with compaction. POSIX behavior is unchanged (O_APPEND was already atomic). - executable-bit test (1 failure): skipped on Windows, which has no POSIX execute bit. windows-latest (Python 3.12) rejoins the gating matrix, COMPATIBILITY.md restores the Windows row to Full, and the landing page's schema.org operatingSystem claims Windows again (and marks v0.24.0 as the latest release now that it has shipped). Fixes #186 https://claude.ai/code/session_01FkHXHcjpWZL2EWn4HGi547

github-actions · 2026-06-12T05:05:29Z

Backend parity gate — graphify vs built-in tree-sitter

✅ PASS — the built-in backend must stay within tolerance of graphify on the reference fixture.

Metric	graphify	built-in
code nodes	65	79
mean reduction	6.05×	6.66×
faithfulness delta	-0.047	+0.143
fact recall	0.527	0.717
grounding	0.889	1.000

Gate checks

✅ reduction within tolerance of graphify — built-in 6.66× ≥ 4.54× (graphify 6.05× − 25%)
✅ reduction ≥ absolute floor — built-in 6.66× ≥ floor 4.00×
✅ faithfulness delta within tolerance of graphify — built-in +0.143 ≥ -0.147 (graphify -0.047 − 0.10)
✅ faithfulness delta ≥ absolute floor — built-in +0.143 ≥ floor +0.000
✅ fact recall within tolerance of graphify — built-in 0.717 ≥ 0.427 (graphify 0.527 − 0.10)

Tolerances: reduction within 25% (floor 4.0×), faithfulness within 0.10 (floor +0.00). Override via NEURALMIND_PARITY_* env vars.

Automated by evals/parity/run.py — reproduce locally with python -m evals.parity.run.

Multi-language structural parity

Language	graphify symbols	built-in covers	dangling
typescript	54	54 (100%)	0
go	45	45 (100%)	0

✅ typescript: symbol coverage ≥ floor — 54/54 graphify symbols (100%) ≥ 90%
✅ typescript: no dangling edges — 0 dangling edge(s)
✅ go: symbol coverage ≥ floor — 45/45 graphify symbols (100%) ≥ 90%
✅ go: no dangling edges — 0 dangling edge(s)

Coverage floor: 90% of graphify's per-language symbols (no gold-fact set exists for TS/Go, so parity is structural).

Optional SCIP precision pass

✅ precision: SCIP corrects the heuristic call edge — run() → A.handle under SCIP (heuristic wrongly linked B.handle)
✅ precision: strict no-op when disabled — graph unchanged when NEURALMIND_PRECISION is unset

Off by default (NEURALMIND_PRECISION); proven on tests/fixtures/scip_precision to replace a heuristic call edge with the compiler-accurate one a SCIP index resolves.

github-actions · 2026-06-12T05:06:50Z

NeuralMind self-benchmark

Status: PASS — floor 4×, measured 6.2×.

Phase 1 — Reduction on committed fixture

Average reduction: 6.2×
Top-k retrieval hit rate: 71.7%
Naive baseline: 47,360 tokens (all fixture files concatenated)
NeuralMind total: 7,706 tokens across 10 queries
Estimated monthly savings @ 100 queries/day on Claude 3.5 Sonnet: ~$35.69

#	Query	Shape	Naive	NeuralMind	Ratio	Hit
1	`auth-flow`	cross-file	4,736	773	6.1×	33.3%
2	`api-endpoints`	focused	4,736	758	6.2×	100.0%
3	`billing-flow`	cross-file	4,736	774	6.1×	33.3%
4	`user-storage`	cross-file	4,736	651	7.3×	50.0%
5	`jwt-verify`	focused	4,736	669	7.1×	100.0%
6	`stripe-webhook`	focused	4,736	801	5.9×	100.0%
7	`create-user`	cross-file	4,736	771	6.1×	50.0%
8	`refund`	focused	4,736	760	6.2×	100.0%
9	`db-choice`	identity	4,736	854	5.5×	100.0%
10	`invoice-send`	cross-file	4,736	895	5.3×	50.0%

Phase 2 — Learning uplift

Memory events logged: 20
Learned patterns: 16
Reduction ratio after neuralmind learn: 6.1× (Δ -0.08× vs. cold)
Top-k hit rate after learning: 75.0% (Δ +3.3 points vs. cold)

Note: uplift numbers on a 500-line fixture are intentionally modest — the point is to
verify the learning mechanism persists and applies. On real production repos the lift
is larger; this test only catches regressions in persistence.

Phase 3 — Synapse recall A/B (same warm graph, recall off vs on)

Synapse edges after seeding co-editing sessions: 2834
Top-k hit rate: 71.7% off → 83.3% on (Δ +11.7 points)
Reduction ratio: 6.2× off → 6.2× on (Δ -0.07× — budget-neutral by design)

This isolates the Hebbian synapse layer from the learned_patterns reranker in
Phase 2. The hit-rate delta shows associative recall surfacing co-edited modules a
purely textual search ranks lower; the near-zero reduction delta confirms it does so
without spending extra tokens (recalled nodes displace the weakest hits, not add to them).

Assumptions

Baseline: every .py file in tests/fixtures/sample_project/ concatenated.
Tokenizer: tiktoken GPT-4o encoding (per-model breakdown in multi_model.json if generated).
Pricing: Claude 3.5 Sonnet input @ $3.0/MTok.
Regression floor: 4× — well below NeuralMind's typical 40–70× on real repos.

Per-model token reduction

Model	Tokenizer	Naive	NeuralMind	Ratio	Source
GPT-4o / GPT-4o-mini	`tiktoken o200k_base`	4,739	779	6.1×	measured
GPT-4 / GPT-3.5-turbo	`tiktoken cl100k_base`	4,710	770	6.1×	measured
Claude 3.5 Sonnet	`estimated: GPT-4o × 1.08 — install` anthropic `for an exact count`	5,118	841	6.1×	estimated
Llama 3 (70B)	`estimated: GPT-4o × 1.22 — Llama tokenizer requires model weights; estimate based on published vocab ratios`	5,781	950	6.1×	estimated

Rows marked measured use the provider's real tokenizer. Rows marked
estimated apply a published vocab-size correction to the GPT-4o count —
honest approximations, not hardcoded claims.

NeuralMind retrieval-quality eval

Suite	Queries	MRR	Answerability	Recall@5	Precision@5	Gate
`go`	10	0.950	100%	0.833	0.603	PASS
`python`	10	0.950	100%	0.833	0.678	PASS
`typescript`	10	0.900	100%	0.800	0.562	PASS

go vs baseline:

mrr: 0.950 (= +0.000)
answerability: 1.000 (= +0.000)
recall@1: 0.617 (= -0.000)
recall@3: 0.833 (= +0.000)
recall@5: 0.833 (= +0.000)

python vs baseline:

mrr: 0.950 (▲ +0.050)
answerability: 1.000 (= +0.000)
recall@1: 0.617 (▲ +0.100)
recall@3: 0.833 (= +0.000)
recall@5: 0.833 (= +0.000)

typescript vs baseline:

mrr: 0.900 (= +0.000)
answerability: 1.000 (= +0.000)
recall@1: 0.583 (= +0.000)
recall@3: 0.800 (= +0.000)
recall@5: 0.800 (= +0.000)

Overall: PASS

Automated by .github/workflows/ci-benchmark.yml — regenerate locally with python -m tests.benchmark.run and neuralmind benchmark --quality.

os.kill(pid, 0) is the POSIX "does this process exist" idiom, but on Windows signal.CTRL_C_EVENT == 0, so the call delivers a real Ctrl-C to the probed pid's console process group. In the test suite the discovery file records pytest's own pid, so the probe interrupted the whole run with a KeyboardInterrupt; for users it could interrupt any console the daemon shares. Probe via OpenProcess/GetExitCodeProcess instead on Windows; POSIX path unchanged. https://claude.ai/code/session_01FkHXHcjpWZL2EWn4HGi547

github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested labels Jun 12, 2026

dfrostar marked this pull request as ready for review June 12, 2026 05:23

Copilot AI review requested due to automatic review settings June 12, 2026 05:23

dfrostar merged commit bd3daad into main Jun 12, 2026
18 checks passed

Copilot started reviewing on behalf of dfrostar June 12, 2026 05:23 View session

github-actions Bot mentioned this pull request Jun 12, 2026

chore(main): release 0.25.0 #229

Merged

dfrostar review requested due to automatic review settings June 12, 2026 05:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make the test suite Windows-green and restore full Windows support#228

fix: make the test suite Windows-green and restore full Windows support#228
dfrostar merged 2 commits into
mainfrom
claude/awesome-wozniak-uhq13c

dfrostar commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dfrostar commented Jun 12, 2026

The four fixes

1. ChromaDB temp-dir teardown (~134 errors)

2. Event-log rotation (3 failures)

3. Concurrent recent-queries appends (1 failure)

4. Executable-bit test (1 failure)

Support claims restored (issue checklist)

Verification

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend parity gate — graphify vs built-in tree-sitter

Gate checks

Multi-language structural parity

Optional SCIP precision pass

Uh oh!

github-actions Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

NeuralMind self-benchmark

Phase 1 — Reduction on committed fixture

Phase 2 — Learning uplift

Phase 3 — Synapse recall A/B (same warm graph, recall off vs on)

Assumptions

Per-model token reduction

NeuralMind retrieval-quality eval

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 12, 2026 •

edited

Loading

github-actions Bot commented Jun 12, 2026 •

edited

Loading