Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,40 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0



## [1.0.160] - 2026-06-02

### Fixed

- **`evaluate_csharp_rebuild` no longer holds `config.write()` during git/fs I/O** —
the bootstrap timestamp computation (git subprocess + ≤10 000-entry filesystem
walk) previously ran while holding the `config` write-lock, blocking every
concurrent `config.read()` caller for the full scan duration. The lock is now
acquired only for the brief config update; the slow work runs with no lock held.
- **`evaluate_csharp_rebuild` offloaded to `spawn_blocking` in phase 2** — even
after the lock fix, the function ran synchronously on a Tokio worker thread.
Wrapped in `spawn_blocking` at the call site in `run_phase_2_csharp_scip` so
the async runtime stays responsive while processing all C# candidates.
- **`build_index()` in warmup and add-repo background task now use `spawn_blocking`** —
two sites called the CPU-heavy HNSW `build_index()` directly on async threads.
Both now follow the established pattern (`spawn_blocking` + `blocking_write()`).
- **`reload_if_changed` uses `safe_canonicalize`** — replaces the raw
`std::fs::canonicalize` that could leave Windows `\\?\` UNC prefixes on the
config path, causing path comparisons to silently fail.
- **Accurate doc-comments on `relocate_missing` / `prune_stale`** — both
methods perform disk I/O (filesystem traversal, git subprocess) and should
be called via `spawn_blocking` in async contexts; the comments now say so.
- **`ensure_hnsw_index_if_needed` extracted and tested** — the safety-net HNSW
rebuild logic (detects and repairs a DB with chunks but no index, e.g. after
cancellation) is now a named `pub(crate)` function with 3 unit tests
(unindexed-with-chunks rebuilds, already-indexed is idempotent, empty DB skips).
- **`metadata.json` schema consistency** — the normal index path now writes
`"partial": false` so readers always find the field regardless of whether
indexing completed or was cancelled.
- **Cancellation finalisation is best-effort** — metadata write, FileMetaStore
save, and stats read in the cancel path now log-and-continue on failure
instead of propagating `Err`, so the partial chunks remain searchable even
if any recovery step fails.

## [1.0.156] - 2026-06-02

### Fixed
Expand Down
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "codesearch"
version = "1.0.156"
version = "1.0.160"
edition = "2021"
authors = ["codesearch contributors"]
license = "Apache-2.0"
Expand Down
21 changes: 16 additions & 5 deletions src/db_discovery/repos.rs
Original file line number Diff line number Diff line change
Expand Up @@ -401,10 +401,16 @@ impl ReposConfig {
///
/// For each missing path a best-effort git-identity relocation is attempted
/// ([`Self::try_relocate`]); successful matches rewrite the in-memory
/// `repos` map. This is pure (no disk I/O, no logging) so callers can decide
/// how to report and persist. Returns `(relocated, unresolved)` where
/// `relocated` is the list of `(alias, new_path)` rewrites and `unresolved`
/// is the list of aliases whose path is still missing.
/// `repos` map.
///
/// **Note:** this method performs disk I/O (filesystem traversal, git
/// subprocess) and should not be called while holding an async lock or from
/// an async task without `spawn_blocking`. No logging is emitted — callers
/// are responsible for reporting results.
///
/// Returns `(relocated, unresolved)` where `relocated` is the list of
/// `(alias, new_path)` rewrites and `unresolved` is the list of aliases
/// whose path is still missing.
#[must_use]
pub fn relocate_missing(&mut self) -> (Vec<(String, PathBuf)>, Vec<String>) {
let aliases: Vec<String> = self.repos.keys().cloned().collect();
Expand All @@ -431,7 +437,12 @@ impl ReposConfig {
}

/// Prune stale entries: relocate what can be relocated, then unregister the
/// rest. Pure (no disk I/O, no logging). Returns `(relocated, removed)`.
/// rest.
///
/// **Note:** this method performs disk I/O (filesystem traversal, git
/// subprocess) via [`Self::relocate_missing`]. No logging is emitted.
///
/// Returns `(relocated, removed)`.
#[must_use]
pub fn prune_stale(&mut self) -> (Vec<(String, PathBuf)>, Vec<String>) {
let (relocated, unresolved) = self.relocate_missing();
Expand Down
254 changes: 205 additions & 49 deletions src/index/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,31 @@ pub use manager::{
is_database_locked, CSharpRebuildNotifier, IndexManager, IndexingStatusCallback, SharedStores,
};

/// Ensure the HNSW vector index is built if it was never built in a previous
/// (possibly cancelled) run.
///
/// Returns `true` if the index was rebuilt, `false` if nothing needed to be
/// done (already indexed, or no chunks present). Returns an error only if the
/// database cannot be opened or `build_index` fails; a failure reading stats is
/// treated as a warning and returns `Ok(false)`.
pub(crate) fn ensure_hnsw_index_if_needed(
db_path: &Path,
dimensions: usize,
) -> anyhow::Result<bool> {
let mut vs = VectorStore::new(db_path, dimensions)?;
match vs.stats() {
Ok(s) if s.total_chunks > 0 && !s.indexed => {
vs.build_index()?;
Ok(true)
}
Ok(_) => Ok(false),
Err(e) => {
tracing::warn!("could not check vector index status: {}", e);
Ok(false)
}
}
}

/// Update metadata.json with current chunk/file counts so that `status(projects)`
/// can report accurate numbers without opening LMDB.
pub(crate) fn update_metadata_stats(db_path: &Path, total_chunks: usize, total_files: usize) {
Expand Down Expand Up @@ -631,28 +656,20 @@ async fn index_with_options(
// Safety net: if a previous run was cancelled/interrupted mid-way,
// the HNSW vector index may never have been built. Detect this and
// rebuild now so the database is usable without requiring --force.
{
let mut vs = VectorStore::new(&db_path, model_type.dimensions())?;
match vs.stats() {
Ok(s) if s.total_chunks > 0 && !s.indexed => {
log_print!(
"\n{}",
format!(
"🔨 Vector index not built ({} chunks found from previous run). Rebuilding...",
s.total_chunks
)
match ensure_hnsw_index_if_needed(&db_path, model_type.dimensions()) {
Ok(true) => {
log_print!(
"\n{}",
"🔨 Vector index not built from previous run — rebuilt successfully."
.yellow()
);
vs.build_index()?;
log_print!("{}", "✅ Vector index rebuilt successfully!".green());
}
Ok(_) => {} // already indexed or no chunks — all good
Err(e) => {
log_print!(
"{}",
format!("⚠️ Could not check vector index status: {}", e).yellow()
);
}
);
}
Ok(false) => {} // already indexed or no chunks — all good
Err(e) => {
log_print!(
"{}",
format!("⚠️ Could not rebuild vector index: {}", e).yellow()
);
}
}

Expand Down Expand Up @@ -957,47 +974,83 @@ async fn index_with_options(
log_print!(" ✅ Vector index built");
}

// Save metadata
std::fs::write(
db_path.join("metadata.json"),
serde_json::to_string_pretty(&serde_json::json!({
"model_short_name": model_type.short_name(),
"model_name": model_type.name(),
"dimensions": model_type.dimensions(),
"indexed_at": chrono::Utc::now().to_rfc3339(),
"partial": true,
}))?,
)?;

// Update FileMetaStore with the files that were actually processed
// Save metadata — best-effort: log and continue on failure so the
// partial chunks we already built are still searchable.
let metadata_json = serde_json::to_string_pretty(&serde_json::json!({
"model_short_name": model_type.short_name(),
"model_name": model_type.name(),
"dimensions": model_type.dimensions(),
"indexed_at": chrono::Utc::now().to_rfc3339(),
"partial": true,
}));
match metadata_json {
Ok(json) => {
if let Err(e) = std::fs::write(db_path.join("metadata.json"), json) {
log_print!("{} metadata.json write warning: {}", "⚠️ ".yellow(), e);
}
}
Err(e) => {
log_print!(
"{} metadata.json serialise warning: {}",
"⚠️ ".yellow(),
e
);
}
}

// Update FileMetaStore with the files that were actually processed.
// Also best-effort: a failed save means the next incremental run will
// re-process those files, which is acceptable for a cancelled index.
if !file_chunks.is_empty() {
if is_incremental {
let save_result = if is_incremental {
let mut meta = file_meta_store.take().unwrap();
for (file_path, chunk_ids) in file_chunks {
meta.update_file(Path::new(&file_path), chunk_ids)?;
if let Err(e) = meta.update_file(Path::new(&file_path), chunk_ids) {
log_print!(
"{} file-meta update warning for '{}': {}",
"⚠️ ".yellow(),
file_path,
e
);
}
}
meta.save(&db_path)?;
meta.save(&db_path)
} else {
let mut meta = FileMetaStore::new(
model_type.short_name().to_string(),
model_type.dimensions(),
);
for (file_path, chunk_ids) in file_chunks {
meta.update_file(Path::new(&file_path), chunk_ids)?;
if let Err(e) = meta.update_file(Path::new(&file_path), chunk_ids) {
log_print!(
"{} file-meta update warning for '{}': {}",
"⚠️ ".yellow(),
file_path,
e
);
}
}
meta.save(&db_path)?;
meta.save(&db_path)
};
if let Err(e) = save_result {
log_print!("{} file-meta save warning: {}", "⚠️ ".yellow(), e);
}
}

// Persist stats
let db_stats = store.stats()?;
update_metadata_stats(&db_path, db_stats.total_chunks, db_stats.total_files);

log_print!(
" Partial index finalised: {} chunks, {} files",
db_stats.total_chunks,
db_stats.total_files
);
// Persist stats — best-effort, only for display; failures are non-fatal.
match store.stats() {
Ok(db_stats) => {
update_metadata_stats(&db_path, db_stats.total_chunks, db_stats.total_files);
log_print!(
" Partial index finalised: {} chunks, {} files",
db_stats.total_chunks,
db_stats.total_files
);
}
Err(e) => {
log_print!("{} Could not read final stats: {}", "⚠️ ".yellow(), e);
}
}
log_print!(
"{} Run {} to index the remaining files",
"💡 ".cyan(),
Expand Down Expand Up @@ -1088,12 +1141,15 @@ async fn index_with_options(
store.build_index()?;
let _storage_duration = storage_start.elapsed();

// Save model metadata
// Save model metadata. `partial: false` is explicit so the schema matches
// the cancel path (which writes `partial: true`); readers can always check
// the field regardless of how indexing completed.
let metadata = serde_json::json!({
"model_short_name": model_short_name,
"model_name": model_name,
"dimensions": model_dimensions,
"indexed_at": chrono::Utc::now().to_rfc3339(),
"partial": false,
});
std::fs::write(
db_path.join("metadata.json"),
Expand Down Expand Up @@ -2311,3 +2367,103 @@ mod serve_probe_tests {
);
}
}

#[cfg(test)]
mod index_quality_tests {
use super::ensure_hnsw_index_if_needed;
use crate::chunker::{Chunk, ChunkKind};
use crate::embed::EmbeddedChunk;
use crate::vectordb::VectorStore;
use tempfile::tempdir;

/// Helper: create a minimal EmbeddedChunk with `dims`-dimensional embedding.
fn fake_chunk(path: &str, dims: usize) -> EmbeddedChunk {
EmbeddedChunk::new(
Chunk::new(
"fn dummy() {}".to_string(),
0,
1,
ChunkKind::Function,
path.to_string(),
),
vec![1.0_f32 / (dims as f32).sqrt(); dims],
)
}

/// `ensure_hnsw_index_if_needed` returns `true` and leaves the DB indexed
/// when there are chunks but the HNSW index was never built (simulates a
/// prior cancellation that finished inserting chunks but never called
/// `build_index`).
#[test]
fn rebuilds_unindexed_db_with_chunks() {
let dir = tempdir().unwrap();
let db_path = dir.path().join("test.db");
const DIMS: usize = 4;

// Insert a chunk without calling build_index — simulate cancelled run.
{
let mut vs = VectorStore::new(&db_path, DIMS).unwrap();
vs.insert_chunks(vec![fake_chunk("foo.rs", DIMS)]).unwrap();
// Deliberately do NOT call vs.build_index()
let s = vs.stats().unwrap();
assert!(s.total_chunks > 0, "precondition: DB has chunks");
assert!(!s.indexed, "precondition: index not yet built");
}

// Safety-net must detect and rebuild the index.
let rebuilt = ensure_hnsw_index_if_needed(&db_path, DIMS)
.expect("ensure_hnsw_index_if_needed should not error");
assert!(
rebuilt,
"expected the function to report it rebuilt the index"
);

// Verify: DB is now indexed.
let vs = VectorStore::new(&db_path, DIMS).unwrap();
assert!(
vs.is_indexed(),
"VectorStore must be indexed after ensure_hnsw_index_if_needed"
);
}

/// `ensure_hnsw_index_if_needed` returns `false` (no rebuild) when the DB
/// is already indexed — repeated calls must be idempotent.
#[test]
fn no_rebuild_when_already_indexed() {
let dir = tempdir().unwrap();
let db_path = dir.path().join("test.db");
const DIMS: usize = 4;

{
let mut vs = VectorStore::new(&db_path, DIMS).unwrap();
vs.insert_chunks(vec![fake_chunk("bar.rs", DIMS)]).unwrap();
vs.build_index().unwrap();
assert!(vs.is_indexed(), "precondition: already indexed");
}

let rebuilt = ensure_hnsw_index_if_needed(&db_path, DIMS)
.expect("should succeed on already-indexed DB");
assert!(
!rebuilt,
"should not report a rebuild on an already-indexed DB"
);
}

/// `ensure_hnsw_index_if_needed` returns `false` (no rebuild) for an empty
/// DB (no chunks, nothing to index).
#[test]
fn no_rebuild_for_empty_db() {
let dir = tempdir().unwrap();
let db_path = dir.path().join("test.db");
const DIMS: usize = 4;

// Empty DB — no chunks inserted.
{
let _vs = VectorStore::new(&db_path, DIMS).unwrap();
}

let rebuilt =
ensure_hnsw_index_if_needed(&db_path, DIMS).expect("should succeed on empty DB");
assert!(!rebuilt, "empty DB needs no rebuild");
}
}
Loading
Loading