Skip to content

crosslink integrity hydration --repair deletes SQLite rows when the JSON files do not have them #602

@vringar

Description

@vringar

Setup caveat

I originally noticed this on a workstation where only parts of the common
git dir are mounted into the hub worktree, which produces split-brain states
in .crosslink/.hub-cache/. The mechanism that put SQLite ahead of the JSON
files in that case was almost certainly an interaction between my custom
mount setup and crosslink's hub-cache, not a defect in crosslink itself.

This report is therefore a structural observation, not triage of an
upstream-caused incident: the --repair operation's clear-then-rebuild
semantics drop any SQLite row that lacks a JSON counterpart, regardless of
how the divergence was reached. Filing because the structural shape is
worth flagging on its own; the fitness-for-purpose framing should be
crosslink's call, not mine.

Summary

crosslink integrity hydration --repair clears SQLite and rehydrates from
the on-disk JSON files. The command does not inspect whether SQLite holds
rows that the JSON does not; any such rows are dropped when SQLite is
cleared, and the rehydration cannot restore them because they are not in
the source it reads from. No warning before the operation runs, no record
of what was deleted.

Offending code

check_hydration — the repair branch calls db.clear_shared_data() and then
hydrate_to_sqlite(&cache_dir, db):

fn check_hydration(crosslink_dir: &Path, db: &Database, repair: bool) -> Result<CheckResult> {
let cache_dir = crosslink_dir.join(HUB_CACHE_DIR);
if !cache_dir.exists() {
return Ok(CheckResult {
name: "hydration".to_string(),
status: CheckStatus::Skipped("sync not configured".to_string()),
});
}
let issues_dir = cache_dir.join("issues");
let json_issues = read_all_issue_files(&issues_dir)?;
let json_issue_count = json_issues
.iter()
.filter(|i| i.display_id.is_some())
.count() as i64;
let db_issue_count = db.get_issue_count()?;
// Count milestones: per-file first, fall back to legacy single-file
let milestones_dir = cache_dir.join("meta").join("milestones");
let json_milestone_entries = read_all_milestone_files(&milestones_dir)?;
let json_milestone_count = if json_milestone_entries.is_empty() {
let legacy_path = cache_dir.join("meta").join("milestones.json");
let legacy = read_milestones_file(&legacy_path)?;
legacy.milestones.len() as i64
} else {
json_milestone_entries.len() as i64
};
let db_milestone_count = db.get_milestone_count()?;
let issues_ok = json_issue_count == db_issue_count;
let milestones_ok = json_milestone_count == db_milestone_count;
if issues_ok && milestones_ok {
return Ok(CheckResult {
name: "hydration".to_string(),
status: CheckStatus::Pass,
});
}
let mut issues = Vec::new();
if !issues_ok {
issues.push(format!(
"{json_issue_count} issues in JSON, {db_issue_count} in SQLite"
));
}
if !milestones_ok {
issues.push(format!(
"{json_milestone_count} milestones in JSON, {db_milestone_count} in SQLite"
));
}
let details = issues.join("; ");
if !repair {
return Ok(CheckResult {
name: "hydration".to_string(),
status: CheckStatus::Fail(details),
});
}
db.clear_shared_data()?;
let stats = hydrate_to_sqlite(&cache_dir, db)?;
Ok(CheckResult {
name: "hydration".to_string(),
status: CheckStatus::Repaired(format!(
"re-hydrated {} issues, {} comments",
stats.issues, stats.comments
)),
})
}

The specific clear-then-rehydrate pair is lines 237-238:

db.clear_shared_data()?;
let stats = hydrate_to_sqlite(&cache_dir, db)?;

The drift-detection logic immediately above (lines 207-215) compares only
counts of issues and milestones; a content-level disagreement (SQLite has
issue #24 blocked by #23, JSON does not) does not register as drift, so the
"pass" branch fires and the repair never runs. Conversely, when counts do
disagree the repair runs unconditionally — without inspecting whether SQLite
or JSON is the side with the extra rows.

Reproducer

The destructive structure can be exercised by constructing SQLite-only state
directly, since I don't have a public-CLI path that produces it on current
main:

crosslink issue create "first"  -q                # → L1
crosslink issue create "second" -q                # → L2
# Insert a block directly into the SQLite store, bypassing the JSON write:
sqlite3 .crosslink/issues.db \
  "INSERT INTO blockers (issue_id, blocker_id) VALUES (2, 1);"

crosslink issue show L2 | grep -i blocked          # → Blocked by: L1
crosslink integrity hydration --repair             # → "re-hydrated 2 issues, 0 comments"
crosslink issue show L2 | grep -i blocked          # → Blocked by: (none)

The repair runs without emitting a warning about the row it deleted, and the
deleted row is unrecoverable from crosslink's own state.

Observed behavior

  • The repair is asymmetric: JSON wins over SQLite. The command name doesn't
    communicate that asymmetry.
  • The drift-detection step only compares counts, so content drift isn't
    surfaced when running integrity hydration without --repair. There's no
    built-in way to inspect what the repair would do before running it.
  • There is no audit trail of deletions. After --repair, nothing records
    what was removed.

Suggested fixes

If the maintainers decide this is worth changing, in priority order:

  1. Detect drift by content, not just count. Compare JSON-derived state to
    SQLite-derived state at the row level. If SQLite has rows that JSON does
    not, surface that as a distinct condition ("SQLite has unrepresented
    state"), separate from the symmetric count-mismatch case.

  2. Default --repair to non-destructive. When SQLite contains rows that
    JSON does not, refuse to run without an explicit --accept-data-loss (or
    similar) flag. The current behavior deletes by default.

  3. Re-emit in the other direction when possible. When SQLite has rows
    JSON doesn't, write them back to the JSON files (and the git log) instead
    of deleting them.

  4. Snapshot SQLite before clearing. Even when destructive repair is what
    the user wants, dropping the previous state to
    .crosslink/integrity/hydration-backup-<ts>.sqlite makes deletions
    recoverable.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions