fix(brain-repo): harden backup integrity — 4 silent data-loss fixes#98
fix(brain-repo): harden backup integrity — 4 silent data-loss fixes#98mt-alarcon wants to merge 1 commit into
Conversation
The Brain Repo mirror could silently drop files from the backup while the
sync-age healthcheck stayed green. Four independent root causes, each fixed
with regression tests (69 tests total):
1. secrets_scanner false positives deleted legitimate files. The mirror
removes any file matching a secret pattern before commit; over-broad
patterns matched variable names (password), recaptcha base64, image URLs,
ALL_CAPS binding-name references (SERVICE_API_TOKEN_PROD), ${VAR} refs and
doc placeholders (YOUR_PASSWORD). Added lookbehind/lookahead constraints +
a central _is_false_positive filter. Regression both ways: FPs cleared,
real secrets (DATABASE_URL, mixed-case passwords, Fernet, JWT) still flagged.
2. Nested .gitignore mirrored into the brain repo hid content. A source
.gitignore with a wildcard (e.g. a scratch _state/ dir) was copied into the
mirror, which then excluded that content from the backup. The mirror now
skips .gitignore files from watched paths.
3. Dropped sync tick. When a sync job was already running, enqueue_sync
returned False and the tick was discarded with no guaranteed retry — batch
writes during a busy window never got mirrored. Added trailing-run
coalescing: N enqueues during one job => exactly one trailing sync after it
finishes (lock-protected, no infinite loop).
4. Orphaned DB lock after restart-during-sync. A process killed mid-sync
leaves BrainRepoConfig.sync_in_progress=True; since _acquire_db_lock does
UPDATE WHERE sync_in_progress=0, every future enqueue fails and auto-sync
stays dead for up to JOB_STALE_SECONDS (20 min) until the janitor reclaims.
Added reclaim_orphaned_locks_on_startup() (no age gate — at boot no sync can
legitimately be running). Sibling of git_ops._clear_stale_lock.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reviewer's GuideHardens the brain repo backup pipeline against silent data loss by coalescing dropped sync ticks into a guaranteed trailing run, reclaiming orphaned DB locks at startup, preventing nested .gitignore files from hiding content, and tightening the secrets scanner to avoid false positives while preserving true-positive detection, all backed by new regression tests. Sequence diagram for trailing-run coalescing in enqueue_sync and run_sync_pipelinesequenceDiagram
actor Caller
participant enqueue_sync
participant _acquire_db_lock
participant run_sync_pipeline
participant _release_db_lock
Caller->>enqueue_sync: enqueue_sync(flask_app, user_id, workspace, kind)
enqueue_sync->>_acquire_db_lock: _acquire_db_lock(flask_app, user_id, kind)
alt lock acquired
_acquire_db_lock-->>enqueue_sync: True
enqueue_sync->>run_sync_pipeline: start thread run_sync_pipeline(...)
enqueue_sync-->>Caller: True
else job already running
_acquire_db_lock-->>enqueue_sync: False
enqueue_sync->>enqueue_sync: set _rerun_requested[user_id] = True
enqueue_sync-->>Caller: False
end
Note over run_sync_pipeline: Existing job finishes
run_sync_pipeline->>_release_db_lock: _release_db_lock(flask_app, user_id, success, error)
_release_db_lock-->>run_sync_pipeline: lock cleared
run_sync_pipeline->>run_sync_pipeline: rerun = _rerun_requested.pop(user_id, False)
alt rerun is True
run_sync_pipeline->>run_sync_pipeline: start trailing run thread
else no rerun
run_sync_pipeline->>run_sync_pipeline: exit
end
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In
app.py, thetry/except Exception: passaroundreclaim_orphaned_locks_on_startup(app)silently swallows startup errors; consider at least logging the exception so failures in the lock-reclaim path are visible in logs. - In
reclaim_orphaned_locks_on_startup, you currently load allBrainRepoConfigrows withsync_in_progress=Trueand loop to update them; if this table can grow, consider a single bulk update (e.g.updatewithsynchronize_session=False) to avoid holding all rows in memory.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `app.py`, the `try/except Exception: pass` around `reclaim_orphaned_locks_on_startup(app)` silently swallows startup errors; consider at least logging the exception so failures in the lock-reclaim path are visible in logs.
- In `reclaim_orphaned_locks_on_startup`, you currently load all `BrainRepoConfig` rows with `sync_in_progress=True` and loop to update them; if this table can grow, consider a single bulk update (e.g. `update` with `synchronize_session=False`) to avoid holding all rows in memory.
## Individual Comments
### Comment 1
<location path="dashboard/backend/brain_repo/secrets_scanner.py" line_range="134-135" />
<code_context>
for name, regex in compiled:
m = regex.search(line)
if m:
+ if _is_false_positive(m.group(0)):
</code_context>
<issue_to_address>
**🚨 issue (security):** Single `search` per line can drop real secrets when the first match is a false positive.
Because we only do a single `regex.search(line)` per pattern, if that first match is treated as a false positive we skip the rest of the line and may miss a real secret later in the same line.
Consider iterating all matches with `for m in regex.finditer(line):` and running `_is_false_positive` inside that loop, adding findings for each non–false-positive match. This ensures multiple secrets on one line (or secrets after placeholders) are all detected.
</issue_to_address>
### Comment 2
<location path="dashboard/backend/brain_repo/tests/test_job_runner_trailing_run.py" line_range="16" />
<code_context>
+from unittest.mock import MagicMock, call, patch
+
+
+class TestTrailingRunCoalescing(unittest.TestCase):
+ """Testa o flag rerun_requested e o trailing run automático."""
+
</code_context>
<issue_to_address>
**issue (testing):** No tests cover the new `reclaim_orphaned_locks_on_startup` behavior described in the PR
The PR calls out `reclaim_orphaned_locks_on_startup()` as a key fix, but its behavior isn’t tested. To validate the regression and prevent it from returning, please add a test that:
- Inserts a `BrainRepoConfig` row with `sync_in_progress=True` (and fields like `sync_started_at`, `sync_job_kind`, `cancel_requested` set).
- Calls `reclaim_orphaned_locks_on_startup(flask_app)` and asserts:
- The returned count matches the number of locked rows.
- The row is updated to `sync_in_progress=False`, `sync_started_at=None`, `sync_job_kind=None`, `cancel_requested=False`.
- `last_error` is set to the expected marker string.
- Optionally, also assert that with no locked rows, it’s a no‑op that returns 0.
</issue_to_address>
### Comment 3
<location path="dashboard/backend/brain_repo/tests/test_job_runner_trailing_run.py" line_range="31-40" />
<code_context>
+ # Helpers
+ # ─────────────────────────────────────────────────────────
+
+ def _make_pipeline_that_blocks(self, event_start, event_unblock):
+ """Retorna um patch de run_sync_pipeline que bloqueia até event_unblock."""
+ def fake_pipeline(flask_app, user_id, workspace, *, kind, tag_name=None, commit_message=None):
+ # Sinaliza que começou e aguarda liberação do teste.
+ event_start.set()
+ event_unblock.wait(timeout=5)
+ # Libera o DB lock como o pipeline real faria.
+ self.jr._release_db_lock(flask_app, user_id, success=True, error=None)
+
+ return fake_pipeline
+
+ def _dummy_flask_app(self):
</code_context>
<issue_to_address>
**suggestion (testing):** Helper `_make_pipeline_that_blocks` is currently unused and can be removed or used to strengthen integration-style coverage
Right now it’s dead code in this test class, which can be confusing and suggest a missing test. I’d suggest either removing it, or using it to add an integration-style test that:
- runs `run_sync_pipeline` in a background thread with this blocking helper,
- calls `enqueue_sync` while it’s running, and
- asserts that exactly one trailing run is created when the first job finishes.
Clarifying this will keep the tests clearer and easier to maintain.
Suggested implementation:
```python
def _dummy_flask_app(self):
"""Flask app mínimo que satisfaz _acquire_db_lock / _release_db_lock."""
app = MagicMock()
# Simula que sync_in_progress começa False — primeira acquire retorna True.
locked = {"value": False}
def fake_app_context():
class Ctx:
def __enter__(self): return self
def __exit__(self, *a): pass
return Ctx()
app.app_context.side_effect = fake_app_context
def fake_query_first():
class Row:
sync_in_progress = locked["value"]
return Row()
def fake_update_sync(in_progress):
locked["value"] = in_progress
app.db.session.execute.side_effect = lambda *a, **k: None
app.db.session.commit.side_effect = lambda: None
app.db.session.query.return_value.filter_by.return_value.first.side_effect = fake_query_first
app.db.session.query.return_value.filter_by.return_value.update.side_effect = (
lambda values: fake_update_sync(values["sync_in_progress"])
)
return app
def test_enqueue_sync_cria_apenas_um_trailing_run_com_pipeline_bloqueante(self):
"""
Usa _make_pipeline_that_blocks para simular uma execução longa de run_sync_pipeline.
Enquanto a primeira execução está em progresso, chamamos enqueue_sync novamente e
verificamos que exatamente um trailing run extra é disparado após o término da primeira.
"""
app = self._dummy_flask_app()
user_id = "user-id"
workspace = "workspace"
event_start = threading.Event()
event_unblock = threading.Event()
fake_pipeline = self._make_pipeline_that_blocks(event_start, event_unblock)
# Patcha run_sync_pipeline para usar o pipeline bloqueante.
with patch.object(self.jr, "run_sync_pipeline", side_effect=fake_pipeline) as mocked_pipeline:
# Inicia a primeira execução em background.
def _run_first():
self.jr.enqueue_sync(app, user_id, workspace, kind="manual")
t = threading.Thread(target=_run_first)
t.start()
# Aguarda o início da primeira execução.
assert event_start.wait(timeout=2), "Primeira execução do pipeline não iniciou a tempo"
# Enquanto a primeira execução está rodando, agenda uma nova sync.
self.jr.enqueue_sync(app, user_id, workspace, kind="manual")
# Libera a execução bloqueante e aguarda término.
event_unblock.set()
t.join(timeout=5)
# Dá uma pequena margem para qualquer trailing run disparar.
time.sleep(0.1)
# Deve haver exatamente duas execuções do pipeline:
# - a execução original
# - exatamente um trailing run.
assert mocked_pipeline.call_count == 2
```
Para que o novo teste funcione, serão necessárias mais duas pequenas atualizações no arquivo:
1. Garantir que os imports incluam as dependências usadas pelo teste:
- Adicionar `import threading` e `import time`.
- Garantir que `patch` está importado: `from unittest.mock import MagicMock, patch` (ou adicionar `patch` à linha existente onde `MagicMock` já é importado).
2. Se a implementação atual de `_dummy_flask_app` já estiver completa em outra parte do arquivo, remova qualquer duplicação da lógica interna que eu reescrevi aqui e mantenha apenas uma versão consistente do helper (a lógica que simula `sync_in_progress` e o contexto de app deve permanecer equivalente ao que os demais testes já utilizam).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| for name, regex in compiled: | ||
| m = regex.search(line) |
There was a problem hiding this comment.
🚨 issue (security): Single search per line can drop real secrets when the first match is a false positive.
Because we only do a single regex.search(line) per pattern, if that first match is treated as a false positive we skip the rest of the line and may miss a real secret later in the same line.
Consider iterating all matches with for m in regex.finditer(line): and running _is_false_positive inside that loop, adding findings for each non–false-positive match. This ensures multiple secrets on one line (or secrets after placeholders) are all detected.
| from unittest.mock import MagicMock, call, patch | ||
|
|
||
|
|
||
| class TestTrailingRunCoalescing(unittest.TestCase): |
There was a problem hiding this comment.
issue (testing): No tests cover the new reclaim_orphaned_locks_on_startup behavior described in the PR
The PR calls out reclaim_orphaned_locks_on_startup() as a key fix, but its behavior isn’t tested. To validate the regression and prevent it from returning, please add a test that:
- Inserts a
BrainRepoConfigrow withsync_in_progress=True(and fields likesync_started_at,sync_job_kind,cancel_requestedset). - Calls
reclaim_orphaned_locks_on_startup(flask_app)and asserts:- The returned count matches the number of locked rows.
- The row is updated to
sync_in_progress=False,sync_started_at=None,sync_job_kind=None,cancel_requested=False. last_erroris set to the expected marker string.
- Optionally, also assert that with no locked rows, it’s a no‑op that returns 0.
| def _make_pipeline_that_blocks(self, event_start, event_unblock): | ||
| """Retorna um patch de run_sync_pipeline que bloqueia até event_unblock.""" | ||
| def fake_pipeline(flask_app, user_id, workspace, *, kind, tag_name=None, commit_message=None): | ||
| # Sinaliza que começou e aguarda liberação do teste. | ||
| event_start.set() | ||
| event_unblock.wait(timeout=5) | ||
| # Libera o DB lock como o pipeline real faria. | ||
| self.jr._release_db_lock(flask_app, user_id, success=True, error=None) | ||
|
|
||
| return fake_pipeline |
There was a problem hiding this comment.
suggestion (testing): Helper _make_pipeline_that_blocks is currently unused and can be removed or used to strengthen integration-style coverage
Right now it’s dead code in this test class, which can be confusing and suggest a missing test. I’d suggest either removing it, or using it to add an integration-style test that:
- runs
run_sync_pipelinein a background thread with this blocking helper, - calls
enqueue_syncwhile it’s running, and - asserts that exactly one trailing run is created when the first job finishes.
Clarifying this will keep the tests clearer and easier to maintain.
Suggested implementation:
def _dummy_flask_app(self):
"""Flask app mínimo que satisfaz _acquire_db_lock / _release_db_lock."""
app = MagicMock()
# Simula que sync_in_progress começa False — primeira acquire retorna True.
locked = {"value": False}
def fake_app_context():
class Ctx:
def __enter__(self): return self
def __exit__(self, *a): pass
return Ctx()
app.app_context.side_effect = fake_app_context
def fake_query_first():
class Row:
sync_in_progress = locked["value"]
return Row()
def fake_update_sync(in_progress):
locked["value"] = in_progress
app.db.session.execute.side_effect = lambda *a, **k: None
app.db.session.commit.side_effect = lambda: None
app.db.session.query.return_value.filter_by.return_value.first.side_effect = fake_query_first
app.db.session.query.return_value.filter_by.return_value.update.side_effect = (
lambda values: fake_update_sync(values["sync_in_progress"])
)
return app
def test_enqueue_sync_cria_apenas_um_trailing_run_com_pipeline_bloqueante(self):
"""
Usa _make_pipeline_that_blocks para simular uma execução longa de run_sync_pipeline.
Enquanto a primeira execução está em progresso, chamamos enqueue_sync novamente e
verificamos que exatamente um trailing run extra é disparado após o término da primeira.
"""
app = self._dummy_flask_app()
user_id = "user-id"
workspace = "workspace"
event_start = threading.Event()
event_unblock = threading.Event()
fake_pipeline = self._make_pipeline_that_blocks(event_start, event_unblock)
# Patcha run_sync_pipeline para usar o pipeline bloqueante.
with patch.object(self.jr, "run_sync_pipeline", side_effect=fake_pipeline) as mocked_pipeline:
# Inicia a primeira execução em background.
def _run_first():
self.jr.enqueue_sync(app, user_id, workspace, kind="manual")
t = threading.Thread(target=_run_first)
t.start()
# Aguarda o início da primeira execução.
assert event_start.wait(timeout=2), "Primeira execução do pipeline não iniciou a tempo"
# Enquanto a primeira execução está rodando, agenda uma nova sync.
self.jr.enqueue_sync(app, user_id, workspace, kind="manual")
# Libera a execução bloqueante e aguarda término.
event_unblock.set()
t.join(timeout=5)
# Dá uma pequena margem para qualquer trailing run disparar.
time.sleep(0.1)
# Deve haver exatamente duas execuções do pipeline:
# - a execução original
# - exatamente um trailing run.
assert mocked_pipeline.call_count == 2Para que o novo teste funcione, serão necessárias mais duas pequenas atualizações no arquivo:
-
Garantir que os imports incluam as dependências usadas pelo teste:
- Adicionar
import threadingeimport time. - Garantir que
patchestá importado:from unittest.mock import MagicMock, patch(ou adicionarpatchà linha existente ondeMagicMockjá é importado).
- Adicionar
-
Se a implementação atual de
_dummy_flask_appjá estiver completa em outra parte do arquivo, remova qualquer duplicação da lógica interna que eu reescrevi aqui e mantenha apenas uma versão consistente do helper (a lógica que simulasync_in_progresse o contexto de app deve permanecer equivalente ao que os demais testes já utilizam).
Problem
The Brain Repo mirror could silently drop files from the backup while the sync-age healthcheck stayed green — the failure is invisible because an excluded/deleted file doesn't dirty
git status. A disk-vs-backup sweep found ~13% of files missing across four independent root causes.Fixes (each with regression tests — 69 tests total)
1.
secrets_scannerfalse positives deleted legitimate filesThe mirror removes any file matching a secret pattern before commit. Over-broad patterns matched non-secrets: variable names (
password), recaptcha base64 blobs, image URLs, ALL_CAPS binding-name references (SERVICE_API_TOKEN_PROD),${VAR}interpolations and doc placeholders (YOUR_PASSWORD). Added lookbehind/lookahead constraints + a central_is_false_positivefilter. Regression both ways: FPs cleared, real secrets (DATABASE_URL with password, mixed-case passwords, Fernet keys, JWTs) still flagged.2. Nested
.gitignoremirrored into the brain repo hid contentA source
.gitignorewith a wildcard rule (e.g. a scratch state dir) was copied into the mirror, which then excluded that content from the backup. The mirror now skips.gitignorefiles from watched paths (the brain repo's own root.gitignoreis unaffected).3. Dropped sync tick
When a sync job was already running,
enqueue_syncreturnedFalseand the tick was discarded with no guaranteed retry — batch writes during a busy window never got mirrored. Added trailing-run coalescing: N enqueues during one job → exactly one trailing sync after it finishes (lock-protected, no infinite loop).4. Orphaned DB lock after restart-during-sync
A process killed mid-sync leaves
BrainRepoConfig.sync_in_progress=True. Since_acquire_db_lockdoesUPDATE WHERE sync_in_progress=0, every future enqueue fails and auto-sync stays dead for up toJOB_STALE_SECONDS(20 min) until the janitor reclaims. Addedreclaim_orphaned_locks_on_startup()(no age gate — at boot no sync can legitimately be running). Sibling of the existinggit_ops._clear_stale_lockfor the.git/index.lockcase.Tests
pytest dashboard/backend/brain_repo/tests/→ 69 passed. Covers: secrets_scanner FP/TP both directions, nested-.gitignoreskip, trailing-run coalescing (exactly-one, no-loop, idle-unchanged), and the startup orphan-lock reclaim path.🤖 Generated with Claude Code
Summary by Sourcery
Harden brain repo backups against silent data loss by improving secret scanning accuracy, avoiding nested .gitignore interference, ensuring queued syncs are not dropped, and reclaiming orphaned sync locks on startup.
Bug Fixes:
Tests: