Skip to content

🐛 OCPBUGS-78787: fix(operator-controller): clean up orphaned temp dirs in catalog cache#2574

Open
tmshort wants to merge 1 commit intooperator-framework:mainfrom
tmshort:fix-OCPBUGS-78787-operator-controller
Open

🐛 OCPBUGS-78787: fix(operator-controller): clean up orphaned temp dirs in catalog cache#2574
tmshort wants to merge 1 commit intooperator-framework:mainfrom
tmshort:fix-OCPBUGS-78787-operator-controller

Conversation

@tmshort
Copy link
Contributor

@tmshort tmshort commented Mar 18, 2026

filesystemCache.writeFS creates a temp dir (.{catalog}-{random}) and renames it into place atomically. If the process is interrupted before the rename, the temp dir persists. Each restart adds another, eventually filling the disk.

Additionally, writeFS had no defer os.RemoveAll(tmpDir), so any error during WalkMetasReader or the rename step also left the temp dir behind — no process kill required.

Two fixes:

  • Add defer os.RemoveAll(tmpDir) so errors during normal operation clean up.
  • Add removeOrphanedTempDirs, called at the start of writeFS (under the write mutex), to clean up dirs orphaned by a previous process run. This bounds worst-case accumulation to one orphaned dir per catalog regardless of restart rate.

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

…in catalog cache

filesystemCache.writeFS creates a temp dir (.{catalog}-{random}) and renames
it into place atomically. If the process is interrupted before the rename, the
temp dir persists. Each restart adds another, eventually filling the disk.

Additionally, writeFS had no defer os.RemoveAll(tmpDir), so any error during
WalkMetasReader or the rename step also left the temp dir behind — no process
kill required.

Two fixes:
- Add defer os.RemoveAll(tmpDir) so errors during normal operation clean up.
- Add removeOrphanedTempDirs, called at the start of writeFS (under the write
  mutex), to clean up dirs orphaned by a previous process run. This bounds
  worst-case accumulation to one orphaned dir per catalog regardless of
  restart rate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Copilot AI review requested due to automatic review settings March 18, 2026 14:12
@openshift-ci openshift-ci bot requested review from joelanford and oceanc80 March 18, 2026 14:12
@openshift-ci
Copy link

openshift-ci bot commented Mar 18, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tmshort for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Mar 18, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 06896bb
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69bab2c8997dd400081adb5b
😎 Deploy Preview https://deploy-preview-2574--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@tmshort
Copy link
Contributor Author

tmshort commented Mar 18, 2026

Related to #2537 which fixed catalogd issues.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds cleanup for orphaned temporary directories in the filesystem cache implementation. These orphaned directories can be left behind if a write operation is interrupted (e.g., pod eviction or crash) before the temporary staging directory is renamed to the final cache location. The changes improve reliability by automatically cleaning up these dangling directories when a new write operation begins.

Changes:

  • Added removeOrphanedTempDirs() method to scan and remove temporary directories with the catalog-specific prefix pattern that were left behind by interrupted writes
  • Integrated orphaned directory cleanup into the writeFS() method to run before creating a new temporary directory
  • Added a defer statement to ensure temporary directories are cleaned up if the write operation fails
  • Added comprehensive test coverage for the orphaned directory cleanup functionality

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
internal/operator-controller/catalogmetadata/cache/cache.go Added orphaned temp directory cleanup with removeOrphanedTempDirs() method and integrated it into writeFS() flow
internal/operator-controller/catalogmetadata/cache/cache_test.go Added TestFilesystemCachePutCleansOrphanedTempDirs() test to verify orphaned directories are cleaned up while preserving directories for other catalogs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

❌ Patch coverage is 46.66667% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.58%. Comparing base (f7a8220) to head (06896bb).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...operator-controller/catalogmetadata/cache/cache.go 46.66% 4 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2574      +/-   ##
==========================================
+ Coverage   63.42%   68.58%   +5.16%     
==========================================
  Files         131      131              
  Lines        9333     9348      +15     
==========================================
+ Hits         5919     6411     +492     
+ Misses       2939     2442     -497     
- Partials      475      495      +20     
Flag Coverage Δ
e2e 39.02% <40.00%> (+<0.01%) ⬆️
experimental-e2e 51.57% <40.00%> (?)
unit 53.82% <46.66%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants