Skip to content

Add --clean_local_dir option to free disk space between queries#253

Draft
jihoonson wants to merge 1 commit intoNVIDIA:devfrom
jihoonson:clean-local-dir
Draft

Add --clean_local_dir option to free disk space between queries#253
jihoonson wants to merge 1 commit intoNVIDIA:devfrom
jihoonson:clean-local-dir

Conversation

@jihoonson
Copy link
Copy Markdown
Collaborator

Spark's block manager shuffle files can accumulate and fill the disk before a full power run completes. This adds a --clean_local_dir flag that deletes shuffle files from Spark's block manager local directories after each query, while preserving the subdirectory structure that DiskBlockManager expects. This change has been manually tested as below.

  • The intermediate shuffle files are cleaned up after each query.
  • The next query runs with no error.

Spark's block manager shuffle files can accumulate and fill the disk
before a full power run completes. This adds a --clean_local_dir flag
that deletes shuffle files from Spark's block manager local directories
after each query, while preserving the subdirectory structure that
DiskBlockManager expects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jihoon Son <ghoonson@gmail.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 6, 2026

Greptile Summary

This PR adds an opt-in --clean_local_dir flag to both nds/nds_power.py and nds-h/nds_h_power.py that removes files from Spark's DiskBlockManager local directories after each query to prevent disk exhaustion at large scale factors. The implementation is correct for its intended local/single-node use case, and all remaining findings are minor.

  • import shutil is added in both files but never used — clean_block_manager_dirs relies only on os.remove and os.walk.
  • os.path.getsize and os.remove are called without error handling; a race with Spark's own deferred cleanup could raise FileNotFoundError and abort the full run.
  • In cluster mode (YARN, Kubernetes) the cleanup silently does nothing since executor shuffle files are on remote nodes; a documentation note would clarify the scope.

Confidence Score: 5/5

Safe to merge — the flag is opt-in, the logic is correct for local mode, and all findings are non-blocking P2 suggestions.

All three issues (unused import, missing try/except, cluster-mode caveat) are style or hardening P2 items that do not block merge.

Both nds/nds_power.py and nds-h/nds_h_power.py carry the same unused shutil import; nds/nds_power.py lines 352-355 and nds-h/nds_h_power.py lines 234-237 lack error handling.

Important Files Changed

Filename Overview
nds/nds_power.py Adds clean_block_manager_dirs helper and --clean_local_dir flag; shutil import is unused; no error handling on file deletion
nds-h/nds_h_power.py Mirrors nds_power.py changes: adds clean_block_manager_dirs and --clean_local_dir; same shutil/error-handling issues

Sequence Diagram

sequenceDiagram
    participant CLI as CLI (--clean_local_dir)
    participant RQS as run_query_stream()
    participant ROQ as run_one_query()
    participant Spark as Spark Engine
    participant CBD as clean_block_manager_dirs()
    participant FS as Local Filesystem

    CLI->>RQS: clean_local_dir=True
    loop For each query
        RQS->>ROQ: execute query
        ROQ->>Spark: spark.sql(query)
        Spark-->>ROQ: DataFrame result
        ROQ-->>RQS: q_report
        RQS->>CBD: clean_block_manager_dirs(spark_session)
        CBD->>Spark: diskBlockManager().localDirs()
        Spark-->>CBD: local_dirs[]
        CBD->>FS: os.remove(shuffle files)
        FS-->>CBD: freed bytes
        CBD-->>RQS: print freed GB
    end
Loading

Reviews (1): Last reviewed commit: "Add --clean_local_dir option to free dis..." | Re-trigger Greptile

Comment thread nds/nds_power.py
import csv
import os
import re
import shutil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused shutil import

shutil is imported but never called — clean_block_manager_dirs uses only os.remove and os.walk. This import can be removed.

Suggested change
import shutil

Comment thread nds-h/nds_h_power.py

import argparse
import csv
import shutil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unused shutil import

shutil is imported but never called — clean_block_manager_dirs uses only os.remove and os.walk. This import can be removed.

Suggested change
import shutil

Comment thread nds/nds_power.py
Comment on lines +352 to +355
for f in fns:
fp = os.path.join(dp, f)
total_freed += os.path.getsize(fp)
os.remove(fp)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No error handling on file size + delete

If Spark's own deferred cleanup removes a file between the os.path.getsize(fp) and os.remove(fp) calls, an unhandled FileNotFoundError will propagate and abort the entire benchmark run. Wrapping the pair in try/except OSError keeps the run alive if a file disappears mid-cleanup.

Suggested change
for f in fns:
fp = os.path.join(dp, f)
total_freed += os.path.getsize(fp)
os.remove(fp)
try:
total_freed += os.path.getsize(fp)
os.remove(fp)
except OSError:
pass

Comment thread nds-h/nds_h_power.py
Comment on lines +234 to +237
for f in fns:
fp = os.path.join(dp, f)
total_freed += os.path.getsize(fp)
os.remove(fp)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No error handling on file size + delete

Same race as nds/nds_power.py: if Spark concurrently removes a file between os.path.getsize and os.remove, the unhandled FileNotFoundError aborts the benchmark. A try/except OSError around the pair makes cleanup resilient.

Suggested change
for f in fns:
fp = os.path.join(dp, f)
total_freed += os.path.getsize(fp)
os.remove(fp)
try:
total_freed += os.path.getsize(fp)
os.remove(fp)
except OSError:
pass

Comment thread nds/nds_power.py
Comment on lines +335 to +357
def clean_block_manager_dirs(spark_session):
"""Remove shuffle/block data from Spark's block manager local directories.

Uses Spark's internal DiskBlockManager to get the exact directories owned by
this application, then deletes files while preserving the directory structure
(Spark's DiskBlockManager pre-creates a hash-based subdirectory pool and
expects those subdirectories to exist).
"""
jsc = spark_session.sparkContext._jsc
JArray = spark_session.sparkContext._gateway.jvm.java.lang.reflect.Array
local_dirs = jsc.sc().env().blockManager().diskBlockManager().localDirs()
total_freed = 0
for i in range(JArray.getLength(local_dirs)):
dir_path = JArray.get(local_dirs, i).getAbsolutePath()
if not os.path.isdir(dir_path):
continue
for dp, _, fns in os.walk(dir_path):
for f in fns:
fp = os.path.join(dp, f)
total_freed += os.path.getsize(fp)
os.remove(fp)
if total_freed > 0:
print(f"Cleaned block manager local dirs, freed {total_freed / (1024**3):.2f} GB")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Driver-only cleanup silently no-ops in cluster mode

os.walk / os.remove execute on the driver process. In cluster deployments (YARN, Kubernetes, Spark Standalone), shuffle files live on executor nodes' local filesystems — diskBlockManager().localDirs() on the driver returns directories that hold no shuffle data in those topologies, so cleanup silently does nothing. Adding a note to the docstring or --help text that this flag is only effective in local/single-node mode would prevent confusion.

@jihoonson jihoonson marked this pull request as draft April 6, 2026 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants