Problem
Running bulk catalog generation via scripts/build_catalog.py against local radar data fills disk with redundant copies. A partial reprocessing run accumulated 729GB in ./radar_cache/ before running out of disk space.
The root cause is that OPRConnection._open_file() unconditionally wraps every URL in fsspec's filecache:: prefix when cache_dir is set — including paths that are already on the local filesystem. This tells fsspec to copy the file into the cache directory before reading it. For local-to-local access, this is a no-op copy that doubles disk usage.
Call chain
build_catalog.py
└─ create_items_from_flight_data(local_path) [stac/catalog.py:193]
└─ extract_item_metadata(local_path) [stac/metadata.py:131-132]
└─ OPRConnection(cache_dir="radar_cache") ← hardcoded
└─ opr.load_frame_url(local_path)
└─ _open_file(local_path) [opr_access.py:132-137]
└─ fsspec.open_local("filecache::/data/.../Data_001.mat", ...)
└─ COPIES 20MB file to ./radar_cache/<hash>
└─ Returns cached path
What it should do
Local paths should be opened directly — no fsspec wrapping, no copy. Remote URLs (https://, s3://, gs://) should keep using filecache:: as before; that's the correct behavior for notebooks and remote data access.
Scale
~2000 files/campaign × ~20MB each × dozens of campaigns = 729GB of redundant copies of data that was already on local disk. The catalog output is ~5MB total.
Additional issue
extract_item_metadata() hardcodes cache_dir="radar_cache" with no config knob and no way to disable caching. Even if _open_file() is fixed for local paths, this hardcoded value should be removed or parameterized.
Affected code
src/xopr/opr_access.py — _open_file() (lines 132-137), __init__ (lines 107-112)
src/xopr/stac/metadata.py — extract_item_metadata() (line 131)
What is NOT affected
- All 6 docs notebooks pass remote URLs with
cache_dir — correct behavior, unchanged.
test_cache_data() validates remote caching — unchanged.
build_layer_parquet.py uses remote URLs — unchanged.
load_frame_url(), load_layers_file(), format dispatch (HDF5 vs MATLAB) — unchanged.
Problem
Running bulk catalog generation via
scripts/build_catalog.pyagainst local radar data fills disk with redundant copies. A partial reprocessing run accumulated 729GB in./radar_cache/before running out of disk space.The root cause is that
OPRConnection._open_file()unconditionally wraps every URL in fsspec'sfilecache::prefix whencache_diris set — including paths that are already on the local filesystem. This tells fsspec to copy the file into the cache directory before reading it. For local-to-local access, this is a no-op copy that doubles disk usage.Call chain
What it should do
Local paths should be opened directly — no fsspec wrapping, no copy. Remote URLs (
https://,s3://,gs://) should keep usingfilecache::as before; that's the correct behavior for notebooks and remote data access.Scale
~2000 files/campaign × ~20MB each × dozens of campaigns = 729GB of redundant copies of data that was already on local disk. The catalog output is ~5MB total.
Additional issue
extract_item_metadata()hardcodescache_dir="radar_cache"with no config knob and no way to disable caching. Even if_open_file()is fixed for local paths, this hardcoded value should be removed or parameterized.Affected code
src/xopr/opr_access.py—_open_file()(lines 132-137),__init__(lines 107-112)src/xopr/stac/metadata.py—extract_item_metadata()(line 131)What is NOT affected
cache_dir— correct behavior, unchanged.test_cache_data()validates remote caching — unchanged.build_layer_parquet.pyuses remote URLs — unchanged.load_frame_url(),load_layers_file(), format dispatch (HDF5 vs MATLAB) — unchanged.