Skip to content

OPRConnection caches local files unnecessarily during catalog builds (i.e., copies the full set of data files) #87

@espg

Description

@espg

Problem

Running bulk catalog generation via scripts/build_catalog.py against local radar data fills disk with redundant copies. A partial reprocessing run accumulated 729GB in ./radar_cache/ before running out of disk space.

The root cause is that OPRConnection._open_file() unconditionally wraps every URL in fsspec's filecache:: prefix when cache_dir is set — including paths that are already on the local filesystem. This tells fsspec to copy the file into the cache directory before reading it. For local-to-local access, this is a no-op copy that doubles disk usage.

Call chain

build_catalog.py
  └─ create_items_from_flight_data(local_path)     [stac/catalog.py:193]
       └─ extract_item_metadata(local_path)         [stac/metadata.py:131-132]
            └─ OPRConnection(cache_dir="radar_cache")   ← hardcoded
            └─ opr.load_frame_url(local_path)
                 └─ _open_file(local_path)           [opr_access.py:132-137]
                      └─ fsspec.open_local("filecache::/data/.../Data_001.mat", ...)
                           └─ COPIES 20MB file to ./radar_cache/<hash>
                           └─ Returns cached path

What it should do

Local paths should be opened directly — no fsspec wrapping, no copy. Remote URLs (https://, s3://, gs://) should keep using filecache:: as before; that's the correct behavior for notebooks and remote data access.

Scale

~2000 files/campaign × ~20MB each × dozens of campaigns = 729GB of redundant copies of data that was already on local disk. The catalog output is ~5MB total.

Additional issue

extract_item_metadata() hardcodes cache_dir="radar_cache" with no config knob and no way to disable caching. Even if _open_file() is fixed for local paths, this hardcoded value should be removed or parameterized.

Affected code

  • src/xopr/opr_access.py_open_file() (lines 132-137), __init__ (lines 107-112)
  • src/xopr/stac/metadata.pyextract_item_metadata() (line 131)

What is NOT affected

  • All 6 docs notebooks pass remote URLs with cache_dir — correct behavior, unchanged.
  • test_cache_data() validates remote caching — unchanged.
  • build_layer_parquet.py uses remote URLs — unchanged.
  • load_frame_url(), load_layers_file(), format dispatch (HDF5 vs MATLAB) — unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions