-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
When metadata_creation is enabled and JSON files are present in the declad dropbox, declad will treat those JSON files as data files if "*.json" is included in filename_patterns. This causes declad to:
- invoke the metadata extractor on JSON files
- generate new JSON metadata files in the dropbox
- re‑ingest those new JSON files
- repeat indefinitely
The result is runaway recursion and exponential growth of JSON files until declad is stopped and the directory is manually cleaned.
This behavior is surprising and hazardous, especially because JSON metadata sidecars are a normal part of declad’s workflow.
Environment
- declad version: 2.3.8 (via Spack)
- deployment:
fermicloud848under/home/hypotpro/declad_848 - host:
fermicloud848 - dropbox:
/home/hypotpro/declad_848/dropbox/ - metadata extractor:
/home/hypotpro/bin/demo_meta_extractor.sh - metadata creation: enabled
- filename patterns:
filename_patterns:
- "*.txt"
- "*.root"
- "*.art"
- "*.ddtest"
- "*.parquet"
- "*.json"
Steps to Reproduce
- Enable metadata creation in declad_config.yaml.
- Add "*.json" to filename_patterns.
- Place a JSON file in the dropbox (e.g., data_00001.json).
- Start or restart declad.
Observed Behavior
- Declad treats the JSON file as a data file.
- The metadata extractor is invoked on the JSON file.
- The extractor writes a new JSON metadata file into the dropbox.
- Declad sees the new JSON file and repeats the process.
This produces a chain like:
data_00001.json
data_00001.json.json
data_00001.json.json.json
...
The dropbox grows rapidly until declad is stopped and the directory is cleaned manually.
Expected Behavior
Declad should not treat JSON metadata files as ingestible data files.
Specifically:
- JSON files should be ignored unless they match the expected metadata sidecar naming convention for a corresponding data file.
- JSON files should not be passed to the metadata extractor.
- JSON files should never trigger metadata creation.
Root Cause
Declad currently:
- Uses filename_patterns to determine which files to ingest.
- Does not distinguish between data files and metadata sidecar files
- Does not validate JSON filenames before treating them as ingestible data.
Thus, adding "*.json" to filename_patterns causes declad to ingest its own metadata output.
A Proposed Fix
1. Add a guardrail for JSON files when metadata creation is enabled
If:
- metadata_creation is enabled
- AND a file ends in .json
- AND the filename does not match the expected metadata sidecar pattern
→ Declad should skip the file and log a warning.
Example rule:
data_<uuid>.json → valid metadata sidecar
*.json → ignore unless it matches the above pattern
2. Consider separating data and metadata patterns in configuration
For example:
data_filename_patterns:
- "*.parquet"
- "*.root"
metadata_filename_patterns:
- "*.json"
This would eliminate ambiguity and prevent recursion.
3. Add a safety check before invoking the metadata extractor
If the file is already a metadata file, declad should not call the extractor.
Impact
This bug can:
- create tens of thousands of files in minutes
- fill user home directories
- cause declad to become unresponsive
- require manual cleanup
- confuse users who expect JSON files to be harmless
Workaround
Do not include "*.json" in filename_patterns when metadata creation is enabled.