Skip to content

declad recursively ingests JSON metadata files when metadata creation is enabled #67

@DouglasLeeTucker

Description

@DouglasLeeTucker

Summary

When metadata_creation is enabled and JSON files are present in the declad dropbox, declad will treat those JSON files as data files if "*.json" is included in filename_patterns. This causes declad to:

  • invoke the metadata extractor on JSON files
  • generate new JSON metadata files in the dropbox
  • re‑ingest those new JSON files
  • repeat indefinitely

The result is runaway recursion and exponential growth of JSON files until declad is stopped and the directory is manually cleaned.

This behavior is surprising and hazardous, especially because JSON metadata sidecars are a normal part of declad’s workflow.

Environment

  • declad version: 2.3.8 (via Spack)
  • deployment: fermicloud848 under /home/hypotpro/declad_848
  • host: fermicloud848
  • dropbox: /home/hypotpro/declad_848/dropbox/
  • metadata extractor: /home/hypotpro/bin/demo_meta_extractor.sh
  • metadata creation: enabled
  • filename patterns:
filename_patterns:
  - "*.txt"
  - "*.root"
  - "*.art"
  - "*.ddtest"
  - "*.parquet"
  - "*.json"

Steps to Reproduce

  1. Enable metadata creation in declad_config.yaml.
  2. Add "*.json" to filename_patterns.
  3. Place a JSON file in the dropbox (e.g., data_00001.json).
  4. Start or restart declad.

Observed Behavior

  • Declad treats the JSON file as a data file.
  • The metadata extractor is invoked on the JSON file.
  • The extractor writes a new JSON metadata file into the dropbox.
  • Declad sees the new JSON file and repeats the process.
This produces a chain like:
data_00001.json
data_00001.json.json
data_00001.json.json.json
...

The dropbox grows rapidly until declad is stopped and the directory is cleaned manually.

Expected Behavior

Declad should not treat JSON metadata files as ingestible data files.

Specifically:

  • JSON files should be ignored unless they match the expected metadata sidecar naming convention for a corresponding data file.
  • JSON files should not be passed to the metadata extractor.
  • JSON files should never trigger metadata creation.

Root Cause

Declad currently:

  1. Uses filename_patterns to determine which files to ingest.
  2. Does not distinguish between data files and metadata sidecar files
  3. Does not validate JSON filenames before treating them as ingestible data.

Thus, adding "*.json" to filename_patterns causes declad to ingest its own metadata output.

A Proposed Fix

1. Add a guardrail for JSON files when metadata creation is enabled

If:

  • metadata_creation is enabled
  • AND a file ends in .json
  • AND the filename does not match the expected metadata sidecar pattern

→ Declad should skip the file and log a warning.

Example rule:

data_<uuid>.json → valid metadata sidecar
*.json → ignore unless it matches the above pattern

2. Consider separating data and metadata patterns in configuration

For example:

data_filename_patterns:
  - "*.parquet"
  - "*.root"

metadata_filename_patterns:
  - "*.json"

This would eliminate ambiguity and prevent recursion.

3. Add a safety check before invoking the metadata extractor

If the file is already a metadata file, declad should not call the extractor.

Impact

This bug can:

  • create tens of thousands of files in minutes
  • fill user home directories
  • cause declad to become unresponsive
  • require manual cleanup
  • confuse users who expect JSON files to be harmless

Workaround

Do not include "*.json" in filename_patterns when metadata creation is enabled.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions