Skip to content

Persist normalization stats to data/normalization_stats.json#17

Open
riley-1995 wants to merge 1 commit intomainfrom
normalization-stats-json
Open

Persist normalization stats to data/normalization_stats.json#17
riley-1995 wants to merge 1 commit intomainfrom
normalization-stats-json

Conversation

@riley-1995
Copy link
Copy Markdown
Collaborator

Summary

Persist computed preprocessing normalization statistics to a single JSON artifact so teammates and downstream scripts can consume stable values without recomputation.

This PR closes #15.

What changed

  • Added JSON upsert helper in src/elp_rumble/data_creation/utils.py.
  • Updated src/elp_rumble/data_creation/create_tfrecords.py to write:
    • audio_mean
    • audio_std
  • Updated src/elp_rumble/data_creation/convert_audio_to_spec_tfrecords.py to write:
    • spec_mean
    • spec_std
  • Updated README.md with:
    • output file location
    • generation steps
    • behavior notes
  • Added .github/ to .gitignore

Output location

  • data/normalization_stats.json

Expected JSON shape

{
  "audio_mean": <float>,
  "audio_std": <float>,
  "spec_mean": <float>,
  "spec_std": <float>
}

Overwrite behavior

If data/normalization_stats.json already exists, rerunning scripts overwrites the keys they compute with latest values:

  • create_tfrecords overwrites audio_mean / audio_std
  • convert_audio_to_spec_tfrecords overwrites spec_mean / spec_std

Validation run

Executed full README preprocessing flow with repo .venv:

  1. python -m elp_rumble.data_creation.pos_audio_clips --mode train
  2. python -m elp_rumble.data_creation.pos_audio_clips --mode test
  3. python -m elp_rumble.data_creation.neg_audio_clips
  4. python -m elp_rumble.data_creation.create_tfrecords
  5. python -m elp_rumble.data_creation.convert_audio_to_spec_tfrecords

Observed output example:

{
  "audio_mean": -0.013995947316288948,
  "audio_std": 0.017343521118164062,
  "spec_mean": 2.7122833728790283,
  "spec_std": 1.0036183595657349
}

All four required keys are present, numeric, and non-null.

Copilot AI review requested due to automatic review settings March 5, 2026 13:26
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds persistence of preprocessing normalization statistics into a single JSON artifact (data/normalization_stats.json) so downstream scripts/teammates can reuse stable values without recomputing them, addressing issue #15.

Changes:

  • Added a JSON “upsert” helper to merge newly-computed stats into an existing stats file.
  • Updated audio and spectrogram TFRecord creation scripts to write their respective mean/std keys.
  • Documented the output location in the README and adjusted .gitignore.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/elp_rumble/data_creation/utils.py Adds upsert_normalization_stats() helper for read/merge/write of stats JSON
src/elp_rumble/data_creation/create_tfrecords.py Writes audio_mean/audio_std into the shared stats JSON
src/elp_rumble/data_creation/convert_audio_to_spec_tfrecords.py Writes spec_mean/spec_std into the shared stats JSON
README.md Notes where normalization stats are written during preprocessing
.gitignore Adds .github/ ignore rule

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


def upsert_normalization_stats(stats_path, updates):
"""Merge normalization stats into JSON at stats_path, creating file if needed."""
os.makedirs(os.path.dirname(os.fspath(stats_path)), exist_ok=True)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.makedirs(os.path.dirname(os.fspath(stats_path)), ...) will raise if stats_path has no parent directory (e.g., just "normalization_stats.json"), because os.path.dirname(...) becomes an empty string. Consider resolving the parent directory via Path(stats_path).parent (or guarding for empty dirname) before calling makedirs so the helper works for both nested and non-nested paths.

Suggested change
os.makedirs(os.path.dirname(os.fspath(stats_path)), exist_ok=True)
dir_name = os.path.dirname(os.fspath(stats_path))
if dir_name:
os.makedirs(dir_name, exist_ok=True)

Copilot uses AI. Check for mistakes.

tmp_path = f"{stats_path}.tmp"
with open(tmp_path, "w", encoding="utf-8") as f:
json.dump(existing, f, indent=2)
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue #15 requires deterministic JSON output, but json.dump(existing, ..., indent=2) does not enforce a stable key order; the serialized ordering will depend on insertion/update history. Add sort_keys=True (and keep indentation) so repeated runs produce a deterministic file layout.

Suggested change
json.dump(existing, f, indent=2)
json.dump(existing, f, indent=2, sort_keys=True)

Copilot uses AI. Check for mistakes.
Comment thread README.md
Comment on lines +352 to +355
During preprocessing, normalization statistics are saved to `data/normalization_stats.json`.
`create_tfrecords` writes `audio_mean` and `audio_std`, and
`convert_audio_to_spec_tfrecords` writes `spec_mean` and `spec_std`.
Running both scripts produces a single JSON file with all four keys.
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section mentions the output path and which scripts write which keys, but it’s still missing two items called out in issue #15/PR description: (1) an example JSON snippet showing the expected shape, and (2) a brief note on overwrite/upsert behavior (reruns overwrite only the keys they compute). Adding both here would make the README sufficient for downstream consumers without needing to read the PR description.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Save computed normalization stats (audio_mean, audio_std, spec_mean, spec_std) to JSON

2 participants