Persist normalization stats to data/normalization_stats.json#17
Persist normalization stats to data/normalization_stats.json#17riley-1995 wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds persistence of preprocessing normalization statistics into a single JSON artifact (data/normalization_stats.json) so downstream scripts/teammates can reuse stable values without recomputing them, addressing issue #15.
Changes:
- Added a JSON “upsert” helper to merge newly-computed stats into an existing stats file.
- Updated audio and spectrogram TFRecord creation scripts to write their respective mean/std keys.
- Documented the output location in the README and adjusted
.gitignore.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/elp_rumble/data_creation/utils.py |
Adds upsert_normalization_stats() helper for read/merge/write of stats JSON |
src/elp_rumble/data_creation/create_tfrecords.py |
Writes audio_mean/audio_std into the shared stats JSON |
src/elp_rumble/data_creation/convert_audio_to_spec_tfrecords.py |
Writes spec_mean/spec_std into the shared stats JSON |
README.md |
Notes where normalization stats are written during preprocessing |
.gitignore |
Adds .github/ ignore rule |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| def upsert_normalization_stats(stats_path, updates): | ||
| """Merge normalization stats into JSON at stats_path, creating file if needed.""" | ||
| os.makedirs(os.path.dirname(os.fspath(stats_path)), exist_ok=True) |
There was a problem hiding this comment.
os.makedirs(os.path.dirname(os.fspath(stats_path)), ...) will raise if stats_path has no parent directory (e.g., just "normalization_stats.json"), because os.path.dirname(...) becomes an empty string. Consider resolving the parent directory via Path(stats_path).parent (or guarding for empty dirname) before calling makedirs so the helper works for both nested and non-nested paths.
| os.makedirs(os.path.dirname(os.fspath(stats_path)), exist_ok=True) | |
| dir_name = os.path.dirname(os.fspath(stats_path)) | |
| if dir_name: | |
| os.makedirs(dir_name, exist_ok=True) |
|
|
||
| tmp_path = f"{stats_path}.tmp" | ||
| with open(tmp_path, "w", encoding="utf-8") as f: | ||
| json.dump(existing, f, indent=2) |
There was a problem hiding this comment.
Issue #15 requires deterministic JSON output, but json.dump(existing, ..., indent=2) does not enforce a stable key order; the serialized ordering will depend on insertion/update history. Add sort_keys=True (and keep indentation) so repeated runs produce a deterministic file layout.
| json.dump(existing, f, indent=2) | |
| json.dump(existing, f, indent=2, sort_keys=True) |
| During preprocessing, normalization statistics are saved to `data/normalization_stats.json`. | ||
| `create_tfrecords` writes `audio_mean` and `audio_std`, and | ||
| `convert_audio_to_spec_tfrecords` writes `spec_mean` and `spec_std`. | ||
| Running both scripts produces a single JSON file with all four keys. |
There was a problem hiding this comment.
This section mentions the output path and which scripts write which keys, but it’s still missing two items called out in issue #15/PR description: (1) an example JSON snippet showing the expected shape, and (2) a brief note on overwrite/upsert behavior (reruns overwrite only the keys they compute). Adding both here would make the README sufficient for downstream consumers without needing to read the PR description.
Summary
Persist computed preprocessing normalization statistics to a single JSON artifact so teammates and downstream scripts can consume stable values without recomputation.
This PR closes #15.
What changed
src/elp_rumble/data_creation/utils.py.src/elp_rumble/data_creation/create_tfrecords.pyto write:audio_meanaudio_stdsrc/elp_rumble/data_creation/convert_audio_to_spec_tfrecords.pyto write:spec_meanspec_stdREADME.mdwith:.github/to.gitignoreOutput location
data/normalization_stats.jsonExpected JSON shape
{ "audio_mean": <float>, "audio_std": <float>, "spec_mean": <float>, "spec_std": <float> }Overwrite behavior
If
data/normalization_stats.jsonalready exists, rerunning scripts overwrites the keys they compute with latest values:create_tfrecordsoverwritesaudio_mean/audio_stdconvert_audio_to_spec_tfrecordsoverwritesspec_mean/spec_stdValidation run
Executed full README preprocessing flow with repo
.venv:python -m elp_rumble.data_creation.pos_audio_clips --mode trainpython -m elp_rumble.data_creation.pos_audio_clips --mode testpython -m elp_rumble.data_creation.neg_audio_clipspython -m elp_rumble.data_creation.create_tfrecordspython -m elp_rumble.data_creation.convert_audio_to_spec_tfrecordsObserved output example:
{ "audio_mean": -0.013995947316288948, "audio_std": 0.017343521118164062, "spec_mean": 2.7122833728790283, "spec_std": 1.0036183595657349 }All four required keys are present, numeric, and non-null.