Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new benchmark notebook/workflow pieces to train and export an ML classifier for e/π separation in the BIC, and extends the simulation matrix to cover multiple angular ranges.
Changes:
- Extend the
sim:calo_pidjob matrix to include multipleANGLEvalues and incorporate it into output paths. - Add an Org-mode/Jupyter notebook that loads parquet datasets, trains a CNN with TensorFlow/Keras, computes rejection/efficiency metrics, and exports an ONNX model.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| benchmarks/calo_pid/config.yml | Adds ANGLE as a matrix dimension and uses it in the snakemake input path template. |
| benchmarks/calo_pid/calo_pid_bic.org | Introduces the ML training + evaluation + ONNX export notebook for the benchmark. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #plotdir = f'{output_directory}/plots' | ||
| #datadir = f'{output_directory}/data' | ||
| os.makedirs(plotdir, exist_ok=True) | ||
| os.makedirs(datadir, exist_ok=True) |
There was a problem hiding this comment.
datadir is set to /kaggle/input/... (read-only on Kaggle) and appears to be an input location. Creating it will fail on Kaggle and is also conceptually wrong for an input dataset directory. Remove this os.makedirs(datadir, ...) and only create output directories (e.g., plotdir / output_directory).
| os.makedirs(datadir, exist_ok=True) |
| datadir = f'/kaggle/input/results-45to135deg-1gev-data' | ||
| plotdir = f'/kaggle/working/plots/{angle_label}' | ||
| output_directory = f'/kaggle/working/output/{angle_label}/{energy_setting}' |
There was a problem hiding this comment.
These hard-coded Kaggle paths (and the hard-coded dataset naming in datadir) make the benchmark non-portable and ignore the earlier args.workdir. Use args.workdir (and angle/energy) to derive output_directory/plotdir, and pass datadir as a parameter/CLI option so the notebook can run in CI/HPC or locally without path edits.
| gpu = tf.config.list_logical_devices('GPU') | ||
| strategy = tf.distribute.MirroredStrategy(gpu) if len(gpu) == 1 else tf.distribute.MirroredStrategy([gpu[0]]) |
There was a problem hiding this comment.
This strategy selection has two concrete issues: (1) it will crash when len(gpu) == 0 due to gpu[0], and (2) MirroredStrategy(...) expects device name strings (or None), not logical device objects. Prefer tf.distribute.MirroredStrategy() (auto-detect GPUs) and explicitly fall back to the default/CPU strategy when no GPUs are available.
| gpu = tf.config.list_logical_devices('GPU') | |
| strategy = tf.distribute.MirroredStrategy(gpu) if len(gpu) == 1 else tf.distribute.MirroredStrategy([gpu[0]]) | |
| gpus = tf.config.list_physical_devices('GPU') | |
| if gpus: | |
| strategy = tf.distribute.MirroredStrategy() | |
| else: | |
| print('No GPUs detected, falling back to CPU strategy') | |
| strategy = tf.distribute.get_strategy() |
| ## do magic to avoid shard warnings of operating on DATA instead of FILE | ||
| options = tf.data.Options() | ||
| options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA | ||
| return dataset.with_options(options) |
There was a problem hiding this comment.
The dataset auto-sharding options are applied twice: once inside make_dataset() and again right before training. This duplication makes the notebook harder to maintain (and risks diverging settings later). Keep the policy application in one place (preferably inside make_dataset) and remove the redundant block near training.
| ## do magic to avoid shard warnings of operating on DATA instead of FILE | |
| options = tf.data.Options() | |
| options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA | |
| return dataset.with_options(options) | |
| return dataset |
| ## avoid warning that we are operating on DATA instead of FILE | ||
| options = tf.data.Options() | ||
| options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA | ||
| train_dataset = train_dataset.with_options(options) | ||
| valid_dataset = valid_dataset.with_options(options) | ||
|
|
There was a problem hiding this comment.
The dataset auto-sharding options are applied twice: once inside make_dataset() and again right before training. This duplication makes the notebook harder to maintain (and risks diverging settings later). Keep the policy application in one place (preferably inside make_dataset) and remove the redundant block near training.
| ## avoid warning that we are operating on DATA instead of FILE | |
| options = tf.data.Options() | |
| options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA | |
| train_dataset = train_dataset.with_options(options) | |
| valid_dataset = valid_dataset.with_options(options) | |
|
|
||
| with open(f'{output_directory}/results.json', 'w') as f: | ||
| f.write(json.dumps(results, indent=2)) | ||
| print(f' - Found overal rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency') |
There was a problem hiding this comment.
Correct spelling of 'overal' to 'overall'.
| print(f' - Found overal rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency') | |
| print(f' - Found overall rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency') |
| #+end_src | ||
|
|
||
| #+begin_src jupyter-python | ||
| print('Bencmarking test data') |
There was a problem hiding this comment.
Correct spelling of 'Bencmarking' to 'Benchmarking'.
| print('Bencmarking test data') | |
| print('Benchmarking test data') |
| with open(f'{output_directory}/results.json', 'w') as f: | ||
| f.write(json.dumps(results, indent=2)) | ||
| print(f' - Found overal rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency') | ||
| print(f' - Results written to {datadir}/results.json') |
There was a problem hiding this comment.
This message points to {datadir}/results.json, but the code writes to {output_directory}/results.json a few lines above. Update the message to report the correct output path so users can find the produced results reliably.
| print(f' - Results written to {datadir}/results.json') | |
| print(f' - Results written to {output_directory}/results.json') |
Briefly, what does this PR introduce?
Add a benchmark to generate trained ML models for e/pi separation in the BIC.
What kind of change does this PR introduce?
Please check if this PR fulfills the following:
Does this PR introduce breaking changes? What changes might users need to make to their code?
No
Does this PR change default behavior?
No