ML model for e/pi separation in the BIC#262

Open

tonnysoyyo wants to merge 1 commit intomasterfrom

259-ml-model-for-epi-separation-in-the-bic

tonnysoyyo commented Mar 2, 2026 •

edited

Loading

Briefly, what does this PR introduce?

Add a benchmark to generate trained ML models for e/pi separation in the BIC.

What kind of change does this PR introduce?

Bug fix (issue #__)
New feature (issue ML model for e/pi separation in the BIC #259 )
Documentation update
Other: __

Please check if this PR fulfills the following:

Tests for the changes have been added
Documentation has been added / updated
Changes have been communicated to collaborators

Does this PR introduce breaking changes? What changes might users need to make to their code?

No

Does this PR change default behavior?

No


          Calorimeter pid bic notebook

277d1b6

Copilot AI review requested due to automatic review settings

March 2, 2026 20:06

tonnysoyyo linked an issue

that may be closed by this pull request

ML model for e/pi separation in the BIC #259

Open

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

Adds a new benchmark notebook/workflow pieces to train and export an ML classifier for e/π separation in the BIC, and extends the simulation matrix to cover multiple angular ranges.

Changes:

Extend the sim:calo_pid job matrix to include multiple ANGLE values and incorporate it into output paths.
Add an Org-mode/Jupyter notebook that loads parquet datasets, trains a CNN with TensorFlow/Keras, computes rejection/efficiency metrics, and exports an ONNX model.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 8 comments.

File	Description
benchmarks/calo_pid/config.yml	Adds `ANGLE` as a matrix dimension and uses it in the snakemake input path template.
benchmarks/calo_pid/calo_pid_bic.org	Introduces the ML training + evaluation + ONNX export notebook for the benchmark.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

benchmarks/calo_pid/calo_pid_bic.org

+                #plotdir = f'{output_directory}/plots'
+                #datadir = f'{output_directory}/data'
+                os.makedirs(plotdir, exist_ok=True)
+                os.makedirs(datadir, exist_ok=True)

Copilot AI Mar 2, 2026

datadir is set to /kaggle/input/... (read-only on Kaggle) and appears to be an input location. Creating it will fail on Kaggle and is also conceptually wrong for an input dataset directory. Remove this os.makedirs(datadir, ...) and only create output directories (e.g., plotdir / output_directory).

Suggested change

os.makedirs(datadir, exist_ok=True)

Copilot uses AI. Check for mistakes.

benchmarks/calo_pid/calo_pid_bic.org

Comment on lines +200 to +202

+                datadir = f'/kaggle/input/results-45to135deg-1gev-data'
+                plotdir = f'/kaggle/working/plots/{angle_label}'
+                output_directory = f'/kaggle/working/output/{angle_label}/{energy_setting}'

Copilot AI Mar 2, 2026

These hard-coded Kaggle paths (and the hard-coded dataset naming in datadir) make the benchmark non-portable and ignore the earlier args.workdir. Use args.workdir (and angle/energy) to derive output_directory/plotdir, and pass datadir as a parameter/CLI option so the notebook can run in CI/HPC or locally without path edits.

Copilot uses AI. Check for mistakes.

benchmarks/calo_pid/calo_pid_bic.org

Comment on lines +294 to +295

		gpu = tf.config.list_logical_devices('GPU')
		strategy = tf.distribute.MirroredStrategy(gpu) if len(gpu) == 1 else tf.distribute.MirroredStrategy([gpu[0]])

Copilot AI Mar 2, 2026

This strategy selection has two concrete issues: (1) it will crash when len(gpu) == 0 due to gpu[0], and (2) MirroredStrategy(...) expects device name strings (or None), not logical device objects. Prefer tf.distribute.MirroredStrategy() (auto-detect GPUs) and explicitly fall back to the default/CPU strategy when no GPUs are available.

Suggested change

      
              gpu = tf.config.list_logical_devices('GPU')
          
              strategy = tf.distribute.MirroredStrategy(gpu) if len(gpu) == 1 else tf.distribute.MirroredStrategy([gpu[0]])
          
              gpus = tf.config.list_physical_devices('GPU')
          
              if gpus:
          
                  strategy = tf.distribute.MirroredStrategy()
          
              else:
          
                  print('No GPUs detected, falling back to CPU strategy')
          
                  strategy = tf.distribute.get_strategy()

Copilot uses AI. Check for mistakes.

benchmarks/calo_pid/calo_pid_bic.org

Comment on lines +116 to +119

+                    ## do magic to avoid shard warnings of operating on DATA instead of FILE
+                    options = tf.data.Options()
+                    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
+                    return dataset.with_options(options)

Copilot AI Mar 2, 2026

The dataset auto-sharding options are applied twice: once inside make_dataset() and again right before training. This duplication makes the notebook harder to maintain (and risks diverging settings later). Keep the policy application in one place (preferably inside make_dataset) and remove the redundant block near training.

Suggested change

      
                  ## do magic to avoid shard warnings of operating on DATA instead of FILE
          
                  options = tf.data.Options()
          
                  options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
          
                  return dataset.with_options(options)
          
                  return dataset

Copilot uses AI. Check for mistakes.

benchmarks/calo_pid/calo_pid_bic.org

Comment on lines +301 to +306

+                    ## avoid warning that we are operating on DATA instead of FILE
+                    options = tf.data.Options()
+                    options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
+                    train_dataset = train_dataset.with_options(options)
+                    valid_dataset = valid_dataset.with_options(options)

Copilot AI Mar 2, 2026

The dataset auto-sharding options are applied twice: once inside make_dataset() and again right before training. This duplication makes the notebook harder to maintain (and risks diverging settings later). Keep the policy application in one place (preferably inside make_dataset) and remove the redundant block near training.

Suggested change

      
                  ## avoid warning that we are operating on DATA instead of FILE
          
                  options = tf.data.Options()
          
                  options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
          
                  train_dataset = train_dataset.with_options(options)
          
                  valid_dataset = valid_dataset.with_options(options)

Copilot uses AI. Check for mistakes.

benchmarks/calo_pid/calo_pid_bic.org

+                with open(f'{output_directory}/results.json', 'w') as f:
+                    f.write(json.dumps(results, indent=2))
+                print(f' - Found overal rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency')

Copilot AI Mar 2, 2026

Correct spelling of 'overal' to 'overall'.

Suggested change

      
              print(f' - Found overal rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency')
          
              print(f' - Found overall rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency')

Copilot uses AI. Check for mistakes.

benchmarks/calo_pid/calo_pid_bic.org

+              #+end_src
+              #+begin_src jupyter-python
+                print('Bencmarking test data')

Copilot AI Mar 2, 2026

Correct spelling of 'Bencmarking' to 'Benchmarking'.

Suggested change

      
              print('Bencmarking test data')
          
              print('Benchmarking test data')

Copilot uses AI. Check for mistakes.

benchmarks/calo_pid/calo_pid_bic.org

+                with open(f'{output_directory}/results.json', 'w') as f:
+                    f.write(json.dumps(results, indent=2))
+                print(f' - Found overal rejection {results["rejection"]:.2f} at {results["efficiency"]:.2f} efficiency')
+                print(f' - Results written to {datadir}/results.json')

Copilot AI Mar 2, 2026

This message points to {datadir}/results.json, but the code writes to {output_directory}/results.json a few lines above. Update the message to report the correct output path so users can find the produced results reliably.

Suggested change

      
              print(f' - Results written to {datadir}/results.json')
          
              print(f' - Results written to {output_directory}/results.json')

Copilot uses AI. Check for mistakes.

Copilot started reviewing on behalf of tonnysoyyo

March 2, 2026 20:24

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet