diff --git a/AI-ML/app-level-benchmark/README.md b/AI-ML/app-level-benchmark/README.md index 13f6e13..5faac52 100644 --- a/AI-ML/app-level-benchmark/README.md +++ b/AI-ML/app-level-benchmark/README.md @@ -39,61 +39,21 @@ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.o pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu ``` -*Any version of PyTorch that might be optimized for a targeted hardware architecture is acceptable for this benchmark, as long as the distribution is widely available and its results can be reproduced on any system hosting the hardware in question.* +*Any version of PyTorch that might be optimized for a targeted hardware architecture that is not mentioned above is acceptable for this benchmark, as long as the distribution is widely available and its results can be reproduced on any system hosting the hardware in question.* ### Step 2: DeepCAM -Install the DeepCAM Python package dependencies from pip and/or conda on inside the PyTorch environment from step 1. For example, since Kestrel (NLR's reference system) uses NVIDIA hardware, we install the following. Note that the specific packages may change depending on the type of accelerator being tested. As per our [general baseline run rules](../../README.md#draft-definitions-for-baselineas-is-ported-and-optimized-runs), the Offeror may freely substitute publicly available packages/libraries as necessary for baseline submissions. +Install the DeepCAM Python package dependencies from pip and/or conda on inside the PyTorch environment from step 1. For an example, see the Slurm script [`prep-env-kestrel.sh`](prep-env-kestrel.sh) for reference instructions on how we created the appropriate PyTorch environment to run the DeepCAM benchmark. Note that on Kestrel, we explicitly compile PyTorch against a system module for NCCL that is configured to work with the HPE Slingshot network (`nccl/2.23.4_cuda124`) rather than using a precompiled version from pip. This step may not be necessary depending on the Offeror's hardware and network configuration. -``` -conda activate $ENV_NAME - -# -echo "h5py -basemap -wandb -sympy -filelock -fsspec -jinja2 -networkx -mlperf-logging -git+https://github.com/NVIDIA/mlperf-common.git -nvidia-ml-py -cupy -" > deepcam-requirements.txt -pip install -r deepcam-requirements.txt - -# DALI -pip install --extra-index-url https://pypi.nvidia.com --upgrade nvidia-dali-cuda130 - -# mpi4py -mpicc=`which mpicc` pip install mpi4py --no-cache-dir - -# io_helpers - from NVIDIA DeepCAM MLCommons HPC v3.0 submission folder -cd deepcam-mlcommons-hpcv3/io_helpers -python setup.py clean -python setup.py install - -# APEX -if [ ! -d apex ]; then - git clone https://github.com/NVIDIA/apex -fi -cd apex -APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation --disable-pip-version-check . -``` +The training scripts for DeepCAM do not require any special installation once the above environment is created. For convenience, this repository contains a lightly modified version of [the NVIDIA submission to MLCommons HPC Results v3.0](https://github.com/mlcommons/hpc_results_v3.0/tree/main/NVIDIA/benchmarks/deepcam/implementations/pytorch) (in `deepcam-mlcommons-hpcv3/`) to enable DeepCAM to run with newer versions (>=2.3.0) of PyTorch (specifically, by updating the calls to `MultiStepLRWarmup` and `CosineAnnealingLRWarmup` in `schedulers.py` to reflect the newer API). As demonstrated in the chunk above, the `io_helpers` package can also be installed from the submitted NVIDIA implementation folder. -The training scripts for DeepCAM do not require any special installation once the above environment is created. For convenience, this repository contains a lightly modified version of [the NVIDIA submission to MLCommons HPC Results v3.0](https://github.com/mlcommons/hpc_results_v3.0/tree/main/NVIDIA/benchmarks/deepcam/implementations/pytorch) (`./deepcam-mlcommons-hpcv3`) to enable DeepCAM to run with newer versions (>=2.3.0) of PyTorch (specifically, by updating the calls to `MultiStepLRWarmup` and `CosineAnnealingLRWarmup` in `schedulers.py` to reflect the newer API). As demonstrated in the chunk above, the `io_helpers` package can also be installed from the submitted NVIDIA implementation folder. +Note: The specific packages mentioned in [`prep-env-kestrel.sh`](prep-env-kestrel.sh) may change depending on the type of accelerator being tested. As per our [general baseline run rules](../../README.md#draft-definitions-for-baselineas-is-ported-and-optimized-runs), the Offeror may freely substitute publicly available packages/libraries (and their subsequent calls in the training code itself) as necessary for baseline submissions. ### Step 3: Download and preprocess training data Input training data can be downloaded via Globus using the [endpoint linked here](https://app.globus.org/file-manager?origin_id=0b226e2c-4de0-11ea-971a-021304b0cca7&origin_path=%2F). Note that the training data requires roughly 10TB of storage and contains HDF5-formatted files for training, validation, and test splits. -Note that before training can occur, submitters must convert the HDF5-formatted input data into numpy format, following guidance from MLCommons HPC Results v3.0. Please see [`preprocess-deepcam-data.sh`](./preprocess-deepcam-data.sh) for instructions on how to preprocess the input data accordingly. - -### Kestrel build example - -See the Slurm script [`prep-env-kestrel.sh`](prep-env-kestrel.sh) for reference instructions on how we created the appropriate PyTorch environment to run the DeepCAM benchmark following the general guidance above. Note that on Kestrel, we explicitly compile PyTorch against a system module for NCCL that is configured to work with the HPE Slingshot network (`nccl/2.23.4_cuda124`) rather than using a precompiled version from pip. This step may not be necessary depending on your hardware and network configuration. +Before training can occur, submitters must convert the HDF5-formatted input data into numpy format, following guidance from MLCommons HPC Results v3.0. Please see [`preprocess-deepcam-data.sh`](./preprocess-deepcam-data.sh) for instructions on how to preprocess the input data accordingly. ## Run Definitions and Requirements @@ -111,8 +71,8 @@ There are three types of submissions possible for this benchmark: *baseline*, *p The ESIF-HPC-4 DeepCAM benchmark encompasses **two** types of scenarios: -- **Scenario 1.** The *local* (i.e., per-device) batch size is fixed to `12`. In this scenario, the reported metric is the average time required per training step over 5 epochs. Device-level "weak scaling" is intended to be measured in this scenario; this test should span *N*, *4N*, and *8N* nodes (in which *N* may equal 1) accordingly. We ask for 3 replicates for each node size, for a total of 15 runs. Model convergence is **not** required in this scenario. This scenario requires enabling verbose, per-step logging via the environment variable `LOGGING_FREQUENCY` - see [below](#baseline-scenario-1). -- **Scenario 2.** The *global* batch size is fixed to `1024`, with no specific requirement on the local batch size accordingly. In this scenario, the reported metric is the time required to reach an evaluation accuracy of 82%. This test should span *M*, *4M*, and *8M* nodes (in which *M* may equal 1) accordingly (*N* and *M* may vary between Scenario 1 and Scenario 2). We ask for 3 replicates for each node size, for a total of 15 runs. **Results from scenario 2 are what will be considered as part of the overall throughput metric.** +- **Scenario 1.** The *local* (i.e., per-device) batch size is fixed to `12`. In this scenario, the reported metric is the average time required per training step over 3 epochs. Device-level "weak scaling" is intended to be measured in this scenario; this test should span *N*, *4N*, and *8N* nodes (in which *N* may equal 1) accordingly. We ask for 3 replicates for each node size, for a total of 9 runs. Model convergence is **not** required in this scenario. This scenario requires enabling verbose, per-step logging via the environment variable `LOGGING_FREQUENCY` - see [below](#baseline-scenario-1). +- **Scenario 2.** The *global* batch size is fixed to `1024`, with no specific requirement on the local batch size accordingly. In this scenario, the reported metric is the time required to reach an evaluation accuracy of 82%. This test should span *M*, *4M*, and *8M* nodes (in which *M* may equal 1) accordingly (*N* and *M* may vary between Scenario 1 and Scenario 2). We ask for 3 replicates for each node size, for a total of 9 runs. **Results from Scenario 2 are what will be considered as part of the overall throughput metric.** **Summary of tests** @@ -149,16 +109,16 @@ The following environment variables set in [`config_scenario1.sh`](./config_scen The following environment variables set in [`config_scenario1.sh`](./config_scenario1.sh) **must be set** for *baseline* **Scenario 1** submissions: -| Variable | Description | Required value | -| :-- | :-- | :-- | -| `LOGGING_FREQUENCY` | Whether to gather logs per-step (`1`) or per-epoch (`0`) | `1` | -| `MAX_EPOCHS` | Number of epochs at which training ends, regardless of convergence. | `5` | -| `LOCAL_BATCH_SIZE` | Per-accelerator batch size | `12` | -| `START_LR` | Starting learning rate | `0.0005` | -| `LR_SCHEDULE_TYPE` | Learning rate scheduler type | `cosine_annealing` | -| `LR_WARMUP_STEPS` | Number of LR warmup steps | `0` | -| `OPTIMIZER` | Learning rate optimizer | `AdamW` | -| `WEIGHT_DECAY` | Strength of L2 regularization | `0.2` | +| Variable | Description | Required value | +| :-- | :-- | :-- | +| `LOGGING_FREQUENCY` | Whether to gather logs per-step (`1`) or per-epoch (`0`) | `1` | +| `MAX_EPOCHS` | Number of epochs at which training ends, regardless of convergence. | `5` | +| `LOCAL_BATCH_SIZE` | Per-accelerator batch size | `12` | +| `START_LR` | Starting learning rate | `0.0005` | +| `LR_SCHEDULE_TYPE` | Learning rate scheduler type | `cosine_annealing` | +| `LR_WARMUP_STEPS` | Number of LR warmup steps | `0` | +| `OPTIMIZER` | Learning rate optimizer | `AdamW` | +| `WEIGHT_DECAY` | Strength of L2 regularization | `0.2` | #### Baseline Scenario 2 @@ -175,19 +135,19 @@ The following environment variables set in [`config_scenario2.sh`](./config_scen The following environment variables set in [`config_scenario2.sh`](./config_scenario2.sh) **must be set** for *baseline* Scenario 2 submissions: -| Variable | Description | Required value | -| :-- | :-- | :-- | -| `LOGGING_FREQUENCY` | Whether to gather logs per-step (`1`) or per-epoch (`0`) | `0` | -| `MAX_EPOCHS` | Number of epochs at which training ends, regardless of convergence. | `50` | -| `START_LR` | Starting learning rate | `0.0005` | -| `LR_SCHEDULE_TYPE` | Learning rate scheduler type | `cosine_annealing` | -| `LR_WARMUP_STEPS` | Number of LR warmup steps | `0` | -| `OPTIMIZER` | Learning rate optimizer | `AdamW` | -| `WEIGHT_DECAY` | Strength of L2 regularization | `0.2` | +| Variable | Description | Required value | +| :-- | :-- | :-- | +| `LOGGING_FREQUENCY` | Whether to gather logs per-step (`1`) or per-epoch (`0`) | `0` | +| `MAX_EPOCHS` | Number of epochs at which training ends, regardless of convergence. | `50` | +| `START_LR` | Starting learning rate | `0.0005` | +| `LR_SCHEDULE_TYPE` | Learning rate scheduler type | `cosine_annealing` | +| `LR_WARMUP_STEPS` | Number of LR warmup steps | `0` | +| `OPTIMIZER` | Learning rate optimizer | `AdamW` | +| `WEIGHT_DECAY` | Strength of L2 regularization | `0.2` | ### Ported submissions -For *ported* submissions, the *baseline* parameters must be used, though training code modifications necessary to port the code to a new/different device architecture are also permitted. As described in the repository's [top-level README](../../README.md#draft-definitions-for-baselineas-is-ported-and-optimized-runs), *ported* submissions should not be reported without *baseline*, unless *baseline* is not possible. +For *ported* submissions, the *baseline* parameters must be used, though training code modifications necessary to port the code to a new/different device architecture are also permitted. **This includes replacing any vendor-specific package/library imports and calls in the training code.** As described in the repository's [top-level README](../../README.md#draft-definitions-for-baselineas-is-ported-and-optimized-runs), *ported* submissions should not be reported without *baseline*, unless *baseline* is not possible. ### Optimized submissions diff --git a/AI-ML/app-level-benchmark/config_common.sh b/AI-ML/app-level-benchmark/config_common.sh new file mode 100644 index 0000000..acb654e --- /dev/null +++ b/AI-ML/app-level-benchmark/config_common.sh @@ -0,0 +1,265 @@ +#!/bin/bash + +# These are common variables for each training scenario. This file should generally not need to be modified, +# except under certain circumstances such as modifying SLURM_* and/or SRUN_* variables if a non-Slurm scheduler is used. + +# create output directory +mkdir -p ${OUTPUT_DIR} + +# other learning rate hyperparameters +export LR_T_MAX=9000 +export LR_ETA_MIN=0.0 +export LR_WARMUP_FACTOR=1. +# These variables are only required if LR_SCHEDULE_TYPE="multistep": +# export LR_MILESTONES="" +# export LR_DECAY_RATE="" + +# data parameters +export SHUFFLE_MODE="global" +export DATA_FORMAT="dali-numpy" +export PRECISION_MODE="amp" +export LOCAL_VALIDATION_BATCH_SIZE=8 + +# staging parameters +if [ ! -z $STAGE_DIR_PREFIX ]; then + mkdir -p $STAGE_DIR_PREFIX +fi +export STAGE_BATCH_SIZE=8 +export STAGE_MODE="global" +export STAGE_VERIFY=0 +export STAGE_FULL_DATA_PER_NODE=0 +export STAGE_USE_DIRECT_IO=0 # note: leads to a segfault on Kestrel when =1 +export STAGE_NUM_READ_WORKERS=6 # can be freely tuned as necessary +export STAGE_NUM_WRITE_WORKERS=12 # can be freely tuned as necessary + +# this is for some global parameters: +export ADDITIONAL_ARGS="--disable_tuning --enable_graph --disable_comm_overlap" +export ADDITIONAL_SRUN_ARGS="--no-kill" + +# direct io settings +export DALI_ODIRECT_ALIGNMENT=4096 +export DALI_ODIRECT_LEN_ALIGNMENT=4096 + +# run parameters +export RUN_TAG=${RUN_TAG:-${SLURM_JOB_ID}} +export ENABLE_IB_BINDING=0 + +# system parameters +export DGXNNODES=$SLURM_NNODES +export TOTALGPUS=$(( ${DGXNNODES} * ${DGXNGPU} )) +export WALLTIME=01:00:00 + +# system parameters (optional, not necessary to modify) +export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' ) +export BASE_COMP_CLOCK=1980 # obtained via nvidia-smi for SXM H100 HBM3 +export BASE_MEM_CLOCK=2619 # obtained via nvidia-smi for SXM H100 HBM3 + +# DO NOT MODIFY THESE VARIABLES - Placeholders +export TRAINING_INSTANCE_SIZE=${TOTALGPUS} # Must equal number of GPUs to use during training +export BATCHNORM_GROUP_SIZE=1 +export NEXP=1 +export NUM_INSTANCES=1 +export gpu_config=${TOTALGPUS} + +##### DO NOT MODIFY THESE VARIABLES - parameters for run script based on values set above ##### +# LR switch +if [ -z ${LR_SCHEDULE_TYPE} ]; then + lr_schedule_arg="" +elif [ "${LR_SCHEDULE_TYPE}" == "multistep" ]; then + lr_schedule_arg="--lr_schedule type=${LR_SCHEDULE_TYPE},milestones=${LR_MILESTONES},decay_rate=${LR_DECAY_RATE}" +elif [ "${LR_SCHEDULE_TYPE}" == "cosine_annealing" ]; then + lr_schedule_arg="--lr_schedule type=${LR_SCHEDULE_TYPE},t_max=${LR_T_MAX},eta_min=${LR_ETA_MIN}" +fi + +# GDS switch +if [ "${ENABLE_GDS}" == "1" ]; then + ADDITIONAL_ARGS="${ADDITIONAL_ARGS} --enable_gds" +fi + +# ignore stop switch +if [ "${MLPERF_POWER_TRAIN_AFTER_RUN_STOP}" == "1" ]; then + MIN_EPOCHS=${MAX_EPOCHS} +fi + +PARAMS=( + --wireup_method ${WIREUP_METHOD} + --run_tag ${RUN_TAG} + --experiment_id ${EXP_ID:-1} + --data_dir_prefix ${DATA_DIR_PREFIX} + --output_dir ${OUTPUT_DIR} + --model_prefix "segmentation" + --optimizer ${OPTIMIZER} + --start_lr ${START_LR} + ${lr_schedule_arg} + --lr_warmup_steps ${LR_WARMUP_STEPS} + --lr_warmup_factor ${LR_WARMUP_FACTOR} + --weight_decay ${WEIGHT_DECAY} + --logging_frequency ${LOGGING_FREQUENCY} + --save_frequency 0 + --min_epochs ${MIN_EPOCHS:-0} + --max_epochs ${MAX_EPOCHS:-200} + --data_num_threads ${MAX_THREADS:-4} + --seed ${SEED:-1} + --batchnorm_group_size ${BATCHNORM_GROUP_SIZE} + --shuffle_mode "${SHUFFLE_MODE}" + --data_format "${DATA_FORMAT}" + --data_oversampling_factor ${DATA_OVERSAMPLING_FACTOR:-1} + --precision_mode "${PRECISION_MODE}" + --enable_nhwc + --local_batch_size ${LOCAL_BATCH_SIZE} + --local_batch_size_validation ${LOCAL_VALIDATION_BATCH_SIZE} + ${ADDITIONAL_ARGS} +) + +# profile command: +if [ ! -z ${OMPI_COMM_WORLD_RANK} ]; then + WORLD_RANK=${OMPI_COMM_WORLD_RANK} +elif [ ! -z ${PMIX_RANK} ]; then + WORLD_RANK=${PMIX_RANK} +elif [ ! -z ${PMI_RANK} ]; then + WORLD_RANK=${PMI_RANK} +fi +PROFILE_BASE_CMD="nsys profile --mpi-impl=openmpi --trace=cuda,cublas,nvtx,mpi --cuda-graph-trace=node --kill none -c cudaProfilerApi -f true -o ${OUTPUT_DIR}/profile_job${SLURM_JOBID}_rank${WORLD_RANK}" +ANNA_BASE_CMD="nsys profile --trace cuda,nvtx --sample cpu --output ${OUTPUT_DIR}/anna_job${SLURM_JOBID}_rank${WORLD_RANK} --export sqlite --force-overwrite true --stop-on-exit true --capture-range cudaProfilerApi --capture-range-end stop --kill none" +DLPROF_BASE_CMD="dlprof --mode=pytorch --force=true --reports=summary,detail,iteration --nsys_profile_range=true --output_path=${OUTPUT_DIR} --profile_name=dlprof_rank${WORLD_RANK}" +METRICS_BASE_CMD="ncu --target-processes=all --profile-from-start=off --nvtx --print-summary=per-nvtx --csv -f -o ${OUTPUT_DIR}/metrics_rank${WORLD_RANK} --metrics=smsp__sass_thread_inst_executed_op_hadd_pred_on.sum,smsp__sass_thread_inst_executed_op_hmul_pred_on.sum,smsp__sass_thread_inst_executed_op_hfma_pred_on.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,sm__inst_executed_pipe_tensor.sum" + +if [[ ${ENABLE_PROFILING} == 1 ]]; then + if [[ ${ENABLE_METRICS_COLLECTION} == 1 ]]; then + echo "Metric Collection enabled" + if [[ "${WORLD_RANK}" == "0" ]]; then + PROFILE_CMD=${METRICS_BASE_CMD} + else + PROFILE_CMD="" + fi + elif [[ ${ENABLE_DLPROF} == 1 ]]; then + echo "Dlprof enabled" + if [[ "${WORLD_RANK}" == "0" ]]; then + PROFILE_CMD=${DLPROF_BASE_CMD} + else + PROFILE_CMD="" + fi + PARAMS+=(--profile_markers=dlprof) + elif [[ ${ENABLE_ANNA} == 1 ]]; then + echo "ANNA enabled" + if [[ "${WORLD_RANK}" == "0" ]]; then + PROFILE_CMD=${ANNA_BASE_CMD} + else + PROFILE_CMD="" + fi + PARAMS+=(--profile_markers=anna) + else + echo "Profiling enabled" + PROFILE_CMD=${PROFILE_BASE_CMD} + fi +elif [[ ${API_LOGGING} == 1 ]]; then + echo "ApiLog enabled" + if [ ${SLURM_PROCID} == 0 ]; then + PROFILE_CMD="apiLog.sh" + else + PROFILE_CMD="" + fi +else + PROFILE_CMD="" +fi + +if [[ ${DEBUG_MEMCHECK} == 1 ]]; then + echo "Debugging enabled" + DEBUG_CMD="compute-sanitizer --tool=memcheck" +else + DEBUG_CMD="" +fi + +IB_BIND='' +if [[ "${SLURM_JOB_NUM_NODES}" -gt 1 && "${ENABLE_IB_BINDING}" -eq 1 ]]; then + IB_BIND='--ib=single' +fi +BIND_BASE_CMD="bindpcie --cpu=exclusive ${IB_BIND} --" +BIND="${BIND_CMD:-${BIND_BASE_CMD}}" + +if [ "$LOGGER" = "apiLog.sh" ]; +then + LOGGER="${LOGGER} -p MLPerf/${MODEL_NAME} -v ${FRAMEWORK}/train/${DGXSYSTEM}" + readonly node_rank="${SLURM_NODEID:-0}" + readonly local_rank="${LOCAL_RANK:=${SLURM_LOCALID:=${OMPI_COMM_WORLD_LOCAL_RANK:-}}}" + if [ "$node_rank" -eq 0 ] && [ "$local_rank" -eq 0 ]; + then + LOGGER=$LOGGER + else + LOGGER="" + fi +fi + +# do we cache data +if [ ! -z ${DATA_CACHE_DIRECTORY} ]; then + PARAMS+=(--data_cache_directory ${DATA_CACHE_DIRECTORY}) +fi + +# run script selection: +if [ ! -z ${TRAINING_INSTANCE_SIZE} ]; then + # echo "Running Multi Instance Training" + RUN_SCRIPT="./train_instance_oo.py" + PARAMS+=(--training_instance_size ${TRAINING_INSTANCE_SIZE}) + + if [ ! -z ${STAGE_DIR_PREFIX} ]; then + PARAMS+=( + --stage_dir_prefix ${STAGE_DIR_PREFIX} + --stage_num_read_workers ${STAGE_NUM_READ_WORKERS:-1} + --stage_num_write_workers ${STAGE_NUM_WRITE_WORKERS:-1} + --stage_batch_size ${STAGE_BATCH_SIZE:--1} + --stage_mode ${STAGE_MODE:-"node"} + --stage_max_num_files ${STAGE_MAX_NUM_FILES:--1} + ) + # do we need to verify the staging results + if [ "${STAGE_VERIFY:-0}" -eq 1 ]; then + PARAMS+=(--stage_verify) + fi + if [ "${STAGE_ONLY:-0}" -eq 1 ]; then + echo "WARNING: You are about to run a staging only benchmark" + PARAMS+=(--stage_only) + fi + if [ "${STAGE_FULL_DATA_PER_NODE:-0}" -eq 1 ]; then + PARAMS+=(--stage_full_data_per_node) + fi + if [ "${STAGE_ARCHIVES:-0}" -eq 1 ]; then + PARAMS+=(--stage_archives) + fi + if [ "${STAGE_USE_DIRECT_IO:-0}" -eq 1 ]; then + PARAMS+=(--stage_use_direct_io) + fi + if [ "${STAGE_READ_ONLY:-0}" -eq 1 ]; then + PARAMS+=(--stage_read_only) + fi + fi +else + echo "Running Single Instance Training" + RUN_SCRIPT="./train.py" +fi + +# decide whether to enable profiling +if [ ! -z ${ENABLE_PROFILING} ] && [ ${ENABLE_PROFILING} == 1 ]; then + echo "Running Profiling" + if [ ! -z ${TRAINING_INSTANCE_SIZE} ]; then + RUN_SCRIPT="./train_instance_oo_profile.py" + else + RUN_SCRIPT="./train_profile.py" + fi + + if [ ! -z ${CAPTURE_RANGE_START} ]; then + PARAMS+=( + --capture_range_start ${CAPTURE_RANGE_START} + --capture_range_stop ${CAPTURE_RANGE_STOP} + ) + fi + + if [ ! -z ${PROFILE_FRACTION} ]; then + PARAMS+=(--profile_fraction ${PROFILE_FRACTION}) + fi +fi + +# Export final variables for run_and_time.sh +RUN_CMD="${RUN_SCRIPT} ${PARAMS[@]}" +export RUN_CMD +export LOGGER +export PROFILE_CMD +export DEBUG_CMD diff --git a/AI-ML/app-level-benchmark/config_scenario1.sh b/AI-ML/app-level-benchmark/config_scenario1.sh index 72e4859..de9ed94 100644 --- a/AI-ML/app-level-benchmark/config_scenario1.sh +++ b/AI-ML/app-level-benchmark/config_scenario1.sh @@ -1,20 +1,20 @@ #!/bin/bash # user inputs -export DATA_DIR_PREFIX="/scratch/$USER/deepcam/numpy" # path to preprocessed numpy-formatted data +export DATA_DIR_PREFIX="/scratch/$USER/deepcam_numpy_preprocessed14Jan26" # path to preprocessed numpy-formatted data export OUTPUT_DIR="/scratch/$USER/DeepCAM-testing/results/$SLURM_JOB_ID" # output directory for training logs -#### BASELINE: CAN CHANGE THESE! #### +#### BASELINE/PORTED: CAN CHANGE THESE! #### #export STAGE_DIR_PREFIX="$TMPDIR/deepcam_staging" # If this variable is missing, no data staging occurs. export WIREUP_METHOD="nccl-slurm" export DGXNGPU=4 # Number of accelerators per node export MAX_THREADS=4 # Number of data loading threads per node #### -#### BASELINE: DO NOT CHANGE THESE! #### +#### BASELINE/PORTED: DO NOT CHANGE THESE! #### export LOGGING_FREQUENCY=1 # Must be set to 1 for 'Scenario 1' -export MAX_EPOCHS=5 # Must be set to 5 for 'Scenario 1' -export LOCAL_BATCH_SIZE=8 # Per-accelerator batch size +export MAX_EPOCHS=3 # Must be set to 3 for 'Scenario 1' +export LOCAL_BATCH_SIZE=12 # Per-accelerator batch size export START_LR=0.0001 # Starting learning rate. Roughly 10X lower than target end LR. export LR_SCHEDULE_TYPE="cosine_annealing" # Learning rate scheduler type export LR_WARMUP_STEPS=0 # Not necessary to set for 'Scenario 1' @@ -22,63 +22,6 @@ export OPTIMIZER="AdamW" # Learning rate optimizer export WEIGHT_DECAY=0.2 # L2 regularization factor - 0.2 is good for AdamW, 0.01 good for LAMB #### -# These variables are only required if LR_SCHEDULE_TYPE="multistep" -# export LR_MILESTONES="" -# export LR_DECAY_RATE="" - -# other hyperparameters -export LR_T_MAX=9000 -export LR_ETA_MIN=0.0 -export LR_WARMUP_FACTOR=1. -export BATCHNORM_GROUP_SIZE=1 - -# this is for some global parameters: -export ADDITIONAL_ARGS="--disable_tuning" - -# direct io settings -export DALI_ODIRECT_ALIGNMENT=4096 -export DALI_ODIRECT_LEN_ALIGNMENT=4096 - -# run parameters -export NEXP="${NEXP:-10}" - -# system parameters -export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' ) -export BASE_COMP_CLOCK=1980 # obtained via nvidia-smi for SXM H100 HBM3 -export BASE_MEM_CLOCK=2619 # obtained via nvidia-smi for SXM H100 HBM3 - -# data parameters -export SHUFFLE_MODE="global" -export DATA_FORMAT="dali-numpy" -export PRECISION_MODE="amp" -export LOCAL_VALIDATION_BATCH_SIZE=8 - -# staging parameter -if [ ! -z $STAGE_DIR_PREFIX ]; then - mkdir -p $STAGE_DIR_PREFIX -fi -export STAGE_BATCH_SIZE=8 -export STAGE_MODE="global" -export STAGE_VERIFY=0 -export STAGE_FULL_DATA_PER_NODE=0 -export STAGE_USE_DIRECT_IO=0 -#export STAGE_USE_DIRECT_IO=1 # this leads to a segfault -export STAGE_NUM_READ_WORKERS=6 -export STAGE_NUM_WRITE_WORKERS=12 - -# misc args -export ADDITIONAL_SRUN_ARGS="--no-kill" -export ADDITIONAL_ARGS="${ADDITIONAL_ARGS} --enable_graph --disable_comm_overlap" - -# number of experiments -export NEXP=1 -export NUM_INSTANCES=1 - -# system parameters -export DGXNNODES=$SLURM_NNODES -export WALLTIME=01:00:00 - -# final things -if [ ! -z $STAGE_DIR_PREFIX ]; then - mkdir -p $STAGE_DIR_PREFIX -fi \ No newline at end of file +# common configuration settings +# note: CONFIG_DIR is set in run_and_time_kestrel.sh +source ${CONFIG_DIR}/config_common.sh diff --git a/AI-ML/app-level-benchmark/config_scenario2.sh b/AI-ML/app-level-benchmark/config_scenario2.sh index 2ccec92..3c316d3 100644 --- a/AI-ML/app-level-benchmark/config_scenario2.sh +++ b/AI-ML/app-level-benchmark/config_scenario2.sh @@ -4,15 +4,15 @@ export DATA_DIR_PREFIX="/scratch/$USER/deepcam/numpy" # path to preprocessed numpy-formatted data export OUTPUT_DIR="/scratch/$USER/DeepCAM-testing/results/$SLURM_JOB_ID" # output directory for training logs -#### BASELINE: CAN CHANGE THESE! #### +#### BASELINE/PORTED: CAN CHANGE THESE! #### #export STAGE_DIR_PREFIX="$TMPDIR/deepcam_staging" # If this variable is missing, no data staging occurs. export WIREUP_METHOD="nccl-slurm" export DGXNGPU=4 # Number of accelerators per node export MAX_THREADS=4 # Number of data loading threads per node -export LOCAL_BATCH_SIZE=8 # Per-accelerator batch size (`DGXNGPU`\*`NUMBER_OF_NODES`\*`LOCAL_BATCH_SIZE` must equal `1024` +export LOCAL_BATCH_SIZE=12 # Per-accelerator batch size (`DGXNGPU`\*`NUMBER_OF_NODES`\*`LOCAL_BATCH_SIZE` must equal `1024` #### -#### BASELINE: DO NOT CHANGE THESE! #### +#### BASELINE/PORTED: DO NOT CHANGE THESE! #### export LOGGING_FREQUENCY=0 # Must be set to 0 for 'Scenario 2' export MAX_EPOCHS=50 # Must be set to 50 for 'Scenario 2' export START_LR=0.0001 # Starting learning rate. Roughly 10X lower than target end LR. @@ -22,63 +22,6 @@ export OPTIMIZER="AdamW" # Learning rate optimizer export WEIGHT_DECAY=0.2 # L2 regularization factor - 0.2 is good for AdamW, 0.01 good for LAMB #### -# These variables are only required if LR_SCHEDULE_TYPE="multistep" -# export LR_MILESTONES="" -# export LR_DECAY_RATE="" - -# other hyperparameters -export LR_T_MAX=9000 -export LR_ETA_MIN=0.0 -export LR_WARMUP_FACTOR=1. -export BATCHNORM_GROUP_SIZE=1 - -# this is for some global parameters: -export ADDITIONAL_ARGS="--disable_tuning" - -# direct io settings -export DALI_ODIRECT_ALIGNMENT=4096 -export DALI_ODIRECT_LEN_ALIGNMENT=4096 - -# run parameters -export NEXP="${NEXP:-10}" - -# system parameters -export DGXSYSTEM=$(basename $(readlink -f ${BASH_SOURCE[0]}) | sed 's/^config_//' | sed 's/\.sh$//' ) -export BASE_COMP_CLOCK=1980 # obtained via nvidia-smi for SXM H100 HBM3 -export BASE_MEM_CLOCK=2619 # obtained via nvidia-smi for SXM H100 HBM3 - -# data parameters -export SHUFFLE_MODE="global" -export DATA_FORMAT="dali-numpy" -export PRECISION_MODE="amp" -export LOCAL_VALIDATION_BATCH_SIZE=8 - -# staging parameter -if [ ! -z $STAGE_DIR_PREFIX ]; then - mkdir -p $STAGE_DIR_PREFIX -fi -export STAGE_BATCH_SIZE=8 -export STAGE_MODE="global" -export STAGE_VERIFY=0 -export STAGE_FULL_DATA_PER_NODE=0 -export STAGE_USE_DIRECT_IO=0 -#export STAGE_USE_DIRECT_IO=1 # this leads to a segfault -export STAGE_NUM_READ_WORKERS=6 -export STAGE_NUM_WRITE_WORKERS=12 - -# misc args -export ADDITIONAL_SRUN_ARGS="--no-kill" -export ADDITIONAL_ARGS="${ADDITIONAL_ARGS} --enable_graph --disable_comm_overlap" - -# number of experiments -export NEXP=1 -export NUM_INSTANCES=1 - -# system parameters -export DGXNNODES=$SLURM_NNODES -export WALLTIME=01:00:00 - -# final things -if [ ! -z $STAGE_DIR_PREFIX ]; then - mkdir -p $STAGE_DIR_PREFIX -fi \ No newline at end of file +# common configuration settings +# note: CONFIG_DIR is set in run_and_time_kestrel.sh +source ${CONFIG_DIR}/config_common.sh diff --git a/AI-ML/app-level-benchmark/run_and_time_kestrel.sh b/AI-ML/app-level-benchmark/run_and_time_kestrel.sh index 7e144ec..c3e7eee 100644 --- a/AI-ML/app-level-benchmark/run_and_time_kestrel.sh +++ b/AI-ML/app-level-benchmark/run_and_time_kestrel.sh @@ -21,266 +21,34 @@ # IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN # CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. -# -# runs benchmark and reports time to convergence -# to use the script: -# run_and_time.sh +# CONFIG_DIR is the folder containing the config files (assumes they are in the same folder as this run script) +export CONFIG_DIR=$(dirname "$(realpath $0)") # Load DeepCAM environment module load mamba nccl/2.23.4_cuda124 cudnn DEEPCAM_WORK_DIR=/projects/esifapps/$USER/DeepCAM-testing_torch2.9 mkdir -p $DEEPCAM_WORK_DIR -cd $DEEPCAM_WORK_DIR PYTHON_VERSION=3.11 PYTORCH_VERSION=2.9.0 -ENV_NAME=`pwd`/deepcam-torch${PYTORCH_VERSION}-env-py${PYTHON_VERSION}_HPC-v3 +ENV_NAME=${DEEPCAM_WORK_DIR}/deepcam-torch${PYTORCH_VERSION}-env-py${PYTHON_VERSION}_HPC-v3 eval "$(conda shell.bash hook)" conda activate $ENV_NAME # config*.sh controls many environment variables # make sure the correct config file for the intended scenario is loaded! -source config_scenario1.sh -# source config_scenario2.sh - -# start timing -start=$(date +%s) -start_fmt=$(date +%Y-%m-%d\ %r) -echo "STARTING TIMING RUN AT $start_fmt" - -ENABLE_IB_BINDING=0 - -# assemble launch command -export TOTALGPUS=$(( ${SLURM_NNODES} * ${DGXNGPU} )) -if [ ! -z ${TRAINING_INSTANCE_SIZE} ]; then - gpu_config="$(( ${TOTALGPUS} / ${TRAINING_INSTANCE_SIZE} ))x${TRAINING_INSTANCE_SIZE}" -else - gpu_config=${TOTALGPUS} -fi -export RUN_TAG=${RUN_TAG:-${SLURM_JOB_ID}} - - -# create output directory -mkdir -p ${OUTPUT_DIR} - -# LR switch -if [ -z ${LR_SCHEDULE_TYPE} ]; then - lr_schedule_arg="" -elif [ "${LR_SCHEDULE_TYPE}" == "multistep" ]; then - lr_schedule_arg="--lr_schedule type=${LR_SCHEDULE_TYPE},milestones=${LR_MILESTONES},decay_rate=${LR_DECAY_RATE}" -elif [ "${LR_SCHEDULE_TYPE}" == "cosine_annealing" ]; then - lr_schedule_arg="--lr_schedule type=${LR_SCHEDULE_TYPE},t_max=${LR_T_MAX},eta_min=${LR_ETA_MIN}" -fi +source ${CONFIG_DIR}/config_scenario1.sh +# source ${CONFIG_DIR}/config_scenario2.sh -# GDS switch -if [ "${ENABLE_GDS}" == "1" ]; then - ADDITIONAL_ARGS="${ADDITIONAL_ARGS} --enable_gds" -fi - -# ignore stop switch -if [ "${MLPERF_POWER_TRAIN_AFTER_RUN_STOP}" == "1" ]; then - MIN_EPOCHS=${MAX_EPOCHS} -fi - -PARAMS=( - --wireup_method ${WIREUP_METHOD} - --run_tag ${RUN_TAG} - --experiment_id ${EXP_ID:-1} - --data_dir_prefix ${DATA_DIR_PREFIX} - --output_dir ${OUTPUT_DIR} - --model_prefix "segmentation" - --optimizer ${OPTIMIZER} - --start_lr ${START_LR} - ${lr_schedule_arg} - --lr_warmup_steps ${LR_WARMUP_STEPS} - --lr_warmup_factor ${LR_WARMUP_FACTOR} - --weight_decay ${WEIGHT_DECAY} - --logging_frequency ${LOGGING_FREQUENCY} - --save_frequency 0 - --min_epochs ${MIN_EPOCHS:-0} - --max_epochs ${MAX_EPOCHS:-200} - --data_num_threads ${MAX_THREADS:-4} - --seed ${SEED} - --batchnorm_group_size ${BATCHNORM_GROUP_SIZE} - --shuffle_mode "${SHUFFLE_MODE}" - --data_format "${DATA_FORMAT}" - --data_oversampling_factor ${DATA_OVERSAMPLING_FACTOR:-1} - --precision_mode "${PRECISION_MODE}" - --enable_nhwc - --local_batch_size ${LOCAL_BATCH_SIZE} - --local_batch_size_validation ${LOCAL_VALIDATION_BATCH_SIZE} - ${ADDITIONAL_ARGS} -) - -# go to source code directory +# go to training code directory cd deepcam-mlcommons-hpcv3/src/deepCam -# profile command: -if [ ! -z ${OMPI_COMM_WORLD_RANK} ]; then - WORLD_RANK=${OMPI_COMM_WORLD_RANK} -elif [ ! -z ${PMIX_RANK} ]; then - WORLD_RANK=${PMIX_RANK} -elif [ ! -z ${PMI_RANK} ]; then - WORLD_RANK=${PMI_RANK} -fi -PROFILE_BASE_CMD="nsys profile --mpi-impl=openmpi --trace=cuda,cublas,nvtx,mpi --cuda-graph-trace=node --kill none -c cudaProfilerApi -f true -o ${OUTPUT_DIR}/profile_job${SLURM_JOBID}_rank${WORLD_RANK}" -ANNA_BASE_CMD="nsys profile --trace cuda,nvtx --sample cpu --output ${OUTPUT_DIR}/anna_job${SLURM_JOBID}_rank${WORLD_RANK} --export sqlite --force-overwrite true --stop-on-exit true --capture-range cudaProfilerApi --capture-range-end stop --kill none" -DLPROF_BASE_CMD="dlprof --mode=pytorch --force=true --reports=summary,detail,iteration --nsys_profile_range=true --output_path=${OUTPUT_DIR} --profile_name=dlprof_rank${WORLD_RANK}" -METRICS_BASE_CMD="ncu --target-processes=all --profile-from-start=off --nvtx --print-summary=per-nvtx --csv -f -o ${OUTPUT_DIR}/metrics_rank${WORLD_RANK} --metrics=smsp__sass_thread_inst_executed_op_hadd_pred_on.sum,smsp__sass_thread_inst_executed_op_hmul_pred_on.sum,smsp__sass_thread_inst_executed_op_hfma_pred_on.sum,smsp__sass_thread_inst_executed_op_fadd_pred_on.sum,smsp__sass_thread_inst_executed_op_fmul_pred_on.sum,smsp__sass_thread_inst_executed_op_ffma_pred_on.sum,sm__inst_executed_pipe_tensor.sum" - -if [[ ${ENABLE_PROFILING} == 1 ]]; then - if [[ ${ENABLE_METRICS_COLLECTION} == 1 ]]; then - echo "Metric Collection enabled" - if [[ "${WORLD_RANK}" == "0" ]]; then - PROFILE_CMD=${METRICS_BASE_CMD} - else - PROFILE_CMD="" - fi - elif [[ ${ENABLE_DLPROF} == 1 ]]; then - echo "Dlprof enabled" - if [[ "${WORLD_RANK}" == "0" ]]; then - PROFILE_CMD=${DLPROF_BASE_CMD} - else - PROFILE_CMD="" - fi - PARAMS+=(--profile_markers=dlprof) - elif [[ ${ENABLE_ANNA} == 1 ]]; then - echo "ANNA enabled" - if [[ "${WORLD_RANK}" == "0" ]]; then - PROFILE_CMD=${ANNA_BASE_CMD} - else - PROFILE_CMD="" - fi - PARAMS+=(--profile_markers=anna) - else - echo "Profiling enabled" - PROFILE_CMD=${PROFILE_BASE_CMD} - fi -elif [[ ${API_LOGGING} == 1 ]]; then - echo "ApiLog enabled" - if [ ${SLURM_PROCID} == 0 ]; then - PROFILE_CMD="apiLog.sh" - else - PROFILE_CMD="" - fi -else - PROFILE_CMD="" -fi - -if [[ ${DEBUG_MEMCHECK} == 1 ]]; then - echo "Debugging enabled" - DEBUG_CMD="compute-sanitizer --tool=memcheck" -else - DEBUG_CMD="" -fi - -IB_BIND='' -if [[ "${SLURM_JOB_NUM_NODES}" -gt 1 && "${ENABLE_IB_BINDING}" -eq 1 ]]; then - IB_BIND='--ib=single' -fi -BIND_BASE_CMD="bindpcie --cpu=exclusive ${IB_BIND} --" -BIND="${BIND_CMD:-${BIND_BASE_CMD}}" - -if [ "$LOGGER" = "apiLog.sh" ]; -then - LOGGER="${LOGGER} -p MLPerf/${MODEL_NAME} -v ${FRAMEWORK}/train/${DGXSYSTEM}" - readonly node_rank="${SLURM_NODEID:-0}" - readonly local_rank="${LOCAL_RANK:=${SLURM_LOCALID:=${OMPI_COMM_WORLD_LOCAL_RANK:-}}}" - if [ "$node_rank" -eq 0 ] && [ "$local_rank" -eq 0 ]; - then - LOGGER=$LOGGER - else - LOGGER="" - fi -fi - -# do we cache data -if [ ! -z ${DATA_CACHE_DIRECTORY} ]; then - PARAMS+=(--data_cache_directory ${DATA_CACHE_DIRECTORY}) -fi - -# run script selection: -if [ ! -z ${TRAINING_INSTANCE_SIZE} ]; then - echo "Running Multi Instance Training" - RUN_SCRIPT="./train_instance_oo.py" - PARAMS+=(--training_instance_size ${TRAINING_INSTANCE_SIZE}) - - if [ ! -z ${STAGE_DIR_PREFIX} ]; then - PARAMS+=( - --stage_dir_prefix ${STAGE_DIR_PREFIX} - --stage_num_read_workers ${STAGE_NUM_READ_WORKERS:-1} - --stage_num_write_workers ${STAGE_NUM_WRITE_WORKERS:-1} - --stage_batch_size ${STAGE_BATCH_SIZE:--1} - --stage_mode ${STAGE_MODE:-"node"} - --stage_max_num_files ${STAGE_MAX_NUM_FILES:--1} - ) - # do we need to verify the staging results - if [ "${STAGE_VERIFY:-0}" -eq 1 ]; then - PARAMS+=(--stage_verify) - fi - if [ "${STAGE_ONLY:-0}" -eq 1 ]; then - echo "WARNING: You are about to run a staging only benchmark" - PARAMS+=(--stage_only) - fi - if [ "${STAGE_FULL_DATA_PER_NODE:-0}" -eq 1 ]; then - PARAMS+=(--stage_full_data_per_node) - fi - if [ "${STAGE_ARCHIVES:-0}" -eq 1 ]; then - PARAMS+=(--stage_archives) - fi - if [ "${STAGE_USE_DIRECT_IO:-0}" -eq 1 ]; then - PARAMS+=(--stage_use_direct_io) - fi - if [ "${STAGE_READ_ONLY:-0}" -eq 1 ]; then - PARAMS+=(--stage_read_only) - fi - fi -else - echo "Running Single Instance Training" - RUN_SCRIPT="./train.py" -fi - -# decide whether to enable profiling -if [ ! -z ${ENABLE_PROFILING} ] && [ ${ENABLE_PROFILING} == 1 ]; then - echo "Running Profiling" - if [ ! -z ${TRAINING_INSTANCE_SIZE} ]; then - RUN_SCRIPT="./train_instance_oo_profile.py" - else - RUN_SCRIPT="./train_profile.py" - fi - - if [ ! -z ${CAPTURE_RANGE_START} ]; then - PARAMS+=( - --capture_range_start ${CAPTURE_RANGE_START} - --capture_range_stop ${CAPTURE_RANGE_STOP} - ) - fi - - if [ ! -z ${PROFILE_FRACTION} ]; then - PARAMS+=(--profile_fraction ${PROFILE_FRACTION}) - fi -fi - -# assemble run command -RUN_CMD="${RUN_SCRIPT} ${PARAMS[@]}" - +# assemble run command (LOGGER, PROFILE_CMD, DEBUG_CMD, and RUN_CMD are set in config_common.sh) # libmpi.so.12 is specifically being looked for. symlink this in a different directory -# to libmpi.so from system module (a little hacky) -PSEUDO_LIB=`pwd`/lib -mkdir -p $PSEUDO_LIB -ln -s $OPENMPI_ROOT_DIR/lib/libmpi.so $PSEUDO_LIB/libmpi.so.12 +# to libmpi.so from system MPICH (a little hacky) +PSEUDO_LIB=`pwd`/lib && mkdir -p $PSEUDO_LIB +ln -s $CRAY_MPICH_DIR/lib/libmpi.so.12 $PSEUDO_LIB/libmpi.so.12 export LD_LIBRARY_PATH=$PSEUDO_LIB:$LD_LIBRARY_PATH srun --overlap -u -N ${SLURM_NNODES} -n ${SLURM_NTASKS} -c ${SLURM_CPUS_PER_TASK} --cpu_bind=cores --gres=gpu:${SLURM_GPUS_ON_NODE} \ ${LOGGER:-} ${PROFILE_CMD} ${DEBUG_CMD} $(which python) ${RUN_CMD}; ret_code=$? if [[ $ret_code != 0 ]]; then exit $ret_code; fi - - -# end timing -end=$(date +%s) -end_fmt=$(date +%Y-%m-%d\ %r) -echo "ENDING TIMING RUN AT $end_fmt" - -# report result -result=$(( $end - $start )) -result_name="DEEPCAM_HPC" -echo "RESULT,$result_name,,$result,$USER,$start_fmt"