Skip to content

Expand run_mpmd.sh to support multi-core multi-threaded tasks#54

Closed
Copilot wants to merge 10 commits intodevelopfrom
copilot/expand-run-mpmd-script-threading
Closed

Expand run_mpmd.sh to support multi-core multi-threaded tasks#54
Copilot wants to merge 10 commits intodevelopfrom
copilot/expand-run-mpmd-script-threading

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 16, 2026

Description

ush/run_mpmd.sh only supported single-core, single-threaded tasks. This adds a table format for heterogeneous MPMD jobs while preserving full backward compatibility.

Table format uses double-quoted columns: command, ntasks, nthreads (columns 2-3 default to 1 if missing):

"./gfs_model" "128" "2"
"${HOMEgfs}/ush/product_manager.sh ./file_list.txt" "1" "1"
"./single_core_task"

Format is auto-detected from the first non-empty, non-comment line. Simple (unquoted) command files behave identically to before.

Key additions:

  • is_table_format() — detects table format by leading double quote
  • parse_table_line() — extracts command, ntasks, nthreads via awk
  • run_table_mpmd() — builds heterogeneous launch commands using colon-separated syntax with per-entry wrapper scripts for OMP_NUM_THREADS
    • srun: -n <ntasks> -c <nthreads> wrapper : ...
    • mpiexec: -np <ntasks> --depth <nthreads> --cpu-bind depth wrapper : ...
  • Chunking based on total allocated tasks to avoid oversubscription
  • Serial mode fallback runs table entries sequentially with per-command OMP_NUM_THREADS
  • OMP_NUM_THREADS=1 moved from global to simple-format-only scope

Resolves NOAA-EMC#3088

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)

Change characteristics

  • Is this change expected to change outputs (e.g. value changes to existing outputs, new files stored in COM, files removed from COM, filename changes, additions/subtractions to archives)? NO
    • GFS
    • GEFS
    • SFS
    • GCAFS
  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO
    • EMC verif-global
    • GDAS
    • GFS-utils
    • GSI
    • GSI-monitor
    • GSI-utils
    • UFS-utils
    • UFS-weather-model
    • wxflow

How has this been tested?

  • Shellcheck passes clean
  • 21 unit tests covering format detection, table parsing (full/partial columns), serial execution of both formats, and backward compatibility

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary
Original prompt

This section details on the original issue you should resolve

<issue_title>Expand run_mpmd.sh to be able to run threaded and multi-core tasks</issue_title>
<issue_description>Problem description and feature request:

The ush/run_mpmd.sh script currently only allows single-threaded tasks to execute and assumes each line of the command file to be a single-core task.

This utility script should be expanded to enable the use of multiple threads per subtask and heterogeneous threading (e.g. task 1 uses 2 cores, single threaded; task 2 uses 4 cores, with two tasks, double threaded).

The script currently assumes that the commands listed in the command file are one-core single-threaded tasks per line. This is fine and should stay in tact by default. However, the script should be expanded to enable multi-core multi-threaded command files. When more complicated MPMD tasks are required, the script should be expanded to parse input ASCII tables. The input command files should be structured as tables. Here is an example table:

"./gfs_model" "128" "2"
"${HOMEgfs}/ush/product_manager.sh ./file_list.txt" "1" "1"

The table should follow an assumed construct (i.e. the first column is always the command, the second line is always the number of tasks, and the third is always the number of threads). If the second or third column is missing, they should be assumed to be 1.

Acceptance criteria:

  • The solution works on both WCOSS2 (Cray/PALS mpiexec) and RDHPCS (slurm srun) systems.
  • Current tasks using run_mpmd.sh are unaffected (i.e. lists of single-core, single-threaded sub-jobs).
  • More complex tasks are supported (multi-core, multi-threaded sub-jobs)
  • (Bonus) two test scripts are added to run run_mpmd.sh. One on a Cray/PALS mpiexec platform and one on a slurm srun platform.</issue_description>

Comments on the Issue (you are @copilot in this section)


💬 Send tasks to Copilot coding agent from Slack and Teams to turn conversations into code. Copilot posts an update in your thread when it's finished.

Copilot AI and others added 2 commits March 16, 2026 12:30
Add table format support for heterogeneous MPMD command files:
- is_table_format(): Detects double-quoted table format command files
- parse_table_line(): Parses command, ntasks, and nthreads from table lines
- run_table_mpmd(): Executes heterogeneous MPMD using colon-separated
  launcher syntax with per-task wrapper scripts for OMP_NUM_THREADS
- Serial mode handles table format by running commands sequentially
- Simple format behavior is completely unchanged (backward compatible)
- Works with both srun (Slurm) and mpiexec (PALS/Cray) launchers

Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
- Document that mixed format files are not supported
- Add warning when a single entry exceeds total allocated tasks
- Add clarifying comments for mpiexec --depth and --cpu-bind depth

Co-authored-by: DavidHuber-NOAA <69919478+DavidHuber-NOAA@users.noreply.github.com>
Copilot AI changed the title [WIP] Expand run_mpmd.sh to support threaded and multi-core tasks Expand run_mpmd.sh to support multi-core multi-threaded tasks Mar 16, 2026
Copilot AI requested a review from DavidHuber-NOAA March 16, 2026 12:33
@DavidHuber-NOAA DavidHuber-NOAA force-pushed the copilot/expand-run-mpmd-script-threading branch from 3e04518 to 8c45a65 Compare March 19, 2026 13:49
Comment thread ush/run_mpmd.sh
# Almost works!! launch_args+=" -n ${task_counts[idx]} --env OMP_PLACES=threads --env OMP_PROC_BIND=spread --env OMP_NUM_THREADS=${thread_counts[idx]} --cpu-bind verbose,none ${cmds[idx]}"
# Test config
list=$(seq -s, 0 $((task_counts[idx] - 1)))
if [[ -z "${cpu_list}" ]]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[shfmt] reported by reviewdog 🐶

Suggested change
if [[ -z "${cpu_list}" ]]; then
if [[ -z "${cpu_list}" ]]; then

echo "\"${DATA}/${FCSTEXEC}\"" '"128" "1"' > cmdfile
echo "\"${HOMEglobal}/dev/scripts/run_date.sh\"" '"1"' '"1"' >> cmdfile

${USHglobal}/run_mpmd.sh cmdfile
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 [shellcheck] reported by reviewdog 🐶
Double quote to prevent globbing and word splitting. SC2086

echo "\"${DATA}/${FCSTEXEC}\"" '"128" "1"' > cmdfile
echo "\"${HOMEglobal}/dev/scripts/run_date.sh\"" '"1"' '"1"' >> cmdfile

${USHglobal}/run_mpmd.sh cmdfile
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[shellcheck (suggestion)] reported by reviewdog 🐶

Suggested change
${USHglobal}/run_mpmd.sh cmdfile
"${USHglobal}"/run_mpmd.sh cmdfile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand run_mpmd.sh to be able to run threaded and multi-core tasks Ensemble members do not recognize breakpoints

2 participants