Replace cpu-benchmark with similar stress-ng monte-carlo test by quantumsteve · Pull Request #137 · wfcommons/WfCommons

quantumsteve · 2026-02-09T20:10:11Z

First attempt at replacing cpu-benchmark with a nearly identical test in stress-ng.

differences

cpu-benchmark uses a full sphere, while stress-ng uses a single quadrand.
cpu-benchmark uses a "terrible" but efficient rng while stress-ng has many options. I chose "lcg" to start with.
cpu-benchmark batches cpu-work into 1000000 samples, while stress-ng uses ~16384 samples.
stress-ng defines samples and ops as int32_t, while samples in cpu-benchmark are int64_t

stress-ng launches multiple processes, which changes the logic for stopping parent and child processes. A quick search recommended psutil.

rafaelfsilva · 2026-02-10T02:38:43Z

Hi @quantumsteve, it seems the tests for Dask are hanging. Could you please take a look at them? Thanks!

henricasanova · 2026-02-10T04:12:33Z

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

quantumsteve · 2026-02-10T15:28:43Z

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

henricasanova · 2026-02-10T17:24:45Z

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

The permissions should be fine, as I have dealt with that as well. I'll take a look today and let you know what I find out. Testing with Docker is pretty finecky due to users/permissions.

henricasanova · 2026-02-10T17:44:44Z

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

The permissions should be fine, as I have dealt with that as well. I'll take a look today and let you know what I find out. Testing with Docker is pretty finecky due to users/permissions.

One thing I noticed is that psutil wasn't listed in pyproject.toml. That "fixed" the bash executor test, in that now it hangs like the others :)

henricasanova · 2026-02-10T17:58:33Z

Confirming that tests hang locally (not only on the GitHub runner). Let me know if you need help/guidance regarding this, as I implemented the testing infrastructure for translators/loggers. Oh, and other tests hang, so it's not specific to Dask, which is good as it means it should be easier to diagnose/fix. AND, some tests fail, like for the Bash translator, which should also be relatively easy to diagnose. All the testing code is in tests/test_helpers.py and tests/translators_loggers/.

Yes, I'm struggling to run some tests locally. It looks like the containers are unable to write in my /tmp directory. Do I need to change permissions?

The permissions should be fine, as I have dealt with that as well. I'll take a look today and let you know what I find out. Testing with Docker is pretty finecky due to users/permissions.

One thing I noticed is that psutil wasn't listed in pyproject.toml. That "fixed" the bash executor test, in that now it hangs like the others :)

Ok, news. Connecting to the container and running wfbench my hand, not involving wfcommons at any of my test infrastructure hangs:

bin/wfbench --name split_fasta_00000001 --percent-cpu 1.0 --cpu-work 1 
[WfBench][09:57:07][INFO] Starting split_fasta_00000001 Benchmark
[WfBench][09:57:07][INFO] Starting CPU and Memory Benchmarks for split_fasta_00000001...
stress-ng: info:  [311] defaulting to a 1 day, 0 secs run per stressor
stress-ng: info:  [311] dispatching hogs: 10 monte-carlo
stress-ng: info:  [312] monte-carlo: pi   ~ 2.6666666666667 vs 3.1415926535898 using lcg (average of 1 runs)
stress-ng: info:  [311] skipped: 0
stress-ng: info:  [311] passed: 10: monte-carlo (10)
stress-ng: info:  [311] failed: 0
stress-ng: info:  [311] metrics untrustworthy: 0
stress-ng: info:  [311] successful run completed in 0.06 secs

That should be easy to diagnose, and I'll look at it soon-ish.

henricasanova · 2026-02-10T18:37:37Z

Ok, so the culprit is io_proc.join(), which hangs. Also, I am noticing 76 stress-ng processes, and 3 stress-ng zombie processes while this hangs. I assume that's fine/intended, but thought I'd mention it.

henricasanova · 2026-02-10T18:45:44Z

@quantumsteve A fix is to call io_proc.kill() right before the io_proc.join() because I don't believe that I/O process can actually terminate on its own. I see you had a commented out io_proc.terminate() before the join.... so perhaps you had thought of that. With that fix, the execution does complete. BUT, it leaves behind tons of zombie stress-ng-vm processes, which is of course not good. With this information, likely you can now fix your code? What do you think?

quantumsteve · 2026-02-11T21:55:17Z

bin/wfbench

                    proc.wait()
        if io_proc is not None and io_proc.is_alive():
-            # io_proc.terminate()
+            io_proc.terminate()


I think calling terminate here is causing the io_proc to not finish.

yes, but the problem is that if we call join() instead of terminate, then it seems to hang... so the I/O process isn't the kind of process that ever terminates perhaps? I'll inspect the code tomorrow Thursday when I have a minute.

the "* 1024 * 1024" to "* 1000 * 1000". The alternative was to document "MiB", but as far as I can tell none of the above layers use 1024. This said, _every time_ we've used any unit other than just byte, we've had errors. Perhaps we should change it all to "memory size in bytes" and that's it.

Signed-off-by: Steven Hahn <hahnse@ornl.gov>

Replace cpu-benchmark with similar stress-ng monte-carlo test

b9d87fc

quantumsteve marked this pull request as draft February 9, 2026 20:10

rafaelfsilva added this to the v1.5 milestone Feb 10, 2026

rafaelfsilva approved these changes Feb 10, 2026

View reviewed changes

quantumsteve and others added 6 commits February 10, 2026 14:33

Add dependency psutil

d3778c2

terminate io process

827c189

check for division by zero

e41861c

mute stress-ng

901e87b

Added a --quiet flag to another stress-ng invocation

8c560de

Fixed the zombie problem

ffac645

quantumsteve commented Feb 11, 2026

View reviewed changes

henricasanova and others added 4 commits February 12, 2026 08:33

typo-- !!

45124d9

try to workaround missing cpu_queue

c92654e

Signed-off-by: Steven Hahn <hahnse@ornl.gov>

typos

510ca58

Signed-off-by: Steven Hahn <hahnse@ornl.gov>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace cpu-benchmark with similar stress-ng monte-carlo test#137

Replace cpu-benchmark with similar stress-ng monte-carlo test#137
quantumsteve wants to merge 11 commits intomainfrom
stress-ng_cpu_benchmark

quantumsteve commented Feb 9, 2026

Uh oh!

rafaelfsilva commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026 •

edited

Loading

Uh oh!

quantumsteve commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026 •

edited

Loading

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

quantumsteve Feb 11, 2026

Uh oh!

henricasanova Feb 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

quantumsteve commented Feb 9, 2026

Uh oh!

rafaelfsilva commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quantumsteve commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

henricasanova commented Feb 10, 2026

Uh oh!

quantumsteve Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

henricasanova Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

henricasanova commented Feb 10, 2026 •

edited

Loading

henricasanova commented Feb 10, 2026 •

edited

Loading

henricasanova Feb 11, 2026 •

edited

Loading