When I was trying to run the examples from the test branch, I frequently ran into errors like:
OSError: Timed out trying to connect to 'tcp://10.255.32.50:35106' after 10 s:
Timed out trying to connect to 'tcp://10.255.32.50:35106' after 10 s: connect() didn't finish in time
Where the communication between workers on different nodes would sometimes fail. I'm not too sure why this is happening.
I used a new venv for the installation of MIDTools, and the issue would occur less often if I lowered the number of workers and nodes, however the processing never ran consistently for me with the settings provided in the test branch.
The best I could get was 100 trains with 8 jobs and 8 procs instead of 12, bumping it to 12 it would fail (seemingly) all the time.
Troubleshooted this a bit and it seemed like the errors happened more often if the jobs were being submitted from an allocated node (instead of a login node), and maybe they happened more often when the GPFS scratch (/gpfs/data/scratch/...) directory was used instead of the home directory, but these are pretty weak observations since I didn't test it out very much.
note to self: try increasing the file limit
When I was trying to run the examples from the test branch, I frequently ran into errors like:
Where the communication between workers on different nodes would sometimes fail. I'm not too sure why this is happening.
I used a new
venvfor the installation of MIDTools, and the issue would occur less often if I lowered the number of workers and nodes, however the processing never ran consistently for me with the settings provided in the test branch.The best I could get was 100 trains with 8 jobs and 8 procs instead of 12, bumping it to 12 it would fail (seemingly) all the time.
Troubleshooted this a bit and it seemed like the errors happened more often if the jobs were being submitted from an allocated node (instead of a login node), and maybe they happened more often when the GPFS scratch (
/gpfs/data/scratch/...) directory was used instead of the home directory, but these are pretty weak observations since I didn't test it out very much.note to self: try increasing the file limit