Skip to content

OSError and timeouts #5

@RobertRosca

Description

@RobertRosca

When I was trying to run the examples from the test branch, I frequently ran into errors like:

OSError: Timed out trying to connect to 'tcp://10.255.32.50:35106' after 10 s:
Timed out trying to connect to 'tcp://10.255.32.50:35106' after 10 s: connect() didn't finish in time

Where the communication between workers on different nodes would sometimes fail. I'm not too sure why this is happening.

I used a new venv for the installation of MIDTools, and the issue would occur less often if I lowered the number of workers and nodes, however the processing never ran consistently for me with the settings provided in the test branch.

The best I could get was 100 trains with 8 jobs and 8 procs instead of 12, bumping it to 12 it would fail (seemingly) all the time.

Troubleshooted this a bit and it seemed like the errors happened more often if the jobs were being submitted from an allocated node (instead of a login node), and maybe they happened more often when the GPFS scratch (/gpfs/data/scratch/...) directory was used instead of the home directory, but these are pretty weak observations since I didn't test it out very much.

note to self: try increasing the file limit

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions