Skip to content

Provide guidance on working around docker rate limiting for large CHTC runs #462

@jhiemstrawisc

Description

@jhiemstrawisc

I was helping @nisthapanda debug failures in one of her workflows, where Snakemake reported issues unpacking a singularity container image:

RuleException:
ContainerError in file "/var/lib/condor/execute/slot1/dir_3175487/scratch/spras/Snakefile", line 285:
An unexpected non-zero exit status (255) occurred while running this singularity container:
    FATAL:   While checking container encryption: could not open image /var/lib/condor/execute/slot1/dir_3175487/scratch/unpacked/omics-integrator-1_v2: failed to retrieve path for /var/lib/condor/execute/slot1/dir_3175487/scratch/unpacked/omics-integrator-1_v2: lstat /var/lib/condor/execute/slot1/dir_3175487/scratch/unpacked/omics-integrator-1_v2: no such file or directory

The real cause lived in the job's individual .out file (for some reason apptainer writes errors stdout, while Snakemake writes regular output to stderr... go figures):

FATAL:   While making image from oci registry: error fetching image to cache: while building SIF from layers: conveyor failed to get: GET https://index.docker.io/v2/reedcompbio/omics-integrator-1/manifests/v2: TOOMANYREQUESTS: You have reached your unauthenticated pull rate limit. https://www.docker.com/increase-rate-limit

This error occurs when the EP a job lands on tries to fetch an image from Docker but has already exceeded a rate limiting threshold. It is a consequence of trying to pull images inside the jobs instead of pre-declaring them as part of the submit description (I think this would be difficult to do with the SPRAS is designed, which is why I'm not proposing it as the solution).

I think there's probably a two-part fix here:

  1. We should catch the error at fetch time instead of at open time for these images. That is, fail early where we might provide a more meaningful error message so users don't have to hunt through logs.
  2. I should work on providing guidance for the SPRAS profile that would set an apptainer image repository cache. This can then be configured to point at whatever hostname CHTC uses, which would help us avoid this rate limiting issue.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions