Skip to content

CI timeout (test-llava-runner-linux) since #7922 #8180

Description

@swolchok

🐛 Describe the bug

Changing from nightly wheel to --use-pt-pinned-commit (from-source build of PyTorch pinned commit, which matches nightly) caused CI timeouts for long jobs, apparently with large timestamp "gaps" in logs

In the raw logs for the first test-llava-runner-linux timeout on main, there are almost 40 minutes (EDIT: actually 83 minutes, see 43-minute gap below) of "gaps" in the logs with no timestamps. Specifically:

  • 14 minute "gap" in logs, jump from 2025-01-31T23:11:50.6702881Z to 2025-01-31T23:25:21.0383351Z, during export.
  • 25 minute "gap" in logs from 2025-01-31T23:25:21.2243143Z to 2025-01-31T23:42:40.6914293Z , and the second message is just a job timeout. Seems to also be during export; not sure why we are exporting multiple times offhand, but that's a separate problem regardless.

@metascroy found that increasing the timeout to 180 minutes causes the job in question to succeed after 150 minutes.

I've ruled out safetensors download being the cause; it took about 6.5 minutes in the last good run and about 6 minutes in the first bad run.

Versions

N/A

Metadata

Metadata

Assignees

Labels

module: ciIssues related to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions