Skip to content

cdx_writer.py timeout when large amounts of URI's present in warc #33

@kiska3

Description

@kiska3

Currently have 71 tasks that have timed out(at least not within ~76k seconds) due to large amounts of URI's in megawarc.

These can be found in the tinypic collection from archiveteam.

Example tasks:
warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76800 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830091905_c83d08f5.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830091905_c83d08f5' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830091905_c83d08f5/cdxstats.json'> '/t/_archiveteam_tinypic_20190830091905_c83d08f5/cdx.txt' failed with exit code: 124, but told to continue on...

warning: ulimit -v 1048576 && PYTHONPATH=/petabox/sw/lib/python timeout 76764 /petabox/sw/bin/cdx_writer.pex 'tinypic_20190830120442_36ec361d.megawarc.warc.gz' --file-prefix='archiveteam_tinypic_20190830120442_36ec361d' --exclude-list='/petabox/sw/wayback/web_excludes.txt' --stats-file='/f/_archiveteam_tinypic_20190830120442_36ec361d/cdxstats.json'> '/t/_archiveteam_tinypic_20190830120442_36ec361d/cdx.txt' failed with exit code: 124, but told to continue on...
And 69 more

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions