Skip to content

Excessive memory usage when loading a WARC with big files #5

Description

@bzc6p

I tried to load a WARC with a few larger (200-300 MB) files in it. During the process of loading (indexing) the WARC, memory usage of the python process (that worked on the indexing) increased up to, like, 700 MB, and then ran out of memory, leaving the following error message in the terminal:

Loading /media/datadisk/upload_queue/hajduvolan_hu_2015_05.warc.gz
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "./warcproxy.py", line 112, in run
    http_response = parse_http_response(record)
  File "./warcproxy.py", line 24, in parse_http_response
    remainder = message.feed(record.content[1])
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 576, in feed
    text = HTTPMessage.feed(self, text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 94, in feed
    text = self.feed_start(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 179, in feed_start
    line, text = self.feed_line(text)
  File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 159, in feed_line
    text = str(self.buffer[pos:])
MemoryError

The progress bar stuck, the indexing stopped.
I bet on the big files being responsible for this, as I've been using this great tool for long and haven't experienced such a problem so far (this was the first time that I tried to load a WARC with files larger than a few tens of megabytes). However, I can't imagine why warc-proxy would need 700 MB of mermoy for indexing a 250 MB file.

I think you can easily reproduce the problem: you can find the problematic WARC here: https://archive.org/details/hajduvolan_hu_2015_05. The probably problematic files are http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt_2010-2012.flv (249 MB) and http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt.flv (146 MB).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions