I tried to load a WARC with a few larger (200-300 MB) files in it. During the process of loading (indexing) the WARC, memory usage of the python process (that worked on the indexing) increased up to, like, 700 MB, and then ran out of memory, leaving the following error message in the terminal:
Loading /media/datadisk/upload_queue/hajduvolan_hu_2015_05.warc.gz
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "./warcproxy.py", line 112, in run
http_response = parse_http_response(record)
File "./warcproxy.py", line 24, in parse_http_response
remainder = message.feed(record.content[1])
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 576, in feed
text = HTTPMessage.feed(self, text)
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 94, in feed
text = self.feed_start(text)
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 179, in feed_start
line, text = self.feed_line(text)
File "/home/istvan/warc-proxy/hanzo/httptools/messaging.py", line 159, in feed_line
text = str(self.buffer[pos:])
MemoryError
The progress bar stuck, the indexing stopped.
I bet on the big files being responsible for this, as I've been using this great tool for long and haven't experienced such a problem so far (this was the first time that I tried to load a WARC with files larger than a few tens of megabytes). However, I can't imagine why warc-proxy would need 700 MB of mermoy for indexing a 250 MB file.
I think you can easily reproduce the problem: you can find the problematic WARC here: https://archive.org/details/hajduvolan_hu_2015_05. The probably problematic files are http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt_2010-2012.flv (249 MB) and http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt.flv (146 MB).
I tried to load a WARC with a few larger (200-300 MB) files in it. During the process of loading (indexing) the WARC, memory usage of the python process (that worked on the indexing) increased up to, like, 700 MB, and then ran out of memory, leaving the following error message in the terminal:
The progress bar stuck, the indexing stopped.
I bet on the big files being responsible for this, as I've been using this great tool for long and haven't experienced such a problem so far (this was the first time that I tried to load a WARC with files larger than a few tens of megabytes). However, I can't imagine why warc-proxy would need 700 MB of mermoy for indexing a 250 MB file.
I think you can easily reproduce the problem: you can find the problematic WARC here: https://archive.org/details/hajduvolan_hu_2015_05. The probably problematic files are http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt_2010-2012.flv (249 MB) and http://www.hajduvolan.hu/files/userfiles/Flash/EU_projekt.flv (146 MB).