Make a .tar archive containing a single large file, archivemount it and copy out the single element in the archive:
$ truncate --size=256M zeroes
$ tar cf zeroes-256mib.tar zeroes
$ archivemount zeroes-256mib.tar ~/mnt
$ dd if=~/mnt/zeroes of=/dev/null status=progress
267633152 bytes (268 MB, 255 MiB) copied, 77 s, 3.5 MB/s
524288+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 77.5176 s, 3.5 MB/s
Notice that the "MB/s" dd throughput slows down (in the interactive output, not copy/pasted above) as time goes on. It looks like there's some sort of super-linear algorithm involved, and the whole thing takes more than a minute. In comparison, a straight tar extraction takes less than a second.
$ time tar xf zeroes-256mib.tar
real 0m0.216s
Sprinkling some logging in the _ar_read function in archivemount.c shows that the dd leads to multiple _ar_read calls. In the steady state, it reads size = 128 KiB each time, with the offset argument incrementing on each call.
Inside that _ar_read function, if we take the if (node->modified) false branch, then for each call:
- we call
archive_read_new, even though it's the same archive every time.
- we iterate from the start of that archive until we find the
archive_entry for the file. Again, this is once per call, even if it's conceptually the same archive_entry object used in previous calls.
- we copy out
offset bytes from the start of the entry's contents to a 'trash' buffer, before finally copying out the payload that _ar_read is actually interested in.
The total number of bytes produced by archive_read_data calls is therefore quadratic in the decompressed size of the archive entry. This is slow enough for .tar files but probably worse for .tar.gz files. The whole thing is reminiscent of the Shlemiel the Painter story.
There may be some complications if we're re-writing the archive, but when mounting read-only, we should be able to get much better performance.
Make a
.tararchive containing a single large file,archivemountit and copy out the single element in the archive:Notice that the "MB/s"
ddthroughput slows down (in the interactive output, not copy/pasted above) as time goes on. It looks like there's some sort of super-linear algorithm involved, and the whole thing takes more than a minute. In comparison, a straighttarextraction takes less than a second.Sprinkling some logging in the
_ar_readfunction inarchivemount.cshows that theddleads to multiple_ar_readcalls. In the steady state, it readssize = 128 KiBeach time, with theoffsetargument incrementing on each call.Inside that
_ar_readfunction, if we take theif (node->modified)false branch, then for each call:archive_read_new, even though it's the same archive every time.archive_entryfor the file. Again, this is once per call, even if it's conceptually the samearchive_entryobject used in previous calls.offsetbytes from the start of the entry's contents to a 'trash' buffer, before finally copying out the payload that_ar_readis actually interested in.The total number of bytes produced by
archive_read_datacalls is therefore quadratic in the decompressed size of the archive entry. This is slow enough for.tarfiles but probably worse for.tar.gzfiles. The whole thing is reminiscent of the Shlemiel the Painter story.There may be some complications if we're re-writing the archive, but when mounting read-only, we should be able to get much better performance.