Quadratic complexity when reading large entries

Make a `.tar` archive containing a single large file, `archivemount` it and copy out the single element in the archive:

```
$ truncate --size=256M zeroes
$ tar cf zeroes-256mib.tar zeroes
$ archivemount zeroes-256mib.tar ~/mnt
$ dd if=~/mnt/zeroes of=/dev/null status=progress
267633152 bytes (268 MB, 255 MiB) copied, 77 s, 3.5 MB/s
524288+0 records in
524288+0 records out
268435456 bytes (268 MB, 256 MiB) copied, 77.5176 s, 3.5 MB/s
```

Notice that the "MB/s" `dd` throughput slows down (in the interactive output, not copy/pasted above) as time goes on. It looks like there's some sort of super-linear algorithm involved, and the whole thing takes *more than a minute*. In comparison, a straight `tar` extraction takes *less than a second*.

```
$ time tar xf zeroes-256mib.tar
real    0m0.216s
```

Sprinkling some logging in the `_ar_read` function in `archivemount.c` shows that the `dd` leads to multiple `_ar_read` calls. In the steady state, it reads `size = 128 KiB` each time, with the `offset` argument incrementing on each call.

Inside that `_ar_read` function, if we take the `if (node->modified)` false branch, then *for each call*:
- we call `archive_read_new`, even though it's the same archive every time.
- we iterate *from the start of that archive* until we find the `archive_entry` for the file. Again, this is once per call, even if it's conceptually the same `archive_entry` object used in previous calls.
- we copy out `offset` bytes *from the start of the entry's contents* to a 'trash' buffer, before finally copying out the payload that `_ar_read` is actually interested in.

The total number of bytes produced by `archive_read_data` calls is therefore quadratic in the decompressed size of the archive entry. This is slow enough for `.tar` files but probably worse for `.tar.gz` files. The whole thing is reminiscent of the [Shlemiel the Painter](https://www.joelonsoftware.com/2001/12/11/back-to-basics/) story.

There may be some complications if we're re-writing the archive, but when mounting read-only, we should be able to get much better performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quadratic complexity when reading large entries #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Quadratic complexity when reading large entries #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions