Issue: storing Hi-C contacts in a gzipped .tsv cause major slowdowns for some computations. We need to pick a binary container and write software for common operations.
.tsv/.csv:
Cons:
- .tsv/.csv is row-oriented. Extra fields, like readID or sam_fields are really heavy, comparing to chrom and pos, yet they have to be unpacked every time a file is read through.
- text is very expensive to compress/decompress and parse. As a result, calculation of P(s) curves and other stats can take 10 minutes or more. It could potentially be done in seconds, if chrom and pos were binary.
- no random access. There is a solution, bgzip+pairix, but it has many moving parts/dependencies.
Pros:
- tsv/csv is a format that is easy to agree for a community and it is the default expectation in bioinformatics
- platform-agnostic: command-line, python, R, C, win/linux/mac - all can work with tsv, to some extent.
- can be streamed between processes via pipes
- merge-sort is available and fairly efficient for text files
- .pairs.gz is already used by the 4DN, not going away.
The alternative is to store pair tables in existing binary container files. The two options are:
HDF5:
Pros:
- a major standard, developed by a company, used by NASA, not going away
- an existing dependency, can be seen as an extension of cooler
- can store multiple tables per file: chromsizes and artifacts, like P(s) curves, trans-levels and other summaries, can be kept inside the file. HDF5 can even store non-tabular data, which could be useful for summary tables.
- easy appending, both along columns and along rows
Cons:
- columnar storage has to be implemented on top of HDF5. @nvictus has prototyped it: https://github.com/nvictus/coltab, but it needs more work. The result can potentially be popular and useful for other people and projects (including cooler?..).
- variable-length strings are not compressed, which makes them useless. The solution is to store string arrays in multiple chunks of fixed-length strings, with the length varying between chunks; the chunks could then be merged together in a virtual dataset. Requires some work to implement and maintain.
- merge-sort has to be implemented.
- very glacial development of the core format.
- there is a certain dislike of the format by the community, due to issues with parallelization and complex implementation. Both issues seem to be improving over time.
Parquet:
Pros:
- a major standard, supported by big players in IT industry. Not likely to die.
- the library is young and developed very rapidly.
- first-class support by pandas and arrow.
- supports compressed variable-length strings and dictionary compression.
- has built-in indexing of dataframes via block statistics
Cons:
- only one table per file! Chromsizes would have to be stored in the header (could be an issue for low-quality assemblies), artifacts would have to be stored in separate parquet and non-parquet files (for non-tabular data). Alternatively, we keep all related datasets together in a single zero-compression zip file, but that could be difficult to use in C/C++ and R.
- not designed for appending columns. Adding extra columns to .pairs would need a complete re-write. Appending rows seems to work, but is not preferred by design.
- merge-sort has to be implemented.
- extra dependency, though not a major one.
Personally, I'm not happy with either of the solutions. Thoughts?...
Issue: storing Hi-C contacts in a gzipped .tsv cause major slowdowns for some computations. We need to pick a binary container and write software for common operations.
.tsv/.csv:
Cons:
Pros:
The alternative is to store pair tables in existing binary container files. The two options are:
HDF5:
Pros:
Cons:
Parquet:
Pros:
Cons:
Personally, I'm not happy with either of the solutions. Thoughts?...