Skip to content

binary file format for pairs #85

@golobor

Description

@golobor

Issue: storing Hi-C contacts in a gzipped .tsv cause major slowdowns for some computations. We need to pick a binary container and write software for common operations.

.tsv/.csv:
Cons:

  • .tsv/.csv is row-oriented. Extra fields, like readID or sam_fields are really heavy, comparing to chrom and pos, yet they have to be unpacked every time a file is read through.
  • text is very expensive to compress/decompress and parse. As a result, calculation of P(s) curves and other stats can take 10 minutes or more. It could potentially be done in seconds, if chrom and pos were binary.
  • no random access. There is a solution, bgzip+pairix, but it has many moving parts/dependencies.

Pros:

  • tsv/csv is a format that is easy to agree for a community and it is the default expectation in bioinformatics
  • platform-agnostic: command-line, python, R, C, win/linux/mac - all can work with tsv, to some extent.
  • can be streamed between processes via pipes
  • merge-sort is available and fairly efficient for text files
  • .pairs.gz is already used by the 4DN, not going away.

The alternative is to store pair tables in existing binary container files. The two options are:

HDF5:
Pros:

  • a major standard, developed by a company, used by NASA, not going away
  • an existing dependency, can be seen as an extension of cooler
  • can store multiple tables per file: chromsizes and artifacts, like P(s) curves, trans-levels and other summaries, can be kept inside the file. HDF5 can even store non-tabular data, which could be useful for summary tables.
  • easy appending, both along columns and along rows

Cons:

  • columnar storage has to be implemented on top of HDF5. @nvictus has prototyped it: https://github.com/nvictus/coltab, but it needs more work. The result can potentially be popular and useful for other people and projects (including cooler?..).
  • variable-length strings are not compressed, which makes them useless. The solution is to store string arrays in multiple chunks of fixed-length strings, with the length varying between chunks; the chunks could then be merged together in a virtual dataset. Requires some work to implement and maintain.
  • merge-sort has to be implemented.
  • very glacial development of the core format.
  • there is a certain dislike of the format by the community, due to issues with parallelization and complex implementation. Both issues seem to be improving over time.

Parquet:
Pros:

  • a major standard, supported by big players in IT industry. Not likely to die.
  • the library is young and developed very rapidly.
  • first-class support by pandas and arrow.
  • supports compressed variable-length strings and dictionary compression.
  • has built-in indexing of dataframes via block statistics

Cons:

  • only one table per file! Chromsizes would have to be stored in the header (could be an issue for low-quality assemblies), artifacts would have to be stored in separate parquet and non-parquet files (for non-tabular data). Alternatively, we keep all related datasets together in a single zero-compression zip file, but that could be difficult to use in C/C++ and R.
  • not designed for appending columns. Adding extra columns to .pairs would need a complete re-write. Appending rows seems to work, but is not preferred by design.
  • merge-sort has to be implemented.
  • extra dependency, though not a major one.

Personally, I'm not happy with either of the solutions. Thoughts?...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions