Data Geekdom: January 2020

Recently a paper was published in Bioinformatics titled "GABAC: an arithmetic coding solution for genomic data", by Voges et al. This is the entropy encoder used in the MPEG-G standard.

The design of GABAC is interesting. It consists of data transformations (including various byte or word to bit packing functions), entropy encoding with some limited higher order modelling, and a method for LZish dictionary usage. This means it is designed to be a one-stop shop to replace the multitude of methods employed in CRAM or Deez.

First things first, despite some shortcomings below, I like this paper in both design and how it's been written. This post may seem like I'm picking holes, but that is science and open peer review. Although I have been publicly damning of some aspects of MPEG-G, I wish to emphasise that that is with the behind-closed-doors process used by ISO and the actions of a few specific individuals and companies, mainly revolving around GenomSys and does not at all reflect on how I view these particular authors whom I have a lot of respect for their work.

They are producng an entirely Open Source implementation of MPEG-G, as part of the Genie project. At the moment it focuses on compression of unmapped data (eg FASTQ), but it gets some impressive results in that department. I'm a total data compression geek, and therefore any new public compression software gets me all happy!

I also have a huge amount of respect for the attention to detail in reproducability of results. Not only have they documented the data sets they used (a "must", but still sadly not a given), but they also provide all of their data separate streams and even the log files from their analysis steps. This is crucial in being able to validate the figures in the paper. My sincere thanks for this. Without that I'd have been left scratching my head and unable to dig down into the data to discover the unfortunate mistakes.

However... In December I informed the authors about my concerns with some of the figures in the paper, followed up with a more complete set of issues in early January. I have recently received acceptance of some of the problems and have confirmation that a revised version of the paper will be appearing some time. I've done my own analysis of their published log files (thanks again!) but I won't steal their thunder here. However given the paper does make some striking - and factually incorrect - claims about the relative speeds and efficacy of certain codecs including my own I do feel duty bound to at least point out where the flaws are so people don't put too much emphasis on the results until the corrected ones are made available.

The supplementary data

I'll start with this as it's where the most obvious smoking gun lies. Table 4 is below:

Unfortunately pretty much every single item in this table is incorrect bar the column headings.

For now, I'd advise readers to simply ignore this table. Problems include:

Most rows have been transposed, so where says "GABAC" read bzip2 instead. (GABAC is actually the "xz" line).
rANS-0 and rANS-1 were built without compiler optimisation, so speeds are far too low.
Decompression speed is compressed_size/time not decompressed_size/time. Hence a tool that compresses well will be penalised in appeared to decompress slowly.
All figures are computed using an average of a whole set of results, but with highly variable input sizes (that also don't reflect the actual sizes of real world data).
Speeds are for maximum compression, so gzip -9, xz -9, etc. While interesting, this doesn't represent the more usual default levels which have a more commonly used speed vs size tradeoff.
Speeds include time to write to disk. That's realistic maybe for compression, but with decompression we maybe want the time to load from disk and decompress, but not the time to write it back. I never end up writing an uncompressed file to disk, so this penalises the faster codecs that can out-perform disk I/O.

A new version will be appearing soon although it may not fix all of these issues. I have my own analysis already, but I'll hold off on this until the authors have had a chance to present their corrected paper.

The Main Paper

Table 1 in the main paper has some figures for compressing the CRAM and Deez byte streams using the CRAM codecs, along with CRAM+GABAC.

I like how this was done, although it's perhaps a bit confusing as this isn't the actual size of a CRAM or Deez file and doesn't take into account things like random access. It does however give a handy guide to the efficacy of the pick-and-mix approach to codec usage.

The authors demonstrate a modest improvement if GABAC is added into the mix, and also show that GABAC essentially replaces the need for gzip or rANS-0 in CRAM. It's a nice way of looking at the problem.

It does beg the question though - how well would GABAC work alone, without the other codecs (mainly rANS-1, bzip2 and xz)? Especially given this is meant to be used alone as part of MPEG-G. Using their log files I get this:

Note this isn't an indication of how well MPEG-G works, because the data streams are different, just as they are between CRAM and Deez. However it does give an indication of potential weakness in encoding, or at least in this particular implementation of it. Finding the optimal configuration is likely a black art and I fully expect improvements to be made here.

Actual CRAM files

The above charts are proxy for real data, but for the curious, here are some actual numbers (not speeds sadly, I forgot to measure them but can supply if people are interested) on the same mpeg-g data set 02 (chr11).

-rw-r--r-- 1 jkb jkb 4493790248 Feb 8 2019 NA12878_S1.chr11.bam
-rw-r--r-- 1 jkb jkb 2584990037 Jan 8 11:33 NA12878_S1.chr11.v30.cram
-rw-r--r-- 1 jkb jkb 2338767711 Jan 7 15:23 NA12878_S1.chr11.v40.cram
-rw-r--r-- 1 jkb jkb 2152252590 Jan 7 15:28 NA12878_S1.chr11.v40-small.cram
-rw-r--r-- 1 jkb jkb 2026174338 Jan 7 15:33 NA12878_S1.chr11.v40-archive7.cram

These were produced using the "CRAM4b" branch of Scramble. "small" is "-X small" parameter and "archive7" -s "-X archive -7". That's very slow, but mainly due to also adding in xz (lzma).

Note these are larger than the ones quoted above because it's the whole file rather than the first 200MiB per data type. However it shows the progress with more modern codecs available to CRAM 4.0.

I'll repeat my oft requested appeal to the MPEG-G community as a whole. Please publish some numbers for how MPEG-G works on this dataset.

Data Geekdom

Thursday, January 9, 2020

GABAC: The Genomic Adaptive Binary Arithmetic Coder

The supplementary data

The Main Paper

Actual CRAM files

PS. Happy New Year