Data Geekdom: 2019

Friday, May 3, 2019

Data compression sweet spot

Historical context

In testing CRAM 3.0, 3.1 (and 4.0) compression codecs I was reminded of a chart I made on the results of the SequenceSqueeze competition for the Fqzcomp/Fastqz paper. The source pic is there, but I embed an annotated copy

The nearly vertical line there shows a trend - more CPU time for slightly less storage cost. We see this trend a lot. Note that we have a linear X axis and a logarithmic Y axis. Ie it's exponentially more expensive to gain a little more compression ratio. Less extreme examples are visible at the Large Text Compression Benchmark. Presumably less extreme there as more information can be gleaned from text than a FASTQ file (or so we thought at the time).

Modern tools such as Spring completely trounce this line - let's face it I got lucky with the timing - although it is still at the expense of high memory if not quite so much high CPU (it's not cheap though in CPU either), bridging the gap between referenceless (B) and refererence-based (A) compression tools.

So what is the sweet spot? My aim has always been the bottom left, at the base of "the wall". Somewhere close to optimal compression without the burden of high memory or high CPU.

With newer CRAM codecs I aim to push it further and to give people more choice in the way of named profiles (fast, normal, small and archive), but still it's got to be practical to be useful and by default that's where it'll be focused - sufficient compression without sacrificing performance.

CRAM profiles

Profiles in CRAM are purpose driven, but will simply match a whole bunch of command line arguments (much how bwa has profiles for specific types of data). As such they change both the granularity of random access, from 1,000 reads per slice (equivalent to BAM and ideal for lots of random access) to 100,000 per slice (archive). These are also coupled with compression levels and algorithms to be used, so if we are doing lots of random access then we're probably also still reading and writing files lots so are more speed oriented. We can always modify the options after though. Eg "scramble -V3.1 -X normal -s 2000" will use normal compression but at finer granularity.

The "normal" profile is how CRAM works out of the box right now, with larger and smaller variants being available.

I've tried to pick parameters that give meaningful choice, permitting maximum compression still for long term archival, but with "small" mode offering improved compression at reasonable speed. On NovaSeq data this does particularly well without needing to go to the extreme modes. This is in part due to the high degree of binning for quality values meaning there is less to be gained through the FQZcomp codec there:

With older HiSeq 2000 era data, with lots of distinct quality values showing high correlation between neighbouring scores and distinct cycle by cycle differences, the FQZcomp codec really shines (at a CPU cost):

CRAM 3.1 (I'll probably delay 4.0 until I have some more major format revamping) is still a work in progress, but you can explore the current badly-named branch at https://github.com/jkbonfield/io_lib/tree/CRAM4_updates2 (soon to be merged to master I hope) and track the ongoing work to create the specifation at https://github.com/jkbonfield/hts-specs/tree/CRAMv4 (PDF here although not automatically refreshed with updates).

Monday, February 25, 2019

LinkedIn's MiGz format, vs bgzf / bgzip

LinkedIn have created a new gzip variant that permits both multi-threaded encoding and decoding.

It achieves this by using the EXTRA field in the gzip headers to indicate the size of the block, so the decoder can read ahead and do multiple decompressions in parallel. In order for this to work, each block (512Kb by default) also has to be an entirely self-contained gzip stream without LZ operations referring back to data in earlier blocks.

LinkedIn themselves mention that this is very similar to bgzip which also uses the gzip EXTRA field to store additional meta-data, albeit with a smaller 64Kb block size. They recognise the similarity, but sadly failed to notice the "-@" option of bgzip (see the man page above) to specify the number of threads. Quoting from their blog (emboldening mine):

Interestingly, BGZip uses an EXTRA field to record the size of the compressed data just as MiGz does (to allow an index to be built efficiently without having to decode every block), and this theoretically allows for multithreaded decompression. However, we are aware of no implementation of this and, because BGZip limits each block to 64KB of data (uncompressed), BGZip-compressed files will tend to be larger than those compressed with MiGz (which uses a 512KB block size by default).

The comment about block size is a valid one, although bgzf (the gzip variant that bgzip is reading and writing) was optimised for fine grained random access rather than size. Sadly however the statement about lack of multi-threading support is plain wrong. It's been in htslib since the 1.4 release in mid 2017, and furthermore with the 1.8 release we added support for linking against libdeflate to get access to a more efficient implementation of the Deflate standard than provided by Zlib.

It's always a pleasure to compare new compression tools, so thanks to LinkedIn for making their work public. I was curious enough to do a side by side comparison, using the first 1Gb of a Wikipedia archive ("enwik9"). Tests are an average of 5 tests for compression time and 20 tests of decompression time, excluding any obvious outliers. Compression was writing to a new file, while decompression was sending the output to /dev/null.

Mzip file size = 32611530

Tool	Encode(s)	Decode(s)
mzip -6 -t 1	41.175	4.568
mzip -6 -t 8	6.647	0.622
mzip -6 -t 16	3.848	0.441
mzip -6 -t 32	2.914	0.480
mzip -6 -t 54	2.459	0.564

Bgzip here was built using libdeflate ("./configure --with-libdeflate CPPFLAGS=-I$HOME/ftp/compression/libdeflate LDFLAGS=-L$HOME/ftp/compression/libdeflate") so the unthreaded speed is faster already.

Bgzip file size = 340338947

Tool	Encode(s)	Decode(s)
bgzip -l6 -@1	16.571	1.929
bgzip -l6 -@8	2.858	0.319
bgzip -l6 -@16	1.883	0.243
bgzip -l6 -@32	1.548	0.226
bgzip -l6 -@54	1.577	0.222

Bgzip file size is around 5% larger on the default compression level, primarily due to the small block size. Both suffer from lack of good multi-threading performance at high thread counts, sometimes even increasing in real-time with more threads, but a lot of this comes down to I/O bottlenecks and likely contention between hyperthreads (the system only has 27 real cores) or hypervisor / virtualization problems. As an example when output to /dev/null instead of enwiki9.gz the fastest bgzip is 0.78s instead of 1.58s. However it is still clear that some high thread count problems exist in both implementations.

Bgzip, or rather htslib on which it is derived, also uses a thread-pool system. This shares worker threads between file descriptors, so for example merging 5 BAM files and writing a new BAM file with 16 threads will still only use 16 threads for the entire job, auto-balancing the number of workers per input or output file descriptor based on their throughput.

I'll happily concede bgzf has some deficiencies, such as the ability to control block size, so it's nice that MiGz improves on this. Ideally it should be capable of using different (better) implementations of Deflate, but I appreciate this may be harder to achieve from within Java.