Monday, February 25, 2019

LinkedIn's MiGz format, vs bgzf / bgzip

LinkedIn have created a new gzip variant that permits both multi-threaded encoding and decoding.

It achieves this by using the EXTRA field in the gzip headers to indicate the size of the block, so the decoder can read ahead and do multiple decompressions in parallel.  In order for this to work, each block (512Kb by default) also has to be an entirely self-contained gzip stream without LZ operations referring back to data in earlier blocks.

LinkedIn themselves mention that this is very similar to bgzip which also uses the gzip EXTRA field to store additional meta-data, albeit with a smaller 64Kb block size.  They recognise the similarity, but sadly failed to notice the "-@" option of bgzip (see the man page above) to specify the number of threads.  Quoting from their blog (emboldening mine):

Interestingly, BGZip uses an EXTRA field to record the size of the compressed data just as MiGz does (to allow an index to be built efficiently without having to decode every block), and this theoretically allows for multithreaded decompression. However, we are aware of no implementation of this and, because BGZip limits each block to 64KB of data (uncompressed), BGZip-compressed files will tend to be larger than those compressed with MiGz (which uses a 512KB block size by default).

The comment about block size is a valid one, although bgzf (the gzip variant that bgzip is reading and writing) was optimised for fine grained random access rather than size.  Sadly however the statement about lack of multi-threading support is plain wrong.  It's been in htslib since the 1.4 release in mid 2017, and furthermore with the 1.8 release we added support for linking against libdeflate to get access to a more efficient implementation of the Deflate standard than provided by Zlib.

It's always a pleasure to compare new compression tools, so thanks to LinkedIn for making their work public.  I was curious enough to do a side by side comparison, using the first 1Gb of a Wikipedia archive ("enwik9").  Tests are an average of 5 tests for compression time and 20 tests of decompression time, excluding any obvious outliers.  Compression was writing to a new file, while decompression was sending the output to /dev/null.

Mzip file size = 32611530

Tool Encode(s) Decode(s)
mzip -6 -t 1 41.175 4.568
mzip -6 -t 8 6.647 0.622
mzip -6 -t 16 3.848 0.441
mzip -6 -t 32 2.914 0.480
mzip -6 -t 54 2.459 0.564

Bgzip here was built using libdeflate ("./configure --with-libdeflate CPPFLAGS=-I$HOME/ftp/compression/libdeflate LDFLAGS=-L$HOME/ftp/compression/libdeflate") so the unthreaded speed is faster already.

Bgzip file size = 340338947

Tool Encode(s) Decode(s)
bgzip -l6 -@1 16.571 1.929
bgzip -l6 -@8 2.858 0.319
bgzip -l6 -@16 1.883 0.243
bgzip -l6 -@32 1.548 0.226
bgzip -l6 -@54 1.577 0.222

Bgzip file size is around 5% larger on the default compression level, primarily due to the small block size.  Both suffer from lack of good multi-threading performance at high thread counts, sometimes even increasing in real-time with more threads, but a lot of this comes down to I/O bottlenecks and likely contention between hyperthreads (the system only has 27 real cores) or hypervisor / virtualization problems.  As an example when output to /dev/null instead of enwiki9.gz the fastest bgzip is 0.78s instead of 1.58s.  However it is still clear that some high thread count problems exist in both implementations.

Bgzip, or rather htslib on which it is derived, also uses a thread-pool system.  This shares worker threads between file descriptors, so for example merging 5 BAM files and writing a new BAM file with 16 threads will still only use 16 threads for the entire job, auto-balancing the number of workers per input or output file descriptor based on their throughput.

I'll happily concede bgzf has some deficiencies, such as the ability to control block size, so it's nice that MiGz improves on this.  Ideally it should be capable of using different (better) implementations of Deflate, but I appreciate this may be harder to achieve from within Java.