Data Geekdom: MPEG-G: An open challenge

Wednesday, October 10, 2018

MPEG-G: An open challenge

Preface: all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

Several key people behind the MPEG-G specification sent a preprint to bioRxiv of a paper covering the format. The first version appeared on 27th September, with a follow up to address some of the comments on 8th October.

Version 1

The first version of the MPEG-G paper is available at bioRxiv here: https://www.biorxiv.org/content/early/2018/09/27/426353 This is the version that has the most comments.

In it, we see no comparison to any other tools out there whatsoever, other than uncompressed SAM and BAM, nor even any mention of their existence. Furthermore the MPEG-G figure was produced using a lossy compressor. No reference was given on the data sets used, and no actual size values were shown - only a poor quality graph.

There is no direct evidence that their lossy mode did not harm analysis, other than a referral to other papers that have demonstrated this. We do not know whether their method is the same. We cannot take things purely on trust. Science is skeptical for a reason.

It got pretty much savaged in the comments, and quite rightly so.

Version 2

A new version has just been uploaded. This second version of the paper initially seems like a significant improvement. Now there are references to lots of other tools considered state of the art and some benchmarks between them. Unfortunately that is as far as it goes. There are even fewer figures on MPEG-G performance (ie none). The authors state that their aim is to describe the format and not the implementation, but really we need demonstration of at least one implementation to even judge whether the format is worthy of further investigation. They just refer to a "sense" of the capabilities it could have. I don't want a "sense", I want evidence!

"Nevertheless, to give the reader a sense of the compression capabilities achievable by MPEG-G, next we show the compression performance of some of the technologies supported by the standard (and hence implementable in an encoder), and how they compare with current formats(2)."

Unless the format is simply a container that permits other tools to be wrapped inside it, then I'm struggling to understand this statement, nor the benefit of such a container format. Even worse, MPEG-G is repeatedly referred to in how it "could" be implemented using this method or that algorithm, but never what it actually "does":

"When it comes to aligned data, an MPEG-G encoder could use a compression method comparable to that of DeeZ [8], which is able to compress a 437 GB H. Sapiens SAM file to about 63 GB, as compared to 75 GB by CRAM (Scramble) or 106 GB by BAM [8]"

This simply doesn't cut the mustard! I want to know how it "does", not "could do". However let us go with the assumption that right now, MPEG-G is indeed comparable to DeeZ for compression ratios, as suggested. Note that the figures quoted above are also old and do not represent the current performance of the tools cited (except for BAM).

So let's do some more digging. That little footnote (2) at the end of the first quote indicates the figures come from unspecified cited publications: "The specific numbers are taken from the corresponding publications".

The paper in question is the DeeZ paper by Hach, Namanagic and Sahinalp. Unfortunately it's not open access, although I can read it via work and the supplementary text is also freely available. The figures quoted in the 2nd version of the MPEG-G paper come from Table 1 of the DeeZ paper; 62,808 MB for DeeZ, 75,784 MB for Scramble (CRAM), 106,596 MB for BAM and 437,589 MB for SAM. Furthermore, because the DeeZ paper is rigorously written and follows accepted notions of reproducible research, the supplementary data lists the exact file tested as ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12878_S1.bam.

Given the age of the DeeZ paper, the figures shown there are for an earlier DeeZ version as well as a considerly weaker CRAM v2.1, rather than the current CRAM v3 standard. (This isn't a complaint about the DeeZ paper as they reviewed what was available at the time.) I did my own analysis on this BAM file.

Format	Options	Size (GiB)	Notes
BAM	(default)	105.8	Orig file
BAM	-9	103.0	Scramble
Deez	(default)	60.6	1 million reads random access granularity
CRAM2.1	(default)	74.0	Using Scramble 1.13.7 to match DeeZ paper
CRAM3	(default)	64.2	Scramble, 10k reads granularity
CRAM3	-7 -s 100000 -j	58.5	Scramble, 100k reads granularity + bzip2
CRAM3	-7 -s 100000 -j -Z	56.1	Scramble, 100k reads granularity + bzip2 + lzma
CRAM4	(default)	58.8	Scramble, 10k reads granularity
CRAM4	-7 -s 100000	53.1	Scramble, 100k reads granularity
CRAM4	-7 -s 100000 -j -J	52.9	Scramble, 100k reads granularity + bzip2 + libbsc

The above chart shows sizes from DeeZ (in random-access mode, rather than the non-random access -q2 method), BAM as downloaded and as generated using scramble -9, and CRAM using a variety of scramble (1.14.10) command line options and versions. Note version 2.1 was created using the old 1.13.7 release to match the one cited in the DeeZ paper. Also note CRAM 4 is highly experimental - it's a demonstration of what may become a standard, rather than what already is. Nevertheless, it's not vapourware and it can be downloaded and ran, so I include it as demonstration of where we could go with CRAM if the community desires it. Obviously (not recorded, sorry) there are big CPU time differences and I've also adjusted some of the CRAM "slice sizes" (granularity of random access) to demonstrate the ranges of sizes we get. The smaller files are more suitable for archival than ongoing analysis.

Basically, we see CRAM can beat DeeZ quite considerably even using just the existing CRAM3 standard. I find it disheartening therefore that the new revision of the MPEG-G paper contains outdated and therefore misleading conclusions. The new proposed codecs in version 4 pushes CRAM to smaller files still, so we need to consider whether we want a complete new format or just an update to existing ones. Obviously I'm biased on this point.

Analysis of the version 1 paper graph shows this is likely the same file reported in the "Human WGS (high coverage)" figure. Unfortunately given the MPEG-G values shown were with lossy compression, so no direct comparison can be made against DeeZ and/or CRAM.

Conclusion

MPEG-G were already called out in the comments of their original paper for incorrectly representing the current state of the art. It appears they now have made the same mistake a second time by using outdated benchmarks. Once may be an error, but two is starting to look like a pattern.

Scientists are more savvy than this. We work on open and reproducible research, not secrecy and promises of what something "could" do. So far MPEG's own paper has stated it could be comparable to DeeZ, and I have compared CRAM to DeeZ showing it to be smaller. One logical conclusion therefore is that MPEG-G can already be beaten by CRAM on file size. I doubt this is actually true as they have had some big heavy-weights working on it, but the paucity of evidence hardly instills confidence. Indeed I am starting to wonder if there is even a fully working implemention out there yet.

An open challenge

So let's get right down to it. I hearby openly challenge the MPEG-G community to provide their own figures, using the data set chosen by them in their paper. Please email me your MPEG-G figures for the same NA12878_S1.bam file presented in the DeeZ paper that you quote, in GiB units. I'd like to know both size, some measure of granularity, be it MB access unit size, number of reads, or a similar metric, and what data is retained. We can start with the easy lossless case. (Lossy gets so hard to evaluate the effect and it's a set of complex adjustments and tradeoffs.)

I will email this to the two authors with email addresses listed in the paper.

6 comments:

UnknownOctober 12, 2018 at 10:33 AM
James, thanks a lot for putting together this blog post and for your continued efforts to improving genomics data formats. For the sake of completion, I believe that you could add one important compression tool to the table: /bin/rm.
So far, deletion has been the most effective compression tool for genomic data (e.g. TIF files, intensity files). Conversely, failure to delete redundant and unnecessary data is a substantial contributor to escalating costs of storage in many organizations. This could add some perspective to the evaluation of the relevance, value or potential of MPEG-G.
ReplyDelete
Replies
UnknownOctober 14, 2018 at 7:45 AM
[1/2]
[disclaimer: this is my personal opinion and it does not reflect the sentiments and/or opinions of the other co-authors or anyone else involved in MPEG-G, MPEG, ISO or IEC]

James did contact me via email and told me that he may quote my replies to this email publicly. To provide a full context, I have decided to reply openly here:

Regarding the challenge, I think it is a great idea. My group, in collaboration with other groups, is currently working on an implementation of the MPEG-G specifications that will be open-source and we hope it will showcase the potential of MPEG-G. I am sure that other groups are working on different implementations.

However, I would like to clarify that MPEG-G as an ISO/IEC standard is actually not comparable to the existing codecs for compression of sequencing data. Existing codecs such as DeeZ, FaStore or CRAM are implementations of specific compression methods that come with an encoder and a decoder. One can download the code, compile it and run it. MPEG-G, however, is different: it is a decoding specification. It isn't code. Furthermore, the specification is asymmetrical in the sense that the encoder is not specified. Any encoder implementation would work fine as long as it outputs a conformant bitstreams.

A good example of this asymmetrical specification paradigm is lossy compression of quality values. MPEG-G does not specify any quantization mechanism for quality values. But what MPEG-G does is specifying that when decoding quality values, one or more codebooks can be used to reconstruct them. The actual codebook(s) to be used can be freely determined by the encoding side. Hence, the effect of quantized quality values on e.g. variant calling is independent of the MPEG-G specifications.

Another example would be the reordering of the reads prior to compression (as done in HARC, ORCOM or FaStore, for example). This is also an encoder-only process, and MPEG-G does not specify how to reorder the reads. It specifies, however, how these reads should be compressed if using an on-the-fly reference for compression (like done in HARC, ORCOM or FaStore). This is a clear example of how different implementations of an MPEG-G-conformant encoder can achieve very different results.

The asymmetry of the MPEG-G specification creates the possibility for different teams to develop different MPEG-G encoders targeted at different applications (e.g., streaming, selective access, long-term archival) and to improve compression performance over time while maintaining interoperability among them. As an example of how the MPEG framework has helped advance compression technology, the compression ratios obtained by AVC (video) encoders improved significantly in the last years while using the same decoding specification (being AVC). Regarding the interoperability, the key advantage that this ecosystem creates is that different encoding implementations can work seamlessly together, as they could all be decoded with the same MPEG-G decoder. This is as opposed to what happens today, in that if one uses one method to compress, needs to use the corresponding decoder, and same for downstream applications that work with compressed data. Currently, the application is tailored to a particular implementation, whereas MPEG-G will allow downstream applications to work with files compressed with different encoders as long as they are MPEG-G compliant.

I hope that I could make clear why comparing existing compressors to MPEG-G [i.e., the specification] is actually not possible. However, what it is possible is to compare an implementation of the MPEG-G specification to existing compressors. We will gladly do so once our implementation is finished.
ReplyDelete
Replies
UnknownOctober 14, 2018 at 7:46 AM
[2/2]

The above information is what we tried to convey in the pre-print paper, as the focus of the paper was not on the compression performance, but on the other capabilities that the standard brings to the table. The compression methods part was intended to be informative pending the release of actual implementations. (it is less than a page out of the 9 pages of the main manuscript).

In this context, I would like to make clear that it was not our intention to undermine the compression capabilities of CRAM, and I am very sorry if there was any misunderstanding. Note that very similar results to the ones mentioned in the pre-print are shown in the Numanagic et al (2016). In all fairness, Numanagic et al (2016) showed that DeeZ with bzip2 and sam_comp qual compression achieves significant improvements over the default DeeZ (Figure 1 (b), last column). Since no MPEG-G implementation was featured in the pre-print we decided to take the numbers from recently published data as an indication of how different methods have been shown to perform.

I’d like to conclude with a more personal note: I hold your work in highest regards and I was extremely pleased when I learned that you were involved in MPEG-G. In fact, your involvement encouraged me to keep contributing to the standard. When your contributions stopped, I felt sad but I understood it considering the complex process of MPEG standards development. However, after your series of blog posts, I wish you had raised your concerns at that time to the group, as some of us might have agreed with you in some points and it could have helped drive the discussions; also, you would have given the chance to the people disagreeing with you to explain themselves.

I am sorry for any misunderstanding that this pre-print may have caused. It was meant to be an informative overview of the features of the MPEG-G standard, and what they bring to the table. It was not intended to be a claim of compression superiority against any existing solution. I hope that in the future we will be able to collaborate and bring the best of two worlds (CRAM and MPEG-G specs) together. In the meantime, I think that your proposed challenge is a great idea!

Kind regards,
Mikel Hernaez
ReplyDelete
Replies

Add comment