Wednesday, October 10, 2018

MPEG-G: An open challenge

Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.


Several key people behind the MPEG-G specification sent a preprint to bioRxiv of a paper covering the format.  The first version appeared on 27th September, with a follow up to address some of the comments on 8th October.

Version 1


The first version of the MPEG-G paper is available at bioRxiv here: https://www.biorxiv.org/content/early/2018/09/27/426353  This is the version that has the most comments.

In it, we see no comparison to any other tools out there whatsoever, other than uncompressed SAM and BAM, nor even any mention of their existence.   Furthermore the MPEG-G figure was produced using a lossy compressor.  No reference was given on the data sets used, and no actual size values were shown - only a poor quality graph.

There is no direct evidence that their lossy mode did not harm analysis, other than a referral to other papers that have demonstrated this.  We do not know whether their method is the same.  We cannot take things purely on trust.  Science is skeptical for a reason.

It got pretty much savaged in the comments, and quite rightly so.


Version 2


A new version has just been uploaded.   This second version of the paper initially seems like a significant improvement.  Now there are references to lots of other tools considered state of the art and some benchmarks between them.  Unfortunately that is as far as it goes.  There are even fewer figures on MPEG-G performance (ie none).  The authors state that their aim is to describe the format and not the implementation, but really we need demonstration of at least one implementation to even judge whether the format is worthy of further investigation.   They just refer to a "sense" of the capabilities it could have.  I don't want a "sense", I want evidence!
"Nevertheless, to give the reader a sense of the compression capabilities achievable by MPEG-G, next we show the compression performance of some of the technologies supported by the standard (and hence implementable in an encoder), and how they compare with current formats(2)."
Unless the format is simply a container that permits other tools to be wrapped inside it, then I'm struggling to understand this statement, nor the benefit of such a container format.  Even worse, MPEG-G is repeatedly referred to in how it "could" be implemented using this method or that algorithm, but never what it actually "does":
"When it comes to aligned data, an MPEG-G encoder could use a compression method comparable to that of DeeZ [8], which is able to compress a 437 GB H. Sapiens SAM file to about 63 GB, as compared to 75 GB by CRAM (Scramble) or 106 GB by BAM [8]"
This simply doesn't cut the mustard!    I want to know how it "does", not "could do".  However let us go with the assumption that right now, MPEG-G is indeed comparable to DeeZ for compression ratios, as suggested.  Note that the figures quoted above are also old and do not represent the current performance of the tools cited (except for BAM).

So let's do some more digging.  That little footnote (2) at the end of the first quote indicates the figures come from unspecified cited publications: "The specific numbers are taken from the corresponding publications".

The paper in question is the DeeZ paper by Hach, Namanagic and Sahinalp.  Unfortunately it's not open access, although I can read it via work and the supplementary text is also freely available.  The figures quoted in the 2nd version of the MPEG-G paper come from Table 1 of the DeeZ paper; 62,808 MB for DeeZ, 75,784 MB for Scramble (CRAM), 106,596 MB for BAM and 437,589 MB for SAM.  Furthermore, because the DeeZ paper is rigorously written and follows accepted notions of reproducible research, the supplementary data lists the exact file tested as ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12878_S1.bam.

Given the age of the DeeZ paper, the figures shown there are for an earlier DeeZ version as well as a considerly weaker CRAM v2.1, rather than the current CRAM v3 standard.  (This isn't a complaint about the DeeZ paper as they reviewed what was available at the time.)  I did my own analysis on this BAM file.


Format Options Size (GiB) Notes
BAM (default) 105.8 Orig file
BAM -9 103.0 Scramble
Deez (default) 60.6 1 million reads random access granularity
CRAM2.1 (default) 74.0 Using Scramble 1.13.7 to match DeeZ paper
CRAM3 (default) 64.2 Scramble, 10k reads granularity
CRAM3 -7 -s 100000 -j 58.5 Scramble, 100k reads granularity + bzip2
CRAM3 -7 -s 100000 -j -Z 56.1 Scramble, 100k reads granularity + bzip2 + lzma
CRAM4 (default) 58.8 Scramble, 10k reads granularity
CRAM4 -7 -s 100000 53.1 Scramble, 100k reads granularity
CRAM4 -7 -s 100000 -j -J 52.9 Scramble, 100k reads granularity + bzip2 + libbsc

The above chart shows sizes from DeeZ (in random-access mode, rather than the non-random access -q2 method), BAM as downloaded and as generated using scramble -9, and CRAM using a variety of scramble (1.14.10) command line options and versions.  Note version 2.1 was created using the old 1.13.7 release to match the one cited in the DeeZ paper.  Also note CRAM 4 is highly experimental - it's a demonstration of what may become a standard, rather than what already is.  Nevertheless, it's not vapourware and it can be downloaded and ran, so I include it as demonstration of where we could go with CRAM if the community desires it.  Obviously (not recorded, sorry) there are big CPU time differences and I've also adjusted some of the CRAM "slice sizes" (granularity of random access) to demonstrate the ranges of sizes we get.  The smaller files are more suitable for archival than ongoing analysis.

Basically, we see CRAM can beat DeeZ quite considerably even using just the existing CRAM3 standard.   I find it disheartening therefore that the new revision of the MPEG-G paper contains outdated and therefore misleading conclusions.  The new proposed codecs in version 4 pushes CRAM to smaller files still, so we need to consider whether we want a complete new format or just an update to existing ones.  Obviously I'm biased on this point.

Analysis of the version 1 paper graph shows this is likely the same file reported in the "Human WGS (high coverage)" figure.  Unfortunately given the MPEG-G values shown were with lossy compression, so no direct comparison can be made against DeeZ and/or CRAM.


Conclusion


MPEG-G were already called out in the comments of their original paper for incorrectly representing the current state of the art.   It appears they now have made the same mistake a second time by using outdated benchmarks.  Once may be an error, but two is starting to look like a pattern.

Scientists are more savvy than this.  We work on open and reproducible research, not secrecy and promises of what something "could" do.   So far MPEG's own paper has stated it could be comparable to DeeZ, and I have compared CRAM to DeeZ showing it to be smaller.  One logical conclusion therefore is that MPEG-G can already be beaten by CRAM on file size. I doubt this is actually true as they have had some big heavy-weights working on it, but the paucity of evidence hardly instills confidence.  Indeed I am starting to wonder if there is even a fully working implemention out there yet.


An open challenge


So let's get right down to it.  I hearby openly challenge the MPEG-G community to provide their own figures, using the data set chosen by them in their paper.   Please email me your MPEG-G figures for the same NA12878_S1.bam file presented in the DeeZ paper that you quote, in GiB units.  I'd like to know both size, some measure of granularity, be it MB access unit size, number of reads, or a similar metric, and what data is retained.  We can start with the easy lossless case.  (Lossy gets so hard to evaluate the effect and it's a set of complex adjustments and tradeoffs.)

I will email this to the two authors with email addresses listed in the paper.

6 comments:

  1. James, thanks a lot for putting together this blog post and for your continued efforts to improving genomics data formats. For the sake of completion, I believe that you could add one important compression tool to the table: /bin/rm.
    So far, deletion has been the most effective compression tool for genomic data (e.g. TIF files, intensity files). Conversely, failure to delete redundant and unnecessary data is a substantial contributor to escalating costs of storage in many organizations. This could add some perspective to the evaluation of the relevance, value or potential of MPEG-G.

    ReplyDelete
    Replies
    1. The question is, when will institutions be ready to get rid of FASTQ/SAM files and only store VCF files? What about sequencing data of non-model organisms?
      While I agree that at some point in time FASTQ and SAM files will end up like the intensity files, how long until we are there?

      Regarding compression of VCF files, I recommend taking a look at these promising methods: GTRAC (Tatwawadi et all, 2016) and GTC (Dadnek and Deorowicz, 2018)

      Thanks!
      Mikel

      Delete
  2. [1/2]
    [disclaimer: this is my personal opinion and it does not reflect the sentiments and/or opinions of the other co-authors or anyone else involved in MPEG-G, MPEG, ISO or IEC]

    James did contact me via email and told me that he may quote my replies to this email publicly. To provide a full context, I have decided to reply openly here:

    Regarding the challenge, I think it is a great idea. My group, in collaboration with other groups, is currently working on an implementation of the MPEG-G specifications that will be open-source and we hope it will showcase the potential of MPEG-G. I am sure that other groups are working on different implementations.

    However, I would like to clarify that MPEG-G as an ISO/IEC standard is actually not comparable to the existing codecs for compression of sequencing data. Existing codecs such as DeeZ, FaStore or CRAM are implementations of specific compression methods that come with an encoder and a decoder. One can download the code, compile it and run it. MPEG-G, however, is different: it is a decoding specification. It isn't code. Furthermore, the specification is asymmetrical in the sense that the encoder is not specified. Any encoder implementation would work fine as long as it outputs a conformant bitstreams.

    A good example of this asymmetrical specification paradigm is lossy compression of quality values. MPEG-G does not specify any quantization mechanism for quality values. But what MPEG-G does is specifying that when decoding quality values, one or more codebooks can be used to reconstruct them. The actual codebook(s) to be used can be freely determined by the encoding side. Hence, the effect of quantized quality values on e.g. variant calling is independent of the MPEG-G specifications.

    Another example would be the reordering of the reads prior to compression (as done in HARC, ORCOM or FaStore, for example). This is also an encoder-only process, and MPEG-G does not specify how to reorder the reads. It specifies, however, how these reads should be compressed if using an on-the-fly reference for compression (like done in HARC, ORCOM or FaStore). This is a clear example of how different implementations of an MPEG-G-conformant encoder can achieve very different results.

    The asymmetry of the MPEG-G specification creates the possibility for different teams to develop different MPEG-G encoders targeted at different applications (e.g., streaming, selective access, long-term archival) and to improve compression performance over time while maintaining interoperability among them. As an example of how the MPEG framework has helped advance compression technology, the compression ratios obtained by AVC (video) encoders improved significantly in the last years while using the same decoding specification (being AVC). Regarding the interoperability, the key advantage that this ecosystem creates is that different encoding implementations can work seamlessly together, as they could all be decoded with the same MPEG-G decoder. This is as opposed to what happens today, in that if one uses one method to compress, needs to use the corresponding decoder, and same for downstream applications that work with compressed data. Currently, the application is tailored to a particular implementation, whereas MPEG-G will allow downstream applications to work with files compressed with different encoders as long as they are MPEG-G compliant.

    I hope that I could make clear why comparing existing compressors to MPEG-G [i.e., the specification] is actually not possible. However, what it is possible is to compare an implementation of the MPEG-G specification to existing compressors. We will gladly do so once our implementation is finished.

    ReplyDelete
    Replies
    1. I think you misunderstand the CRAM specification here. It is not code either. Indeed GA4GH explicitly keeps specification and code separate, but it does require that working code exists (two indepedent implementations) as validation of the specification being complete and comprehensive.

      CRAM also does not dictate the encoder process directly. It describes the file format and how to decode, although for clarity it also explains how to encode in some situations (eg an entropy encoder may explain both sides of the process). There are many choices available to the encoder, which are not dictated by the specification: how many reads per slice, slices per container, which data series goes into which block (separate or interleaved into the same block?), which compression scheme is applied to each block, and for some codecs what transforms will be applied and other codec specific parameters. These tend to be written to the container compression header, which is essentially the meta data used by the decoder. It saddens me that this lack of understanding of the CRAM format has directly lead to these very same methods appearing in the GenomSys patents.

      Obviously it is also possible to reorder sequences in CRAM if you so desire. It's not so sensible for aligned data, but could be done (although you'd get out what you put in). It does make a significant difference when storing unaligned data though. CRAM is just a container in this regard. If you presort the data then non-reference based non-mapped encoding (good old fashioned FASTQ style) will still compress far better with a sequence-based collation step.

      Similarly when it comes to lossy compression. Crumble is a separate application and does not appear anywhere within the CRAM specification nor within any of the implementations of CRAM. It takes an aligned SAM, BAM or CRAM as input, does some analysis of the "pileup", and outputs a new SAM, BAM or CRAM with most qualities quantised to 1 or 2 values where appropriate. It happens to work well with CRAM as I designed it with the format in mind, but equally so someone else could come up with a better quantiser and apply that to CRAM too. We can take QVZ2 or CALQ qualities and put those into CRAM too if we wished, and obviously Crumble output could be written to MPEG-G.

      So far all of this is sounding very similar to your MPEG-G description.

      Ultimately when it comes down to it, there *have* been comparisons made between MPEG-G and SAM / BAM and it is these one-sided comparisons that lead to me starting these blog posts; see Marco's talk last year or the compression ratios stated in the IEEE Spectrum article. So MPEG-G can be compared to BAM, but the notion that no comparison to CRAM simply doesn't hold water. Infact by your argument it is fairer to compare to CRAM than BAM, as BAM *is* as you describe (a format where both encode and decode is tightly specified) whereas CRAM has many of the same concerns over implementations.

      I'll rephrase my request then - can we please have some comparisons between an *implementation* of MPEG-G and one of the standard implementations of CRAM?

      I'm curious where the numbers for BAM comparison came from if you don't yet have a working implementation. I assume that is GenomSys' own implementation. Are they keeping it close to their chest, not even telling the other MPEG-G members what figures they get?

      Delete
  3. [2/2]

    The above information is what we tried to convey in the pre-print paper, as the focus of the paper was not on the compression performance, but on the other capabilities that the standard brings to the table. The compression methods part was intended to be informative pending the release of actual implementations. (it is less than a page out of the 9 pages of the main manuscript).

    In this context, I would like to make clear that it was not our intention to undermine the compression capabilities of CRAM, and I am very sorry if there was any misunderstanding. Note that very similar results to the ones mentioned in the pre-print are shown in the Numanagic et al (2016). In all fairness, Numanagic et al (2016) showed that DeeZ with bzip2 and sam_comp qual compression achieves significant improvements over the default DeeZ (Figure 1 (b), last column). Since no MPEG-G implementation was featured in the pre-print we decided to take the numbers from recently published data as an indication of how different methods have been shown to perform.

    I’d like to conclude with a more personal note: I hold your work in highest regards and I was extremely pleased when I learned that you were involved in MPEG-G. In fact, your involvement encouraged me to keep contributing to the standard. When your contributions stopped, I felt sad but I understood it considering the complex process of MPEG standards development. However, after your series of blog posts, I wish you had raised your concerns at that time to the group, as some of us might have agreed with you in some points and it could have helped drive the discussions; also, you would have given the chance to the people disagreeing with you to explain themselves.

    I am sorry for any misunderstanding that this pre-print may have caused. It was meant to be an informative overview of the features of the MPEG-G standard, and what they bring to the table. It was not intended to be a claim of compression superiority against any existing solution. I hope that in the future we will be able to collaborate and bring the best of two worlds (CRAM and MPEG-G specs) together. In the meantime, I think that your proposed challenge is a great idea!

    Kind regards,
    Mikel Hernaez

    ReplyDelete
    Replies
    1. Thank you Mikel for taking the time to reply here, and thank you also for the kind comments. For what it's worth, I do think the updates to the paper have improved it, despite still having some misgivings on the figures. I disagree though with not wanting to show any MPEG-G figures at all. If I was a reviewer of such a paper, it would be rejected as lacking any demonstrable benefit. It's not that I disbelieve it can perform well, but that it hasn't yet been demonstrated, let alone in a way that permits others to build upon it and compare to it. That is a core tenet of science.

      I accept the point that I could have handled this better and brought these issues up at the more appropriate time. When I bailed out, I was already struggling on the work load, but the Voges et al patent was definitely the final straw. They were certainly aware of how I felt as I made a prior art submission against their patent (which I felt was too close a description of sam_comp to be considered novel). To my shame I didn't make those feelings public on the MPEG-G reflector.

      I should have been more public in my complaints, and not doing so meant that I potentially left others with like-minded views unaware of the impending commercialisation. However believe it or not, I'm not really a confrontational person! I know that seems odd given these blog posts, but I was rather forced into this situation as a way of fighting the PR and lobbying machinery (MPEG-G talks, IEEE articles, etc), even including an apparent consideration as "a good candidate" from EC eHealth.

      Hindsight is a wonderful thing. There are many things I would have done differently if I'd have known where this was going, not least of which would be to ask the whole question of patent encumbered formats at the very start so everyone was clear which boat they'd got into.

      Delete