Data Geekdom: MPEG-G: the bad

Preface: all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

OK I'll come clean - having read this title you'll probably have realised that the previous post could have been called "MPEG-G: the good". It's not all smelling of roses. I covered the history and process before, so now I'll look at what, to my mind, isn't so good.

Transparency

MPEG-G has been bubbling along in the background. It's hard to gain much knowledge of what's being discussed as it's all behind the ISO machinery, which seems designed to lock in as much information as possible unless you're one of the club.   Given the history of intellectual property surrounding video coding, I can understand why this is their usual policy, but I don't feel it is an appropriate fit for a relatively open field such as bioinformatics.

I know as I almost got there, but was put off by the BSI IST/37 committee and work-loads. As it happened, I think it was beneficial (more on that to come), but it does mean that even an interested party like myself could see absolutely nothing of what was going on behind closed doors. I wasn't even aware for some time that one of my own submissions to the Core Experiments was adopted by the standard.    Yes I know I could have gone through all the bureaucracy to get there, but that's still going to be a very self limiting set of people.   My strong view here is if you *really* want the best format known to mankind, then be as public about it as you conceivably can.

MPEG-G is a complex specification. Infact it's huge. Partly this is because it's very precisely documented down to every tiny nut and bolt (and I have to confess, this puts CRAM to shame), but it's also got a lot going on. Is that necessary?   I'd like to be able to see the evaluations, of which I am sure there were many, that demonstrate why the format is as it is. These are key to implementing good encoders, but it would also give me confidence that the format is appropriate and doesn't contain pointless bloat. (Again, this is something less than ideal with CRAM too.)

Risk vs Reward

A shiny new format is all well and good if the gains to be had are worth it. Switching format is a costly exercise. CRAM started off around 30% less than BAM and has improved to be around 50% saving now. With CRAM v4 is may even be 60-70% saving for NovaSeq data. That's clearly worth it. What percentage reduction over CRAM is worth switching again? 10%? 20%? Maybe, maybe not, depending on how hard it is to implement and what else we gain from it.

Unfortunately MPEG-G is a large unknown right now, not only in size vs CRAM as all comparisons I've seen to date are vs BAM, but also in how any savings are achieved: size isn't everything. Will it be similar speed, with similar granularity of random access? So far nothing I've seen shows it is a significant advance over CRAM (eyeballing vs BAM ratios).

There are potentially new features (rewards) too in MPEG-G, but it's too early to know how beneficial any of these are. For example it has the ability to append new data without rewriting the file. This is simply analogous to on-the-fly file merging which we use already and have been for a decade. It's a nice feature, but not one we couldn't live without. There are also options to retrieve sequences that only contain, say, X differences to the reference. Is this useful? I really don't know, but I'd need some convincing test cases.

Reproducibility

That neatly brings me on to evaluation and reproducibility of research. MPEG-G make a few claims vs BAM, but to compare vs CRAM ourselves we need access to the software, or failing that to some common data sets and more information on speeds and access granularity.

In July 2017 Marco Mattavelli gave a talk including some MPEG figures, which includes BAM and MPEG-G file sizes for ERR174324 chromosome 11; a public data set. Unfortunately in the ENA archives it's in FASTQ format. How was the BAM produced? I've tried using "bwa mem", but it's simply not a good match for file size.

Source: https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf

The same talk also gives figures for NA12878_S1 - one of the Illumina Platinum Genome BAMs. This one is released in BAM format, however the BAM file size in the talk is around 1/3rd larger than the one released by Illumina. There are more recent slides from Marco, but they don't reference which data sets were used. We need more than this. What are the timings for encode and decode? How many seeks and how much data is returned for a small region query?

I'm not necessarily expecting a PowerPoint presentation to be the source of the data I want - this is normally the sort of thing relegated to a Supplementary Material section of a paper - but it should be somewhere and as far as I can tell it isn't. I sent an email to Marco on Wed 19th September and am awaiting a reply:

(From me:) "I'd like to be able to see how CRAM compares against MPEG-G, and for a fair comparison it needs identical input BAMs. Are there any recent public data sets with MPEG-G figures available?"

The best I can do so far is a side by side comparison vs CRAM 3 and the CRAM 4 prototype using my bwa mem generated BAM for ERR174324, break it down by approximate size of seq, quality, name and aux (by removing them from the BAM one at a time and observing the impact), and then scale my charts to look the same. This gives us compression ratios, although we cannot draw any hard conclusions on absolute file sizes:

The PR machinery

No one wants to shout "Adopt our format - it's marginally better than the current ones". It just doesn't work as a slogan! So it's hardly surprising in the talks given this is glossed over, but it goes beyond that. The talks and articles I've seen perpetuate the myth that there was never anything better out there:

Source: https://mpeg.chiariglione.org/sites/default/files/files/standards/parts/docs/w17514.zip

We know MPEG are aware there are better formats. They collaborated with authors of more advanced tools during the early days while learning the field, have cited them in online publications, and even invited them to MPEG conferences to talk about them. Yet sadly this is all lost in the goal for a nice soft target to compare against.

We need proper comparisons against other well established formats, or even the not so well established ones if they look a better choice to switch to. Their very absence is suspicious. I do still think MPEG-G has something to offer, but convince me because I'm not a man of faith!

It also extends passed the immediate ISO/MPEG eco system too and into the related standards bodies such as the IEEE. In an article in IEEE Spectrum with the rather condescending title of "THE QUEST TO SAVE GENOMICS", it states:

"Bioinformatics experts already use standard compression tools like gzip to shrink the size of a file by up to a factor of 20. Some researchers also use more specialized compression tools that are optimized for genomic data, but none of these tools have seen wide adoption."

This is why I added "Myth 4: CRAM has not seen wide adoption" to my earlier blog. CRAM has gone beyond research and into production. I wouldn't say over 1 million deposited CRAM files in the EBI public archives and in use by multiple major sequencing centres across the globe as not seeing wide adoption.

MPEG - you can do better than this. Give us some hard numbers to work with and please stop painting the community as being in the stone age as it's simply not true. We don't need "saving", we need proper open collaboration instead.

Taking it up a notch

Worryingly it goes further than simple press releases. I have recently become aware that, despite the lack of any official figures and public benchmarking, MPEG-G is being recommended to the European Commission eHealth project. A report on Data Driven Healthcare explicitly mentions MPEG-G as a likely format for standardisation across Europe:

"The European Health Record Recommendation will also talk about content-related standards for documents, images, complex datasets, and multimedia recordings. Viola admitted that there will be discussions around these standards, but he also said that the Commission is not planning to reinvent anything that already exists.

[...]

A good candidate, according to Viola, is the MPEG-G standard, a genomic information representation that is currently being developed by the Moving Picture Experts Group (MPEG). MPEG-G aims to bring the terabyte datasets of whole-genome sequencing down to a few gigabytes. This would allow to them to be stored on a smartphone."

Note that "Not planning to reinvent anything that already exists" statement. Roberto Viola (Director General for Communications Networks, Content and Technology) is listed elsewhere in the above report as the main driving force behind the eHealth Network project. We can only speculate on whether Viola is aware of the full range of existing technologies.

I'm not anti-progress, but I am a scientist and that means evidence - before you start the marketing.

This is dangerous ground and history is resplendent with battles where a good PR campaign won the day.

Addendum:

During preparation of this blog, someone pointed me an MPEG-G paper at bioRxiv. It's good to have a better description of the format, fitting somewhere between a few simple slides and the complexity of the full blown specification. However again it appears to be a missed opportunity in terms of the compression performance section. The chart shows lossy MPEG-G only and has no direct evidence regarding the impact this has on variant calling. It does not compare against formats other than SAM or BAM, or even acknowledge their existance. If I was a reviewer I would be requesting something that others can reproduce, using public datasets, and a comparison against the current state of the art. If the reader wishes to move away from BAM, they need to know which format to move to. This paper does not demonstrate that.

Data Geekdom

Thursday, September 27, 2018

MPEG-G: the bad