Wednesday, October 17, 2018

MPEG-G challenge, 1 week on.

Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

 

Replies


It's been one week since I publicly asked for some figures for how MPEG-G performs on the data set they provided DeeZ and CRAMv2 figures for in their paper.

Yesterday I had a reply from a GenomSys employee, and a couple days before that from Mikel Hernaez (as you can see in the comments to the previous post). Thank you both for taking the time to reply and I want you to know none of this is personal.

Sadly neither are willing or able to provide figures at this moment.    GenomSys for unspecified reasons other than "we will provide them according to the established workplan within MPEG".  Make of that rather unclear statement what you will.  Mikel because he's working on an open source implementation which isn't yet finished - fair enough and a huge thank you for working on that.  I look forward to experimenting with it myself when available.  I do still genuinely believe the format has something to offer, if it wasn't for the patents.

I was also told that the BAM figures from Marco's talk last year were estimates based on the outcomes from the Call  for Proposals experiments, rather than from a fully formed MPEG-G file.  Maybe this is why there were no auxiliary data shown in the bar-graphs, although I am unsure if this is part of the specification anyway.  There is still a problem here though.  Stating how much better MPEG-G is than SAM / BAM in various press pieces ([1], [2], [3]) and then claiming it is not yet appropriate to do comparisons against CRAM, DeeZ, etc just doesn't hold water.  Either it's ready for comparisons or it isn't.

I ended my last blog by saying "indeed I am starting to wonder if there is even a fully working implementation out there yet".    While it may appear that this was bang on the money, note it is unclear if GenomSys' demonstration at the GA4GH conference in Basel was a mock up, or that they are refusing to give numbers for some other reason (e.g. tipping off the "opposition" - other MPEG-G implementers - before the final acceptance of the standard).  Either way, this doesn't fit well with the ISO Code of Conduct[4]: "uphold the key principles of international standardization : consensus, transparency, openness, [...]".  If the performance of the MPEG-G format was any less open it'd be in locked filing cabinet in a disused lavatory with a sign on the door saying "Beware of the leopard"!


My goals


So where to go from here?  Wait I guess as far as benchmarks go.   However some people have wondered what were my desired outcomes from starting this series of blogs?  I'll elaborate a little.

  1. To educate people about the CRAM format.   I consistently hear of what CRAM cannot do, when the reality is it is either different or it is a complaint about inadequacies of our tool chain rather than the format itself.

    This is an ongoing project, but I think we're better off than before. 


  2. To ensure that as many people as possible are aware that there is more than just BAM vs MPEG-G.  Numerous press pieces have been making these comparisons without mention of the other technologies, some of which made it into MPEG-G.

    Decisions makers need to be fully aware of the choices available.

  3. To inform about software patents, both my own personal views (and I believe of the GA4GH, although I do not speak on their behalf), but also of the existence of patents that directly cover both encoding and decoding of the MPEG-G standard.  *ALL* implementations will be covered by these patents, and we are not yet aware of the licensing terms.

    Again, decision makers need to be fully aware of the implications of their choices, and I have not yet observed any other press releases covering this issue and I still cannot see any of the GenomSys patents registered in the ISO patent database[5].

    To date there have been over 25,000 views of the blog, so the messages in this and point 2 above are hopefully getting through.

  4. On the topic of patents, to research prior art and where appropriate block patent claims.  We are aware of numerous existing bioinformatics tools that may become unusable without obtaining licenses if these patents are granted.  These will be filed as prior art.

    The cost to the community of these patents isn't just in any royalty fees, but in hampered public research as the existing bioinformatics community will have to tiptoe around the minefield of confusingly worded patents, with the fear of having their work squashed by a patent licensing authority if they step too close.

    Fighting this is time consuming and thus costly, but it is ongoing work.  It's not something I can speak about yet though.

  5. To improve!

    One beneficial outcome of this whole unfortunate tangle is acknowledging our own weaknesses and correcting them.  For CRAM this means documentation, specification clarity, conformance tests, and generally more capable tools.

    This is a bit more long term, but it's good to have a lot of ideas to work on.

Conclusion


As a final thought, there are things in the MPEG-G standard that I like and that are an improvement over CRAM.  It was designed by a proficient group with several world class researchers involved, so it is sad that it comes with strings attached that will ultimately hamper its uptake.  There are things I can suggest to improve it further, but will not while it is a patented format.

Certainly it is the case that no MPEG-G decoder can be patent free while decoding any valid MPEG-G file.  However it remains to be seen if it is possible for an encoder to produce a subset of the MPEG-G specification to avoid the patents (e.g. by having just 1 data class).   If so, maybe something is salvageable and I would encourage the open source implementation to consider this as an option.   I'm not a lawyer though and the patents waffle on forever!  It'll need the more egregious claims squashing first though.


References


[1] https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf
[2] https://spectrum.ieee.org/ns/Blast/Sept18/09_Spectrum_2018_INT.pdf 
[3] https://www.biorxiv.org/content/early/2018/09/27/426353
[4] https://www.iso.org/publication/PUB100397.html
[5] https://www.iso.org/iso-standards-and-patents.html

1 comment: