Tuesday, September 25, 2018

"ISO/IEC JTC 1/SC 29/WG 11"

Preface:  all my blogs are my own personal views and may not necessarily the views of my employer, the Wellcome Sanger Institute.

Who?


You'll probably know ISO/IEC JTC 1/SC 29/WG 11 as the Motion Picture Expert Group, or MPEG for short.

They formed out of a requirement for interoperability between the burgeoning multi-media formats and devices, chaired throughout its history by Leonardo Chiariglione.  MPEG's aims were eminently sensible; put together the best formats they can, irrespective or who owns what ideas, and come up with a format better than the sum of its parts.  Given the companies wanted to monetise their intellectual property, this was done via a central pot (the FRAND system) and it was divvied up to those due their slice of the MPEG pie.  The net effect is a win/win for everyone, both consumer and producer.

This blog will cover some history of how MPEG got to grips with genomic data formats, up to the current publication of "MPEG-G".

MPEG discovers genomics


It appears to be back in July 2014, during a meeting in Japan, that MPEG formally acknowledge an ad-hoc group (AHG) to start investigations into the field of genomic data compression.


This document shows the first tentative steps to understanding the field, along with the expected gaps in knowledge: (Emboldening theirs.)
"Such challenges imply the use/existence of an appropriate open standard for both compression and storage, [...] but that currently does not exist".
It doesn't take long to get a basic understanding, and by the next MPEG meeting in Strasbourg the team had grown and learnt far more about the field of genomic data compression, mentioning standard formats including FASTQ, SAM, BAM and CRAM and ego-pleasingly citing and including figures from my own Sequence Squeeze paper too.


Enter MPEG, stage left


Before the following meeting (Feb 2015) they reached out to authors of existing tools and formats.  The process gathered momentum.   In  January 2015 Vadim Zalunin, author of the original CRAM implementation at the EBI had a call from MPEG.  He relayed this to me shortly after:
"MPEG community has contacted us, they are reviewing CRAM and if it has the potential to become a standard format they would want to support.  If you are interested please subscribe to <redacted>."
This was rather a bolt out of the blue to me, having heard none of the above yet.  The video coding experts are looking at genomic data?  Well, I guess they have a lot of data compression expertise, so why not?

I joined them.  Vadim had already subscribed to their "reflector" (ISO speak for a mailing list), as had a number of other authors of related tools such as DeeZ,  SCALCE and GeneCodeq, and numerous data compression experts in the video coding world.  From one of my first postings of 7th Jan 2015, when I was just working out what all this entailed:
"I am eager to know what the MPEG team has to offer though.  For sure I think there is a rich vein around lossy quality encoding, maybe applying rate distortion theory and other bounded loss methods.  There have been a number of papers published on lossy encodings, but they've found it hard to gain traction so far.  I think perhaps there is a lack of a gold-standard method of evaluating the impact on lossy quality values."
We had lots of really technical discussions on various formats, their strengths and weaknesses, and how to improve.  Even the main man at MPEG, Leonardo Chiariglione, occasionally took part in discussions.   We also had a series of conference calls, and produced a set of test data and results of running various already existing tools, which subsequently formed the basis of a Nature Methods paper, of which I am a co-author: a survey of existing genomics compression formats.

So far my involvement was purely conference calls, emails and a lot of benchmarking, but in July 2015 Claudio Alberti invited me to the next MPEG conference:
"MPEG will organize a Workshop/seminar on Genome compression on 20th October 2015 in Geneva and we'd like to invite you as keynote speaker on genome compression issues/tools/technologies."
I spoke at this meeting, presenting CRAM both as it stood then and potential improvements to the codecs based on my earlier fqzcomp (Sequence Squeeze) work.  The talk slides can be seen here:  https://www.slideshare.net/JamesBonfield/cram-116143348

Formal processes


Once the field had been surveyed and a few explorations made, the next step of MPEG-G (as it later became known) was to construct a series of "Core Experiments" which related to the specific tasks at hand, and a public "call for contributions".   These included compression of sequences, identifiers, quality values (both lossless and lossy), auxiliary tags, as well as more nebulous things.

ISO/MPEG procedures are complicated!  ISO do not accept comments and submissions from individuals, only from member nations or other recognised liaison committees.  Each nation in turn has its own standards body (BSI for the UK), and people affiliated with an appropriate technical committee (eg BSI's IST/37) are permitted to take part in ISO discussions on behalf of the nation standards body.  This was all very bureaucratic, and they seemed to want me to turn up to meetings in London and do all sorts of other duties.  I opted out of this complexity.

However I still made submissions via Claudio Alberti, who kindly submitted them on my behalf.  These consisted of the various experimental CRAM codecs I had presented to the October 2015 meeting in Geneva.  However I wasn't present at the subsequent meetings (without standards accreditation, you can only attend if formally invited), nor was I able to see anyone else's submissions in the internal MPEG document archive.   The work load was getting quite high and I took this as an opportunity to bow out.  I later heard one of my submissions (identifier compression) was adopted as part of the MPEG-G specification.

The MPEG-G standard itself


Having bowed out of regular communication and meetings, I am hazy of the process from then on apart from the occasional public document.  The most relevant of these is the MPEG-G standard itself, now designated the snappy title of ISO/IEC 23092-2 CD, all 250+ pages of it!  The MPEG site also has more documents than the file format, with other topics covering transport / streaming facilities (similar to htsget) and APIs.

I am uncertain on the timeline, but it is probable the standard will be finalised by the end of this year or early next year, with a reference implementation available soon.  (I do not know whether this would include the encoder or just a decoder.)

You're probably curious, like me, as to how it compares to other existing technologies for compression ratios and encode / decode times.  I don't think we have long wait, but right now information appears to be sketchy, with only rather vague numbers published.

1 comment:

  1. Very efficiently written information. It will be beneficial to anybody who utilizes it, including me. Keep up the good work. For sure i will check out more posts. This site seems to get a good amount of visitors. IEC Standards

    ReplyDelete