Wednesday, October 17, 2018

MPEG-G challenge, 1 week on.

Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

 

Replies


It's been one week since I publicly asked for some figures for how MPEG-G performs on the data set they provided DeeZ and CRAMv2 figures for in their paper.

Yesterday I had a reply from a GenomSys employee, and a couple days before that from Mikel Hernaez (as you can see in the comments to the previous post). Thank you both for taking the time to reply and I want you to know none of this is personal.

Sadly neither are willing or able to provide figures at this moment.    GenomSys for unspecified reasons other than "we will provide them according to the established workplan within MPEG".  Make of that rather unclear statement what you will.  Mikel because he's working on an open source implementation which isn't yet finished - fair enough and a huge thank you for working on that.  I look forward to experimenting with it myself when available.  I do still genuinely believe the format has something to offer, if it wasn't for the patents.

I was also told that the BAM figures from Marco's talk last year were estimates based on the outcomes from the Call  for Proposals experiments, rather than from a fully formed MPEG-G file.  Maybe this is why there were no auxiliary data shown in the bar-graphs, although I am unsure if this is part of the specification anyway.  There is still a problem here though.  Stating how much better MPEG-G is than SAM / BAM in various press pieces ([1], [2], [3]) and then claiming it is not yet appropriate to do comparisons against CRAM, DeeZ, etc just doesn't hold water.  Either it's ready for comparisons or it isn't.

I ended my last blog by saying "indeed I am starting to wonder if there is even a fully working implementation out there yet".    While it may appear that this was bang on the money, note it is unclear if GenomSys' demonstration at the GA4GH conference in Basel was a mock up, or that they are refusing to give numbers for some other reason (e.g. tipping off the "opposition" - other MPEG-G implementers - before the final acceptance of the standard).  Either way, this doesn't fit well with the ISO Code of Conduct[4]: "uphold the key principles of international standardization : consensus, transparency, openness, [...]".  If the performance of the MPEG-G format was any less open it'd be in locked filing cabinet in a disused lavatory with a sign on the door saying "Beware of the leopard"!


My goals


So where to go from here?  Wait I guess as far as benchmarks go.   However some people have wondered what were my desired outcomes from starting this series of blogs?  I'll elaborate a little.

  1. To educate people about the CRAM format.   I consistently hear of what CRAM cannot do, when the reality is it is either different or it is a complaint about inadequacies of our tool chain rather than the format itself.

    This is an ongoing project, but I think we're better off than before. 


  2. To ensure that as many people as possible are aware that there is more than just BAM vs MPEG-G.  Numerous press pieces have been making these comparisons without mention of the other technologies, some of which made it into MPEG-G.

    Decisions makers need to be fully aware of the choices available.

  3. To inform about software patents, both my own personal views (and I believe of the GA4GH, although I do not speak on their behalf), but also of the existence of patents that directly cover both encoding and decoding of the MPEG-G standard.  *ALL* implementations will be covered by these patents, and we are not yet aware of the licensing terms.

    Again, decision makers need to be fully aware of the implications of their choices, and I have not yet observed any other press releases covering this issue and I still cannot see any of the GenomSys patents registered in the ISO patent database[5].

    To date there have been over 25,000 views of the blog, so the messages in this and point 2 above are hopefully getting through.

  4. On the topic of patents, to research prior art and where appropriate block patent claims.  We are aware of numerous existing bioinformatics tools that may become unusable without obtaining licenses if these patents are granted.  These will be filed as prior art.

    The cost to the community of these patents isn't just in any royalty fees, but in hampered public research as the existing bioinformatics community will have to tiptoe around the minefield of confusingly worded patents, with the fear of having their work squashed by a patent licensing authority if they step too close.

    Fighting this is time consuming and thus costly, but it is ongoing work.  It's not something I can speak about yet though.

  5. To improve!

    One beneficial outcome of this whole unfortunate tangle is acknowledging our own weaknesses and correcting them.  For CRAM this means documentation, specification clarity, conformance tests, and generally more capable tools.

    This is a bit more long term, but it's good to have a lot of ideas to work on.

Conclusion


As a final thought, there are things in the MPEG-G standard that I like and that are an improvement over CRAM.  It was designed by a proficient group with several world class researchers involved, so it is sad that it comes with strings attached that will ultimately hamper its uptake.  There are things I can suggest to improve it further, but will not while it is a patented format.

Certainly it is the case that no MPEG-G decoder can be patent free while decoding any valid MPEG-G file.  However it remains to be seen if it is possible for an encoder to produce a subset of the MPEG-G specification to avoid the patents (e.g. by having just 1 data class).   If so, maybe something is salvageable and I would encourage the open source implementation to consider this as an option.   I'm not a lawyer though and the patents waffle on forever!  It'll need the more egregious claims squashing first though.


References


[1] https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf
[2] https://spectrum.ieee.org/ns/Blast/Sept18/09_Spectrum_2018_INT.pdf 
[3] https://www.biorxiv.org/content/early/2018/09/27/426353
[4] https://www.iso.org/publication/PUB100397.html
[5] https://www.iso.org/iso-standards-and-patents.html

Wednesday, October 10, 2018

MPEG-G: An open challenge

Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.


Several key people behind the MPEG-G specification sent a preprint to bioRxiv of a paper covering the format.  The first version appeared on 27th September, with a follow up to address some of the comments on 8th October.

Version 1


The first version of the MPEG-G paper is available at bioRxiv here: https://www.biorxiv.org/content/early/2018/09/27/426353  This is the version that has the most comments.

In it, we see no comparison to any other tools out there whatsoever, other than uncompressed SAM and BAM, nor even any mention of their existence.   Furthermore the MPEG-G figure was produced using a lossy compressor.  No reference was given on the data sets used, and no actual size values were shown - only a poor quality graph.

There is no direct evidence that their lossy mode did not harm analysis, other than a referral to other papers that have demonstrated this.  We do not know whether their method is the same.  We cannot take things purely on trust.  Science is skeptical for a reason.

It got pretty much savaged in the comments, and quite rightly so.


Version 2


A new version has just been uploaded.   This second version of the paper initially seems like a significant improvement.  Now there are references to lots of other tools considered state of the art and some benchmarks between them.  Unfortunately that is as far as it goes.  There are even fewer figures on MPEG-G performance (ie none).  The authors state that their aim is to describe the format and not the implementation, but really we need demonstration of at least one implementation to even judge whether the format is worthy of further investigation.   They just refer to a "sense" of the capabilities it could have.  I don't want a "sense", I want evidence!
"Nevertheless, to give the reader a sense of the compression capabilities achievable by MPEG-G, next we show the compression performance of some of the technologies supported by the standard (and hence implementable in an encoder), and how they compare with current formats(2)."
Unless the format is simply a container that permits other tools to be wrapped inside it, then I'm struggling to understand this statement, nor the benefit of such a container format.  Even worse, MPEG-G is repeatedly referred to in how it "could" be implemented using this method or that algorithm, but never what it actually "does":
"When it comes to aligned data, an MPEG-G encoder could use a compression method comparable to that of DeeZ [8], which is able to compress a 437 GB H. Sapiens SAM file to about 63 GB, as compared to 75 GB by CRAM (Scramble) or 106 GB by BAM [8]"
This simply doesn't cut the mustard!    I want to know how it "does", not "could do".  However let us go with the assumption that right now, MPEG-G is indeed comparable to DeeZ for compression ratios, as suggested.  Note that the figures quoted above are also old and do not represent the current performance of the tools cited (except for BAM).

So let's do some more digging.  That little footnote (2) at the end of the first quote indicates the figures come from unspecified cited publications: "The specific numbers are taken from the corresponding publications".

The paper in question is the DeeZ paper by Hach, Namanagic and Sahinalp.  Unfortunately it's not open access, although I can read it via work and the supplementary text is also freely available.  The figures quoted in the 2nd version of the MPEG-G paper come from Table 1 of the DeeZ paper; 62,808 MB for DeeZ, 75,784 MB for Scramble (CRAM), 106,596 MB for BAM and 437,589 MB for SAM.  Furthermore, because the DeeZ paper is rigorously written and follows accepted notions of reproducible research, the supplementary data lists the exact file tested as ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12878_S1.bam.

Given the age of the DeeZ paper, the figures shown there are for an earlier DeeZ version as well as a considerly weaker CRAM v2.1, rather than the current CRAM v3 standard.  (This isn't a complaint about the DeeZ paper as they reviewed what was available at the time.)  I did my own analysis on this BAM file.


Format Options Size (GiB) Notes
BAM (default) 105.8 Orig file
BAM -9 103.0 Scramble
Deez (default) 60.6 1 million reads random access granularity
CRAM2.1 (default) 74.0 Using Scramble 1.13.7 to match DeeZ paper
CRAM3 (default) 64.2 Scramble, 10k reads granularity
CRAM3 -7 -s 100000 -j 58.5 Scramble, 100k reads granularity + bzip2
CRAM3 -7 -s 100000 -j -Z 56.1 Scramble, 100k reads granularity + bzip2 + lzma
CRAM4 (default) 58.8 Scramble, 10k reads granularity
CRAM4 -7 -s 100000 53.1 Scramble, 100k reads granularity
CRAM4 -7 -s 100000 -j -J 52.9 Scramble, 100k reads granularity + bzip2 + libbsc

The above chart shows sizes from DeeZ (in random-access mode, rather than the non-random access -q2 method), BAM as downloaded and as generated using scramble -9, and CRAM using a variety of scramble (1.14.10) command line options and versions.  Note version 2.1 was created using the old 1.13.7 release to match the one cited in the DeeZ paper.  Also note CRAM 4 is highly experimental - it's a demonstration of what may become a standard, rather than what already is.  Nevertheless, it's not vapourware and it can be downloaded and ran, so I include it as demonstration of where we could go with CRAM if the community desires it.  Obviously (not recorded, sorry) there are big CPU time differences and I've also adjusted some of the CRAM "slice sizes" (granularity of random access) to demonstrate the ranges of sizes we get.  The smaller files are more suitable for archival than ongoing analysis.

Basically, we see CRAM can beat DeeZ quite considerably even using just the existing CRAM3 standard.   I find it disheartening therefore that the new revision of the MPEG-G paper contains outdated and therefore misleading conclusions.  The new proposed codecs in version 4 pushes CRAM to smaller files still, so we need to consider whether we want a complete new format or just an update to existing ones.  Obviously I'm biased on this point.

Analysis of the version 1 paper graph shows this is likely the same file reported in the "Human WGS (high coverage)" figure.  Unfortunately given the MPEG-G values shown were with lossy compression, so no direct comparison can be made against DeeZ and/or CRAM.


Conclusion


MPEG-G were already called out in the comments of their original paper for incorrectly representing the current state of the art.   It appears they now have made the same mistake a second time by using outdated benchmarks.  Once may be an error, but two is starting to look like a pattern.

Scientists are more savvy than this.  We work on open and reproducible research, not secrecy and promises of what something "could" do.   So far MPEG's own paper has stated it could be comparable to DeeZ, and I have compared CRAM to DeeZ showing it to be smaller.  One logical conclusion therefore is that MPEG-G can already be beaten by CRAM on file size. I doubt this is actually true as they have had some big heavy-weights working on it, but the paucity of evidence hardly instills confidence.  Indeed I am starting to wonder if there is even a fully working implemention out there yet.


An open challenge


So let's get right down to it.  I hearby openly challenge the MPEG-G community to provide their own figures, using the data set chosen by them in their paper.   Please email me your MPEG-G figures for the same NA12878_S1.bam file presented in the DeeZ paper that you quote, in GiB units.  I'd like to know both size, some measure of granularity, be it MB access unit size, number of reads, or a similar metric, and what data is retained.  We can start with the easy lossless case.  (Lossy gets so hard to evaluate the effect and it's a set of complex adjustments and tradeoffs.)

I will email this to the two authors with email addresses listed in the paper.

Tuesday, October 9, 2018

Reflections on the GA4GH conference

On 3rd to 5th October we have had a large GA4GH conference, attended by around 400 people from across the world and sectors (academia, pharma, manufacturers, medical, and so on).  It was an exciting meeting with lots of discussions, important face to face networking opportunities, and some genuinely moving and inspiring talks.

"Collaborate on the interfaces; compete on the implementations"


This quote was in Ewan Birney's opening address to the delegates, and I couldn't agree more with this philosophy.  "Interface" can mean multiple things, from htsget and refget protocols to the underlying file formats. GA4GH embraces and desires the commercial sector to get involved, many of whom have been instrumental in setting new standards.  The file formats are our pre-competitive interchange.  They're the common language we speak and it benefits everyone, both for-profit and non-profit organisations, if they are free for all to use.  This still leaves room for clever and optimised implementations and no one raises objections to, say, a commercial FPGA implementation of BAM.


CRAM: format vs toolchain implementation


In this light, I also wish to discuss what CRAM can do as a format and what the implementations currently offer us, because sadly they differ.  We should not fall into the mistake of wishing for a new format when a new implementation will suffice.


Firstly, it is up to the encoder to decide which data goes where.  Earlier CRAM implementations put most aux tags into the same block, which improves compression in some cases and harms in others.  New ones put each tag to its own block (which again is sometimes good and sometimes bad for compression). We have files that are smaller with samtools 1.3 than 1.9 and vice versa.  Clearly there is room for optimisation and improvement over time, while still adhering to the same CRAM bitstream format.  In this regard, CRAM shares the same methodology as MPEG-G.

Cram_filter


I covered the htslib REQUIRED_FIELDS parameter in an earlier blog, but in summary it indicates to the CRAM decoder which fields we wish to retrieve, and in turn informs the decoder which fields it may choose to discard during reading.  However avoiding unnecessary decompression and decode is a rather naive solution.  There are times we wish to manipulate files at the block level, for fast streaming in the context of, for example, and htsget server.

A rather crude tool cram_filter, from io_lib, demonstrates this capability in a proof-of-concept fashion.  Currently cram_filter takes a slice number and just emits those slices using nothing more than seek, read and write calls.  It is *extremely* fast (hence appropriate for serving CRAM file htsget).

We can also do the vertical component too, coined data series in CRAM and descriptors in MPEG-G, but fundamentally both are simply blocks.  Cram_filter permits us to specify the tags we wish to include or to drop.  Sadly it currently only permits filtering of quality values and aux tags at the moment, mainly because I was lazy and it's harder to do others, however many combinations should be possible including read names, mapping qualities and ancillary fields such as SAM PNEXT, RNEXT and TLEN.  In theory any block can be discarded on-the-fly provided there is not a data or code-path dependency on it from a block we wish to keep.

However I also have to acknowledge the implementation is, frankly, amateurish. (It's mine so I don't offend anyone else!). It could so trivially accept a chromosome / range and use the index instead of slice numbers, but it doesn't.  That's a toolchain problem though and not a specification issue.  It also should be natively capable of this within htslib / samtools, instead of relying on a third party tool.  This fully comes under the "Compete on the Implementation" mantra.

The next step?


Filtering data, by row or column, is just a starting point.  What if we wish to add a column or modify a value?  Let us consider a hypothetical scenario.  We read the quality values for each record and produce mean and standard deviation auxiliary tags per record.  From a data process viewpoint, our tool would decode qualities, compute, and add two new tags, with all other fields being passed through from input file to output file.  In the example below we also remove the OQ:Z tag:



This needs a very different API to the one currently implemented in htslib.  Perhaps this should become htslib 2.0.  Fundamentally it is scuppered by the lack of a function level API, instead operating on a data-structure level API (where all data is available directly, not via query functions).  However in principle this could be implemented as follows.

  • API call to fetch the next record
    • On first call, library loads the entire next slice and holds compressed blocks in memory.
    • Return link to a reference to a specific record from this slice - just a link, not decoded yet.
  • API call to obtain the quality values
    • On first call, we decompress the quality block.
    • Return appropriate pointer to decompressed quality values.
  •  Compute mean and standard deviation scores
  • API call to write mean and standard deviation aux. tags.
    • On first call for this slice, create the new blocks.
    • Append data to the uncompressed in-memory mean & sd blocks
  • API call to write the new record
    • Create a new output slice if necessary.
    • Add cram record, with link to record from input slice
    • When slice is full, store entire slice.
This last part is the cunning bit.  If we have 10,000 records in our output read name block, and all of those 10,000 records are simply links to 10,000 records in our corresponding input block, then we already have a copy of the compressed block.  Thus we can spot the opportunity to avoid an expensive decode and encode step.  We'll possibly need to recompress the quality block, unless we cached the compressed form, and obviously we'll need to compress the new mean and sd value blocks.

There is no timeline for implementing this sadly, but this does not indicate that we need a new format simply to get a new library with a better API!  That would be a case of the tail wagging the dog.

CRAM 4 proposal

Just a quick one this time.

The slides for the talk I gave at the GA4GH 6th Plenary meeting in Basel, on 3rd October 2018 are now online at https://www.slideshare.net/JamesBonfield/cram4

This talk covers CRAM v4.0 proposal, which could also be v3.1 if we restrict it to purely codec improvements.

Note that this hasn't actually changed much since the MPEG-G talk I gave 3 years ago!  It's improved slightly, but back then it was a proposal, and right now it still is.   In the interim there simply wasn't any strong community desire to eek out another 10-20% saving given it cost more CPU, but with NovaSeq we see we can gain at least 10% with no real CPU hit (more if we want to spend CPU of course) and there are other issues such as extra-long chromosomes that require format updates anyway.

Saturday, October 6, 2018

CRAM structure

Having just got back from the super GA4GH conference (more on that later), it is clear some are still confused as to how CRAM actually looks under the hood, so I'll take some time to illustrate it in more detail.

CRAM file structure


The basic CRAM structure is a magic number / version, a whole file header (SAM header), followed by a series of containers.  For position sorted data, these containers are anchored to locations in the genome and by way of an index act as a way to do random access.  (We can index unsorted / unmapped data too, permitting retrieval by container number.)  Each container contains a Compression Header block, indicating what type of data (Data Series) is stored in which block, and a series of Slices - typically we just use one slice per container, but the format permits many:



Each slice in turn has its own Header block and a multiple Data Series blocks corresponding to specific types of data.  It is the container Compression Header which dictates which cram data series goes into which block, permitting multiplexing of data series into a single block if it helps compression.  Usually it is best to separate them out though, as follows:


Typically there will be many more data series than the four I listed in the above diagram (read names, quality scores, BAM flags and an OQ:Z auxiliary tag).  For random access on chromosome and position sorted data, we work out which containers cover the regions in question.  We also get the option to specify which data series we wish to pull back.  For example:


You may have seen very similar plots recently elsewhere, such as the MPEG-G bioRxiv paper, but be rest assured it's not a new idea.  Their terms differ, but replace Slice by "Access Unit" and Data series by "Descriptor" and you'll get something looking very similar.  I doubt CRAM was the first to implement such an obvious data structure either.

Htslib "REQUIRED_FIELDS"


This block nature of CRAM means we don't have to decode everything if we do not wish to, permitting something that MPEG term as "selective access".  In the htslib implementation this is achieved by specifying the SAM columns that the decoder is interested in (CRAM specific), in addition to the more common region restriction by using an iterator (used in SAM, BAM and CRAM).  This code example is from "samtools flagstat", the effect of which can be seen by benchmarking this command on BAM and CRAM files.

hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG | SAM_MAPQ | SAM_RNEXT);

At the samtools command line level, this is equivalent too:

  samtools view --input-fmt-option=0x52 file.cram

(See the samtools man page for the hex column codes.)

The implementation of this is complex.  It is up to the encoder to decide which data types go where, even permitting multiple data types to share the same block, creating data dependencies.  It also requires knowledge of the specific implementation code dependencies - e.g. I know my own code only decodes data X and data Y together.

The important point though is this is an implementation trick, permitted but not mandated by the file format structure.  CRAM does not dictate anything more than the bitstream format to enable decoding, leaving choice of encoding method up to the implementor as well as avoiding overly prescriptive rules for optimising decode.

I have described a method of implementing fast decoding, but there are ways we can also do fast writing.  More on the pros and cons of that later.