Wednesday, October 17, 2018

MPEG-G challenge, 1 week on.

Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

 

Replies


It's been one week since I publicly asked for some figures for how MPEG-G performs on the data set they provided DeeZ and CRAMv2 figures for in their paper.

Yesterday I had a reply from a GenomSys employee, and a couple days before that from Mikel Hernaez (as you can see in the comments to the previous post). Thank you both for taking the time to reply and I want you to know none of this is personal.

Sadly neither are willing or able to provide figures at this moment.    GenomSys for unspecified reasons other than "we will provide them according to the established workplan within MPEG".  Make of that rather unclear statement what you will.  Mikel because he's working on an open source implementation which isn't yet finished - fair enough and a huge thank you for working on that.  I look forward to experimenting with it myself when available.  I do still genuinely believe the format has something to offer, if it wasn't for the patents.

I was also told that the BAM figures from Marco's talk last year were estimates based on the outcomes from the Call  for Proposals experiments, rather than from a fully formed MPEG-G file.  Maybe this is why there were no auxiliary data shown in the bar-graphs, although I am unsure if this is part of the specification anyway.  There is still a problem here though.  Stating how much better MPEG-G is than SAM / BAM in various press pieces ([1], [2], [3]) and then claiming it is not yet appropriate to do comparisons against CRAM, DeeZ, etc just doesn't hold water.  Either it's ready for comparisons or it isn't.

I ended my last blog by saying "indeed I am starting to wonder if there is even a fully working implementation out there yet".    While it may appear that this was bang on the money, note it is unclear if GenomSys' demonstration at the GA4GH conference in Basel was a mock up, or that they are refusing to give numbers for some other reason (e.g. tipping off the "opposition" - other MPEG-G implementers - before the final acceptance of the standard).  Either way, this doesn't fit well with the ISO Code of Conduct[4]: "uphold the key principles of international standardization : consensus, transparency, openness, [...]".  If the performance of the MPEG-G format was any less open it'd be in locked filing cabinet in a disused lavatory with a sign on the door saying "Beware of the leopard"!


My goals


So where to go from here?  Wait I guess as far as benchmarks go.   However some people have wondered what were my desired outcomes from starting this series of blogs?  I'll elaborate a little.

  1. To educate people about the CRAM format.   I consistently hear of what CRAM cannot do, when the reality is it is either different or it is a complaint about inadequacies of our tool chain rather than the format itself.

    This is an ongoing project, but I think we're better off than before. 


  2. To ensure that as many people as possible are aware that there is more than just BAM vs MPEG-G.  Numerous press pieces have been making these comparisons without mention of the other technologies, some of which made it into MPEG-G.

    Decisions makers need to be fully aware of the choices available.

  3. To inform about software patents, both my own personal views (and I believe of the GA4GH, although I do not speak on their behalf), but also of the existence of patents that directly cover both encoding and decoding of the MPEG-G standard.  *ALL* implementations will be covered by these patents, and we are not yet aware of the licensing terms.

    Again, decision makers need to be fully aware of the implications of their choices, and I have not yet observed any other press releases covering this issue and I still cannot see any of the GenomSys patents registered in the ISO patent database[5].

    To date there have been over 25,000 views of the blog, so the messages in this and point 2 above are hopefully getting through.

  4. On the topic of patents, to research prior art and where appropriate block patent claims.  We are aware of numerous existing bioinformatics tools that may become unusable without obtaining licenses if these patents are granted.  These will be filed as prior art.

    The cost to the community of these patents isn't just in any royalty fees, but in hampered public research as the existing bioinformatics community will have to tiptoe around the minefield of confusingly worded patents, with the fear of having their work squashed by a patent licensing authority if they step too close.

    Fighting this is time consuming and thus costly, but it is ongoing work.  It's not something I can speak about yet though.

  5. To improve!

    One beneficial outcome of this whole unfortunate tangle is acknowledging our own weaknesses and correcting them.  For CRAM this means documentation, specification clarity, conformance tests, and generally more capable tools.

    This is a bit more long term, but it's good to have a lot of ideas to work on.

Conclusion


As a final thought, there are things in the MPEG-G standard that I like and that are an improvement over CRAM.  It was designed by a proficient group with several world class researchers involved, so it is sad that it comes with strings attached that will ultimately hamper its uptake.  There are things I can suggest to improve it further, but will not while it is a patented format.

Certainly it is the case that no MPEG-G decoder can be patent free while decoding any valid MPEG-G file.  However it remains to be seen if it is possible for an encoder to produce a subset of the MPEG-G specification to avoid the patents (e.g. by having just 1 data class).   If so, maybe something is salvageable and I would encourage the open source implementation to consider this as an option.   I'm not a lawyer though and the patents waffle on forever!  It'll need the more egregious claims squashing first though.


References


[1] https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf
[2] https://spectrum.ieee.org/ns/Blast/Sept18/09_Spectrum_2018_INT.pdf 
[3] https://www.biorxiv.org/content/early/2018/09/27/426353
[4] https://www.iso.org/publication/PUB100397.html
[5] https://www.iso.org/iso-standards-and-patents.html

Wednesday, October 10, 2018

MPEG-G: An open challenge

Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.


Several key people behind the MPEG-G specification sent a preprint to bioRxiv of a paper covering the format.  The first version appeared on 27th September, with a follow up to address some of the comments on 8th October.

Version 1


The first version of the MPEG-G paper is available at bioRxiv here: https://www.biorxiv.org/content/early/2018/09/27/426353  This is the version that has the most comments.

In it, we see no comparison to any other tools out there whatsoever, other than uncompressed SAM and BAM, nor even any mention of their existence.   Furthermore the MPEG-G figure was produced using a lossy compressor.  No reference was given on the data sets used, and no actual size values were shown - only a poor quality graph.

There is no direct evidence that their lossy mode did not harm analysis, other than a referral to other papers that have demonstrated this.  We do not know whether their method is the same.  We cannot take things purely on trust.  Science is skeptical for a reason.

It got pretty much savaged in the comments, and quite rightly so.


Version 2


A new version has just been uploaded.   This second version of the paper initially seems like a significant improvement.  Now there are references to lots of other tools considered state of the art and some benchmarks between them.  Unfortunately that is as far as it goes.  There are even fewer figures on MPEG-G performance (ie none).  The authors state that their aim is to describe the format and not the implementation, but really we need demonstration of at least one implementation to even judge whether the format is worthy of further investigation.   They just refer to a "sense" of the capabilities it could have.  I don't want a "sense", I want evidence!
"Nevertheless, to give the reader a sense of the compression capabilities achievable by MPEG-G, next we show the compression performance of some of the technologies supported by the standard (and hence implementable in an encoder), and how they compare with current formats(2)."
Unless the format is simply a container that permits other tools to be wrapped inside it, then I'm struggling to understand this statement, nor the benefit of such a container format.  Even worse, MPEG-G is repeatedly referred to in how it "could" be implemented using this method or that algorithm, but never what it actually "does":
"When it comes to aligned data, an MPEG-G encoder could use a compression method comparable to that of DeeZ [8], which is able to compress a 437 GB H. Sapiens SAM file to about 63 GB, as compared to 75 GB by CRAM (Scramble) or 106 GB by BAM [8]"
This simply doesn't cut the mustard!    I want to know how it "does", not "could do".  However let us go with the assumption that right now, MPEG-G is indeed comparable to DeeZ for compression ratios, as suggested.  Note that the figures quoted above are also old and do not represent the current performance of the tools cited (except for BAM).

So let's do some more digging.  That little footnote (2) at the end of the first quote indicates the figures come from unspecified cited publications: "The specific numbers are taken from the corresponding publications".

The paper in question is the DeeZ paper by Hach, Namanagic and Sahinalp.  Unfortunately it's not open access, although I can read it via work and the supplementary text is also freely available.  The figures quoted in the 2nd version of the MPEG-G paper come from Table 1 of the DeeZ paper; 62,808 MB for DeeZ, 75,784 MB for Scramble (CRAM), 106,596 MB for BAM and 437,589 MB for SAM.  Furthermore, because the DeeZ paper is rigorously written and follows accepted notions of reproducible research, the supplementary data lists the exact file tested as ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12878_S1.bam.

Given the age of the DeeZ paper, the figures shown there are for an earlier DeeZ version as well as a considerly weaker CRAM v2.1, rather than the current CRAM v3 standard.  (This isn't a complaint about the DeeZ paper as they reviewed what was available at the time.)  I did my own analysis on this BAM file.


Format Options Size (GiB) Notes
BAM (default) 105.8 Orig file
BAM -9 103.0 Scramble
Deez (default) 60.6 1 million reads random access granularity
CRAM2.1 (default) 74.0 Using Scramble 1.13.7 to match DeeZ paper
CRAM3 (default) 64.2 Scramble, 10k reads granularity
CRAM3 -7 -s 100000 -j 58.5 Scramble, 100k reads granularity + bzip2
CRAM3 -7 -s 100000 -j -Z 56.1 Scramble, 100k reads granularity + bzip2 + lzma
CRAM4 (default) 58.8 Scramble, 10k reads granularity
CRAM4 -7 -s 100000 53.1 Scramble, 100k reads granularity
CRAM4 -7 -s 100000 -j -J 52.9 Scramble, 100k reads granularity + bzip2 + libbsc

The above chart shows sizes from DeeZ (in random-access mode, rather than the non-random access -q2 method), BAM as downloaded and as generated using scramble -9, and CRAM using a variety of scramble (1.14.10) command line options and versions.  Note version 2.1 was created using the old 1.13.7 release to match the one cited in the DeeZ paper.  Also note CRAM 4 is highly experimental - it's a demonstration of what may become a standard, rather than what already is.  Nevertheless, it's not vapourware and it can be downloaded and ran, so I include it as demonstration of where we could go with CRAM if the community desires it.  Obviously (not recorded, sorry) there are big CPU time differences and I've also adjusted some of the CRAM "slice sizes" (granularity of random access) to demonstrate the ranges of sizes we get.  The smaller files are more suitable for archival than ongoing analysis.

Basically, we see CRAM can beat DeeZ quite considerably even using just the existing CRAM3 standard.   I find it disheartening therefore that the new revision of the MPEG-G paper contains outdated and therefore misleading conclusions.  The new proposed codecs in version 4 pushes CRAM to smaller files still, so we need to consider whether we want a complete new format or just an update to existing ones.  Obviously I'm biased on this point.

Analysis of the version 1 paper graph shows this is likely the same file reported in the "Human WGS (high coverage)" figure.  Unfortunately given the MPEG-G values shown were with lossy compression, so no direct comparison can be made against DeeZ and/or CRAM.


Conclusion


MPEG-G were already called out in the comments of their original paper for incorrectly representing the current state of the art.   It appears they now have made the same mistake a second time by using outdated benchmarks.  Once may be an error, but two is starting to look like a pattern.

Scientists are more savvy than this.  We work on open and reproducible research, not secrecy and promises of what something "could" do.   So far MPEG's own paper has stated it could be comparable to DeeZ, and I have compared CRAM to DeeZ showing it to be smaller.  One logical conclusion therefore is that MPEG-G can already be beaten by CRAM on file size. I doubt this is actually true as they have had some big heavy-weights working on it, but the paucity of evidence hardly instills confidence.  Indeed I am starting to wonder if there is even a fully working implemention out there yet.


An open challenge


So let's get right down to it.  I hearby openly challenge the MPEG-G community to provide their own figures, using the data set chosen by them in their paper.   Please email me your MPEG-G figures for the same NA12878_S1.bam file presented in the DeeZ paper that you quote, in GiB units.  I'd like to know both size, some measure of granularity, be it MB access unit size, number of reads, or a similar metric, and what data is retained.  We can start with the easy lossless case.  (Lossy gets so hard to evaluate the effect and it's a set of complex adjustments and tradeoffs.)

I will email this to the two authors with email addresses listed in the paper.

Tuesday, October 9, 2018

Reflections on the GA4GH conference

On 3rd to 5th October we have had a large GA4GH conference, attended by around 400 people from across the world and sectors (academia, pharma, manufacturers, medical, and so on).  It was an exciting meeting with lots of discussions, important face to face networking opportunities, and some genuinely moving and inspiring talks.

"Collaborate on the interfaces; compete on the implementations"


This quote was in Ewan Birney's opening address to the delegates, and I couldn't agree more with this philosophy.  "Interface" can mean multiple things, from htsget and refget protocols to the underlying file formats. GA4GH embraces and desires the commercial sector to get involved, many of whom have been instrumental in setting new standards.  The file formats are our pre-competitive interchange.  They're the common language we speak and it benefits everyone, both for-profit and non-profit organisations, if they are free for all to use.  This still leaves room for clever and optimised implementations and no one raises objections to, say, a commercial FPGA implementation of BAM.


CRAM: format vs toolchain implementation


In this light, I also wish to discuss what CRAM can do as a format and what the implementations currently offer us, because sadly they differ.  We should not fall into the mistake of wishing for a new format when a new implementation will suffice.


Firstly, it is up to the encoder to decide which data goes where.  Earlier CRAM implementations put most aux tags into the same block, which improves compression in some cases and harms in others.  New ones put each tag to its own block (which again is sometimes good and sometimes bad for compression). We have files that are smaller with samtools 1.3 than 1.9 and vice versa.  Clearly there is room for optimisation and improvement over time, while still adhering to the same CRAM bitstream format.  In this regard, CRAM shares the same methodology as MPEG-G.

Cram_filter


I covered the htslib REQUIRED_FIELDS parameter in an earlier blog, but in summary it indicates to the CRAM decoder which fields we wish to retrieve, and in turn informs the decoder which fields it may choose to discard during reading.  However avoiding unnecessary decompression and decode is a rather naive solution.  There are times we wish to manipulate files at the block level, for fast streaming in the context of, for example, and htsget server.

A rather crude tool cram_filter, from io_lib, demonstrates this capability in a proof-of-concept fashion.  Currently cram_filter takes a slice number and just emits those slices using nothing more than seek, read and write calls.  It is *extremely* fast (hence appropriate for serving CRAM file htsget).

We can also do the vertical component too, coined data series in CRAM and descriptors in MPEG-G, but fundamentally both are simply blocks.  Cram_filter permits us to specify the tags we wish to include or to drop.  Sadly it currently only permits filtering of quality values and aux tags at the moment, mainly because I was lazy and it's harder to do others, however many combinations should be possible including read names, mapping qualities and ancillary fields such as SAM PNEXT, RNEXT and TLEN.  In theory any block can be discarded on-the-fly provided there is not a data or code-path dependency on it from a block we wish to keep.

However I also have to acknowledge the implementation is, frankly, amateurish. (It's mine so I don't offend anyone else!). It could so trivially accept a chromosome / range and use the index instead of slice numbers, but it doesn't.  That's a toolchain problem though and not a specification issue.  It also should be natively capable of this within htslib / samtools, instead of relying on a third party tool.  This fully comes under the "Compete on the Implementation" mantra.

The next step?


Filtering data, by row or column, is just a starting point.  What if we wish to add a column or modify a value?  Let us consider a hypothetical scenario.  We read the quality values for each record and produce mean and standard deviation auxiliary tags per record.  From a data process viewpoint, our tool would decode qualities, compute, and add two new tags, with all other fields being passed through from input file to output file.  In the example below we also remove the OQ:Z tag:



This needs a very different API to the one currently implemented in htslib.  Perhaps this should become htslib 2.0.  Fundamentally it is scuppered by the lack of a function level API, instead operating on a data-structure level API (where all data is available directly, not via query functions).  However in principle this could be implemented as follows.

  • API call to fetch the next record
    • On first call, library loads the entire next slice and holds compressed blocks in memory.
    • Return link to a reference to a specific record from this slice - just a link, not decoded yet.
  • API call to obtain the quality values
    • On first call, we decompress the quality block.
    • Return appropriate pointer to decompressed quality values.
  •  Compute mean and standard deviation scores
  • API call to write mean and standard deviation aux. tags.
    • On first call for this slice, create the new blocks.
    • Append data to the uncompressed in-memory mean & sd blocks
  • API call to write the new record
    • Create a new output slice if necessary.
    • Add cram record, with link to record from input slice
    • When slice is full, store entire slice.
This last part is the cunning bit.  If we have 10,000 records in our output read name block, and all of those 10,000 records are simply links to 10,000 records in our corresponding input block, then we already have a copy of the compressed block.  Thus we can spot the opportunity to avoid an expensive decode and encode step.  We'll possibly need to recompress the quality block, unless we cached the compressed form, and obviously we'll need to compress the new mean and sd value blocks.

There is no timeline for implementing this sadly, but this does not indicate that we need a new format simply to get a new library with a better API!  That would be a case of the tail wagging the dog.

CRAM 4 proposal

Just a quick one this time.

The slides for the talk I gave at the GA4GH 6th Plenary meeting in Basel, on 3rd October 2018 are now online at https://www.slideshare.net/JamesBonfield/cram4

This talk covers CRAM v4.0 proposal, which could also be v3.1 if we restrict it to purely codec improvements.

Note that this hasn't actually changed much since the MPEG-G talk I gave 3 years ago!  It's improved slightly, but back then it was a proposal, and right now it still is.   In the interim there simply wasn't any strong community desire to eek out another 10-20% saving given it cost more CPU, but with NovaSeq we see we can gain at least 10% with no real CPU hit (more if we want to spend CPU of course) and there are other issues such as extra-long chromosomes that require format updates anyway.

Saturday, October 6, 2018

CRAM structure

Having just got back from the super GA4GH conference (more on that later), it is clear some are still confused as to how CRAM actually looks under the hood, so I'll take some time to illustrate it in more detail.

CRAM file structure


The basic CRAM structure is a magic number / version, a whole file header (SAM header), followed by a series of containers.  For position sorted data, these containers are anchored to locations in the genome and by way of an index act as a way to do random access.  (We can index unsorted / unmapped data too, permitting retrieval by container number.)  Each container contains a Compression Header block, indicating what type of data (Data Series) is stored in which block, and a series of Slices - typically we just use one slice per container, but the format permits many:



Each slice in turn has its own Header block and a multiple Data Series blocks corresponding to specific types of data.  It is the container Compression Header which dictates which cram data series goes into which block, permitting multiplexing of data series into a single block if it helps compression.  Usually it is best to separate them out though, as follows:


Typically there will be many more data series than the four I listed in the above diagram (read names, quality scores, BAM flags and an OQ:Z auxiliary tag).  For random access on chromosome and position sorted data, we work out which containers cover the regions in question.  We also get the option to specify which data series we wish to pull back.  For example:


You may have seen very similar plots recently elsewhere, such as the MPEG-G bioRxiv paper, but be rest assured it's not a new idea.  Their terms differ, but replace Slice by "Access Unit" and Data series by "Descriptor" and you'll get something looking very similar.  I doubt CRAM was the first to implement such an obvious data structure either.

Htslib "REQUIRED_FIELDS"


This block nature of CRAM means we don't have to decode everything if we do not wish to, permitting something that MPEG term as "selective access".  In the htslib implementation this is achieved by specifying the SAM columns that the decoder is interested in (CRAM specific), in addition to the more common region restriction by using an iterator (used in SAM, BAM and CRAM).  This code example is from "samtools flagstat", the effect of which can be seen by benchmarking this command on BAM and CRAM files.

hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG | SAM_MAPQ | SAM_RNEXT);

At the samtools command line level, this is equivalent too:

  samtools view --input-fmt-option=0x52 file.cram

(See the samtools man page for the hex column codes.)

The implementation of this is complex.  It is up to the encoder to decide which data types go where, even permitting multiple data types to share the same block, creating data dependencies.  It also requires knowledge of the specific implementation code dependencies - e.g. I know my own code only decodes data X and data Y together.

The important point though is this is an implementation trick, permitted but not mandated by the file format structure.  CRAM does not dictate anything more than the bitstream format to enable decoding, leaving choice of encoding method up to the implementor as well as avoiding overly prescriptive rules for optimising decode.

I have described a method of implementing fast decoding, but there are ways we can also do fast writing.  More on the pros and cons of that later.

Friday, September 28, 2018

MPEG-G: the ugly


Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

  

Commercialisation


I ended the last blog with the statement "history is resplendent with battles where a good PR campaign won the day".  I truly wish this wasn't a battle.  I engaged with MPEG-G from the moment I heard about it, submitting ideas to improve it, despite being instrumental in recent CRAM improvements.  I had hopes of a better format.

I bowed out after a while, making rather weak excuses about work loads.  However the honest reason I disengaged was due to the discovery of patent applications by people working on the format.  I wanted nothing to do with helping others making profits, at the expense of the bioinformatics community.  I regret now that I helped make the format that little bit better.  I am guilty of being hopelessly naive.

I am not against commercialisation of software.   I use plenty of it daily.  Indeed I once worked on software that was sold on a semi-commercial basis, from which I personally benefited.

A commercial file format however is a different beast entirely.  It touches everything that interacts with those files.  Programs that read and write the format need to be licensed, adding a burden to software developers

I'm also not against patents, when applied appropriately.  I can even see potential benefits to software patents, just, although the 25 year expiry is far too long in computing.  25 years ago the Intel Pentium had just come out, but I was still using an 8MB Intel 486 PC.   It seems ludicrous to think something invented back then would only just be opening up for others to use without having to pay royalties.   Holding a patent for that long in such a fast moving field is extreme - 5 to 10 years max seems more appropriate.

[Correction: a reader pointer out this should be 20 years.  Correct and sorry for my mistake.  The point still makes sense, even if the exact dates are wrong.]

Anyway, I digress.   In my opinion commercialisation of software is nearly always best done by efficient and robust implementations, protected by copyright.

Patent applications


The first related one I became aware of was WO2016202918A1: Method for Compressing Genomic Data.  This describes a way to use sequence position plus alignment (a "CIGAR" field in SAM) to identify a previously observed sequence at that location and thus compress via redundancy removal.  This is a subset of the method I used in samcomp for the SequenceSqueeze contest and published 2 years earlier, where I used all prior sequences instead of just one.  I made prior art observations and the authors of the patent appear to have amended it slightly, but this is still an ongoing concern.

The same authors also later submitted a second patent involving lossy compression of quality values by forming a sequence consensus along with an associated confidence, and using that to dictate how much loss can be applied to the quality values.  This too infringes on my earlier work as it closely follows the Crumble program.

As far as I'm aware however neither of these two patents made it in to the MPEG-G specification, but the authors were actively involved in the MPEG-G process and this directly lead to my withdrawal.


GenomSoft-SA and GenomSys-SA



 The same cannot be said however for this barrage of 12 patents from GenomSys:

Source: https://globaldossier.uspto.gov/#/result/publication/WO/2018071080/1

The patents all relate to how the sequence is encoded in the MPEG-G draft specification.   Many of these patents have similar problems with prior art, with some claims being sufficiently woolly that CRAM, among others, would infringe on them if granted.  Some are simply huge - over 100 pages of impenetrable lawyer speak and 80+ claims.  Fighting this has become an enormous and unwelcome time sink.

Crunchbase reports one GenomSys founder as Claudio Alberti:

Source: https://www.crunchbase.com/search/organizations/field/people/num_founded_organizations/claudio-alberti

MoneyHouse.ch reports GenomSys board of directories as including Marco Mattavelli:

Source: https://www.moneyhouse.ch/en/company/genomsys-sa-15920162331

As reported earlier, both Alberti and Mattavelli are founders of MPEG-G:

Source: https://mpeg.chiariglione.org/standards/exploration/genome-compression/issues-genome-compression-and-storage
There is no proof that this was their intention all along, but I will leave the readers to draw their own conclusions.

As mentioned before, I was not formally accepted as a member of BSI and therefore not party to all the documents registered in the MPEG repository, only receiving communication via the ad-hoc group mailing list.  Therefore I do not know when the other members were notified of these patents (as is the requirement under ISO regulations), but I was certainly taken by surprise and would not have engaged if I knew where it was headed.  As it happens I consider it fortunate that I did not become a formal part of the process, as my lack of access to the document repository also means I am untainted by any confidential knowledge, which may help with any future prior art claims.

Licensing


Patents by themselves do not automatically mean royalties.  Many patents are obtained purely to protect against others attempting to patent similar things (as patent officials are considerably more likely to find prior art in a previous patent than they are in other literature).

A patent means we need a license to use that invention.  Licenses can be granted free, but we have no idea of the intentions here.   However to date none of this has been openly discussed and most I have spoken with have been blissfully unaware of the patent applications.  There have been some suggestions that there will be an open source reference implementation, but do not assume open source means free from royalty.  The two are orthogonal issues.

Implications


My concerns over these MPEG-G patents are many fold, but include:

  • Fear:  even if a program has clear prior art and a patent is unenforceable, the fear of potential legal action being taken can dissuade people from using that program.  This may dissuade people form using my work, including CRAM and Crumble.
  • Prevention of further research: a core public Idea can be extended within patent claims as Idea+X, Idea+Y and Idea+Z.  The original inventor of Idea can then become stuck trying to find a way to continue his/her own research, simply because they didn't publicly state up front which next avenues of research they were going to continue down.  Crumble vs the 2nd Voges patent is an example of this.

    To this end, I am now trying to publicly document as many ideas I have as I can, regardless of whether I have time to implement them yet, however this does give ideas to my competitors.
  • Needless complexity: every program touching the patents now needs to deal with licensing.  If the patent is monetised, it makes every tool a commercial tool unless the author has money to throw away.  This simply will not work with most academic groups.
  • Time sink:  fighting patents takes a LOT of time.   Patents are deliberately written to be hard to read and much like builders applying for planning permission over and over again for the same plot, assuming people will give up objecting, many subtle variations on the same patents can be applied for in the hope one of them will get through.   I have spent weeks on this, as have others.  During that time I could have been doing other more productive work, improving software elsewhere.  This costs us money and doesn't help the community as a whole.
  • Cost: an obvious one.  We do not yet know what the cost will be or how it will be applied.  It may be free, it could be a one off charge when first creating the file, or it could be a charge every time we read / write data from the file format.  Assuming a non-zero cost, this will increase the price of research and medicine.  We need to know if we spent the money wisely.  What do we get from that money that we didn't have before?

 A better way?


Leonardo Chiariglione, chairman of MPEG, describes the MPEG model as this:

"Patent holders who allow use of their patents get hefty royalties with which they can develop new technologies for the next generation of MPEG standards. A virtuous cycle everybody benefits from."

It sounds sensible on the face of it, and probably is when the field is already covered by a myriad of patents.  He has concerns that there are now free alternatives to video coding arriving, such as the Alliance for Open Media, and views this as both damaging to MPEG but also damaging to the very idea of progress:

"At long last everybody realises that the old MPEG business model is now broke, all the investments (collectively hundreds of millions USD) made by the industry for the new video codec will go up in smoke and AOM’s royalty free model will spread to other business segments as well.

[...]

AOM will certainly give much needed stability to the video codec market but this will come at the cost of reduced if not entirely halted technical progress. There will simply be no incentive for companies to develop new video compression technologies, at very significant cost because of the sophistication of the field, knowing that their assets will be thankfully – and nothing more – accepted and used by AOM in their video codecs."
I believe however that he is mistaken.   There are clear non-royalty based incentives for large companies to develop new compression algorithms and drive the industry forward.  Both Google and Facebook have active data compression teams, lead by some of the world's top experts in the field.   It is clear to see why: bandwidth costs money, and lots of it.  Google's Brotli algorithm is now supported by all the major browsers and Facebook's Zstd is targetting itself as a replacement for the legacy Zlib library.

There are also more enlightened approaches, where like minded companies pay for research into "pre-competitive" areas which benefit all but are not, by themselves, considered a commercial edge.  File formats absolutely fits the bill.  Indeed one such organisation, the Pistoia Alliance were the people who organised the Sequence Squeeze contest, which lead to so many innovations in genomics data compression.

Conclusion


"Patents + extensive high profile marketing + lack of evidence for benefit" jointly add up to something tasting and smelling rather unpleasant.   The situation can, however, be rescued with appropriate licensing, (assuming a benefit exists).

I urge the community now - do not adopt MPEG-G without a cast iron assurance that no royalties will be applied, now and forever.

Thursday, September 27, 2018

MPEG-G: the bad

Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

OK I'll come clean - having read this title you'll probably have realised that the previous post could have been called "MPEG-G: the good".  It's not all smelling of roses.  I covered the history and process before, so now I'll look at what, to my mind, isn't so good.

Transparency


MPEG-G has been bubbling along in the background.  It's hard to gain much knowledge of what's being discussed as it's all behind the ISO machinery, which seems designed to lock in as much information as possible unless you're one of the club.   Given the history of intellectual property surrounding video coding, I can understand why this is their usual policy, but I don't feel it is an appropriate fit for a relatively open field such as bioinformatics.

I know as I almost got there, but was put off by the BSI IST/37 committee and work-loads.  As it happened, I think it was beneficial (more on that to come), but it does mean that even an interested party like myself could see absolutely nothing of what was going on behind closed doors.    I wasn't even aware for some time that one of my own submissions to the Core Experiments was adopted by the standard.    Yes I know I could have gone through all the bureaucracy to get there, but that's still going to be a very self limiting set of people.   My strong view here is if you *really* want the best format known to mankind, then be as public about it as you conceivably can.

MPEG-G is a complex specification.  Infact it's huge.  Partly this is because it's very precisely documented down to every tiny nut and bolt (and I have to confess, this puts CRAM to shame), but it's also got a lot going on.  Is that necessary?   I'd like to be able to see the evaluations, of which I am sure there were many, that demonstrate why the format is as it is.  These are key to implementing good encoders, but it would also give me confidence that the format is appropriate and doesn't contain pointless bloat.  (Again, this is something less than ideal with CRAM too.)

Risk vs Reward


A shiny new format is all well and good if the gains to be had are worth it.  Switching format is a costly exercise.  CRAM started off around 30% less than BAM and has improved to be around 50% saving now.  With CRAM v4 is may even be 60-70% saving for NovaSeq data.  That's clearly worth it.  What percentage reduction over CRAM is worth switching again? 10%? 20%?  Maybe, maybe not, depending on how hard it is to implement and what else we gain from it.

Unfortunately MPEG-G is a large unknown right now, not only in size vs CRAM as all comparisons I've seen to date are vs BAM, but also in how any savings are achieved: size isn't everything.  Will it be similar speed, with similar granularity of random access?  So far nothing I've seen shows it is a significant advance over CRAM (eyeballing vs BAM ratios).

There are potentially new features (rewards) too in MPEG-G, but it's too early to know how beneficial any of these are.  For example it has the ability to append new data without rewriting the file.  This is simply analogous to on-the-fly file merging which we use already and have been for a decade.  It's a nice feature, but not one we couldn't live without.  There are also options to retrieve sequences that only contain, say, X differences to the reference.  Is this useful?  I really don't know, but I'd need some convincing test cases.

Reproducibility


That neatly brings me on to evaluation and reproducibility of research.  MPEG-G make a few claims vs BAM, but to compare vs CRAM ourselves we need access to the software, or failing that to some common data sets and more information on speeds and access granularity.

In July 2017 Marco Mattavelli gave a talk including some MPEG figures, which includes BAM and MPEG-G file sizes for ERR174324 chromosome 11;  a public data set.  Unfortunately in the ENA archives it's in FASTQ format.  How was the BAM produced?  I've tried using "bwa mem", but it's simply not a good match for file size.

 Source: https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf

The same talk also gives figures for NA12878_S1 - one of the Illumina Platinum Genome BAMs.  This one is released in BAM format, however the BAM file size in the talk is around 1/3rd larger than the one released by Illumina.    There are more recent slides from Marco, but they don't reference which data sets were used.  We need more than this.  What are the timings for encode and decode?   How many seeks and how much data is returned for a small region query?

I'm not necessarily expecting a PowerPoint presentation to be the source of the data I want - this is normally the sort of thing relegated to a Supplementary Material section of a paper - but it should be somewhere and as far as I can tell it isn't.  I sent an email to Marco on Wed 19th September and am awaiting a reply:
(From me:) "I'd like to be able to see how CRAM compares against MPEG-G, and for a fair comparison it needs identical input BAMs.  Are there any recent public data sets with MPEG-G figures available?"
The best I can do so far is a side by side comparison vs CRAM 3 and the CRAM 4 prototype using my bwa mem generated BAM for ERR174324, break it down by approximate size of seq, quality, name and aux (by removing them from the BAM one at a time and observing the impact), and then scale my charts to look the same.   This gives us compression ratios, although we cannot draw any hard conclusions on absolute file sizes:


The PR machinery


No one wants to shout "Adopt our format - it's marginally better than the current ones".  It just doesn't work as a slogan!   So it's hardly surprising in the talks given this is glossed over, but it goes beyond that.   The talks and articles I've seen perpetuate the myth that there was never anything better out there:

 Source: https://mpeg.chiariglione.org/sites/default/files/files/standards/parts/docs/w17514.zip

We know MPEG are aware there are better formats.   They collaborated with authors of more advanced tools during the early days while learning the field, have cited them in online publications, and even invited them to MPEG conferences to talk about them.  Yet sadly this is all lost in the goal for a nice soft target to compare against.

We need proper comparisons against other well established formats, or even the not so well established ones if they look a better choice to switch to.  Their very absence  is suspicious.  I do still think MPEG-G has something to offer, but convince me because I'm not a man of faith!

It also extends passed the immediate ISO/MPEG eco system too and into the related standards bodies such as the IEEE.   In an article in IEEE Spectrum with the rather condescending title of "THE QUEST TO SAVE GENOMICS", it states:
"Bioinformatics experts already use standard compression tools like gzip to shrink the size of a file by up to a factor of 20. Some researchers also use more specialized compression tools that are optimized for genomic data, but none of these tools have seen wide adoption."
This is why I added "Myth 4: CRAM has not seen wide adoption" to my earlier blog.  CRAM has gone beyond research and into production.  I wouldn't say over 1 million deposited CRAM files in the EBI public archives and in use by multiple major sequencing centres across the globe as not seeing wide adoption. 

MPEG - you can do better than this.  Give us some hard numbers to work with and please stop painting the community as being in the stone age as it's simply not true.  We don't need "saving", we need proper open collaboration instead.

Taking it up a notch


Worryingly it goes further than simple press releases.  I have recently become aware that, despite the lack of any official figures and public benchmarking, MPEG-G is being recommended to the European Commission eHealth project.  A report on Data Driven Healthcare explicitly mentions MPEG-G as a likely format for standardisation across Europe:

"The European Health Record Recommendation will also talk about content-related standards for documents, images, complex datasets, and multimedia recordings. Viola admitted that there will be discussions around these standards, but he also said that the Commission is not planning to reinvent anything that already exists.
[...]
A good candidate, according to Viola, is the MPEG-G standard, a genomic information representation that is currently being developed by the Moving Picture Experts Group (MPEG). MPEG-G aims to bring the terabyte datasets of whole-genome sequencing down to a few gigabytes. This would allow to them to be stored on a smartphone."

Note that "Not planning to reinvent anything that already exists" statement.  Roberto Viola (Director General for Communications Networks, Content and Technology) is listed elsewhere in the above report as the main driving force behind the eHealth Network project.  We can only speculate on whether Viola is aware of the full range of existing technologies.

I'm not anti-progress, but I am a scientist and that means evidence - before you start the marketing.

This is dangerous ground and history is resplendent with battles where a good PR campaign won the day. 

Addendum:

During preparation of this blog, someone pointed me an MPEG-G paper at bioRxiv.  It's good to have a better description of the format, fitting somewhere between a few simple slides and the complexity of the full blown specification.  However again it appears to be a missed opportunity in terms of the compression performance section.  The chart shows lossy MPEG-G only and has no direct evidence regarding the impact this has on variant calling.  It does not compare against formats other than SAM or BAM, or even acknowledge their existance.  If I was a reviewer I would be requesting something that others can reproduce, using public datasets, and a comparison against the current state of the art.  If the reader wishes to move away from BAM, they need to know which format to move to.  This paper does not demonstrate that.

Tuesday, September 25, 2018

"ISO/IEC JTC 1/SC 29/WG 11"

Preface:  all my blogs are my own personal views and may not necessarily the views of my employer, the Wellcome Sanger Institute.

Who?


You'll probably know ISO/IEC JTC 1/SC 29/WG 11 as the Motion Picture Expert Group, or MPEG for short.

They formed out of a requirement for interoperability between the burgeoning multi-media formats and devices, chaired throughout its history by Leonardo Chiariglione.  MPEG's aims were eminently sensible; put together the best formats they can, irrespective or who owns what ideas, and come up with a format better than the sum of its parts.  Given the companies wanted to monetise their intellectual property, this was done via a central pot (the FRAND system) and it was divvied up to those due their slice of the MPEG pie.  The net effect is a win/win for everyone, both consumer and producer.

This blog will cover some history of how MPEG got to grips with genomic data formats, up to the current publication of "MPEG-G".

MPEG discovers genomics


It appears to be back in July 2014, during a meeting in Japan, that MPEG formally acknowledge an ad-hoc group (AHG) to start investigations into the field of genomic data compression.


This document shows the first tentative steps to understanding the field, along with the expected gaps in knowledge: (Emboldening theirs.)
"Such challenges imply the use/existence of an appropriate open standard for both compression and storage, [...] but that currently does not exist".
It doesn't take long to get a basic understanding, and by the next MPEG meeting in Strasbourg the team had grown and learnt far more about the field of genomic data compression, mentioning standard formats including FASTQ, SAM, BAM and CRAM and ego-pleasingly citing and including figures from my own Sequence Squeeze paper too.


Enter MPEG, stage left


Before the following meeting (Feb 2015) they reached out to authors of existing tools and formats.  The process gathered momentum.   In  January 2015 Vadim Zalunin, author of the original CRAM implementation at the EBI had a call from MPEG.  He relayed this to me shortly after:
"MPEG community has contacted us, they are reviewing CRAM and if it has the potential to become a standard format they would want to support.  If you are interested please subscribe to <redacted>."
This was rather a bolt out of the blue to me, having heard none of the above yet.  The video coding experts are looking at genomic data?  Well, I guess they have a lot of data compression expertise, so why not?

I joined them.  Vadim had already subscribed to their "reflector" (ISO speak for a mailing list), as had a number of other authors of related tools such as DeeZ,  SCALCE and GeneCodeq, and numerous data compression experts in the video coding world.  From one of my first postings of 7th Jan 2015, when I was just working out what all this entailed:
"I am eager to know what the MPEG team has to offer though.  For sure I think there is a rich vein around lossy quality encoding, maybe applying rate distortion theory and other bounded loss methods.  There have been a number of papers published on lossy encodings, but they've found it hard to gain traction so far.  I think perhaps there is a lack of a gold-standard method of evaluating the impact on lossy quality values."
We had lots of really technical discussions on various formats, their strengths and weaknesses, and how to improve.  Even the main man at MPEG, Leonardo Chiariglione, occasionally took part in discussions.   We also had a series of conference calls, and produced a set of test data and results of running various already existing tools, which subsequently formed the basis of a Nature Methods paper, of which I am a co-author: a survey of existing genomics compression formats.

So far my involvement was purely conference calls, emails and a lot of benchmarking, but in July 2015 Claudio Alberti invited me to the next MPEG conference:
"MPEG will organize a Workshop/seminar on Genome compression on 20th October 2015 in Geneva and we'd like to invite you as keynote speaker on genome compression issues/tools/technologies."
I spoke at this meeting, presenting CRAM both as it stood then and potential improvements to the codecs based on my earlier fqzcomp (Sequence Squeeze) work.  The talk slides can be seen here:  https://www.slideshare.net/JamesBonfield/cram-116143348

Formal processes


Once the field had been surveyed and a few explorations made, the next step of MPEG-G (as it later became known) was to construct a series of "Core Experiments" which related to the specific tasks at hand, and a public "call for contributions".   These included compression of sequences, identifiers, quality values (both lossless and lossy), auxiliary tags, as well as more nebulous things.

ISO/MPEG procedures are complicated!  ISO do not accept comments and submissions from individuals, only from member nations or other recognised liaison committees.  Each nation in turn has its own standards body (BSI for the UK), and people affiliated with an appropriate technical committee (eg BSI's IST/37) are permitted to take part in ISO discussions on behalf of the nation standards body.  This was all very bureaucratic, and they seemed to want me to turn up to meetings in London and do all sorts of other duties.  I opted out of this complexity.

However I still made submissions via Claudio Alberti, who kindly submitted them on my behalf.  These consisted of the various experimental CRAM codecs I had presented to the October 2015 meeting in Geneva.  However I wasn't present at the subsequent meetings (without standards accreditation, you can only attend if formally invited), nor was I able to see anyone else's submissions in the internal MPEG document archive.   The work load was getting quite high and I took this as an opportunity to bow out.  I later heard one of my submissions (identifier compression) was adopted as part of the MPEG-G specification.

The MPEG-G standard itself


Having bowed out of regular communication and meetings, I am hazy of the process from then on apart from the occasional public document.  The most relevant of these is the MPEG-G standard itself, now designated the snappy title of ISO/IEC 23092-2 CD, all 250+ pages of it!  The MPEG site also has more documents than the file format, with other topics covering transport / streaming facilities (similar to htsget) and APIs.

I am uncertain on the timeline, but it is probable the standard will be finalised by the end of this year or early next year, with a reference implementation available soon.  (I do not know whether this would include the encoder or just a decoder.)

You're probably curious, like me, as to how it compares to other existing technologies for compression ratios and encode / decode times.  I don't think we have long wait, but right now information appears to be sketchy, with only rather vague numbers published.

Saturday, September 22, 2018

Dispelling the CRAM myths

This is a blog about the CRAM file format for storing DNA sequencing data.

There is a lot of misinformation about CRAM out there.  I'll be charitable and say it's accidental, but even accidental wrongs sometimes need correction.  Some of it also comes down to lack of decent tools rather than limitations of the file format, so consider this also to be my TO-DO list!

Myth 1: CRAM is reference based only

By default, the main implementations of CRAM encode sequence as a difference to the reference genome, however this is not mandatory.  Scramble has -x (referenceless) and -e (embedded ref) modes.  These can be enabled in Htslib too by using Samtools. For example:

    samtools view -O cram,embed_ref in.bam -o out.cram
    samtools view -O cram,no_ref in.bam -o out.cram

Where CRAM suffers though is the tool chain.  This isn't a fault of the format, but of our implementations and usage.

Embedded references are ideal for assemblies where the reference is a one off, but it'll forget this if you pipe in.cram to out.cram - you have to specify once again to embed the reference again, which is tricky as you don't even have the reference to embed!  Ideally we'd simply replace embed_ref with generate_ref and get it done on-the-fly by building a consensus as it goes.  If it started this way, then do it on output too.  Further more it ought to do this automatically if the sequence differs from the reference too much.  I'd even go as far to suggest generating and embedding the consensus may even be the optimal solution for deep alignments to the human genome and it avoids a whole host of ref shenanigans.

"No_ref" mode is ideal for data that isn't position sorted, as doing reference based compression puts a heavy burden on the I/O or memory bandwidth if you're jumping all over the place in the entire genome.  Again, this ought to be completely automatic and transparent.

Myth 2: CRAM is for aligned data only

CRAM can quite happily store unaligned data.  Indeed at the Sanger Institute we use unaligned CRAM instead of FASTQ as it's smaller and it's also one less format to contend with.  (We can convert back on-the-fly for tools that need it.)

It's even possible to index an unaligned CRAM, although retrieval again suffers due to the lack of a good tool chain.  The cram_filter tool from io_lib  permits extraction of specific CRAM slices based on their numerical IDs if the file has been indexed.  Eg:

    cram_filter -n 1000-1100 in.cram out.cram

You can specify - or /dev/stdout for the output to permit streaming.  This permits subsets of FASTQ to be generated with a little bit of tooling and so works better in a map/reduce style manner than streaming the entire file and dispatching commands, but again the tool chain makes this clunky and could do with tidying up.

Myth 3: CRAM is not lossless

I think this stems from the original Cramtools code having lossy compression enabled by default.  CRAM can do lossy compression, but Samtools, Scramble and Picard are lossless by default.

Malformed data and a few very quirky corner cases aside, CRAM is lossless in that the same information will come back.

Technically there are a few legitimate things which CRAM cannot store, such as distinguishing between "=" / "X" CIGAR operators and "M" and knowledge of whether the read had an NM/MD tag or not (it'll just regenerate these), but everything else which doesn't survive the round-trip is either broken data or information which the SAM specification states have no meaning anyway.  Eg CIGAR fields for unmapped data.  By and large, it's lossless enough to not practically matter.  It's also worth noting that even BAM isn't perfectly lossless if we really want to get down to it.  SAM -> BAM -> SAM can change some fields, such as uppercasing sequence and, for Picard, changing some auxiliary types (H to B).

Myth 4: CRAM has not seen wide adoption

CRAM is sometimes viewed as niche, and with poor take up.  While it's true it had a slow start, it is now wide spread and it active use across the world.

More alignment files have been deposited in the EBI's ENA and EGA archives in CRAM than BAM, by a factor of around 2:1 - that's well over a million CRAM files on one site alone.  The Broad Institute also have wide adoption of CRAM, more focused on per-project than an overall strategy, but with more projects switching over as the space savings are so large.

The code base has been downloaded maybe 1 million times - conda shows 300k and github release counts are around 150k, but htslib itself has been embedded into 100s of other github projects also (not counting all those who use submodules).  Obviously most of these downloads will not be specifically for CRAM, but if you use htslib you'll have the ability to use CRAM.  Due to lots of programming languages havings buildngs to htslib, this also means CRAM support in many languages.  Similarly for htsjdk.



Myth 5: CRAM is of size "X"

There are numerous comparisons of CRAM to <insert-format-here> that gloss over a huge wealth of detail.  Point 4 above already details one way in which CRAM files can differ in size - by implementation.   Even within the same implementation there are different levels of compression.  Eg like gzip -1 to gzip -9, scramble has -1 to -9 arguments to favour light or heavy levels of compression.  By default neither scramble nor samtools use the bzip2 and lzma codecs, but they can be enabled for higher compression.   The number of sequences per slice also improves compression ratio, at the cost of less fine grained random access.  For example:

    samtools view \
        -O cram,use_lzma,use_bzip2,level=9,seqs_per_slice=100000 \
        in.bam -o out.cram

Note: within Sanger some groups reduce the number of sequences per slice (default to 10,000) in order to permit finer-grained random access.

All of this means if you're quoting CRAM size for purposes of comparison with another tool, please quote the software you use, the version of it, and the command line options.  See the benchmarks on htslib.org for some, rather outdated, examples of how size and speed trade-offs work.

To further illustrate the point above about there being no one single size, and also to demonstrate that CRAM is actively being worked on, here we show a prototype of CRAM 4 with more compression codecs on a set of NovaSeq alignments.  The time is log-scale, so note it costs to get that little bit more saved.  Also note that the level of granularity differs between each point on this plot, with some (left-most Deez and Quip) having no random access at all.

 (The faded points represent the encode times; the main chart is showing decode.)

Myth 6: CRAM doesn't support selective access

This claim I assume relates to selection of specific types of data (SAM columns) rather than region queries, as that is obviously working.  Although note it's possible using cram_filter to extract regions of CRAMs with zero transcoding (again using the -n option I mentioned before, although it could really do with a saner interface that permits e.g. X:10000000-20000000).  This, along with BAM equivalents, are the backbone of the efficient htsget protocol for streaming NGS data.

Selection of specific columns within CRAM is something which may be possible, or may not be, depending on how the CRAM was encoded and on the type of data field you're trying to select.  It is also dependent on a decent tool chain.   Remember I said that the CRAM format gives complete freedom to write data in whichever blocks it likes?  If the encoder puts all data in one block then there is no way to select only some types.  For some time now the major CRAM implementations have all been putting each type of data into its own block, to facilitate ease of decoding.  This means we can get only certain types of data out much faster.

An example of this is samtools flagstat.  Try it on a BAM vs a CRAM file and you'll see it's much faster in CRAM, as it has selectively skipped decoding of some fields.  This makes file reading much quicker.  Similarly indexing a CRAM is an order of magnitude faster than BAM.

In some cases it's possible to do the same for writing too.  For example if we wish to drop one single tag, say the large OQ field, then we don't want to decode sequence, name, quality and more only to immediately reencode them again.  We'd be better off simply copying the compressed data over, bar the block containing the OQ field and the meta-data that indicates its presence.  This isn't possible within samtools / htslib, but the cram_filter tool demonstrates some of this in the -t option which permits dropping of specific axillary tags.  It can do this with little more than a read/write loop, which even with the effort to adjust the tag list and meta-data blocks comes out at least 50x faster than a traditional transcoding loop:

    cram_filter -t OQ  in.cram out.cram

If we can do it with removal, in theory we can do it with addition too, but this exposes an issue with our tool chain: again this isn't a CRAM file format woe, but a coding problem.

Say we wish to add an auxiliary tag stating the average quality value, we'd only need to decode the quality from the CRAM file.   We know what to do - decode the quality block, add a new tag, update various CRAM meta-data blocks, and do read/write loops on all the other blocks.  Sadly we simply haven't got a good API for block level manipulations like this.  (Of course this also doesn't work too well if we want to strip out entire reads, but we could simply mark them as unmapped and leave in-situ.)

Myth 7: CRAM specifies how to encode, and so cannot improve

This is a peculiar one, but I've seen it written that CRAM suffers as it leaves no room for efficient implementations, due to dictating the encoding process instead of the decoding process.

That is simply not true.

The CRAM format specifies the file layout and the decoding method, but the encoder has complete freedom on how to break the data up.  It can decide which data types to put in which blocks and with with codec.  It also permits storing multiple data series in the same block if there is strong correlation between values and a mixture would lead to better compression ratios.  In the early days we gradually learned how to better encode CRAM 2.0 files which lead to Scramble producing files smaller than Cramtools of the same era.  (Cramtools implemented the same ideas so it's now comparable again.)

Myth 8: CRAM was invented by me

Finally, it embarrasses me to even have to put this here, but I've seen it said before so I'll address it.  No, I didn't invent CRAM!  I just helped continue it along the path.


The initial ideas were from the EBI in Markus Fritz's paper Efficient storage of high throughput DNA sequencing data using reference-based compression, which doesn't give it the CRAM moniker, and subsequently implemented by the EBI's Vadim Zalunin in Java as the cramtools package.  My involvements was later, to implement it in C.  In doing so I found a number of discrepancies between the specification and the Java implementation, which lead to CRAM 2.0 being released. ( This was largely to get the spec and code in line with each other.)  I was also the main instigator  behind CRAM 3.0, with the newer codecs.

Interestingly the original paper had ideas which were never implemented, perhaps due to speed concerns.   One idea was to assemble the unmapped data to find any deep contigs and use these as fake references to improve compression.  Coupled with embedded references this would work well.  With fast tools like miniasm it may be possible now without too much of a speed problem.

Final words

Is it perfect?  By no means!  The specification could do with improving and I'd have designed a few things differently, but by and large it's been highly successful and my hat is off to the EBI people who designed and implemented it.  You should be proud of your achievements and I'm pleased to have helped it along its path.

Introduction

I've never had a blog before, but events over the last year or so have inspired me to have an area to write in that is more permanent and easier to find than simple facebook ramblings.  This will primarily be about my adventures and explorations in Big Data, most notably my work on compression of genomic data, but I won't discount other forays into undiscovered (by me) lands.

I've been a fan of data compression since my early university days, with the first compression tool I wrote being a Huffman entropy encoder and decoder in 6502 assembly for the BBC Micro.  Fun days.  It worked, but was hardly earth shattering.  It sparked a geeky passion which has stayed with me ever since.

In my professional life I've mostly been doing work in and around Bioinformatics, but occasionally this gave me room to exercise my love of compression.   It started with DNA sequencing chromatograms ("traces") and the ZTR format and much later moved on to experimental and unpublished FASTQ compressors which subsequently lead to taking part in the Sequence Squeeze contest (and somehow winning it, which paid for several Sanger Informatics punting trips among other things).  The contest lead me to write a paper with Matt Mahoney, who I have to confess has been a long time hero of mine due to his fantastic work on the PAQ compressors and running sites like the Large Text Compression Benchmark, so it was an honour to collaborate with him.

Off the back of the Sequence Squeeze contest, I was asked to work on writing a C implementation of the CRAM format, an alternative to the BAM format for storing next generation DNA sequence alignments.   I didn't know whether to be pleased or not, but my "Great Grandboss" (how else do people refer to their boss's boss's boss?) said words to the effect of "We're looking for someone to do a quick and dirty C implementation of CRAM, and I immediately thought of you".  Umm. :-)

It's with CRAM where my blogs will start proper.

As to why it is important, take a look at this well published image from https://www.genome.gov/sequencingcostsdata/.


When the first Human Genome was published, it was a truly international effort that took over a decade to complete and cost nearly $3 billion dollars, which included a lot of learning and development of techniques.  By the end of it, the cost of doing it again was already under 1/10th of that, and now we're well under $1000 for the same thing and completed in a couple of days (along with numerous other genomes on the same instrument at the same time).  Because of this though we now do other experiments: we sequence cancers and "normals" (non-cancer) in 100s of patients in order to work out what genetic differences lead to the cancer, we're even sequencing all tissues in the human body to produce a Human Cell Atlas.  In short, as the cost went down, the amount of data went up.  Up by a HUGE amount, far more than Moore's Law for computing and Kryder's Law for storage.   Ultimately data compression means nothing against a disparity in exponential growths, but it's a start and we'd be foolish to not exploit it.  A good reason to scratch an itch.