Data Geekdom: September 2018

Friday, September 28, 2018

MPEG-G: the ugly

Preface: all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

Commercialisation

I ended the last blog with the statement "history is resplendent with battles where a good PR campaign won the day". I truly wish this wasn't a battle. I engaged with MPEG-G from the moment I heard about it, submitting ideas to improve it, despite being instrumental in recent CRAM improvements. I had hopes of a better format.

I bowed out after a while, making rather weak excuses about work loads. However the honest reason I disengaged was due to the discovery of patent applications by people working on the format. I wanted nothing to do with helping others making profits, at the expense of the bioinformatics community. I regret now that I helped make the format that little bit better. I am guilty of being hopelessly naive.

I am not against commercialisation of software.   I use plenty of it daily. Indeed I once worked on software that was sold on a semi-commercial basis, from which I personally benefited.

A commercial file format however is a different beast entirely. It touches everything that interacts with those files. Programs that read and write the format need to be licensed, adding a burden to software developers

I'm also not against patents, when applied appropriately. I can even see potential benefits to software patents, just, although the 25 year expiry is far too long in computing. 25 years ago the Intel Pentium had just come out, but I was still using an 8MB Intel 486 PC. It seems ludicrous to think something invented back then would only just be opening up for others to use without having to pay royalties.   Holding a patent for that long in such a fast moving field is extreme - 5 to 10 years max seems more appropriate.

[Correction: a reader pointer out this should be 20 years. Correct and sorry for my mistake. The point still makes sense, even if the exact dates are wrong.]

Anyway, I digress.   In my opinion commercialisation of software is nearly always best done by efficient and robust implementations, protected by copyright.

Patent applications

The first related one I became aware of was WO2016202918A1: Method for Compressing Genomic Data. This describes a way to use sequence position plus alignment (a "CIGAR" field in SAM) to identify a previously observed sequence at that location and thus compress via redundancy removal. This is a subset of the method I used in samcomp for the SequenceSqueeze contest and published 2 years earlier, where I used all prior sequences instead of just one. I made prior art observations and the authors of the patent appear to have amended it slightly, but this is still an ongoing concern.

The same authors also later submitted a second patent involving lossy compression of quality values by forming a sequence consensus along with an associated confidence, and using that to dictate how much loss can be applied to the quality values. This too infringes on my earlier work as it closely follows the Crumble program.

As far as I'm aware however neither of these two patents made it in to the MPEG-G specification, but the authors were actively involved in the MPEG-G process and this directly lead to my withdrawal.

GenomSoft-SA and GenomSys-SA

The same cannot be said however for this barrage of 12 patents from GenomSys:

Source: https://globaldossier.uspto.gov/#/result/publication/WO/2018071080/1

The patents all relate to how the sequence is encoded in the MPEG-G draft specification. Many of these patents have similar problems with prior art, with some claims being sufficiently woolly that CRAM, among others, would infringe on them if granted. Some are simply huge - over 100 pages of impenetrable lawyer speak and 80+ claims. Fighting this has become an enormous and unwelcome time sink.

Crunchbase reports one GenomSys founder as Claudio Alberti:

Source: https://www.crunchbase.com/search/organizations/field/people/num_founded_organizations/claudio-alberti

MoneyHouse.ch reports GenomSys board of directories as including Marco Mattavelli:

Source: https://www.moneyhouse.ch/en/company/genomsys-sa-15920162331

As reported earlier, both Alberti and Mattavelli are founders of MPEG-G:

Source: https://mpeg.chiariglione.org/standards/exploration/genome-compression/issues-genome-compression-and-storage

There is no proof that this was their intention all along, but I will leave the readers to draw their own conclusions.

As mentioned before, I was not formally accepted as a member of BSI and therefore not party to all the documents registered in the MPEG repository, only receiving communication via the ad-hoc group mailing list. Therefore I do not know when the other members were notified of these patents (as is the requirement under ISO regulations), but I was certainly taken by surprise and would not have engaged if I knew where it was headed. As it happens I consider it fortunate that I did not become a formal part of the process, as my lack of access to the document repository also means I am untainted by any confidential knowledge, which may help with any future prior art claims.

Licensing

Patents by themselves do not automatically mean royalties. Many patents are obtained purely to protect against others attempting to patent similar things (as patent officials are considerably more likely to find prior art in a previous patent than they are in other literature).

A patent means we need a license to use that invention. Licenses can be granted free, but we have no idea of the intentions here. However to date none of this has been openly discussed and most I have spoken with have been blissfully unaware of the patent applications. There have been some suggestions that there will be an open source reference implementation, but do not assume open source means free from royalty. The two are orthogonal issues.

Implications

My concerns over these MPEG-G patents are many fold, but include:

Fear: even if a program has clear prior art and a patent is unenforceable, the fear of potential legal action being taken can dissuade people from using that program. This may dissuade people form using my work, including CRAM and Crumble.
Prevention of further research: a core public Idea can be extended within patent claims as Idea+X, Idea+Y and Idea+Z. The original inventor of Idea can then become stuck trying to find a way to continue his/her own research, simply because they didn't publicly state up front which next avenues of research they were going to continue down. Crumble vs the 2nd Voges patent is an example of this.

To this end, I am now trying to publicly document as many ideas I have as I can, regardless of whether I have time to implement them yet, however this does give ideas to my competitors.
Needless complexity: every program touching the patents now needs to deal with licensing. If the patent is monetised, it makes every tool a commercial tool unless the author has money to throw away. This simply will not work with most academic groups.
Time sink: fighting patents takes a LOT of time. Patents are deliberately written to be hard to read and much like builders applying for planning permission over and over again for the same plot, assuming people will give up objecting, many subtle variations on the same patents can be applied for in the hope one of them will get through. I have spent weeks on this, as have others. During that time I could have been doing other more productive work, improving software elsewhere. This costs us money and doesn't help the community as a whole.
Cost: an obvious one. We do not yet know what the cost will be or how it will be applied. It may be free, it could be a one off charge when first creating the file, or it could be a charge every time we read / write data from the file format. Assuming a non-zero cost, this will increase the price of research and medicine. We need to know if we spent the money wisely. What do we get from that money that we didn't have before?

A better way?

Leonardo Chiariglione, chairman of MPEG, describes the MPEG model as this:

"Patent holders who allow use of their patents get hefty royalties with which they can develop new technologies for the next generation of MPEG standards. A virtuous cycle everybody benefits from."

It sounds sensible on the face of it, and probably is when the field is already covered by a myriad of patents. He has concerns that there are now free alternatives to video coding arriving, such as the Alliance for Open Media, and views this as both damaging to MPEG but also damaging to the very idea of progress:

"At long last everybody realises that the old MPEG business model is now broke, all the investments (collectively hundreds of millions USD) made by the industry for the new video codec will go up in smoke and AOM’s royalty free model will spread to other business segments as well.

[...]

AOM will certainly give much needed stability to the video codec market but this will come at the cost of reduced if not entirely halted technical progress. There will simply be no incentive for companies to develop new video compression technologies, at very significant cost because of the sophistication of the field, knowing that their assets will be thankfully – and nothing more – accepted and used by AOM in their video codecs."

I believe however that he is mistaken. There are clear non-royalty based incentives for large companies to develop new compression algorithms and drive the industry forward. Both Google and Facebook have active data compression teams, lead by some of the world's top experts in the field. It is clear to see why: bandwidth costs money, and lots of it. Google's Brotli algorithm is now supported by all the major browsers and Facebook's Zstd is targetting itself as a replacement for the legacy Zlib library.

There are also more enlightened approaches, where like minded companies pay for research into "pre-competitive" areas which benefit all but are not, by themselves, considered a commercial edge. File formats absolutely fits the bill. Indeed one such organisation, the Pistoia Alliance were the people who organised the Sequence Squeeze contest, which lead to so many innovations in genomics data compression.

Conclusion

"Patents + extensive high profile marketing + lack of evidence for benefit" jointly add up to something tasting and smelling rather unpleasant. The situation can, however, be rescued with appropriate licensing, (assuming a benefit exists).

I urge the community now - do not adopt MPEG-G without a cast iron assurance that no royalties will be applied, now and forever.

Thursday, September 27, 2018

MPEG-G: the bad

Preface: all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

OK I'll come clean - having read this title you'll probably have realised that the previous post could have been called "MPEG-G: the good". It's not all smelling of roses. I covered the history and process before, so now I'll look at what, to my mind, isn't so good.

Transparency

MPEG-G has been bubbling along in the background. It's hard to gain much knowledge of what's being discussed as it's all behind the ISO machinery, which seems designed to lock in as much information as possible unless you're one of the club.   Given the history of intellectual property surrounding video coding, I can understand why this is their usual policy, but I don't feel it is an appropriate fit for a relatively open field such as bioinformatics.

I know as I almost got there, but was put off by the BSI IST/37 committee and work-loads. As it happened, I think it was beneficial (more on that to come), but it does mean that even an interested party like myself could see absolutely nothing of what was going on behind closed doors. I wasn't even aware for some time that one of my own submissions to the Core Experiments was adopted by the standard.    Yes I know I could have gone through all the bureaucracy to get there, but that's still going to be a very self limiting set of people.   My strong view here is if you *really* want the best format known to mankind, then be as public about it as you conceivably can.

MPEG-G is a complex specification. Infact it's huge. Partly this is because it's very precisely documented down to every tiny nut and bolt (and I have to confess, this puts CRAM to shame), but it's also got a lot going on. Is that necessary?   I'd like to be able to see the evaluations, of which I am sure there were many, that demonstrate why the format is as it is. These are key to implementing good encoders, but it would also give me confidence that the format is appropriate and doesn't contain pointless bloat. (Again, this is something less than ideal with CRAM too.)

Risk vs Reward

A shiny new format is all well and good if the gains to be had are worth it. Switching format is a costly exercise. CRAM started off around 30% less than BAM and has improved to be around 50% saving now. With CRAM v4 is may even be 60-70% saving for NovaSeq data. That's clearly worth it. What percentage reduction over CRAM is worth switching again? 10%? 20%? Maybe, maybe not, depending on how hard it is to implement and what else we gain from it.

Unfortunately MPEG-G is a large unknown right now, not only in size vs CRAM as all comparisons I've seen to date are vs BAM, but also in how any savings are achieved: size isn't everything. Will it be similar speed, with similar granularity of random access? So far nothing I've seen shows it is a significant advance over CRAM (eyeballing vs BAM ratios).

There are potentially new features (rewards) too in MPEG-G, but it's too early to know how beneficial any of these are. For example it has the ability to append new data without rewriting the file. This is simply analogous to on-the-fly file merging which we use already and have been for a decade. It's a nice feature, but not one we couldn't live without. There are also options to retrieve sequences that only contain, say, X differences to the reference. Is this useful? I really don't know, but I'd need some convincing test cases.

Reproducibility

That neatly brings me on to evaluation and reproducibility of research. MPEG-G make a few claims vs BAM, but to compare vs CRAM ourselves we need access to the software, or failing that to some common data sets and more information on speeds and access granularity.

In July 2017 Marco Mattavelli gave a talk including some MPEG figures, which includes BAM and MPEG-G file sizes for ERR174324 chromosome 11; a public data set. Unfortunately in the ENA archives it's in FASTQ format. How was the BAM produced? I've tried using "bwa mem", but it's simply not a good match for file size.

Source: https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf

The same talk also gives figures for NA12878_S1 - one of the Illumina Platinum Genome BAMs. This one is released in BAM format, however the BAM file size in the talk is around 1/3rd larger than the one released by Illumina. There are more recent slides from Marco, but they don't reference which data sets were used. We need more than this. What are the timings for encode and decode? How many seeks and how much data is returned for a small region query?

I'm not necessarily expecting a PowerPoint presentation to be the source of the data I want - this is normally the sort of thing relegated to a Supplementary Material section of a paper - but it should be somewhere and as far as I can tell it isn't. I sent an email to Marco on Wed 19th September and am awaiting a reply:

(From me:) "I'd like to be able to see how CRAM compares against MPEG-G, and for a fair comparison it needs identical input BAMs. Are there any recent public data sets with MPEG-G figures available?"

The best I can do so far is a side by side comparison vs CRAM 3 and the CRAM 4 prototype using my bwa mem generated BAM for ERR174324, break it down by approximate size of seq, quality, name and aux (by removing them from the BAM one at a time and observing the impact), and then scale my charts to look the same. This gives us compression ratios, although we cannot draw any hard conclusions on absolute file sizes:

The PR machinery

No one wants to shout "Adopt our format - it's marginally better than the current ones". It just doesn't work as a slogan! So it's hardly surprising in the talks given this is glossed over, but it goes beyond that. The talks and articles I've seen perpetuate the myth that there was never anything better out there:

Source: https://mpeg.chiariglione.org/sites/default/files/files/standards/parts/docs/w17514.zip

We know MPEG are aware there are better formats. They collaborated with authors of more advanced tools during the early days while learning the field, have cited them in online publications, and even invited them to MPEG conferences to talk about them. Yet sadly this is all lost in the goal for a nice soft target to compare against.

We need proper comparisons against other well established formats, or even the not so well established ones if they look a better choice to switch to. Their very absence is suspicious. I do still think MPEG-G has something to offer, but convince me because I'm not a man of faith!

It also extends passed the immediate ISO/MPEG eco system too and into the related standards bodies such as the IEEE. In an article in IEEE Spectrum with the rather condescending title of "THE QUEST TO SAVE GENOMICS", it states:

"Bioinformatics experts already use standard compression tools like gzip to shrink the size of a file by up to a factor of 20. Some researchers also use more specialized compression tools that are optimized for genomic data, but none of these tools have seen wide adoption."

This is why I added "Myth 4: CRAM has not seen wide adoption" to my earlier blog. CRAM has gone beyond research and into production. I wouldn't say over 1 million deposited CRAM files in the EBI public archives and in use by multiple major sequencing centres across the globe as not seeing wide adoption.

MPEG - you can do better than this. Give us some hard numbers to work with and please stop painting the community as being in the stone age as it's simply not true. We don't need "saving", we need proper open collaboration instead.

Taking it up a notch

Worryingly it goes further than simple press releases. I have recently become aware that, despite the lack of any official figures and public benchmarking, MPEG-G is being recommended to the European Commission eHealth project. A report on Data Driven Healthcare explicitly mentions MPEG-G as a likely format for standardisation across Europe:

"The European Health Record Recommendation will also talk about content-related standards for documents, images, complex datasets, and multimedia recordings. Viola admitted that there will be discussions around these standards, but he also said that the Commission is not planning to reinvent anything that already exists.

[...]

A good candidate, according to Viola, is the MPEG-G standard, a genomic information representation that is currently being developed by the Moving Picture Experts Group (MPEG). MPEG-G aims to bring the terabyte datasets of whole-genome sequencing down to a few gigabytes. This would allow to them to be stored on a smartphone."

Note that "Not planning to reinvent anything that already exists" statement. Roberto Viola (Director General for Communications Networks, Content and Technology) is listed elsewhere in the above report as the main driving force behind the eHealth Network project. We can only speculate on whether Viola is aware of the full range of existing technologies.

I'm not anti-progress, but I am a scientist and that means evidence - before you start the marketing.

This is dangerous ground and history is resplendent with battles where a good PR campaign won the day.

Addendum:

During preparation of this blog, someone pointed me an MPEG-G paper at bioRxiv. It's good to have a better description of the format, fitting somewhere between a few simple slides and the complexity of the full blown specification. However again it appears to be a missed opportunity in terms of the compression performance section. The chart shows lossy MPEG-G only and has no direct evidence regarding the impact this has on variant calling. It does not compare against formats other than SAM or BAM, or even acknowledge their existance. If I was a reviewer I would be requesting something that others can reproduce, using public datasets, and a comparison against the current state of the art. If the reader wishes to move away from BAM, they need to know which format to move to. This paper does not demonstrate that.

Tuesday, September 25, 2018

"ISO/IEC JTC 1/SC 29/WG 11"

Preface: all my blogs are my own personal views and may not necessarily the views of my employer, the Wellcome Sanger Institute.

Who?

You'll probably know ISO/IEC JTC 1/SC 29/WG 11 as the Motion Picture Expert Group, or MPEG for short.

They formed out of a requirement for interoperability between the burgeoning multi-media formats and devices, chaired throughout its history by Leonardo Chiariglione. MPEG's aims were eminently sensible; put together the best formats they can, irrespective or who owns what ideas, and come up with a format better than the sum of its parts. Given the companies wanted to monetise their intellectual property, this was done via a central pot (the FRAND system) and it was divvied up to those due their slice of the MPEG pie. The net effect is a win/win for everyone, both consumer and producer.

This blog will cover some history of how MPEG got to grips with genomic data formats, up to the current publication of "MPEG-G".

MPEG discovers genomics

It appears to be back in July 2014, during a meeting in Japan, that MPEG formally acknowledge an ad-hoc group (AHG) to start investigations into the field of genomic data compression.

This document shows the first tentative steps to understanding the field, along with the expected gaps in knowledge: (Emboldening theirs.)

"Such challenges imply the use/existence of an appropriate open standard for both compression and storage, [...] but that currently does not exist".

It doesn't take long to get a basic understanding, and by the next MPEG meeting in Strasbourg the team had grown and learnt far more about the field of genomic data compression, mentioning standard formats including FASTQ, SAM, BAM and CRAM and ego-pleasingly citing and including figures from my own Sequence Squeeze paper too.

Enter MPEG, stage left

Before the following meeting (Feb 2015) they reached out to authors of existing tools and formats. The process gathered momentum. In January 2015 Vadim Zalunin, author of the original CRAM implementation at the EBI had a call from MPEG. He relayed this to me shortly after:

"MPEG community has contacted us, they are reviewing CRAM and if it has the potential to become a standard format they would want to support. If you are interested please subscribe to <redacted>."

This was rather a bolt out of the blue to me, having heard none of the above yet. The video coding experts are looking at genomic data? Well, I guess they have a lot of data compression expertise, so why not?

I joined them. Vadim had already subscribed to their "reflector" (ISO speak for a mailing list), as had a number of other authors of related tools such as DeeZ, SCALCE and GeneCodeq, and numerous data compression experts in the video coding world. From one of my first postings of 7th Jan 2015, when I was just working out what all this entailed:

"I am eager to know what the MPEG team has to offer though. For sure I think there is a rich vein around lossy quality encoding, maybe applying rate distortion theory and other bounded loss methods. There have been a number of papers published on lossy encodings, but they've found it hard to gain traction so far. I think perhaps there is a lack of a gold-standard method of evaluating the impact on lossy quality values."

We had lots of really technical discussions on various formats, their strengths and weaknesses, and how to improve. Even the main man at MPEG, Leonardo Chiariglione, occasionally took part in discussions. We also had a series of conference calls, and produced a set of test data and results of running various already existing tools, which subsequently formed the basis of a Nature Methods paper, of which I am a co-author: a survey of existing genomics compression formats.

So far my involvement was purely conference calls, emails and a lot of benchmarking, but in July 2015 Claudio Alberti invited me to the next MPEG conference:

"MPEG will organize a Workshop/seminar on Genome compression on 20th October 2015 in Geneva and we'd like to invite you as keynote speaker on genome compression issues/tools/technologies."

I spoke at this meeting, presenting CRAM both as it stood then and potential improvements to the codecs based on my earlier fqzcomp (Sequence Squeeze) work. The talk slides can be seen here: https://www.slideshare.net/JamesBonfield/cram-116143348

Formal processes

Once the field had been surveyed and a few explorations made, the next step of MPEG-G (as it later became known) was to construct a series of "Core Experiments" which related to the specific tasks at hand, and a public "call for contributions". These included compression of sequences, identifiers, quality values (both lossless and lossy), auxiliary tags, as well as more nebulous things.

ISO/MPEG procedures are complicated! ISO do not accept comments and submissions from individuals, only from member nations or other recognised liaison committees. Each nation in turn has its own standards body (BSI for the UK), and people affiliated with an appropriate technical committee (eg BSI's IST/37) are permitted to take part in ISO discussions on behalf of the nation standards body. This was all very bureaucratic, and they seemed to want me to turn up to meetings in London and do all sorts of other duties. I opted out of this complexity.

However I still made submissions via Claudio Alberti, who kindly submitted them on my behalf. These consisted of the various experimental CRAM codecs I had presented to the October 2015 meeting in Geneva. However I wasn't present at the subsequent meetings (without standards accreditation, you can only attend if formally invited), nor was I able to see anyone else's submissions in the internal MPEG document archive. The work load was getting quite high and I took this as an opportunity to bow out. I later heard one of my submissions (identifier compression) was adopted as part of the MPEG-G specification.

The MPEG-G standard itself

Having bowed out of regular communication and meetings, I am hazy of the process from then on apart from the occasional public document. The most relevant of these is the MPEG-G standard itself, now designated the snappy title of ISO/IEC 23092-2 CD, all 250+ pages of it! The MPEG site also has more documents than the file format, with other topics covering transport / streaming facilities (similar to htsget) and APIs.

I am uncertain on the timeline, but it is probable the standard will be finalised by the end of this year or early next year, with a reference implementation available soon. (I do not know whether this would include the encoder or just a decoder.)

You're probably curious, like me, as to how it compares to other existing technologies for compression ratios and encode / decode times. I don't think we have long wait, but right now information appears to be sketchy, with only rather vague numbers published.

Saturday, September 22, 2018

Dispelling the CRAM myths

This is a blog about the CRAM file format for storing DNA sequencing data.

There is a lot of misinformation about CRAM out there. I'll be charitable and say it's accidental, but even accidental wrongs sometimes need correction. Some of it also comes down to lack of decent tools rather than limitations of the file format, so consider this also to be my TO-DO list!

Myth 1: CRAM is reference based only

By default, the main implementations of CRAM encode sequence as a difference to the reference genome, however this is not mandatory. Scramble has -x (referenceless) and -e (embedded ref) modes. These can be enabled in Htslib too by using Samtools. For example:

samtools view -O cram,embed_ref in.bam -o out.cram
samtools view -O cram,no_ref in.bam -o out.cram

Where CRAM suffers though is the tool chain. This isn't a fault of the format, but of our implementations and usage.

Embedded references are ideal for assemblies where the reference is a one off, but it'll forget this if you pipe in.cram to out.cram - you have to specify once again to embed the reference again, which is tricky as you don't even have the reference to embed! Ideally we'd simply replace embed_ref with generate_ref and get it done on-the-fly by building a consensus as it goes. If it started this way, then do it on output too. Further more it ought to do this automatically if the sequence differs from the reference too much. I'd even go as far to suggest generating and embedding the consensus may even be the optimal solution for deep alignments to the human genome and it avoids a whole host of ref shenanigans.

"No_ref" mode is ideal for data that isn't position sorted, as doing reference based compression puts a heavy burden on the I/O or memory bandwidth if you're jumping all over the place in the entire genome. Again, this ought to be completely automatic and transparent.

Myth 2: CRAM is for aligned data only

CRAM can quite happily store unaligned data. Indeed at the Sanger Institute we use unaligned CRAM instead of FASTQ as it's smaller and it's also one less format to contend with. (We can convert back on-the-fly for tools that need it.)

It's even possible to index an unaligned CRAM, although retrieval again suffers due to the lack of a good tool chain. The cram_filter tool from io_lib permits extraction of specific CRAM slices based on their numerical IDs if the file has been indexed. Eg:

cram_filter -n 1000-1100 in.cram out.cram

You can specify - or /dev/stdout for the output to permit streaming. This permits subsets of FASTQ to be generated with a little bit of tooling and so works better in a map/reduce style manner than streaming the entire file and dispatching commands, but again the tool chain makes this clunky and could do with tidying up.

Myth 3: CRAM is not lossless

I think this stems from the original Cramtools code having lossy compression enabled by default. CRAM can do lossy compression, but Samtools, Scramble and Picard are lossless by default.

Malformed data and a few very quirky corner cases aside, CRAM is lossless in that the same information will come back.

Technically there are a few legitimate things which CRAM cannot store, such as distinguishing between "=" / "X" CIGAR operators and "M" and knowledge of whether the read had an NM/MD tag or not (it'll just regenerate these), but everything else which doesn't survive the round-trip is either broken data or information which the SAM specification states have no meaning anyway. Eg CIGAR fields for unmapped data. By and large, it's lossless enough to not practically matter. It's also worth noting that even BAM isn't perfectly lossless if we really want to get down to it. SAM -> BAM -> SAM can change some fields, such as uppercasing sequence and, for Picard, changing some auxiliary types (H to B).

Myth 4: CRAM has not seen wide adoption

CRAM is sometimes viewed as niche, and with poor take up. While it's true it had a slow start, it is now wide spread and it active use across the world.

More alignment files have been deposited in the EBI's ENA and EGA archives in CRAM than BAM, by a factor of around 2:1 - that's well over a million CRAM files on one site alone. The Broad Institute also have wide adoption of CRAM, more focused on per-project than an overall strategy, but with more projects switching over as the space savings are so large.

The code base has been downloaded maybe 1 million times - conda shows 300k and github release counts are around 150k, but htslib itself has been embedded into 100s of other github projects also (not counting all those who use submodules). Obviously most of these downloads will not be specifically for CRAM, but if you use htslib you'll have the ability to use CRAM. Due to lots of programming languages havings buildngs to htslib, this also means CRAM support in many languages. Similarly for htsjdk.

Myth 5: CRAM is of size "X"

There are numerous comparisons of CRAM to <insert-format-here> that gloss over a huge wealth of detail. Point 4 above already details one way in which CRAM files can differ in size - by implementation.   Even within the same implementation there are different levels of compression. Eg like gzip -1 to gzip -9, scramble has -1 to -9 arguments to favour light or heavy levels of compression. By default neither scramble nor samtools use the bzip2 and lzma codecs, but they can be enabled for higher compression.   The number of sequences per slice also improves compression ratio, at the cost of less fine grained random access. For example:

    samtools view \
        -O cram,use_lzma,use_bzip2,level=9,seqs_per_slice=100000 \
        in.bam -o out.cram

Note: within Sanger some groups reduce the number of sequences per slice (default to 10,000) in order to permit finer-grained random access.

All of this means if you're quoting CRAM size for purposes of comparison with another tool, please quote the software you use, the version of it, and the command line options. See the benchmarks on htslib.org for some, rather outdated, examples of how size and speed trade-offs work.

To further illustrate the point above about there being no one single size, and also to demonstrate that CRAM is actively being worked on, here we show a prototype of CRAM 4 with more compression codecs on a set of NovaSeq alignments. The time is log-scale, so note it costs to get that little bit more saved. Also note that the level of granularity differs between each point on this plot, with some (left-most Deez and Quip) having no random access at all.

(The faded points represent the encode times; the main chart is showing decode.)

Myth 6: CRAM doesn't support selective access

This claim I assume relates to selection of specific types of data (SAM columns) rather than region queries, as that is obviously working. Although note it's possible using cram_filter to extract regions of CRAMs with zero transcoding (again using the -n option I mentioned before, although it could really do with a saner interface that permits e.g. X:10000000-20000000). This, along with BAM equivalents, are the backbone of the efficient htsget protocol for streaming NGS data.

Selection of specific columns within CRAM is something which may be possible, or may not be, depending on how the CRAM was encoded and on the type of data field you're trying to select. It is also dependent on a decent tool chain. Remember I said that the CRAM format gives complete freedom to write data in whichever blocks it likes? If the encoder puts all data in one block then there is no way to select only some types. For some time now the major CRAM implementations have all been putting each type of data into its own block, to facilitate ease of decoding. This means we can get only certain types of data out much faster.

An example of this is samtools flagstat. Try it on a BAM vs a CRAM file and you'll see it's much faster in CRAM, as it has selectively skipped decoding of some fields. This makes file reading much quicker. Similarly indexing a CRAM is an order of magnitude faster than BAM.

In some cases it's possible to do the same for writing too. For example if we wish to drop one single tag, say the large OQ field, then we don't want to decode sequence, name, quality and more only to immediately reencode them again. We'd be better off simply copying the compressed data over, bar the block containing the OQ field and the meta-data that indicates its presence. This isn't possible within samtools / htslib, but the cram_filter tool demonstrates some of this in the -t option which permits dropping of specific axillary tags. It can do this with little more than a read/write loop, which even with the effort to adjust the tag list and meta-data blocks comes out at least 50x faster than a traditional transcoding loop:

cram_filter -t OQ in.cram out.cram

If we can do it with removal, in theory we can do it with addition too, but this exposes an issue with our tool chain: again this isn't a CRAM file format woe, but a coding problem.

Say we wish to add an auxiliary tag stating the average quality value, we'd only need to decode the quality from the CRAM file. We know what to do - decode the quality block, add a new tag, update various CRAM meta-data blocks, and do read/write loops on all the other blocks. Sadly we simply haven't got a good API for block level manipulations like this. (Of course this also doesn't work too well if we want to strip out entire reads, but we could simply mark them as unmapped and leave in-situ.)

Myth 7: CRAM specifies how to encode, and so cannot improve

This is a peculiar one, but I've seen it written that CRAM suffers as it leaves no room for efficient implementations, due to dictating the encoding process instead of the decoding process.

That is simply not true.

The CRAM format specifies the file layout and the decoding method, but the encoder has complete freedom on how to break the data up. It can decide which data types to put in which blocks and with with codec. It also permits storing multiple data series in the same block if there is strong correlation between values and a mixture would lead to better compression ratios. In the early days we gradually learned how to better encode CRAM 2.0 files which lead to Scramble producing files smaller than Cramtools of the same era. (Cramtools implemented the same ideas so it's now comparable again.)

Myth 8: CRAM was invented by me

Finally, it embarrasses me to even have to put this here, but I've seen it said before so I'll address it. No, I didn't invent CRAM! I just helped continue it along the path.

The initial ideas were from the EBI in Markus Fritz's paper Efficient storage of high throughput DNA sequencing data using reference-based compression, which doesn't give it the CRAM moniker, and subsequently implemented by the EBI's Vadim Zalunin in Java as the cramtools package. My involvements was later, to implement it in C. In doing so I found a number of discrepancies between the specification and the Java implementation, which lead to CRAM 2.0 being released. ( This was largely to get the spec and code in line with each other.) I was also the main instigator behind CRAM 3.0, with the newer codecs.

Interestingly the original paper had ideas which were never implemented, perhaps due to speed concerns. One idea was to assemble the unmapped data to find any deep contigs and use these as fake references to improve compression. Coupled with embedded references this would work well. With fast tools like miniasm it may be possible now without too much of a speed problem.

Final words

Is it perfect? By no means! The specification could do with improving and I'd have designed a few things differently, but by and large it's been highly successful and my hat is off to the EBI people who designed and implemented it. You should be proud of your achievements and I'm pleased to have helped it along its path.

Introduction

I've never had a blog before, but events over the last year or so have inspired me to have an area to write in that is more permanent and easier to find than simple facebook ramblings. This will primarily be about my adventures and explorations in Big Data, most notably my work on compression of genomic data, but I won't discount other forays into undiscovered (by me) lands.

I've been a fan of data compression since my early university days, with the first compression tool I wrote being a Huffman entropy encoder and decoder in 6502 assembly for the BBC Micro. Fun days. It worked, but was hardly earth shattering. It sparked a geeky passion which has stayed with me ever since.

In my professional life I've mostly been doing work in and around Bioinformatics, but occasionally this gave me room to exercise my love of compression. It started with DNA sequencing chromatograms ("traces") and the ZTR format and much later moved on to experimental and unpublished FASTQ compressors which subsequently lead to taking part in the Sequence Squeeze contest (and somehow winning it, which paid for several Sanger Informatics punting trips among other things). The contest lead me to write a paper with Matt Mahoney, who I have to confess has been a long time hero of mine due to his fantastic work on the PAQ compressors and running sites like the Large Text Compression Benchmark, so it was an honour to collaborate with him.

Off the back of the Sequence Squeeze contest, I was asked to work on writing a C implementation of the CRAM format, an alternative to the BAM format for storing next generation DNA sequence alignments. I didn't know whether to be pleased or not, but my "Great Grandboss" (how else do people refer to their boss's boss's boss?) said words to the effect of "We're looking for someone to do a quick and dirty C implementation of CRAM, and I immediately thought of you". Umm. :-)

It's with CRAM where my blogs will start proper.

As to why it is important, take a look at this well published image from https://www.genome.gov/sequencingcostsdata/.

When the first Human Genome was published, it was a truly international effort that took over a decade to complete and cost nearly $3 billion dollars, which included a lot of learning and development of techniques. By the end of it, the cost of doing it again was already under 1/10th of that, and now we're well under $1000 for the same thing and completed in a couple of days (along with numerous other genomes on the same instrument at the same time). Because of this though we now do other experiments: we sequence cancers and "normals" (non-cancer) in 100s of patients in order to work out what genetic differences lead to the cancer, we're even sequencing all tissues in the human body to produce a Human Cell Atlas. In short, as the cost went down, the amount of data went up. Up by a HUGE amount, far more than Moore's Law for computing and Kryder's Law for storage. Ultimately data compression means nothing against a disparity in exponential growths, but it's a start and we'd be foolish to not exploit it. A good reason to scratch an itch.