Data Geekdom: Dispelling the CRAM myths

This is a blog about the CRAM file format for storing DNA sequencing data.

There is a lot of misinformation about CRAM out there. I'll be charitable and say it's accidental, but even accidental wrongs sometimes need correction. Some of it also comes down to lack of decent tools rather than limitations of the file format, so consider this also to be my TO-DO list!

Myth 1: CRAM is reference based only

By default, the main implementations of CRAM encode sequence as a difference to the reference genome, however this is not mandatory. Scramble has -x (referenceless) and -e (embedded ref) modes. These can be enabled in Htslib too by using Samtools. For example:

samtools view -O cram,embed_ref in.bam -o out.cram
samtools view -O cram,no_ref in.bam -o out.cram

Where CRAM suffers though is the tool chain. This isn't a fault of the format, but of our implementations and usage.

Embedded references are ideal for assemblies where the reference is a one off, but it'll forget this if you pipe in.cram to out.cram - you have to specify once again to embed the reference again, which is tricky as you don't even have the reference to embed! Ideally we'd simply replace embed_ref with generate_ref and get it done on-the-fly by building a consensus as it goes. If it started this way, then do it on output too. Further more it ought to do this automatically if the sequence differs from the reference too much. I'd even go as far to suggest generating and embedding the consensus may even be the optimal solution for deep alignments to the human genome and it avoids a whole host of ref shenanigans.

"No_ref" mode is ideal for data that isn't position sorted, as doing reference based compression puts a heavy burden on the I/O or memory bandwidth if you're jumping all over the place in the entire genome. Again, this ought to be completely automatic and transparent.

Myth 2: CRAM is for aligned data only

CRAM can quite happily store unaligned data. Indeed at the Sanger Institute we use unaligned CRAM instead of FASTQ as it's smaller and it's also one less format to contend with. (We can convert back on-the-fly for tools that need it.)

It's even possible to index an unaligned CRAM, although retrieval again suffers due to the lack of a good tool chain. The cram_filter tool from io_lib permits extraction of specific CRAM slices based on their numerical IDs if the file has been indexed. Eg:

cram_filter -n 1000-1100 in.cram out.cram

You can specify - or /dev/stdout for the output to permit streaming. This permits subsets of FASTQ to be generated with a little bit of tooling and so works better in a map/reduce style manner than streaming the entire file and dispatching commands, but again the tool chain makes this clunky and could do with tidying up.

Myth 3: CRAM is not lossless

I think this stems from the original Cramtools code having lossy compression enabled by default. CRAM can do lossy compression, but Samtools, Scramble and Picard are lossless by default.

Malformed data and a few very quirky corner cases aside, CRAM is lossless in that the same information will come back.

Technically there are a few legitimate things which CRAM cannot store, such as distinguishing between "=" / "X" CIGAR operators and "M" and knowledge of whether the read had an NM/MD tag or not (it'll just regenerate these), but everything else which doesn't survive the round-trip is either broken data or information which the SAM specification states have no meaning anyway. Eg CIGAR fields for unmapped data. By and large, it's lossless enough to not practically matter. It's also worth noting that even BAM isn't perfectly lossless if we really want to get down to it. SAM -> BAM -> SAM can change some fields, such as uppercasing sequence and, for Picard, changing some auxiliary types (H to B).

Myth 4: CRAM has not seen wide adoption

CRAM is sometimes viewed as niche, and with poor take up. While it's true it had a slow start, it is now wide spread and it active use across the world.

More alignment files have been deposited in the EBI's ENA and EGA archives in CRAM than BAM, by a factor of around 2:1 - that's well over a million CRAM files on one site alone. The Broad Institute also have wide adoption of CRAM, more focused on per-project than an overall strategy, but with more projects switching over as the space savings are so large.

The code base has been downloaded maybe 1 million times - conda shows 300k and github release counts are around 150k, but htslib itself has been embedded into 100s of other github projects also (not counting all those who use submodules). Obviously most of these downloads will not be specifically for CRAM, but if you use htslib you'll have the ability to use CRAM. Due to lots of programming languages havings buildngs to htslib, this also means CRAM support in many languages. Similarly for htsjdk.

Myth 5: CRAM is of size "X"

There are numerous comparisons of CRAM to <insert-format-here> that gloss over a huge wealth of detail. Point 4 above already details one way in which CRAM files can differ in size - by implementation.   Even within the same implementation there are different levels of compression. Eg like gzip -1 to gzip -9, scramble has -1 to -9 arguments to favour light or heavy levels of compression. By default neither scramble nor samtools use the bzip2 and lzma codecs, but they can be enabled for higher compression.   The number of sequences per slice also improves compression ratio, at the cost of less fine grained random access. For example:

    samtools view \
        -O cram,use_lzma,use_bzip2,level=9,seqs_per_slice=100000 \
        in.bam -o out.cram

Note: within Sanger some groups reduce the number of sequences per slice (default to 10,000) in order to permit finer-grained random access.

All of this means if you're quoting CRAM size for purposes of comparison with another tool, please quote the software you use, the version of it, and the command line options. See the benchmarks on htslib.org for some, rather outdated, examples of how size and speed trade-offs work.

To further illustrate the point above about there being no one single size, and also to demonstrate that CRAM is actively being worked on, here we show a prototype of CRAM 4 with more compression codecs on a set of NovaSeq alignments. The time is log-scale, so note it costs to get that little bit more saved. Also note that the level of granularity differs between each point on this plot, with some (left-most Deez and Quip) having no random access at all.

(The faded points represent the encode times; the main chart is showing decode.)

Myth 6: CRAM doesn't support selective access

This claim I assume relates to selection of specific types of data (SAM columns) rather than region queries, as that is obviously working. Although note it's possible using cram_filter to extract regions of CRAMs with zero transcoding (again using the -n option I mentioned before, although it could really do with a saner interface that permits e.g. X:10000000-20000000). This, along with BAM equivalents, are the backbone of the efficient htsget protocol for streaming NGS data.

Selection of specific columns within CRAM is something which may be possible, or may not be, depending on how the CRAM was encoded and on the type of data field you're trying to select. It is also dependent on a decent tool chain. Remember I said that the CRAM format gives complete freedom to write data in whichever blocks it likes? If the encoder puts all data in one block then there is no way to select only some types. For some time now the major CRAM implementations have all been putting each type of data into its own block, to facilitate ease of decoding. This means we can get only certain types of data out much faster.

An example of this is samtools flagstat. Try it on a BAM vs a CRAM file and you'll see it's much faster in CRAM, as it has selectively skipped decoding of some fields. This makes file reading much quicker. Similarly indexing a CRAM is an order of magnitude faster than BAM.

In some cases it's possible to do the same for writing too. For example if we wish to drop one single tag, say the large OQ field, then we don't want to decode sequence, name, quality and more only to immediately reencode them again. We'd be better off simply copying the compressed data over, bar the block containing the OQ field and the meta-data that indicates its presence. This isn't possible within samtools / htslib, but the cram_filter tool demonstrates some of this in the -t option which permits dropping of specific axillary tags. It can do this with little more than a read/write loop, which even with the effort to adjust the tag list and meta-data blocks comes out at least 50x faster than a traditional transcoding loop:

cram_filter -t OQ in.cram out.cram

If we can do it with removal, in theory we can do it with addition too, but this exposes an issue with our tool chain: again this isn't a CRAM file format woe, but a coding problem.

Say we wish to add an auxiliary tag stating the average quality value, we'd only need to decode the quality from the CRAM file. We know what to do - decode the quality block, add a new tag, update various CRAM meta-data blocks, and do read/write loops on all the other blocks. Sadly we simply haven't got a good API for block level manipulations like this. (Of course this also doesn't work too well if we want to strip out entire reads, but we could simply mark them as unmapped and leave in-situ.)

Myth 7: CRAM specifies how to encode, and so cannot improve

This is a peculiar one, but I've seen it written that CRAM suffers as it leaves no room for efficient implementations, due to dictating the encoding process instead of the decoding process.

That is simply not true.

The CRAM format specifies the file layout and the decoding method, but the encoder has complete freedom on how to break the data up. It can decide which data types to put in which blocks and with with codec. It also permits storing multiple data series in the same block if there is strong correlation between values and a mixture would lead to better compression ratios. In the early days we gradually learned how to better encode CRAM 2.0 files which lead to Scramble producing files smaller than Cramtools of the same era. (Cramtools implemented the same ideas so it's now comparable again.)

Myth 8: CRAM was invented by me

Finally, it embarrasses me to even have to put this here, but I've seen it said before so I'll address it. No, I didn't invent CRAM! I just helped continue it along the path.

The initial ideas were from the EBI in Markus Fritz's paper Efficient storage of high throughput DNA sequencing data using reference-based compression, which doesn't give it the CRAM moniker, and subsequently implemented by the EBI's Vadim Zalunin in Java as the cramtools package. My involvements was later, to implement it in C. In doing so I found a number of discrepancies between the specification and the Java implementation, which lead to CRAM 2.0 being released. ( This was largely to get the spec and code in line with each other.) I was also the main instigator behind CRAM 3.0, with the newer codecs.

Interestingly the original paper had ideas which were never implemented, perhaps due to speed concerns. One idea was to assemble the unmapped data to find any deep contigs and use these as fake references to improve compression. Coupled with embedded references this would work well. With fast tools like miniasm it may be possible now without too much of a speed problem.

Final words

Is it perfect? By no means! The specification could do with improving and I'd have designed a few things differently, but by and large it's been highly successful and my hat is off to the EBI people who designed and implemented it. You should be proud of your achievements and I'm pleased to have helped it along its path.

Data Geekdom

Saturday, September 22, 2018

Dispelling the CRAM myths