Data Geekdom: CRAM structure

Having just got back from the super GA4GH conference (more on that later), it is clear some are still confused as to how CRAM actually looks under the hood, so I'll take some time to illustrate it in more detail.

CRAM file structure

The basic CRAM structure is a magic number / version, a whole file header (SAM header), followed by a series of containers. For position sorted data, these containers are anchored to locations in the genome and by way of an index act as a way to do random access. (We can index unsorted / unmapped data too, permitting retrieval by container number.) Each container contains a Compression Header block, indicating what type of data (Data Series) is stored in which block, and a series of Slices - typically we just use one slice per container, but the format permits many:

Each slice in turn has its own Header block and a multiple Data Series blocks corresponding to specific types of data. It is the container Compression Header which dictates which cram data series goes into which block, permitting multiplexing of data series into a single block if it helps compression. Usually it is best to separate them out though, as follows:

Typically there will be many more data series than the four I listed in the above diagram (read names, quality scores, BAM flags and an OQ:Z auxiliary tag). For random access on chromosome and position sorted data, we work out which containers cover the regions in question. We also get the option to specify which data series we wish to pull back. For example:

You may have seen very similar plots recently elsewhere, such as the MPEG-G bioRxiv paper, but be rest assured it's not a new idea. Their terms differ, but replace Slice by "Access Unit" and Data series by "Descriptor" and you'll get something looking very similar. I doubt CRAM was the first to implement such an obvious data structure either.

Htslib "REQUIRED_FIELDS"

This block nature of CRAM means we don't have to decode everything if we do not wish to, permitting something that MPEG term as "selective access". In the htslib implementation this is achieved by specifying the SAM columns that the decoder is interested in (CRAM specific), in addition to the more common region restriction by using an iterator (used in SAM, BAM and CRAM). This code example is from "samtools flagstat", the effect of which can be seen by benchmarking this command on BAM and CRAM files.

hts_set_opt(fp, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG | SAM_MAPQ | SAM_RNEXT);

At the samtools command line level, this is equivalent too:

samtools view --input-fmt-option=0x52 file.cram

(See the samtools man page for the hex column codes.)

The implementation of this is complex. It is up to the encoder to decide which data types go where, even permitting multiple data types to share the same block, creating data dependencies. It also requires knowledge of the specific implementation code dependencies - e.g. I know my own code only decodes data X and data Y together.

The important point though is this is an implementation trick, permitted but not mandated by the file format structure. CRAM does not dictate anything more than the bitstream format to enable decoding, leaving choice of encoding method up to the implementor as well as avoiding overly prescriptive rules for optimising decode.

I have described a method of implementing fast decoding, but there are ways we can also do fast writing. More on the pros and cons of that later.

Data Geekdom

Saturday, October 6, 2018

CRAM structure

CRAM file structure

Htslib "REQUIRED_FIELDS"

No comments:

Post a Comment