Data Geekdom

Thursday, January 9, 2020

GABAC: The Genomic Adaptive Binary Arithmetic Coder

Recently a paper was published in Bioinformatics titled "GABAC: an arithmetic coding solution for genomic data", by Voges et al. This is the entropy encoder used in the MPEG-G standard.

The design of GABAC is interesting. It consists of data transformations (including various byte or word to bit packing functions), entropy encoding with some limited higher order modelling, and a method for LZish dictionary usage. This means it is designed to be a one-stop shop to replace the multitude of methods employed in CRAM or Deez.

First things first, despite some shortcomings below, I like this paper in both design and how it's been written. This post may seem like I'm picking holes, but that is science and open peer review. Although I have been publicly damning of some aspects of MPEG-G, I wish to emphasise that that is with the behind-closed-doors process used by ISO and the actions of a few specific individuals and companies, mainly revolving around GenomSys and does not at all reflect on how I view these particular authors whom I have a lot of respect for their work.

They are producng an entirely Open Source implementation of MPEG-G, as part of the Genie project. At the moment it focuses on compression of unmapped data (eg FASTQ), but it gets some impressive results in that department. I'm a total data compression geek, and therefore any new public compression software gets me all happy!

I also have a huge amount of respect for the attention to detail in reproducability of results. Not only have they documented the data sets they used (a "must", but still sadly not a given), but they also provide all of their data separate streams and even the log files from their analysis steps. This is crucial in being able to validate the figures in the paper. My sincere thanks for this. Without that I'd have been left scratching my head and unable to dig down into the data to discover the unfortunate mistakes.

However... In December I informed the authors about my concerns with some of the figures in the paper, followed up with a more complete set of issues in early January. I have recently received acceptance of some of the problems and have confirmation that a revised version of the paper will be appearing some time. I've done my own analysis of their published log files (thanks again!) but I won't steal their thunder here. However given the paper does make some striking - and factually incorrect - claims about the relative speeds and efficacy of certain codecs including my own I do feel duty bound to at least point out where the flaws are so people don't put too much emphasis on the results until the corrected ones are made available.

The supplementary data

I'll start with this as it's where the most obvious smoking gun lies. Table 4 is below:

Unfortunately pretty much every single item in this table is incorrect bar the column headings.

For now, I'd advise readers to simply ignore this table. Problems include:

Most rows have been transposed, so where says "GABAC" read bzip2 instead. (GABAC is actually the "xz" line).
rANS-0 and rANS-1 were built without compiler optimisation, so speeds are far too low.
Decompression speed is compressed_size/time not decompressed_size/time. Hence a tool that compresses well will be penalised in appeared to decompress slowly.
All figures are computed using an average of a whole set of results, but with highly variable input sizes (that also don't reflect the actual sizes of real world data).
Speeds are for maximum compression, so gzip -9, xz -9, etc. While interesting, this doesn't represent the more usual default levels which have a more commonly used speed vs size tradeoff.
Speeds include time to write to disk. That's realistic maybe for compression, but with decompression we maybe want the time to load from disk and decompress, but not the time to write it back. I never end up writing an uncompressed file to disk, so this penalises the faster codecs that can out-perform disk I/O.

A new version will be appearing soon although it may not fix all of these issues. I have my own analysis already, but I'll hold off on this until the authors have had a chance to present their corrected paper.

The Main Paper

Table 1 in the main paper has some figures for compressing the CRAM and Deez byte streams using the CRAM codecs, along with CRAM+GABAC.

I like how this was done, although it's perhaps a bit confusing as this isn't the actual size of a CRAM or Deez file and doesn't take into account things like random access. It does however give a handy guide to the efficacy of the pick-and-mix approach to codec usage.

The authors demonstrate a modest improvement if GABAC is added into the mix, and also show that GABAC essentially replaces the need for gzip or rANS-0 in CRAM. It's a nice way of looking at the problem.

It does beg the question though - how well would GABAC work alone, without the other codecs (mainly rANS-1, bzip2 and xz)? Especially given this is meant to be used alone as part of MPEG-G. Using their log files I get this:

Note this isn't an indication of how well MPEG-G works, because the data streams are different, just as they are between CRAM and Deez. However it does give an indication of potential weakness in encoding, or at least in this particular implementation of it. Finding the optimal configuration is likely a black art and I fully expect improvements to be made here.

Actual CRAM files

The above charts are proxy for real data, but for the curious, here are some actual numbers (not speeds sadly, I forgot to measure them but can supply if people are interested) on the same mpeg-g data set 02 (chr11).

-rw-r--r-- 1 jkb jkb 4493790248 Feb 8 2019 NA12878_S1.chr11.bam
-rw-r--r-- 1 jkb jkb 2584990037 Jan 8 11:33 NA12878_S1.chr11.v30.cram
-rw-r--r-- 1 jkb jkb 2338767711 Jan 7 15:23 NA12878_S1.chr11.v40.cram
-rw-r--r-- 1 jkb jkb 2152252590 Jan 7 15:28 NA12878_S1.chr11.v40-small.cram
-rw-r--r-- 1 jkb jkb 2026174338 Jan 7 15:33 NA12878_S1.chr11.v40-archive7.cram

These were produced using the "CRAM4b" branch of Scramble. "small" is "-X small" parameter and "archive7" -s "-X archive -7". That's very slow, but mainly due to also adding in xz (lzma).

Note these are larger than the ones quoted above because it's the whole file rather than the first 200MiB per data type. However it shows the progress with more modern codecs available to CRAM 4.0.

I'll repeat my oft requested appeal to the MPEG-G community as a whole. Please publish some numbers for how MPEG-G works on this dataset.

PS. Happy New Year

Friday, May 3, 2019

Data compression sweet spot

Historical context

In testing CRAM 3.0, 3.1 (and 4.0) compression codecs I was reminded of a chart I made on the results of the SequenceSqueeze competition for the Fqzcomp/Fastqz paper. The source pic is there, but I embed an annotated copy

The nearly vertical line there shows a trend - more CPU time for slightly less storage cost. We see this trend a lot. Note that we have a linear X axis and a logarithmic Y axis. Ie it's exponentially more expensive to gain a little more compression ratio. Less extreme examples are visible at the Large Text Compression Benchmark. Presumably less extreme there as more information can be gleaned from text than a FASTQ file (or so we thought at the time).

Modern tools such as Spring completely trounce this line - let's face it I got lucky with the timing - although it is still at the expense of high memory if not quite so much high CPU (it's not cheap though in CPU either), bridging the gap between referenceless (B) and refererence-based (A) compression tools.

So what is the sweet spot? My aim has always been the bottom left, at the base of "the wall". Somewhere close to optimal compression without the burden of high memory or high CPU.

With newer CRAM codecs I aim to push it further and to give people more choice in the way of named profiles (fast, normal, small and archive), but still it's got to be practical to be useful and by default that's where it'll be focused - sufficient compression without sacrificing performance.

CRAM profiles

Profiles in CRAM are purpose driven, but will simply match a whole bunch of command line arguments (much how bwa has profiles for specific types of data). As such they change both the granularity of random access, from 1,000 reads per slice (equivalent to BAM and ideal for lots of random access) to 100,000 per slice (archive). These are also coupled with compression levels and algorithms to be used, so if we are doing lots of random access then we're probably also still reading and writing files lots so are more speed oriented. We can always modify the options after though. Eg "scramble -V3.1 -X normal -s 2000" will use normal compression but at finer granularity.

The "normal" profile is how CRAM works out of the box right now, with larger and smaller variants being available.

I've tried to pick parameters that give meaningful choice, permitting maximum compression still for long term archival, but with "small" mode offering improved compression at reasonable speed. On NovaSeq data this does particularly well without needing to go to the extreme modes. This is in part due to the high degree of binning for quality values meaning there is less to be gained through the FQZcomp codec there:

With older HiSeq 2000 era data, with lots of distinct quality values showing high correlation between neighbouring scores and distinct cycle by cycle differences, the FQZcomp codec really shines (at a CPU cost):

CRAM 3.1 (I'll probably delay 4.0 until I have some more major format revamping) is still a work in progress, but you can explore the current badly-named branch at https://github.com/jkbonfield/io_lib/tree/CRAM4_updates2 (soon to be merged to master I hope) and track the ongoing work to create the specifation at https://github.com/jkbonfield/hts-specs/tree/CRAMv4 (PDF here although not automatically refreshed with updates).

Monday, February 25, 2019

LinkedIn's MiGz format, vs bgzf / bgzip

LinkedIn have created a new gzip variant that permits both multi-threaded encoding and decoding.

It achieves this by using the EXTRA field in the gzip headers to indicate the size of the block, so the decoder can read ahead and do multiple decompressions in parallel. In order for this to work, each block (512Kb by default) also has to be an entirely self-contained gzip stream without LZ operations referring back to data in earlier blocks.

LinkedIn themselves mention that this is very similar to bgzip which also uses the gzip EXTRA field to store additional meta-data, albeit with a smaller 64Kb block size. They recognise the similarity, but sadly failed to notice the "-@" option of bgzip (see the man page above) to specify the number of threads. Quoting from their blog (emboldening mine):

Interestingly, BGZip uses an EXTRA field to record the size of the compressed data just as MiGz does (to allow an index to be built efficiently without having to decode every block), and this theoretically allows for multithreaded decompression. However, we are aware of no implementation of this and, because BGZip limits each block to 64KB of data (uncompressed), BGZip-compressed files will tend to be larger than those compressed with MiGz (which uses a 512KB block size by default).

The comment about block size is a valid one, although bgzf (the gzip variant that bgzip is reading and writing) was optimised for fine grained random access rather than size. Sadly however the statement about lack of multi-threading support is plain wrong. It's been in htslib since the 1.4 release in mid 2017, and furthermore with the 1.8 release we added support for linking against libdeflate to get access to a more efficient implementation of the Deflate standard than provided by Zlib.

It's always a pleasure to compare new compression tools, so thanks to LinkedIn for making their work public. I was curious enough to do a side by side comparison, using the first 1Gb of a Wikipedia archive ("enwik9"). Tests are an average of 5 tests for compression time and 20 tests of decompression time, excluding any obvious outliers. Compression was writing to a new file, while decompression was sending the output to /dev/null.

Mzip file size = 32611530

Tool	Encode(s)	Decode(s)
mzip -6 -t 1	41.175	4.568
mzip -6 -t 8	6.647	0.622
mzip -6 -t 16	3.848	0.441
mzip -6 -t 32	2.914	0.480
mzip -6 -t 54	2.459	0.564

Bgzip here was built using libdeflate ("./configure --with-libdeflate CPPFLAGS=-I$HOME/ftp/compression/libdeflate LDFLAGS=-L$HOME/ftp/compression/libdeflate") so the unthreaded speed is faster already.

Bgzip file size = 340338947

Tool	Encode(s)	Decode(s)
bgzip -l6 -@1	16.571	1.929
bgzip -l6 -@8	2.858	0.319
bgzip -l6 -@16	1.883	0.243
bgzip -l6 -@32	1.548	0.226
bgzip -l6 -@54	1.577	0.222

Bgzip file size is around 5% larger on the default compression level, primarily due to the small block size. Both suffer from lack of good multi-threading performance at high thread counts, sometimes even increasing in real-time with more threads, but a lot of this comes down to I/O bottlenecks and likely contention between hyperthreads (the system only has 27 real cores) or hypervisor / virtualization problems. As an example when output to /dev/null instead of enwiki9.gz the fastest bgzip is 0.78s instead of 1.58s. However it is still clear that some high thread count problems exist in both implementations.

Bgzip, or rather htslib on which it is derived, also uses a thread-pool system. This shares worker threads between file descriptors, so for example merging 5 BAM files and writing a new BAM file with 16 threads will still only use 16 threads for the entire job, auto-balancing the number of workers per input or output file descriptor based on their throughput.

I'll happily concede bgzf has some deficiencies, such as the ability to control block size, so it's nice that MiGz improves on this. Ideally it should be capable of using different (better) implementations of Deflate, but I appreciate this may be harder to achieve from within Java.

Wednesday, October 17, 2018

MPEG-G challenge, 1 week on.

Preface: all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

Replies

It's been one week since I publicly asked for some figures for how MPEG-G performs on the data set they provided DeeZ and CRAMv2 figures for in their paper.

Yesterday I had a reply from a GenomSys employee, and a couple days before that from Mikel Hernaez (as you can see in the comments to the previous post). Thank you both for taking the time to reply and I want you to know none of this is personal.

Sadly neither are willing or able to provide figures at this moment. GenomSys for unspecified reasons other than "we will provide them according to the established workplan within MPEG". Make of that rather unclear statement what you will. Mikel because he's working on an open source implementation which isn't yet finished - fair enough and a huge thank you for working on that. I look forward to experimenting with it myself when available. I do still genuinely believe the format has something to offer, if it wasn't for the patents.

I was also told that the BAM figures from Marco's talk last year were estimates based on the outcomes from the Call for Proposals experiments, rather than from a fully formed MPEG-G file. Maybe this is why there were no auxiliary data shown in the bar-graphs, although I am unsure if this is part of the specification anyway. There is still a problem here though. Stating how much better MPEG-G is than SAM / BAM in various press pieces ([1], [2], [3]) and then claiming it is not yet appropriate to do comparisons against CRAM, DeeZ, etc just doesn't hold water. Either it's ready for comparisons or it isn't.

I ended my last blog by saying "indeed I am starting to wonder if there is even a fully working implementation out there yet". While it may appear that this was bang on the money, note it is unclear if GenomSys' demonstration at the GA4GH conference in Basel was a mock up, or that they are refusing to give numbers for some other reason (e.g. tipping off the "opposition" - other MPEG-G implementers - before the final acceptance of the standard). Either way, this doesn't fit well with the ISO Code of Conduct[4]: "uphold the key principles of international standardization : consensus, transparency, openness, [...]". If the performance of the MPEG-G format was any less open it'd be in locked filing cabinet in a disused lavatory with a sign on the door saying "Beware of the leopard"!

My goals

So where to go from here? Wait I guess as far as benchmarks go. However some people have wondered what were my desired outcomes from starting this series of blogs? I'll elaborate a little.

To educate people about the CRAM format. I consistently hear of what CRAM cannot do, when the reality is it is either different or it is a complaint about inadequacies of our tool chain rather than the format itself.

This is an ongoing project, but I think we're better off than before.

To ensure that as many people as possible are aware that there is more than just BAM vs MPEG-G. Numerous press pieces have been making these comparisons without mention of the other technologies, some of which made it into MPEG-G.

Decisions makers need to be fully aware of the choices available.
To inform about software patents, both my own personal views (and I believe of the GA4GH, although I do not speak on their behalf), but also of the existence of patents that directly cover both encoding and decoding of the MPEG-G standard. *ALL* implementations will be covered by these patents, and we are not yet aware of the licensing terms.

Again, decision makers need to be fully aware of the implications of their choices, and I have not yet observed any other press releases covering this issue and I still cannot see any of the GenomSys patents registered in the ISO patent database[5].

To date there have been over 25,000 views of the blog, so the messages in this and point 2 above are hopefully getting through.
On the topic of patents, to research prior art and where appropriate block patent claims. We are aware of numerous existing bioinformatics tools that may become unusable without obtaining licenses if these patents are granted. These will be filed as prior art.

The cost to the community of these patents isn't just in any royalty fees, but in hampered public research as the existing bioinformatics community will have to tiptoe around the minefield of confusingly worded patents, with the fear of having their work squashed by a patent licensing authority if they step too close.

Fighting this is time consuming and thus costly, but it is ongoing work. It's not something I can speak about yet though.
To improve!

One beneficial outcome of this whole unfortunate tangle is acknowledging our own weaknesses and correcting them. For CRAM this means documentation, specification clarity, conformance tests, and generally more capable tools.

This is a bit more long term, but it's good to have a lot of ideas to work on.

Conclusion

As a final thought, there are things in the MPEG-G standard that I like and that are an improvement over CRAM. It was designed by a proficient group with several world class researchers involved, so it is sad that it comes with strings attached that will ultimately hamper its uptake. There are things I can suggest to improve it further, but will not while it is a patented format.

Certainly it is the case that no MPEG-G decoder can be patent free while decoding any valid MPEG-G file. However it remains to be seen if it is possible for an encoder to produce a subset of the MPEG-G specification to avoid the patents (e.g. by having just 1 data class). If so, maybe something is salvageable and I would encourage the open source implementation to consider this as an option. I'm not a lawyer though and the patents waffle on forever! It'll need the more egregious claims squashing first though.

References

[1] https://mpeg.chiariglione.org/sites/default/files/events/Mattavelli.pdf
[2] https://spectrum.ieee.org/ns/Blast/Sept18/09_Spectrum_2018_INT.pdf
[3] https://www.biorxiv.org/content/early/2018/09/27/426353
[4] https://www.iso.org/publication/PUB100397.html
[5] https://www.iso.org/iso-standards-and-patents.html

Wednesday, October 10, 2018

MPEG-G: An open challenge

Preface: all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

Several key people behind the MPEG-G specification sent a preprint to bioRxiv of a paper covering the format. The first version appeared on 27th September, with a follow up to address some of the comments on 8th October.

Version 1

The first version of the MPEG-G paper is available at bioRxiv here: https://www.biorxiv.org/content/early/2018/09/27/426353 This is the version that has the most comments.

In it, we see no comparison to any other tools out there whatsoever, other than uncompressed SAM and BAM, nor even any mention of their existence. Furthermore the MPEG-G figure was produced using a lossy compressor. No reference was given on the data sets used, and no actual size values were shown - only a poor quality graph.

There is no direct evidence that their lossy mode did not harm analysis, other than a referral to other papers that have demonstrated this. We do not know whether their method is the same. We cannot take things purely on trust. Science is skeptical for a reason.

It got pretty much savaged in the comments, and quite rightly so.

Version 2

A new version has just been uploaded. This second version of the paper initially seems like a significant improvement. Now there are references to lots of other tools considered state of the art and some benchmarks between them. Unfortunately that is as far as it goes. There are even fewer figures on MPEG-G performance (ie none). The authors state that their aim is to describe the format and not the implementation, but really we need demonstration of at least one implementation to even judge whether the format is worthy of further investigation. They just refer to a "sense" of the capabilities it could have. I don't want a "sense", I want evidence!

"Nevertheless, to give the reader a sense of the compression capabilities achievable by MPEG-G, next we show the compression performance of some of the technologies supported by the standard (and hence implementable in an encoder), and how they compare with current formats(2)."

Unless the format is simply a container that permits other tools to be wrapped inside it, then I'm struggling to understand this statement, nor the benefit of such a container format. Even worse, MPEG-G is repeatedly referred to in how it "could" be implemented using this method or that algorithm, but never what it actually "does":

"When it comes to aligned data, an MPEG-G encoder could use a compression method comparable to that of DeeZ [8], which is able to compress a 437 GB H. Sapiens SAM file to about 63 GB, as compared to 75 GB by CRAM (Scramble) or 106 GB by BAM [8]"

This simply doesn't cut the mustard! I want to know how it "does", not "could do". However let us go with the assumption that right now, MPEG-G is indeed comparable to DeeZ for compression ratios, as suggested. Note that the figures quoted above are also old and do not represent the current performance of the tools cited (except for BAM).

So let's do some more digging. That little footnote (2) at the end of the first quote indicates the figures come from unspecified cited publications: "The specific numbers are taken from the corresponding publications".

The paper in question is the DeeZ paper by Hach, Namanagic and Sahinalp. Unfortunately it's not open access, although I can read it via work and the supplementary text is also freely available. The figures quoted in the 2nd version of the MPEG-G paper come from Table 1 of the DeeZ paper; 62,808 MB for DeeZ, 75,784 MB for Scramble (CRAM), 106,596 MB for BAM and 437,589 MB for SAM. Furthermore, because the DeeZ paper is rigorously written and follows accepted notions of reproducible research, the supplementary data lists the exact file tested as ftp://ftp.sra.ebi.ac.uk/vol1/ERA207/ERA207860/bam/NA12878_S1.bam.

Given the age of the DeeZ paper, the figures shown there are for an earlier DeeZ version as well as a considerly weaker CRAM v2.1, rather than the current CRAM v3 standard. (This isn't a complaint about the DeeZ paper as they reviewed what was available at the time.) I did my own analysis on this BAM file.

Format	Options	Size (GiB)	Notes
BAM	(default)	105.8	Orig file
BAM	-9	103.0	Scramble
Deez	(default)	60.6	1 million reads random access granularity
CRAM2.1	(default)	74.0	Using Scramble 1.13.7 to match DeeZ paper
CRAM3	(default)	64.2	Scramble, 10k reads granularity
CRAM3	-7 -s 100000 -j	58.5	Scramble, 100k reads granularity + bzip2
CRAM3	-7 -s 100000 -j -Z	56.1	Scramble, 100k reads granularity + bzip2 + lzma
CRAM4	(default)	58.8	Scramble, 10k reads granularity
CRAM4	-7 -s 100000	53.1	Scramble, 100k reads granularity
CRAM4	-7 -s 100000 -j -J	52.9	Scramble, 100k reads granularity + bzip2 + libbsc

The above chart shows sizes from DeeZ (in random-access mode, rather than the non-random access -q2 method), BAM as downloaded and as generated using scramble -9, and CRAM using a variety of scramble (1.14.10) command line options and versions. Note version 2.1 was created using the old 1.13.7 release to match the one cited in the DeeZ paper. Also note CRAM 4 is highly experimental - it's a demonstration of what may become a standard, rather than what already is. Nevertheless, it's not vapourware and it can be downloaded and ran, so I include it as demonstration of where we could go with CRAM if the community desires it. Obviously (not recorded, sorry) there are big CPU time differences and I've also adjusted some of the CRAM "slice sizes" (granularity of random access) to demonstrate the ranges of sizes we get. The smaller files are more suitable for archival than ongoing analysis.

Basically, we see CRAM can beat DeeZ quite considerably even using just the existing CRAM3 standard. I find it disheartening therefore that the new revision of the MPEG-G paper contains outdated and therefore misleading conclusions. The new proposed codecs in version 4 pushes CRAM to smaller files still, so we need to consider whether we want a complete new format or just an update to existing ones. Obviously I'm biased on this point.

Analysis of the version 1 paper graph shows this is likely the same file reported in the "Human WGS (high coverage)" figure. Unfortunately given the MPEG-G values shown were with lossy compression, so no direct comparison can be made against DeeZ and/or CRAM.

Conclusion

MPEG-G were already called out in the comments of their original paper for incorrectly representing the current state of the art. It appears they now have made the same mistake a second time by using outdated benchmarks. Once may be an error, but two is starting to look like a pattern.

Scientists are more savvy than this. We work on open and reproducible research, not secrecy and promises of what something "could" do. So far MPEG's own paper has stated it could be comparable to DeeZ, and I have compared CRAM to DeeZ showing it to be smaller. One logical conclusion therefore is that MPEG-G can already be beaten by CRAM on file size. I doubt this is actually true as they have had some big heavy-weights working on it, but the paucity of evidence hardly instills confidence. Indeed I am starting to wonder if there is even a fully working implemention out there yet.

An open challenge

So let's get right down to it. I hearby openly challenge the MPEG-G community to provide their own figures, using the data set chosen by them in their paper. Please email me your MPEG-G figures for the same NA12878_S1.bam file presented in the DeeZ paper that you quote, in GiB units. I'd like to know both size, some measure of granularity, be it MB access unit size, number of reads, or a similar metric, and what data is retained. We can start with the easy lossless case. (Lossy gets so hard to evaluate the effect and it's a set of complex adjustments and tradeoffs.)

I will email this to the two authors with email addresses listed in the paper.

Tuesday, October 9, 2018

Reflections on the GA4GH conference

On 3rd to 5th October we have had a large GA4GH conference, attended by around 400 people from across the world and sectors (academia, pharma, manufacturers, medical, and so on). It was an exciting meeting with lots of discussions, important face to face networking opportunities, and some genuinely moving and inspiring talks.

"Collaborate on the interfaces; compete on the implementations"

This quote was in Ewan Birney's opening address to the delegates, and I couldn't agree more with this philosophy. "Interface" can mean multiple things, from htsget and refget protocols to the underlying file formats. GA4GH embraces and desires the commercial sector to get involved, many of whom have been instrumental in setting new standards. The file formats are our pre-competitive interchange. They're the common language we speak and it benefits everyone, both for-profit and non-profit organisations, if they are free for all to use. This still leaves room for clever and optimised implementations and no one raises objections to, say, a commercial FPGA implementation of BAM.

CRAM: format vs toolchain implementation

In this light, I also wish to discuss what CRAM can do as a format and what the implementations currently offer us, because sadly they differ. We should not fall into the mistake of wishing for a new format when a new implementation will suffice.

Firstly, it is up to the encoder to decide which data goes where. Earlier CRAM implementations put most aux tags into the same block, which improves compression in some cases and harms in others. New ones put each tag to its own block (which again is sometimes good and sometimes bad for compression). We have files that are smaller with samtools 1.3 than 1.9 and vice versa. Clearly there is room for optimisation and improvement over time, while still adhering to the same CRAM bitstream format. In this regard, CRAM shares the same methodology as MPEG-G.

Cram_filter

I covered the htslib REQUIRED_FIELDS parameter in an earlier blog, but in summary it indicates to the CRAM decoder which fields we wish to retrieve, and in turn informs the decoder which fields it may choose to discard during reading. However avoiding unnecessary decompression and decode is a rather naive solution. There are times we wish to manipulate files at the block level, for fast streaming in the context of, for example, and htsget server.

A rather crude tool cram_filter, from io_lib, demonstrates this capability in a proof-of-concept fashion. Currently cram_filter takes a slice number and just emits those slices using nothing more than seek, read and write calls. It is *extremely* fast (hence appropriate for serving CRAM file htsget).

We can also do the vertical component too, coined data series in CRAM and descriptors in MPEG-G, but fundamentally both are simply blocks. Cram_filter permits us to specify the tags we wish to include or to drop. Sadly it currently only permits filtering of quality values and aux tags at the moment, mainly because I was lazy and it's harder to do others, however many combinations should be possible including read names, mapping qualities and ancillary fields such as SAM PNEXT, RNEXT and TLEN. In theory any block can be discarded on-the-fly provided there is not a data or code-path dependency on it from a block we wish to keep.

However I also have to acknowledge the implementation is, frankly, amateurish. (It's mine so I don't offend anyone else!). It could so trivially accept a chromosome / range and use the index instead of slice numbers, but it doesn't. That's a toolchain problem though and not a specification issue. It also should be natively capable of this within htslib / samtools, instead of relying on a third party tool. This fully comes under the "Compete on the Implementation" mantra.

The next step?

Filtering data, by row or column, is just a starting point. What if we wish to add a column or modify a value? Let us consider a hypothetical scenario. We read the quality values for each record and produce mean and standard deviation auxiliary tags per record. From a data process viewpoint, our tool would decode qualities, compute, and add two new tags, with all other fields being passed through from input file to output file. In the example below we also remove the OQ:Z tag:

This needs a very different API to the one currently implemented in htslib. Perhaps this should become htslib 2.0. Fundamentally it is scuppered by the lack of a function level API, instead operating on a data-structure level API (where all data is available directly, not via query functions). However in principle this could be implemented as follows.

API call to fetch the next record

On first call, library loads the entire next slice and holds compressed blocks in memory.
Return link to a reference to a specific record from this slice - just a link, not decoded yet.

API call to obtain the quality values

On first call, we decompress the quality block.
Return appropriate pointer to decompressed quality values.

Compute mean and standard deviation scores
API call to write mean and standard deviation aux. tags.

On first call for this slice, create the new blocks.
Append data to the uncompressed in-memory mean & sd blocks

API call to write the new record

Create a new output slice if necessary.
Add cram record, with link to record from input slice
When slice is full, store entire slice.

This last part is the cunning bit. If we have 10,000 records in our output read name block, and all of those 10,000 records are simply links to 10,000 records in our corresponding input block, then we already have a copy of the compressed block. Thus we can spot the opportunity to avoid an expensive decode and encode step. We'll possibly need to recompress the quality block, unless we cached the compressed form, and obviously we'll need to compress the new mean and sd value blocks.

There is no timeline for implementing this sadly, but this does not indicate that we need a new format simply to get a new library with a better API! That would be a case of the tail wagging the dog.

CRAM 4 proposal

Just a quick one this time.

The slides for the talk I gave at the GA4GH 6th Plenary meeting in Basel, on 3rd October 2018 are now online at https://www.slideshare.net/JamesBonfield/cram4

This talk covers CRAM v4.0 proposal, which could also be v3.1 if we restrict it to purely codec improvements.

Note that this hasn't actually changed much since the MPEG-G talk I gave 3 years ago! It's improved slightly, but back then it was a proposal, and right now it still is. In the interim there simply wasn't any strong community desire to eek out another 10-20% saving given it cost more CPU, but with NovaSeq we see we can gain at least 10% with no real CPU hit (more if we want to spend CPU of course) and there are other issues such as extra-long chromosomes that require format updates anyway.