Tuesday, October 9, 2018

Reflections on the GA4GH conference

On 3rd to 5th October we have had a large GA4GH conference, attended by around 400 people from across the world and sectors (academia, pharma, manufacturers, medical, and so on).  It was an exciting meeting with lots of discussions, important face to face networking opportunities, and some genuinely moving and inspiring talks.

"Collaborate on the interfaces; compete on the implementations"


This quote was in Ewan Birney's opening address to the delegates, and I couldn't agree more with this philosophy.  "Interface" can mean multiple things, from htsget and refget protocols to the underlying file formats. GA4GH embraces and desires the commercial sector to get involved, many of whom have been instrumental in setting new standards.  The file formats are our pre-competitive interchange.  They're the common language we speak and it benefits everyone, both for-profit and non-profit organisations, if they are free for all to use.  This still leaves room for clever and optimised implementations and no one raises objections to, say, a commercial FPGA implementation of BAM.


CRAM: format vs toolchain implementation


In this light, I also wish to discuss what CRAM can do as a format and what the implementations currently offer us, because sadly they differ.  We should not fall into the mistake of wishing for a new format when a new implementation will suffice.


Firstly, it is up to the encoder to decide which data goes where.  Earlier CRAM implementations put most aux tags into the same block, which improves compression in some cases and harms in others.  New ones put each tag to its own block (which again is sometimes good and sometimes bad for compression). We have files that are smaller with samtools 1.3 than 1.9 and vice versa.  Clearly there is room for optimisation and improvement over time, while still adhering to the same CRAM bitstream format.  In this regard, CRAM shares the same methodology as MPEG-G.

Cram_filter


I covered the htslib REQUIRED_FIELDS parameter in an earlier blog, but in summary it indicates to the CRAM decoder which fields we wish to retrieve, and in turn informs the decoder which fields it may choose to discard during reading.  However avoiding unnecessary decompression and decode is a rather naive solution.  There are times we wish to manipulate files at the block level, for fast streaming in the context of, for example, and htsget server.

A rather crude tool cram_filter, from io_lib, demonstrates this capability in a proof-of-concept fashion.  Currently cram_filter takes a slice number and just emits those slices using nothing more than seek, read and write calls.  It is *extremely* fast (hence appropriate for serving CRAM file htsget).

We can also do the vertical component too, coined data series in CRAM and descriptors in MPEG-G, but fundamentally both are simply blocks.  Cram_filter permits us to specify the tags we wish to include or to drop.  Sadly it currently only permits filtering of quality values and aux tags at the moment, mainly because I was lazy and it's harder to do others, however many combinations should be possible including read names, mapping qualities and ancillary fields such as SAM PNEXT, RNEXT and TLEN.  In theory any block can be discarded on-the-fly provided there is not a data or code-path dependency on it from a block we wish to keep.

However I also have to acknowledge the implementation is, frankly, amateurish. (It's mine so I don't offend anyone else!). It could so trivially accept a chromosome / range and use the index instead of slice numbers, but it doesn't.  That's a toolchain problem though and not a specification issue.  It also should be natively capable of this within htslib / samtools, instead of relying on a third party tool.  This fully comes under the "Compete on the Implementation" mantra.

The next step?


Filtering data, by row or column, is just a starting point.  What if we wish to add a column or modify a value?  Let us consider a hypothetical scenario.  We read the quality values for each record and produce mean and standard deviation auxiliary tags per record.  From a data process viewpoint, our tool would decode qualities, compute, and add two new tags, with all other fields being passed through from input file to output file.  In the example below we also remove the OQ:Z tag:



This needs a very different API to the one currently implemented in htslib.  Perhaps this should become htslib 2.0.  Fundamentally it is scuppered by the lack of a function level API, instead operating on a data-structure level API (where all data is available directly, not via query functions).  However in principle this could be implemented as follows.

  • API call to fetch the next record
    • On first call, library loads the entire next slice and holds compressed blocks in memory.
    • Return link to a reference to a specific record from this slice - just a link, not decoded yet.
  • API call to obtain the quality values
    • On first call, we decompress the quality block.
    • Return appropriate pointer to decompressed quality values.
  •  Compute mean and standard deviation scores
  • API call to write mean and standard deviation aux. tags.
    • On first call for this slice, create the new blocks.
    • Append data to the uncompressed in-memory mean & sd blocks
  • API call to write the new record
    • Create a new output slice if necessary.
    • Add cram record, with link to record from input slice
    • When slice is full, store entire slice.
This last part is the cunning bit.  If we have 10,000 records in our output read name block, and all of those 10,000 records are simply links to 10,000 records in our corresponding input block, then we already have a copy of the compressed block.  Thus we can spot the opportunity to avoid an expensive decode and encode step.  We'll possibly need to recompress the quality block, unless we cached the compressed form, and obviously we'll need to compress the new mean and sd value blocks.

There is no timeline for implementing this sadly, but this does not indicate that we need a new format simply to get a new library with a better API!  That would be a case of the tail wagging the dog.

No comments:

Post a Comment