Saturday, September 22, 2018

Introduction

I've never had a blog before, but events over the last year or so have inspired me to have an area to write in that is more permanent and easier to find than simple facebook ramblings.  This will primarily be about my adventures and explorations in Big Data, most notably my work on compression of genomic data, but I won't discount other forays into undiscovered (by me) lands.

I've been a fan of data compression since my early university days, with the first compression tool I wrote being a Huffman entropy encoder and decoder in 6502 assembly for the BBC Micro.  Fun days.  It worked, but was hardly earth shattering.  It sparked a geeky passion which has stayed with me ever since.

In my professional life I've mostly been doing work in and around Bioinformatics, but occasionally this gave me room to exercise my love of compression.   It started with DNA sequencing chromatograms ("traces") and the ZTR format and much later moved on to experimental and unpublished FASTQ compressors which subsequently lead to taking part in the Sequence Squeeze contest (and somehow winning it, which paid for several Sanger Informatics punting trips among other things).  The contest lead me to write a paper with Matt Mahoney, who I have to confess has been a long time hero of mine due to his fantastic work on the PAQ compressors and running sites like the Large Text Compression Benchmark, so it was an honour to collaborate with him.

Off the back of the Sequence Squeeze contest, I was asked to work on writing a C implementation of the CRAM format, an alternative to the BAM format for storing next generation DNA sequence alignments.   I didn't know whether to be pleased or not, but my "Great Grandboss" (how else do people refer to their boss's boss's boss?) said words to the effect of "We're looking for someone to do a quick and dirty C implementation of CRAM, and I immediately thought of you".  Umm. :-)

It's with CRAM where my blogs will start proper.

As to why it is important, take a look at this well published image from https://www.genome.gov/sequencingcostsdata/.


When the first Human Genome was published, it was a truly international effort that took over a decade to complete and cost nearly $3 billion dollars, which included a lot of learning and development of techniques.  By the end of it, the cost of doing it again was already under 1/10th of that, and now we're well under $1000 for the same thing and completed in a couple of days (along with numerous other genomes on the same instrument at the same time).  Because of this though we now do other experiments: we sequence cancers and "normals" (non-cancer) in 100s of patients in order to work out what genetic differences lead to the cancer, we're even sequencing all tissues in the human body to produce a Human Cell Atlas.  In short, as the cost went down, the amount of data went up.  Up by a HUGE amount, far more than Moore's Law for computing and Kryder's Law for storage.   Ultimately data compression means nothing against a disparity in exponential growths, but it's a start and we'd be foolish to not exploit it.  A good reason to scratch an itch.

2 comments:

  1. As a user of other MPEG-style standards, I say: Don't let them steamroll you into requiring their stuff. They don't know how to keep things simple.

    Open source stuff, with rough consensus and working code, is far far better.

    If you need to standardize, do it at the Internet Engineering Task Force. Write a document called an RFC.

    ReplyDelete
  2. This comment has been removed by a blog administrator.

    ReplyDelete