Friday, September 28, 2018

MPEG-G: the ugly


Preface:  all my blogs are my own personal views and may not necessarily be the views of my employer, the Wellcome Sanger Institute.

  

Commercialisation


I ended the last blog with the statement "history is resplendent with battles where a good PR campaign won the day".  I truly wish this wasn't a battle.  I engaged with MPEG-G from the moment I heard about it, submitting ideas to improve it, despite being instrumental in recent CRAM improvements.  I had hopes of a better format.

I bowed out after a while, making rather weak excuses about work loads.  However the honest reason I disengaged was due to the discovery of patent applications by people working on the format.  I wanted nothing to do with helping others making profits, at the expense of the bioinformatics community.  I regret now that I helped make the format that little bit better.  I am guilty of being hopelessly naive.

I am not against commercialisation of software.   I use plenty of it daily.  Indeed I once worked on software that was sold on a semi-commercial basis, from which I personally benefited.

A commercial file format however is a different beast entirely.  It touches everything that interacts with those files.  Programs that read and write the format need to be licensed, adding a burden to software developers

I'm also not against patents, when applied appropriately.  I can even see potential benefits to software patents, just, although the 25 year expiry is far too long in computing.  25 years ago the Intel Pentium had just come out, but I was still using an 8MB Intel 486 PC.   It seems ludicrous to think something invented back then would only just be opening up for others to use without having to pay royalties.   Holding a patent for that long in such a fast moving field is extreme - 5 to 10 years max seems more appropriate.

[Correction: a reader pointer out this should be 20 years.  Correct and sorry for my mistake.  The point still makes sense, even if the exact dates are wrong.]

Anyway, I digress.   In my opinion commercialisation of software is nearly always best done by efficient and robust implementations, protected by copyright.

Patent applications


The first related one I became aware of was WO2016202918A1: Method for Compressing Genomic Data.  This describes a way to use sequence position plus alignment (a "CIGAR" field in SAM) to identify a previously observed sequence at that location and thus compress via redundancy removal.  This is a subset of the method I used in samcomp for the SequenceSqueeze contest and published 2 years earlier, where I used all prior sequences instead of just one.  I made prior art observations and the authors of the patent appear to have amended it slightly, but this is still an ongoing concern.

The same authors also later submitted a second patent involving lossy compression of quality values by forming a sequence consensus along with an associated confidence, and using that to dictate how much loss can be applied to the quality values.  This too infringes on my earlier work as it closely follows the Crumble program.

As far as I'm aware however neither of these two patents made it in to the MPEG-G specification, but the authors were actively involved in the MPEG-G process and this directly lead to my withdrawal.


GenomSoft-SA and GenomSys-SA



 The same cannot be said however for this barrage of 12 patents from GenomSys:

Source: https://globaldossier.uspto.gov/#/result/publication/WO/2018071080/1

The patents all relate to how the sequence is encoded in the MPEG-G draft specification.   Many of these patents have similar problems with prior art, with some claims being sufficiently woolly that CRAM, among others, would infringe on them if granted.  Some are simply huge - over 100 pages of impenetrable lawyer speak and 80+ claims.  Fighting this has become an enormous and unwelcome time sink.

Crunchbase reports one GenomSys founder as Claudio Alberti:

Source: https://www.crunchbase.com/search/organizations/field/people/num_founded_organizations/claudio-alberti

MoneyHouse.ch reports GenomSys board of directories as including Marco Mattavelli:

Source: https://www.moneyhouse.ch/en/company/genomsys-sa-15920162331

As reported earlier, both Alberti and Mattavelli are founders of MPEG-G:

Source: https://mpeg.chiariglione.org/standards/exploration/genome-compression/issues-genome-compression-and-storage
There is no proof that this was their intention all along, but I will leave the readers to draw their own conclusions.

As mentioned before, I was not formally accepted as a member of BSI and therefore not party to all the documents registered in the MPEG repository, only receiving communication via the ad-hoc group mailing list.  Therefore I do not know when the other members were notified of these patents (as is the requirement under ISO regulations), but I was certainly taken by surprise and would not have engaged if I knew where it was headed.  As it happens I consider it fortunate that I did not become a formal part of the process, as my lack of access to the document repository also means I am untainted by any confidential knowledge, which may help with any future prior art claims.

Licensing


Patents by themselves do not automatically mean royalties.  Many patents are obtained purely to protect against others attempting to patent similar things (as patent officials are considerably more likely to find prior art in a previous patent than they are in other literature).

A patent means we need a license to use that invention.  Licenses can be granted free, but we have no idea of the intentions here.   However to date none of this has been openly discussed and most I have spoken with have been blissfully unaware of the patent applications.  There have been some suggestions that there will be an open source reference implementation, but do not assume open source means free from royalty.  The two are orthogonal issues.

Implications


My concerns over these MPEG-G patents are many fold, but include:

  • Fear:  even if a program has clear prior art and a patent is unenforceable, the fear of potential legal action being taken can dissuade people from using that program.  This may dissuade people form using my work, including CRAM and Crumble.
  • Prevention of further research: a core public Idea can be extended within patent claims as Idea+X, Idea+Y and Idea+Z.  The original inventor of Idea can then become stuck trying to find a way to continue his/her own research, simply because they didn't publicly state up front which next avenues of research they were going to continue down.  Crumble vs the 2nd Voges patent is an example of this.

    To this end, I am now trying to publicly document as many ideas I have as I can, regardless of whether I have time to implement them yet, however this does give ideas to my competitors.
  • Needless complexity: every program touching the patents now needs to deal with licensing.  If the patent is monetised, it makes every tool a commercial tool unless the author has money to throw away.  This simply will not work with most academic groups.
  • Time sink:  fighting patents takes a LOT of time.   Patents are deliberately written to be hard to read and much like builders applying for planning permission over and over again for the same plot, assuming people will give up objecting, many subtle variations on the same patents can be applied for in the hope one of them will get through.   I have spent weeks on this, as have others.  During that time I could have been doing other more productive work, improving software elsewhere.  This costs us money and doesn't help the community as a whole.
  • Cost: an obvious one.  We do not yet know what the cost will be or how it will be applied.  It may be free, it could be a one off charge when first creating the file, or it could be a charge every time we read / write data from the file format.  Assuming a non-zero cost, this will increase the price of research and medicine.  We need to know if we spent the money wisely.  What do we get from that money that we didn't have before?

 A better way?


Leonardo Chiariglione, chairman of MPEG, describes the MPEG model as this:

"Patent holders who allow use of their patents get hefty royalties with which they can develop new technologies for the next generation of MPEG standards. A virtuous cycle everybody benefits from."

It sounds sensible on the face of it, and probably is when the field is already covered by a myriad of patents.  He has concerns that there are now free alternatives to video coding arriving, such as the Alliance for Open Media, and views this as both damaging to MPEG but also damaging to the very idea of progress:

"At long last everybody realises that the old MPEG business model is now broke, all the investments (collectively hundreds of millions USD) made by the industry for the new video codec will go up in smoke and AOM’s royalty free model will spread to other business segments as well.

[...]

AOM will certainly give much needed stability to the video codec market but this will come at the cost of reduced if not entirely halted technical progress. There will simply be no incentive for companies to develop new video compression technologies, at very significant cost because of the sophistication of the field, knowing that their assets will be thankfully – and nothing more – accepted and used by AOM in their video codecs."
I believe however that he is mistaken.   There are clear non-royalty based incentives for large companies to develop new compression algorithms and drive the industry forward.  Both Google and Facebook have active data compression teams, lead by some of the world's top experts in the field.   It is clear to see why: bandwidth costs money, and lots of it.  Google's Brotli algorithm is now supported by all the major browsers and Facebook's Zstd is targetting itself as a replacement for the legacy Zlib library.

There are also more enlightened approaches, where like minded companies pay for research into "pre-competitive" areas which benefit all but are not, by themselves, considered a commercial edge.  File formats absolutely fits the bill.  Indeed one such organisation, the Pistoia Alliance were the people who organised the Sequence Squeeze contest, which lead to so many innovations in genomics data compression.

Conclusion


"Patents + extensive high profile marketing + lack of evidence for benefit" jointly add up to something tasting and smelling rather unpleasant.   The situation can, however, be rescued with appropriate licensing, (assuming a benefit exists).

I urge the community now - do not adopt MPEG-G without a cast iron assurance that no royalties will be applied, now and forever.

No comments:

Post a Comment