Bioinformatics Zen

A blog about bioinformatics and mindfulness by Michael Barton

Continuous, reproducible genome assembler benchmarking

New bioinformatics software is always being produced and published. The stream of new developments makes it difficult to keep track of the available software for common bioinformatics tasks. An example of this is the domain of genome assembly where there is already a large amount of existing software.

If you are researching which bioinformatics software to use, it can be difficult to understand how effective each piece of software will be. For example given a new genomer assembler publication, how can you know how well it will assemble your data? A publication may include benchmarks however these may not be applicable to all types of data, or may include authorship bias in the way the results are presented.

Even if you know which is the best genome assembler for your data, another question is how easy is it to install? Does the software require complex dependencies such as language libraries or other third-party software? How easy is to debug the software if it fails, does it give cryptic error messages? The poor usability experience of many bioinformatics software can lead to scientists using the software that their colleagues knows how to use, rather than what may be the state of the art.

A registry of assemblers

There is precedent for solving the problem of objective benchmarks for genome assemblers. These are the Assemblathon and GAGE:

I believe that these objective evaluations are critical for the bioinformatics community. The development of performance benchmarks, not just for genome assembly but for all common tasks, helps researchers determine which software they should be using in their research. I believe that this can be taken a step further to solve the problems I described above. For instance using genome assembly again as an example:

I created nucleotides - a registry of genome assemblers and benchmarks with the end of satisfying these goals. This website provides, a currently short, list of genome assemblers. Each of these assemblers is shown alongside benchmarks resulting from assembly a test set of reads and then comparing back with the reference genome. Finally each assembler is packaged within a Docker image so that each assembler can be downloaded from Docker.io.

Reproducible genome assembly benchmarks using Docker

An important part of this is that all assemblers are constructed as Docker images. If you are unfamiliar with Docker, an image is analogous to a list of instructions or blueprint that specifies how an assembler should be installed and used. If a system has Docker installed, this blueprint (called a Dockerfile) can be used to install everything required to get an assembler running. This thereby simplifies installation for an assembler, each assembler can also be installed with the docker pull command.

If assemblers are packaged up as Docker images using a common API, then running benchmarks against a variety of reference data is much simpler. Each assembler can be cloned from the repository and run against test data. The assembled contigs are then compared against the reference genome for accuracy using quast. These benchmarks are the homepage of nucleotid.es. Eventually, by running against a variety of test datasets, this site will provide a way to see which assembler performs best for different kinds of data.