Bioinformatics Zen

A blog about bioinformatics and mindfulness by Michael Barton.

Three libraries and a tool to enhance your bioinformatics coding

Coding is fact of life for bioinformatics. If you work in bioinformatics you probably enjoy coding to some extent. It's our equivalent to PCR, western blots and sequencing. So whether your weapon of choice is Java, Perl, Python or C++, here's three packages and a tool worth a look.

Logging The most common way to debug a script/application is to include statements that print out the state of a variable. When the variable is not what you expect, this is where the problem is. After all the bugs have been corrected, these print statements are removed, as they are not part of the end product.

Logging takes a different approach. Logging statements are included in the code in the same way as print statements. The difference, however, is that each has a priority. For example DEBUG, INFO, WARN, ERROR, FATAL. Setting log priority to WARN, usually in a config file, will only print the WARN, ERROR and FATAL statements. DEBUG and INFO are a lesser priority than WARN and therefore ignored.

The reason I use logging is so that can liberally scatter DEBUG logging statements, turn them on when there is a problem, then turn them off when it's fixed. Better than adding and deleting lots of print statements.

Examples : Java - log4j, C++ - log4cxx, Perl - Log4perl, Python - logging

Unit testing The consequence of in house bioinformatics programs not doing what you expect has recently been in the news. You hope that your program does what you expect, but how do you know for sure? By using unit tests.

A unit testing approach creates a set of routines/methods that test to make sure your program performs as it should. Once these are written you can write your code to the specifications of these tests. If the code doesn't do what you intended, you'll get a message when you run the tests. Even better when in the future you change or add to code, these unit tests will still check it does what you originally intended.

Examples : Java - Junit, C++ - CPPUnit, Perl - PerlUnit, Python - PyUnit

Object relational mapping I think one of the best tips in bioinformatics is to use a database to store all of your data. Accessing a database inside code is often rather cumbersome though, requiring some rather unwieldy generation of SQL statements and an ad hoc knowledge of databases. Enter object relational mapping.

ORM negates the need for these rather ridiculous "in code" SQL statements. Instead every row in a database is treated as an object, and each field or variable of the object is a column entry. Therefore rather than hard coding SQL statements you can create an object for each database entry, and the ORM package will take care of the rest, running the SQL and updating the database in background.

You still need a basic knowledge of relational mapping and databases but ORM takes the drudgery out of using a database in computer programs. I switched to using ORMs a couple of months ago and never looked back.

Examples : Java - Hibernate, C++ - OpenORM, Perl - Rose::DB, Python - SQLObject

Automated building Not a library or package, but a great way of chaining a set of commands together. For example compile a set of source files. Package them up. Run unit tests. Send logging statements to a specified file, then export them to a web page. Automated building usually becomes more important, the larger the project. But anywhere you repetitively use the same set of commands, it comes in handy. I even use automated building to compile my LaTeX documents. But maybe that's a bit too much geekery!

Examples : Java - Ant/Maven, C++ - make, Perl - PerlBuildSystem, Python - Scons