Organised bioinformatics experiments
May 24th, 2008One of the things I’ve found in two years of doing bioinformatics, is that directories quickly fill up with files, usually data, scripts, and results. Working out the contents of each file is difficult as the only identifier is the name, which with lots of similarly named files, is confusing. Using lots of scripts gets more complicated when there are dependencies. For example scripts need the data from one file, or are dependent on an intermediate set of results from the output of another script. These dependencies mean that when a set of results needs updating, usually many times when producing a manuscript, scripts need to be re-run in the correct order. The requirement of manually re-running scripts in a specific order is cumbersome, and easily generates errors.
A previous post I wrote tried to address some of these problems by being strict about directory and file naming, however this does not solve dependencies between files. For the last year I’ve been working with Ruby on Rails, and I think many of the techniques used in this framework are also applicable to bioinformatics experiment structuring.
- Always use a database and object relational management library for data, not flat files
- Use a single Make-type file in each directory, instead of many scripts
- Put code related to data in models, put analysis code in make-like tasks
- Use validations and testing to verify data and code
Use a database and object relational management system
My first point is the most important, and the one that has made my work much easier. Without exception, always use a database to store data. Manipulating flat files in scripts is hard work, and is also a source of bugs. The only time I open a flat file, is to load the contents into a database. A database is allows you to access data in a standard format, in the same way, every time. The alternative is to have lots of little snippets of code to manipulate different flat file formats, then link each dataset with hashes and arrays: messy and very hard to maintain if new datasets are added.
Databases are useful but the language to access them (SQL) has a steep learning curve. Object relational management libraries (ORM) allow you to access a database using an object orientated approach in the language you are familiar with. SQL is still useful for creating complex joins between different datasets, but unless you have to use SQL, use ORM instead, and the library takes care of all the SQL for you. Here’s and example, with out a single line of SQL, using Datamapper, the Ruby ORM library.
The first block of code defines the link between the database table and the Ruby class. The name of the class (Gene) corresponds to the table (genes). The properties (name & sequence) correspond to columns in the table, and get/set methods for the object.
class Gene < DataMapper::Base property :name, :string property :sequence, :text end
Using this class, data can be manipulated using object oriented programming, making the code more readable. Even if you’re not familiar with Ruby I think the below code is simple and easy to understand, which would not necessarily be the case for hard-coded SQL statements.
# Create object using accessors gene = Gene.new gene.name = 'ALS2' gene.sequence = 'ATG...' gene.save! # Search and get an array of # records matching a specific criteria genes = Gene.all( :name => 'ALS2' ) # Delete the first record in this array genes.first.destroy!
Use make-type files instead of scripts
As I start doing some bioinformatics, I’ll need to write more scripts, and soon I’ll have a directory full of files. So this brings me to next point, instead of scripts I recommend using make-type files. If you’re not familiar with make files, the best way to explain is with an example. Here’s an illustration using the Ruby version of make - Rake. I’ve created an example set of analyses where instead of each being a script they are instead a task in the Rake file.
desc 'Delete all exiting rows' task :delete_sequences do # Clear the sequence table end desc 'Load protein sequences into the databases' task :load_sequences => :delete_sequences do # Load a set of DNA sequences into my database end desc 'Calculates statistics for gene sequences' task :sequence_stats do # Calculate sequence mean and SD. end desc 'Reset project and repeat all analysis' task :rebuild => [ :delete_sequences, :load_sequences, :sequence_stats]
As the example shows, the characteristic of a Rake file is a set of named tasks that each perform an action. Within these tasks any valid Ruby code can be inserted. Therefore anything I would put in a script can instead be put in a task. Rake also allows me to manage the dependencies between the tasks, so if I need to repeat one set of analyses, the corresponding dependencies are also run. In my example, before my gene sequences are loaded, any pre-existing rows in the database are first cleared. I’ve also created a task called rebuild that clears all the data project and repeats the analysis from scratch.
Calling “rake -T” at the command line list all the tasks outlined in the file. This is a handy way of keeping track of what all the tasks in your project are doing. This would also be useful for anyone else who was looking at your research.
rake :rebuild rake :delete_sequences rake :load_sequences rake :sequence_stats
Keep data code in models, and analysis code in tasks.
My next point is to decouple the code that gets data from the database, from the code that analyses it. My reasoning is that you don’t care how the data is retrieved, only that it is as you expect. If I change the way I get sequences from the database, I don’t need to the change the corresponding analysis code as long as it is returned in the expected format. Decoupling the two sets of code, also stops it being repeated. If I need the same data else where, I can access it from the model, without having to cut and paste, which makes code easier to maintain. Here’s an example for calculating the mean of a set of sequences. The data fetching code is in the Gene class, and manipulation of the data is in the rake task.
gene.rb
def self.mean_length genes = Gene.find(:all) lengths = genes.map{|gene| gene.sequence.length} lengths.to_statarray.mean end
analysis.rake
desc 'Calculates statistics for gene sequences' task :sequence_stats do File.open('results.txt','w') do |file| file.puts "Mean length : #{ Gene.mean_length }" end end
Use testing and validations
The importance of code testing is widely discussed, so I’m not going to go into detail. Validations however receive less attention, but I believe they can be very important in computational research. The majority of data in bioinformatics has been compiled by a script, and you can’t be 100% sure that the data you’re given is in the format you expect, or even that you’re not making mistakes yourself. Validations are a really simple way of making sure that the data is what you expect. Here’s an simple one line validation that makes sure the DNA sequence is coding, by ensuring that it begins with a start codon, finishes with a stop codon, and contains only valid DNA sequence. I use this in the following method so that any sequences that don’t match this criteria will print an error at the command line, when the data is loaded.
# Checks that the sequence has a start codon, # a stop codon, and contains only ATGCs # and line breaks validates_format_of :sequence, :with => /^ATG[ATGC\n]+(TAG|TAA|TGA)$/im def self.create_from_flatfile(entry) gene = Gene.new gene.name = entry.definition.split(/\s+/).first gene.sequence = entry.data if gene.valid? gene.save! else puts "#{gene.name} is not valid sequence " # Could raise an error here instead # Or even better write to an error log end end
Summary
I’ve given some examples on how to use these principles, but they have been very brief. I’ve implemented a complete example on Github so if you are interested how this can be put into practice, pull the repository and take a look. I’ve done this in Ruby as it’s the language I know, but all major languages have libraries to implement all these techniques, so the language you choose is less important than using good programming practices for the task you’re working on.
May 24th, 2008 at 9:20 pm
[...] Mike over at Bioinformatics Zen has written a more thorough post about organised bioinformatics experiments with examples using Rake and DataMapper. Definitely check that [...]
May 24th, 2008 at 9:38 pm
Excellent article as always Mike and thanks for the link. I especially like the DataMapper examples. I haven’t had a chance to try out DM yet. Lately I’ve been using Sequel but the ‘define schema in the model’ aspect of DM is very slick. Just when I thought I had pushed myself too far into software obscurity you come along and make me feel less alone. Awesome.
May 25th, 2008 at 1:35 am
Thank Mike for this excellent idea to solve the problem. I also encounter this problem in my Bioinformatics research. However, I use Python for programming. I will find the related modules in Python.
May 25th, 2008 at 1:40 am
[...] Organised bioinformatics experiments [...]
May 25th, 2008 at 2:42 am
Wubin: Something similar to Ruby’s Datamapper in the Python world are either SQLObject or SQLAlchemy.
I certainly prefer the ORM approach as opposed to writing SQL expressions, and try to use it whenever I can get away with it.
May 25th, 2008 at 9:27 am
Great post again Mike. I will surely try this out as I always end up having too many scripts in my working directory. Or you did a whole analysis and then you’re told to do the same thing for this-and-this gene.
Two questions: (1) do you rewrite a little Gene class within every project (tweaking it to just do what is necessary within that project), or do you have a “master” gene class defined somewhere that you use whenever there’s a new project? We all know the second option should probably be preferred in the long run, but we all end up doing the first…
And (2): this approach means you have to do everything in ruby isn’t it? However, several of the steps in my own workflows can often more easily be done with linux commands (grep, sort, uniq and wc, anyone?) How do you handle that? Do you use a ruby equivalent? Or do you use a “system(’sort’)” in your rakefiles?
Good idea of putting an example on github…
jan.
May 25th, 2008 at 7:29 pm
Just had a little more time to actually look at your example on github. I think I’ll try to use this approach on my next project. What I find particularly useful as well is how you load task-specific Rakefiles (the 001 namespace) into the project Rakefile (using the project-task-step I described here.
May 26th, 2008 at 2:56 pm
Thanks for your comments guys, I’m always very flattered when you take the time to write that something I’ve done could be useful.
@Adam
I think DataMapper and Sequel are interchangeable in terms of the attention they are getting in the Ruby community. I’ve never tried sequel though. Comparing DM and ActiveRecord, the best point about DM is that it does away with Migrations which are a bit heavy weight for this type of work. However on the downside, I’m not sure how many AR plugins will work with DM. An example would be acts_as_reportable which I don’t think does.
@Wubin and @Andrew
I’ve never used Python so I’ve haven’t got a clue about the libraries. Andrew’s suggestions look good. Another option could be to look at what Django uses, as I think it’s somewhat equivalent to Rails so will probably have a handy ORM involved somewhere.
@Jan
As you say in an ideal world you there would be an complete gene class that could be interchangeable for any project. At the moment I’m only using this approach in one project, so I can’t really say what I would do, and to be honest I hadn’t really thought about this until you mentioned it. I agree though that a set of generic classes would useful. Also a set of validations that could included when required would be useful as well, as these are things that I spend most of my time messing around with until they work. BioSQL could be place to start for a set of generic ORM classes though? As always it’s the problem of trying to fit in writing all the sexy Ruby libraries I’d like compared with publishing something.
As for your next point, it’s true that you can’t write everything in Ruby. For bash scripting, Ruby allows ` quoting for running Unix commands. For example `cat data.txt | sort` and so forth. It’d be nice to write everything in Ruby, but I guess pragmatism comes first before style.
May 27th, 2008 at 10:00 am
Another thought (sorry
Something I’ve bumped into myself using rake for this type of thing…
I’m sure you tried this as well: within your sequence_stats task, you obviously would want to add the prerequisite that the sequences are loaded in the first place, so instead of
task :sequence_stats do
you’d do
task :sequence_stats => :load_sequences do
However, this doesn’t work in rake because it can’t know if you did your loading or not. As a result, it would load the sequences every single time you want to get the sequence stats. Do you know of a way to get this working? I wonder if we can use timestamps on files to do this… Or would we have to tweak rake so that it takes into account some “status” metatable in the database? Any suggestions appreciated…
May 27th, 2008 at 11:17 am
Yes I did think about this a bit. I made that the assumption that I would know whether the sequences were in the database or not, therefore there would not need to be a dependency. I think adding a database status that would run the load_sequence task if necessary could start to make maintaining the Rakefile a little complicated. As a compromise you could add a check on the status of the database and notify the user to run the corresponding task. Some thing like.
On the other hand, the rebuild task runs all of the tasks in the correct order when the project is run from scratch, so in this case you would know that all the required data was in the database.
I do see what you mean about a meta-table of information, and it would take care of things like this, but I think it could add an extra layer of complexity. Another option is to have a hard and a soft load_sequences task. The hard task clears the database and loads all the sequences. The soft task only loads them if there are none . The stats task could then be dependent on the soft load_sequence task.
May 27th, 2008 at 11:19 am
Also Jan, Jay Fields has a great article on running bash processes from Ruby
May 27th, 2008 at 12:43 pm
Thanks for those thoughts, Mike. I like the idea of the ‘hard’ and ’soft’ task.
May 27th, 2008 at 12:57 pm
This does the trick: sequences are only loaded _once_ if necessary (notice the new prerequisite for sequence_stats):
May 27th, 2008 at 1:43 pm
I really like how you’re implemented that Jan. I’ve added the change to the repository. I didn’t add the STDERR messages as I prefer to do this using a project logger. I hope you don’t mind.
I wonder in future if it might be worth packaging this up into a Rails type gem focused on organising bioinformatics experiments. I think it might take a fair amount of work, but could ultimately prove very worth while.
May 27th, 2008 at 4:27 pm
Good suggestion about that gem. We’ll have to think about how that would work, though.
I actually just performed a little project using this approach, and it works great!
May 27th, 2008 at 8:41 pm
There’s an excellent make-like tool that handles prerequisites, allows “shelling out”, etc. — it’s called Make. Ruby is great, and so is the shell (as already admitted by most). You don’t reimplement grep, sort, uniq, etc. in Ruby (though you certainly could), you just use the shell commands. “make” is a shell command, you should learn it along with all the other arcane Rubyesque wizardry …
May 28th, 2008 at 10:42 pm
@Aaron
I think we’re all in agreement that make is a great utility. I personally use it almost everyday. But Make isn’t always ideal, otherwise there wouldn’t be so many alternatives (ant, rake, scons, omake, etc, etc). Rake is not so much a replacement for Make as it is a tool designed in the spirit of make that lets you define tasks and their prerequisites easily in Ruby. You probably shouldn’t use Rake to build C or Fortran programs, and you probably shouldn’t use Make to manipulate a database.
May 29th, 2008 at 11:19 am
[...] 2008-05-24: Excellent advice from Bioinformatics Zen about how to organize bioinformatics experiments. See this post. [...]
May 29th, 2008 at 1:40 pm
@Mike:
`cat data.txt | sort`
Definitely a nomination for the Useless Use of Cat Award.
http://partmaps.org/era/unix/award.html
June 11th, 2008 at 6:56 pm
I am glad this is ruby, rather than perl. Shows people which language is indeed better, and there are so many people still relying on perl rather than ruby (or python, which I think is better than perl as well, although the syntax in ruby is much clear from my point of view when writing. And I have written quite some python code as well.)
June 12th, 2008 at 3:18 pm
All,
To solve the issue we discussed above about keeping track of what tasks have already been done in your Rakefile, I have written an extension to rake. Rake is ideal if you’re working with files because it takes the timestamps of the files into consideration. However, no files are created when you’re loading stuff into a database. I tried to solve that by extending rake so that it puts timestamps in a little meta table in the database as well.
You can get it at http://github.com/jandot/biorake/tree/master
jan.
June 13th, 2008 at 9:00 am
How do you manage ‘To-Do’ lists?
For example, let’s say you don’t have the time to test some particular parameters in your statistical analysis, but you want to have a memo in case you will be able to do it later.
Do you use the same revision control system software, or do you have something like a personal bug-tracker/feature report installed in your computer?
June 14th, 2008 at 9:26 am
That’s not always as easy as it seems.
Many laboratories and people have used flat-files for years: that means that if you work in such a place, all the scripts, programs produced and used internally by everyone are based on flat files.
What will you do then? If you spend time trying re-writing or adapting all these scripts, your colleagues will most likely think you are kind of wasting time, because the flat-file based script already works :).
Think of biopython, bioperl, emboss: all of these are meant to be used with flat-files, even if you can re-adapt them with a small amount of work.
Wow, I didn’t know of make before!
I am using it now and it is a really good tool.
A good primer on using make for bioinformaticists is also in http://www.swc.scipy.org/
I try to validate and test my data as often as possible.
I think there is an additional complexity in the problem of testing for bioinformatics software: you have to check that your scripts don’t make any mystake, and moreover, you have to check that your results are compatible with their biological context.
This last kind of testing is more likely what other scientists do when they use negative and positive controls in their experiments.
However, do you know whether there are any common guidelines for testing bioinformatics scripts?
They would be very useful.
Well, I really appreciated your post. You are really a good bioinformatician because you are so keen in sharing your experience and known-how with others, publicly.
Thank you very much: I am going to install a mysql database on my computer, and I didn’t know of ORM modules, so, if only there were more people like you in bioinformatics :).
June 16th, 2008 at 6:20 pm
@Andrew
Point taken, I will try to curb my cat abuse habit.
@Michael
I agree, Python and Ruby syntax make coding much more of a pleasure and would recommend either of these two to any bioinformatician. Though I’ll continue to write in Ruby as this is the language I know.
@Gioby
For to-do lists I use lighthouse, as it suits my personal preference. However there’s plenty of tools available for keeping todo lists, and I think it comes down to which one suits you.
I agree with your point about old software that works. I would say that if you have a system that you still have to maintain and change, then “upgrade” to more easily maintainable system. In your example I would say leave it be, as you say, if it works then your extra efforts could be used elsewhere.
June 21st, 2008 at 6:16 pm
[...] Zen GTA4 is a competitive inhibitor of blogging Organised bioinformatics experiments [...]
July 4th, 2008 at 3:53 pm
I’m not convinced. It’s very important for me to be able to look at the data, and to look at old data. It’s very convenient for me to be able to use Unix command-line tools to count the data, sort the data, randomly sort the data, remove duplicates, sum columns, compute average + standard deviation, etc. I sometimes have to move files between different computers and different operating systems, and I don’t have to worry over whether someone else’s DB configuration is different. Also, I will still be able to look at that data 3 years from now, when I have a new version of my DB installed that won’t read the old files.
July 17th, 2008 at 3:15 pm
That’s fair enough Phil. I understand that using Unix like scripts to manually look at the data. However I would say that when you start manipulating and joining dataset in scripts, this is when a database is essential, as it will make things so much easier - especially when combined with an ORM.
As to your point about database format interoperability, that is a valid point. I think most database allow the export of common data formats such as CSV, which is ideal for situations such as creating supplementary materials in manuscripts.
October 17th, 2008 at 2:06 am
[...] a comment » As Mike keeps reminding me, getting your data into database tables is A Good Thing. Like many people, my database of choice is [...]