Posts about best practice

Organised bioinformatics experiments

May 24th, 2008

One of the things I’ve found in two years of doing bioinformatics, is that directories quickly fill up with files, usually data, scripts, and results. Working out the contents of each file is difficult as the only identifier is the name, which with lots of similarly named files, is confusing. Using lots of scripts gets more complicated when there are dependencies. For example scripts need the data from one file, or are dependent on an intermediate set of results from the output of another script. These dependencies mean that when a set of results needs updating, usually many times when producing a manuscript, scripts need to be re-run in the correct order. The requirement of manually re-running scripts in a specific order is cumbersome, and easily generates errors.

Read more »

Decouple the file parsing from the analysis

January 7th, 2008

A common task in bioinformatics is to read data from a set of files, arrange into the required format, then run an analysis to verify or falsify your expectation. An example would be reading in the yeast interaction network, and protein evolution rates, then correlating the two sets of data to see if there is a trend. Using Perl, you would specify how each file gets read in, arrange the sets of data by gene name, then correlate the two.

Read more »

Good programming versus biological intuition

November 20th, 2007

Good programming versus biological intuition

As I write my first paper, my biggest worry is that my results are wrong. In particular, that my code, which I think does one thing, has a bug and does something different. This, in turn, produces inaccurate results and leads me to incorrect conclusions. I then produce a paper where the story I am telling is wrong.

Read more »

How to avoid errors when processing CSV files

November 1st, 2007

A lot of bioinformatics involves reading data from files to manipulate them for our analysis. For example, I spend a lot of time importing data from CSV files into my database. Doing this involves creating a script to iterate over each line of the file, then referencing each token in the row by its column number.

However this is bad for two reasons. The first reason is because it introduces a dependency on the column number, which may feasibly change. You can fix this by changing the script though, so this is not too bad.

The second reason is much more worse, because it could introduce a silent error. If the column number was wrong, then the wrong entry would be referenced. If correct and wrong entry where both of the same type, e.g. floats, then there is a chance you would miss the mistake, which is very bad.

One approach to fix this is to treat each row as a hash or map. I’ve laid out two examples in Ruby using the gem FasterCSV. They’re quite simple, so you should get the idea whatever language you use, hopefully there are equivalent libraries too.

Bad example

FasterCSV.foreach(file_path) do |row|


# In this instance the row is an array
# and has to accessed by the column number.
# Bad, because this introduces a dependency
# on the position of the column and doesn't
# throw an error if you are using the wrong column
row[column_number] # Do something here


end

Good example

#Set the header processing option...
FasterCSV.foreach(data_path, :headers => true) do |row|


# ...each row is now a hash, and the
# data can be accessed using a key
row['column_name']

# This is dependent on the column
# name, but not its position.
# Also you will get an error if
# the column doesn't exist and you
# will always reference the column you expect

end

Importantly by using a third party library, you implement another programming best practice which is, don’t reinvent the wheel.

Be more productive - throw away your mouse

August 22nd, 2007

Keyboard

The mouse, or two dimensional motion pointing device, is undoubtedly useful, especially when you’re new to computers. It lets you open windows, click around, and explore. However the more time you spend using a computer, and the more proficient you become, the more the mouse becomes a hindrance to how fast you can work.

Read more »