Posts about howto

Organised bioinformatics experiments

May 24th, 2008

One of the things I’ve found in two years of doing bioinformatics, is that directories quickly fill up with files, usually data, scripts, and results. Working out the contents of each file is difficult as the only identifier is the name, which with lots of similarly named files, is confusing. Using lots of scripts gets more complicated when there are dependencies. For example scripts need the data from one file, or are dependent on an intermediate set of results from the output of another script. These dependencies mean that when a set of results needs updating, usually many times when producing a manuscript, scripts need to be re-run in the correct order. The requirement of manually re-running scripts in a specific order is cumbersome, and easily generates errors.

Read more »

How to avoid errors when processing CSV files

November 1st, 2007

A lot of bioinformatics involves reading data from files to manipulate them for our analysis. For example, I spend a lot of time importing data from CSV files into my database. Doing this involves creating a script to iterate over each line of the file, then referencing each token in the row by its column number.

However this is bad for two reasons. The first reason is because it introduces a dependency on the column number, which may feasibly change. You can fix this by changing the script though, so this is not too bad.

The second reason is much more worse, because it could introduce a silent error. If the column number was wrong, then the wrong entry would be referenced. If correct and wrong entry where both of the same type, e.g. floats, then there is a chance you would miss the mistake, which is very bad.

One approach to fix this is to treat each row as a hash or map. I’ve laid out two examples in Ruby using the gem FasterCSV. They’re quite simple, so you should get the idea whatever language you use, hopefully there are equivalent libraries too.

Bad example

FasterCSV.foreach(file_path) do |row|


# In this instance the row is an array
# and has to accessed by the column number.
# Bad, because this introduces a dependency
# on the position of the column and doesn't
# throw an error if you are using the wrong column
row[column_number] # Do something here


end

Good example

#Set the header processing option...
FasterCSV.foreach(data_path, :headers => true) do |row|


# ...each row is now a hash, and the
# data can be accessed using a key
row['column_name']

# This is dependent on the column
# name, but not its position.
# Also you will get an error if
# the column doesn't exist and you
# will always reference the column you expect

end

Importantly by using a third party library, you implement another programming best practice which is, don’t reinvent the wheel.

Comparing two populations using different graph types

October 5th, 2007

I think the title says it all. If you have two populations such as “Treatment” and “Control”, what type of graphs can you use to compare the two? Have a look at the examples, then pick the corresponding R code.

All of the charts come from either excellent the lattice package, or the superb ggplot2 package. The code should also work for multiple populations as well.

Read more »

Deriving biological meaning from principal components analysis

August 1st, 2007

Back from Madrid. I spent three weeks there on an excellent data analysis course, which I would recommend. Not only did I learn valuable techniques, I also got the chance to spend my evenings by the pool or in Sol eating tapas - which explains the lack of posts this July. I offer this brief tutorial in recompense, continuing the theme of data analysis.

Read more »

Visualising and exploring multivariate datasets using singular value decomposition and self organising maps

July 17th, 2007

Hola from Madrid, I’ve come here for a data analysis summer school. Last week, there was an interesting class on dimensionality reduction, and since multivariate datasets are prevalent in this -omic era, I thought to post a discussion of what I learnt. The aim of this example is illustrate one technique for visualising multivariate data, singular value decomposition, and a second technique for exploring it, self organising maps.

Read more »