February 2008 edition of Bio::Blogs

February 1st, 2008

This is the February edition of Bio::Blogs, an aggregation of bioinformatics posts from the past month. This issue has a particular focus on Open Notebook Science - researchers sharing their work as they produce it.

For this edition, with the help of Cameron Neylon, Jean-Claude Bradley, and Pedro Beltrao, I’ve tried to write a short essay on open science as well as the open notebook variety. I’ve posted this on my research website.

Read more »

World of Bioinformatics Quest: Character generation

January 28th, 2008

In World of Bioinformatics QuestTM (WoBQ) having the right character that suits your style is essential. You may think that a hot shot Rubyist is the coolest class to be, but remember that you have to play this character for the next 50 years. In general it’s better to be a character that you’ll enjoy playing, rather than one that will get you more publications but have less fun with. This page will guide you through all the parts of character generation for WoBQ.

Read more »

Three stories about science and the web : The movie

January 21st, 2008

In a previous post I wrote about how great new web tools are making it easier for scientists to collaborate, find information, and share information. This light-hearted introduction was rather popular, so heres’s a tongue-in-cheek video version.

Decouple the file parsing from the analysis

January 7th, 2008

A common task in bioinformatics is to read data from a set of files, arrange into the required format, then run an analysis to verify or falsify your expectation. An example would be reading in the yeast interaction network, and protein evolution rates, then correlating the two sets of data to see if there is a trend. Using Perl, you would specify how each file gets read in, arrange the sets of data by gene name, then correlate the two.

Read more »

Why data testing is important in computational research

December 31st, 2007

I wrote in a previous post about the importance of testing in computational research. If you’re developing a piece of software, functional testing is essential. However, we bioinformaticians don’t just develop software, we also have to develop conclusions and hypothesis, based on data, as well as code we’ve written. Here is an example of why I think data testing is as equally important as functional testing in research.

Imagine you want to see if the structural stability of protein correlates with the number of disulphide bonds. To do this you can create two methods in a Ruby mixin.

module CysteineStatistics
  def n_cys
    to_s.downcase.scan('c').length
  end
 
  def avg_cys
    n_cys / to_s.length.to_f
  end
end

These two methods can be be mixed into the String class to add the functionality. For the sake of this demonstration I’m using String, but in reality you would look at adding this to the Bioruby Bio::Sequence class.

class String
  include CysteineStatistics
end

Being a responsible bioinformatician I’ll write some testing code to make sure that the methods do what I expect.

class TestCysteineStatistics < Test::Unit::TestCase
 
  def test_n_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.n_cys,1)
  end
 
  def test_avg_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.avg_cys,1.0/7.0)
  end
end

Again in real life I would probably write a few more tests just to make sure. Both these tests pass though, so the code is doing what I want. Next I write a short script to analyse my set of protein sequences.

# Read in each sequence, then print out the average number of cysteines
  CSV.open(out_file,'w') do |csv|
  CSV.open(in_file,'r') {|row| csv << [row[0], row[1].avg_cys]}
end

Which gives me a file containing average cysteine count for all my protein sequences. Next I would correlate these values with some structural stability measure I’ve worked out. Here’s what the file looks like.

1,0.0203291384317522
2,0.0248447204968944
3,0.0388349514563107
.
.
.

This is fine and I could finish the post here. The point I would like to make though, that in an ideal world this is not the end of the story, and before I move on to the next stage of my research I’m going to test my data.

class TestCysteineStatistics < Test::Unit::TestCase
 
  # A shortcut method to find the entry in the data file
  def entry(id)
    CSV.open('count.csv','r') {|row| return row[1].to_f if row[0] == id.to_s}
  end
 
  def test_avg_cysteines_for_2
    assert_equal(0.025,entry(2))
  end
 
  def test_avg_cysteines_for_3
    assert_equal(0.0392156862745098,entry(3))
  end
end

What I’ve done for this script is take two proteins from my original data set and calculate their average cysteine content by hand, then compare this with the calculated value. Running this test produces the following result.

Loaded suite test_data
Started
FF
Finished in 0.018369 seconds.

1) Failure:
test_avg_cysteines_for_2(TestCysteineStatistics) [test_data.rb:11]:
<0.025> expected but was
<0.0248447204968944>.

2) Failure:
test_avg_cysteines_for_3(TestCysteineStatistics) [test_data.rb:15]:
<0.0392156862745098> expected but was
<0.0388349514563107>.

2 tests, 2 assertions, 2 failures, 0 errors

Both tests fail, indicating a difference in what I’ve written my code to do, and what I’ve calculated by hand. So where is the error? The problem is I’ve divided the number of cysteines by the length of the sequence, which in theory is correct. However, when DNA sequence is translated to protein, stop codons are translated and added as a special character ‘*’, meaning that the length of the protein sequence is actually one residue longer than is correct. Something that was not picked up by the code testing, but was by the data testing.

Summary
Code testing will find all the possible errors that you can think of, data testing will find all the errors you don’t. This could seem laborious, and will not be applicable to every situation, but can be useful in the real world.

All the code in this tutorial can be found here.