Posts tagged with research

Why data testing is important in computational research

December 31st, 2007

I wrote in a previous post about the importance of testing in computational research. If you’re developing a piece of software, functional testing is essential. However, we bioinformaticians don’t just develop software, we also have to develop conclusions and hypothesis, based on data, as well as code we’ve written. Here is an example of why I think data testing is as equally important as functional testing in research.

Imagine you want to see if the structural stability of protein correlates with the number of disulphide bonds. To do this you can create two methods in a Ruby mixin.

module CysteineStatistics
  def n_cys
    to_s.downcase.scan('c').length
  end
 
  def avg_cys
    n_cys / to_s.length.to_f
  end
end

These two methods can be be mixed into the String class to add the functionality. For the sake of this demonstration I’m using String, but in reality you would look at adding this to the Bioruby Bio::Sequence class.

class String
  include CysteineStatistics
end

Being a responsible bioinformatician I’ll write some testing code to make sure that the methods do what I expect.

class TestCysteineStatistics < Test::Unit::TestCase
 
  def test_n_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.n_cys,1)
  end
 
  def test_avg_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.avg_cys,1.0/7.0)
  end
end

Again in real life I would probably write a few more tests just to make sure. Both these tests pass though, so the code is doing what I want. Next I write a short script to analyse my set of protein sequences.

# Read in each sequence, then print out the average number of cysteines
  CSV.open(out_file,'w') do |csv|
  CSV.open(in_file,'r') {|row| csv << [row[0], row[1].avg_cys]}
end

Which gives me a file containing average cysteine count for all my protein sequences. Next I would correlate these values with some structural stability measure I’ve worked out. Here’s what the file looks like.

1,0.0203291384317522
2,0.0248447204968944
3,0.0388349514563107
.
.
.

This is fine and I could finish the post here. The point I would like to make though, that in an ideal world this is not the end of the story, and before I move on to the next stage of my research I’m going to test my data.

class TestCysteineStatistics < Test::Unit::TestCase
 
  # A shortcut method to find the entry in the data file
  def entry(id)
    CSV.open('count.csv','r') {|row| return row[1].to_f if row[0] == id.to_s}
  end
 
  def test_avg_cysteines_for_2
    assert_equal(0.025,entry(2))
  end
 
  def test_avg_cysteines_for_3
    assert_equal(0.0392156862745098,entry(3))
  end
end

What I’ve done for this script is take two proteins from my original data set and calculate their average cysteine content by hand, then compare this with the calculated value. Running this test produces the following result.

Loaded suite test_data
Started
FF
Finished in 0.018369 seconds.

1) Failure:
test_avg_cysteines_for_2(TestCysteineStatistics) [test_data.rb:11]:
<0.025> expected but was
<0.0248447204968944>.

2) Failure:
test_avg_cysteines_for_3(TestCysteineStatistics) [test_data.rb:15]:
<0.0392156862745098> expected but was
<0.0388349514563107>.

2 tests, 2 assertions, 2 failures, 0 errors

Both tests fail, indicating a difference in what I’ve written my code to do, and what I’ve calculated by hand. So where is the error? The problem is I’ve divided the number of cysteines by the length of the sequence, which in theory is correct. However, when DNA sequence is translated to protein, stop codons are translated and added as a special character ‘*’, meaning that the length of the protein sequence is actually one residue longer than is correct. Something that was not picked up by the code testing, but was by the data testing.

Summary
Code testing will find all the possible errors that you can think of, data testing will find all the errors you don’t. This could seem laborious, and will not be applicable to every situation, but can be useful in the real world.

All the code in this tutorial can be found here.

Good programming versus biological intuition

November 20th, 2007

Good programming versus biological intuition

As I write my first paper, my biggest worry is that my results are wrong. In particular, that my code, which I think does one thing, has a bug and does something different. This, in turn, produces inaccurate results and leads me to incorrect conclusions. I then produce a paper where the story I am telling is wrong.

Read more »

Three stories about science and the web

October 19th, 2007

Picture of many different web logos

Collaborating on the same document

Tom, Dick, and Harry are collaborating on a paper. Tom, being the PhD student, does all the work and then writes the paper. Tom then sends a copy to Dick and Harry who edit it with their opinions. Unfortunately Dick completely removes the second paragraph of the discussion, while Harry expands it. Both then send their edited copies back to Tom.

Read more »

Five resources for beginning bioinformaticians

October 4th, 2007

Lists

Back from a weeks holiday in Hungary just in time for my, hopefully, last ever year as a student. Last month I had a flurry of work completing a report and poster for the end of my second year, but now I’m aiming to work hard and try and get at least two papers out in my final year: in time to write up my thesis.

But now, to coincide with the beginning of the academic year, and the time that new PhD and Masters students start, I thought I would share some the resources that I found useful through out the course of my own Masters degree, then first two years of PhD.

Read more »

My open science notebook

September 10th, 2007

Notebook

I said, a month ago, that there would be an open science section for this month’s issue of Bio::Blogs. Unfortunately this wasn’t the case. There are a number of theories as to why this didn’t materialise, including global warming, the recent stock market panic, or the appointment of new British Prime Minister. However research has suggested that it was because I didn’t work on it; though the other reasons are linked to having had secondary effects.

Read more »