Posts tagged with unit testing

Why data testing is important in computational research

December 31st, 2007

I wrote in a previous post about the importance of testing in computational research. If you’re developing a piece of software, functional testing is essential. However, we bioinformaticians don’t just develop software, we also have to develop conclusions and hypothesis, based on data, as well as code we’ve written. Here is an example of why I think data testing is as equally important as functional testing in research.

Imagine you want to see if the structural stability of protein correlates with the number of disulphide bonds. To do this you can create two methods in a Ruby mixin.

module CysteineStatistics
  def n_cys
    to_s.downcase.scan('c').length
  end
 
  def avg_cys
    n_cys / to_s.length.to_f
  end
end

These two methods can be be mixed into the String class to add the functionality. For the sake of this demonstration I’m using String, but in reality you would look at adding this to the Bioruby Bio::Sequence class.

class String
  include CysteineStatistics
end

Being a responsible bioinformatician I’ll write some testing code to make sure that the methods do what I expect.

class TestCysteineStatistics < Test::Unit::TestCase
 
  def test_n_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.n_cys,1)
  end
 
  def test_avg_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.avg_cys,1.0/7.0)
  end
end

Again in real life I would probably write a few more tests just to make sure. Both these tests pass though, so the code is doing what I want. Next I write a short script to analyse my set of protein sequences.

# Read in each sequence, then print out the average number of cysteines
  CSV.open(out_file,'w') do |csv|
  CSV.open(in_file,'r') {|row| csv << [row[0], row[1].avg_cys]}
end

Which gives me a file containing average cysteine count for all my protein sequences. Next I would correlate these values with some structural stability measure I’ve worked out. Here’s what the file looks like.

1,0.0203291384317522
2,0.0248447204968944
3,0.0388349514563107
.
.
.

This is fine and I could finish the post here. The point I would like to make though, that in an ideal world this is not the end of the story, and before I move on to the next stage of my research I’m going to test my data.

class TestCysteineStatistics < Test::Unit::TestCase
 
  # A shortcut method to find the entry in the data file
  def entry(id)
    CSV.open('count.csv','r') {|row| return row[1].to_f if row[0] == id.to_s}
  end
 
  def test_avg_cysteines_for_2
    assert_equal(0.025,entry(2))
  end
 
  def test_avg_cysteines_for_3
    assert_equal(0.0392156862745098,entry(3))
  end
end

What I’ve done for this script is take two proteins from my original data set and calculate their average cysteine content by hand, then compare this with the calculated value. Running this test produces the following result.

Loaded suite test_data
Started
FF
Finished in 0.018369 seconds.

1) Failure:
test_avg_cysteines_for_2(TestCysteineStatistics) [test_data.rb:11]:
<0.025> expected but was
<0.0248447204968944>.

2) Failure:
test_avg_cysteines_for_3(TestCysteineStatistics) [test_data.rb:15]:
<0.0392156862745098> expected but was
<0.0388349514563107>.

2 tests, 2 assertions, 2 failures, 0 errors

Both tests fail, indicating a difference in what I’ve written my code to do, and what I’ve calculated by hand. So where is the error? The problem is I’ve divided the number of cysteines by the length of the sequence, which in theory is correct. However, when DNA sequence is translated to protein, stop codons are translated and added as a special character ‘*’, meaning that the length of the protein sequence is actually one residue longer than is correct. Something that was not picked up by the code testing, but was by the data testing.

Summary
Code testing will find all the possible errors that you can think of, data testing will find all the errors you don’t. This could seem laborious, and will not be applicable to every situation, but can be useful in the real world.

All the code in this tutorial can be found here.

Good programming versus biological intuition

November 20th, 2007

Good programming versus biological intuition

As I write my first paper, my biggest worry is that my results are wrong. In particular, that my code, which I think does one thing, has a bug and does something different. This, in turn, produces inaccurate results and leads me to incorrect conclusions. I then produce a paper where the story I am telling is wrong.

Read more »

Three libraries and a tool to enhance your bioinformatics coding

March 30th, 2007

Coding is fact of life for bioinformatics. If you work in bioinformatics you probably enjoy coding to some extent. It’s our equivalent to PCR, western blots and sequencing. So whether your weapon of choice is Java, Perl, Python or C++, here’s three packages and a tool worth a look.

Read more »