Posts tagged with Ruby

BioRuby and Ruby on Rails: Active BioRecords

March 6th, 2008

A common practice in any computationally based field is writing code where the intended functionality has already been produced by someone else. This is usually called reinventing the wheel. This isn’t very useful since you’re spending time on an intermediate step, when instead you can use existing code and jump ahead to the next step in your research. Of course, it’s easy for me to shout bad practice on my blog, but I’m the worst person for doing this. I work in bioinformatics because I like writing code to solve problems, and my first response is to start coding, rather than look to see if someone has created a solution already. On the other hand, the benefit of using existing libraries is that you can build new things on what has already been done.

Read more »

Bioinformatics Zen FAQ

February 12th, 2008

I guess one of the golden rules of blogging is write about what people are interested in. Here’s the most common questions I get emailed, and my answers to them.

Read more »

Why data testing is important in computational research

December 31st, 2007

I wrote in a previous post about the importance of testing in computational research. If you’re developing a piece of software, functional testing is essential. However, we bioinformaticians don’t just develop software, we also have to develop conclusions and hypothesis, based on data, as well as code we’ve written. Here is an example of why I think data testing is as equally important as functional testing in research.

Imagine you want to see if the structural stability of protein correlates with the number of disulphide bonds. To do this you can create two methods in a Ruby mixin.

module CysteineStatistics
  def n_cys
    to_s.downcase.scan('c').length
  end
 
  def avg_cys
    n_cys / to_s.length.to_f
  end
end

These two methods can be be mixed into the String class to add the functionality. For the sake of this demonstration I’m using String, but in reality you would look at adding this to the Bioruby Bio::Sequence class.

class String
  include CysteineStatistics
end

Being a responsible bioinformatician I’ll write some testing code to make sure that the methods do what I expect.

class TestCysteineStatistics < Test::Unit::TestCase
 
  def test_n_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.n_cys,1)
  end
 
  def test_avg_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.avg_cys,1.0/7.0)
  end
end

Again in real life I would probably write a few more tests just to make sure. Both these tests pass though, so the code is doing what I want. Next I write a short script to analyse my set of protein sequences.

# Read in each sequence, then print out the average number of cysteines
  CSV.open(out_file,'w') do |csv|
  CSV.open(in_file,'r') {|row| csv << [row[0], row[1].avg_cys]}
end

Which gives me a file containing average cysteine count for all my protein sequences. Next I would correlate these values with some structural stability measure I’ve worked out. Here’s what the file looks like.

1,0.0203291384317522
2,0.0248447204968944
3,0.0388349514563107
.
.
.

This is fine and I could finish the post here. The point I would like to make though, that in an ideal world this is not the end of the story, and before I move on to the next stage of my research I’m going to test my data.

class TestCysteineStatistics < Test::Unit::TestCase
 
  # A shortcut method to find the entry in the data file
  def entry(id)
    CSV.open('count.csv','r') {|row| return row[1].to_f if row[0] == id.to_s}
  end
 
  def test_avg_cysteines_for_2
    assert_equal(0.025,entry(2))
  end
 
  def test_avg_cysteines_for_3
    assert_equal(0.0392156862745098,entry(3))
  end
end

What I’ve done for this script is take two proteins from my original data set and calculate their average cysteine content by hand, then compare this with the calculated value. Running this test produces the following result.

Loaded suite test_data
Started
FF
Finished in 0.018369 seconds.

1) Failure:
test_avg_cysteines_for_2(TestCysteineStatistics) [test_data.rb:11]:
<0.025> expected but was
<0.0248447204968944>.

2) Failure:
test_avg_cysteines_for_3(TestCysteineStatistics) [test_data.rb:15]:
<0.0392156862745098> expected but was
<0.0388349514563107>.

2 tests, 2 assertions, 2 failures, 0 errors

Both tests fail, indicating a difference in what I’ve written my code to do, and what I’ve calculated by hand. So where is the error? The problem is I’ve divided the number of cysteines by the length of the sequence, which in theory is correct. However, when DNA sequence is translated to protein, stop codons are translated and added as a special character ‘*’, meaning that the length of the protein sequence is actually one residue longer than is correct. Something that was not picked up by the code testing, but was by the data testing.

Summary
Code testing will find all the possible errors that you can think of, data testing will find all the errors you don’t. This could seem laborious, and will not be applicable to every situation, but can be useful in the real world.

All the code in this tutorial can be found here.

Bioinformatics : which programming language to use?

March 14th, 2007

Two recent posts on using programming languages in bioinformatics. One at biowhat and the other at Omics! Omics!. Both discuss what type of language to use. Heavy weight languages such as C++ and Java versus lighter scripting languages such as Perl, Ruby and Python.

I think this depends on what what your research goals are. If your aim is to build a tool for biologists, then you probably need an application building language such as C or Java. On the other hand if you want to find an answer to a biological question then it’s a lot easier to create a short Perl script than manipulates the data to produce the desired result.

Heavy weight
My background is biology rather than computing science, but I find languages like Java encourage a better coding style. Which if you’re working on a large project, is what you want. The object orientation aspects such as polymorphism and encapsulation work to prevent bugs. The syntax of these languages are often a lot stricter; object types are declared and generics can be used to further enforce correct allocation of resources. Development environments such as Eclipse and Netbeans can also make the production relatively quick. On the other hand using a language like this to strip a set of protein names from a file can be rather cumbersome and somewhat overkill.

Light weight
Perl was originally intended as a regular expression language for manipulating text. Something that is still very useful in biology, given the vast array of non-standard formats that biological data is distributed in. If you want to quickly strip data from a file, then Perl is by far the best choice. Which is probably what has made Perl the most popular choice of language in bioinformatics, and led to the incredibly successful bioperl project. A very useful set of libraries for performing common bioinformatics tasks; created and maintained by the community.

Specialised
If you want to create a non-linear mixed effects model, or solve a series of stochastic differential equations then you’ll need a language designed with specific set of functions in mind. Examples are the impenetrably named “R” for statistics, and the more descriptive Matlab/Mathematica for, unsurprisingly, mathematics. Numerical languages such as these also take care of the sometimes tricky binary imprecision problem. Where storing a base 10 number in base 2 format can lead to inaccuracies.

Of course no programming language is a golden hammer that can solve all of your problems. Each has it’s own place. During my work I use a combination of Java, Ant and Hibernate to maintain a large omic database. I then use R to pull the data and run my statistical analyses. Using a database also decouples stripping the data out of the files, from running the statistical analysis. Have I mentioned before that databases are great?