Posts about programming

git, github, and bioinformatics software development

April 15th, 2008

Github, a source code management (SCM) repository based on git has exited beta and is ready for people to sign up. Git and github offer interesting opportunities for bioinformatics software development, and I think it’s worth taking a few minutes to explore them. There’s a free option too, so it doesn’t cost anything to sign up and play around.

Source code management

Github is based on git, and if you’re familiar with a source code management tool like subversion, git uses a similar command syntax, and would only take about 20 minutes to familiarise with. Git does many things to improve upon SCM, and one of the first things I noticed is how much faster it is than subversion. Also if you’ve ever used subversion, you’ll know that it creates a .svn directory in every subdirectory of the project. This can make it rather difficult to share and maintain. Git on the other hand creates a single .git directory in the root of the project, so if you want to share the project minus git revision control, you can just delete this directory. Git also simplifies the process of when there there is more than one developer working on a project, where each developer needs to work on the same code, which will obviously lead to conflicts in the different versions. Git’s approach allows each developer to copy the main project and work on their own version. This copy can be modified, and committed to, while nothing is sent back to the master copy. Only when you decide to push the changes to the master, are they sent back to the original, at which point the maintainer decides what changes to merge into original version. Subversion does have this option to create branches, but I find that git’s interface is much simpler and gives the developer more freedom in taking risks and trying out new code.

Social Software

Github builds on git and takes the easy branching feature a step further to create a social software site. I know everyone and their dog is creating a social [insert verb]ing application/site, but you might find that that github’s approach can make a difference in your approach to software development. Github makes it possible to see who is creating branches of your project, visualised as a network, where branch and merge points are shown in a timeline.

Image of github network feature

As a use case, I’m working on a manuscript and I have a set of ruby classes which I’ve been using in my analysis. I think these might be useful to other bioinformaticians, and I’d like to contribute them to the BioRuby library. To do this, I have to contact the BioRuby mailing list with my suggestion, get CVS access, and my changes, them commit them to the trunk. Were BioRuby a git repository I could fork it at the beginning of my project, edit BioRuby as I am doing my research, then when my manuscript is done I can prune and tidy my changes and push them back as a patch. Even better, with github’s network feature, anyone interested in BioRuby can see that I’ve forked it, follow the link to my changes and see what I’m doing, even before I’ve committed my changes back to the main project. The BioRuby developers spend a lot of time maintaining the code and so are entitled to tell me what I can do with my ideas, however I’m writing this as a suggestion as a way for BioRuby to further grow, and encourage contributions

I think it would be great if bioinformatics researchers, on publication of a manuscript, included a link to a github repository. As how often is bioinformatics code reinvented? Or when someone emails another researcher for their code, wouldn’t it be great to know what they’re up to? In particular, when you see some code mentioned in a paper, you want to be able to quickly get access, and start playing around. Whether people would want to share code in this way is one issue, but if they choose to, the features that git and github offer can make it much easier.

More on git and github

Repository Formats Matter

Moving from subversion to git

Video tutorial on using git

Comments in github

Project forking using github

Ruby on Rails moves to github

BioRuby and Ruby on Rails: Active BioRecords

March 6th, 2008

A common practice in any computationally based field is writing code where the intended functionality has already been produced by someone else. This is usually called reinventing the wheel. This isn’t very useful since you’re spending time on an intermediate step, when instead you can use existing code and jump ahead to the next step in your research. Of course, it’s easy for me to shout bad practice on my blog, but I’m the worst person for doing this. I work in bioinformatics because I like writing code to solve problems, and my first response is to start coding, rather than look to see if someone has created a solution already. On the other hand, the benefit of using existing libraries is that you can build new things on what has already been done.

Read more »

Using helper scripts to make bioinformatics analysis easier to maintain

February 29th, 2008

One of the differences between researching a scientific problem using a computer, and developing software, is the approach to writing code. If you’re producing a bioinformatics application there is more emphasis on generating high quality, flexible code, as this makes future maintenance easier. On the other hand if you’re trying to find the answer to a biological question using a series of scripts, then the focus is on the results, rather than the standard of code. During my work, the number of scripts I have tends to grow quickly, and this leads to problems with maintaining dependencies across scripts. Examples of this can be database connection parameters, or the file system location of a library I’m calling. This is because the fastest way to get this information into a script, is to cut and paste from an already existing one. However this becomes difficult to manage, when something changes, because I have to go back through all my scripts and update each in turn.

Read more »

Bioinformatics Zen FAQ

February 12th, 2008

I guess one of the golden rules of blogging is write about what people are interested in. Here’s the most common questions I get emailed, and my answers to them.

Read more »

Why data testing is important in computational research

December 31st, 2007

I wrote in a previous post about the importance of testing in computational research. If you’re developing a piece of software, functional testing is essential. However, we bioinformaticians don’t just develop software, we also have to develop conclusions and hypothesis, based on data, as well as code we’ve written. Here is an example of why I think data testing is as equally important as functional testing in research.

Imagine you want to see if the structural stability of protein correlates with the number of disulphide bonds. To do this you can create two methods in a Ruby mixin.

module CysteineStatistics
  def n_cys
    to_s.downcase.scan('c').length
  end
 
  def avg_cys
    n_cys / to_s.length.to_f
  end
end

These two methods can be be mixed into the String class to add the functionality. For the sake of this demonstration I’m using String, but in reality you would look at adding this to the Bioruby Bio::Sequence class.

class String
  include CysteineStatistics
end

Being a responsible bioinformatician I’ll write some testing code to make sure that the methods do what I expect.

class TestCysteineStatistics < Test::Unit::TestCase
 
  def test_n_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.n_cys,1)
  end
 
  def test_avg_cysteines
    small_sequence = 'AAACAAA'
    assert_equal(small_sequence.avg_cys,1.0/7.0)
  end
end

Again in real life I would probably write a few more tests just to make sure. Both these tests pass though, so the code is doing what I want. Next I write a short script to analyse my set of protein sequences.

# Read in each sequence, then print out the average number of cysteines
  CSV.open(out_file,'w') do |csv|
  CSV.open(in_file,'r') {|row| csv << [row[0], row[1].avg_cys]}
end

Which gives me a file containing average cysteine count for all my protein sequences. Next I would correlate these values with some structural stability measure I’ve worked out. Here’s what the file looks like.

1,0.0203291384317522
2,0.0248447204968944
3,0.0388349514563107
.
.
.

This is fine and I could finish the post here. The point I would like to make though, that in an ideal world this is not the end of the story, and before I move on to the next stage of my research I’m going to test my data.

class TestCysteineStatistics < Test::Unit::TestCase
 
  # A shortcut method to find the entry in the data file
  def entry(id)
    CSV.open('count.csv','r') {|row| return row[1].to_f if row[0] == id.to_s}
  end
 
  def test_avg_cysteines_for_2
    assert_equal(0.025,entry(2))
  end
 
  def test_avg_cysteines_for_3
    assert_equal(0.0392156862745098,entry(3))
  end
end

What I’ve done for this script is take two proteins from my original data set and calculate their average cysteine content by hand, then compare this with the calculated value. Running this test produces the following result.

Loaded suite test_data
Started
FF
Finished in 0.018369 seconds.

1) Failure:
test_avg_cysteines_for_2(TestCysteineStatistics) [test_data.rb:11]:
<0.025> expected but was
<0.0248447204968944>.

2) Failure:
test_avg_cysteines_for_3(TestCysteineStatistics) [test_data.rb:15]:
<0.0392156862745098> expected but was
<0.0388349514563107>.

2 tests, 2 assertions, 2 failures, 0 errors

Both tests fail, indicating a difference in what I’ve written my code to do, and what I’ve calculated by hand. So where is the error? The problem is I’ve divided the number of cysteines by the length of the sequence, which in theory is correct. However, when DNA sequence is translated to protein, stop codons are translated and added as a special character ‘*’, meaning that the length of the protein sequence is actually one residue longer than is correct. Something that was not picked up by the code testing, but was by the data testing.

Summary
Code testing will find all the possible errors that you can think of, data testing will find all the errors you don’t. This could seem laborious, and will not be applicable to every situation, but can be useful in the real world.

All the code in this tutorial can be found here.