git, github, and bioinformatics software development

April 15th, 2008

Github, a source code management (SCM) repository based on git has exited beta and is ready for people to sign up. Git and github offer interesting opportunities for bioinformatics software development, and I think it’s worth taking a few minutes to explore them. There’s a free option too, so it doesn’t cost anything to sign up and play around.

Source code management

Github is based on git, and if you’re familiar with a source code management tool like subversion, git uses a similar command syntax, and would only take about 20 minutes to familiarise with. Git does many things to improve upon SCM, and one of the first things I noticed is how much faster it is than subversion. Also if you’ve ever used subversion, you’ll know that it creates a .svn directory in every subdirectory of the project. This can make it rather difficult to share and maintain. Git on the other hand creates a single .git directory in the root of the project, so if you want to share the project minus git revision control, you can just delete this directory. Git also simplifies the process of when there there is more than one developer working on a project, where each developer needs to work on the same code, which will obviously lead to conflicts in the different versions. Git’s approach allows each developer to copy the main project and work on their own version. This copy can be modified, and committed to, while nothing is sent back to the master copy. Only when you decide to push the changes to the master, are they sent back to the original, at which point the maintainer decides what changes to merge into original version. Subversion does have this option to create branches, but I find that git’s interface is much simpler and gives the developer more freedom in taking risks and trying out new code.

Social Software

Github builds on git and takes the easy branching feature a step further to create a social software site. I know everyone and their dog is creating a social [insert verb]ing application/site, but you might find that that github’s approach can make a difference in your approach to software development. Github makes it possible to see who is creating branches of your project, visualised as a network, where branch and merge points are shown in a timeline.

Image of github network feature

As a use case, I’m working on a manuscript and I have a set of ruby classes which I’ve been using in my analysis. I think these might be useful to other bioinformaticians, and I’d like to contribute them to the BioRuby library. To do this, I have to contact the BioRuby mailing list with my suggestion, get CVS access, and my changes, them commit them to the trunk. Were BioRuby a git repository I could fork it at the beginning of my project, edit BioRuby as I am doing my research, then when my manuscript is done I can prune and tidy my changes and push them back as a patch. Even better, with github’s network feature, anyone interested in BioRuby can see that I’ve forked it, follow the link to my changes and see what I’m doing, even before I’ve committed my changes back to the main project. The BioRuby developers spend a lot of time maintaining the code and so are entitled to tell me what I can do with my ideas, however I’m writing this as a suggestion as a way for BioRuby to further grow, and encourage contributions

I think it would be great if bioinformatics researchers, on publication of a manuscript, included a link to a github repository. As how often is bioinformatics code reinvented? Or when someone emails another researcher for their code, wouldn’t it be great to know what they’re up to? In particular, when you see some code mentioned in a paper, you want to be able to quickly get access, and start playing around. Whether people would want to share code in this way is one issue, but if they choose to, the features that git and github offer can make it much easier.

More on git and github

Repository Formats Matter

Moving from subversion to git

Video tutorial on using git

Comments in github

Project forking using github

Ruby on Rails moves to github

Passive research streaming using Twitter, Flickr, and CiteULike

March 18th, 2008

Deepak, Neil, and Cameron have set up life streams which aggregate the feeds from services from sites like Last.fm and Flickr into a single set of posts. I’m a bit wary of this doing this because I already get easily distracted by Ruby and bioinformatics blogs, but Neil gave me an idea when he wrote about using these technologies to track research. I currently use Subversion to back up my project files, and I noticed Twitter status updates are very similar in length to subversion log messages. I created a short script so that every time I do a subversion repository check in, the message is also sent to Twitter.

#!/bin/sh
#Inspired by tinyurl.com/yt4ssq
 
# Scrub weird characters
MSG=`echo $@|tr ' ' '+'`
 
# Send twitter request
curl --basic --user "username:password" --data-ascii "status=$MSG" "http://twitter.com/statuses/update.json"
 
# Send SVN request
svn ci -m $MSG

Combined with an RSS Wordpress plugin, my most recent research activity from Twitter is displayed as a stream on my blog. Taking this a step further I included feeds for my research tagged Figures on Flickr, my paper bibliography on CiteULike, and discussion of my research on my blog. This stream is available on www.michaelbarton.me.uk/research-stream/, and shows the general idea of what I’m trying to do. I like this because in bioinformatics its sometimes difficult to know what other people are doing, but, now I hope that other people in my group can have a quick glance to see what I’ve been up to. Furthermore this all works passively, where I’m already using these services in my research anyway, and the only thing I had to do, was use yahoo pipes to aggregate the already existing information.

Because bioinformatics work is amenable to being displayed, shared, and edited on the web I think that the field should be at the bleeding edge of using Web 2.0 services like this. Of course many other bloggers before me, in particular Deepak, are already discussing this. Compared to most mashups what I’ve created is a rather shoddy as I’m cobbling together various services and trying to use Wordpress plugins to create something not exactly what they were intended for. However I don’t have much time in my PhD to spend experimenting, and I think this would be true for most scientists. Therefore the more that existing services can be used, the better. As a further example, I think Flickr has a lot of potential, and I would like to create a group for my lab, so everyone can upload and tag their figures, as they are producing them. Then the group’s pictures can be browsed and organise by tag to visualise what everyone is working on. The only effort required is for people to upload and tag their photos as they are making them.

BioRuby and Ruby on Rails: Active BioRecords

March 6th, 2008

A common practice in any computationally based field is writing code where the intended functionality has already been produced by someone else. This is usually called reinventing the wheel. This isn’t very useful since you’re spending time on an intermediate step, when instead you can use existing code and jump ahead to the next step in your research. Of course, it’s easy for me to shout bad practice on my blog, but I’m the worst person for doing this. I work in bioinformatics because I like writing code to solve problems, and my first response is to start coding, rather than look to see if someone has created a solution already. On the other hand, the benefit of using existing libraries is that you can build new things on what has already been done.

Read more »

Using helper scripts to make bioinformatics analysis easier to maintain

February 29th, 2008

One of the differences between researching a scientific problem using a computer, and developing software, is the approach to writing code. If you’re producing a bioinformatics application there is more emphasis on generating high quality, flexible code, as this makes future maintenance easier. On the other hand if you’re trying to find the answer to a biological question using a series of scripts, then the focus is on the results, rather than the standard of code. During my work, the number of scripts I have tends to grow quickly, and this leads to problems with maintaining dependencies across scripts. Examples of this can be database connection parameters, or the file system location of a library I’m calling. This is because the fastest way to get this information into a script, is to cut and paste from an already existing one. However this becomes difficult to manage, when something changes, because I have to go back through all my scripts and update each in turn.

Read more »

Bioinformatics Zen FAQ

February 12th, 2008

I guess one of the golden rules of blogging is write about what people are interested in. Here’s the most common questions I get emailed, and my answers to them.

Read more »