Using helper scripts to make bioinformatics analysis easier to maintain

February 29th, 2008

One of the differences between researching a scientific problem using a computer, and developing software, is the approach to writing code. If you’re producing a bioinformatics application there is more emphasis on generating high quality, flexible code, as this makes future maintenance easier. On the other hand if you’re trying to find the answer to a biological question using a series of scripts, then the focus is on the results, rather than the standard of code. During my work, the number of scripts I have tends to grow quickly, and this leads to problems with maintaining dependencies across scripts. Examples of this can be database connection parameters, or the file system location of a library I’m calling. This is because the fastest way to get this information into a script, is to cut and paste from an already existing one. However this becomes difficult to manage, when something changes, because I have to go back through all my scripts and update each in turn.

I think a better way to organise these dependencies is to use a helper script which encapsulates the commonality between the other scripts. For example, if I’m analysing the evolution of gene expression, my directory might look something like this.

- genome_analysis.rb
- transcript_analysis.rb
- protein_analysis.rb
- analysis_helper.rb

As I described the analysis_helper.rb script would contain the common database parameters and library calls, and the other scripts call this helper in their first line. If anything changes I only have to update analysis_helper.rb and the changes are reflected in the other scripts.

This might seem trivial, but the more commonality I extract to the analysis_helper.rb, the more useful it becomes. As a further example, if my expression analysis always begins with a multiple sequence alignment of genes from related species, then this could be extracted into method in the helper script. This then becomes useful when in peer review, a reviewer asks for a more appropriate sequence alignment method. All I have to do is change the method in the helper script, and then the analysis can be repeated without having to edit any of the other scripts.

2 responses

  1. Chris Lasher comments:

    Andrew Hunt and Dave Thomas covered this well in The Pragmatic Programmer chapter on the “DRY (Don’t Repeat Yourself) Principle”. Anytime I repeat a single line of code, my Spidey-sense tingles and I think of how I can place it into a common area. Need to repeat a for-loop? Make it a function/method. Need to repeat a function/method? Put it in a module/mixin class. Need to repeat a module/class? Make a package/library.

    I’m not sure if you experience this, but in my experience, scripts intended as “one-offs” have a habit of staying around. Cutting and pasting between them creates crufty, brittle code. Each successive time I touch that “one-off” I make sure the code becomes more tidy when I leave it than when I open it. This way, it incrementally develops into a full-fledged module or program appropriately.

    And, of course, if I plan to publish the result, making sure that script has unit/functional tests is a Really Good Idea.

  2. Max comments:

    Are you’re talking about the use of a library for common stuff? I might have misunderstood, but is there anything more basic in programming than regrouping your common stuff in your own lib/ dir ? I guess anyone who is programming more than a couple of lines per week will use a directory with common functions he or she is using all the time, they will grow and grow by the time…

Leave a comment