Using helper scripts to make bioinformatics analysis easier to maintain
February 29th, 2008One of the differences between researching a scientific problem using a computer, and developing software, is the approach to writing code. If you’re producing a bioinformatics application there is more emphasis on generating high quality, flexible code, as this makes future maintenance easier. On the other hand if you’re trying to find the answer to a biological question using a series of scripts, then the focus is on the results, rather than the standard of code. During my work, the number of scripts I have tends to grow quickly, and this leads to problems with maintaining dependencies across scripts. Examples of this can be database connection parameters, or the file system location of a library I’m calling. This is because the fastest way to get this information into a script, is to cut and paste from an already existing one. However this becomes difficult to manage, when something changes, because I have to go back through all my scripts and update each in turn.
I think a better way to organise these dependencies is to use a helper script which encapsulates the commonality between the other scripts. For example, if I’m analysing the evolution of gene expression, my directory might look something like this.
- genome_analysis.rb
- transcript_analysis.rb
- protein_analysis.rb
- analysis_helper.rb
As I described the analysis_helper.rb script would contain the common database parameters and library calls, and the other scripts call this helper in their first line. If anything changes I only have to update analysis_helper.rb and the changes are reflected in the other scripts.
This might seem trivial, but the more commonality I extract to the analysis_helper.rb, the more useful it becomes. As a further example, if my expression analysis always begins with a multiple sequence alignment of genes from related species, then this could be extracted into method in the helper script. This then becomes useful when in peer review, a reviewer asks for a more appropriate sequence alignment method. All I have to do is change the method in the helper script, and then the analysis can be repeated without having to edit any of the other scripts.
March 2nd, 2008 at 11:36 pm
Andrew Hunt and Dave Thomas covered this well in The Pragmatic Programmer chapter on the “DRY (Don’t Repeat Yourself) Principle”. Anytime I repeat a single line of code, my Spidey-sense tingles and I think of how I can place it into a common area. Need to repeat a for-loop? Make it a function/method. Need to repeat a function/method? Put it in a module/mixin class. Need to repeat a module/class? Make a package/library.
I’m not sure if you experience this, but in my experience, scripts intended as “one-offs” have a habit of staying around. Cutting and pasting between them creates crufty, brittle code. Each successive time I touch that “one-off” I make sure the code becomes more tidy when I leave it than when I open it. This way, it incrementally develops into a full-fledged module or program appropriately.
And, of course, if I plan to publish the result, making sure that script has unit/functional tests is a Really Good Idea.
March 3rd, 2008 at 6:44 pm
Are you’re talking about the use of a library for common stuff? I might have misunderstood, but is there anything more basic in programming than regrouping your common stuff in your own lib/ dir ? I guess anyone who is programming more than a couple of lines per week will use a directory with common functions he or she is using all the time, they will grow and grow by the time…