Decouple the file parsing from the analysis

January 7th, 2008

A common task in bioinformatics is to read data from a set of files, arrange into the required format, then run an analysis to verify or falsify your expectation. An example would be reading in the yeast interaction network, and protein evolution rates, then correlating the two sets of data to see if there is a trend. Using Perl, you would specify how each file gets read in, arrange the sets of data by gene name, then correlate the two.

Anti Pattern
I argue that that it is better to split this single script into two, one that takes the data from the files and outputs it into the required tabular format, the second reads in the table and runs the analysis. Why go through the trouble of doing this? Because in the instance of a single script, the analysis of the data is tightly coupled to the format the data is in, or in other words the analysis code knows too much about the format of the original data files.

Imagine a more comprehensive study of protein evolution is produced, if you want to use this data, you’ll have to edit the whole script. This will usually end up with you having to change the way the analysis is done, because the reading in of the files overlaps with it. This presents the chance that a new bug could be introduced. Testing for any such bugs is difficult to do because the script is doing two things at the same time, and the data itself can’t be verified because it’s hidden inside the program.

Advantages
Then what is the benefit of the extra time required to split this script into two? Well first of all, it means that the scripts need to know only one thing about each other, how the data needs to be tabulated so the analysis can be performed. That’s all. As long as this doesn’t change, everything else can. If the data formatting side takes a long time, you can write it in C to decrease the time, but the analysis script doesn’t need to change. The reverse is true for the data formatting, if the analysis is modified. The correct tabulation of the data can also be tested, since it is output before the analysis is run. Important for making sure that you are analysing the correct information. Also, a CSV file is programming language agnostic, so anyone can repeat or further the analysis, without being tied to the language you used.

Finally, asserting that each script does only one thing is good programming practice. This prevents monolithic, hard to debug code, and encourages modularity and reusability. If there is bunch of code you’re always using - copying and pasting from one script to another, pull this out and make it up into a library. You’ll save yourself effort each time you need to use it, and if you find a bug you only need to fix it once.

4 responses

  1. Deepak comments:

    Excellent point. It actually speaks to a couple of things. One is some of Perl’s limitations, since the ideal way to do this IMO would be to create class libraries for the two tasks. The second is to the power of workflow systems. The problem you describe is ideally handled by workflow engines especially if you have to do it repeatedly

  2. jan. comments:

    Good point Mike. I know that I don’t do this enough. I often like to use YAML instead of CSV because it’s self-describing but still human-readable. Just my 2c.

  3. DMK comments:

    If I had a nickel for every time that kind of decoupling was useful, I’d, uh, have at least enough for a couple coffees. At Starbucks. Grandes.

    In other words, I could not stress enough how decoupling parsing into a more easily used data structure helps.

    Compared to a straight-up load of a CSV or fixed-width file, XML is really quite sluggish. If particular algorithms could benefit from some re-arranging of the data from the original source, not only does this encourage the building of more flexible tools, but expedites repeated executions with tweakable probabilistic algorithms (read: machine learning).

  4. Martino comments:

    I have just stumbled across this blog, and must say it is a nice article. As a non-scripter, I don’t know much more than general knowledge perl. However, the modular approach is a good example of an object oriented approach where each script (or object or module) will do one job and simply connect with other scripts via some interface. The biostatistical landscape is clutterd with far too many monolithic structures, be it code or databases or ….., and is creating a tangled web of unnecessary complexity. The simple idea of modularity that you suggest can be applied to many of the existing models that attempt to represent various aspects of biological representation and would allow us to take advantage of the tremendous output currently being generated in this science.

Leave a comment