Good programming versus biological intuition

November 20th, 2007

Good programming versus biological intuition

As I write my first paper, my biggest worry is that my results are wrong. In particular, that my code, which I think does one thing, has a bug and does something different. This, in turn, produces inaccurate results and leads me to incorrect conclusions. I then produce a paper where the story I am telling is wrong.

For example, in a recent story, five papers were retracted because an incorrect sign change lead to the wrong protein structure prediction. This is very unfortunate, and I think that every bioinformatician could empathise with this: how often can you be 100% sure that your scripts and programs are doing what you expect?

To contrast this with the wet lab, one of the the most important criteria is to be very meticulous about positive and negative controls. This makes me think, could I use positive and negative controls in my computational work to test my experiments? Whether this is testing what my program is doing, or whether the results I have produced are true; without this level of certainty, how can I be sure about my conclusions?

Good programming

This may sound rather like I am preaching, but I hope that by following good programming practices I can prevent many errors that could result from simple mistakes.

Behaviour, or unit driven code testing

If you’re unfamiliar, unit/behaviour testing involves writing short snippets of code to test your program function. The best way to do this, is to write all the possible tests before you begin writing your program, then as you are coding you can keep running the tests. This makes sure you the program is always doing what you expect.

I think regular testing is one of the most critical parts of code production, if not the most critical, and if you take anything away from reading this article, consider implementing tests in your work routine.

There are plenty of libraries to support code testing, so the process isn’t too arduous. My personal favourite is test/spec, which I use for behaviour driven code testing in Ruby, but there are libraries for every language.

For example, as in the laboratory, you could write tests to make sure that your program produces positive and negative results when you expect. A good habit to get into, is to keep adding more tests where you think of other possible errors that could be introduced. Then, when you start actually doing the analysis, you can be sure that your code is doing everything that you want, and nothing else.

Reusing code, and making it open source

Using existing libraries is a great habit to get into. Writing your own solution to a problem means more code that has to be tested, and debugged. On the other hand, if code has already been written, and in my experience 70% of the time it has, then you can just slot this in, and move on to the next step, saving you time.

If you have a problem where there isn’t already have an existing library for, then package up your solution and release it on a site like SourceForge.net or RubyForge.org. You’ll be helping other people in their work, and also if your problem is an important one, you will find that people contribute and improve the code you’ve written, which will directly feedback into your own work.

A bonus about creating packages, is that it keeps your code modular. For example, do you have a piece of code that you regularly use in your scripts? Something you keep pasting in when you need it? If you found an error in this code, you would have to go back to all the places where this code was used, and correct the problem.

On the other hand, if the code was part of a package that your script was calling, when you update the package to fix the problem, every corresponding script gets updated too. This is an important principle in software production, Don’t repeat yourself, and one I try to stick to as much as possible.

Biological understanding

Writing tests and and using open source software only takes a little extra time, so the next step is to produce results, and then interpret them.

Let negative results drive the research

Again, this is something that I think is really important. Before I start to find my important results, I will perform the simple analyses: the things I already know the answer to. Doing this will confirm that, again, all the code I’ve written is working, but also that in terms of biology I am seeing the types of results I expect. This point may seem obvious, but I am a bioinformatician, and after writing the code as an informatician, I have to put my biological hat on, and use my knowledge to verify what I am producing, the current literature predicts.

I hope, that by doing my research this way, I can creep slowly forward from the areas already understood to those less certain. This means I can be confident that each further, and more insecure, conclusion is built on the solid logical foundation of the previous result. Of course, this is easy to write in principle, and this is not possible in every situation, but I try where I can, and I believe it’s worth it.

Results match expectations

Finally, getting to the crux of my research, I hope I’ll be producing the results that will answer important questions and give me the material I need to write a paper. Here is where I may have an expectation of what the result is going to be, but a wildly different result is possible. This could indicate a breakthrough in understanding, or a mistake in my method. But hopefully, by being rigorous both in my practice of informatics and biology, I can attribute this result to a new insight, rather than worrying that there is still a bug lurking in my code.

Summary

The point I’ve tried to make here, is that it’s worth taking a little extra time to rigorous in code production. However, I always keep in mind that you can never be 100% sure that your code does what you think, so I also try to keep biological expectations in mind when I interpret results.

Related
Boscoh also wrote a similar article to this a few months ago at his blog, Trapped in the USA.

11 responses

  1. Chris comments:

    Good article about a topic that doesn’t get enough attention. I like the idea of treating test cases as pos/neg controls. Would being required to mention these in the methods or supplemental be overkill, especially in purely computational and algorithmic papers? (probably is overkill, but worth mulling over for a second or two)

  2. paradoxus comments:

    I like this post which is very relevant.

    Another issue is the program you use which was written and published by someone else. If (just an imaginary scenario) a well known program is found to be faulty, what will the consequence be?

    It also takes time to go through all of your code again and again. Debugging is never ending. For my analysis, first I make sure the core code (analysis/model) is correct, then I spent more time focusing on the parsing the correct format of the data.

  3. Andy comments:

    A couple of points that I think need to be kept in mind.

    1) Testing is good. In fact, testing is essential. But - particularly in bioinformatics - it is equally important that you understand why your test inputs should produce your test outputs. Writing code to produce predictable results is required where the process is well understood and validated, but if there is the potential for uncertainty every concept needs to be challenged. One should never be afraid to come up with a result that differs from the expected. If you read the boscoh post fully, you can see that it took a long time for the results to be questioned because no-one believed in their results enough to challenge the previous ones. If we were all coding routines that had to conform with the conventional wisdom, we would never advance.

    2) The really important thing is to be *auditable*. Version control is a non-negotiable. Accurate records of which version of the code was run on what date with what input is, likewise, something that is essential. Checksum everything. Record everything. Dare to challenge, but be prepared to be challenged is the essential maxim.

  4. Pedro Beltrao comments:

    Good post and an important topic. One thing that I have thought about a few times is that I am much more likely to review code and assumptions when the outcome is not what I was expecting. I think you cover several good practices that help overcome potential bias.

  5. Mike comments:

    Thanks for your comments guys.

    @Chris
    I have thought about what you wrote about journals requiring authors to submit tests as part of papers. I don’t think it is overkill, wetlab biologists have to be very rigorous about what controls they use, so why shouldn’t bioinformatician’s have to do the sane? Most modern testing software produces readable results, that for example a reviewer could quickly cast their eye over.

    @Paradoxus
    I agree that going through your code can be repetitive and boring, so in no way do I advocate this. In fact, I think in programming if something is hard work there is usually already an existing solution to make it easier. For example unit testing is meant to run quickly and easily every time you want to make sure that your program is working. This is usually better tthat adding print statements through out your code. Worth a look perhaps?

    @Andy
    I agree that you should be prepared for unexpected results, and I hope the point I was trying to make was that if you’re sure that your code has no bugs in, then you can be more confident in the results you have produced. Especially in the case where the results challenge previous research. I’m not advocating routines that conform, rather that they do what you have written them to do.

    @Pedro
    I am exactly the same, most of the time I write posts of what I think I should be doing rather than what I am actually doing.

  6. Animesh Sharma comments:

    CS Grads advice me to shift to functional programming like Haskell to avoid many of the bugs.

  7. Cleaning my dirty laundry in public: errors in my data < michael barton pings back:

    [...] I wrote a post on Bioinformatics Zen about the importance of testing your code and data to make confirm you are producing what you think [...]

  8. News: Link fest. « Thirst for Science pings back:

    [...] while not a literature review, I enjoyed the recent post Good Programming versus Biological Intuition over on Bioinformatics [...]

  9. Bioinformatics Zen » Why data testing is important in computational research pings back:

    [...] Zen Good programming versus biological intuition [...]

  10. giovanni comments:

    Testing stands to bioinformatics like negative and positive controls do to wet biology experiments, isn’t it true?

  11. Bioinfo Blog! » testing del software: un problema di metodologia in bioinformatica pings back:

    [...] blog di bioinformaticszen ha parlato di questo argomento recentemente. Il testing nel caso della bioinformatica assume una dimensione ancora più complicata rispetto a [...]

Leave a comment