The dark side of bioinformatics data mining · Posted: Feb 13, 2007

I spend much of my day analysing yeast high throughput data, recently produced in the laboratory. On one hand I’m very lucky to have access to fresh data at many cellular levels. On the other hand, with all this information, I’m easily swept away by the amount of variables I have access to - the dark side of data mining.

Dark side

Principle components analysis. Singular value decomposition. Hierarchical clustering. Constrained ordination. Unconstrained ordination. Bayesian clustering. Throw all the techniques I can think of, but it won’t give me the answer I’m looking for - because I don’t have a question. Instead I have component weights for variables and parameters, identified clusters, and interesting but ultimately useless graphs. I know I’m slipped into the dark side of data mining because the time is half past go home, my motivation is in my shoes, my face is on my desk.

Light side

The power of the light side, comes from reading and understanding the literature. Plenty of reading gives you the question to answer. I spend the a portion of my day reading, then thinking about how this relates to the question I’m trying to answer. Work in small portions, find one small question. Then set out to answer it. With your answer review this in light of what you have understood from the literature. Is this what you expect? How does it relate to previous findings? Review the literature again, until you have another small question. Repeat as necessary, publish. Bask in admiration of your peers.

Lure of the dark side

Why is it so easy to end up in the dark side of data mining? Or bioinformatics in general? I believe this is because of how little effort it takes to carry out computational analysis. If you worked in a wet laboratory it’s likely to take more than a week to perform a small defined piece of analytical work. However, as a bioinformatician sat at your computer, working away no one is going to check on what you’re doing. Only when you show your results to someone and the obvious flaw is highlighted, do you wish you had first considered the problem a bit more. Taking a few moments to sketch out what you’re thinking before you perform an analysis can really help with this.

The lesson here is that you can use the most advanced bayesian multivariate analysis technique available, but if you don’t know what you’re looking for, it’s going to very hard to find an answer. However if you have a clearly stated question, you’ll find that it can often be answered with something as simple as a t-test or correlation test.

Interested in bioinformatics and data analysis? Follow me on twitter.