An introduction to data mining in bioinformatics
January 15th, 2007In other words, you’re a bioinformatician, and data has been dumped in your lap. Find the patterns, trend, answers, or what ever meaningful knowledge the data is hiding.
From experience, I can say that is one of the most frustrating positions to be in. Data mining is a huge field and can easily be bewildering for a beginner. However, high through-put techniques in molecular biology require, more and more, that bioinformatics is required to interpret the data. Furthermore, people working in bioinformatics generally come from computer science, or biology backgrounds. Data mining, however, involves statistics to one degree or another, which means entering a field that is may not be your strong point.
Here are some tips from my own forays in the quagmire of data mining in bioinformatics. Hopefully this will give you some guidance, and an introduction to starting.
What’s your question?
I cannot emphasise this enough. Always have a question in mind. Frame this question, so that it can be answered in a statistically sound way. Then answer the question. Even if the answer is negative, you’ve answered the question. This is essentially what part science boils down to. The more important or relevant the question, the more important or relevant your work.
Unfortunately life is never simple. In molecular biology, it’s becoming more common to generate reams of data then ask someone in bioinformatics to produce an answer. This is exploratory data analysis, one of the most difficult things to do well. Especially if you’re thrown in at the deep end.
Find someone who knows what they are doing
And be very nice to them. People will often spend some time talking to you if you bring cakes or biscuits. But don’t expect anyone to do the work for you. The more you pester someone the less inclined they will be to give up their time. However, telling some statistically minded person the nature of your problem will often result in some advice on how to proceed. If they offer to meet on a semi regular basis, even better. I guarantee that discussing the data with a third person will also bring fresh perspectives. Just keep the cakes coming.
Set a deadline
Without an end point, exploratory data mining can drag into weeks, months, or, god forbid, years. Searching for the elusive answer that will validate all of the time you have sunk into the project. This can be the one of the most demoralising traps in bioinformatics. You’d rather be working on something else but you don’t want to waste all the effort you’ve already put in. But, do you still want to be working on the same thing in another six months? What if there isn’t even an answer?
Set a deadline. When you reach it, write up what you’ve done as a paper. Seeing the work outlined in this format, will tell you what need to do to finish off. Even if the paper is not the ground-breaking work you were hoping for, you’ve still gotten something for your time.
The right tools
Excel is fine for creating graphs. If you’re serious about data mining though, you’ll need something more heavy weight. I use R, free, and with good data mining packages such as vegan and labdsv. For beginners R can be impenetrable, I recommend this book an introduction to R as well as the underlying statistics.
Also consider investing in one of the highest rated data mining books on amazon as well.
What are you doing?
Any of us can rush head on into a land of support vector machines, hidden markov models and neural networks. But coming back to the first point, what are you trying to prove? Always question what are you doing, how does it fit in to the wider picture? Try to regularly review, and keep track of where you are going? This will prevent you from falling into data mining despair.
Bookmarks on delicious for data mining bioinformatics
February 13th, 2007 at 9:29 pm
[...] Bioinformatics Zen « An introduction to data mining in bioinformatics [...]