Bioinformatics : use a database for data

February 26th, 2007

Previously, I wrote about organising your file system to make the relationships between files that produce data, and files containing data more descriptive. One of the best tips I’ve been given, is to store all my data in a database. Regardless of what the data is, or how “mission critical”. Here are some reasons to use a database, rather than files, to store your data.

Location independent
You create a perl script that analyses file A. You later move file A. So you have to update your perl script with the new location. What if you’ve got a perl script that analyses file A, B, C etc. Or if you’ve moved the file several months ago, and you can’t rember which is the one you need.
Instead, if you have everything as tables in a database, you can pull the data, location independently. The database doesn’t even need to be on your computer.

Databases are clean
Unless they are XML, data files are messy. Missing commas. Too many commas. Blank lines at the end of file. Bizarre header lines. Binary data files are even worse, you’ll need a library to parse it. Databases on the other hand are consistent - data is always stored the same way. Named columns in a named table. You’ll always use the same methods to pull the data. You’ll always use the same program to view the data

Easier to backup
Obviously you backup regularly. If you use files to store your data, every time you create a new file you’ll have to inform your backup application that the file needs to be included. On the other hand, databases can be saved into a single text. If you’ve 5, 10, or 20 tables in your database, everything can still be backed up into one file.

Relational meaning
Relational data management is a huge topic and I’m not going into detail here. But a simple illustration is table for organisms and a table for sequences. Each sequence can referenced to the originating organism using SQL, and vice versa. A operation that would more difficult if the two data sets were in separate files.

Where to get started
I personally use MySQL for my databases. Not for any particular technical reason, but because this is what I was taught using. I know that PostgreSQL is popular, HSQLDB also. As for tutorials, this page has a good explanation on different database types.

8 responses

  1. Pierre comments:

    Hi,
    FYI: javadb/derby is now part of the standard java distribution and, I guess, it should be preferred over hsqlb.
    People might also use SQLite which doesn’t require a server.
    Flat files could also be tagged just like http://del.icio.us ?

    Pierre

  2. Mike comments:

    Thanks for the information, Pierre.
    I read on your website about javadb. I wasn’t sure of the relationship between the two. Is javadb the Sun equivalent of hsqldb?

    You could tag flat files, but what I find with lots of different flat files is the necessity for unique code for each file to pull the data you need. If someone gives you an SQL dump you know how the data is going to formatted.

  3. Bioinformatics Zen » Blog Archive » Bioinformatics : which programming language to use? pings back:

    [...] stripping the data out of the files, from running the statistical analysis. Have I mentioned before that databases are [...]

  4. Bioinformatics Zen » Blog Archive » Three libraries and a tool to enhance your bioinformatics coding pings back:

    [...] relational mapping I think one of the best tips in bioinformatics is to use a database to store all of your data. Accessing a database inside code is often rather cumbersome though, requiring some rather unwieldy [...]

  5. Organised bioinformatics experiments | Bioinformatics Zen pings back:

    [...] point is the most important, and the one that has made my work much easier. Without exception, always use a database to store data. Manipulating flat files in scripts is hard work, and is also a source of bugs. The only time I [...]

  6. gioby comments:

    Hi,
    so, do you use any database design software to design your database?
    For example, something like this: http://www.programurl.com/software/database-design.htm ?

    Do you create any UML schema before creating the database? or you use only ORM modules like you said in another post?

    Cheers

  7. aroth comments:

    I am curious about how the best way to organize sequence data for a small scale bioinformatics project. I am mainly interested in performing phylogenetic analysis on several gene families, and I am only dealing with a few hundred sequences. I am trying to find some software to organize the sequences in a centralized location so I can reuse them easily for new analysis. Text files are inconvenient, but the learning curve for implementing my own SQL database seems rather steep. Does anyone have a suggestion for a program or a tutorial that might help?

  8. gioby comments:

    @aroth: maybe openoffice-calc and cvs files?

Leave a comment