How to avoid errors when processing CSV files

November 1st, 2007

A lot of bioinformatics involves reading data from files to manipulate them for our analysis. For example, I spend a lot of time importing data from CSV files into my database. Doing this involves creating a script to iterate over each line of the file, then referencing each token in the row by its column number.

However this is bad for two reasons. The first reason is because it introduces a dependency on the column number, which may feasibly change. You can fix this by changing the script though, so this is not too bad.

The second reason is much more worse, because it could introduce a silent error. If the column number was wrong, then the wrong entry would be referenced. If correct and wrong entry where both of the same type, e.g. floats, then there is a chance you would miss the mistake, which is very bad.

One approach to fix this is to treat each row as a hash or map. I’ve laid out two examples in Ruby using the gem FasterCSV. They’re quite simple, so you should get the idea whatever language you use, hopefully there are equivalent libraries too.

Bad example

FasterCSV.foreach(file_path) do |row|


# In this instance the row is an array
# and has to accessed by the column number.
# Bad, because this introduces a dependency
# on the position of the column and doesn't
# throw an error if you are using the wrong column
row[column_number] # Do something here


end

Good example

#Set the header processing option...
FasterCSV.foreach(data_path, :headers => true) do |row|


# ...each row is now a hash, and the
# data can be accessed using a key
row['column_name']

# This is dependent on the column
# name, but not its position.
# Also you will get an error if
# the column doesn't exist and you
# will always reference the column you expect

end

Importantly by using a third party library, you implement another programming best practice which is, don’t reinvent the wheel.

4 responses

  1. Stephen comments:

    I often have to process CSV files in my work. Sometimes just poorly formatted text. I just read an article about importing text, that you might find interesting. http://www.simple-talk.com/sql/t-sql-programming/importing-text-based-data-workbench/

    I enjoy reading your blog. I’m adding R and Ruby to my research toolbox and appreciate your insights.

  2. Luca Beltrame comments:

    Python has a similar feature in the csv module. Specifically, it uses the DictReader method to create a dictionary (Python hashes) with the names of the columns. It’s rather convenient.

  3. Mike comments:

    Thanks for both your comments guys.

  4. Stephen comments:

    Have you tried to use Kettle? An ETL tool from the Pentaho BI suite. http://kettle.pentaho.org/

    I’ve been playing with it for some of my projects. Java based GUI. Open source. Tried is since my previous post above and thought it might be something to consider also.

    Here’s a link to the product overview screencast. http://www.pentaho.com/products/demos/PDI_overview/PDI_overview.html

Leave a comment