BioRuby and Ruby on Rails: Active BioRecords

March 6th, 2008

A common practice in any computationally based field is writing code where the intended functionality has already been produced by someone else. This is usually called reinventing the wheel. This isn’t very useful since you’re spending time on an intermediate step, when instead you can use existing code and jump ahead to the next step in your research. Of course, it’s easy for me to shout bad practice on my blog, but I’m the worst person for doing this. I work in bioinformatics because I like writing code to solve problems, and my first response is to start coding, rather than look to see if someone has created a solution already. On the other hand, the benefit of using existing libraries is that you can build new things on what has already been done.

BioRuby on Rails

During my research I’ve been trying to programmatically get EMBL files from the EBI database. I saw Neil’s post on accessing PDB files directly over HTTP, and there is also a similar method for the EBI. I thought it would be interesting to combine this with Bio::EMBL so that I can instantiate a BioRuby EMBL object just by calling a method with the required accession.

  def self.fetch(id)
    uri = 'http://www.ebi.ac.uk/cgi-bin/dbfetch?db=EMBL&id=' + id.downcase + '&style=raw'
    embl = Bio::EMBL.new(open(uri).read)
    return embl
  end

This has a short coming, where downloading each EMBL record takes a couple of seconds, and is therefore slow for repeated use. This lead me to thinking about ways I could store the the object after it has been downloaded once, then reload it every time it’s called again in future. ActiveRecord, part of Ruby on Rails, is useful for storing objects in database.

require 'rubygems'
require 'bio'
require 'open-uri'
 
require 'active_record'
 
class ActiveEMBL < ActiveRecord::Base
 
  serialize :embl_obj
 
  def after_create
    self.embl_obj = ActiveEMBL.fetch(self.accession)
    save
  end
 
  def self.get(embl_id)
    ActiveEMBL.find_or_create_by_accession(embl_id)
  end
 
  private
 
  def self.fetch(id)
    uri = 'http://www.ebi.ac.uk/cgi-bin/dbfetch?db=EMBL&id=' + id.downcase + '&style=raw'
    embl = Bio::EMBL.new(open(uri).read)
    return embl
  end
end

The ActiveEMBL class inherits from ActiveRecord::Base, and the Bio::EMBL object is stored as an attribute. The get method calls ActiveRecords’s dynamic find_or_create_by method, which returns the corresponding method if it exists, or creates a new one if it doesn’t. If a record is created, the after_create method is automatically called which then calls the fetch method, and saves itself with the created Bio::EMBL object. The storing of the Bio::EMBL object is also taken care of by ActiveRecord when I declare in the first line that it should be serialised. The only other thing I have to do is create a table called active_embls and make sure it has three columns: id , accession, and embl_obj.

Ruby benchmarking can be used to show the difference in time between when the file downloaded the first time, and then again when it’s reloaded from the local database.

  require File.dirname(__FILE__) + '/active_embl.rb'
  require 'benchmark'
  include Benchmark
 
  ActiveRecord::Base.establish_connection(
    # Database connection details
  )
 
  id = 'J00231'
 
  bmbm(10) do |x|
    x.report('fetching') { ActiveEMBL.get(id) }
    x.report('loading') { ActiveEMBL.get(id).destroy }
  end

This shows a substantial difference in running time between the two.

                user     system       total      real
fetching    0.030000   0.010000   0.040000 ( 2.561105)
loading     0.000000   0.000000   0.000000 ( 0.001548)

The only problem that the Bio::EMBL object is a composite attribute, and so the Bio::EMBL methods are not directly accessible from the ActiveEMBL class, and instead must be accessed via the composite.

  embl.embl_obj.sequence # 'ATG...'

This isn’t very elegant, as I want to treat the ActiveEMBL object as a BioRuby EMBL object. I could write aliases for all the method, but an easier way is to just to use Ruby’s meta-programming ability to direct all the Bio::EMBL object method calls to the EMBL obj first.

  alias original_method_missing method_missing
 
  def method_missing(meth,*args)
    if read_attribute(:embl_obj).respond_to? meth
      read_attribute(:embl_obj).send meth, *args
    else
      original_method_missing meth, *args
    end
  end

The Bio:Ruby EMBL methods can now be called directly. The drawback is that none of the Bio::EMBL object methods have the same name as the ActiveRecord::Base object.

  embl.sequence # 'ATG...'

The code for this is here, and its pretty generic and similar approaches to other objects would only need to change the fetch method. A further example could use BioRuby’s Bio::Fetch code to generically fetch data from any bioinformatics database, and BioSQL could be used to explicitly represent bioinformatics objects in SQL. This could then be combined with ActiveRecord dynamic finders to create searches by fields something like this.

  ActiveEMBL.find_by_sequence_length = 30

3 responses

  1. Neil comments:

    Nice. I’ve used something vaguely similar in the past with BioPerl, to suck sequences from a URL into a sequence object. The ugly Perl one-line syntax would go something like:

    my $seqio = Bio::SeqIO->new(’-fh’ => IO::String->new(get($url)), ‘-format’ => ‘fasta’);

    One thing to watch out for: raw/plain text from URLs is sometimes wrapped in HTML ‘pre’ tags.

  2. BioBlogs 19: Bioengineering « O’Really? at Duncan.Hull.name pings back:

    […] while pondering the merist of workflows. Michael Barton also outlined some of the challenges of re-using code in software engineering and bioinformatics. Data integration is always a massive engineering challenge in bioinformatics projects, Rod Page […]

  3. jan. comments:

    Congratulations Mike. This post apparently made it onto the “Latest Ruby Links” section of RubyInside.

    Nice work. I’ll have to keep this in mind as I’m interfacing a lot with external databases. As a matter of fact, I’ll have to think about this to incorporate in the Ensembl API: make it possible to use a local sqlite3 store or something.

    BTW: We’re at the moment rewriting part of bioruby so that Bio::EMBL, Bio::GenBank and colleagues actually create Bio::Sequence objects (which makes sense, because EMBL and GenBank are nothing more than sequence _formats_). That’s all still in an SVN branch though and has not been merged into trunk.

Leave a comment