<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: Organised bioinformatics experiments</title>
	<atom:link href="http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/</link>
	<description></description>
	<pubDate>Fri, 21 Nov 2008 18:36:51 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
		<item>
		<title>By: The three phases of MySQL usage &#171; What You&#8217;re Doing Is Rather Desperate</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-30842</link>
		<dc:creator>The three phases of MySQL usage &#171; What You&#8217;re Doing Is Rather Desperate</dc:creator>
		<pubDate>Fri, 17 Oct 2008 02:06:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-30842</guid>
		<description>[...] a comment &#187;  As Mike keeps reminding me, getting your data into database tables is A Good Thing. Like many people, my database of choice is [...]</description>
		<content:encoded><![CDATA[<p>[...] a comment &raquo;  As Mike keeps reminding me, getting your data into database tables is A Good Thing. Like many people, my database of choice is [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-22755</link>
		<dc:creator>Mike</dc:creator>
		<pubDate>Thu, 17 Jul 2008 15:15:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-22755</guid>
		<description>That's fair enough Phil. I understand that using Unix like scripts to manually look at the data. However I would say that when you start manipulating and joining dataset in scripts, this is when a database is essential, as it will make things so much easier - especially when combined with an ORM.

As to your point about database format interoperability, that is a valid point. I think most database allow the export of common data formats such as CSV, which is ideal for situations such as creating supplementary materials in manuscripts.</description>
		<content:encoded><![CDATA[<p>That&#8217;s fair enough Phil. I understand that using Unix like scripts to manually look at the data. However I would say that when you start manipulating and joining dataset in scripts, this is when a database is essential, as it will make things so much easier - especially when combined with an ORM.</p>
<p>As to your point about database format interoperability, that is a valid point. I think most database allow the export of common data formats such as CSV, which is ideal for situations such as creating supplementary materials in manuscripts.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Phil</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-21554</link>
		<dc:creator>Phil</dc:creator>
		<pubDate>Fri, 04 Jul 2008 15:53:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-21554</guid>
		<description>&lt;blockquote&gt;Without exception, always use a database to store data. Manipulating flat files in scripts is hard work, and is also a source of bugs.&lt;/blockquote&gt;

I'm not convinced.  It's very important for me to be able to look at the data, and to look at old data.  It's very convenient for me to be able to use Unix command-line tools to count the data, sort the data, randomly sort the data, remove duplicates, sum columns, compute average + standard deviation, etc.  I sometimes have to move files between different computers and different operating systems, and I don't have to worry over whether someone else's DB configuration is different.  Also, I will still be able to look at that data 3 years from now, when I have a new version of my DB installed that won't read the old files.</description>
		<content:encoded><![CDATA[<blockquote><p>Without exception, always use a database to store data. Manipulating flat files in scripts is hard work, and is also a source of bugs.</p></blockquote>
<p>I&#8217;m not convinced.  It&#8217;s very important for me to be able to look at the data, and to look at old data.  It&#8217;s very convenient for me to be able to use Unix command-line tools to count the data, sort the data, randomly sort the data, remove duplicates, sum columns, compute average + standard deviation, etc.  I sometimes have to move files between different computers and different operating systems, and I don&#8217;t have to worry over whether someone else&#8217;s DB configuration is different.  Also, I will still be able to look at that data 3 years from now, when I have a new version of my DB installed that won&#8217;t read the old files.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Shortcuts for generating HTML &#124; Bioinformatics Zen</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-20438</link>
		<dc:creator>Shortcuts for generating HTML &#124; Bioinformatics Zen</dc:creator>
		<pubDate>Sat, 21 Jun 2008 18:16:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-20438</guid>
		<description>[...] Zen GTA4 is a competitive inhibitor of blogging     Organised bioinformatics experiments [...]</description>
		<content:encoded><![CDATA[<p>[...] Zen GTA4 is a competitive inhibitor of blogging     Organised bioinformatics experiments [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-19990</link>
		<dc:creator>Mike</dc:creator>
		<pubDate>Mon, 16 Jun 2008 18:20:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-19990</guid>
		<description>@Andrew
Point taken, I will try to curb my cat abuse habit.

@Michael
I agree, Python and Ruby syntax make coding much more of a pleasure and would recommend either of these two to any bioinformatician. Though I'll continue to write in Ruby as this is the language I know.

@Gioby
For to-do lists I use lighthouse, as it suits my personal preference. However there's plenty of tools available for keeping todo lists, and I think it comes down to which one suits you.

I agree with your point about old software that works. I would say that if you have a system that you still have to maintain and change, then "upgrade" to more easily maintainable system. In your example I would say leave it be, as you say, if it works then your extra efforts could be used elsewhere.</description>
		<content:encoded><![CDATA[<p>@Andrew<br />
Point taken, I will try to curb my cat abuse habit.</p>
<p>@Michael<br />
I agree, Python and Ruby syntax make coding much more of a pleasure and would recommend either of these two to any bioinformatician. Though I&#8217;ll continue to write in Ruby as this is the language I know.</p>
<p>@Gioby<br />
For to-do lists I use lighthouse, as it suits my personal preference. However there&#8217;s plenty of tools available for keeping todo lists, and I think it comes down to which one suits you.</p>
<p>I agree with your point about old software that works. I would say that if you have a system that you still have to maintain and change, then &#8220;upgrade&#8221; to more easily maintainable system. In your example I would say leave it be, as you say, if it works then your extra efforts could be used elsewhere.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gioby</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-19743</link>
		<dc:creator>gioby</dc:creator>
		<pubDate>Sat, 14 Jun 2008 09:26:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-19743</guid>
		<description>&lt;blockquote&gt;Without exception, always use a database to store data. Manipulating flat files in scripts is hard work, and is also a source of bugs. &lt;/blockquote&gt;
That's not always as easy as it seems.
Many laboratories and people have used flat-files for years: that means that if you work in such a place, all the scripts, programs produced and used internally by everyone are based on flat files.
What will you do then? If you spend time trying re-writing or adapting all these scripts, your colleagues will most likely think you are kind of wasting time, because the flat-file based script already works :).
Think of biopython, bioperl, emboss: all of these are meant to be used with flat-files, even if you can re-adapt them with a small amount of work.


&lt;blockquote&gt;Use make-type files instead of scripts&lt;/blockquote&gt;Wow, I didn't know of make before!
I am using it now and it is a really good tool. 
A good primer on using make for bioinformaticists is also in http://www.swc.scipy.org/

&lt;blockquote&gt;Use testing and validations&lt;/blockquote&gt;I try to validate and test my data as often as possible.
I think there is an additional complexity in the problem of testing for bioinformatics software: you have to check that your scripts don't make any mystake, and moreover, you have to check that your results are compatible with their biological context.
This last kind of testing is more likely what other scientists do when they use negative and positive controls in their experiments.
However, do you know whether there are any common guidelines for testing bioinformatics scripts?
They would be very useful.


Well, I really appreciated your post. You are really a good bioinformatician because you are so keen in sharing your experience and known-how with others, publicly.
Thank you very much: I am going to install a mysql database on my computer, and I didn't know of ORM modules, so, if only there were more people like you in bioinformatics :).</description>
		<content:encoded><![CDATA[<blockquote><p>Without exception, always use a database to store data. Manipulating flat files in scripts is hard work, and is also a source of bugs. </p></blockquote>
<p>That&#8217;s not always as easy as it seems.<br />
Many laboratories and people have used flat-files for years: that means that if you work in such a place, all the scripts, programs produced and used internally by everyone are based on flat files.<br />
What will you do then? If you spend time trying re-writing or adapting all these scripts, your colleagues will most likely think you are kind of wasting time, because the flat-file based script already works :).<br />
Think of biopython, bioperl, emboss: all of these are meant to be used with flat-files, even if you can re-adapt them with a small amount of work.</p>
<blockquote><p>Use make-type files instead of scripts</p></blockquote>
<p>Wow, I didn&#8217;t know of make before!<br />
I am using it now and it is a really good tool.<br />
A good primer on using make for bioinformaticists is also in <a href="http://www.swc.scipy.org/" rel="nofollow">http://www.swc.scipy.org/</a></p>
<blockquote><p>Use testing and validations</p></blockquote>
<p>I try to validate and test my data as often as possible.<br />
I think there is an additional complexity in the problem of testing for bioinformatics software: you have to check that your scripts don&#8217;t make any mystake, and moreover, you have to check that your results are compatible with their biological context.<br />
This last kind of testing is more likely what other scientists do when they use negative and positive controls in their experiments.<br />
However, do you know whether there are any common guidelines for testing bioinformatics scripts?<br />
They would be very useful.</p>
<p>Well, I really appreciated your post. You are really a good bioinformatician because you are so keen in sharing your experience and known-how with others, publicly.<br />
Thank you very much: I am going to install a mysql database on my computer, and I didn&#8217;t know of ORM modules, so, if only there were more people like you in bioinformatics :).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gioby</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-19650</link>
		<dc:creator>gioby</dc:creator>
		<pubDate>Fri, 13 Jun 2008 09:00:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-19650</guid>
		<description>How do you manage 'To-Do' lists?
For example, let's say you don't have the time to test some particular parameters in your statistical analysis, but you want to have a memo in case you will be able to do it later.
Do you use the same revision control system software, or do you have something like a personal bug-tracker/feature report installed in your computer?</description>
		<content:encoded><![CDATA[<p>How do you manage &#8216;To-Do&#8217; lists?<br />
For example, let&#8217;s say you don&#8217;t have the time to test some particular parameters in your statistical analysis, but you want to have a memo in case you will be able to do it later.<br />
Do you use the same revision control system software, or do you have something like a personal bug-tracker/feature report installed in your computer?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jan.</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-19591</link>
		<dc:creator>jan.</dc:creator>
		<pubDate>Thu, 12 Jun 2008 15:18:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-19591</guid>
		<description>All,

To solve the issue we discussed above about keeping track of what tasks have already been done in your Rakefile, I have written an extension to rake. Rake is ideal if you're working with files because it takes the timestamps of the files into consideration. However, no files are created when you're loading stuff into a database. I tried to solve that by extending rake so that it puts timestamps in a little meta table in the database as well.

You can get it at http://github.com/jandot/biorake/tree/master

jan.</description>
		<content:encoded><![CDATA[<p>All,</p>
<p>To solve the issue we discussed above about keeping track of what tasks have already been done in your Rakefile, I have written an extension to rake. Rake is ideal if you&#8217;re working with files because it takes the timestamps of the files into consideration. However, no files are created when you&#8217;re loading stuff into a database. I tried to solve that by extending rake so that it puts timestamps in a little meta table in the database as well.</p>
<p>You can get it at <a href="http://github.com/jandot/biorake/tree/master" rel="nofollow">http://github.com/jandot/biorake/tree/master</a></p>
<p>jan.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: michael</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-19524</link>
		<dc:creator>michael</dc:creator>
		<pubDate>Wed, 11 Jun 2008 18:56:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-19524</guid>
		<description>I am glad this is ruby, rather than perl. Shows people which language is indeed better, and there are so many people still relying on perl rather than ruby (or python, which I think is better than perl as well, although the syntax in ruby is much clear from my point of view when writing. And I have written quite some python code as well.)</description>
		<content:encoded><![CDATA[<p>I am glad this is ruby, rather than perl. Shows people which language is indeed better, and there are so many people still relying on perl rather than ruby (or python, which I think is better than perl as well, although the syntax in ruby is much clear from my point of view when writing. And I have written quite some python code as well.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Clegg</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18244</link>
		<dc:creator>Andrew Clegg</dc:creator>
		<pubDate>Thu, 29 May 2008 13:40:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18244</guid>
		<description>@Mike:

`cat data.txt &#124; sort`

Definitely a nomination for the Useless Use of Cat Award.

http://partmaps.org/era/unix/award.html</description>
		<content:encoded><![CDATA[<p>@Mike:</p>
<p>`cat data.txt | sort`</p>
<p>Definitely a nomination for the Useless Use of Cat Award.</p>
<p><a href="http://partmaps.org/era/unix/award.html" rel="nofollow">http://partmaps.org/era/unix/award.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Sphaerula &#187; Twitter for Week of 18 May 2008</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18237</link>
		<dc:creator>Sphaerula &#187; Twitter for Week of 18 May 2008</dc:creator>
		<pubDate>Thu, 29 May 2008 11:19:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18237</guid>
		<description>[...] 2008-05-24: Excellent advice from Bioinformatics Zen about how to organize bioinformatics experiments. See this post. [...]</description>
		<content:encoded><![CDATA[<p>[...] 2008-05-24: Excellent advice from Bioinformatics Zen about how to organize bioinformatics experiments. See this post. [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Adam</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18197</link>
		<dc:creator>Adam</dc:creator>
		<pubDate>Wed, 28 May 2008 22:42:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18197</guid>
		<description>@Aaron
I think we're all in agreement that make is a great utility.  I personally use it almost everyday.  But Make isn't always ideal, otherwise there wouldn't be so many alternatives (ant, rake, scons, omake, etc, etc).  Rake is not so much a replacement for Make as it is a tool designed in the spirit of make that lets you define tasks and their prerequisites easily in Ruby.  You probably shouldn't use Rake to build C or Fortran programs, and you probably shouldn't use Make to manipulate a database.</description>
		<content:encoded><![CDATA[<p>@Aaron<br />
I think we&#8217;re all in agreement that make is a great utility.  I personally use it almost everyday.  But Make isn&#8217;t always ideal, otherwise there wouldn&#8217;t be so many alternatives (ant, rake, scons, omake, etc, etc).  Rake is not so much a replacement for Make as it is a tool designed in the spirit of make that lets you define tasks and their prerequisites easily in Ruby.  You probably shouldn&#8217;t use Rake to build C or Fortran programs, and you probably shouldn&#8217;t use Make to manipulate a database.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aaron</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18077</link>
		<dc:creator>Aaron</dc:creator>
		<pubDate>Tue, 27 May 2008 20:41:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18077</guid>
		<description>There's an excellent make-like tool that handles prerequisites, allows "shelling out", etc. -- it's called Make.  Ruby is great, and so is the shell (as already admitted by most).  You don't reimplement grep, sort, uniq, etc. in Ruby (though you certainly could), you just use the shell commands.  "make" is a shell command, you should learn it along with all the other arcane Rubyesque wizardry ...</description>
		<content:encoded><![CDATA[<p>There&#8217;s an excellent make-like tool that handles prerequisites, allows &#8220;shelling out&#8221;, etc. &#8212; it&#8217;s called Make.  Ruby is great, and so is the shell (as already admitted by most).  You don&#8217;t reimplement grep, sort, uniq, etc. in Ruby (though you certainly could), you just use the shell commands.  &#8220;make&#8221; is a shell command, you should learn it along with all the other arcane Rubyesque wizardry &#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jan.</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18051</link>
		<dc:creator>jan.</dc:creator>
		<pubDate>Tue, 27 May 2008 16:27:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18051</guid>
		<description>Good suggestion about that gem. We'll have to think about how that would work, though.

I actually just performed a little project using this approach, and it works great!</description>
		<content:encoded><![CDATA[<p>Good suggestion about that gem. We&#8217;ll have to think about how that would work, though.</p>
<p>I actually just performed a little project using this approach, and it works great!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18045</link>
		<dc:creator>Mike</dc:creator>
		<pubDate>Tue, 27 May 2008 13:43:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18045</guid>
		<description>I really like how you're implemented that Jan. &lt;a href="http://github.com/michaelbarton/organised_experiments/commit/91c0f6c439126729fd16f8fe03adb5c21c212fd1" rel="nofollow"&gt;I've added the change to the repository.&lt;/a&gt; I didn't add the STDERR messages as I prefer to do this using a project logger. I hope you don't mind.

I wonder in future if it might be worth packaging this up into a Rails type gem focused on organising bioinformatics experiments. I think it might take a fair amount of work, but could ultimately prove very worth while.</description>
		<content:encoded><![CDATA[<p>I really like how you&#8217;re implemented that Jan. <a href="http://github.com/michaelbarton/organised_experiments/commit/91c0f6c439126729fd16f8fe03adb5c21c212fd1" rel="nofollow">I&#8217;ve added the change to the repository.</a> I didn&#8217;t add the STDERR messages as I prefer to do this using a project logger. I hope you don&#8217;t mind.</p>
<p>I wonder in future if it might be worth packaging this up into a Rails type gem focused on organising bioinformatics experiments. I think it might take a fair amount of work, but could ultimately prove very worth while.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jan.</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18042</link>
		<dc:creator>jan.</dc:creator>
		<pubDate>Tue, 27 May 2008 12:57:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18042</guid>
		<description>This does the trick: sequences are only loaded _once_ if necessary (notice the new prerequisite for sequence_stats):

&lt;pre lang="ruby"&gt;
  desc 'Checks to see if sequences need to be loaded'
  task :check_load_sequences do
    STDERR.puts "DEBUG: checking if sequences need loading"
    if Gene.all.length == 0
      Rake::Task['001:load_sequences'].invoke
    end
  end
  
  desc 'Loads the protein sequences into the databases'
  task :load_sequences =&gt; :delete_sequences do
    STDERR.puts "DEBUG: loading sequences"
    file_gz = File.dirname(__FILE__) + '/data/protein.fasta.gz'
    Zlib::GzipReader.open(file_gz) do &#124;file&#124;
      Bio::FlatFile.auto(file).each {&#124;entry&#124; Gene.create_from_flatfile entry }
    end
  end

  desc 'Calculates statistics for gene sequences'
  task :sequence_stats =&gt; :check_load_sequences do
    STDERR.puts "DEBUG: calculating statistics"
    File.open(File.dirname(__FILE__) + '/results/sequence_statistics.txt','w') do &#124;file&#124;
      file.puts "Gene mean length : #{ Gene.mean_length }"
      file.puts "Gene length standard deviation : #{ Gene.sd_length }"
    end
  end
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>This does the trick: sequences are only loaded _once_ if necessary (notice the new prerequisite for sequence_stats):</p>

<div class="wp_syntax"><div class="code"><pre class="ruby">  desc <span style="color:#996600;">'Checks to see if sequences need to be loaded'</span>
  task <span style="color:#ff3333; font-weight:bold;">:check_load_sequences</span> <span style="color:#9966CC; font-weight:bold;">do</span>
    STDERR.<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;DEBUG: checking if sequences need loading&quot;</span>
    <span style="color:#9966CC; font-weight:bold;">if</span> Gene.<span style="color:#9900CC;">all</span>.<span style="color:#9900CC;">length</span> == <span style="color:#006666;">0</span>
      <span style="color:#6666ff; font-weight:bold;">Rake::Task</span><span style="color:#006600; font-weight:bold;">&#91;</span><span style="color:#996600;">'001:load_sequences'</span><span style="color:#006600; font-weight:bold;">&#93;</span>.<span style="color:#9900CC;">invoke</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  desc <span style="color:#996600;">'Loads the protein sequences into the databases'</span>
  task <span style="color:#ff3333; font-weight:bold;">:load_sequences</span> =&gt; <span style="color:#ff3333; font-weight:bold;">:delete_sequences</span> <span style="color:#9966CC; font-weight:bold;">do</span>
    STDERR.<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;DEBUG: loading sequences&quot;</span>
    file_gz = <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">dirname</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#0000FF; font-weight:bold;">__FILE__</span><span style="color:#006600; font-weight:bold;">&#41;</span> + <span style="color:#996600;">'/data/protein.fasta.gz'</span>
    <span style="color:#6666ff; font-weight:bold;">Zlib::GzipReader</span>.<span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span>file_gz<span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span> |file|
      <span style="color:#6666ff; font-weight:bold;">Bio::FlatFile</span>.<span style="color:#9900CC;">auto</span><span style="color:#006600; font-weight:bold;">&#40;</span>file<span style="color:#006600; font-weight:bold;">&#41;</span>.<span style="color:#9900CC;">each</span> <span style="color:#006600; font-weight:bold;">&#123;</span>|entry| Gene.<span style="color:#9900CC;">create_from_flatfile</span> entry <span style="color:#006600; font-weight:bold;">&#125;</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#9966CC; font-weight:bold;">end</span>
&nbsp;
  desc <span style="color:#996600;">'Calculates statistics for gene sequences'</span>
  task <span style="color:#ff3333; font-weight:bold;">:sequence_stats</span> =&gt; <span style="color:#ff3333; font-weight:bold;">:check_load_sequences</span> <span style="color:#9966CC; font-weight:bold;">do</span>
    STDERR.<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;DEBUG: calculating statistics&quot;</span>
    <span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#CC0066; font-weight:bold;">open</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#CC00FF; font-weight:bold;">File</span>.<span style="color:#9900CC;">dirname</span><span style="color:#006600; font-weight:bold;">&#40;</span><span style="color:#0000FF; font-weight:bold;">__FILE__</span><span style="color:#006600; font-weight:bold;">&#41;</span> + <span style="color:#996600;">'/results/sequence_statistics.txt'</span>,<span style="color:#996600;">'w'</span><span style="color:#006600; font-weight:bold;">&#41;</span> <span style="color:#9966CC; font-weight:bold;">do</span> |file|
      file.<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Gene mean length : #{ Gene.mean_length }&quot;</span>
      file.<span style="color:#CC0066; font-weight:bold;">puts</span> <span style="color:#996600;">&quot;Gene length standard deviation : #{ Gene.sd_length }&quot;</span>
    <span style="color:#9966CC; font-weight:bold;">end</span>
  <span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

]]></content:encoded>
	</item>
	<item>
		<title>By: jan.</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18040</link>
		<dc:creator>jan.</dc:creator>
		<pubDate>Tue, 27 May 2008 12:43:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18040</guid>
		<description>Thanks for those thoughts, Mike. I like the idea of the 'hard' and 'soft' task.</description>
		<content:encoded><![CDATA[<p>Thanks for those thoughts, Mike. I like the idea of the &#8216;hard&#8217; and &#8217;soft&#8217; task.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18035</link>
		<dc:creator>Mike</dc:creator>
		<pubDate>Tue, 27 May 2008 11:19:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18035</guid>
		<description>Also Jan, Jay Fields has a great &lt;a href="http://blog.jayfields.com/2006/06/ruby-kernel-system-exec-and-x.html" rel="nofollow"&gt;article on running bash processes from Ruby&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>Also Jan, Jay Fields has a great <a href="http://blog.jayfields.com/2006/06/ruby-kernel-system-exec-and-x.html" rel="nofollow">article on running bash processes from Ruby</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18034</link>
		<dc:creator>Mike</dc:creator>
		<pubDate>Tue, 27 May 2008 11:17:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18034</guid>
		<description>Yes I did think about this a bit. I made that the assumption that I would know whether the sequences were in the database or not, therefore there would not need to be a dependency. I think adding a database status that would run the load_sequence task if necessary could start to make maintaining the Rakefile a little complicated. As a compromise you could add a check on the status of the database and notify the user to run the corresponding task. Some thing like.

&lt;pre lang="ruby"&gt;
if Gene.all.length == 0
  # Notify the user
else
  # Do analysis
end
&lt;/pre&gt;

On the other hand, the rebuild task runs all of the tasks in the correct order when the project is run from scratch, so in this case you would know that all the required data was in the database.

I do see what you mean about a meta-table of information, and it would take care of things like this, but I think it could add an extra layer of complexity. Another option is to have a hard and a soft load_sequences task. The hard task clears the database and loads all the sequences. The soft task only loads them if there are none . The stats task could then be dependent on the soft load_sequence task.</description>
		<content:encoded><![CDATA[<p>Yes I did think about this a bit. I made that the assumption that I would know whether the sequences were in the database or not, therefore there would not need to be a dependency. I think adding a database status that would run the load_sequence task if necessary could start to make maintaining the Rakefile a little complicated. As a compromise you could add a check on the status of the database and notify the user to run the corresponding task. Some thing like.</p>

<div class="wp_syntax"><div class="code"><pre class="ruby"><span style="color:#9966CC; font-weight:bold;">if</span> Gene.<span style="color:#9900CC;">all</span>.<span style="color:#9900CC;">length</span> == <span style="color:#006666;">0</span>
  <span style="color:#008000; font-style:italic;"># Notify the user</span>
<span style="color:#9966CC; font-weight:bold;">else</span>
  <span style="color:#008000; font-style:italic;"># Do analysis</span>
<span style="color:#9966CC; font-weight:bold;">end</span></pre></div></div>

<p>On the other hand, the rebuild task runs all of the tasks in the correct order when the project is run from scratch, so in this case you would know that all the required data was in the database.</p>
<p>I do see what you mean about a meta-table of information, and it would take care of things like this, but I think it could add an extra layer of complexity. Another option is to have a hard and a soft load_sequences task. The hard task clears the database and loads all the sequences. The soft task only loads them if there are none . The stats task could then be dependent on the soft load_sequence task.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jan.</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-18028</link>
		<dc:creator>jan.</dc:creator>
		<pubDate>Tue, 27 May 2008 10:00:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-18028</guid>
		<description>Another thought (sorry :-) Something I've bumped into myself using rake for this type of thing...

I'm sure you tried this as well: within your sequence_stats task, you obviously would want to add the prerequisite that the sequences are loaded in the first place, so instead of

  task :sequence_stats do

you'd do

  task :sequence_stats =&#62; :load_sequences do

However, this doesn't work in rake because it can't know if you did your loading or not. As a result, it would load the sequences every single time you want to get the sequence stats. Do you know of a way to get this working? I wonder if we can use timestamps on files to do this... Or would we have to tweak rake so that it takes into account some "status" metatable in the database? Any suggestions appreciated...</description>
		<content:encoded><![CDATA[<p>Another thought (sorry <img src='http://www.bioinformaticszen.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> Something I&#8217;ve bumped into myself using rake for this type of thing&#8230;</p>
<p>I&#8217;m sure you tried this as well: within your sequence_stats task, you obviously would want to add the prerequisite that the sequences are loaded in the first place, so instead of</p>
<p>  task :sequence_stats do</p>
<p>you&#8217;d do</p>
<p>  task :sequence_stats =&gt; :load_sequences do</p>
<p>However, this doesn&#8217;t work in rake because it can&#8217;t know if you did your loading or not. As a result, it would load the sequences every single time you want to get the sequence stats. Do you know of a way to get this working? I wonder if we can use timestamps on files to do this&#8230; Or would we have to tweak rake so that it takes into account some &#8220;status&#8221; metatable in the database? Any suggestions appreciated&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Mike</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17950</link>
		<dc:creator>Mike</dc:creator>
		<pubDate>Mon, 26 May 2008 14:56:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17950</guid>
		<description>Thanks for your comments guys, I'm always very flattered when you take the time to write that something I've done could be useful.

@Adam
I think DataMapper and Sequel are interchangeable in terms of the attention they are getting in the Ruby community. I've never tried sequel though. Comparing DM and ActiveRecord, the best point about DM is that it does away with Migrations which are a bit heavy weight for this type of work. However on the downside, I'm not sure how many AR plugins will work with DM. An example would be acts_as_reportable which I don't think does.

@Wubin and @Andrew
I've never used Python so I've haven't got a clue about the libraries. Andrew's suggestions look good. Another option could be to look at what Django uses, as I think it's somewhat equivalent to Rails so will probably have a handy ORM involved somewhere.

@Jan
As you say in an ideal world you there would be an complete gene class that could be interchangeable for any project. At the moment I'm only using this approach in one project, so I can't really say what I would do, and to be honest I hadn't really thought about this until you mentioned it. I agree though that a set of generic classes would useful. Also a set of validations that could included when required would be useful as well, as these are things that I spend most of my time messing around with until they work. BioSQL could be place to start for a set of generic ORM classes though? As always it's the problem of trying to fit in writing all the sexy Ruby libraries I'd like compared with publishing something.

As for your next point, it's true that you can't write everything in Ruby. For bash scripting, Ruby allows ` quoting for running Unix commands. For example `cat data.txt &#124; sort` and so forth. It'd be nice to write everything in Ruby, but I guess pragmatism comes first before style.</description>
		<content:encoded><![CDATA[<p>Thanks for your comments guys, I&#8217;m always very flattered when you take the time to write that something I&#8217;ve done could be useful.</p>
<p>@Adam<br />
I think DataMapper and Sequel are interchangeable in terms of the attention they are getting in the Ruby community. I&#8217;ve never tried sequel though. Comparing DM and ActiveRecord, the best point about DM is that it does away with Migrations which are a bit heavy weight for this type of work. However on the downside, I&#8217;m not sure how many AR plugins will work with DM. An example would be acts_as_reportable which I don&#8217;t think does.</p>
<p>@Wubin and @Andrew<br />
I&#8217;ve never used Python so I&#8217;ve haven&#8217;t got a clue about the libraries. Andrew&#8217;s suggestions look good. Another option could be to look at what Django uses, as I think it&#8217;s somewhat equivalent to Rails so will probably have a handy ORM involved somewhere.</p>
<p>@Jan<br />
As you say in an ideal world you there would be an complete gene class that could be interchangeable for any project. At the moment I&#8217;m only using this approach in one project, so I can&#8217;t really say what I would do, and to be honest I hadn&#8217;t really thought about this until you mentioned it. I agree though that a set of generic classes would useful. Also a set of validations that could included when required would be useful as well, as these are things that I spend most of my time messing around with until they work. BioSQL could be place to start for a set of generic ORM classes though? As always it&#8217;s the problem of trying to fit in writing all the sexy Ruby libraries I&#8217;d like compared with publishing something.</p>
<p>As for your next point, it&#8217;s true that you can&#8217;t write everything in Ruby. For bash scripting, Ruby allows ` quoting for running Unix commands. For example `cat data.txt | sort` and so forth. It&#8217;d be nice to write everything in Ruby, but I guess pragmatism comes first before style.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jan.</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17872</link>
		<dc:creator>jan.</dc:creator>
		<pubDate>Sun, 25 May 2008 19:29:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17872</guid>
		<description>Just had a little more time to actually look at your example on github. I think I'll try to use this approach on my next project. What I find particularly useful as well is how you load task-specific Rakefiles (the 001 namespace) into the project Rakefile (using the project-task-step I described &lt;a href="http://saaientist.blogspot.com/2008/05/keeping-track-of-things-using-labbook.html" rel="nofollow"&gt;here&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p>Just had a little more time to actually look at your example on github. I think I&#8217;ll try to use this approach on my next project. What I find particularly useful as well is how you load task-specific Rakefiles (the 001 namespace) into the project Rakefile (using the project-task-step I described <a href="http://saaientist.blogspot.com/2008/05/keeping-track-of-things-using-labbook.html" rel="nofollow">here</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jan.</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17837</link>
		<dc:creator>jan.</dc:creator>
		<pubDate>Sun, 25 May 2008 09:27:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17837</guid>
		<description>Great post again Mike. I will surely try this out as I always end up having too many scripts in my working directory. Or you did a whole analysis and then you're told to do the same thing for this-and-this gene.

Two questions: (1) do you rewrite a little Gene class within every project (tweaking it to just do what is necessary within that project), or do you have a "master" gene class defined somewhere that you use whenever there's a new project? We all know the second option should probably be preferred in the long run, but we all end up doing the first...
And (2): this approach means you have to do everything in ruby isn't it? However, several of the steps in my own workflows can often more easily be done with linux commands (grep, sort, uniq and wc, anyone?) How do you handle that? Do you use a ruby equivalent? Or do you use a "system('sort')" in your rakefiles?

Good idea of putting an example on github...

jan.</description>
		<content:encoded><![CDATA[<p>Great post again Mike. I will surely try this out as I always end up having too many scripts in my working directory. Or you did a whole analysis and then you&#8217;re told to do the same thing for this-and-this gene.</p>
<p>Two questions: (1) do you rewrite a little Gene class within every project (tweaking it to just do what is necessary within that project), or do you have a &#8220;master&#8221; gene class defined somewhere that you use whenever there&#8217;s a new project? We all know the second option should probably be preferred in the long run, but we all end up doing the first&#8230;<br />
And (2): this approach means you have to do everything in ruby isn&#8217;t it? However, several of the steps in my own workflows can often more easily be done with linux commands (grep, sort, uniq and wc, anyone?) How do you handle that? Do you use a ruby equivalent? Or do you use a &#8220;system(&#8217;sort&#8217;)&#8221; in your rakefiles?</p>
<p>Good idea of putting an example on github&#8230;</p>
<p>jan.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Perry</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17824</link>
		<dc:creator>Andrew Perry</dc:creator>
		<pubDate>Sun, 25 May 2008 02:42:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17824</guid>
		<description>Wubin: Something similar to Ruby's Datamapper in the Python world are either SQLObject or SQLAlchemy.

I certainly prefer the ORM approach as opposed to writing SQL expressions, and try to use it whenever I can get away with it.</description>
		<content:encoded><![CDATA[<p>Wubin: Something similar to Ruby&#8217;s Datamapper in the Python world are either SQLObject or SQLAlchemy.</p>
<p>I certainly prefer the ORM approach as opposed to writing SQL expressions, and try to use it whenever I can get away with it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bioinformatics（生物信息学） &#187; 推荐一个著名的生物信息学博客</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17823</link>
		<dc:creator>Bioinformatics（生物信息学） &#187; 推荐一个著名的生物信息学博客</dc:creator>
		<pubDate>Sun, 25 May 2008 01:40:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17823</guid>
		<description>[...] Organised bioinformatics experiments [...]</description>
		<content:encoded><![CDATA[<p>[...] Organised bioinformatics experiments [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Wubin Qu</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17822</link>
		<dc:creator>Wubin Qu</dc:creator>
		<pubDate>Sun, 25 May 2008 01:35:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17822</guid>
		<description>Thank Mike for this excellent idea to solve the problem. I also encounter this problem in my Bioinformatics research. However, I use Python for programming. I will find the related modules in Python.</description>
		<content:encoded><![CDATA[<p>Thank Mike for this excellent idea to solve the problem. I also encounter this problem in my Bioinformatics research. However, I use Python for programming. I will find the related modules in Python.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Adam</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17804</link>
		<dc:creator>Adam</dc:creator>
		<pubDate>Sat, 24 May 2008 21:38:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17804</guid>
		<description>Excellent article as always Mike and thanks for the link.  I especially like the DataMapper examples.  I haven't had a chance to try out DM yet.  Lately I've been using &lt;a href="http://code.google.com/p/ruby-sequel/" rel="nofollow"&gt;Sequel&lt;/a&gt; but the 'define schema in the model' aspect of DM is very slick.  Just when I thought I had pushed myself too far into software obscurity you come along and make me feel less alone.  Awesome.</description>
		<content:encoded><![CDATA[<p>Excellent article as always Mike and thanks for the link.  I especially like the DataMapper examples.  I haven&#8217;t had a chance to try out DM yet.  Lately I&#8217;ve been using <a href="http://code.google.com/p/ruby-sequel/" rel="nofollow">Sequel</a> but the &#8216;define schema in the model&#8217; aspect of DM is very slick.  Just when I thought I had pushed myself too far into software obscurity you come along and make me feel less alone.  Awesome.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: A Pipeline is a Rakefile at Bleeding Edge Biotech</title>
		<link>http://www.bioinformaticszen.com/2008/05/organised-bioinformatics-experiments/#comment-17799</link>
		<dc:creator>A Pipeline is a Rakefile at Bleeding Edge Biotech</dc:creator>
		<pubDate>Sat, 24 May 2008 21:20:39 +0000</pubDate>
		<guid isPermaLink="false">http://www.bioinformaticszen.com/?p=153#comment-17799</guid>
		<description>[...] Mike over at Bioinformatics Zen has written a more thorough post about organised bioinformatics experiments with examples using Rake and DataMapper. Definitely check that [...]</description>
		<content:encoded><![CDATA[<p>[...] Mike over at Bioinformatics Zen has written a more thorough post about organised bioinformatics experiments with examples using Rake and DataMapper. Definitely check that [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.447 seconds -->
