Monday, November 9, 2009

Automated informatics pipelines, public datasets, and the cloud

Although Biodivertido typically focuses on biodiversity informatics, I'm going to step it back to informatics of another stripe in the hopes of making a few points and getting people primed for Javier's TDWG talk Moving Biodiversity to the Cloud on Friday.

A couple of days ago I became curious if I could use CloudBurst to compare novel influenza sequences to the entire known sequence record of influenza in the hopes of detecting novel reassortants on the fly. The intended use of CloudBurst is to take small sequence reads from high throughput sequencing projects and query them against a complete genome a variety of reasons including SNP discovery, genotyping and personal genomics1. So why not take a novel influenza and break it up into many small overlapping fragments and query them against a genome comprised of all known sequences to detect reassortment?

Comparisons similar to these can likely be done using some manipulation of the Blast or similar algorithm. But in my case, I was particularly interested in CloudBurst because of its available implementation in EC2 using Hadoop to utilise the cluster environment. This means that if successful I could later wrap my method into an automated workflow for detecting and reporting potentially novel influenza reassortants as soon as new sequences are reported.

I posted an overview of the steps it took below. In the end, as a first pass detection of reassortment it appears promising or at least interesting. All in all, I went from idea, to data assembly, to trial implementation in about 2 hours (not counting a few hours trying to get Hadoop to run the CloudBurst.jar. Hadoop version number = VERY important). That is great for the study of influenza, but what about biodiversity informatics?

There are a few things that made this so simple.

First, a rapid adoption of EC2 by the bioinformatics community. Look for example at the JVCI maintained Bio-linux distribution for use on EC2. It comes prepackaged with a whole smörgåsbord of very fun little tools. It costs 34 cents per hour (not 10 because it requires a 64-bit instance) and takes about 1 minute to set up your very own copy. Where is our (the biodiversity informatics community's) GISing, niche modeling, distribution mapping, PDing, SGAing, Hadoop spatial joining, machine image?

Second, a growing community of users who are exchanging knowledge about how to tackle these very large datasets in the cloud. Biodiversity informatics is growing its own such community, writers of this blog can attest to that.

Third, cloud hosted datasets. Amazon Web Services is making it fairly simple to host very large datasets that anyone in the community can access. The GenBank image is ~200gb and takes 5 minutes to set up as a mounted volume for my own use. This targets a completely different consumer then our data portals and APIs. It opens up the data for the community of informatics enthusiasts doing cool things on the cloud (think spelling correction, retrospective georeferencing, spatial joining).

With all that said I'd like to return to think about how this relates to biodiversity informatics. I've been talking with a variety of people over the past few weeks and all of them show some level of interest in moving to the cloud. What I'm worried about now is that we will adopt many of the same stances as a community as we have had without the cloud. The case study I developed above relied heavily on easily accessible data and tools. We as a community must move forward with this as a primary goal. Likely, there will always be a need for portals and APIs, but for really big questions sometimes it is just easier to have the entire dataset ready for access. Why hasn't our community got it together enough to launch a unified public dataset in the cloud?

I guess data quality concerns and ownership are two primary concerns. I'm sorry, but those are bad reasons, time to grow up, it isn't 2001 any longer.

Once in place we can begin to build new methods (or reapply existing ones) for parsing out duplicate records, linking data to geographic areas, merging error types into targeted datasets, and sharing findings with the owners of the original data. The 'snapshot' approach as implemented in AWS Public Data Set makes it so we consumers constantly rely on the original providers to include the newest and most up to date records, none of our hard work will be included in future snapshots if we don't come up with methods for reporting corrections to the source.

It is important to restate, I don't think portals and APIs will go away. They are for two completely different consumer communities than those interested in looking at the entire dataset at once, versus pulling smaller subsets of data manually or developing tools to do so via API interfaces. I do think that by providing public datasets, new methods and technologies for enhancing the portals and APIs will arise, as well as still unknown methods for improving the datasets at source, and ultimately enhancing our knowledge about the world's biodiversity.


After downloading Hadoop (version 0.18.3), here is what I ran.

Launch Hadoop cluster, create and mount a volume containing the GenBank influenza data

>src/contrib/ec2/bin/hadoop-ec2 launch-cluster cloud 5

>ec2-create-volume --snapshot snap-fe3ec297 -z us-east-1d

>ec2-attach-volume vol-7b49a612 -i MASTERNODEIDHERE -d /dev/sdh

>src/contrib/ec2/bin/hadoop-ec2 login cloud

$mkdir /inf

$mount /dev/sdh /inf

Here I ran a small Python script to extract all sequences (hemagluttanin only) known prior to the original swine flu outbreak, concatenate those sequences into a single “genome” saved as data/genome.fa, and recording a map of where in the genome each sequence ended.

Next, a group of sequences (again, hemagluttanin only) from the earliest swine flu outbreaks were broken up into numerous overlapping fragments and saved as data/segs.fa, again keeping a map of where each fragment belonged. These are the 'novel sequences' I would test for any cases of reassortment.

Download and extract CloudBurst


$tar xzf CloudBurst-1.0.1.tgz

$mv CloudBurst-1.0.1/ data/

Convert fasta files to Hadoop ready files

$java -jar data/ConvertFastaForCloud.jar data/genome.fa data/

$java -jar data/ConvertFastaForCloud.jar data/segs.fa data/

Move the data to hdfs

$/usr/local/hadoop-0.17.0/bin/hadoop fs -put ~/data /data

Run the analysis

$/usr/local/hadoop-0.17.0/bin/hadoop jar ~/data/CloudBurst.jar /data/ /data/ /results 40 3 0 1 50 15 5 5 128 5

Copy results from hdfs

$/usr/local/hadoop-0.17.0/bin/hadoop fs -get /results results

Convert the results back to human readable format

java -jar data/PrintAlignments.jar results >results.txt


Parsing meaning out of the results was a bit more labor intensive and I forewent any automation (for now), using OpenOffice spreadsheets to map matched portions back to their original viruses and report accession numbers. What I found was interesting, exciting and promissing, even matching some of the findings reported in Kingsford et al., 2009, from only 2 hours of work and a couple of dollars. I was pretty satisfied.

No comments: