Robert R. McCormick School of Engineering and Applied Science Electrical Engineering and Computer Science Department Center for Ultra-scale Computing and Information Security at Northwestern University

Sponsor:

National Science Foundation award no. 0905205

Project Team Members:






Northwestern University - EECS Dept.



AGILE: AliGnIng Long rEads


Overview

Recent advances in Next Generation Sequencing (NGS) technology have led to affordable desktop-sized sequencers with low running costs and high throughput. These sequencers produce small fragments of the genome being sequenced as a result of the sequencing process. By mapping these small fragments (reads) to a reference genome, we can sequence the DNA of a new individual. The NGSs are making it possible for these studies to be conducted at a mass scale. This is believed to usher an era of personal genomics when each individual can have his/her dna sequenced and studied to come up with more personalized ways of anticipating, diagnosing and treating diseases.

The studies of this nature have already begun. Following are two very recent examples of such kinds of studies. James Lupski, a physician-scientist who suffers from a neurological disorder called Charcot-Marie-Tooth has found the genetic cause of his disease by sequencing his entire genome (late 2009). Another study, the first to describe the genomes of an entire family of four, confirmed the genetic root of a rare disease, called Miller syndrome, afflicting both children (March 2010).

A number of different companies are involved in building sequencers - 454, Illumina and Applied Biosciences to name a few. The rate of throughput as well as read lengths of these NGSs are increasing at a pace that puts even the Moore's law to shame. Hence there is a growing need of tools that can work for longer reads and can still match the pace of the NGSs.

AGILE

AGILE is a sequence mapping tool specifically designed to map the longer reads (read length > 200) to a given reference genome. Currently it works for 454 reads, but efforts are being made to make it suitable to work for all sequencers, which produce longer reads. Looking at the current trend of increasing read lengths, soon most of the sequencers will have read lengths > 200. In comparison with the existing tools, the most significant features of AGILE are:

Download AGILE

Please download AGILE here : (agile_x86_64_0.4.0.tar.gz)
This version works for a 64 bit linux operating system. Other versions can be provided on request.

Download PAR-AGILE

Please download PAR-AGILE here : (PAR-AGILE_0.1.0.tar.gz).
This version is compiled on 64 bit linux operating system using mpicc.

Download Data

The fasta version of human genome hg19 can be downloaded here: (hg19_unmasked.fa.tar.bz2).
A sample fasta file containing real 454 reads can be downloaded here: (SRR005010_15.fa).

Usage

agile database query [options] output_file > mapping_quality_file
where:
database and query are each either a .fa , .nib or .2bit file,
or a list these files one file name per line.
output_file is file for the mapped result.
Options:
-tileSize=k Sets the length of tuples for creating hash table.
Usually between 11 and 20 (default 16).
-maxSIMs=n Sets the maximum #SIMs (single imperfect matches) allowed as a percentage of the read length. These include mismatches and indels.
(default 5 (i.e. 5 %) with -all option and 100 (i.e. 100 %) without -all option)
-maxFreq=F Sets the maximum number of occurrences of a pattern (k-tuple) that are allowed. k-tuples which occur more than F times are marked as overused and ignored. The default value depends on the read length (for example, F = 8 for read length of 500).
-all If this is used, the program outputs all the alignments which satisfy maxSIMs=n. If this is not used, the program simply tries to find the best alignment and outputs the best alignment it can find and also all the other alignments with the same score as the best one.
-out=type sets output file format. Type is one of:
psl - Default. Tab separated format, no sequence
pslx - Tab separated format with sequence
axt - blastz-associated axt format
maf - multiz-associated maf format
sim4 - similar to sim4 format
wublast - similar to wublast format
blast - similar to NCBI blast format
blast8- NCBI blast tabular format
blast9 - NCBI blast tabular format with comments

Third Party Software for AGILE

Galaxy wrapper for AGILE

Galaxy is an online tool that facilitates analysis of genomic data. Simon Lank from O'Connor Lab, WNPRC, Madison WI has written galaxy python script and xml wrapper for AGILE and has graciously shared it with us. We are happy to share it here: AGILE_Galaxy_wrapper

Publications

  1. Sanchit Misra, Ankit Agrawal, Wei-keng Liao, Alok Choudhary. Anatomy of a Hash-based Long Read Sequence Mapping Algorithm for Next Generation DNA Sequencing. Bioinformatics 2010; doi: 10.1093/bioinformatics/btq648.(PDF)
  2. Sanchit Misra, Ramanathan Narayanan, Wei-keng Liao, Alok Choudhary and Simon Lin. pFANGS: Parallel High Speed Sequence Mapping for Next Generation 454-Roche Sequencing Reads. In Proc. Ninth IEEE International Workshop on High Performance Computational Biology (IPDPS 2010), April, 2010, Atlanta, GA.(PDF)
  3. Sanchit Misra, Ramanathan Narayanan, Simon Lin and Alok Choudhary. FANGS: High Speed Sequence Mapping for Next Generation Sequencing Reads. In Proceedings of ACM Symposium of Applied Computing (ACM SAC), March 22-26, 2010, Sierre, Switzerland.(PDF)

Current Status


Acknowledgements:

This material is based upon work supported by the National Science Foundation under collaborating grants at the Northwestern University (NSF grant no. 0905205) and University of Minnesota (NSF grant no. 0905581).

Northwestern University EECS Home | McCormick Home | Northwestern Home | Calendar: Plan-It Purple
© 2011 Robert R. McCormick School of Engineering and Applied Science, Northwestern University
"Tech": 2145 Sheridan Rd, Tech L359, Evanston IL 60208-3118  |  Phone: (847) 491-5410  |  Fax: (847) 491-4455
"Ford": 2133 Sheridan Rd, Ford Building, Rm 3-320, Evanston, IL 60208  |  Fax: (847) 491-5258
Email Director

Last Updated: $LastChangedDate: 2015-02-22 10:04:49 -0600 (Sun, 22 Feb 2015) $