Skip Navigation

NAR's Top Ten Articles - August 2011

NAR’s Top Ten Articles are updated monthly and show recent articles that have been most often accessed in HTML Full-Text and PDF formats in the specified month.

Database

gkp985

The Pfam protein families database
Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A.
Nucl. Acids Res. January 2010: D211-222.
Free Full Text

Pfam is a widely used database of protein families and domains. This article describes a set of major updates that we have implemented in the latest release (version 24.0). The most important change is that we now use HMMER3, the latest version of the popular profile hidden Markov model package. This software is approximately 100 times faster than HMMER2 and is more sensitive due to the routine use of the forward algorithm. The move to HMMER3 has necessitated numerous changes to Pfam that are described in detail. Pfam release 24.0 contains 11,912 families, of which a large number have been significantly updated during the past two years. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/).

Database

gkn879

The Ribosomal Database Project: improved alignments and new tools for rRNA analysis
Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM.
Nucl. Acids Res. January 2009: D141-145.
Free Full Text

The Ribosomal Database Project (RDP) provides researchers with quality-controlled bacterial and archaeal small subunit rRNA alignments and analysis tools. An improved alignment strategy uses the Infernal secondary structure aware aligner to provide a more consistent higher quality alignment and faster processing of user sequences. Substantial new analysis features include a new Pyrosequencing Pipeline that provides tools to support analysis of ultra high-throughput rRNA sequencing data. This pipeline offers a collection of tools that automate the data processing and simplify the computationally intensive analysis of large sequencing libraries. In addition, a new Taxomatic visualization tool allows rapid visualization of taxonomic inconsistencies and suggests corrections, and a new class Assignment Generator provides instructors with a lesson plan and individualized teaching materials. Details about RDP data and analytical functions can be found at http://rdp.cme.msu.edu/.

Methods online

gkq873

Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions
Claesson MJ, Wang Q, O'Sullivan O, Greene-Diniz R, Cole JR, Ross RP, O'Toole PW.
Nucl. Acids Res. December 2010: e200.
Free Full Text

High-throughput molecular technologies can profile microbial communities at high resolution even in complex environments like the intestinal microbiota. Recent improvements in next-generation sequencing technologies allow for even finer resolution. We compared phylogenetic profiling of both longer (454 Titanium) sequence reads with shorter, but more numerous, paired-end reads (Illumina). For both approaches, we targeted six tandem combinations of 16S rRNA gene variable regions, in microbial DNA extracted from a human faecal sample, in order to investigate their limitations and potentials. In silico evaluations predicted that the V3/V4 and V4/V5 regions would provide the highest classification accuracies for both technologies. However, experimental sequencing of the V3/V4 region revealed significant amplification bias compared to the other regions, emphasising the necessity for experimental validation of primer pairs. The latest developments of 454 and Illumina technologies offered higher resolution compared to their previous versions, and showed relative consistency with each other. However, the majority of the Illumina reads could not be classified down to genus level due to their shorter length and higher error rates beyond 60 nt. Nonetheless, with improved quality and longer reads, the far greater coverage of Illumina promises unparalleled insights into highly diverse and complex environments such as the human gut.

Survey and Summary

gkp1137


The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM.
Nucl. Acids Res. April 2010: 1767-1771.
Free Full Text

FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal definition to date, and existing in at least three incompatible variants. This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them, based on publicly available information such as the MAQ documentation and conventions recently agreed by the Open Bioinformatics Foundation projects Biopython, BioPerl, BioRuby, BioJava and EMBOSS. Being an open access publication, it is hoped that this description, with the example files provided as Supplementary Data, will serve in future as a reference for this important file format.

Methods online

gkq224

Biases in Illumina transcriptome sequencing caused by random hexamer priming
Hansen KD, Brenner SE, Dudoit S
Nucl. Acids Res. July 2010: e131.
Free Full Text

Generation of cDNA using random hexamer priming induces biases in the nucleotide composition at the beginning of transcriptome sequencing reads from the Illumina Genome Analyzer. The bias is independent of organism and laboratory and impacts the uniformity of the reads along the transcriptome. We provide a read count reweighting scheme, based on the nucleotide frequencies of the reads, that mitigates the impact of the bias.

Database

gkp875

The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases
Caspi R, Altman T, Dale JM, Dreher K, Fulcher CA, Gilham F, Kaipa P, Karthikeyan AS, Kothari A, Krummenacker M, Latendresse M, Mueller LA, Paley S, Popescu L, Pujar A, Shearer AG, Zhang P, Karp PD.
Nucl. Acids Res. January 2010: D473-479.
Free Full Text

The MetaCyc database (MetaCyc.org) is a comprehensive and freely accessible resource for metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are experimentally determined, small-molecule metabolic pathways and are curated from the primary scientific literature. With more than 1400 pathways, MetaCyc is the largest collection of metabolic pathways currently available. Pathways reactions are linked to one or more well-characterized enzymes, and both pathways and enzymes are annotated with reviews, evidence codes, and literature citations. BioCyc (BioCyc.org) is a collection of more than 500 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the full genome and predicted metabolic network of one organism. The network, which is predicted by the Pathway Tools software using MetaCyc as a reference, consists of metabolites, enzymes, reactions and metabolic pathways. BioCyc PGDBs also contain additional features, such as predicted operons, transport systems, and pathway hole-fillers. The BioCyc Web site offers several tools for the analysis of the PGDBs, including Omics Viewers that enable visualization of omics datasets on two different genome-scale diagrams and tools for comparative analysis. The BioCyc PGDBs generated by SRI are offered for adoption by any party interested in curation of metabolic, regulatory, and genome-related information about an organism.

Methods online

gkq603


ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data
Wang K, Li M, Hakonarson H.
Nucl. Acids Res. September 2010: e164.
Free Full Text

High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a 'variants reduction' protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/.

Methods online

gkp596

Transcriptome analysis by strand-specific sequencing of complementary DNA
Parkhomchuk D, Borodina T, Amstislavskiy V, Banaru M, Hallen L, Krobitsch S, Lehrach H, Soldatov A.
Nucl. Acids Res. October 2009: e123.
Free Full Text

High-throughput complementary DNA sequencing (RNA-Seq) is a powerful tool for whole-transcriptome analysis, supplying information about a transcript's expression level and structure. However, it is difficult to determine the polarity of transcripts, and therefore identify which strand is transcribed. Here, we present a simple cDNA sequencing protocol that preserves information about a transcript's direction. Using Saccharomyces cerevisiae and mouse brain transcriptomes as models, we demonstrate that knowing the transcript's orientation allows more accurate determination of the structure and expression of genes. It also helps to identify new genes and enables studying promoter-associated and antisense transcription. The transcriptional landscapes we obtained are available online.

Methods online

gkp1235

A sensitive non-radioactive northern blot method to detect small RNAs
Kim SW, Li Z, Moore PS, Monaghan AP, Chang Y, Nichols M, John B.
Nucl. Acids Res. April 2010: e98.
Free Full Text

The continuing discoveries of potentially active small RNAs at an unprecedented rate using high-throughput sequencing have raised the need for methods that can reliably detect and quantitate the expression levels of small RNAs. Currently, northern blot is the most widely used method for validating small RNAs that are identified by methods such as high-throughput sequencing. We describe a new northern blot-based protocol (LED) for small RNA (approximately 15-40 bases) detection using digoxigenin (DIG)-labeled oligonucleotide probes containing locked nucleic acids (LNA) and 1-ethyl-3-(3-dimethylaminopropyl) carbodiimide for cross-linking the RNA to the membrane. LED generates clearly visible signals for RNA amounts as low as 0.05 fmol. This method requires as little as a few seconds of membrane exposure to outperform the signal intensity using overnight exposure of isotope-based methods, corresponding to approximately 1000-fold improvement in exposure-time. In contrast to commonly used radioisotope-based methods, which require freshly prepared and hazardous probes, LED probes can be stored for at least 6 months, facilitate faster and more cost-effective experiments, and are more environmentally friendly. A detailed protocol of LED is provided in the Supplementary Data.

Survey and Summary

gkn923

Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists
Huang, DW, Sherman, BT Lempicki, RA.
Nucl. Acids Res. January 2009: 1-13.
Free Full Text

Functional analysis of large gene lists, derived in most cases from emerging high-throughput genomic, proteomic and bioinformatics scanning approaches, is still a challenging and daunting task. The gene-annotation enrichment analysis is a promising high-throughput strategy that increases the likelihood for investigators to identify biological processes most pertinent to their study. Approximately 68 bioinformatics enrichment tools that are currently available in the community are collected in this survey. Tools are uniquely categorized into three major classes, according to their underlying enrichment algorithms. The comprehensive collections, unique tool classifications and associated questions/issues will provide a more comprehensive and up-to-date view regarding the advantages, pitfalls and recent trends in a simpler tool-class level rather than by a tool-by-tool approach. Thus, the survey will help tool designers/developers and experienced end users understand the underlying algorithms and pertinent details of particular tool categories/tools, enabling them to make the best choices for their particular research interests.