Skip Navigation

The Gene Indices


NAR Molecular Biology Database Collection entry number 5
Lee, Y., Antonescu, V., Cheung, F., Karamycheva, S., Parvizi, B., Pertea, G., Sultana, R., Sunkara, S., Tsai, J., White, J., Quackenbush, J.
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA
Contact johnq@tigr.org

Database Description

Expressed Sequence Tags (ESTs) have provided a first glimpse of the collection of transcribed sequences in a variety of organisms. However, a careful analysis of this sequence data can provide significant additional functional, structural, and evolutionary information. Our analysis of the public EST sequences, available through the TIGR Gene Indices (TGI; http://www.tigr.org/tdb/tdb.html), is an attempt to identify the genes represented by that data and to provide additional information regarding those genes. Gene Indices are constructed for selected organisms by first clustering, then assembling EST and annotated gene sequences from GenBank. This process produces a set of unique, high-fidelity virtual transcripts, or Tentative Consensus (TC) sequences. The TC sequences can be used to provide putative genes with functional annotation, to link the transcripts to mapping and genomic sequence data, and to provide links between orthologous and paralogous genes.

Recent Developments

Construction of the Gene Index databases. The TIGR Gene Index databases (http://www.jcvi.org/cms/research/projects/tdb/overview/) are an attempt to identify and classify transcribed sequences in eukaryotic species using available EST and gene sequence data. Sequences are first cleaned to identify and remove contaminating sequences, including vector, adapter, mitochondrial, ribosomal, and chimeric sequences. These sequences are then searched pair-wise against each other and grouped in clusters based on shared sequence similarity. The clusters are assembled at high stringency to produce Tentative Consensus (TC) sequences which are annotated using a variety of tools including ORF prediction, putative annotation using a controlled vocabulary, Gene Ontology and Enzyme Commission number assignments, and maps to completed and draft genomes. The TCs are used to construct a variety of other databases, including the Eukaryotic Gene Orthologue (EGO) database and RESOURCERER, a database that annotates and cross-references microarray resources for human, mouse, and rat. At present, 57 species are represented in the Gene Index Databases, including 19 animals, 17 plants, 7 fungi and 14 protists; this includes the 24 species most highly represented in by public EST sequencing projects. Release information for each species-specific database is included in Table 1. Individual databases are updated and released three times yearly, on February 1, June 1, and October 1, if the number of available ESTs for that species has increased than 10% or 25,000, whichever is fewer. The process used to assemble each Gene Index is similar to that described previously (1, 2, 3), although some small modifications have been made to improve the efficiency and accuracy of the process. mgBLAST, a modified version of the MegaBLAST (4) program is now used for the pairwise sequence comparisons that are the basis for defining clusters. Each cluster is assembled using the Paracel Transcript Assembler (PTA), a modified version of the CAP3 assembly program (5). TGICL, an open source software system for EST clustering and assembly is available for uses interested in performing a similar analysis on their own datasets (6). New Features of the TC reports The central element of the TGI databases are the TC sequences and the TC reports that are presented through the database web site. The TC reports contain a number of features, including the TC sequence in FASTA format with a history from previous builds in the header, a map showing component EST and gene sequences and a table providing links to those sequences, putative annotation, an expression summary based on the number of ESTs from various libraries, genomic locations, and links to tentative orthologues in EGO, as described previously. Since the last description in this special issue of Nucleic Acids Research, several new features have been added to the TC report. Putative polyadenylation signals are identified and shaded in the consensus sequence. Potential open reading frames are predicted in the TC using a variety of software tools including the NCBI ORF Finder, ESTScan (7), DIANA-EST (8) and FrameFinder; predicted ORFs can be searched against a variety of databases using WU-BLAST. Assembly of the TCs can result in incorrect orientations for the consensus and an attempt is now made to determine the proper orientation using the annotated direction of component gene and EST sequences as well as BLAST search results. Putative Single Nucleotide Polymorphism (SNP) sites are found by analyzing the multiple sequence alignments that are produced in the assembly stage; putative SNPs are reported only if a variant is found in multiple sequences from independent libraries. New databases and tools The Eukaryotic Gene Ortholog (EGO; http://www.tigr.org/.tdb/tgi/ego/) database, previously known as TIGR Orthologous Gene Alignments (TOGA), uses pairwise sequence similarity searches and a transitive, reciprocal closure process to identify Tentative Ortholog Groups (TOGs) in Eukaryotes (5). EGO has expanded its representation to include all 57 species represented in the TGI and TOGs have been cross-referenced to the Online Mendelian in Man (OMIM; http://www.ncbi.nlm.nih.gov/OMIM) database of human disease genes. RESOURCERER (6) provides annotation based on the TIGR Gene Indices for widely available microarray resources in human, mouse, and rat, including widely used clone sets and Affymetrix GeneChipsâ„¢. RESOURCERER also allows users to compare microarray resources within or across species and microarray platforms using links established in EGO. Users can also submit a list of GenBank Accessions corresponding to their microarray databases for annotation. Genomic maps align TCs to available complete or draft genomes, including human, mouse, fly, worm, Fugu, Arabidopsis, yeast, fission yeast, and rice. And these alignments can be viewed using either a graphical alignment viewer or through a number of distributed annotation system (DAS; 9) viewers, including one developed at TIGR. Each Gene Index also includes graphical metabolic pathway maps linked to TCs associated with specific pathways through GO term and EC number annotation. Comparisons between TCs are also used to identify putative alternative splice forms based on shared blocks of sequence similarity. Using TIGR Gene Indices There are many ways users can access the TIGR Gene Index databases. Nucleotide or protein sequences can be searched using WU-BLAST against individual TGI databases, EGO, or pre-selected classes of species, such as animals or plants. The TGI can be searched using unique identifiers (GB and TC Accessions, EST identifiers, and ET numbers from the TIGR EGAD database) gene product names, functional classifications based on GO terms, metabolic pathways, library-related expression analysis, map position within various sequenced genomes, TOGs in the EGO database, and alternative splice forms. Complete summary annotation for all of the ESTs and TCs in each TGI database are now also provided through the EST Annotator and TC Annotator features which provide comprehensive lists of sequences within each species-specific database. Software Many of the software tools used to create the TGI are available with source code to the research community through the TGI software tools web site (http://www.jcvi.org/cms/research/projects/tdb/overview/software/). The TGI Clustering tools (TGICL; 6) is a software system for fast clustering and assembly of large EST datasets. TGICL starts with a large multi-FASTA file (and an optional quality value file) and outputs the assemblies produced by CAP3 (5). Both clustering and assembly phases can be parallelized by distributing the searches and the assembly jobs across multiple CPUs, as TGICL can take advantage of either SMP machines or PVM (Parallel Virtual Machine) clusters. Other available software includes clview for viewing sequence assemblies in .ace format, SeqClean which is used to remove contaminating sequences from EST and gene sequences, and cdbfasta/cdbyank which index FASTA-formatted files and can be used to rapidly extract sequences from them.

Acknowledgements

The authors wish to thank TIGR IT group for their database and computer system support. This work was supported by the US Department of Energy, grant DE-FG02-99ER62852 and the US National Science Foundation, grant DBI-9983070.

References

1. Quackenbush, J., Liang, F., Holt, I., Pertea, G. and Upton, J. (2000) Nucleic Acids Res, 28(1), 141-5.
2. Quackenbush, J., Cho, J., Lee, D., Liang, F., Holt, I., Karamycheva, S., Parvizi, B., Pertea, G., Sultana, R. and White, J. (2001) Nucleic Acids Res, 29(1), 159-64.
3. Liang, F., Holt, I., Pertea, G., Karamycheva, S., Salzberg, S. L. and Quackenbush, J. (2000) Nucleic Acids Res, 28(18), 3657-65.
4. Zhang, Z., Schwartz, S., Wagner, L. and Miller, W. (2000) J Comput Biol, 7, 203-14.
5. Huang, X. and Madan, A. (1999) CAP3: A DNA Sequence Assembly Program. Genome Research, 9: 868-877.
6. Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R., Karamycheva, S., Lee, Y., White, J., Cheung, F., Parvizi, B., Tsai, J., and Quackenbush, J. (2002) TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics, to appear.
7. Lee, Y., Sultana, R., Pertea, G., Cho, J., Karamycheva, S., Tsai, J., Parvizi, B., Cheung, F., Antonescu, V., White, J., Holt, I., Liang, F. and Quackenbush, J. (2002) Genome Res, 12(3), 493-502.
8. Tsai, J., Sultana, R., Lee, Y., Pertea, G., Karamycheva, S., Anonescu, V., Cho, J., Parvizi, B., Cheung, F. and Quackenbush, J. (2001) 2001, 2(11), software0002.1-0002.4.
9. Iseli, C., Jongeneel, C. V. and Bucher, P. (1999) Proc Int Conf Intell Syst Mol Biol, 138-48.
10. Hatzigeorgiou, A.G., Fiziev, P. and Reczko, M. (2001) Bioinformatics, 17(10), 913-919.
11. Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. (2001) BMC Bioinformatics 2(1), 7

Subcategory: Human ORFs

Go to the abstract in the NAR 2005 Database Issue.
Oxford University Press is not responsible for the content of external internet sites