Skip Navigation

SEVENS


NAR Molecular Biology Database Collection entry number 373
Suwa, M.1, Sato, T.2, Okouchi, I.2, Arita, M.1, Matsumoto, S.3, Tsutsumi, S.3, Aburatani, H.3, Asai, K.1, Akiya, Y.1
1Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology
2Center for Computational Science and Engineering, Fuji Research Institute Corporation
3Genome Science Division, Research Center for Advanced Science and Technology (RCAST), University of Tokyo, Japan

Database Description

Seven-transmembrane-helix receptors (7-TMR), known as G-protein-coupled receptors [1], are important genes that work as the gateway of signal transudation induced by ligand binding. Recent progress in determination of human draft sequences [2,3] accelerates the comprehensive analysis of 7-TMR in whole human genome. We have developed an automated system for discovering 7-TMR genes in the whole human genome by three stages. (I) Gene prediction stage: Genomic sequences were obtained from human genome resources of NCBI . To maximize the number of gene candidates, we detected three kinds of sequence sets, (a)"6f-sequences" which were all possible combination between initial and stop codons in 6 reading frames. (b)"ALN-sequences" obtained by ALN [4], which is a dynamic programming algorithm that assigns genome sequence to known protein sequence. (c)"GD-sequences" generated by GeneDecoder [5] which is based on HMM models. (II)Screening stage: The predicted genes passed an analyzing filter using items of BLASTP [6] for similarity search, HMMER [7] and in house program for assigning 7-TMR specific HMM. (PFAM domain [7] ), PROSITE patterns [8] and transmembrane helix (TMH) prediction tools [9]. By carefully assessing each component, two threshold settings, best specificity and best sensitivity, were determined. Then four confidence levels of the datasets were obtained by combining the best specificity and best sensitivity thresholds. (III) Quality improvement stage: Sequence redundancies were adjusted as follows. (1) Pair-wise alignment was applied to the candidate sequences in all-against-all fashion. (2) Sequences were linked together only when they hit for > 50 A.A residues with > 95% identity and shared the same chromosome No., and overlapping genetic position. (3)The result of a transitive closure of the links was then regarded as one cluster. And one representative gene was selected from each cluster. Applying this system to human genome sequences (Apr, 2003), we collected 7-TMR genes in four confidence levels ranging from 1,114 candidates at the highest specificity to 2,235 at the highest sensitivity. These are summarized in SEVENS (http://sevens.cbrc.jp/1.20/). This database intends to cover all "7-TMR universe" with not only the known sequences but also to use newly discovered sequence by computational gene finding program. This aspect is clearly different from previous databases [10-12]. The content search button navigates a page, where candidates are obtained. by the "AND" combination of (a) Keyword in nr.aa database search results, (b)Chromosome number, (c)Data Level, (d)Predicted exon number, (e) Gene Length, (f)Protein length, (g)E-value of sequence search against SWISSPROT or nr.aa, (h) Prosite motifs, and (i) Pfam domains. This search lists up 7-TMR candidate sequences at a chromosomal viewer and a list table. Then each chromosome or sequence links to the sequence analysis page. Here, chromosomal viewer shows the mapping information of selected genes (purple) which links to their protein sequence analysis. Result of Similarity Search part shows an alignment of the query searched against SWISS-PROT and nr.aa database. using BLASTP. Structure part shows the results of analysis, with TMH prediction, PROSITE motif pattern and PFAM domain in amino acid sequence. We are planning to maintain SEVENS with constant updates according to the version up of human genome sequence. Additional information (such as expression data, tertiary structure data etc.) will be included in database with every update chance. We hope these datasets will be of value to researchers engaged in 7-TMR studies.

Recent Developments

We recalculated the data collection process by using human genome sequences (Apr. 2003). Web pages are more visualized by chromosomal mapping viewer.

References

1. Watson, S. & Arkinstall, S. (1994). The G-protein Linked Receptor Facts Book, Academic Press,@ London.
2 International Human Genome Sequencing Consortium. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921.
3 Venter, J. C., et al. (2001) The sequence of the human genome. Science. 291, 1304-1351.
4. Goto, O. (2000) Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinfomatics, 16,190-202.
5. Asai, K., Itou, K., Ueno, Y. and Yada, T. (1998) Recognition of human genes by stochastic parsing, Pacific Symposium on Biocomputing 98, pp. 228-239 (PSB98, 1998).
6. Altschul, S. F., et al.(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402.
7. Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K, L. and Sonnhammer, E. L. (2000) The Pfam protein familiesf database. Nucleic Acids Res. 28, 263-266.
8. Bairoch, A. (1992) Prosite: A dictionary of sites and patterns in proteins. Nucleic Acids Res. 20, 2013-2018.
9. Hirokawa, T., Boon-Chieng, S. and Mitaku, S. (1998) SOSUI: classification and secondary structure prediction system for membrane proteins. Bioinformatics 14, 378-379.
10. Horn, F., Vriend, G. & Cohen, F. E. (2001) Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res. 29, 346-349.
11. Crasto, C., Marenco, L., Miller, P, Shepherd G. (2002) Olfactory Receptor Database: a metadata-driven automated population from sources of gene and protein sequences. Nucl. Acids. Res. 30, 354-360.
12. Hodges PE, Carrico PM, Hogan JD, O'Neill KE, Owen JJ, Mangan M, Davis BP, Brooks JE, Garrels JI. (2002). Annotating the human proteome: the Human Proteome Survey Database (HumanPSDTM) and an in-depth target database for G protein-coupled receptors (GPCR-PDTM) from Incyte Genomics. Nucleic Acids Res 30. 137-141.


Oxford University Press is not responsible for the content of external internet sites