SIMAP
NAR Molecular Biology Database Collection entry number 867
Rattei, T.1, Tischler, P.1, Götz, S.2, Jehl, M.-A.1, Hoser, J.1, Arnold, R.1, Conesa, A.2, and Mewes, H.-W.1,3
1Technische Universität München, Department of Genome Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Freising, Germany
2Bioinformatics Department, Centro de Investigación PrÃncipe Felipe, Valencia, Spain
3Institute for Bioinformatics and Systems Biology (MIPS), Helmholtz Zentrum München, German Research Center for Environmental Health (GmbH), Neuherberg, Germany
2Bioinformatics Department, Centro de Investigación PrÃncipe Felipe, Valencia, Spain
3Institute for Bioinformatics and Systems Biology (MIPS), Helmholtz Zentrum München, German Research Center for Environmental Health (GmbH), Neuherberg, Germany
Contact t.rattei@wzw.tum.de
Database Description
Protein sequences are of utmost importance for studying the function and evolution of genes and genomes. Therefore a rich collection of methods in computational biology relies on the analysis and comparison of protein sequences. Many of these intensively used methods perform sequence similarity searches (e.g. BLAST (1)) or compare protein sequences against secondary databases of protein families (e.g. InterPro (2)).
The rapidly increasing volume of publicly available protein sequences forges a computational dilemma for bioinformatics tasks that require repeated all-against-all calculations of sequence similarities or sequence features. Such rather straightforward but technically challenging tasks among others are the annotation of genomes or the clustering of the protein sequence space into protein families. The Similarity Matrix of Proteins (SIMAP) solves the computational dilemma described above by incrementally pre-calculating the sequence similarities forming the known protein sequence space (3). The comparison of new sequences vs. known ones returns symmetric scores that can be updated accordingly in the existing records. To complement the pair-wise sequence similarity matrix by position specific searches against known protein families, SIMAP in addition pre-calculates sequence based features as e.g. InterPro matches (2).
The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences.
Access to SIMAP is freely provided through the web portal for individuals (http://mips.gsf.de/simap/) and for programmatic access through DAS (http://webclu.bio.wzw.tum.de/das/) and Web-Service (http://mips.gsf.de/webservices/services/SimapService2.0?wsdl).
The rapidly increasing volume of publicly available protein sequences forges a computational dilemma for bioinformatics tasks that require repeated all-against-all calculations of sequence similarities or sequence features. Such rather straightforward but technically challenging tasks among others are the annotation of genomes or the clustering of the protein sequence space into protein families. The Similarity Matrix of Proteins (SIMAP) solves the computational dilemma described above by incrementally pre-calculating the sequence similarities forming the known protein sequence space (3). The comparison of new sequences vs. known ones returns symmetric scores that can be updated accordingly in the existing records. To complement the pair-wise sequence similarity matrix by position specific searches against known protein families, SIMAP in addition pre-calculates sequence based features as e.g. InterPro matches (2).
The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences.
Access to SIMAP is freely provided through the web portal for individuals (http://mips.gsf.de/simap/) and for programmatic access through DAS (http://webclu.bio.wzw.tum.de/das/) and Web-Service (http://mips.gsf.de/webservices/services/SimapService2.0?wsdl).
Recent Developments
Novel features of SIMAP include the expansion of the sequence space by including databases such as ENSEMBL as well as the integration of metagenomes based on their consistent processing and annotation. Furthermore, protein function predictions by Blast2GO are pre-calculated for all sequences in SIMAP and the data access and query functions have been improved. SIMAP assists biologists to query the up-to-date sequence space systematically and facilitates large-scale downstream projects in computational biology.
Acknowledgements
The authors gratefully acknowledge the BOINCSIMAP community for donating their CPU power for the calculation of protein similarities and features. We are grateful to our colleagues at MIPS, in particular Mathias Walter, Martin Muensterkoetter and Manuel Spannagl, for many helpful discussions and suggestions. The authors wish to thank SUN Microsystems Inc. for funding a fully equipped X4500 data center server that is hosting parts of the SIMAP database, through a SUN Academic Excellence Grant.
References
1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389-3402
2. Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L. and Duquenne, L. (2009) InterPro: the integrative protein signature database. Nucleic Acids Research 37, D211-215
Arnold, R., Rattei, T., Tischler, P., Truong, M.D., Stumpflen, V. and Mewes, W. (2005) SIMAP-The similarity matrix of proteins. Bioinformatics, 21, ii42-46
2. Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L. and Duquenne, L. (2009) InterPro: the integrative protein signature database. Nucleic Acids Research 37, D211-215
Arnold, R., Rattei, T., Tischler, P., Truong, M.D., Stumpflen, V. and Mewes, W. (2005) SIMAP-The similarity matrix of proteins. Bioinformatics, 21, ii42-46
Category: Protein sequence databases
Subcategory: Protein domain databases; protein classification
Go to the abstract in the NAR 2010 Database Issue.
DOI: 10.1093/nar/gkp949
Oxford University Press is not responsible for the content of external internet sites