NAR Molecular Biology Database Collection entry number 867
Rattei, Thomas; Arnold, Roland,; Goldenberg, Florian; Mewes, Hans-Werner
Terrence Donnelly Centre for Cellular and Biomolecular Research, Kim Lab, University of Toronto, Toronto, ON M5S 3E1, Canada, CUBE-Division of Computational Systems Biology, Department of Microbiology and Ecosystem Science, University of Vienna, 1090 Vienna, Austria and Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, Technische Universität München, Wissenschaftszentrum Weihenstephan, 85764 Neuherberg, Germany.

Database Description

Protein sequences are of utmost importance for studying the function and evolution of genes and genomes. Therefore a rich collection of methods in computational biology relies on the analysis and comparison of protein sequences. Many of these intensively used methods perform sequence similarity searches (e.g. BLAST (1)) or compare protein sequences against secondary databases of protein families (e.g. InterPro (2)).
The rapidly increasing volume of publicly available protein sequences forges a computational dilemma for bioinformatics tasks that require repeated all-against-all calculations of sequence similarities or sequence features. Such rather straightforward but technically challenging tasks among others are the annotation of genomes or the clustering of the protein sequence space into protein families. The Similarity Matrix of Proteins (SIMAP) solves the computational dilemma described above by incrementally pre-calculating the sequence similarities forming the known protein sequence space (3). The comparison of new sequences vs. known ones returns symmetric scores that can be updated accordingly in the existing records. To complement the pair-wise sequence similarity matrix by position specific searches against known protein families, SIMAP in addition pre-calculates sequence based features as e.g. InterPro matches (2).
The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences.
Access to SIMAP is freely provided through the web portal for individuals ( and for programmatic access through DAS ( and Web-Service (

Recent Developments

Novel features of SIMAP include the expansion of the sequence space by including databases such as ENSEMBL as well as the integration of metagenomes based on their consistent processing and annotation. Furthermore, protein function predictions by Blast2GO are pre-calculated for all sequences in SIMAP and the data access and query functions have been improved. SIMAP assists biologists to query the up-to-date sequence space systematically and facilitates large-scale downstream projects in computational biology.


The authors gratefully acknowledge the BOINCSIMAP community for donating their CPU power for the calculation of protein similarities and features. We are grateful to our colleagues at MIPS, in particular Mathias Walter, Martin Muensterkoetter and Manuel Spannagl, for many helpful discussions and suggestions. The authors wish to thank SUN Microsystems Inc. for funding a fully equipped X4500 data center server that is hosting parts of the SIMAP database, through a SUN Academic Excellence Grant.


1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research 25, 3389-3402
2. Hunter, S., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Das, U., Daugherty, L. and Duquenne, L. (2009) InterPro: the integrative protein signature database. Nucleic Acids Research 37, D211-215
Arnold, R., Rattei, T., Tischler, P., Truong, M.D., Stumpflen, V. and Mewes, W. (2005) SIMAP-The similarity matrix of proteins. Bioinformatics, 21, ii42-46

Go to the article in the NAR Database issue.
Oxford University Press is not responsible for the content of external internet sites