COG - Clusters of Orthologous Groups of proteins

NAR Molecular Biology Database Collection entry number 7
Galperin, Michael Y.; Makarova, Kira; Wolf, Yuri; Koonin, Eugene
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA

Database Description

The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on phylogenetic classification of the proteins encoded in complete genomes. Each COGs includes proteins that are inferred to be orthologs (direct evolutionary counterparts). The current release consists of 138,458 which form 4873 COGs and comprise 75% of the 185,505 proteins from 50 bacterial genomes, 13 archaeal genomes, and three genomes of unicellular eukaryotes, the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe, and the microsporidian Encephalitozoon cuniculi ( The COG database is updated periodically as new genomes become available. The COGs can be applied to the task of functional annotation of newly sequenced genomes by using the COGNITOR program, which is available on the COG front page.

Recent Developments

The COG database can now be searched using RPS-BLAST through the Conserved Domain Database web site,

The version of the COGs (termed KOGs after euKaryotic Orthologous Groups) for seven (nearly) complete eukaryotic genomes, S. cerevisiae, S. pombe, E. cuniculi, the green plant Arabidopsis thaliana, the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, and Homo sapiens, is currently available. The current set consists of 4852 KOGs which include 60,579 proteins. Detailed analysis of the KOGs revealed various trends in the evolution of eukaryotic genomes including widely different, lineage-specific propensities for gene loss. Manual validation and annotation of the KOGs and update to include additional eukaryotic genomes are underway.


1. Tatusov RL, Galperin MY, Natale DA, Koonin EV. (2000) The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28(1): 33-36. PMID: 12969510
2. Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. (2001) The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29(1): 22-28. PMID: 11125040
3. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 4(1):41. PMID: 10592175
4. Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 2004;5(2):R7. PMID: 14759257

Go to the abstract in the NAR 2015 Database Issue.
Oxford University Press is not responsible for the content of external internet sites