NAR Molecular Biology Database Collection entry number 914
Hu Z.Z. and Wu C.H.
Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20057, USA

Database Description

iProLINK (integrated Protein Literature, INformation and Knowledge) is a resource to facilitate text mining research in the area of literature-based database curation, named entity recognition, and protein ontology development. This collection of annotated data sources can be utilized by computational and biological researchers to explore literature information on proteins and their features or properties (Hu et al., 2004). The data sets for bibliography mapping and feature evidence attribution include mapped citations (PubMed ID to protein entry and feature line mapping) and annotation-tagged literature corpora. The latter includes ~800 abstracts and/or full-text articles in which text evidence was tagged for ~1200 experimentally validated post-translational modifications (PTMs) annotated in the PIR protein sequence database (PIR-PSD). The data sets for entity recognition and ontology development include protein name dictionaries, word token dictionaries, protein name-tagged literature corpora along with tagging guidelines, and a protein ontology based on PIRSF protein family names. All datasets are freely accessible and can be downloaded at

Recent Developments

iProLINK now provides tools developed using its annotated data sources, e.g. the text mining system RLIMS-P (Rule-based Literature Mining System for Protein Phosphorylation), specifically designed to extract protein phosphorylation information, including protein kinase, substrate and phosphorylation sites, from the MEDLINE abstracts (Hu et al., 2005). In addition, a new resource called BioThesaurus, a comprehensive collection of protein/gene names and their associations with UniProtKB protein entries (Liu et al., 2005), was also recently developed as part of the iProLINK resources. BioThesaurus is described separately in this on-line Database Collection.


The iProLINK project is supported by grants DBI-0138188, ITR-0205470, and IIS-0430743 from the National Science Foundation, and in part by grant U01-HG02712 from the National Institutes of Health, USA.


Subcategory: Protein properties

