Manually Curated Database of Rice Proteins
NAR Molecular Biology Database Collection entry number 1743
Priyanka Garg, Rashmi Jain, Shaji V. Joseph, Akhilesh K. Tyagi, Saurabh Raghuvanshi
Department of Plant Molecular Biology, University of Delhi South Campus, Benito Juarez Road, New Delhi 110021, India
Manually Curated Database of Rice Proteins (MCDRP) is a unique manually curated database based on published experimental data. Semantic integration of scientific data is essential to gain a higher level of understanding of biological systems. Since the majority of scientific data is available as published literature, text mining is an essential step before the data can be integrated and made available for computer-based search in various databases. However, text mining is a tedious exercise and thus, there is a large gap in the data available in curated databases and published literature. Moreover, data in an experiment can be perceived from several perspectives, which may not reflect in the text based curation. In order to address such issues, we have demonstrated the feasibility of digitizing the experimental data itself by creating a database on rice proteins based on in-house developed data curation models. Digitization of experimental data itself offers several advantages: 1. Digitization renders the experimental data amenable to computerized search such that exact data from multiple articles can be searched/retrieved rapidly. 2. Digitization into a standard computer readable format would facilitate a seamless and semantic integration of data. 3. Data models for digitization of experimental data may be extended further to enable pre-publication integration of data. Thus, from the very beginning, the data would be in a format where it can be integrated into any database with minimal manual intervention. Such approach would help close the gap between the curated databases and the accumulated scientific literature. The manual curation workflow adopted in developing the database is designed to digitize the actual experimental data (presented as images or graphs) published in peer-reviewed articles. We developed a data curation workflow that was guided by the need to digitize experimental data as well as to lay foundation for development of standard representation of experimental data. In simple terms, every data-point in a graph or image represents data of several dimensions such as gene name, plant type, tissue and growth conditions. A combined systematic use of ontologies and notations has been done to represent information contained in every data point such that every data point is represented by a collection of these pre-defined terms. Since the notations are alphanumeric in nature with pre-defined definition, the data is stored in a relational database and can be easily searched and correlated. The current release of the database catalogues data for over 1800 rice genes from about 400 peer-reviewed research articles. Data from more than 4000 different experiments have been digitized using in-house developed manual data curation models. These experiments are based on a total of about 140 different experimental techniques mostly related to gene expression analysis, biochemical activity analysis, protein-protein interaction, DNA-protein interaction and cellular/sub-cellular localization. Since every aspect (gene id, tissue, growth conditions etc.) of the experimental data has been encoded with the help of defined notations (ontologies etc.) it is possible to retrieve the same set of data from several different perspectives. The database can be accessed either by browsing or searching for a particular keyword or term. A global overview of the data from different perspectives (PubMed id, gene/protein id, environmental/growth conditions, plant part/developmental stage or gene function and localization) can be acquired by "browsing" the database. Specific queries can be made by searching the database with the help of any of the ontology terms or with the help of a keyword. The database will be updated every 6 months and as the curation progresses the data would be enriched by inclusion of studies from diverse aspects such as biotic stress, yield etc. The ultimate aim is to digitize all the relevant articles but since the task is enormous we are proceeding in a methodical manner so that important aspects such stress biology (abiotic or biotic), yield are better represented.
The database is hosted at University of Delhi South Campus, New Delhi, India and financed by Department of Biotechnology, Government of India.
Gour P, Garg P, Jain R, Joseph SV, Tyagi AK, Raghuvanshi S. (2014). Nucleic Acids Res. (2014 Database issue).
Category: Plant databases
Go to the article in the NAR Database issue.
Oxford University Press is not responsible for the content of external internet sites