1
|
Bult CJ, Sternberg PW. The alliance of genome resources: transforming comparative genomics. Mamm Genome 2023; 34:531-544. [PMID: 37666946 PMCID: PMC10628019 DOI: 10.1007/s00335-023-10015-2] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 08/11/2023] [Indexed: 09/06/2023]
Abstract
Comparing genomic and biological characteristics across multiple species is essential to using model systems to investigate the molecular and cellular mechanisms underlying human biology and disease and to translate mechanistic insights from studies in model organisms for clinical applications. Building a scalable knowledge commons platform that supports cross-species comparison of rich, expertly curated knowledge regarding gene function, phenotype, and disease associations available for model organisms and humans is the primary mission of the Alliance of Genome Resources (the Alliance). The Alliance is a consortium of seven model organism knowledgebases (mouse, rat, yeast, nematode, zebrafish, frog, fruit fly) and the Gene Ontology resource. The Alliance uses a common set of gene ortholog assertions as the basis for comparing biological annotations across the organisms represented in the Alliance. The major types of knowledge associated with genes that are represented in the Alliance database currently include gene function, phenotypic alleles and variants, human disease associations, pathways, gene expression, and both protein-protein and genetic interactions. The Alliance has enhanced the ability of researchers to easily compare biological annotations for common data types across model organisms and human through the implementation of shared programmatic access mechanisms, data-specific web pages with a unified "look and feel", and interactive user interfaces specifically designed to support comparative biology. The modular infrastructure developed by the Alliance allows the resource to serve as an extensible "knowledge commons" capable of expanding to accommodate additional model organisms.
Collapse
|
2
|
Egorova KS, Smirnova NS, Toukach PV. CSDB_GT, a curated glycosyltransferase database with close-to-full coverage on three most studied nonanimal species. Glycobiology 2020; 31:524-529. [PMID: 33242091 DOI: 10.1093/glycob/cwaa107] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2020] [Revised: 11/13/2020] [Accepted: 11/18/2020] [Indexed: 11/13/2022] Open
Abstract
We report the accomplishment of the first stage of the development of a novel manually curated database on glycosyltransferase (GT) activities, CSDB_GT. CSDB_GT (http://csdb.glycoscience.ru/gt.html) has been supplemented with GT activities from Saccharomyces cerevisiae. Now it provides the close-to-complete coverage on experimentally confirmed GTs from the three most studied model organisms from the three kingdoms: plantae (Arabidopsis thaliana, ca. 930 activities), bacteria (Escherichia coli, ca. 820 activities) and fungi (S. cerevisiae, ca. 270 activities).
Collapse
Affiliation(s)
- Ksenia S Egorova
- Laboratory of Metal-Complex and Nano-Scale Catalysts, N.D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky prospect 47, Moscow 119991, Russia
| | - Nadezhda S Smirnova
- Kurnakov Institute of General and Inorganic Chemistry, Russian Academy of Sciences, Leninsky prospect 31, Moscow 119991, Russia
| | - Philip V Toukach
- Laboratory of Carbohydrate Chemistry, N.D. Zelinsky Institute of Organic Chemistry, Russian Academy of Sciences, Leninsky prospect 47, Moscow 119991, Russia
| |
Collapse
|
3
|
Baryshnikova A. Data libraries - the missing element for modeling biological systems. FEBS J 2020; 287:4594-4601. [PMID: 32100391 PMCID: PMC7687078 DOI: 10.1111/febs.15261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2019] [Revised: 02/19/2020] [Accepted: 02/24/2020] [Indexed: 11/29/2022]
Abstract
The primary bottleneck in understanding and modeling biological systems is shifting from data collection to data analysis and integration. This process critically depends on data being available in an organized form, so that they can be accessed, understood, and reused by a broad community of scientists. A proven solution for organizing data is literature curation, which extracts, aggregates, and distributes findings from publications. Here, I describe the benefits of extending curation practices to datasets, especially those that are not deposited in centralized databases. I argue that dataset curation (or ‘data librarianship’ as I suggest we call it) will overcome many barriers in data visibility and reusability and make a unique contribution to integration and modeling.
Collapse
|
4
|
Abstract
Data, including information generated from them by processing and analysis, are an asset with measurable value. The assets that biological research funding produces are the data generated, the information derived from these data, and, ultimately, the discoveries and knowledge these lead to. From the time when Henry Oldenburg published the first scientific journal in 1665 (Proceedings of the Royal Society) to the founding of the United States National Library of Medicine in 1879 to the present, there has been a sustained drive to improve how researchers can record and discover what is known. Researchers’ experimental work builds upon years and (collectively) billions of dollars’ worth of earlier work. Today, researchers are generating data at ever-faster rates because of advances in instrumentation and technology, coupled with decreases in production costs. Unfortunately, the ability of researchers to manage and disseminate their results has not kept pace, so their work cannot achieve its maximal impact. Strides have recently been made, but more awareness is needed of the essential role that biological data resources, including biocuration, play in maintaining and linking this ever-growing flood of data and information. The aim of this paper is to describe the nature of data as an asset, the role biocurators play in increasing its value, and consistent, practical means to measure effectiveness that can guide planning and justify costs in biological research information resources’ development and management.
Collapse
|
5
|
MacPherson KA, Starr B, Wong ED, Dalusag KS, Hellerstedt ST, Lang OW, Nash RS, Skrzypek MS, Engel SR, Cherry JM. Outreach and online training services at the Saccharomyces Genome Database. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3053434. [PMID: 28365719 PMCID: PMC5467555 DOI: 10.1093/database/bax002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2016] [Accepted: 01/05/2017] [Indexed: 11/12/2022]
Abstract
The Saccharomyces Genome Database (SGD; www.yeastgenome.org), the primary genetics and genomics resource for the budding yeast S. cerevisiae, provides free public access to expertly curated information about the yeast genome and its gene products. As the central hub for the yeast research community, SGD engages in a variety of social outreach efforts to inform our users about new developments, promote collaboration, increase public awareness of the importance of yeast to biomedical research, and facilitate scientific discovery. Here we describe these various outreach methods, from networking at scientific conferences to the use of online media such as blog posts and webinars, and include our perspectives on the benefits provided by outreach activities for model organism databases. Database URL:http://www.yeastgenome.org
Collapse
Affiliation(s)
| | - Barry Starr
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Edith D Wong
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Kyla S Dalusag
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | | | - Olivia W Lang
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Robert S Nash
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Marek S Skrzypek
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - Stacia R Engel
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| | - J Michael Cherry
- Department of Genetics, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
6
|
Rodriguez-Esteban R. Biocuration with insufficient resources and fixed timelines. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav116. [PMID: 26708987 PMCID: PMC4691339 DOI: 10.1093/database/bav116] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Accepted: 11/17/2015] [Indexed: 11/14/2022]
Abstract
Biological curation, or biocuration, is often studied from the perspective of creating and maintaining databases that have the goal of mapping and tracking certain areas of biology. However, much biocuration is, in fact, dedicated to finite and time-limited projects in which insufficient resources demand trade-offs. This typically more ephemeral type of curation is nonetheless of importance in biomedical research. Here, I propose a framework to understand such restricted curation projects from the point of view of return on curation (ROC), value, efficiency and productivity. Moreover, I suggest general strategies to optimize these curation efforts, such as the ‘multiple strategies’ approach, as well as a metric called overhead that can be used in the context of managing curation resources.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, pRED Informatics, Roche Innovation Center Basel, Basel 4070, Switzerland
| |
Collapse
|
7
|
|
8
|
Skrzypek MS, Nash RS. Biocuration at the Saccharomyces genome database. Genesis 2015; 53:450-7. [PMID: 25997651 DOI: 10.1002/dvg.22862] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2015] [Revised: 05/12/2015] [Accepted: 05/13/2015] [Indexed: 11/06/2022]
Abstract
Saccharomyces Genome Database is an online resource dedicated to managing information about the biology and genetics of the model organism, yeast (Saccharomyces cerevisiae). This information is derived primarily from scientific publications through a process of human curation that involves manual extraction of data and their organization into a comprehensive system of knowledge. This system provides a foundation for further analysis of experimental data coming from research on yeast as well as other organisms. In this review we will demonstrate how biocuration and biocurators add a key component, the biological context, to our understanding of how genes, proteins, genomes and cells function and interact. We will explain the role biocurators play in sifting through the wealth of biological data to incorporate and connect key information. We will also discuss the many ways we assist researchers with their various research needs. We hope to convince the reader that manual curation is vital in converting the flood of data into organized and interconnected knowledge, and that biocurators play an essential role in the integration of scientific information into a coherent model of the cell.
Collapse
Affiliation(s)
- Marek S Skrzypek
- Department of Genetics, Saccharomyces Genome Database, Stanford University, Stanford, California
| | - Robert S Nash
- Department of Genetics, Saccharomyces Genome Database, Stanford University, Stanford, California
| |
Collapse
|
9
|
Abstract
Gene Ontology (GO) provides dynamic controlled vocabularies to aid in the description of the functional biological attributes and subcellular locations of gene products from all taxonomic groups (www.geneontology.org). Here we describe collaboration between the renal biomedical research community and the GO Consortium to improve the quality and quantity of GO terms describing renal development. In the associated annotation activity, the new and revised terms were associated with gene products involved in renal development and function. This project resulted in a total of 522 GO terms being added to the ontology and the creation of approximately 9,600 kidney-related GO term associations to 940 UniProt Knowledgebase (UniProtKB) entries, covering 66 taxonomic groups. We demonstrate the impact of these improvements on the interpretation of GO term analyses performed on genes differentially expressed in kidney glomeruli affected by diabetic nephropathy. In summary, we have produced a resource that can be utilized in the interpretation of data from small- and large-scale experiments investigating molecular mechanisms of kidney function and development and thereby help towards alleviating renal disease.
Collapse
|
10
|
Zimmer AD, Lang D, Buchta K, Rombauts S, Nishiyama T, Hasebe M, Van de Peer Y, Rensing SA, Reski R. Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions. BMC Genomics 2013; 14:498. [PMID: 23879659 PMCID: PMC3729371 DOI: 10.1186/1471-2164-14-498] [Citation(s) in RCA: 136] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2013] [Accepted: 07/19/2013] [Indexed: 11/24/2022] Open
Abstract
Background The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation. Results Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the cosmoss.org resource as a central repository for this plant “flagship” genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the http://www.cosmoss.org model organism database. Conclusions Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5’-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes.
Collapse
Affiliation(s)
- Andreas D Zimmer
- Plant Biotechnology, Faculty of Biology, University of Freiburg, Schaenzlestrasse 1, 79104, Freiburg, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Neves M, Damaschun A, Mah N, Lekschas F, Seltmann S, Stachelscheid H, Fontaine JF, Kurtz A, Leser U. Preliminary evaluation of the CellFinder literature curation pipeline for gene expression in kidney cells and anatomical parts. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat020. [PMID: 23599415 PMCID: PMC3629873 DOI: 10.1093/database/bat020] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Biomedical literature curation is the process of automatically and/or manually deriving knowledge from scientific publications and recording it into specialized databases for structured delivery to users. It is a slow, error-prone, complex, costly and, yet, highly important task. Previous experiences have proven that text mining can assist in its many phases, especially, in triage of relevant documents and extraction of named entities and biological events. Here, we present the curation pipeline of the CellFinder database, a repository of cell research, which includes data derived from literature curation and microarrays to identify cell types, cell lines, organs and so forth, and especially patterns in gene expression. The curation pipeline is based on freely available tools in all text mining steps, as well as the manual validation of extracted data. Preliminary results are presented for a data set of 2376 full texts from which >4500 gene expression events in cell or anatomical part have been extracted. Validation of half of this data resulted in a precision of ∼50% of the extracted data, which indicates that we are on the right track with our pipeline for the proposed task. However, evaluation of the methods shows that there is still room for improvement in the named-entity recognition and that a larger and more robust corpus is needed to achieve a better performance for event extraction. Database URL: http://www.cellfinder.org/
Collapse
Affiliation(s)
- Mariana Neves
- Humboldt-Universität zu Berlin, Knowledge Management in Bioinformatics, Berlin, 10099, Germany.
| | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Davis AP, Wiegers TC, Johnson RJ, Lay JM, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, Murphy CG, Mattingly CJ. Text mining effectively scores and ranks the literature for improving chemical-gene-disease curation at the comparative toxicogenomics database. PLoS One 2013; 8:e58201. [PMID: 23613709 PMCID: PMC3629079 DOI: 10.1371/journal.pone.0058201] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2012] [Accepted: 01/31/2013] [Indexed: 11/30/2022] Open
Abstract
The Comparative Toxicogenomics Database (CTD; http://ctdbase.org/) is a public resource that curates interactions between environmental chemicals and gene products, and their relationships to diseases, as a means of understanding the effects of environmental chemicals on human health. CTD provides a triad of core information in the form of chemical-gene, chemical-disease, and gene-disease interactions that are manually curated from scientific articles. To increase the efficiency, productivity, and data coverage of manual curation, we have leveraged text mining to help rank and prioritize the triaged literature. Here, we describe our text-mining process that computes and assigns each article a document relevancy score (DRS), wherein a high DRS suggests that an article is more likely to be relevant for curation at CTD. We evaluated our process by first text mining a corpus of 14,904 articles triaged for seven heavy metals (cadmium, cobalt, copper, lead, manganese, mercury, and nickel). Based upon initial analysis, a representative subset corpus of 3,583 articles was then selected from the 14,094 articles and sent to five CTD biocurators for review. The resulting curation of these 3,583 articles was analyzed for a variety of parameters, including article relevancy, novel data content, interaction yield rate, mean average precision, and biological and toxicological interpretability. We show that for all measured parameters, the DRS is an effective indicator for scoring and improving the ranking of literature for the curation of chemical-gene-disease information at CTD. Here, we demonstrate how fully incorporating text mining-based DRS scoring into our curation pipeline enhances manual curation by prioritizing more relevant articles, thereby increasing data content, productivity, and efficiency.
Collapse
Affiliation(s)
- Allan Peter Davis
- Department of Biology, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Thomas C. Wiegers
- Department of Biology, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Robin J. Johnson
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, Maine, United States of America
| | - Jean M. Lay
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, Maine, United States of America
| | - Kelley Lennon-Hopkins
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, Maine, United States of America
| | - Cynthia Saraceni-Richards
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, Maine, United States of America
| | - Daniela Sciaky
- Department of Bioinformatics, The Mount Desert Island Biological Laboratory, Salisbury Cove, Maine, United States of America
| | - Cynthia Grondin Murphy
- Department of Biology, North Carolina State University, Raleigh, North Carolina, United States of America
| | - Carolyn J. Mattingly
- Department of Biology, North Carolina State University, Raleigh, North Carolina, United States of America
| |
Collapse
|
13
|
Davis AP, Johnson RJ, Lennon-Hopkins K, Sciaky D, Rosenstein MC, Wiegers TC, Mattingly CJ. Targeted journal curation as a method to improve data currency at the Comparative Toxicogenomics Database. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas051. [PMID: 23221299 PMCID: PMC3515863 DOI: 10.1093/database/bas051] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The Comparative Toxicogenomics Database (CTD) is a public resource that promotes understanding about the effects of environmental chemicals on human health. CTD biocurators read the scientific literature and manually curate a triad of chemical–gene, chemical–disease and gene–disease interactions. Typically, articles for CTD are selected using a chemical-centric approach by querying PubMed to retrieve a corpus containing the chemical of interest. Although this technique ensures adequate coverage of knowledge about the chemical (i.e. data completeness), it does not necessarily reflect the most current state of all toxicological research in the community at large (i.e. data currency). Keeping databases current with the most recent scientific results, as well as providing a rich historical background from legacy articles, is a challenging process. To address this issue of data currency, CTD designed and tested a journal-centric approach of curation to complement our chemical-centric method. We first identified priority journals based on defined criteria. Next, over 7 weeks, three biocurators reviewed 2425 articles from three consecutive years (2009–2011) of three targeted journals. From this corpus, 1252 articles contained relevant data for CTD and 52 752 interactions were manually curated. Here, we describe our journal selection process, two methods of document delivery for the biocurators and the analysis of the resulting curation metrics, including data currency, and both intra-journal and inter-journal comparisons of research topics. Based on our results, we expect that curation by select journals can (i) be easily incorporated into the curation pipeline to complement our chemical-centric approach; (ii) build content more evenly for chemicals, genes and diseases in CTD (rather than biasing data by chemicals-of-interest); (iii) reflect developing areas in environmental health and (iv) improve overall data currency for chemicals, genes and diseases. Database URL: http://ctdbase.org/
Collapse
Affiliation(s)
- Allan Peter Davis
- Department of Biology, North Carolina State University, Raleigh, NC 27695-7617, USA.
| | | | | | | | | | | | | |
Collapse
|
14
|
Klionsky DJ, Bruford EA, Cherry JM, Hodgkin J, Laulederkind SJF, Singer AG. In the beginning there was babble... Autophagy 2012; 8:1165-7. [PMID: 22836666 PMCID: PMC3625114 DOI: 10.4161/auto.20665] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
“Go to, let us go down, and there confound their language, that they may not understand one another's speech. …Therefore is the name of it called Babel; because the Lord did there confound the language of all the earth…” Genesis 11:7,9
Collapse
|
15
|
Park J, Costanzo MC, Balakrishnan R, Cherry JM, Hong EL. CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas001. [PMID: 22434836 PMCID: PMC3308158 DOI: 10.1093/database/bas001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. Database URL:http://www.yeastgenome.org
Collapse
Affiliation(s)
- Julie Park
- Department of Genetics, Stanford University, Stanford, CA 94305-5120, USA
| | | | | | | | | |
Collapse
|
16
|
Vasilevsky N, Johnson T, Corday K, Torniai C, Brush M, Segerdell E, Wilson M, Shaffer C, Robinson D, Haendel M. Research resources: curating the new eagle-i discovery system. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bar067. [PMID: 22434835 PMCID: PMC3308157 DOI: 10.1093/database/bar067] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Development of biocuration processes and guidelines for new data types or projects is a challenging task. Each project finds its way toward defining annotation standards and ensuring data consistency with varying degrees of planning and different tools to support and/or report on consistency. Further, this process may be data type specific even within the context of a single project. This article describes our experiences with eagle-i, a 2-year pilot project to develop a federated network of data repositories in which unpublished, unshared or otherwise ‘invisible’ scientific resources could be inventoried and made accessible to the scientific community. During the course of eagle-i development, the main challenges we experienced related to the difficulty of collecting and curating data while the system and the data model were simultaneously built, and a deficiency and diversity of data management strategies in the laboratories from which the source data was obtained. We discuss our approach to biocuration and the importance of improving information management strategies to the research process, specifically with regard to the inventorying and usage of research resources. Finally, we highlight the commonalities and differences between eagle-i and similar efforts with the hope that our lessons learned will assist other biocuration endeavors. Database URL:www.eagle-i.net
Collapse
Affiliation(s)
- Nicole Vasilevsky
- Oregon Health & Science University, Library, LIB, 3181 S.W. Sam Jackson Park Rd., Portland, OR 97239-3098, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
17
|
Fang R, Schindelman G, Auken KV, Fernandes J, Chen W, Wang X, Davis P, Tuli MA, Marygold SJ, Millburn G, Matthews B, Zhang H, Brown N, Gelbart WM, Sternberg PW. Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics 2012; 13:16. [PMID: 22280404 PMCID: PMC3305665 DOI: 10.1186/1471-2105-13-16] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2011] [Accepted: 01/26/2012] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. RESULTS We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. CONCLUSIONS Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
Collapse
Affiliation(s)
- Ruihua Fang
- Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA
| | - Gary Schindelman
- Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA
| | - Kimberly Van Auken
- Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA
| | - Jolene Fernandes
- Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA
| | - Wen Chen
- Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA
| | - Xiaodong Wang
- Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA
| | - Paul Davis
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK
| | - Mary Ann Tuli
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, CB10 1SA, UK
| | - Steven J Marygold
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
| | - Gillian Millburn
- Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK
| | - Beverley Matthews
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - Haiyan Zhang
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - Nick Brown
- The Gurdon Institute and Department of Physiology, Development & Neuroscience, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - William M Gelbart
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - Paul W Sternberg
- Howard Hughes Medical Institute and Biology Division, California Institute of Technology, Pasadena, CA 91125, USA
| |
Collapse
|
18
|
Baran J, Gerner M, Haeussler M, Nenadic G, Bergman CM. pubmed2ensembl: a resource for mining the biological literature on genes. PLoS One 2011; 6:e24716. [PMID: 21980353 PMCID: PMC3183000 DOI: 10.1371/journal.pone.0024716] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2011] [Accepted: 08/17/2011] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions. METHODOLOGY/PRINCIPAL FINDINGS To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data. CONCLUSION/SIGNIFICANCE By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature.
Collapse
Affiliation(s)
- Joachim Baran
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | - Martin Gerner
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
| | | | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, United Kingdom
| | - Casey M. Bergman
- Faculty of Life Sciences, University of Manchester, Manchester, United Kingdom
- * E-mail:
| |
Collapse
|
19
|
Costanzo MC, Park J, Balakrishnan R, Cherry JM, Hong EL. Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:bar004. [PMID: 21411447 PMCID: PMC3067894 DOI: 10.1093/database/bar004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned ‘unknown’ annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome. Database URL:http://www.yeastgenome.org
Collapse
Affiliation(s)
- Maria C Costanzo
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120, USA
| | | | | | | | | |
Collapse
|
20
|
Lardenois A, Gattiker A, Collin O, Chalmel F, Primig M. GermOnline 4.0 is a genomics gateway for germline development, meiosis and the mitotic cell cycle. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2010; 2010:baq030. [PMID: 21149299 PMCID: PMC3004465 DOI: 10.1093/database/baq030] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
GermOnline 4.0 is a cross-species database portal focusing on high-throughput expression data relevant for germline development, the meiotic cell cycle and mitosis in healthy versus malignant cells. It is thus a source of information for life scientists as well as clinicians who are interested in gene expression and regulatory networks. The GermOnline gateway provides unlimited access to information produced with high-density oligonucleotide microarrays (3'-UTR GeneChips), genome-wide protein-DNA binding assays and protein-protein interaction studies in the context of Ensembl genome annotation. Samples used to produce high-throughput expression data and to carry out genome-wide in vivo DNA binding assays are annotated via the MIAME-compliant Multiomics Information Management and Annotation System (MIMAS 3.0). Furthermore, the Saccharomyces Genomics Viewer (SGV) was developed and integrated into the gateway. SGV is a visualization tool that outputs genome annotation and DNA-strand specific expression data produced with high-density oligonucleotide tiling microarrays (Sc_tlg GeneChips) which cover the complete budding yeast genome on both DNA strands. It facilitates the interpretation of expression levels and transcript structures determined for various cell types cultured under different growth and differentiation conditions. Database URL: www.germonline.org/
Collapse
Affiliation(s)
- Aurélie Lardenois
- Inserm, U625, GERHM, IFR-140, Université de Rennes 1, F-35042 Rennes, France
| | | | | | | | | |
Collapse
|