1
|
Devkota P, Mohanty SD, Manda P. A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature. BioData Min 2022; 15:22. [PMID: 36171616 PMCID: PMC9516808 DOI: 10.1186/s13040-022-00310-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2022] [Accepted: 09/17/2022] [Indexed: 11/27/2022] Open
Abstract
Background Annotating scientific literature with ontology concepts is a critical task in biology and several other domains for knowledge discovery. Ontology based annotations can power large-scale comparative analyses in a wide range of applications ranging from evolutionary phenotypes to rare human diseases to the study of protein functions. Computational methods that can tag scientific text with ontology terms have included lexical/syntactic methods, traditional machine learning, and most recently, deep learning. Results Here, we present state of the art deep learning architectures based on Gated Recurrent Units for annotating text with ontology concepts. We use the Colorado Richly Annotated Full Text Corpus (CRAFT) as a gold standard for training and testing. We explore a number of additional information sources including NCBI’s BioThesauraus and Unified Medical Language System (UMLS) to augment information from CRAFT for increasing prediction accuracy. Our best model results in a 0.84 F1 and semantic similarity. Conclusion The results shown here underscore the impact for using deep learning architectures for automatically recognizing ontology concepts from literature. The augmentation of the models with biological information beyond that present in the gold standard corpus shows a distinct improvement in prediction accuracy.
Collapse
Affiliation(s)
- Pratik Devkota
- Department of Computer Science, University of North Carolina at Greensboro, Greensboro, USA
| | - Somya D Mohanty
- Department of Computer Science, University of North Carolina at Greensboro, Greensboro, USA.
| | - Prashanti Manda
- Informatics and Analytics, University of North Carolina at Greensboro, Greensboro, USA
| |
Collapse
|
2
|
Cui H, Ford B, Starr J, Reznicek A, Zhang L, Macklin JA. Authors’ attitude toward adopting a new workflow to improve the computability of phenotype publications. Database (Oxford) 2022; 2022:6519872. [PMID: 35106535 PMCID: PMC9278328 DOI: 10.1093/database/baac001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 11/24/2021] [Accepted: 01/10/2022] [Indexed: 11/13/2022]
Abstract
Critical to answering large-scale questions in biology is the integration of knowledge from different disciplines into a coherent, computable whole. Controlled vocabularies such as ontologies represent a clear path toward this goal. Using survey questionnaires, we examined the attitudes of biologists toward adopting controlled vocabularies in phenotype publications. Our questions cover current experience and overall attitude with controlled vocabularies, the awareness of the issues around ambiguity and inconsistency in phenotype descriptions and post-publication professional data curation, the preferred solutions and the effort and desired rewards for adopting a new authoring workflow. Results suggest that although the existence of controlled vocabularies is widespread, their use is not common. A majority of respondents (74%) are frustrated with ambiguity in phenotypic descriptions, and there is a strong agreement (mean agreement score 4.21 out of 5) that author curation would better reflect the original meaning of phenotype data. Moreover, the vast majority (85%) of researchers would try a new authoring workflow if resultant data were more consistent and less ambiguous. Even more respondents (93%) suggested that they would try and possibly adopt a new authoring workflow if it required 5% additional effort as compared to normal, but higher rates resulted in a steep decline in likely adoption rates. Among the four different types of rewards, two types of citations were the most desired incentives for authors to produce computable data. Overall, our results suggest the adoption of a new authoring workflow would be accelerated by a user-friendly and efficient software-authoring tool, an increased awareness of the challenges text ambiguity creates for external curators and an elevated appreciation of the benefits of controlled vocabularies.
Collapse
Affiliation(s)
- Hong Cui
- School of Information, University of Arizona , 1103 E. Second Street, Tucson, AZ 85705, USA
| | - Bruce Ford
- Department of Biological Sciences, University of Manitoba , 50 Sifton Road, Winnipeg, MB R3T 2N2, Canada
| | - Julian Starr
- Department of Biology, University of Ottawa , 30 Marie Curie Road, Ottawa, ON K1N 6N5, Canada
| | - Anton Reznicek
- SLA Herbarium, University of Michigan , 3600 Varsity Drive #1046, Ann Arbor, MI 48019, USA
| | - Limin Zhang
- School of Information, University of Arizona , 1103 E. Second Street, Tucson, AZ 85705, USA
| | - James A Macklin
- Ottawa Research and Development Centre, Agriculture and Agri-Food Canada , 960 Carling Avenue, Ottawa, ON K1A 0C6, Canada
| |
Collapse
|
3
|
Davis AP, Wiegers TC, Wiegers J, Grondin CJ, Johnson RJ, Sciaky D, Mattingly CJ. CTD Anatomy: analyzing chemical-induced phenotypes and exposures from an anatomical perspective, with implications for environmental health studies. Curr Res Toxicol 2021; 2:128-139. [PMID: 33768211 PMCID: PMC7990325 DOI: 10.1016/j.crtox.2021.03.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Revised: 02/01/2021] [Accepted: 03/01/2021] [Indexed: 12/12/2022] Open
Abstract
The Comparative Toxicogenomics Database (CTD) is a freely available public resource that curates and interrelates chemical, gene/protein, phenotype, disease, organism, and exposure data. CTD can be used to address toxicological mechanisms for environmental chemicals and facilitate the generation of testable hypotheses about how exposures affect human health. At CTD, manually curated interactions for chemical-induced phenotypes are enhanced with anatomy terms (tissues, fluids, and cell types) to describe the physiological system of the reported event. These same anatomy terms are used to annotate the human media (e.g., urine, hair, nail, blood, etc.) in which an environmental chemical was assayed for exposure. Currently, CTD uses more than 880 unique anatomy terms to contextualize over 255,000 chemical-phenotype interactions and 167,000 exposure statements. These annotations allow chemical-phenotype interactions and exposure data to be explored from a novel, anatomical perspective. Here, we describe CTD's anatomy curation process (including the construction of a controlled, interoperable vocabulary) and new anatomy webpages (that coalesce and organize the curated chemical-phenotype and exposure data sets). We also provide examples that demonstrate how this feature can be used to identify system- and cell-specific chemical-induced toxicities, help inform exposure data, prioritize phenotypes for environmental diseases, survey tissue and pregnancy exposomes, and facilitate data connections with external resources. Anatomy annotations advance understanding of environmental health by providing new ways to explore and survey chemical-induced events and exposure studies in the CTD framework.
Collapse
Affiliation(s)
- Allan Peter Davis
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, United States
| | - Thomas C. Wiegers
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, United States
| | - Jolene Wiegers
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, United States
| | - Cynthia J. Grondin
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, United States
| | - Robin J. Johnson
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, United States
| | - Daniela Sciaky
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, United States
| | - Carolyn J. Mattingly
- Department of Biological Sciences, North Carolina State University, Raleigh, NC 27695, United States
- Center for Human Health and the Environment, North Carolina State University, Raleigh, NC 27695, United States
| |
Collapse
|
4
|
Teletchea S, Teletchea F. STOREFISH 2.0: a database on the reproductive strategies of teleost fishes. Database (Oxford) 2020; 2020:baaa095. [PMID: 33216894 PMCID: PMC7678788 DOI: 10.1093/database/baaa095] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Revised: 09/04/2020] [Accepted: 10/14/2020] [Indexed: 01/08/2023]
Abstract
Teleost fishes show the most outstanding reproductive diversity of all vertebrates. Yet to date, no one has been able to decisively explain this striking variability nor to perform large-scale phylogenetic analyses of reproductive modes. Here, we describe STrategies Of REproduction in FISH (STOREFISH) 2.0, an online database easing the sharing of an original data set on reproduction published in 2007, enriched with automated data extraction and presentation to display the knowledge acquired on temperate freshwater fish species. STOREFISH 2.0 contains the information for 80 freshwater fish species and 50 traits from the analysis of 1219 references. It is anticipated that this new database could be useful for freshwater biodiversity research, conservation, assessment and management. Database URL: www.storefish.org.
Collapse
Affiliation(s)
- Stéphane Teletchea
- UFIP, Université de Nantes, UMR CRNS 6286, 2 rue de la Houssinière, 44322 Nantes cedex 3, France
| | - Fabrice Teletchea
- University of Lorraine, INRAE, UR AFPA, 2 avenue de la Forêt de Haye - BP 20163
, F-54000, Vandoeuvre-lès-Nancy Cedex, France
| |
Collapse
|
5
|
Cui H, Zhang L, Ford B, Cheng HL, Macklin JA, Reznicek A, Starr J. Measurement Recorder: developing a useful tool for making species descriptions that produces computable phenotypes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5995854. [PMID: 33216896 PMCID: PMC7678789 DOI: 10.1093/database/baaa079] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 08/24/2020] [Accepted: 08/27/2020] [Indexed: 12/31/2022]
Abstract
To use published phenotype information in computational analyses, there have been efforts to convert descriptions of phenotype characters from human languages to ontologized statements. This postpublication curation process is not only slow and costly, it is also burdened with significant intercurator variation (including curator-author variation), due to different interpretations of a character by various individuals. This problem is inherent in any human-based intellectual activity. To address this problem, making scientific publications semantically clear (i.e. computable) by the authors at the time of publication is a critical step if we are to avoid postpublication curation. To help authors efficiently produce species phenotypes while producing computable data, we are experimenting with an author-driven ontology development approach and developing and evaluating a series of ontology-aware software modules that would create publishable species descriptions that are readily useable in scientific computations. The first software module prototype called Measurement Recorder has been developed to assist authors in defining continuous measurements and reported in this paper. Two usability studies of the software were conducted with 22 undergraduate students majoring in information science and 32 in biology. Results suggest that participants can use Measurement Recorder without training and they find it easy to use after limited practice. Participants also appreciate the semantic enhancement features. Measurement Recorder's character reuse features facilitate character convergence among participants by 48% and have the potential to further reduce user errors in defining characters. A set of software design issues have also been identified and then corrected. Measurement Recorder enables authors to record measurements in a semantically clear manner and enriches phenotype ontology along the way. Future work includes representing the semantic data as Resource Description Framework (RDF) knowledge graphs and characterizing the division of work between authors as domain knowledge providers and ontology engineers as knowledge formalizers in this new author-driven ontology development approach.
Collapse
Affiliation(s)
- Hong Cui
- School of Information, University of Arizona, Tucson, AZ 85705, USA
| | - Limin Zhang
- School of Information, University of Arizona, Tucson, AZ 85705, USA
| | - Bruce Ford
- Department of Biological sciences, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | - Hsin-Liang Cheng
- Curtis Laws Wilson Library, Missouri University of Science and Technology, Rolla, MO 65409, USA
| | - James A Macklin
- Ottawa Research and Development Centre, Agriculture and Agri-Food Canada, Ottawa, ON K1A 0C6, Canada
| | - Anton Reznicek
- LSA Herbarium, University of Michigan, Ann Arbor, MI 48019, USA
| | - Julian Starr
- Department of Biology, University of Ottawa, Ottawa, ON K1N 6N5, Canada
| |
Collapse
|
6
|
Thessen AE, Walls RL, Vogt L, Singer J, Warren R, Buttigieg PL, Balhoff JP, Mungall CJ, McGuinness DL, Stucky BJ, Yoder MJ, Haendel MA. Transforming the study of organisms: Phenomic data models and knowledge bases. PLoS Comput Biol 2020; 16:e1008376. [PMID: 33232313 PMCID: PMC7685442 DOI: 10.1371/journal.pcbi.1008376] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
The rapidly decreasing cost of gene sequencing has resulted in a deluge of genomic data from across the tree of life; however, outside a few model organism databases, genomic data are limited in their scientific impact because they are not accompanied by computable phenomic data. The majority of phenomic data are contained in countless small, heterogeneous phenotypic data sets that are very difficult or impossible to integrate at scale because of variable formats, lack of digitization, and linguistic problems. One powerful solution is to represent phenotypic data using data models with precise, computable semantics, but adoption of semantic standards for representing phenotypic data has been slow, especially in biodiversity and ecology. Some phenotypic and trait data are available in a semantic language from knowledge bases, but these are often not interoperable. In this review, we will compare and contrast existing ontology and data models, focusing on nonhuman phenotypes and traits. We discuss barriers to integration of phenotypic data and make recommendations for developing an operationally useful, semantically interoperable phenotypic data ecosystem.
Collapse
Affiliation(s)
- Anne E. Thessen
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
- Ronin Institute for Independent Scholarship, Monclair, New Jersey, United States of America
| | - Ramona L. Walls
- Bio5 Institute, University of Arizona, Tucson, Arizona, United States of America
| | - Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
| | | | | | - Pier Luigi Buttigieg
- Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar- und Meeresforschung, Bremerhaven, Germany
| | - James P. Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Christopher J. Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | | | - Brian J. Stucky
- Florida Museum of Natural History, University of Florida, Gainesville, Florida, United States of America
| | - Matthew J. Yoder
- Illinois Natural History Survey, Champaign, Illinois, United States of America
| | - Melissa A. Haendel
- Environmental and Molecular Toxicology, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
7
|
Cui H, Macklin JA, Sachs J, Reznicek A, Starr J, Ford B, Penev L, Chen HL. Incentivising use of structured language in biological descriptions: Author-driven phenotype data and ontology production. Biodivers Data J 2018; 6:e29616. [PMID: 30473620 PMCID: PMC6235995 DOI: 10.3897/bdj.6.e29616] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Accepted: 10/23/2018] [Indexed: 01/17/2023] Open
Abstract
Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production.
Collapse
Affiliation(s)
- Hong Cui
- University of Arizona, TUCSON, United States of AmericaUniversity of ArizonaTUCSONUnited States of America
| | - James A. Macklin
- Agriculture and Agri-Food Canada, Ottawa, CanadaAgriculture and Agri-Food CanadaOttawaCanada
| | - Joel Sachs
- Agriculture and Agri-Food Canada, Ottawa, CanadaAgriculture and Agri-Food CanadaOttawaCanada
| | - Anton Reznicek
- University of Michigan, Ann Arbor, United States of AmericaUniversity of MichiganAnn ArborUnited States of America
| | - Julian Starr
- University of Ottawa, Ottawa, CanadaUniversity of OttawaOttawaCanada
| | - Bruce Ford
- University of Manitoba, Winnipeg, CanadaUniversity of ManitobaWinnipegCanada
| | - Lyubomir Penev
- Pensoft Publishers & Bulgarian Academy of Sciences, Sofia, BulgariaPensoft Publishers & Bulgarian Academy of SciencesSofiaBulgaria
| | - Hsin-Liang Chen
- University of Massachusetts at Boston, Boston, United States of AmericaUniversity of Massachusetts at BostonBostonUnited States of America
| |
Collapse
|
8
|
Jackson LM, Fernando PC, Hanscom JS, Balhoff JP, Mabee PM. Automated Integration of Trees and Traits: A Case Study Using Paired Fin Loss Across Teleost Fishes. Syst Biol 2018; 67:559-575. [PMID: 29325126 PMCID: PMC6005059 DOI: 10.1093/sysbio/syx098] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2017] [Revised: 12/15/2017] [Accepted: 12/21/2017] [Indexed: 11/24/2022] Open
Abstract
Data synthesis required for large-scale macroevolutionary studies is challenging with the current tools available for integration. Using a classic question regarding the frequency of paired fin loss in teleost fishes as a case study, we sought to create automated methods to facilitate the integration of broad-scale trait data with a sizable species-level phylogeny. Similar to the evolutionary pattern previously described for limbs, pelvic and pectoral fin reduction and loss are thought to have occurred independently multiple times in the evolution of fishes. We developed a bioinformatics pipeline to identify the presence and absence of pectoral and pelvic fins of 12,582 species. To do this, we integrated a synthetic morphological supermatrix of phenotypic data for the pectoral and pelvic fins for teleost fishes from the Phenoscape Knowledgebase (two presence/absence characters for 3047 taxa) with a species-level tree for teleost fishes from the Open Tree of Life project (38,419 species). The integration method detailed herein harnessed a new combined approach by utilizing data based on ontological inference, as well as phylogenetic propagation, to reduce overall data loss. Using inference enabled by ontology-based annotations, missing data were reduced from 98.0% to 85.9%, and further reduced to 34.8% by phylogenetic data propagation. These methods allowed us to extend the data to an additional 11,293 species for a total of 12,582 species with trait data. The pectoral fin appears to have been independently lost in a minimum of 19 lineages and the pelvic fin in 48. Though interpretation is limited by lack of phylogenetic resolution at the species level, it appears that following loss, both pectoral and pelvic fins were regained several (3) to many (14) times respectively. Focused investigation into putative regains of the pectoral fin, all within one clade (Anguilliformes), showed that the pectoral fin was regained at least twice following loss. Overall, this study points to specific teleost clades where strategic phylogenetic resolution and genetic investigation will be necessary to understand the pattern and frequency of pectoral fin reversals.
Collapse
Affiliation(s)
- Laura M Jackson
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| | - Pasan C Fernando
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| | - Josh S Hanscom
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, 100 Europa Drive Suite 540, Chapel Hill, NC 27517, USA
| | - Paula M Mabee
- Department of Biology, University of South Dakota, 414 East Clark St., Vermillion, SD 57069, USA
| |
Collapse
|
9
|
Dahdul W, Manda P, Cui H, Balhoff JP, Dececchi TA, Ibrahim N, Lapp H, Vision T, Mabee PM. Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems. Database (Oxford) 2018; 2018:5255130. [PMID: 30576485 PMCID: PMC6301375 DOI: 10.1093/database/bay110] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Revised: 08/22/2018] [Accepted: 09/24/2018] [Indexed: 11/12/2022]
Abstract
Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.
Collapse
Affiliation(s)
| | - Prashanti Manda
- University of North Carolina at Greensboro, Greensboro, NC, USA
| | - Hong Cui
- University of Arizona, Tucson, AZ, USA
| | - James P Balhoff
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - T Alexander Dececchi
- University of South Dakota, Vermillion, SD, USA
- Current affiliation: University of Pittsburgh at Johnstown, Johnstown, PA, USA
| | - Nizar Ibrahim
- University of Chicago, Chicago, IL, USA
- Current affiliation: University of Detroit Mercy, Detroit, MI, USA & University of Portsmouth, Portsmouth, UK
| | | | - Todd Vision
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | |
Collapse
|
10
|
Dececchi TA, Mabee PM, Blackburn DC. Data Sources for Trait Databases: Comparing the Phenomic Content of Monographs and Evolutionary Matrices. PLoS One 2016; 11:e0155680. [PMID: 27191170 PMCID: PMC4871461 DOI: 10.1371/journal.pone.0155680] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Accepted: 05/03/2016] [Indexed: 01/17/2023] Open
Abstract
Databases of organismal traits that aggregate information from one or multiple sources can be leveraged for large-scale analyses in biology. Yet the differences among these data streams and how well they capture trait diversity have never been explored. We present the first analysis of the differences between phenotypes captured in free text of descriptive publications ('monographs') and those used in phylogenetic analyses ('matrices'). We focus our analysis on osteological phenotypes of the limbs of four extinct vertebrate taxa critical to our understanding of the fin-to-limb transition. We find that there is low overlap between the anatomical entities used in these two sources of phenotype data, indicating that phenotypes represented in matrices are not simply a subset of those found in monographic descriptions. Perhaps as expected, compared to characters found in matrices, phenotypes in monographs tend to emphasize descriptive and positional morphology, be somewhat more complex, and relate to fewer additional taxa. While based on a small set of focal taxa, these qualitative and quantitative data suggest that either source of phenotypes alone will result in incomplete knowledge of variation for a given taxon. As a broader community develops to use and expand databases characterizing organismal trait diversity, it is important to recognize the limitations of the data sources and develop strategies to more fully characterize variation both within species and across the tree of life.
Collapse
Affiliation(s)
- T. Alex Dececchi
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| | - Paula M. Mabee
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| | - David C. Blackburn
- Florida Museum of Natural History, University of Florida, Gainesville, Florida, United States of America
| |
Collapse
|
11
|
Druzinsky RE, Balhoff JP, Crompton AW, Done J, German RZ, Haendel MA, Herrel A, Herring SW, Lapp H, Mabee PM, Muller HM, Mungall CJ, Sternberg PW, Van Auken K, Vinyard CJ, Williams SH, Wall CE. Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals. PLoS One 2016; 11:e0149102. [PMID: 26870952 PMCID: PMC4752357 DOI: 10.1371/journal.pone.0149102] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 01/27/2016] [Indexed: 01/27/2023] Open
Abstract
Background In recent years large bibliographic databases have made much of the published literature of biology available for searches. However, the capabilities of the search engines integrated into these databases for text-based bibliographic searches are limited. To enable searches that deliver the results expected by comparative anatomists, an underlying logical structure known as an ontology is required. Development and Testing of the Ontology Here we present the Mammalian Feeding Muscle Ontology (MFMO), a multi-species ontology focused on anatomical structures that participate in feeding and other oral/pharyngeal behaviors. A unique feature of the MFMO is that a simple, computable, definition of each muscle, which includes its attachments and innervation, is true across mammals. This construction mirrors the logical foundation of comparative anatomy and permits searches using language familiar to biologists. Further, it provides a template for muscles that will be useful in extending any anatomy ontology. The MFMO is developed to support the Feeding Experiments End-User Database Project (FEED, https://feedexp.org/), a publicly-available, online repository for physiological data collected from in vivo studies of feeding (e.g., mastication, biting, swallowing) in mammals. Currently the MFMO is integrated into FEED and also into two literature-specific implementations of Textpresso, a text-mining system that facilitates powerful searches of a corpus of scientific publications. We evaluate the MFMO by asking questions that test the ability of the ontology to return appropriate answers (competency questions). We compare the results of queries of the MFMO to results from similar searches in PubMed and Google Scholar. Results and Significance Our tests demonstrate that the MFMO is competent to answer queries formed in the common language of comparative anatomy, but PubMed and Google Scholar are not. Overall, our results show that by incorporating anatomical ontologies into searches, an expanded and anatomically comprehensive set of results can be obtained. The broader scientific and publishing communities should consider taking up the challenge of semantically enabled search capabilities.
Collapse
Affiliation(s)
- Robert E. Druzinsky
- Department of Oral Biology, University of Illinois at Chicago, Chicago, Illinois, United States of America
- * E-mail:
| | - James P. Balhoff
- RTI International, Research Triangle Park, North Carolina, United States of America
| | - Alfred W. Crompton
- Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - James Done
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Rebecca Z. German
- Department of Anatomy and Neurobiology, Northeast Ohio Medical University, Rootstown, Ohio, United States of America
| | - Melissa A. Haendel
- Oregon Health and Science University, Portland, Oregon, United States of America
| | - Anthony Herrel
- Département d’Ecologie et de Gestion de la Biodiversité, Museum National d’Histoire Naturelle, Paris, France
| | - Susan W. Herring
- University of Washington, Department of Orthodontics, Seattle, Washington, United States of America
| | - Hilmar Lapp
- National Evolutionary Synthesis Center, Durham, North Carolina, United States of America
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina, United States of America
| | - Paula M. Mabee
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| | - Hans-Michael Muller
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Christopher J. Mungall
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Paul W. Sternberg
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
- Howard Hughes Medical Institute, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Christopher J. Vinyard
- Department of Anatomy and Neurobiology, Northeast Ohio Medical University, Rootstown, Ohio, United States of America
| | - Susan H. Williams
- Department of Biomedical Sciences, Ohio University Heritage College of Osteopathic Medicine, Athens, Ohio, United States of America
| | - Christine E. Wall
- Department of Evolutionary Anthropology, Duke University, Durham, North Carolina, United States of America
| |
Collapse
|
12
|
Thessen AE, Bunker DE, Buttigieg PL, Cooper LD, Dahdul WM, Domisch S, Franz NM, Jaiswal P, Lawrence-Dill CJ, Midford PE, Mungall CJ, Ramírez MJ, Specht CD, Vogt L, Vos RA, Walls RL, White JW, Zhang G, Deans AR, Huala E, Lewis SE, Mabee PM. Emerging semantics to link phenotype and environment. PeerJ 2015; 3:e1470. [PMID: 26713234 PMCID: PMC4690371 DOI: 10.7717/peerj.1470] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 11/12/2015] [Indexed: 11/20/2022] Open
Abstract
Understanding the interplay between environmental conditions and phenotypes is a fundamental goal of biology. Unfortunately, data that include observations on phenotype and environment are highly heterogeneous and thus difficult to find and integrate. One approach that is likely to improve the status quo involves the use of ontologies to standardize and link data about phenotypes and environments. Specifying and linking data through ontologies will allow researchers to increase the scope and flexibility of large-scale analyses aided by modern computing methods. Investments in this area would advance diverse fields such as ecology, phylogenetics, and conservation biology. While several biological ontologies are well-developed, using them to link phenotypes and environments is rare because of gaps in ontological coverage and limits to interoperability among ontologies and disciplines. In this manuscript, we present (1) use cases from diverse disciplines to illustrate questions that could be answered more efficiently using a robust linkage between phenotypes and environments, (2) two proof-of-concept analyses that show the value of linking phenotypes to environments in fishes and amphibians, and (3) two proposed example data models for linking phenotypes and environments using the extensible observation ontology (OBOE) and the Biological Collections Ontology (BCO); these provide a starting point for the development of a data model linking phenotypes and environments.
Collapse
Affiliation(s)
- Anne E. Thessen
- Ronin Institute for Independent Scholarship, Monclair, NJ, United States
- The Data Detektiv, Waltham, MA, United States
| | - Daniel E. Bunker
- Department of Biological Sciences, New Jersey Institute of Technology, Newark, NJ, United States
| | - Pier Luigi Buttigieg
- HGF-MPG Group for Deep Sea Ecology and Technology, Alfred-Wegener-Institut, Helmholtz-Zentrum für Polar-und Meeresforschung, Bremerhaven, Germany
| | - Laurel D. Cooper
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, United States
| | - Wasila M. Dahdul
- Department of Biology, University of South Dakota, Vermillion, SD, United States
| | - Sami Domisch
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, United States
| | - Nico M. Franz
- School of Life Sciences, Arizona State University, Tempe, AZ, United States
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, United States
| | - Carolyn J. Lawrence-Dill
- Departments of Genetics, Development and Cell Biology and Agronomy, Iowa State University, Ames, IA, United States
| | | | | | - Martín J. Ramírez
- Division of Arachnology, Museo Argentino de Ciencias Naturales–CONICET, Buenos Aires, Argentina
| | - Chelsea D. Specht
- Departments of Plant and Microbial Biology & Integrative Biology, University of California, Berkeley, CA, United States
| | - Lars Vogt
- Institut für Evolutionsbiologie und Ökologie, Universität Bonn, Bonn, Germany
| | | | - Ramona L. Walls
- iPlant Collaborative, University of Arizona, Tucson, AZ, United States
| | - Jeffrey W. White
- US Arid Land Agricultural Research Center, United States Department of Agriculture—ARS, Maricopa, AZ, United States
| | - Guanyang Zhang
- School of Life Sciences, Arizona State University, Tempe, AZ, United States
| | - Andrew R. Deans
- Department of Entomology, Pennsylvania State University, University Park, PA, United States
| | - Eva Huala
- Phoenix Bioinformatics, Redwood City, CA, United States
| | - Suzanna E. Lewis
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | - Paula M. Mabee
- Department of Biology, University of South Dakota, Vermillion, SD, United States
| |
Collapse
|
13
|
Dececchi TA, Balhoff JP, Lapp H, Mabee PM. Toward Synthesizing Our Knowledge of Morphology: Using Ontologies and Machine Reasoning to Extract Presence/Absence Evolutionary Phenotypes across Studies. Syst Biol 2015; 64:936-52. [PMID: 26018570 PMCID: PMC4604830 DOI: 10.1093/sysbio/syv031] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 05/20/2015] [Indexed: 02/02/2023] Open
Abstract
The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in noncomputer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived by machine reasoning over the formal semantics of the source ontologies. Inferred data reduced the missing data in the variable character-subset from 98.5% to 78.2%. Machine reasoning also enables the isolation of conflicts in the data, that is, cells where both presence and absence are indicated; reports regarding conflicting data provenance can be generated automatically. Further, reasoning enables quantification and new visualizations of the data, here for example, allowing identification of character space that has been undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, providing the data are semantically annotated. Because such data can also be linked to model organism genetics through computational scoring of phenotypic similarity, they open a rich set of future research questions into phenotype-to-genome relationships.
Collapse
Affiliation(s)
| | - James P Balhoff
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; University of North Carolina, Chapel Hill, NC 27599, USA
| | - Hilmar Lapp
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Center for Genomics and Computational Biology, Duke University, Durham, NC 27708, USA
| | - Paula M Mabee
- Department of Biology, University of South Dakota, Vermillion, SD 57069, USA;
| |
Collapse
|
14
|
Manda P, Balhoff JP, Lapp H, Mabee P, Vision TJ. Using the phenoscape knowledgebase to relate genetic perturbations to phenotypic evolution. Genesis 2015. [PMID: 26220875 DOI: 10.1002/dvg.22878] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
The abundance of phenotypic diversity among species can enrich our knowledge of development and genetics beyond the limits of variation that can be observed in model organisms. The Phenoscape Knowledgebase (KB) is designed to enable exploration and discovery of phenotypic variation among species. Because phenotypes in the KB are annotated using standard ontologies, evolutionary phenotypes can be compared with phenotypes from genetic perturbations in model organisms. To illustrate the power of this approach, we review the use of the KB to find taxa showing evolutionary variation similar to that of a query gene. Matches are made between the full set of phenotypes described for a gene and an evolutionary profile, the latter of which is defined as the set of phenotypes that are variable among the daughters of any node on the taxonomic tree. Phenoscape's semantic similarity interface allows the user to assess the statistical significance of each match and flags matches that may only result from differences in annotation coverage between genetic and evolutionary studies. Tools such as this will help meet the challenge of relating the growing volume of genetic knowledge in model organisms to the diversity of phenotypes in nature. The Phenoscape KB is available at http://kb.phenoscape.org.
Collapse
Affiliation(s)
- Prashanti Manda
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina.,US National Evolutionary Synthesis Center, Durham, North Carolina
| | - James P Balhoff
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina.,US National Evolutionary Synthesis Center, Durham, North Carolina
| | - Hilmar Lapp
- US National Evolutionary Synthesis Center, Durham, North Carolina.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina
| | - Paula Mabee
- Department of Biology, University of South Dakota, Vermillion, South Dakota
| | - Todd J Vision
- Department of Biology, University of North Carolina, Chapel Hill, North Carolina.,US National Evolutionary Synthesis Center, Durham, North Carolina
| |
Collapse
|