1
|
Grams M, Richter S. On the four complementary aspects of hierarchical character relationships and their bearing on scoring constraints, expressed in a new syntax for character dependencies. Cladistics 2023; 39:437-455. [PMID: 37428134 DOI: 10.1111/cla.12550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 06/02/2023] [Accepted: 06/03/2023] [Indexed: 07/11/2023] Open
Abstract
Morphological matrices, including the conceptualization of characters and character states and scoring thereof, still are a valuable and necessary tool for phylogenetic analyses. Although they are often seen only as numerically simplified summaries of observations for the purpose of cladistic analyses, they also hold value as collections of ideas, concepts and the current state of knowledge, conveying various hypotheses on character state identity, homology and evolutionary transformations. A common and persistent issue in scoring and analysing morphological matrices is the phenomenon of inapplicable characters ("inapplicables"). Inapplicables result from the ontological dependency (based on hierarchical relationships) between characters. Traditionally handled the same as "missing data", inapplicables were shown to be problematic in holding the potential to result in unreasonable algorithmic preference for certain cladograms over others. Recently, though, this problem has been solved by approaching parsimony as a maximization of homology rather than a minimization of transformational steps. We herein aim to further improve our theoretical understanding of the underlying hierarchical nature of morphological characters, which causes the phenomenon of ontological dependencies and, thereby, inapplicables. As a result, we present a discussion of various character-dependency scenarios and a new concept of hierarchical character relationships as being composed of four complementary sub-aspects. Building on this, a new syntax for the designation of character dependencies as part of the character statement is proposed, to help identify and apply scoring constraints for manual and automated scoring of morphological character matrices and their cladistic analysis.
Collapse
Affiliation(s)
- Markus Grams
- Universität Rostock Institut für Biowissenschaften, Allgemeine & Spezielle Zoologie, Rostock, Germany
| | - Stefan Richter
- Universität Rostock Institut für Biowissenschaften, Allgemeine & Spezielle Zoologie, Rostock, Germany
| |
Collapse
|
2
|
Vogt L, Mikó I, Bartolomaeus T. Anatomy and the type concept in biology show that ontologies must be adapted to the diagnostic needs of research. J Biomed Semantics 2022; 13:18. [PMID: 35761389 PMCID: PMC9235205 DOI: 10.1186/s13326-022-00268-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 04/12/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In times of exponential data growth in the life sciences, machine-supported approaches are becoming increasingly important and with them the need for FAIR (Findable, Accessible, Interoperable, Reusable) and eScience-compliant data and metadata standards. Ontologies, with their queryable knowledge resources, play an essential role in providing these standards. Unfortunately, biomedical ontologies only provide ontological definitions that answer What is it? questions, but no method-dependent empirical recognition criteria that answer How does it look? QUESTIONS Consequently, biomedical ontologies contain knowledge of the underlying ontological nature of structural kinds, but often lack sufficient diagnostic knowledge to unambiguously determine the reference of a term. RESULTS We argue that this is because ontology terms are usually textually defined and conceived as essentialistic classes, while recognition criteria often require perception-based definitions because perception-based contents more efficiently document and communicate spatial and temporal information-a picture is worth a thousand words. Therefore, diagnostic knowledge often must be conceived as cluster classes or fuzzy sets. Using several examples from anatomy, we point out the importance of diagnostic knowledge in anatomical research and discuss the role of cluster classes and fuzzy sets as concepts of grouping needed in anatomy ontologies in addition to essentialistic classes. In this context, we evaluate the role of the biological type concept and discuss its function as a general container concept for groupings not covered by the essentialistic class concept. CONCLUSIONS We conclude that many recognition criteria can be conceptualized as text-based cluster classes that use terms that are in turn based on perception-based fuzzy set concepts. Finally, we point out that only if biomedical ontologies model also relevant diagnostic knowledge in addition to ontological knowledge, they will fully realize their potential and contribute even more substantially to the establishment of FAIR and eScience-compliant data and metadata standards in the life sciences.
Collapse
Affiliation(s)
- Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1B, 30167, Hannover, Germany.
| | - István Mikó
- Don Chandler Entomological Collection, University of New Hampshire, Durham, NH, USA
| | - Thomas Bartolomaeus
- Institut für Evolutionsbiologie und Ökologie, Universität Bonn, An der Immenburg 1, 53121, Bonn, Germany
| |
Collapse
|
3
|
Porto DS, Dahdul WM, Lapp H, Balhoff JP, Vision TJ, Mabee PM, Uyeda J. Assessing Bayesian Phylogenetic Information Content of Morphological Data Using Knowledge from Anatomy Ontologies. Syst Biol 2022; 71:1290-1306. [PMID: 35285502 PMCID: PMC9558846 DOI: 10.1093/sysbio/syac022] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Revised: 02/09/2022] [Accepted: 03/05/2022] [Indexed: 11/18/2022] Open
Abstract
Morphology remains a primary source of phylogenetic information for many groups of organisms, and the only one for most fossil taxa. Organismal anatomy is not a collection of randomly assembled and independent “parts”, but instead a set of dependent and hierarchically nested entities resulting from ontogeny and phylogeny. How do we make sense of these dependent and at times redundant characters? One promising approach is using ontologies—structured controlled vocabularies that summarize knowledge about different properties of anatomical entities, including developmental and structural dependencies. Here, we assess whether evolutionary patterns can explain the proximity of ontology-annotated characters within an ontology. To do so, we measure phylogenetic information across characters and evaluate if it matches the hierarchical structure given by ontological knowledge—in much the same way as across-species diversity structure is given by phylogeny. We implement an approach to evaluate the Bayesian phylogenetic information (BPI) content and phylogenetic dissonance among ontology-annotated anatomical data subsets. We applied this to data sets representing two disparate animal groups: bees (Hexapoda: Hymenoptera: Apoidea, 209 chars) and characiform fishes (Actinopterygii: Ostariophysi: Characiformes, 463 chars). For bees, we find that BPI is not substantially explained by anatomy since dissonance is often high among morphologically related anatomical entities. For fishes, we find substantial information for two clusters of anatomical entities instantiating concepts from the jaws and branchial arch bones, but among-subset information decreases and dissonance increases substantially moving to higher-level subsets in the ontology. We further applied our approach to address particular evolutionary hypotheses with an example of morphological evolution in miniature fishes. While we show that phylogenetic information does match ontology structure for some anatomical entities, additional relationships and processes, such as convergence, likely play a substantial role in explaining BPI and dissonance, and merit future investigation. Our work demonstrates how complex morphological data sets can be interrogated with ontologies by allowing one to access how information is spread hierarchically across anatomical concepts, how congruent this information is, and what sorts of processes may play a role in explaining it: phylogeny, development, or convergence. [Apidae; Bayesian phylogenetic information; Ostariophysi; Phenoscape; phylogenetic dissonance; semantic similarity.]
Collapse
Affiliation(s)
- Diego S Porto
- Department of Biological Sciences, Virginia Polytechnic Institute and State University, 926 West Campus Drive, Blacksburg, VA 24061, USA
| | - Wasila M Dahdul
- UCI Libraries,University of California, Irvine, Irvine, CA 92623, USA
- Department of Biology, University of South Dakota, 414 East Clark Street, Vermillion, SD 57069, USA
| | - Hilmar Lapp
- Center for Genomic and Computational Biology, Duke University, 101 Science Drive, Durham, NC 27708, USA
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, 100 Europa Drive, Suite 540, Chapel Hill, NC 27517, USA
| | - Todd J Vision
- Department of Biology and School of Information and Library Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Paula M Mabee
- Department of Biology, University of South Dakota, 414 East Clark Street, Vermillion, SD 57069, USA
- Battelle, National Ecological Observatory Network, Boulder, CO 80301, USA
| | - Josef Uyeda
- Department of Biological Sciences, Virginia Polytechnic Institute and State University, 926 West Campus Drive, Blacksburg, VA 24061, USA
| |
Collapse
|
4
|
Vogt L. FAIR data representation in times of eScience: a comparison of instance-based and class-based semantic representations of empirical data using phenotype descriptions as example. J Biomed Semantics 2021; 12:20. [PMID: 34823588 PMCID: PMC8613519 DOI: 10.1186/s13326-021-00254-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 11/11/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The size, velocity, and heterogeneity of Big Data outclasses conventional data management tools and requires data and metadata to be fully machine-actionable (i.e., eScience-compliant) and thus findable, accessible, interoperable, and reusable (FAIR). This can be achieved by using ontologies and through representing them as semantic graphs. Here, we discuss two different semantic graph approaches of representing empirical data and metadata in a knowledge graph, with phenotype descriptions as an example. Almost all phenotype descriptions are still being published as unstructured natural language texts, with far-reaching consequences for their FAIRness, substantially impeding their overall usability within the life sciences. However, with an increasing amount of anatomy ontologies becoming available and semantic applications emerging, a solution to this problem becomes available. Researchers are starting to document and communicate phenotype descriptions through the Web in the form of highly formalized and structured semantic graphs that use ontology terms and Uniform Resource Identifiers (URIs) to circumvent the problems connected with unstructured texts. RESULTS Using phenotype descriptions as an example, we compare and evaluate two basic representations of empirical data and their accompanying metadata in the form of semantic graphs: the class-based TBox semantic graph approach called Semantic Phenotype and the instance-based ABox semantic graph approach called Phenotype Knowledge Graph. Their main difference is that only the ABox approach allows for identifying every individual part and property mentioned in the description in a knowledge graph. This technical difference results in substantial practical consequences that significantly affect the overall usability of empirical data. The consequences affect findability, accessibility, and explorability of empirical data as well as their comparability, expandability, universal usability and reusability, and overall machine-actionability. Moreover, TBox semantic graphs often require querying under entailment regimes, which is computationally more complex. CONCLUSIONS We conclude that, from a conceptual point of view, the advantages of the instance-based ABox semantic graph approach outweigh its shortcomings and outweigh the advantages of the class-based TBox semantic graph approach. Therefore, we recommend the instance-based ABox approach as a FAIR approach for documenting and communicating empirical data and metadata in a knowledge graph.
Collapse
Affiliation(s)
- Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1B, 30167, Hanover, Germany.
| |
Collapse
|
5
|
Lehtonen S. Phenotypic characters of static homology increase phylogenetic stability under direct optimization of otherwise dynamic homology characters. Cladistics 2021; 36:617-626. [PMID: 34618977 DOI: 10.1111/cla.12438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/20/2020] [Indexed: 11/29/2022] Open
Abstract
Direct optimization of unaligned sequence characters provides a natural framework to explore the sensitivity of phylogenetic hypotheses to variation in analytical parameters. Phenotypic data, when combined into such analyses, are typically analyzed with static homology correspondences unlike the dynamic homology sequence data. Static homology characters may be expected to constrain the direct optimization and thus, potentially increase the similarity of phylogenetic hypotheses under different cost sets. However, whether a total-evidence approach increases the phylogenetic stability or not remains empirically largely unexplored. Here, I studied the impact of static homology data on sensitivity using six empirical data sets composed of several molecular markers and phenotypic data. The inclusion of static homology phenotypic data increased the average stability of phylogenetic hypothesis in five out of the six data sets. To investigate if any static homology characters would have similar effect, the analyses were repeated with randomized phenotypic data, and with one of the molecular markers fixed as static homology characters. These analyses had, on average, almost no effect on the phylogenetic stability, although the randomized phenotypic data sometimes resulted in even higher stability than empirical phenotypic data. The impact was related to the strength of the phylogenetic signal in the phenotypic data: higher average jackknife support of the phenotypic tree correlated with stronger stabilizing effect in the total-evidence analysis. Phenotypic data with a strong signal made the total-evidence trees topologically more similar to the phenotypic trees, thus, they constrained the dynamic homology correspondences of the sequence data. Characters that increase phylogenetic stability are particularly valuable for phylogenetic inference. These results indicate an important role and additive value of phenotypic data in increasing the stability of phylogenetic hypotheses in total-evidence analyses.
Collapse
Affiliation(s)
- Samuli Lehtonen
- Biodiversity Unit, University of Turku, Turku, FI-20014, Finland
| |
Collapse
|
6
|
|
7
|
Mabee PM, Balhoff JP, Dahdul WM, Lapp H, Mungall CJ, Vision TJ. A Logical Model of Homology for Comparative Biology. Syst Biol 2020; 69:345-362. [PMID: 31596473 PMCID: PMC7672696 DOI: 10.1093/sysbio/syz067] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Revised: 09/20/2019] [Accepted: 09/26/2019] [Indexed: 01/09/2023] Open
Abstract
There is a growing body of research on the evolution of anatomy in a wide variety of organisms. Discoveries in this field could be greatly accelerated by computational methods and resources that enable these findings to be compared across different studies and different organisms and linked with the genes responsible for anatomical modifications. Homology is a key concept in comparative anatomy; two important types are historical homology (the similarity of organisms due to common ancestry) and serial homology (the similarity of repeated structures within an organism). We explored how to most effectively represent historical and serial homology across anatomical structures to facilitate computational reasoning. We assembled a collection of homology assertions from the literature with a set of taxon phenotypes for the skeletal elements of vertebrate fins and limbs from the Phenoscape Knowledgebase. Using seven competency questions, we evaluated the reasoning ramifications of two logical models: the Reciprocal Existential Axioms (REA) homology model and the Ancestral Value Axioms (AVA) homology model. The AVA model returned all user-expected results in addition to the search term and any of its subclasses. The AVA model also returns any superclass of the query term in which a homology relationship has been asserted. The REA model returned the user-expected results for five out of seven queries. We identify some challenges of implementing complete homology queries due to limitations of OWL reasoning. This work lays the foundation for homology reasoning to be incorporated into other ontology-based tools, such as those that enable synthetic supermatrix construction and candidate gene discovery. [Homology; ontology; anatomy; morphology; evolution; knowledgebase; phenoscape.].
Collapse
Affiliation(s)
- Paula M Mabee
- Department of Biology, University of South Dakota, 414 East Clark Street, Vermillion, SD 57069, USA
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, 100 Europa Drive, Suite 540, Chapel Hill, NC 27517, USA
| | - Wasila M Dahdul
- Department of Biology, University of South Dakota, 414 East Clark Street, Vermillion, SD 57069, USA
| | - Hilmar Lapp
- Center for Genomic and Computational Biology, Duke University, 101 Science Drive, Durham, NC 27708, USA
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Todd J Vision
- Department of Biology and School of Information and Library Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-3280, USA
| |
Collapse
|
8
|
Tarasov S. Integration of Anatomy Ontologies and Evo-Devo Using Structured Markov Models Suggests a New Framework for Modeling Discrete Phenotypic Traits. Syst Biol 2019; 68:698-716. [PMID: 30668800 PMCID: PMC6701457 DOI: 10.1093/sysbio/syz005] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Revised: 01/06/2019] [Accepted: 01/15/2019] [Indexed: 11/12/2022] Open
Abstract
Modeling discrete phenotypic traits for either ancestral character state reconstruction or morphology-based phylogenetic inference suffers from ambiguities of character coding, homology assessment, dependencies, and selection of adequate models. These drawbacks occur because trait evolution is driven by two key processes-hierarchical and hidden-which are not accommodated simultaneously by the available phylogenetic methods. The hierarchical process refers to the dependencies between anatomical body parts, while the hidden process refers to the evolution of gene regulatory networks (GRNs) underlying trait development. Herein, I demonstrate that these processes can be efficiently modeled using structured Markov models (SMM) equipped with hidden states, which resolves the majority of the problems associated with discrete traits. Integration of SMM with anatomy ontologies can adequately incorporate the hierarchical dependencies, while the use of the hidden states accommodates hidden evolution of GRNs and substitution rate heterogeneity. I assess the new models using simulations and theoretical synthesis. The new approach solves the long-standing "tail color problem," in which the trait is scored for species with tails of different colors or no tails. It also presents a previously unknown issue called the "two-scientist paradox," in which the nature of coding the trait and the hidden processes driving the trait's evolution are confounded; failing to account for the hidden process may result in a bias, which can be avoided by using hidden state models. All this provides a clear guideline for coding traits into characters. This article gives practical examples of using the new framework for phylogenetic inference and comparative analysis.
Collapse
Affiliation(s)
- Sergei Tarasov
- National Institute for Mathematical and Biological Synthesis, University of Tennessee, Knoxville, TN 37996, USA
- Department of Biological Sciences, Virginia Tech, 4076 Derring Hall, 926 West Campus Drive, Blacksburg, VA 24061, USA
| |
Collapse
|
9
|
Vogt L. Organizing phenotypic data-a semantic data model for anatomy. J Biomed Semantics 2019; 10:12. [PMID: 31221226 PMCID: PMC6585074 DOI: 10.1186/s13326-019-0204-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Accepted: 06/05/2019] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Currently, almost all morphological data are published as unstructured free text descriptions. This not only brings about terminological problems regarding semantic transparency, which hampers their re-use by non-experts, but the data cannot be parsed by computers either, which in turn hampers their integration across many fields in the life sciences, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. With an ever-increasing amount of available ontologies and the development of adequate semantic technology, however, a solution to this problem becomes available. Instead of free text descriptions, morphological data can be recorded, stored, and communicated through the Web in the form of highly formalized and structured directed graphs (semantic graphs) that use ontology terms and URIs as terminology. RESULTS After introducing an instance-based approach of recording morphological descriptions as semantic graphs (i.e., Semantic Instance Anatomy Knowledge Graphs) and discussing accompanying metadata graphs, I propose a general scheme of how to efficiently organize the resulting graphs in a tuple store framework based on instances of defined named graph ontology classes. The use of such named graph resources allows meaningful fragmentation of the data, which in turn enables subsequent specification of all kinds of data views for managing and accessing morphological data. CONCLUSIONS Morphological data that comply with the here proposed semantic data model will not only be computer-parsable but also re-usable by non-experts and could be better integrated with other sources of data in the life sciences. This would allow morphology as a discipline to further participate in eScience and Big Data.
Collapse
Affiliation(s)
- Lars Vogt
- Institut für Evolutionsbiologie und Ökologie, Rheinische Friedrich-Wilhelms-Universität Bonn, An der Immenburg 1, 53121, Bonn, Germany.
| |
Collapse
|
10
|
Burdíková N, Kjærandsen J, Lindemann JP, Kaspřák D, Tóthová A, Ševčík J. Molecular phylogeny of the Paleogene fungus gnat tribe Exechiini (Diptera: Mycetophilidae) revisited: Monophyly of genera established and rapid radiation confirmed. J ZOOL SYST EVOL RES 2019. [DOI: 10.1111/jzs.12287] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Affiliation(s)
- Nikola Burdíková
- Department of Biology and Ecology, Faculty of Science University of Ostrava Ostrava Czech Republic
| | - Jostein Kjærandsen
- UiT—The Arctic University of Norway Tromsø University Museum Tromsø Norway
| | | | - David Kaspřák
- Department of Biology and Ecology, Faculty of Science University of Ostrava Ostrava Czech Republic
| | - Andrea Tóthová
- Department of Botany and Zoology, Faculty of Science Masaryk University Brno Czech Republic
| | - Jan Ševčík
- Department of Biology and Ecology, Faculty of Science University of Ostrava Ostrava Czech Republic
| |
Collapse
|
11
|
Vogt L. Levels and building blocks-toward a domain granularity framework for the life sciences. J Biomed Semantics 2019; 10:4. [PMID: 30691505 PMCID: PMC6348634 DOI: 10.1186/s13326-019-0196-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2018] [Accepted: 01/14/2019] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND With the emergence of high-throughput technologies, Big Data and eScience, the use of online data repositories and the establishment of new data standards that require data to be computer-parsable become increasingly important. As a consequence, there is an increasing need for an integrated system of hierarchies of levels of different types of material entities that helps with organizing, structuring and integrating data from disparate sources to facilitate data exploration, data comparison and analysis. Theories of granularity provide such integrated systems. RESULTS On the basis of formal approaches to theories of granularity authored by information scientists and ontology researchers, I discuss the shortcomings of some applications of the concept of levels and argue that the general theory of granularity proposed by Keet circumvents these problems. I introduce the concept of building blocks, which gives rise to a hierarchy of levels that can be formally characterized by Keet's theory. This hierarchy functions as an organizational backbone for integrating various other hierarchies that I briefly discuss, resulting in a domain granularity framework for the life sciences. I also discuss the consequences of this granularity framework for the structure of the top-level category of 'material entity' in Basic Formal Ontology. CONCLUSIONS The domain granularity framework suggested here is meant to provide the basis on which a more comprehensive information framework for the life sciences can be developed, which would provide the much needed conceptual framework for representing domains that cover multiple granularity levels. This framework can be used for intuitively structuring data in the life sciences, facilitating data exploration, and it can be employed for reasoning over different granularity levels across different hierarchies. It would provide a methodological basis for establishing comparability between data sets and for quantitatively measuring their degree of semantic similarity.
Collapse
Affiliation(s)
- Lars Vogt
- Rheinische Friedrich-Wilhelms-Universität Bonn, Institut für Evolutionsbiologie und Ökologie, An der Immenburg 1, 53121, Bonn, Germany.
| |
Collapse
|
12
|
Dahdul W, Manda P, Cui H, Balhoff JP, Dececchi TA, Ibrahim N, Lapp H, Vision T, Mabee PM. Annotation of phenotypes using ontologies: a gold standard for the training and evaluation of natural language processing systems. Database (Oxford) 2018; 2018:5255130. [PMID: 30576485 PMCID: PMC6301375 DOI: 10.1093/database/bay110] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Revised: 08/22/2018] [Accepted: 09/24/2018] [Indexed: 11/12/2022]
Abstract
Natural language descriptions of organismal phenotypes, a principal object of study in biology, are abundant in the biological literature. Expressing these phenotypes as logical statements using ontologies would enable large-scale analysis on phenotypic information from diverse systems. However, considerable human effort is required to make these phenotype descriptions amenable to machine reasoning. Natural language processing tools have been developed to facilitate this task, and the training and evaluation of these tools depend on the availability of high quality, manually annotated gold standard data sets. We describe the development of an expert-curated gold standard data set of annotated phenotypes for evolutionary biology. The gold standard was developed for the curation of complex comparative phenotypes for the Phenoscape project. It was created by consensus among three curators and consists of entity-quality expressions of varying complexity. We use the gold standard to evaluate annotations created by human curators and those generated by the Semantic CharaParser tool. Using four annotation accuracy metrics that can account for any level of relationship between terms from two phenotype annotations, we found that machine-human consistency, or similarity, was significantly lower than inter-curator (human-human) consistency. Surprisingly, allowing curatorsaccess to external information did not significantly increase the similarity of their annotations to the gold standard or have a significant effect on inter-curator consistency. We found that the similarity of machine annotations to the gold standard increased after new relevant ontology terms had been added. Evaluation by the original authors of the character descriptions indicated that the gold standard annotations came closer to representing their intended meaning than did either the curator or machine annotations. These findings point toward ways to better design software to augment human curators and the use of the gold standard corpus will allow training and assessment of new tools to improve phenotype annotation accuracy at scale.
Collapse
Affiliation(s)
| | - Prashanti Manda
- University of North Carolina at Greensboro, Greensboro, NC, USA
| | - Hong Cui
- University of Arizona, Tucson, AZ, USA
| | - James P Balhoff
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - T Alexander Dececchi
- University of South Dakota, Vermillion, SD, USA
- Current affiliation: University of Pittsburgh at Johnstown, Johnstown, PA, USA
| | - Nizar Ibrahim
- University of Chicago, Chicago, IL, USA
- Current affiliation: University of Detroit Mercy, Detroit, MI, USA & University of Portsmouth, Portsmouth, UK
| | | | - Todd Vision
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | |
Collapse
|