1
|
Mora-Cross M, Morales-Carmiol A, Chen-Huang T, Barquero-Pérez M. Essential Biodiversity Variables: extracting plant phenological data from specimen labels using machine learning. RESEARCH IDEAS AND OUTCOMES 2022. [DOI: 10.3897/rio.8.e86012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Essential Biodiversity Variables (EBVs) make it possible to evaluate and monitor the state of biodiversity over time at different spatial scales. Its development is led by the Group on Earth Observations Biodiversity Observation Network (GEO BON) to harmonize, consolidate and standardize biodiversity data from varied biodiversity sources. This document presents a mechanism to obtain baseline data to feed the Species Traits Variable Phenology or other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying such descriptions using machine learning (ML).
A workflow that performs Named Entity Recognition (NER) and Classification of morphological descriptions using ML algorithms was evaluated with excellent results. It was implemented using Python, Pytorch, Scikit-Learn, Pomegranate, Python-crfsuite, and other libraries applied to 106,804 herbarium records from the National Biodiversity Institute of Costa Rica (INBio). The text classification results were almost excellent (F1 score between 96% and 99%) using three traditional ML methods: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC), and Logistic Regression (LR). Furthermore, results extracting names of species morphological structures (e.g., leaves, trichomes, flowers, petals, sepals) and character names (e.g., length, width, pigmentation patterns, and smell) using NER algorithms were competitive (F1 score between 95% and 98%) using Hidden Markov Models (HMM), Conditional Random Fields (CRFs), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF).
Collapse
|
2
|
Vogt L. FAIR data representation in times of eScience: a comparison of instance-based and class-based semantic representations of empirical data using phenotype descriptions as example. J Biomed Semantics 2021; 12:20. [PMID: 34823588 PMCID: PMC8613519 DOI: 10.1186/s13326-021-00254-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 11/11/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The size, velocity, and heterogeneity of Big Data outclasses conventional data management tools and requires data and metadata to be fully machine-actionable (i.e., eScience-compliant) and thus findable, accessible, interoperable, and reusable (FAIR). This can be achieved by using ontologies and through representing them as semantic graphs. Here, we discuss two different semantic graph approaches of representing empirical data and metadata in a knowledge graph, with phenotype descriptions as an example. Almost all phenotype descriptions are still being published as unstructured natural language texts, with far-reaching consequences for their FAIRness, substantially impeding their overall usability within the life sciences. However, with an increasing amount of anatomy ontologies becoming available and semantic applications emerging, a solution to this problem becomes available. Researchers are starting to document and communicate phenotype descriptions through the Web in the form of highly formalized and structured semantic graphs that use ontology terms and Uniform Resource Identifiers (URIs) to circumvent the problems connected with unstructured texts. RESULTS Using phenotype descriptions as an example, we compare and evaluate two basic representations of empirical data and their accompanying metadata in the form of semantic graphs: the class-based TBox semantic graph approach called Semantic Phenotype and the instance-based ABox semantic graph approach called Phenotype Knowledge Graph. Their main difference is that only the ABox approach allows for identifying every individual part and property mentioned in the description in a knowledge graph. This technical difference results in substantial practical consequences that significantly affect the overall usability of empirical data. The consequences affect findability, accessibility, and explorability of empirical data as well as their comparability, expandability, universal usability and reusability, and overall machine-actionability. Moreover, TBox semantic graphs often require querying under entailment regimes, which is computationally more complex. CONCLUSIONS We conclude that, from a conceptual point of view, the advantages of the instance-based ABox semantic graph approach outweigh its shortcomings and outweigh the advantages of the class-based TBox semantic graph approach. Therefore, we recommend the instance-based ABox approach as a FAIR approach for documenting and communicating empirical data and metadata in a knowledge graph.
Collapse
Affiliation(s)
- Lars Vogt
- TIB Leibniz Information Centre for Science and Technology, Welfengarten 1B, 30167, Hanover, Germany.
| |
Collapse
|
3
|
Folk RA, Siniscalchi CM. Biodiversity at the global scale: the synthesis continues. AMERICAN JOURNAL OF BOTANY 2021; 108:912-924. [PMID: 34181762 DOI: 10.1002/ajb2.1694] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 04/14/2021] [Indexed: 06/13/2023]
Abstract
Traditionally, the generation and use of biodiversity data and their associated specimen objects have been primarily the purview of individuals and small research groups. While deposition of data and specimens in herbaria and other repositories has long been the norm, throughout most of their history, these resources have been accessible only to a small community of specialists. Through recent concerted efforts, primarily at the level of national and international governmental agencies over the last two decades, the pace of biodiversity data accumulation has accelerated, and a wider array of biodiversity scientists has gained access to this massive accumulation of resources, applying them to an ever-widening compass of research pursuits. We review how these new resources and increasing access to them are affecting the landscape of biodiversity research in plants today, focusing on new applications across evolution, ecology, and other fields that have been enabled specifically by the availability of these data and the global scope that was previously beyond the reach of individual investigators. We give an overview of recent advances organized along three lines: broad-scale analyses of distributional data and spatial information, phylogenetic research circumscribing large clades with comprehensive taxon sampling, and data sets derived from improved accessibility of biodiversity literature. We also review synergies between large data resources and more traditional data collection paradigms, describe shortfalls and how to overcome them, and reflect on the future of plant biodiversity analyses in light of increasing linkages between data types and scientists in our field.
Collapse
Affiliation(s)
- Ryan A Folk
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| | - Carolina M Siniscalchi
- Department of Biological Sciences, Mississippi State University, Mississippi State, Mississippi, USA
| |
Collapse
|
4
|
Eliason CM, Edwards SV, Clarke JA. phenotools: An
r
package for visualizing and analysing phenomic datasets. Methods Ecol Evol 2019. [DOI: 10.1111/2041-210x.13217] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Chad M. Eliason
- Department of Geological Sciences University of Texas Austin Austin Texas
- Grainger Bioinformatics Center Field Museum of Natural History Chicago Illinois
| | - Scott V. Edwards
- Department of Organismic and Evolutionary Biology and Museum of Comparative Zoology Harvard University Cambridge Massachusetts
| | - Julia A. Clarke
- Department of Geological Sciences University of Texas Austin Austin Texas
| |
Collapse
|
5
|
Endara L, Thessen AE, Cole HA, Walls R, Gkoutos G, Cao Y, Chong SS, Cui H. Modifier Ontologies for frequency, certainty, degree, and coverage phenotype modifier. Biodivers Data J 2018; 6:e29232. [PMID: 30532623 PMCID: PMC6281706 DOI: 10.3897/bdj.6.e29232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2018] [Accepted: 11/20/2018] [Indexed: 11/21/2022] Open
Abstract
Background: When phenotypic characters are described in the literature, they may be constrained or clarified with additional information such as the location or degree of expression, these terms are called "modifiers". With effort underway to convert narrative character descriptions to computable data, ontologies for such modifiers are needed. Such ontologies can also be used to guide term usage in future publications. Spatial and method modifiers are the subjects of ontologies that already have been developed or are under development. In this work, frequency (e.g., rarely, usually), certainty (e.g., probably, definitely), degree (e.g., slightly, extremely), and coverage modifiers (e.g., sparsely, entirely) are collected, reviewed, and used to create two modifier ontologies with different design considerations. The basic goal is to express the sequential relationships within a type of modifiers, for example, usually is more frequent than rarely, in order to allow data annotated with ontology terms to be classified accordingly. Method: Two designs are proposed for the ontology, both using the list pattern: a closed ordered list (i.e., five-bin design) and an open ordered list design. The five-bin design puts the modifier terms into a set of 5 fixed bins with interval object properties, for example, one_level_more/less_frequently_than, where new terms can only be added as synonyms to existing classes. The open list approach starts with 5 bins, but supports the extensibility of the list via ordinal properties, for example, more/less_frequently_than, allowing new terms to be inserted as a new class anywhere in the list. The consequences of the different design decisions are discussed in the paper. CharaParser was used to extract modifiers from plant, ant, and other taxonomic descriptions. After a manual screening, 130 modifier words were selected as the candidate terms for the modifier ontologies. Four curators/experts (three biologists and one information scientist specialized in biosemantics) reviewed and categorized the terms into 20 bins using the Ontology Term Organizer (OTO) (http://biosemantics.arizona.edu/OTO). Inter-curator variations were reviewed and expressed in the final ontologies. Results: Frequency, certainty, degree, and coverage terms with complete agreement among all curators were used as class labels or exact synonyms. Terms with different interpretations were either excluded or included using "broader synonym" or "not recommended" annotation properties. These annotations explicitly allow for the user to be aware of the semantic ambiguity associated with the terms and whether they should be used with caution or avoided. Expert categorization results showed that 16 out of 20 bins contained terms with full agreements, suggesting differentiating the modifiers into 5 levels/bins balances the need to differentiate modifiers and the need for the ontology to reflect user consensus. Two ontologies, developed using the Protege ontology editor, are made available as OWL files and can be downloaded from https://github.com/biosemantics/ontologies. Contribution: We built the first two modifier ontologies following a consensus-based approach with terms commonly used in taxonomic literature. The five-bin ontology has been used in the Explorer of Taxon Concepts web toolkit to compute the similarity between characters extracted from literature to facilitate taxon concepts alignments. The two ontologies will also be used in an ontology-informed authoring tool for taxonomists to facilitate consistency in modifier term usage.
Collapse
Affiliation(s)
- Lorena Endara
- University of Florida, Gainesville, United States of AmericaUniversity of FloridaGainesvilleUnited States of America
| | - Anne E Thessen
- The Ronin Institute for Independent Scholarship, Monclair, NJ, United States of AmericaThe Ronin Institute for Independent ScholarshipMonclair, NJUnited States of America
| | - Heather A Cole
- Science and Technology Branch, Agriculture and Agri-Food Canada, Government of Canada, Ottawa, CanadaScience and Technology Branch, Agriculture and Agri-Food Canada, Government of CanadaOttawaCanada
| | - Ramona Walls
- CyVerse, Tucson, United States of AmericaCyVerseTucsonUnited States of America
| | - Georgios Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, United KingdomCollege of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of BirminghamBirminghamUnited Kingdom
- Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TT, Birmingham, United KingdomInstitute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, B15 2TTBirminghamUnited Kingdom
| | - Yujie Cao
- Center for Studies of Information Resources, Wuhan Universtity, Wuhan, ChinaCenter for Studies of Information Resources, Wuhan UniverstityWuhanChina
| | - Steven S. Chong
- National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, Santa Barbara, United States of AmericaNational Center for Ecological Analysis and Synthesis, University of California, Santa BarbaraSanta BarbaraUnited States of America
- University of Arizona, Tucson, United States of AmericaUniversity of ArizonaTucsonUnited States of America
| | - Hong Cui
- University of Arizona, Tucson, United States of AmericaUniversity of ArizonaTucsonUnited States of America
| |
Collapse
|
6
|
Page R. Liberating links between datasets using lightweight data publishing: an example using plant names and the taxonomic literature. Biodivers Data J 2018:e27539. [PMID: 30065607 PMCID: PMC6066477 DOI: 10.3897/bdj.6.e27539] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 07/11/2018] [Indexed: 11/12/2022] Open
Abstract
Constructing a biodiversity knowledge graph will require making millions of cross links between diversity entities in different datasets. Researchers trying to bootstrap the growth of the biodiversity knowledge graph by constructing databases of links between these entities lack obvious ways to publish these sets of links. One appealing and lightweight approach is to create a "datasette", a database that is wrapped together with a simple web server that enables users to query the data. Datasettes can be packaged into Docker containers and hosted online with minimal effort. This approach is illustrated using a dataset of links between globally unique identifiers for plant taxonomic namesand identifiers for the taxonomic articles that published those names.
Collapse
Affiliation(s)
- Roderic Page
- University of Glasgow, Glasgow, United Kingdom University of Glasgow Glasgow United Kingdom
| |
Collapse
|
7
|
Mora MA, Araya JE. Semi-automatic Extraction of Plants Morphological Characters from Taxonomic Descriptions Written in Spanish. Biodivers Data J 2018; 6:e21282. [PMID: 29991903 PMCID: PMC6030177 DOI: 10.3897/bdj.6.e21282] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 06/11/2018] [Indexed: 12/02/2022] Open
Abstract
Taxonomic literature keeps records of the planet's biodiversity and gives access to the knowledge needed for its sustainable management. Unfortunately, most of the taxonomic information is available in scientific publications in text format. The amount of publications generated is very large; therefore, to process it in order to obtain high structured texts would be complex and very expensive. Approaches like citizen science may help the process by selecting whole fragments of texts dealing with morphological descriptions; but a deeper analysis, compatible with accepted ontologies, will require specialised tools. The Biodiversity Heritage Library (BHL) estimates that there are more than 120 million pages published in over 5.4 million books since 1469, plus about 800,000 monographs and 40,000 journal titles (12,500 of these are current titles). It is necessary to develop standards and software tools to extract, integrate and publish this information into existing free and open access repositories of biodiversity knowledge to support science, education and biodiversity conservation. This document presents an algorithm based on computational linguistics techniques to extract structured information from morphological descriptions of plants written in Spanish. The developed algorithm is based on the work of Dr. Hong Cui from the University of Arizona; it uses semantic analysis, ontologies and a repository of knowledge acquired from the same descriptions. The algorithm was applied to the books Trees of Costa Rica Volume III (TCRv3), Trees of Costa Rica Volume IV (TCRv4) and to a subset of descriptions of the Manual of Plants of Costa Rica (MPCR) with very competitive results (more than 92.5% of average performance). The system receives the morphological descriptions in tabular format and generates XML documents. The XML schema allows documenting structures, characters and relations between characters and structures. Each extracted object is associated with attributes like name, value, modifiers, restrictions, ontology term id, amongst other attributes. The implemented tool is free software. It was developed using Java and integrates existing technology as FreeLing, the Plant Ontology (PO), the Plant Glossary, the Ontology Term Organizer (OTO) and the Flora Mesoamericana English-Spanish Glossary.
Collapse
|
8
|
Vogt L. Towards a semantic approach to numerical tree inference in phylogenetics. Cladistics 2018; 34:200-224. [PMID: 34645075 DOI: 10.1111/cla.12195] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/03/2017] [Indexed: 12/24/2022] Open
Abstract
Conventional approaches to phylogeny reconstruction require a character analysis step prior to and methodologically separated from a numerical tree inference step. The former results in a character matrix that contains the empirical data analysed in the latter. This separation of steps involves various methodological and conceptual problems (e.g. homology assessment independent of tree inference and character optimization, character dependencies, discounting of alternative homology hypotheses). In morphology, the character analysis step covers the stages of morphological comparative studies, homology assessment and the identification and coding of morphological characters. Unfortunately, only the last stage requires some formalism, whereas the preceding stages are commonly regarded to be pre-rational and intuitive, which is why their reproducibility and analytical accessibility is limited. Here, I introduce a rational for a semantic approach to numerical tree inference that uses sets of semantic instance anatomies as data source instead of character matrices, thereby avoiding the above-mentioned problems. A semantic instance anatomy is an ontology-based description of the anatomical organization of a specimen in the form of a semantic graph. The semantic approach to numerical tree inference combines and integrates the steps of character analysis and numerical tree inference and makes both analytically accessible and communicable. Before outlining first steps for a research programme dedicated to the semantic approach to numerical tree inference, I discuss in detail the methodological, conceptual, and computational challenges and requirements that first have to be dealt with before adequate algorithms can be developed.
Collapse
Affiliation(s)
- Lars Vogt
- Institut für Evolutionsbiologie und Ökologie, Universität Bonn, An der Immenburg 1, Bonn, D-53121, Germany
| |
Collapse
|
9
|
Endara L, Cui H, Burleigh JG. Extraction of phenotypic traits from taxonomic descriptions for the tree of life using natural language processing. APPLICATIONS IN PLANT SCIENCES 2018; 6:e1035. [PMID: 29732265 PMCID: PMC5895189 DOI: 10.1002/aps3.1035] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2017] [Accepted: 01/31/2018] [Indexed: 05/09/2023]
Abstract
PREMISE OF THE STUDY Phenotypic data sets are necessary to elucidate the genealogy of life, but assembling phenotypic data for taxa across the tree of life can be technically challenging and prohibitively time consuming. We describe a semi-automated protocol to facilitate and expedite the assembly of phenotypic character matrices of plants from formal taxonomic descriptions. This pipeline uses new natural language processing (NLP) techniques and a glossary of over 9000 botanical terms. METHODS AND RESULTS Our protocol includes the Explorer of Taxon Concepts (ETC), an online application that assembles taxon-by-character matrices from taxonomic descriptions, and MatrixConverter, a Java application that enables users to evaluate and discretize the characters extracted by ETC. We demonstrate this protocol using descriptions from Araucariaceae. CONCLUSIONS The NLP pipeline unlocks the phenotypic data found in taxonomic descriptions and makes them usable for evolutionary analyses.
Collapse
Affiliation(s)
- Lorena Endara
- Department of BiologyUniversity of FloridaGainesvilleFlorida32611USA
| | - Hong Cui
- School of InformationUniversity of ArizonaTucsonArizona85719USA
| | | |
Collapse
|
10
|
|
11
|
Hao T, Zhu C, Mu Y, Liu G. A user-oriented semantic annotation approach to knowledge acquisition and conversion. J Inf Sci 2017. [DOI: 10.1177/0165551516642688] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Semantic annotation on natural language texts labels the meaning of an annotated element in specific contexts, and thus is an essential procedure for domain knowledge acquisition. An extensible and coherent annotation method is crucial for knowledge engineers to reduce human efforts to keep annotations consistent. This article proposes a comprehensive semantic annotation approach supported by a user-oriented markup language named UOML to enhance annotation efficiency with the aim of building a high quality knowledge base. UOML is operable by human annotators and convertible to formal knowledge representation languages. A pattern-based annotation conversion method named PAC is further proposed for knowledge exchange by utilizing automatic pattern learning. We designed and implemented a semantic annotation platform Annotation Assistant to test the effectiveness of the approach. By applying this platform in a long-term international research project for more than three years aiming at high quality knowledge acquisition from a classical Chinese poetry corpus containing 52,621 Chinese characters, we effectively acquired 150,624 qualified annotations. Our test shows that the approach has improved operational efficiency by 56.8%, on average, compared with text-based manual annotation. By using UOML, PAC achieved a conversion error ratio of 0.2% on average, significantly improving the annotation consistency compared with baseline annotations. The results indicate the approach is feasible for practical use in knowledge acquisition and conversion.
Collapse
Affiliation(s)
- Tianyong Hao
- School of Informatics, Guangdong University of Foreign Studies, China
| | - Chunshen Zhu
- Department of Chinese and History, City University of Hong Kong, Hong Kong
| | - Yuanyuan Mu
- Center for Corpus-based Translation Studies, Hefei University of Technology, China
| | - Gang Liu
- Department of Computer Science, City University of Hong Kong, Hong Kong
| |
Collapse
|
12
|
Mao J, Moore LR, Blank CE, Wu EHH, Ackerman M, Ranade S, Cui H. Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources. BMC Bioinformatics 2016; 17:528. [PMID: 27955641 PMCID: PMC5153691 DOI: 10.1186/s12859-016-1396-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2016] [Accepted: 11/29/2016] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND The large-scale analysis of phenomic data (i.e., full phenotypic traits of an organism, such as shape, metabolic substrates, and growth conditions) in microbial bioinformatics has been hampered by the lack of tools to rapidly and accurately extract phenotypic data from existing legacy text in the field of microbiology. To quickly obtain knowledge on the distribution and evolution of microbial traits, an information extraction system needed to be developed to extract phenotypic characters from large numbers of taxonomic descriptions so they can be used as input to existing phylogenetic analysis software packages. RESULTS We report the development and evaluation of Microbial Phenomics Information Extractor (MicroPIE, version 0.1.0). MicroPIE is a natural language processing application that uses a robust supervised classification algorithm (Support Vector Machine) to identify characters from sentences in prokaryotic taxonomic descriptions, followed by a combination of algorithms applying linguistic rules with groups of known terms to extract characters as well as character states. The input to MicroPIE is a set of taxonomic descriptions (clean text). The output is a taxon-by-character matrix-with taxa in the rows and a set of 42 pre-defined characters (e.g., optimum growth temperature) in the columns. The performance of MicroPIE was evaluated against a gold standard matrix and another student-made matrix. Results show that, compared to the gold standard, MicroPIE extracted 21 characters (50%) with a Relaxed F1 score > 0.80 and 16 characters (38%) with Relaxed F1 scores ranging between 0.50 and 0.80. Inclusion of a character prediction component (SVM) improved the overall performance of MicroPIE, notably the precision. Evaluated against the same gold standard, MicroPIE performed significantly better than the undergraduate students. CONCLUSION MicroPIE is a promising new tool for the rapid and efficient extraction of phenotypic character information from prokaryotic taxonomic descriptions. However, further development, including incorporation of ontologies, will be necessary to improve the performance of the extraction for some character types.
Collapse
Affiliation(s)
- Jin Mao
- School of Information, University of Arizona, Tucson, 85721 AZ USA
| | - Lisa R. Moore
- Department of Biological Sciences, University of Southern Maine, Portland, 04103 ME USA
| | - Carrine E. Blank
- Department of Geosciences, University of Montana, Missoula, 59812 MT USA
| | | | - Marcia Ackerman
- Department of Biological Sciences, University of Southern Maine, Portland, 04103 ME USA
| | - Sonali Ranade
- School of Information, University of Arizona, Tucson, 85721 AZ USA
| | - Hong Cui
- School of Information, University of Arizona, Tucson, 85721 AZ USA
| |
Collapse
|
13
|
Dietrich CH, Dmitriev DA. Insect phylogenetics in the digital age. CURRENT OPINION IN INSECT SCIENCE 2016; 18:48-52. [PMID: 27939710 DOI: 10.1016/j.cois.2016.09.008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2016] [Accepted: 09/21/2016] [Indexed: 06/06/2023]
Abstract
Insect systematists have long used digital data management tools to facilitate phylogenetic research. Web-based platforms developed over the past several years support creation of comprehensive, openly accessible data repositories and analytical tools that support large-scale collaboration, accelerating efforts to document Earth's biota and reconstruct the Tree of Life. New digital tools have the potential to further enhance insect phylogenetics by providing efficient workflows for capturing and analyzing phylogenetically relevant data. Recent initiatives streamline various steps in phylogenetic studies and provide community access to supercomputing resources. In the near future, automated, web-based systems will enable researchers to complete a phylogenetic study from start to finish using resources linked together within a single portal and incorporate results into a global synthesis.
Collapse
Affiliation(s)
- Christopher H Dietrich
- Illinois Natural History Survey, Prairie Research Institute, University of Illinois, 1816 S Oak St., Champaign, IL 61820, USA.
| | - Dmitry A Dmitriev
- Illinois Natural History Survey, Prairie Research Institute, University of Illinois, 1816 S Oak St., Champaign, IL 61820, USA
| |
Collapse
|
14
|
Cui H, Xu D, Chong SS, Ramirez M, Rodenhausen T, Macklin JA, Ludäscher B, Morris RA, Soto EM, Koch NM. Introducing Explorer of Taxon Concepts with a case study on spider measurement matrix building. BMC Bioinformatics 2016; 17:471. [PMID: 27855645 PMCID: PMC5114841 DOI: 10.1186/s12859-016-1352-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 11/11/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Taxonomic descriptions are traditionally composed in natural language and published in a format that cannot be directly used by computers. The Exploring Taxon Concepts (ETC) project has been developing a set of web-based software tools that convert morphological descriptions published in telegraphic style to character data that can be reused and repurposed. This paper introduces the first semi-automated pipeline, to our knowledge, that converts morphological descriptions into taxon-character matrices to support systematics and evolutionary biology research. We then demonstrate and evaluate the use of the ETC Input Creation - Text Capture - Matrix Generation pipeline to generate body part measurement matrices from a set of 188 spider morphological descriptions and report the findings. RESULTS From the given set of spider taxonomic publications, two versions of input (original and normalized) were generated and used by the ETC Text Capture and ETC Matrix Generation tools. The tools produced two corresponding spider body part measurement matrices, and the matrix from the normalized input was found to be much more similar to a gold standard matrix hand-curated by the scientist co-authors. Special conventions utilized in the original descriptions (e.g., the omission of measurement units) were attributed to the lower performance of using the original input. The results show that simple normalization of the description text greatly increased the quality of the machine-generated matrix and reduced edit effort. The machine-generated matrix also helped identify issues in the gold standard matrix. CONCLUSIONS ETC Text Capture and ETC Matrix Generation are low-barrier and effective tools for extracting measurement values from spider taxonomic descriptions and are more effective when the descriptions are self-contained. Special conventions that make the description text less self-contained challenge automated extraction of data from biodiversity descriptions and hinder the automated reuse of the published knowledge. The tools will be updated to support new requirements revealed in this case study.
Collapse
Affiliation(s)
- Hong Cui
- University of Arizona, Tucson, AZ USA
| | | | | | - Martin Ramirez
- Museo Argentino de Ciencias, Naturales, CONICET, Buenos Aires, Argentina
| | | | | | | | - Robert A. Morris
- University of Massachusetts at Boston and Harvard University Herbaria, Massachusetts, USA
| | - Eduardo M. Soto
- Department of Geology & Geophysics, Yale University, New Haven, Connecticut USA
| | | |
Collapse
|
15
|
Hoehndorf R, Alshahrani M, Gkoutos GV, Gosline G, Groom Q, Hamann T, Kattge J, de Oliveira SM, Schmidt M, Sierra S, Smets E, Vos RA, Weiland C. The flora phenotype ontology (FLOPO): tool for integrating morphological traits and phenotypes of vascular plants. J Biomed Semantics 2016; 7:65. [PMID: 27842607 PMCID: PMC5109718 DOI: 10.1186/s13326-016-0107-8] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 11/01/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The systematic analysis of a large number of comparable plant trait data can support investigations into phylogenetics and ecological adaptation, with broad applications in evolutionary biology, agriculture, conservation, and the functioning of ecosystems. Floras, i.e., books collecting the information on all known plant species found within a region, are a potentially rich source of such plant trait data. Floras describe plant traits with a focus on morphology and other traits relevant for species identification in addition to other characteristics of plant species, such as ecological affinities, distribution, economic value, health applications, traditional uses, and so on. However, a key limitation in systematically analyzing information in Floras is the lack of a standardized vocabulary for the described traits as well as the difficulties in extracting structured information from free text. RESULTS We have developed the Flora Phenotype Ontology (FLOPO), an ontology for describing traits of plant species found in Floras. We used the Plant Ontology (PO) and the Phenotype And Trait Ontology (PATO) to extract entity-quality relationships from digitized taxon descriptions in Floras, and used a formal ontological approach based on phenotype description patterns and automated reasoning to generate the FLOPO. The resulting ontology consists of 25,407 classes and is based on the PO and PATO. The classified ontology closely follows the structure of Plant Ontology in that the primary axis of classification is the observed plant anatomical structure, and more specific traits are then classified based on parthood and subclass relations between anatomical structures as well as subclass relations between phenotypic qualities. CONCLUSIONS The FLOPO is primarily intended as a framework based on which plant traits can be integrated computationally across all species and higher taxa of flowering plants. Importantly, it is not intended to replace established vocabularies or ontologies, but rather serve as an overarching framework based on which different application- and domain-specific ontologies, thesauri and vocabularies of phenotypes observed in flowering plants can be integrated.
Collapse
Affiliation(s)
- Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, 4700 KAUST, Thuwal, 23955–6900 Kingdom of Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, 4700 KAUST, Thuwal, 23955–6900 Kingdom of Saudi Arabia
| | - Mona Alshahrani
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, 4700 KAUST, Thuwal, 23955–6900 Kingdom of Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, 4700 KAUST, Thuwal, 23955–6900 Kingdom of Saudi Arabia
| | - Georgios V. Gkoutos
- College of Medical and Dental Sciences, Institute of Cancer and Genomic Sciences, Centre for Computational Biology, University of Birmingham, Birmingham, B15 2TT United Kingdom
- Institute of Translational Medicine, University Hospitals Birmingham, NHS Foundation Trust, Birmingham, B15 2TT United Kingdom
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth, SY23 2AX United Kingdom
| | - George Gosline
- Royal Botanical Gardens, Kew, Richmond, Surrey, TW9 3AB United Kingdom
| | - Quentin Groom
- Botanic Garden Meise, Nieuwelaan 38, Meise, 1860 Belgium
| | - Thomas Hamann
- Naturalis Biodiversity Center, P.O. Box 9517, Leiden, 2300 RA The Netherlands
| | - Jens Kattge
- Max Planck Institute for Biogeochemistry, Hans Knoell Str. 10, Jena, 07745 Germany
- German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Deutscher Platz 5e, Leipzig, 04103 Germany
| | | | - Marco Schmidt
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Senckenberganlage 25, Frankfurt am Main, 60325 Germany
| | - Soraya Sierra
- Naturalis Biodiversity Center, P.O. Box 9517, Leiden, 2300 RA The Netherlands
| | - Erik Smets
- Naturalis Biodiversity Center, P.O. Box 9517, Leiden, 2300 RA The Netherlands
| | - Rutger A. Vos
- Naturalis Biodiversity Center, P.O. Box 9517, Leiden, 2300 RA The Netherlands
| | - Claus Weiland
- Senckenberg Biodiversity and Climate Research Centre (BiK-F), Senckenberganlage 25, Frankfurt am Main, 60325 Germany
| |
Collapse
|
16
|
Franz NM, Pier NM, Reeder DM, Chen M, Yu S, Kianmajd P, Bowers S, Ludäscher B. Two Influential Primate Classifications Logically Aligned. Syst Biol 2016; 65:561-82. [PMID: 27009895 PMCID: PMC4911943 DOI: 10.1093/sysbio/syw023] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Revised: 03/11/2016] [Accepted: 03/17/2016] [Indexed: 01/02/2023] Open
Abstract
Classifications and phylogenies of perceived natural entities change in the light of new evidence. Taxonomic changes, translated into Code-compliant names, frequently lead to name:meaning dissociations across succeeding treatments. Classification standards such as the Mammal Species of the World (MSW) may experience significant levels of taxonomic change from one edition to the next, with potential costs to long-term, large-scale information integration. This circumstance challenges the biodiversity and phylogenetic data communities to express taxonomic congruence and incongruence in ways that both humans and machines can process, that is, to logically represent taxonomic alignments across multiple classifications. We demonstrate that such alignments are feasible for two classifications of primates corresponding to the second and third MSW editions. Our approach has three main components: (i) use of taxonomic concept labels, that is name sec. author (where sec. means according to), to assemble each concept hierarchy separately via parent/child relationships; (ii) articulation of select concepts across the two hierarchies with user-provided Region Connection Calculus (RCC-5) relationships; and (iii) the use of an Answer Set Programming toolkit to infer and visualize logically consistent alignments of these input constraints. Our use case entails the Primates sec. Groves (1993; MSW2-317 taxonomic concepts; 233 at the species level) and Primates sec. Groves (2005; MSW3-483 taxonomic concepts; 376 at the species level). Using 402 RCC-5 input articulations, the reasoning process yields a single, consistent alignment and 153,111 Maximally Informative Relations that constitute a comprehensive meaning resolution map for every concept pair in the Primates sec. MSW2/MSW3. The complete alignment, and various partitions thereof, facilitate quantitative analyses of name:meaning dissociation, revealing that nearly one in three taxonomic names are not reliable across treatments-in the sense of the same name identifying congruent taxonomic meanings. The RCC-5 alignment approach is potentially widely applicable in systematics and can achieve scalable, precise resolution of semantically evolving name usages in synthetic, next-generation biodiversity, and phylogeny data platforms.
Collapse
Affiliation(s)
- Nico M Franz
- School of Life Sciences, PO Box 874501, Arizona State University, Tempe, AZ 85287, USA;
| | - Naomi M Pier
- School of Life Sciences, PO Box 874501, Arizona State University, Tempe, AZ 85287, USA
| | - Deeann M Reeder
- Department of Biology, Bucknell University, 1 Dent Drive, Lewisburg, PA 17837, USA
| | - Mingmin Chen
- Department of Computer Science, 2063 Kemper Hall, 1 Shields Avenue, University of California at Davis, CA 95616, USA
| | - Shizhuo Yu
- Department of Computer Science, 2063 Kemper Hall, 1 Shields Avenue, University of California at Davis, CA 95616, USA
| | - Parisa Kianmajd
- Department of Computer Science, 2063 Kemper Hall, 1 Shields Avenue, University of California at Davis, CA 95616, USA
| | - Shawn Bowers
- Department of Computer Science, 502 East Boone Avenue, AD Box 26, Gonzaga University, Spokane, WA 99258, USA
| | - Bertram Ludäscher
- Gradate School of Library and Information Science, 510 East Daniel Street, University of Illinois at Urbana-Champaign, Champaign, IL 61820
| |
Collapse
|
17
|
Druzinsky RE, Balhoff JP, Crompton AW, Done J, German RZ, Haendel MA, Herrel A, Herring SW, Lapp H, Mabee PM, Muller HM, Mungall CJ, Sternberg PW, Van Auken K, Vinyard CJ, Williams SH, Wall CE. Muscle Logic: New Knowledge Resource for Anatomy Enables Comprehensive Searches of the Literature on the Feeding Muscles of Mammals. PLoS One 2016; 11:e0149102. [PMID: 26870952 PMCID: PMC4752357 DOI: 10.1371/journal.pone.0149102] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 01/27/2016] [Indexed: 01/27/2023] Open
Abstract
Background In recent years large bibliographic databases have made much of the published literature of biology available for searches. However, the capabilities of the search engines integrated into these databases for text-based bibliographic searches are limited. To enable searches that deliver the results expected by comparative anatomists, an underlying logical structure known as an ontology is required. Development and Testing of the Ontology Here we present the Mammalian Feeding Muscle Ontology (MFMO), a multi-species ontology focused on anatomical structures that participate in feeding and other oral/pharyngeal behaviors. A unique feature of the MFMO is that a simple, computable, definition of each muscle, which includes its attachments and innervation, is true across mammals. This construction mirrors the logical foundation of comparative anatomy and permits searches using language familiar to biologists. Further, it provides a template for muscles that will be useful in extending any anatomy ontology. The MFMO is developed to support the Feeding Experiments End-User Database Project (FEED, https://feedexp.org/), a publicly-available, online repository for physiological data collected from in vivo studies of feeding (e.g., mastication, biting, swallowing) in mammals. Currently the MFMO is integrated into FEED and also into two literature-specific implementations of Textpresso, a text-mining system that facilitates powerful searches of a corpus of scientific publications. We evaluate the MFMO by asking questions that test the ability of the ontology to return appropriate answers (competency questions). We compare the results of queries of the MFMO to results from similar searches in PubMed and Google Scholar. Results and Significance Our tests demonstrate that the MFMO is competent to answer queries formed in the common language of comparative anatomy, but PubMed and Google Scholar are not. Overall, our results show that by incorporating anatomical ontologies into searches, an expanded and anatomically comprehensive set of results can be obtained. The broader scientific and publishing communities should consider taking up the challenge of semantically enabled search capabilities.
Collapse
Affiliation(s)
- Robert E. Druzinsky
- Department of Oral Biology, University of Illinois at Chicago, Chicago, Illinois, United States of America
- * E-mail:
| | - James P. Balhoff
- RTI International, Research Triangle Park, North Carolina, United States of America
| | - Alfred W. Crompton
- Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, United States of America
| | - James Done
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Rebecca Z. German
- Department of Anatomy and Neurobiology, Northeast Ohio Medical University, Rootstown, Ohio, United States of America
| | - Melissa A. Haendel
- Oregon Health and Science University, Portland, Oregon, United States of America
| | - Anthony Herrel
- Département d’Ecologie et de Gestion de la Biodiversité, Museum National d’Histoire Naturelle, Paris, France
| | - Susan W. Herring
- University of Washington, Department of Orthodontics, Seattle, Washington, United States of America
| | - Hilmar Lapp
- National Evolutionary Synthesis Center, Durham, North Carolina, United States of America
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina, United States of America
| | - Paula M. Mabee
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| | - Hans-Michael Muller
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Christopher J. Mungall
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Paul W. Sternberg
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
- Howard Hughes Medical Institute, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering, M/C 156–29, California Institute of Technology, Pasadena, California, United States of America
| | - Christopher J. Vinyard
- Department of Anatomy and Neurobiology, Northeast Ohio Medical University, Rootstown, Ohio, United States of America
| | - Susan H. Williams
- Department of Biomedical Sciences, Ohio University Heritage College of Osteopathic Medicine, Athens, Ohio, United States of America
| | - Christine E. Wall
- Department of Evolutionary Anthropology, Duke University, Durham, North Carolina, United States of America
| |
Collapse
|
18
|
Ferro MV, Gavilanes MF, González AB, Gómez-Rodríguez C. Intelligent Retrieval for Biodiversity. INT J ARTIF INTELL T 2016. [DOI: 10.1142/s0218213015500293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A proposal for intelligent retrieval in the biodiversity domain is described. It applies natural language processing to integrate linguistic and domain knowledge in a mathematical model for information management, formalizing the notion of semantic similarity in different degrees. The goal is to provide computational tools to identify, extract and relate not only data but also scientific notions, even if the information available to start the process is not complete. The use of conceptual graphs as a basis for interpretation makes it possible to avoid the use of classic ontologies, whose start-up requires costly generation and maintenance protocols and also unnecessarily overload the accessing task for inexpert users. We exploit the automatic generation of these structures from raw texts through graphical and natural language interaction, at the same time providing a solid logical and linguistic foundation to sustain the curation of databases.
Collapse
Affiliation(s)
- M. Vilares Ferro
- Department of Computer Science, University of Vigo, Campus As Lagoas s/n 32004 Ourense, Spain
| | - M. Fernández Gavilanes
- Department of Computer Science, University of Vigo, Campus As Lagoas s/n 32004 Ourense, Spain
| | - A. Blanco González
- Department of Computer Science, University of Vigo, Campus As Lagoas s/n 32004 Ourense, Spain
| | - C. Gómez-Rodríguez
- Department of Computer Science, University of A Coruña, Campus de Elviña s/n 15071 A Coruña, Spain
| |
Collapse
|
19
|
Chang J, Alfaro ME. Crowdsourced geometric morphometrics enable rapid large‐scale collection and analysis of phenotypic data. Methods Ecol Evol 2015. [DOI: 10.1111/2041-210x.12508] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Affiliation(s)
- Jonathan Chang
- Department of Ecology and Evolutionary Biology University of California Los Angeles CA USA
| | - Michael E. Alfaro
- Department of Ecology and Evolutionary Biology University of California Los Angeles CA USA
| |
Collapse
|
20
|
Dececchi TA, Balhoff JP, Lapp H, Mabee PM. Toward Synthesizing Our Knowledge of Morphology: Using Ontologies and Machine Reasoning to Extract Presence/Absence Evolutionary Phenotypes across Studies. Syst Biol 2015; 64:936-52. [PMID: 26018570 PMCID: PMC4604830 DOI: 10.1093/sysbio/syv031] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2014] [Accepted: 05/20/2015] [Indexed: 02/02/2023] Open
Abstract
The reality of larger and larger molecular databases and the need to integrate data scalably have presented a major challenge for the use of phenotypic data. Morphology is currently primarily described in discrete publications, entrenched in noncomputer readable text, and requires enormous investments of time and resources to integrate across large numbers of taxa and studies. Here we present a new methodology, using ontology-based reasoning systems working with the Phenoscape Knowledgebase (KB; kb.phenoscape.org), to automatically integrate large amounts of evolutionary character state descriptions into a synthetic character matrix of neomorphic (presence/absence) data. Using the KB, which includes more than 55 studies of sarcopterygian taxa, we generated a synthetic supermatrix of 639 variable characters scored for 1051 taxa, resulting in over 145,000 populated cells. Of these characters, over 76% were made variable through the addition of inferred presence/absence states derived by machine reasoning over the formal semantics of the source ontologies. Inferred data reduced the missing data in the variable character-subset from 98.5% to 78.2%. Machine reasoning also enables the isolation of conflicts in the data, that is, cells where both presence and absence are indicated; reports regarding conflicting data provenance can be generated automatically. Further, reasoning enables quantification and new visualizations of the data, here for example, allowing identification of character space that has been undersampled across the fin-to-limb transition. The approach and methods demonstrated here to compute synthetic presence/absence supermatrices are applicable to any taxonomic and phenotypic slice across the tree of life, providing the data are semantically annotated. Because such data can also be linked to model organism genetics through computational scoring of phenotypic similarity, they open a rich set of future research questions into phenotype-to-genome relationships.
Collapse
Affiliation(s)
| | - James P Balhoff
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; University of North Carolina, Chapel Hill, NC 27599, USA
| | - Hilmar Lapp
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Center for Genomics and Computational Biology, Duke University, Durham, NC 27708, USA
| | - Paula M Mabee
- Department of Biology, University of South Dakota, Vermillion, SD 57069, USA;
| |
Collapse
|
21
|
Dahdul W, Dececchi TA, Ibrahim N, Lapp H, Mabee P. Moving the mountain: analysis of the effort required to transform comparative anatomy into computable anatomy. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav040. [PMID: 25972520 PMCID: PMC4429748 DOI: 10.1093/database/bav040] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2015] [Accepted: 04/05/2015] [Indexed: 11/28/2022]
Abstract
The diverse phenotypes of living organisms have been described for centuries, and though they may be digitized, they are not readily available in a computable form. Using over 100 morphological studies, the Phenoscape project has demonstrated that by annotating characters with community ontology terms, links between novel species anatomy and the genes that may underlie them can be made. But given the enormity of the legacy literature, how can this largely unexploited wealth of descriptive data be rendered amenable to large-scale computation? To identify the bottlenecks, we quantified the time involved in the major aspects of phenotype curation as we annotated characters from the vertebrate phylogenetic systematics literature. This involves attaching fully computable logical expressions consisting of ontology terms to the descriptions in character-by-taxon matrices. The workflow consists of: (i) data preparation, (ii) phenotype annotation, (iii) ontology development and (iv) curation team discussions and software development feedback. Our results showed that the completion of this work required two person-years by a team of two post-docs, a lead data curator, and students. Manual data preparation required close to 13% of the effort. This part in particular could be reduced substantially with better community data practices, such as depositing fully populated matrices in public repositories. Phenotype annotation required ∼40% of the effort. We are working to make this more efficient with Natural Language Processing tools. Ontology development (40%), however, remains a highly manual task requiring domain (anatomical) expertise and use of specialized software. The large overhead required for data preparation and ontology development contributed to a low annotation rate of approximately two characters per hour, compared with 14 characters per hour when activity was restricted to character annotation. Unlocking the potential of the vast stores of morphological descriptions requires better tools for efficiently processing natural language, and better community practices towards a born-digital morphology. Database URL:http://kb.phenoscape.org
Collapse
Affiliation(s)
- Wasila Dahdul
- Department of Biology, University of South Dakota, Vermillion, SD, USA, Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL, USA and National Evolutionary Synthesis Center, Durham, NC, USA
| | - T Alexander Dececchi
- Department of Biology, University of South Dakota, Vermillion, SD, USA, Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL, USA and National Evolutionary Synthesis Center, Durham, NC, USA
| | - Nizar Ibrahim
- Department of Biology, University of South Dakota, Vermillion, SD, USA, Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL, USA and National Evolutionary Synthesis Center, Durham, NC, USA
| | - Hilmar Lapp
- Department of Biology, University of South Dakota, Vermillion, SD, USA, Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL, USA and National Evolutionary Synthesis Center, Durham, NC, USA
| | - Paula Mabee
- Department of Biology, University of South Dakota, Vermillion, SD, USA, Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL, USA and National Evolutionary Synthesis Center, Durham, NC, USA
| |
Collapse
|
22
|
Daly M, Endara LA, Burleigh JG. Peeking behind the page: using natural language processing to identify and explore the characters used to classify sea anemones. ZOOL ANZ 2015. [DOI: 10.1016/j.jcz.2015.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
23
|
Fu X, Batista-Navarro R, Rak R, Ananiadou S. Supporting the annotation of chronic obstructive pulmonary disease (COPD) phenotypes with text mining workflows. J Biomed Semantics 2015; 6:8. [PMID: 25789153 PMCID: PMC4364458 DOI: 10.1186/s13326-015-0004-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2014] [Accepted: 02/22/2015] [Indexed: 02/03/2023] Open
Abstract
BACKGROUND Chronic obstructive pulmonary disease (COPD) is a life-threatening lung disorder whose recent prevalence has led to an increasing burden on public healthcare. Phenotypic information in electronic clinical records is essential in providing suitable personalised treatment to patients with COPD. However, as phenotypes are often "hidden" within free text in clinical records, clinicians could benefit from text mining systems that facilitate their prompt recognition. This paper reports on a semi-automatic methodology for producing a corpus that can ultimately support the development of text mining tools that, in turn, will expedite the process of identifying groups of COPD patients. METHODS A corpus of 30 full-text papers was formed based on selection criteria informed by the expertise of COPD specialists. We developed an annotation scheme that is aimed at producing fine-grained, expressive and computable COPD annotations without burdening our curators with a highly complicated task. This was implemented in the Argo platform by means of a semi-automatic annotation workflow that integrates several text mining tools, including a graphical user interface for marking up documents. RESULTS When evaluated using gold standard (i.e., manually validated) annotations, the semi-automatic workflow was shown to obtain a micro-averaged F-score of 45.70% (with relaxed matching). Utilising the gold standard data to train new concept recognisers, we demonstrated that our corpus, although still a work in progress, can foster the development of significantly better performing COPD phenotype extractors. CONCLUSIONS We describe in this work the means by which we aim to eventually support the process of COPD phenotype curation, i.e., by the application of various text mining tools integrated into an annotation workflow. Although the corpus being described is still under development, our results thus far are encouraging and show great potential in stimulating the development of further automatic COPD phenotype extractors.
Collapse
Affiliation(s)
- Xiao Fu
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, UK
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, UK ; Department of Computer Science, University of the Philippines Diliman, Quezon City, 1101 Philippines
| | - Rafal Rak
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, UK
| |
Collapse
|
24
|
Franz NM, Chen M, Yu S, Kianmajd P, Bowers S, Ludäscher B. Reasoning over taxonomic change: exploring alignments for the Perelleschus use case. PLoS One 2015; 10:e0118247. [PMID: 25700173 PMCID: PMC4336294 DOI: 10.1371/journal.pone.0118247] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2014] [Accepted: 01/02/2015] [Indexed: 11/19/2022] Open
Abstract
Classifications and phylogenetic inferences of organismal groups change in light of new insights. Over time these changes can result in an imperfect tracking of taxonomic perspectives through the re-/use of Code-compliant or informal names. To mitigate these limitations, we introduce a novel approach for aligning taxonomies through the interaction of human experts and logic reasoners. We explore the performance of this approach with the Perelleschus use case of Franz & Cardona-Duque (2013). The use case includes six taxonomies published from 1936 to 2013, 54 taxonomic concepts (i.e., circumscriptions of names individuated according to their respective source publications), and 75 expert-asserted Region Connection Calculus articulations (e.g., congruence, proper inclusion, overlap, or exclusion). An Open Source reasoning toolkit is used to analyze 13 paired Perelleschus taxonomy alignments under heterogeneous constraints and interpretations. The reasoning workflow optimizes the logical consistency and expressiveness of the input and infers the set of maximally informative relations among the entailed taxonomic concepts. The latter are then used to produce merge visualizations that represent all congruent and non-congruent taxonomic elements among the aligned input trees. In this small use case with 6-53 input concepts per alignment, the information gained through the reasoning process is on average one order of magnitude greater than in the input. The approach offers scalable solutions for tracking provenance among succeeding taxonomic perspectives that may have differential biases in naming conventions, phylogenetic resolution, ingroup and outgroup sampling, or ostensive (member-referencing) versus intensional (property-referencing) concepts and articulations.
Collapse
Affiliation(s)
- Nico M. Franz
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Mingmin Chen
- Department of Computer Science, University of California Davis, Davis, California, United States of America
| | - Shizhuo Yu
- Department of Computer Science, University of California Davis, Davis, California, United States of America
| | - Parisa Kianmajd
- Department of Computer Science, University of California Davis, Davis, California, United States of America
| | - Shawn Bowers
- Department of Computer Science, Gonzaga University, Spokane, Washington, United States of America
| | - Bertram Ludäscher
- Department of Computer Science, University of California Davis, Davis, California, United States of America
| |
Collapse
|
25
|
Huang F, Macklin JA, Cui H, Cole HA, Endara L. OTO: Ontology Term Organizer. BMC Bioinformatics 2015; 16:47. [PMID: 25887779 PMCID: PMC4339750 DOI: 10.1186/s12859-015-0488-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2014] [Accepted: 01/30/2015] [Indexed: 11/17/2022] Open
Abstract
Background The need to create controlled vocabularies such as ontologies for knowledge organization and access has been widely recognized in various domains. Despite the indispensable need of thorough domain knowledge in ontology construction, most software tools for ontology construction are designed for knowledge engineers and not for domain experts to use. The differences in the opinions of different domain experts and in the terminology usages in source literature are rarely addressed by existing software. Methods OTO software was developed based on the Agile principles. Through iterations of software release and user feedback, new features are added and existing features modified to make the tool more intuitive and efficient to use for small and large data sets. The software is open source and built in Java. Results Ontology Term Organizer (OTO; http://biosemantics.arizona.edu/OTO/) is a user-friendly, web-based, consensus-promoting, open source application for organizing domain terms by dragging and dropping terms to appropriate locations. The application is designed for users with specific domain knowledge such as biology but not in-depth ontology construction skills. Specifically OTO can be used to establish is_a, part_of, synonym, and order relationships among terms in any domain that reflects the terminology usage in source literature and based on multiple experts’ opinions. The organized terms may be fed into formal ontologies to boost their coverage. All datasets organized on OTO are publicly available. Conclusion OTO has been used to organize the terms extracted from thirty volumes of Flora of North America and Flora of China combined, in addition to some smaller datasets of different taxon groups. User feedback indicates that the tool is efficient and user friendly. Being open source software, the application can be modified to fit varied term organization needs for different domains.
Collapse
Affiliation(s)
- Fengqiong Huang
- School of Information Resources and Library Science, University of Arizona, Tucson, USA.
| | | | - Hong Cui
- School of Information Resources and Library Science, University of Arizona, Tucson, USA.
| | | | - Lorena Endara
- Department of Biology, University of Florida, Gainesville, USA.
| |
Collapse
|
26
|
Liu J, Endara L, Burleigh JG. MatrixConverter: Facilitating construction of phenomic character matrices. APPLICATIONS IN PLANT SCIENCES 2015; 3:apps1400088. [PMID: 25699217 PMCID: PMC4332142 DOI: 10.3732/apps.1400088] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 01/18/2015] [Indexed: 05/24/2023]
Abstract
UNLABELLED • PREMISE OF THE STUDY While numerous software packages enable scientists to evaluate molecular data and transform them for phylogenetic analyses, few such tools exist for phenomic data. We introduce MatrixConverter, a program that helps expedite and facilitate the transformation of raw phenomic character data into discrete character matrices that can be used in most evolutionary inference programs. • METHODS AND RESULTS MatrixConverter is an open source program written in Java; a platform-independent binary executable, as well as sample data sets and a user's manual, are available at https://github.com/gburleigh/MatrixConverter/tree/master/distribution. MatrixConverter has a simple, intuitive user interface that enables the user to immediately begin scoring phenomic characters. We demonstrate the performance of MatrixConverter on a phenomic data set from cycads. • CONCLUSIONS New technologies and software make it possible to obtain phenomic data from species across the tree of life, and MatrixConverter helps to transform these new data for evolutionary or ecological inference.
Collapse
Affiliation(s)
- Jing Liu
- Department of Biology, University of Florida, P.O. Box 118526, Gainesville, Florida 32611 USA
- State Key Laboratory of Software Engineering, Computer School, Wuhan University, Wuhan 430072, People’s Republic of China
| | - Lorena Endara
- Department of Biology, University of Florida, P.O. Box 118526, Gainesville, Florida 32611 USA
| | - J. Gordon Burleigh
- Department of Biology, University of Florida, P.O. Box 118526, Gainesville, Florida 32611 USA
| |
Collapse
|
27
|
Deans AR, Lewis SE, Huala E, Anzaldo SS, Ashburner M, Balhoff JP, Blackburn DC, Blake JA, Burleigh JG, Chanet B, Cooper LD, Courtot M, Csösz S, Cui H, Dahdul W, Das S, Dececchi TA, Dettai A, Diogo R, Druzinsky RE, Dumontier M, Franz NM, Friedrich F, Gkoutos GV, Haendel M, Harmon LJ, Hayamizu TF, He Y, Hines HM, Ibrahim N, Jackson LM, Jaiswal P, James-Zorn C, Köhler S, Lecointre G, Lapp H, Lawrence CJ, Le Novère N, Lundberg JG, Macklin J, Mast AR, Midford PE, Mikó I, Mungall CJ, Oellrich A, Osumi-Sutherland D, Parkinson H, Ramírez MJ, Richter S, Robinson PN, Ruttenberg A, Schulz KS, Segerdell E, Seltmann KC, Sharkey MJ, Smith AD, Smith B, Specht CD, Squires RB, Thacker RW, Thessen A, Fernandez-Triana J, Vihinen M, Vize PD, Vogt L, Wall CE, Walls RL, Westerfeld M, Wharton RA, Wirkner CS, Woolley JB, Yoder MJ, Zorn AM, Mabee P. Finding our way through phenotypes. PLoS Biol 2015; 13:e1002033. [PMID: 25562316 PMCID: PMC4285398 DOI: 10.1371/journal.pbio.1002033] [Citation(s) in RCA: 124] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Despite a large and multifaceted effort to understand the vast landscape of phenotypic data, their current form inhibits productive data analysis. The lack of a community-wide, consensus-based, human- and machine-interpretable language for describing phenotypes and their genomic and environmental contexts is perhaps the most pressing scientific bottleneck to integration across many key fields in biology, including genomics, systems biology, development, medicine, evolution, ecology, and systematics. Here we survey the current phenomics landscape, including data resources and handling, and the progress that has been made to accurately capture relevant data descriptions for phenotypes. We present an example of the kind of integration across domains that computable phenotypes would enable, and we call upon the broader biology community, publishers, and relevant funding agencies to support efforts to surmount today's data barriers and facilitate analytical reproducibility.
Collapse
Affiliation(s)
- Andrew R. Deans
- Department of Entomology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Suzanna E. Lewis
- Genome Division, Lawrence Berkeley National Lab, Berkeley, California, United States of America
| | - Eva Huala
- Department of Plant Biology, Carnegie Institution for Science, Stanford, California, United States of America
- Phoenix Bioinformatics, Palo Alto, California, United States of America
| | - Salvatore S. Anzaldo
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Michael Ashburner
- Department of Genetics, University of Cambridge, Cambridge, United Kingdom
| | - James P. Balhoff
- National Evolutionary Synthesis Center, Durham, North Carolina, United States of America
| | - David C. Blackburn
- Department of Vertebrate Zoology and Anthropology, California Academy of Sciences, San Francisco, California, United States of America
| | - Judith A. Blake
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - J. Gordon Burleigh
- Department of Biology, University of Florida, Gainesville, Florida, United States of America
| | - Bruno Chanet
- Muséum national d'Histoire naturelle, Département Systématique et Evolution, Paris, France
| | - Laurel D. Cooper
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, Oregon, United States of America
| | - Mélanie Courtot
- Molecular Biology and Biochemistry Department, Simon Fraser University, Burnaby, British Columbia, Canada
| | - Sándor Csösz
- MTA-ELTE-MTM, Ecology Research Group, Pázmány Péter sétány 1C, Budapest, Hungary
| | - Hong Cui
- School of Information Resources and Library Science, University of Arizona, Tucson, Arizona, United States of America
| | - Wasila Dahdul
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| | - Sandip Das
- Department of Botany, University of Delhi, Delhi, India
| | - T. Alexander Dececchi
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| | - Agnes Dettai
- Muséum national d'Histoire naturelle, Département Systématique et Evolution, Paris, France
| | - Rui Diogo
- Department of Anatomy, Howard University College of Medicine, Washington D.C., United States of America
| | - Robert E. Druzinsky
- Department of Oral Biology, College of Dentistry, University of Illinois, Chicago, Illinois, United States of America
| | - Michel Dumontier
- Stanford Center for Biomedical Informatics Research, Stanford, California, United States of America
| | - Nico M. Franz
- School of Life Sciences, Arizona State University, Tempe, Arizona, United States of America
| | - Frank Friedrich
- Biocenter Grindel and Zoological Museum, Hamburg University, Hamburg, Germany
| | - George V. Gkoutos
- Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion, United Kingdom
| | - Melissa Haendel
- Department of Medical Informatics & Epidemiology, Oregon Health & Science University, Portland, Oregon, United States of America
| | - Luke J. Harmon
- Department of Biological Sciences, University of Idaho, Moscow, Idaho, United States of America
| | - Terry F. Hayamizu
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, and Comprehensive Cancer Center, University of Michigan Medical School, Ann Arbor, Michigan, United States of America
| | - Heather M. Hines
- Department of Entomology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Nizar Ibrahim
- Department of Organismal Biology and Anatomy, University of Chicago, Chicago, Illinois, United States of America
| | - Laura M. Jackson
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, Oregon, United States of America
| | - Christina James-Zorn
- Cincinnati Children's Hospital, Division of Developmental Biology, Cincinnati, Ohio, United States of America
| | - Sebastian Köhler
- Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Guillaume Lecointre
- Muséum national d'Histoire naturelle, Département Systématique et Evolution, Paris, France
| | - Hilmar Lapp
- National Evolutionary Synthesis Center, Durham, North Carolina, United States of America
| | - Carolyn J. Lawrence
- Department of Genetics, Development and Cell Biology and Department of Agronomy, Iowa State University, Ames, Iowa, United States of America
| | | | - John G. Lundberg
- Department of Ichthyology, The Academy of Natural Sciences, Philadelphia, Pennsylvania, United States of America
| | - James Macklin
- Eastern Cereal and Oilseed Research Centre, Ottawa, Ontario, Canada
| | - Austin R. Mast
- Department of Biological Science, Florida State University, Tallahassee, Florida, United States of America
| | | | - István Mikó
- Department of Entomology, Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Christopher J. Mungall
- Genome Division, Lawrence Berkeley National Lab, Berkeley, California, United States of America
| | - Anika Oellrich
- European Molecular Biology Laboratory - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | - David Osumi-Sutherland
- European Molecular Biology Laboratory - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | - Helen Parkinson
- European Molecular Biology Laboratory - European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | - Martín J. Ramírez
- Division of Arachnology, Museo Argentino de Ciencias Naturales - CONICET, Buenos Aires, Argentina
| | - Stefan Richter
- Allgemeine & Spezielle Zoologie, Institut für Biowissenschaften, Universität Rostock, Universitätsplatz 2, Rostock, Germany
| | - Peter N. Robinson
- Institut für Medizinische Genetik und Humangenetik Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Alan Ruttenberg
- School of Dental Medicine, University at Buffalo, Buffalo, New York, United States of America
| | - Katja S. Schulz
- Smithsonian Institution, National Museum of Natural History, Washington, D.C., United States of America
| | - Erik Segerdell
- Knight Cancer Institute, Oregon Health & Science University, Portland, Oregon, United States of America
| | - Katja C. Seltmann
- Division of Invertebrate Zoology, American Museum of Natural History, New York, New York, United States of America
| | - Michael J. Sharkey
- Department of Entomology, University of Kentucky, Lexington, Kentucky, United States of America
| | - Aaron D. Smith
- Department of Biological Sciences, Northern Arizona University, Flagstaff, Arizona, United States of America
| | - Barry Smith
- Department of Philosophy, University at Buffalo, Buffalo, New York, United States of America
| | - Chelsea D. Specht
- Department of Plant and Microbial Biology, Integrative Biology, and the University and Jepson Herbaria, University of California, Berkeley, California, United States of America
| | - R. Burke Squires
- Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Robert W. Thacker
- Department of Biology, University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Anne Thessen
- The Data Detektiv, 1412 Stearns Hill Road, Waltham, Massachusetts, United States of America
| | | | - Mauno Vihinen
- Department of Experimental Medical Science, Lund University, Lund, Sweden
| | - Peter D. Vize
- Department of Biological Sciences, University of Calgary, Calgary, Alberta, Canada
| | - Lars Vogt
- Universität Bonn, Institut für Evolutionsbiologie und Ökologie, Bonn, Germany
| | - Christine E. Wall
- Department of Evolutionary Anthropology, Duke University, Durham, North Carolina, United States of America
| | - Ramona L. Walls
- iPlant Collaborative University of Arizona, Thomas J. Keating Bioresearch Building, Tucson, Arizona, United States of America
| | - Monte Westerfeld
- Institute of Neuroscience, University of Oregon, Eugene, Oregon, United States of America
| | - Robert A. Wharton
- Department of Entomology, Texas A & M University, College, Station, Texas, United States of America
| | - Christian S. Wirkner
- Allgemeine & Spezielle Zoologie, Institut für Biowissenschaften, Universität Rostock, Universitätsplatz 2, Rostock, Germany
| | - James B. Woolley
- Department of Entomology, Texas A & M University, College, Station, Texas, United States of America
| | - Matthew J. Yoder
- Illinois Natural History Survey, University of Illinois, Champaign, Illinois, United States of America
| | - Aaron M. Zorn
- Cincinnati Children's Hospital, Division of Developmental Biology, Cincinnati, Ohio, United States of America
| | - Paula Mabee
- Department of Biology, University of South Dakota, Vermillion, South Dakota, United States of America
| |
Collapse
|
28
|
Haendel MA, Balhoff JP, Bastian FB, Blackburn DC, Blake JA, Bradford Y, Comte A, Dahdul WM, Dececchi TA, Druzinsky RE, Hayamizu TF, Ibrahim N, Lewis SE, Mabee PM, Niknejad A, Robinson-Rechavi M, Sereno PC, Mungall CJ. Unification of multi-species vertebrate anatomy ontologies for comparative biology in Uberon. J Biomed Semantics 2014; 5:21. [PMID: 25009735 PMCID: PMC4089931 DOI: 10.1186/2041-1480-5-21] [Citation(s) in RCA: 88] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2013] [Accepted: 03/25/2014] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Elucidating disease and developmental dysfunction requires understanding variation in phenotype. Single-species model organism anatomy ontologies (ssAOs) have been established to represent this variation. Multi-species anatomy ontologies (msAOs; vertebrate skeletal, vertebrate homologous, teleost, amphibian AOs) have been developed to represent 'natural' phenotypic variation across species. Our aim has been to integrate ssAOs and msAOs for various purposes, including establishing links between phenotypic variation and candidate genes. RESULTS Previously, msAOs contained a mixture of unique and overlapping content. This hampered integration and coordination due to the need to maintain cross-references or inter-ontology equivalence axioms to the ssAOs, or to perform large-scale obsolescence and modular import. Here we present the unification of anatomy ontologies into Uberon, a single ontology resource that enables interoperability among disparate data and research groups. As a consequence, independent development of TAO, VSAO, AAO, and vHOG has been discontinued. CONCLUSIONS The newly broadened Uberon ontology is a unified cross-taxon resource for metazoans (animals) that has been substantially expanded to include a broad diversity of vertebrate anatomical structures, permitting reasoning across anatomical variation in extinct and extant taxa. Uberon is a core resource that supports single- and cross-species queries for candidate genes using annotations for phenotypes from the systematics, biodiversity, medical, and model organism communities, while also providing entities for logical definitions in the Cell and Gene Ontologies. THE ONTOLOGY RELEASE FILES ASSOCIATED WITH THE ONTOLOGY MERGE DESCRIBED IN THIS MANUSCRIPT ARE AVAILABLE AT: http://purl.obolibrary.org/obo/uberon/releases/2013-02-21/ CURRENT ONTOLOGY RELEASE FILES ARE AVAILABLE ALWAYS AVAILABLE AT: http://purl.obolibrary.org/obo/uberon/releases/
Collapse
Affiliation(s)
- Melissa A Haendel
- Department of Medical Informatics & Epidemiology, Oregon Health & Science University, Portland, OR, USA
| | - James P Balhoff
- Department of Biology, University of North Carolina, Chapel Hill, NC 27599-3280, USA ; National Evolutionary Synthesis Center, Durham, NC, USA
| | - Frederic B Bastian
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland ; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - David C Blackburn
- Department of Vertebrate Zoology and Anthropology, California Academy of Sciences, San Francisco, CA 94118, USA
| | | | - Yvonne Bradford
- The Zebrafish Model Organism Database, University of Oregon, Eugene, OR 97403, USA
| | - Aurelie Comte
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland ; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Wasila M Dahdul
- National Evolutionary Synthesis Center, Durham, NC, USA ; Department of Biology, University of South Dakota, Vermillion, SD 57069, USA
| | - Thomas A Dececchi
- Department of Biology, University of South Dakota, Vermillion, SD 57069, USA
| | - Robert E Druzinsky
- Department of Oral Biology, University of Illinois-Chicago, Chicago, IL 60612, USA
| | | | - Nizar Ibrahim
- Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL 60637, USA
| | - Suzanna E Lewis
- Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley, CA 94720, USA
| | - Paula M Mabee
- Department of Biology, University of South Dakota, Vermillion, SD 57069, USA
| | - Anne Niknejad
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland ; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland ; Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Paul C Sereno
- Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL 60637, USA
| | | |
Collapse
|
29
|
Thessen AE, Parr CS. Knowledge extraction and semantic annotation of text from the encyclopedia of life. PLoS One 2014; 9:e89550. [PMID: 24594988 PMCID: PMC3940440 DOI: 10.1371/journal.pone.0089550] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Accepted: 01/21/2014] [Indexed: 11/19/2022] Open
Abstract
Numerous digitization and ontological initiatives have focused on translating biological knowledge from narrative text to machine-readable formats. In this paper, we describe two workflows for knowledge extraction and semantic annotation of text data objects featured in an online biodiversity aggregator, the Encyclopedia of Life. One workflow tags text with DBpedia URIs based on keywords. Another workflow finds taxon names in text using GNRD for the purpose of building a species association network. Both workflows work well: the annotation workflow has an F1 Score of 0.941 and the association algorithm has an F1 Score of 0.885. Existing text annotators such as Terminizer and DBpedia Spotlight performed well, but require some optimization to be useful in the ecology and evolution domain. Important future work includes scaling up and improving accuracy through the use of distributional semantics.
Collapse
Affiliation(s)
- Anne E. Thessen
- Arizona State University, School of Life Sciences, Tempe, Arizona, United States of America
- * E-mail:
| | - Cynthia Sims Parr
- National Museum of Natural History, Smithsonian Institution, Washington, District of Columbia, United States of America
| |
Collapse
|
30
|
Balhoff JP, Mikó I, Yoder MJ, Mullins PL, Deans AR. A semantic model for species description applied to the ensign wasps (hymenoptera: evaniidae) of New Caledonia. Syst Biol 2013; 62:639-59. [PMID: 23652347 PMCID: PMC3739881 DOI: 10.1093/sysbio/syt028] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 02/14/2013] [Accepted: 04/23/2013] [Indexed: 12/01/2022] Open
Abstract
Taxonomic descriptions are unparalleled sources of knowledge of life's phenotypic diversity. As natural language prose, these data sets are largely refractory to computation and integration with other sources of phenotypic data. By formalizing taxonomic descriptions using ontology-based semantic representation, we aim to increase the reusability and computability of taxonomists' primary data. Here, we present a revision of the ensign wasp (Hymenoptera: Evaniidae) fauna of New Caledonia using this new model for species description. Descriptive matrices, specimen data, and taxonomic nomenclature are gathered in a unified Web-based application, mx, then exported as both traditional taxonomic treatments and semantic statements using the OWL Web Ontology Language. Character:character-state combinations are then annotated following the entity-quality phenotype model, originally developed to represent mutant model organism phenotype data; concepts of anatomy are drawn from the Hymenoptera Anatomy Ontology and linked to phenotype descriptors from the Phenotypic Quality Ontology. The resulting set of semantic statements is provided in Resource Description Framework format. Applying the model to real data, that is, specimens, taxonomic names, diagnoses, descriptions, and redescriptions, provides us with a foundation to discuss limitations and potential benefits such as automated data integration and reasoner-driven queries. Four species of ensign wasp are now known to occur in New Caledonia: Szepligetella levipetiolata, Szepligetella deercreeki Deans and Mikó sp. nov., Szepligetella irwini Deans and Mikó sp. nov., and the nearly cosmopolitan Evania appendigaster. A fifth species, Szepligetella sericea, including Szepligetella impressa, syn. nov., has not yet been collected in New Caledonia but can be found on islands throughout the Pacific and so is included in the diagnostic key.
Collapse
Affiliation(s)
- James P. Balhoff
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - István Mikó
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Matthew J. Yoder
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Patricia L. Mullins
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Andrew R. Deans
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
31
|
Franz NM, Cardona-Duque* J. Description of two new species and phylogenetic reassessment of Perelleschus O’Brien & Wibmer, 1986 (Coleoptera: Curculionidae), with a complete taxonomic concept history of Perelleschus sec. Franz & Cardona-Duque, 2013. SYST BIODIVERS 2013. [DOI: 10.1080/14772000.2013.806371] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Nico M. Franz
- a School of Life Sciences, PO Box 874501 , Arizona State University , Tempe , AZ , 85287-4501 , USA
| | - Juliana Cardona-Duque*
- b Grupo de Entomología , Universidad de Antioquia (GEUA) , Medellín , AA , 1226 , Colombia
| |
Collapse
|
32
|
Seltmann KC, Pénzes Z, Yoder MJ, Bertone MA, Deans AR. Utilizing descriptive statements from the biodiversity heritage library to expand the Hymenoptera Anatomy Ontology. PLoS One 2013; 8:e55674. [PMID: 23441153 PMCID: PMC3575469 DOI: 10.1371/journal.pone.0055674] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2012] [Accepted: 12/29/2012] [Indexed: 12/02/2022] Open
Abstract
Hymenoptera, the insect order that includes sawflies, bees, wasps, and ants, exhibits an incredible diversity of phenotypes, with over 145,000 species described in a corpus of textual knowledge since Carolus Linnaeus. In the absence of specialized training, often spanning decades, however, these articles can be challenging to decipher. Much of the vocabulary is domain-specific (e.g., Hymenoptera biology), historically without a comprehensive glossary, and contains much homonymous and synonymous terminology. The Hymenoptera Anatomy Ontology was developed to surmount this challenge and to aid future communication related to hymenopteran anatomy, as well as provide support for domain experts so they may actively benefit from the anatomy ontology development. As part of HAO development, an active learning, dictionary-based, natural language recognition tool was implemented to facilitate Hymenoptera anatomy term discovery in literature. We present this tool, referred to as the 'Proofer', as part of an iterative approach to growing phenotype-relevant ontologies, regardless of domain. The process of ontology development results in a critical mass of terms that is applied as a filter to the source collection of articles in order to reveal term occurrence and biases in natural language species descriptions. Our results indicate that taxonomists use domain-specific terminology that follows taxonomic specialization, particularly at superfamily and family level groupings and that the developed Proofer tool is effective for term discovery, facilitating ontology construction.
Collapse
Affiliation(s)
- Katja C Seltmann
- Department of Invertebrate Zoology, American Museum of Natural History, New York, New York, United States of America.
| | | | | | | | | |
Collapse
|
33
|
Applications of natural language processing in biodiversity science. Adv Bioinformatics 2012; 2012:391574. [PMID: 22685456 PMCID: PMC3364545 DOI: 10.1155/2012/391574] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2011] [Accepted: 02/15/2012] [Indexed: 12/11/2022] Open
Abstract
Centuries of biological knowledge are contained in the massive body of scientific literature, written for human-readability but too big for any one person to consume. Large-scale mining of information from the literature is necessary if biology is to transform into a data-driven science.
A computer can handle the volume but cannot make sense of the language. This paper reviews and discusses the use of natural language processing (NLP) and machine-learning algorithms to extract information from systematic literature. NLP algorithms have been used for decades, but require special development for application in the biological realm due to the special nature of the language. Many tools exist for biological information extraction (cellular processes, taxonomic names, and morphological characters), but none have been applied life wide and most still require testing and development. Progress has been made in developing algorithms for automated annotation of taxonomic text, identification of taxonomic names in text, and extraction of morphological character information from taxonomic descriptions. This manuscript will briefly discuss the key steps in applying information extraction tools to enhance biodiversity science.
Collapse
|