1
|
Vaidya G, Cellinese N, Lapp H. A new phylogenetic data standard for computable clade definitions: the Phyloreference Exchange Format (Phyx). PeerJ 2022; 10:e12618. [PMID: 35186448 PMCID: PMC8855714 DOI: 10.7717/peerj.12618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 11/18/2021] [Indexed: 01/06/2023] Open
Abstract
To be computationally reproducible and efficient, integration of disparate data depends on shared entities whose matching meaning (semantics) can be computationally assessed. For biodiversity data one of the most prevalent shared entities for linking data records is the associated taxon concept. Unlike Linnaean taxon names, the traditional way in which taxon concepts are provided, phylogenetic definitions are native to phylogenetic trees and offer well-defined semantics that can be transformed into formal, computationally evaluable logic expressions. These attributes make them highly suitable for phylogeny-driven comparative biology by allowing computationally verifiable and reproducible integration of taxon-linked data against Tree of Life-scale phylogenies. To achieve this, the first step is transforming phylogenetic definitions from the natural language text in which they are published to a structured interoperable data format that maintains strong ties to semantics and lends itself well to sharing, reuse, and long-term archival. To this end, we developed the Phyloreference Exchange Format (Phyx), a JSON-LD-based text format encompassing rich metadata for all elements of a phylogenetic definition, and we created a supporting software library, phyx.js, to streamline computational management of such files. Together they form a foundation layer for digitizing and computing with phylogenetic definitions of clades.
Collapse
Affiliation(s)
- Gaurav Vaidya
- Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, NC, United States of America,Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of America
| | - Nico Cellinese
- Florida Museum of Natural History, University of Florida, Gainesville, FL, United States of America,Informatics Institute, University of Florida, Gainesville, FL, United States of America
| | - Hilmar Lapp
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, United States of America
| |
Collapse
|
2
|
Vos RA, Katayama T, Mishima H, Kawano S, Kawashima S, Kim JD, Moriya Y, Tokimatsu T, Yamaguchi A, Yamamoto Y, Wu H, Amstutz P, Antezana E, Aoki NP, Arakawa K, Bolleman JT, Bolton E, Bonnal RJP, Bono H, Burger K, Chiba H, Cohen KB, Deutsch EW, Fernández-Breis JT, Fu G, Fujisawa T, Fukushima A, García A, Goto N, Groza T, Hercus C, Hoehndorf R, Itaya K, Juty N, Kawashima T, Kim JH, Kinjo AR, Kotera M, Kozaki K, Kumagai S, Kushida T, Lütteke T, Matsubara M, Miyamoto J, Mohsen A, Mori H, Naito Y, Nakazato T, Nguyen-Xuan J, Nishida K, Nishida N, Nishide H, Ogishima S, Ohta T, Okuda S, Paten B, Perret JL, Prathipati P, Prins P, Queralt-Rosinach N, Shinmachi D, Suzuki S, Tabata T, Takatsuki T, Taylor K, Thompson M, Uchiyama I, Vieira B, Wei CH, Wilkinson M, Yamada I, Yamanaka R, Yoshitake K, Yoshizawa AC, Dumontier M, Kosaki K, Takagi T. BioHackathon 2015: Semantics of data for life sciences and reproducible research. F1000Res 2020; 9:136. [PMID: 32308977 PMCID: PMC7141167 DOI: 10.12688/f1000research.18236.1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/05/2020] [Indexed: 01/08/2023] Open
Abstract
We report on the activities of the 2015 edition of the BioHackathon, an annual event that brings together researchers and developers from around the world to develop tools and technologies that promote the reusability of biological data. We discuss issues surrounding the representation, publication, integration, mining and reuse of biological data and metadata across a wide range of biomedical data types of relevance for the life sciences, including chemistry, genotypes and phenotypes, orthology and phylogeny, proteomics, genomics, glycomics, and metabolomics. We describe our progress to address ongoing challenges to the reusability and reproducibility of research results, and identify outstanding issues that continue to impede the progress of bioinformatics research. We share our perspective on the state of the art, continued challenges, and goals for future research and development for the life sciences Semantic Web.
Collapse
Affiliation(s)
- Rutger A. Vos
- Institute of Biology Leiden, Leiden University, Leiden, The Netherlands
- Naturalis Biodiversity Center, Leiden, The Netherlands
| | | | - Hiroyuki Mishima
- Department of Human Genetics, Nagasaki University Graduate School of Biomedical Sciences, Nagasaki, Japan
| | - Shin Kawano
- Database Center for Life Science, Tokyo, Japan
| | | | | | - Yuki Moriya
- Database Center for Life Science, Tokyo, Japan
| | | | | | | | - Hongyan Wu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | | | - Erick Antezana
- Department of Biology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Nobuyuki P. Aoki
- Faculty of Science and Engineering, SOKA University, Tokyo, Japan
| | - Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Tokyo, Japan
| | - Jerven T. Bolleman
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Lausanne, Switzerland
| | - Evan Bolton
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Raoul J. P. Bonnal
- Istituto Nazionale Genetica Molecolare, Romeo ed Enrica Invernizzi, Milan, Italy
| | | | - Kees Burger
- Dutch Techcentre for Life Sciences, Utrecht, The Netherlands
| | - Hirokazu Chiba
- National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Japan
| | - Kevin B. Cohen
- Computational Bioscience Program, University of Colorado School of Medicine, Denver, USA
- Université Paris-Saclay, LIMSI, CNRS, Paris, France
| | | | | | - Gang Fu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | | | | | | | - Naohisa Goto
- Research Institute for Microbial Diseases, Osaka University, Osaka, Japan
| | - Tudor Groza
- St Vincent's Clinical School, Faculty of Medicine, University of New South Wales, Darlinghurst, Australia
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Darlinghurst, Australia
| | - Colin Hercus
- Novocraft Technologies Sdn. Bhd., Selangor, Malaysia
| | - Robert Hoehndorf
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Kotone Itaya
- Institute for Advanced Biosciences, Keio University, Tokyo, Japan
| | - Nick Juty
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | | | - Jee-Hyub Kim
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Akira R. Kinjo
- Institute for Protein Research, Osaka University, Osaka, Japan
| | - Masaaki Kotera
- School of Life Science and Technology, Tokyo Institute of Technology, Tokyo, Japan
| | - Kouji Kozaki
- The Institute of Scientific and Industrial Research, Osaka University, Osaka, Japan
| | | | - Tatsuya Kushida
- National Bioscience Database Center, Japan Science and Technology Agency, Tokyo, Japan
| | - Thomas Lütteke
- Institute of Veterinary Physiology and Biochemistry, Justus-Liebig University Giessen, Giessen, Germany
- Gesellschaft für innovative Personalwirtschaftssysteme mbH (GIP GmbH), Offenbach, Germany
| | | | | | - Attayeb Mohsen
- National Institutes of Biomedical Innovation, Health and Nutrition, Osaka, Japan
| | - Hiroshi Mori
- Center for Information Biology, National Institute of Genetics, Mishima, Japan
| | - Yuki Naito
- Database Center for Life Science, Tokyo, Japan
| | | | | | | | - Naoki Nishida
- Department of Systems Science, Osaka University, Osaka, Japan
| | - Hiroyo Nishide
- National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Japan
| | - Soichi Ogishima
- Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan
| | - Tazro Ohta
- Database Center for Life Science, Tokyo, Japan
| | - Shujiro Okuda
- Niigata University Graduate School of Medical and Dental Sciences, Niigata, Japan
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, USA
| | | | - Philip Prathipati
- National Institutes of Biomedical Innovation, Health and Nutrition, Osaka, Japan
| | - Pjotr Prins
- University Medical Center Utrecht, Utrecht, The Netherlands
- University of Tennessee Health Science Center, Memphis, USA
| | - Núria Queralt-Rosinach
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA
| | | | - Shinya Suzuki
- School of Life Science and Technology, Tokyo Institute of Technology, Tokyo, Japan
| | - Tsuyosi Tabata
- Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Japan
| | | | - Kieron Taylor
- European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
| | - Mark Thompson
- Leiden University Medical Center, Leiden, The Netherlands
| | - Ikuo Uchiyama
- National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, Japan
| | - Bruno Vieira
- WurmLab, School of Biological & Chemical Sciences, Queen Mary University of London, London, UK
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Mark Wilkinson
- Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid, Madrid, Spain
| | | | | | - Kazutoshi Yoshitake
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo, Japan
| | | | - Michel Dumontier
- Institute of Data Science, Maastricht University, Maastricht, The Netherlands
| | - Kenjiro Kosaki
- Center for Medical Genetics, Keio University School of Medicine, Tokyo, Japan
| | - Toshihisa Takagi
- National Bioscience Database Center, Japan Science and Technology Agency, Tokyo, Japan
- Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
3
|
ASP Applications in Bio-informatics: A Short Tour. KUNSTLICHE INTELLIGENZ 2018. [DOI: 10.1007/s13218-018-0551-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
4
|
Fernández-Breis JT, Chiba H, Legaz-García MDC, Uchiyama I. The Orthology Ontology: development and applications. J Biomed Semantics 2016; 7:34. [PMID: 27259657 PMCID: PMC4893294 DOI: 10.1186/s13326-016-0077-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Accepted: 05/17/2016] [Indexed: 11/16/2022] Open
Abstract
Background Computational comparative analysis of multiple genomes provides valuable opportunities to biomedical research. In particular, orthology analysis can play a central role in comparative genomics; it guides establishing evolutionary relations among genes of organisms and allows functional inference of gene products. However, the wide variations in current orthology databases necessitate the research toward the shareability of the content that is generated by different tools and stored in different structures. Exchanging the content with other research communities requires making the meaning of the content explicit. Description The need for a common ontology has led to the creation of the Orthology Ontology (ORTH) following the best practices in ontology construction. Here, we describe our model and major entities of the ontology that is implemented in the Web Ontology Language (OWL), followed by the assessment of the quality of the ontology and the application of the ORTH to existing orthology datasets. This shareable ontology enables the possibility to develop Linked Orthology Datasets and a meta-predictor of orthology through standardization for the representation of orthology databases. The ORTH is freely available in OWL format to all users at http://purl.org/net/orth. Conclusions The Orthology Ontology can serve as a framework for the semantic standardization of orthology content and it will contribute to a better exploitation of orthology resources in biomedical research. The results demonstrate the feasibility of developing shareable datasets using this ontology. Further applications will maximize the usefulness of this ontology.
Collapse
Affiliation(s)
| | - Hirokazu Chiba
- National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, 444-8585, Aichi, Japan
| | | | - Ikuo Uchiyama
- National Institute for Basic Biology, National Institutes of Natural Sciences, Okazaki, 444-8585, Aichi, Japan
| |
Collapse
|
5
|
Evolutionary relationships among barley and Arabidopsis core circadian clock and clock-associated genes. J Mol Evol 2015; 80:108-19. [PMID: 25608480 PMCID: PMC4320304 DOI: 10.1007/s00239-015-9665-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2014] [Accepted: 01/06/2015] [Indexed: 12/13/2022]
Abstract
The circadian clock regulates a multitude of plant developmental and metabolic processes. In crop species, it contributes significantly to plant performance and productivity and to the adaptation and geographical range over which crops can be grown. To understand the clock in barley and how it relates to the components in the Arabidopsis thaliana clock, we have performed a systematic analysis of core circadian clock and clock-associated genes in barley, Arabidopsis and another eight species including tomato, potato, a range of monocotyledonous species and the moss, Physcomitrella patens. We have identified orthologues and paralogues of Arabidopsis genes which are conserved in all species, monocot/dicot differences, species-specific differences and variation in gene copy number (e.g. gene duplications among the various species). We propose that the common ancestor of barley and Arabidopsis had two-thirds of the key clock components identified in Arabidopsis prior to the separation of the monocot/dicot groups. After this separation, multiple independent gene duplication events took place in both monocot and dicot ancestors.
Collapse
|
6
|
Sonnhammer ELL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M, Boeckmann B, Thomas PD, Dessimoz C. Big data and other challenges in the quest for orthologs. Bioinformatics 2014; 30:2993-8. [PMID: 25064571 PMCID: PMC4201156 DOI: 10.1093/bioinformatics/btu492] [Citation(s) in RCA: 98] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Revised: 06/25/2014] [Accepted: 07/16/2014] [Indexed: 01/29/2023] Open
Abstract
UNLABELLED Given the rapid increase of species with a sequenced genome, the need to identify orthologous genes between them has emerged as a central bioinformatics task. Many different methods exist for orthology detection, which makes it difficult to decide which one to choose for a particular application. Here, we review the latest developments and issues in the orthology field, and summarize the most recent results reported at the third 'Quest for Orthologs' meeting. We focus on community efforts such as the adoption of reference proteomes, standard file formats and benchmarking. Progress in these areas is good, and they are already beneficial to both orthology consumers and providers. However, a major current issue is that the massive increase in complete proteomes poses computational challenges to many of the ortholog database providers, as most orthology inference algorithms scale at least quadratically with the number of proteomes. The Quest for Orthologs consortium is an open community with a number of working groups that join efforts to enhance various aspects of orthology analysis, such as defining standard formats and datasets, documenting community resources and benchmarking. AVAILABILITY AND IMPLEMENTATION All such materials are available at http://questfororthologs.org.
Collapse
Affiliation(s)
- Erik L L Sonnhammer
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Toni Gabaldón
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Alan W Sousa da Silva
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Maria Martin
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Marc Robinson-Rechavi
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| | - Brigitte Boeckmann
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Paul D Thomas
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK
| | - Christophe Dessimoz
- Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London WC1E 6BT, UK Stockholm Bioinformatics Center, Science for Life Laboratory, Box 1031, SE-17121 Solna, Sweden, Swedish eScience Research Center, Stockholm, Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden, Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain, Universitat Pompeu Fabra (UPF), 08003 Barcelona, Spain, Institució Catalana de Recerca i Estudis Avançats (ICREA), 08010 Barcelona, Spain, EMBL-European Bioinformatics Institute, Hinxton CB10 1SD, UK, Department of Ecology and Evolution, University of Lausanne, Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, SwissProt, Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland, Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90089, USA and Department of Genetics, Evolution and Environment, and Department of Computer Science, University College London, Gower St, London
| |
Collapse
|
7
|
Ramírez MJ, Michalik P. Calculating structural complexity in phylogenies using ancestral ontologies. Cladistics 2014; 30:635-649. [DOI: 10.1111/cla.12075] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/20/2014] [Indexed: 01/29/2023] Open
Affiliation(s)
- Martín J. Ramírez
- Museo Argentino de Ciencias Naturales “Bernardino Rivadavia” - CONICET; Av. Angel Gallardo 470 C1405DJR Buenos Aires Argentina
| | - Peter Michalik
- Zoologisches Institut und Museum; Ernst-Moritz-Arndt-Universität; J.-S.-Bach-Str. 11/12 D-17489 Greifswald Germany
| |
Collapse
|
8
|
Panahiazar M, Sheth AP, Ranabahu A, Vos RA, Leebens-Mack J. Advancing data reuse in phyloinformatics using an ontology-driven Semantic Web approach. BMC Med Genomics 2013; 6 Suppl 3:S5. [PMID: 24565381 PMCID: PMC3980757 DOI: 10.1186/1755-8794-6-s3-s5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Phylogenetic analyses can resolve historical relationships among genes, organisms or higher taxa. Understanding such relationships can elucidate a wide range of biological phenomena, including, for example, the importance of gene and genome duplications in the evolution of gene function, the role of adaptation as a driver of diversification, or the evolutionary consequences of biogeographic shifts. Phyloinformaticists are developing data standards, databases and communication protocols (e.g. Application Programming Interfaces, APIs) to extend the accessibility of gene trees, species trees, and the metadata necessary to interpret these trees, thus enabling researchers across the life sciences to reuse phylogenetic knowledge. Specifically, Semantic Web technologies are being developed to make phylogenetic knowledge interpretable by web agents, thereby enabling intelligently automated, high-throughput reuse of results generated by phylogenetic research. This manuscript describes an ontology-driven, semantic problem-solving environment for phylogenetic analyses and introduces artefacts that can promote phyloinformatic efforts to promote accessibility of trees and underlying metadata. PhylOnt is an extensible ontology with concepts describing tree types and tree building methodologies including estimation methods, models and programs. In addition we present the PhylAnt platform for annotating scientific articles and NeXML files with PhylOnt concepts. The novelty of this work is the annotation of NeXML files and phylogenetic related documents with PhylOnt Ontology. This approach advances data reuse in phyloinformatics.
Collapse
|
9
|
Balhoff JP, Mikó I, Yoder MJ, Mullins PL, Deans AR. A semantic model for species description applied to the ensign wasps (hymenoptera: evaniidae) of New Caledonia. Syst Biol 2013; 62:639-59. [PMID: 23652347 PMCID: PMC3739881 DOI: 10.1093/sysbio/syt028] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2012] [Revised: 02/14/2013] [Accepted: 04/23/2013] [Indexed: 12/01/2022] Open
Abstract
Taxonomic descriptions are unparalleled sources of knowledge of life's phenotypic diversity. As natural language prose, these data sets are largely refractory to computation and integration with other sources of phenotypic data. By formalizing taxonomic descriptions using ontology-based semantic representation, we aim to increase the reusability and computability of taxonomists' primary data. Here, we present a revision of the ensign wasp (Hymenoptera: Evaniidae) fauna of New Caledonia using this new model for species description. Descriptive matrices, specimen data, and taxonomic nomenclature are gathered in a unified Web-based application, mx, then exported as both traditional taxonomic treatments and semantic statements using the OWL Web Ontology Language. Character:character-state combinations are then annotated following the entity-quality phenotype model, originally developed to represent mutant model organism phenotype data; concepts of anatomy are drawn from the Hymenoptera Anatomy Ontology and linked to phenotype descriptors from the Phenotypic Quality Ontology. The resulting set of semantic statements is provided in Resource Description Framework format. Applying the model to real data, that is, specimens, taxonomic names, diagnoses, descriptions, and redescriptions, provides us with a foundation to discuss limitations and potential benefits such as automated data integration and reasoner-driven queries. Four species of ensign wasp are now known to occur in New Caledonia: Szepligetella levipetiolata, Szepligetella deercreeki Deans and Mikó sp. nov., Szepligetella irwini Deans and Mikó sp. nov., and the nearly cosmopolitan Evania appendigaster. A fifth species, Szepligetella sericea, including Szepligetella impressa, syn. nov., has not yet been collected in New Caledonia but can be found on islands throughout the Pacific and so is included in the diagnostic key.
Collapse
Affiliation(s)
- James P. Balhoff
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - István Mikó
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Matthew J. Yoder
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Patricia L. Mullins
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| | - Andrew R. Deans
- National Evolutionary Synthesis Center, Durham, NC 27705, USA; Department of Biology, University of North Carolina, Chapel Hill, NC 27599, USA; Insect Museum, Department of Entomology, North Carolina State University, Box 7613, Raleigh, NC 27695, USA; Department of Entomology, Pennsylvania State University, 501 ASI Building, University Park, PA 16802, USA; Illinois Natural History Survey, University of Illinois, 1816 South Oak Street, MC 652 Champaign, IL 61820, USA; and Department of Ecology, Evolution, and Organismal Biology, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
10
|
Ison J, Kalas M, Jonassen I, Bolser D, Uludag M, McWilliam H, Malone J, Lopez R, Pettifer S, Rice P. EDAM: an ontology of bioinformatics operations, types of data and identifiers, topics and formats. Bioinformatics 2013; 29:1325-32. [PMID: 23479348 PMCID: PMC3654706 DOI: 10.1093/bioinformatics/btt113] [Citation(s) in RCA: 126] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Revised: 02/28/2013] [Accepted: 03/01/2013] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable descriptions. A comprehensive ontology allowing such descriptions is therefore required. RESULTS EDAM is an ontology of bioinformatics operations (tool or workflow functions), types of data and identifiers, application domains and data formats. EDAM supports semantic annotation of diverse entities such as Web services, databases, programmatic libraries, standalone tools, interactive applications, data schemas, datasets and publications within bioinformatics. EDAM applies to organizing and finding suitable tools and data and to automating their integration into complex applications or workflows. It includes over 2200 defined concepts and has successfully been used for annotations and implementations. AVAILABILITY The latest stable version of EDAM is available in OWL format from http://edamontology.org/EDAM.owl and in OBO format from http://edamontology.org/EDAM.obo. It can be viewed online at the NCBO BioPortal and the EBI Ontology Lookup Service. For documentation and license please refer to http://edamontology.org. This article describes version 1.2 available at http://edamontology.org/EDAM_1.2.owl. CONTACT jison@ebi.ac.uk.
Collapse
Affiliation(s)
- Jon Ison
- EMBL European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Stoltzfus A, Lapp H, Matasci N, Deus H, Sidlauskas B, Zmasek CM, Vaidya G, Pontelli E, Cranston K, Vos R, Webb CO, Harmon LJ, Pirrung M, O'Meara B, Pennell MW, Mirarab S, Rosenberg MS, Balhoff JP, Bik HM, Heath TA, Midford PE, Brown JW, McTavish EJ, Sukumaran J, Westneat M, Alfaro ME, Steele A, Jordan G. Phylotastic! Making tree-of-life knowledge accessible, reusable and convenient. BMC Bioinformatics 2013; 14:158. [PMID: 23668630 PMCID: PMC3669619 DOI: 10.1186/1471-2105-14-158] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2013] [Accepted: 04/30/2013] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Scientists rarely reuse expert knowledge of phylogeny, in spite of years of effort to assemble a great "Tree of Life" (ToL). A notable exception involves the use of Phylomatic, which provides tools to generate custom phylogenies from a large, pre-computed, expert phylogeny of plant taxa. This suggests great potential for a more generalized system that, starting with a query consisting of a list of any known species, would rectify non-standard names, identify expert phylogenies containing the implicated taxa, prune away unneeded parts, and supply branch lengths and annotations, resulting in a custom phylogeny suited to the user's needs. Such a system could become a sustainable community resource if implemented as a distributed system of loosely coupled parts that interact through clearly defined interfaces. RESULTS With the aim of building such a "phylotastic" system, the NESCent Hackathons, Interoperability, Phylogenies (HIP) working group recruited 2 dozen scientist-programmers to a weeklong programming hackathon in June 2012. During the hackathon (and a three-month follow-up period), 5 teams produced designs, implementations, documentation, presentations, and tests including: (1) a generalized scheme for integrating components; (2) proof-of-concept pruners and controllers; (3) a meta-API for taxonomic name resolution services; (4) a system for storing, finding, and retrieving phylogenies using semantic web technologies for data exchange, storage, and querying; (5) an innovative new service, DateLife.org, which synthesizes pre-computed, time-calibrated phylogenies to assign ages to nodes; and (6) demonstration projects. These outcomes are accessible via a public code repository (GitHub.com), a website (http://www.phylotastic.org), and a server image. CONCLUSIONS Approximately 9 person-months of effort (centered on a software development hackathon) resulted in the design and implementation of proof-of-concept software for 4 core phylotastic components, 3 controllers, and 3 end-user demonstration tools. While these products have substantial limitations, they suggest considerable potential for a distributed system that makes phylogenetic knowledge readily accessible in computable form. Widespread use of phylotastic systems will create an electronic marketplace for sharing phylogenetic knowledge that will spur innovation in other areas of the ToL enterprise, such as annotation of sources and methods and third-party methods of quality assessment.
Collapse
Affiliation(s)
- Arlin Stoltzfus
- Institute for Bioscience and Biotechnology Research (IBBR), Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Hilmar Lapp
- National Evolutionary Synthesis Center, 2024 W. Main St, Durham, NC, 27705, USA
| | - Naim Matasci
- The iPlant Collaborative and EEB Department, University of Arizona, 1657 E Helen St, Tucson, AZ, 85721, USA
| | - Helena Deus
- Digital Enterprise Research Institute, National University of Ireland, University Road, Galway, Ireland
| | - Brian Sidlauskas
- Department of Fisheries and Wildlife, Oregon State University, 104 Nash Hall, Corvallis, OR, 97331-3803, USA
| | - Christian M Zmasek
- Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA, 92037, USA
| | - Gaurav Vaidya
- Department of Ecology and Evolutionary Biology, University of Colorado Boulder, Boulder, CO, 80309-0334, USA
| | - Enrico Pontelli
- Department of Computer Science, New Mexico State University, MSC CS, Box 30001, Las Cruces, NM, 88003, USA
| | - Karen Cranston
- National Evolutionary Synthesis Center, 2024 W. Main St, Durham, NC, 27705, USA
| | - Rutger Vos
- NCB Naturalis, Einsteinweg 2, Leiden, 2333 CC, the Netherlands
| | - Campbell O Webb
- Arnold Arboretum of Harvard University, Boston, MA, 02130, USA
| | - Luke J Harmon
- Institute for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, PO Box 443051, Moscow, ID, 83844-3051, USA
| | - Megan Pirrung
- University of Colorado Denver Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Brian O'Meara
- Department of Ecology & Evolutionary Biology, 569 Dabney Hall, University of Tennessee, Knoxville, TN, 37996, USA
| | - Matthew W Pennell
- Institute for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, PO Box 443051, Moscow, ID, 83844-3051, USA
| | - Siavash Mirarab
- Department of Computer Science, University of Texas at Austin, Austin, TX, 78701, USA
| | - Michael S Rosenberg
- Center for Evolutionary Medicine and Informatics, The Biodesign Institute, and School of Life Sciences, Arizona State University, PO Box 874501, Tempe, AZ, 85287-4501, USA
| | - James P Balhoff
- National Evolutionary Synthesis Center, 2024 W. Main St, Durham, NC, 27705, USA
| | - Holly M Bik
- UC Davis Genome Center, One Shields Ave, Davis, CA, 95618, USA
| | - Tracy A Heath
- Department of Integrative Biology, University of California, Berkeley, CA, 94720-3140, USA
| | - Peter E Midford
- National Evolutionary Synthesis Center, 2024 W. Main St, Durham, NC, 27705, USA
| | - Joseph W Brown
- Institute for Bioinformatics and Evolutionary Studies (IBEST), University of Idaho, PO Box 443051, Moscow, ID, 83844-3051, USA
| | | | - Jeet Sukumaran
- Biology Department, Duke University, Biological Sciences Building, 125 Science Drive, Durham, NC, 27708, USA
| | - Mark Westneat
- Biodiversity Synthesis Center, Field Museum of Natural History, 1400 S Lakeshore Dr, Chicago, IL, 60605, USA
| | - Michael E Alfaro
- Department of Ecology and Evolutionary Biology, South University of California Los Angeles, 621 Charles E. Young Dr, Los Angeles, CA, 90095, USA
| | - Aaron Steele
- U.C. Berkeley Museum of Vertebrate Zoology, University of California, 3101 Valley Life Sciences Building, Berkeley, CA, 94720, USA
| | - Greg Jordan
- Paperpile, 34 Houghton Street, Somerville, MA, 02143, USA
| |
Collapse
|
12
|
The 3rd DBCLS BioHackathon: improving life science data integration with Semantic Web technologies. J Biomed Semantics 2013; 4:6. [PMID: 23398680 PMCID: PMC3598643 DOI: 10.1186/2041-1480-4-6] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2012] [Accepted: 02/05/2013] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND BioHackathon 2010 was the third in a series of meetings hosted by the Database Center for Life Sciences (DBCLS) in Tokyo, Japan. The overall goal of the BioHackathon series is to improve the quality and accessibility of life science research data on the Web by bringing together representatives from public databases, analytical tool providers, and cyber-infrastructure researchers to jointly tackle important challenges in the area of in silico biological research. RESULTS The theme of BioHackathon 2010 was the 'Semantic Web', and all attendees gathered with the shared goal of producing Semantic Web data from their respective resources, and/or consuming or interacting those data using their tools and interfaces. We discussed on topics including guidelines for designing semantic data and interoperability of resources. We consequently developed tools and clients for analysis and visualization. CONCLUSION We provide a meeting report from BioHackathon 2010, in which we describe the discussions, decisions, and breakthroughs made as we moved towards compliance with Semantic Web technologies - from source provider, through middleware, to the end-consumer.
Collapse
|
13
|
Talevich E, Invergo BM, Cock PJA, Chapman BA. Bio.Phylo: a unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinformatics 2012; 13:209. [PMID: 22909249 PMCID: PMC3468381 DOI: 10.1186/1471-2105-13-209] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2012] [Accepted: 08/08/2012] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Ongoing innovation in phylogenetics and evolutionary biology has been accompanied by a proliferation of software tools, data formats, analytical techniques and web servers. This brings with it the challenge of integrating phylogenetic and other related biological data found in a wide variety of formats, and underlines the need for reusable software that can read, manipulate and transform this information into the various forms required to build computational pipelines. RESULTS We built a Python software library for working with phylogenetic data that is tightly integrated with Biopython, a broad-ranging toolkit for computational biology. Our library, Bio.Phylo, is highly interoperable with existing libraries, tools and standards, and is capable of parsing common file formats for phylogenetic trees, performing basic transformations and manipulations, attaching rich annotations, and visualizing trees. We unified the modules for working with the standard file formats Newick, NEXUS and phyloXML behind a consistent and simple API, providing a common set of functionality independent of the data source. CONCLUSIONS Bio.Phylo meets a growing need in bioinformatics for working with heterogeneous types of phylogenetic data. By supporting interoperability with multiple file formats and leveraging existing Biopython features, this library simplifies the construction of phylogenetic workflows. We also provide examples of the benefits of building a community around a shared open-source project. Bio.Phylo is included with Biopython, available through the Biopython website, http://biopython.org.
Collapse
Affiliation(s)
- Eric Talevich
- Institute of Bioinformatics, University of Georgia, 120 Green Street, Athens, GA 30602, USA
| | - Brandon M Invergo
- Institute of Evolutionary Biology (CSIC-UPF), CEXS-UPF-PRBB, C/ Doctor Aiguader 88, 08003 Barcelona, Spain
| | - Peter JA Cock
- James Hutton Institute, InvergowrieDundee DD2 5DA, UK
| | - Brad A Chapman
- Harvard School of Public Health Bioinformatics Core, 655 Huntington Ave, Boston, MA 02115, USA
| |
Collapse
|
14
|
Vos RA, Balhoff JP, Caravas JA, Holder MT, Lapp H, Maddison WP, Midford PE, Priyam A, Sukumaran J, Xia X, Stoltzfus A. NeXML: rich, extensible, and verifiable representation of comparative data and metadata. Syst Biol 2012; 61:675-89. [PMID: 22357728 PMCID: PMC3376374 DOI: 10.1093/sysbio/sys025] [Citation(s) in RCA: 64] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2011] [Revised: 07/29/2011] [Accepted: 02/07/2012] [Indexed: 12/13/2022] Open
Abstract
In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input-output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.
Collapse
|
15
|
Abstract
In gene prediction, studying phenotypes is highly valuable for reducing the number of locus candidates in association studies and to aid disease gene candidate prioritization. This is due to the intrinsic nature of phenotypes to visibly reflect genetic activity, making them potentially one of the most useful data types for functional studies. However, systematic use of these data has begun only recently. 'Comparative phenomics' is the analysis of genotype-phenotype associations across species and experimental methods. This is an emerging research field of utmost importance for gene discovery and gene function annotation. In this chapter, we review the use of phenotype data in the biomedical field. We will give an overview of phenotype resources, focusing on PhenomicDB--a cross-species genotype-phenotype database--which is the largest available collection of phenotype descriptions across species and experimental methods. We report on its latest extension by which genotype-phenotype relationships can be viewed as graphical representations of similar phenotypes clustered together ('phenoclusters'), supplemented with information from protein-protein interactions and Gene Ontology terms. We show that such 'phenoclusters' represent a novel approach to group genes functionally and to predict novel gene functions with high precision. We explain how these data and methods can be used to supplement the results of gene discovery approaches. The aim of this chapter is to assist researchers interested in understanding how phenotype data can be used effectively in the gene discovery field.
Collapse
|
16
|
Chisham B, Wright B, Le T, Son TC, Pontelli E. CDAO-store: ontology-driven data integration for phylogenetic analysis. BMC Bioinformatics 2011; 12:98. [PMID: 21496247 PMCID: PMC3101187 DOI: 10.1186/1471-2105-12-98] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2010] [Accepted: 04/15/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Comparative Data Analysis Ontology (CDAO) is an ontology developed, as part of the EvoInfo and EvoIO groups supported by the National Evolutionary Synthesis Center, to provide semantic descriptions of data and transformations commonly found in the domain of phylogenetic analysis. The core concepts of the ontology enable the description of phylogenetic trees and associated character data matrices. RESULTS Using CDAO as the semantic back-end, we developed a triple-store, named CDAO-Store. CDAO-Store is a RDF-based store of phylogenetic data, including a complete import of TreeBASE. CDAO-Store provides a programmatic interface, in the form of web services, and a web-based front-end, to perform both user-defined as well as domain-specific queries; domain-specific queries include search for nearest common ancestors, minimum spanning clades, filter multiple trees in the store by size, author, taxa, tree identifier, algorithm or method. In addition, CDAO-Store provides a visualization front-end, called CDAO-Explorer, which can be used to view both character data matrices and trees extracted from the CDAO-Store. CDAO-Store provides import capabilities, enabling the addition of new data to the triple-store; files in PHYLIP, MEGA, nexml, and NEXUS formats can be imported and their CDAO representations added to the triple-store. CONCLUSIONS CDAO-Store is made up of a versatile and integrated set of tools to support phylogenetic analysis. To the best of our knowledge, CDAO-Store is the first semantically-aware repository of phylogenetic data with domain-specific querying capabilities. The portal to CDAO-Store is available at http://www.cs.nmsu.edu/~cdaostore.
Collapse
Affiliation(s)
- Brandon Chisham
- Department of Computer Science, New Mexico State University, Las Cruces, USA
| | | | | | | | | |
Collapse
|
17
|
Vos RA, Caravas J, Hartmann K, Jensen MA, Miller C. BIO::Phylo-phyloinformatic analysis using perl. BMC Bioinformatics 2011; 12:63. [PMID: 21352572 PMCID: PMC3056726 DOI: 10.1186/1471-2105-12-63] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2010] [Accepted: 02/27/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Phyloinformatic analyses involve large amounts of data and metadata of complex structure. Collecting, processing, analyzing, visualizing and summarizing these data and metadata should be done in steps that can be automated and reproduced. This requires flexible, modular toolkits that can represent, manipulate and persist phylogenetic data and metadata as objects with programmable interfaces. RESULTS This paper presents Bio::Phylo, a Perl5 toolkit for phyloinformatic analysis. It implements classes and methods that are compatible with the well-known BioPerl toolkit, but is independent from it (making it easy to install) and features a richer API and a data model that is better able to manage the complex relationships between different fundamental data and metadata objects in phylogenetics. It supports commonly used file formats for phylogenetic data including the novel NeXML standard, which allows rich annotations of phylogenetic data to be stored and shared. Bio::Phylo can interact with BioPerl, thereby giving access to the file formats that BioPerl supports. Many methods for data simulation, transformation and manipulation, the analysis of tree shape, and tree visualization are provided. CONCLUSIONS Bio::Phylo is composed of 59 richly documented Perl5 modules. It has been deployed successfully on a variety of computer architectures (including various Linux distributions, Mac OS X versions, Windows, Cygwin and UNIX-like systems). It is available as open source (GPL) software from http://search.cpan.org/dist/Bio-Phylo.
Collapse
Affiliation(s)
- Rutger A Vos
- School of Biological Sciences, University of Reading, UK.
| | | | | | | | | |
Collapse
|
18
|
Quest DJ, Land ML, Brettin TS, Cottingham RW. Next generation models for storage and representation of microbial biological annotation. BMC Bioinformatics 2010; 11 Suppl 6:S15. [PMID: 20946598 PMCID: PMC3026362 DOI: 10.1186/1471-2105-11-s6-s15] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.
Collapse
Affiliation(s)
- Daniel J Quest
- Biosciences Division, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, TN 37831-6420, USA.
| | | | | | | |
Collapse
|
19
|
Balhoff JP, Dahdul WM, Kothari CR, Lapp H, Lundberg JG, Mabee P, Midford PE, Westerfield M, Vision TJ. Phenex: ontological annotation of phenotypic diversity. PLoS One 2010; 5:e10500. [PMID: 20463926 PMCID: PMC2864769 DOI: 10.1371/journal.pone.0010500] [Citation(s) in RCA: 67] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2009] [Accepted: 12/27/2009] [Indexed: 02/04/2023] Open
Abstract
Background Phenotypic differences among species have long been systematically itemized and described by biologists in the process of investigating phylogenetic relationships and trait evolution. Traditionally, these descriptions have been expressed in natural language within the context of individual journal publications or monographs. As such, this rich store of phenotype data has been largely unavailable for statistical and computational comparisons across studies or integration with other biological knowledge. Methodology/Principal Findings Here we describe Phenex, a platform-independent desktop application designed to facilitate efficient and consistent annotation of phenotypic similarities and differences using Entity-Quality syntax, drawing on terms from community ontologies for anatomical entities, phenotypic qualities, and taxonomic names. Phenex can be configured to load only those ontologies pertinent to a taxonomic group of interest. The graphical user interface was optimized for evolutionary biologists accustomed to working with lists of taxa, characters, character states, and character-by-taxon matrices. Conclusions/Significance Annotation of phenotypic data using ontologies and globally unique taxonomic identifiers will allow biologists to integrate phenotypic data from different organisms and studies, leveraging decades of work in systematics and comparative morphology.
Collapse
Affiliation(s)
- James P Balhoff
- National Evolutionary Synthesis Center, Durham, North Carolina, United States of America.
| | | | | | | | | | | | | | | | | |
Collapse
|