1
|
Santangelo BE, Apgar M, Colorado ASB, Martin CG, Sterrett J, Wall E, Joachimiak MP, Hunter LE, Lozupone CA. Integrating biological knowledge for mechanistic inference in the host-associated microbiome. Front Microbiol 2024; 15:1351678. [PMID: 38638909 PMCID: PMC11024261 DOI: 10.3389/fmicb.2024.1351678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/26/2024] [Indexed: 04/20/2024] Open
Abstract
Advances in high-throughput technologies have enhanced our ability to describe microbial communities as they relate to human health and disease. Alongside the growth in sequencing data has come an influx of resources that synthesize knowledge surrounding microbial traits, functions, and metabolic potential with knowledge of how they may impact host pathways to influence disease phenotypes. These knowledge bases can enable the development of mechanistic explanations that may underlie correlations detected between microbial communities and disease. In this review, we survey existing resources and methodologies for the computational integration of broad classes of microbial and host knowledge. We evaluate these knowledge bases in their access methods, content, and source characteristics. We discuss challenges of the creation and utilization of knowledge bases including inconsistency of nomenclature assignment of taxa and metabolites across sources, whether the biological entities represented are rooted in ontologies or taxonomies, and how the structure and accessibility limit the diversity of applications and user types. We make this information available in a code and data repository at: https://github.com/lozuponelab/knowledge-source-mappings. Addressing these challenges will allow for the development of more effective tools for drawing from abundant knowledge to find new insights into microbial mechanisms in disease by fostering a systematic and unbiased exploration of existing information.
Collapse
Affiliation(s)
- Brook E. Santangelo
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, United States
| | - Madison Apgar
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, United States
| | | | - Casey G. Martin
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, United States
| | - John Sterrett
- Department of Integrative Physiology, University of Colorado, Boulder, CO, United States
| | - Elena Wall
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, United States
| | - Marcin P. Joachimiak
- Lawrence Berkeley National Laboratory, Environmental Genomics and Systems Biology Division, Biosystems Data Science Department, Berkeley, CA, United States
| | - Lawrence E. Hunter
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, United States
| | - Catherine A. Lozupone
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, United States
| |
Collapse
|
2
|
Alqaissi E, Alotaibi F, Ramzan MS. Graph data science and machine learning for the detection of COVID-19 infection from symptoms. PeerJ Comput Sci 2023; 9:e1333. [PMID: 37346701 PMCID: PMC10280642 DOI: 10.7717/peerj-cs.1333] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 03/16/2023] [Indexed: 06/23/2023]
Abstract
Background COVID-19 is an infectious disease caused by SARS-CoV-2. The symptoms of COVID-19 vary from mild-to-moderate respiratory illnesses, and it sometimes requires urgent medication. Therefore, it is crucial to detect COVID-19 at an early stage through specific clinical tests, testing kits, and medical devices. However, these tests are not always available during the time of the pandemic. Therefore, this study developed an automatic, intelligent, rapid, and real-time diagnostic model for the early detection of COVID-19 based on its symptoms. Methods The COVID-19 knowledge graph (KG) constructed based on literature from heterogeneous data is imported to understand the COVID-19 different relations. We added human disease ontology to the COVID-19 KG and applied a node-embedding graph algorithm called fast random projection to extract an extra feature from the COVID-19 dataset. Subsequently, experiments were conducted using two machine learning (ML) pipelines to predict COVID-19 infection from its symptoms. Additionally, automatic tuning of the model hyperparameters was adopted. Results We compared two graph-based ML models, logistic regression (LR) and random forest (RF) models. The proposed graph-based RF model achieved a small error rate = 0.0064 and the best scores on all performance metrics, including specificity = 98.71%, accuracy = 99.36%, precision = 99.65%, recall = 99.53%, and F1-score = 99.59%. Furthermore, the Matthews correlation coefficient achieved by the RF model was higher than that of the LR model. Comparative analysis with other ML algorithms and with studies from the literature showed that the proposed RF model exhibited the best detection accuracy. Conclusion The graph-based RF model registered high performance in classifying the symptoms of COVID-19 infection, thereby indicating that the graph data science, in conjunction with ML techniques, helps improve performance and accelerate innovations.
Collapse
Affiliation(s)
- Eman Alqaissi
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
- Information Systems, King Khalid University, Abha, Saudi Arabia
| | - Fahd Alotaibi
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Sher Ramzan
- Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
3
|
Wood EC, Glen AK, Kvarfordt LG, Womack F, Acevedo L, Yoon TS, Ma C, Flores V, Sinha M, Chodpathumwan Y, Termehchy A, Roach JC, Mendoza L, Hoffman AS, Deutsch EW, Koslicki D, Ramsey SA. RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine. BMC Bioinformatics 2022; 23:400. [PMID: 36175836 PMCID: PMC9520835 DOI: 10.1186/s12859-022-04932-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 09/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical translational science is increasingly using computational reasoning on repositories of structured knowledge (such as UMLS, SemMedDB, ChEMBL, Reactome, DrugBank, and SMPDB in order to facilitate discovery of new therapeutic targets and modalities. The NCATS Biomedical Data Translator project is working to federate autonomous reasoning agents and knowledge providers within a distributed system for answering translational questions. Within that project and the broader field, there is a need for a framework that can efficiently and reproducibly build an integrated, standards-compliant, and comprehensive biomedical knowledge graph that can be downloaded in standard serialized form or queried via a public application programming interface (API). RESULTS To create a knowledge provider system within the Translator project, we have developed RTX-KG2, an open-source software system for building-and hosting a web API for querying-a biomedical knowledge graph that uses an Extract-Transform-Load approach to integrate 70 knowledge sources (including the aforementioned core six sources) into a knowledge graph with provenance information including (where available) citations. The semantic layer and schema for RTX-KG2 follow the standard Biolink model to maximize interoperability. RTX-KG2 is currently being used by multiple Translator reasoning agents, both in its downloadable form and via its SmartAPI-registered interface. Serializations of RTX-KG2 are available for download in both the pre-canonicalized form and in canonicalized form (in which synonyms are merged). The current canonicalized version (KG2.7.3) of RTX-KG2 contains 6.4M nodes and 39.3M edges with a hierarchy of 77 relationship types from Biolink. CONCLUSION RTX-KG2 is the first knowledge graph that integrates UMLS, SemMedDB, ChEMBL, DrugBank, Reactome, SMPDB, and 64 additional knowledge sources within a knowledge graph that conforms to the Biolink standard for its semantic layer and schema. RTX-KG2 is publicly available for querying via its API at arax.rtx.ai/api/rtxkg2/v1.2/openapi.json . The code to build RTX-KG2 is publicly available at github:RTXteam/RTX-KG2 .
Collapse
Affiliation(s)
- E C Wood
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Amy K Glen
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA.
| | - Lindsey G Kvarfordt
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Finn Womack
- Computer Science and Engineering, Penn State University, State College, PA, USA
| | - Liliana Acevedo
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Timothy S Yoon
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Chunyu Ma
- Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA
| | - Veronica Flores
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | - Meghamala Sinha
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | | | - Arash Termehchy
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| | | | | | - Andrew S Hoffman
- Interdisciplinary Hub for Digitalization and Society, Radboud University, Nijmegen, The Netherlands
| | | | - David Koslicki
- Computer Science and Engineering, Penn State University, State College, PA, USA
- Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA
- Department of Biology, Penn State University, State College, PA, USA
| | - Stephen A Ramsey
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
- Department of Biomedical Sciences, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
4
|
Rodriguez-Esteban R, Duarte J, Teixeira PC, Richard F, Koltsova S, So WV. Prediction of standard cell types and functional markers from textual descriptions of flow cytometry gating definitions using machine learning. CYTOMETRY. PART B, CLINICAL CYTOMETRY 2022; 102:220-227. [PMID: 35253974 DOI: 10.1002/cyto.b.22065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 02/02/2022] [Accepted: 02/28/2022] [Indexed: 06/14/2023]
Abstract
BACKGROUND A key step in clinical flow cytometry data analysis is gating, which involves the identification of cell populations. The process of gating produces a set of reportable results, which are typically described by gating definitions. The non-standardized, non-interpreted nature of gating definitions represents a hurdle for data interpretation and data sharing across and within organizations. Interpreting and standardizing gating definitions for subsequent analysis of gating results requires a curation effort from experts. Machine learning approaches have the potential to help in this process by predicting expert annotations associated with gating definitions. METHODS We created a gold-standard dataset by manually annotating thousands of gating definitions with cell type and functional marker annotations. We used this dataset to train and test a machine learning pipeline able to predict standard cell types and functional marker genes associated with gating definitions. RESULTS The machine learning pipeline predicted annotations with high accuracy for both cell types and functional marker genes. Accuracy was lower for gating definitions from assays belonging to laboratories from which limited or no prior data was available in the training. Manual error review ensured that resulting predicted annotations could be reused subsequently as additional gold-standard training data. CONCLUSIONS Machine learning methods are able to consistently predict annotations associated with gating definitions from flow cytometry assays. However, a hybrid automatic and manual annotation workflow would be recommended to achieve optimal results.
Collapse
Affiliation(s)
- Raul Rodriguez-Esteban
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - José Duarte
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - Priscila C Teixeira
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - Fabien Richard
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center Basel, Basel, Switzerland
| | - Svetlana Koltsova
- Curation Department, Rancho BioSciences LLC, San Diego, California, USA
| | - W Venus So
- Roche Pharmaceutical Research and Early Development, Roche Innovation Center New York, New York, USA
| |
Collapse
|
5
|
Chen C, Ross KE, Gavali S, Cowart JE, Wu CH. COVID-19 knowledge graph from semantic integration of biomedical literature and databases. Bioinformatics 2021; 37:4597-4598. [PMID: 34613368 PMCID: PMC8513397 DOI: 10.1093/bioinformatics/btab694] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2021] [Revised: 09/26/2021] [Accepted: 10/04/2021] [Indexed: 11/12/2022] Open
Abstract
SUMMARY The global response to the COVID-19 pandemic has led to a rapid increase of scientific literature on this deadly disease. Extracting knowledge from biomedical literature and integrating it with relevant information from curated biological databases is essential to gain insight into COVID-19 etiology, diagnosis, and treatment. We used Semantic Web technology RDF to integrate COVID-19 knowledge mined from literature by iTextMine, PubTator, and SemRep with relevant biological databases and formalized the knowledge in a standardized and computable COVID-19 Knowledge Graph (KG). We published the COVID-19 KG via a SPARQL endpoint to support federated queries on the Semantic Web and developed a knowledge portal with browsing and searching interfaces. We also developed a RESTful API to support programmatic access and provided RDF dumps for download. AVAILABILITY AND IMPLEMENTATION The COVID-19 Knowledge Graph is publicly available under CC-BY 4.0 license at https://research.bioinformatics.udel.edu/covid19kg/.
Collapse
Affiliation(s)
- Chuming Chen
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA
| | - Karen E Ross
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20007, USA
| | - Sachin Gavali
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA
| | - Julie E Cowart
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, 19716, USA.,Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20007, USA
| |
Collapse
|