1
|
Wang S, Liu JX, Li F, Wang J, Gao YL. M 3HOGAT: A Multi-View Multi-Modal Multi-Scale High-Order Graph Attention Network for Microbe-Disease Association Prediction. IEEE J Biomed Health Inform 2024; 28:6259-6267. [PMID: 39012741 DOI: 10.1109/jbhi.2024.3429128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/18/2024]
Abstract
Numerous scientific studies have found a link between diverse microorganisms in the human body and complex human diseases. Because traditional experimental approaches are time-consuming and expensive, using computational methods to identify microbes correlated with diseases is critical. In this paper, a new microbe-disease association prediction model is proposed that combines a multi-view multi-modal network and a multi-scale feature fusion mechanism, called M3HOGAT. Firstly, a microbe-disease association network and multiple similarity views are constructed based on multi-source information. Then, consider that neighbor information from disparate orders might be more adept at learning node representations. Consequently, the higher-order graph attention network (HOGAT) is devised to aggregate neighbor information from disparate orders to extract microbe and disease features from different networks and views. Given that the embedding features of microbe and disease from different views possess varying importance, a multi-scale feature fusion mechanism is employed to learn their interaction information, thereby generating the final feature of microbes and diseases. Finally, an inner product decoder is used to reconstruct the microbe-disease association matrix. Compared with five state-of-the-art methods on the HMDAD and Disbiome datasets, the results of 5-fold cross-validations show that M3HOGAT achieves the best performance. Furthermore, case studies on asthma and obesity confirm the effectiveness of M3HOGAT in identifying potential disease-related microbes.
Collapse
|
2
|
Toussaint PA, Leiser F, Thiebes S, Schlesner M, Brors B, Sunyaev A. Explainable artificial intelligence for omics data: a systematic mapping study. Brief Bioinform 2023; 25:bbad453. [PMID: 38113073 PMCID: PMC10729786 DOI: 10.1093/bib/bbad453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 07/28/2023] [Accepted: 11/08/2023] [Indexed: 12/21/2023] Open
Abstract
Researchers increasingly turn to explainable artificial intelligence (XAI) to analyze omics data and gain insights into the underlying biological processes. Yet, given the interdisciplinary nature of the field, many findings have only been shared in their respective research community. An overview of XAI for omics data is needed to highlight promising approaches and help detect common issues. Toward this end, we conducted a systematic mapping study. To identify relevant literature, we queried Scopus, PubMed, Web of Science, BioRxiv, MedRxiv and arXiv. Based on keywording, we developed a coding scheme with 10 facets regarding the studies' AI methods, explainability methods and omics data. Our mapping study resulted in 405 included papers published between 2010 and 2023. The inspected papers analyze DNA-based (mostly genomic), transcriptomic, proteomic or metabolomic data by means of neural networks, tree-based methods, statistical methods and further AI methods. The preferred post-hoc explainability methods are feature relevance (n = 166) and visual explanation (n = 52), while papers using interpretable approaches often resort to the use of transparent models (n = 83) or architecture modifications (n = 72). With many research gaps still apparent for XAI for omics data, we deduced eight research directions and discuss their potential for the field. We also provide exemplary research questions for each direction. Many problems with the adoption of XAI for omics data in clinical practice are yet to be resolved. This systematic mapping study outlines extant research on the topic and provides research directions for researchers and practitioners.
Collapse
Affiliation(s)
- Philipp A Toussaint
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
- HIDSS4Health – Helmholtz Information and Data Science School for Health, Karlsruhe, Heidelberg, Germany
| | - Florian Leiser
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Scott Thiebes
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Matthias Schlesner
- Biomedical Informatics, Data Mining and Data Analytics, Faculty of Applied Computer Science and Medical Faculty, University of Augsburg, Augsburg, Germany
| | - Benedikt Brors
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Translational Oncology, National Center for Tumor Diseases, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Ali Sunyaev
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
3
|
Ibrahimi E, Lopes MB, Dhamo X, Simeon A, Shigdel R, Hron K, Stres B, D’Elia D, Berland M, Marcos-Zambrano LJ. Overview of data preprocessing for machine learning applications in human microbiome research. Front Microbiol 2023; 14:1250909. [PMID: 37869650 PMCID: PMC10588656 DOI: 10.3389/fmicb.2023.1250909] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/22/2023] [Indexed: 10/24/2023] Open
Abstract
Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.
Collapse
Affiliation(s)
- Eliana Ibrahimi
- Department of Biology, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University Olomouc, Olomouc, Czechia
| | - Blaž Stres
- Department of Catalysis and Chemical Reaction Engineering, National Institute of Chemistry, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, Institute of Sanitary Engineering, Ljubljana, Slovenia
- Department of Automation, Biocybernetics and Robotics, Jožef Stefan Institute, Ljubljana, Slovenia
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Domenica D’Elia
- Department of Biomedical Sciences, National Research Council, Institute for Biomedical Technologies, Bari, Italy
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| |
Collapse
|
4
|
Salim F, Mizutani S, Zolfo M, Yamada T. Recent advances of machine learning applications in human gut microbiota study: from observational analysis toward causal inference and clinical intervention. Curr Opin Biotechnol 2023; 79:102884. [PMID: 36623442 DOI: 10.1016/j.copbio.2022.102884] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Revised: 02/24/2022] [Accepted: 12/09/2022] [Indexed: 01/08/2023]
Abstract
Statistical methods, especially machine learning, learning(ML), are pivotal for the analyses of large data generated by multiomics human gut microbiota study. These analyses lead to the discovery of microbe-disease associations. Furthermore, recent efforts for more data transparency and accessible analytical tools improved data availability and study reproducibility. Our recent accumulated knowledge on microbe-disease associations brings light to the next questions: what is the role of microbes in disease progression and how can we apply our knowledge of microbiome in clinical settings? Here, we introduce recent studies that implemented ML to answer the questions of causal inference and clinical translation.
Collapse
Affiliation(s)
- Felix Salim
- School of Life Science and Technology, Tokyo Institute of Technology
| | - Sayaka Mizutani
- School of Life Science and Technology, Tokyo Institute of Technology; Japan Society for the Promotion of Science
| | - Moreno Zolfo
- School of Life Science and Technology, Tokyo Institute of Technology
| | - Takuji Yamada
- School of Life Science and Technology, Tokyo Institute of Technology; Metagen, Inc.; Metagen Therapeutics, Inc.; digzyme, Inc..
| |
Collapse
|
5
|
Cui Z, Chen ZH, Zhang QH, Gribova V, Filaretov VF, Huang DS. RMSCNN: A Random Multi-Scale Convolutional Neural Network for Marine Microbial Bacteriocins Identification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3663-3672. [PMID: 34699364 DOI: 10.1109/tcbb.2021.3122183] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The abuse of traditional antibiotics has led to an increase in the resistance of bacteria and viruses. Similar to the function of antibacterial peptides, bacteriocins are more common as a kind of peptides produced by bacteria that have bactericidal or bacterial effects. More importantly, the marine environment is one of the most abundant resources for extracting marine microbial bacteriocins (MMBs). Identifying bacteriocins from marine microorganisms is a common goal for the development of new drugs. Effective use of MMBs will greatly alleviate the current antibiotic abuse problem. In this work, deep learning is used to identify meaningful MMBs. We propose a random multi-scale convolutional neural network method. In the scale setting, we set a random model to update the scale value randomly. The scale selection method can reduce the contingency caused by artificial setting under certain conditions, thereby making the method more extensive. The results show that the classification performance of the proposed method is better than the state-of-the-art classification methods. In addition, some potential MMBs are predicted, and some different sequence analyses are performed on these candidates. It is worth mentioning that after sequence analysis, the HNH endonucleases of different marine bacteria are considered as potential bacteriocins.
Collapse
|
6
|
Costantini C, Nunzi E, Romani L. From the nose to the lungs: the intricate journey of airborne pathogens amidst commensal bacteria. Am J Physiol Cell Physiol 2022; 323:C1036-C1043. [PMID: 36036448 PMCID: PMC9529274 DOI: 10.1152/ajpcell.00287.2022] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The recent COVID-19 pandemic has dramatically brought the pitfalls of airborne pathogens to the attention of the scientific community. Not only viruses but also bacteria and fungi may exploit air transmission to colonize and infect potential hosts and be the cause of significant morbidity and mortality in susceptible populations. The efforts to decipher the mechanisms of pathogenicity of airborne microbes have brought to light the delicate equilibrium that governs the homeostasis of mucosal membranes. The microorganisms already thriving in the permissive environment of the respiratory tract represent a critical component of this equilibrium and a potent barrier to infection by means of direct competition with airborne pathogens or indirectly via modulation of the immune response. Moving down the respiratory tract, physicochemical and biological constraints promote site-specific expansion of microbes that engage in cross talk with the local immune system to maintain homeostasis and promote protection. In this review, we critically assess the site-specific microbial communities that an airborne pathogen encounters in its hypothetical travel along the respiratory tract and discuss the changes in the composition and function of the microbiome in airborne diseases by taking fungal and SARS-CoV-2 infections as examples. Finally, we discuss how technological and bioinformatics advancements may turn microbiome analysis into a valuable tool in the hands of clinicians to predict the risk of disease onset, the clinical course, and the response to treatment of individual patients in the direction of personalized medicine implementation.
Collapse
Affiliation(s)
- Claudio Costantini
- Department of Medicine and Surgery, University of Perugia, Perugia, Italy
| | - Emilia Nunzi
- Department of Medicine and Surgery, University of Perugia, Perugia, Italy
| | - Luigina Romani
- Department of Medicine and Surgery, University of Perugia, Perugia, Italy
| |
Collapse
|
7
|
Boyraz A, Pawlowsky-Glahn V, Egozcue JJ, Acar AC. Principal microbial groups: compositional alternative to phylogenetic grouping of microbiome data. Brief Bioinform 2022; 23:6675749. [PMID: 36007229 DOI: 10.1093/bib/bbac328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 07/19/2022] [Accepted: 07/20/2022] [Indexed: 11/13/2022] Open
Abstract
Statistical and machine learning techniques based on relative abundances have been used to predict health conditions and to identify microbial biomarkers. However, high dimensionality, sparsity and the compositional nature of microbiome data represent statistical challenges. On the other hand, the taxon grouping allows summarizing microbiome abundance with a coarser resolution in a lower dimension, but it presents new challenges when correlating taxa with a disease. In this work, we present a novel approach that groups Operational Taxonomical Units (OTUs) based only on relative abundances as an alternative to taxon grouping. The proposed procedure acknowledges the compositional data making use of principal balances. The identified groups are called Principal Microbial Groups (PMGs). The procedure reduces the need for user-defined aggregation of $\textrm{OTU}$s and offers the possibility of working with coarse group of $\textrm{OTU}$s, which are not present in a phylogenetic tree. PMGs can be used for two different goals: (1) as a dimensionality reduction method for compositional data, (2) as an aggregation procedure that provides an alternative to taxon grouping for construction of microbial balances afterward used for disease prediction. We illustrate the procedure with a cirrhosis study data. PMGs provide a coherent data analysis for the search of biomarkers in human microbiota. The source code and demo data for PMGs are available at: https://github.com/asliboyraz/PMGs.
Collapse
Affiliation(s)
- Aslı Boyraz
- Department of Computer Programming, Recep Tayyip Erdoğan University, Ardeşen Vocational School, Rize, 53400, Turkey
| | - Vera Pawlowsky-Glahn
- Department of Computer Sciences, Applied Mathematics and Statistics, University of Girona, Campus Montilivi, 17003 Girona, Spain
| | - Juan José Egozcue
- Department of Civil and Environmental Engineering, Universitat Politécnica de Catalunya, Barcelona, 08034, Spain
| | - Aybar Can Acar
- Department of Medical Informatics, Middle East Technical University, Ankara Turkey
| |
Collapse
|
8
|
Liu B, Sträuber H, Saraiva J, Harms H, Silva SG, Kasmanas JC, Kleinsteuber S, Nunes da Rocha U. Machine learning-assisted identification of bioindicators predicts medium-chain carboxylate production performance of an anaerobic mixed culture. MICROBIOME 2022; 10:48. [PMID: 35331330 PMCID: PMC8952268 DOI: 10.1186/s40168-021-01219-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 12/17/2021] [Indexed: 05/10/2023]
Abstract
BACKGROUND The ability to quantitatively predict ecophysiological functions of microbial communities provides an important step to engineer microbiota for desired functions related to specific biochemical conversions. Here, we present the quantitative prediction of medium-chain carboxylate production in two continuous anaerobic bioreactors from 16S rRNA gene dynamics in enriched communities. RESULTS By progressively shortening the hydraulic retention time (HRT) from 8 to 2 days with different temporal schemes in two bioreactors operated for 211 days, we achieved higher productivities and yields of the target products n-caproate and n-caprylate. The datasets generated from each bioreactor were applied independently for training and testing machine learning algorithms using 16S rRNA genes to predict n-caproate and n-caprylate productivities. Our dataset consisted of 14 and 40 samples from HRT of 8 and 2 days, respectively. Because of the size and balance of our dataset, we compared linear regression, support vector machine and random forest regression algorithms using the original and balanced datasets generated using synthetic minority oversampling. Further, we performed cross-validation to estimate model stability. The random forest regression was the best algorithm producing more consistent results with median of error rates below 8%. More than 90% accuracy in the prediction of n-caproate and n-caprylate productivities was achieved. Four inferred bioindicators belonging to the genera Olsenella, Lactobacillus, Syntrophococcus and Clostridium IV suggest their relevance to the higher carboxylate productivity at shorter HRT. The recovery of metagenome-assembled genomes of these bioindicators confirmed their genetic potential to perform key steps of medium-chain carboxylate production. CONCLUSIONS Shortening the hydraulic retention time of the continuous bioreactor systems allows to shape the communities with desired chain elongation functions. Using machine learning, we demonstrated that 16S rRNA amplicon sequencing data can be used to predict bioreactor process performance quantitatively and accurately. Characterizing and harnessing bioindicators holds promise to manage reactor microbiota towards selection of the target processes. Our mathematical framework is transferrable to other ecosystem processes and microbial systems where community dynamics is linked to key functions. The general methodology used here can be adapted to data types of other functional categories such as genes, transcripts, proteins or metabolites. Video Abstract.
Collapse
Affiliation(s)
- Bin Liu
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Heike Sträuber
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - João Saraiva
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Hauke Harms
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
| | - Sandra Godinho Silva
- Institute for Bioengineering and Biosciences, Department of Bioengineering, Instituto Superior Técnico Universidade de Lisboa, Lisbon, Portugal
| | - Jonas Coelho Kasmanas
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, Leipzig, Germany
| | - Sabine Kleinsteuber
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany.
| | - Ulisses Nunes da Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany.
| |
Collapse
|
9
|
Yang F, Qiao Y, Qi Y, Bo J, Wang X. BACS: blockchain and AutoML-based technology for efficient credit scoring classification. ANNALS OF OPERATIONS RESEARCH 2022:1-21. [PMID: 35095154 PMCID: PMC8785710 DOI: 10.1007/s10479-022-04531-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 01/03/2022] [Indexed: 05/02/2023]
Abstract
Credit evaluation is of high scientific significance and practical use, especially in today's plight of the world suffering from the COVID-19 epidemic. However, due to the difficulties inherent in credit scoring model building which involves a large number of data mining steps and requires a lot of time to process the data and build the model, efficient and accurate credit scoring methods are are urgently required. Aiming to solve this problem, we propose BACS, an blockchain and automated machine learning based classification model using credit dataset so that the credit modelling processes are performed in the pipeline in an automated manner to eventually obtain the classification results of credit scoring. BACS scheme consists of credit data storage to blockchain, feature extraction, feature selection, modelling algorithm and hyperparameter optimization, and model evaluation. Firstly, we propose a mechanism for credit data management and storage using blockchain to ensure that the entire credit scoring system is traceable and that the information of each scoring candidate is securely, efficiently and tamper-proofly stored on the blockchain nodes. Next, we design a pipeline using a random forest model to effectively integrate the key steps of credit data feature extraction, feature selection, credit model construction, and model evaluation. The experimental results demonstrate that our proposed automated machine learning-based credit scoring classification scheme BACS can assess the credit condition efficiently and accurately.
Collapse
Affiliation(s)
- Fan Yang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi People’s Republic of China
| | - Yanan Qiao
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi People’s Republic of China
| | - Yong Qi
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi People’s Republic of China
| | - Junge Bo
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi People’s Republic of China
| | - Xiao Wang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, Shaanxi People’s Republic of China
| |
Collapse
|
10
|
Curry KD, Nute MG, Treangen TJ. It takes guts to learn: machine learning techniques for disease detection from the gut microbiome. Emerg Top Life Sci 2021; 5:815-827. [PMID: 34779841 PMCID: PMC8786294 DOI: 10.1042/etls20210213] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 09/29/2021] [Accepted: 10/06/2021] [Indexed: 02/01/2023]
Abstract
Associations between the human gut microbiome and expression of host illness have been noted in a variety of conditions ranging from gastrointestinal dysfunctions to neurological deficits. Machine learning (ML) methods have generated promising results for disease prediction from gut metagenomic information for diseases including liver cirrhosis and irritable bowel disease, but have lacked efficacy when predicting other illnesses. Here, we review current ML methods designed for disease classification from microbiome data. We highlight the computational challenges these methods have effectively overcome and discuss the biological components that have been overlooked to offer perspectives on future work in this area.
Collapse
Affiliation(s)
- Kristen D. Curry
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| |
Collapse
|
11
|
Narayana JK, Mac Aogáin M, Goh WWB, Xia K, Tsaneva-Atanasova K, Chotirmall SH. Mathematical-based microbiome analytics for clinical translation. Comput Struct Biotechnol J 2021; 19:6272-6281. [PMID: 34900137 PMCID: PMC8637001 DOI: 10.1016/j.csbj.2021.11.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/20/2022] Open
Abstract
Traditionally, human microbiology has been strongly built on the laboratory focused culture of microbes isolated from human specimens in patients with acute or chronic infection. These approaches primarily view human disease through the lens of a single species and its relevant clinical setting however such approaches fail to account for the surrounding environment and wide microbial diversity that exists in vivo. Given the emergence of next generation sequencing technologies and advancing bioinformatic pipelines, researchers now have unprecedented capabilities to characterise the human microbiome in terms of its taxonomy, function, antibiotic resistance and even bacteriophages. Despite this, an analysis of microbial communities has largely been restricted to ordination, ecological measures, and discriminant taxa analysis. This is predominantly due to a lack of suitable computational tools to facilitate microbiome analytics. In this review, we first evaluate the key concerns related to the inherent structure of microbiome datasets which include its compositionality and batch effects. We describe the available and emerging analytical techniques including integrative analysis, machine learning, microbial association networks, topological data analysis (TDA) and mathematical modelling. We also present how these methods may translate to clinical settings including tools for implementation. Mathematical based analytics for microbiome analysis represents a promising avenue for clinical translation across a range of acute and chronic disease states.
Collapse
Affiliation(s)
- Jayanth Kumar Narayana
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
| | - Micheál Mac Aogáin
- Biochemical Genetics Laboratory, Department of Biochemistry, St. James’s Hospital, Dublin, Ireland
- Clinical Biochemistry Unit, School of Medicine, Trinity College Dublin, Dublin, Ireland
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
| | - Krasimira Tsaneva-Atanasova
- Department of Mathematics & Living Systems Institute, College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter EX4 4QF, UK
| | - Sanjay H. Chotirmall
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Department of Respiratory and Critical Care Medicine, Tan Tock Seng Hospital, Singapore
| |
Collapse
|
12
|
Cheng L, Qi C, Yang H, Lu M, Cai Y, Fu T, Ren J, Jin Q, Zhang X. gutMGene: a comprehensive database for target genes of gut microbes and microbial metabolites. Nucleic Acids Res 2021; 50:D795-D800. [PMID: 34500458 PMCID: PMC8728193 DOI: 10.1093/nar/gkab786] [Citation(s) in RCA: 100] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 08/28/2021] [Accepted: 09/01/2021] [Indexed: 12/19/2022] Open
Abstract
gutMGene (http://bio-annotation.cn/gutmgene), a manually curated database, aims at providing a comprehensive resource of target genes of gut microbes and microbial metabolites in humans and mice. Metagenomic sequencing of fecal samples has identified 3.3 × 106 non-redundant microbial genes from up to 1500 different species. One of the contributions of gut microbiota to host biology is the circulating pool of bacterially derived small-molecule metabolites. It has been estimated that 10% of metabolites found in mammalian blood are derived from the gut microbiota, where they can produce systemic effects on the host through activating or inhibiting gene expression. The current version of gutMGene documents 1331 curated relationships between 332 gut microbes, 207 microbial metabolites and 223 genes in humans, and 2349 curated relationships between 209 gut microbes, 149 microbial metabolites and 544 genes in mice. Each entry in the gutMGene contains detailed information on a relationship between gut microbe, microbial metabolite and target gene, a brief description of the relationship, experiment technology and platform, literature reference and so on. gutMGene provides a user-friendly interface to browse and retrieve each entry using gut microbes, disorders and intervention measures. It also offers the option to download all the entries and submit new experimentally validated associations.
Collapse
Affiliation(s)
- Liang Cheng
- NHC and CAMS Key Laboratory of Molecular Probe and Targeted Theranostics, Harbin Medical University, Harbin 150028, Heilongjiang, China.,College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Changlu Qi
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Haixiu Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Minke Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Yiting Cai
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Tongze Fu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Jialiang Ren
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Qu Jin
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Xue Zhang
- NHC and CAMS Key Laboratory of Molecular Probe and Targeted Theranostics, Harbin Medical University, Harbin 150028, Heilongjiang, China.,McKusick-Zhang Center for Genetic Medicine, Peking Union Medical College, Beijing 100005, China
| |
Collapse
|
13
|
Yang H, Tong F, Qi C, Wang P, Li J, Cheng L. Prioritizing Disease-Related Microbes Based on the Topological Properties of a Comprehensive Network. Front Microbiol 2021; 12:685549. [PMID: 34326821 PMCID: PMC8315281 DOI: 10.3389/fmicb.2021.685549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 05/10/2021] [Indexed: 01/09/2023] Open
Abstract
Many microbes are parasitic within the human body, engaging in various physiological processes and playing an important role in human diseases. The discovery of new microbe-disease associations aids our understanding of disease pathogenesis. Computational methods can be applied in such investigations, thereby avoiding the time-consuming and laborious nature of experimental methods. In this study, we constructed a comprehensive microbe-disease network by integrating known microbe-disease associations from three large-scale databases (Peryton, Disbiome, and gutMDisorder), and extended the random walk with restart to the network for prioritizing unknown microbe-disease associations. The area under the curve values of the leave-one-out cross-validation and the fivefold cross-validation exceeded 0.9370 and 0.9366, respectively, indicating the high performance of this method. Despite being widely studied diseases, in case studies of inflammatory bowel disease, asthma, and obesity, some prioritized disease-related microbes were validated by recent literature. This suggested that our method is effective at prioritizing novel disease-related microbes and may offer further insight into disease pathogenesis.
Collapse
Affiliation(s)
- Haixiu Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Fan Tong
- Academy of Military Medical Science, Beijing, China
| | - Changlu Qi
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Ping Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Jiangyu Li
- Academy of Military Medical Science, Beijing, China
| | - Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China.,NHC and CAMS Key Laboratory of Molecular Probe and Targeted Theranostics, Harbin Medical University, Harbin, China
| |
Collapse
|
14
|
Zhu Z, Han X, Cheng L. Identification of gene signature associated with type 2 diabetes mellitus by integrating mutation and expression data. Curr Gene Ther 2021; 22:51-58. [PMID: 34238156 DOI: 10.2174/1566523221666210707140839] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 04/08/2021] [Accepted: 04/18/2021] [Indexed: 11/22/2022]
Abstract
Type 2 diabetes mellitus (T2DM) is a chronic disease. The molecular diagnosis should be helpful for the treatment of T2DM patients. With the development of sequencing technology, a large number of differentially expressed genes were identified from expression data. However, the method of machine learning can only identify the local optimal solution as the signature. The mutation information obtained by inheritance can better reflect the relationship between genes and diseases. Therefore, we need to integrate mutation information to more accurately identify the signature. To this end, we integrated genome-wide association study (GWAS) data and expression data, combined with expression quantitative trait loci (eQTL) technology to get T2DM predictive signature (T2DMSig-10). Firstly, we used GWAS data to obtain a list of T2DM susceptible loci. Then, we used eQTL technology to obtain risk single nucleotide polymorphisms (SNPs), and combined with the pancreatic β-cells gene expression data to obtain 10 protein-coding genes. Next, we combined these genes with equal weights. After receiver operating characteristic (ROC), single-gene removal and increase method, gene ontology function enrichment and protein-protein interaction network were used to verify the results that showed that T2DMSig-10 had an excellent predictive effect on T2DM (AUC=0.99), and was highly robust. In short, we obtained the predictive signature of T2DM, and further verified it.
Collapse
Affiliation(s)
- Zijun Zhu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China
| | - Xudong Han
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China
| | - Liang Cheng
- NHC and CAMS Key Laboratory of Molecular Probe and Targeted Theranostics, College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, Heilongjiang, China
| |
Collapse
|
15
|
Yang F, Zou Q. DisBalance: a platform to automatically build balance-based disease prediction models and discover microbial biomarkers from microbiome data. Brief Bioinform 2021; 22:6217721. [PMID: 33834198 DOI: 10.1093/bib/bbab094] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2021] [Revised: 02/22/2021] [Accepted: 03/03/2021] [Indexed: 12/23/2022] Open
Abstract
How best to utilize the microbial taxonomic abundances in regard to the prediction and explanation of human diseases remains appealing and challenging, and the relative nature of microbiome data necessitates a proper feature selection method to resolve the compositional problem. In this study, we developed an all-in-one platform to address a series of issues in microbiome-based human disease prediction and taxonomic biomarkers discovery. We prioritize the interpretation, runtime and classification accuracy of the distal discriminative balances analysis (DBA-distal) method in selecting a set of distal discriminative balances, and develop DisBalance, a comprehensive platform, to integrate and streamline the workflows of disease model building, disease risk prediction and disease-related biomarker discovery for microbiome-based binary classifications. DisBalance allows the de novo model-building and disease risk prediction in a very fast and convenient way. To facilitate the model-driven and knowledge-driven discoveries, DisBalance dedicates multiple strategies for the mining of microbial biomarkers. The independent validation of the models constructed by the DisBalance pipeline is performed on seven microbiome datasets from the original article of DBA-distal. The implementation of the DisBalance platform is demonstrated by a complete analysis of a shotgun metagenomic dataset of Ulcerative Colitis (UC). As a free and open-source, DisBlance can be accessed at http://lab.malab.cn/soft/DisBalance. The source code and demo data for Disbalance are available at https://github.com/yangfenglong/DisBalance.
Collapse
Affiliation(s)
- Fenglong Yang
- University of Electronic Science and Technology of China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China
| |
Collapse
|
16
|
Dou L, Yang F, Xu L, Zou Q. A comprehensive review of the imbalance classification of protein post-translational modifications. Brief Bioinform 2021; 22:6217722. [PMID: 33834199 DOI: 10.1093/bib/bbab089] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Revised: 02/17/2021] [Accepted: 02/24/2021] [Indexed: 12/13/2022] Open
Abstract
Post-translational modifications (PTMs) play significant roles in regulating protein structure, activity and function, and they are closely involved in various pathologies. Therefore, the identification of associated PTMs is the foundation of in-depth research on related biological mechanisms, disease treatments and drug design. Due to the high cost and time consumption of high-throughput sequencing techniques, developing machine learning-based predictors has been considered an effective approach to rapidly recognize potential modified sites. However, the imbalanced distribution of true and false PTM sites, namely, the data imbalance problem, largely effects the reliability and application of prediction tools. In this article, we conduct a systematic survey of the research progress in the imbalanced PTMs classification. First, we describe the modeling process in detail and outline useful data imbalance solutions. Then, we summarize the recently proposed bioinformatics tools based on imbalanced PTM data and simultaneously build a convenient website, ImClassi_PTMs (available at lab.malab.cn/∼dlj/ImbClassi_PTMs/), to facilitate the researchers to view. Moreover, we analyze the challenges of current computational predictors and propose some suggestions to improve the efficiency of imbalance learning. We hope that this work will provide comprehensive knowledge of imbalanced PTM recognition and contribute to advanced predictors in the future.
Collapse
Affiliation(s)
- Lijun Dou
- University of Electronic Science and Technology of China and the Shenzhen Polytechnic, China
| | - Fenglong Yang
- University of Electronic Science and Technology of China and the Shenzhen Polytechnic, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
17
|
Yang F, Zou Q, Gao B. GutBalance: a server for the human gut microbiome-based disease prediction and biomarker discovery with compositionality addressed. Brief Bioinform 2021; 22:6123951. [PMID: 33515036 DOI: 10.1093/bib/bbaa436] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 12/17/2020] [Accepted: 12/26/2020] [Indexed: 02/07/2023] Open
Abstract
The compositionality of the microbiome data is well-known but often neglected. The compositional transformation pertains to the supervised learning of microbiome data and is a critical step that decides the performance and reliability of the disease classifiers. We value the excellent performance of the distal discriminative balance analysis (DBA) method, which selects distal balances of pairs and trios of bacteria, in addressing the classification of high-dimensional microbiome data. By applying this method to the species-level abundances of all the disease phenotypes in the GMrepo database, we build a balance-based model repository for the classification of human gut microbiome-related diseases. The model repository supports the prediction of disease risks for new sample(s). More importantly, we highlight the concept of balance-disease associations rather than the conventional microbe-disease associations and develop the human Gut Balance-Disease Association Database (GBDAD). Each predictable balance for each disease model indicates a potential biomarker-disease relationship and can be interpreted as a bacteria ratio positively or negatively correlated with the disease. Furthermore, by linking the balance-disease associations to the evidenced microbe-disease associations in MicroPhenoDB, we surprisingly found that most species-disease associations inferred from the shotgun metagenomic datasets can be validated by external evidence beyond MicroPhenoDB. The balance-based species-disease association inference will accelerate the generation of new microbe-disease association hypotheses in gastrointestinal microecology research and clinical trials. The model repository and the GBDAD database are deployed on the GutBalance server, which supports interactive visualization and systematic interrogation of the disease models, disease-related balances and disease-related species of interest.
Collapse
Affiliation(s)
- Fenglong Yang
- University of Electronic Science and Technology of China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Hainan Key Laboratory for Computational Science and Application, Hainan Normal University, Haikou 571158, China
| | - Bo Gao
- Department of Radiology, The Second Affiliated Hospital, Harbin Medical University, Harbin 150001, China
| |
Collapse
|