1
|
Porreca A, Ibrahimi E, Maturo F, Marcos Zambrano LJ, Meto M, Lopes MB. Robust prediction of colorectal cancer via gut microbiome 16S rRNA sequencing data. J Med Microbiol 2024; 73. [PMID: 39377779 DOI: 10.1099/jmm.0.001903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/09/2024] Open
Abstract
Introduction. The study addresses the challenge of utilizing human gut microbiome data for the early detection of colorectal cancer (CRC). The research emphasizes the potential of using machine learning techniques to analyze complex microbiome datasets, providing a non-invasive approach to identifying CRC-related microbial markers.Hypothesis/Gap Statement. The primary hypothesis is that a robust machine learning-based analysis of 16S rRNA microbiome data can identify specific microbial features that serve as effective biomarkers for CRC detection, overcoming the limitations of classical statistical models in high-dimensional settings.Aim. The primary objective of this study is to explore and validate the potential of the human microbiome, specifically in the colon, as a valuable source of biomarkers for colorectal cancer (CRC) detection and progression. The focus is on developing a classifier that effectively predicts the presence of CRC and normal samples based on the analysis of three previously published faecal 16S rRNA sequencing datasets.Methodology. To achieve the aim, various machine learning techniques are employed, including random forest (RF), recursive feature elimination (RFE) and a robust correlation-based technique known as the fuzzy forest (FF). The study utilizes these methods to analyse the three datasets, comparing their performance in predicting CRC and normal samples. The emphasis is on identifying the most relevant microbial features (taxa) associated with CRC development via partial dependence plots, i.e. a machine learning tool focused on explainability, visualizing how a feature influences the predicted outcome.Results. The analysis of the three faecal 16S rRNA sequencing datasets reveals the consistent and superior predictive performance of the FF compared to the RF and RFE. Notably, FF proves effective in addressing the correlation problem when assessing the importance of microbial taxa in explaining the development of CRC. The results highlight the potential of the human microbiome as a non-invasive means to detect CRC and underscore the significance of employing FF for improved predictive accuracy.Conclusion. In conclusion, this study underscores the limitations of classical statistical techniques in handling high-dimensional information such as human microbiome data. The research demonstrates the potential of the human microbiome, specifically in the colon, as a valuable source of biomarkers for CRC detection. Applying machine learning techniques, particularly the FF, is a promising approach for building a classifier to predict CRC and normal samples. The findings advocate for integrating FF to overcome the challenges associated with correlation when identifying crucial microbial features linked to CRC development.
Collapse
Affiliation(s)
- Annamaria Porreca
- Department of Economics, Statistics and Business, Faculty of Economics and Law, Universitas Mercatorum, Rome, Italy
| | - Eliana Ibrahimi
- Department of Biology, University of Tirana, Tirana, Albania
| | - Fabrizio Maturo
- Department of Economics, Statistics and Business, Faculty of Technological and Innovation Sciences, Universitas Mercatorum, Rome, Italy
| | - Laura Judith Marcos Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Melisa Meto
- Department of Biology, University of Tirana, Tirana, Albania
| | - Marta B Lopes
- Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Research and Development Unit for Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| |
Collapse
|
2
|
Prasath ST, Navaneethan C. Colorectal cancer prognosis based on dietary pattern using synthetic minority oversampling technique with K-nearest neighbors approach. Sci Rep 2024; 14:17709. [PMID: 39085324 PMCID: PMC11292025 DOI: 10.1038/s41598-024-67848-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 07/16/2024] [Indexed: 08/02/2024] Open
Abstract
Generally, a person's life span depends on their food consumption because it may cause deadly diseases like colorectal cancer (CRC). In 2020, colorectal cancer accounted for one million fatalities globally, representing 10% of all cancer casualties. 76,679 males and 78,213 females over the age of 59 from ten states in the United States participated in this analysis. During follow-up, 1378 men and 981 women were diagnosed with colon cancer. This prospective cohort study used 231 food items and their variants as input features to identify CRC patients. Before labelling any foods as colorectal cancer-causing foods, it is ethical to analyse facts like how many grams of food should be consumed daily and how many times a week. This research examines five classification algorithms on real-time datasets: K-Nearest Neighbour (KNN), Decision Tree (DT), Random Forest (RF), Logistic Regression with Classifier Chain (LRCC), and Logistic Regression with Label Powerset (LRLC). Then, the SMOTE algorithm is applied to deal with and identify imbalances in the data. Our study shows that eating more than 10 g/d of low-fat butter in bread (RR 1.99, CI 0.91-4.39) and more than twice a week (RR 1.49, CI 0.93-2.38) increases CRC risk. Concerning beef, eating in excess of 74 g of beef steak daily (RR 0.88, CI 0.50-1.55) and having it more than once a week (RR 0.88, CI 0.62-1.23) decreases the risk of CRC, respectively. While eating beef and dairy products in a daily diet should be cautious about quantity. Consuming those items in moderation on a regular basis will protect us against CRC risk. Meanwhile, a high intake of poultry (RR 0.2, CI 0.05-0.81), fish (RR 0.82, CI 0.31-2.16), and pork (RR 0.67, CI 0.17-2.65) consumption negatively correlates to CRC hazards.
Collapse
Affiliation(s)
- S Thanga Prasath
- School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, Vellore, Tamil Nadu, India
| | - C Navaneethan
- School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, Vellore, Tamil Nadu, India.
| |
Collapse
|
3
|
Bars-Cortina D, Ramon E, Rius-Sansalvador B, Guinó E, Garcia-Serrano A, Mach N, Khannous-Lleiffe O, Saus E, Gabaldón T, Ibáñez-Sanz G, Rodríguez-Alonso L, Mata A, García-Rodríguez A, Obón-Santacana M, Moreno V. Comparison between 16S rRNA and shotgun sequencing in colorectal cancer, advanced colorectal lesions, and healthy human gut microbiota. BMC Genomics 2024; 25:730. [PMID: 39075388 PMCID: PMC11285316 DOI: 10.1186/s12864-024-10621-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 07/15/2024] [Indexed: 07/31/2024] Open
Abstract
BACKGROUND Gut dysbiosis has been associated with colorectal cancer (CRC), the third most prevalent cancer in the world. This study compares microbiota taxonomic and abundance results obtained by 16S rRNA gene sequencing (16S) and whole shotgun metagenomic sequencing to investigate their reliability for bacteria profiling. The experimental design included 156 human stool samples from healthy controls, advanced (high-risk) colorectal lesion patients (HRL), and CRC cases, with each sample sequenced using both 16S and shotgun methods. We thoroughly compared both sequencing technologies at the species, genus, and family annotation levels, the abundance differences in these taxa, sparsity, alpha and beta diversities, ability to train prediction models, and the similarity of the microbial signature derived from these models. RESULTS As expected, the results showed that 16S detects only part of the gut microbiota community revealed by shotgun, although some genera were only profiled by 16S. The 16S abundance data was sparser and exhibited lower alpha diversity. In lower taxonomic ranks, shotgun and 16S highly differed, partially due to a disagreement in reference databases. When considering only shared taxa, the abundance was positively correlated between the two strategies. We also found a moderate correlation between the shotgun and 16S alpha-diversity measures, as well as their PCoAs. Regarding the machine learning models, only some of the shotgun models showed some degree of predictive power in an independent test set, but we could not demonstrate a clear superiority of one technology over the other. Microbial signatures from both sequencing techniques revealed taxa previously associated with CRC development, e.g., Parvimonas micra. CONCLUSIONS Shotgun and 16S sequencing provide two different lenses to examine microbial communities. While we have demonstrated that they can unravel common patterns (including microbial signatures), shotgun often gives a more detailed snapshot than 16S, both in depth and breadth. Instead, 16S will tend to show only part of the picture, giving greater weight to dominant bacteria in a sample. Therefore, we recommend choosing one or another sequencing technique before launching a study. Specifically, shotgun sequencing is preferred for stool microbiome samples and in-depth analyses, while 16S is more suitable for tissue samples and studies with targeted aims.
Collapse
Affiliation(s)
- David Bars-Cortina
- Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain
- ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain
| | - Elies Ramon
- Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain
- ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain
| | - Blanca Rius-Sansalvador
- Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain
- ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain
- Doctoral Programme in Biomedicine, University of Barcelona (UB), Barcelona, 08907, Spain
| | - Elisabet Guinó
- Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain
- ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, 28029, Spain
| | - Ainhoa Garcia-Serrano
- Department of Clinical Science, Intervention and Technology, Karolinska Institutet, Stockholm, 14186, Sweden
| | - Núria Mach
- IHAP, Université de Toulouse, INRAE, ENVT, Toulouse, France
| | - Olfat Khannous-Lleiffe
- Barcelona Supercomputing Centre (BSC-CNS), Barcelona, 08034, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, 08028, Spain
| | - Ester Saus
- Barcelona Supercomputing Centre (BSC-CNS), Barcelona, 08034, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, 08028, Spain
| | - Toni Gabaldón
- Barcelona Supercomputing Centre (BSC-CNS), Barcelona, 08034, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, 08028, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, 08010, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Barcelona, 08028, Spain
| | - Gemma Ibáñez-Sanz
- Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain
- ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain
- Gastroenterology Department, Bellvitge University Hospital, L'Hospitalet de Llobregat, Barcelona, 08907, Spain
| | - Lorena Rodríguez-Alonso
- Gastroenterology Department, Bellvitge University Hospital, L'Hospitalet de Llobregat, Barcelona, 08907, Spain
| | - Alfredo Mata
- Digestive System Service, Moisés Broggi Hospital, Sant Joan Despí, 08970, Spain
| | - Ana García-Rodríguez
- Endoscopy Unit, Digestive System Service, Viladecans Hospital-IDIBELL, Viladecans, 08840, Spain
| | - Mireia Obón-Santacana
- Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain.
- ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain.
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, 28029, Spain.
| | - Victor Moreno
- Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain.
- ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain.
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, 28029, Spain.
- Department of Clinical Sciences, Faculty of Medicine and Health Sciences, Universitat de Barcelona Institute of Complex Systems (UBICS), University of Barcelona (UB), L'Hospitalet de Llobregat, Barcelona, 08908, Spain.
| |
Collapse
|
4
|
Acheampong DA, Jenjaroenpun P, Wongsurawat T, Kurilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: coverage-based analysis for identification of microbiome. Brief Bioinform 2024; 25:bbae424. [PMID: 39222062 PMCID: PMC11367759 DOI: 10.1093/bib/bbae424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 06/26/2024] [Accepted: 08/13/2024] [Indexed: 09/04/2024] Open
Abstract
Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic approach. In this study, we developed a new bioinformatics tool, coverage-based analysis for identification of microbiome (CAIM), for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count-based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consistently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similarity of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and 44 primary liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.
Collapse
Affiliation(s)
- Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Stowers Institute for Medical Research, 1000 E 50 St, Kansas City, MO 64110, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| | - Alongkorn Kurilung
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
| | - Yotsawat Pomyen
- Translational Research Unit, Chulabhorn Research Institute, 54 Kamphaeng Phet Rd., Laksi, Bangkok 10210, Thailand
| | - Sangam Kandel
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Influenza Research Institute, Department of Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, 575 Science Drive, Madison, WI 53711, United States
| | - Pattapon Kunadirek
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
| | - Natthaya Chuaypen
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
| | - Kanthida Kusonmano
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand
- Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Division of Endocrinology, Department of Medicine, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Department of Physiology and Cell Biology, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
- Department of Biochemistry, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
| |
Collapse
|
5
|
Hagen M, Dass R, Westhues C, Blom J, Schultheiss SJ, Patz S. Interpretable machine learning decodes soil microbiome's response to drought stress. ENVIRONMENTAL MICROBIOME 2024; 19:35. [PMID: 38812054 PMCID: PMC11138018 DOI: 10.1186/s40793-024-00578-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 05/10/2024] [Indexed: 05/31/2024]
Abstract
BACKGROUND Extreme weather events induced by climate change, particularly droughts, have detrimental consequences for crop yields and food security. Concurrently, these conditions provoke substantial changes in the soil bacterial microbiota and affect plant health. Early recognition of soil affected by drought enables farmers to implement appropriate agricultural management practices. In this context, interpretable machine learning holds immense potential for drought stress classification of soil based on marker taxa. RESULTS This study demonstrates that the 16S rRNA-based metagenomic approach of Differential Abundance Analysis methods and machine learning-based Shapley Additive Explanation values provide similar information. They exhibit their potential as complementary approaches for identifying marker taxa and investigating their enrichment or depletion under drought stress in grass lineages. Additionally, the Random Forest Classifier trained on a diverse range of relative abundance data from the soil bacterial micobiome of various plant species achieves a high accuracy of 92.3 % at the genus rank for drought stress prediction. It demonstrates its generalization capacity for the lineages tested. CONCLUSIONS In the detection of drought stress in soil bacterial microbiota, this study emphasizes the potential of an optimized and generalized location-based ML classifier. By identifying marker taxa, this approach holds promising implications for microbe-assisted plant breeding programs and contributes to the development of sustainable agriculture practices. These findings are crucial for preserving global food security in the face of climate change.
Collapse
Affiliation(s)
- Michelle Hagen
- Computomics GmbH, Eisenbahnstraße 1, 72072, Tübingen, Baden-Württemberg, Germany
| | - Rupashree Dass
- Computomics GmbH, Eisenbahnstraße 1, 72072, Tübingen, Baden-Württemberg, Germany
| | - Cathy Westhues
- Computomics GmbH, Eisenbahnstraße 1, 72072, Tübingen, Baden-Württemberg, Germany
| | - Jochen Blom
- Bioinformatics and Systems Biology, Justus Liebig University Gießen, Heinrich-Buff-Ring 58, 35390, Gießen, Hesse, Germany
| | | | - Sascha Patz
- Computomics GmbH, Eisenbahnstraße 1, 72072, Tübingen, Baden-Württemberg, Germany.
| |
Collapse
|
6
|
Acheampong DA, Jenjaroenpun P, Wongsurawat T, Krulilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: Coverage-based Analysis for Identification of Microbiome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.25.591018. [PMID: 38746391 PMCID: PMC11091946 DOI: 10.1101/2024.04.25.591018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic (WMS) approach. In this study, we developed a new bioinformatics tool, CAIM, for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consitently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similality of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and primary 44 liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.
Collapse
Affiliation(s)
- Daniel A. Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
| | - Alongkorn Krulilung
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Yotsawat Pomyen
- Translational Research Unit, Chulabhorn Research Institute, Bangkok, 10210, Thailand
| | - Sangam Kandel
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Pattapon Kunadirek
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Natthaya Chuaypen
- Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Kanthida Kusonmano
- Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand
- Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| |
Collapse
|
7
|
Gorman ED, Lladser ME. Interpretable metric learning in comparative metagenomics: The adaptive Haar-like distance. PLoS Comput Biol 2024; 20:e1011543. [PMID: 38768195 PMCID: PMC11142682 DOI: 10.1371/journal.pcbi.1011543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 05/31/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024] Open
Abstract
Random forests have emerged as a promising tool in comparative metagenomics because they can predict environmental characteristics based on microbial composition in datasets where β-diversity metrics fall short of revealing meaningful relationships between samples. Nevertheless, despite this efficacy, they lack biological insight in tandem with their predictions, potentially hindering scientific advancement. To overcome this limitation, we leverage a geometric characterization of random forests to introduce a data-driven phylogenetic β-diversity metric, the adaptive Haar-like distance. This new metric assigns a weight to each internal node (i.e., split or bifurcation) of a reference phylogeny, indicating the relative importance of that node in discerning environmental samples based on their microbial composition. Alongside this, a weighted nearest-neighbors classifier, constructed using the adaptive metric, can be used as a proxy for the random forest while maintaining accuracy on par with that of the original forest and another state-of-the-art classifier, CoDaCoRe. As shown in datasets from diverse microbial environments, however, the new metric and classifier significantly enhance the biological interpretability and visualization of high-dimensional metagenomic samples.
Collapse
Affiliation(s)
- Evan D. Gorman
- Department of Applied Mathematics, University of Colorado, Boulder, Colorado, United States of America
| | - Manuel E. Lladser
- Department of Applied Mathematics, University of Colorado, Boulder, Colorado, United States of America
| |
Collapse
|
8
|
Biró B, Gál Z, Fekete Z, Klecska E, Hoffmann OI. Mitochondrial genome plasticity of mammalian species. BMC Genomics 2024; 25:278. [PMID: 38486136 PMCID: PMC10941376 DOI: 10.1186/s12864-024-10201-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 03/08/2024] [Indexed: 03/17/2024] Open
Abstract
There is an ongoing process in which mitochondrial sequences are being integrated into the nuclear genome. The importance of these sequences has already been revealed in cancer biology, forensic, phylogenetic studies and in the evolution of the eukaryotic genetic information. Human and numerous model organisms' genomes were described from those sequences point of view. Furthermore, recent studies were published on the patterns of these nuclear localised mitochondrial sequences in different taxa.However, the results of the previously released studies are difficult to compare due to the lack of standardised methods and/or using few numbers of genomes. Therefore, in this paper our primary goal is to establish a uniform mining pipeline to explore these nuclear localised mitochondrial sequences.Our results show that the frequency of several repetitive elements is higher in the flanking regions of these sequences than expected. A machine learning model reveals that the flanking regions' repetitive elements and different structural characteristics are highly influential during the integration process.In this paper, we introduce a general mining pipeline for all mammalian genomes. The workflow is publicly available and is believed to serve as a validated baseline for future research in this field. We confirm the widespread opinion, on - as to our current knowledge - the largest dataset, that structural circumstances and events corresponding to repetitive elements are highly significant. An accurate model has also been trained to predict these sequences and their corresponding flanking regions.
Collapse
Affiliation(s)
- Bálint Biró
- Agribiotechnology and Precision Breeding for Food Security National Laboratory, Department of Animal Biotechnology, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary.
- Group BM, Data Insights Team, _VOIS, Kerepesi str. 35, 1087, Budapest, Hungary.
| | - Zoltán Gál
- Agribiotechnology and Precision Breeding for Food Security National Laboratory, Department of Animal Biotechnology, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary
| | - Zsófia Fekete
- Department of Genetics and Genomics, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary
| | - Eszter Klecska
- FamiCord Group, Krio Institute, Kelemen László str, 1026, Budapest, Hungary
| | - Orsolya Ivett Hoffmann
- Agribiotechnology and Precision Breeding for Food Security National Laboratory, Department of Animal Biotechnology, Institute of Genetics and Biotechnology, Hungarian University of Agriculture and Life Sciences, Szent-Györgyi Albert str. 4, 2100, Gödöllő, Hungary.
| |
Collapse
|
9
|
Gao Y, Sun F. Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies. PLoS Comput Biol 2023; 19:e1010608. [PMID: 37844077 PMCID: PMC10602384 DOI: 10.1371/journal.pcbi.1010608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 10/26/2023] [Accepted: 09/30/2023] [Indexed: 10/18/2023] Open
Abstract
Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.
Collapse
Affiliation(s)
- Yilin Gao
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| |
Collapse
|
10
|
Sun Y, Zhang X, Jin C, Yue K, Sheng D, Zhang T, Dou X, Liu J, Jing H, Zhang L, Yue J. Prospective, longitudinal analysis of the gut microbiome in patients with locally advanced rectal cancer predicts response to neoadjuvant concurrent chemoradiotherapy. J Transl Med 2023; 21:221. [PMID: 36967379 PMCID: PMC10041716 DOI: 10.1186/s12967-023-04054-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Accepted: 03/10/2023] [Indexed: 03/28/2023] Open
Abstract
BACKGROUND Neoadjuvant concurrent chemoradiotherapy (nCCRT) is a standard treatment for locally advanced rectal cancer (LARC). The gut microbiome may be reshaped by radiotherapy through its effects on microbial composition, mucosal immunity, and the systemic immune system. We sought to clarify dynamic, longitudinal changes in the gut microbiome and blood immunomodulators throughout nCCRT and to explore the relationship of such changes with outcomes after nCCRT. METHODS A total of 39 patients with LARC were recruited for this study. Fecal samples and peripheral blood samples were collected from all 39 patients before nCCRT, during nCCRT (at week 3), and after nCCRT (at week 5). The gut microbiota and the microbial community structure were analyzed by 16S rRNA sequencing of the V3-V4 region. Levels of blood immunomodulatory proteins were measured with a Millipore HCKPMAG-11 K kit and Luminex 200 platform (Luminex, USA). RESULTS Cross-sectional and longitudinal analyses revealed that the gut microbiome profile and enterotype exhibited characteristic variations that could distinguish patients with good response (AJCC TRG classification 0-1) vs poor response (TRG 2-3) to nCCRT. Sparse partial least squares regression and canonical correspondence analyses showed multivariate associations between specific microbial taxa, host immunomodulatory proteins, immune cells, and outcomes after nCCRT. An integrated model consisting of baseline Clostridium sensu stricto 1 levels, fold changes in Intestinimonas, blood levels of the herpesvirus entry mediator (HVEM/CD270), and lymphocyte counts could predict good vs poor outcome after nCCRT [area under the receiver-operating characteristics curve (AUC)= 0.821; area under the precision-recall curve [AUPR] = 0.911]. CONCLUSIONS Our results showed that longitudinal variations in specific gut taxa, associated host immune cells, and immunomodulatory proteins before and during nCCRT could be useful for early predictions of the efficacy of nCCRT, which could guide the choice of individualized treatment for patients with LARC.
Collapse
Affiliation(s)
- Yi Sun
- Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
- The Comprehensive Cancer Centre of Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School, Nanjing, China
| | - Xiang Zhang
- Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
| | - Chuandi Jin
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Kaile Yue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Dashuang Sheng
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Tao Zhang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Xue Dou
- Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
| | - Jing Liu
- Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
| | - Hongbiao Jing
- Department of Pathology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
| | - Lei Zhang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China.
- Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China.
- State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China.
| | - Jinbo Yue
- Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China.
| |
Collapse
|
11
|
Leveraging Scheme for Cross-Study Microbiome Machine Learning Prediction and Feature Evaluations. Bioengineering (Basel) 2023; 10:bioengineering10020231. [PMID: 36829725 PMCID: PMC9952031 DOI: 10.3390/bioengineering10020231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 02/02/2023] [Accepted: 02/04/2023] [Indexed: 02/11/2023] Open
Abstract
The microbiota has proved to be one of the critical factors for many diseases, and researchers have been using microbiome data for disease prediction. However, models trained on one independent microbiome study may not be easily applicable to other independent studies due to the high level of variability in microbiome data. In this study, we developed a method for improving the generalizability and interpretability of machine learning models for predicting three different diseases (colorectal cancer, Crohn's disease, and immunotherapy response) using nine independent microbiome datasets. Our method involves combining a smaller dataset with a larger dataset, and we found that using at least 25% of the target samples in the source data resulted in improved model performance. We determined random forest as our top model and employed feature selection to identify common and important taxa for disease prediction across the different studies. Our results suggest that this leveraging scheme is a promising approach for improving the accuracy and interpretability of machine learning models for predicting diseases based on microbiome data.
Collapse
|
12
|
Lee Y, Cappellato M, Di Camillo B. Machine learning-based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease. Gigascience 2022; 12:giad083. [PMID: 37882604 PMCID: PMC10600917 DOI: 10.1093/gigascience/giad083] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 08/23/2023] [Accepted: 09/17/2023] [Indexed: 10/27/2023] Open
Abstract
BACKGROUND Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning-based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance. RESULTS We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray-Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations. CONCLUSION Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.
Collapse
Affiliation(s)
- Youngro Lee
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, 08826, Korea
- Institute of Engineering Research at Seoul National University, Seoul, 08826, Korea
| | - Marco Cappellato
- Department of Information Engineering, University of Padova, Padova, 35122, Italy
| | - Barbara Di Camillo
- Department of Information Engineering, University of Padova, Padova, 35122, Italy
| |
Collapse
|
13
|
Zuo W, Michail S, Sun F. Metagenomic Analyses of Multiple Gut Datasets Revealed the Association of Phage Signatures in Colorectal Cancer. Front Cell Infect Microbiol 2022; 12:918010. [PMID: 35782128 PMCID: PMC9240273 DOI: 10.3389/fcimb.2022.918010] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 05/12/2022] [Indexed: 12/24/2022] Open
Abstract
The association of colorectal cancer (CRC) and the human gut microbiome dysbiosis has been the focus of several studies in the past. Many bacterial taxa have been shown to have differential abundance among CRC patients compared to healthy controls. However, the relationship between CRC and non-bacterial gut microbiome such as the gut virome is under-studied and not well understood. In this study we conducted a comprehensive analysis of the association of viral abundances with CRC using metagenomic shotgun sequencing data of 462 CRC subjects and 449 healthy controls from 7 studies performed in 8 different countries. Despite the high heterogeneity, our results showed that the virome alpha diversity was consistently higher in CRC patients than in healthy controls (p-value <0.001). This finding is in sharp contrast to previous reports of low alpha diversity of prokaryotes in CRC compared to healthy controls. In addition to the previously known association of Podoviridae, Siphoviridae and Myoviridae with CRC, we further demonstrate that Herelleviridae, a newly constructed viral family, is significantly depleted in CRC subjects. Our interkingdom association analysis reveals a less intertwined correlation between the gut virome and bacteriome in CRC compared to healthy controls. Furthermore, we show that the viral abundance profiles can be used to accurately predict CRC disease status (AUROC >0.8) in both within-study and cross-study settings. The combination of training sets resulted in rather generalized and accurate prediction models. Our study clearly shows that subjects with colorectal cancer harbor a distinct human gut virome profile which may have an important role in this disease.
Collapse
Affiliation(s)
- Wenxuan Zuo
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA, United States
| | - Sonia Michail
- Department of Pediatrics, Keck School of Medicine of the University of Southern California, Los Angeles, CA, United States
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA, United States
- *Correspondence: Fengzhu Sun,
| |
Collapse
|