Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Gao Y, Zhu Z, Sun F. Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data. Synth Syst Biotechnol 2022;7:574-585. [PMID: 35155839 PMCID: PMC8801753 DOI: 10.1016/j.synbio.2022.01.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 12/14/2021] [Accepted: 01/19/2022] [Indexed: 12/14/2022] Open

For:	Gao Y, Zhu Z, Sun F. Increasing prediction performance of colorectal cancer disease status using random forests classification based on metagenomic shotgun sequencing data. Synth Syst Biotechnol 2022;7:574-585. [PMID: 35155839 PMCID: PMC8801753 DOI: 10.1016/j.synbio.2022.01.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2021] [Revised: 12/14/2021] [Accepted: 01/19/2022] [Indexed: 12/14/2022] Open

Number

Cited by Other Article(s)

Porreca A, Ibrahimi E, Maturo F, Marcos Zambrano LJ, Meto M, Lopes MB. Robust prediction of colorectal cancer via gut microbiome 16S rRNA sequencing data. J Med Microbiol 2024;73. [PMID: 39377779 DOI: 10.1099/jmm.0.001903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/09/2024] Open

Abstract

Introduction. The study addresses the challenge of utilizing human gut microbiome data for the early detection of colorectal cancer (CRC). The research emphasizes the potential of using machine learning techniques to analyze complex microbiome datasets, providing a non-invasive approach to identifying CRC-related microbial markers.Hypothesis/Gap Statement. The primary hypothesis is that a robust machine learning-based analysis of 16S rRNA microbiome data can identify specific microbial features that serve as effective biomarkers for CRC detection, overcoming the limitations of classical statistical models in high-dimensional settings.Aim. The primary objective of this study is to explore and validate the potential of the human microbiome, specifically in the colon, as a valuable source of biomarkers for colorectal cancer (CRC) detection and progression. The focus is on developing a classifier that effectively predicts the presence of CRC and normal samples based on the analysis of three previously published faecal 16S rRNA sequencing datasets.Methodology. To achieve the aim, various machine learning techniques are employed, including random forest (RF), recursive feature elimination (RFE) and a robust correlation-based technique known as the fuzzy forest (FF). The study utilizes these methods to analyse the three datasets, comparing their performance in predicting CRC and normal samples. The emphasis is on identifying the most relevant microbial features (taxa) associated with CRC development via partial dependence plots, i.e. a machine learning tool focused on explainability, visualizing how a feature influences the predicted outcome.Results. The analysis of the three faecal 16S rRNA sequencing datasets reveals the consistent and superior predictive performance of the FF compared to the RF and RFE. Notably, FF proves effective in addressing the correlation problem when assessing the importance of microbial taxa in explaining the development of CRC. The results highlight the potential of the human microbiome as a non-invasive means to detect CRC and underscore the significance of employing FF for improved predictive accuracy.Conclusion. In conclusion, this study underscores the limitations of classical statistical techniques in handling high-dimensional information such as human microbiome data. The research demonstrates the potential of the human microbiome, specifically in the colon, as a valuable source of biomarkers for CRC detection. Applying machine learning techniques, particularly the FF, is a promising approach for building a classifier to predict CRC and normal samples. The findings advocate for integrating FF to overcome the challenges associated with correlation when identifying crucial microbial features linked to CRC development.

Collapse

Prasath ST, Navaneethan C. Colorectal cancer prognosis based on dietary pattern using synthetic minority oversampling technique with K-nearest neighbors approach. Sci Rep 2024;14:17709. [PMID: 39085324 PMCID: PMC11292025 DOI: 10.1038/s41598-024-67848-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 07/16/2024] [Indexed: 08/02/2024] Open

Bars-Cortina D, Ramon E, Rius-Sansalvador B, Guinó E, Garcia-Serrano A, Mach N, Khannous-Lleiffe O, Saus E, Gabaldón T, Ibáñez-Sanz G, Rodríguez-Alonso L, Mata A, García-Rodríguez A, Obón-Santacana M, Moreno V. Comparison between 16S rRNA and shotgun sequencing in colorectal cancer, advanced colorectal lesions, and healthy human gut microbiota. BMC Genomics 2024;25:730. [PMID: 39075388 PMCID: PMC11285316 DOI: 10.1186/s12864-024-10621-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 07/15/2024] [Indexed: 07/31/2024] Open

Abstract

BACKGROUND

Gut dysbiosis has been associated with colorectal cancer (CRC), the third most prevalent cancer in the world. This study compares microbiota taxonomic and abundance results obtained by 16S rRNA gene sequencing (16S) and whole shotgun metagenomic sequencing to investigate their reliability for bacteria profiling. The experimental design included 156 human stool samples from healthy controls, advanced (high-risk) colorectal lesion patients (HRL), and CRC cases, with each sample sequenced using both 16S and shotgun methods. We thoroughly compared both sequencing technologies at the species, genus, and family annotation levels, the abundance differences in these taxa, sparsity, alpha and beta diversities, ability to train prediction models, and the similarity of the microbial signature derived from these models.

RESULTS

As expected, the results showed that 16S detects only part of the gut microbiota community revealed by shotgun, although some genera were only profiled by 16S. The 16S abundance data was sparser and exhibited lower alpha diversity. In lower taxonomic ranks, shotgun and 16S highly differed, partially due to a disagreement in reference databases. When considering only shared taxa, the abundance was positively correlated between the two strategies. We also found a moderate correlation between the shotgun and 16S alpha-diversity measures, as well as their PCoAs. Regarding the machine learning models, only some of the shotgun models showed some degree of predictive power in an independent test set, but we could not demonstrate a clear superiority of one technology over the other. Microbial signatures from both sequencing techniques revealed taxa previously associated with CRC development, e.g., Parvimonas micra.

CONCLUSIONS

Shotgun and 16S sequencing provide two different lenses to examine microbial communities. While we have demonstrated that they can unravel common patterns (including microbial signatures), shotgun often gives a more detailed snapshot than 16S, both in depth and breadth. Instead, 16S will tend to show only part of the picture, giving greater weight to dominant bacteria in a sample. Therefore, we recommend choosing one or another sequencing technique before launching a study. Specifically, shotgun sequencing is preferred for stool microbiome samples and in-depth analyses, while 16S is more suitable for tissue samples and studies with targeted aims.

Collapse

Affiliation(s)

David Bars-Cortina Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain
Elies Ramon Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain
Blanca Rius-Sansalvador Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain Doctoral Programme in Biomedicine, University of Barcelona (UB), Barcelona, 08907, Spain
Elisabet Guinó Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, 28029, Spain
Ainhoa Garcia-Serrano Department of Clinical Science, Intervention and Technology, Karolinska Institutet, Stockholm, 14186, Sweden
Núria Mach IHAP, Université de Toulouse, INRAE, ENVT, Toulouse, France
Olfat Khannous-Lleiffe Barcelona Supercomputing Centre (BSC-CNS), Barcelona, 08034, Spain Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, 08028, Spain
Ester Saus Barcelona Supercomputing Centre (BSC-CNS), Barcelona, 08034, Spain Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, 08028, Spain
Toni Gabaldón Barcelona Supercomputing Centre (BSC-CNS), Barcelona, 08034, Spain Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Barcelona, 08028, Spain Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, 08010, Spain Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Barcelona, 08028, Spain
Gemma Ibáñez-Sanz Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain Gastroenterology Department, Bellvitge University Hospital, L'Hospitalet de Llobregat, Barcelona, 08907, Spain
Lorena Rodríguez-Alonso Gastroenterology Department, Bellvitge University Hospital, L'Hospitalet de Llobregat, Barcelona, 08907, Spain
Alfredo Mata Digestive System Service, Moisés Broggi Hospital, Sant Joan Despí, 08970, Spain
Ana García-Rodríguez Endoscopy Unit, Digestive System Service, Viladecans Hospital-IDIBELL, Viladecans, 08840, Spain
Mireia Obón-Santacana Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain. ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain. Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, 28029, Spain.
Victor Moreno Unit of Biomarkers and Susceptibility (UBS), Oncology Data Analytics Program (ODAP), Catalan Institute of Oncology (ICO), L'Hospitalet del Llobregat, Barcelona, 08908, Spain. ONCOBELL Program, Bellvitge Biomedical Research Institute (IDIBELL), L'Hospitalet de Llobregat, Barcelona, 08908, Spain. Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Madrid, 28029, Spain. Department of Clinical Sciences, Faculty of Medicine and Health Sciences, Universitat de Barcelona Institute of Complex Systems (UBICS), University of Barcelona (UB), L'Hospitalet de Llobregat, Barcelona, 08908, Spain.

Collapse

Acheampong DA, Jenjaroenpun P, Wongsurawat T, Kurilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: coverage-based analysis for identification of microbiome. Brief Bioinform 2024;25:bbae424. [PMID: 39222062 PMCID: PMC11367759 DOI: 10.1093/bib/bbae424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 06/26/2024] [Accepted: 08/13/2024] [Indexed: 09/04/2024] Open

Abstract

Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic approach. In this study, we developed a new bioinformatics tool, coverage-based analysis for identification of microbiome (CAIM), for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count-based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consistently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similarity of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and 44 primary liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.

Collapse

Affiliation(s)

Daniel A Acheampong Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States Stowers Institute for Medical Research, 1000 E 50 St, Kansas City, MO 64110, United States
Piroon Jenjaroenpun Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
Thidathip Wongsurawat Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand
Alongkorn Kurilung Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States
Yotsawat Pomyen Translational Research Unit, Chulabhorn Research Institute, 54 Kamphaeng Phet Rd., Laksi, Bangkok 10210, Thailand
Sangam Kandel Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States Influenza Research Institute, Department of Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, 575 Science Drive, Madison, WI 53711, United States
Pattapon Kunadirek Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
Natthaya Chuaypen Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Rama 4 road, Pathumwan, Bangkok 10330, Thailand
Kanthida Kusonmano Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, 49 Soi Thian Thale 25, Bang Khun Thian Chai Thale Road, Tha Kham, Bang Khun Thian, Bangkok 10150, Thailand
Intawat Nookaew Department of Biomedical Informatics, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States Division of Endocrinology, Department of Medicine, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States Department of Physiology and Cell Biology, University of Arkansas for Medical Sciences, 4301 W Markham St, Little Rock, AR 72205, United States Department of Biochemistry, Faculty of Medicine Siriraj Hospital, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok 10700, Thailand

Collapse

Hagen M, Dass R, Westhues C, Blom J, Schultheiss SJ, Patz S. Interpretable machine learning decodes soil microbiome's response to drought stress. ENVIRONMENTAL MICROBIOME 2024;19:35. [PMID: 38812054 PMCID: PMC11138018 DOI: 10.1186/s40793-024-00578-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 05/10/2024] [Indexed: 05/31/2024]

Acheampong DA, Jenjaroenpun P, Wongsurawat T, Krulilung A, Pomyen Y, Kandel S, Kunadirek P, Chuaypen N, Kusonmano K, Nookaew I. CAIM: Coverage-based Analysis for Identification of Microbiome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.25.591018. [PMID: 38746391 PMCID: PMC11091946 DOI: 10.1101/2024.04.25.591018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]

Abstract

Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic (WMS) approach. In this study, we developed a new bioinformatics tool, CAIM, for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consitently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similality of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and primary 44 liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.

Collapse

Affiliation(s)

Daniel A. Acheampong Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA Stowers Institute for Medical Research, Kansas City, MO, USA
Piroon Jenjaroenpun Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
Thidathip Wongsurawat Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA Division of Medical Bioinformatics, Department of Research, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkok, Thailand
Alongkorn Krulilung Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
Yotsawat Pomyen Translational Research Unit, Chulabhorn Research Institute, Bangkok, 10210, Thailand
Sangam Kandel Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
Pattapon Kunadirek Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
Natthaya Chuaypen Center of Excellence in Hepatitis and Liver Cancer, Department of Biochemistry, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
Kanthida Kusonmano Bioinformatics and Systems Biology Program, School of Bioresources and Technology, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand Systems Biology and Bioinformatics Research Laboratory, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, Bangkok, 10150, Thailand
Intawat Nookaew Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA

Collapse

Gorman ED, Lladser ME. Interpretable metric learning in comparative metagenomics: The adaptive Haar-like distance. PLoS Comput Biol 2024;20:e1011543. [PMID: 38768195 PMCID: PMC11142682 DOI: 10.1371/journal.pcbi.1011543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 05/31/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024] Open

Biró B, Gál Z, Fekete Z, Klecska E, Hoffmann OI. Mitochondrial genome plasticity of mammalian species. BMC Genomics 2024;25:278. [PMID: 38486136 PMCID: PMC10941376 DOI: 10.1186/s12864-024-10201-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 03/08/2024] [Indexed: 03/17/2024] Open

Gao Y, Sun F. Batch normalization followed by merging is powerful for phenotype prediction integrating multiple heterogeneous studies. PLoS Comput Biol 2023;19:e1010608. [PMID: 37844077 PMCID: PMC10602384 DOI: 10.1371/journal.pcbi.1010608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 10/26/2023] [Accepted: 09/30/2023] [Indexed: 10/18/2023] Open

Abstract

Heterogeneity in different genomic studies compromises the performance of machine learning models in cross-study phenotype predictions. Overcoming heterogeneity when incorporating different studies in terms of phenotype prediction is a challenging and critical step for developing machine learning algorithms with reproducible prediction performance on independent datasets. We investigated the best approaches to integrate different studies of the same type of omics data under a variety of different heterogeneities. We developed a comprehensive workflow to simulate a variety of different types of heterogeneity and evaluate the performances of different integration methods together with batch normalization by using ComBat. We also demonstrated the results through realistic applications on six colorectal cancer (CRC) metagenomic studies and six tuberculosis (TB) gene expression studies, respectively. We showed that heterogeneity in different genomic studies can markedly negatively impact the machine learning classifier's reproducibility. ComBat normalization improved the prediction performance of machine learning classifier when heterogeneous populations are present, and could successfully remove batch effects within the same population. We also showed that the machine learning classifier's prediction accuracy can be markedly decreased as the underlying disease model became more different in training and test populations. Comparing different merging and integration methods, we found that merging and integration methods can outperform each other in different scenarios. In the realistic applications, we observed that the prediction accuracy improved when applying ComBat normalization with merging or integration methods in both CRC and TB studies. We illustrated that batch normalization is essential for mitigating both population differences of different studies and batch effects. We also showed that both merging strategy and integration methods can achieve good performances when combined with batch normalization. In addition, we explored the potential of boosting phenotype prediction performance by rank aggregation methods and showed that rank aggregation methods had similar performance as other ensemble learning approaches.

Collapse

Sun Y, Zhang X, Jin C, Yue K, Sheng D, Zhang T, Dou X, Liu J, Jing H, Zhang L, Yue J. Prospective, longitudinal analysis of the gut microbiome in patients with locally advanced rectal cancer predicts response to neoadjuvant concurrent chemoradiotherapy. J Transl Med 2023;21:221. [PMID: 36967379 PMCID: PMC10041716 DOI: 10.1186/s12967-023-04054-1] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2022] [Accepted: 03/10/2023] [Indexed: 03/28/2023] Open

Abstract

BACKGROUND

Neoadjuvant concurrent chemoradiotherapy (nCCRT) is a standard treatment for locally advanced rectal cancer (LARC). The gut microbiome may be reshaped by radiotherapy through its effects on microbial composition, mucosal immunity, and the systemic immune system. We sought to clarify dynamic, longitudinal changes in the gut microbiome and blood immunomodulators throughout nCCRT and to explore the relationship of such changes with outcomes after nCCRT.

METHODS

A total of 39 patients with LARC were recruited for this study. Fecal samples and peripheral blood samples were collected from all 39 patients before nCCRT, during nCCRT (at week 3), and after nCCRT (at week 5). The gut microbiota and the microbial community structure were analyzed by 16S rRNA sequencing of the V3-V4 region. Levels of blood immunomodulatory proteins were measured with a Millipore HCKPMAG-11 K kit and Luminex 200 platform (Luminex, USA).

RESULTS

Cross-sectional and longitudinal analyses revealed that the gut microbiome profile and enterotype exhibited characteristic variations that could distinguish patients with good response (AJCC TRG classification 0-1) vs poor response (TRG 2-3) to nCCRT. Sparse partial least squares regression and canonical correspondence analyses showed multivariate associations between specific microbial taxa, host immunomodulatory proteins, immune cells, and outcomes after nCCRT. An integrated model consisting of baseline Clostridium sensu stricto 1 levels, fold changes in Intestinimonas, blood levels of the herpesvirus entry mediator (HVEM/CD270), and lymphocyte counts could predict good vs poor outcome after nCCRT [area under the receiver-operating characteristics curve (AUC)= 0.821; area under the precision-recall curve [AUPR] = 0.911].

CONCLUSIONS

Our results showed that longitudinal variations in specific gut taxa, associated host immune cells, and immunomodulatory proteins before and during nCCRT could be useful for early predictions of the efficacy of nCCRT, which could guide the choice of individualized treatment for patients with LARC.

Collapse

Affiliation(s)

Yi Sun Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China The Comprehensive Cancer Centre of Nanjing Drum Tower Hospital, The Affiliated Hospital of Nanjing University Medical School, Nanjing, China
Xiang Zhang Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
Chuandi Jin Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China
Kaile Yue Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China
Dashuang Sheng Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China
Tao Zhang Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
Xue Dou Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
Jing Liu Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
Hongbiao Jing Department of Pathology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China
Lei Zhang Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China. Microbiome-X, National Institute of Health Data Science of China & Institute for Medical Dataology, Cheeloo College of Medicine, Shandong University, Jinan, China. State Key Laboratory of Microbial Technology, Shandong University, Qingdao, China.
Jinbo Yue Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University and Shandong Academy of Medical Sciences, Jinan, China.

Collapse

Leveraging Scheme for Cross-Study Microbiome Machine Learning Prediction and Feature Evaluations. Bioengineering (Basel) 2023;10:bioengineering10020231. [PMID: 36829725 PMCID: PMC9952031 DOI: 10.3390/bioengineering10020231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 02/02/2023] [Accepted: 02/04/2023] [Indexed: 02/11/2023] Open

Lee Y, Cappellato M, Di Camillo B. Machine learning-based feature selection to search stable microbial biomarkers: application to inflammatory bowel disease. Gigascience 2022;12:giad083. [PMID: 37882604 PMCID: PMC10600917 DOI: 10.1093/gigascience/giad083] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 08/23/2023] [Accepted: 09/17/2023] [Indexed: 10/27/2023] Open

Abstract

BACKGROUND

Biomarker discovery exploiting feature importance of machine learning has risen recently in the microbiome landscape with its high predictive performance in several disease states. To have a concrete selection among a high number of features, recursive feature elimination (RFE) has been widely used in the bioinformatics field. However, machine learning-based RFE has factors that decrease the stability of feature selection. In this article, we suggested methods to improve stability while sustaining performance.

RESULTS

We exploited the abundance matrices of the gut microbiome (283 taxa at species level and 220 at genus level) to classify between patients with inflammatory bowel disease (IBD) and healthy control (1,569 samples). We found that applying an already published data transformation before RFE improves feature stability significantly. Moreover, we performed an in-depth evaluation of different variants of the data transformation and identify those that demonstrate better improvement in stability while not sacrificing classification performance. To ensure a robust comparison, we evaluated stability using various similarity metrics, distances, the common number of features, and the ability to filter out noise features. We were able to confirm that the mapping by the Bray-Curtis similarity matrix before RFE consistently improves the stability while maintaining good performance. Multilayer perceptron algorithm exhibited the highest performance among 8 different machine learning algorithms when a large number of features (a few hundred) were considered based on the best performance across 100 bootstrapped internal test sets. Conversely, when utilizing only a limited number of biomarkers as a trade-off between optimal performance and method generalizability, the random forest algorithm demonstrated the best performance. Using the optimal pipeline we developed, we identified 14 biomarkers for IBD at the species level and analyzed their roles using Shapley additive explanations.

CONCLUSION

Taken together, our work not only showed how to improve biomarker discovery in the metataxonomic field without sacrificing classification performance but also provided useful insights for future comparative studies.

Collapse

Zuo W, Michail S, Sun F. Metagenomic Analyses of Multiple Gut Datasets Revealed the Association of Phage Signatures in Colorectal Cancer. Front Cell Infect Microbiol 2022;12:918010. [PMID: 35782128 PMCID: PMC9240273 DOI: 10.3389/fcimb.2022.918010] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 05/12/2022] [Indexed: 12/24/2022] Open