1
|
Miao J, Chen T, Misir M, Lin Y. Deep learning for predicting 16S rRNA gene copy number. Sci Rep 2024; 14:14282. [PMID: 38902329 PMCID: PMC11190246 DOI: 10.1038/s41598-024-64658-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/11/2024] [Indexed: 06/22/2024] Open
Abstract
Culture-independent 16S rRNA gene metabarcoding is a commonly used method for microbiome profiling. To achieve more quantitative cell fraction estimates, it is important to account for the 16S rRNA gene copy number (hereafter 16S GCN) of different community members. Currently, there are several bioinformatic tools available to estimate the 16S GCN values, either based on taxonomy assignment or phylogeny. Here we present a novel approach ANNA16, Artificial Neural Network Approximator for 16S rRNA gene copy number, a deep learning-based method that estimates the 16S GCN values directly from the 16S gene sequence strings. Based on 27,579 16S rRNA gene sequences and gene copy number data from the rrnDB database, we show that ANNA16 outperforms the commonly used 16S GCN prediction algorithms. Interestingly, Shapley Additive exPlanations (SHAP) shows that ANNA16 can identify unexpected informative positions in 16S rRNA gene sequences without any prior phylogenetic knowledge, which suggests potential applications beyond 16S GCN prediction.
Collapse
Affiliation(s)
- Jiazheng Miao
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China
- Department of Biomedical Informatics, Harvard Medical School, Boston, USA
| | - Tianlai Chen
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China
- Department of Biomedical Engineering, Duke University, Durham, USA
| | - Mustafa Misir
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China.
| | - Yajuan Lin
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China.
- Department of Life Sciences, Texas A&M University-Corpus Christi, Corpus Christi, USA.
| |
Collapse
|
2
|
Regueira-Iglesias A, Balsa-Castro C, Blanco-Pintos T, Tomás I. Critical review of 16S rRNA gene sequencing workflow in microbiome studies: From primer selection to advanced data analysis. Mol Oral Microbiol 2023; 38:347-399. [PMID: 37804481 DOI: 10.1111/omi.12434] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 09/01/2023] [Accepted: 09/14/2023] [Indexed: 10/09/2023]
Abstract
The multi-batch reanalysis approach of jointly reevaluating gene/genome sequences from different works has gained particular relevance in the literature in recent years. The large amount of 16S ribosomal ribonucleic acid (rRNA) gene sequence data stored in public repositories and information in taxonomic databases of the same gene far exceeds that related to complete genomes. This review is intended to guide researchers new to studying microbiota, particularly the oral microbiota, using 16S rRNA gene sequencing and those who want to expand and update their knowledge to optimise their decision-making and improve their research results. First, we describe the advantages and disadvantages of using the 16S rRNA gene as a phylogenetic marker and the latest findings on the impact of primer pair selection on diversity and taxonomic assignment outcomes in oral microbiome studies. Strategies for primer selection based on these results are introduced. Second, we identified the key factors to consider in selecting the sequencing technology and platform. The process and particularities of the main steps for processing 16S rRNA gene-derived data are described in detail to enable researchers to choose the most appropriate bioinformatics pipeline and analysis methods based on the available evidence. We then produce an overview of the different types of advanced analyses, both the most widely used in the literature and the most recent approaches. Several indices, metrics and software for studying microbial communities are included, highlighting their advantages and disadvantages. Considering the principles of clinical metagenomics, we conclude that future research should focus on rigorous analytical approaches, such as developing predictive models to identify microbiome-based biomarkers to classify health and disease states. Finally, we address the batch effect concept and the microbiome-specific methods for accounting for or correcting them.
Collapse
Affiliation(s)
- Alba Regueira-Iglesias
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Carlos Balsa-Castro
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Triana Blanco-Pintos
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| | - Inmaculada Tomás
- Oral Sciences Research Group, Special Needs Unit, Department of Surgery and Medical-Surgical Specialties, School of Medicine and Dentistry, Universidade de Santiago de Compostela, Health Research Institute of Santiago de Compostela (IDIS), Santiago de Compostela, A Coruña, Spain
| |
Collapse
|
3
|
Sokhansanj BA, Zhao Z, Rosen GL. Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity. BIOLOGY 2022; 11:1786. [PMID: 36552295 PMCID: PMC9774807 DOI: 10.3390/biology11121786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 11/28/2022] [Accepted: 12/05/2022] [Indexed: 12/13/2022]
Abstract
Through the COVID-19 pandemic, SARS-CoV-2 has gained and lost multiple mutations in novel or unexpected combinations. Predicting how complex mutations affect COVID-19 disease severity is critical in planning public health responses as the virus continues to evolve. This paper presents a novel computational framework to complement conventional lineage classification and applies it to predict the severe disease potential of viral genetic variation. The transformer-based neural network model architecture has additional layers that provide sample embeddings and sequence-wide attention for interpretation and visualization. First, training a model to predict SARS-CoV-2 taxonomy validates the architecture's interpretability. Second, an interpretable predictive model of disease severity is trained on spike protein sequence and patient metadata from GISAID. Confounding effects of changing patient demographics, increasing vaccination rates, and improving treatment over time are addressed by including demographics and case date as independent input to the neural network model. The resulting model can be interpreted to identify potentially significant virus mutations and proves to be a robust predctive tool. Although trained on sequence data obtained entirely before the availability of empirical data for Omicron, the model can predict the Omicron's reduced risk of severe disease, in accord with epidemiological and experimental data.
Collapse
Affiliation(s)
- Bahrad A. Sokhansanj
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA 19104, USA
| | | | | |
Collapse
|
4
|
Zhou X, Chen L, Liu HX. Applications of Machine Learning Models to Predict and Prevent Obesity: A Mini-Review. Front Nutr 2022; 9:933130. [PMID: 35866076 PMCID: PMC9294383 DOI: 10.3389/fnut.2022.933130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 05/19/2022] [Indexed: 11/28/2022] Open
Abstract
Research on obesity and related diseases has received attention from government policymakers; interventions targeting nutrient intake, dietary patterns, and physical activity are deployed globally. An urgent issue now is how can we improve the efficiency of obesity research or obesity interventions. Currently, machine learning (ML) methods have been widely applied in obesity-related studies to detect obesity disease biomarkers or discover intervention strategies to optimize weight loss results. In addition, an open source of these algorithms is necessary to check the reproducibility of the research results. Furthermore, appropriate applications of these algorithms could greatly improve the efficiency of similar studies by other researchers. Here, we proposed a mini-review of several open-source ML algorithms, platforms, or related databases that are of particular interest or can be applied in the field of obesity research. We focus our topic on nutrition, environment and social factor, genetics or genomics, and microbiome-adopting ML algorithms.
Collapse
Affiliation(s)
- Xiaobei Zhou
- Health Sciences Institute, China Medical University, Shenyang, China
- Liaoning Key Laboratory of Obesity and Glucose/Lipid Associated Metabolic Diseases, China Medical University, Shenyang, China
| | - Lei Chen
- Health Sciences Institute, China Medical University, Shenyang, China
- Liaoning Key Laboratory of Obesity and Glucose/Lipid Associated Metabolic Diseases, China Medical University, Shenyang, China
- Institute of Life Sciences, China Medical University, Shenyang, China
| | - Hui-Xin Liu
- Health Sciences Institute, China Medical University, Shenyang, China
- Liaoning Key Laboratory of Obesity and Glucose/Lipid Associated Metabolic Diseases, China Medical University, Shenyang, China
- Institute of Life Sciences, China Medical University, Shenyang, China
| |
Collapse
|
5
|
Borgman J, Stark K, Carson J, Hauser L. Deep Learning Encoding for Rapid Sequence Identification on Microbiome Data. FRONTIERS IN BIOINFORMATICS 2022; 2:871256. [PMID: 36304316 PMCID: PMC9580936 DOI: 10.3389/fbinf.2022.871256] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 05/30/2022] [Indexed: 11/18/2022] Open
Abstract
We present a novel approach for rapidly identifying sequences that leverages the representational power of Deep Learning techniques and is applied to the analysis of microbiome data. The method involves the creation of a latent sequence space, training a convolutional neural network to rapidly identify sequences by mapping them into that space, and we leverage the novel encoded latent space for denoising to correct sequencing errors. Using mock bacterial communities of known composition, we show that this approach achieves single nucleotide resolution, generating results for sequence identification and abundance estimation that match the best available microbiome algorithms in terms of accuracy while vastly increasing the speed of accurate processing. We further show the ability of this approach to support phenotypic prediction at the sample level on an experimental data set for which the ground truth for sequence identities and abundances is unknown, but the expected phenotypes of the samples are definitive. Moreover, this approach offers a potential solution for the analysis of data from other types of experiments that currently rely on computationally intensive sequence identification.
Collapse
|
6
|
Sokhansanj BA, Rosen GL. Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences. mSystems 2022; 7:e0003522. [PMID: 35311562 PMCID: PMC9040592 DOI: 10.1128/msystems.00035-22] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2022] [Indexed: 12/22/2022] Open
Abstract
Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces "black box" models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control.
Collapse
Affiliation(s)
- Bahrad A. Sokhansanj
- Drexel University, Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Philadelphia, Pennsylvania, USA
| | - Gail L. Rosen
- Drexel University, Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical & Computer Engineering, College of Engineering, Philadelphia, Pennsylvania, USA
| |
Collapse
|
7
|
David MM, Tataru C, Pope Q, Baker LJ, English MK, Epstein HE, Hammer A, Kent M, Sieler MJ, Mueller RS, Sharpton TJ, Tomas F, Vega Thurber R, Fern XZ. Revealing General Patterns of Microbiomes That Transcend Systems: Potential and Challenges of Deep Transfer Learning. mSystems 2022; 7:e0105821. [PMID: 35040699 PMCID: PMC8765061 DOI: 10.1128/msystems.01058-21] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
A growing body of research has established that the microbiome can mediate the dynamics and functional capacities of diverse biological systems. Yet, we understand little about what governs the response of these microbial communities to host or environmental changes. Most efforts to model microbiomes focus on defining the relationships between the microbiome, host, and environmental features within a specified study system and therefore fail to capture those that may be evident across multiple systems. In parallel with these developments in microbiome research, computer scientists have developed a variety of machine learning tools that can identify subtle, but informative, patterns from complex data. Here, we recommend using deep transfer learning to resolve microbiome patterns that transcend study systems. By leveraging diverse public data sets in an unsupervised way, such models can learn contextual relationships between features and build on those patterns to perform subsequent tasks (e.g., classification) within specific biological contexts.
Collapse
Affiliation(s)
- Maude M. David
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
- Department of Pharmaceutical Sciences, Oregon State University, Corvallis, Oregon, USA
| | - Christine Tataru
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Quintin Pope
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA
| | - Lydia J. Baker
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Mary K. English
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Hannah E. Epstein
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Austin Hammer
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Michael Kent
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Michael J. Sieler
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Ryan S. Mueller
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
| | - Thomas J. Sharpton
- Department of Microbiology, Oregon State University, Corvallis, Oregon, USA
- Department of Statistics, Oregon State University, Corvallis, Oregon, USA
| | - Fiona Tomas
- Instituto Mediterráneo de Estudios Avanzados, IMEDEA, Esporles, Balearic Islands, Spain
| | | | - Xiaoli Z. Fern
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA
| |
Collapse
|