1
|
Miao J, Chen T, Misir M, Lin Y. Deep learning for predicting 16S rRNA gene copy number. Sci Rep 2024; 14:14282. [PMID: 38902329 PMCID: PMC11190246 DOI: 10.1038/s41598-024-64658-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/11/2024] [Indexed: 06/22/2024] Open
Abstract
Culture-independent 16S rRNA gene metabarcoding is a commonly used method for microbiome profiling. To achieve more quantitative cell fraction estimates, it is important to account for the 16S rRNA gene copy number (hereafter 16S GCN) of different community members. Currently, there are several bioinformatic tools available to estimate the 16S GCN values, either based on taxonomy assignment or phylogeny. Here we present a novel approach ANNA16, Artificial Neural Network Approximator for 16S rRNA gene copy number, a deep learning-based method that estimates the 16S GCN values directly from the 16S gene sequence strings. Based on 27,579 16S rRNA gene sequences and gene copy number data from the rrnDB database, we show that ANNA16 outperforms the commonly used 16S GCN prediction algorithms. Interestingly, Shapley Additive exPlanations (SHAP) shows that ANNA16 can identify unexpected informative positions in 16S rRNA gene sequences without any prior phylogenetic knowledge, which suggests potential applications beyond 16S GCN prediction.
Collapse
Affiliation(s)
- Jiazheng Miao
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China
- Department of Biomedical Informatics, Harvard Medical School, Boston, USA
| | - Tianlai Chen
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China
- Department of Biomedical Engineering, Duke University, Durham, USA
| | - Mustafa Misir
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China.
| | - Yajuan Lin
- Division of Applied and Natural Sciences, Duke Kunshan University, Suzhou, China.
- Department of Life Sciences, Texas A&M University-Corpus Christi, Corpus Christi, USA.
| |
Collapse
|
2
|
Unger M, Kather JN. Deep learning in cancer genomics and histopathology. Genome Med 2024; 16:44. [PMID: 38539231 PMCID: PMC10976780 DOI: 10.1186/s13073-024-01315-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 03/13/2024] [Indexed: 07/08/2024] Open
Abstract
Histopathology and genomic profiling are cornerstones of precision oncology and are routinely obtained for patients with cancer. Traditionally, histopathology slides are manually reviewed by highly trained pathologists. Genomic data, on the other hand, is evaluated by engineered computational pipelines. In both applications, the advent of modern artificial intelligence methods, specifically machine learning (ML) and deep learning (DL), have opened up a fundamentally new way of extracting actionable insights from raw data, which could augment and potentially replace some aspects of traditional evaluation workflows. In this review, we summarize current and emerging applications of DL in histopathology and genomics, including basic diagnostic as well as advanced prognostic tasks. Based on a growing body of evidence, we suggest that DL could be the groundwork for a new kind of workflow in oncology and cancer research. However, we also point out that DL models can have biases and other flaws that users in healthcare and research need to know about, and we propose ways to address them.
Collapse
Affiliation(s)
- Michaela Unger
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
- Department of Medicine I, University Hospital Dresden, Dresden, Germany.
- Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
| |
Collapse
|
3
|
Santorsola M, Lescai F. The promise of explainable deep learning for omics data analysis: Adding new discovery tools to AI. N Biotechnol 2023; 77:1-11. [PMID: 37329982 DOI: 10.1016/j.nbt.2023.06.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/01/2023] [Accepted: 06/14/2023] [Indexed: 06/19/2023]
Abstract
Deep learning has already revolutionised the way a wide range of data is processed in many areas of daily life. The ability to learn abstractions and relationships from heterogeneous data has provided impressively accurate prediction and classification tools to handle increasingly big datasets. This has a significant impact on the growing wealth of omics datasets, with the unprecedented opportunity for a better understanding of the complexity of living organisms. While this revolution is transforming the way these data are analyzed, explainable deep learning is emerging as an additional tool with the potential to change the way biological data is interpreted. Explainability addresses critical issues such as transparency, so important when computational tools are introduced especially in clinical environments. Moreover, it empowers artificial intelligence with the capability to provide new insights into the input data, thus adding an element of discovery to these already powerful resources. In this review, we provide an overview of the transformative effects explainable deep learning is having on multiple sectors, ranging from genome engineering and genomics, from radiomics to drug design and clinical trials. We offer a perspective to life scientists, to better understand the potential of these tools, and a motivation to implement them in their research, by suggesting learning resources they can use to move their first steps in this field.
Collapse
Affiliation(s)
| | - Francesco Lescai
- Department of Biology and Biotechnology, University of Pavia, Pavia, Italy.
| |
Collapse
|
4
|
Tran KA, Addala V, Johnston RL, Lovell D, Bradley A, Koufariotis LT, Wood S, Wu SZ, Roden D, Al-Eryani G, Swarbrick A, Williams ED, Pearson JV, Kondrashova O, Waddell N. Performance of tumour microenvironment deconvolution methods in breast cancer using single-cell simulated bulk mixtures. Nat Commun 2023; 14:5758. [PMID: 37717006 PMCID: PMC10505141 DOI: 10.1038/s41467-023-41385-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 09/01/2023] [Indexed: 09/18/2023] Open
Abstract
Cells within the tumour microenvironment (TME) can impact tumour development and influence treatment response. Computational approaches have been developed to deconvolve the TME from bulk RNA-seq. Using scRNA-seq profiling from breast tumours we simulate thousands of bulk mixtures, representing tumour purities and cell lineages, to compare the performance of nine TME deconvolution methods (BayesPrism, Scaden, CIBERSORTx, MuSiC, DWLS, hspe, CPM, Bisque, and EPIC). Some methods are more robust in deconvolving mixtures with high tumour purity levels. Most methods tend to mis-predict normal epithelial for cancer epithelial as tumour purity increases, a finding that is validated in two independent datasets. The breast cancer molecular subtype influences this mis-prediction. BayesPrism and DWLS have the lowest combined numbers of false positives and false negatives, and have the best performance when deconvolving granular immune lineages. Our findings highlight the need for more single-cell characterisation of rarer cell types, and suggest that tumour cell compositions should be considered when deconvolving the TME.
Collapse
Affiliation(s)
- Khoa A Tran
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia
- School of Biomedical Sciences, Queensland University of Technology (QUT), Brisbane, QLD, 4000, Australia
| | - Venkateswar Addala
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia
| | - Rebecca L Johnston
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia
| | - David Lovell
- School of Computer Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia
- QUT Centre for Data Science, Brisbane, QLD, 4000, Australia
| | - Andrew Bradley
- Faculty of Engineering, Queensland University of Technology, Brisbane, QLD, 4000, Australia
| | - Lambros T Koufariotis
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia
| | - Scott Wood
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia
| | - Sunny Z Wu
- Cancer Ecosystems Program, Garvan Institute of Medical Research, Darlinghurst, NSW, 2010, Australia
- School of Clinical Medicine, Faculty of Medicine and Health, UNSW Sydney, Kensington, NSW, 2052, Australia
| | - Daniel Roden
- Cancer Ecosystems Program, Garvan Institute of Medical Research, Darlinghurst, NSW, 2010, Australia
- School of Clinical Medicine, Faculty of Medicine and Health, UNSW Sydney, Kensington, NSW, 2052, Australia
| | - Ghamdan Al-Eryani
- Cancer Ecosystems Program, Garvan Institute of Medical Research, Darlinghurst, NSW, 2010, Australia
- School of Clinical Medicine, Faculty of Medicine and Health, UNSW Sydney, Kensington, NSW, 2052, Australia
| | - Alexander Swarbrick
- Cancer Ecosystems Program, Garvan Institute of Medical Research, Darlinghurst, NSW, 2010, Australia
- School of Clinical Medicine, Faculty of Medicine and Health, UNSW Sydney, Kensington, NSW, 2052, Australia
| | - Elizabeth D Williams
- School of Biomedical Sciences, Queensland University of Technology (QUT), Brisbane, QLD, 4000, Australia
- Australian Prostate Cancer Research Centre - Queensland (APCRC-Q) and Queensland Bladder Cancer Initiative (QBCI), Brisbane, QLD, 4000, Australia
| | - John V Pearson
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia
| | - Olga Kondrashova
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia
| | - Nicola Waddell
- Cancer Program, QIMR Berghofer Medical Research Institute, Brisbane, QLD, 4006, Australia.
- School of Biomedical Sciences, Queensland University of Technology (QUT), Brisbane, QLD, 4000, Australia.
| |
Collapse
|
5
|
Wang Z, Zhu Y, Liu Z, Li H, Tang X, Jiang Y. Comparative analysis of tissue-specific genes in maize based on machine learning models: CNN performs technically best, LightGBM performs biologically soundest. Front Genet 2023; 14:1190887. [PMID: 37229198 PMCID: PMC10203421 DOI: 10.3389/fgene.2023.1190887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Accepted: 04/17/2023] [Indexed: 05/27/2023] Open
Abstract
Introduction: With the advancement of RNA-seq technology and machine learning, training large-scale RNA-seq data from databases with machine learning models can generally identify genes with important regulatory roles that were previously missed by standard linear analytic methodologies. Finding tissue-specific genes could improve our comprehension of the relationship between tissues and genes. However, few machine learning models for transcriptome data have been deployed and compared to identify tissue-specific genes, particularly for plants. Methods: In this study, an expression matrix was processed with linear models (Limma), machine learning models (LightGBM), and deep learning models (CNN) with information gain and the SHAP strategy based on 1,548 maize multi-tissue RNA-seq data obtained from a public database to identify tissue-specific genes. In terms of validation, V-measure values were computed based on k-means clustering of the gene sets to evaluate their technical complementarity. Furthermore, GO analysis and literature retrieval were used to validate the functions and research status of these genes. Results: Based on clustering validation, the convolutional neural network outperformed others with higher V-measure values as 0.647, indicating that its gene set could cover as many specific properties of various tissues as possible, whereas LightGBM discovered key transcription factors. The combination of three gene sets produced 78 core tissue-specific genes that had previously been shown in the literature to be biologically significant. Discussion: Different tissue-specific gene sets were identified due to the distinct interpretation strategy for machine learning models and researchers may use multiple methodologies and strategies for tissue-specific gene sets based on their goals, types of data, and computational resources. This study provided comparative insight for large-scale data mining of transcriptome datasets, shedding light on resolving high dimensions and bias difficulties in bioinformatics data processing.
Collapse
Affiliation(s)
- Zijie Wang
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Yuzhi Zhu
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Zhule Liu
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Hongfu Li
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| | - Xinqiang Tang
- School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, China
| | - Yi Jiang
- School of Agriculture, Sun Yat-sen University, Shenzhen, China
| |
Collapse
|
6
|
MacDonald S, Foley H, Yap M, Johnston RL, Steven K, Koufariotis LT, Sharma S, Wood S, Addala V, Pearson JV, Roosta F, Waddell N, Kondrashova O, Trzaskowski M. Generalising uncertainty improves accuracy and safety of deep learning analytics applied to oncology. Sci Rep 2023; 13:7395. [PMID: 37149669 PMCID: PMC10164181 DOI: 10.1038/s41598-023-31126-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 03/07/2023] [Indexed: 05/08/2023] Open
Abstract
Uncertainty estimation is crucial for understanding the reliability of deep learning (DL) predictions, and critical for deploying DL in the clinic. Differences between training and production datasets can lead to incorrect predictions with underestimated uncertainty. To investigate this pitfall, we benchmarked one pointwise and three approximate Bayesian DL models for predicting cancer of unknown primary, using three RNA-seq datasets with 10,968 samples across 57 cancer types. Our results highlight that simple and scalable Bayesian DL significantly improves the generalisation of uncertainty estimation. Moreover, we designed a prototypical metric-the area between development and production curve (ADP), which evaluates the accuracy loss when deploying models from development to production. Using ADP, we demonstrate that Bayesian DL improves accuracy under data distributional shifts when utilising 'uncertainty thresholding'. In summary, Bayesian DL is a promising approach for generalising uncertainty, improving performance, transparency, and safety of DL models for deployment in the real world.
Collapse
Affiliation(s)
- Samual MacDonald
- Max Kelsen, Brisbane, QLD, Australia
- ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia
- The University of Queensland, Brisbane, Australia
| | | | | | | | | | | | - Sowmya Sharma
- QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia
- ACL Pathology, Bella Vista, NSW, Australia
| | - Scott Wood
- QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia
| | | | - John V Pearson
- QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia
| | - Fred Roosta
- ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia
- The University of Queensland, Brisbane, Australia
| | - Nicola Waddell
- QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia
| | - Olga Kondrashova
- QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia.
| | - Maciej Trzaskowski
- Max Kelsen, Brisbane, QLD, Australia.
- ARC Training Centre for Information Resilience (CIRES), Brisbane, Australia.
- The University of Queensland, Brisbane, Australia.
- QIMR Berghofer Medical Research Institute, Brisbane, QLD, Australia.
| |
Collapse
|
7
|
Pandey D, Onkara Perumal P. A scoping review on deep learning for next-generation RNA-Seq. data analysis. Funct Integr Genomics 2023; 23:134. [PMID: 37084004 DOI: 10.1007/s10142-023-01064-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 03/24/2023] [Accepted: 04/17/2023] [Indexed: 04/22/2023]
Abstract
In the last decade, transcriptome research adopting next-generation sequencing (NGS) technologies has gathered incredible momentum amongst functional genomics scientists, particularly amongst clinical/biomedical research groups. The progressive enfoldment/adoption of NGS technologies has incited an abundance of next-generation transcriptomic data harbouring an opulence of new knowledge in public databases. Nevertheless, knowledge discovery from these next-generation RNA-Seq. data analysis necessitates extensive bioinformatics know-how besides elaborate data analysis software packages consistent with the type and context of data analysis. Several reliability and reproducibility concerns continue to impede RNA-Seq. data analysis. Characteristic challenges comprise of data quality, hardware and networking provisions, selection and prioritisation of data analysis tools, and yet significantly implementing of robust machine learning algorithms for maximised exploitation of these experimental transcriptomic data. Over the years, numerous machine learning algorithms have been implemented for improved transcriptomic data analysis executing predominantly shallow learning approaches. More recently, deep learning algorithms are becoming more mainstream, and enactment for next-generation RNA-Seq. data analysis could be revolutionary in the coming years in the biomedical domain. In this scoping review, we attempt to determine the existing literature's size and potential nature in deep learning and NGS RNA-Seq. data analysis. An analysis of the contemporary topics of next-generation RNA-Seq. data analysis based on deep learning algorithms is critically reviewed, emphasising open-source resources.
Collapse
Affiliation(s)
- Diksha Pandey
- Department of Biotechnology, National Institute of Technology, Warangal, Telanga na, 506004, India
| | - P Onkara Perumal
- Department of Biotechnology, National Institute of Technology, Warangal, Telanga na, 506004, India.
| |
Collapse
|
8
|
Ferrato MH, Marsh AG, Franke KR, Huang BJ, Kolb EA, DeRyckere D, Grahm DK, Chandrasekaran S, Crowgey EL. Machine learning classifier approaches for predicting response to RTK-type-III inhibitors demonstrate high accuracy using transcriptomic signatures and ex vivo data. BIOINFORMATICS ADVANCES 2023; 3:vbad034. [PMID: 37250111 PMCID: PMC10209528 DOI: 10.1093/bioadv/vbad034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Revised: 02/16/2023] [Accepted: 03/21/2023] [Indexed: 05/31/2023]
Abstract
Motivation The application of machine learning (ML) techniques in the medical field has demonstrated both successes and challenges in the precision medicine era. The ability to accurately classify a subject as a potential responder versus a nonresponder to a given therapy is still an active area of research pushing the field to create new approaches for applying machine-learning techniques. In this study, we leveraged publicly available data through the BeatAML initiative. Specifically, we used gene count data, generated via RNA-seq, from 451 individuals matched with ex vivo data generated from treatment with RTK-type-III inhibitors. Three feature selection techniques were tested, principal component analysis, Shapley Additive Explanation (SHAP) technique and differential gene expression analysis, with three different classifiers, XGBoost, LightGBM and random forest (RF). Sensitivity versus specificity was analyzed using the area under the curve (AUC)-receiver operating curves (ROCs) for every model developed. Results Our work demonstrated that feature selection technique, rather than the classifier, had the greatest impact on model performance. The SHAP technique outperformed the other feature selection techniques and was able to with high accuracy predict outcome response, with the highest performing model: Foretinib with 89% AUC using the SHAP technique and RF classifier. Our ML pipelines demonstrate that at the time of diagnosis, a transcriptomics signature exists that can potentially predict response to treatment, demonstrating the potential of using ML applications in precision medicine efforts. Availability and implementation https://github.com/UD-CRPL/RCDML. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | | | - Karl R Franke
- Nemours Children Health System, Wilmington, DE 19803, USA
| | - Benjamin J Huang
- Department of Pediatrics, University of California San Francisco, San Francisco, CA 94143, USA
- Helen Diller Family Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA 94143, USA
| | - E Anders Kolb
- Nemours Children Health System, Wilmington, DE 19803, USA
| | - Deborah DeRyckere
- Department of Pediatrics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Douglas K Grahm
- Department of Pediatrics, Emory University School of Medicine, Atlanta, GA 30322, USA
| | | | | |
Collapse
|
9
|
Big Data in Gastroenterology Research. Int J Mol Sci 2023; 24:ijms24032458. [PMID: 36768780 PMCID: PMC9916510 DOI: 10.3390/ijms24032458] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 01/18/2023] [Accepted: 01/20/2023] [Indexed: 01/28/2023] Open
Abstract
Studying individual data types in isolation provides only limited and incomplete answers to complex biological questions and particularly falls short in revealing sufficient mechanistic and kinetic details. In contrast, multi-omics approaches to studying health and disease permit the generation and integration of multiple data types on a much larger scale, offering a comprehensive picture of biological and disease processes. Gastroenterology and hepatobiliary research are particularly well-suited to such analyses, given the unique position of the luminal gastrointestinal (GI) tract at the nexus between the gut (mucosa and luminal contents), brain, immune and endocrine systems, and GI microbiome. The generation of 'big data' from multi-omic, multi-site studies can enhance investigations into the connections between these organ systems and organisms and more broadly and accurately appraise the effects of dietary, pharmacological, and other therapeutic interventions. In this review, we describe a variety of useful omics approaches and how they can be integrated to provide a holistic depiction of the human and microbial genetic and proteomic changes underlying physiological and pathophysiological phenomena. We highlight the potential pitfalls and alternatives to help avoid the common errors in study design, execution, and analysis. We focus on the application, integration, and analysis of big data in gastroenterology and hepatobiliary research.
Collapse
|
10
|
D’Agostino N, Li W, Wang D. High-throughput transcriptomics. Sci Rep 2022; 12:20313. [PMID: 36446824 PMCID: PMC9708670 DOI: 10.1038/s41598-022-23985-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Indexed: 11/30/2022] Open
Affiliation(s)
- Nunzio D’Agostino
- grid.4691.a0000 0001 0790 385XDepartment of Agricultural Sciences, University of Naples Federico II, Portici, NA Italy
| | - Wenli Li
- grid.512861.9Dairy Forage Research Center, USDA-ARS, 1925 Linden Drive, Madison, WI 53706 USA
| | - Dapeng Wang
- grid.7445.20000 0001 2113 8111National Heart and Lung Institute, Imperial College London, London, SW3 6LY UK
| |
Collapse
|
11
|
Qin R, Mahal LK, Bojar D. Deep learning explains the biology of branched glycans from single-cell sequencing data. iScience 2022; 25:105163. [PMID: 36217547 PMCID: PMC9547197 DOI: 10.1016/j.isci.2022.105163] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Revised: 09/06/2022] [Accepted: 09/16/2022] [Indexed: 11/03/2022] Open
Abstract
Glycosylation is ubiquitous and often dysregulated in disease. However, the regulation and functional significance of various types of glycosylation at cellular levels is hard to unravel experimentally. Multi-omics, single-cell measurements such as SUGAR-seq, which quantifies transcriptomes and cell surface glycans, facilitate addressing this issue. Using SUGAR-seq data, we pioneered a deep learning model to predict the glycan phenotypes of cells (mouse T lymphocytes) from transcripts, with the example of predicting β1,6GlcNAc-branching across T cell subtypes (test set F1 score: 0.9351). Model interpretation via SHAP (SHapley Additive exPlanations) identified highly predictive genes, in part known to impact (i) branched glycan levels and (ii) the biology of branched glycans. These genes included physiologically relevant low-abundance genes that were not captured by conventional differential expression analysis. Our work shows that interpretable deep learning models are promising for uncovering novel functions and regulatory mechanisms of glycans from integrated transcriptomic and glycomic datasets.
Collapse
Affiliation(s)
- Rui Qin
- Department of Chemistry, University of Alberta, Edmonton, AB T6G 2G2, Canada
| | - Lara K. Mahal
- Department of Chemistry, University of Alberta, Edmonton, AB T6G 2G2, Canada
| | - Daniel Bojar
- Department of Chemistry and Molecular Biology, University of Gothenburg, 405 30 Gothenburg, Sweden
- Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, 405 30 Gothenburg, Sweden
| |
Collapse
|
12
|
Zeng W, Gautam A, Huson DH. DeepToA: An Ensemble Deep-Learning Approach to Predicting the Theater of Activity of a Microbiome. Bioinformatics 2022; 38:4670-4676. [PMID: 36029249 DOI: 10.1093/bioinformatics/btac584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 07/19/2022] [Accepted: 08/26/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Metagenomics is the study of microbiomes using DNA sequencing. A microbiome consists of an assemblage of microbes that is associated with a "theater of activity" (ToA). An important question is, to what degree does the taxonomic and functional content of the former depend on the (details of the) latter? Here we investigate a related technical question: Given a taxonomic and/or functional profile estimated from metagenomic sequencing data, how to predict the associated ToA? We present a deep-learning approach to this question. We use both taxonomic and functional profiles as input. We apply node2vec to embed hierarchical taxonomic profiles into numerical vectors. We then perform dimension reduction using clustering, to address the sparseness of the taxonomic data and thus make the problem more amenable to deep-learning algorithms. Functional features are combined with textual descriptions of protein families or domains. We present an ensemble deep-learning framework DeepToA for predicting the "theater of activity" of amicrobial community, based on taxonomic and functional profiles. We use SHAP (SHapley Additive exPlanations) values to determine which taxonomic and functional features are important for the prediction. RESULTS Based on 7,560 metagenomic profiles downloaded from MGnify, classified into ten different theaters of activity, we demonstrate that DeepToA has an accuracy of 98.30%. We show that adding textual information to functional features increases the accuracy. AVAILABILITY Our approach is available at http://ab.inf.uni-tuebingen.de/software/deeptoa. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenhuan Zeng
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, 72076, Germany
| | - Anupam Gautam
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, 72076, Germany.,International Max Planck Research School "From Molecules to Organisms", Max Planck Institute for Biology Tübingen, Max-Planck-Ring 5, Tübingen, 72076, Germany
| | - Daniel H Huson
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, 72076, Germany.,International Max Planck Research School "From Molecules to Organisms", Max Planck Institute for Biology Tübingen, Max-Planck-Ring 5, Tübingen, 72076, Germany.,Cluster of Excellence: Controlling Microbes to Fight Infection, Tübingen, Germany
| |
Collapse
|
13
|
Pathway importance by graph convolutional network and Shapley additive explanations in gene expression phenotype of diffuse large B-cell lymphoma. PLoS One 2022; 17:e0269570. [PMID: 35749395 PMCID: PMC9231717 DOI: 10.1371/journal.pone.0269570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 05/09/2022] [Indexed: 11/30/2022] Open
Abstract
Deep learning techniques have recently been applied to analyze associations between gene expression data and disease phenotypes. However, there are concerns regarding the black box problem: it is difficult to interpret why the prediction results are obtained using deep learning models from model parameters. New methods have been proposed for interpreting deep learning model predictions but have not been applied to genetics. In this study, we demonstrated that applying SHapley Additive exPlanations (SHAP) to a deep learning model using graph convolutions of genetic pathways can provide pathway-level feature importance for classification prediction of diffuse large B-cell lymphoma (DLBCL) gene expression subtypes. Using Kyoto Encyclopedia of Genes and Genomes pathways, a graph convolutional network (GCN) model was implemented to construct graphs with nodes and edges. DLBCL datasets, including microarray gene expression data and clinical information on subtypes (germinal center B-cell-like type and activated B-cell-like type), were retrieved from the Gene Expression Omnibus to evaluate the model. The GCN model showed an accuracy of 0.914, precision of 0.948, recall of 0.868, and F1 score of 0.906 in analysis of the classification performance for the test datasets. The pathways with high feature importance by SHAP included highly enriched pathways in the gene set enrichment analysis. Moreover, a logistic regression model with explanatory variables of genes in pathways with high feature importance showed good performance in predicting DLBCL subtypes. In conclusion, our GCN model for classifying DLBCL subtypes is useful for interpreting important regulatory pathways that contribute to the prediction.
Collapse
|
14
|
A deep learning model to classify neoplastic state and tissue origin from transcriptomic data. Sci Rep 2022; 12:9669. [PMID: 35690622 PMCID: PMC9188604 DOI: 10.1038/s41598-022-13665-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 04/11/2022] [Indexed: 12/20/2022] Open
Abstract
Application of deep learning methods to transcriptomic data has the potential to enhance the accuracy and efficiency of tissue classification and cell state identification. Herein, we developed a multitask deep learning model for tissue classification combining publicly available whole transcriptomic (RNA-seq) datasets of non-neoplastic, neoplastic and peri-neoplastic tissue to classify disease state, tissue origin and neoplastic subclass. RNA-seq data from a total of 10,116 patient samples processed through a common pipeline were used for model training and validation. The model achieved 99% accuracy for disease state classification (ROC-AUC of 0.98) and 97% accuracy for tissue origin (ROC-AUC of 0.99). Moreover, the model achieved an accuracy of 92% (ROC-AUC 0.95) for neoplastic subclassification. This is the first multitask deep learning algorithm developed for tissue classification employing a uniform pipeline analysis of transcriptomic data with multiple tissue classifiers. This model serves as a framework for incorporating large transcriptomic datasets across conditions to facilitate clinical diagnosis and cell-based treatment strategies.
Collapse
|
15
|
Interpretation of Machine-Learning-Based (Black-box) Wind Pressure Predictions for Low-Rise Gable-Roofed Buildings Using Shapley Additive Explanations (SHAP). BUILDINGS 2022. [DOI: 10.3390/buildings12060734] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Conventional methods of estimating pressure coefficients of buildings retain time and cost constraints. Recently, machine learning (ML) has been successfully established to predict wind pressure coefficients. However, regardless of the accuracy, ML models are incompetent in providing end-users’ confidence as a result of the black-box nature of predictions. In this study, we employed tree-based regression models (Decision Tree, XGBoost, Extra-tree, LightGBM) to predict surface-averaged mean pressure coefficient (Cp,mean), fluctuation pressure coefficient (Cp, rms), and peak pressure coefficient (Cp,peak) of low-rise gable-roofed buildings. The accuracy of models was verified using Tokyo Polytechnic University (TPU) wind tunnel data. Subsequently, we used Shapley Additive Explanations (SHAP) to explain the black-box nature of the ML predictions. The comparison revealed that tree-based models are efficient and accurate in wind-predicting pressure coefficients. Interestingly, SHAP provided human-comprehensible explanations for the interaction of variables, the importance of features towards the outcome, and the underlying reasoning behind the predictions. Moreover, SHAP confirmed that tree-based predictions adhere to the flow physics of wind engineering, advancing the fidelity of ML-based predictions.
Collapse
|
16
|
Interpretable AI in Healthcare: Enhancing Fairness, Safety, and Trust. Artif Intell Med 2022. [DOI: 10.1007/978-981-19-1223-8_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
17
|
Shared Blocks-Based Ensemble Deep Learning for Shallow Landslide Susceptibility Mapping. REMOTE SENSING 2021. [DOI: 10.3390/rs13234776] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Natural disaster impact assessment is of the utmost significance for post-disaster recovery, environmental protection, and hazard mitigation plans. With their recent usage in landslide susceptibility mapping, deep learning (DL) architectures have proven their efficiency in many scientific studies. However, some restrictions, including insufficient model variance and limited generalization capabilities, have been reported in the literature. To overcome these restrictions, ensembling DL models has often been preferred as a practical solution. In this study, an ensemble DL architecture, based on shared blocks, was proposed to improve the prediction capability of individual DL models. For this purpose, three DL models, namely Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Long Short-Term Memory (LSTM), together with their ensemble form (CNN–RNN–LSTM) were utilized to model landslide susceptibility in Trabzon province, Turkey. The proposed DL architecture produced the highest modeling performance of 0.93, followed by CNN (0.92), RNN (0.91), and LSTM (0.86). Findings proved that the proposed model excelled the performance of the DL models by up to 7% in terms of overall accuracy, which was also confirmed by the Wilcoxon signed-rank test. The area under curve analysis also showed a significant improvement (~4%) in susceptibility map accuracy by the proposed strategy.
Collapse
|
18
|
Abstract
High-throughput technologies such as next-generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state-of-the-art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines.
Collapse
Affiliation(s)
- David S Watson
- Department of Statistical Science, University College London, London, UK.
| |
Collapse
|