1
|
Yao X, Jiang X, Luo H, Liang H, Ye X, Wei Y, Cong S. MOCAT: multi-omics integration with auxiliary classifiers enhanced autoencoder. BioData Min 2024; 17:9. [PMID: 38444019 PMCID: PMC10916109 DOI: 10.1186/s13040-024-00360-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 02/29/2024] [Indexed: 03/07/2024] Open
Abstract
BACKGROUND Integrating multi-omics data is emerging as a critical approach in enhancing our understanding of complex diseases. Innovative computational methods capable of managing high-dimensional and heterogeneous datasets are required to unlock the full potential of such rich and diverse data. METHODS We propose a Multi-Omics integration framework with auxiliary Classifiers-enhanced AuToencoders (MOCAT) to utilize intra- and inter-omics information comprehensively. Additionally, attention mechanisms with confidence learning are incorporated for enhanced feature representation and trustworthy prediction. RESULTS Extensive experiments were conducted on four benchmark datasets to evaluate the effectiveness of our proposed model, including BRCA, ROSMAP, LGG, and KIPAN. Our model significantly improved most evaluation measurements and consistently surpassed the state-of-the-art methods. Ablation studies showed that the auxiliary classifiers significantly boosted classification accuracy in the ROSMAP and LGG datasets. Moreover, the attention mechanisms and confidence evaluation block contributed to improvements in the predictive accuracy and generalizability of our model. CONCLUSIONS The proposed framework exhibits superior performance in disease classification and biomarker discovery, establishing itself as a robust and versatile tool for analyzing multi-layer biological data. This study highlights the significance of elaborated designed deep learning methodologies in dissecting complex disease phenotypes and improving the accuracy of disease predictions.
Collapse
Affiliation(s)
- Xiaohui Yao
- Qingdao Innovation and Development Center, Harbin Engineering University, 1777 Sansha Rd, Qingdao, 266000, Shandong, China
- College of Intelligent Systems Science and Engineering, Harbin Engineering University, 145 Nantong St, Harbin, 150001, Heilongjiang, China
| | - Xiaohan Jiang
- Qingdao Innovation and Development Center, Harbin Engineering University, 1777 Sansha Rd, Qingdao, 266000, Shandong, China
| | - Haoran Luo
- Qingdao Innovation and Development Center, Harbin Engineering University, 1777 Sansha Rd, Qingdao, 266000, Shandong, China
- College of Intelligent Systems Science and Engineering, Harbin Engineering University, 145 Nantong St, Harbin, 150001, Heilongjiang, China
| | - Hong Liang
- College of Intelligent Systems Science and Engineering, Harbin Engineering University, 145 Nantong St, Harbin, 150001, Heilongjiang, China
| | - Xiufen Ye
- College of Intelligent Systems Science and Engineering, Harbin Engineering University, 145 Nantong St, Harbin, 150001, Heilongjiang, China
| | - Yanhui Wei
- College of Intelligent Systems Science and Engineering, Harbin Engineering University, 145 Nantong St, Harbin, 150001, Heilongjiang, China
| | - Shan Cong
- Qingdao Innovation and Development Center, Harbin Engineering University, 1777 Sansha Rd, Qingdao, 266000, Shandong, China.
- College of Intelligent Systems Science and Engineering, Harbin Engineering University, 145 Nantong St, Harbin, 150001, Heilongjiang, China.
| |
Collapse
|
2
|
Turgut H, Turanli B, Boz B. DCDA: CircRNA-Disease Association Prediction with Feed-Forward Neural Network and Deep Autoencoder. Interdiscip Sci 2024; 16:91-103. [PMID: 37978116 DOI: 10.1007/s12539-023-00590-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 10/13/2023] [Accepted: 10/15/2023] [Indexed: 11/19/2023]
Abstract
Circular RNA is a single-stranded RNA with a closed-loop structure. In recent years, academic research has revealed that circular RNAs play critical roles in biological processes and are related to human diseases. The discovery of potential circRNAs as disease biomarkers and drug targets is crucial since it can help diagnose diseases in the early stages and be used to treat people. However, in conventional experimental methods, conducting experiments to detect associations between circular RNAs and diseases is time-consuming and costly. To overcome this problem, various computational methodologies are proposed to extract essential features for both circular RNAs and diseases and predict the associations. Studies showed that computational methods successfully predicted performance and made it possible to detect possible highly related circular RNAs for diseases. This study proposes a deep learning-based circRNA-disease association predictor methodology called DCDA, which uses multiple data sources to create circRNA and disease features and reveal hidden feature codings of a circular RNA-disease pair with a deep autoencoder, then predict the relation score of the pair by a deep neural network. Fivefold cross-validation results on the benchmark dataset showed that our model outperforms state-of-the-art prediction methods in the literature with the AUC score of 0.9794.
Collapse
Affiliation(s)
- Hacer Turgut
- Computer Engineering Department, Marmara University, 34854, Istanbul, Türkiye.
| | - Beste Turanli
- Bioengineering Department, Marmara University, 34854, Istanbul, Türkiye
| | - Betül Boz
- Computer Engineering Department, Marmara University, 34854, Istanbul, Türkiye.
| |
Collapse
|
3
|
Meimetis N, Pullen KM, Zhu DY, Nilsson A, Hoang TN, Magliacane S, Lauffenburger DA. AutoTransOP: translating omics signatures without orthologue requirements using deep learning. NPJ Syst Biol Appl 2024; 10:13. [PMID: 38287079 PMCID: PMC10825146 DOI: 10.1038/s41540-024-00341-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2023] [Accepted: 01/17/2024] [Indexed: 01/31/2024] Open
Abstract
The development of therapeutics and vaccines for human diseases requires a systematic understanding of human biology. Although animal and in vitro culture models can elucidate some disease mechanisms, they typically fail to adequately recapitulate human biology as evidenced by the predominant likelihood of clinical trial failure. To address this problem, we developed AutoTransOP, a neural network autoencoder framework, to map omics profiles from designated species or cellular contexts into a global latent space, from which germane information for different contexts can be identified without the typically imposed requirement of matched orthologues. This approach was found in general to perform at least as well as current alternative methods in identifying animal/culture-specific molecular features predictive of other contexts-most importantly without requiring homology matching. For an especially challenging test case, we successfully applied our framework to a set of inter-species vaccine serology studies, where 1-to-1 mapping between human and non-human primate features does not exist.
Collapse
Affiliation(s)
- Nikolaos Meimetis
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Krista M Pullen
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Daniel Y Zhu
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Avlant Nilsson
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, SE, 41296, Sweden
| | - Trong Nghia Hoang
- School of Electrical Engineering and Computer Science, Washington State University, Pullman, WA, 99164-236, USA
| | - Sara Magliacane
- Institute of Informatics, University of Amsterdam, Amsterdam, The Netherlands
- MIT-IBM Watson AI Lab, Cambridge, MA, 02139, USA
| | - Douglas A Lauffenburger
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
4
|
Tjärnberg A, Beheler-Amass M, Jackson CA, Christiaen LA, Gresham D, Bonneau R. Structure-primed embedding on the transcription factor manifold enables transparent model architectures for gene regulatory network and latent activity inference. Genome Biol 2024; 25:24. [PMID: 38238840 PMCID: PMC10797903 DOI: 10.1186/s13059-023-03134-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 11/30/2023] [Indexed: 01/22/2024] Open
Abstract
BACKGROUND Modeling of gene regulatory networks (GRNs) is limited due to a lack of direct measurements of genome-wide transcription factor activity (TFA) making it difficult to separate covariance and regulatory interactions. Inference of regulatory interactions and TFA requires aggregation of complementary evidence. Estimating TFA explicitly is problematic as it disconnects GRN inference and TFA estimation and is unable to account for, for example, contextual transcription factor-transcription factor interactions, and other higher order features. Deep-learning offers a potential solution, as it can model complex interactions and higher-order latent features, although does not provide interpretable models and latent features. RESULTS We propose a novel autoencoder-based framework, StrUcture Primed Inference of Regulation using latent Factor ACTivity (SupirFactor) for modeling, and a metric, explained relative variance (ERV), for interpretation of GRNs. We evaluate SupirFactor with ERV in a wide set of contexts. Compared to current state-of-the-art GRN inference methods, SupirFactor performs favorably. We evaluate latent feature activity as an estimate of TFA and biological function in S. cerevisiae as well as in peripheral blood mononuclear cells (PBMC). CONCLUSION Here we present a framework for structure-primed inference and interpretation of GRNs, SupirFactor, demonstrating interpretability using ERV in multiple biological and experimental settings. SupirFactor enables TFA estimation and pathway analysis using latent factor activity, demonstrated here on two large-scale single-cell datasets, modeling S. cerevisiae and PBMC. We find that the SupirFactor model facilitates biological analysis acquiring novel functional and regulatory insight.
Collapse
Affiliation(s)
- Andreas Tjärnberg
- Center for Developmental Genetics, New York University, New York, NY, 10003, USA.
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA.
- Department of Biology, NYU, New York, NY, 10008, USA.
- Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, 10010, USA.
- Department of Neuro-Science, University of Wisconsin-Madison - Waisman Center, Madison, USA.
| | - Maggie Beheler-Amass
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA
- Department of Biology, NYU, New York, NY, 10008, USA
| | - Christopher A Jackson
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA
- Department of Biology, NYU, New York, NY, 10008, USA
| | - Lionel A Christiaen
- Center for Developmental Genetics, New York University, New York, NY, 10003, USA
- Department of Biology, NYU, New York, NY, 10008, USA
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway
- Department of Heart Disease, Haukeland University Hospital, Bergen, Norway
| | - David Gresham
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA
- Department of Biology, NYU, New York, NY, 10008, USA
| | - Richard Bonneau
- Center For Genomics and Systems Biology, NYU, New York, NY, 10008, USA.
- Department of Biology, NYU, New York, NY, 10008, USA.
- Flatiron Institute, Center for Computational Biology, Simons Foundation, New York, NY, 10010, USA.
- Courant Institute of Mathematical Sciences, Computer Science Department, New York University, New York, NY, 10003, USA.
- Center For Data Science, NYU, New York, NY, 10008, USA.
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA.
| |
Collapse
|
5
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
6
|
Young JD, Ren S, Chen L, Lu X. Revealing the Impact of Genomic Alterations on Cancer Cell Signaling with an Interpretable Deep Learning Model. Cancers (Basel) 2023; 15:3857. [PMID: 37568673 PMCID: PMC10416927 DOI: 10.3390/cancers15153857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 07/24/2023] [Accepted: 07/26/2023] [Indexed: 08/13/2023] Open
Abstract
Cancer is a disease of aberrant cellular signaling resulting from somatic genomic alterations (SGAs). Heterogeneous SGA events in tumors lead to tumor-specific signaling system aberrations. We interpret the cancer signaling system as a causal graphical model, where SGAs affect signaling proteins, propagate their effects through signal transduction, and ultimately change gene expression. To represent such a system, we developed a deep learning model called redundant-input neural network (RINN) with a transparent redundant-input architecture. Our findings demonstrate that by utilizing SGAs as inputs, the RINN can encode their impact on the signaling system and predict gene expression accurately when measured as the area under ROC curves. Moreover, the RINN can discover the shared functional impact (similar embeddings) of SGAs that perturb a common signaling pathway (e.g., PI3K, Nrf2, and TGF). Furthermore, the RINN exhibits the ability to discover known relationships in cellular signaling systems.
Collapse
Affiliation(s)
- Jonathan D. Young
- Intelligent Systems Program, School of Computing and Information, University of Pittsburgh, Pittsburgh, PA 15260, USA;
| | - Shuangxia Ren
- Intelligent Systems Program, School of Computing and Information, University of Pittsburgh, Pittsburgh, PA 15260, USA;
| | - Lujia Chen
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA;
| | - Xinghua Lu
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA;
| |
Collapse
|
7
|
Ren S, Cooper GF, Chen L, Lu X. An interpretable deep learning framework for genome-informed precision oncology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.11.548534. [PMID: 37503199 PMCID: PMC10369905 DOI: 10.1101/2023.07.11.548534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Cancers result from aberrations in cellular signaling systems, typically resulting from driver somatic genome alterations (SGAs) in individual tumors. Precision oncology requires understanding the cellular state and selecting medications that induce vulnerability in cancer cells under such conditions. To this end, we developed a computational framework consisting of two components: 1) A representation-learning component, which learns a representation of the cellular signaling systems when perturbed by SGAs, using a biologically-motivated and interpretable deep learning model. 2) A drug-response-prediction component, which predicts the response to drugs by leveraging the information of the cellular state of the cancer cells derived by the first component. Our cell-state-oriented framework significantly enhances the accuracy of genome-informed prediction of drug responses in comparison to models that directly use SGAs as inputs. Importantly, our framework enables the prediction of response to chemotherapy agents based on SGAs, thus expanding genome-informed precision oncology beyond molecularly targeted drugs.
Collapse
Affiliation(s)
- Shuangxia Ren
- Intelligent Systems Program, School of Computing and Information, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Gregory F Cooper
- Intelligent Systems Program, School of Computing and Information, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Lujia Chen
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Xinghua Lu
- Intelligent Systems Program, School of Computing and Information, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
8
|
Tapak L, Ghasemi MK, Afshar S, Mahjub H, Soltanian A, Khotanlou H. Identification of gene profiles related to the development of oral cancer using a deep learning technique. BMC Med Genomics 2023; 16:35. [PMID: 36849997 PMCID: PMC9972685 DOI: 10.1186/s12920-023-01462-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 02/15/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Oral cancer (OC) is a debilitating disease that can affect the quality of life of these patients adversely. Oral premalignant lesion patients have a high risk of developing OC. Therefore, identifying robust survival subgroups among them may significantly improve patient therapy and care. This study aimed to identify prognostic biomarkers that predict the time-to-development of OC and survival stratification for patients using state-of-the-art machine learning and deep learning. METHODS Gene expression profiles (29,096 probes) related to 86 patients from the GSE26549 dataset from the GEO repository were used. An autoencoder deep learning neural network model was used to extract features. We also used a univariate Cox regression model to select significant features obtained from the deep learning method (P < 0.05). High-risk and low-risk groups were then identified using a hierarchical clustering technique based on 100 encoded features (the number of units of the encoding layer, i.e., bottleneck of the network) from autoencoder and selected by Cox proportional hazards model and a supervised random forest (RF) classifier was used to identify gene profiles related to subtypes of OC from the original 29,096 probes. RESULTS Among 100 encoded features extracted by autoencoder, seventy features were significantly related to time-to-OC-development, based on the univariate Cox model, which was used as the inputs for the clustering of patients. Two survival risk groups were identified (P value of log-rank test = 0.003) and were used as the labels for supervised classification. The overall accuracy of the RF classifier was 0.916 over the test set, yielded 21 top genes (FUT8-DDR2-ATM-CD247-ETS1-ZEB2-COL5A2-GMAP7-CDH1-COL11A2-COL3A1-AHR-COL2A1-CHORDC1-PTP4A3-COL1A2-CCR2-PDGFRB-COL1A1-FERMT2-PIK3CB) associated with time to developing OC, selected among the original 29,096 probes. CONCLUSIONS Using deep learning, our study identified prominent transcriptional biomarkers in determining high-risk patients for developing oral cancer, which may be prognostic as significant targets for OC therapy. The identified genes may serve as potential targets for oral cancer chemoprevention. Additional validation of these biomarkers in experimental prospective and retrospective studies will launch them in OC clinics.
Collapse
Affiliation(s)
- Leili Tapak
- Department of Biostatistics, School of Public Health and Modeling of Noncommunicable Diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Mohammad Kazem Ghasemi
- Department of Biostatistics, School of Public Health, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Saeid Afshar
- Research Center for Molecular Medicine, Hamadan University of Medical Sciences, Hamadan, Iran.
| | - Hossein Mahjub
- Department of Biostatistics, School of Public Health and Modeling of Noncommunicable Diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Alireza Soltanian
- Department of Biostatistics, School of Public Health and Modeling of Noncommunicable Diseases Research Center, Hamadan University of Medical Sciences, Hamadan, Iran
| | - Hassan Khotanlou
- Department of Computer Engineering, Bu-Ali Sina University, Hamadan, Iran
| |
Collapse
|
9
|
Compendium-Wide Analysis of Pseudomonas aeruginosa Core and Accessory Genes Reveals Transcriptional Patterns across Strains PAO1 and PA14. mSystems 2023; 8:e0034222. [PMID: 36541762 PMCID: PMC9948736 DOI: 10.1128/msystems.00342-22] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
Pseudomonas aeruginosa is an opportunistic pathogen that causes difficult-to-treat infections. Two well-studied divergent P. aeruginosa strain types, PAO1 and PA14, have significant genomic heterogeneity, including diverse accessory genes present in only some strains. Genome content comparisons find core genes that are conserved across both PAO1 and PA14 strains and accessory genes that are present in only a subset of PAO1 and PA14 strains. Here, we use recently assembled transcriptome compendia of publicly available P. aeruginosa RNA sequencing (RNA-seq) samples to create two smaller compendia consisting of only strain PAO1 or strain PA14 samples with each aligned to their cognate reference genome. We confirmed strain annotations and identified other samples for inclusion by assessing each sample's median expression of PAO1-only or PA14-only accessory genes. We then compared the patterns of core gene expression in each strain. To do so, we developed a method by which we analyzed genes in terms of which genes showed similar expression patterns across strain types. We found that some core genes had consistent correlated expression patterns across both compendia, while others were less stable in an interstrain comparison. For each accessory gene, we also determined core genes with correlated expression patterns. We found that stable core genes had fewer coexpressed neighbors that were accessory genes. Overall, this approach for analyzing expression patterns across strain types can be extended to other groups of genes, like phage genes, or applied for analyzing patterns beyond groups of strains, such as samples with different traits, to reveal a deeper understanding of regulation. IMPORTANCE Pseudomonas aeruginosa is a ubiquitous pathogen. There is much diversity among P. aeruginosa strains, including two divergent but well-studied strains, PAO1 and PA14. Understanding how these different strain-level traits manifest is important for identifying targets that regulate different traits of interest. With the availability of thousands of PAO1 and PA14 samples, we created two strain-specific RNA-seq compendia where each one contains hundreds of samples from PAO1 or PA14 strains and used them to compare the expression patterns of core genes that are conserved in both strain types and to determine which core genes have expression patterns that are similar to those of accessory genes that are unique to one strain or the other using an approach that we developed. We found a subset of core genes with different transcriptional patterns across PAO1 and PA14 strains and identified those core genes with expression patterns similar to those of strain-specific accessory genes.
Collapse
|
10
|
Tjärnberg A, Beheler-Amass M, Jackson CA, Christiaen LA, Gresham D, Bonneau R. Structure primed embedding on the transcription factor manifold enables transparent model architectures for gene regulatory network and latent activity inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.02.02.526909. [PMID: 36778259 PMCID: PMC9915715 DOI: 10.1101/2023.02.02.526909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The modeling of gene regulatory networks (GRNs) is limited due to a lack of direct measurements of regulatory features in genome-wide screens. Most GRN inference methods are therefore forced to model relationships between regulatory genes and their targets with expression as a proxy for the upstream independent features, complicating validation and predictions produced by modeling frameworks. Separating covariance and regulatory influence requires aggregation of independent and complementary sets of evidence, such as transcription factor (TF) binding and target gene expression. However, the complete regulatory state of the system, e.g. TF activity (TFA) is unknown due to a lack of experimental feasibility, making regulatory relations difficult to infer. Some methods attempt to account for this by modeling TFA as a latent feature, but these models often use linear frameworks that are unable to account for non-linearities such as saturation, TF-TF interactions, and other higher order features. Deep learning frameworks may offer a solution, as they are capable of modeling complex interactions and capturing higher-order latent features. However, these methods often discard central concepts in biological systems modeling, such as sparsity and latent feature interpretability, in favor of increased model complexity. We propose a novel deep learning autoencoder-based framework, StrUcture Primed Inference of Regulation using latent Factor ACTivity (SupirFactor), that scales to single cell genomic data and maintains interpretability to perform GRN inference and estimate TFA as a latent feature. We demonstrate that SupirFactor outperforms current leading GRN inference methods, predicts biologically relevant TFA and elucidates functional regulatory pathways through aggregation of TFs.
Collapse
Affiliation(s)
- Andreas Tjärnberg
- Center for Developmental Genetics, New York University, New York 10003 NY, USA
- Center For Genomics and Systems Biology, NYU, New York, NY 10008, USA
- Department of Biology, NYU, New York, NY 10008, USA
- Mortimer B. Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, 10010, USA
| | - Maggie Beheler-Amass
- Center For Genomics and Systems Biology, NYU, New York, NY 10008, USA
- Department of Biology, NYU, New York, NY 10008, USA
| | - Christopher A Jackson
- Center For Genomics and Systems Biology, NYU, New York, NY 10008, USA
- Department of Biology, NYU, New York, NY 10008, USA
| | - Lionel A Christiaen
- Center for Developmental Genetics, New York University, New York 10003 NY, USA
- Department of Biology, NYU, New York, NY 10008, USA
- Sars International Centre for Marine Molecular Biology, University of Bergen, Bergen, Norway
- Department of Heart Disease, Haukeland University Hospital, Bergen, Norway
| | - David Gresham
- Center For Genomics and Systems Biology, NYU, New York, NY 10008, USA
- Department of Biology, NYU, New York, NY 10008, USA
| | - Richard Bonneau
- Center For Genomics and Systems Biology, NYU, New York, NY 10008, USA
- Department of Biology, NYU, New York, NY 10008, USA
- Flatiron Institute, Center for Computational Biology, Simons Foundation, New York, NY 10010, USA
- Courant Institute of Mathematical Sciences, Computer Science Department, New York University, New York, NY 10003, USA
- Center For Data Science, NYU, New York, NY 10008, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| |
Collapse
|
11
|
Duan H, Li F, Shang J, Liu J, Li Y, Liu X. scVAEBGM: Clustering Analysis of Single-Cell ATAC-seq Data Using a Deep Generative Model. Interdiscip Sci 2022; 14:917-928. [PMID: 35939233 DOI: 10.1007/s12539-022-00536-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 07/15/2022] [Accepted: 07/20/2022] [Indexed: 06/15/2023]
Abstract
A surge in research has occurred because of current developments in single-cell technologies. Above all, single-cell Assay for Transposase-Accessible Chromatin with high throughput sequencing (scATAC-seq) is a popular approach of analyzing chromatin accessibility differences at the level of single cell, either within or between groups. As a result, it is critical to examine cell heterogeneity at a previously unseen level and to identify both recognized and unknown cell types. However, with the ever-increasing number of cells engendered by technological development and the characteristics of the data, such as high noise, sparsity and dimension, challenges in distinguishing cell types have emerged. We propose scVAEBGM, which integrates a Variational Autoencoder (VAE) with a Bayesian Gaussian-mixture model (BGM) to process and analyze scATAC-seq data. This method combines and takes benefits of a Bayesian Gaussian mixture model to estimate the number of cell types without determining the cluster number in a beforehand. In other words, the size of the clusters is inferred from the data, thus avoiding biases introduced by subjective assessments when manually determining the size of the clusters. Additionally, the method is more robust to noise and can better represent single-cell data in lower dimensions. We also create a further clustering strategy. It is indicated by experiments that further clustering based on the already completed clustering can improve the clustering accuracy again. We test on six public datasets, and scVAEBGM outperforms various dimension reduction baselines. In downstream applications, scVAEBGM can reveal biological cell types.
Collapse
Affiliation(s)
- Hongyu Duan
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Feng Li
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China.
| | - Junliang Shang
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Jinxing Liu
- School of Computer Science, Qufu Normal University, Rizhao, 276826, China
| | - Yan Li
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan, 250031, Shandong, China
| | - Xikui Liu
- Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Jinan, 250031, Shandong, China
| |
Collapse
|
12
|
Zhang L, Zhou L, Wang Y, Li C, Liao P, Zhong L, Geng S, Lai P, Du X, Weng J. Deep learning-based transcriptome model predicts survival of T-cell acute lymphoblastic leukemia. Front Oncol 2022; 12:1057153. [DOI: 10.3389/fonc.2022.1057153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 10/17/2022] [Indexed: 11/06/2022] Open
Abstract
Identifying subgroups of T-cell acute lymphoblastic leukemia (T-ALL) with poor survival will significantly influence patient treatment options and improve patient survival expectations. Current efforts to predict T-ALL survival expectations in multiple patient cohorts are lacking. A deep learning (DL)-based model was developed to determine the prognostic staging of T-ALL patients. We used transcriptome sequencing data from TARGET to build a DL-based survival model using 265 T-ALL patients. We found that patients could be divided into two subgroups (K0 and K1) with significant difference (P< 0.0001) in survival rate. The more malignant subgroup was significantly associated with some tumor-related signaling pathways, such as PI3K-Akt, cGMP-PKG and TGF-beta signaling pathway. DL-based model showed good performance in a cohort of patients from our clinical center (P = 0.0248). T-ALL patients survival was successfully predicted using a DL-based model, and we hope to apply it to clinical practice in the future.
Collapse
|
13
|
Yang S, Shen T, Fang Y, Wang X, Zhang J, Yang W, Huang J, Han X. DeepNoise: Signal and Noise Disentanglement Based on Classifying Fluorescent Microscopy Images via Deep Learning. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:989-1001. [PMID: 36608842 PMCID: PMC10025761 DOI: 10.1016/j.gpb.2022.12.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Revised: 11/26/2022] [Accepted: 12/11/2022] [Indexed: 01/05/2023]
Abstract
The high-content image-based assay is commonly leveraged for identifying the phenotypic impact of genetic perturbations in biology field. However, a persistent issue remains unsolved during experiments: the interferential technical noises caused by systematic errors (e.g., temperature, reagent concentration, and well location) are always mixed up with the real biological signals, leading to misinterpretation of any conclusion drawn. Here, we reported a mean teacher-based deep learning model (DeepNoise) that can disentangle biological signals from the experimental noises. Specifically, we aimed to classify the phenotypic impact of 1108 different genetic perturbations screened from 125,510 fluorescent microscopy images, which were totally unrecognizable by the human eye. We validated our model by participating in the Recursion Cellular Image Classification Challenge, and DeepNoise achieved an extremely high classification score (accuracy: 99.596%), ranking the 2nd place among 866 participating groups. This promising result indicates the successful separation of biological and technical factors, which might help decrease the cost of treatment development and expedite the drug discovery process. The source code of DeepNoise is available at https://github.com/Scu-sen/Recursion-Cellular-Image-Classification-Challenge.
Collapse
Affiliation(s)
- Sen Yang
- Tencent AI Lab, Shenzhen 518057, China
| | - Tao Shen
- Tencent AI Lab, Shenzhen 518057, China
| | - Yuqi Fang
- Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong Special Administrative Region 999077, China
| | - Xiyue Wang
- College of Computer Science, Sichuan University, Chengdu 610065, China
| | - Jun Zhang
- Tencent AI Lab, Shenzhen 518057, China.
| | - Wei Yang
- Tencent AI Lab, Shenzhen 518057, China
| | | | - Xiao Han
- Tencent AI Lab, Shenzhen 518057, China.
| |
Collapse
|
14
|
Integrating Radiomics with Genomics for Non-Small Cell Lung Cancer Survival Analysis. JOURNAL OF ONCOLOGY 2022; 2022:5131170. [PMID: 36065309 PMCID: PMC9440821 DOI: 10.1155/2022/5131170] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 06/14/2022] [Accepted: 07/11/2022] [Indexed: 11/18/2022]
Abstract
Purpose The objectives of our study were to assess the association of radiological imaging and gene expression with patient outcomes in non-small cell lung cancer (NSCLC) and construct a nomogram by combining selected radiomic, genomic, and clinical risk factors to improve the performance of the risk model. Methods A total of 116 cases of NSCLC with CT images, gene expression, and clinical factors were studied, wherein 87 patients were used as the training cohort, and 29 patients were used as an independent testing cohort. Handcrafted radiomic features and deep-learning genomic features were extracted and selected from CT images and gene expression analysis, respectively. Two risk scores were calculated through Cox regression models for each patient based on radiomic features and genomic features to predict overall survival (OS). Finally, a fusion survival model was constructed by incorporating these two risk scores and clinical factors. Results The fusion model that combined CT images, gene expression data, and clinical factors effectively stratified patients into low- and high-risk groups. The C-indexes for OS prediction were 0.85 and 0.736 in the training and testing cohorts, respectively, which was better than that based on unimodal data. Conclusions Combining radiomics and genomics can effectively improve OS prediction for NSCLC patients.
Collapse
|
15
|
Lee AJ, Reiter T, Doing G, Oh J, Hogan DA, Greene CS. Using genome-wide expression compendia to study microorganisms. Comput Struct Biotechnol J 2022; 20:4315-4324. [PMID: 36016717 PMCID: PMC9396250 DOI: 10.1016/j.csbj.2022.08.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 08/07/2022] [Accepted: 08/07/2022] [Indexed: 11/30/2022] Open
Abstract
A gene expression compendium is a heterogeneous collection of gene expression experiments assembled from data collected for diverse purposes. The widely varied experimental conditions and genetic backgrounds across samples creates a tremendous opportunity for gaining a systems level understanding of the transcriptional responses that influence phenotypes. Variety in experimental design is particularly important for studying microbes, where the transcriptional responses integrate many signals and demonstrate plasticity across strains including response to what nutrients are available and what microbes are present. Advances in high-throughput measurement technology have made it feasible to construct compendia for many microbes. In this review we discuss how these compendia are constructed and analyzed to reveal transcriptional patterns.
Collapse
Affiliation(s)
- Alexandra J. Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, Philadelphia, PA, USA
| | - Taylor Reiter
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, CO, USA
| | - Georgia Doing
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Julia Oh
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Deborah A. Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, Hanover, NH, USA
| | - Casey S. Greene
- Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Denver, CO, USA
| |
Collapse
|
16
|
DNA Computing: Concepts for Medical Applications. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12146928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The branch of informatics that deals with construction and operation of computers built of DNA, is one of the research directions which investigates issues related to the use of DNA as hardware and software. This concept assumes the use of DNA computers due to their biological origin mainly for intelligent, personalized and targeted diagnostics frequently related to therapy. Important elements of this concept are (1) the retrieval of unique DNA sequences using machine learning methods and, based on the results of this process, (2) the construction/design of smart diagnostic biochip projects. The authors of this paper propose a new concept of designing diagnostic biochips, the key elements of which are machine-learning methods and the concept of biomolecular queue automata. This approach enables the scheduling of computational tasks at the molecular level by sequential events of cutting and ligating DNA molecules. We also summarize current challenges and perspectives of biomolecular computer application and machine-learning approaches using DNA sequence data mining.
Collapse
|
17
|
Greener JG, Kandathil SM, Moffat L, Jones DT. A guide to machine learning for biologists. Nat Rev Mol Cell Biol 2022; 23:40-55. [PMID: 34518686 DOI: 10.1038/s41580-021-00407-0] [Citation(s) in RCA: 514] [Impact Index Per Article: 257.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/23/2021] [Indexed: 02/08/2023]
Abstract
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, London, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London, UK
| | - Lewis Moffat
- Department of Computer Science, University College London, London, UK
| | - David T Jones
- Department of Computer Science, University College London, London, UK.
| |
Collapse
|
18
|
Wei H, Zhao Z, Luo R. Machine-Learned Molecular Surface and Its Application to Implicit Solvent Simulations. J Chem Theory Comput 2021; 17:6214-6224. [PMID: 34516109 DOI: 10.1021/acs.jctc.1c00492] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Implicit solvent models, such as Poisson-Boltzmann models, play important roles in computational studies of biomolecules. A vital step in almost all implicit solvent models is to determine the solvent-solute interface, and the solvent excluded surface (SES) is the most widely used interface definition in these models. However, classical algorithms used for computing SES are geometry-based, so that they are neither suitable for parallel implementations nor convenient for obtaining surface derivatives. To address the limitations, we explored a machine learning strategy to obtain a level set formulation for the SES. The training process was conducted in three steps, eventually leading to a model with over 95% agreement with the classical SES. Visualization of tested molecular surfaces shows that the machine-learned SES overlaps with the classical SES in almost all situations. Further analyses show that the machine-learned SES is incredibly stable in terms of rotational variation of tested molecules. Our timing analysis shows that the machine-learned SES is roughly 2.5 times as efficient as the classical SES routine implemented in Amber/PBSA on a tested central processing unit (CPU) platform. We expect further performance gain on massively parallel platforms such as graphics processing units (GPUs) given the ease in converting the machine-learned SES to a parallel procedure. We also implemented the machine-learned SES into the Amber/PBSA program to study its performance on reaction field energy calculation. The analysis shows that the two sets of reaction field energies are highly consistent with a 1% deviation on average. Given its level set formulation, we expect the machine-learned SES to be applied in molecular simulations that require either surface derivatives or high efficiency on parallel computing platforms.
Collapse
Affiliation(s)
- Haixin Wei
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Zekai Zhao
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| | - Ray Luo
- Departments of Materials Science and Engineering, Molecular Biology and Biochemistry, Chemical and Biomolecular Engineering, and Biomedical Engineering, Graduate Program in Chemical and Materials Physics, University of California, Irvine, California 92697, United States
| |
Collapse
|
19
|
Pratella D, Ait-El-Mkadem Saadi S, Bannwarth S, Paquis-Fluckinger V, Bottini S. A Survey of Autoencoder Algorithms to Pave the Diagnosis of Rare Diseases. Int J Mol Sci 2021; 22:10891. [PMID: 34639231 PMCID: PMC8509321 DOI: 10.3390/ijms221910891] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 10/04/2021] [Accepted: 10/07/2021] [Indexed: 12/28/2022] Open
Abstract
Rare diseases (RDs) concern a broad range of disorders and can result from various origins. For a long time, the scientific community was unaware of RDs. Impressive progress has already been made for certain RDs; however, due to the lack of sufficient knowledge, many patients are not diagnosed. Nowadays, the advances in high-throughput sequencing technologies such as whole genome sequencing, single-cell and others, have boosted the understanding of RDs. To extract biological meaning using the data generated by these methods, different analysis techniques have been proposed, including machine learning algorithms. These methods have recently proven to be valuable in the medical field. Among such approaches, unsupervised learning methods via neural networks including autoencoders (AEs) or variational autoencoders (VAEs) have shown promising performances with applications on various type of data and in different contexts, from cancer to healthy patient tissues. In this review, we discuss how AEs and VAEs have been used in biomedical settings. Specifically, we discuss their current applications and the improvements achieved in diagnostic and survival of patients. We focus on the applications in the field of RDs, and we discuss how the employment of AEs and VAEs would enhance RD understanding and diagnosis.
Collapse
Affiliation(s)
- David Pratella
- Center of Modeling, Simulation and Interactions, Université Côte d’Azur, 06200 Nice, France;
| | - Samira Ait-El-Mkadem Saadi
- Centre Hospitalier Universitaire (CHU) de Nice, Institute for Research on Cancer and Aging, Nice (IRCAN), Université Côte d’Azur, Inserm U1081, CNRS UMR 7284, 06200 Nice, France; (S.A.-E.-M.S.); (S.B.); (V.P.-F.)
| | - Sylvie Bannwarth
- Centre Hospitalier Universitaire (CHU) de Nice, Institute for Research on Cancer and Aging, Nice (IRCAN), Université Côte d’Azur, Inserm U1081, CNRS UMR 7284, 06200 Nice, France; (S.A.-E.-M.S.); (S.B.); (V.P.-F.)
| | - Véronique Paquis-Fluckinger
- Centre Hospitalier Universitaire (CHU) de Nice, Institute for Research on Cancer and Aging, Nice (IRCAN), Université Côte d’Azur, Inserm U1081, CNRS UMR 7284, 06200 Nice, France; (S.A.-E.-M.S.); (S.B.); (V.P.-F.)
| | - Silvia Bottini
- Center of Modeling, Simulation and Interactions, Université Côte d’Azur, 06200 Nice, France;
| |
Collapse
|
20
|
Umarov R, Li Y, Arner E. DeepCellState: An autoencoder-based framework for predicting cell type specific transcriptional states induced by drug treatment. PLoS Comput Biol 2021; 17:e1009465. [PMID: 34610009 PMCID: PMC8519465 DOI: 10.1371/journal.pcbi.1009465] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 10/15/2021] [Accepted: 09/20/2021] [Indexed: 11/18/2022] Open
Abstract
Drug treatment induces cell type specific transcriptional programs, and as the number of combinations of drugs and cell types grows, the cost for exhaustive screens measuring the transcriptional drug response becomes intractable. We developed DeepCellState, a deep learning autoencoder-based framework, for predicting the induced transcriptional state in a cell type after drug treatment, based on the drug response in another cell type. Training the method on a large collection of transcriptional drug perturbation profiles, prediction accuracy improves significantly over baseline and alternative deep learning approaches when applying the method to two cell types, with improved accuracy when generalizing the framework to additional cell types. Treatments with drugs or whole drug families not seen during training are predicted with similar accuracy, and the same framework can be used for predicting the results from other interventions, such as gene knock-downs. Finally, analysis of the trained model shows that the internal representation is able to learn regulatory relationships between genes in a fully data-driven manner.
Collapse
Affiliation(s)
- Ramzan Umarov
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- * E-mail: (RU); (EA)
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong, People’s Republic of China
| | - Erik Arner
- Graduate School of Integrated Sciences for Life, Hiroshima University, Higashi-Hiroshima, Japan
- Laboratory for Applied Regulatory Genomics Network Analysis, RIKEN Center for Integrative Medical Sciences, Yokohama, Kanagawa, Japan
- * E-mail: (RU); (EA)
| |
Collapse
|
21
|
Albaradei S, Thafar M, Alsaedi A, Van Neste C, Gojobori T, Essack M, Gao X. Machine learning and deep learning methods that use omics data for metastasis prediction. Comput Struct Biotechnol J 2021; 19:5008-5018. [PMID: 34589181 PMCID: PMC8450182 DOI: 10.1016/j.csbj.2021.09.001] [Citation(s) in RCA: 62] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 08/16/2021] [Accepted: 09/02/2021] [Indexed: 12/14/2022] Open
Abstract
Knowing metastasis is the primary cause of cancer-related deaths, incentivized research directed towards unraveling the complex cellular processes that drive the metastasis. Advancement in technology and specifically the advent of high-throughput sequencing provides knowledge of such processes. This knowledge led to the development of therapeutic and clinical applications, and is now being used to predict the onset of metastasis to improve diagnostics and disease therapies. In this regard, predicting metastasis onset has also been explored using artificial intelligence approaches that are machine learning, and more recently, deep learning-based. This review summarizes the different machine learning and deep learning-based metastasis prediction methods developed to date. We also detail the different types of molecular data used to build the models and the critical signatures derived from the different methods. We further highlight the challenges associated with using machine learning and deep learning methods, and provide suggestions to improve the predictive performance of such methods.
Collapse
Key Words
- AE, autoencoder
- ANN, Artificial Neural Network
- AUC, area under the curve
- Acc, Accuracy
- Artificial intelligence
- BC, Betweenness centrality
- BH, Benjamini-Hochberg
- BioGRID, Biological General Repository for Interaction Datasets
- CCP, compound covariate predictor
- CEA, Carcinoembryonic antigen
- CNN, convolution neural networks
- CV, cross-validation
- Cancer
- DBN, deep belief network
- DDBN, discriminative deep belief network
- DEGs, differentially expressed genes
- DIP, Database of Interacting Proteins
- DNN, Deep neural network
- DT, Decision Tree
- Deep learning
- EMT, epithelial-mesenchymal transition
- FC, fully connected
- GA, Genetic Algorithm
- GANs, generative adversarial networks
- GEO, Gene Expression Omnibus
- HCC, hepatocellular carcinoma
- HPRD, Human Protein Reference Database
- KNN, K-nearest neighbor
- L-SVM, linear SVM
- LIMMA, linear models for microarray data
- LOOCV, Leave-one-out cross-validation
- LR, Logistic Regression
- MCCV, Monte Carlo cross-validation
- MLP, multilayer perceptron
- Machine learning
- Metastasis
- NPV, negative predictive value
- PCA, Principal component analysis
- PPI, protein-protein interaction
- PPV, positive predictive value
- RC, ridge classifier
- RF, Random Forest
- RFE, recursive feature elimination
- RMA, robust multi‐array average
- RNN, recurrent neural networks
- SGD, stochastic gradient descent
- SMOTE, synthetic minority over-sampling technique
- SVM, Support Vector Machine
- Se, sensitivity
- Sp, specificity
- TCGA, The Cancer Genome Atlas
- k-CV, k-fold cross validation
- mRMR, minimum redundancy maximum relevance
Collapse
Affiliation(s)
- Somayah Albaradei
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- King Abdulaziz University, Faculty of Computing and Information Technology, Jeddah, Saudi Arabia
| | - Maha Thafar
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Taif University, Collage of Computers and Information Technology, Taif, Saudi Arabia
| | - Asim Alsaedi
- King Saud bin Abdulaziz University for Health Sciences, Jeddah, Saudi Arabia
- King Abdulaziz Medical City, Jeddah, Saudi Arabia
| | - Christophe Van Neste
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
- Biological and Environmental Sciences and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Magbubah Essack
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
22
|
Integrated knowledge mining, genome-scale modeling, and machine learning for predicting Yarrowia lipolytica bioproduction. Metab Eng 2021; 67:227-236. [PMID: 34242777 DOI: 10.1016/j.ymben.2021.07.003] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2021] [Revised: 06/17/2021] [Accepted: 07/05/2021] [Indexed: 01/14/2023]
Abstract
Predicting bioproduction titers from microbial hosts has been challenging due to complex interactions between microbial regulatory networks, stress responses, and suboptimal cultivation conditions. This study integrated knowledge mining, feature extraction, genome-scale modeling (GSM), and machine learning (ML) to develop a model for predicting Yarrowia lipolytica chemical titers (i.e., organic acids, terpenoids, etc.). First, Y. lipolytica production data, including cultivation conditions, genetic engineering strategies, and product information, was manually collected from literature (~100 papers) and stored as either numerical (e.g., substrate concentrations) or categorical (e.g., bioreactor modes) variables. For each case recorded, central pathway fluxes were estimated using GSMs and flux balance analysis (FBA) to provide metabolic features. Second, a ML ensemble learner was trained to predict strain production titers. Accurate predictions on the test data were obtained for instances with production titers >1 g/L (R2 = 0.87). However, the model had reduced predictability for low performance strains (0.01-1 g/L, R2 = 0.29) potentially due to biosynthesis bottlenecks not captured in the features. Feature ranking indicated that the FBA fluxes, the number of enzyme steps, the substrate inputs, and thermodynamic barriers (i.e., Gibbs free energy of reaction) were the most influential factors. Third, the model was evaluated on other oleaginous yeasts and indicated there were conserved features for some hosts that can be potentially exploited by transfer learning. The platform was also designed to assist computational strain design tools (such as OptKnock) to screen genetic targets for improved microbial production in light of experimental conditions.
Collapse
|
23
|
Yang S, Zhu F, Ling X, Liu Q, Zhao P. Intelligent Health Care: Applications of Deep Learning in Computational Medicine. Front Genet 2021; 12:607471. [PMID: 33912213 PMCID: PMC8075004 DOI: 10.3389/fgene.2021.607471] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 03/05/2021] [Indexed: 12/24/2022] Open
Abstract
With the progress of medical technology, biomedical field ushered in the era of big data, based on which and driven by artificial intelligence technology, computational medicine has emerged. People need to extract the effective information contained in these big biomedical data to promote the development of precision medicine. Traditionally, the machine learning methods are used to dig out biomedical data to find the features from data, which generally rely on feature engineering and domain knowledge of experts, requiring tremendous time and human resources. Different from traditional approaches, deep learning, as a cutting-edge machine learning branch, can automatically learn complex and robust feature from raw data without the need for feature engineering. The applications of deep learning in medical image, electronic health record, genomics, and drug development are studied, where the suggestion is that deep learning has obvious advantage in making full use of biomedical data and improving medical health level. Deep learning plays an increasingly important role in the field of medical health and has a broad prospect of application. However, the problems and challenges of deep learning in computational medical health still exist, including insufficient data, interpretability, data privacy, and heterogeneity. Analysis and discussion on these problems provide a reference to improve the application of deep learning in medical health.
Collapse
Affiliation(s)
- Sijie Yang
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Xinghong Ling
- School of Computer Science and Technology, Soochow University, Suzhou, China
- WenZheng College of Soochow University, Suzhou, China
| | - Quan Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Peiyao Zhao
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
24
|
Grønbech CH, Vording MF, Timshel PN, Sønderby CK, Pers TH, Winther O. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 2021; 36:4415-4422. [PMID: 32415966 DOI: 10.1093/bioinformatics/btaa293] [Citation(s) in RCA: 99] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 02/25/2020] [Accepted: 05/12/2020] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Models for analysing and making relevant biological inferences from massive amounts of complex single-cell transcriptomic data typically require several individual data-processing steps, each with their own set of hyperparameter choices. With deep generative models one can work directly with count data, make likelihood-based model comparison, learn a latent representation of the cells and capture more of the variability in different cell populations. RESULTS We propose a novel method based on variational auto-encoders (VAEs) for analysis of single-cell RNA sequencing (scRNA-seq) data. It avoids data preprocessing by using raw count data as input and can robustly estimate the expected gene expression levels and a latent representation for each cell. We tested several count likelihood functions and a variant of the VAE that has a priori clustering in the latent space. We show for several scRNA-seq datasets that our method outperforms recently proposed scRNA-seq methods in clustering cells and that the resulting clusters reflect cell types. AVAILABILITY AND IMPLEMENTATION Our method, called scVAE, is implemented in Python using the TensorFlow machine-learning library, and it is freely available at https://github.com/scvae/scvae. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher Heje Grønbech
- Department of Biology, Bioinformatics Centre, University of Copenhagen.,Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, København Ø 2100, Denmark.,Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| | - Maximillian Fornitz Vording
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| | - Pascal N Timshel
- Faculty of Health and Medical Sciences, The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, København N 2200, Denmark
| | | | - Tune H Pers
- Faculty of Health and Medical Sciences, The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, København N 2200, Denmark
| | - Ole Winther
- Department of Biology, Bioinformatics Centre, University of Copenhagen.,Centre for Genomic Medicine, Rigshospitalet, Copenhagen University Hospital, København Ø 2100, Denmark.,Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby 2800, Denmark
| |
Collapse
|
25
|
Kong L, Chen Y, Xu F, Xu M, Li Z, Fang J, Zhang L, Pian C. Mining influential genes based on deep learning. BMC Bioinformatics 2021; 22:27. [PMID: 33482718 PMCID: PMC7821411 DOI: 10.1186/s12859-021-03972-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 01/15/2021] [Indexed: 11/17/2022] Open
Abstract
Background Currently, large-scale gene expression profiling has been successfully applied to the discovery of functional connections among diseases, genetic perturbation, and drug action. To address the cost of an ever-expanding gene expression profile, a new, low-cost, high-throughput reduced representation expression profiling method called L1000 was proposed, with which one million profiles were produced. Although a set of ~ 1000 carefully chosen landmark genes that can capture ~ 80% of information from the whole genome has been identified for use in L1000, the robustness of using these landmark genes to infer target genes is not satisfactory. Therefore, more efficient computational methods are still needed to deep mine the influential genes in the genome. Results Here, we propose a computational framework based on deep learning to mine a subset of genes that can cover more genomic information. Specifically, an AutoEncoder framework is first constructed to learn the non-linear relationship between genes, and then DeepLIFT is applied to calculate gene importance scores. Using this data-driven approach, we have re-obtained a landmark gene set. The result shows that our landmark genes can predict target genes more accurately and robustly than that of L1000 based on two metrics [mean absolute error (MAE) and Pearson correlation coefficient (PCC)]. This reveals that the landmark genes detected by our method contain more genomic information. Conclusions We believe that our proposed framework is very suitable for the analysis of biological big data to reveal the mysteries of life. Furthermore, the landmark genes inferred from this study can be used for the explosive amplification of gene expression profiles to facilitate research into functional connections.
Collapse
Affiliation(s)
- Lingpeng Kong
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Yuanyuan Chen
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Fengjiao Xu
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Mingmin Xu
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Zutan Li
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Jingya Fang
- College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China
| | - Liangyun Zhang
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
| | - Cong Pian
- Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
| |
Collapse
|
26
|
Wang J, Xie X, Shi J, He W, Chen Q, Chen L, Gu W, Zhou T. Denoising Autoencoder, A Deep Learning Algorithm, Aids the Identification of A Novel Molecular Signature of Lung Adenocarcinoma. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:468-480. [PMID: 33346087 PMCID: PMC8242334 DOI: 10.1016/j.gpb.2019.02.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 01/11/2019] [Accepted: 03/01/2019] [Indexed: 02/06/2023]
Abstract
Precise biomarker development is a key step in disease management. However, most of the published biomarkers were derived from a relatively small number of samples with supervised approaches. Recent advances in unsupervised machine learning promise to leverage very large datasets for making better predictions of disease biomarkers. Denoising autoencoder (DA) is one of the unsupervised deep learning algorithms, which is a stochastic version of autoencoder techniques. The principle of DA is to force the hidden layer of autoencoder to capture more robust features by reconstructing a clean input from a corrupted one. Here, a DA model was applied to analyze integrated transcriptomic data from 13 published lung cancer studies, which consisted of 1916 human lung tissue samples. Using DA, we discovered a molecular signature composed of multiple genes for lung adenocarcinoma (ADC). In independent validation cohorts, the proposed molecular signature is proved to be an effective classifier for lung cancer histological subtypes. Also, this signature successfully predicts clinical outcome in lung ADC, which is independent of traditional prognostic factors. More importantly, this signature exhibits a superior prognostic power compared with the other published prognostic genes. Our study suggests that unsupervised learning is helpful for biomarker development in the era of precision medicine.
Collapse
Affiliation(s)
- Jun Wang
- Department of Thoracic Surgery, Jiangsu Province People's Hospital and the First Affiliated Hospital of Nanjing Medical University, Nanjing 210029, China
| | - Xueying Xie
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Junchao Shi
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA
| | - Wenjun He
- State Key Lab of Respiratory Disease, Guangzhou Medical University, Guangzhou 510000, China
| | - Qi Chen
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA
| | - Liang Chen
- Department of Thoracic Surgery, Jiangsu Province People's Hospital and the First Affiliated Hospital of Nanjing Medical University, Nanjing 210029, China.
| | - Wanjun Gu
- State Key Laboratory of Bioelectronics, School of Biological Sciences and Medical Engineering, Southeast University, Nanjing 210096, China.
| | - Tong Zhou
- Department of Physiology and Cell Biology, University of Nevada, Reno School of Medicine, Reno, NV 89557, USA.
| |
Collapse
|
27
|
Simon LM, Yan F, Zhao Z. DrivAER: Identification of driving transcriptional programs in single-cell RNA sequencing data. Gigascience 2020; 9:giaa122. [PMID: 33301553 PMCID: PMC7727875 DOI: 10.1093/gigascience/giaa122] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Revised: 05/27/2020] [Accepted: 10/07/2020] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) unfolds complex transcriptomic datasets into detailed cellular maps. Despite recent success, there is a pressing need for specialized methods tailored towards the functional interpretation of these cellular maps. FINDINGS Here, we present DrivAER, a machine learning approach for the identification of driving transcriptional programs using autoencoder-based relevance scores. DrivAER scores annotated gene sets on the basis of their relevance to user-specified outcomes such as pseudotemporal ordering or disease status. DrivAER iteratively evaluates the information content of each gene set with respect to the outcome variable using autoencoders. We benchmark our method using extensive simulation analysis as well as comparison to existing methods for functional interpretation of scRNA-seq data. Furthermore, we demonstrate that DrivAER extracts key pathways and transcription factors that regulate complex biological processes from scRNA-seq data. CONCLUSIONS By quantifying the relevance of annotated gene sets with respect to specified outcome variables, DrivAER greatly enhances our ability to understand the underlying molecular mechanisms.
Collapse
Affiliation(s)
- Lukas M Simon
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
| | - Fangfang Yan
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, 7000 Fannin St, Houston, TX 77030, USA
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, 6767 Bertner Ave, Houston, TX 77030, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End, Nashville, TN 37203, USA
| |
Collapse
|
28
|
Xue Y, Ding MQ, Lu X. Learning to encode cellular responses to systematic perturbations with deep generative models. NPJ Syst Biol Appl 2020; 6:35. [PMID: 33159077 PMCID: PMC7648057 DOI: 10.1038/s41540-020-00158-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Accepted: 10/07/2020] [Indexed: 11/09/2022] Open
Abstract
Cellular signaling systems play a vital role in maintaining homeostasis when a cell is exposed to different perturbations. Components of the systems are organized as hierarchical networks, and perturbing different components often leads to transcriptomic profiles that exhibit compositional statistical patterns. Mining such patterns to investigate how cellular signals are encoded is an important problem in systems biology, where artificial intelligence techniques can be of great assistance. Here, we investigated the capability of deep generative models (DGMs) to modeling signaling systems and learn representations of cellular states underlying transcriptomic responses to diverse perturbations. Specifically, we show that the variational autoencoder and the supervised vector-quantized variational autoencoder can accurately regenerate gene expression data in response to perturbagen treatments. The models can learn representations that reveal the relationships between different classes of perturbagens and enable mappings between drugs and their target genes. In summary, DGMs can adequately learn and depict how cellular signals are encoded. The resulting representations have broad applications, demonstrating the power of artificial intelligence in systems biology and precision medicine.
Collapse
Affiliation(s)
- Yifan Xue
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15206, USA
| | - Michael Q Ding
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15206, USA
| | - Xinghua Lu
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, 15206, USA.
- Department of Pharmaceutical Sciences, School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, 15206, USA.
| |
Collapse
|
29
|
Lee AJ, Park Y, Doing G, Hogan DA, Greene CS. Correcting for experiment-specific variability in expression compendia can remove underlying signals. Gigascience 2020; 9:giaa117. [PMID: 33140829 PMCID: PMC7607552 DOI: 10.1093/gigascience/giaa117] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 08/28/2020] [Accepted: 09/29/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In the past two decades, scientists in different laboratories have assayed gene expression from millions of samples. These experiments can be combined into compendia and analyzed collectively to extract novel biological patterns. Technical variability, or "batch effects," may result from combining samples collected and processed at different times and in different settings. Such variability may distort our ability to extract true underlying biological patterns. As more integrative analysis methods arise and data collections get bigger, we must determine how technical variability affects our ability to detect desired patterns when many experiments are combined. OBJECTIVE We sought to determine the extent to which an underlying signal was masked by technical variability by simulating compendia comprising data aggregated across multiple experiments. METHOD We developed a generative multi-layer neural network to simulate compendia of gene expression experiments from large-scale microbial and human datasets. We compared simulated compendia before and after introducing varying numbers of sources of undesired variability. RESULTS The signal from a baseline compendium was obscured when the number of added sources of variability was small. Applying statistical correction methods rescued the underlying signal in these cases. However, as the number of sources of variability increased, it became easier to detect the original signal even without correction. In fact, statistical correction reduced our power to detect the underlying signal. CONCLUSION When combining a modest number of experiments, it is best to correct for experiment-specific noise. However, when many experiments are combined, statistical correction reduces our ability to extract underlying patterns.
Collapse
Affiliation(s)
- Alexandra J Lee
- Genomics and Computational Biology Graduate Program, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - YoSon Park
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Georgia Doing
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, 1 Rope Ferry Rd, Hanover, NH, 03755, USA
| | - Deborah A Hogan
- Department of Microbiology and Immunology, Geisel School of Medicine, Dartmouth, 1 Rope Ferry Rd, Hanover, NH, 03755, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 3400 Civic Center Blvd, Philadelphia, PA, 19104, USA
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, 1429 Walnut St, Floor 10, Philadelphia, PA, 19102 USA
| |
Collapse
|
30
|
McDermott MBA, Wang J, Zhao WN, Sheridan SD, Szolovits P, Kohane I, Haggarty SJ, Perlis RH. Deep Learning Benchmarks on L1000 Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1846-1857. [PMID: 30990190 PMCID: PMC6980363 DOI: 10.1109/tcbb.2019.2910061] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Gene expression data can offer deep, physiological insights beyond the static coding of the genome alone. We believe that realizing this potential requires specialized, high-capacity machine learning methods capable of using underlying biological structure, but the development of such models is hampered by the lack of published benchmark tasks and well characterized baselines. In this work, we establish such benchmarks and baselines by profiling many classifiers against biologically motivated tasks on two curated views of a large, public gene expression dataset (the LINCS corpus) and one privately produced dataset. We provide these two curated views of the public LINCS dataset and our benchmark tasks to enable direct comparisons to future methodological work and help spur deep learning method development on this modality. In addition to profiling a battery of traditional classifiers, including linear models, random forests, decision trees, K nearest neighbor (KNN) classifiers, and feed-forward artificial neural networks (FF-ANNs), we also test a method novel to this data modality: graph convolugtional neural networks (GCNNs), which allow us to incorporate prior biological domain knowledge. We find that GCNNs can be highly performant, with large datasets, whereas FF-ANNs consistently perform well. Non-neural classifiers are dominated by linear models and KNN classifiers.
Collapse
|
31
|
Tan K, Huang W, Hu J, Dong S. A multi-omics supervised autoencoder for pan-cancer clinical outcome endpoints prediction. BMC Med Inform Decis Mak 2020; 20:129. [PMID: 32646413 PMCID: PMC7477832 DOI: 10.1186/s12911-020-1114-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background With the rapid development of sequencing technologies, collecting diverse types of cancer omics data become more cost-effective. Many computational methods attempted to represent and fuse multiple omics into a comprehensive view of cancer. However, different types of omics are related and heterogeneous. Most of the existing methods do not consider the difference between omics, so the biological knowledge of individual omics may not be fully excavated. And for a given task (e.g. predicting overall survival), these methods prefer to use sample similarity or domain knowledge to learn a more reasonable representation of omics, but it’s not enough. Methods For the purpose of learning more useful representation for individual omics and fusing them to improve the prediction ability, we proposed an autoencoder-based method named MOSAE (Multi-omics Supervised Autoencoder). In our method, a specific autoencoder were designed for each omics according to their size of dimension to generate omics-specific representations. Then, a supervised autoencoder was constructed based on specific autoencoder by using labels to enforce each specific autoencoder to learn both omics-specific and task-specific representations. Finally, representations of different omics that generate from supervised autoencoders were fused in a traditional but powerful way, and the fused representation was used for subsequent predictive tasks. Results We applied our method over TCGA Pan-Cancer dataset to predict four different clinical outcome endpoints (OS, PFI, DFI, and DSS). Compared with traditional and state-of-the-art methods, MOSAE achieved better predictive performance. We also tested the effects of each improvement, which all have a positive effect on predictive performance. Conclusions Predicting clinical outcome endpoints are very important for precision medicine and personalized medicine. And multi-omics fusion is an effective way to solve this problem. MOSAE is a powerful multi-omics fusion method, which can generate both omics-specific and task-specific representation for given endpoint predictive tasks and improve the predictive performance.
Collapse
Affiliation(s)
- Kaiwen Tan
- Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 381, China
| | - Weixian Huang
- Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 381, China
| | - Jinlong Hu
- Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 381, China
| | - Shoubin Dong
- Communication & Computer Network Lab of Guangdong, School of Computer Science & Engineering, South China University of Technology, Wushan Road, Guangzhou, 381, China.
| |
Collapse
|
32
|
Koumakis L. Deep learning models in genomics; are we there yet? Comput Struct Biotechnol J 2020; 18:1466-1473. [PMID: 32637044 PMCID: PMC7327302 DOI: 10.1016/j.csbj.2020.06.017] [Citation(s) in RCA: 52] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 06/07/2020] [Accepted: 06/08/2020] [Indexed: 12/23/2022] Open
Abstract
With the evolution of biotechnology and the introduction of the high throughput sequencing, researchers have the ability to produce and analyze vast amounts of genomics data. Since genomics produce big data, most of the bioinformatics algorithms are based on machine learning methodologies, and lately deep learning, to identify patterns, make predictions and model the progression or treatment of a disease. Advances in deep learning created an unprecedented momentum in biomedical informatics and have given rise to new bioinformatics and computational biology research areas. It is evident that deep learning models can provide higher accuracies in specific tasks of genomics than the state of the art methodologies. Given the growing trend on the application of deep learning architectures in genomics research, in this mini review we outline the most prominent models, we highlight possible pitfalls and discuss future directions. We foresee deep learning accelerating changes in the area of genomics, especially for multi-scale and multimodal data analysis for precision medicine.
Collapse
Affiliation(s)
- Lefteris Koumakis
- Foundation for Research and Technology - Hellas (FORTH), Institute of Computer Science, Heraklion, Crete, Greece
| |
Collapse
|
33
|
Zhao Z, Li Y, Wu Y, Chen R. Deep learning-based model for predicting progression in patients with head and neck squamous cell carcinoma. Cancer Biomark 2020; 27:19-28. [PMID: 31658045 DOI: 10.3233/cbm-190380] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
PURPOSE This study endeavors to build a deep learning (DL)-based model for predicting disease progression in head and neck squamous cell carcinoma (HNSCC) patients by integrating multi-omics data. METHODS RNA sequencing, miRNA sequencing, and methylation data from The Cancer Genome Atlas (TCGA) were used as input for autoencoder, a DL approach. An autoencoder-based prognosis model for PFS was built by SVM algorithm and tested in three confirmation sets. Predictive performance of the model was compared to two alternative approaches. Differential expression analysis for mRNAs, microRNAs (miRNA) and methylation was conducted. Moreover, functional annotation of differentially expressed genes (DEGs) was achieved through function enrichment analysis. RESULT The DL-based prognosis model identified two subgroups of patients with significantly different PFS, and showcased a good model fitness (C-index = 0.73). The two identified PFS subtypes were successfully validated in three confirmation sets. The DL-based model was more accurate and efficient than principal component analysis (PCA) or individual Cox-PH-based models. There were 348 DEGs, 23 differentially expressed miRNAs and 55 differentially methylated genes between the two PFS subtypes. These genes were significantly involved in several immune-related biological processes and primary immunodeficiency, cell adhesion molecules (CAMs), B cell receptor signaling and leukocyte transendothelial migration pathways. CONCLUSION The DL-based model introduced in this study is reliable and robust in predicting disease progression in HNSCC patients. A number of pathways and genes targets are unraveled to be implicated in cancer progression. Utility of this model would facilitate development of more individualized therapy for HNSCC patients and improve prognosis.
Collapse
Affiliation(s)
- Zhen Zhao
- Department of Otolaryngology, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu 210006, China.,Department of Otolaryngology, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu 210006, China
| | - Yingli Li
- Department of Plastic Surgery, The 960th Hospital of the PLA Joint Logistic Support Force, Jinan, Shandong 250000, China.,Department of Otolaryngology, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu 210006, China
| | - Yuanqing Wu
- Department of Otolaryngology, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu 210006, China
| | - Rongrong Chen
- Department of Otolaryngology, Nanjing First Hospital, Nanjing Medical University, Nanjing, Jiangsu 210006, China
| |
Collapse
|
34
|
Zhang S, Li X, Lin Q, Lin J, Wong KC. Uncovering the key dimensions of high-throughput biomolecular data using deep learning. Nucleic Acids Res 2020; 48:e56. [PMID: 32232416 PMCID: PMC7261195 DOI: 10.1093/nar/gkaa191] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Revised: 03/06/2020] [Accepted: 03/16/2020] [Indexed: 01/09/2023] Open
Abstract
Recent advances in high-throughput single-cell RNA-seq have enabled us to measure thousands of gene expression levels at single-cell resolution. However, the transcriptomic profiles are high-dimensional and sparse in nature. To address it, a deep learning framework based on auto-encoder, termed DeepAE, is proposed to elucidate high-dimensional transcriptomic profiling data in an encode-decode manner. Comparative experiments were conducted on nine transcriptomic profiling datasets to compare DeepAE with four benchmark methods. The results demonstrate that the proposed DeepAE outperforms the benchmark methods with robust performance on uncovering the key dimensions of single-cell RNA-seq data. In addition, we also investigate the performance of DeepAE in other contexts and platforms such as mass cytometry and metabolic profiling in a comprehensive manner. Gene ontology enrichment and pathology analysis are conducted to reveal the mechanisms behind the robust performance of DeepAE by uncovering its key dimensions.
Collapse
Affiliation(s)
- Shixiong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 132000, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
35
|
Way GP, Zietz M, Rubinetti V, Himmelstein DS, Greene CS. Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations. Genome Biol 2020; 21:109. [PMID: 32393369 PMCID: PMC7212571 DOI: 10.1186/s13059-020-02021-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 04/16/2020] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. RESULTS We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. CONCLUSIONS There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.
Collapse
Affiliation(s)
- Gregory P Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
- Imaging Platform, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Vincent Rubinetti
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, 10-131 SCTR 34th and Civic Center Blvd, Philadelphia, PA, 19104, USA.
- Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, 19102, USA.
| |
Collapse
|
36
|
Yadav A, Sahu R, Nath A. A representation transfer learning approach for enhanced prediction of growth hormone binding proteins. Comput Biol Chem 2020; 87:107274. [PMID: 32416563 DOI: 10.1016/j.compbiolchem.2020.107274] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2019] [Revised: 04/19/2020] [Accepted: 04/27/2020] [Indexed: 10/24/2022]
Abstract
Growth hormone binding proteins (GHBPs) are soluble proteins that play an important role in the modulation of signaling pathways pertaining to growth hormones. GHBPs are selective and bind non-covalently with growth hormones, but their functions are still not fully understood. Identification and characterization of GHBPs are the preliminary steps for understanding their roles in various cellular processes. As wet lab based experimental methods involve high cost and labor, computational methods can facilitate in narrowing down the search space of putative GHBPs. Performance of machine learning algorithms largely depends on the quality of features that it feeds on. Informative and non-redundant features generally result in enhanced performance and for this purpose feature selection algorithms are commonly used. In the present work, a novel representation transfer learning approach is presented for prediction of GHBPs. For their accurate prediction, deep autoencoder based features were extracted and subsequently SMO-PolyK classifier is trained. The prediction model is evaluated by both leave one out cross validation (LOOCV) and hold out independent testing set. On LOOCV, the prediction model achieved 89.8%% accuracy, with 89.4% sensitivity and 90.2% specificity and accuracy of 93.5%, sensitivity of 90.2% and specificity of 96.8% is attained on the hold out testing set. Further a comparison was made between the full set of sequence-based features, top performing sequence features extracted using feature selection algorithm, deep autoencoder based features and generalized low rank model based features on the prediction accuracy. Principal component analysis of the representative features along with t-sne visualization demonstrated the effectiveness of deep features in prediction of GHBPs. The present method is robust and accurate and may complement other wet lab based methods for identification of novel GHBPs.
Collapse
Affiliation(s)
- Amisha Yadav
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India
| | - Roopshikha Sahu
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India
| | - Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur 492001, India.
| |
Collapse
|
37
|
Li Y, Huang H, Chen H, Liu T. Deep Neural Networks for In Situ Hybridization Grid Completion and Clustering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:536-546. [PMID: 30106689 PMCID: PMC7199204 DOI: 10.1109/tcbb.2018.2864262] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Transcriptome in brain plays a crucial role in understanding the cortical organization and the development of brain structure and function. Two challenges, incomplete data and high dimensionality of transcriptome, remain unsolved. Here, we present a novel training scheme that successfully adapts the U-net architecture to the problem of volume recovery. By analogy to denoising autoencoder, we hide a portion of each training sample so that the network can learn to recover missing voxels from context. Then on the completed volumes, we show that Restricted Boltzmann Machines (RBMs) can be used to infer co-occurrences among voxels, providing foundations for dividing the cortex into discrete subregions. As we stack multiple RBMs to form a deep belief network (DBN), we progressively map the high-dimensional raw input into abstract representations and create a hierarchy of transcriptome architecture. A coarse to fine organization emerges from the network layers. This organization incidentally corresponds to the anatomical structures, suggesting a close link between structures and the genetic underpinnings. Thus, we demonstrate a new way of learning transcriptome-based hierarchical organization using RBM and DBN.
Collapse
|
38
|
Beykikhoshk A, Quinn TP, Lee SC, Tran T, Venkatesh S. DeepTRIAGE: interpretable and individualised biomarker scores using attention mechanism for the classification of breast cancer sub-types. BMC Med Genomics 2020; 13:20. [PMID: 32093737 PMCID: PMC7038647 DOI: 10.1186/s12920-020-0658-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 01/06/2020] [Indexed: 01/10/2023] Open
Abstract
Background Breast cancer is a collection of multiple tissue pathologies, each with a distinct molecular signature that correlates with patient prognosis and response to therapy. Accurately differentiating between breast cancer sub-types is an important part of clinical decision-making. Although this problem has been addressed using machine learning methods in the past, there remains unexplained heterogeneity within the established sub-types that cannot be resolved by the commonly used classification algorithms. Methods In this paper, we propose a novel deep learning architecture, called DeepTRIAGE (Deep learning for the TRactable Individualised Analysis of Gene Expression), which uses an attention mechanism to obtain personalised biomarker scores that describe how important each gene is in predicting the cancer sub-type for each sample. We then perform a principal component analysis of these biomarker scores to visualise the sample heterogeneity, and use a linear model to test whether the major principal axes associate with known clinical phenotypes. Results Our model not only classifies cancer sub-types with good accuracy, but simultaneously assigns each patient their own set of interpretable and individualised biomarker scores. These personalised scores describe how important each feature is in the classification of any patient, and can be analysed post-hoc to generate new hypotheses about latent heterogeneity. Conclusions We apply the DeepTRIAGE framework to classify the gene expression signatures of luminal A and luminal B breast cancer sub-types, and illustrate its use for genes as well as the GO and KEGG gene sets. Using DeepTRIAGE, we calculate personalised biomarker scores that describe the most important features for classifying an individual patient as luminal A or luminal B. In doing so, DeepTRIAGE simultaneously reveals heterogeneity within the luminal A biomarker scores that significantly associate with tumour stage, placing all luminal samples along a continuum of severity.
Collapse
Affiliation(s)
- Adham Beykikhoshk
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Australia.
| | - Thomas P Quinn
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Australia
| | - Samuel C Lee
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Australia
| | - Truyen Tran
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Australia
| | - Svetha Venkatesh
- Centre for Pattern Recognition and Data Analytics, Deakin University, Geelong, Australia
| |
Collapse
|
39
|
Dwivedi SK, Tjärnberg A, Tegnér J, Gustafsson M. Deriving disease modules from the compressed transcriptional space embedded in a deep autoencoder. Nat Commun 2020; 11:856. [PMID: 32051402 PMCID: PMC7016183 DOI: 10.1038/s41467-020-14666-6] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Accepted: 01/22/2020] [Indexed: 01/05/2023] Open
Abstract
Disease modules in molecular interaction maps have been useful for characterizing diseases. Yet biological networks, that commonly define such modules are incomplete and biased toward some well-studied disease genes. Here we ask whether disease-relevant modules of genes can be discovered without prior knowledge of a biological network, instead training a deep autoencoder from large transcriptional data. We hypothesize that modules could be discovered within the autoencoder representations. We find a statistically significant enrichment of genome-wide association studies (GWAS) relevant genes in the last layer, and to a successively lesser degree in the middle and first layers respectively. In contrast, we find an opposite gradient where a modular protein-protein interaction signal is strongest in the first layer, but then vanishing smoothly deeper in the network. We conclude that a data-driven discovery approach is sufficient to discover groups of disease-related genes.
Collapse
Affiliation(s)
- Sanjiv K Dwivedi
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Linköping, Sweden
| | - Andreas Tjärnberg
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Linköping, Sweden
- Department of Biology, Center For Genomics and Systems Biology, New York University, New York, NY, 10008, USA
- Center for Developmental Genetics, Department of Biology, New York University, New York, NY, USA
| | - Jesper Tegnér
- Biological and Environmental Sciences and Engineering Division, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- Unit of Computational Medicine, Department of Medicine, Solna, Center for Molecular Medicine, Karolinska Institutet, Stockholm, Sweden
- Science for Life Laboratory, Solna, Sweden
| | - Mika Gustafsson
- Bioinformatics, Department of Physics, Chemistry and Biology, Linköping University, Linköping, Sweden.
| |
Collapse
|
40
|
Tao Y, Cai C, Cohen WW, Lu X. From genome to phenome: Predicting multiple cancer phenotypes based on somatic genomic alterations via the genomic impact transformer. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2020; 25:79-90. [PMID: 31797588 PMCID: PMC6932864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Cancers are mainly caused by somatic genomic alterations (SGAs) that perturb cellular signaling systems and eventually activate oncogenic processes. Therefore, understanding the functional impact of SGAs is a fundamental task in cancer biology and precision oncology. Here, we present a deep neural network model with encoder-decoder architecture, referred to as genomic impact transformer (GIT), to infer the functional impact of SGAs on cellular signaling systems through modeling the statistical relationships between SGA events and differentially expressed genes (DEGs) in tumors. The model utilizes a multi-head self-attention mechanism to identify SGAs that likely cause DEGs, or in other words, differentiating potential driver SGAs from passenger ones in a tumor. GIT model learns a vector (gene embedding) as an abstract representation of functional impact for each SGA-affected gene. Given SGAs of a tumor, the model can instantiate the states of the hidden layer, providing an abstract representation (tumor embedding) reflecting characteristics of perturbed molecular/cellular processes in the tumor, which in turn can be used to predict multiple phenotypes. We apply the GIT model to 4,468 tumors profiled by The Cancer Genome Atlas (TCGA) project. The attention mechanism enables the model to better capture the statistical relationship between SGAs and DEGs than conventional methods, and distinguishes cancer drivers from passengers. The learned gene embeddings capture the functional similarity of SGAs perturbing common pathways. The tumor embeddings are shown to be useful for tumor status representation, and phenotype prediction including patient survival time and drug response of cancer cell lines.
Collapse
Affiliation(s)
- Yifeng Tao
- School of Computer Science, Carnegie Mellon University
| | - Chunhui Cai
- Department of Biomedical Informatics, University of Pittsburgh
| | - William W. Cohen
- School of Computer Science, Carnegie Mellon University,To whom correspondence should be addressed. ,
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh,Department of Pharmaceutical Sciences, School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, USA,To whom correspondence should be addressed. ,
| |
Collapse
|
41
|
Exploring single-cell data with deep multitasking neural networks. Nat Methods 2019; 16:1139-1145. [PMID: 31591579 PMCID: PMC10164410 DOI: 10.1038/s41592-019-0576-7] [Citation(s) in RCA: 158] [Impact Index Per Article: 31.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Accepted: 08/19/2019] [Indexed: 01/22/2023]
Abstract
It is currently challenging to analyze single-cell data consisting of many cells and samples, and to address variations arising from batch effects and different sample preparations. For this purpose, we present SAUCIE, a deep neural network that combines parallelization and scalability offered by neural networks, with the deep representation of data that can be learned by them to perform many single-cell data analysis tasks. Our regularizations (penalties) render features learned in hidden layers of the neural network interpretable. On large, multi-patient datasets, SAUCIE's various hidden layers contain denoised and batch-corrected data, a low-dimensional visualization and unsupervised clustering, as well as other information that can be used to explore the data. We analyze a 180-sample dataset consisting of 11 million T cells from dengue patients in India, measured with mass cytometry. SAUCIE can batch correct and identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue.
Collapse
|
42
|
Way GP, Greene CS. Discovering Pathway and Cell Type Signatures in Transcriptomic Compendia with Machine Learning. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021348] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Pathway and cell type signatures are patterns present in transcriptome data that are associated with biological processes or phenotypic consequences. These signatures result from specific cell type and pathway expression but can require large transcriptomic compendia to detect. Machine learning techniques can be powerful tools for signature discovery through their ability to provide accurate and interpretable results. In this review, we discuss various machine learning applications to extract pathway and cell type signatures from transcriptomic compendia. We focus on the biological motivations and interpretation for both supervised and unsupervised learning approaches in this setting. We consider recent advances, including deep learning, and their applications to expanding bulk and single-cell RNA data. As data and computational resources increase, there will be more opportunities for machine learning to aid in revealing biological signatures.
Collapse
Affiliation(s)
- Gregory P. Way
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
43
|
Abdolhosseini F, Azarkhalili B, Maazallahi A, Kamal A, Motahari SA, Sharifi-Zarchi A, Chitsaz H. Cell Identity Codes: Understanding Cell Identity from Gene Expression Profiles using Deep Neural Networks. Sci Rep 2019; 9:2342. [PMID: 30787315 PMCID: PMC6382891 DOI: 10.1038/s41598-019-38798-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Accepted: 01/10/2019] [Indexed: 01/11/2023] Open
Abstract
Understanding cell identity is an important task in many biomedical areas. Expression patterns of specific marker genes have been used to characterize some limited cell types, but exclusive markers are not available for many cell types. A second approach is to use machine learning to discriminate cell types based on the whole gene expression profiles (GEPs). The accuracies of simple classification algorithms such as linear discriminators or support vector machines are limited due to the complexity of biological systems. We used deep neural networks to analyze 1040 GEPs from 16 different human tissues and cell types. After comparing different architectures, we identified a specific structure of deep autoencoders that can encode a GEP into a vector of 30 numeric values, which we call the cell identity code (CIC). The original GEP can be reproduced from the CIC with an accuracy comparable to technical replicates of the same experiment. Although we use an unsupervised approach to train the autoencoder, we show different values of the CIC are connected to different biological aspects of the cell, such as different pathways or biological processes. This network can use CIC to reproduce the GEP of the cell types it has never seen during the training. It also can resist some noise in the measurement of the GEP. Furthermore, we introduce classifier autoencoder, an architecture that can accurately identify cell type based on the GEP or the CIC.
Collapse
Affiliation(s)
- Farzad Abdolhosseini
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | | | - Abbas Maazallahi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Aryan Kamal
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | | | - Ali Sharifi-Zarchi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| | - Hamidreza Chitsaz
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA.
| |
Collapse
|
44
|
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet 2019; 51:12-18. [PMID: 30478442 PMCID: PMC11180539 DOI: 10.1038/s41588-018-0295-5] [Citation(s) in RCA: 377] [Impact Index Per Article: 75.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 09/26/2018] [Indexed: 12/13/2022]
Abstract
Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.
Collapse
Affiliation(s)
- James Zou
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA.
- Chan-Zuckerberg Biohub, San Francisco, CA, USA.
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA.
| | - Mikael Huss
- Peltarion, Stockholm, Sweden
- Department of Learning, Informatics, Management and Ethics, Karolinska Institutet, Stockholm, Sweden
| | - Abubakar Abid
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA
| | - Pejman Mohammadi
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Ali Torkamani
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Amalio Telenti
- Scripps Research Translational Institute, La Jolla, CA, USA.
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.
| |
Collapse
|
45
|
Abstract
BACKGROUND Micro-RNAs (miRNAs) play a significant role in regulating gene expression under physiological and pathological conditions such as cancers. However, it remains a challenging problem to discover the target messenger RNAs (mRNAs) of a miRNA in a data driven fashion. On one hand, sequence-based methods for predicting miRNA targets tend to make too many false positive calls. On the other hand, analyzing expression correlation between miRNAs and mRNAs cannot establish whether relationship between a pair of correlated miRNA and mRNA is causal. METHODS In this study, we designed a deep learning model, referred to as miRNA causal deep net (mCADET), which aims to explicitly represent two types of statistical relationships between miRNAs and mRNAs: correlation resulting from confounded co-regulation and correlation as a result of causal regulation. The model utilizes a deep neural network to simulate transcription mechanism that leads to co-expression of miRNA and mRNA, and, in addition, it also contains directed edges from miRNAs to mRNAs to capture causal relationships among them. RESULTS We trained the mCADET model using pan-cancer miRNA and mRNA data from The Cancer Genome Atlas (TCGA) project to investigate mechanism of co-expression and causal interactions between miRNAs and mRNAs. Quantitative analyses of the results indicate that the mCADET significantly outperforms conventional deep learning models when modeling combined miRNA and mRNA expression data, indicating its superior capability of capturing the high-order statistical structures in the data. Qualitative analysis of predicted targets of miRNAs indicate that predictions by mCADET agree well with existing knowledge. Finally, the predictions by mCADET have a significantly lower false discovery rate and better overall accuracy in comparison to sequence-based and correlation-based methods when comparing to experimental results. CONCLUSION The mCADET model can simultaneously infer the states of cellular signaling system regulating co-expression of miRNAs and mRNAs, while capturing their causal relationships in a data-driven fashion.
Collapse
Affiliation(s)
- Lujia Chen
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, 5607 Baum Blvd, Pittsburgh, PA, USA.
| | - Xinghua Lu
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, 5607 Baum Blvd, Pittsburgh, PA, USA.,Center for Causal Discovery, University of Pittsburgh, 5607 Baum Blvd, Pittsburgh, PA, USA.,Department of Pharmaceutical Sciences, School of Pharmacy, University of Pittsburgh, 5607 Baum Blvd, Pittsburgh, PA, USA
| |
Collapse
|
46
|
Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15:20170387. [PMID: 29618526 PMCID: PMC5938574 DOI: 10.1098/rsif.2017.0387] [Citation(s) in RCA: 806] [Impact Index Per Article: 134.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 03/07/2018] [Indexed: 11/12/2022] Open
Abstract
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.
Collapse
Affiliation(s)
- Travers Ching
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Alexandr A Kalinin
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | - Gregory P Way
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Enrico Ferrero
- Computational Biology and Stats, Target Sciences, GlaxoSmithKline, Stevenage, UK
| | | | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Wei Xie
- Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Gail L Rosen
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Benjamin J Lengerich
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Johnny Israeli
- Biophysics Program, Stanford University, Stanford, CA, USA
| | - Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Stephen Woloszynek
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Anne E Carpenter
- Imaging Platform, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL, USA
| | - Evan M Cofer
- Department of Computer Science, Trinity University, San Antonio, TX, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Christopher A Lavender
- Integrative Bioinformatics, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, USA
| | - Srinivas C Turaga
- Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA, USA
| | - Amr M Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David J Harris
- Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, FL, USA
| | | | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Yifan Peng
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Laura K Wiley
- Division of Biomedical Informatics and Personalized Medicine, University of Colorado School of Medicine, Aurora, CO, USA
| | - Marwin H S Segler
- Institute of Organic Chemistry, Westfälische Wilhelms-Universität Münster, Münster, Germany
| | - Simina M Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA
| | - S Joshua Swamidass
- Department of Pathology and Immunology, Washington University in Saint Louis, St Louis, MO, USA
| | - Austin Huang
- Department of Medicine, Brown University, Providence, RI, USA
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
47
|
Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep Learning-Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer. Clin Cancer Res 2018; 24:1248-1259. [PMID: 28982688 PMCID: PMC6050171 DOI: 10.1158/1078-0432.ccr-17-0853] [Citation(s) in RCA: 501] [Impact Index Per Article: 83.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Revised: 06/18/2017] [Accepted: 10/02/2017] [Indexed: 02/07/2023]
Abstract
Identifying robust survival subgroups of hepatocellular carcinoma (HCC) will significantly improve patient care. Currently, endeavor of integrating multi-omics data to explicitly predict HCC survival from multiple patient cohorts is lacking. To fill this gap, we present a deep learning (DL)-based model on HCC that robustly differentiates survival subpopulations of patients in six cohorts. We built the DL-based, survival-sensitive model on 360 HCC patients' data using RNA sequencing (RNA-Seq), miRNA sequencing (miRNA-Seq), and methylation data from The Cancer Genome Atlas (TCGA), which predicts prognosis as good as an alternative model where genomics and clinical data are both considered. This DL-based model provides two optimal subgroups of patients with significant survival differences (P = 7.13e-6) and good model fitness [concordance index (C-index) = 0.68]. More aggressive subtype is associated with frequent TP53 inactivation mutations, higher expression of stemness markers (KRT19 and EPCAM) and tumor marker BIRC5, and activated Wnt and Akt signaling pathways. We validated this multi-omics model on five external datasets of various omics types: LIRI-JP cohort (n = 230, C-index = 0.75), NCI cohort (n = 221, C-index = 0.67), Chinese cohort (n = 166, C-index = 0.69), E-TABM-36 cohort (n = 40, C-index = 0.77), and Hawaiian cohort (n = 27, C-index = 0.82). This is the first study to employ DL to identify multi-omics features linked to the differential survival of patients with HCC. Given its robustness over multiple cohorts, we expect this workflow to be useful at predicting HCC prognosis prediction. Clin Cancer Res; 24(6); 1248-59. ©2017 AACR.
Collapse
Affiliation(s)
| | - Olivier B Poirion
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii
| | - Liangqun Lu
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, Hawaii
| | - Lana X Garmire
- Epidemiology Program, University of Hawaii Cancer Center, Honolulu, Hawaii.
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, Hawaii
| |
Collapse
|
48
|
Pardamean B, Cenggoro TW, Rahutomo R, Budiarto A, Karuppiah EK. Transfer Learning from Chest X-Ray Pre-trained Convolutional Neural Network for Learning Mammogram Data. ACTA ACUST UNITED AC 2018. [DOI: 10.1016/j.procs.2018.08.190] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|
49
|
Ding MQ, Chen L, Cooper GF, Young JD, Lu X. Precision Oncology beyond Targeted Therapy: Combining Omics Data with Machine Learning Matches the Majority of Cancer Cells to Effective Therapeutics. Mol Cancer Res 2017; 16:269-278. [PMID: 29133589 DOI: 10.1158/1541-7786.mcr-17-0378] [Citation(s) in RCA: 89] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Revised: 10/02/2017] [Accepted: 11/02/2017] [Indexed: 02/06/2023]
Abstract
Precision oncology involves identifying drugs that will effectively treat a tumor and then prescribing an optimal clinical treatment regimen. However, most first-line chemotherapy drugs do not have biomarkers to guide their application. For molecularly targeted drugs, using the genomic status of a drug target as a therapeutic indicator has limitations. In this study, machine learning methods (e.g., deep learning) were used to identify informative features from genome-scale omics data and to train classifiers for predicting the effectiveness of drugs in cancer cell lines. The methodology introduced here can accurately predict the efficacy of drugs, regardless of whether they are molecularly targeted or nonspecific chemotherapy drugs. This approach, on a per-drug basis, can identify sensitive cancer cells with an average sensitivity of 0.82 and specificity of 0.82; on a per-cell line basis, it can identify effective drugs with an average sensitivity of 0.80 and specificity of 0.82. This report describes a data-driven precision medicine approach that is not only generalizable but also optimizes therapeutic efficacy. The framework detailed herein, when successfully translated to clinical environments, could significantly broaden the scope of precision oncology beyond targeted therapies, benefiting an expanded proportion of cancer patients. Mol Cancer Res; 16(2); 269-78. ©2017 AACR.
Collapse
Affiliation(s)
- Michael Q Ding
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania
| | - Lujia Chen
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania
| | - Gregory F Cooper
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania
| | - Jonathan D Young
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, Pennsylvania. .,Center for Translational Bioinformatics, University of Pittsburgh, Pittsburgh, Pennsylvania
| |
Collapse
|
50
|
Fu L, Peng Q. A deep ensemble model to predict miRNA-disease association. Sci Rep 2017; 7:14482. [PMID: 29101378 PMCID: PMC5670180 DOI: 10.1038/s41598-017-15235-6] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Accepted: 10/23/2017] [Indexed: 02/08/2023] Open
Abstract
Cumulative evidence from biological experiments has confirmed that microRNAs (miRNAs) are related to many types of human diseases through different biological processes. It is anticipated that precise miRNA-disease association prediction could not only help infer potential disease-related miRNA but also boost human diagnosis and disease prevention. Considering the limitations of previous computational models, a more effective computational model needs to be implemented to predict miRNA-disease associations. In this work, we first constructed a human miRNA-miRNA similarity network utilizing miRNA-miRNA functional similarity data and heterogeneous miRNA Gaussian interaction profile kernel similarities based on the assumption that similar miRNAs with similar functions tend to be associated with similar diseases, and vice versa. Then, we constructed disease-disease similarity using disease semantic information and heterogeneous disease-related interaction data. We proposed a deep ensemble model called DeepMDA that extracts high-level features from similarity information using stacked autoencoders and then predicts miRNA-disease associations by adopting a 3-layer neural network. In addition to five-fold cross-validation, we also proposed another cross-validation method to evaluate the performance of the model. The results show that the proposed model is superior to previous methods with high robustness.
Collapse
Affiliation(s)
- Laiyi Fu
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shannxi, 710049, China
| | - Qinke Peng
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, Shannxi, 710049, China.
| |
Collapse
|