1
|
Ding Y, Jiang X, Wu J, Wang Y, Zhao L, Pan Y, Xi Y, Zhao G, Li Z, Zhang L. Synergistic horizontal transfer of antibiotic resistance genes and transposons in the infant gut microbial genome. mSphere 2024; 9:e0060823. [PMID: 38112433 PMCID: PMC10826358 DOI: 10.1128/msphere.00608-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 11/07/2023] [Indexed: 12/21/2023] Open
Abstract
Transposons, plasmids, bacteriophages, and other mobile genetic elements facilitate horizontal gene transfer in the gut microbiota, allowing some pathogenic bacteria to acquire antibiotic resistance genes (ARGs). Currently, the relationship between specific ARGs and specific transposons in the comprehensive infant gut microbiome has not been elucidated. In this study, ARGs and transposons were annotated from the Unified Human Gastrointestinal Genome (UHGG) and the Early-Life Gut Genomes (ELGG). Association rules mining was used to explore the association between specific ARGs and specific transposons in UHGG, and the robustness of the association rules was validated using the external database in ELGG. Our results suggested that ARGs and transposons were more likely to be relevant in infant gut microbiota compared to adult gut microbiota, and nine robust association rules were identified, among which Klebsiella pneumoniae, Enterobacter hormaechei_A, and Escherichia coli_D played important roles in this association phenomenon. The emphasis of this study is to investigate the synergistic transfer of specific ARGs and specific transposons in the infant gut microbiota, which can contribute to the study of microbial pathogenesis and the ARG dissemination dynamics.IMPORTANCEThe transfer of transposons carrying antibiotic resistance genes (ARGs) among microorganisms accelerates antibiotic resistance dissemination among infant gut microbiota. Nonetheless, it is unclear what the relationship between specific ARGs and specific transposons within the infant gut microbiota. K. pneumoniae, E. hormaechei_A, and E. coli_D were identified as key players in the nine robust association rules we discovered. Meanwhile, we found that infant gut microorganisms were more susceptible to horizontal gene transfer events about specific ARGs and specific transposons than adult gut microorganisms. These discoveries could enhance the understanding of microbial pathogenesis and the ARG dissemination dynamics within the infant gut microbiota.
Collapse
Affiliation(s)
- Yanwen Ding
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Xin Jiang
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Jiacheng Wu
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Yihui Wang
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Lanlan Zhao
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Yingmiao Pan
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Yaxuan Xi
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Guoping Zhao
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- Shandong University, State Key Laboratory of Microbial Technology, Qingdao, China
- University of Chinese Academy of Sciences, Chinese Academy of Sciences, CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, China National Institute of Health, Shanghai, China
| | - Ziyun Li
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Lei Zhang
- Microbiome-X, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, China
- Shandong University, State Key Laboratory of Microbial Technology, Qingdao, China
| |
Collapse
|
2
|
Mallick K, Chakraborty S, Mallik S, Bandyopadhyay S. A scalable unsupervised learning of scRNAseq data detects rare cells through integration of structure-preserving embedding, clustering and outlier detection. Brief Bioinform 2023; 24:bbad125. [PMID: 37185897 DOI: 10.1093/bib/bbad125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 02/06/2023] [Accepted: 02/24/2023] [Indexed: 05/17/2023] Open
Abstract
Single-cell RNA-seq analysis has become a powerful tool to analyse the transcriptomes of individual cells. In turn, it has fostered the possibility of screening thousands of single cells in parallel. Thus, contrary to the traditional bulk measurements that only paint a macroscopic picture, gene measurements at the cell level aid researchers in studying different tissues and organs at various stages. However, accurate clustering methods for such high-dimensional data remain exiguous and a persistent challenge in this domain. Of late, several methods and techniques have been promulgated to address this issue. In this article, we propose a novel framework for clustering large-scale single-cell data and subsequently identifying the rare-cell sub-populations. To handle such sparse, high-dimensional data, we leverage PaCMAP (Pairwise Controlled Manifold Approximation), a feature extraction algorithm that preserves both the local and the global structures of the data and Gaussian Mixture Model to cluster single-cell data. Subsequently, we exploit Edited Nearest Neighbours sampling and Isolation Forest/One-class Support Vector Machine to identify rare-cell sub-populations. The performance of the proposed method is validated using the publicly available datasets with varying degrees of cell types and rare-cell sub-populations. On several benchmark datasets, the proposed method outperforms the existing state-of-the-art methods. The proposed method successfully identifies cell types that constitute populations ranging from 0.1 to 8% with F1-scores of 0.91 0.09. The source code is available at https://github.com/scrab017/RarPG.
Collapse
Affiliation(s)
- Koushik Mallick
- Computer Science and Engineering, RCC Institute of Information Technology, Canal South Road, 700015, West Bengal, India
| | - Sikim Chakraborty
- Centre for Economy and Growth, Observer Research Foundation, Rouse Avenue, New Delhi, 110002, Delhi, India
| | - Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of Public Health, 677 Huntington Ave, 02115, MA, USA
| | - Sanghamitra Bandyopadhyay
- Machine Intelligence Unit, Indian Statistical Institute, Barrackpore Trunk Rd., 700108, West Bengal, India
| |
Collapse
|
3
|
Patterson A, Elbasir A, Tian B, Auslander N. Computational Methods Summarizing Mutational Patterns in Cancer: Promise and Limitations for Clinical Applications. Cancers (Basel) 2023; 15:cancers15071958. [PMID: 37046619 PMCID: PMC10093138 DOI: 10.3390/cancers15071958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Revised: 02/24/2023] [Accepted: 03/09/2023] [Indexed: 03/29/2023] Open
Abstract
Since the rise of next-generation sequencing technologies, the catalogue of mutations in cancer has been continuously expanding. To address the complexity of the cancer-genomic landscape and extract meaningful insights, numerous computational approaches have been developed over the last two decades. In this review, we survey the current leading computational methods to derive intricate mutational patterns in the context of clinical relevance. We begin with mutation signatures, explaining first how mutation signatures were developed and then examining the utility of studies using mutation signatures to correlate environmental effects on the cancer genome. Next, we examine current clinical research that employs mutation signatures and discuss the potential use cases and challenges of mutation signatures in clinical decision-making. We then examine computational studies developing tools to investigate complex patterns of mutations beyond the context of mutational signatures. We survey methods to identify cancer-driver genes, from single-driver studies to pathway and network analyses. In addition, we review methods inferring complex combinations of mutations for clinical tasks and using mutations integrated with multi-omics data to better predict cancer phenotypes. We examine the use of these tools for either discovery or prediction, including prediction of tumor origin, treatment outcomes, prognosis, and cancer typing. We further discuss the main limitations preventing widespread clinical integration of computational tools for the diagnosis and treatment of cancer. We end by proposing solutions to address these challenges using recent advances in machine learning.
Collapse
Affiliation(s)
- Andrew Patterson
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- The Wistar Institute, Philadelphia, PA 19104, USA
| | | | - Bin Tian
- The Wistar Institute, Philadelphia, PA 19104, USA
| | - Noam Auslander
- The Wistar Institute, Philadelphia, PA 19104, USA
- Department of Cancer Biology, University of Pennsylvania, Philadelphia, PA 19104, USA
- Correspondence:
| |
Collapse
|
4
|
Li A, Xiong S, Li J, Mallik S, Liu Y, Fei R, Zhou H, Liu G. AngClust: Angle Feature-Based Clustering for Short Time Series Gene Expression Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1574-1580. [PMID: 35853049 DOI: 10.1109/tcbb.2022.3192306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
When clustering gene expression, it is expected that correlation coefficients of genes in the same clusters are high, and that gene ontology (GO) enrichment analysis of most clusters will be significant. However, existing short-term gene expression clustering algorithms have limitations. To address this problem, we proposed a novel clustering process based on angular features for short-term gene expression. Our method (named AngClust) uses angular features to indicate the change of trend in gene expression levels at two neighboring time points. The changes of angles at multiple time points reflects the change of trend of the overall expression levels. Such changes are used to measure whether the expression trends of different genes are similar. To obtain functionally significant clusters from the clustering results, we evaluated numbers of genes in clusters, average correlation coefficient, fluctuation, and their correlation with GO term enrichment. The efficacy of AngClust outperform two other measures, Euclidean distance (ED) and dynamic time warping of correlation (DTW), on a dataset of yeast gene expression. The ratios of GO and pathway term-enriched of clusters of AngClust is higher than or equal to that of STEM and TMixClust on human, mouse, and yeast time series of gene expression.
Collapse
|
5
|
Mallik S, Sarkar A, Nath S, Maulik U, Das S, Pati SK, Ghosh S, Zhao Z. 3PNMF-MKL: A non-negative matrix factorization-based multiple kernel learning method for multi-modal data integration and its application to gene signature detection. Front Genet 2023; 14:1095330. [PMID: 36865387 PMCID: PMC9971618 DOI: 10.3389/fgene.2023.1095330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 01/30/2023] [Indexed: 02/16/2023] Open
Abstract
In this current era, biomedical big data handling is a challenging task. Interestingly, the integration of multi-modal data, followed by significant feature mining (gene signature detection), becomes a daunting task. Remembering this, here, we proposed a novel framework, namely, three-factor penalized, non-negative matrix factorization-based multiple kernel learning with soft margin hinge loss (3PNMF-MKL) for multi-modal data integration, followed by gene signature detection. In brief, limma, employing the empirical Bayes statistics, was initially applied to each individual molecular profile, and the statistically significant features were extracted, which was followed by the three-factor penalized non-negative matrix factorization method used for data/matrix fusion using the reduced feature sets. Multiple kernel learning models with soft margin hinge loss had been deployed to estimate average accuracy scores and the area under the curve (AUC). Gene modules had been identified by the consecutive analysis of average linkage clustering and dynamic tree cut. The best module containing the highest correlation was considered the potential gene signature. We utilized an acute myeloid leukemia cancer dataset from The Cancer Genome Atlas (TCGA) repository containing five molecular profiles. Our algorithm generated a 50-gene signature that achieved a high classification AUC score (viz., 0.827). We explored the functions of signature genes using pathway and Gene Ontology (GO) databases. Our method outperformed the state-of-the-art methods in terms of computing AUC. Furthermore, we included some comparative studies with other related methods to enhance the acceptability of our method. Finally, it can be notified that our algorithm can be applied to any multi-modal dataset for data integration, followed by gene module discovery.
Collapse
Affiliation(s)
- Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of public Health, Boston, MA, United States,*Correspondence: Saurav Mallik, , ; Zhongming Zhao,
| | - Anasua Sarkar
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Sagnik Nath
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Ujjwal Maulik
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Supantha Das
- Department of Information Technology, Academy of Technology, Hooghly, West Bengal, India
| | - Soumen Kumar Pati
- Department of Bioinformatics, Maulana Abul Kalam Azad University, Kolkata, West Bengal, India
| | - Soumadip Ghosh
- Department of Computer Science & Engineering, Sister Nivedita University, New Town, West Bengal, India
| | - Zhongming Zhao
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States,Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States,*Correspondence: Saurav Mallik, , ; Zhongming Zhao,
| |
Collapse
|
6
|
Pandey D, Onkara PP. Improved downstream functional analysis of single-cell RNA-sequence data using DGAN. Sci Rep 2023; 13:1618. [PMID: 36709340 PMCID: PMC9884242 DOI: 10.1038/s41598-023-28952-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Accepted: 01/27/2023] [Indexed: 01/29/2023] Open
Abstract
The dramatic increase in the number of single-cell RNA-sequence (scRNA-seq) investigations is indeed an endorsement of the new-fangled proficiencies of next generation sequencing technologies that facilitate the accurate measurement of tens of thousands of RNA expression levels at the cellular resolution. Nevertheless, missing values of RNA amplification persist and remain as a significant computational challenge, as these data omission induce further noise in their respective cellular data and ultimately impede downstream functional analysis of scRNA-seq data. Consequently, it turns imperative to develop robust and efficient scRNA-seq data imputation methods for improved downstream functional analysis outcomes. To overcome this adversity, we have designed an imputation framework namely deep generative autoencoder network [DGAN]. In essence, DGAN is an evolved variational autoencoder designed to robustly impute data dropouts in scRNA-seq data manifested as a sparse gene expression matrix. DGAN principally reckons count distribution, besides data sparsity utilizing a gaussian model whereby, cell dependencies are capitalized to detect and exclude outlier cells via imputation. When tested on five publicly available scRNA-seq data, DGAN outperformed every single baseline method paralleled, with respect to downstream functional analysis including cell data visualization, clustering, classification and differential expression analysis. DGAN is executed in Python and is accessible at https://github.com/dikshap11/DGAN .
Collapse
Affiliation(s)
- Diksha Pandey
- Department of Biotechnology, National Institute of Technology, Warangal, India
| | - Perumal P Onkara
- Department of Biotechnology, National Institute of Technology, Warangal, India.
| |
Collapse
|
7
|
Wei Y, Li L, Zhao X, Yang H, Sa J, Cao H, Cui Y. Cancer subtyping with heterogeneous multi-omics data via hierarchical multi-kernel learning. Brief Bioinform 2023; 24:6847203. [PMID: 36433785 DOI: 10.1093/bib/bbac488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2022] [Revised: 09/14/2022] [Accepted: 10/15/2022] [Indexed: 11/27/2022] Open
Abstract
Differentiating cancer subtypes is crucial to guide personalized treatment and improve the prognosis for patients. Integrating multi-omics data can offer a comprehensive landscape of cancer biological process and provide promising ways for cancer diagnosis and treatment. Taking the heterogeneity of different omics data types into account, we propose a hierarchical multi-kernel learning (hMKL) approach, a novel cancer molecular subtyping method to identify cancer subtypes by adopting a two-stage kernel learning strategy. In stage 1, we obtain a composite kernel borrowing the cancer integration via multi-kernel learning (CIMLR) idea by optimizing the kernel parameters for individual omics data type. In stage 2, we obtain a final fused kernel through a weighted linear combination of individual kernels learned from stage 1 using an unsupervised multiple kernel learning method. Based on the final fusion kernel, k-means clustering is applied to identify cancer subtypes. Simulation studies show that hMKL outperforms the one-stage CIMLR method when there is data heterogeneity. hMKL can estimate the number of clusters correctly, which is the key challenge in subtyping. Application to two real data sets shows that hMKL identified meaningful subtypes and key cancer-associated biomarkers. The proposed method provides a novel toolkit for heterogeneous multi-omics data integration and cancer subtypes identification.
Collapse
Affiliation(s)
- Yifang Wei
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Lingmei Li
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Xin Zhao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Haitao Yang
- Division of Health Statistics, School of Public Health, Hebei Medical University, Shijiazhuang, Hebei 050017, PR China
| | - Jian Sa
- Department of Science and Technology, Shanxi Provincial Key Laboratory of Major Disease Risk Assessment, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Hongyan Cao
- Division of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China.,Department of Mathematics, Shanxi Medical University, Taiyuan, Shanxi 030001, PR China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
8
|
Designing optimal convolutional neural network architecture using differential evolution algorithm. PATTERNS 2022; 3:100567. [PMID: 36124301 PMCID: PMC9481963 DOI: 10.1016/j.patter.2022.100567] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/10/2022] [Revised: 06/04/2022] [Accepted: 07/13/2022] [Indexed: 01/08/2023]
Abstract
Convolutional neural networks (CNNs) are deep learning models used widely for solving various tasks like computer vision and speech recognition. CNNs are developed manually based on problem-specific domain knowledge and tricky settings, which are laborious, time consuming, and challenging. To solve these, our study develops an improved differential evolution of convolutional neural network (IDECNN) algorithm to design CNN layer architectures for image classification. Variable-length encoding is utilized to represent the flexible layer architecture of a CNN model in IDECNN. An efficient heuristic mechanism is proposed in IDECNN to evolve CNN architecture through mutation and crossover to prevent premature convergence during the evolutionary process. Eight well-known imaging datasets were utilized. The results showed that IDECNN could design suitable architecture compared with 20 existing CNN models. Finally, CNN architectures are applied to pneumonia and coronavirus disease 2019 (COVID-19) X-ray biomedical image data. The results demonstrated the usefulness of the proposed approach to generate a suitable CNN model. Introduce DE algorithm to automatically design CNN architectures Variable-length encoding strategy is proposed to encode each CNN model For the DE framework, two CNN architectures undergo a refinement difference approach Design a heuristic mechanism for mutation operation to evolve CNN architectures
Convolutional neural networks (CNNs) are a class of deep learning (DL) methods that have demonstrated improved performance in various computer vision tasks. With the growing popularity of CNNs, several CNN architectures have been introduced with a large number of design options that are problem dependent. In most situations, the constructed CNN model performs well on the dataset used to train it. There is no guarantee that the designed CNN model can achieve sufficient classification accuracy for other datasets. Designing an appropriate CNN model architecture for a particular problem requires human interaction and trial-and-error procedures, which are laborious and time consuming. This study uses an improved differential evolution of convolutional neural network (IDECNN) technique to automatically construct effective CNN architectures for several image classification problems, which mitigates the issues found with manually designed CNN models.
Collapse
|
9
|
Xi E, Bai J, Zhang K, Yu H, Guo Y. Genomic variants disrupt miRNA-mRNA regulation. Chem Biodivers 2022; 19:e202200623. [PMID: 35985010 DOI: 10.1002/cbdv.202200623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 08/17/2022] [Indexed: 11/09/2022]
Abstract
Micro RNA (miRNA) and its regulatory effect on messenger RNA (mRNA) gene expression are a major focus in cancer research. Disruption in the normal miRNA-mRNA regulation network can result in serious cascading biological repercussions. In this study, we curated miRNA-related variants from major genomic consortiums and thoroughly evaluated how these variants could exert their effects by cross-validating with independent functional knowledge bases. Nearly all known variants (more than 664 million) categorized by type (germline, somatic, epigenetic) were mapped to the genomic regions involved in miRNA-mRNA binding (miRNA seeds and miRNA-mRNA 3'-UTR binding sequence). Subsets of miRNA-related variants supported by additional functional evidence, such as expression Quantitative Trait Loci (eQTL) and Genome-Wide Association Study (GWAS), were identified and scrutinized. Our results show that variants in miRNA seeds can substantially alter the composition of an miRNA's target mRNA set. Various functional analyses converged to reveal a post-transcriptional complex regulatory network where miRNA, eQTL, and RNA-binding protein intertwined to disseminate the impact of genomic variants. These results may potentially explain how certain variants affect disease/trait risks in genome wide association studies.
Collapse
Affiliation(s)
- Ellie Xi
- University of New Mexico - Albuquerque: The University of New Mexico, Internal Medicine, 100A Cancer Research Facility, 100A Cancer Research Facility, 87131, Albuquerque, UNITED STATES
| | - Judy Bai
- University of New Mexico - Albuquerque: The University of New Mexico, Internal Medicine, 100A Cancer Research Facility, 100A Cancer Research Facility, 87131, Albuquerque, UNITED STATES
| | - Klaira Zhang
- University of New Mexico - Albuquerque: The University of New Mexico, Internal Medicine, 100A Cancer Research Facility, 100A Cancer Research Facility, 87131, Albuquerque, UNITED STATES
| | - Hui Yu
- University of New Mexico - Albuquerque: The University of New Mexico, Internal Medicine, 100A Cancer Research Facility, Albuquerque, UNITED STATES
| | - Yan Guo
- University of New Mexico, Cancer Research Facility 100A, 87131, Albuquerque, UNITED STATES
| |
Collapse
|
10
|
Hu R, Zhou XJ, Li W. Computational Analysis of High-Dimensional DNA Methylation Data for Cancer Prognosis. J Comput Biol 2022; 29:769-781. [PMID: 35671506 PMCID: PMC9419965 DOI: 10.1089/cmb.2022.0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Developing cancer prognostic models using multiomics data is a major goal of precision oncology. DNA methylation provides promising prognostic biomarkers, which have been used to predict survival and treatment response in solid tumor or plasma samples. This review article presents an overview of recently published computational analyses on DNA methylation for cancer prognosis. To address the challenges of survival analysis with high-dimensional methylation data, various feature selection methods have been applied to screen a subset of informative markers. Using candidate markers associated with survival, prognostic models either predict risk scores or stratify patients into subtypes. The model's discriminatory power can be assessed by multiple evaluation metrics. Finally, we discuss the limitations of existing studies and present the prospects of applying machine learning algorithms to fully exploit the prognostic value of DNA methylation.
Collapse
Affiliation(s)
- Ran Hu
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Bioinformatics Interdepartmental Graduate Program, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| | - Xianghong Jasmine Zhou
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| | - Wenyuan Li
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California at Los Angeles, Los Angeles, California, USA
- Institute for Quantitative & Computational Biosciences, University of California at Los Angeles, Los Angeles, California, USA
| |
Collapse
|
11
|
Dhar R, Mallik S, Devi A. Exosomal microRNAs (exoMIRs): micromolecules with macro impact in oral cancer. 3 Biotech 2022; 12:155. [DOI: 10.1007/s13205-022-03217-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Accepted: 05/31/2022] [Indexed: 12/16/2022] Open
|
12
|
Unsupervised Learning for Feature Representation Using Spatial Distribution of Amino Acids in Aldehyde Dehydrogenase (ALDH2) Protein Sequences. MATHEMATICS 2022. [DOI: 10.3390/math10132228] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Aldehyde dehydrogenase 2 (ALDH2) enzyme is required for alcohol detoxification. ALDH2 belongs to the aldehyde dehydrogenase family, the most important oxidative pathway of alcohol digestion. Two main liver isoforms of aldehyde dehydrogenase are cytosolic and mitochondrial. Approximately 50% of East Asians have ALDH2 deficiency (inactive mitochondrial isozyme), with lysine (K) for glutamate (E) substitution at position 487 (E487K). ALDH2 deficiency is also known as Alcohol Flushing Syndrome or Asian Glow. For people with an ALDH2 deficiency, their face turns red after drinking alcohol, and they are more susceptible to various diseases than ALDH2-normal people. This study performed a machine learning analysis of ALDH2 sequences of thirteen other species by comparing them with the human ALDH2 sequence. Based on the various quantitative metrics (physicochemical properties, secondary structure, Hurst exponent, Shannon entropy, and fractal dimension), these fourteen species were clustered into four clusters using the unsupervised machine learning (K-means clustering) algorithm. We also analyze these species using hierarchical clustering (agglomerative clustering) and draw the phylogenetic trees. The results show that Homo sapiens is more closely related to the Bos taurus and Sus scrofa species. Our experimental results suggest that the testing for discovering medicines may be done on these species before being tested in humans to alleviate the impacts of ALDH2 deficiency.
Collapse
|
13
|
Bhadra T, Mallik S, Hasan N, Zhao Z. Comparison of five supervised feature selection algorithms leading to top features and gene signatures from multi-omics data in cancer. BMC Bioinformatics 2022; 23:153. [PMID: 35484501 PMCID: PMC9052461 DOI: 10.1186/s12859-022-04678-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 04/11/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND As many complex omics data have been generated during the last two decades, dimensionality reduction problem has been a challenging issue in better mining such data. The omics data typically consists of many features. Accordingly, many feature selection algorithms have been developed. The performance of those feature selection methods often varies by specific data, making the discovery and interpretation of results challenging. METHODS AND RESULTS In this study, we performed a comprehensive comparative study of five widely used supervised feature selection methods (mRMR, INMIFS, DFS, SVM-RFE-CBR and VWMRmR) for multi-omics datasets. Specifically, we used five representative datasets: gene expression (Exp), exon expression (ExpExon), DNA methylation (hMethyl27), copy number variation (Gistic2), and pathway activity dataset (Paradigm IPLs) from a multi-omics study of acute myeloid leukemia (LAML) from The Cancer Genome Atlas (TCGA). The different feature subsets selected by the aforesaid five different feature selection algorithms are assessed using three evaluation criteria: (1) classification accuracy (Acc), (2) representation entropy (RE) and (3) redundancy rate (RR). Four different classifiers, viz., C4.5, NaiveBayes, KNN, and AdaBoost, were used to measure the classification accuary (Acc) for each selected feature subset. The VWMRmR algorithm obtains the best Acc for three datasets (ExpExon, hMethyl27 and Paradigm IPLs). The VWMRmR algorithm offers the best RR (obtained using normalized mutual information) for three datasets (Exp, Gistic2 and Paradigm IPLs), while it gives the best RR (obtained using Pearson correlation coefficient) for two datasets (Gistic2 and Paradigm IPLs). It also obtains the best RE for three datasets (Exp, Gistic2 and Paradigm IPLs). Overall, the VWMRmR algorithm yields best performance for all three evaluation criteria for majority of the datasets. In addition, we identified signature genes using supervised learning collected from the overlapped top feature set among five feature selection methods. We obtained a 7-gene signature (ZMIZ1, ENG, FGFR1, PAWR, KRT17, MPO and LAT2) for EXP, a 9-gene signature for ExpExon, a 7-gene signature for hMethyl27, one single-gene signature (PIK3CG) for Gistic2 and a 3-gene signature for Paradigm IPLs. CONCLUSION We performed a comprehensive comparison of the performance evaluation of five well-known feature selection methods for mining features from various high-dimensional datasets. We identified signature genes using supervised learning for the specific omic data for the disease. The study will help incorporate higher order dependencies among features.
Collapse
Affiliation(s)
- Tapas Bhadra
- Department of Computer Science and Engineering, Aliah University, Kolkata, West Bengal, 700160, India
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Neaj Hasan
- Department of Computer Science and Engineering, Aliah University, Kolkata, West Bengal, 700160, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
14
|
Munquad S, Si T, Mallik S, Das AB, Zhao Z. A Deep Learning-Based Framework for Supporting Clinical Diagnosis of Glioblastoma Subtypes. Front Genet 2022; 13:855420. [PMID: 35419027 PMCID: PMC9000988 DOI: 10.3389/fgene.2022.855420] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2022] [Accepted: 02/17/2022] [Indexed: 12/12/2022] Open
Abstract
Understanding molecular features that facilitate aggressive phenotypes in glioblastoma multiforme (GBM) remains a major clinical challenge. Accurate diagnosis of GBM subtypes, namely classical, proneural, and mesenchymal, and identification of specific molecular features are crucial for clinicians for systematic treatment. We develop a biologically interpretable and highly efficient deep learning framework based on a convolutional neural network for subtype identification. The classifiers were generated from high-throughput data of different molecular levels, i.e., transcriptome and methylome. Furthermore, an integrated subsystem of transcriptome and methylome data was also used to build the biologically relevant model. Our results show that deep learning model outperforms the traditional machine learning algorithms. Furthermore, to evaluate the biological and clinical applicability of the classification, we performed weighted gene correlation network analysis, gene set enrichment, and survival analysis of the feature genes. We identified the genotype-phenotype relationship of GBM subtypes and the subtype-specific predictive biomarkers for potential diagnosis and treatment.
Collapse
Affiliation(s)
- Sana Munquad
- Department of Biotechnology, National Institute of Technology Warangal, Warangal, India
| | - Tapas Si
- Department of Computer Science and Engineering, Bankura Unnayani Institute of Engineering, Bankura, India
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Asim Bikas Das
- Department of Biotechnology, National Institute of Technology Warangal, Warangal, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States.,Department of Pathology and Laboratory Medicine, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
15
|
Serra A, Saarimäki LA, Pavel A, del Giudice G, Fratello M, Cattelani L, Federico A, Laurino O, Marwah VS, Fortino V, Scala G, Sofia Kinaret PA, Greco D. Nextcast: a software suite to analyse and model toxicogenomics data. Comput Struct Biotechnol J 2022; 20:1413-1426. [PMID: 35386103 PMCID: PMC8956870 DOI: 10.1016/j.csbj.2022.03.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2021] [Revised: 03/16/2022] [Accepted: 03/16/2022] [Indexed: 11/28/2022] Open
Abstract
Toxicogenomics is emerging as a valid approach to characterise the mechanism of action of chemicals. Structured pipelines for toxicogenomics increase standardisation and regulatory acceptance. We developed the Nextcast software suite for robust analysis and modelling of toxicogenomic data. Nextcast offers customisable modular pipelines to tackle multiple biological questions.
The recent advancements in toxicogenomics have led to the availability of large omics data sets, representing the starting point for studying the exposure mechanism of action and identifying candidate biomarkers for toxicity prediction. The current lack of standard methods in data generation and analysis hampers the full exploitation of toxicogenomics-based evidence in regulatory risk assessment. Moreover, the pipelines for the preprocessing and downstream analyses of toxicogenomic data sets can be quite challenging to implement. During the years, we have developed a number of software packages to address specific questions related to multiple steps of toxicogenomics data analysis and modelling. In this review we present the Nextcast software collection and discuss how its individual tools can be combined into efficient pipelines to answer specific biological questions. Nextcast components are of great support to the scientific community for analysing and interpreting large data sets for the toxicity evaluation of compounds in an unbiased, straightforward, and reliable manner. The Nextcast software suite is available at: ( https://github.com/fhaive/nextcast).
Collapse
Affiliation(s)
- Angela Serra
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Laura Aliisa Saarimäki
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Alisa Pavel
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Giusy del Giudice
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Michele Fratello
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Luca Cattelani
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | - Antonio Federico
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
| | | | - Veer Singh Marwah
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
| | - Vittorio Fortino
- Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland
| | - Giovanni Scala
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Department of Biology, University of Naples Federico II, Naples, Italy
| | - Pia Anneli Sofia Kinaret
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
| | - Dario Greco
- Faculty of Medicine and Health Technology, Tampere University, Tampere, Finland
- BioMediTech Institute, Tampere University, Tampere University, Tampere, Finland
- Finnish Hub for Development and Validation of Integrated Approaches (FHAIVE), Tampere, Finland
- Institute of Biotechnology, University of Helsinki, Helsinki, Finland
- Corresponding author.
| |
Collapse
|
16
|
Uzunangelov V, Wong CK, Stuart JM. Accurate cancer phenotype prediction with AKLIMATE, a stacked kernel learner integrating multimodal genomic data and pathway knowledge. PLoS Comput Biol 2021; 17:e1008878. [PMID: 33861732 PMCID: PMC8081343 DOI: 10.1371/journal.pcbi.1008878] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 04/28/2021] [Accepted: 03/15/2021] [Indexed: 02/03/2023] Open
Abstract
Advancements in sequencing have led to the proliferation of multi-omic profiles of human cells under different conditions and perturbations. In addition, many databases have amassed information about pathways and gene "signatures"-patterns of gene expression associated with specific cellular and phenotypic contexts. An important current challenge in systems biology is to leverage such knowledge about gene coordination to maximize the predictive power and generalization of models applied to high-throughput datasets. However, few such integrative approaches exist that also provide interpretable results quantifying the importance of individual genes and pathways to model accuracy. We introduce AKLIMATE, a first kernel-based stacked learner that seamlessly incorporates multi-omics feature data with prior information in the form of pathways for either regression or classification tasks. AKLIMATE uses a novel multiple-kernel learning framework where individual kernels capture the prediction propensities recorded in random forests, each built from a specific pathway gene set that integrates all omics data for its member genes. AKLIMATE has comparable or improved performance relative to state-of-the-art methods on diverse phenotype learning tasks, including predicting microsatellite instability in endometrial and colorectal cancer, survival in breast cancer, and cell line response to gene knockdowns. We show how AKLIMATE is able to connect feature data across data platforms through their common pathways to identify examples of several known and novel contributors of cancer and synthetic lethality.
Collapse
Affiliation(s)
- Vladislav Uzunangelov
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, United States of America
| | - Christopher K. Wong
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, United States of America
| | - Joshua M. Stuart
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, United States of America
- * E-mail:
| |
Collapse
|
17
|
Bora K, Bhuyan MK, Kasugai K, Mallik S, Zhao Z. Computational learning of features for automated colonic polyp classification. Sci Rep 2021; 11:4347. [PMID: 33623086 PMCID: PMC7902635 DOI: 10.1038/s41598-021-83788-8] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Accepted: 02/04/2021] [Indexed: 12/24/2022] Open
Abstract
Shape, texture, and color are critical features for assessing the degree of dysplasia in colonic polyps. A comprehensive analysis of these features is presented in this paper. Shape features are extracted using generic Fourier descriptor. The nonsubsampled contourlet transform is used as texture and color feature descriptor, with different combinations of filters. Analysis of variance (ANOVA) is applied to measure statistical significance of the contribution of different descriptors between two colonic polyps: non-neoplastic and neoplastic. Final descriptors selected after ANOVA are optimized using the fuzzy entropy-based feature ranking algorithm. Finally, classification is performed using Least Square Support Vector Machine and Multi-layer Perceptron with five-fold cross-validation to avoid overfitting. Evaluation of our analytical approach using two datasets suggested that the feature descriptors could efficiently designate a colonic polyp, which subsequently can help the early detection of colorectal carcinoma. Based on the comparison with four deep learning models, we demonstrate that the proposed approach out-performs the existing feature-based methods of colonic polyp identification.
Collapse
Affiliation(s)
- Kangkana Bora
- Department of Computer Science and IT, Cotton University, Pan Bazar, Guwahati, Assam, 781001, India
| | - M K Bhuyan
- Department of Electrical and Electronics Engineering, Indian Institute of Technology Guwahati (IITG), Guwahati, Assam, 781039, India
| | - Kunio Kasugai
- Department of Gastroenterology, Aichi Medical University, Nagakute, 480-1195, Japan
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA. .,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA. .,Department of Pathology and Laboratory Medicine, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA.
| |
Collapse
|
18
|
Mandal M, Sahoo SK, Patra P, Mallik S, Zhao Z. In silico ranking of phenolics for therapeutic effectiveness on cancer stem cells. BMC Bioinformatics 2020; 21:499. [PMID: 33371879 PMCID: PMC7768647 DOI: 10.1186/s12859-020-03849-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Accepted: 10/27/2020] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Cancer stem cells (CSCs) have features such as the ability to self-renew, differentiate into defined progenies and initiate the tumor growth. Treatments of cancer include drugs, chemotherapy and radiotherapy or a combination. However, treatment of cancer by various therapeutic strategies often fail. One possible reason is that the nature of CSCs, which has stem-like properties, make it more dynamic and complex and may cause the therapeutic resistance. Another limitation is the side effects associated with the treatment of chemotherapy or radiotherapy. To explore better or alternative treatment options the current study aims to investigate the natural drug-like molecules that can be used as CSC-targeted therapy. Among various natural products, anticancer potential of phenolics is well established. We collected the 21 phytochemicals from phenolic group and their interacting CSC genes from the publicly available databases. Then a bipartite graph is constructed from the collected CSC genes along with their interacting phytochemicals from phenolic group as other. The bipartite graph is then transformed into weighted bipartite graph by considering the interaction strength between the phenolics and the CSC genes. The CSC genes are also weighted by two scores, namely, DSI (Disease Specificity Index) and DPI (Disease Pleiotropy Index). For each gene, its DSI score reflects the specific relationship with the disease and DPI score reflects the association with multiple diseases. Finally, a ranking technique is developed based on PageRank (PR) algorithm for ranking the phenolics. RESULTS We collected 21 phytochemicals from phenolic group and 1118 CSC genes. The top ranked phenolics were evaluated by their molecular and pharmacokinetics properties and disease association networks. We selected top five ranked phenolics (Resveratrol, Curcumin, Quercetin, Epigallocatechin Gallate, and Genistein) for further examination of their oral bioavailability through molecular properties, drug likeness through pharmacokinetic properties, and associated network with CSC genes. CONCLUSION Our PR ranking based approach is useful to rank the phenolics that are associated with CSC genes. Our results suggested some phenolics are potential molecules for CSC-related cancer treatment.
Collapse
Affiliation(s)
- Monalisa Mandal
- Department of School of Computer Science and Engineering, Xavier University, Bhubaneswar, Odisha, 752050, India
| | | | - Priyadarsan Patra
- Department of School of Computer Science and Engineering, Xavier University, Bhubaneswar, Odisha, 752050, India
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center At Houston, Houston, TX, 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center At Houston, Houston, TX, 77030, USA.
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center At Houston, Houston, TX, 77030, USA.
| |
Collapse
|
19
|
Kabeshova A, Yu Y, Lukacs B, Bacry E, Gaïffas S. ZiMM: A deep learning model for long term and blurry relapses with non-clinical claims data. J Biomed Inform 2020; 110:103531. [PMID: 32818667 DOI: 10.1016/j.jbi.2020.103531] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 07/25/2020] [Accepted: 08/09/2020] [Indexed: 11/28/2022]
Abstract
This paper considers the problems of modeling and predicting a long-term and "blurry" relapse that occurs after a medical act, such as a surgery. We do not consider a short-term complication related to the act itself, but a long-term relapse that clinicians cannot explain easily, since it depends on unknown sets or sequences of past events that occurred before the act. The relapse is observed only indirectly, in a "blurry" fashion, through longitudinal prescriptions of drugs over a long period of time after the medical act. We introduce a new model, called ZiMM (Zero-inflated Mixture of Multinomial distributions) in order to capture long-term and blurry relapses. On top of it, we build an end-to-end deep-learning architecture called ZiMM Encoder-Decoder (ZiMM ED) that can learn from the complex, irregular, highly heterogeneous and sparse patterns of health events that are observed through a claims-only database. ZiMM ED is applied on a "non-clinical" claims database, that contains only timestamped reimbursement codes for drug purchases, medical procedures and hospital diagnoses, the only available clinical feature being the age of the patient. This setting is more challenging than a setting where bedside clinical signals are available. Our motivation for using such a non-clinical claims database is its exhaustivity population-wise, compared to clinical electronic health records coming from a single or a small set of hospitals. Indeed, we consider a dataset containing the claims of almost all French citizens who had surgery for prostatic problems, with a history between 1.5 and 5 years. We consider a long-term (18 months) relapse (urination problems still occur despite surgery), which is blurry since it is observed only through the reimbursement of a specific set of drugs for urination problems. Our experiments show that ZiMM ED improves several baselines, including non-deep learning and deep-learning approaches, and that it allows working on such a dataset with minimal preprocessing work.
Collapse
Affiliation(s)
| | | | | | | | - Stéphane Gaïffas
- LPSM, Université de Paris, France; DMA, Ecole normale supérieure, Paris, France.
| |
Collapse
|
20
|
A Linear Regression and Deep Learning Approach for Detecting Reliable Genetic Alterations in Cancer Using DNA Methylation and Gene Expression Data. Genes (Basel) 2020; 11:genes11080931. [PMID: 32806782 PMCID: PMC7465138 DOI: 10.3390/genes11080931] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Revised: 08/03/2020] [Accepted: 08/06/2020] [Indexed: 12/12/2022] Open
Abstract
DNA methylation change has been useful for cancer biomarker discovery, classification, and potential treatment development. So far, existing methods use either differentially methylated CpG sites or combined CpG sites, namely differentially methylated regions, that can be mapped to genes. However, such methylation signal mapping has limitations. To address these limitations, in this study, we introduced a combinatorial framework using linear regression, differential expression, deep learning method for accurate biological interpretation of DNA methylation through integrating DNA methylation data and corresponding TCGA gene expression data. We demonstrated it for uterine cervical cancer. First, we pre-filtered outliers from the data set and then determined the predicted gene expression value from the pre-filtered methylation data through linear regression. We identified differentially expressed genes (DEGs) by Empirical Bayes test using Limma. Then we applied a deep learning method, "nnet" to classify the cervical cancer label of those DEGs to determine all classification metrics including accuracy and area under curve (AUC) through 10-fold cross validation. We applied our approach to uterine cervical cancer DNA methylation dataset (NCBI accession ID: GSE30760, 27,578 features covering 63 tumor and 152 matched normal samples). After linear regression and differential expression analysis, we obtained 6287 DEGs with false discovery rate (FDR) <0.001. After performing deep learning analysis, we obtained average classification accuracy 90.69% (±1.97%) of the uterine cervical cancerous labels. This performance is better than that of other peer methods. We performed in-degree and out-degree hub gene network analysis using Cytoscape. We reported five top in-degree genes (PAIP2, GRWD1, VPS4B, CRADD and LLPH) and five top out-degree genes (MRPL35, FAM177A1, STAT4, ASPSCR1 and FABP7). After that, we performed KEGG pathway and Gene Ontology enrichment analysis of DEGs using tool WebGestalt(WEB-based Gene SeT AnaLysis Toolkit). In summary, our proposed framework that integrated linear regression, differential expression, deep learning provides a robust approach to better interpret DNA methylation analysis and gene expression data in disease study.
Collapse
|
21
|
Mallik S, Qin G, Jia P, Zhao Z. Molecular signatures identified by integrating gene expression and methylation in non-seminoma and seminoma of testicular germ cell tumours. Epigenetics 2020; 16:162-176. [PMID: 32615059 DOI: 10.1080/15592294.2020.1790108] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Testicular germ cell tumours (TGCTs) are the most common cancer in young male adults (aged 15 to 40). Unlike most other cancer types, identification of molecular signatures in TGCT has rarely reported. In this study, we developed a novel integrative analysis framework to identify co-methylated and co-expressed genes [mRNAs and microRNAs (miRNAs)] modules in two TGCT subtypes: non-seminoma (NSE) and seminoma (SE). We first integrated DNA methylation and mRNA/miRNA expression data and then used a statistical method, CoMEx (Combined score of DNA Methylation and Expression), to assess differentially expressed and methylated (DEM) genes/miRNAs. Next, we identified co-methylation and co-expression modules by applying WGCNA (Weighted Gene Correlation Network Analysis) tool to these DEM genes/miRNAs. The module with the highest average Pearson's Correlation Coefficient (PCC) after considering all pair-wise molecules (genes/miRNAs) included 91 molecules. By integrating both transcription factor and miRNA regulations, we constructed subtype-specific regulatory networks for NSE and SE. We identified four hub miRNAs (miR-182-5p, miR-520b, miR-520c-3p, and miR-7-5p), two hub TFs (MYC and SP1), and two genes (RECK and TERT) in the NSE-specific regulatory network, and two hub miRNAs (miR-182-5p and miR-338-3p), five hub TFs (ETS1, HIF1A, HNF1A, MYC, and SP1), and three hub genes (CDH1, CXCR4, and SNAI1) in the SE-specific regulatory network. miRNA (miR-182-5p) and two TFs (MYC and SP1) were common hubs of NSE and SE. We further examined pathways enriched in these subtype-specific networks. Our study provides a comprehensive view of the molecular signatures and co-regulation in two TGCT subtypes.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA
| | - Guimin Qin
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston , Houston, TX, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences , Houston, TX, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center , Nashville, TN, USA
| |
Collapse
|
22
|
Nakhl S, Sleilaty G, Chouery E, Salem N, Chahine R, Farès N. FokI vitamin D receptor gene polymorphism and serum 25-hydroxyvitamin D in patients with cardiovascular risk. Arch Med Sci Atheroscler Dis 2019; 4:e298-e303. [PMID: 32368685 PMCID: PMC7191939 DOI: 10.5114/amsad.2019.91437] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 11/19/2019] [Indexed: 01/02/2023] Open
Abstract
INTRODUCTION The biological actions of vitamin D are mediated through vitamin D receptor (VDR). Numerous single-nucleotide polymorphisms (SNPs) in the VDR gene have been identified, and some have been associated with cardiovascular disease (CVD) risk factors. This study aims to evaluate the association of five SNPs in the VDR gene with 25-hydroxyvitamin D (25[OH]D) levels in patients with at least one CVD risk factor. MATERIAL AND METHODS Genomic DNA was sequenced using standard Sanger methods for five VDR SNPs (BsmI rs1544410; ApaI rs7975232; Cdx2 rs11568820; TaqI rs731236; FokI rs2228570) in 50 Mediterranean subjects having hypovitaminosis D with at least one documented CVD risk factor, aged 18 years or more. The collected variables were serum levels of (25[OH]D), HbA1c, fasting plasma glucose, triglycerides, LDL cholesterol, and total cholesterol. RESULTS BsmI, ApaI, and TaqI were moderately to highly intercorrelated. Cdx2 was less frequent than expected. With respect to the number of mutations in FokI, levels of (25 [OH]D) were 11.2 ±5.5 ng/ml in the absence of mutations, 12.6 ±4.7 ng/ml in the presence of one mutation, and 16.5 ± 5.5 ng/ml in the presence of two mutations. CONCLUSIONS FokI polymorphism is more frequent in subjects with cardiovascular risk factors than in the general Caucasian population.
Collapse
Affiliation(s)
- Sahar Nakhl
- Research Laboratory in Physiology and Physiopathology (LRPP), Health Technology Centre, Faculty of Medicine, Saint Joseph University, Beirut, Lebanon
- Research Laboratory in Oxidative Stress and Antioxidants, Faculty of Medical Sciences and Doctoral School in Science and Technology, Lebanese University, Beirut, Lebanon
| | - Ghassan Sleilaty
- Faculty of Medicine, Higher Institute of Public Health, Saint Joseph University, Beirut, Lebanon
| | - Eliane Chouery
- Medical Genetics Unit, Faculty of Medicine, Saint Joseph University, Beirut, Lebanon
| | - Nabiha Salem
- Medical Genetics Unit, Faculty of Medicine, Saint Joseph University, Beirut, Lebanon
| | - Ramez Chahine
- Research Laboratory in Oxidative Stress and Antioxidants, Faculty of Medical Sciences and Doctoral School in Science and Technology, Lebanese University, Beirut, Lebanon
- Faculty of Public Health, Sagesse University, Beirut, Lebanon
| | - Nassim Farès
- Research Laboratory in Physiology and Physiopathology (LRPP), Health Technology Centre, Faculty of Medicine, Saint Joseph University, Beirut, Lebanon
| |
Collapse
|
23
|
Mallik S, Zhao Z. Multi-Objective Optimized Fuzzy Clustering for Detecting Cell Clusters from Single-Cell Expression Profiles. Genes (Basel) 2019; 10:E611. [PMID: 31412637 PMCID: PMC6723724 DOI: 10.3390/genes10080611] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/30/2019] [Accepted: 08/07/2019] [Indexed: 02/06/2023] Open
Abstract
Rapid advance in single-cell RNA sequencing (scRNA-seq) allows measurement of the expression of genes at single-cell resolution in complex disease or tissue. While many methods have been developed to detect cell clusters from the scRNA-seq data, this task currently remains a main challenge. We proposed a multi-objective optimization-based fuzzy clustering approach for detecting cell clusters from scRNA-seq data. First, we conducted initial filtering and SCnorm normalization. We considered various case studies by selecting different cluster numbers ( c l = 2 to a user-defined number), and applied fuzzy c-means clustering algorithm individually. From each case, we evaluated the scores of four cluster validity index measures, Partition Entropy ( P E ), Partition Coefficient ( P C ), Modified Partition Coefficient ( M P C ), and Fuzzy Silhouette Index ( F S I ). Next, we set the first measure as minimization objective (↓) and the remaining three as maximization objectives (↑), and then applied a multi-objective decision-making technique, TOPSIS, to identify the best optimal solution. The best optimal solution (case study) that had the highest TOPSIS score was selected as the final optimal clustering. Finally, we obtained differentially expressed genes (DEGs) using Limma through the comparison of expression of the samples between each resultant cluster and the remaining clusters. We applied our approach to a scRNA-seq dataset for the rare intestinal cell type in mice [GEO ID: GSE62270, 23,630 features (genes) and 288 cells]. The optimal cluster result (TOPSIS optimal score= 0.858) comprised two clusters, one with 115 cells and the other 91 cells. The evaluated scores of the four cluster validity indices, F S I , P E , P C , and M P C for the optimized fuzzy clustering were 0.482, 0.578, 0.607, and 0.215, respectively. The Limma analysis identified 1240 DEGs (cluster 1 vs. cluster 2). The top ten gene markers were Rps21, Slc5a1, Crip1, Rpl15, Rpl3, Rpl27a, Khk, Rps3a1, Aldob and Rps17. In this list, Khk (encoding ketohexokinase) is a novel marker for the rare intestinal cell type. In summary, this method is useful to detect cell clusters from scRNA-seq data.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA.
| |
Collapse
|