1
|
Mallik S, Seth S, Si A, Bhadra T, Zhao Z. Optimal ranking and directional signature classification using the integral strategy of multi-objective optimization-based association rule mining of multi-omics data. FRONTIERS IN BIOINFORMATICS 2023; 3:1182176. [PMID: 37576714 PMCID: PMC10415913 DOI: 10.3389/fbinf.2023.1182176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 06/19/2023] [Indexed: 08/15/2023] Open
Abstract
Introduction: Association rule mining (ARM) is a powerful tool for exploring the informative relationships among multiple items (genes) in any dataset. The main problem of ARM is that it generates many rules containing different rule-informative values, which becomes a challenge for the user to choose the effective rules. In addition, few works have been performed on the integration of multiple biological datasets and variable cutoff values in ARM. Methods: To solve all these problems, in this article, we developed a novel framework MOOVARM (multi-objective optimized variable cutoff-based association rule mining) for multi-omics profiles. Results: In this regard, we identified the positive ideal solution (PIS), which maximized the profit and minimized the loss, and negative ideal solution (NIS), which minimized the profit and maximized the loss for all gene sets (item sets), belonging to each extracted rule. Thereafter, we computed the distance (d +) from PIS and distance (d -) from NIS for each gene set or product. These two distances played an important role in determining the optimized associations among various pairs of genes in the multi-omics dataset. We then globally estimated the relative closeness to PIS for ranking the gene sets. When the relative closeness score of the rule is greater than or equal to the pre-defined threshold value, the rule can be considered a final resultant rule. Moreover, MOOVARM evaluated the relative score of the rule based on the status of all genes instead of individual genes. Conclusions: MOOVARM produced the final rank of the extracted (multi-objective optimized) rules of correlated genes which had better disease classification than the state-of-the-art algorithms on gene signature identification.
Collapse
Affiliation(s)
- Saurav Mallik
- Environmental Health, Harvard T. H. Chan School of Public Health, Boston, MA, United States
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Soumita Seth
- Department of Computer Science and Engineering, Brainware University, Kolkata, India
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Amalendu Si
- School of Information Technology, Maulana Abul Kalam Azad University of Technology, Haringhata, India
| | - Tapas Bhadra
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
2
|
Mallik S, Sarkar A, Nath S, Maulik U, Das S, Pati SK, Ghosh S, Zhao Z. 3PNMF-MKL: A non-negative matrix factorization-based multiple kernel learning method for multi-modal data integration and its application to gene signature detection. Front Genet 2023; 14:1095330. [PMID: 36865387 PMCID: PMC9971618 DOI: 10.3389/fgene.2023.1095330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 01/30/2023] [Indexed: 02/16/2023] Open
Abstract
In this current era, biomedical big data handling is a challenging task. Interestingly, the integration of multi-modal data, followed by significant feature mining (gene signature detection), becomes a daunting task. Remembering this, here, we proposed a novel framework, namely, three-factor penalized, non-negative matrix factorization-based multiple kernel learning with soft margin hinge loss (3PNMF-MKL) for multi-modal data integration, followed by gene signature detection. In brief, limma, employing the empirical Bayes statistics, was initially applied to each individual molecular profile, and the statistically significant features were extracted, which was followed by the three-factor penalized non-negative matrix factorization method used for data/matrix fusion using the reduced feature sets. Multiple kernel learning models with soft margin hinge loss had been deployed to estimate average accuracy scores and the area under the curve (AUC). Gene modules had been identified by the consecutive analysis of average linkage clustering and dynamic tree cut. The best module containing the highest correlation was considered the potential gene signature. We utilized an acute myeloid leukemia cancer dataset from The Cancer Genome Atlas (TCGA) repository containing five molecular profiles. Our algorithm generated a 50-gene signature that achieved a high classification AUC score (viz., 0.827). We explored the functions of signature genes using pathway and Gene Ontology (GO) databases. Our method outperformed the state-of-the-art methods in terms of computing AUC. Furthermore, we included some comparative studies with other related methods to enhance the acceptability of our method. Finally, it can be notified that our algorithm can be applied to any multi-modal dataset for data integration, followed by gene module discovery.
Collapse
Affiliation(s)
- Saurav Mallik
- Department of Environmental Health, Harvard T H Chan School of public Health, Boston, MA, United States,*Correspondence: Saurav Mallik, , ; Zhongming Zhao,
| | - Anasua Sarkar
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Sagnik Nath
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Ujjwal Maulik
- Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
| | - Supantha Das
- Department of Information Technology, Academy of Technology, Hooghly, West Bengal, India
| | - Soumen Kumar Pati
- Department of Bioinformatics, Maulana Abul Kalam Azad University, Kolkata, West Bengal, India
| | - Soumadip Ghosh
- Department of Computer Science & Engineering, Sister Nivedita University, New Town, West Bengal, India
| | - Zhongming Zhao
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States,Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States,*Correspondence: Saurav Mallik, , ; Zhongming Zhao,
| |
Collapse
|
3
|
Mallick K, Mallik S, Bandyopadhyay S, Chakraborty S. A Novel Graph Topology-Based GO-Similarity Measure for Signature Detection From Multi-Omics Data and its Application to Other Problems. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:773-785. [PMID: 32866101 DOI: 10.1109/tcbb.2020.3020537] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Large scale multi-omics data analysis and signature prediction have been a topic of interest in the last two decades. While various traditional clustering/correlation-based methods have been proposed, but the overall prediction is not always satisfactory. To solve these challenges, in this article, we propose a new approach by leveraging the Gene Ontology (GO)similarity combined with multiomics data. In this article, a new GO similarity measure, ModSchlicker, is proposed and the effectiveness of the proposed measure along with other standardized measures are reviewed while using various graph topology-based Information Content (IC)values of GO-term. The proposed measure is deployed to PPI prediction. Furthermore, by involving GO similarity, we propose a new framework for stronger disease-based gene signature detection from the multi-omics data. For the first objective, we predict interaction from various benchmark PPI datasets of Yeast and Human species. For the latter, the gene expression and methylation profiles are used to identify Differentially Expressed and Methylated (DEM)genes. Thereafter, the GO similarity score along with a statistical method are used to determine the potential gene signature. Interestingly, the proposed method produces a better performance ( 0.9 avg. accuracy and 0.95 AUC)as compared to the other existing related methods during the classification of the participating features (genes)of the signature. Moreover, the proposed method is highly useful in other prediction/classification problems for any kind of large scale omics data.
Collapse
|
4
|
Dey L, Mukhopadhyay A. A systems biology approach for identifying key genes and pathways of gastric cancer using microarray data. GENE REPORTS 2021. [DOI: 10.1016/j.genrep.2020.101011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
5
|
Mallik S, Zhao Z. Detecting methylation signatures in neurodegenerative disease by density-based clustering of applications with reducing noise. Sci Rep 2020; 10:22164. [PMID: 33335112 PMCID: PMC7747741 DOI: 10.1038/s41598-020-78463-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Accepted: 11/23/2020] [Indexed: 12/20/2022] Open
Abstract
There have been numerous genetic and epigenetic datasets generated for the study of complex disease including neurodegenerative disease. However, analysis of such data often suffers from detecting the outliers of the samples, which subsequently affects the extraction of the true biological signals involved in the disease. To address this critical issue, we developed a novel framework for identifying methylation signatures using consecutive adaptation of a well-known outlier detection algorithm, density based clustering of applications with reducing noise (DBSCAN) followed by hierarchical clustering. We applied the framework to two representative neurodegenerative diseases, Alzheimer's disease (AD) and Down syndrome (DS), using DNA methylation datasets from public sources (Gene Expression Omnibus, GEO accession ID: GSE74486). We first applied DBSCAN algorithm to eliminate outliers, and then used Limma statistical method to determine differentially methylated genes. Next, hierarchical clustering technique was applied to detect gene modules. Our analysis identified a methylation signature comprising 21 genes for AD and a methylation signature comprising 89 genes for DS, respectively. Our evaluation indicated that these two signatures could lead to high classification accuracy values (92% and 70%) for these two diseases. In summary, this framework will be useful to better detect outlier-free genetic and epigenetic signatures in various complex diseases and their developmental stages.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Department of Psychiatry and Behavioral Sciences, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
6
|
Mallik S, Qin G, Jia P, Zhao Z. Molecular signatures identified by integrating gene expression and methylation in non-seminoma and seminoma of testicular germ cell tumours. Epigenetics 2020; 16:162-176. [PMID: 32615059 DOI: 10.1080/15592294.2020.1790108] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
Testicular germ cell tumours (TGCTs) are the most common cancer in young male adults (aged 15 to 40). Unlike most other cancer types, identification of molecular signatures in TGCT has rarely reported. In this study, we developed a novel integrative analysis framework to identify co-methylated and co-expressed genes [mRNAs and microRNAs (miRNAs)] modules in two TGCT subtypes: non-seminoma (NSE) and seminoma (SE). We first integrated DNA methylation and mRNA/miRNA expression data and then used a statistical method, CoMEx (Combined score of DNA Methylation and Expression), to assess differentially expressed and methylated (DEM) genes/miRNAs. Next, we identified co-methylation and co-expression modules by applying WGCNA (Weighted Gene Correlation Network Analysis) tool to these DEM genes/miRNAs. The module with the highest average Pearson's Correlation Coefficient (PCC) after considering all pair-wise molecules (genes/miRNAs) included 91 molecules. By integrating both transcription factor and miRNA regulations, we constructed subtype-specific regulatory networks for NSE and SE. We identified four hub miRNAs (miR-182-5p, miR-520b, miR-520c-3p, and miR-7-5p), two hub TFs (MYC and SP1), and two genes (RECK and TERT) in the NSE-specific regulatory network, and two hub miRNAs (miR-182-5p and miR-338-3p), five hub TFs (ETS1, HIF1A, HNF1A, MYC, and SP1), and three hub genes (CDH1, CXCR4, and SNAI1) in the SE-specific regulatory network. miRNA (miR-182-5p) and two TFs (MYC and SP1) were common hubs of NSE and SE. We further examined pathways enriched in these subtype-specific networks. Our study provides a comprehensive view of the molecular signatures and co-regulation in two TGCT subtypes.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA
| | - Guimin Qin
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston , Houston, TX, USA.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston , Houston, TX, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences , Houston, TX, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center , Nashville, TN, USA
| |
Collapse
|
7
|
Mallik S, Zhao Z. Graph- and rule-based learning algorithms: a comprehensive review of their applications for cancer type classification and prognosis using genomic data. Brief Bioinform 2020; 21:368-394. [PMID: 30649169 PMCID: PMC7373185 DOI: 10.1093/bib/bby120] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 10/26/2018] [Accepted: 11/21/2018] [Indexed: 12/20/2022] Open
Abstract
Cancer is well recognized as a complex disease with dysregulated molecular networks or modules. Graph- and rule-based analytics have been applied extensively for cancer classification as well as prognosis using large genomic and other data over the past decade. This article provides a comprehensive review of various graph- and rule-based machine learning algorithms that have been applied to numerous genomics data to determine the cancer-specific gene modules, identify gene signature-based classifiers and carry out other related objectives of potential therapeutic value. This review focuses mainly on the methodological design and features of these algorithms to facilitate the application of these graph- and rule-based analytical approaches for cancer classification and prognosis. Based on the type of data integration, we divided all the algorithms into three categories: model-based integration, pre-processing integration and post-processing integration. Each category is further divided into four sub-categories (supervised, unsupervised, semi-supervised and survival-driven learning analyses) based on learning style. Therefore, a total of 11 categories of methods are summarized with their inputs, objectives and description, advantages and potential limitations. Next, we briefly demonstrate well-known and most recently developed algorithms for each sub-category along with salient information, such as data profiles, statistical or feature selection methods and outputs. Finally, we summarize the appropriate use and efficiency of all categories of graph- and rule mining-based learning methods when input data and specific objective are given. This review aims to help readers to select and use the appropriate algorithms for cancer classification and prognosis study.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center, Houston
| |
Collapse
|
8
|
Lin C, Peng B, Li Y, Wang P, Zhao G, Ding X, Li R, Zhao L, Zhang C. Cytoplasm Types Affect DNA Methylation among Different Cytoplasmic Male Sterility Lines and Their Maintainer Line in Soybean ( Glycine max L.). PLANTS (BASEL, SWITZERLAND) 2020; 9:E385. [PMID: 32245080 PMCID: PMC7155767 DOI: 10.3390/plants9030385] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 03/14/2020] [Accepted: 03/16/2020] [Indexed: 11/18/2022]
Abstract
Cytoplasmic male sterility (CMS) lines and their maintainer line have the same nucleus but different cytoplasm types. We used three soybean (Glycine max L.) CMS lines, JLCMS9A, JLCMSZ9A, and JLCMSPI9A, and their maintainer line, JLCMS9B, to explore whether methylation levels differed in their nuclei. Whole-genome bisulfite sequencing of these four lines was performed. The results show that the cytosine methylation level in the maintainer line was lower than in the CMS lines. Compared with JLCMS9B, the Gene Ontology (GO) enrichment analysis of DMR (differentially methylated region, DMR)-related genes of JLCMS9A revealed that their different 5-methylcytosine backgrounds were enriched in molecular function, whereas JLCMSZ9A and JLCMSPI9A were enriched in biological process and cellular component. The Kyoto Encyclopedia of Genes and Genome (KEGG) analysis of DMR-related genes and different methylated promoter regions in different cytosine contexts, hypomethylation or hypermethylation, showed that the numbers of DMR-related genes and promoter regions were clearly different. According to the DNA methylation and genetic distances separately, JLCMS9A clustered with JLCMS9B, and JLCMSPI9A with JLCMSZ9A. Thus, the effects of different cytoplasm types on DNA methylation were significantly different. This may be related to their genetic distances revealed by re-sequencing these lines. The detected DMR-related genes and pathways that are probably associated with CMS are also discussed.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Limei Zhao
- Soybean Research Institute, The National Engineering Research Center for Soybean, Jilin Academy of Agricultural Sciences, No. 1363, Shengtai St., Changchun 130000, China; (C.L.); (B.P.); (Y.L.); (P.W.); (G.Z.); (X.D.); (R.L.)
| | - Chunbao Zhang
- Soybean Research Institute, The National Engineering Research Center for Soybean, Jilin Academy of Agricultural Sciences, No. 1363, Shengtai St., Changchun 130000, China; (C.L.); (B.P.); (Y.L.); (P.W.); (G.Z.); (X.D.); (R.L.)
| |
Collapse
|
9
|
Mallik S, Bandyopadhyay S. WeCoMXP: Weighted Connectivity Measure Integrating Co-Methylation, Co-Expression and Protein-Protein Interactions for Gene-Module Detection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:690-703. [PMID: 30183644 DOI: 10.1109/tcbb.2018.2868348] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The identification of modules (groups of several tightly interconnected genes) in gene interaction network is an essential task for better understanding of the architecture of the whole network. In this article, we develop a novel weighted connectivity measure integrating co-methylation, co-expression, and protein-protein interactions (called WeCoMXP) to detect gene-modules for multi-omics dataset. The proposed measure goes beyond the fundamental degree centrality measure through considering some formulation of higher-order connections. Thereafter, we apply the average linkage clustering method using the corresponding dissimilarity (distance) values of WeCoMXP scores, and utilize a dynamic tree cut method for identifying some gene-modules. We validate the modules through literature search, KEGG pathway, and gene-ontology analyses on the genes representing the modules. Furthermore, the top 10 TFs/miRNAs that are connected with the maximum number of gene-modules and that regulate/target the maximum number of genes from these connected gene-modules, are identified. Moreover, our proposed method provides a better performance than the existing methods in terms of several cluster-validity indices in maximum times.
Collapse
|
10
|
Qin G, Mallik S, Mitra R, Li A, Jia P, Eischen CM, Zhao Z. MicroRNA and transcription factor co-regulatory networks and subtype classification of seminoma and non-seminoma in testicular germ cell tumors. Sci Rep 2020; 10:852. [PMID: 31965022 PMCID: PMC6972857 DOI: 10.1038/s41598-020-57834-w] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 12/24/2019] [Indexed: 12/11/2022] Open
Abstract
Recent studies have revealed that feed-forward loops (FFLs) as regulatory motifs have synergistic roles in cellular systems and their disruption may cause diseases including cancer. FFLs may include two regulators such as transcription factors (TFs) and microRNAs (miRNAs). In this study, we extensively investigated TF and miRNA regulation pairs, their FFLs, and TF-miRNA mediated regulatory networks in two major types of testicular germ cell tumors (TGCT): seminoma (SE) and non-seminoma (NSE). Specifically, we identified differentially expressed mRNA genes and miRNAs in 103 tumors using the transcriptomic data from The Cancer Genome Atlas. Next, we determined significantly correlated TF-gene/miRNA and miRNA-gene/TF pairs with regulation direction. Subsequently, we determined 288 and 664 dysregulated TF-miRNA-gene FFLs in SE and NSE, respectively. By constructing dysregulated FFL networks, we found that many hub nodes (12 out of 30 for SE and 8 out of 32 for NSE) in the top ranked FFLs could predict subtype-classification (Random Forest classifier, average accuracy ≥90%). These hub molecules were validated by an independent dataset. Our network analysis pinpointed several SE-specific dysregulated miRNAs (miR-200c-3p, miR-25-3p, and miR-302a-3p) and genes (EPHA2, JUN, KLF4, PLXDC2, RND3, SPI1, and TIMP3) and NSE-specific dysregulated miRNAs (miR-367-3p, miR-519d-3p, and miR-96-5p) and genes (NR2F1 and NR2F2). This study is the first systematic investigation of TF and miRNA regulation and their co-regulation in two major TGCT subtypes.
Collapse
Affiliation(s)
- Guimin Qin
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.,School of Computer Science and Technology, Xidian University, Xi'an, Shaanxi, China
| | - Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ramkrishna Mitra
- Department of Cancer Biology, Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, USA
| | - Aimin Li
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA.,School of Computer Science and Engineering, Xi'an University of Technology, Xi'an, Shaanxi, China
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Christine M Eischen
- Department of Cancer Biology, Sidney Kimmel Cancer Center, Thomas Jefferson University, Philadelphia, PA, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA. .,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, USA.
| |
Collapse
|
11
|
Mallik S, Zhao Z. Multi-Objective Optimized Fuzzy Clustering for Detecting Cell Clusters from Single-Cell Expression Profiles. Genes (Basel) 2019; 10:E611. [PMID: 31412637 PMCID: PMC6723724 DOI: 10.3390/genes10080611] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Revised: 07/30/2019] [Accepted: 08/07/2019] [Indexed: 02/06/2023] Open
Abstract
Rapid advance in single-cell RNA sequencing (scRNA-seq) allows measurement of the expression of genes at single-cell resolution in complex disease or tissue. While many methods have been developed to detect cell clusters from the scRNA-seq data, this task currently remains a main challenge. We proposed a multi-objective optimization-based fuzzy clustering approach for detecting cell clusters from scRNA-seq data. First, we conducted initial filtering and SCnorm normalization. We considered various case studies by selecting different cluster numbers ( c l = 2 to a user-defined number), and applied fuzzy c-means clustering algorithm individually. From each case, we evaluated the scores of four cluster validity index measures, Partition Entropy ( P E ), Partition Coefficient ( P C ), Modified Partition Coefficient ( M P C ), and Fuzzy Silhouette Index ( F S I ). Next, we set the first measure as minimization objective (↓) and the remaining three as maximization objectives (↑), and then applied a multi-objective decision-making technique, TOPSIS, to identify the best optimal solution. The best optimal solution (case study) that had the highest TOPSIS score was selected as the final optimal clustering. Finally, we obtained differentially expressed genes (DEGs) using Limma through the comparison of expression of the samples between each resultant cluster and the remaining clusters. We applied our approach to a scRNA-seq dataset for the rare intestinal cell type in mice [GEO ID: GSE62270, 23,630 features (genes) and 288 cells]. The optimal cluster result (TOPSIS optimal score= 0.858) comprised two clusters, one with 115 cells and the other 91 cells. The evaluated scores of the four cluster validity indices, F S I , P E , P C , and M P C for the optimized fuzzy clustering were 0.482, 0.578, 0.607, and 0.215, respectively. The Limma analysis identified 1240 DEGs (cluster 1 vs. cluster 2). The top ten gene markers were Rps21, Slc5a1, Crip1, Rpl15, Rpl3, Rpl27a, Khk, Rps3a1, Aldob and Rps17. In this list, Khk (encoding ketohexokinase) is a novel marker for the rare intestinal cell type. In summary, this method is useful to detect cell clusters from scRNA-seq data.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA.
| |
Collapse
|
12
|
Karwacki MW, Wysocki M, Perek-Polnik M, Jatczak-Gaca A. Coordinated medical care for children with neurofibromatosis type 1 and related RASopathies in Poland. Arch Med Sci 2019; 17:1221-1231. [PMID: 34522251 PMCID: PMC8425254 DOI: 10.5114/aoms.2019.85143] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/11/2019] [Accepted: 04/12/2019] [Indexed: 11/17/2022] Open
Abstract
Coordinated medical care offered in Poland for patients suffering from neurofibromatosis type 1 and related RASopathies combines complex multispecialty consultation with permanent supervision and the patient's oriented longitudinal care. Neurofibromatosis type 1 is one of the most common single gene disorders in the global population, observed in 1 out of 2500-3000 live births. It is a primary neoplasia disease with 100% penetration of the gene mutation but remarkable age-dependent onset of different disease signs and symptoms, outstanding clinical heterogeneity between patients even in one family and lack of genotype-phenotype correlation, a high rate of spontaneous mutation exceeding 50%, and multiple comorbidities among which increased risk of malignancy is the most important. Medical practice proved that not only patient-oriented complex but also coordinated care provided in centers of competence is indispensable for patients and the families and provides a sense of medical security to them in conjunction with public health costs rationalization.
Collapse
Affiliation(s)
- Marek W. Karwacki
- Coordinated Care Center for Neurofibromatoses and related RASopathies, Department of Pediatrics, Hematology and Oncology, Medical University of Warsaw, Poland
| | - Mariusz Wysocki
- Department of Paediatrics, Haematology and Oncology, Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, Poland
| | - Marta Perek-Polnik
- Neuro-oncology Division, Department of Oncology, The Children’s Memorial Health Institute, Warsaw, Poland
| | - Agnieszka Jatczak-Gaca
- Department of Paediatrics, Haematology and Oncology, Collegium Medicum in Bydgoszcz, Nicolaus Copernicus University in Torun, Poland
| |
Collapse
|
13
|
Mallik S, Zhao Z. Identification of gene signatures from RNA-seq data using Pareto-optimal cluster algorithm. BMC SYSTEMS BIOLOGY 2018; 12:126. [PMID: 30577846 PMCID: PMC6302366 DOI: 10.1186/s12918-018-0650-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Background Gene signatures are important to represent the molecular changes in the disease genomes or the cells in specific conditions, and have been often used to separate samples into different groups for better research or clinical treatment. While many methods and applications have been available in literature, there still lack powerful ones that can take account of the complex data and detect the most informative signatures. Methods In this article, we present a new framework for identifying gene signatures using Pareto-optimal cluster size identification for RNA-seq data. We first performed pre-filtering steps and normalization, then utilized the empirical Bayes test in Limma package to identify the differentially expressed genes (DEGs). Next, we used a multi-objective optimization technique, “Multi-objective optimization for collecting cluster alternatives” (MOCCA in R package) on these DEGs to find Pareto-optimal cluster size, and then applied k-means clustering to the RNA-seq data based on the optimal cluster size. The best cluster was obtained through computing the average Spearman’s Correlation Score among all the genes in pair-wise manner belonging to the module. The best cluster is treated as the signature for the respective disease or cellular condition. Results We applied our framework to a cervical cancer RNA-seq dataset, which included 253 squamous cell carcinoma (SCC) samples and 22 adenocarcinoma (ADENO) samples. We identified a total of 582 DEGs by Limma analysis of SCC versus ADENO samples. Among them, 260 are up-regulated genes and 322 are down-regulated genes. Using MOCCA, we obtained seven Pareto-optimal clusters. The best cluster has a total of 35 DEGs consisting of all-upregulated genes. For validation, we ran PAMR (prediction analysis for microarrays) classifier on the selected best cluster, and assessed the classification performance. Our evaluation, measured by sensitivity, specificity, precision, and accuracy, showed high confidence. Conclusions Our framework identified a multi-objective based cluster that is treated as a signature that can classify the disease and control group of samples with higher classification performance (accuracy 0.935) for the corresponding disease. Our method is useful to find signature for any RNA-seq or microarray data.
Collapse
Affiliation(s)
- Saurav Mallik
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, 77030, TX, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, 77030, TX, USA. .,Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, 37232, TN, USA.
| |
Collapse
|
14
|
Cheng AA, Li W, Hernandez LL. Effect of high-fat diet feeding and associated transcriptome changes in the peak lactation mammary gland in C57BL/6 dams. Physiol Genomics 2018; 50:1059-1070. [DOI: 10.1152/physiolgenomics.00052.2018] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Maternal consumption of a high-fat diet (HFD) during pregnancy has established adverse effects on the developing neonate. In this study, we aimed to investigate the effect of an HFD on the murine mammary gland during midlactation. Female C57BL/6J mice were placed on either a low-fat diet (LFD/10% fat) or HFD (60% fat) from 3 wk of age through peak lactation (lactation day 11/L11). After 4 wk of consuming either the LFD or HFD, female mice were bred. There were no significant differences in milk yield between treatment groups, which was measured from L1 to L9. On L10, mice were subjected to an overnight fast and then euthanized on the morning of L11. Total RNA was isolated from inguinal mammary glands for whole transcriptome sequencing. We found 628 genes that were differentially expressed between the treatment groups. Notably, HFD feeding resulted in expression alterations of genes involved in collagen and cytoplasmic components. Additionally, genes related to inflammatory and immune responses were also impacted. Differential expression in gene transcript isoforms between the treatment groups was detected in three genes related to mammary duct development. This study sheds light as to how an HFD may affect the mammary gland transcriptome during midlactation.
Collapse
Affiliation(s)
- A. A. Cheng
- Department of Dairy Sciences, University of Wisconsin, Madison, Wisconsin
| | - W. Li
- United States Department of Agriculture Dairy Forage, Madison, Wisconsin
| | - L. L. Hernandez
- Department of Dairy Sciences, University of Wisconsin, Madison, Wisconsin
| |
Collapse
|
15
|
Mallik S, Bhadra T, Mukherji A, Mallik S, Bhadra T, Mukherji A, Mallik S, Bhadra T, Mukherji A. DTFP-Growth: Dynamic Threshold-Based FP-Growth Rule Mining Algorithm Through Integrating Gene Expression, Methylation, and Protein-Protein Interaction Profiles. IEEE Trans Nanobioscience 2018; 17:117-125. [PMID: 29870335 DOI: 10.1109/tnb.2018.2803021] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Association rule mining is an important technique for identifying interesting relationships between gene pairs in a biological data set. Earlier methods basically work for a single biological data set, and, in maximum cases, a single minimum support cutoff can be applied globally, i.e., across all genesets/itemsets. To overcome this limitation, in this paper, we propose dynamic threshold-based FP-growth rule mining algorithm that integrates gene expression, methylation and protein-protein interaction profiles based on weighted shortest distance to find the novel associations among different pairs of genes in multi-view data sets. For this purpose, we introduce three new thresholds, namely, Distance-based Variable/Dynamic Supports (DVS), Distance-based Variable Confidences (DVC), and Distance-based Variable Lifts (DVL) for each rule by integrating co-expression, co-methylation, and protein-protein interactions existed in the multi-omics data set. We develop the proposed algorithm utilizing these three novel multiple threshold measures. In the proposed algorithm, the values of , , and are computed for each rule separately, and subsequently it is verified whether the support, confidence, and lift of each evolved rule are greater than or equal to the corresponding individual , , and values, respectively, or not. If all these three conditions for a rule are found to be true, the rule is treated as a resultant rule. One of the major advantages of the proposed method compared with other related state-of-the-art methods is that it considers both the quantitative and interactive significance among all pairwise genes belonging to each rule. Moreover, the proposed method generates fewer rules, takes less running time, and provides greater biological significance for the resultant top-ranking rules compared to previous methods.
Collapse
|
16
|
Li W, Bickhart DM, Ramunno L, Iamartino D, Williams JL, Liu GE. Genomic structural differences between cattle and River Buffalo identified through comparative genomic and transcriptomic analysis. Data Brief 2018; 19:236-239. [PMID: 29892639 PMCID: PMC5993156 DOI: 10.1016/j.dib.2018.05.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Revised: 04/16/2018] [Accepted: 05/04/2018] [Indexed: 12/01/2022] Open
Abstract
Water buffalo (Bubalus bubalis L.) is an important livestock species worldwide. Like many other livestock species, water buffalo lacks high quality and continuous reference genome assembly, required for fine-scale comparative genomics studies. In this work, we present a dataset, which characterizes genomic differences between water buffalo genome and the extensively studied cattle (Bos taurus Taurus) reference genome. This data set is obtained after alignment of 14 river buffalo whole genome sequencing datasets to the cattle reference. This data set consisted of 13,444 deletion CNV regions, and 11,050 merged mobile element insertion (MEI) events within the upstream regions of annotated cattle genes. Gene expression data from cattle and buffalo were also presented for genes impacted by these regions. Public assessment of this dataset will allow for further analyses and functional annotation of genes that are potentially associated with phenotypic difference between cattle and water buffalo.
Collapse
Affiliation(s)
- Wenli Li
- The Cell Wall Utilization and Biology Laboratory, US Dairy Forage Research Center, USDA ARS, Madison, WI 53706, USA
| | - Derek M Bickhart
- The Cell Wall Utilization and Biology Laboratory, US Dairy Forage Research Center, USDA ARS, Madison, WI 53706, USA
| | - Luigi Ramunno
- Dipartimento di Agraria, Università degli Studi di Napoli "Federico II", via Università 100, 80055 Portici, NA, Italy
| | - Daniela Iamartino
- AIA-LGS, Associazione Italiana Allevatori - Laboratorio Genetica e Servizi, Via Bergamo 292, 26100 Cremona, CR, Italy.,Parco Tecnologico Padano, Via Einstein, 26500 Lodi, Italy
| | - John L Williams
- Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA 5371, Australia
| | - George E Liu
- The Animal Genomics and Improvement Laboratory, USDA ARS, Beltsville, MD, USA
| |
Collapse
|
17
|
Li W, Bickhart DM, Ramunno L, Iamartino D, Williams JL, Liu GE. Comparative sequence alignment reveals River Buffalo genomic structural differences compared with cattle. Genomics 2018; 111:418-425. [PMID: 29501677 DOI: 10.1016/j.ygeno.2018.02.018] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Revised: 02/12/2018] [Accepted: 02/28/2018] [Indexed: 10/17/2022]
Abstract
This study sought to characterize differences in gene content, regulation and structure between taurine cattle and river buffalo (one subspecies of domestic water buffalo) using the extensively annotated UMD3.1 cattle reference genome as a basis for comparisons. We identified 127 deletion CNV regions in river buffalo representing 5 annotated cattle genes. We also characterized 583 merged mobile element insertion (MEI) events within the upstream regions of annotated cattle genes. Transcriptome analysis in various tissue types on river buffalo confirmed the absence of four cattle genes. Four genes which may be related to phenotypic differences in meat quality and color, had upstream MEI predictions and were found to have significantly elevated expression in river buffalo compared with cattle. Our comparative alignment approach and gene expression analyses suggested a functional role for many genomic structural variations, which may contribute to the unique phenotypes of river buffalo.
Collapse
Affiliation(s)
- Wenli Li
- The Cell Wall Utilization and Biology Laboratory, US Dairy Forage Research Center, USDA ARS, Madison, WI 53706, USA
| | - Derek M Bickhart
- The Cell Wall Utilization and Biology Laboratory, US Dairy Forage Research Center, USDA ARS, Madison, WI 53706, USA
| | - Luigi Ramunno
- Dipartimento di Agraria, Università degli Studi di Napoli "Federico II", via Università 100, 80055 Portici (NA), Italy
| | - Daniela Iamartino
- AIA-LGS, Associazione Italiana Allevatori - Laboratorio Genetica e Servizi, Via Bergamo 292, 26100 Cremona (CR), Italy; Parco Tecnologico Padano, Via Einstein, 26500 Lodi, Italy
| | - John L Williams
- Davies Research Centre, School of Animal and Veterinary Sciences, University of Adelaide, Roseworthy, SA 5371, Australia
| | - George E Liu
- The Animal Genomics and Improvement Laboratory, USDA ARS, Beltsville, MD, USA.
| |
Collapse
|
18
|
Bandyopadhyay S, Mallik S. Integrating Multiple Data Sources for Combinatorial Marker Discovery: A Study in Tumorigenesis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:673-687. [PMID: 28114033 DOI: 10.1109/tcbb.2016.2636207] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Identification of combinatorial markers from multiple data sources is a challenging task in bioinformatics. Here, we propose a novel computational framework for identifying significant combinatorial markers ( s) using both gene expression and methylation data. The gene expression and methylation data are integrated into a single continuous data as well as a (post-discretized) boolean data based on their intrinsic (i.e., inverse) relationship. A novel combined score of methylation and expression data (viz., ) is introduced which is computed on the integrated continuous data for identifying initial non-redundant set of genes. Thereafter, (maximal) frequent closed homogeneous genesets are identified using a well-known biclustering algorithm applied on the integrated boolean data of the determined non-redundant set of genes. A novel sample-based weighted support ( ) is then proposed that is consecutively calculated on the integrated boolean data of the determined non-redundant set of genes in order to identify the non-redundant significant genesets. The top few resulting genesets are identified as potential s. Since our proposed method generates a smaller number of significant non-redundant genesets than those by other popular methods, the method is much faster than the others. Application of the proposed technique on an expression and a methylation data for Uterine tumor or Prostate Carcinoma produces a set of significant combination of markers. We expect that such a combination of markers will produce lower false positives than individual markers.
Collapse
|
19
|
Maulik U, Sen S, Mallik S, Bandyopadhyay S. Detecting TF-miRNA-gene network based modules for 5hmC and 5mC brain samples: a intra- and inter-species case-study between human and rhesus. BMC Genet 2018; 19:9. [PMID: 29357837 PMCID: PMC5776763 DOI: 10.1186/s12863-017-0574-7] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2017] [Accepted: 11/29/2017] [Indexed: 01/09/2023] Open
Abstract
Background Study of epigenetics is currently a high-impact research topic. Multi stage methylation is also an area of high-dimensional prospect. In this article, we provide a new study (intra and inter-species study) on brain tissue between human and rhesus on two methylation cytosine variants based data-profiles (viz., 5-hydroxymethylcytosine (5hmC) and 5-methylcytosine (5mC) samples) through TF-miRNA-gene network based module detection. Results First of all, we determine differentially 5hmC methylated genes for human as well as rhesus for intra-species analysis, and differentially multi-stage methylated genes for inter-species analysis. Thereafter, we utilize weighted topological overlap matrix (TOM) measure and average linkage clustering consecutively on these genesets for intra- and inter-species study.We identify co-methylated and multi-stage co-methylated gene modules by using dynamic tree cut, for intra-and inter-species cases, respectively. Each module is represented by individual color in the dendrogram. Gene Ontology and KEGG pathway based analysis are then performed to identify biological functionalities of the identified modules. Finally, top ten regulator TFs and targeter miRNAs that are associated with the maximum number of gene modules, are determined for both intra-and inter-species analysis. Conclusions The novel TFs and miRNAs obtained from the analysis are: MYST3 and ZNF771 as TFs (for human intra-species analysis), BAZ2B, RCOR3 and ATF1 as TFs (for rhesus intra-species analysis), and mml-miR-768-3p and mml-miR-561 as miRs (for rhesus intra-species analysis); and MYST3 and ZNF771 as miRs(for inter-species study). Furthermore, the genes/TFs/miRNAs that are already found to be liable for several brain-related dreadful diseases as well as rare neglected diseases (e.g., wolf Hirschhorn syndrome, Joubarts Syndrome, Huntington’s disease, Simian Immunodeficiency Virus(SIV) mediated enchaphilits, Parkinsons Disease, Bipolar disorder and Schizophenia etc.) are mentioned. Electronic supplementary material The online version of this article (doi:10.1186/s12863-017-0574-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India.
| | - Sagnik Sen
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | - Saurav Mallik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | | |
Collapse
|
20
|
Mallik S, Zhao Z. Towards integrated oncogenic marker recognition through mutual information-based statistically significant feature extraction: an association rule mining based study on cancer expression and methylation profiles. QUANTITATIVE BIOLOGY 2017; 5:302-327. [PMID: 30221015 DOI: 10.1007/s40484-017-0119-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Background Marker detection is an important task in complex disease studies. Here we provide an association rule mining (ARM) based approach for identifying integrated markers through mutual information (MI) based statistically significant feature extraction, and apply it to acute myeloid leukemia (AML) and prostate carcinoma (PC) gene expression and methylation profiles. Methods We first collect the genes having both expression and methylation values in AML as well as PC. Next, we run Jarque-Bera normality test on the expression/methylation data to divide the whole dataset into two parts: one that ollows normal distribution and the other that does not follow normal distribution. Thus, we have now four parts of the dataset: normally distributed expression data, normally distributed methylation data, non-normally distributed expression data, and non-normally distributed methylated data. A feature-extraction technique, "mRMR" is then utilized on each part. This results in a list of top-ranked genes. Next, we apply Welch t-test (parametric test) and Shrink t-test (non-parametric test) on the expression/methylation data for the top selected normally distributed genes and non-normally distributed genes, respectively. We then use a recent weighted ARM method, "RANWAR" to combine all/specific resultant genes to generate top oncogenic rules along with respective integrated markers. Finally, we perform literature search as well as KEGG pathway and Gene-Ontology (GO) analyses using Enrichr database for in silico validation of the prioritized oncogenes as the markers and labeling the markers as existing or novel. Results The novel markers of AML are {ABCB11↑∪KRT17↓} (i.e., ABCB11 as up-regulated, & KRT17 as down-regulated), and {AP1S1-∪KRT17↓∪NEIL2-∪DYDC1↓}) (i.e., AP1S1 and NEIL2 both as hypo-methylated, & KRT17 and DYDC1 both as down-regulated). The novel marker of PC is {UBIAD1¶∪APBA2‡∪C4orf31‡} (i.e., UBIAD1 as up-regulated and hypo-methylated, & APBA2 and C4orf31 both as down-regulated and hyper-methylated). Conclusion The identified novel markers might have critical roles in AML as well as PC. The approach can be applied to other complex disease.
Collapse
Affiliation(s)
- Saurav Mallik
- Computer Science & Engineering, Aliah University, Newtown, Newtown 700156, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
21
|
Lee H, Shin M. Mining pathway associations for disease-related pathway activity analysis based on gene expression and methylation data. BioData Min 2017; 10:3. [PMID: 28168005 PMCID: PMC5286825 DOI: 10.1186/s13040-017-0127-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2016] [Accepted: 01/26/2017] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND The problem of discovering genetic markers as disease signatures is of great significance for the successful diagnosis, treatment, and prognosis of complex diseases. Even if many earlier studies worked on identifying disease markers from a variety of biological resources, they mostly focused on the markers of genes or gene-sets (i.e., pathways). However, these markers may not be enough to explain biological interactions between genetic variables that are related to diseases. Thus, in this study, our aim is to investigate distinctive associations among active pathways (i.e., pathway-sets) shown each in case and control samples which can be observed from gene expression and/or methylation data. RESULTS The pathway-sets are obtained by identifying a set of associated pathways that are often active together over a significant number of class samples. For this purpose, gene expression or methylation profiles are first analyzed to identify significant (active) pathways via gene-set enrichment analysis. Then, regarding these active pathways, an association rule mining approach is applied to examine interesting pathway-sets in each class of samples (case or control). By doing so, the sets of associated pathways often working together in activity profiles are finally chosen as our distinctive signature of each class. The identified pathway-sets are aggregated into a pathway activity network (PAN), which facilitates the visualization of differential pathway associations between case and control samples. From our experiments with two publicly available datasets, we could find interesting PAN structures as the distinctive signatures of breast cancer and uterine leiomyoma cancer, respectively. CONCLUSIONS Our pathway-set markers were shown to be superior or very comparable to other genetic markers (such as genes or gene-sets) in disease classification. Furthermore, the PAN structure, which can be constructed from the identified markers of pathway-sets, could provide deeper insights into distinctive associations between pathway activities in case and control samples.
Collapse
Affiliation(s)
- Hyeonjeong Lee
- Bio-Intelligence & Data Mining Laboratory, Graduate School of Electronics Engineering, Kyungpook National University, 80, Daehak-ro, Buk-gu, Daegu, 41566 Republic of Korea
| | - Miyoung Shin
- School of Electronics Engineering, Kyungpook National University, 80, Daehak-ro, Buk-gu, Daegu, 41566 Republic of Korea
| |
Collapse
|
22
|
Mallik S, Bhadra T, Maulik U. Identifying Epigenetic Biomarkers using Maximal Relevance and Minimal Redundancy Based Feature Selection for Multi-Omics Data. IEEE Trans Nanobioscience 2017; 16:3-10. [PMID: 28092570 DOI: 10.1109/tnb.2017.2650217] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Epigenetic Biomarker discovery is an important task in bioinformatics. In this article, we develop a new framework of identifying statistically significant epigenetic biomarkers using maximal-relevance and minimal-redundancy criterion based feature (gene) selection for multi-omics dataset. Firstly, we determine the genes that have both expression as well as methylation values, and follow normal distribution. Similarly, we identify the genes which consist of both expression and methylation values, but do not follow normal distribution. For each case, we utilize a gene-selection method that provides maximal-relevant, but variable-weighted minimum-redundant genes as top ranked genes. For statistical validation, we apply t-test on both the expression and methylation data consisting of only the normally distributed top ranked genes to determine how many of them are both differentially expressed andmethylated. Similarly, we utilize Limma package for performing non-parametric Empirical Bayes test on both expression and methylation data comprising only the non-normally distributed top ranked genes to identify how many of them are both differentially expressed and methylated. We finally report the top-ranking significant gene-markerswith biological validation. Moreover, our framework improves positive predictive rate and reduces false positive rate in marker identification. In addition, we provide a comparative analysis of our gene-selection method as well as othermethods based on classificationperformances obtained using several well-known classifiers.
Collapse
|
23
|
Mallik S, Sen S, Maulik U. IDPT: Insights into potential intrinsically disordered proteins through transcriptomic analysis of genes for prostate carcinoma epigenetic data. Gene 2016; 586:87-96. [DOI: 10.1016/j.gene.2016.03.056] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Revised: 02/22/2016] [Accepted: 03/30/2016] [Indexed: 12/13/2022]
|
24
|
Mallik S, Maulik U. MiRNA-TF-gene network analysis through ranking of biomolecules for multi-informative uterine leiomyoma dataset. J Biomed Inform 2015; 57:308-19. [PMID: 26297985 DOI: 10.1016/j.jbi.2015.08.014] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2014] [Revised: 06/26/2015] [Accepted: 08/11/2015] [Indexed: 12/12/2022]
Abstract
Gene ranking is an important problem in bioinformatics. Here, we propose a new framework for ranking biomolecules (viz., miRNAs, transcription-factors/TFs and genes) in a multi-informative uterine leiomyoma dataset having both gene expression and methylation data using (statistical) eigenvector centrality based approach. At first, genes that are both differentially expressed and methylated, are identified using Limma statistical test. A network, comprising these genes, corresponding TFs from TRANSFAC and ITFP databases, and targeter miRNAs from miRWalk database, is then built. The biomolecules are then ranked based on eigenvector centrality. Our proposed method provides better average accuracy in hub gene and non-hub gene classifications than other methods. Furthermore, pre-ranked Gene set enrichment analysis is applied on the pathway database as well as GO-term databases of Molecular Signatures Database with providing a pre-ranked gene-list based on different centrality values for comparing among the ranking methods. Finally, top novel potential gene-markers for the uterine leiomyoma are provided.
Collapse
Affiliation(s)
- Saurav Mallik
- Machine Intelligence Unit, Indian Statistical Institute, Kolkata 700108, India.
| | - Ujjwal Maulik
- Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India.
| |
Collapse
|