1
|
Bhowmick SS, Bhattacharjee D, Rato L. Integrated analysis of the miRNA–mRNA next-generation sequencing data for finding their associations in different cancer types. Comput Biol Chem 2020; 84:107152. [DOI: 10.1016/j.compbiolchem.2019.107152] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2017] [Revised: 10/05/2019] [Accepted: 10/15/2019] [Indexed: 12/21/2022]
|
2
|
Bhowmick SS, Bhattacharjee D, Rato L. In silico markers: an evolutionary and statistical approach to select informative genes of human breast cancer subtypes. Genes Genomics 2019; 41:1371-1382. [PMID: 31004329 DOI: 10.1007/s13258-019-00816-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2018] [Accepted: 04/02/2019] [Indexed: 10/27/2022]
Abstract
BACKGROUND Recent advancement in bioinformatics offers the ability to identify informative genes from high dimensional gene expression data. Selection of informative genes from these large datasets has emerged as an issue of major concern among researchers. OBJECTIVE Gene functionality and regulatory mechanisms can be understood through the analysis of these gene expression data. Here, we present a computational method to identify informative genes for breast cancer subtypes such as Basal, human epidermal growth factor receptor 2 (Her2), luminal A (LumA), and luminal B (LumB). METHODS The proposed In Silico Markers method is a wrapper feature selection method based on Least Absolute Shrinkage and Selection Operator (LASSO), Covariance Matrix Adaptation Evolution Strategy (CMA-ES) and Support Vector Machine (SVM) as a classifier. Moreover, the composite measure consisting of relevance, redundancy, and rank score of frequently appeared genes are used to select informative genes. RESULTS The informative genes are validated by statistical and biologically relevant criteria. For a comparative evaluation of the proposed approach, biological similarity score designed on semantic similarity measure of GO terms are investigated. Further, the proposed technique is evaluated with 7 existing gene selection techniques using two-class annotated breast cancer subtype datasets. CONCLUSION The utilization of this method can bring about the discovery of informative genes. Furthermore, under multiple criteria decision-making set-up, informative genes selected by the In Silico Markers are found to be admirable than the compared methods selected genes.
Collapse
Affiliation(s)
- Shib Sankar Bhowmick
- Department of Electronics and Communication Engineering, Heritage Institute of Technology, Kolkata, 700107, India.
| | - Debotosh Bhattacharjee
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | - Luis Rato
- Department of Informatics, University of Evora, 7004-516, Evora, Portugal
| |
Collapse
|
3
|
Bhowmick SS, Bhattacharjee D, Rato L. Identification of tissue-specific tumor biomarker using different optimization algorithms. Genes Genomics 2018; 41:431-443. [PMID: 30535858 DOI: 10.1007/s13258-018-0773-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2018] [Accepted: 12/03/2018] [Indexed: 11/25/2022]
Abstract
BACKGROUND Identification of differentially expressed genes, i.e., genes whose transcript abundance level differs across different biological or physiological conditions, was indeed a challenging task. However, the inception of transcriptome sequencing (RNA-seq) technology revolutionized the simultaneous measurement of the transcript abundance levels for thousands of genes. OBJECTIVE In this paper, such next-generation sequencing (NGS) data is used to identify biomarker signatures for several of the most common cancer types (bladder, colon, kidney, brain, liver, lung, prostate, skin, and thyroid) METHODS: Here, the problem is mapped into the comparison of optimization algorithms for selecting a set of genes that lead to the highest classification accuracy of a two-class classification task between healthy and tumor samples. As the optimization algorithms Artificial Bee Colony (ABC), Ant Colony Optimization, Differential Evolution, and Particle Swarm Optimization are chosen for this experiment. A standard statistical method called DESeq2 is used to select differentially expressed genes before being feed to the optimization algorithms. Classification of healthy and tumor samples is done by support vector machine RESULTS: Cancer-specific validation yields remarkably good results in terms of accuracy. Highest classification accuracy is achieved by the ABC algorithm for Brain lower grade glioma data is 99.10%. This validation is well supported by a statistical test, gene ontology enrichment analysis, and KEGG pathway enrichment analysis for each cancer biomarker signature CONCLUSION: The current study identified robust genes as biomarker signatures and these identified biomarkers might be helpful to accurately identify tumors of unknown origin.
Collapse
Affiliation(s)
- Shib Sankar Bhowmick
- Department of Electronics and Communication Engineering, Heritage Institute of Technology, Kolkata, 700107, India.
| | - Debotosh Bhattacharjee
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, 700032, India
| | - Luis Rato
- Department of Informatics, University of Evora, 7004-516, Evora, Portugal
| |
Collapse
|
4
|
Bhowmick SS, Saha I, Bhattacharjee D, Genovese LM, Geraci F. Genome-wide analysis of NGS data to compile cancer-specific panels of miRNA biomarkers. PLoS One 2018; 13:e0200353. [PMID: 30048452 PMCID: PMC6061989 DOI: 10.1371/journal.pone.0200353] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2017] [Accepted: 06/25/2018] [Indexed: 12/22/2022] Open
Abstract
MicroRNAs are small non-coding RNAs that influence gene expression by binding to the 3’ UTR of target mRNAs in order to repress protein synthesis. Soon after discovery, microRNA dysregulation has been associated to several pathologies. In particular, they have often been reported as differentially expressed in healthy and tumor samples. This fact suggested that microRNAs are likely to be good candidate biomarkers for cancer diagnosis and personalized medicine. With the advent of Next-Generation Sequencing (NGS), measuring the expression level of the whole miRNAome at once is now routine. Yet, the collaborative effort of sharing data opens to the possibility of population analyses. This context motivated us to perform an in-silico study to distill cancer-specific panels of microRNAs that can serve as biomarkers. We observed that the problem of finding biomarkers can be modeled as a two-class classification task where, given the miRNAomes of a population of healthy and cancerous samples, we want to find the subset of microRNAs that leads to the highest classification accuracy. We fulfill this task leveraging on a sensible combination of data mining tools. In particular, we used: differential evolution for candidate selection, component analysis to preserve the relationships among miRNAs, and SVM for sample classification. We identified 10 cancer-specific panels whose classification accuracy is always higher than 92%. These panels have a very little overlap suggesting that miRNAs are not only predictive of the onset of cancer, but can be used for classification purposes as well. We experimentally validated the contribution of each of the employed tools to the selection of discriminating miRNAs. Moreover, we tested the significance of each panel for the corresponding cancer type. In particular, enrichment analysis showed that the selected miRNAs are involved in oncogenesis pathways, while survival analysis proved that miRNAs can be used to evaluate cancer severity. Summarizing: results demonstrated that our method is able to produce cancer-specific panels that are promising candidates for a subsequent in vitro validation.
Collapse
Affiliation(s)
- Shib Sankar Bhowmick
- Department of Computer Science and Engineering, Jadavpur University, Kolkata, India
- Department of Electronics & Communication Engineering, Heritage Institute of Technology, Kolkata, India
| | - Indrajit Saha
- Department of Computer Science and Engineering, National Institute of Technical Teachers’ Training & Research, Kolkata, India
- * E-mail:
| | | | - Loredana M. Genovese
- Institute for Informatics and telematics, National Research Council, Pisa, Italy
| | - Filippo Geraci
- Institute for Informatics and telematics, National Research Council, Pisa, Italy
| |
Collapse
|
5
|
Saha I, Rak B, Bhowmick SS, Maulik U, Bhattacharjee D, Koch U, Lazniewski M, Plewczynski D. Binding Activity Prediction of Cyclin-Dependent Inhibitors. J Chem Inf Model 2015; 55:1469-82. [PMID: 26079845 DOI: 10.1021/ci500633c] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The Cyclin-Dependent Kinases (CDKs) are the core components coordinating eukaryotic cell division cycle. Generally the crystal structure of CDKs provides information on possible molecular mechanisms of ligand binding. However, reliable and robust estimation of ligand binding activity has been a challenging task in drug design. In this regard, various machine learning techniques, such as Support Vector Machine, Naive Bayesian classifier, Decision Tree, and K-Nearest Neighbor classifier, have been used. The performance of these heterogeneous classification techniques depends on proper selection of features from the data set. This fact motivated us to propose an integrated classification technique using Genetic Algorithm (GA), Rotational Feature Selection (RFS) scheme, and Ensemble of Machine Learning methods, named as the Genetic Algorithm integrated Rotational Ensemble based classification technique, for the prediction of ligand binding activity of CDKs. This technique can automatically find the important features and the ensemble size. For this purpose, GA encodes the features and ensemble size in a chromosome as a binary string. Such encoded features are then used to create diverse sets of training points using RFS in order to train the machine learning method multiple times. The RFS scheme works on Principal Component Analysis (PCA) to preserve the variability information of the rotational nonoverlapping subsets of original data. Thereafter, the testing points are fed to the different instances of trained machine learning method in order to produce the ensemble result. Here accuracy is computed as a final result after 10-fold cross validation, which also used as an objective function for GA to maximize. The effectiveness of the proposed classification technique has been demonstrated quantitatively and visually in comparison with different machine learning methods for 16 ligand binding CDK docking and rescoring data sets. In addition, the best possible features have been reported for CDK docking and rescoring data sets separately. Finally, the Friedman test has been conducted to judge the statistical significance of the results produced by the proposed technique. The results indicate that the integrated classification technique has high relevance in predicting of protein-ligand binding activity.
Collapse
Affiliation(s)
- Indrajit Saha
- †Centre of New Technologies, University of Warsaw, 02-097 Warsaw, Poland.,‡Institute of Informatics and Telematics, National Research Council, 56124 Pisa, Italy.,§Institute of Computer Science, University of Wroclaw, 50-383 Wroclaw, Poland
| | - Benedykt Rak
- †Centre of New Technologies, University of Warsaw, 02-097 Warsaw, Poland
| | - Shib Sankar Bhowmick
- ∥Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, West Bengal, India.,⊥Department of Informatics, University of Evora, Evora 7004-516, Portugal
| | - Ujjwal Maulik
- ∥Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, West Bengal, India
| | - Debotosh Bhattacharjee
- ∥Department of Computer Science and Engineering, Jadavpur University, Kolkata-700032, West Bengal, India
| | - Uwe Koch
- □Lead Discovery Center, Emil-Figge-Strasse 76a, 44227 Dortmund, Germany
| | - Michal Lazniewski
- †Centre of New Technologies, University of Warsaw, 02-097 Warsaw, Poland
| | - Dariusz Plewczynski
- †Centre of New Technologies, University of Warsaw, 02-097 Warsaw, Poland.,△The Jackson Laboratory for Genomic Medicine, c/o University of Connecticut Health Center, Administrative Services Building-Call Box 901, 263 Farmington Avenue, Farmington, Connecticut 06030, United States.,¶Yale University, New Haven, Connecticut 06520, United States
| |
Collapse
|