1
|
Genç M. Penalized logistic regression with prior information for microarray gene expression classification. Int J Biostat 2024; 20:107-122. [PMID: 36427223 DOI: 10.1515/ijb-2022-0025] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 11/07/2022] [Indexed: 02/17/2024]
Abstract
Cancer classification and gene selection are important applications in DNA microarray gene expression data analysis. Since DNA microarray data suffers from the high-dimensionality problem, automatic gene selection methods are used to enhance the classification performance of expert classifier systems. In this paper, a new penalized logistic regression method that performs simultaneous gene coefficient estimation and variable selection in DNA microarray data is discussed. The method employs prior information about the gene coefficients to improve the classification accuracy of the underlying model. The coordinate descent algorithm with screening rules is given to obtain the gene coefficient estimates of the proposed method efficiently. The performance of the method is examined on five high-dimensional cancer classification datasets using the area under the curve, the number of selected genes, misclassification rate and F-score measures. The real data analysis results indicate that the proposed method achieves a good cancer classification performance with a small misclassification rate, large area under the curve and F-score by trading off some sparsity level of the underlying model. Hence, the proposed method can be seen as a reliable penalized logistic regression method in the scope of high-dimensional cancer classification.
Collapse
Affiliation(s)
- Murat Genç
- Department of Management Information Systems, Faculty of Economics and Administrative Sciences, Tarsus University Mersin, Mersin 33400, Türkiye
| |
Collapse
|
2
|
Zhang W, Kenney T, Ho LST. Evolutionary shift detection with ensemble variable selection. BMC Ecol Evol 2024; 24:11. [PMID: 38245667 PMCID: PMC10800078 DOI: 10.1186/s12862-024-02201-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 01/10/2024] [Indexed: 01/22/2024] Open
Abstract
Abrupt environmental changes can lead to evolutionary shifts in trait evolution. Identifying these shifts is an important step in understanding the evolutionary history of phenotypes. The detection performances of different methods are influenced by many factors, including different numbers of shifts, shift sizes, where a shift occurs on a tree, and the types of phylogenetic structure. Furthermore, the model assumptions are oversimplified, so are likely to be violated in real data, which could cause the methods to fail. We perform simulations to assess the effect of these factors on the performance of shift detection methods. To make the comparisons more complete, we also propose an ensemble variable selection method (R package ELPASO) and compare it with existing methods (R packages [Formula: see text]1ou and PhylogeneticEM). The performances of methods are highly dependent on the selection criterion. [Formula: see text]1ou+pBIC is usually the most conservative method and it performs well when signal sizes are large. [Formula: see text]1ou+BIC is the least conservative method and it performs well when signal sizes are small. The ensemble method provides more balanced choices between those two methods. Moreover, the performances of all methods are heavily impacted by measurement error, tree reconstruction error and shifts in variance.
Collapse
Affiliation(s)
- Wensha Zhang
- Department of Mathematics and Statistics, Dalhousie University, Nova Scotia, Canada.
| | - Toby Kenney
- Department of Mathematics and Statistics, Dalhousie University, Nova Scotia, Canada
| | - Lam Si Tung Ho
- Department of Mathematics and Statistics, Dalhousie University, Nova Scotia, Canada
| |
Collapse
|
3
|
Park S, Kim JH, Cha YK, Chung MJ, Woo JH, Park S. Application of Machine Learning Algorithm in Predicting Axillary Lymph Node Metastasis from Breast Cancer on Preoperative Chest CT. Diagnostics (Basel) 2023; 13:2953. [PMID: 37761320 PMCID: PMC10528867 DOI: 10.3390/diagnostics13182953] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 09/05/2023] [Accepted: 09/13/2023] [Indexed: 09/29/2023] Open
Abstract
Axillary lymph node (ALN) status is one of the most critical prognostic factors in patients with breast cancer. However, ALN evaluation with contrast-enhanced CT (CECT) has been challenging. Machine learning (ML) is known to show excellent performance in image recognition tasks. The purpose of our study was to evaluate the performance of the ML algorithm for predicting ALN metastasis by combining preoperative CECT features of both ALN and primary tumor. This was a retrospective single-institutional study of a total of 266 patients with breast cancer who underwent preoperative chest CECT. Random forest (RF), extreme gradient boosting (XGBoost), and neural network (NN) algorithms were used. Statistical analysis and recursive feature elimination (RFE) were adopted as feature selection for ML. The best ML-based ALN prediction model for breast cancer was NN with RFE, which achieved an AUROC of 0.76 ± 0.11 and an accuracy of 0.74 ± 0.12. By comparing NN with RFE model performance with and without ALN features from CECT, NN with RFE model with ALN features showed better performance at all performance evaluations, which indicated the effect of ALN features. Through our study, we were able to demonstrate that the ML algorithm could effectively predict the final diagnosis of ALN metastases from CECT images of the primary tumor and ALN. This suggests that ML has the potential to differentiate between benign and malignant ALNs.
Collapse
Affiliation(s)
- Soyoung Park
- Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul 06351, Republic of Korea; (S.P.); (S.P.)
| | - Jong Hee Kim
- Department of Radiology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul 06351, Republic of Korea; (J.H.K.); (J.H.W.)
| | - Yoon Ki Cha
- Department of Radiology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul 06351, Republic of Korea; (J.H.K.); (J.H.W.)
| | - Myung Jin Chung
- Department of Radiology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul 06351, Republic of Korea; (J.H.K.); (J.H.W.)
| | - Jung Han Woo
- Department of Radiology, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul 06351, Republic of Korea; (J.H.K.); (J.H.W.)
| | - Subin Park
- Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul 06351, Republic of Korea; (S.P.); (S.P.)
| |
Collapse
|
4
|
A Survey on Feature Selection Techniques Based on Filtering Methods for Cyber Attack Detection. INFORMATION 2023. [DOI: 10.3390/info14030191] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023] Open
Abstract
Cyber attack detection technology plays a vital role today, since cyber attacks have been causing great harm and loss to organizations and individuals. Feature selection is a necessary step for many cyber-attack detection systems, because it can reduce training costs, improve detection performance, and make the detection system lightweight. Many techniques related to feature selection for cyber attack detection have been proposed, and each technique has advantages and disadvantages. Determining which technology should be selected is a challenging problem for many researchers and system developers, and although there have been several survey papers on feature selection techniques in the field of cyber security, most of them try to be all-encompassing and are too general, making it difficult for readers to grasp the concrete and comprehensive image of the methods. In this paper, we survey the filter-based feature selection technique in detail and comprehensively for the first time. The filter-based technique is one popular kind of feature selection technique and is widely used in both research and application. In addition to general descriptions of this kind of method, we also explain in detail search algorithms and relevance measures, which are two necessary technical elements commonly used in the filter-based technique.
Collapse
|
5
|
Liu B, Zhai J, Wang W, Liu T, Liu C, Zhu X, Wang Q, Tian W, Zhang F. Identification of Tumor Microenvironment and DNA Methylation-Related Prognostic Signature for Predicting Clinical Outcomes and Therapeutic Responses in Cervical Cancer. Front Mol Biosci 2022; 9:872932. [PMID: 35517856 PMCID: PMC9061945 DOI: 10.3389/fmolb.2022.872932] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Accepted: 03/17/2022] [Indexed: 01/14/2023] Open
Abstract
Background: Tumor microenvironment (TME) has been reported to have a strong association with tumor progression and therapeutic outcome, and epigenetic modifications such as DNA methylation can affect TMB and play an indispensable role in tumorigenesis. However, the potential mechanisms of TME and DNA methylation remain unclear in cervical cancer (CC). Methods: The immune and stromal scores of TME were generated by the ESTIMATE algorithm for CC patients in The Cancer Genome Atlas (TCGA) database. The TME and DNA methylation-related genes were identified by the integrative analysis of DNA promoter methylation and gene expression. The least absolute shrinkage and selection operator (LASSO) Cox regression was performed 1,000 times to further identify a nine-gene TME and DNA methylation-related prognostic signature. The signature was further validated in Gene Expression Omnibus (GEO) dataset. Then, the identified signature was integrated with the Federation International of Gynecology and Obstetrics (FIGO) stage to establish a composite prognostic nomogram. Results: CC patients with high immunity levels have better survival than those with low immunity levels. Both in the training and validation datasets, the risk score of the signature was an independent prognosis factor. The composite nomogram showed higher accuracy of prognosis and greater net benefits than the FIGO stage and the signature. The high-risk group had a significantly higher fraction of genome altered than the low-risk group. Eleven genes were significantly different in mutation frequencies between the high- and low-risk groups. Interestingly, patients with mutant TTN had better overall survival (OS) than those with wild type. Patients in the low-risk group had significantly higher tumor mutational burden (TMB) than those in the high-risk group. Taken together, the results of TMB, immunophenoscore (IPS), and tumor immune dysfunction and exclusion (TIDE) score suggested that patients in the low-risk group may have greater immunotherapy benefits. Finally, four drugs (panobinostat, lenvatinib, everolimus, and temsirolimus) were found to have potential therapeutic implications for patients with a high-risk score. Conclusions: Our findings highlight that the TME and DNA methylation-related prognostic signature can accurately predict the prognosis of CC and may be important for stratified management of patients and precision targeted therapy.
Collapse
Affiliation(s)
- Bangquan Liu
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Jiabao Zhai
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Wanyu Wang
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Tianyu Liu
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Chang Liu
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Xiaojie Zhu
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Qi Wang
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Wenjing Tian
- Department of Epidemiology, College of Public Health, Harbin Medical University, Harbin, China
| | - Fubin Zhang
- Department of Gynecological Oncology, Harbin Medical University Cancer Hospital, Harbin, China
| |
Collapse
|
6
|
Ai H. GSEA-SDBE: A gene selection method for breast cancer classification based on GSEA and analyzing differences in performance metrics. PLoS One 2022; 17:e0263171. [PMID: 35472078 PMCID: PMC9041804 DOI: 10.1371/journal.pone.0263171] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Accepted: 01/13/2022] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Selecting the most relevant genes for sample classification is a common process in gene expression studies. Moreover, determining the smallest set of relevant genes that can achieve the required classification performance is particularly important in diagnosing cancer and improving treatment. RESULTS In this study, I propose a novel method to eliminate irrelevant and redundant genes, and thus determine the smallest set of relevant genes for breast cancer diagnosis. The method is based on random forest models, gene set enrichment analysis (GSEA), and my developed Sort Difference Backward Elimination (SDBE) algorithm; hence, the method is named GSEA-SDBE. Using this method, genes are filtered according to their importance following random forest training and GSEA is used to select genes by core enrichment of Kyoto Encyclopedia of Genes and Genomes pathways that are strongly related to breast cancer. Subsequently, the SDBE algorithm is applied to eliminate redundant genes and identify the most relevant genes for breast cancer diagnosis. In the SDBE algorithm, the differences in the Matthews correlation coefficients (MCCs) of performing random forest models are computed before and after the deletion of each gene to indicate the degree of redundancy of the corresponding deleted gene on the remaining genes during backward elimination. Next, the obtained MCC difference list is divided into two parts from a set position and each part is respectively sorted. By continuously iterating and changing the set position, the most relevant genes are stably assembled on the left side of the gene list, facilitating their identification, and the redundant genes are gathered on the right side of the gene list for easy elimination. A cross-comparison of the SDBE algorithm was performed by respectively computing differences between MCCs and ROC_AUC_score and then respectively using 10-fold classification models, e.g., random forest (RF), support vector machine (SVM), k-nearest neighbor (KNN), extreme gradient boosting (XGBoost), and extremely randomized trees (ExtraTrees). Finally, the classification performance of the proposed method was compared with that of three advanced algorithms for five cancer datasets. Results showed that analyzing MCC differences and using random forest models was the optimal solution for the SDBE algorithm. Accordingly, three consistently relevant genes (i.e., VEGFD, TSLP, and PKMYT1) were selected for the diagnosis of breast cancer. The performance metrics (MCC and ROC_AUC_score, respectively) of the random forest models based on 10-fold verification reached 95.28% and 98.75%. In addition, survival analysis showed that VEGFD and TSLP could be used to predict the prognosis of patients with breast cancer. Moreover, the proposed method significantly outperformed the other methods tested as it allowed selecting a smaller number of genes while maintaining the required classification accuracy.
Collapse
Affiliation(s)
- Hu Ai
- Department of Criminal Technology, Guizhou Police College, Guiyang, Guizhou, China
- * E-mail:
| |
Collapse
|
7
|
Asad E, Mollah AF. Biomarker Identification From Gene Expression Based on Symmetrical Uncertainty. INTERNATIONAL JOURNAL OF INTELLIGENT INFORMATION TECHNOLOGIES 2021. [DOI: 10.4018/ijiit.289966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In this paper, we present an effective information theoretic feature selection method, Symmetrical Uncertainty to classify gene expression microarray data and detect biomarkers from it. Here, Information Gain and Symmetrical Uncertainty contribute for ranking the features. Based on computed values of Symmetrical Uncertainty, features were sorted from most informative to least informative ones. Then, the top features from the sorted list are passed to Random Forest, Logistic Regression and other well-known classifiers with Leave-One-Out cross validation to construct the best classification model(s) and accordingly select the most important genes from microarray datasets. Obtained results in terms of classification accuracy, running time, root mean square error and other parameters computed on Leukemia and Colon cancer datasets demonstrate the effectiveness of the proposed approach. The proposed method is relatively much faster than many other wrapper or ensemble methods.
Collapse
|
8
|
Yang X, Wu W, Xin X, Su L, Xue L. Adaptive factorization rank selection-based NMF and its application in tumor recognition. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01353-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
9
|
Abstract
Feature selection (FS) has attracted the attention of many researchers in the last few years due to the increasing sizes of datasets, which contain hundreds or thousands of columns (features). Typically, not all columns represent relevant values. Consequently, the noise or irrelevant columns could confuse the algorithms, leading to a weak performance of machine learning models. Different FS algorithms have been proposed to analyze highly dimensional datasets and determine their subsets of relevant features to overcome this problem. However, very often, FS algorithms are biased by the data. Thus, methods for ensemble feature selection (EFS) algorithms have become an alternative to integrate the advantages of single FS algorithms and compensate for their disadvantages. The objective of this research is to propose a conceptual and implementation framework to understand the main concepts and relationships in the process of aggregating FS algorithms and to demonstrate how to address FS on datasets with high dimensionality. The proposed conceptual framework is validated by deriving an implementation framework, which incorporates a set of Phyton packages with functionalities to support the assembly of feature selection algorithms. The performance of the implementation framework was demonstrated in several experiments discovering relevant features in the Sonar, SPECTF, and WDBC datasets. The experiments contrasted the accuracy of two machine learning classifiers (decision tree and logistic regression), trained with subsets of features generated either by single FS algorithms or the set of features selected by the ensemble feature selection framework. We observed that for the three datasets used (Sonar, SPECTF, and WD), the highest precision percentages (86.95%, 74.73%, and 93.85%, respectively) were obtained when the classifiers were trained with the subset of features generated by our framework. Additionally, the stability of the feature sets generated using our ensemble method was evaluated. The results showed that the method achieved perfect stability for the three datasets used in the evaluation.
Collapse
|
10
|
Gomes J, Kong J, Kurc T, Melo ACMA, Ferreira R, Saltz JH, Teodoro G. Building robust pathology image analyses with uncertainty quantification. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 208:106291. [PMID: 34333205 DOI: 10.1016/j.cmpb.2021.106291] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Accepted: 07/09/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE Computerized pathology image analysis is an important tool in research and clinical settings, which enables quantitative tissue characterization and can assist a pathologist's evaluation. The aim of our study is to systematically quantify and minimize uncertainty in output of computer based pathology image analysis. METHODS Uncertainty quantification (UQ) and sensitivity analysis (SA) methods, such as Variance-Based Decomposition (VBD) and Morris One-At-a-Time (MOAT), are employed to track and quantify uncertainty in a real-world application with large Whole Slide Imaging datasets - 943 Breast Invasive Carcinoma (BRCA) and 381 Lung Squamous Cell Carcinoma (LUSC) patients. Because these studies are compute intensive, high-performance computing systems and efficient UQ/SA methods were combined to provide efficient execution. UQ/SA has been able to highlight parameters of the application that impact the results, as well as nuclear features that carry most of the uncertainty. Using this information, we built a method for selecting stable features that minimize application output uncertainty. RESULTS The results show that input parameter variations significantly impact all stages (segmentation, feature computation, and survival analysis) of the use case application. We then identified and classified features according to their robustness to parameter variation, and using the proposed features selection strategy, for instance, patient grouping stability in survival analysis has been improved from in 17% and 34% for BRCA and LUSC, respectively. CONCLUSIONS This strategy created more robust analyses, demonstrating that SA and UQ are important methods that may increase confidence digital pathology.
Collapse
Affiliation(s)
- Jeremias Gomes
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Jun Kong
- Biomedical Informatics Department, Emory University, Atlanta, USA; Department of Biomedical Engineering, Emory-Georgia Institute of Technology, Atlanta, USA; Department of Mathematics and Statistics, Georgia State University, Atlanta, USA
| | - Tahsin Kurc
- Biomedical Informatics Department, Stony Brook University, Stony Brook, USA; Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, USA
| | - Alba C M A Melo
- Department of Computer Science, University of Brasília, Brasília, Brazil
| | - Renato Ferreira
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Joel H Saltz
- Biomedical Informatics Department, Stony Brook University, Stony Brook, USA
| | - George Teodoro
- Department of Computer Science, University of Brasília, Brasília, Brazil; Biomedical Informatics Department, Stony Brook University, Stony Brook, USA; Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil.
| |
Collapse
|
11
|
Ge C, Luo L, Zhang J, Meng X, Chen Y. FRL: An Integrative Feature Selection Algorithm Based on the Fisher Score, Recursive Feature Elimination, and Logistic Regression to Identify Potential Genomic Biomarkers. BIOMED RESEARCH INTERNATIONAL 2021; 2021:4312850. [PMID: 34235216 PMCID: PMC8218915 DOI: 10.1155/2021/4312850] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 05/21/2021] [Indexed: 01/06/2023]
Abstract
Accurate screening on cancer biomarkers contributes to health assessment, drug screening, and targeted therapy for precision medicine. The rapid development of high-throughput sequencing technology has identified abundant genomic biomarkers, but most of them are limited to single-cancer analysis. Based on the combination of Fisher score, Recursive feature elimination, and Logistic regression (FRL), this paper proposes an integrative feature selection algorithm named FRL to explore potential cancer genomic biomarkers on cancer subsets. Fisher score is initially used to calculate the weights of genes to rapidly reduce the dimension. Recursive feature elimination and Logistic regression are then jointly employed to extract the optimal subset. Compared to the current differential expression analysis tool GEO2R based on the Limma algorithm, FRL has greater classification precision than Limma. Compared with five traditional feature selection algorithms, FRL exhibits excellent performance on accuracy (ACC) and F1-score and greatly improves computational efficiency. On high-noise datasets such as esophageal cancer, the ACC of FRL is 30% superior to the average ACC achieved with other traditional algorithms. As biomarkers found in multiple studies are more reliable and reproducible, and reveal stronger association on potential clinical value than single analysis, through literature review and spatial analyses of gene functional enrichment and functional pathways, we conduct cluster analysis on 10 diverse cancers with high mortality and form a potential biomarker module comprising 19 genes. All genes in this module can serve as potential biomarkers to provide more information on the overall oncogenesis mechanism for the detection of diverse early cancers and assist in targeted anticancer therapies for further developments in precision medicine.
Collapse
Affiliation(s)
- Chenyu Ge
- School of Mechanical, Electrical, & Information Engineering, Shandong University, Jinan 250000, China
| | - Liqun Luo
- Department of Information Management, Peking University, Beijing 100000, China
| | - Jialin Zhang
- Laboratoire de Recherche en Informatique, Paris-Saclay University, Paris 91405, France
| | - Xiangbing Meng
- Qufu Institute of Traditional Chinese Medical Health and Rehabilitation, Qufu 273100, China
| | - Yun Chen
- The Second Hospital Affiliated to Shandong University of TCM, Jinan 250000, China
| |
Collapse
|
12
|
|
13
|
Nazari E, Farzin AH, Aghemiri M, Avan A, Tara M, Tabesh H. Deep Learning for Acute Myeloid Leukemia Diagnosis. J Med Life 2020; 13:382-387. [PMID: 33072212 PMCID: PMC7550141 DOI: 10.25122/jml-2019-0090] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
By changing the lifestyle and increasing the cancer incidence, accurate diagnosis becomes a significant medical action. Today, DNA microarray is widely used in cancer diagnosis and screening since it is able to measure gene expression levels. Analyzing them by using common statistical methods is not suitable because of the high gene expression data dimensions. So, this study aims to use new techniques to diagnose acute myeloid leukemia. In this study, the leukemia microarray gene data, contenting 22283 genes, was extracted from the Gene Expression Omnibus repository. Initial preprocessing was applied by using a normalization test and principal component analysis in Python. Then DNNs neural network designed and implemented to the data and finally results cross-validated by classifiers. The normalization test was significant (P>0.05) and the results show the PCA gene segregation potential and independence of cancer and healthy cells. The results accuracy for single-layer neural network and DNNs deep learning network with three hidden layers are 63.33 and 96.67, respectively. Using new methods such as deep learning can improve diagnosis accuracy and performance compared to the old methods. It is recommended to use these methods in cancer diagnosis and effective gene selection in various types of cancer.
Collapse
Affiliation(s)
- Elham Nazari
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | | | - Mehran Aghemiri
- Department of Medical Informatics, Faculty of Medical Sciences, Tarbiat Modares University, Tehran, Iran
| | - Amir Avan
- Molecular Medicine Group, Department of Modern Sciences and Technologies, School of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Mahmood Tara
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| | - Hamed Tabesh
- Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran
| |
Collapse
|
14
|
A novel dictionary learning method based on total least squares approach with application in high dimensional biological data. ADV DATA ANAL CLASSI 2020. [DOI: 10.1007/s11634-020-00417-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
15
|
Comprehensive relative importance analysis and its applications to high dimensional gene expression data analysis. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106120] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
16
|
Li M, Zhang C, Zhou L, Li S, Cao YJ, Wang L, Xiang R, Shi Y, Piao Y. Identification and validation of novel DNA methylation markers for early diagnosis of lung adenocarcinoma. Mol Oncol 2020; 14:2744-2758. [PMID: 32688456 PMCID: PMC7607165 DOI: 10.1002/1878-0261.12767] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Revised: 06/07/2020] [Accepted: 07/16/2020] [Indexed: 12/15/2022] Open
Abstract
Lung cancer has the highest mortality of all cancers worldwide. Epigenetic alterations have emerged as potential biomarkers for early diagnosis of various cancer tissue types. To identify methylation markers for early diagnosis of lung adenocarcinoma, we aimed to integrate genome‐wide DNA methylation and gene expression data from The Cancer Genome Atlas. To this end, we first examined the global DNA methylation pattern of lung adenocarcinoma and investigated the relationship between DNA methylation subtypes and clinical features. We then extracted differentially methylated and expressed genes, and adopted feature selection techniques to determine the final methylation markers. The performance of the markers in predicting lung adenocarcinoma was evaluated on three independent datasets from Gene Expression Omnibus. Protein levels of marker genes were validated by immunohistochemistry, and their biological function was further verified in vivo. We identified three novel methylation markers in lung adenocarcinoma including cg08032924, cg14823851, and cg19161124, mapping to CMTM2, TBX4, and DPP6, respectively. Validating these results on three independent datasets indicated that the three markers can achieve extremely high sensitivity and specificity in distinguishing lung adenocarcinoma from normal samples. Immunohistochemistry quantification results confirmed that markers are weakly expressed in human lung adenocarcinoma, and CMTM2 decreased tumor growth of mouse Lewis lung carcinoma in vivo. Overall, our study identified three novel methylation markers in lung adenocarcinoma which may contribute toward an improved diagnosis potentially leading to a better outcome for patients with lung adenocarcinoma.
Collapse
Affiliation(s)
- Miao Li
- School of Medicine, Nankai University, Tianjin, China
| | - Chen Zhang
- School of Medicine, Nankai University, Tianjin, China
| | - Lijun Zhou
- School of Medicine, Nankai University, Tianjin, China
| | - Siyu Li
- School of Medicine, Nankai University, Tianjin, China
| | - Yuan Jie Cao
- Department of Radiation and Oncology, National Clinical Research Center for Cancer and Tianjin Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and Hospital, Tianjin, China
| | - Longlong Wang
- School of Medicine, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Human Development and Reproductive Regulation, Nankai University Affiliated Hospital of Obstetrics and Gynecology, Tianjin, China
| | - Rong Xiang
- School of Medicine, Nankai University, Tianjin, China
| | - Yi Shi
- School of Medicine, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Human Development and Reproductive Regulation, Nankai University Affiliated Hospital of Obstetrics and Gynecology, Tianjin, China
| | - Yongjun Piao
- School of Medicine, Nankai University, Tianjin, China.,Tianjin Key Laboratory of Human Development and Reproductive Regulation, Nankai University Affiliated Hospital of Obstetrics and Gynecology, Tianjin, China
| |
Collapse
|
17
|
Abdulrauf Sharifai G, Zainol Z. Feature Selection for High-Dimensional and Imbalanced Biomedical Data Based on Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm. Genes (Basel) 2020; 11:genes11070717. [PMID: 32605144 PMCID: PMC7397300 DOI: 10.3390/genes11070717] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2019] [Revised: 12/19/2019] [Accepted: 01/07/2020] [Indexed: 11/16/2022] Open
Abstract
The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.
Collapse
Affiliation(s)
- Garba Abdulrauf Sharifai
- Department of Computer Sciences, Yusuf Maitama Sule University, 700222 Kofar Nassarawa, Kano, Nigeria
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
- Correspondence: ; Tel.: +60-111-317-0481 or +60-194-004-327
| | - Zurinahni Zainol
- School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia;
| |
Collapse
|
18
|
SGL-SVM: A novel method for tumor classification via support vector machine with sparse group Lasso. J Theor Biol 2020; 486:110098. [DOI: 10.1016/j.jtbi.2019.110098] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 11/27/2019] [Accepted: 11/28/2019] [Indexed: 02/07/2023]
|
19
|
Xu W, Xu M, Wang L, Zhou W, Xiang R, Shi Y, Zhang Y, Piao Y. Integrative analysis of DNA methylation and gene expression identified cervical cancer-specific diagnostic biomarkers. Signal Transduct Target Ther 2019; 4:55. [PMID: 31871774 PMCID: PMC6908647 DOI: 10.1038/s41392-019-0081-6] [Citation(s) in RCA: 52] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2019] [Revised: 04/25/2019] [Accepted: 05/10/2019] [Indexed: 12/24/2022] Open
Abstract
Cervical cancer is the leading cause of death among women with cancer worldwide. Here, we performed an integrative analysis of Illumina HumanMethylation450K and RNA-seq data from TCGA to identify cervical cancer-specific DNA methylation markers. We first identified differentially methylated and expressed genes and examined the correlation between DNA methylation and gene expression. The DNA methylation profiles of 12 types of cancers, including cervical cancer, were used to generate a candidate set, and machine-learning techniques were adopted to define the final cervical cancer-specific markers in the candidate set. Then, we assessed the protein levels of marker genes by immunohistochemistry by using tissue arrays containing 93 human cervical squamous cell carcinoma samples and cancer-adjacent normal tissues. Promoter methylation was negatively correlated with the local regulation of gene expression. In the distant regulation of gene expression, the methylation of hypermethylated genes was more likely to be negatively correlated with gene expression, while the methylation of hypomethylated genes was more likely to be positively correlated with gene expression. Moreover, we identified four cervical cancer-specific methylation markers, cg07211381 (RAB3C), cg12205729 (GABRA2), cg20708961 (ZNF257), and cg26490054 (SLC5A8), with 96.2% sensitivity and 95.2% specificity by using the tenfold cross-validation of TCGA data. The four markers could distinguish tumors from normal tissues with a 94.2, 100, 100, and 100% AUC in four independent validation sets from the GEO database. Overall, our study demonstrates the potential use of methylation markers in cervical cancer diagnosis and may boost the development of new epigenetic therapies.
Collapse
Affiliation(s)
- Wanxue Xu
- School of Medicine, Nankai University, Tianjin, China
| | - Mengyao Xu
- School of Medicine, Nankai University, Tianjin, China
| | - Longlong Wang
- School of Medicine, Nankai University, Tianjin, China
- Tianjin Key Laboratory of Human Development and Reproductive Regulation, Nankai University Affiliated Hospital of Obstetrics and Gynecology, Tianjin, China
| | - Wei Zhou
- School of Medicine, Nankai University, Tianjin, China
| | - Rong Xiang
- School of Medicine, Nankai University, Tianjin, China
| | - Yi Shi
- School of Medicine, Nankai University, Tianjin, China
- Tianjin Key Laboratory of Human Development and Reproductive Regulation, Nankai University Affiliated Hospital of Obstetrics and Gynecology, Tianjin, China
| | - Yunshan Zhang
- Reproductive Medical Center, Nankai University Affiliated Hospital of Obstetrics and Gynecology, Tianjin, China
| | - Yongjun Piao
- School of Medicine, Nankai University, Tianjin, China
- Tianjin Key Laboratory of Human Development and Reproductive Regulation, Nankai University Affiliated Hospital of Obstetrics and Gynecology, Tianjin, China
| |
Collapse
|
20
|
Zhao Q, Zhang Y. Ensemble Method of Feature Selection and Reverse Construction of Gene Logical Network Based on Information Entropy. INT J PATTERN RECOGN 2019. [DOI: 10.1142/s0218001420590041] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we propose a novel ensemble gene selection method to obtain a gene subset. Then we provide a reverse construction method of gene network derived from expression profile data of the gene subset. The uncertainty coefficient based on information entropy are used to define the existence of logical relations among these genes. If the uncertainty coefficient between some genes exceeds predefined thresholds, the gene nodes will be connected by directed edges. Thus, a gene network is generated, which we define as gene logical network. This method is applied to the breast cancer data including control group and experimental group, with comparisons of the 2nd-order logic type distribution, average degree as well as average path length of the networks. It is found that these structures with different networks are quite distinct. By the comparison of the degree difference between control group and experimental group, the key genes are picked up. By defining the dynamics evolution rules of state transition based on the logical regulation among the key genes in the network, the dynamic behaviors for normal breast cells and cells with cancer of different stages are simulated numerically. Some of them are highly related to the development of breast cancer through literature inquiry. The study may provide a useful revelation to the biological mechanism in the formation and development of cancer.
Collapse
Affiliation(s)
- Qingfeng Zhao
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, P. R. China
- Shandong Province Key Laboratory of Wisdom Mine Information Technology, Shandong University of Science and Technology, Qingdao 266590, P. R. China
| | - Yulin Zhang
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong 266590, P. R. China
| |
Collapse
|
21
|
Symmetrical Uncertainty-Based Feature Subset Generation and Ensemble Learning for Electricity Customer Classification. Symmetry (Basel) 2019. [DOI: 10.3390/sym11040498] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The use of actual electricity consumption data provided the chance to detect the change of customer class types. This work could be done by using classification techniques. However, there are several challenges in computational techniques. The most important one is to efficiently handle a large number of dimensions to increase customer classification performance. In this paper, we proposed a symmetrical uncertainty based feature subset generation and ensemble learning method for the electricity customer classification. Redundant and significant feature sets are generated according to symmetrical uncertainty. After that, a classifier ensemble is built based on significant feature sets and the results are combined for the final decision. The results show that the proposed method can efficiently find useful feature subsets and improve classification performance.
Collapse
|
22
|
Novel tumor suppressor SPRYD4 inhibits tumor progression in hepatocellular carcinoma by inducing apoptotic cell death. Cell Oncol (Dordr) 2018; 42:55-66. [PMID: 30238408 DOI: 10.1007/s13402-018-0407-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/29/2018] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Hepatocellular carcinoma (HCC) is one of the leading causes of cancer-associated deaths worldwide. Although recent studies have proposed different biomarkers for HCC progression and therapy resistance, a better understanding of the molecular mechanisms underlying HCC progression and recurrence, as well as the identification of molecular markers with a higher diagnostic accuracy, are necessary for the development of more effective clinical management strategies. Here, we aimed to identify novel players in HCC progression. METHODS SPRYD4 mRNA and protein expression analyses were carried out on a normal liver-derived cell line (HL-7702) and four HCC-derived cell lines (HepG2, SMMC7721, Huh-7, BEL-7402) using qRT-PCR and Western blotting, respectively. Cell proliferation Cell Counting Kit-8 (CCK-8) assays, protein expression analyses for apoptosis markers using Western blotting, and Caspase-Glo 3/7 apoptosis assays were carried out on the four HCC-derived cell lines. Expression comparison, functional annotation, gene set enrichment, correlation and survival analyses were carried out on patient data retrieved from the NCBI Gene module, the NCBI GEO database and the TCGA database. RESULTS Through a meta-analysis we found that the expression of SPRYD4 was downregulated in primary HCC tissues compared to non-tumor tissues. We also found that the expression of SPRYD4 was downregulated in HCC-derived cells compared to normal liver-derived cells. Subsequently, we found that the expression of SPRYD4 was inversely correlated with a gene signature associated with HCC cell proliferation. Exogenous SPRYD4 expression was found to inhibit HCC cell proliferation by inducing apoptotic cell death. We also found that SPRYD4 expression was associated with a good prognosis and that its expression became downregulated when HCCs progressed towards more aggressive stages and higher grades. Finally, we found that SPRYD4 expression may serve as a biomarker for a good overall and relapse-free survival in HCC patients. CONCLUSIONS Our data indicate that a decreased SPRYD4 expression may serve as an independent predictor for a poor prognosis in patients with HCC and that increased SPRYD4 expression may reduce HCC growth and progression through the induction of apoptotic cell death, thereby providing a potential therapeutic target.
Collapse
|
23
|
Xia CQ, Han K, Qi Y, Zhang Y, Yu DJ. A Self-Training Subspace Clustering Algorithm under Low-Rank Representation for Cancer Classification on Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1315-1324. [PMID: 28600258 PMCID: PMC5986621 DOI: 10.1109/tcbb.2017.2712607] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.
Collapse
|
24
|
A novel effective diagnosis model based on optimized least squares support machine for gene microarray. Appl Soft Comput 2018. [DOI: 10.1016/j.asoc.2018.02.009] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
25
|
|
26
|
Piao Y, Piao M, Ryu KH. Multiclass cancer classification using a feature subset-based ensemble from microRNA expression profiles. Comput Biol Med 2017; 80:39-44. [DOI: 10.1016/j.compbiomed.2016.11.008] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Revised: 11/15/2016] [Accepted: 11/20/2016] [Indexed: 11/24/2022]
|
27
|
Salem H, Attiya G, El-Fishawy N. Classification of human cancer diseases by gene expression profiles. Appl Soft Comput 2017. [DOI: 10.1016/j.asoc.2016.11.026] [Citation(s) in RCA: 64] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
28
|
Neumann U, Riemenschneider M, Sowa JP, Baars T, Kälsch J, Canbay A, Heider D. Compensation of feature selection biases accompanied with improved predictive performance for binary classification by using a novel ensemble feature selection approach. BioData Min 2016; 9:36. [PMID: 27891179 PMCID: PMC5116216 DOI: 10.1186/s13040-016-0114-4] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2016] [Accepted: 10/27/2016] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Biomarker discovery methods are essential to identify a minimal subset of features (e.g., serum markers in predictive medicine) that are relevant to develop prediction models with high accuracy. By now, there exist diverse feature selection methods, which either are embedded, combined, or independent of predictive learning algorithms. Many preceding studies showed the defectiveness of single feature selection results, which cause difficulties for professionals in a variety of fields (e.g., medical practitioners) to analyze and interpret the obtained feature subsets. Whereas each of these methods is highly biased, an ensemble feature selection has the advantage to alleviate and compensate for such biases. Concerning the reliability, validity, and reproducibility of these methods, we examined eight different feature selection methods for binary classification datasets and developed an ensemble feature selection system. RESULTS By using an ensemble of feature selection methods, a quantification of the importance of the features could be obtained. The prediction models that have been trained on the selected features showed improved prediction performance.
Collapse
Affiliation(s)
- Ursula Neumann
- Department of Bioinformatics, Straubing, 94315 Germany ; University of Applied Science, Weihenstephan-Triesdorf, Freising, 85354 Germany ; Wissenschaftszentrum Weihenstephan, Technische Universität München, Freising, 85354 Germany
| | - Mona Riemenschneider
- Department of Bioinformatics, Straubing, 94315 Germany ; University of Applied Science, Weihenstephan-Triesdorf, Freising, 85354 Germany
| | - Jan-Peter Sowa
- Department of Gastroenterology and Hepatology, University Hospital, University Duisburg-Essen, Essen, 45122 Germany
| | - Theodor Baars
- Clinic for Cardiology, West German Heart and Vascular Centre Essen, University Hospital, University Duisburg-Essen, Essen, 45122 Germany
| | - Julia Kälsch
- Department of Gastroenterology and Hepatology, University Hospital, University Duisburg-Essen, Essen, 45122 Germany
| | - Ali Canbay
- Department of Gastroenterology and Hepatology, University Hospital, University Duisburg-Essen, Essen, 45122 Germany
| | - Dominik Heider
- Department of Bioinformatics, Straubing, 94315 Germany ; University of Applied Science, Weihenstephan-Triesdorf, Freising, 85354 Germany ; Wissenschaftszentrum Weihenstephan, Technische Universität München, Freising, 85354 Germany
| |
Collapse
|
29
|
Devi Arockia Vanitha C, Devaraj D, Venkatesulu M. Multiclass cancer diagnosis in microarray gene expression profile using mutual information and Support Vector Machine. INTELL DATA ANAL 2016. [DOI: 10.3233/ida-150203] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
| | - D. Devaraj
- Department of Electrical and Electronics Engineering, Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India
| | - M. Venkatesulu
- Department of Computer Applications, Kalasalingam Academy of Research and Education, Krishnankoil, Tamil Nadu, India
| |
Collapse
|
30
|
Izadi F, Zarrini HN, Kiani G, Jelodar NB. A comparative analytical assay of gene regulatory networks inferred using microarray and RNA-seq datasets. Bioinformation 2016; 12:340-346. [PMID: 28293077 PMCID: PMC5320930 DOI: 10.6026/97320630012340] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2016] [Revised: 08/05/2016] [Accepted: 08/06/2016] [Indexed: 01/16/2023] Open
Abstract
A Gene Regulatory Network (GRN) is a collection of interactions between molecular regulators and their targets in cells governing gene expression level. Omics data explosion generated from high-throughput genomic assays such as microarray and RNA-Seq technologies and the emergence of a number of pre-processing methods demands suitable guidelines to determine the impact of transcript data platforms and normalization procedures on describing associations in GRNs. In this study exploiting publically available microarray and RNA-Seq datasets and a gold standard of transcriptional interactions in Arabidopsis, we performed a comparison between six GRNs derived by RNA-Seq and microarray data and different normalization procedures. As a result we observed that compared algorithms were highly data-specific and Networks reconstructed by RNA-Seq data revealed a considerable accuracy against corresponding networks captured by microarrays. Topological analysis showed that GRNs inferred from two platforms were similar in several of topological features although we observed more connectivity in RNA-Seq derived genes network. Taken together transcriptional regulatory networks obtained by Robust Multiarray Averaging (RMA) and Variance-Stabilizing Transformed (VST) normalized data demonstrated predicting higher rate of true edges over the rest of methods used in this comparison.
Collapse
Affiliation(s)
- Fereshteh Izadi
- Plant Breeding Department, Sari Agricultural Sciences and Natural Resources, Iran
| | - Hamid Najafi Zarrini
- Plant Breeding Department, Sari Agricultural Sciences and Natural Resources, Iran
| | - Ghaffar Kiani
- Plant Breeding Department, Sari Agricultural Sciences and Natural Resources, Iran
| | | |
Collapse
|
31
|
Sun S, Peng Q, Zhang X. Global feature selection from microarray data using Lagrange multipliers. Knowl Based Syst 2016. [DOI: 10.1016/j.knosys.2016.07.035] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
32
|
Chen H, Zhang Y, Gutman I. A kernel-based clustering method for gene selection with gene expression data. J Biomed Inform 2016; 62:12-20. [DOI: 10.1016/j.jbi.2016.05.007] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2015] [Revised: 05/08/2016] [Accepted: 05/19/2016] [Indexed: 12/21/2022]
|
33
|
Nayyeri M, Sharifi Noghabi H. Cancer classification by correntropy-based sparse compact incremental learning machine. GENE REPORTS 2016. [DOI: 10.1016/j.genrep.2016.01.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
34
|
Mohammadi M, Sharifi Noghabi H, Abed Hodtani G, Rajabi Mashhadi H. Robust and stable gene selection via Maximum–Minimum Correntropy Criterion. Genomics 2016; 107:83-87. [DOI: 10.1016/j.ygeno.2015.12.006] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2015] [Revised: 12/13/2015] [Accepted: 12/23/2015] [Indexed: 11/17/2022]
|
35
|
Mishra S, Mishra D. Enhanced gene ranking approaches using modified trace ratio algorithm for gene expression data. INFORMATICS IN MEDICINE UNLOCKED 2016. [DOI: 10.1016/j.imu.2016.09.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
36
|
Liao B, Jiang Y, Liang W, Peng L, Peng L, Hanyurwimfura D, Li Z, Chen M. On Efficient Feature Ranking Methods for High-Throughput Data Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1374-1384. [PMID: 26684461 DOI: 10.1109/tcbb.2015.2415790] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Efficient mining of high-throughput data has become one of the popular themes in the big data era. Existing biology-related feature ranking methods mainly focus on statistical and annotation information. In this study, two efficient feature ranking methods are presented. Multi-target regression and graph embedding are incorporated in an optimization framework, and feature ranking is achieved by introducing structured sparsity norm. Unlike existing methods, the presented methods have two advantages: (1) the feature subset simultaneously account for global margin information as well as locality manifold information. Consequently, both global and locality information are considered. (2) Features are selected by batch rather than individually in the algorithm framework. Thus, the interactions between features are considered and the optimal feature subset can be guaranteed. In addition, this study presents a theoretical justification. Empirical experiments demonstrate the effectiveness and efficiency of the two algorithms in comparison with some state-of-the-art feature ranking methods through a set of real-world gene expression data sets.
Collapse
|
37
|
Li P, Piao Y, Shon HS, Ryu KH. Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data. BMC Bioinformatics 2015; 16:347. [PMID: 26511205 PMCID: PMC4625728 DOI: 10.1186/s12859-015-0778-7] [Citation(s) in RCA: 103] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2015] [Accepted: 10/14/2015] [Indexed: 01/08/2023] Open
Abstract
Background Recently, rapid improvements in technology and decrease in sequencing costs have made RNA-Seq a widely used technique to quantify gene expression levels. Various normalization approaches have been proposed, owing to the importance of normalization in the analysis of RNA-Seq data. A comparison of recently proposed normalization methods is required to generate suitable guidelines for the selection of the most appropriate approach for future experiments. Results In this paper, we compared eight non-abundance (RC, UQ, Med, TMM, DESeq, Q, RPKM, and ERPKM) and two abundance estimation normalization methods (RSEM and Sailfish). The experiments were based on real Illumina high-throughput RNA-Seq of 35- and 76-nucleotide sequences produced in the MAQC project and simulation reads. Reads were mapped with human genome obtained from UCSC Genome Browser Database. For precise evaluation, we investigated Spearman correlation between the normalization results from RNA-Seq and MAQC qRT-PCR values for 996 genes. Based on this work, we showed that out of the eight non-abundance estimation normalization methods, RC, UQ, Med, TMM, DESeq, and Q gave similar normalization results for all data sets. For RNA-Seq of a 35-nucleotide sequence, RPKM showed the highest correlation results, but for RNA-Seq of a 76-nucleotide sequence, least correlation was observed than the other methods. ERPKM did not improve results than RPKM. Between two abundance estimation normalization methods, for RNA-Seq of a 35-nucleotide sequence, higher correlation was obtained with Sailfish than that with RSEM, which was better than without using abundance estimation methods. However, for RNA-Seq of a 76-nucleotide sequence, the results achieved by RSEM were similar to without applying abundance estimation methods, and were much better than with Sailfish. Furthermore, we found that adding a poly-A tail increased alignment numbers, but did not improve normalization results. Conclusion Spearman correlation analysis revealed that RC, UQ, Med, TMM, DESeq, and Q did not noticeably improve gene expression normalization, regardless of read length. Other normalization methods were more efficient when alignment accuracy was low; Sailfish with RPKM gave the best normalization results. When alignment accuracy was high, RC was sufficient for gene expression calculation. And we suggest ignoring poly-A tail during differential gene expression analysis. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0778-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Peipei Li
- College of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si, South Korea.
| | - Yongjun Piao
- College of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si, South Korea.
| | - Ho Sun Shon
- College of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si, South Korea.
| | - Keun Ho Ryu
- College of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si, South Korea.
| |
Collapse
|
38
|
Sachnev V, Saraswathi S, Niaz R, Kloczkowski A, Suresh S. Multi-class BCGA-ELM based classifier that identifies biomarkers associated with hallmarks of cancer. BMC Bioinformatics 2015; 16:166. [PMID: 25986937 PMCID: PMC4448565 DOI: 10.1186/s12859-015-0565-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Accepted: 03/31/2015] [Indexed: 12/05/2022] Open
Abstract
Background Traditional cancer treatments have centered on cytotoxic drugs and general purpose chemotherapy that may not be tailored to treat specific cancers. Identification of molecular markers that are related to different types of cancers might lead to discovery of drugs that are patient and disease specific. This study aims to use microarray gene expression cancer data to identify biomarkers that are indicative of different types of cancers. Our aim is to provide a multi-class cancer classifier that can simultaneously differentiate between cancers and identify type-specific biomarkers, through the application of the Binary Coded Genetic Algorithm (BCGA) and a neural network based Extreme Learning Machine (ELM) algorithm. Results BCGA and ELM are combined and used to select a subset of genes that are present in the Global Cancer Mapping (GCM) data set. This set of candidate genes contains over 52 biomarkers that are related to multiple cancers, according to the literature. They include APOA1, VEGFC, YWHAZ, B2M, EIF2S1, CCR9 and many other genes that have been associated with the hallmarks of cancer. BCGA-ELM is tested on several cancer data sets and the results are compared to other classification methods. BCGA-ELM compares or exceeds other algorithms in terms of accuracy. We were also able to show that over 50% of genes selected by BCGA-ELM on GCM data are cancer related biomarkers. Conclusions We were able to simultaneously differentiate between 14 different types of cancers, using only 92 genes, to achieve a multi-class classification accuracy of 95.4% which is between 21.6% and 38% higher than other results in the literature for multi-class cancer classification. Our findings suggest that computational algorithms such as BCGA-ELM can facilitate biomarker-driven integrated cancer research that can lead to a detailed understanding of the complexities of cancer. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0565-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vasily Sachnev
- Department of Information, Communication and Electronics Engineering, Catholic University of Korea, Bucheon, Republic of Korea.
| | - Saras Saraswathi
- Battelle Center for Mathematical Medicine at The Research Institute at Nationwide Children's Hospital; currently at Sidra, Medical and Research Center, Doha, Qatar.
| | - Rashid Niaz
- Department of Medical Informatics, Sidra Medical and Research Center, Doha, Qatar.
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine at The Research Institute at Nationwide Children's Hospital; Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, USA.
| | - Sundaram Suresh
- School of Computer Science, Nanyang Technological University, Nanyang, Singapore.
| |
Collapse
|
39
|
Yang L, Ainali C, Kittas A, Nestle FO, Papageorgiou LG, Tsoka S. Pathway-level disease data mining through hyper-box principles. Math Biosci 2015; 260:25-34. [DOI: 10.1016/j.mbs.2014.09.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2014] [Revised: 09/11/2014] [Accepted: 09/13/2014] [Indexed: 01/16/2023]
|
40
|
Liao B, Jiang Y, Liang W, Zhu W, Cai L, Cao Z. Gene Selection Using Locality Sensitive Laplacian Score. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:1146-1156. [PMID: 26357051 DOI: 10.1109/tcbb.2014.2328334] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Gene selection based on microarray data, is highly important for classifying tumors accurately. Existing gene selection schemes are mainly based on ranking statistics. From manifold learning standpoint, local geometrical structure is more essential to characterize features compared with global information. In this study, we propose a supervised gene selection method called locality sensitive Laplacian score (LSLS), which incorporates discriminative information into local geometrical structure, by minimizing local within-class information and maximizing local between-class information simultaneously. In addition, variance information is considered in our algorithm framework. Eventually, to find more superior gene subsets, which is significant for biomarker discovery, a two-stage feature selection method that combines the LSLS and wrapper method (sequential forward selection or sequential backward selection) is presented. Experimental results of six publicly available gene expression profile data sets demonstrate the effectiveness of the proposed approach compared with a number of state-of-the-art gene selection methods.
Collapse
|
41
|
Sun S, Peng Q, Shakoor A. A kernel-based multivariate feature selection method for microarray data classification. PLoS One 2014; 9:e102541. [PMID: 25048512 PMCID: PMC4105478 DOI: 10.1371/journal.pone.0102541] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2014] [Accepted: 06/20/2014] [Indexed: 11/19/2022] Open
Abstract
High dimensionality and small sample sizes, and their inherent risk of overfitting, pose great challenges for constructing efficient classifiers in microarray data classification. Therefore a feature selection technique should be conducted prior to data classification to enhance prediction performance. In general, filter methods can be considered as principal or auxiliary selection mechanism because of their simplicity, scalability, and low computational complexity. However, a series of trivial examples show that filter methods result in less accurate performance because they ignore the dependencies of features. Although few publications have devoted their attention to reveal the relationship of features by multivariate-based methods, these methods describe relationships among features only by linear methods. While simple linear combination relationship restrict the improvement in performance. In this paper, we used kernel method to discover inherent nonlinear correlations among features as well as between feature and target. Moreover, the number of orthogonal components was determined by kernel Fishers linear discriminant analysis (FLDA) in a self-adaptive manner rather than by manual parameter settings. In order to reveal the effectiveness of our method we performed several experiments and compared the results between our method and other competitive multivariate-based features selectors. In our comparison, we used two classifiers (support vector machine, [Formula: see text]-nearest neighbor) on two group datasets, namely two-class and multi-class datasets. Experimental results demonstrate that the performance of our method is better than others, especially on three hard-classify datasets, namely Wang's Breast Cancer, Gordon's Lung Adenocarcinoma and Pomeroy's Medulloblastoma.
Collapse
Affiliation(s)
- Shiquan Sun
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Qinke Peng
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| | - Adnan Shakoor
- Systems Engineering Institute, School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China
| |
Collapse
|