1
|
Chen R, Tang L, Melendy T, Yang L, Goodison S, Sun Y. Prostate Cancer Progression Modeling Provides Insight into Dynamic Molecular Changes Associated with Progressive Disease States. CANCER RESEARCH COMMUNICATIONS 2024; 4:2783-2798. [PMID: 39347576 PMCID: PMC11500312 DOI: 10.1158/2767-9764.crc-24-0210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 08/27/2024] [Accepted: 09/25/2024] [Indexed: 10/01/2024]
Abstract
Prostate cancer is a significant health concern and the most commonly diagnosed cancer in men worldwide. Understanding the complex process of prostate tumor evolution and progression is crucial for improved diagnosis, treatments, and patient outcomes. Previous studies have focused on unraveling the dynamics of prostate cancer evolution using phylogenetic or lineage analysis approaches. However, those approaches have limitations in capturing the complete disease process or incorporating genomic and transcriptomic variations comprehensively. In this study, we applied a novel computational approach to derive a prostate cancer progression model using multidimensional data from 497 prostate tumor samples and 52 tumor-adjacent normal samples obtained from The Cancer Genome Atlas study. The model was validated using data from an independent cohort of 545 primary tumor samples. By integrating transcriptomic and genomic data, our model provides a comprehensive view of prostate tumor progression, identifies crucial signaling pathways and genetic events, and uncovers distinct transcription signatures associated with disease progression. Our findings have significant implications for cancer research and hold promise for guiding personalized treatment strategies in prostate cancer. SIGNIFICANCE We developed and validated a progression model of prostate cancer using >1,000 tumor and normal tissue samples. The model provided a comprehensive view of prostate tumor evolution and progression.
Collapse
Affiliation(s)
- Runpu Chen
- Department of Microbiology and Immunology, University at Buffalo, State University of New York, Buffalo, New York
| | - Li Tang
- Department of Cancer Prevention and Control, Roswell Park Comprehensive Cancer Center, Buffalo, New York
| | - Thomas Melendy
- Department of Microbiology and Immunology, University at Buffalo, State University of New York, Buffalo, New York
| | - Le Yang
- Department of Microbiology and Immunology, University at Buffalo, State University of New York, Buffalo, New York
| | - Steve Goodison
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, Florida
| | - Yijun Sun
- Department of Microbiology and Immunology, University at Buffalo, State University of New York, Buffalo, New York
- Department of Computer Science and Engineering, University at Buffalo, State University of New York, Buffalo, New York
| |
Collapse
|
2
|
Xiao Q, Xu H, Chu Z, Feng Q, Zhang Y. Margin-Maximized Norm-Mixed Representation Learning for Autism Spectrum Disorder Diagnosis With Multi-Level Flux Features. IEEE Trans Biomed Eng 2024; 71:183-194. [PMID: 37432838 DOI: 10.1109/tbme.2023.3294223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2023]
Abstract
Early diagnosis and timely intervention are significantly beneficial to patients with autism spectrum disorder (ASD). Although structural magnetic resonance imaging (sMRI) has become an essential tool to facilitate the diagnosis of ASD, these sMRI-based approaches still have the following issues. The heterogeneity and subtle anatomical changes place higher demands for effective feature descriptors. Additionally, the original features are usually high-dimensional, while most existing methods prefer to select feature subsets in the original space, in which noises and outliers may hinder the discriminative ability of selected features. In this article, we propose a margin-maximized norm-mixed representation learning framework for ASD diagnosis with multi-level flux features extracted from sMRI. Specifically, a flux feature descriptor is devised to quantify comprehensive gradient information of brain structures on both local and global levels. For the multi-level flux features, we learn latent representations in an assumed low-dimensional space, in which a self-representation term is incorporated to characterize the relationships among features. We also introduce mixed norms to finely select original flux features for the construction of latent representations while preserving the low-rankness of latent representations. Furthermore, a margin maximization strategy is applied to enlarge the inter-class distance of samples, thereby increasing the discriminative ability of latent representations. The extensive experiments on several datasets show that our proposed method can achieve promising classification performance (the average area under curve, accuracy, specificity, and sensitivity on the studied ASD datasets are 0.907, 0.896, 0.892, and 0.908, respectively) and also find potential biomarkers for ASD diagnosis.
Collapse
|
3
|
Yin K, Zhai J, Xie A, Zhu J. Feature selection using max dynamic relevancy and min redundancy. Pattern Anal Appl 2023. [DOI: 10.1007/s10044-023-01138-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023]
|
4
|
Chen B, Guan J, Li Z. Unsupervised Feature Selection via Graph Regularized Nonnegative CP Decomposition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:2582-2594. [PMID: 35298373 DOI: 10.1109/tpami.2022.3160205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Unsupervised feature selection has attracted remarkable attention recently. With the development of data acquisition technology, multi-dimensional tensor data has been appeared in enormous real-world applications. However, most existing unsupervised feature selection methods are non-tensor-based which results the vectorization of tensor data as a preprocessing step. This seemingly ordinary operation has led to an unnecessary loss of the multi-dimensional structural information and eventually restricted the quality of the selected features. To overcome the limitation, in this paper, we propose a novel unsupervised feature selection model: Nonnegative tensor CP (CANDECOMP/PARAFAC) decomposition based unsupervised feature selection, CPUFS for short. In specific, we devise new tensor-oriented linear classifier and feature selection matrix for CPUFS. In addition, CPUFS simultaneously conducts graph regularized nonnegative CP decomposition and newly-designed tensor-oriented pseudo label regression and feature selection to fully preserve the multi-dimensional data structure. To solve the CPUFS model, we propose an efficient iterative optimization algorithm with theoretically guaranteed convergence, whose computational complexity scales linearly in the number of features. A variation of the CPUFS model by incorporating nonnegativity into the linear classifier, namely CPUFSnn, is also proposed and studied. Experimental results on ten real-world benchmark datasets demonstrate the effectiveness of both CPUFS and CPUFSnn over the state-of-the-arts.
Collapse
|
5
|
Interpretable Bayesian network abstraction for dimension reduction. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07810-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
6
|
Zheng W, Chen S, Fu Z, Zhu F, Yan H, Yang J. Feature Selection Boosted by Unselected Features. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4562-4574. [PMID: 33646957 DOI: 10.1109/tnnls.2021.3058172] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Feature selection aims to select strongly relevant features and discard the rest. Recently, embedded feature selection methods, which incorporate feature weights learning into the training process of a classifier, have attracted much attention. However, traditional embedded methods merely focus on the combinatorial optimality of all selected features. They sometimes select the weakly relevant features with satisfactory combination abilities and leave out some strongly relevant features, thereby degrading the generalization performance. To address this issue, we propose a novel embedded framework for feature selection, termed feature selection boosted by unselected features (FSBUF). Specifically, we introduce an extra classifier for unselected features into the traditional embedded model and jointly learn the feature weights to maximize the classification loss of unselected features. As a result, the extra classifier recycles the unselected strongly relevant features to replace the weakly relevant features in the selected feature subset. Our final objective can be formulated as a minimax optimization problem, and we design an effective gradient-based algorithm to solve it. Furthermore, we theoretically prove that the proposed FSBUF is able to improve the generalization ability of traditional embedded feature selection methods. Extensive experiments on synthetic and real-world data sets exhibit the comprehensibility and superior performance of FSBUF.
Collapse
|
7
|
Time Series Data Prediction and Feature Analysis of Sports Dance Movements Based on Machine Learning. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:5611829. [PMID: 36059406 PMCID: PMC9433201 DOI: 10.1155/2022/5611829] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 07/14/2022] [Accepted: 07/25/2022] [Indexed: 11/17/2022]
Abstract
Sports dance is a competition project and a kind of sports, with the characteristics of being smooth, generous, leisurely, and comfortable, dance steps, smooth movements, and flowing clouds, and it can give full play to the indoor space. In the light of the new era, sports dance is also playing an increasingly important role. Through the time series data and feature analysis of dance sports movements through machine learning, the internal information is mined to find the trends and laws. Machine learning in the era of big data is widely used in research as the main tool for data analysis and mining. The key difficulty of data mining has always been time series data. Machine learning refers to a method of using the resulting data in a computer to derive a certain model and then using this model to make predictions. The core is “using algorithms to parse data, learn from it, and then make decisions or predictions about new data.”
Collapse
|
8
|
Computational approach to modeling microbiome landscapes associated with chronic human disease progression. PLoS Comput Biol 2022; 18:e1010373. [PMID: 35926003 PMCID: PMC9380910 DOI: 10.1371/journal.pcbi.1010373] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 08/16/2022] [Accepted: 07/11/2022] [Indexed: 11/20/2022] Open
Abstract
A microbial community is a dynamic system undergoing constant change in response to internal and external stimuli. These changes can have significant implications for human health. However, due to the difficulty in obtaining longitudinal samples, the study of the dynamic relationship between the microbiome and human health remains a challenge. Here, we introduce a novel computational strategy that uses massive cross-sectional sample data to model microbiome landscapes associated with chronic disease development. The strategy is based on the rationale that each static sample provides a snapshot of the disease process, and if the number of samples is sufficiently large, the footprints of individual samples populate progression trajectories, which enables us to recover disease progression paths along a microbiome landscape by using computational approaches. To demonstrate the validity of the proposed strategy, we developed a bioinformatics pipeline and applied it to a gut microbiome dataset available from a Crohn’s disease study. Our analysis resulted in one of the first working models of microbial progression for Crohn’s disease. We performed a series of interrogations to validate the constructed model. Our analysis suggested that the model recapitulated the longitudinal progression of microbial dysbiosis during the known clinical trajectory of Crohn’s disease. By overcoming restrictions associated with complex longitudinal sampling, the proposed strategy can provide valuable insights into the role of the microbiome in the pathogenesis of chronic disease and facilitate the shift of the field from descriptive research to mechanistic studies. The delineation of system dynamics of a microbial community can provide a wealth of insights into the roles of the microbiome in the pathogenesis of chronic disease. However, due to the difficulty in obtaining longitudinal samples, most existing microbiome studies have been cross-sectional and largely descriptive. Here, we present a novel computational strategy that leverages massive static sample data to model microbiome landscapes associated with chronic disease development. To demonstrate the validity of the proposed strategy, we applied it to a gut microbiome dataset available from a Crohn’s disease study and constructed one of the first microbial progression models of the disease. We performed a series of interrogations on the constructed model. Our analysis suggested that the constructed model recapitulated the longitudinal progression of microbial dysbiosis during the known clinical trajectory of Crohn’s disease. By overcoming the sampling restrictions inherent to slowly progressive diseases, our approach is potentially widely applicable in many different studies across body sites, diseases, and other conditions.
Collapse
|
9
|
Kumar A, Halder A. Greedy fuzzy vaguely quantified rough approach for cancer relevant gene selection from gene expression data. Soft comput 2022. [DOI: 10.1007/s00500-022-07312-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
10
|
Tsanas A. Relevance, redundancy, and complementarity trade-off (RRCT): A principled, generic, robust feature-selection tool. PATTERNS (NEW YORK, N.Y.) 2022; 3:100471. [PMID: 35607618 PMCID: PMC9122960 DOI: 10.1016/j.patter.2022.100471] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 01/19/2022] [Accepted: 02/24/2022] [Indexed: 12/21/2022]
Abstract
We present a new heuristic feature-selection (FS) algorithm that integrates in a principled algorithmic framework the three key FS components: relevance, redundancy, and complementarity. Thus, we call it relevance, redundancy, and complementarity trade-off (RRCT). The association strength between each feature and the response and between feature pairs is quantified via an information theoretic transformation of rank correlation coefficients, and the feature complementarity is quantified using partial correlation coefficients. We empirically benchmark the performance of RRCT against 19 FS algorithms across four synthetic and eight real-world datasets in indicative challenging settings evaluating the following: (1) matching the true feature set and (2) out-of-sample performance in binary and multi-class classification problems when presenting selected features into a random forest. RRCT is very competitive in both tasks, and we tentatively make suggestions on the generalizability and application of the best-performing FS algorithms across settings where they may operate effectively.
Collapse
Affiliation(s)
- Athanasios Tsanas
- Usher Institute, Edinburgh Medical School, University of Edinburgh, NINE Edinburgh BioQuarter, 9 Little France road, Edinburgh, UK.,School of Mathematics, University of Edinburgh, Edinburgh, UK.,Alan Turing Institute, British Library, London, UK
| |
Collapse
|
11
|
Komeili M, Armanfard N, Hatzinakos D. Multiview Feature Selection for Single-View Classification. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:3573-3586. [PMID: 32305902 DOI: 10.1109/tpami.2020.2987013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In many real-world scenarios, data from multiple modalities (sources) are collected during a development phase. Such data are referred to as multiview data. While additional information from multiple views often improves the performance, collecting data from such additional views during the testing phase may not be desired due to the high costs associated with measuring such views or, unavailability of such additional views. Therefore, in many applications, despite having a multiview training data set, it is desired to do performance testing using data from only one view. In this paper, we present a multiview feature selection method that leverages the knowledge of all views and use it to guide the feature selection process in an individual view. We realize this via a multiview feature weighting scheme such that the local margins of samples in each view are maximized and similarities of samples to some reference points in different views are preserved. Also, the proposed formulation can be used for cross-view matching when the view-specific feature weights are pre-computed on an auxiliary data set. Promising results have been achieved on nine real-world data sets as well as three biometric recognition applications. On average, the proposed feature selection method has improved the classification error rate by 31 percent of the error rate of the state-of-the-art.
Collapse
|
12
|
Zhang S, Dang X, Nguyen D, Wilkins D, Chen Y. Estimating Feature-Label Dependence Using Gini Distance Statistics. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:1947-1963. [PMID: 31869782 DOI: 10.1109/tpami.2019.2960358] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Identifying statistical dependence between the features and the label is a fundamental problem in supervised learning. This paper presents a framework for estimating dependence between numerical features and a categorical label using generalized Gini distance, an energy distance in reproducing kernel Hilbert spaces (RKHS). Two Gini distance based dependence measures are explored: Gini distance covariance and Gini distance correlation. Unlike Pearson covariance and correlation, which do not characterize independence, the above Gini distance based measures define dependence as well as independence of random variables. The test statistics are simple to calculate and do not require probability density estimation. Uniform convergence bounds and asymptotic bounds are derived for the test statistics. Comparisons with distance covariance statistics are provided. It is shown that Gini distance statistics converge faster than distance covariance statistics in the uniform convergence bounds, hence tighter upper bounds on both Type I and Type II errors. Moreover, the probability of Gini distance covariance statistic under-performing the distance covariance statistic in Type II error decreases to 0 exponentially with the increase of the sample size. Extensive experimental results are presented to demonstrate the performance of the proposed method.
Collapse
|
13
|
Pang Q, Zhang L. A recursive feature retention method for semi-supervised feature selection. INT J MACH LEARN CYB 2021. [DOI: 10.1007/s13042-021-01346-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
14
|
Jin B, Fu C, Jin Y, Yang W, Li S, Zhang G, Wang Z. An Adaptive Unsupervised Feature Selection Algorithm Based on MDS for Tumor Gene Data Classification. SENSORS 2021; 21:s21113627. [PMID: 34071066 PMCID: PMC8197094 DOI: 10.3390/s21113627] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 05/19/2021] [Accepted: 05/20/2021] [Indexed: 11/29/2022]
Abstract
Identifying the key genes related to tumors from gene expression data with a large number of features is important for the accurate classification of tumors and to make special treatment decisions. In recent years, unsupervised feature selection algorithms have attracted considerable attention in the field of gene selection as they can find the most discriminating subsets of genes, namely the potential information in biological data. Recent research also shows that maintaining the important structure of data is necessary for gene selection. However, most current feature selection methods merely capture the local structure of the original data while ignoring the importance of the global structure of the original data. We believe that the global structure and local structure of the original data are equally important, and so the selected genes should maintain the essential structure of the original data as far as possible. In this paper, we propose a new, adaptive, unsupervised feature selection scheme which not only reconstructs high-dimensional data into a low-dimensional space with the constraint of feature distance invariance but also employs ℓ2,1-norm to enable a matrix with the ability to perform gene selection embedding into the local manifold structure-learning framework. Moreover, an effective algorithm is developed to solve the optimization problem based on the proposed scheme. Comparative experiments with some classical schemes on real tumor datasets demonstrate the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Bo Jin
- School of Artificial Intelligence, Henan University, Kaifeng 475004, China; (B.J.); (Y.J.); (S.L.); (G.Z.); (Z.W.)
- School of Computer and Information Engineering, Henan University, Kaifeng 475004, China;
| | - Chunling Fu
- School of Physics and Electronics, Henan University, Kaifeng 475004, China
- Correspondence:
| | - Yong Jin
- School of Artificial Intelligence, Henan University, Kaifeng 475004, China; (B.J.); (Y.J.); (S.L.); (G.Z.); (Z.W.)
- School of Computer and Information Engineering, Henan University, Kaifeng 475004, China;
| | - Wei Yang
- School of Computer and Information Engineering, Henan University, Kaifeng 475004, China;
| | - Shengbin Li
- School of Artificial Intelligence, Henan University, Kaifeng 475004, China; (B.J.); (Y.J.); (S.L.); (G.Z.); (Z.W.)
- School of Computer and Information Engineering, Henan University, Kaifeng 475004, China;
| | - Guangyao Zhang
- School of Artificial Intelligence, Henan University, Kaifeng 475004, China; (B.J.); (Y.J.); (S.L.); (G.Z.); (Z.W.)
- School of Computer and Information Engineering, Henan University, Kaifeng 475004, China;
| | - Zheng Wang
- School of Artificial Intelligence, Henan University, Kaifeng 475004, China; (B.J.); (Y.J.); (S.L.); (G.Z.); (Z.W.)
- School of Computer and Information Engineering, Henan University, Kaifeng 475004, China;
| |
Collapse
|
15
|
Fu J, Luo Y, Mou M, Zhang H, Tang J, Wang Y, Zhu F. Advances in Current Diabetes Proteomics: From the Perspectives of Label- free Quantification and Biomarker Selection. Curr Drug Targets 2021; 21:34-54. [PMID: 31433754 DOI: 10.2174/1389450120666190821160207] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/17/2019] [Accepted: 07/24/2019] [Indexed: 12/13/2022]
Abstract
BACKGROUND Due to its prevalence and negative impacts on both the economy and society, the diabetes mellitus (DM) has emerged as a worldwide concern. In light of this, the label-free quantification (LFQ) proteomics and diabetic marker selection methods have been applied to elucidate the underlying mechanisms associated with insulin resistance, explore novel protein biomarkers, and discover innovative therapeutic protein targets. OBJECTIVE The purpose of this manuscript is to review and analyze the recent computational advances and development of label-free quantification and diabetic marker selection in diabetes proteomics. METHODS Web of Science database, PubMed database and Google Scholar were utilized for searching label-free quantification, computational advances, feature selection and diabetes proteomics. RESULTS In this study, we systematically review the computational advances of label-free quantification and diabetic marker selection methods which were applied to get the understanding of DM pathological mechanisms. Firstly, different popular quantification measurements and proteomic quantification software tools which have been applied to the diabetes studies are comprehensively discussed. Secondly, a number of popular manipulation methods including transformation, pretreatment (centering, scaling, and normalization), missing value imputation methods and a variety of popular feature selection techniques applied to diabetes proteomic data are overviewed with objective evaluation on their advantages and disadvantages. Finally, the guidelines for the efficient use of the computationbased LFQ technology and feature selection methods in diabetes proteomics are proposed. CONCLUSION In summary, this review provides guidelines for researchers who will engage in proteomics biomarker discovery and by properly applying these proteomic computational advances, more reliable therapeutic targets will be found in the field of diabetes mellitus.
Collapse
Affiliation(s)
- Jianbo Fu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Yongchao Luo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Hongning Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Jing Tang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing 401331, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,School of Pharmaceutical Sciences and Innovative Drug Research Centre, Chongqing University, Chongqing 401331, China
| |
Collapse
|
16
|
Ben Brahim A. Stable feature selection based on instance learning, redundancy elimination and efficient subsets fusion. Neural Comput Appl 2021. [DOI: 10.1007/s00521-020-04971-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
17
|
|
18
|
Zhang B, Liu Q, Zhang X, Liu S, Chen W, You J, Chen Q, Li M, Chen Z, Chen L, Chen L, Dong Y, Zeng Q, Zhang S. Clinical Utility of a Nomogram for Predicting 30-Days Poor Outcome in Hospitalized Patients With COVID-19: Multicenter External Validation and Decision Curve Analysis. Front Med (Lausanne) 2020; 7:590460. [PMID: 33425939 PMCID: PMC7785751 DOI: 10.3389/fmed.2020.590460] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Accepted: 11/18/2020] [Indexed: 12/14/2022] Open
Abstract
Aim: Early detection of coronavirus disease 2019 (COVID-19) patients who are likely to develop worse outcomes is of great importance, which may help select patients at risk of rapid deterioration who should require high-level monitoring and more aggressive treatment. We aimed to develop and validate a nomogram for predicting 30-days poor outcome of patients with COVID-19. Methods: The prediction model was developed in a primary cohort consisting of 233 patients with laboratory-confirmed COVID-19, and data were collected from January 3 to March 20, 2020. We identified and integrated significant prognostic factors for 30-days poor outcome to construct a nomogram. The model was subjected to internal validation and to external validation with two separate cohorts of 110 and 118 cases, respectively. The performance of the nomogram was assessed with respect to its predictive accuracy, discriminative ability, and clinical usefulness. Results: In the primary cohort, the mean age of patients was 55.4 years and 129 (55.4%) were male. Prognostic factors contained in the clinical nomogram were age, lactic dehydrogenase, aspartate aminotransferase, prothrombin time, serum creatinine, serum sodium, fasting blood glucose, and D-dimer. The model was externally validated in two cohorts achieving an AUC of 0.946 and 0.878, sensitivity of 100 and 79%, and specificity of 76.5 and 83.8%, respectively. Although adding CT score to the clinical nomogram (clinical-CT nomogram) did not yield better predictive performance, decision curve analysis showed that the clinical-CT nomogram provided better clinical utility than the clinical nomogram. Conclusions: We established and validated a nomogram that can provide an individual prediction of 30-days poor outcome for COVID-19 patients. This practical prognostic model may help clinicians in decision making and reduce mortality.
Collapse
Affiliation(s)
- Bin Zhang
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Qin Liu
- Department of Radiology, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
| | - Xiao Zhang
- Zhuhai Precision Medical Center, Zhuhai People's Hospital (Zhuhai Hospital Affiliated With Jinan University), Zhuhai, China
| | - Shuyi Liu
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Weiqi Chen
- Big Data Decision Institute, Jinan University, Guangzhou, China
| | - Jingjing You
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Qiuying Chen
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Minmin Li
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Zhuozhi Chen
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Luyan Chen
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Lv Chen
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yuhao Dong
- Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| | - Qingsi Zeng
- Department of Radiology, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
| | - Shuixing Zhang
- Department of Radiology, The First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
19
|
|
20
|
Shukla AK. Feature selection inspired by human intelligence for improving classification accuracy of cancer types. Comput Intell 2020. [DOI: 10.1111/coin.12341] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Alok Kumar Shukla
- Department of Computer Science & EngineeringG.L. Bajaj Institute of Technology and Management Gr. Noida India
| |
Collapse
|
21
|
Shukla AK, Pippal SK, Gupta S, Ramachandra Reddy B, Tripathi D. Knowledge discovery in medical and biological datasets by integration of Relief-F and correlation feature selection techniques. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-179743] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Alok Kumar Shukla
- Department of CSE, G.L. Bajaj Institute of Technology & Management, Greater Noida, India
| | - Sanjeev Kumar Pippal
- Department of CSE, G.L. Bajaj Institute of Technology & Management, Greater Noida, India
| | | | | | | |
Collapse
|
22
|
An Efficient Filter-Based Feature Selection Model to Identify Significant Features from High-Dimensional Microarray Data. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2020. [DOI: 10.1007/s13369-020-04380-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
23
|
Adeel A, Khan MA, Sharif M, Azam F, Shah JH, Umer T, Wan S. Diagnosis and recognition of grape leaf diseases: An automated system based on a novel saliency approach and canonical correlation analysis based multiple features fusion. SUSTAINABLE COMPUTING: INFORMATICS AND SYSTEMS 2019; 24:100349. [DOI: 10.1016/j.suscom.2019.08.002] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
|
24
|
Brankovic A, Hosseini M, Piroddi L. A Distributed Feature Selection Algorithm Based on Distance Correlation with an Application to Microarrays. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1802-1815. [PMID: 29993889 DOI: 10.1109/tcbb.2018.2833482] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
DNA microarray datasets are characterized by a large number of features with very few samples, which is a typical cause of overfitting and poor generalization in the classification task. Here, we introduce a novel feature selection (FS) approach which employs the distance correlation (dCor) as a criterion for evaluating the dependence of the class on a given feature subset. The dCor index provides a reliable dependence measure among random vectors of arbitrary dimension, without any assumption on their distribution. Moreover, it is sensitive to the presence of redundant terms. The proposed FS method is based on a probabilistic representation of the feature subset model, which is progressively refined by a repeated process of model extraction and evaluation. A key element of the approach is a distributed optimization scheme based on a vertical partitioning of the dataset, which alleviates the negative effects of its unbalanced dimensions. The proposed method has been tested on several microarray datasets, resulting in quite compact and accurate models obtained at a reasonable computational cost.
Collapse
|
25
|
Shukla AK. Identification of cancerous gene groups from microarray data by employing adaptive genetic and support vector machine technique. Comput Intell 2019. [DOI: 10.1111/coin.12245] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Alok Kumar Shukla
- Department of Computer Science & EngineeringG.L. Bajaj Institute of Technology & Management Greater Noida India
| |
Collapse
|
26
|
Arora S, Visanji NP, Mestre TA, Tsanas A, AlDakheel A, Connolly BS, Gasca-Salas C, Kern DS, Jain J, Slow EJ, Faust-Socher A, Lang AE, Little MA, Marras C. Investigating Voice as a Biomarker for Leucine-Rich Repeat Kinase 2-Associated Parkinson's Disease. JOURNAL OF PARKINSONS DISEASE 2019; 8:503-510. [PMID: 30248062 DOI: 10.3233/jpd-181389] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
We investigate the potential association between leucine-rich repeat kinase 2 (LRRK2) mutations and voice. Sustained phonations ('aaah' sounds) were recorded from 7 individuals with LRRK2-associated Parkinson's disease (PD), 17 participants with idiopathic PD (iPD), 20 non-manifesting LRRK2-mutation carriers, 25 related non-carriers, and 26 controls. In distinguishing LRRK2-associated PD and iPD, the mean sensitivity was 95.4% (SD 17.8%) and mean specificity was 89.6% (SD 26.5%). Voice features for non-manifesting carriers, related non-carriers, and controls were much less discriminatory. Vocal deficits in LRRK2-associated PD may be different than those in iPD. These preliminary results warrant longitudinal analyses and replication in larger cohorts.
Collapse
Affiliation(s)
| | - Naomi P. Visanji
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| | - Tiago A. Mestre
- Department of Medicine, Parkinson’s Disease and Movement Disorders Center, Division of Neurology, The Ottawa Hospital Research Institute, University of Ottawa Brain and Mind Institute, Ottawa, Canada
| | - Athanasios Tsanas
- Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK
| | - Amaal AlDakheel
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| | - Barbara S. Connolly
- Department of Medicine, Division of Neurology, Hamilton Health Sciences, McMaster University, Hamilton, ON, Canada
| | - Carmen Gasca-Salas
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| | - Drew S. Kern
- Department of Neurology, Movement Disorders Center, University of Colorado, Anschutz Medical Campus, Aurora, CO, USA
| | - Jennifer Jain
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| | - Elizabeth J. Slow
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| | - Achinoam Faust-Socher
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| | - Anthony E. Lang
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| | - Max A. Little
- Engineering and Applied Science, Aston University, Birmingham, UK
- Media Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Connie Marras
- The Edmond J. Safra Program in Parkinson’s Disease and the Morton and Gloria Shulman Movement Disorders Centre and, Toronto Western Hospital, Toronto, ON, Canada
| |
Collapse
|
27
|
|
28
|
Zhao Q, Zhang Y. Ensemble Method of Feature Selection and Reverse Construction of Gene Logical Network Based on Information Entropy. INT J PATTERN RECOGN 2019. [DOI: 10.1142/s0218001420590041] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we propose a novel ensemble gene selection method to obtain a gene subset. Then we provide a reverse construction method of gene network derived from expression profile data of the gene subset. The uncertainty coefficient based on information entropy are used to define the existence of logical relations among these genes. If the uncertainty coefficient between some genes exceeds predefined thresholds, the gene nodes will be connected by directed edges. Thus, a gene network is generated, which we define as gene logical network. This method is applied to the breast cancer data including control group and experimental group, with comparisons of the 2nd-order logic type distribution, average degree as well as average path length of the networks. It is found that these structures with different networks are quite distinct. By the comparison of the degree difference between control group and experimental group, the key genes are picked up. By defining the dynamics evolution rules of state transition based on the logical regulation among the key genes in the network, the dynamic behaviors for normal breast cells and cells with cancer of different stages are simulated numerically. Some of them are highly related to the development of breast cancer through literature inquiry. The study may provide a useful revelation to the biological mechanism in the formation and development of cancer.
Collapse
Affiliation(s)
- Qingfeng Zhao
- College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, Shandong 266590, P. R. China
- Shandong Province Key Laboratory of Wisdom Mine Information Technology, Shandong University of Science and Technology, Qingdao 266590, P. R. China
| | - Yulin Zhang
- College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, Shandong 266590, P. R. China
| |
Collapse
|
29
|
Choi YG, Lim J, Roy A, Park J. Fixed support positive-definite modification of covariance matrix estimators via linear shrinkage. J MULTIVARIATE ANAL 2019. [DOI: 10.1016/j.jmva.2018.12.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
30
|
Zhang Y, Zhou Y, Zhang D, Song W. A Stroke Risk Detection: Improving Hybrid Feature Selection Method. J Med Internet Res 2019; 21:e12437. [PMID: 30938684 PMCID: PMC6466481 DOI: 10.2196/12437] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2018] [Revised: 01/04/2019] [Accepted: 01/26/2019] [Indexed: 01/16/2023] Open
Abstract
Background Stroke is one of the most common diseases that cause mortality. Detecting the risk of stroke for individuals is critical yet challenging because of a large number of risk factors for stroke. Objective This study aimed to address the limitation of ineffective feature selection in existing research on stroke risk detection. We have proposed a new feature selection method called weighting- and ranking-based hybrid feature selection (WRHFS) to select important risk factors for detecting ischemic stroke. Methods WRHFS integrates the strengths of various filter algorithms by following the principle of a wrapper approach. We employed a variety of filter-based feature selection models as the candidate set, including standard deviation, Pearson correlation coefficient, Fisher score, information gain, Relief algorithm, and chi-square test and used sensitivity, specificity, accuracy, and Youden index as performance metrics to evaluate the proposed method. Results This study chose 792 samples from the electronic records of 13,421 patients in a community hospital. Each sample included 28 features (24 blood test features and 4 demographic features). The results of evaluation showed that the proposed method selected 9 important features out of the original 28 features and significantly outperformed baseline methods. Their cumulative contribution was 0.51. The WRHFS method achieved a sensitivity of 82.7% (329/398), specificity of 80.4% (317/394), classification accuracy of 81.5% (645/792), and Youden index of 0.63 using only the top 9 features. We have also presented a chart for visualizing the risk of having ischemic strokes. Conclusions This study has proposed, developed, and evaluated a new feature selection method for identifying the most important features for building effective and parsimonious models for stroke risk detection. The findings of this research provide several novel research contributions and practical implications.
Collapse
Affiliation(s)
- Yonglai Zhang
- Medical Big Data Institute, Software School, North University of China, Taiyuan, China
| | - Yaojian Zhou
- Medical Big Data Institute, Software School, North University of China, Taiyuan, China
| | - Dongsong Zhang
- Department of Business Information Systems and Operations Research, Belk School of Business, University of North Carolina, Charlotte, NC, United States
| | - Wenai Song
- Medical Big Data Institute, Software School, North University of China, Taiyuan, China
| |
Collapse
|
31
|
Komeili M, Pou-Prom C, Liaqat D, Fraser KC, Yancheva M, Rudzicz F. Talk2Me: Automated linguistic data collection for personal assessment. PLoS One 2019; 14:e0212342. [PMID: 30917120 PMCID: PMC6436678 DOI: 10.1371/journal.pone.0212342] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Accepted: 01/31/2019] [Indexed: 11/18/2022] Open
Abstract
Language is one the earliest capacities affected by cognitive change. To monitor that change longitudinally, we have developed a web portal for remote linguistic data acquisition, called Talk2Me, consisting of a variety of tasks. In order to facilitate research in different aspects of language, we provide baselines including the relations between different scoring functions within and across tasks. These data can be used to augment studies that require a normative model; for example, we provide baseline classification results in identifying dementia. These data are released publicly along with a comprehensive open-source package for extracting approximately two thousand lexico-syntactic, acoustic, and semantic features. This package can be applied arbitrarily to studies that include linguistic data. To our knowledge, this is the most comprehensive publicly available software for extracting linguistic features. The software includes scoring functions for different tasks.
Collapse
Affiliation(s)
- Majid Komeili
- School of Computer Science, Carleton University, Ottawa, Ontario, Canada
| | - Chloé Pou-Prom
- Li Ka Shing Knowledge Institute, Saint Michael’s Hospital, Toronto, Ontario, Canada
| | - Daniyal Liaqat
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | | | | | - Frank Rudzicz
- Li Ka Shing Knowledge Institute, Saint Michael’s Hospital, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Surgical Safety Technologies, Toronto, Ontario, Canada
- * E-mail:
| |
Collapse
|
32
|
Bhola A, Singh S. Visualisation and Modelling of High-Dimensional Cancerous Gene Expression Dataset. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2019. [DOI: 10.1142/s0219649219500011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The increase in the number of dimensions of cancerous gene expression dataset causes an increase in complexity, misinterpretation and decrease in the visualisation of the particular dataset for further analysis. Therefore, dimensionality reduction, visualisation and modelling tasks of these dataset become challenging. In this paper, a framework is developed which helps to understand, visualise and model high-dimensional cancerous gene expression dataset into lower dimensions which may be helpful in revealing cancer mechanism and diagnosis. Initially, cancerous gene expression datasets are preprocessed to make them complete, precise and efficient; and principal component analysis is applied for dimensionality reduction and visualisation purpose. The regression is used to model the cancerous gene expression dataset so that type of association (linear or nonlinear) and directions between gene profiles may be estimated. To assess the performance of the developed framework, three different types of cancerous gene expression datasets are taken namely: breast (GEO Acc. No. GDS5076), lung (GEO Acc. No. GDS5040) and prostate (GEO Acc. No. GDS5072) which are publicly available. To validate the results of the regression the cross-validation method is used. The results revealed that a linear approach is to be used for prostate cancer dataset and nonlinear approach for breast and lung cancer datasets in finding an association between gene pairs.
Collapse
Affiliation(s)
- Abhishek Bhola
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh 160012, India
| | - Shailendra Singh
- Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Sector 12, Chandigarh 160012, India
| |
Collapse
|
33
|
Singh D, Singh B. Hybridization of feature selection and feature weighting for high dimensional data. APPL INTELL 2018. [DOI: 10.1007/s10489-018-1348-2] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
34
|
Ye Q, Sun Y. Weighted structure preservation and redundancy minimization for feature selection. Soft comput 2018. [DOI: 10.1007/s00500-017-2727-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
35
|
Armanfard N, Komeili M, Reilly JP, Connolly JF. A Machine Learning Framework for Automatic and Continuous MMN Detection With Preliminary Results for Coma Outcome Prediction. IEEE J Biomed Health Inform 2018; 23:1794-1804. [PMID: 30369457 DOI: 10.1109/jbhi.2018.2877738] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Mismatch negativity (MMN) is a component of the event-related potential (ERP) that is elicited through an odd-ball paradigm. The existence of the MMN in a coma patient has a good correlation with coma emergence; however, this component can be difficult to detect. Previously, MMN detection was based on visual inspection of the averaged ERPs by a skilled clinician, a process that is expensive and not always feasible in practice. In this paper, we propose a practical machine learning (ML) based approach for detection of MMN component, thus, improving the accuracy of prediction of emergence from coma. Furthermore, the method can operate on an automatic and continuous basis thus alleviating the need for clinician involvement. The proposed method is capable of the MMN detection over intervals as short as two minutes. This finer time resolution enables identification of waxing and waning cycles of a conscious state. An auditory odd-ball paradigm was applied to 22 healthy subjects and 2 coma patients. A coma patient is tested by measuring the similarity of the patient's ERP responses with the aggregate healthy responses. Because the training process for measuring similarity requires only healthy subjects, the complexity and practicality of training procedure of the proposed method are greatly improved relative to training on coma patients directly. Since there are only two coma patients involved with this study, the results are reported on a very preliminary basis. Preliminary results indicate we can detect the MMN component with an accuracy of 92.7% on healthy subjects. The method successfully predicted emergence in both coma patients when conventional methods failed. The proposed method for collecting training data using exclusively healthy subjects is a novel approach that may prove useful in future, unrelated studies where ML methods are used.
Collapse
|
36
|
Bharti P, Mittal D, Ananthasivan R. Characterization of chronic liver disease based on ultrasound images using the variants of grey-level difference matrix. Proc Inst Mech Eng H 2018; 232:884-900. [DOI: 10.1177/0954411918796531] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Chronic liver diseases are fifth leading cause of fatality in developing countries. Early diagnosis is important for timely treatment and to salvage life. Ultrasound imaging is frequently used to examine abnormalities of liver. However, ambiguity lies in visual interpretation of liver stages on ultrasound images. This difficult visualization problem can be solved by analysing extracted textural features from images. Grey-level difference matrix, a texture feature extraction method, can provide information about roughness of liver surface, sharpness of liver borders and echotexture of liver parenchyma. In this article, the behaviour of variants of grey-level difference matrix in characterizing liver stages is investigated. The texture feature sets are extracted by using variants of grey-level difference matrix based on two, three, five and seven neighbouring pixels. Thereafter, to take the advantage of complementary information from extracted feature sets, feature fusion schemes are implemented. In addition, hybrid feature selection (combination of ReliefF filter method and sequential forward selection wrapper method) is used to obtain optimal feature set in characterizing liver stages. Finally, a computer-aided system is designed with the optimal feature set to classify liver health in terms of normal, chronic liver, cirrhosis and hepatocellular carcinoma evolved over cirrhosis. In the proposed work, experiments are performed to (1) identify the best approximation of derivative (forward, central or backward); (2) analyse the performance of individual feature sets of variants of grey-level difference matrix; (3) obtain optimal feature set by exploiting the complementary information from variants of grey-level difference matrix and (4) analyse the performance of proposed method in comparison with existing feature extraction methods. These experiments are carried out on database of 754 segmented regions of interest formed by clinically acquired ultrasound images. The results show that classification accuracy of 94.5% is obtained by optimal feature set having complementary information from variants of grey-level difference matrix.
Collapse
Affiliation(s)
- Puja Bharti
- Department of Electrical & Instrumental Engineering, Thapar Institute of Engineering & Technology, Patiala, India
| | - Deepti Mittal
- Department of Electrical & Instrumental Engineering, Thapar Institute of Engineering & Technology, Patiala, India
| | | |
Collapse
|
37
|
An S, Wang J, Wei J. Local-Nearest-Neighbors-Based Feature Weighting for Gene Selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1538-1548. [PMID: 28600259 DOI: 10.1109/tcbb.2017.2712775] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Selecting functional genes is essential for analyzing microarray data. Among many available feature (gene) selection approaches, the ones on the basis of the large margin nearest neighbor receive more attention due to their low computational costs and high accuracies in analyzing the high-dimensional data. Yet, there still exist some problems that hamper the existing approaches in sifting real target genes, including selecting erroneous nearest neighbors, high sensitivity to irrelevant genes, and inappropriate evaluation criteria. Previous pioneer works have partly addressed some of the problems, but none of them are capable of solving these problems simultaneously. In this paper, we propose a new local-nearest-neighbors-based feature weighting approach to alleviate the above problems. The proposed approach is based on the trick of locally minimizing the within-class distances and maximizing the between-class distances with the nearest neighbors rule. We further define a feature weight vector, and construct it by minimizing the cost function with a regularization term. The proposed approach can be applied naturally to the multi-class problems and does not require extra modification. Experimental results on the UCI and the open microarray data sets validate the effectiveness and efficiency of the new approach.
Collapse
|
38
|
Urbanowicz RJ, Meeker M, La Cava W, Olson RS, Moore JH. Relief-based feature selection: Introduction and review. J Biomed Inform 2018; 85:189-203. [PMID: 30031057 PMCID: PMC6299836 DOI: 10.1016/j.jbi.2018.07.014] [Citation(s) in RCA: 346] [Impact Index Per Article: 49.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Revised: 06/29/2018] [Accepted: 07/14/2018] [Indexed: 01/25/2023]
Abstract
Feature selection plays a critical role in biomedical data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that have gained appeal by striking an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.
Collapse
Affiliation(s)
- Ryan J Urbanowicz
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | | | - William La Cava
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Randal S Olson
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Jason H Moore
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
39
|
Cai Z, Gu J, Wen C, Zhao D, Huang C, Huang H, Tong C, Li J, Chen H. An Intelligent Parkinson's Disease Diagnostic System Based on a Chaotic Bacterial Foraging Optimization Enhanced Fuzzy KNN Approach. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2018; 2018:2396952. [PMID: 30034509 PMCID: PMC6032994 DOI: 10.1155/2018/2396952] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2017] [Revised: 04/02/2018] [Accepted: 05/21/2018] [Indexed: 11/17/2022]
Abstract
Parkinson's disease (PD) is a common neurodegenerative disease, which has attracted more and more attention. Many artificial intelligence methods have been used for the diagnosis of PD. In this study, an enhanced fuzzy k-nearest neighbor (FKNN) method for the early detection of PD based upon vocal measurements was developed. The proposed method, an evolutionary instance-based learning approach termed CBFO-FKNN, was developed by coupling the chaotic bacterial foraging optimization with Gauss mutation (CBFO) approach with FKNN. The integration of the CBFO technique efficiently resolved the parameter tuning issues of the FKNN. The effectiveness of the proposed CBFO-FKNN was rigorously compared to those of the PD datasets in terms of classification accuracy, sensitivity, specificity, and AUC (area under the receiver operating characteristic curve). The simulation results indicated the proposed approach outperformed the other five FKNN models based on BFO, particle swarm optimization, Genetic algorithms, fruit fly optimization, and firefly algorithm, as well as three advanced machine learning methods including support vector machine (SVM), SVM with local learning-based feature selection, and kernel extreme learning machine in a 10-fold cross-validation scheme. The method presented in this paper has a very good prospect, which will bring great convenience to the clinicians to make a better decision in the clinical diagnosis.
Collapse
Affiliation(s)
- Zhennao Cai
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China
| | - Jianhua Gu
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi'an 710072, China
| | - Caiyun Wen
- Department of Radiology, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang 325035, China
| | - Dong Zhao
- College of Computer Science and Technology, Changchun Normal University, Changchun 130032, China
| | - Chunyu Huang
- College of Computer Science and Technology, Changchun University of Science Technology, Changchun 130032, China
| | - Hui Huang
- College of Mathematics, Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, Zhejiang 325035, China
| | - Changfei Tong
- College of Mathematics, Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, Zhejiang 325035, China
| | - Jun Li
- College of Mathematics, Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, Zhejiang 325035, China
| | - Huiling Chen
- College of Mathematics, Physics and Electronic Information Engineering, Wenzhou University, Wenzhou, Zhejiang 325035, China
| |
Collapse
|
40
|
Komeili M, Louis W, Armanfard N, Hatzinakos D. Feature Selection for Nonstationary Data: Application to Human Recognition Using Medical Biometrics. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:1446-1459. [PMID: 28534806 DOI: 10.1109/tcyb.2017.2702059] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Electrocardiogram (ECG) and transient evoked otoacoustic emission (TEOAE) are among the physiological signals that have attracted significant interest in biometric community due to their inherent robustness to replay and falsification attacks. However, they are time-dependent signals and this makes them hard to deal with in across-session human recognition scenario where only one session is available for enrollment. This paper presents a novel feature selection method to address this issue. It is based on an auxiliary dataset with multiple sessions where it selects a subset of features that are more persistent across different sessions. It uses local information in terms of sample margins while enforcing an across-session measure. This makes it a perfect fit for aforementioned biometric recognition problem. Comprehensive experiments on ECG and TEOAE variability due to time lapse and body posture are done. Performance of the proposed method is compared against seven state-of-the-art feature selection algorithms as well as another six approaches in the area of ECG and TEOAE biometric recognition. Experimental results demonstrate that the proposed method performs noticeably better than other algorithms.
Collapse
|
41
|
Armanfard N, Reilly JP, Komeili M. Logistic Localized Modeling of the Sample Space for Feature Selection and Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:1396-1413. [PMID: 28333643 DOI: 10.1109/tnnls.2017.2676101] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Conventional feature selection algorithms assign a single common feature set to all regions of the sample space. In contrast, this paper proposes a novel algorithm for localized feature selection for which each region of the sample space is characterized by its individual distinct feature subset that may vary in size and membership. This approach can therefore select an optimal feature subset that adapts to local variations of the sample space, and hence offer the potential for improved performance. Feature subsets are computed by choosing an optimal coordinate space so that, within a localized region, within-class distances and between-class distances are, respectively, minimized and maximized. Distances are measured using a logistic function metric within the corresponding region. This enables the optimization process to focus on a localized region within the sample space. A local classification approach is utilized for measuring the similarity of a new input data point to each class. The proposed logistic localized feature selection (lLFS) algorithm is invariant to the underlying probability distribution of the data; hence, it is appropriate when the data are distributed on a nonlinear or disjoint manifold. lLFS is efficiently formulated as a joint convex/increasing quasi-convex optimization problem with a unique global optimum point. The method is most applicable when the number of available training samples is small. The performance of the proposed localized method is successfully demonstrated on a large variety of data sets. We demonstrate that the number of features selected by the lLFS method saturates at the number of available discriminative features. In addition, we have shown that the Vapnik-Chervonenkis dimension of the localized classifier is finite. Both these factors suggest that the lLFS method is insensitive to the overfitting issue, relative to other methods.
Collapse
|
42
|
Jiang L, Wang Y, Cai B, Wang Y, Wang Y. Spatial-Temporal Feature Analysis on Single-Trial Event Related Potential for Rapid Face Identification. Front Comput Neurosci 2017; 11:106. [PMID: 29230171 PMCID: PMC5711855 DOI: 10.3389/fncom.2017.00106] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 11/08/2017] [Indexed: 11/19/2022] Open
Abstract
The event-related potential (ERP) is the brain response measured in electroencephalography (EEG), which reflects the process of human cognitive activity. ERP has been introduced into brain computer interfaces (BCIs) to communicate the computer with the subject's intention. Due to the low signal-to-noise ratio of EEG, most ERP studies are based on grand-averaging over many trials. Recently single-trial ERP detection attracts more attention, which enables real time processing tasks as rapid face identification. All the targets needed to be retrieved may appear only once, and there is no knowledge of target label for averaging. More interestingly, how the features contribute temporally and spatially to single-trial ERP detection has not been fully investigated. In this paper, we propose to implement a local-learning-based (LLB) feature extraction method to investigate the importance of spatial-temporal components of ERP in a task of rapid face identification using single-trial detection. Comparing to previous methods, LLB method preserves the nonlinear structure of EEG signal distribution, and analyze the importance of original spatial-temporal components via optimization in feature space. As a data-driven methods, the weighting of the spatial-temporal component does not depend on the ERP detection method. The importance weights are optimized by making the targets more different from non-targets in feature space, and regularization penalty is introduced in optimization for sparse weights. This spatial-temporal feature extraction method is evaluated on the EEG data of 15 participants in performing a face identification task using rapid serial visual presentation paradigm. Comparing with other methods, the proposed spatial-temporal analysis method uses sparser (only 10% of the total) features, and could achieve comparable performance (98%) of single-trial ERP detection as the whole features across different detection methods. The interesting finding is that the N250 is the earliest temporal component that contributes to single-trial ERP detection in face identification. And the importance of N250 components is more laterally distributed toward the left hemisphere. We show that using only the left N250 component over-performs the right N250 in the face identification task using single-trial ERP detection. The finding is also important in building a fast and efficient (fewer electrodes) BCI system for rapid face identification.
Collapse
Affiliation(s)
- Lei Jiang
- Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, China
- Department of Computer Science and Technology, Zhejiang University, Hangzhou, China
| | - Yun Wang
- Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, China
- Department of Biomedical Engineering, Zhejiang University, Hangzhou, China
| | - Bangyu Cai
- Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, China
- Department of Biomedical Engineering, Zhejiang University, Hangzhou, China
| | - Yueming Wang
- Qiushi Academy for Advanced Studies, Zhejiang University, Hangzhou, China
- Department of Computer Science and Technology, Zhejiang University, Hangzhou, China
| | - Yiwen Wang
- Department of Electronic and Computer Engineering, Department of Chemical and Biology Engineering, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| |
Collapse
|
43
|
Liu X, He J, Chang SF. Hash Bit Selection for Nearest Neighbor Search. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:5367-5380. [PMID: 28436872 DOI: 10.1109/tip.2017.2695895] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
To overcome the barrier of storage and computation when dealing with gigantic-scale data sets, compact hashing has been studied extensively to approximate the nearest neighbor search. Despite the recent advances, critical design issues remain open in how to select the right features, hashing algorithms, and/or parameter settings. In this paper, we address these by posing an optimal hash bit selection problem, in which an optimal subset of hash bits are selected from a pool of candidate bits generated by different features, algorithms, or parameters. Inspired by the optimization criteria used in existing hashing algorithms, we adopt the bit reliability and their complementarity as the selection criteria that can be carefully tailored for hashing performance in different tasks. Then, the bit selection solution is discovered by finding the best tradeoff between search accuracy and time using a modified dynamic programming method. To further reduce the computational complexity, we employ the pairwise relationship among hash bits to approximate the high-order independence property, and formulate it as an efficient quadratic programming method that is theoretically equivalent to the normalized dominant set problem in a vertex- and edge-weighted graph. Extensive large-scale experiments have been conducted under several important application scenarios of hash techniques, where our bit selection framework can achieve superior performance over both the naive selection methods and the state-of-the-art hashing algorithms, with significant accuracy gains ranging from 10% to 50%, relatively.
Collapse
|
44
|
Sun Y, Yao J, Yang L, Chen R, Nowak NJ, Goodison S. Computational approach for deriving cancer progression roadmaps from static sample data. Nucleic Acids Res 2017; 45:e69. [PMID: 28108658 PMCID: PMC5436003 DOI: 10.1093/nar/gkx003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2016] [Accepted: 01/07/2017] [Indexed: 12/26/2022] Open
Abstract
As with any biological process, cancer development is inherently dynamic. While major efforts continue to catalog the genomic events associated with human cancer, it remains difficult to interpret and extrapolate the accumulating data to provide insights into the dynamic aspects of the disease. Here, we present a computational strategy that enables the construction of a cancer progression model using static tumor sample data. The developed approach overcame many technical limitations of existing methods. Application of the approach to breast cancer data revealed a linear, branching model with two distinct trajectories for malignant progression. The validity of the constructed model was demonstrated in 27 independent breast cancer data sets, and through visualization of the data in the context of disease progression we were able to identify a number of potentially key molecular events in the advance of breast cancer to malignancy.
Collapse
Affiliation(s)
- Yijun Sun
- Department of Microbiology and Immunology.,Department of Computer Science and Engineering.,Department of Biostatistics, The State University of New York, Buffalo, NY14203, USA.,Department of Biochemistry The State University of New York, Buffalo, NY14203, USA
| | - Jin Yao
- Department of Microbiology and Immunology
| | - Le Yang
- Department of Computer Science and Engineering
| | - Runpu Chen
- Department of Computer Science and Engineering
| | - Norma J Nowak
- Department of Bioinformatics and Biostatistics Roswell Park Cancer Institute, Buffalo, NY 14201, USA
| | - Steve Goodison
- Department of Health Sciences Research Mayo Clinic, Jacksonville, FL 32224, USA
| |
Collapse
|
45
|
Armanfard N, Komeili M, Reilly JP, Mah R, Connolly JF. Automatic and continuous assessment of ERPs for mismatch negativity detection. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2016:969-972. [PMID: 28268485 DOI: 10.1109/embc.2016.7590863] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Accurate and fast detection of event related potential (ERP) components is an unresolved issue in neuroscience and critical health care. Mismatch negativity (MMN) is a component of the ERP to an odd stimulus in a sequence of identical stimuli which has good correlation with coma awakening. All of the previous studies for MMN detection are based on visual inspection of the averaged ERPs (over a long recording time) by a skilled neurophysiologist. However, in practical situations, such an expert may not be available or familiar with all aspects of evoked potential methods. Further, we may miss important clinically essential events due to the implicit averaging process used to acquire the ERPs. In this paper we propose a practical machine learning approach for automatic and continuous assessment of the ERPs for detecting the presence of the MMN component. The proposed method is realized in a classification framework. Performance of the proposed method is demonstrated on 22 healthy subjects through a leave-one subject-out strategy where the MMN components are identified with about 93% accuracy.
Collapse
|
46
|
The impact of machine learning techniques in the study of bipolar disorder: A systematic review. Neurosci Biobehav Rev 2017; 80:538-554. [PMID: 28728937 DOI: 10.1016/j.neubiorev.2017.07.004] [Citation(s) in RCA: 81] [Impact Index Per Article: 10.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 06/15/2017] [Accepted: 07/08/2017] [Indexed: 01/10/2023]
Abstract
Machine learning techniques provide new methods to predict diagnosis and clinical outcomes at an individual level. We aim to review the existing literature on the use of machine learning techniques in the assessment of subjects with bipolar disorder. We systematically searched PubMed, Embase and Web of Science for articles published in any language up to January 2017. We found 757 abstracts and included 51 studies in our review. Most of the included studies used multiple levels of biological data to distinguish the diagnosis of bipolar disorder from other psychiatric disorders or healthy controls. We also found studies that assessed the prediction of clinical outcomes and studies using unsupervised machine learning to build more consistent clinical phenotypes of bipolar disorder. We concluded that given the clinical heterogeneity of samples of patients with BD, machine learning techniques may provide clinicians and researchers with important insights in fields such as diagnosis, personalized treatment and prognosis orientation.
Collapse
|
47
|
Gui J, Sun Z, Ji S, Tao D, Tan T. Feature Selection Based on Structured Sparsity: A Comprehensive Study. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:1490-1507. [PMID: 28287983 DOI: 10.1109/tnnls.2016.2551724] [Citation(s) in RCA: 126] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Feature selection (FS) is an important component of many pattern recognition tasks. In these tasks, one is often confronted with very high-dimensional data. FS algorithms are designed to identify the relevant feature subset from the original features, which can facilitate subsequent analysis, such as clustering and classification. Structured sparsity-inducing feature selection (SSFS) methods have been widely studied in the last few years, and a number of algorithms have been proposed. However, there is no comprehensive study concerning the connections between different SSFS methods, and how they have evolved. In this paper, we attempt to provide a survey on various SSFS methods, including their motivations and mathematical representations. We then explore the relationship among different formulations and propose a taxonomy to elucidate their evolution. We group the existing SSFS methods into two categories, i.e., vector-based feature selection (feature selection based on lasso) and matrix-based feature selection (feature selection based on lr,p-norm). Furthermore, FS has been combined with other machine learning algorithms for specific applications, such as multitask learning, multilabel learning, multiview learning, classification, and clustering. This paper not only compares the differences and commonalities of these methods based on regression and regularization strategies, but also provides useful guidelines to practitioners working in related fields to guide them how to do feature selection.
Collapse
|
48
|
|
49
|
Abstract
Background The Receiver Operator Characteristic (ROC) curve is well-known in evaluating classification performance in biomedical field. Owing to its superiority in dealing with imbalanced and cost-sensitive data, the ROC curve has been exploited as a popular metric to evaluate and find out disease-related genes (features). The existing ROC-based feature selection approaches are simple and effective in evaluating individual features. However, these approaches may fail to find real target feature subset due to their lack of effective means to reduce the redundancy between features, which is essential in machine learning. Results In this paper, we propose to assess feature complementarity by a trick of measuring the distances between the misclassified instances and their nearest misses on the dimensions of pairwise features. If a misclassified instance and its nearest miss on one feature dimension are far apart on another feature dimension, the two features are regarded as complementary to each other. Subsequently, we propose a novel filter feature selection approach on the basis of the ROC analysis. The new approach employs an efficient heuristic search strategy to select optimal features with highest complementarities. The experimental results on a broad range of microarray data sets validate that the classifiers built on the feature subset selected by our approach can get the minimal balanced error rate with a small amount of significant features. Conclusions Compared with other ROC-based feature selection approaches, our new approach can select fewer features and effectively improve the classification performance.
Collapse
|
50
|
Hamzeh-Mivehroud M, Sokouti B, Dastmalchi S. An Introduction to the Basic Concepts in QSAR-Aided Drug Design. Oncology 2017. [DOI: 10.4018/978-1-5225-0549-5.ch002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The need for the development of new drugs to combat existing and newly identified conditions is unavoidable. One of the important tools used in the advanced drug development pipeline is computer-aided drug design. Traditionally, to find a drug many ligands were synthesized and evaluated for their effectiveness using suitable bioassays and if all other drug-likeness features were met, the candidate(s) would possibly reach the market. Although this approach is still in use in advanced format, computational methods are an indispensable component of modern drug development projects. One of the methods used from very early days of rationalizing the drug design approaches is Quantitative Structure-Activity Relationship (QSAR). This chapter overviews QSAR modeling steps by introducing molecular descriptors, mathematical model development for relating biological activities to molecular structures, and model validation. At the end, several successful cases where QSAR studies were used extensively are presented.
Collapse
Affiliation(s)
| | | | - Siavoush Dastmalchi
- Biotechnology Research Center, Tabriz University of Medical Sciences, Iran & School of Pharmacy, Tabriz University of Medical Sciences, Iran
| |
Collapse
|