1
|
Rao RS, Dewangan S, Mishra A, Gupta M. A study of dealing class imbalance problem with machine learning methods for code smell severity detection using PCA-based feature selection technique. Sci Rep 2023; 13:16245. [PMID: 37758824 PMCID: PMC10533884 DOI: 10.1038/s41598-023-43380-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Accepted: 09/22/2023] [Indexed: 09/29/2023] Open
Abstract
Detecting code smells may be highly helpful for reducing maintenance costs and raising source code quality. Code smells facilitate developers or researchers to understand several types of design flaws. Code smells with high severity can cause significant problems for the software and may cause challenges for the system's maintainability. It is quite essential to assess the severity of the code smells detected in software, as it prioritizes refactoring efforts. The class imbalance problem also further enhances the difficulties in code smell severity detection. In this study, four code smell severity datasets (Data class, God class, Feature envy, and Long method) are selected to detect code smell severity. In this work, an effort is made to address the issue of class imbalance, for which, the Synthetic Minority Oversampling Technique (SMOTE) class balancing technique is applied. Each dataset's relevant features are chosen using a feature selection technique based on principal component analysis. The severity of code smells is determined using five machine learning techniques: K-nearest neighbor, Random forest, Decision tree, Multi-layer Perceptron, and Logistic Regression. This study obtained the 0.99 severity accuracy score with the Random forest and Decision tree approach with the Long method code smell. The model's performance is compared based on its accuracy and three other performance measurements (Precision, Recall, and F-measure) to estimate severity classification models. The impact of performance is also compared and presented with and without applying SMOTE. The results obtained in the study are promising and can be beneficial for paving the way for further studies in this area.
Collapse
Affiliation(s)
- Rajwant Singh Rao
- Department of Computer Science and Information Technology, Guru Ghasidas Vishwavidyalaya, Bilaspur, India
| | - Seema Dewangan
- Department of Computer Science and Information Technology, Guru Ghasidas Vishwavidyalaya, Bilaspur, India
| | - Alok Mishra
- Faculty of Engineering, Norwegian University of Science and Technology, Trondheim, Norway.
| | - Manjari Gupta
- (Computer Science), DST - Centre for Interdisciplinary Mathematical Sciences, Institute of Science, Banaras Hindu University, Varanasi, India
| |
Collapse
|
2
|
Shi Y, Li Y, Koike Y. Sparse Logistic Regression-Based EEG Channel Optimization Algorithm for Improved Universality across Participants. Bioengineering (Basel) 2023; 10:664. [PMID: 37370595 DOI: 10.3390/bioengineering10060664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 05/25/2023] [Accepted: 05/26/2023] [Indexed: 06/29/2023] Open
Abstract
Electroencephalogram (EEG) channel optimization can reduce redundant information and improve EEG decoding accuracy by selecting the most informative channels. This article aims to investigate the universality regarding EEG channel optimization in terms of how well the selected EEG channels can be generalized to different participants. In particular, this study proposes a sparse logistic regression (SLR)-based EEG channel optimization algorithm using a non-zero model parameter ranking method. The proposed channel optimization algorithm was evaluated in both individual analysis and group analysis using the raw EEG data, compared with the conventional channel selection method based on the correlation coefficients (CCS). The experimental results demonstrate that the SLR-based EEG channel optimization algorithm not only filters out most redundant channels (filters 75-96.9% of channels) with a 1.65-5.1% increase in decoding accuracy, but it can also achieve a satisfactory level of decoding accuracy in the group analysis by employing only a few (2-15) common EEG electrodes, even for different participants. The proposed channel optimization algorithm can realize better universality for EEG decoding, which can reduce the burden of EEG data acquisition and enhance the real-world application of EEG-based brain-computer interface (BCI).
Collapse
Affiliation(s)
- Yuxi Shi
- School of Engineering, Tokyo Institute of Technology, Yokohama 226-8503, Japan
| | - Yuanhao Li
- School of Engineering, Tokyo Institute of Technology, Yokohama 226-8503, Japan
| | - Yasuharu Koike
- Institute of Innovative Research, Tokyo Institute of Technology, Yokohama 226-8503, Japan
| |
Collapse
|
3
|
Bugata P, Drotar P. Feature Selection Based on a Sparse Neural-Network Layer With Normalizing Constraints. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:161-172. [PMID: 34236981 DOI: 10.1109/tcyb.2021.3087776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Feature selection (FS) is an important step in machine learning since it has been shown to improve prediction accuracy while suppressing the curse of dimensionality of high-dimensional data. Neural networks have experienced tremendous success in solving many nonlinear learning problems. Here, we propose a new neural-network-based FS approach that introduces two constraints, the satisfaction of which leads to a sparse FS layer. We performed extensive experiments on synthetic and real-world data to evaluate the performance of our proposed FS method. In the experiments, we focus on high-dimensional, low-sample-size data since they represent the main challenge for FS. The results confirm that the proposed FS method based on a sparse neural-network layer with normalizing constraints (SNeL-FS) is able to select the important features and yields superior performance compared to other conventional FS methods.
Collapse
|
4
|
Li Y, Mansmann U, Du S, Hornung R. Benchmark study of feature selection strategies for multi-omics data. BMC Bioinformatics 2022; 23:412. [PMID: 36199022 PMCID: PMC9533501 DOI: 10.1186/s12859-022-04962-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 09/21/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the last few years, multi-omics data, that is, datasets containing different types of high-dimensional molecular variables for the same samples, have become increasingly available. To date, several comparison studies focused on feature selection methods for omics data, but to our knowledge, none compared these methods for the special case of multi-omics data. Given that these data have specific structures that differentiate them from single-omics data, it is unclear whether different feature selection strategies may be optimal for such data. In this paper, using 15 cancer multi-omics datasets we compared four filter methods, two embedded methods, and two wrapper methods with respect to their performance in the prediction of a binary outcome in several situations that may affect the prediction results. As classifiers, we used support vector machines and random forests. The methods were compared using repeated fivefold cross-validation. The accuracy, the AUC, and the Brier score served as performance metrics. RESULTS The results suggested that, first, the chosen number of selected features affects the predictive performance for many feature selection methods but not all. Second, whether the features were selected by data type or from all data types concurrently did not considerably affect the predictive performance, but for some methods, concurrent selection took more time. Third, regardless of which performance measure was considered, the feature selection methods mRMR, the permutation importance of random forests, and the Lasso tended to outperform the other considered methods. Here, mRMR and the permutation importance of random forests already delivered strong predictive performance when considering only a few selected features. Finally, the wrapper methods were computationally much more expensive than the filter and embedded methods. CONCLUSIONS We recommend the permutation importance of random forests and the filter method mRMR for feature selection using multi-omics data, where, however, mRMR is considerably more computationally costly.
Collapse
Affiliation(s)
- Yingxia Li
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377, Munich, Germany.
| | - Ulrich Mansmann
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Shangming Du
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Roman Hornung
- Institute for Medical Information Processing, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, 81377, Munich, Germany
| |
Collapse
|
5
|
Performance of soft sensors based on stochastic configuration networks with nonnegative garrote. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07254-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
6
|
Varanasi SK, Daemi A, Huang B, Slot G, Majoko P. Sparsity constrained wavelet neural networks for robust soft sensor design with application to the industrial KIVCET unit. Comput Chem Eng 2022. [DOI: 10.1016/j.compchemeng.2022.107695] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
7
|
Huang Z, Ren Y, Pu X, Pan L, Yao D, Yu G. Dual self-paced multi-view clustering. Neural Netw 2021; 140:184-192. [PMID: 33770727 DOI: 10.1016/j.neunet.2021.02.022] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 01/17/2021] [Accepted: 02/18/2021] [Indexed: 10/22/2022]
Abstract
By utilizing the complementary information from multiple views, multi-view clustering (MVC) algorithms typically achieve much better clustering performance than conventional single-view methods. Although in this field, great progresses have been made in past few years, most existing multi-view clustering methods still suffer the following shortcomings: (1) most MVC methods are non-convex and thus are easily stuck into suboptimal local minima; (2) the effectiveness of these methods is sensitive to the existence of noises or outliers; and (3) the qualities of different features and views are usually ignored, which can also influence the clustering result. To address these issues, we propose dual self-paced multi-view clustering (DSMVC) in this paper. Specifically, DSMVC takes advantage of self-paced learning to tackle the non-convex issue. By applying a soft-weighting scheme of self-paced learning for instances, the negative impact caused by noises and outliers can be significantly reduced. Moreover, to alleviate the feature and view quality issues, we develop a novel feature selection approach in a self-paced manner and a weighting term for views. Experimental results on real-world data sets demonstrate the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Zongmo Huang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Yazhou Ren
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China.
| | - Xiaorong Pu
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Lili Pan
- School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Dezhong Yao
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Laboratory for Neuroinformation, University of Electronic Science and Technology of China, Chengdu 611731, China; Research Unit of NeuroInformation, Chinese Academy of Medical Sciences, 2019RU035, Chengdu, China; School of Electrical Engineering, Zhengzhou University, Zhengzhou 450001, China
| | - Guoxian Yu
- School of Software, Shandong University, Jinan 250101, China
| |
Collapse
|
8
|
Abstract
In modern network infrastructure, Distributed Denial of Service (DDoS) attacks are considered as severe network security threats. For conventional network security tools it is extremely difficult to distinguish between the higher traffic volume of a DDoS attack and large number of legitimate users accessing a targeted network service or a resource. Although these attacks have been widely studied, there are few works which collect and analyse truly representative characteristics of DDoS traffic. The current research mostly focuses on DDoS detection and mitigation with predefined DDoS data-sets which are often hard to generalise for various network services and legitimate users’ traffic patterns. In order to deal with considerably large DDoS traffic flow in a Software Defined Networking (SDN), in this work we proposed a fast and an effective entropy-based DDoS detection. We deployed generalised entropy calculation by combining Shannon and Renyi entropy to identify distributed features of DDoS traffic—it also helped SDN controller to effectively deal with heavy malicious traffic. To lower down the network traffic overhead, we collected data-plane traffic with signature-based Snort detection. We then analysed the collected traffic for entropy-based features to improve the detection accuracy of deep learning models: Stacked Auto Encoder (SAE) and Convolutional Neural Network (CNN). This work also investigated the trade-off between SAE and CNN classifiers by using accuracy and false-positive results. Quantitative results demonstrated SAE achieved relatively higher detection accuracy of 94% with only 6% of false-positive alerts, whereas the CNN classifier achieved an average accuracy of 93%.
Collapse
|
9
|
Curreri F, Graziani S, Xibilia MG. Input selection methods for data-driven Soft sensors design: Application to an industrial process. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2020.05.028] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
10
|
Cho BJ, Kim KM, Bilegsaikhan SE, Suh YJ. Machine learning improves the prediction of febrile neutropenia in Korean inpatients undergoing chemotherapy for breast cancer. Sci Rep 2020; 10:14803. [PMID: 32908182 PMCID: PMC7481240 DOI: 10.1038/s41598-020-71927-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Accepted: 08/24/2020] [Indexed: 01/01/2023] Open
Abstract
Febrile neutropenia (FN) is one of the most concerning complications of chemotherapy, and its prediction remains difficult. This study aimed to reveal the risk factors for and build the prediction models of FN using machine learning algorithms. Medical records of hospitalized patients who underwent chemotherapy after surgery for breast cancer between May 2002 and September 2018 were selectively reviewed for development of models. Demographic, clinical, pathological, and therapeutic data were analyzed to identify risk factors for FN. Using machine learning algorithms, prediction models were developed and evaluated for performance. Of 933 selected inpatients with a mean age of 51.8 ± 10.7 years, FN developed in 409 (43.8%) patients. There was a significant difference in FN incidence according to age, staging, taxane-based regimen, and blood count 5 days after chemotherapy. The area under the curve (AUC) built based on these findings was 0.870 on the basis of logistic regression. The AUC improved by machine learning was 0.908. Machine learning improves the prediction of FN in patients undergoing chemotherapy for breast cancer compared to the conventional statistical model. In these high-risk patients, primary prophylaxis with granulocyte colony-stimulating factor could be considered.
Collapse
Affiliation(s)
- Bum-Joo Cho
- Department of Ophthalmology, Hallym University Sacred Heart Hospital, Anyang, Korea
| | - Kyoung Min Kim
- Institute of New Frontier Research, Hallym University College of Medicine, Chuncheon, Korea
| | | | - Yong Joon Suh
- Department of Breast and Endocrine Surgery, Hallym University Sacred Heart Hospital, 22, Gwanpyeong-ro 170 beon-gil, Dongan-gu, Anyang, 14068, Korea.
| |
Collapse
|
11
|
Sparse evolutionary deep learning with over one million artificial neurons on commodity hardware. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-05136-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
AbstractArtificial neural networks (ANNs) have emerged as hot topics in the research community. Despite the success of ANNs, it is challenging to train and deploy modern ANNs on commodity hardware due to the ever-increasing model size and the unprecedented growth in the data volumes. Particularly for microarray data, the very high dimensionality and the small number of samples make it difficult for machine learning techniques to handle. Furthermore, specialized hardware such as graphics processing unit (GPU) is expensive. Sparse neural networks are the leading approaches to address these challenges. However, off-the-shelf sparsity-inducing techniques either operate from a pretrained model or enforce the sparse structure via binary masks. The training efficiency of sparse neural networks cannot be obtained practically. In this paper, we introduce a technique allowing us to train truly sparse neural networks with fixed parameter count throughout training. Our experimental results demonstrate that our method can be applied directly to handle high-dimensional data, while achieving higher accuracy than the traditional two-phase approaches. Moreover, we have been able to create truly sparse multilayer perceptron models with over one million neurons and to train them on a typical laptop without GPU (https://github.com/dcmocanu/sparse-evolutionary-artificial-neural-networks/tree/master/SET-MLP-Sparse-Python-Data-Structures), this being way beyond what is possible with any state-of-the-art technique.
Collapse
|
12
|
Software Metrics and tree-based machine learning algorithms for distinguishing and detecting similar structure design patterns. SN APPLIED SCIENCES 2020. [DOI: 10.1007/s42452-019-1815-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open
|
13
|
Qiu L, Zou X. Scoring Functions for Protein-RNA Complex Structure Prediction: Advances, Applications, and Future Directions. COMMUNICATIONS IN INFORMATION AND SYSTEMS 2020; 20:1-22. [PMID: 33867869 DOI: 10.4310/cis.2020.v20.n1.a1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Protein-RNA interaction is among the most essential of biological events in living cells, being involved in protein synthesizing, RNA processing and transport, DNA transcription, and regulation of gene expression, and many other critical bio-molecular activities. A thorough understanding of this interaction is of paramount importance in fundamental study of a variety of vital cellular processes and therapeutic application for remedy of a broad range of diseases. Experimental high-resolution 3D structure determination is the primary source of knowledge for protein-RNA complexes. However, due to technical limitations, the existing techniques for experimental structure determination couldn't match the demand from fast growing interest in academia and industry. This problem necessitates the alternative high-throughput computational method for protein-RNA complex structure prediction. Similar to the in silico methods used for protein-protein and protein-DNA interactions, a reliable prediction of protein-RNA complex structure requires a scoring function with commensurate discriminatory power. Derived from determined structures and purposed to predict the to-be-determined structures, the scoring function is not only a predictive tool but also a gauge of our knowledge of protein-RNA interaction. In this review, we present an overview of the status of existing scoring functions and the scientific principle behind their constructions as well as their strengths and limitations. Finally, we will discuss about future directions of the scoring function development for protein-RNA structure prediction.
Collapse
Affiliation(s)
- Liming Qiu
- Dalton Cardiovascular Research Center, University of Missouri, Columbia, Missouri 65211
| | - Xiaoqin Zou
- Dalton Cardiovascular Research Center, University of Missouri, Columbia, Missouri 65211.,Department of Physics & Astronomy, University of Missouri, Columbia, Missouri 65211.,Department of Biochemistry, University of Missouri, Columbia, Missouri 65211.,Informatics Institute, University of Missouri, Columbia, Missouri 65211
| |
Collapse
|
14
|
Sun K, Tian P, Qi H, Ma F, Yang G. An Improved Normalized Mutual Information Variable Selection Algorithm for Neural Network-Based Soft Sensors. SENSORS 2019; 19:s19245368. [PMID: 31817459 PMCID: PMC6960561 DOI: 10.3390/s19245368] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Revised: 11/24/2019] [Accepted: 12/02/2019] [Indexed: 11/28/2022]
Abstract
In this paper, normalized mutual information feature selection (NMIFS) and tabu search (TS) are integrated to develop a new variable selection algorithm for soft sensors. NMIFS is applied to select influential variables contributing to the output variable and avoids selecting redundant variables by calculating mutual information (MI). A TS based strategy is designed to prevent NMIFS from falling into a local optimal solution. The proposed algorithm performs the variable selection by combining the entropy information and MI and validating error information of artificial neural networks (ANNs); therefore, it has advantages over previous MI-based variable selection algorithms. Several simulation datasets with different scales, correlations and noise parameters are implemented to demonstrate the performance of the proposed algorithm. A set of actual production data from a power plant is also used to check the performance of these algorithms. The experiments showed that the developed variable selection algorithm presents better model accuracy with fewer selected variables, compared with other state-of-the-art methods. The application of this algorithm to soft sensors can achieve reliable results.
Collapse
Affiliation(s)
- Kai Sun
- School of Electrical Engineering and Automation, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China (F.M.)
- Correspondence: (K.S.); (G.Y.); Tel.: +86-15269190537 (K.S.); +86-13651869523 (G.Y.)
| | - Pengxin Tian
- School of Electrical Engineering and Automation, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China (F.M.)
| | - Huanning Qi
- School of Electrical Engineering and Automation, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China (F.M.)
| | - Fengying Ma
- School of Electrical Engineering and Automation, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China (F.M.)
| | - Genke Yang
- Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China
- Ningbo Artificial Intelligence Institute, Shanghai Jiao Tong University, Ningbo 315000, China
- Correspondence: (K.S.); (G.Y.); Tel.: +86-15269190537 (K.S.); +86-13651869523 (G.Y.)
| |
Collapse
|
15
|
Zhang F, Sun K, Wu X. A novel variable selection algorithm for multi-layer perceptron with elastic net. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.04.091] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
|
16
|
|
17
|
Hu H, Wang R, Yang X, Nie F. Scalable and Flexible Unsupervised Feature Selection. Neural Comput 2019; 31:517-537. [PMID: 30645178 DOI: 10.1162/neco_a_01163] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Recently, graph-based unsupervised feature selection algorithms (GUFS) have been shown to efficiently handle prevalent high-dimensional unlabeled data. One common drawback associated with existing graph-based approaches is that they tend to be time-consuming and in need of large storage, especially when faced with the increasing size of data. Research has started using anchors to accelerate graph-based learning model for feature selection, while the hard linear constraint between the data matrix and the lower-dimensional representation is usually overstrict in many applications. In this letter, we propose a flexible linearization model with anchor graph and <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msub><mml:mi>ℓ</mml:mi><mml:mn>21</mml:mn></mml:msub></mml:math> -norm regularization, which can deal with large-scale data sets and improve the performance of the existing anchor-based method. In addition, the anchor-based graph Laplacian is constructed to characterize the manifold embedding structure by means of a parameter-free adaptive neighbor assignment strategy. An efficient iterative algorithm is developed to address the optimization problem, and we also prove the convergence of the algorithm. Experiments on several public data sets demonstrate the effectiveness and efficiency of the method we propose.
Collapse
Affiliation(s)
- Haojie Hu
- Xi'an Research Institute of Hi-Tech, Xi'an 710025, China
| | - Rong Wang
- Center for Optical Imagery Analysis and Learning, Northwestern Polytechnical University, Xi'an 710072, China
| | - Xiaojun Yang
- School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
| | - Feiping Nie
- Center for Optical Imagery Analysis and Learning, Northwestern Polytechnical University, Xi'an 710072, China
| |
Collapse
|
18
|
Zhang R, Lv Q, Tao J, Gao F. Data Driven Modeling Using an Optimal Principle Component Analysis Based Neural Network and Its Application to a Nonlinear Coke Furnace. Ind Eng Chem Res 2018. [DOI: 10.1021/acs.iecr.8b00071] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Ridong Zhang
- The Belt and Road Information Research Institute, Automation College, Hangzhou Dianzi University, Hangzhou, 310018, P.R. China
| | - Qiang Lv
- The Belt and Road Information Research Institute, Automation College, Hangzhou Dianzi University, Hangzhou, 310018, P.R. China
| | - Jili Tao
- Ningbo Institute of Technology, Zhejiang University, Ningbo 315100, P.R. China
| | - Furong Gao
- Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, P.R. China
| |
Collapse
|
19
|
Luo M, Nie F, Chang X, Yang Y, Hauptmann AG, Zheng Q. Adaptive Unsupervised Feature Selection With Structure Regularization. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:944-956. [PMID: 28141533 DOI: 10.1109/tnnls.2017.2650978] [Citation(s) in RCA: 92] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Feature selection is one of the most important dimension reduction techniques for its efficiency and interpretation. Since practical data in large scale are usually collected without labels, and labeling these data are dramatically expensive and time-consuming, unsupervised feature selection has become a ubiquitous and challenging problem. Without label information, the fundamental problem of unsupervised feature selection lies in how to characterize the geometry structure of original feature space and produce a faithful feature subset, which preserves the intrinsic structure accurately. In this paper, we characterize the intrinsic local structure by an adaptive reconstruction graph and simultaneously consider its multiconnected-components (multicluster) structure by imposing a rank constraint on the corresponding Laplacian matrix. To achieve a desirable feature subset, we learn the optimal reconstruction graph and selective matrix simultaneously, instead of using a predetermined graph. We exploit an efficient alternative optimization algorithm to solve the proposed challenging problem, together with the theoretical analyses on its convergence and computational complexity. Finally, extensive experiments on clustering task are conducted over several benchmark data sets to verify the effectiveness and superiority of the proposed unsupervised feature selection algorithm.
Collapse
|
20
|
Sun K, Huang SH, Wong DSH, Jang SS. Design and Application of a Variable Selection Method for Multilayer Perceptron Neural Network With LASSO. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2017; 28:1386-1396. [PMID: 28113826 DOI: 10.1109/tnnls.2016.2542866] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
In this paper, a novel variable selection method for neural network that can be applied to describe nonlinear industrial processes is developed. The proposed method is an iterative two-step approach. First, a multilayer perceptron is constructed. Second, the least absolute shrinkage and selection operator is introduced to select the input variables that are truly essential to the model with the shrinkage parameter is determined using a cross-validation method. Then, variables whose input weights are zero are eliminated from the data set. The algorithm is repeated until there is no improvement in the model accuracy. Simulation examples as well as an industrial application in a crude distillation unit are used to validate the proposed algorithm. The results show that the proposed approach can be used to construct a more compressed model, which incorporates a higher level of prediction accuracy than other existing methods.
Collapse
|
21
|
Paul S, Das S. Simultaneous feature selection and weighting – An evolutionary multi-objective optimization approach. Pattern Recognit Lett 2015. [DOI: 10.1016/j.patrec.2015.07.007] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
22
|
Kesharaju M, Nagarajah R. Feature selection for neural network based defect classification of ceramic components using high frequency ultrasound. ULTRASONICS 2015; 62:271-277. [PMID: 26081920 DOI: 10.1016/j.ultras.2015.05.027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2015] [Accepted: 05/30/2015] [Indexed: 06/04/2023]
Abstract
The motivation for this research stems from a need for providing a non-destructive testing method capable of detecting and locating any defects and microstructural variations within armour ceramic components before issuing them to the soldiers who rely on them for their survival. The development of an automated ultrasonic inspection based classification system would make possible the checking of each ceramic component and immediately alert the operator about the presence of defects. Generally, in many classification problems a choice of features or dimensionality reduction is significant and simultaneously very difficult, as a substantial computational effort is required to evaluate possible feature subsets. In this research, a combination of artificial neural networks and genetic algorithms are used to optimize the feature subset used in classification of various defects in reaction-sintered silicon carbide ceramic components. Initially wavelet based feature extraction is implemented from the region of interest. An Artificial Neural Network classifier is employed to evaluate the performance of these features. Genetic Algorithm based feature selection is performed. Principal Component Analysis is a popular technique used for feature selection and is compared with the genetic algorithm based technique in terms of classification accuracy and selection of optimal number of features. The experimental results confirm that features identified by Principal Component Analysis lead to improved performance in terms of classification percentage with 96% than Genetic algorithm with 94%.
Collapse
Affiliation(s)
- Manasa Kesharaju
- Swinburne University of Technology, Faculty of Engineering & Industrial Sciences, Melbourne, Victoria 3122, Australia; Defence Materials Technology Centre (DMTC LTD), Melbourne, Victoria 3122, Australia.
| | - Romesh Nagarajah
- Swinburne University of Technology, Faculty of Engineering & Industrial Sciences, Melbourne, Victoria 3122, Australia; Defence Materials Technology Centre (DMTC LTD), Melbourne, Victoria 3122, Australia
| |
Collapse
|
23
|
Fock E. Global sensitivity analysis approach for input selection and system identification purposes--a new framework for feedforward neural networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2014; 25:1484-1495. [PMID: 25050946 DOI: 10.1109/tnnls.2013.2294437] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
A new algorithm for the selection of input variables of neural network is proposed. This new method, applied after the training stage, ranks the inputs according to their importance in the variance of the model output. The use of a global sensitivity analysis technique, extended Fourier amplitude sensitivity test, gives the total sensitivity index for each variable, which allows for the ranking and the removal of the less relevant inputs. Applied to some benchmarking problems in the field of features selection, the proposed approach shows good agreement in keeping the relevant variables. This new method is a useful tool for removing superfluous inputs and for system identification.
Collapse
|
24
|
Parisien M, Wang X, Perdrizet G, Lamphear C, Fierke CA, Maheshwari KC, Wilde MJ, Sosnick TR, Pan T. Discovering RNA-protein interactome by using chemical context profiling of the RNA-protein interface. Cell Rep 2013; 3:1703-13. [PMID: 23665222 DOI: 10.1016/j.celrep.2013.04.010] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2012] [Revised: 03/04/2013] [Accepted: 04/12/2013] [Indexed: 02/04/2023] Open
Abstract
RNA-protein (RNP) interactions generally are required for RNA function. At least 5% of human genes code for RNA-binding proteins. Whereas many approaches can identify the RNA partners for a specific protein, finding the protein partners for a specific RNA is difficult. We present a machine-learning method that scores a protein's binding potential for an RNA structure by utilizing the chemical context profiles of the interface from known RNP structures. Our approach is applicable even when only a single RNP structure is available. We examined 801 mammalian proteins and find that 37 (4.6%) potentially bind transfer RNA (tRNA). Most are enzymes involved in cellular processes unrelated to translation and were not known to interact with RNA. We experimentally tested six positive and three negative predictions for tRNA binding in vivo, and all nine predictions were correct. Our computational approach provides a powerful complement to experiments in discovering new RNPs.
Collapse
Affiliation(s)
- Marc Parisien
- Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, IL 60637, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Multimodality GPU-based computer-assisted diagnosis of breast cancer using ultrasound and digital mammography images. Int J Comput Assist Radiol Surg 2013; 8:547-60. [DOI: 10.1007/s11548-013-0813-y] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2012] [Accepted: 01/08/2013] [Indexed: 02/04/2023]
|
26
|
Xiang S, Nie F, Meng G, Pan C, Zhang C. Discriminative least squares regression for multiclass classification and feature selection. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2012; 23:1738-54. [PMID: 24808069 DOI: 10.1109/tnnls.2012.2212721] [Citation(s) in RCA: 161] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
This paper presents a framework of discriminative least squares regression (LSR) for multiclass classification and feature selection. The core idea is to enlarge the distance between different classes under the conceptual framework of LSR. First, a technique called ε-dragging is introduced to force the regression targets of different classes moving along opposite directions such that the distances between classes can be enlarged. Then, the ε-draggings are integrated into the LSR model for multiclass classification. Our learning framework, referred to as discriminative LSR, has a compact model form, where there is no need to train two-class machines that are independent of each other. With its compact form, this model can be naturally extended for feature selection. This goal is achieved in terms of L2,1 norm of matrix, generating a sparse learning model for feature selection. The model for multiclass classification and its extension for feature selection are finally solved elegantly and efficiently. Experimental evaluation over a range of benchmark datasets indicates the validity of our method.
Collapse
|
27
|
Vellido A, Romero E, Julià-Sapé M, Majós C, Moreno-Torres Á, Pujol J, Arús C. Robust discrimination of glioblastomas from metastatic brain tumors on the basis of single-voxel (1)H MRS. NMR IN BIOMEDICINE 2012; 25:819-828. [PMID: 22081447 DOI: 10.1002/nbm.1797] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/04/2011] [Revised: 08/01/2011] [Accepted: 09/13/2011] [Indexed: 05/31/2023]
Abstract
This article investigates methods for the accurate and robust differentiation of metastases from glioblastomas on the basis of single-voxel (1)H MRS information. Single-voxel (1)H MR spectra from a total of 109 patients (78 glioblastomas and 31 metastases) from the multicenter, international INTERPRET database, plus a test set of 40 patients (30 glioblastomas and 10 metastases) from three different centers in the Barcelona (Spain) metropolitan area, were analyzed using a robust method for feature (spectral frequency) selection coupled with a linear-in-the-parameters single-layer perceptron classifier. For the test set, a parsimonious selection of five frequencies yielded an area under the receiver operating characteristic curve of 0.86, and an area under the convex hull of the receiver operating characteristic curve of 0.91. Moreover, these accurate results for the discrimination between glioblastomas and metastases were obtained using a small number of frequencies that are amenable to metabolic interpretation, which should ease their use as diagnostic markers. Importantly, the prediction can be expressed as a simple formula based on a linear combination of these frequencies. As a result, new cases could be straightforwardly predicted by integrating this formula into a computer-based medical decision support system. This work also shows that the combination of spectra acquired at different TEs (short TE, 20-32 ms; long TE, 135-144 ms) is key to the successful discrimination between glioblastomas and metastases from single-voxel (1)H MRS.
Collapse
Affiliation(s)
- A Vellido
- Departamento de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Barcelona, Spain.
| | | | | | | | | | | | | |
Collapse
|
28
|
Parisien M, Freed KF, Sosnick TR. On docking, scoring and assessing protein-DNA complexes in a rigid-body framework. PLoS One 2012; 7:e32647. [PMID: 22393431 PMCID: PMC3290582 DOI: 10.1371/journal.pone.0032647] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2011] [Accepted: 01/28/2012] [Indexed: 01/20/2023] Open
Abstract
We consider the identification of interacting protein-nucleic acid partners using the rigid body docking method FTdock, which is systematic and exhaustive in the exploration of docking conformations. The accuracy of rigid body docking methods is tested using known protein-DNA complexes for which the docked and undocked structures are both available. Additional tests with large decoy sets probe the efficacy of two published statistically derived scoring functions that contain a huge number of parameters. In contrast, we demonstrate that state-of-the-art machine learning techniques can enormously reduce the number of parameters required, thereby identifying the relevant docking features using a miniscule fraction of the number of parameters in the prior works. The present machine learning study considers a 300 dimensional vector (dependent on only 15 parameters), termed the Chemical Context Profile (CCP), where each dimension reflects a specific type of protein amino acid-nucleic acid base interaction. The CCP is designed to capture the chemical complementarities of the interface and is well suited for machine learning techniques. Our objective function is the Chemical Context Discrepancy (CCD), which is defined as the angle between the native system's CCP vector and the decoy's vector and which serves as a substitute for the more commonly used root mean squared deviation (RMSD). We demonstrate that the CCP provides a useful scoring function when certain dimensions are properly weighted. Finally, we explore how the amino acids on a protein's surface can help guide DNA binding, first through long-range interactions, followed by direct contacts, according to specific preferences for either the major or minor grooves of the DNA.
Collapse
Affiliation(s)
- Marc Parisien
- Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois, United States of America
| | - Karl F. Freed
- Department of Chemistry, University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, University of Chicago, Chicago, Illinois, United States of America
- The James Frank Institute, University of Chicago, Chicago, Illinois, United States of America
| | - Tobin R. Sosnick
- Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, University of Chicago, Chicago, Illinois, United States of America
- Institute for Biophysical Dynamics, University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
29
|
Kabir MM, Shahjahan M, Murase K. A new local search based hybrid genetic algorithm for feature selection. Neurocomputing 2011. [DOI: 10.1016/j.neucom.2011.03.034] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
30
|
Kabir MM, Shahjahan M, Murase K. Ant Colony Optimization for Feature Selection Involving Effective Local Search. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2011. [DOI: 10.20965/jaciii.2011.p0671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
This paper proposes an effective algorithm for feature selection (ACOFS) that uses a global Ant Colony Optimization algorithm (ACO) search strategy. To make ACO effective in feature selection, our proposed algorithm uses an effective local search in selecting significant features. The novelty of ACOFS lies in its effective balance between ant exploration and exploitation using new pheromone update and heuristic information computation rules to generate a subset of a smaller number of significant features. We evaluate algorithm performance using seven real-world benchmark classification datasets. Results show that ACOFS generates smaller subsets of significant features with improved classification accuracy.
Collapse
|
31
|
Windeatt T, Duangsoithong R, Smith R. Embedded Feature Ranking for Ensemble MLP Classifiers. ACTA ACUST UNITED AC 2011; 22:988-94. [DOI: 10.1109/tnn.2011.2138158] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
32
|
Monirul Kabir M, Monirul Islam M, Murase K. A new wrapper feature selection approach using neural network. Neurocomputing 2010. [DOI: 10.1016/j.neucom.2010.04.003] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
33
|
Xu Z, King I, Lyu MRT, Jin R. Discriminative semi-supervised feature selection via manifold regularization. ACTA ACUST UNITED AC 2010; 21:1033-47. [PMID: 20570772 DOI: 10.1109/tnn.2010.2047114] [Citation(s) in RCA: 239] [Impact Index Per Article: 17.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Feature selection has attracted a huge amount of interest in both research and application communities of data mining. We consider the problem of semi-supervised feature selection, where we are given a small amount of labeled examples and a large amount of unlabeled examples. Since a small number of labeled samples are usually insufficient for identifying the relevant features, the critical problem arising from semi-supervised feature selection is how to take advantage of the information underneath the unlabeled data. To address this problem, we propose a novel discriminative semi-supervised feature selection method based on the idea of manifold regularization. The proposed approach selects features through maximizing the classification margin between different classes and simultaneously exploiting the geometry of the probability distribution that generates both labeled and unlabeled data. In comparison with previous semi-supervised feature selection algorithms, our proposed semi-supervised feature selection method is an embedded feature selection method and is able to find more discriminative features. We formulate the proposed feature selection method into a convex-concave optimization problem, where the saddle point corresponds to the optimal solution. To find the optimal solution, the level method, a fairly recent optimization method, is employed. We also present a theoretic proof of the convergence rate for the application of the level method to our problem. Empirical evaluation on several benchmark data sets demonstrates the effectiveness of the proposed semi-supervised feature selection method.
Collapse
Affiliation(s)
- Zenglin Xu
- Cluster of Excellence, Saarland University, Max Planck Institute for Informatics, Saarbruecken 66123, Germany.
| | | | | | | |
Collapse
|
34
|
Banerjee AK, M S, M N, Murty US. Classification and clustering analysis of pyruvate dehydrogenase enzyme based on their physicochemical properties. Bioinformation 2010; 4:456-62. [PMID: 20975910 PMCID: PMC2951700 DOI: 10.6026/97320630004456] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2010] [Revised: 03/02/2010] [Accepted: 04/09/2010] [Indexed: 11/23/2022] Open
Abstract
Biological systems are highly organized and enormously coordinated maintaining greater complexity. The increment of secondary data generation and progress of modern mining techniques provided us an opportunity to discover hidden intra and inter relations among these non linear dataset. This will help in understanding the complex biological phenomenon with greater efficiency. In this paper we report comparative classification of Pyruvate Dehydrogenase protein sequences from bacterial sources based on 28 different physicochemical parameters (such as bulkiness, hydrophobicity, total positively and negatively charged residues, α helices, β strand etc.) and 20 type amino acid compositions. Logistic, MLP (Multi Layer Perceptron), SMO (Sequential Minimal Optimization), RBFN (Radial Basis Function Network) and SL (simple logistic) methods were compared in this study. MLP was found to be the best method with maximum average accuracy of 88.20%. Same dataset was subjected for clustering using 2*2 grid of a two dimensional SOM (Self Organizing Maps). Clustering analysis revealed the proximity of the unannotated sequences with the Mycobacterium and Synechococcus genus.
Collapse
Affiliation(s)
- Amit Kumar Banerjee
- Bioinformatics Group, Biology Division, Indian Institute of Chemical Technology, Hyderabad-500607, A.P, India
| | | | | | | |
Collapse
|
35
|
Arizmendi C, Romero E, Alquezar R, Caminal P, Díaz I, Benito S, Giraldo BF. Data mining of patients on weaning trials from mechanical ventilation using cluster analysis and neural networks. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009; 2009:4343-4346. [PMID: 19963824 DOI: 10.1109/iembs.2009.5332742] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
The process of weaning from mechanical ventilation is one of the challenges in intensive care. 149 patients under extubation process (T-tube test) were studied: 88 patients with successful trials (group S), 38 patients who failed to maintain spontaneous breathing and were reconnected (group F), and 23 patients with successful test but that had to be reintubated before 48 hours (group R). Each patient was characterized using 8 time series and 6 statistics extracted from respiratory and cardiac signals. A moving window statistical analysis was applied obtaining for each patient a sequence of patterns of 48 features. Applying a cluster analysis two groups with the majority dataset were obtained. Neural networks were applied to discriminate between patients from groups S, F and R. The best performance obtained was 84.0% of well classified patients using a linear perceptron trained with a feature selection procedure (that selected 19 of the 48 features) and taking as input the main cluster centroid. However, the classification baseline 69.8% could not be improved when using the original set of patterns instead of the centroids to classify the patients.
Collapse
Affiliation(s)
- Carlos Arizmendi
- Department of LSI, Technical University of Catalonia (UPC), C. Jordi Girona, 1-3, 08034, Barcelona, Spain.
| | | | | | | | | | | | | |
Collapse
|