1
|
Perera-Lago J, Toscano-Duran V, Paluzo-Hidalgo E, Gonzalez-Diaz R, Gutiérrez-Naranjo MA, Rucco M. An in-depth analysis of data reduction methods for sustainable deep learning. OPEN RESEARCH EUROPE 2024; 4:101. [PMID: 39309190 PMCID: PMC11413558 DOI: 10.12688/openreseurope.17554.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 09/10/2024] [Indexed: 09/25/2024]
Abstract
In recent years, deep learning has gained popularity for its ability to solve complex classification tasks. It provides increasingly better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure the similarity between the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.
Collapse
Affiliation(s)
- Javier Perera-Lago
- Applied Mathematics I Department, University of Seville, Seville, Andalusia, Spain
| | - Victor Toscano-Duran
- Applied Mathematics I Department, University of Seville, Seville, Andalusia, Spain
| | - Eduardo Paluzo-Hidalgo
- Quantitative Methods Department, Loyola University of Andalusia, Dos Hermanas, Andalusia, Spain
| | - Rocio Gonzalez-Diaz
- Applied Mathematics I Department, University of Seville, Seville, Andalusia, Spain
| | | | - Matteo Rucco
- Applied Mathematics I Department, University of Seville, Seville, Andalusia, Spain
- Data Science Department, Biocentis, Milan, Lombardy, Italy
| |
Collapse
|
2
|
Yin R, Pan X, Zhang L, Yang J, Lu W. A Rule-based Deep Fuzzy System with Nonlinear Fuzzy Feature Transform for Data Classification. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.03.071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/17/2023]
|
3
|
ITL-IDS: Incremental Transfer Learning for Intrusion Detection Systems. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
4
|
Hoshino T, Kanoga S, Tsubaki M, Aoyama A. Comparing subject-to-subject transfer learning methods in surface electromyogram-based motion recognition with shallow and deep classifiers. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.12.081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
5
|
Kanoga S, Hoshino T, Asoh H. Subject-transfer framework with unlabeled data based on multiple distance measures for surface electromyogram pattern recognition. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
6
|
Kanoga S, Hoshino T, Asoh H. Semi-supervised style transfer mapping-based framework for sEMG-based pattern recognition with 1- or 2-DoF forearm motions. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102817] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
7
|
Sucholutsky I, Schonlau M. Optimal 1-NN prototypes for pathological geometries. PeerJ Comput Sci 2021; 7:e464. [PMID: 33954242 PMCID: PMC8049135 DOI: 10.7717/peerj-cs.464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Accepted: 03/11/2021] [Indexed: 06/12/2023]
Abstract
Using prototype methods to reduce the size of training datasets can drastically reduce the computational cost of classification with instance-based learning algorithms like the k-Nearest Neighbour classifier. The number and distribution of prototypes required for the classifier to match its original performance is intimately related to the geometry of the training data. As a result, it is often difficult to find the optimal prototypes for a given dataset, and heuristic algorithms are used instead. However, we consider a particularly challenging setting where commonly used heuristic algorithms fail to find suitable prototypes and show that the optimal number of prototypes can instead be found analytically. We also propose an algorithm for finding nearly-optimal prototypes in this setting, and use it to empirically validate the theoretical results. Finally, we show that a parametric prototype generation method that normally cannot solve this pathological setting can actually find optimal prototypes when combined with the results of our theoretical analysis.
Collapse
|
8
|
Bello M, Nápoles G, Vanhoof K, Bello R. On the generation of multi-label prototypes. INTELL DATA ANAL 2020. [DOI: 10.3233/ida-200014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Data reduction techniques play a key role in instance-based classification to lower the amount of data to be processed. Prototype generation aims to obtain a reduced training set in order to obtain accurate results with less effort. This translates into a significant reduction in both algorithms’ spatial and temporal burden. This issue is particularly relevant in multi-label classification, which is a generalization of multiclass classification that allows objects to belong to several classes simultaneously. Although this field is quite active in terms of learning algorithms, there is a lack of data reduction methods. In this paper, we propose several prototype generation methods from multi-label datasets based on Granular Computing. The simulations show that these methods significantly reduce the number of examples to a set of prototypes without significantly affecting classifiers’ performance.
Collapse
Affiliation(s)
- Marilyn Bello
- Computer Science Department, Universidad Central de Las Villas, Cuba
- Faculty of Business Economics, Hasselt University, Belgium
| | - Gonzalo Nápoles
- Faculty of Business Economics, Hasselt University, Belgium
- Department of Cognitive Science and Artificial Intelligence, Tilburg University, The Netherlands
| | - Koen Vanhoof
- Faculty of Business Economics, Hasselt University, Belgium
| | - Rafael Bello
- Computer Science Department, Universidad Central de Las Villas, Cuba
| |
Collapse
|
9
|
Gan H, Zhang J, Towsey M, Truskinger A, Stark D, van Rensburg BJ, Li Y, Roe P. Data selection in frog chorusing recognition with acoustic indices. ECOL INFORM 2020. [DOI: 10.1016/j.ecoinf.2020.101160] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
10
|
Li J, Qiu S, Shen YY, Liu CL, He H. Multisource Transfer Learning for Cross-Subject EEG Emotion Recognition. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:3281-3293. [PMID: 30932860 DOI: 10.1109/tcyb.2019.2904052] [Citation(s) in RCA: 78] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Electroencephalogram (EEG) has been widely used in emotion recognition due to its high temporal resolution and reliability. Since the individual differences of EEG are large, the emotion recognition models could not be shared across persons, and we need to collect new labeled data to train personal models for new users. In some applications, we hope to acquire models for new persons as fast as possible, and reduce the demand for the labeled data amount. To achieve this goal, we propose a multisource transfer learning method, where existing persons are sources, and the new person is the target. The target data are divided into calibration sessions for training and subsequent sessions for test. The first stage of the method is source selection aimed at locating appropriate sources. The second is style transfer mapping, which reduces the EEG differences between the target and each source. We use few labeled data in the calibration sessions to conduct source selection and style transfer. Finally, we integrate the source models to recognize emotions in the subsequent sessions. The experimental results show that the three-category classification accuracy on benchmark SEED improves by 12.72% comparing with the nontransfer method. Our method facilitates the fast deployment of emotion recognition models by reducing the reliance on the labeled data amount, which has practical significance especially in fast-deployment scenarios.
Collapse
|
11
|
Abstract
AbstractA large variety of issues influence the success of data mining on a given problem. Two primary and important issues are the representation and the quality of the dataset. Specifically, if much redundant and unrelated or noisy and unreliable information is presented, then knowledge discovery becomes a very difficult problem. It is well-known that data preparation steps require significant processing time in machine learning tasks. It would be very helpful and quite useful if there were various preprocessing algorithms with the same reliable and effective performance across all datasets, but this is impossible. To this end, we present the most well-known and widely used up-to-date algorithms for each step of data preprocessing in the framework of predictive data mining.
Collapse
|
12
|
Kube R, Bianchi FM, Brunner D, LaBombard B. Outlier classification using autoencoders: Application for fluctuation driven flows in fusion plasmas. THE REVIEW OF SCIENTIFIC INSTRUMENTS 2019; 90:013505. [PMID: 30709222 DOI: 10.1063/1.5049519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 12/14/2018] [Indexed: 06/09/2023]
Abstract
Understanding the statistics of fluctuation driven flows in the boundary layer of magnetically confined plasmas is desired to accurately model the lifetime of the vacuum vessel components. Mirror Langmuir probes (MLPs) are a novel diagnostic that uniquely allow us to sample the plasma parameters on a time scale shorter than the characteristic time scale of their fluctuations. Sudden large-amplitude fluctuations in the plasma degrade the precision and accuracy of the plasma parameters reported by MLPs for cases in which the probe bias range is of insufficient amplitude. While some data samples can readily be classified as valid and invalid, we find that such a classification may be ambiguous for up to 40% of data sampled for the plasma parameters and bias voltages considered in this study. In this contribution, we employ an autoencoder (AE) to learn a low-dimensional representation of valid data samples. By definition, the coordinates in this space are the features that mostly characterize valid data. Ambiguous data samples are classified in this space using standard classifiers for vectorial data. In this way, we avoid defining complicated threshold rules to identify outliers, which require strong assumptions and introduce biases in the analysis. By removing the outliers that are identified in the latent low-dimensional space of the AE, we find that the average conductive and convective radial heat fluxes are between approximately 5% and 15% lower as when removing outliers identified by threshold values. For contributions to the radial heat flux due to triple correlations, the difference is up to 40%.
Collapse
Affiliation(s)
- R Kube
- Department of Physics and Technology, UiT The Arctic University of Norway, N-9037 Tromsø, Norway
| | - F M Bianchi
- Department of Physics and Technology, UiT The Arctic University of Norway, N-9037 Tromsø, Norway
| | - D Brunner
- Commonwealth Fusion Systems, Cambridge, Massachusetts 02139, USA
| | - B LaBombard
- MIT Plasma Science and Fusion Center, Cambridge, Massachusetts 02139, USA
| |
Collapse
|
13
|
Kusunoki Y, Wakou C, Tatsumi K. Maximum-Margin Model for Nearest Prototype Classifiers. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS 2018. [DOI: 10.20965/jaciii.2018.p0565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In this paper, we study nearest prototype classifiers, which classify data instances into the classes to which their nearest prototypes belong. We propose a maximum-margin model for nearest prototype classifiers. To provide the margin, we define a class-wise discriminant function for instances by the negatives of distances of their nearest prototypes of the class. Then, we define the margin by the minimum of differences between the discriminant function values of instances with respect to the classes they belong to and the values of the other classes. The optimization problem corresponding to the maximum-margin model is a difference of convex functions (DC) program. It is solved using a DC algorithm, which is ak-means-like algorithm, i.e., the members and positions of prototypes are alternately optimized. Through a numerical study, we analyze the effects of hyperparameters of the maximum-margin model, especially considering the classification performance.
Collapse
|
14
|
Gorzalczany MB, Rudzinski F. Generalized Self-Organizing Maps for Automatic Determination of the Number of Clusters and Their Multiprototypes in Cluster Analysis. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:2833-2845. [PMID: 28600264 DOI: 10.1109/tnnls.2017.2704779] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
This paper presents a generalization of self-organizing maps with 1-D neighborhoods (neuron chains) that can be effectively applied to complex cluster analysis problems. The essence of the generalization consists in introducing mechanisms that allow the neuron chain-during learning-to disconnect into subchains, to reconnect some of the subchains again, and to dynamically regulate the overall number of neurons in the system. These features enable the network-working in a fully unsupervised way (i.e., using unlabeled data without a predefined number of clusters)-to automatically generate collections of multiprototypes that are able to represent a broad range of clusters in data sets. First, the operation of the proposed approach is illustrated on some synthetic data sets. Then, this technique is tested using several real-life, complex, and multidimensional benchmark data sets available from the University of California at Irvine (UCI) Machine Learning repository and the Knowledge Extraction based on Evolutionary Learning data set repository. A sensitivity analysis of our approach to changes in control parameters and a comparative analysis with an alternative approach are also performed.
Collapse
|
15
|
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF. Accurate and fast prototype selection based on the notion of relevant and border prototypes. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2018. [DOI: 10.3233/jifs-169478] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
| | - J. Ariel Carrasco-Ochoa
- Coordinación de Ciencias Computacionales, National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
| | - J. Franciso Martínez-Trinidad
- Coordinación de Ciencias Computacionales, National Institute of Astrophysics, Optics and Electronics, Puebla, Mexico
| |
Collapse
|
16
|
Liu C, Wang W, Wang M, Lv F, Konan M. An efficient instance selection algorithm to reconstruct training set for support vector machine. Knowl Based Syst 2017. [DOI: 10.1016/j.knosys.2016.10.031] [Citation(s) in RCA: 84] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
17
|
Gibert K, Sànchez–Marrè M, Izquierdo J. A survey on pre-processing techniques: Relevant issues in the context of environmental data mining. AI COMMUN 2016. [DOI: 10.3233/aic-160710] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Karina Gibert
- Knowledge Engineering and Machine Learning Group, Department of Statistics and Operation Research, Universitat Politècnica de Catalunya-BarcelonaTech, Barcelona, Catalonia, Spain
| | - Miquel Sànchez–Marrè
- Knowledge Engineering and Machine Learning Group, Computer Science Department, Universitat Politècnica de Catalunya-BarcelonaTech, Barcelona, Catalonia, Spain
| | | |
Collapse
|
18
|
|
19
|
Ragozini G, Palumbo F, D'Esposito MR. Archetypal analysis for data‐driven prototype identification. Stat Anal Data Min 2016. [DOI: 10.1002/sam.11325] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- G. Ragozini
- Department of Political Sciences Università di Napoli Federico II 80138 Napoli Italy
| | - F. Palumbo
- Department of Political Sciences Università di Napoli Federico II 80138 Napoli Italy
| | - M. R. D'Esposito
- Department of Economics and Statistics Università di Salerno 84084 Fisciano SA Italy
| |
Collapse
|
20
|
Al-Jarrah OY, Alhussein O, Yoo PD, Muhaidat S, Taha K, Kim K. Data Randomization and Cluster-Based Partitioning for Botnet Intrusion Detection. IEEE TRANSACTIONS ON CYBERNETICS 2016; 46:1796-806. [PMID: 26540724 DOI: 10.1109/tcyb.2015.2490802] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Botnets, which consist of remotely controlled compromised machines called bots, provide a distributed platform for several threats against cyber world entities and enterprises. Intrusion detection system (IDS) provides an efficient countermeasure against botnets. It continually monitors and analyzes network traffic for potential vulnerabilities and possible existence of active attacks. A payload-inspection-based IDS (PI-IDS) identifies active intrusion attempts by inspecting transmission control protocol and user datagram protocol packet's payload and comparing it with previously seen attacks signatures. However, the PI-IDS abilities to detect intrusions might be incapacitated by packet encryption. Traffic-based IDS (T-IDS) alleviates the shortcomings of PI-IDS, as it does not inspect packet payload; however, it analyzes packet header to identify intrusions. As the network's traffic grows rapidly, not only the detection-rate is critical, but also the efficiency and the scalability of IDS become more significant. In this paper, we propose a state-of-the-art T-IDS built on a novel randomized data partitioned learning model (RDPLM), relying on a compact network feature set and feature selection techniques, simplified subspacing and a multiple randomized meta-learning technique. The proposed model has achieved 99.984% accuracy and 21.38 s training time on a well-known benchmark botnet dataset. Experiment results demonstrate that the proposed methodology outperforms other well-known machine-learning models used in the same detection task, namely, sequential minimal optimization, deep neural network, C4.5, reduced error pruning tree, and randomTree.
Collapse
|
21
|
Ashfaq RAR, He YL, Chen DG. Toward an efficient fuzziness based instance selection methodology for intrusion detection system. INT J MACH LEARN CYB 2016. [DOI: 10.1007/s13042-016-0557-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
22
|
Li Y, Oommen BJ, Ngom A, Rueda L. Pattern classification using a new border identification paradigm: The nearest border technique. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2015.01.030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
23
|
Rezaei M, Nezamabadi-pour H. Using gravitational search algorithm in prototype generation for nearest neighbor classification. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2015.01.008] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
24
|
Acilar AM, Arslan A. A novel approach for designing adaptive fuzzy classifiers based on the combination of an artificial immune network and a memetic algorithm. Inf Sci (N Y) 2014. [DOI: 10.1016/j.ins.2013.12.023] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
25
|
Fazzolari M, Giglio B, Alcalá R, Marcelloni F, Herrera F. A study on the application of instance selection techniques in genetic fuzzy rule-based classification systems: Accuracy-complexity trade-off. Knowl Based Syst 2013. [DOI: 10.1016/j.knosys.2013.07.011] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
26
|
Kheradpisheh SR, Behjati-Ardakani F, Ebrahimpour R. Combining classifiers using nearest decision prototypes. Appl Soft Comput 2013. [DOI: 10.1016/j.asoc.2013.07.028] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
27
|
García S, Derrac J, Cano JR, Herrera F. Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2012; 34:417-35. [PMID: 21768651 DOI: 10.1109/tpami.2011.142] [Citation(s) in RCA: 161] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
The nearest neighbor classifier is one of the most used and well-known techniques for performing recognition tasks. It has also demonstrated itself to be one of the most useful algorithms in data mining in spite of its simplicity. However, the nearest neighbor classifier suffers from several drawbacks such as high storage requirements, low efficiency in classification response, and low noise tolerance. These weaknesses have been the subject of study for many researchers and many solutions have been proposed. Among them, one of the most promising solutions consists of reducing the data used for establishing a classification rule (training data) by means of selecting relevant prototypes. Many prototype selection methods exist in the literature and the research in this area is still advancing. Different properties could be observed in the definition of them, but no formal categorization has been established yet. This paper provides a survey of the prototype selection methods proposed in the literature from a theoretical and empirical point of view. Considering a theoretical point of view, we propose a taxonomy based on the main characteristics presented in prototype selection and we analyze their advantages and drawbacks. Empirically, we conduct an experimental study involving different sizes of data sets for measuring their performance in terms of accuracy, reduction capabilities, and runtime. The results obtained by all the methods studied have been verified by nonparametric statistical tests. Several remarks, guidelines, and recommendations are made for the use of prototype selection for nearest neighbor classification.
Collapse
|
28
|
Luengo J, Herrera F. Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci (N Y) 2012. [DOI: 10.1016/j.ins.2011.09.022] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
29
|
On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl Inf Syst 2011. [DOI: 10.1007/s10115-011-0424-2] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
30
|
Li Y, Maguire L. Selecting critical patterns based on local geometrical and statistical information. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2011; 33:1189-1201. [PMID: 21493967 DOI: 10.1109/tpami.2010.188] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Pattern selection methods have been traditionally developed with a dependency on a specific classifier. In contrast, this paper presents a method that selects critical patterns deemed to carry essential information applicable to train those types of classifiers which require spatial information of the training data set. Critical patterns include those edge patterns that define the boundary and those border patterns that separate classes. The proposed method selects patterns from a new perspective, primarily based on their location in input space. It determines class edge patterns with the assistance of the approximated tangent hyperplane of a class surface. It also identifies border patterns between classes using local probability. The proposed method is evaluated on benchmark problems using popular classifiers, including multilayer perceptrons, radial basis functions, support vector machines, and nearest neighbors. The proposed approach is also compared with four state-of-the-art approaches and it is shown to provide similar but more consistent accuracy from a reduced data set. Experimental results demonstrate that it selects patterns sufficient to represent class boundary and to preserve the decision surface.
Collapse
Affiliation(s)
- Yuhua Li
- School of Computing and Intelligent Systems, University of Ulster, Londonderry BT487JL, UK.
| | | |
Collapse
|
31
|
Somorjai RL, Dolenko B, Nikulin A, Roberson W, Thiessen N. Class proximity measures--dissimilarity-based classification and display of high-dimensional data. J Biomed Inform 2011; 44:775-88. [PMID: 21545844 DOI: 10.1016/j.jbi.2011.04.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2010] [Revised: 04/11/2011] [Accepted: 04/16/2011] [Indexed: 11/16/2022]
Abstract
For two-class problems, we introduce and construct mappings of high-dimensional instances into dissimilarity (distance)-based Class-Proximity Planes. The Class Proximity Projections are extensions of our earlier relative distance plane mapping, and thus provide a more general and unified approach to the simultaneous classification and visualization of many-feature datasets. The mappings display all L-dimensional instances in two-dimensional coordinate systems, whose two axes represent the two distances of the instances to various pre-defined proximity measures of the two classes. The Class Proximity mappings provide a variety of different perspectives of the dataset to be classified and visualized. We report and compare the classification and visualization results obtained with various Class Proximity Projections and their combinations on four datasets from the UCI data base, as well as on a particular high-dimensional biomedical dataset.
Collapse
Affiliation(s)
- R L Somorjai
- Institute for Biodiagnostics, National Research Council Canada, 435 Ellice Avenue, Winnipeg, MB R3B1Y6, Canada.
| | | | | | | | | |
Collapse
|
32
|
|
33
|
Stavrakoudis DG, Theocharis JB, Zalidis GC. A multistage genetic fuzzy classifier for land cover classification from satellite imagery. Soft comput 2010. [DOI: 10.1007/s00500-010-0666-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
34
|
Pérez-Godoy MD, Fernández A, Rivera AJ, del Jesus MJ. Analysis of an evolutionary RBFN design algorithm, CO2RBFN, for imbalanced data sets. Pattern Recognit Lett 2010. [DOI: 10.1016/j.patrec.2010.07.010] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
35
|
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF, Kittler J. A review of instance selection methods. Artif Intell Rev 2010. [DOI: 10.1007/s10462-010-9165-y] [Citation(s) in RCA: 216] [Impact Index Per Article: 15.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
36
|
Derrac J, García S, Herrera F. A Survey on Evolutionary Instance Selection and Generation. INTERNATIONAL JOURNAL OF APPLIED METAHEURISTIC COMPUTING 2010. [DOI: 10.4018/jamc.2010102604] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The use of Evolutionary Algorithms to perform data reduction tasks has become an effective approach to improve the performance of data mining algorithms. Many proposals in the literature have shown that Evolutionary Algorithms obtain excellent results in their application as Instance Selection and Instance Generation procedures. The purpose of this paper is to present a survey on the application of Evolutionary Algorithms to Instance Selection and Generation process. It will cover approaches applied to the enhancement of the nearest neighbor rule, as well as other approaches focused on the improvement of the models extracted by some well-known data mining algorithms. Furthermore, some proposals developed to tackle two emerging problems in data mining, Scaling Up and Imbalance Data Sets, also are reviewed.
Collapse
|
37
|
|
38
|
|
39
|
|
40
|
Shen F, Hasegawa O. A fast nearest neighbor classifier based on self-organizing incremental neural network. Neural Netw 2008; 21:1537-47. [PMID: 18678468 DOI: 10.1016/j.neunet.2008.07.001] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2007] [Revised: 05/21/2008] [Accepted: 07/02/2008] [Indexed: 10/21/2022]
Abstract
A fast prototype-based nearest neighbor classifier is introduced. The proposed Adjusted SOINN Classifier (ASC) is based on SOINN (self-organizing incremental neural network), it automatically learns the number of prototypes needed to determine the decision boundary, and learns new information without destroying old learned information. It is robust to noisy training data, and it realizes very fast classification. In the experiment, we use some artificial datasets and real-world datasets to illustrate ASC. We also compare ASC with other prototype-based classifiers with regard to its classification error, compression ratio, and speed up ratio. The results show that ASC has the best performance and it is a very efficient classifier.
Collapse
Affiliation(s)
- Furao Shen
- The State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210093, PR China.
| | | |
Collapse
|
41
|
Kim SW, Oommen BJ. On Using Prototype Reduction Schemes to Optimize Kernel-Based Fisher Discriminant Analysis. ACTA ACUST UNITED AC 2008; 38:564-70. [DOI: 10.1109/tsmcb.2007.914446] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
42
|
Fernandez F, Isasi P. Local Feature Weighting in Nearest Prototype Classification. ACTA ACUST UNITED AC 2008; 19:40-53. [DOI: 10.1109/tnn.2007.902955] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
43
|
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF. Prototype Selection Via Prototype Relevance. ACTA ACUST UNITED AC 2008. [DOI: 10.1007/978-3-540-85920-8_19] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
|
44
|
Ros F, Guillaume S, Pintore M, Chrétien JR. Hybrid genetic algorithm for dual selection. Pattern Anal Appl 2007. [DOI: 10.1007/s10044-007-0089-3] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
45
|
Abstract
Data reduction algorithms determine a small data subset from a given large data set. In this article, new types of data reduction criteria, based on the concept of entropy, are first presented. These criteria can evaluate the data reduction performance in a sophisticated and comprehensive way. As a result, new data reduction procedures are developed. Using the newly introduced criteria, the proposed data reduction scheme is shown to be efficient and effective. In addition, an outlier-filtering strategy, which is computationally insignificant, is developed. In some instances, this strategy can substantially improve the performance of supervised data analysis. The proposed procedures are compared with related techniques in two types of application: density estimation and classification. Extensive comparative results are included to corroborate the contributions of the proposed algorithms.
Collapse
Affiliation(s)
- D Huang
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong.
| | | |
Collapse
|
46
|
Veenman CJ, Reinders MJT. The nearest subclass classifier: a compromise between the nearest mean and nearest neighbor classifier. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2005; 27:1417-29. [PMID: 16173185 DOI: 10.1109/tpami.2005.187] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
We present the Nearest Subclass Classifier (NSC), which is a classification algorithm that unifies the flexibility of the nearest neighbor classifier with the robustness of the nearest mean classifier. The algorithm is based on the Maximum Variance Cluster algorithm and, as such, it belongs to the class of prototype-based classifiers. The variance constraint parameter of the cluster algorithm serves to regularize the classifier, that is, to prevent overfitting. With a low variance constraint value, the classifier turns into the nearest neighbor classifier and, with a high variance parameter, it becomes the nearest mean classifier with the respective properties. In other words, the number of prototypes ranges from the whole training set to only one per class. In the experiments, we compared the NSC with regard to its performance and data set compression ratio to several other prototype-based methods. On several data sets, the NSC performed similarly to the k-nearest neighbor classifier, which is a well-established classifier in many domains. Also concerning storage requirements and classification speed, the NSC has favorable properties, so it gives a good compromise between classification performance and efficiency.
Collapse
Affiliation(s)
- Cor J Veenman
- Department of Mediamatics, Delft University of Technology, P.O. Box 5031, 2600 GA Delft, The Netherlands.
| | | |
Collapse
|
47
|
Oommen BJ. On using prototype reduction schemes and classifier fusion strategies to optimize kernel-based nonlinear subspace methods. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2005; 27:455-460. [PMID: 15747799 DOI: 10.1109/tpami.2005.60] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
In Kernel-based Nonlinear Subspace (KNS) methods, the length of the projections onto the principal component directions in the feature space, is computed using a kernel matrix, K, whose dimension is equivalent to the number of sample data points. Clearly this is problematic, especially, for large data sets. In this paper, we solve this problem by subdividing the data into smaller subsets, and utilizing a Prototype Reduction Scheme (PRS) as a preprocessing module, to yield more refined representative prototypes. Thereafter, a Classifier Fusion Strategy (CFS) is invoked as a postprocessing module, to combine the individual KNS classification results to derive a consensus decision. Essentially, the PRS is used to yield computational advantage, and the CFS, in turn, is used to compensate for the decreased efficiency caused by the data set division. Our experimental results demonstrate that the proposed mechanism significantly reduces the prototype extraction time as well as the computation time without sacrificing the classification accuracy. The results especially demonstrate a significant computational advantage for large data sets within a parallel processing philosophy.
Collapse
|
48
|
|
49
|
Kim SW, Oommen BJ. Enhancing Prototype Reduction Schemes With Recursion: A Method Applicable for “Large” Data Sets. ACTA ACUST UNITED AC 2004; 34:1384-97. [PMID: 15484911 DOI: 10.1109/tsmcb.2004.824524] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Most of the prototype reduction schemes (PRS), which have been reported in the literature, process the data in its entirety to yield a subset of prototypes that are useful in nearest-neighbor-like classification. Foremost among these are the prototypes for nearest neighbor classifiers, the vector quantization technique, and the support vector machines. These methods suffer from a major disadvantage, namely, that of the excessive computational burden encountered by processing all the data. In this paper, we suggest a recursive and computationally superior mechanism referred to as adaptive recursive partitioning (ARP)_PRS. Rather than process all the data using a PRS, we propose that the data be recursively subdivided into smaller subsets. This recursive subdivision can be arbitrary, and need not utilize any underlying clustering philosophy. The advantage of ARP_PRS is that the PRS processes subsets of data points that effectively sample the entire space to yield smaller subsets of prototypes. These prototypes are then, in turn, gathered and processed by the PRS to yield more refined prototypes. In this manner, prototypes which are in the interior of the Voronoi spaces, and thus ineffective in the classification, are eliminated at the subsequent invocations of the PRS. We are unaware of any PRS that employs such a recursive philosophy. Although we marginally forfeit accuracy in return for computational efficiency, our experimental results demonstrate that the proposed recursive mechanism yields classification comparable to the best reported prototype condensation schemes reported to-date. Indeed, this is true for both artificial data sets and for samples involving real-life data sets. The results especially demonstrate that a fair computational advantage can be obtained by using such a recursive strategy for "large" data sets, such as those involved in data mining and text categorization applications.
Collapse
|
50
|
|