1
|
Korycki Ł, Krawczyk B. Adversarial concept drift detection under poisoning attacks for robust data stream mining. Mach Learn 2022; 112:1-36. [PMID: 35668720 PMCID: PMC9162121 DOI: 10.1007/s10994-022-06177-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2020] [Revised: 11/01/2021] [Accepted: 04/12/2022] [Indexed: 11/30/2022]
Abstract
Continuous learning from streaming data is among the most challenging topics in the contemporary machine learning. In this domain, learning algorithms must not only be able to handle massive volume of rapidly arriving data, but also adapt themselves to potential emerging changes. The phenomenon of evolving nature of data streams is known as concept drift. While there is a plethora of methods designed for detecting its occurrence, all of them assume that the drift is connected with underlying changes in the source of data. However, one must consider the possibility of a malicious injection of false data that simulates a concept drift. This adversarial setting assumes a poisoning attack that may be conducted in order to damage the underlying classification system by forcing an adaptation to false data. Existing drift detectors are not capable of differentiating between real and adversarial concept drift. In this paper, we propose a framework for robust concept drift detection in the presence of adversarial and poisoning attacks. We introduce the taxonomy for two types of adversarial concept drifts, as well as a robust trainable drift detector. It is based on the augmented restricted Boltzmann machine with improved gradient computation and energy function. We also introduce Relative Loss of Robustness-a novel measure for evaluating the performance of concept drift detectors under poisoning attacks. Extensive computational experiments, conducted on both fully and sparsely labeled data streams, prove the high robustness and efficacy of the proposed drift detection framework in adversarial scenarios.
Collapse
Affiliation(s)
- Łukasz Korycki
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA USA
| | - Bartosz Krawczyk
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA USA
| |
Collapse
|
2
|
Han M, Chen Z, Li M, Wu H, Zhang X. A survey of active and passive concept drift handling methods. Comput Intell 2022. [DOI: 10.1111/coin.12520] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Meng Han
- School of Computer Science and Engineering North Minzu University Yinchuan China
| | - Zhiqiang Chen
- School of Computer Science and Engineering North Minzu University Yinchuan China
| | - Muhang Li
- School of Computer Science and Engineering North Minzu University Yinchuan China
| | - Hongxin Wu
- School of Computer Science and Engineering North Minzu University Yinchuan China
| | - Xilong Zhang
- School of Computer Science and Engineering North Minzu University Yinchuan China
| |
Collapse
|
3
|
Disposition-Based Concept Drift Detection and Adaptation in Data Stream. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-022-06653-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
4
|
The impact of data difficulty factors on classification of imbalanced and concept drifting data streams. Knowl Inf Syst 2021. [DOI: 10.1007/s10115-021-01560-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
AbstractClass imbalance introduces additional challenges when learning classifiers from concept drifting data streams. Most existing work focuses on designing new algorithms for dealing with the global imbalance ratio and does not consider other data complexities. Independent research on static imbalanced data has highlighted the influential role of local data difficulty factors such as minority class decomposition and presence of unsafe types of examples. Despite often being present in real-world data, the interactions between concept drifts and local data difficulty factors have not been investigated in concept drifting data streams yet. We thoroughly study the impact of such interactions on drifting imbalanced streams. For this purpose, we put forward a new categorization of concept drifts for class imbalanced problems. Through comprehensive experiments with synthetic and real data streams, we study the influence of concept drifts, global class imbalance, local data difficulty factors, and their combinations, on predictions of representative online classifiers. Experimental results reveal the high influence of new considered factors and their local drifts, as well as differences in existing classifiers’ reactions to such factors. Combinations of multiple factors are the most challenging for classifiers. Although existing classifiers are partially capable of coping with global class imbalance, new approaches are needed to address challenges posed by imbalanced data streams.
Collapse
|
5
|
Liu A, Lu J, Zhang G. Diverse Instance-Weighting Ensemble Based on Region Drift Disagreement for Concept Drift Adaptation. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:293-307. [PMID: 32217484 DOI: 10.1109/tnnls.2020.2978523] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Concept drift refers to changes in the distribution of underlying data and is an inherent property of evolving data streams. Ensemble learning, with dynamic classifiers, has proved to be an efficient method of handling concept drift. However, the best way to create and maintain ensemble diversity with evolving streams is still a challenging problem. In contrast to estimating diversity via inputs, outputs, or classifier parameters, we propose a diversity measurement based on whether the ensemble members agree on the probability of a regional distribution change. In our method, estimations over regional distribution changes are used as instance weights. Constructing different region sets through different schemes will lead to different drift estimation results, thereby creating diversity. The classifiers that disagree the most are selected to maximize diversity. Accordingly, an instance-based ensemble learning algorithm, called the diverse instance-weighting ensemble (DiwE), is developed to address concept drift for data stream classification problems. Evaluations of various synthetic and real-world data stream benchmarks show the effectiveness and advantages of the proposed algorithm.
Collapse
|
6
|
Stachl C, Pargent F, Hilbert S, Harari GM, Schoedel R, Vaid S, Gosling SD, Bühner M. Personality Research and Assessment in the Era of Machine Learning. EUROPEAN JOURNAL OF PERSONALITY 2020. [DOI: 10.1002/per.2257] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The increasing availability of high–dimensional, fine–grained data about human behaviour, gathered from mobile sensing studies and in the form of digital footprints, is poised to drastically alter the way personality psychologists perform research and undertake personality assessment. These new kinds and quantities of data raise important questions about how to analyse the data and interpret the results appropriately. Machine learning models are well suited to these kinds of data, allowing researchers to model highly complex relationships and to evaluate the generalizability and robustness of their results using resampling methods. The correct usage of machine learning models requires specialized methodological training that considers issues specific to this type of modelling. Here, we first provide a brief overview of past studies using machine learning in personality psychology. Second, we illustrate the main challenges that researchers face when building, interpreting, and validating machine learning models. Third, we discuss the evaluation of personality scales, derived using machine learning methods. Fourth, we highlight some key issues that arise from the use of latent variables in the modelling process. We conclude with an outlook on the future role of machine learning models in personality research and assessment.
Collapse
Affiliation(s)
- Clemens Stachl
- Department of Communication, Stanford University, CA USA
- Department of Psychology, Psychological Methods and Assessment, Ludwig-Maximilians-Universität München, Germany
| | - Florian Pargent
- Department of Psychology, Psychological Methods and Assessment, Ludwig-Maximilians-Universität München, Germany
| | - Sven Hilbert
- Faculty of Psychology, Educational Science and Sport Science, University of Regensburg, Germany
| | | | - Ramona Schoedel
- Department of Psychology, Psychological Methods and Assessment, Ludwig-Maximilians-Universität München, Germany
| | - Sumer Vaid
- Department of Communication, Stanford University, CA USA
| | - Samuel D. Gosling
- Department of Psychology, University of Texas at Austin, TX USA
- Melbourne School of Psychological Sciences, University of Melbourne, Australia
| | - Markus Bühner
- Department of Psychology, Psychological Methods and Assessment, Ludwig-Maximilians-Universität München, Germany
| |
Collapse
|
7
|
|
8
|
|
9
|
Shan J, Zhang H, Liu W, Liu Q. Online Active Learning Ensemble Framework for Drifted Data Streams. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:486-498. [PMID: 29994730 DOI: 10.1109/tnnls.2018.2844332] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In practical applications, data stream classification faces significant challenges, such as high cost of labeling instances and potential concept drifting. We present a new online active learning ensemble framework for drifting data streams based on a hybrid labeling strategy that includes the following: 1) an ensemble classifier, which consists of a long-term stable classifier and multiple dynamic classifiers (a multilevel sliding window model is used to create and update the dynamic classifiers to effectively process both the gradual drift type and sudden drift type data stream) and 2) active learning, which takes a nonfixed labeling budget, supports on-demand request labeling, and adopts an uncertainty strategy and random strategy to label instances. The decision threshold of the uncertainty strategy is adjusted dynamically, i.e., when concept drift occurs, the threshold is gradually reduced to query the most uncertain instances in priority to reduce the request expense as much as possible. Experiments on synthetic and real data sets show that precise prediction accuracy can be obtained by the proposed method without increasing the total cost of labeling, and that the labeling cost can be dynamically allocated according to the concept drift.
Collapse
|
10
|
|
11
|
Borchani H, Larrañaga P, Gama J, Bielza C. Mining multi-dimensional concept-drifting data streams using Bayesian network classifiers. INTELL DATA ANAL 2016. [DOI: 10.3233/ida-160804] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Hanen Borchani
- Computational Intelligence Group, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Madrid, Spain
| | - Pedro Larrañaga
- Computational Intelligence Group, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Madrid, Spain
| | - João Gama
- LIAAD-INESC Porto, Faculty of Economics, University of Porto, Porto, Portugal
| | - Concha Bielza
- Computational Intelligence Group, Departamento de Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
12
|
Li P, Wu X, Hu X, Wang H. Learning concept-drifting data streams with random ensemble decision trees. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2015.04.024] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
13
|
|
14
|
Evaluation methods and decision theory for classification of streaming data with temporal dependence. Mach Learn 2014. [DOI: 10.1007/s10994-014-5441-4] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
15
|
Pitfalls in Benchmarking Data Stream Classification and How to Avoid Them. ADVANCED INFORMATION SYSTEMS ENGINEERING 2013. [DOI: 10.1007/978-3-642-40988-2_30] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
16
|
Using a classifier pool in accuracy based tracking of recurring concepts in data stream classification. EVOLVING SYSTEMS 2012. [DOI: 10.1007/s12530-012-9064-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
17
|
A semi-supervised dynamic version of Fuzzy K-Nearest Neighbours to monitor evolving systems. EVOLVING SYSTEMS 2010. [DOI: 10.1007/s12530-010-9001-2] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
18
|
Pechenizkiy M, Bakker J, Žliobaitė I, Ivannikov A, Kärkkäinen T. Online mass flow prediction in CFB boilers with explicit detection of sudden concept drift. ACTA ACUST UNITED AC 2010. [DOI: 10.1145/1809400.1809423] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Fuel feeding and inhomogeneity of fuel typically cause fluctuations in the circulating fluidized bed (CFB) process. If control systems fail to compensate the fluctuations, the whole plant will suffer from dynamics that is reinforced by the closed-loop controls. This phenomenon causes reducing efficiency and the lifetime of process components. In this paper we address the problem of online mass flow prediction, which is a part of control. Particularly, we consider the problem of learning an accurate predictor with explicit detection of abrupt concept drift and noise handling mechanisms. We emphasize the importance of having domain knowledge concerning the considered case and constructing the ground truth for facilitating the quantitative evaluation of different approaches. We demonstrate the performance of change detection methods and show their effect on the accuracy of the online mass flow prediction with real datasets collected from the experimental laboratory-scale CFB boiler.
Collapse
|
19
|
Online Mass Flow Prediction in CFB Boilers. ADVANCES IN DATA MINING. APPLICATIONS AND THEORETICAL ASPECTS 2009. [DOI: 10.1007/978-3-642-03067-3_17] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
20
|
|