1
|
Wang J, Wang W, Liang X. Finite-horizon optimal secure tracking control under denial-of-service attacks. ISA TRANSACTIONS 2024; 149:44-53. [PMID: 38692974 DOI: 10.1016/j.isatra.2024.04.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 04/23/2024] [Accepted: 04/23/2024] [Indexed: 05/03/2024]
Abstract
The finite-horizon optimal secure tracking control (FHOSTC) problem for cyber-physical systems under actuator denial-of-service (DoS) attacks is addressed in this paper. A model-free method based on the Q-function is designed to achieve FHOSTC without the system model information. First, an augmented time-varying Riccati equation (TVRE) is derived by integrating the system with the reference system into a unified augmented system. Then, a lower bound on malicious DoS attacks probability that guarantees the solutions of the TVRE is provided. Third, a Q-function that changes over time (time-varying Q-function, TVQF) is devised. A TVQF-based method is then proposed to solve the TVRE without the need for the knowledge of the augmented system dynamics. The developed method works backward-in-time and uses the least-squares method. To validate the performance and features of the developed method, simulation studies are conducted in the end.
Collapse
Affiliation(s)
- Jian Wang
- Key Laboratory of Marine Intelligent Equipment and System Ministry of Education, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| | - Wei Wang
- School of Information Engineering, Zhongnan University of Economics and Law, Wuhan 430073, PR China; School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, PR China.
| | - Xiaofeng Liang
- Key Laboratory of Marine Intelligent Equipment and System Ministry of Education, Shanghai Jiao Tong University, Shanghai, 200240, PR China
| |
Collapse
|
2
|
Yuan Y, Xu X, Yang C, Luo B, Dubljevic S. Concurrent Learning Robust Adaptive Fault Tolerant Boundary Regulation of Hyperbolic Distributed Parameter Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:6286-6300. [PMID: 36449581 DOI: 10.1109/tnnls.2022.3224245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
This article develops a robust adaptive boundary output regulation approach for a class of complex anticollocated hyperbolic partial differential equations subjected to multiplicative unknown faults in both the boundary sensor and actuator. The regulator design is based on the internal model principle, which amounts to stabilize a coupled cascade system, which consists of a finite-dimensional internal model driven by a hyperbolic distributed parameter system (DPS). To this end, a systematic sliding mode equipped with a backstepping approach is developed such that the robust state feedback control can be realized. Moreover, since the available information is a faulty boundary measurement at the right side point, state estimation is required. However, due to the presence of boundary unknown faults, we need to solve an issue of joint fault-state estimation. Restrictive persistent excitation conditions are usually required to guarantee the exact estimation of faults but are unrealistic in practice. To this end, a novel concurrent learning (CL) adaptive observer is proposed so that exponential convergence is obtained. It is the first time that the spirit of CL is introduced to the field of DPSs. Consequently, the observer-based adaptive boundary fault tolerant control scheme is developed, and rigorous theoretical analysis is given such that the exponential output regulation can be achieved. Finally, the effectiveness of the proposed methodology is demonstrated via comparative simulations.
Collapse
|
3
|
Jiang X, Huang M, Shi H, Wang X, Zhang Y. Off-policy two-dimensional reinforcement learning for optimal tracking control of batch processes with network-induced dropout and disturbances. ISA TRANSACTIONS 2024; 144:228-244. [PMID: 38030447 DOI: 10.1016/j.isatra.2023.11.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 10/07/2023] [Accepted: 11/03/2023] [Indexed: 12/01/2023]
Abstract
In this paper, a new off-policy two-dimensional (2D) reinforcement learning approach is proposed to deal with the optimal tracking control (OTC) issue of batch processes with network-induced dropout and disturbances. A dropout 2D augmented Smith predictor is first devised to estimate the present extended state utilizing past data of time and batch orientations. The dropout 2D value function and Q-function are further defined, and their relation is analyzed to meet the optimal performance. On this basis, the dropout 2D Bellman equation is derived according to the principle of the Q-function. For the sake of addressing the dropout 2D OTC problem of batch processes, two algorithms, i.e., the off-line 2D policy iteration algorithm and the off-policy 2D Q-learning algorithm, are presented. The latter method is developed by applying only the input and the estimated state, not the underlying information of the system. Meanwhile, the analysis with regard to the unbiasedness of solutions and convergence is separately given. The effectiveness of the provided methodologies is eventually validated through the application of a simulated case during the filling process.
Collapse
Affiliation(s)
- Xueying Jiang
- College of Information Science and Engineering, Northeastern University, China; State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, China
| | - Min Huang
- College of Information Science and Engineering, Northeastern University, China; State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, China.
| | - Huiyuan Shi
- State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, China; School of Information and Control Engineering, Liaoning Petrochemical University, China
| | - Xingwei Wang
- College of Computer Science and Engineering, Northeastern University, China
| | - Yanfeng Zhang
- College of Computer Science and Engineering, Northeastern University, China
| |
Collapse
|
4
|
Cheng Y, Huang L, Chen CLP, Wang X. Robust Actor-Critic With Relative Entropy Regulating Actor. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:9054-9063. [PMID: 35286268 DOI: 10.1109/tnnls.2022.3155483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The accurate estimation of Q-function and the enhancement of agent's exploration ability have always been challenges of off-policy actor-critic algorithms. To address the two concerns, a novel robust actor-critic (RAC) is developed in this article. We first derive a robust policy improvement mechanism (RPIM) by using the local optimal policy about the current estimated Q-function to guide policy improvement. By constraining the relative entropy between the new policy and the previous one in policy improvement, the proposed RPIM can enhance the stability of the policy update process. The theoretical analysis shows that the incentive to increase the policy entropy is endowed when the policy is updated, which is conducive to enhancing the exploration ability of agents. Then, RAC is developed by applying the proposed RPIM to regulate the actor improvement process. The developed RAC is proven to be convergent. Finally, the proposed RAC is evaluated on some continuous-action control tasks in the MuJoCo platform and the experimental results show that RAC outperforms several state-of-the-art reinforcement learning algorithms.
Collapse
|
5
|
Mu C, Peng J, Sun C. Hierarchical Multiagent Formation Control Scheme via Actor-Critic Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:8764-8777. [PMID: 35302940 DOI: 10.1109/tnnls.2022.3153028] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
This article presents a nearly optimal solution to the cooperative formation control problem for large-scale multiagent system (MAS). First, multigroup technique is widely used for the decomposition of the large-scale problem, but there is no consensus between different subgroups. Inspired by the hierarchical structure applied in the MAS, a hierarchical leader-following formation control structure with multigroup technique is constructed, where two layers and three types of agents are designed. Second, adaptive dynamic programming technique is conformed to the optimal formation control problem by the establishment of performance index function. Based on the traditional generalized policy iteration (PI) algorithm, the multistep generalized policy iteration (MsGPI) is developed with the modification of policy evaluation. The novel algorithm not only inherits the advantages of high convergence speed and low computational complexity in the generalized PI algorithm but also further accelerates the convergence speed and reduces run time. Besides, the stability analysis, convergence analysis, and optimality analysis are given for the proposed multistep PI algorithm. Afterward, a neural network-based actor-critic structure is built for approximating the iterative control policies and value functions. Finally, a large-scale formation control problem is provided to demonstrate the performance of our developed hierarchical leader-following formation control structure and MsGPI algorithm.
Collapse
|
6
|
Shi H, Wang M, Wang C. Leader-Follower Formation Learning Control of Discrete-Time Nonlinear Multiagent Systems. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1184-1194. [PMID: 34606467 DOI: 10.1109/tcyb.2021.3110645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This article investigates the leader-follower formation learning control (FLC) problem for discrete-time strict-feedback multiagent systems (MASs). The objective is to acquire the experience knowledge from the stable leader-follower adaptive formation control process and improve the control performance by reusing the experiential knowledge. First, a two-layer control scheme is proposed to solve the leader-follower formation control problem. In the first layer, by combining adaptive distributed observers and constructed in -step predictors, the leader's future state is predicted by the followers in a distributed manner. In the second layer, the adaptive neural network (NN) controllers are constructed for the followers to ensure that all the followers track the predicted output of the leader. In the stable formation control process, the NN weights are verified to exponentially converge to their optimal values by developing an extended stability corollary of linear time-varying (LTV) system. Second, by constructing some specific "learning rules," the NN weights with convergent sequences are synthetically acquired and stored in the followers as experience knowledge. Then, the stored knowledge is reused to construct the FLC. The proposed FLC method not only solves the leader-follower formation problem but also improves the transient control performance. Finally, the validity of the presented FLC scheme is illustrated by simulations.
Collapse
|
7
|
Lin M, Zhao B, Liu D. Policy gradient adaptive dynamic programming for nonlinear discrete-time zero-sum games with unknown dynamics. Soft comput 2023. [DOI: 10.1007/s00500-023-07817-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
8
|
Wang D, Ren J, Ha M. Discounted linear Q-learning control with novel tracking cost and its stability. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.01.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
9
|
Xue S, Luo B, Liu D, Gao Y. Neural network-based event-triggered integral reinforcement learning for constrained H∞ tracking control with experience replay. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.09.119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
10
|
Gao X, Si J, Wen Y, Li M, Huang H. Reinforcement Learning Control of Robotic Knee With Human-in-the-Loop by Flexible Policy Iteration. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:5873-5887. [PMID: 33956634 DOI: 10.1109/tnnls.2021.3071727] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
We are motivated by the real challenges presented in a human-robot system to develop new designs that are efficient at data level and with performance guarantees, such as stability and optimality at system level. Existing approximate/adaptive dynamic programming (ADP) results that consider system performance theoretically are not readily providing practically useful learning control algorithms for this problem, and reinforcement learning (RL) algorithms that address the issue of data efficiency usually do not have performance guarantees for the controlled system. This study fills these important voids by introducing innovative features to the policy iteration algorithm. We introduce flexible policy iteration (FPI), which can flexibly and organically integrate experience replay and supplemental values from prior experience into the RL controller. We show system-level performances, including convergence of the approximate value function, (sub)optimality of the solution, and stability of the system. We demonstrate the effectiveness of the FPI via realistic simulations of the human-robot system. It is noted that the problem we face in this study may be difficult to address by design methods based on classical control theory as it is nearly impossible to obtain a customized mathematical model of a human-robot system either online or offline. The results we have obtained also indicate the great potential of RL control to solving realistic and challenging problems with high-dimensional control inputs.
Collapse
|
11
|
Assawinchaichote W, Pongfai J, Zhang H, Shi Y. Optimal design of a nonlinear control system based on new deterministic neural network scheduling. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.07.076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
12
|
Wei Q, Ma H, Chen C, Dong D. Deep Reinforcement Learning With Quantum-Inspired Experience Replay. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:9326-9338. [PMID: 33600343 DOI: 10.1109/tcyb.2021.3053414] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, a novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. In contrast to the traditional experience replay mechanism in DRL, the proposed DRL with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed times of each experience (also called transition), to achieve a balance between exploration and exploitation. In DRL-QER, transitions are first formulated in quantum representations and then the preparation operation and depreciation operation are performed on the transitions. In this process, the preparation operation reflects the relationship between the temporal-difference errors (TD-errors) and the importance of the experiences, while the depreciation operation is taken into account to ensure the diversity of the transitions. The experimental results on Atari 2600 games show that DRL-QER outperforms state-of-the-art algorithms, such as DRL-PER and DCRL on most of these games with improved training efficiency and is also applicable to such memory-based DRL approaches as double network and dueling network.
Collapse
|
13
|
Moghadam R, Natarajan P, Jagannathan S. Online Optimal Adaptive Control of Partially Uncertain Nonlinear Discrete-Time Systems Using Multilayer Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:4840-4850. [PMID: 33710960 DOI: 10.1109/tnnls.2021.3061414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This article intends to address an online optimal adaptive regulation of nonlinear discrete-time systems in affine form and with partially uncertain dynamics using a multilayer neural network (MNN). The actor-critic framework estimates both the optimal control input and value function. Instantaneous control input error and temporal difference are used to tune the weights of the critic and actor networks, respectively. The selection of the basis functions and their derivatives are not required in the proposed approach. The state vector, critic, and actor NN weights are proven to be bounded using the Lyapunov method. Our approach can be extended to neural networks with an arbitrary number of hidden layers. We have demonstrated our approach via a simulation example.
Collapse
|
14
|
Event-triggered integral reinforcement learning for nonzero-sum games with asymmetric input saturation. Neural Netw 2022; 152:212-223. [DOI: 10.1016/j.neunet.2022.04.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 04/04/2022] [Accepted: 04/14/2022] [Indexed: 11/20/2022]
|
15
|
Wen S, Ni X, Wang H, Zhu S, Shi K, Huang T. Observer-Based Adaptive Synchronization of Multiagent Systems With Unknown Parameters Under Attacks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:3109-3119. [PMID: 33513114 DOI: 10.1109/tnnls.2021.3051017] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This article studies the observer-based adaptive synchronization of multiagent systems (MASs) with unknown parameters under attacks. First, to estimate the state of agents, the observer for MAS is introduced. When disturbance, nonlinear function, and system model uncertainty are not considered, the nominal controller is proposed to achieve synchronization and state estimation. Then, in order to eliminate the effect of unknown parameters in the disturbance, nonlinear function, and system model uncertainty, the adaptive controller with switching term is introduced. However, the attack will lead to the destruction of the network topology so as the destruction of the nominal controller. By constructing an appropriate Lyapunov function, we analyze the effect caused by attacks, and the security control law is given to make sure the synchronization of the MASs under attacks. Finally, a numerical simulation is given to verify the validness of the obtained theorem.
Collapse
|
16
|
Yu X, Hou Z, Polycarpou MM. A Data-Driven ILC Framework for a Class of Nonlinear Discrete-Time Systems. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:6143-6157. [PMID: 33571102 DOI: 10.1109/tcyb.2020.3029596] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this article, we propose a data-driven iterative learning control (ILC) framework for unknown nonlinear nonaffine repetitive discrete-time single-input-single-output systems by applying the dynamic linearization (DL) technique. The ILC law is constructed based on the equivalent DL expression of an unknown ideal learning controller in the iteration and time domains. The learning control gain vector is adaptively updated by using a Newton-type optimization method. The monotonic convergence on the tracking errors of the controlled plant is theoretically guaranteed with respect to the 2-norm under some conditions. In the proposed ILC framework, existing proportional, integral, and derivative type ILC, and high-order ILC can be considered as special cases. The proposed ILC framework is a pure data-driven ILC, that is, the ILC law is independent of the physical dynamics of the controlled plant, and the learning control gain updating algorithm is formulated using only the measured input-output data of the nonlinear system. The proposed ILC framework is effectively verified by two illustrative examples on a complicated unknown nonlinear system and on a linear time-varying system.
Collapse
|
17
|
Fu Y, Hong C, Fu J, Chai T. Approximate Optimal Tracking Control of Nondifferentiable Signals for a Class of Continuous-Time Nonlinear Systems. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:4441-4450. [PMID: 33141675 DOI: 10.1109/tcyb.2020.3027344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, for a class of continuous-time nonlinear nonaffine systems with unknown dynamics, a robust approximate optimal tracking controller (RAOTC) is proposed in the framework of adaptive dynamic programming (ADP). The distinguishing contribution of this article is that a new Lyapunov function is constructed, by using which the derivative information of tracking errors is not required in computing its time derivative along with the solution of the closed-loop system. Thus, the proposed method can make the system states follow nondifferentiable reference signals, which removes the common assumption that the reference signals have to be continuous for tracking control of continuous-time nonlinear systems in the literature. The theoretical analysis, simulation, and application results well illustrate the effectiveness and superiority of the proposed method.
Collapse
|
18
|
Yuan L, Li T, Tong S, Xiao Y, Gao X. NN adaptive optimal tracking control for a class of uncertain nonstrict feedback nonlinear systems. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.03.049] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
19
|
Wen X, Shi H, Su C, Jiang X, Li P, Yu J. Novel data-driven two-dimensional Q-learning for optimal tracking control of batch process with unknown dynamics. ISA TRANSACTIONS 2022; 125:10-21. [PMID: 34130858 DOI: 10.1016/j.isatra.2021.06.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Revised: 06/04/2021] [Accepted: 06/04/2021] [Indexed: 06/12/2023]
Abstract
In view that the previous control methods usually rely too much on the models of batch process and have difficulty in a practical batch process with unknown dynamics, a novel data-driven two-dimensional (2D) off-policy Q-learning approach for optimal tracking control (OTC) is proposed to make the batch process obtain a model-free control law. Firstly, an extended state space equation composing of the state and output error is established for ensuring tracking performance of the designed controller. Secondly, the behavior policy of generating data and the target policy of optimization as well as learning is introduced based on this extended system. Then, the Bellman equation independent of model parameters is given via analyzing the relation between 2D value function and 2D Q-function. The measured data along the batch and time directions of batch process are just taken to carry out the policy iteration, which can figure out the optimal control problem despite lacking systematic dynamic information. The unbiasedness and convergence of the designed 2D off-policy Q-learning algorithm are proved. Finally, a simulation case for injection molding process manifests that control effect and tracking effect gradually become better with the increasing number of batches.
Collapse
Affiliation(s)
- Xin Wen
- School of Information and Control Engineering, Liaoning Petrochemical University, China
| | - Huiyuan Shi
- School of Information and Control Engineering, Liaoning Petrochemical University, China; School of Automation, Northwestern Polytechnical University, China; State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, China.
| | - Chengli Su
- School of Information and Control Engineering, Liaoning Petrochemical University, China; School of Electronic and Information Engineering, University of Science and Technology Liaoning, China.
| | - Xueying Jiang
- School of Information Science and Engineering, Northeastern University, China
| | - Ping Li
- School of Information and Control Engineering, Liaoning Petrochemical University, China; School of Electronic and Information Engineering, University of Science and Technology Liaoning, China
| | - Jingxian Yu
- School of Sciences, Liaoning Petrochemical University, China
| |
Collapse
|
20
|
Ye J, Bian Y, Luo B, Hu M, Xu B, Ding R. Costate-Supplement ADP for Model-Free Optimal Control of Discrete-Time Nonlinear Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:45-59. [PMID: 35544498 DOI: 10.1109/tnnls.2022.3172126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, an adaptive dynamic programming (ADP) scheme utilizing a costate function is proposed for optimal control of unknown discrete-time nonlinear systems. The state-action data are obtained by interacting with the environment under the iterative scheme without any model information. In contrast with the traditional ADP scheme, the collected data in the proposed algorithm are generated with different policies, which improves data utilization in the learning process. In order to approximate the cost function more accurately and to achieve a better policy improvement direction in the case of insufficient data, a separate costate network is introduced to approximate the costate function under the actor-critic framework, and the costate is utilized as supplement information to estimate the cost function more precisely. Furthermore, convergence properties of the proposed algorithm are analyzed to demonstrate that the costate function plays a positive role in the convergence process of the cost function based on the alternate iteration mode of the costate function and cost function under a mild assumption. The uniformly ultimately bounded (UUB) property of all the variables is proven by using the Lyapunov approach. Finally, two numerical examples are presented to demonstrate the effectiveness and computation efficiency of the proposed method.
Collapse
|
21
|
Wang N, Gao Y, Yang C, Zhang X. Reinforcement learning-based finite-time tracking control of an unknown unmanned surface vehicle with input constraints. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.04.133] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
22
|
Lv P, Wang X, Cheng Y, Duan Z, Chen CLP. Integrated Double Estimator Architecture for Reinforcement Learning. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:3111-3122. [PMID: 33027028 DOI: 10.1109/tcyb.2020.3023033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Estimation bias is an important index for evaluating the performance of reinforcement learning (RL) algorithms. The popular RL algorithms, such as Q -learning and deep Q -network (DQN), often suffer overestimation due to the maximum operation in estimating the maximum expected action values of the next states, while double Q -learning (DQ) and double DQN may fall into underestimation by using a double estimator (DE) to avoid overestimation. To keep the balance between overestimation and underestimation, we propose a novel integrated DE (IDE) architecture by combining the maximum operation and DE operation to estimate the maximum expected action value. Based on IDE, two RL algorithms: 1) integrated DQ (IDQ) and 2) its deep network version, that is, integrated double DQN (IDDQN), are proposed. The main idea of the proposed RL algorithms is that the maximum and DE operations are integrated to eliminate the estimation bias, where one estimator is stochastically used to perform action selection based on the maximum operation, and the convex combination of two estimators is used to carry out action evaluation. We theoretically analyze the reason of estimation bias caused by using nonmaximum operation to estimate the maximum expected value and investigate the possible reasons of underestimation existence in DQ. We also prove the unbiasedness of IDE and convergence of IDQ. Experiments on the grid world and Atari 2600 games indicate that IDQ and IDDQN can reduce or even eliminate estimation bias effectively, enable the learning to be more stable and balanced, and improve the performance effectively.
Collapse
|
23
|
Rizvi SAA, Lin Z. Adaptive Dynamic Programming for Model-Free Global Stabilization of Control Constrained Continuous-Time Systems. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:1048-1060. [PMID: 32471805 DOI: 10.1109/tcyb.2020.2989419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article addresses the problem of global stabilization of continuous-time linear systems subject to control constraints using a model-free approach. We propose a gain-scheduled low-gain feedback scheme that prevents saturation from occurring and achieves global stabilization. The framework of parameterized algebraic Riccati equations (AREs) is employed to design the low-gain feedback control laws. An adaptive dynamic programming (ADP) method is presented to find the solution of the parameterized ARE without requiring the knowledge of the system dynamics. In particular, we present an iterative ADP algorithm that searches for an appropriate value of the low-gain parameter and iteratively solves the parameterized ADP Bellman equation. We present both state feedback and output feedback algorithms. The closed-loop stability and the convergence of the algorithm to the nominal solution of the parameterized ARE are shown. The simulation results validate the effectiveness of the proposed scheme.
Collapse
|
24
|
Li J, Ding J, Chai T, Lewis FL, Jagannathan S. Adaptive Interleaved Reinforcement Learning: Robust Stability of Affine Nonlinear Systems With Unknown Uncertainty. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:270-280. [PMID: 33112750 DOI: 10.1109/tnnls.2020.3027653] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This article investigates adaptive robust controller design for discrete-time (DT) affine nonlinear systems using an adaptive dynamic programming. A novel adaptive interleaved reinforcement learning algorithm is developed for finding a robust controller of DT affine nonlinear systems subject to matched or unmatched uncertainties. To this end, the robust control problem is converted into the optimal control problem for nominal systems by selecting an appropriate utility function. The performance evaluation and control policy update combined with neural networks approximation are alternately implemented at each time step for solving a simplified Hamilton-Jacobi-Bellman (HJB) equation such that the uniformly ultimately bounded (UUB) stability of DT affine nonlinear systems can be guaranteed, allowing for all realization of unknown bounded uncertainties. The rigorously theoretical proofs of convergence of the proposed interleaved RL algorithm and UUB stability of uncertain systems are provided. Simulation results are given to verify the effectiveness of the proposed method.
Collapse
|
25
|
Wen G, Chen CLP, Ge SS. Simplified Optimized Backstepping Control for a Class of Nonlinear Strict-Feedback Systems With Unknown Dynamic Functions. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4567-4580. [PMID: 32639935 DOI: 10.1109/tcyb.2020.3002108] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, a control scheme based on optimized backstepping (OB) technique is developed for a class of nonlinear strict-feedback systems with unknown dynamic functions. Reinforcement learning (RL) is employed for achieving the optimized control, and it is designed on the basis of the neural-network (NN) approximations under identifier-critic-actor architecture, where the identifier, critic, and actor are utilized for estimating the unknown dynamic, evaluating the system performance, and implementing the control action, respectively. OB control is to design all virtual controls and the actual control of backstepping to be the optimized solutions of corresponding subsystems. If the control is developed by employing the existing RL-based optimal control methods, it will become very intricate because their critic and actor updating laws are derived by carrying out gradient descent algorithm to the square of Bellman residual error, which is equal to the approximation of the Hamilton-Jacobi-Bellman (HJB) equation that contains multiple nonlinear terms. In order to effectively accomplish the optimized control, a simplified RL algorithm is designed by deriving the updating laws from the negative gradient of a simple positive function, which is generated from the partial derivative of the HJB equation. Meanwhile, the design can also release the condition of persistence excitation, which is required in most existing optimal controls. Finally, effectiveness is demonstrated by both theory and simulation.
Collapse
|
26
|
Learn to grasp unknown objects in robotic manipulation. INTEL SERV ROBOT 2021. [DOI: 10.1007/s11370-021-00380-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
27
|
Liu C, Zhang H, Sun S, Ren H. Online H∞ control for continuous-time nonlinear large-scale systems via single echo state network. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.03.017] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
28
|
Adaptive output-feedback optimal control for continuous-time linear systems based on adaptive dynamic programming approach. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.01.070] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
29
|
Singh B, Kumar R, Singh VP. Reinforcement learning in robotic applications: a comprehensive survey. Artif Intell Rev 2021. [DOI: 10.1007/s10462-021-09997-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
30
|
|
31
|
Nguyen TT, Nguyen ND, Nahavandi S. Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:3826-3839. [PMID: 32203045 DOI: 10.1109/tcyb.2020.2977374] [Citation(s) in RCA: 99] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Reinforcement learning (RL) algorithms have been around for decades and employed to solve various sequential decision-making problems. These algorithms, however, have faced great challenges when dealing with high-dimensional environments. The recent development of deep learning has enabled RL methods to drive optimal policies for sophisticated and capable agents, which can perform efficiently in these challenging environments. This article addresses an important aspect of deep RL related to situations that require multiple agents to communicate and cooperate to solve complex tasks. A survey of different approaches to problems related to multiagent deep RL (MADRL) is presented, including nonstationarity, partial observability, continuous state and action spaces, multiagent training schemes, and multiagent transfer learning. The merits and demerits of the reviewed methods will be analyzed and discussed with their corresponding applications explored. It is envisaged that this review provides insights about various MADRL methods and can lead to the future development of more robust and highly useful multiagent learning methods for solving real-world problems.
Collapse
|
32
|
Jiang H, Zhang H, Xie X. Critic-only adaptive dynamic programming algorithms' applications to the secure control of cyber-physical systems. ISA TRANSACTIONS 2020; 104:138-144. [PMID: 30853105 DOI: 10.1016/j.isatra.2019.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2018] [Revised: 01/22/2019] [Accepted: 02/14/2019] [Indexed: 06/09/2023]
Abstract
Industrial cyber-physical systems generally suffer from the malicious attacks and unmatched perturbation, and thus the security issue is always the core research topic in the related fields. This paper proposes a novel intelligent secure control scheme, which integrates optimal control theory, zero-sum game theory, reinforcement learning and neural networks. First, the secure control problem of the compromised system is converted into the zero-sum game issue of the nominal auxiliary system, and then both policy-iteration-based and value-iteration-based adaptive dynamic programming methods are introduced to solve the Hamilton-Jacobi-Isaacs equations. The proposed secure control scheme can mitigate the effects of actuator attacks and unmatched perturbation, and stabilize the compromised cyber-physical systems by tuning the system performance parameters, which is proved through the Lyapunov stability theory. Finally, the proposed approach is applied to the Quanser helicopter to verify the effectiveness.
Collapse
Affiliation(s)
- He Jiang
- College of Information Science and Engineering, Northeastern University, Box 134, 110819, Shenyang, PR China.
| | - Huaguang Zhang
- College of Information Science and Engineering, Northeastern University, Box 134, 110819, Shenyang, PR China.
| | - Xiangpeng Xie
- Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, 210003, Nanjing, PR China.
| |
Collapse
|
33
|
Integral reinforcement learning based event-triggered control with input saturation. Neural Netw 2020; 131:144-153. [PMID: 32771844 DOI: 10.1016/j.neunet.2020.07.016] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2020] [Revised: 06/13/2020] [Accepted: 07/10/2020] [Indexed: 11/20/2022]
Abstract
In this paper, a novel integral reinforcement learning (IRL)-based event-triggered adaptive dynamic programming scheme is developed for input-saturated continuous-time nonlinear systems. By using the IRL technique, the learning system does not require the knowledge of the drift dynamics. Then, a single critic neural network is designed to approximate the unknown value function and its learning is not subjected to the requirement of an initial admissible control. In order to reduce computational and communication costs, the event-triggered control law is designed. The triggering threshold is given to guarantee the asymptotic stability of the control system. Two examples are employed in the simulation studies, and the results verify the effectiveness of the developed IRL-based event-triggered control method.
Collapse
|
34
|
|
35
|
Davoud S, Gao W, Riveros-Perez E. Adaptive optimal target controlled infusion algorithm to prevent hypotension associated with labor epidural: An adaptive dynamic programming approach. ISA TRANSACTIONS 2020; 100:74-81. [PMID: 31813558 DOI: 10.1016/j.isatra.2019.11.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2019] [Revised: 11/11/2019] [Accepted: 11/12/2019] [Indexed: 06/10/2023]
Abstract
Patients receiving labor epidurals commonly experience arterial hypotension as a complication of neuraxial block. The purpose of this study was to design an adaptive optimal controller for an infusion system to regulate mean arterial pressure. A state-space model relating mean arterial pressure to Norepinephrine (NE) infusion rate was derived for controller design. A data-driven adaptive optimal control algorithm was developed based on adaptive dynamic programming (ADP). The stability and disturbance rejection ability of the closed-loop system were tested via a simulation model calibrated using available clinical data. Simulation results indicated that the settling time was six minutes and the system showed effective disturbance rejection. The results also demonstrate that the adaptive optimal control algorithm would achieve individualized control of mean arterial pressure in pregnant patients with no prior knowledge of patient parameters.
Collapse
Affiliation(s)
- Sherwin Davoud
- Department of Anesthesiology and Perioperative Medicine, Medical College of Georgia, Augusta University, 1120 15th st, Augusta, GA 30912, United States of America
| | - Weinan Gao
- Department of Electrical and Computer Engineering, Allen E. Paulson College of Engineering and Computing, Georgia Southern University, 1100 IT Drive, Statesboro, GA 30460, United States of America.
| | - Efrain Riveros-Perez
- Department of Anesthesiology and Perioperative Medicine, Medical College of Georgia, Augusta University, 1120 15th st, Augusta, GA 30912, United States of America
| |
Collapse
|
36
|
Zhang Y, Zhao B, Liu D. Deterministic policy gradient adaptive dynamic programming for model-free optimal control. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.11.032] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|