1
|
Li K, Xu H, Zhao E, Wu Z, Xing J. OpenHoldem: A Benchmark for Large-Scale Imperfect-Information Game Research. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:14618-14632. [PMID: 37314914 DOI: 10.1109/tnnls.2023.3280186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Owing to the unremitting efforts from a few institutes, researchers have recently made significant progress in designing superhuman artificial intelligence (AI) in no-limit Texas hold'em (NLTH), the primary testbed for large-scale imperfect-information game research. However, it remains challenging for new researchers to study this problem since there are no standard benchmarks for comparing with existing methods, which hinders further developments in this research area. This work presents OpenHoldem, an integrated benchmark for large-scale imperfect-information game research using NLTH. OpenHoldem makes three main contributions to this research direction: 1) a standardized evaluation protocol for thoroughly evaluating different NLTH AIs; 2) four publicly available strong baselines for NLTH AI; and 3) an online testing platform with easy-to-use APIs for public NLTH AI evaluation. We will publicly release OpenHoldem and hope it facilitates further studies on the unsolved theoretical and computational issues in this area and cultivates crucial research problems like opponent modeling and human-computer interactive learning.
Collapse
|
2
|
Song R, Yang G, Lewis FL. Nearly Optimal Control for Mixed Zero-Sum Game Based on Off-Policy Integral Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:2793-2804. [PMID: 35877793 DOI: 10.1109/tnnls.2022.3191847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this article, we solve a class of mixed zero-sum game with unknown dynamic information of nonlinear system. A policy iterative algorithm that adopts integral reinforcement learning (IRL), which does not depend on system information, is proposed to obtain the optimal control of competitor and collaborators. An adaptive update law that combines critic-actor structure with experience replay is proposed. The actor function not only approximates optimal control of every player but also estimates auxiliary control, which does not participate in the actual control process and only exists in theory. The parameters of the actor-critic structure are simultaneously updated. Then, it is proven that the parameter errors of the polynomial approximation are uniformly ultimately bounded. Finally, the effectiveness of the proposed algorithm is verified by two given simulations.
Collapse
|
3
|
Qin C, Zhang Z, Shang Z, Zhang J, Zhang D. Adaptive optimal safety tracking control for multiplayer mixed zero-sum games of continuous-time systems. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04348-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
|
4
|
Liu P, Zhang H, Sun J, Tan Z. Event-triggered adaptive integral reinforcement learning method for zero-sum differential games of nonlinear systems with incomplete known dynamics. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07010-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
5
|
Experimental Verification of the Differential Games and H∞ Theory in Tracking Control of a Wheeled Mobile Robot. J INTELL ROBOT SYST 2022. [DOI: 10.1007/s10846-022-01584-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
6
|
Zhu Y, Zhao D. Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; 33:1228-1241. [PMID: 33306474 DOI: 10.1109/tnnls.2020.3041469] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The Nash equilibrium is an important concept in game theory. It describes the least exploitability of one player from any opponents. We combine game theory, dynamic programming, and recent deep reinforcement learning (DRL) techniques to online learn the Nash equilibrium policy for two-player zero-sum Markov games (TZMGs). The problem is first formulated as a Bellman minimax equation, and generalized policy iteration (GPI) provides a double-loop iterative way to find the equilibrium. Then, neural networks are introduced to approximate Q functions for large-scale problems. An online minimax Q network learning algorithm is proposed to train the network with observations. Experience replay, dueling network, and double Q-learning are applied to improve the learning process. The contributions are twofold: 1) DRL techniques are combined with GPI to find the TZMG Nash equilibrium for the first time and 2) the convergence of the online learning algorithm with a lookup table and experience replay is proven, whose proof is not only useful for TZMGs but also instructive for single-agent Markov decision problems. Experiments on different examples validate the effectiveness of the proposed algorithm on TZMG problems.
Collapse
|
7
|
Yang X, He H. Event-Driven H ∞-Constrained Control Using Adaptive Critic Learning. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:4860-4872. [PMID: 32112694 DOI: 10.1109/tcyb.2020.2972748] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This article considers an event-driven H∞ control problem of continuous-time nonlinear systems with asymmetric input constraints. Initially, the H∞ -constrained control problem is converted into a two-person zero-sum game with the discounted nonquadratic cost function. Then, we present the event-driven Hamilton-Jacobi-Isaacs equation (HJIE) associated with the two-person zero-sum game. Meanwhile, we develop a novel event-triggering condition making Zeno behavior excluded. The present event-triggering condition differs from the existing literature in that it can make the triggering threshold non-negative without the requirement of properly selecting the prescribed level of disturbance attenuation. After that, under the framework of adaptive critic learning, we use a single critic network to solve the event-driven HJIE and tune its weight parameters by using historical and instantaneous state data simultaneously. Based on the Lyapunov approach, we demonstrate that the uniform ultimate boundedness of all the signals in the closed-loop system is guaranteed. Finally, simulations of a nonlinear plant are presented to validate the developed event-driven H∞ control strategy.
Collapse
|
8
|
Liu P, Zhang H, Ren H, Liu C. Online event-triggered adaptive critic design for multi-player zero-sum games of partially unknown nonlinear systems with input constraints. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.058] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
9
|
Luo B, Yang Y, Liu D. Policy Iteration Q-Learning for Data-Based Two-Player Zero-Sum Game of Linear Discrete-Time Systems. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:3630-3640. [PMID: 32092032 DOI: 10.1109/tcyb.2020.2970969] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, the data-based two-player zero-sum game problem is considered for linear discrete-time systems. This problem theoretically depends on solving the discrete-time game algebraic Riccati equation (DTGARE), while it requires complete system dynamics. To avoid solving the DTGARE, the Q -function is introduced and a data-based policy iteration Q -learning (PIQL) algorithm is developed to learn the optimal Q -function by using data collected from the real system. Writing the Q -function in a quadratic form, it is proved that the PIQL algorithm is equivalent to the Newton iteration method in the Banach space by using the Fréchet derivative. Then, the convergence of the PIQL algorithm can be guaranteed by Kantorovich's theorem. For the realization of the PIQL algorithm, the off-policy learning scheme is proposed using real data rather than the system model. Finally, the efficiency of the developed data-based PIQL method is validated through simulation studies.
Collapse
|
10
|
Yang X, He H, Zhong X. Approximate Dynamic Programming for Nonlinear-Constrained Optimizations. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:2419-2432. [PMID: 31329149 DOI: 10.1109/tcyb.2019.2926248] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this paper, we study the constrained optimization problem of a class of uncertain nonlinear interconnected systems. First, we prove that the solution of the constrained optimization problem can be obtained through solving an array of optimal control problems of constrained auxiliary subsystems. Then, under the framework of approximate dynamic programming, we present a simultaneous policy iteration (SPI) algorithm to solve the Hamilton-Jacobi-Bellman equations corresponding to the constrained auxiliary subsystems. By building an equivalence relationship, we demonstrate the convergence of the SPI algorithm. Meanwhile, we implement the SPI algorithm via an actor-critic structure, where actor networks are used to approximate optimal control policies and critic networks are applied to estimate optimal value functions. By using the least squares method and the Monte Carlo integration technique together, we are able to determine the weight vectors of actor and critic networks. Finally, we validate the developed control method through the simulation of a nonlinear interconnected plant.
Collapse
|
11
|
Zhang Y, Zhao B, Liu D. Event-triggered adaptive dynamic programming for multi-player zero-sum games with unknown dynamics. Soft comput 2021. [DOI: 10.1007/s00500-020-05293-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
12
|
Paul S, Ni Z, Mu C. A Learning-Based Solution for an Adversarial Repeated Game in Cyber-Physical Power Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:4512-4523. [PMID: 31899439 DOI: 10.1109/tnnls.2019.2955857] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Due to the rapidly expanding complexity of the cyber-physical power systems, the probability of a system malfunctioning and failing is increasing. Most of the existing works combining smart grid (SG) security and game theory fail to replicate the adversarial events in the simulated environment close to the real-life events. In this article, a repeated game is formulated to mimic the real-life interactions between the adversaries of the modern electric power system. The optimal action strategies for different environment settings are analyzed. The advantage of the repeated game is that the players can generate actions independent of the previous actions' history. The solution of the game is designed based on the reinforcement learning algorithm, which ensures the desired outcome in favor of the players. The outcome in favor of a player means achieving higher mixed strategy payoff compared to the other player. Different from the existing game-theoretic approaches, both the attacker and the defender participate actively in the game and learn the sequence of actions applying to the power transmission lines. In this game, we consider several factors (e.g., attack and defense costs, allocated budgets, and the players' strengths) that could affect the outcome of the game. These considerations make the game close to real-life events. To evaluate the game outcome, both players' utilities are compared, and they reflect how much power is lost due to the attacks and how much power is saved due to the defenses. The players' favorable outcome is achieved for different attack and defense strengths (probabilities). The IEEE 39 bus system is used here as the test benchmark. Learned attack and defense strategies are applied in a simulated power system environment (PowerWorld) to illustrate the postattack effects on the system.
Collapse
|
13
|
Cao W, Yang Q. Online sequential extreme learning machine based adaptive control for wastewater treatment plant. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.05.109] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
14
|
Event-driven H ∞ control with critic learning for nonlinear systems. Neural Netw 2020; 132:30-42. [PMID: 32861146 DOI: 10.1016/j.neunet.2020.08.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2020] [Revised: 08/03/2020] [Accepted: 08/10/2020] [Indexed: 11/22/2022]
Abstract
In this paper, we study an event-driven H∞ control problem of continuous-time nonlinear systems. Initially, with the introduction of a discounted cost function, we convert the nonlinear H∞ control problem into an event-driven nonlinear two-player zero-sum game. Then, we develop an event-driven Hamilton-Jacobi-Isaacs equation (HJIE) related to the two-player zero-sum game. After that, we propose a novel event-triggering condition guaranteeing Zeno behavior not to happen. The triggering threshold in the newly proposed event-triggering condition can be kept positive without requiring to properly choose the prescribed level of disturbance attenuation. To solve the event-driven HJIE, we employ an adaptive critic architecture which contains a unique critic neural network (NN). The weight parameters used in the critic NN are tuned via the gradient descent method. After that, we carry out stability analysis of the hybrid closed-loop system based on Lyapunov's direct approach. Finally, we provide two nonlinear plants, including the pendulum system, to validate the proposed event-driven H∞ control scheme.
Collapse
|
15
|
Li H, Zhang Q, Zhao D. Deep Reinforcement Learning-Based Automatic Exploration for Navigation in Unknown Environment. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:2064-2076. [PMID: 31398138 DOI: 10.1109/tnnls.2019.2927869] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This paper investigates the automatic exploration problem under the unknown environment, which is the key point of applying the robotic system to some social tasks. The solution to this problem via stacking decision rules is impossible to cover various environments and sensor properties. Learning-based control methods are adaptive for these scenarios. However, these methods are damaged by low learning efficiency and awkward transferability from simulation to reality. In this paper, we construct a general exploration framework via decomposing the exploration process into the decision, planning, and mapping modules, which increases the modularity of the robotic system. Based on this framework, we propose a deep reinforcement learning-based decision algorithm that uses a deep neural network to learning exploration strategy from the partial map. The results show that this proposed algorithm has better learning efficiency and adaptability for unknown environments. In addition, we conduct the experiments on the physical robot, and the results suggest that the learned policy can be well transferred from simulation to the real robot.
Collapse
|
16
|
Ni Z, Paul S. A Multistage Game in Smart Grid Security: A Reinforcement Learning Solution. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:2684-2695. [PMID: 30624227 DOI: 10.1109/tnnls.2018.2885530] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Existing smart grid security research investigates different attack techniques and cascading failures from the attackers' viewpoints, while the defenders' or the operators' protection strategies are somehow neglected. Game theoretic methods are applied for the attacker-defender games in the smart grid security area. Yet, most of the existing works only use the one-shot game and do not consider the dynamic process of the electric power grid. In this paper, we propose a new solution for a multistage game (also called a dynamic game) between the attacker and the defender based on reinforcement learning to identify the optimal attack sequences given certain objectives (e.g., transmission line outages or generation loss). Different from a one-shot game, the attacker here learns a sequence of attack actions applying for the transmission lines and the defender protects a set of selected lines. After each time step, the cascading failure will be measured, and the line outage (and/or generation loss) will be used as the feedback for the attacker to generate the next action. The performance is evaluated on W&W 6-bus and IEEE 39-bus systems. A comparison between a multistage attack and a one-shot attack is conducted to show the significance of the multistage attack. Furthermore, different protection strategies are evaluated in simulation, which shows that the proposed reinforcement learning solution can identify optimal attack sequences under several attack objectives. It also indicates that attacker's learned information helps the defender to enhance the security of the system.
Collapse
|
17
|
Zhang Q, Zhao D. Data-Based Reinforcement Learning for Nonzero-Sum Games With Unknown Drift Dynamics. IEEE TRANSACTIONS ON CYBERNETICS 2019; 49:2874-2885. [PMID: 29994780 DOI: 10.1109/tcyb.2018.2830820] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper is concerned about the nonlinear optimization problem of nonzero-sum (NZS) games with unknown drift dynamics. The data-based integral reinforcement learning (IRL) method is proposed to approximate the Nash equilibrium of NZS games iteratively. Furthermore, we prove that the data-based IRL method is equivalent to the model-based policy iteration algorithm, which guarantees the convergence of the proposed method. For the implementation purpose, a single-critic neural network structure for the NZS games is given. To enhance the application capability of the data-based IRL method, we design the updating laws of critic weights based on the offline and online iterative learning methods, respectively. Note that the experience replay technique is introduced in the online iterative learning, which can improve the convergence rate of critic weights during the learning process. The uniform ultimate boundedness of the critic weights are guaranteed using the Lyapunov method. Finally, the numerical results demonstrate the effectiveness of the data-based IRL algorithm for nonlinear NZS games with unknown drift dynamics.
Collapse
|
18
|
Song R, Zhu L. Stable value iteration for two-player zero-sum game of discrete-time nonlinear systems based on adaptive dynamic programming. Neurocomputing 2019. [DOI: 10.1016/j.neucom.2019.03.002] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
19
|
Shao K, Zhu Y, Zhao D. StarCraft Micromanagement With Reinforcement Learning and Curriculum Transfer Learning. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE 2019. [DOI: 10.1109/tetci.2018.2823329] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
20
|
|
21
|
Liu C, Zhu E, Zhang Q, Wei X. Modeling of Agent Cognition in Extensive Games via Artificial Neural Networks. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:4857-4868. [PMID: 29993959 DOI: 10.1109/tnnls.2017.2782266] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The decision-making process, which is regarded as cognitive and ubiquitous, has been exploited in diverse fields, such as psychology, economics, and artificial intelligence. This paper considers the problem of modeling agent cognition in a class of game-theoretic decision-making scenarios called extensive games. We present a novel framework in which artificial neural networks are incorporated to simulate agent cognition regarding the structure of the underlying game and the goodness of the game situations therein. An algorithmic procedure is investigated to describe the process for solving games with cognition, and then, a new equilibrium concept is proposed as a refinement of the classical one-subgame perfect equilibrium-by involving players' cognitive reasoning. Moreover, a series of results concerning the computational complexity, soundness, and completeness of the algorithm, as well as the existence of an equilibrium solution, is obtained. This framework, which is shown to be general enough to model the way in which AlphaGo plays Go, may offer a means for bridging the gap between theoretical models and practical problem-solving.
Collapse
|
22
|
A data-driven online ADP control method for nonlinear system based on policy iteration and nonlinear MIMO decoupling ADRC. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.04.024] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
23
|
Jiang H, Zhang H, Han J, Zhang K. Iterative adaptive dynamic programming methods with neural network implementation for multi-player zero-sum games. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.04.005] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
24
|
Pan J, Wang X, Cheng Y, Yu Q, Yu Q, Cheng Y, Pan J, Wang X. Multisource Transfer Double DQN Based on Actor Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:2227-2238. [PMID: 29771674 DOI: 10.1109/tnnls.2018.2806087] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Deep reinforcement learning (RL) comprehensively uses the psychological mechanisms of "trial and error" and "reward and punishment" in RL as well as powerful feature expression and nonlinear mapping in deep learning. Currently, it plays an essential role in the fields of artificial intelligence and machine learning. Since an RL agent needs to constantly interact with its surroundings, the deep Q network (DQN) is inevitably faced with the need to learn numerous network parameters, which results in low learning efficiency. In this paper, a multisource transfer double DQN (MTDDQN) based on actor learning is proposed. The transfer learning technique is integrated with deep RL to make the RL agent collect, summarize, and transfer action knowledge, including policy mimic and feature regression, to the training of related tasks. There exists action overestimation in DQN, i.e., the lower probability limit of action corresponding to the maximum Q value is nonzero. Therefore, the transfer network is trained by using double DQN to eliminate the error accumulation caused by action overestimation. In addition, to avoid negative transfer, i.e., to ensure strong correlations between source and target tasks, a multisource transfer learning mechanism is applied. The Atari2600 game is tested on the arcade learning environment platform to evaluate the feasibility and performance of MTDDQN by comparing it with some mainstream approaches, such as DQN and double DQN. Experiments prove that MTDDQN achieves not only human-like actor learning transfer capability, but also the desired learning efficiency and testing accuracy on target task.
Collapse
|
25
|
Sledge IJ, Emigh MS, Principe JC. Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:2080-2098. [PMID: 29771664 DOI: 10.1109/tnnls.2018.2812709] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Reinforcement learning in environments with many action-state pairs is challenging. The issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, information-theoretic approach for performing guided stochastic searches that more effectively cover the policy space. Our approach is based on the value of information, a criterion that provides the optimal tradeoff between expected costs and the granularity of the search process. The value of information yields a stochastic routine for choosing actions during learning that can explore the policy space in a coarse to fine manner. We augment this criterion with a state-transition uncertainty factor, which guides the search process into previously unexplored regions of the policy space. We evaluate the uncertainty-based value-of-information policies on the games Centipede and Crossy Road. Our results indicate that our approach yields better performing policies in fewer episodes than stochastic-based exploration strategies. We show that the training rate for our approach can be further improved by using the policy cross entropy to guide our criterion's hyperparameter selection.
Collapse
|
26
|
Zhong X, He H, Wang D, Ni Z. Model-Free Adaptive Control for Unknown Nonlinear Zero-Sum Differential Game. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:1633-1646. [PMID: 28727566 DOI: 10.1109/tcyb.2017.2712617] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
In this paper, we present a new model-free globalized dual heuristic dynamic programming (GDHP) approach for the discrete-time nonlinear zero-sum game problems. First, the online learning algorithm is proposed based on the GDHP method to solve the Hamilton-Jacobi-Isaacs equation associated with optimal regulation control problem. By setting backward one step of the definition of performance index, the requirement of system dynamics, or an identifier is relaxed in the proposed method. Then, three neural networks are established to approximate the optimal saddle point feedback control law, the disturbance law, and the performance index, respectively. The explicit updating rules for these three neural networks are provided based on the data generated during the online learning along the system trajectories. The stability analysis in terms of the neural network approximation errors is discussed based on the Lyapunov approach. Finally, two simulation examples are provided to show the effectiveness of the proposed method.
Collapse
|
27
|
Xiao G, Zhang H, Zhang K, Wen Y. Value iteration based integral reinforcement learning approach for H∞ controller design of continuous-time nonlinear systems. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2018.01.029] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
28
|
Event-driven optimal control for uncertain nonlinear systems with external disturbance via adaptive dynamic programming. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.12.010] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
29
|
Zhu Y, Zhao D, Yang X, Zhang Q. Policy Iteration for $H_\infty $ Optimal Control of Polynomial Nonlinear Systems via Sum of Squares Programming. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:500-509. [PMID: 28092589 DOI: 10.1109/tcyb.2016.2643687] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Sum of squares (SOS) polynomials have provided a computationally tractable way to deal with inequality constraints appearing in many control problems. It can also act as an approximator in the framework of adaptive dynamic programming. In this paper, an approximate solution to the optimal control of polynomial nonlinear systems is proposed. Under a given attenuation coefficient, the Hamilton-Jacobi-Isaacs equation is relaxed to an optimization problem with a set of inequalities. After applying the policy iteration technique and constraining inequalities to SOS, the optimization problem is divided into a sequence of feasible semidefinite programming problems. With the converged solution, the attenuation coefficient is further minimized to a lower value. After iterations, approximate solutions to the smallest -gain and the associated optimal controller are obtained. Four examples are employed to verify the effectiveness of the proposed algorithm.
Collapse
|
30
|
Data-driven adaptive dynamic programming schemes for non-zero-sum games of unknown discrete-time nonlinear systems. Neurocomputing 2018. [DOI: 10.1016/j.neucom.2017.09.020] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
31
|
Wang D, He H, Liu D. Adaptive Critic Nonlinear Robust Control: A Survey. IEEE TRANSACTIONS ON CYBERNETICS 2017; 47:3429-3451. [PMID: 28682269 DOI: 10.1109/tcyb.2017.2712188] [Citation(s) in RCA: 80] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Adaptive dynamic programming (ADP) and reinforcement learning are quite relevant to each other when performing intelligent optimization. They are both regarded as promising methods involving important components of evaluation and improvement, at the background of information technology, such as artificial intelligence, big data, and deep learning. Although great progresses have been achieved and surveyed when addressing nonlinear optimal control problems, the research on robustness of ADP-based control strategies under uncertain environment has not been fully summarized. Hence, this survey reviews the recent main results of adaptive-critic-based robust control design of continuous-time nonlinear systems. The ADP-based nonlinear optimal regulation is reviewed, followed by robust stabilization of nonlinear systems with matched uncertainties, guaranteed cost control design of unmatched plants, and decentralized stabilization of interconnected systems. Additionally, further comprehensive discussions are presented, including event-based robust control design, improvement of the critic learning rule, nonlinear H∞ control design, and several notes on future perspectives. By applying the ADP-based optimal and robust control methods to a practical power system and an overhead crane plant, two typical examples are provided to verify the effectiveness of theoretical results. Overall, this survey is beneficial to promote the development of adaptive critic control methods with robustness guarantee and the construction of higher level intelligent systems.
Collapse
|
32
|
Zhu Y, Zhao D. Comprehensive comparison of online ADP algorithms for continuous-time optimal control. Artif Intell Rev 2017. [DOI: 10.1007/s10462-017-9548-4] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
33
|
Lyapunov stability-based control and identification of nonlinear dynamical systems using adaptive dynamic programming. Soft comput 2017. [DOI: 10.1007/s00500-017-2500-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|