1
|
Lin K, Li D, Li Y, Chen S, Liu Q, Gao J, Jin Y, Gong L. TAG: Teacher-Advice Mechanism With Gaussian Process for Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12419-12433. [PMID: 37023165 DOI: 10.1109/tnnls.2023.3262956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Reinforcement learning (RL) still suffers from the problem of sample inefficiency and struggles with the exploration issue, particularly in situations with long-delayed rewards, sparse rewards, and deep local optimum. Recently, learning from demonstration (LfD) paradigm was proposed to tackle this problem. However, these methods usually require a large number of demonstrations. In this study, we present a sample efficient teacher-advice mechanism with Gaussian process (TAG) by leveraging a few expert demonstrations. In TAG, a teacher model is built to provide both an advice action and its associated confidence value. Then, a guided policy is formulated to guide the agent in the exploration phase via the defined criteria. Through the TAG mechanism, the agent is capable of exploring the environment more intentionally. Moreover, with the confidence value, the guided policy can guide the agent precisely. Also, due to the strong generalization ability of Gaussian process, the teacher model can utilize the demonstrations more effectively. Therefore, substantial improvement in performance and sample efficiency can be attained. Considerable experiments on sparse reward environments demonstrate that the TAG mechanism can help typical RL algorithms achieve significant performance gains. In addition, the TAG mechanism with soft actor-critic algorithm (TAG-SAC) attains the state-of-the-art performance over other LfD counterparts on several delayed reward and complicated continuous control environments.
Collapse
|
2
|
Xue W, Lian B, Fan J, Kolaric P, Chai T, Lewis FL. Inverse Reinforcement Q-Learning Through Expert Imitation for Discrete-Time Systems. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023; 34:2386-2399. [PMID: 34520364 DOI: 10.1109/tnnls.2021.3106635] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In inverse reinforcement learning (RL), there are two agents. An expert target agent has a performance cost function and exhibits control and state behaviors to a learner. The learner agent does not know the expert's performance cost function but seeks to reconstruct it by observing the expert's behaviors and tries to imitate these behaviors optimally by its own response. In this article, we formulate an imitation problem where the optimal performance intent of a discrete-time (DT) expert target agent is unknown to a DT Learner agent. Using only the observed expert's behavior trajectory, the learner seeks to determine a cost function that yields the same optimal feedback gain as the expert's, and thus, imitates the optimal response of the expert. We develop an inverse RL approach with a new scheme to solve the behavior imitation problem. The approach consists of a cost function update based on an extension of RL policy iteration and inverse optimal control, and a control policy update based on optimal control. Then, under this scheme, we develop an inverse reinforcement Q-learning algorithm, which is an extension of RL Q-learning. This algorithm does not require any knowledge of agent dynamics. Proofs of stability, convergence, and optimality are given. A key property about the nonunique solution is also shown. Finally, simulation experiments are presented to show the effectiveness of the new approach.
Collapse
|
3
|
Li T, Yang D, Xie X, Zhang H. Event-Triggered Control of Nonlinear Discrete-Time System With Unknown Dynamics Based on HDP(λ). IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:6046-6058. [PMID: 33531312 DOI: 10.1109/tcyb.2020.3044595] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The heuristic dynamic programming (HDP) ( λ )-based optimal control strategy, which takes a long-term prediction parameter λ into account using an iterative manner, accelerates the learning rate obviously. The computation complexity caused by the state-associated extra variable in λ -return value computing of the traditional value-gradient learning method can be reduced. However, as the iteration number increases, calculation costs have grown dramatically that bring huge challenge for the optimal control process with limited bandwidth and computational units. In this article, we propose an event-triggered HDP (ETHDP) ( λ ) optimal control strategy for nonlinear discrete-time (NDT) systems with unknown dynamics. The iterative relation for λ -return of the final target value is derived first. The event-triggered condition ensuring system stability is designed to reduce the computation and communication requirements. Next, we build a model-actor-critic neural network (NN) structure, in which the model NN evaluates the system state for getting λ -return of the current time target value, which is used to obtain the critic NN real-time update errors. The event-triggered optimal control signal and one-step-return value are approximated by actor and critic NN, respectively. Then, the event trigger-based uniformly ultimately bounded (UUB) stability of the system state and NN weight errors are demonstrated by applying the Lyapunov technology. Finally, we illustrate the effectiveness of our proposed ETHDP ( λ ) strategy by two cases.
Collapse
|
4
|
Toward reliable designs of data-driven reinforcement learning tracking control for Euler–Lagrange systems. Neural Netw 2022; 153:564-575. [DOI: 10.1016/j.neunet.2022.05.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2021] [Revised: 04/21/2022] [Accepted: 05/17/2022] [Indexed: 11/23/2022]
|
5
|
Perrusquia A, Yu W. Neural H₂ Control Using Continuous-Time Reinforcement Learning. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:4485-4494. [PMID: 33232250 DOI: 10.1109/tcyb.2020.3028988] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, we discuss continuous-time H2 control for the unknown nonlinear system. We use differential neural networks to model the system, then apply the H2 tracking control based on the neural model. Since the neural H2 control is very sensitive to the neural modeling error, we use reinforcement learning to improve the control performance. The stabilities of the neural modeling and the H2 tracking control are proven. The convergence of the approach is also given. The proposed method is validated with two benchmark control problems.
Collapse
|
6
|
An Enhanced Full-Form Model-Free Adaptive Controller for SISO Discrete-Time Nonlinear Systems. ENTROPY 2022; 24:e24020163. [PMID: 35205458 PMCID: PMC8871481 DOI: 10.3390/e24020163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 01/17/2022] [Accepted: 01/18/2022] [Indexed: 02/01/2023]
Abstract
This study focuses on the full-form model-free adaptive controller (FFMFAC) for SISO discrete-time nonlinear systems, and proposes enhanced FFMFAC. The proposed technique design incorporates long short-term memory neural networks (LSTMs) and fuzzy neural networks (FNNs). To be more precise, LSTMs are utilized to adjust vital parameters of the FFMFAC online. Additionally, due to the high nonlinear approximation capabilities of FNNs, pseudo gradient (PG) values of the controller are estimated online. EFFMFAC is characterized by utilizing the measured I/O data for the online training of all introduced neural networks and does not involve offline training and specific models of the controlled system. Finally, the rationality and superiority are verified by two simulations and a supporting ablation analysis. Five individual performance indices are given, and the experimental findings show that EFFMFAC outperforms all other methods. Especially compared with the FFMFAC, EFFMFAC reduces the RMSE by 21.69% and 11.21%, respectively, proving it to be applicable for SISO discrete-time nonlinear systems.
Collapse
|
7
|
Perrusquia A, Yu W. Discrete-Time H 2 Neural Control Using Reinforcement Learning. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2021; 32:4879-4889. [PMID: 33017294 DOI: 10.1109/tnnls.2020.3026010] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this article, we discuss H2 control for unknown nonlinear systems in discrete time. A discrete-time recurrent neural network is used to model the nonlinear system, and then, the H2 tracking control is applied based on the neural model. Since this neural H2 control is very sensitive to the neural modeling error, we use reinforcement learning and another neural approximator to improve tracking accuracy and robustness of the controller. The stabilities of the neural identifier and the H2 tracking control are proven. The convergence of the approach is also given. The proposed method is validated with the control of the pan and tilt robot and the surge tank.
Collapse
|
8
|
Abstract
A general control system tracking learning framework is proposed, by which an optimal learned tracking behavior called ‘primitive’ is extrapolated to new unseen trajectories without requiring relearning. This is considered intelligent behavior and strongly related to the neuro-motor cognitive control of biological (human-like) systems that deliver suboptimal executions for tasks outside of their current knowledge base, by using previously memorized experience. However, biological systems do not solve explicit mathematical equations for solving learning and prediction tasks. This stimulates the proposed hierarchical cognitive-like learning framework, based on state-of-the-art model-free control: (1) at the low-level L1, an approximated iterative Value Iteration for linearizing the closed-loop system (CLS) behavior by a linear reference model output tracking is first employed; (2) an experiment-driven Iterative Learning Control (EDILC) applied to the CLS from the reference input to the controlled output learns simple tracking tasks called ‘primitives’ in the secondary L2 level, and (3) the tertiary level L3 extrapolates the primitives’ optimal tracking behavior to new tracking tasks without trial-based relearning. The learning framework relies only on input-output system data to build a virtual state space representation of the underlying controlled system that is assumed to be observable. It has been shown to be effective by experimental validation on a representative, coupled, nonlinear, multivariable real-world system. Able to cope with new unseen scenarios in an optimal fashion, the hierarchical learning framework is an advance toward cognitive control systems.
Collapse
|
9
|
Chiluka SK, Ambati SR, Seepana MM, Babu Gara UB. A novel robust Virtual Reference Feedback Tuning approach for minimum and non-minimum phase systems. ISA TRANSACTIONS 2021; 115:163-191. [PMID: 33454053 DOI: 10.1016/j.isatra.2021.01.018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Revised: 12/31/2020] [Accepted: 01/07/2021] [Indexed: 06/12/2023]
Abstract
In real-world applications, it is often desired that the design of a closed-loop system must attain not only high performance but also robustness. This paper presents a novel robustness-based formulation for control for discrete time minimum and non-minimum phase systems using the Virtual Reference Feedback Tuning (VRFT) framework. The proposed idea is to design robust lower and higher order Proportional Integral and Derivative (PID) controller using Maximum Sensitivity (Ms) based closed-loop reference model by minimization of the VRFT objective function (JVR) The efficacy of the approach is verified on various systems in simulation and also demonstrated on the Temperature Control Process and Level Control Process experimental setups. The proposed VRFT approach provides improved performance and robustness for set point tracking and disturbance rejection. Also, the proposed approach ensures stability and specific robustness of the closed-loop system. Further, the fragility of the controller is investigated for perturbations in the controller parameters.
Collapse
Affiliation(s)
- Suresh Kumar Chiluka
- Department of Chemical Engineering, National Institute of Technology, Warangal, Telangana, 506 004, India.
| | - Seshagiri Rao Ambati
- Department of Chemical Engineering, National Institute of Technology, Warangal, Telangana, 506 004, India.
| | - Murali Mohan Seepana
- Department of Chemical Engineering, National Institute of Technology, Warangal, Telangana, 506 004, India.
| | - Uday Bhaskar Babu Gara
- Department of Chemical Engineering, National Institute of Technology, Warangal, Telangana, 506 004, India.
| |
Collapse
|
10
|
Abstract
In this paper, we propose an adaptive data-driven control approach for linear time varying systems, affected by bounded measurement noise. The plant to be controlled is assumed to be unknown, and no information in regard to its time varying behaviour is exploited. First, using set-membership identification techniques, we formulate the controller design problem through a model-matching scheme, i.e., designing a controller such that the closed-loop behaviour matches that of a given reference model. The problem is then reformulated as to derive a controller that corresponds to the minimum variation bounding its parameters. Finally, a convex relaxation approach is proposed to solve the formulated controller design problem by means of linear programming. The effectiveness of the proposed scheme is demonstrated by means of two simulation examples.
Collapse
|
11
|
Improved Data Association of Hypothesis-Based Trackers Using Fast and Robust Object Initialization. SENSORS 2021; 21:s21093146. [PMID: 34062836 PMCID: PMC8125535 DOI: 10.3390/s21093146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 04/22/2021] [Accepted: 04/29/2021] [Indexed: 11/17/2022]
Abstract
The tracking of Vulnerable Road Users (VRU) is one of the vital tasks of autonomous cars. This includes estimating the positions and velocities of VRUs surrounding a car. To do this, VRU trackers must utilize measurements that are received from sensors. However, even the most accurate VRU trackers are affected by measurement noise, background clutter, and VRUs' interaction and occlusion. Such uncertainties can cause deviations in sensors' data association, thereby leading to dangerous situations and potentially even the failure of a tracker. The initialization of a data association depends on various parameters. This paper proposes steps to reveal the trade-offs between stochastic model parameters to improve data association's accuracy in autonomous cars. The proposed steps can reduce the number of false tracks; besides, it is independent of variations in measurement noise and the number of VRUs. Our initialization can reduce the lag between the first detection and initialization of the VRU trackers. As a proof of concept, the procedure is validated using experiments, simulation data, and the publicly available KITTI dataset. Moreover, we compared our initialization method with the most popular approaches that were found in the literature. The results showed that the tracking precision and accuracy increase to 3.6% with the proposed initialization as compared to the state-of-the-art algorithms in tracking VRU.
Collapse
|
12
|
Virtual State Feedback Reference Tuning and Value Iteration Reinforcement Learning for Unknown Observable Systems Control. ENERGIES 2021. [DOI: 10.3390/en14041006] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In this paper, a novel Virtual State-feedback Reference Feedback Tuning (VSFRT) and Approximate Iterative Value Iteration Reinforcement Learning (AI-VIRL) are applied for learning linear reference model output (LRMO) tracking control of observable systems with unknown dynamics. For the observable system, a new state representation in terms of input/output (IO) data is derived. Consequently, the Virtual State Feedback Tuning (VRFT)-based solution is redefined to accommodate virtual state feedback control, leading to an original stability-certified Virtual State-Feedback Reference Tuning (VSFRT) concept. Both VSFRT and AI-VIRL use neural networks controllers. We find that AI-VIRL is significantly more computationally demanding and more sensitive to the exploration settings, while leading to inferior LRMO tracking performance when compared to VSFRT. It is not helped either by transfer learning the VSFRT control as initialization for AI-VIRL. State dimensionality reduction using machine learning techniques such as principal component analysis and autoencoders does not improve on the best learned tracking performance however it trades off the learning complexity. Surprisingly, unlike AI-VIRL, the VSFRT control is one-shot (non-iterative) and learns stabilizing controllers even in poorly, open-loop explored environments, proving to be superior in learning LRMO tracking control. Validation on two nonlinear coupled multivariable complex systems serves as a comprehensive case study.
Collapse
|
13
|
Bai W, Li T, Tong S. NN Reinforcement Learning Adaptive Control for a Class of Nonstrict-Feedback Discrete-Time Systems. IEEE TRANSACTIONS ON CYBERNETICS 2020; 50:4573-4584. [PMID: 31995515 DOI: 10.1109/tcyb.2020.2963849] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
This article investigates an adaptive reinforcement learning (RL) optimal control design problem for a class of nonstrict-feedback discrete-time systems. Based on the neural network (NN) approximating ability and RL control design technique, an adaptive backstepping RL optimal controller and a minimal learning parameter (MLP) adaptive RL optimal controller are developed by establishing a novel strategic utility function and introducing external function terms. It is proved that the proposed adaptive RL optimal controllers can guarantee that all signals in the closed-loop systems are semiglobal uniformly ultimately bounded (SGUUB). The main feature is that the proposed schemes can solve the optimal control problem that the previous literature cannot deal with. Furthermore, the proposed MPL adaptive optimal control scheme can reduce the number of adaptive laws, and thus the computational complexity is decreased. Finally, the simulation results illustrate the validity of the proposed optimal control schemes.
Collapse
|
14
|
Disturbance Observer and L2-Gain-Based State Error Feedback Linearization Control for the Quadruple-Tank Liquid-Level System. ENERGIES 2020. [DOI: 10.3390/en13205500] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This paper proposes a fresh state error feedback linearization control method with disturbance observer (DOB) and L2 gain for a quadruple-tank liquid-level system. Firstly, in terms of the highly nonlinear and strong coupling characteristics of the quadruple-tank system, a state error feedback linearization technique is employed to design the controller to achieve liquid-level position control and tracking control. Secondly, DOB is purposed to estimate uncertain exogenous disturbances and applied to compensation control. Moreover, an L2-gain disturbance attenuation technology is designed to resolve one class of disturbance problem by uncertain parameter perturbation existing in the quadruple-tank liquid-level system. Finally, compared with the classical proportion integration differentiation (PID) and sliding mode control (SMC) methods, the extensive experimental results validate that the proposed strategy has good position control, tracking control, and disturbance rejection performances.
Collapse
|
15
|
Design of control framework based on deep reinforcement learning and Monte-Carlo sampling in downstream separation. Comput Chem Eng 2020. [DOI: 10.1016/j.compchemeng.2020.106910] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
16
|
Alves Goulart D, Dutra Pereira R. Autonomous pH control by reinforcement learning for electroplating industry wastewater. Comput Chem Eng 2020. [DOI: 10.1016/j.compchemeng.2020.106909] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
17
|
Abstract
Nowadays, deep learning is the fastest growing research field in machine learning and has a tremendous impact on a plethora of daily life applications, ranging from security and surveillance to autonomous driving, automatic indexing and retrieval of media content, text analysis, speech recognition, automatic translation, and many others [...]
Collapse
|
18
|
Zhang C, Gan M, Zhao J, Xue C. Data-Driven Suboptimal Scheduling of Switched Systems. SENSORS (BASEL, SWITZERLAND) 2020; 20:s20051287. [PMID: 32120901 PMCID: PMC7085537 DOI: 10.3390/s20051287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Revised: 02/24/2020] [Accepted: 02/25/2020] [Indexed: 06/10/2023]
Abstract
In this paper, a data-driven optimal scheduling approach is investigated for continuous-time switched systems with unknown subsystems and infinite-horizon cost functions. Firstly, a policy iteration (PI) based algorithm is proposed to approximate the optimal switching policy online quickly for known switched systems. Secondly, a data-driven PI-based algorithm is proposed online solely from the system state data for switched systems with unknown subsystems. Approximation functions are brought in and their weight vectors can be achieved step by step through different data in the algorithm. Then the weight vectors are employed to approximate the switching policy and the cost function. The convergence and the performance are analyzed. Finally, the simulation results of two examples validate the effectiveness of the proposed approaches.
Collapse
|
19
|
Long Short-Term Memory Neural Network Applied to Train Dynamic Model and Speed Prediction. ALGORITHMS 2019. [DOI: 10.3390/a12080173] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The automatic train operation system is a significant component of the intelligent railway transportation. As a fundamental problem, the construction of the train dynamic model has been extensively researched using parametric approaches. The parametric based models may have poor performances due to unrealistic assumptions and changeable environments. In this paper, a long short-term memory network is carefully developed to build the train dynamic model in a nonparametric way. By optimizing the hyperparameters of the proposed model, more accurate outputs can be obtained with the same inputs of the parametric approaches. The proposed model was compared with two parametric methods using actual data. Experimental results suggest that the model performance is better than those of traditional models due to the strong learning ability. By exploring a detailed feature engineering process, the proposed long short-term memory network based algorithm was extended to predict train speed for multiple steps ahead.
Collapse
|
20
|
Learning Output Reference Model Tracking for Higher-Order Nonlinear Systems with Unknown Dynamics. ALGORITHMS 2019. [DOI: 10.3390/a12060121] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This work suggests a solution for the output reference model (ORM) tracking control problem, based on approximate dynamic programming. General nonlinear systems are included in a control system (CS) and subjected to state feedback. By linear ORM selection, indirect CS feedback linearization is obtained, leading to favorable linear behavior of the CS. The Value Iteration (VI) algorithm ensures model-free nonlinear state feedback controller learning, without relying on the process dynamics. From linear to nonlinear parameterizations, a reliable approximate VI implementation in continuous state-action spaces depends on several key parameters such as problem dimension, exploration of the state-action space, the state-transitions dataset size, and a suitable selection of the function approximators. Herein, we find that, given a transition sample dataset and a general linear parameterization of the Q-function, the ORM tracking performance obtained with an approximate VI scheme can reach the performance level of a more general implementation using neural networks (NNs). Although the NN-based implementation takes more time to learn due to its higher complexity (more parameters), it is less sensitive to exploration settings, number of transition samples, and to the selected hyper-parameters, hence it is recommending as the de facto practical implementation. Contributions of this work include the following: VI convergence is guaranteed under general function approximators; a case study for a low-order linear system in order to generalize the more complex ORM tracking validation on a real-world nonlinear multivariable aerodynamic process; comparisons with an offline deep deterministic policy gradient solution; implementation details and further discussions on the obtained results.
Collapse
|