1
|
Aubret A, Matignon L, Hassas S. An Information-Theoretic Perspective on Intrinsic Motivation in Reinforcement Learning: A Survey. ENTROPY (BASEL, SWITZERLAND) 2023; 25:327. [PMID: 36832693 PMCID: PMC9954873 DOI: 10.3390/e25020327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/28/2022] [Revised: 01/23/2023] [Accepted: 02/07/2023] [Indexed: 06/18/2023]
Abstract
The reinforcement learning (RL) research area is very active, with an important number of new contributions, especially considering the emergent field of deep RL (DRL). However, a number of scientific and technical challenges still need to be resolved, among which we acknowledge the ability to abstract actions or the difficulty to explore the environment in sparse-reward settings which can be addressed by intrinsic motivation (IM). We propose to survey these research works through a new taxonomy based on information theory: we computationally revisit the notions of surprise, novelty, and skill-learning. This allows us to identify advantages and disadvantages of methods and exhibit current outlooks of research. Our analysis suggests that novelty and surprise can assist the building of a hierarchy of transferable skills which abstracts dynamics and makes the exploration process more robust.
Collapse
|
2
|
A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots. ELECTRONICS 2022. [DOI: 10.3390/electronics11142120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Constrained by the numbers of action space and state space, Q-learning cannot be applied to continuous state space. Targeting this problem, the double deep Q network (DDQN) algorithm and the corresponding improvement methods were explored. First of all, to improve the accuracy of the DDNQ algorithm in estimating the target Q value in the training process, a multi-step guided strategy was introduced into the traditional DDQN algorithm, for which the single-step reward was replaced with the reward obtained in continuous multi-step interactions of mobile robots. Furthermore, an experience classification training method was introduced into the traditional DDQN algorithm, for which the state transition generated by the mobile robot–environment interaction was divided into two different types of experience pools, and experience pools were trained by the Q network, and the sampling proportions of the two experience pools were updated through the training loss. Afterward, the advantages of a multi-step guided DDQN (MS-DDQN) algorithm and experience classification DDQN (EC-DDQN) algorithm were combined to develop a novel experience classification multi-step DDQN (ECMS-DDQN) algorithm. Finally, the path planning of these four algorithms, including DDQN, MS-DDQN, EC-DDQN, and ECMS-DDQN, was simulated on the OpenAI Gym platform. The simulation results revealed that the ECMS-DDQN algorithm outperforms the other three in the total return value and generalization in path planning.
Collapse
|
3
|
Indoor Emergency Path Planning Based on the Q-Learning Optimization Algorithm. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2022. [DOI: 10.3390/ijgi11010066] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The internal structure of buildings is becoming increasingly complex. Providing a scientific and reasonable evacuation route for trapped persons in a complex indoor environment is important for reducing casualties and property losses. In emergency and disaster relief environments, indoor path planning has great uncertainty and higher safety requirements. Q-learning is a value-based reinforcement learning algorithm that can complete path planning tasks through autonomous learning without establishing mathematical models and environmental maps. Therefore, we propose an indoor emergency path planning method based on the Q-learning optimization algorithm. First, a grid environment model is established. The discount rate of the exploration factor is used to optimize the Q-learning algorithm, and the exploration factor in the ε-greedy strategy is dynamically adjusted before selecting random actions to accelerate the convergence of the Q-learning algorithm in a large-scale grid environment. An indoor emergency path planning experiment based on the Q-learning optimization algorithm was carried out using simulated data and real indoor environment data. The proposed Q-learning optimization algorithm basically converges after 500 iterative learning rounds, which is nearly 2000 rounds higher than the convergence rate of the Q-learning algorithm. The SASRA algorithm has no obvious convergence trend in 5000 iterations of learning. The results show that the proposed Q-learning optimization algorithm is superior to the SARSA algorithm and the classic Q-learning algorithm in terms of solving time and convergence speed when planning the shortest path in a grid environment. The convergence speed of the proposed Q- learning optimization algorithm is approximately five times faster than that of the classic Q- learning algorithm. The proposed Q-learning optimization algorithm in the grid environment can successfully plan the shortest path to avoid obstacle areas in a short time.
Collapse
|
4
|
dos Santos DDML, Santana EEC, Junior PFDS, Queiroz JA, Neto JVDF, Barros AK, Augusto de Moraes Cruz C, de Aquino VS, de Castro LSO, Freire RCS, Silva PHDF. Autofocus Entropy Repositioning Method Bioinspired in the Magnetic Field Memory of the Bees Applied to Pollination. SENSORS 2021; 21:s21186198. [PMID: 34577405 PMCID: PMC8472858 DOI: 10.3390/s21186198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 09/03/2021] [Accepted: 09/05/2021] [Indexed: 11/18/2022]
Abstract
In this paper, a bioinspired method in the magnetic field memory of the bees, applied in a rover of precision pollination, is presented. The method calculates sharpness features by entropy and variance of the Laplacian of images segmented by color in the HSV system in real-time. A complementary positioning method based on area feature extraction between active markers was developed, analyzing color characteristics, noise, and vibrations of the probe in time and frequency, through the lateral image of the probe. From the observed results, it can be seen that the unsupervised method does not require previous calibration of target dimensions, histogram, and distances involved in positioning. The algorithm showed less sensitivity in the extraction of sharpness characteristics regarding the number of edges and greater sensitivity to the gradient, allowing unforeseen operation scenarios, even in small sharpness variations, and robust response to variance local, temporal, and geophysical of the magnetic declination, not needing luminosity after scanning, with the two freedom of degrees of the rotation.
Collapse
Affiliation(s)
- Daniel de Matos Luna dos Santos
- Graduating Program in Electrical Engineering, Federal University of Maranhão, Sao Luis 65085-580, Brazil; (J.A.Q.); (J.V.d.F.N.); (A.K.B.)
- Correspondence: ; Tel.: +55-98-985-428-512
| | - Ewaldo Eder Carvalho Santana
- Gradutation Program in Computation Engineering and Systems, State University of Maranhão, Sao Luis 65081-000, Brazil; (E.E.C.S.); (P.F.d.S.J.)
| | - Paulo Fernandes da Silva Junior
- Gradutation Program in Computation Engineering and Systems, State University of Maranhão, Sao Luis 65081-000, Brazil; (E.E.C.S.); (P.F.d.S.J.)
| | - Jonathan Araujo Queiroz
- Graduating Program in Electrical Engineering, Federal University of Maranhão, Sao Luis 65085-580, Brazil; (J.A.Q.); (J.V.d.F.N.); (A.K.B.)
| | - João Viana da Fonseca Neto
- Graduating Program in Electrical Engineering, Federal University of Maranhão, Sao Luis 65085-580, Brazil; (J.A.Q.); (J.V.d.F.N.); (A.K.B.)
| | - Allan Kardec Barros
- Graduating Program in Electrical Engineering, Federal University of Maranhão, Sao Luis 65085-580, Brazil; (J.A.Q.); (J.V.d.F.N.); (A.K.B.)
| | - Carlos Augusto de Moraes Cruz
- Graduation Program in Electrical Engineering, Federal University of Amazonas, Manaus 69080-900, Brazil; (C.A.d.M.C.); (V.S.d.A.); (L.S.O.d.C.)
| | - Viviane S. de Aquino
- Graduation Program in Electrical Engineering, Federal University of Amazonas, Manaus 69080-900, Brazil; (C.A.d.M.C.); (V.S.d.A.); (L.S.O.d.C.)
| | - Luís S. O. de Castro
- Graduation Program in Electrical Engineering, Federal University of Amazonas, Manaus 69080-900, Brazil; (C.A.d.M.C.); (V.S.d.A.); (L.S.O.d.C.)
| | | | - Paulo Henrique da Fonseca Silva
- Coordination of PostGraduate Studies in Electrical Enginnering, Federal Institute of Paraíba, Joao Pessoa 58059-900, Brazil;
| |
Collapse
|
5
|
Abstract
Directing at various problems of the traditional Q-Learning algorithm, such as heavy repetition and disequilibrium of explorations, the reinforcement-exploration strategy was used to replace the decayed ε-greedy strategy in the traditional Q-Learning algorithm, and thus a novel self-adaptive reinforcement-exploration Q-Learning (SARE-Q) algorithm was proposed. First, the concept of behavior utility trace was introduced in the proposed algorithm, and the probability for each action to be chosen was adjusted according to the behavior utility trace, so as to improve the efficiency of exploration. Second, the attenuation process of exploration factor ε was designed into two phases, where the first phase centered on the exploration and the second one transited the focus from the exploration into utilization, and the exploration rate was dynamically adjusted according to the success rate. Finally, by establishing a list of state access times, the exploration factor of the current state is adaptively adjusted according to the number of times the state is accessed. The symmetric grid map environment was established via OpenAI Gym platform to carry out the symmetrical simulation experiments on the Q-Learning algorithm, self-adaptive Q-Learning (SA-Q) algorithm and SARE-Q algorithm. The experimental results show that the proposed algorithm has obvious advantages over the first two algorithms in the average number of turning times, average inside success rate, and number of times with the shortest planned route.
Collapse
|
6
|
Abstract
This paper presents a novel bio-inspired predictive model of visual navigation inspired by mammalian navigation. This model takes inspiration from specific types of neurons observed in the brain, namely place cells, grid cells and head direction cells. In the proposed model, place cells are structures that store and connect local representations of the explored environment, grid and head direction cells make predictions based on these representations to define the position of the agent in a place cell’s reference frame. This specific use of navigation cells has three advantages: First, the environment representations are stored by place cells and require only a few spatialized descriptors or elements, making this model suitable for the integration of large-scale environments (indoor and outdoor). Second, the grid cell modules act as an efficient visual and absolute odometry system. Finally, the model provides sequential spatial tracking that can integrate and track an agent in redundant environments or environments with very few or no distinctive cues, while being very robust to environmental changes. This paper focuses on the architecture formalization and the main elements and properties of this model. The model has been successfully validated on basic functions: mapping, guidance, homing, and finding shortcuts. The precision of the estimated position of the agent and the robustness to environmental changes during navigation were shown to be satisfactory. The proposed predictive model is intended to be used on autonomous platforms, but also to assist visually impaired people in their mobility.
Collapse
|
7
|
Yu X, Wang P, Zhang Z. Learning-Based End-to-End Path Planning for Lunar Rovers with Safety Constraints. SENSORS 2021; 21:s21030796. [PMID: 33504073 PMCID: PMC7866010 DOI: 10.3390/s21030796] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 01/19/2021] [Accepted: 01/22/2021] [Indexed: 11/16/2022]
Abstract
Path planning is an essential technology for lunar rover to achieve safe and efficient autonomous exploration mission, this paper proposes a learning-based end-to-end path planning algorithm for lunar rovers with safety constraints. Firstly, a training environment integrating real lunar surface terrain data was built using the Gazebo simulation environment and a lunar rover simulator was created in it to simulate the real lunar surface environment and the lunar rover system. Then an end-to-end path planning algorithm based on deep reinforcement learning method is designed, including state space, action space, network structure, reward function considering slip behavior, and training method based on proximal policy optimization. In addition, to improve the generalization ability to different lunar surface topography and different scale environments, a variety of training scenarios were set up to train the network model using the idea of curriculum learning. The simulation results show that the proposed planning algorithm can successfully achieve the end-to-end path planning of the lunar rover, and the path generated by the proposed algorithm has a higher safety guarantee compared with the classical path planning algorithm.
Collapse
Affiliation(s)
- Xiaoqiang Yu
- School of Astronautics, Harbin Institute of Technology, Harbin 150002, China;
| | - Ping Wang
- China Academy of Space Technology, Beijing 100094, China;
| | - Zexu Zhang
- School of Astronautics, Harbin Institute of Technology, Harbin 150002, China;
- Correspondence:
| |
Collapse
|
8
|
Zhang K, Yang Y, Fu M, Wang M. Traversability Assessment and Trajectory Planning of Unmanned Ground Vehicles with Suspension Systems on Rough Terrain. SENSORS 2019; 19:s19204372. [PMID: 31658645 PMCID: PMC6833019 DOI: 10.3390/s19204372] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/08/2019] [Revised: 09/24/2019] [Accepted: 09/27/2019] [Indexed: 11/16/2022]
Abstract
This paper presents a traversability assessment method and a trajectory planning method. They are key features for the navigation of an unmanned ground vehicle (UGV) in a non-planar environment. In this work, a 3D light detection and ranging (LiDAR) sensor is used to obtain the geometric information about a rough terrain surface. For a given SE(2) pose of the vehicle and a specific vehicle model, the SE(3) pose of the vehicle is estimated based on LiDAR points, and then a traversability is computed. The traversability tells the vehicle the effects of its interaction with the rough terrain. Note that the traversability is computed on demand during trajectory planning, so there is not any explicit terrain discretization. The proposed trajectory planner finds an initial path through the non-holonomic A*, which is a modified form of the conventional A* planner. A path is a sequence of poses without timestamps. Then, the initial path is optimized in terms of the traversability, using the method of Lagrange multipliers. The optimization accounts for the model of the vehicle's suspension system. Therefore, the optimized trajectory is dynamically feasible, and the trajectory tracking error is small. The proposed methods were tested in both the simulation and the real-world experiments. The simulation experiments were conducted in a simulator called Gazebo, which uses a physics engine to compute the vehicle motion. The experiments were conducted in various non-planar experiments. The results indicate that the proposed methods could accurately estimate the SE(3) pose of the vehicle. Besides, the trajectory cost of the proposed planner was lower than the trajectory costs of other state-of-the-art trajectory planners.
Collapse
Affiliation(s)
- Kai Zhang
- School of Automation, Beijing Institute of Technology, Beijing 100081, China.
| | - Yi Yang
- School of Automation, Beijing Institute of Technology, Beijing 100081, China.
| | - Mengyin Fu
- School of Automation, Beijing Institute of Technology, Beijing 100081, China.
- School of Automation, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Meiling Wang
- School of Automation, Beijing Institute of Technology, Beijing 100081, China.
| |
Collapse
|