MarsExplorer: Exploration of Unknown Terrains via Deep Reinforcement Learning and Procedurally Generated Environments

被引:5
作者
Koutras, Dimitrios, I [1 ,2 ]
Kapoutsis, Athanasios C. [2 ]
Amanatiadis, Angelos A. [3 ]
Kosmatopoulos, Elias B. [1 ,2 ]
机构
[1] Democritus Univ Thrace, Dept Elect & Comp Engn, Xanthi 67100, Greece
[2] Informat Technol Inst, Ctr Res & Technol, Thessaloniki 57001, Greece
[3] Democritus Univ Thrace, Dept Prod & Management Engn, Xanthi 67100, Greece
基金
欧盟地平线“2020”;
关键词
Deep Reinforcement Learning; OpenAI gym; exploration; unknown terrains; MULTIROBOT; ALGORITHM;
D O I
10.3390/electronics10222751
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper is an initial endeavor to bridge the gap between powerful Deep Reinforcement Learning methodologies and the problem of exploration/coverage of unknown terrains. Within this scope, MarsExplorer, an openai-gym compatible environment tailored to exploration/coverage of unknown areas, is presented. MarsExplorer translates the original robotics problem into a Reinforcement Learning setup that various off-the-shelf algorithms can tackle. Any learned policy can be straightforwardly applied to a robotic platform without an elaborate simulation model of the robot's dynamics to apply a different learning/adaptation phase. One of its core features is the controllable multi-dimensional procedural generation of terrains, which is the key for producing policies with strong generalization capabilities. Four different state-of-the-art RL algorithms (A3C, PPO, Rainbow, and SAC) are trained on the MarsExplorer environment, and a proper evaluation of their results compared to the average human-level performance is reported. In the follow-up experimental analysis, the effect of the multi-dimensional difficulty setting on the learning capabilities of the best-performing algorithm (PPO) is analyzed. A milestone result is the generation of an exploration policy that follows the Hilbert curve without providing this information to the environment or rewarding directly or indirectly Hilbert-curve-like trajectories. The experimental analysis is concluded by evaluating PPO learned policy algorithm side-by-side with frontier-based exploration strategies. A study on the performance curves revealed that PPO-based policy was capable of performing adaptive-to-the-unknown-terrain sweeping without leaving expensive-to-revisit areas uncovered, underlying the capability of RL-based methodologies to tackle exploration tasks efficiently.
引用
收藏
页数:15
相关论文
共 37 条
  • [1] Exploration strategies based on multi-criteria decision making for searching environments in rescue operations
    Basilico, Nicola
    Amigoni, Francesco
    [J]. AUTONOMOUS ROBOTS, 2011, 31 (04) : 401 - 417
  • [2] A Multi-Resolution Frontier-Based Planner for Autonomous 3D Exploration
    Batinovic, Ana
    Petrovic, Tamara
    Ivanovic, Antun
    Petric, Frano
    Bogdan, Stjepan
    [J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (03) : 4528 - 4535
  • [3] The Arcade Learning Environment: An Evaluation Platform for General Agents
    Bellemare, Marc G.
    Naddaf, Yavar
    Veness, Joel
    Bowling, Michael
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 : 253 - 279
  • [4] Brockman Greg, 2016, arXiv
  • [5] Burgard Wolfram, 2000, Robotics and Automation, V1, P476
  • [6] Cobbe K, 2020, PR MACH LEARN RES, V119
  • [7] Cordero A. H., 2016, EXTENDING OPENAI GYM
  • [8] Dhariwal P, 2017, GITHUB
  • [9] Gray L., 2003, NOT AM MATH SOC, V50, P200
  • [10] Haarnoja T, 2018, PR MACH LEARN RES, V80