MarsExplorer: Exploration of Unknown Terrains via Deep Reinforcement Learning and Procedurally Generated Environments

被引：5

作者：

Koutras, Dimitrios, I ^{[1
,2
]}

Kapoutsis, Athanasios C. ^{[2
]}

Amanatiadis, Angelos A. ^{[3
]}

Kosmatopoulos, Elias B. ^{[1
,2
]}

机构：

[1] Democritus Univ Thrace, Dept Elect & Comp Engn, Xanthi 67100, Greece

[2] Informat Technol Inst, Ctr Res & Technol, Thessaloniki 57001, Greece

[3] Democritus Univ Thrace, Dept Prod & Management Engn, Xanthi 67100, Greece

来源：

ELECTRONICS | 2021年 / 10卷 / 22期

基金：

欧盟地平线“2020”;

关键词：

Deep Reinforcement Learning; OpenAI gym; exploration; unknown terrains; MULTIROBOT; ALGORITHM;

D O I：

10.3390/electronics10222751

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper is an initial endeavor to bridge the gap between powerful Deep Reinforcement Learning methodologies and the problem of exploration/coverage of unknown terrains. Within this scope, MarsExplorer, an openai-gym compatible environment tailored to exploration/coverage of unknown areas, is presented. MarsExplorer translates the original robotics problem into a Reinforcement Learning setup that various off-the-shelf algorithms can tackle. Any learned policy can be straightforwardly applied to a robotic platform without an elaborate simulation model of the robot's dynamics to apply a different learning/adaptation phase. One of its core features is the controllable multi-dimensional procedural generation of terrains, which is the key for producing policies with strong generalization capabilities. Four different state-of-the-art RL algorithms (A3C, PPO, Rainbow, and SAC) are trained on the MarsExplorer environment, and a proper evaluation of their results compared to the average human-level performance is reported. In the follow-up experimental analysis, the effect of the multi-dimensional difficulty setting on the learning capabilities of the best-performing algorithm (PPO) is analyzed. A milestone result is the generation of an exploration policy that follows the Hilbert curve without providing this information to the environment or rewarding directly or indirectly Hilbert-curve-like trajectories. The experimental analysis is concluded by evaluating PPO learned policy algorithm side-by-side with frontier-based exploration strategies. A study on the performance curves revealed that PPO-based policy was capable of performing adaptive-to-the-unknown-terrain sweeping without leaving expensive-to-revisit areas uncovered, underlying the capability of RL-based methodologies to tackle exploration tasks efficiently.

引用

页数：15

共 37 条

[1] Exploration strategies based on multi-criteria decision making for searching environments in rescue operations
Basilico, Nicola
Amigoni, Francesco
[J]. AUTONOMOUS ROBOTS, 2011, 31 (04) : 401 - 417
[2] A Multi-Resolution Frontier-Based Planner for Autonomous 3D Exploration
Batinovic, Ana
Petrovic, Tamara
Ivanovic, Antun
Petric, Frano
Bogdan, Stjepan
[J]. IEEE ROBOTICS AND AUTOMATION LETTERS, 2021, 6 (03) : 4528 - 4535
[3] The Arcade Learning Environment: An Evaluation Platform for General Agents
Bellemare, Marc G.
Naddaf, Yavar
Veness, Joel
Bowling, Michael
[J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2013, 47 : 253 - 279
[4] Brockman Greg, 2016, arXiv
[5] Burgard Wolfram, 2000, Robotics and Automation, V1, P476
[6] Cobbe K, 2020, PR MACH LEARN RES, V119
[7] Cordero A. H., 2016, EXTENDING OPENAI GYM
[8] Dhariwal P, 2017, GITHUB
[9] Gray L., 2003, NOT AM MATH SOC, V50, P200
[10] Haarnoja T, 2018, PR MACH LEARN RES, V80

← 1 2 3 4 →