Safe Policy Improvement for POMDPs via Finite-State Controllers

被引：0

作者：

Simao, Thiago D. ^{[1
]}

Suilen, Marnix ^{[1
]}

Jansen, Nils ^{[1
]}

机构：

[1] Radboud Univ Nijmegen, Dept Software Sci, Nijmegen, Netherlands

来源：

THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 12 | 2023年

基金：

欧洲研究理事会;

关键词：

MARKOV-PROCESSES;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve upon the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data is available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.

引用

页码：15109 / 15117

页数：9

共 43 条

[1] Optimizing fixed-size stochastic controllers for POMDPs and decentralized POMDPs [J].

Amato, Christopher ;

Bernstein, Daniel S. ;

Zilberstein, Shlomo .

AUTONOMOUS AGENTS AND MULTI-AGENT SYSTEMS, 2010, 21 (03) :293-320

[2] Deep reinforcement learning driven inspection and maintenance planning under incomplete information and constraints [J].

Andriotis, C. P. ;

Papakonstantinou, K. G. .

RELIABILITY ENGINEERING & SYSTEM SAFETY, 2021, 212

[3]

[Anonymous], 2016, Advances in Neural Information Processing Systems

[4] OPTIMAL CONTROL OF MARKOV PROCESSES WITH INCOMPLETE STATE INFORMATION [J].

ASTROM, KJ .

JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS, 1965, 10 (01) :174-&

[5]

Bonet B., 2002, ICML, V51, P58

[6]

Brandfonbrener D, 2022, Arxiv, DOI arXiv:2206.01085

[7]

Carr S, 2021, J ARTIF INTELL RES, V72, P819

[8]

Chades I., 2012, AAAI, V267, P273

[9] General rules for managing and surveying networks of pests, diseases, and endangered species [J].

Chades, Iadine ;

Martin, Tara G. ;

Nicol, Samuel ;

Burgman, Mark A. ;

Possingham, Hugh P. ;

Buckley, Yvonne M. .

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2011, 108 (20) :8323-8328

[10]

Chandak Y., 2020, NeurIPS, V9156, P9168

← 1 2 3 4 5 →