Skill Disentanglement for Imitation Learning from Suboptimal Demonstrations

被引：0

作者：

Zhao, Tianxiang ^{[1
]}

Yu, Wenchao ^{[2
]}

Wang, Suhang ^{[1
]}

Wang, Lu ^{[3
]}

Zhang, Xiang ^{[1
]}

Chen, Yuncong ^{[2
]}

Liu, Yanchi ^{[2
]}

Cheng, Wei ^{[2
]}

Chen, Haifeng ^{[2
]}

机构：

[1] Penn State Univ, University Pk, PA 16802 USA

[2] NEC Labs Amer, Princeton, NJ USA

[3] East China Normal Univ, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023 | 2023年

基金：

美国国家科学基金会;

关键词：

imitation learning; hierarchical reinforcement learning; skill discovery; noisy data;

D O I：

10.1145/3580305.3599506

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Imitation learning has achieved great success in many sequential decision-making tasks, in which a neural agent is learned by imitating collected human demonstrations. However, existing algorithms typically require a large number of high-quality demonstrations that are difficult and expensive to collect. Usually, a trade-off needs to be made between demonstration quality and quantity in practice. Targeting this problem, in this work we consider the imitation of sub-optimal demonstrations, with both a small clean demonstration set and a large noisy set. Some pioneering works have been proposed, but they suffer from many limitations, e.g., assuming a demonstration to be of the same optimality throughout time steps and failing to provide any interpretation w.r.t knowledge learned from the noisy set. Addressing these problems, we propose SDIL by evaluating and imitating at the sub-demonstration level, encoding action primitives of varying quality into different skills. Concretely, SDIL consists of a high-level controller to discover skills and a skill-conditioned module to capture action-taking policies, and is trained following a two-phase pipeline by first discovering skills with all demonstrations and then adapting the controller to only the clean set. A mutual-information-based regularization and a dynamic sub-demonstration optimality estimator are designed to promote disentanglement in the skill space. Extensive experiments are conducted over two gym environments and a real-world healthcare dataset to demonstrate the superiority of SDIL in learning from sub-optimal demonstrations and its improved interpretability by examining learned skills.

引用

页码：3513 / 3524

页数：12

共 49 条

[1] Understanding the Impact of Technical Debt in Coding and Testing: An Exploratory Case Study [J].

Abad, Zahra Shakeri Hossein ;

Karimpour, Reza ;

Ho, Jason ;

Didar-Al-Alam, S. M. ;

Ruhe, Guenther ;

Tse, Edward ;

Barabash, Kevin ;

Hargreaves, Ian .

2016 IEEE/ACM 3RD INTERNATIONAL WORKSHOP ON SOFTWARE ENGINEERING RESEARCH AND INDUSTRIAL PRACTICE (SER&IP), 2016, :25-31

[2]

Bacon PL, 2017, AAAI CONF ARTIF INTE, P1726

[3]

Bajor J.M., 2016, PREDICTING MED DIAGN

[4] Learning from positive and unlabeled data: a survey [J].

Bekker, Jessa ;

Davis, Jesse .

MACHINE LEARNING, 2020, 109 (04) :719-760

[5]

Belghazi MI, 2018, PR MACH LEARN RES, V80

[6]

Beliaev M., 2022, ARXIV220201288

[7] The use of the area under the roc curve in the evaluation of machine learning algorithms [J].

Bradley, AP .

PATTERN RECOGNITION, 1997, 30 (07) :1145-1159

[8]

Campos V, 2019, 25TH AMERICAS CONFERENCE ON INFORMATION SYSTEMS (AMCIS 2019)

[9]

Celik Onur, 2022, Conference on Robot Learning, P1423

[10]

Chevalier-Boisvert Maxime., 2018, Minimalistic gridworld environment for openai gym

← 1 2 3 4 5 →