Policy Learning with Human Reinforcement

被引：4

作者：

Hwang, Kao-Shing ^{[1
,2
]}

Lin, Jin-Ling ^{[3
]}

Shi, Haobin ^{[4
]}

Chen, Yu-Ying ^{[5
]}

机构：

[1] Natl Sun Yat Sen Univ, Dept Elect Engn, Kaohsiung, Taiwan

[2] Kaohsiung Med Univ, Dept Healthcare Adm & Med Informat, Kaohsiung, Taiwan

[3] Shih Hsin Univ, Dept Informat Management, Taipei 11678, Taiwan

[4] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China

[5] Natl Chung Cheng Univ, Dept Elect Engn, Chiayi, Taiwan

来源：

INTERNATIONAL JOURNAL OF FUZZY SYSTEMS | 2016年 / 18卷 / 04期

关键词：

Knowledge transfer; Reinforcement learning; Relative entropy; Reward shaping; ROBOT; SYSTEMS;

D O I：

10.1007/s40815-016-0194-9

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A reinforcement learning agent learns an optimal policy under a certain environment with an evaluative reward function. In some applications, such an agent may lack sufficient adaptability to handle a variety of scenarios, similar to the one in the learning process. In other words, a learning agent is supposed to satisfy minor demands which are not part of the reward function. This paper proposes an interactive approach to accommodating human reinforcement and environmental rewards in a shaped reinforcement function, which can coach a robot despite goal modification or an inaccurate reward. The proposed approach coaches a robot, already equipped with a reinforcement learning mechanism, by human reinforcement feedback to conquer insufficiencies or shortsightedness in the environmental reward function. The proposed reinforcement learning algorithm links direct policy evaluation and human reinforcement to autonomous robots, accordingly shaping the reward function by combining both reinforcement signals. The technique of relative information entropy was applied to provide more effective learning for solving the conflict between human reinforcement and the robot's core policy. In this work, the human coaching is conveyed by a bystander's facial expressions. The transformation of facial expressions to a scalar index was processed by a type-2 fuzzy system. The simulated and experimental results show that a short-sighted robot could walk successfully through a swamp, and an under-powered car could reach the tip of a mountain with coaching from a bystander. The learning system worked quickly enough that the robot could continually adapt to an altered goal or environment.

引用

页码：618 / 629

页数：12

共 22 条

[1]

Ayesh A, 2004, IEEE SYS MAN CYBERN, P874

[2]

Gadanho S. C., 1999, THESIS

[3] Emotionally Assisted Human-Robot Interaction Using a Wearable Device for Reading Facial Expressions [J].

Gruebler, Anna ;

Berenz, Vincent ;

Suzuki, Kenji .

ADVANCED ROBOTICS, 2012, 26 (10) :1143-1159

[4] An Approach to Subjective Computing: A Robot That Learns From Interaction With Humans [J].

Grueneberg, Patrick ;

Suzuki, Kenji .

IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, 2014, 6 (01) :5-18

[5] Conceptual Imitation Learning Based on Perceptual and Functional Characteristics of Action [J].

Hajimirsadeghi, Hossein ;

Ahmadabadi, Majid Nili ;

Araabi, Babak Nadjar .

IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, 2013, 5 (04) :311-325

[6] Cost-based anticipatory action selection for human-robot fluency [J].

Hoffman, Guy ;

Breazeal, Cynthia .

IEEE TRANSACTIONS ON ROBOTICS, 2007, 23 (05) :952-961

[7]

IRIS, 2014, HUM US FAC EXPR IND

[8]

Knox W. B., 2012, P AAMAS

[9]

Knox W. Bradley, 2009, P 5 INT C KNOWL CAPT

[10]

Konidaris George, 2006, 23 INT C MACH LEARN

← 1 2 3 →