Cross coordination of behavior clone and reinforcement learning for autonomous within-visual-range air combat

被引：2

作者：

Li, Lun ^{[1
,2
]}

Zhang, Xuebo ^{[1
]}

Qian, Chenxu ^{[1
,2
]}

Zhao, Minghui ^{[1
,2
]}

Wang, Runhua ^{[1
,2
]}

机构：

[1] Nankai Univ, Inst Robot & Automat Informat Syst, Coll Artificial Intelligence, Tianjin, Peoples R China

[2] Nankai Univ, Tianjin Key Lab Intelligent Robot, Tianjin, Peoples R China

来源：

NEUROCOMPUTING | 2024年 / 584卷

基金：

中国国家自然科学基金;

关键词：

WVR air combat; Fixed-wing plane; Behavior clone; PPO; IMITATION; LEVEL; GAME;

D O I：

10.1016/j.neucom.2024.127591

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this article, we propose a novel hierarchical framework to resolve within -visual -range (WVR) air-to-air combat under complex nonlinear 6 degrees -of -freedom (6-DOF) dynamics of the aircraft and missile. The decision process is constructed with two layers from the top to the bottom and adopts reinforcement learning to solve them separately. The top layer designs a new combat policy to decide the autopilot commands (such as the target heading, velocity, and altitude) and missile launch according to the current combat situation. Then the bottom layer uses a control policy to answer the autopilot commands by calculating the actual input signals (deflections of the rudder, elevator, aileron, and throttle) for the aircraft. For the combat policy, we present a new learning method called "E2L"that can mimic the knowledge of the expert under the two -layer decision frame to inspire the intelligence of the agent in the early stage of training. This method establishes a cross coordination of behavior clone (BC) and proximal policy optimization (PPO). Under the mechanism, the agent is alternately updated around the latest strategy, using BC with gradient clipping and PPO with Kullback-Leibler divergence loss and the modified BC demonstration trajectories, which can learn competitive combat strategies more stably and quickly. Sufficient experimental results show that the proposed method can achieve better combat performance than the baselines.

引用

页数：13

共 46 条

[1] A flexible rule-based framework for pilot performance analysis in air combat simulation systems [J].

Arar, Omer Faruk ;

Ayan, Kursat .

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2013, 21 :2397-2415

[2] GAME-THEORY FOR AUTOMATED MANEUVERING DURING AIR-TO-AIR COMBAT [J].

AUSTIN, F ;

CARBONE, G ;

FALCO, M ;

HINZ, H ;

LEWIS, M .

JOURNAL OF GUIDANCE CONTROL AND DYNAMICS, 1990, 13 (06) :1143-1149

[3]

Berndt J.S., 2011, JSBSim Reference Manual, P4

[4] A Hierarchical Deep Reinforcement Learning Framework for 6-DOF UCAV Air-to-Air Combat [J].

Chai, Jiajun ;

Chen, Wenzhang ;

Zhu, Yuanheng ;

Yao, Zong-Xin ;

Zhao, Dongbin .

IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2023, 53 (09) :5417-5429

[5]

Chappell A. R., 1992, Proceedings IEEE/AIAA 11th Digital Avionics Systems Conference (Cat. No.92CH3212-8), P155, DOI 10.1109/DASC.1992.282166

[6]

Chen Xinyue, 2020, Advances in Neural Information Processing Systems, V33

[7] An approximate dynamic programming approach for solving an air combat maneuvering problem [J].

Crumpacker, James B. ;

Robbins, Matthew J. ;

Jenkins, Phillip R. .

EXPERT SYSTEMS WITH APPLICATIONS, 2022, 203

[8]

Espeholt L, 2018, PR MACH LEARN RES, V80

[9]

Fang J, 2016, 2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), P1386, DOI 10.1109/CompComm.2016.7924931

[10] Self-Supervised Correspondence in Visuomotor Policy Learning [J].

Florence, Peter ;

Manuelli, Lucas ;

Tedrake, Russ .

IEEE ROBOTICS AND AUTOMATION LETTERS, 2020, 5 (02) :492-499

← 1 2 3 4 5 →