现代防御技术 ›› 2022, Vol. 50 ›› Issue (5): 36-42.DOI: 10.3969/j.issn.1009-086x.2022.05.006

• 军事智能 • 上一篇    下一篇

基于元课程强化学习的多智能体协同博弈技术

丁季时雨, 孙科武, 董博, 杨皙睿, 范长超, 马喆   

  1. 中国航天科工集团有限公司第二研究院 未来实验室,北京 100854
  • 收稿日期:2021-12-16 修回日期:2022-05-19 出版日期:2022-10-28 发布日期:2022-11-03
  • 作者简介:丁季时雨(1993-),男,河北保定人。工程师,博士,研究方向为多智能体强化学习。
  • 基金资助:
    国家自然科学基金(62103386)

Multi-agent Autonomous Cooperative Confrontation based on Meta Curriculum Reinforcement Learning

Ji-shi-yu DING, Ke-wu SUN, Bo DONG, Xi-rui YANG, Chang-chao FAN, Zhe MA   

  1. The Second Academy of CASIC,XLAB,Beijing 100854,China
  • Received:2021-12-16 Revised:2022-05-19 Online:2022-10-28 Published:2022-11-03

摘要:

多智能体协同博弈具有实时及动作连续性、非完全信息博弈、庞大的搜索空间、多复杂任务和时间空间推理等特点,是当前人工智能领域极具挑战的难题之一。针对大规模多智能体强化学习训练时间长、难以收敛等问题,提出了一种基于Actor-Critic的多智能体强化学习协同博弈框架,利用元课程强化学习方法对小规模场景进行基础课程元模型提取,并且基于课程学习向大规模场景进行模型迁移,在元模型基础上继续进行训练,扩展元模型策略网络,最终得到较优协同博弈策略。在《星际争霸Ⅱ》平台上进行仿真实验,结果表明:基于元课程强化学习的多智能体协同博弈技术可有效地加速其训练过程,相较于传统训练方法可以在较短时间内达到较高的胜率,训练速度提升约40%,该方法可有效支撑多智能体协同博弈策略的高效生成,为低资源下的强化学习高效训练奠定理论基础。

关键词: 多智能体, 强化学习, 协同博弈, 元课程学习, 高效训练

Abstract:

Multi-agent cooperation and competition has the characteristics of real-time and action continuity, incomplete information, huge search space, multi-complex tasks and time-space inference, etc. It is one of the most challenging problems in the current artificial intelligence field. Aiming at the problem of long training time for large-scale multi-agent reinforcement learning, this paper proposes an Actor-Critic-based cooperative confrontation framework, which uses meta curriculum reinforcement learning method to extract meta-models of basic tasks for small-scale scenarios. We carry out model migration to large-scale scenarios based on the curriculum learning, which continues training based on the meta-models and finally obtains a better collaboration strategy. This paper conducts simulation experiments on the "Star-Craft II" platform. The results show that the multi-agent cooperative confrontation technology based on the meta curriculum reinforcement learning can effectively accelerate the training process, and can achieve a higher win rate within a shorter time compared with the traditional training methods. The training speed is increased by about 40%. This method can effectively support the efficient generation of multi-agent cooperative confrontation strategies.

Key words: multi-agent, reinforcement learning, cooperative confrontation, meta turriculum learning, high efficiency training

中图分类号: