基于元课程强化学习的多智能体协同博弈技术

doi:10.3969/j.issn.1009-086x.2022.05.006

现代防御技术 ›› 2022, Vol. 50 ›› Issue (5): 36-42.DOI: 10.3969/j.issn.1009-086x.2022.05.006

基于元课程强化学习的多智能体协同博弈技术

丁季时雨, 孙科武, 董博, 杨皙睿, 范长超, 马喆

中国航天科工集团有限公司第二研究院未来实验室，北京 100854

收稿日期:2021-12-16 修回日期:2022-05-19 出版日期:2022-10-28 发布日期:2022-11-03
作者简介:丁季时雨（1993-），男，河北保定人。工程师，博士，研究方向为多智能体强化学习。
基金资助:
国家自然科学基金(62103386)

Multi-agent Autonomous Cooperative Confrontation based on Meta Curriculum Reinforcement Learning

Ji-shi-yu DING, Ke-wu SUN, Bo DONG, Xi-rui YANG, Chang-chao FAN, Zhe MA

The Second Academy of CASIC，XLAB，Beijing 100854，China

Received:2021-12-16 Revised:2022-05-19 Online:2022-10-28 Published:2022-11-03

摘要/Abstract

摘要：

多智能体协同博弈具有实时及动作连续性、非完全信息博弈、庞大的搜索空间、多复杂任务和时间空间推理等特点，是当前人工智能领域极具挑战的难题之一。针对大规模多智能体强化学习训练时间长、难以收敛等问题，提出了一种基于Actor-Critic的多智能体强化学习协同博弈框架，利用元课程强化学习方法对小规模场景进行基础课程元模型提取，并且基于课程学习向大规模场景进行模型迁移，在元模型基础上继续进行训练，扩展元模型策略网络，最终得到较优协同博弈策略。在《星际争霸Ⅱ》平台上进行仿真实验，结果表明：基于元课程强化学习的多智能体协同博弈技术可有效地加速其训练过程，相较于传统训练方法可以在较短时间内达到较高的胜率，训练速度提升约40%，该方法可有效支撑多智能体协同博弈策略的高效生成，为低资源下的强化学习高效训练奠定理论基础。

关键词: 多智能体, 强化学习, 协同博弈, 元课程学习, 高效训练

Abstract:

Multi-agent cooperation and competition has the characteristics of real-time and action continuity， incomplete information， huge search space， multi-complex tasks and time-space inference， etc. It is one of the most challenging problems in the current artificial intelligence field. Aiming at the problem of long training time for large-scale multi-agent reinforcement learning， this paper proposes an Actor-Critic-based cooperative confrontation framework， which uses meta curriculum reinforcement learning method to extract meta-models of basic tasks for small-scale scenarios. We carry out model migration to large-scale scenarios based on the curriculum learning， which continues training based on the meta-models and finally obtains a better collaboration strategy. This paper conducts simulation experiments on the "Star-Craft II" platform. The results show that the multi-agent cooperative confrontation technology based on the meta curriculum reinforcement learning can effectively accelerate the training process， and can achieve a higher win rate within a shorter time compared with the traditional training methods. The training speed is increased by about 40%. This method can effectively support the efficient generation of multi-agent cooperative confrontation strategies.

Key words: multi-agent, reinforcement learning, cooperative confrontation, meta turriculum learning, high efficiency training

中图分类号:

丁季时雨, 孙科武, 董博, 杨皙睿, 范长超, 马喆. 基于元课程强化学习的多智能体协同博弈技术[J]. 现代防御技术, 2022, 50(5): 36-42.

Ji-shi-yu DING, Ke-wu SUN, Bo DONG, Xi-rui YANG, Chang-chao FAN, Zhe MA. Multi-agent Autonomous Cooperative Confrontation based on Meta Curriculum Reinforcement Learning[J]. Modern Defense Technology, 2022, 50(5): 36-42.

图/表 8

图1 多智能体强化学习协同博弈框架

Fig. 1 Multi-agent reinforcement learning autonomous coordination framework

图2 星际争霸Ⅱ元课程构建示意图

Fig. 2 Meta curriculum construction based on Star Craft Ⅱ

图3 基础课程元模型提取示意图

Fig. 3 Meta model extraction in basic tasks

图4 基于课程学习的元模型迁移

Fig. 4 Meta model transfer based on curriculum learning

图5 《星际争霸II》地图

Fig. 5 "Star Craft II" map

图6 基于元课程强化学习的训练过程对比图

Fig. 6 Training process based on the meta curriculum reinforcement learning

图7 不同训练步长下的平均奖励对比图

Fig. 7 Training process based on the meta curriculum reinforcement learning

图8 不同训练步长下的胜率对比图

Fig. 8 Win rate under different training steps

参考文献 16

1	孙清. 基于强化学习的多智能体协同机制研究［D］. 杭州：浙江工业大学， 2015.
	SUN Qing. Research on Multi-Agent Collaboration Mechanism Based on Reinforcement Learning［D］. Hangzhou：Zhejiang University of Technology， 2015.
2	柏晓祉. 强化学习在多智能体协同中的研究与应用［D］.成都：电子科技大学， 2020.
	BAI Xiao-zhi. Research and Application of Reinforcement Learning in Multi-Agent Collaboration［D］. Chengdu： University of Electronic Science and Technology of China， 2020.
3	李天旭. 基于深度强化学习的多智能体协同算法研究［D］. 北京：中国矿业大学， 2020.
	LI Tian-xu. Research on Multi-Agent Cooperative Algorithm Based on Deep Reinforcement Learning［D］.Beijing： China University of Mining and Technology，2020.
4	谭晓阳，文超，姚兴虎. 一种基于λ-回报的异策略多智能体强化学习协作方法： CN111079305A［P］. 2020-04-28.
	TAN Xiao-yang， WEN Chao， YAO Xing-hu. Multi-Agent Reinforcement Learning Cooperation Method Based on λ-Return： CN111079305A［P］.2020-04-28.
5	陈亮，梁宸，张景异，等. Actor-Critic框架下一种基于改进DDPG的多智能体强化学习算法［J］. 控制与决策， 2021， 36（1）： 75-82.
	CHEN Liang， LIANG Chen， ZHANG Jing-yi，et al. A Multi-Agent Reinforcement Learning Algorithm Basedon Improved DDPG Under Actor-Critic Framework［J］.Control and Decision， 2021， 36（1）： 75-82.
6	郑健，陈建，朱琨. 基于多智能体强化学习的无人集群协同设计［J］. 指挥信息系统与技术，2020，11（6）： 26-31.
	ZHENG Jian， CHEN Jian， ZHU Kun. Collaborative Design of Unmanned Cluster Based on Multi-Agent Reinforcement Learning［J］. Command Information System and Technology， 2020， 11（6）： 26-31.
7	曹雷. 基于深度强化学习的智能博弈对抗关键技术［J］. 指挥信息系统与技术， 2019， 10（5）： 1-7.
	CAO Lei. Key Technologies of Intelligent Game Confrontation Based on Deep Reinforcement Learning［J］. Command Information System and Technology， 2019， 10（5）： 1-7.
8	HAUSKNECHT M， STONE P. Deep Recurrent Q-Learning for Partially Observable MDP［C］∥2015 AAAI Spring Symposium Series， Palo Alto，CA，2015.
9	MNIH V， BADIA A P， MIRZA M， et al. Asynchronous Methods for Deep Reinforcement Learning［C］∥International Conference on Machine Learning （ICLR）. New York，2016： 1928-1937.
10	LOWE R， WU Y I， TAMAR A， et al. Multi-Agent Actor-Critic for Mixed Cooperative Competitive Environments［C］//Advances in Neural Information Processing Systems.San Francisco， 2017： 6379-6390.
11	PENG P， WEN Y， YANG Y， et al. Multiagent Bidirectionally-Coordinated Nets for Learning to Play StarCraft Combat Games［J］. arXiv： Learning， 2017.
12	WEI E， WICKE D， FREELAN D， et al. Multiagent Soft Q-Learning［C］∥2018 AAAI Spring Symposium Series， 2018.
13	SUNEHAG P， LEVER G， GRUSLYS A， et al. Value-Decomposition Networks For Cooperative Multi-Agent Learning Based on Team Reward［C］∥Adaptive Agents and Multi-Agents Systems （AAMAS）. Stockholm，Sweden，2018： 2085-2087.
14	RASHID T， SAMVELYAN M， SCHROEDER C， et al. QMIX： Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning［C］∥ International Conference on Machine Learning （ICML）. Stockholm， Sweden，2018： 4292-4301.
15	FOERSTER J N， FARQUHAR G， AFOURAS T， et al. Counterfactual Multi-Agent policy gradients［C］∥ Thirty-Second AAAI Conference on Artificial Intelligence， New Orleans，2018.
16	IQBAL S， SHA F. Actor-Attention-Critic for Multi-Agent Reinforcement Learning［C］∥International Conference on Machine Learning.San Francisco，2019： 2961-2970.

[1]	白宗磊, 刘秀华, 白天翔, 孙科武. 面向空天防御体系的多任务可控涌现机制研究[J]. 现代防御技术, 2023, 51(3): 39-48.
[2]	房霄, 曾贲, 宋祥祥, 贾正轩. 基于深度强化学习的舰艇空中威胁行为建模[J]. 现代防御技术, 2020, 48(5): 59-66.
[3]	汤润泽, 张承龙, 李林林. 人工智能在无人战场态势预判与博弈对抗中的应用[J]. 现代防御技术, 2020, 48(5): 25-31.

基于元课程强化学习的多智能体协同博弈技术

Multi-agent Autonomous Cooperative Confrontation based on Meta Curriculum Reinforcement Learning

RichHTML

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 16

相关文章 3

编辑推荐

Metrics

本文评价