现代防御技术 ›› 2024, Vol. 52 ›› Issue (2): 63-71.DOI: 10.3969/j.issn.1009-086x.2024.02.007

• 军事智能 • 上一篇    下一篇

部分可观测条件下的策略迁移强化学习方法

王忠禹(), 徐晓鹏, 王东   

  1. 大连理工大学 控制科学与工程学院,辽宁 大连 116024
  • 收稿日期:2023-12-29 修回日期:2024-02-27 出版日期:2024-04-28 发布日期:2024-04-29
  • 作者简介:王忠禹(1999-),男,黑龙江绥化人。硕士生,研究方向为无人集群协同对抗、多智能体强化学习。 E-mail:zhongyu@mail.dlut.edu.cn
  • 基金资助:
    国家自然科学基金(61973050)

Policy Transfer Reinforcement Learning Method for Partially Observable Conditions

Zhongyu WANG(), Xiaopeng XU, Dong WANG   

  1. School of Control Science and Engineering,Dalian University of Technology,Dalian 116024,China
  • Received:2023-12-29 Revised:2024-02-27 Online:2024-04-28 Published:2024-04-29

摘要:

针对多智能体强化学习算法在部分可观测条件下难以形成有效协同策略的问题,基于集中式训练与分散式执行范式(CTDE)提出一种策略迁移强化学习方法。该方法在全局观测下训练可以探索到良好协同策略的教师模块,在部分可观测条件下,学生模块依据最大化累计回报的期望为目标函数在线训练的同时,利用策略蒸馏技术从教师模块进行策略迁移,并自适应调整教师策略对学生策略的影响比重。在多个地图场景中对所提出的方法进行仿真验证,实验结果表明部分可观测条件下学生模块的胜率高于所对比的基线算法的胜率。研究成果可以应用于多智能体合作任务,提升智能体在分散式执行时的协同性能。

关键词: 多智能体, 强化学习, 部分观测, 策略迁移, 集中式训练与分散式执行

Abstract:

Multi-agent reinforcement learning algorithms fail to form effective collaborative policy under partially observable conditions. In view of this problem, a policy transfer reinforcement learning method based on centralized training and decentralized execution (CTDE) paradigm was proposed. Firstly, under global observation, the teacher module was trained to explore good collaborative policy. Then, under partially observable conditions, the student module was trained online with the expectation of maximizing cumulative returns as the objective function, and at the same time, policy distillation techniques were used to transfer policy from the teacher module and adaptively adjust the proportion of teacher policy affecting student policy. Finally, the proposed method was verified by simulation in multiple map scenarios. The experimental results show that under partially observable conditions, the success rate of student modules is higher than that of the baseline algorithms. The research results can be applied to multi-agent collaborative tasks, improving the collaborative performance of agents in decentralized execution.

Key words: multi-agent, reinforcement learning, partial observation, policy transfer, centralized training and decentralized execution(CTDE)

中图分类号: