基于强化学习的无人攻击机与诱饵机协同航迹规划

doi:10.3969/j.issn.1009-086x.2026.03.007

摘要/Abstract

摘要：

在现代战争中，无人机协同作战已成为提升战场效能的关键技术。其中，无人战斗机（UCAV）与诱饵机（Decoy UAV）的协同作战模式因其战术价值受到广泛关注。针对无人攻击机与诱饵机协同打击敌方重点目标的任务，提出了一种基于近端策略优化（proximal policy optimization, PPO）算法的协同航迹规划方法。构建融合动态威胁评估的马尔可夫决策过程（Markov decision process, MDP）模型，集成无人机运动学与战场环境约束，设计状态、动作空间及分层奖励函数。仿真实验表明，所提方法能有效引导攻击机与诱饵机在复杂战场环境中实现高效协同，显著提升任务成功率并降低被敌方防空系统拦截的风险，为无人机协同作战的智能化路径规划提供了理论与技术支撑。

关键词: 无人机协同作战, 航迹规划, 强化学习, PPO算法, 马尔可夫决策

Abstract:

Unmanned aerial vehicle （UAV） cooperative combat is crucial in modern warfare. The cooperative mode between unmanned combat aerial vehicles （UCAVs） and decoy UAVs has gained significant attention due to its tactical value. This paper proposes a cooperative trajectory planning method based on the proximal policy optimization （PPO） algorithm for UCAV and decoy UAV strike missions against key enemy targets. We construct a Markov decision process （MDP） model incorporating dynamic threat assessment， integrating UAV kinematics and battlefield constraints， and design the state/action spaces and a hierarchical reward function. Simulation results demonstrate that the proposed method effectively guides UCAVs and decoys to achieve efficient cooperation in complex environments， significantly increasing mission success rates while reducing interception risks from enemy air defense systems. This provides theoretical and technical support for intelligent path planning in UAV cooperative operations.

Key words: UAVs cooperative operations, trajectory planning, reinforcement learning, proximal policy optimization（PPO）

中图分类号:

V279

祁昊哲, 郑明发, 胡小荣, 杨楠. 基于强化学习的无人攻击机与诱饵机协同航迹规划[J]. 现代防御技术, 2026, 54(3): 71-81.

Haozhe QI, Mingfa ZHENG, Xiaorong HU, Nan YANG. Reinforcement Learning-Based Cooperative Trajectory Planning for Unmanned Combat Aerial Vehicles and Decoy UAVs[J]. Modern Defense Technology, 2026, 54(3): 71-81.

图/表 19

图1 攻击机、诱饵机协同打击场景

Fig. 1 Cooperative strike scenario of unmanned combat aerial vehicle and decoy UAV

图2 雷达威胁约束建模图

Fig. 2 Radar threat constraint modeling diagram

图3 无人机相撞模型图

Fig. 3 UAV collision avoidance modeling

图4 禁飞区约束模型图

Fig. 4 No-fly zone constraint modeling

表1 回合奖励设置

Table 1 Episode reward configuration

事件	描述	取值
胜	消灭敌方地导或目标	1 000
负	我方无人机被击落	-1 000
平	双方无对抗	-10

表2 关键事件奖励设置

Table 2 Critical events reward configuration

事件	描述	取值
击落	击落敌方地导	1 000
被击落	战斗机被敌方击落	-1 000
被击落	诱饵机被敌方击落	-100

表3 单步奖励设置

Table 3 Single-step reward configuration

事件	描述	取值
每步	每个时间步给予负奖励	-1
诱饵机吸引火力	牵制敌方雷达	200
被锁定	被敌方锁定	-100
战斗机靠近	在诱饵机牵制期间靠近目标	100

图5 策略网络模型

Fig. 5 Policy Network architecture

图6 研究框架

Fig. 6 Research framework

图7 智能规划框架

Fig. 7 Intelligent planning framework

图8 仿真环境

Fig. 8 Simulation environment

表4 深度强化学习训练参数配置

Table 4 Deep reinforcement learning training parameter configuration

参数类别	参数名称	取值范围	说明
基础参数	最大训练步数	100 000	终止条件
基础参数	回合最大步长	1 000	单局最大执行步数
优化参数	学习率	0.000 3	Adam优化器初始学习率
优化参数	新旧策略裁剪阈值	0.2	PPO Clip约束参数
折扣参数	奖励折扣因子γ	0.995	未来奖励衰减系数
折扣参数	GAE参数λ	0.95	广义优势估计参数
损失函数参数	策略损失权重	1.0	策略梯度权重
	值函数损失权重	0.5	价值网络损失权重
	熵正则化系数	0.01	策略熵项权重
数据处理参数	小批量样本数	64	梯度更新批次大小
数据处理参数	随机种子数	3	初始化随机种子

图9 收敛函数

Fig. 9 Convergence curve

图10 强化学习离线训练图

Fig. 10 Reinforcement learning offline training plots

图11 训练前结果

Fig. 11 Pre-training results

图12 训练后结果

Fig. 12 Post-training outcomes

图13 鲁棒性测试结果图

Fig. 13 Robustness test results

图14 定量性能对比结果图

Fig. 14 Quantitative performance comparison results

表5 算法性能定量对比

Table 5 Quantitative performance comparison of algorithms

算法	TSR/%	ASP/%	APL_UCAV/km	APL_Decoy/km	ASC （kSteps）	AR
PPO Clip	92.0	88.0	142.3	98.7	40.0	856.2
PPO	85.0	82.0	138.5	95.2	50.0	782.4
DDPG	78.0	75.0	145.8	101.5	>60.0	701.8
A3C	80.0	77.0	140.1	97.3	55.0	725.6
DQN	65.0	68.0	130.5	90.8	45.0	620.3

参考文献 21

[1]	韦振汉，唐辉，杨煜，等. 基于强化学习的多无人机航迹规划［J］. 现代防御技术， 2025， 53（5）： 136-144.
	WEI Zhenhan， TANG Hui， YANG Yu， et al. Multi-UAV Path Planning Based on Reinforcement Learning［J］. Modern Defence Technology， 2025， 53（5）： 136-144.
[2]	费陈，赵亮，贺拥亮，等. 城市环境下无人机群目标打击航迹规划［J］. 现代防御技术， 2025， 53（1）： 1-10.
	FEI Chen， ZHAO Liang， HE Yongliang， et al. Trajectory Planning for UAV Swarm Target Strikes in Urban Environments［J］. Modern Defence Technology， 2025， 53（1）： 1-10.
[3]	潘楠，刘海石，陈启用，等. 多基地多目标无人机协同任务规划算法研究［J］. 现代防御技术， 2021， 49（2）： 49-56.
	PAN Nan， LIU Haishi， CHEN Qiyong， et al. Study on Cooperative Mission Planning Algorithm for Multi-base and Multi-target UAV［J］. Modern Defence Technology， 2021， 49（2）： 49-56.
[4]	房霄，曾贲，宋祥祥，等. 基于深度强化学习的舰艇空中威胁行为建模［J］. 现代防御技术， 2020， 48（5）： 59-66.
	FANG Xiao， ZENG Ben， SONG Xiangxiang， et al. Modeling of Air Target Threat to Warship Based on Deep Reinforcement Learning［J］. Modern Defence Technology， 2020， 48（5）： 59-66.
[5]	LI Qiuxiang， WU Jianping. Efficient Network Attack Path Optimization Method Based on Prior Knowledge-Based PPO Algorithm［J］. Cybersecurity， 2025， 8（1）： 15.
[6]	屈文涛，谢韩彧，刘鑫，等. 基于改进遗传算法的油气管道无人机航迹规划［J］. 科学技术与工程， 2024， 24（27）： 11901-11908.
	QU Wentao， XIE Hanyu， LIU Xin， et al. Path Planning of UAV in Oil and Gas Pipeline Based on Improved Genetic Algorithm［J］. Science Technology and Engineering， 2024， 24（27）： 11901-11908.
[7]	唐颂，吴建源. 基于改进遗传算法的协同航迹规划方法［J］. 电光与控制， 2024， 31（7）： 8-12， 26.
	TANG Song， WU Jianyuan. A Cooperative Trajectory Planning Method Based on Improved Genetic Algorithm［J］. Electronics Optics & Control， 2024， 31（7）： 8-12， 26.
[8]	王瑶，任安虎，任洋洋. 改进蚁群算法的无人机航迹规划［J］. 电光与控制， 2024， 31（4）： 43-48.
	WANG Yao， REN Anhu， REN Yangyang. An Improved Ant Colony Algorithm for UAV Trajectory Planning［J］. Electronics Optics & Control， 2024， 31（4）： 43-48.
[9]	XU Chentao， ZHOU Shiqi， LIANG Maohan， et al. Reliable Vessel Trajectory Clustering： A Maritime Shipping Network-Driven Computational Method［J］. Ocean Engineering， 2025， 336： 121691.
[10]	郝昱. 基于强化学习的无人机火灾救援航迹规划［D］. 赣州：江西理工大学， 2023.
	HAO Yu. Route Planning for Fire Rescue of Unmanned Aerial Vehicles Based on Reinforcement Learning［D］. Ganzhou： Jiangxi University of Science and Technology， 2023.
[11]	周枫. 基于智能算法的无人机航迹规划研究［D］. 镇江：江苏科技大学， 2023.
	ZHOU Feng. Research on Unmanned Aerial Vehicle Path Planning Based on Intelligent Algorithms［D］. Zhenjiang： Jiangsu University of Science and Technology， 2023.
[12]	王力，赵全海，黄石磊. 面向物流机器人的改进Q-Learning动态避障算法研究［J］. 计算机测量与控制， 2025， 33（3）： 267-274.
	WANG Li， ZHAO Quanhai， HUANG Shilei. Improved Q-Learning Dynamic Obstacle Avoidance Algorithm for Logistics Robots［J］. Computer Measurement & Control， 2025， 33（3）： 267-274.
[13]	WANG Meng， GUI Xueqian， YAN Huaicheng， et al. Event-Triggered Optimal Bipartite Consensus Control for Constrained Multiagent Systems via Internal Reinforce Q-Learning［J］. IEEE Transactions on Cybernetics， 2025， 55（8）： 3852-3865.
[14]	杨淞匀，王杭先，林鹏. 基于DDPG算法的无人船避障路径规划［J］. 信息技术， 2025， 49（3）： 1-7， 15.
	YANG Songyun， WANG Hangxian， LIN Peng. Obstacle Avoidance Path Planning of Unmanned Surface Vessels Based on DDPG Algorithm［J］. Information Technology， 2025， 49（3）： 1-7， 15.
[15]	桑垚，马晓宁. 改进奖励函数的深度强化学习路径规划方法［J］. 计算机应用与软件， 2025， 42（1）： 271-276.
	SANG Yao， MA Xiaoning. Path Planning Method of Deep Reinforcement Learning with Improved Reward Function［J］. Computer Applications and Software， 2025， 42（1）： 271-276.
[16]	周从航，李建兴，石宇静，等. 深度强化学习在无人机编队路径规划中的应用［J］. 电光与控制， 2024， 31（10）： 27-33.
	ZHOU Conghang， LI Jianxing， SHI Yujing， et al. Application of Deep Reinforcement Learning in Path Planning of UAV Formation［J］. Electronics Optics & Control， 2024， 31（10）： 27-33.
[17]	李京涛. 基于强化学习的多智能体救援策略研究［D］. 杭州：浙江科技大学， 2024.
	LI Jingtao. Research on Multi-agent Rescue Strategy Based on Reinforcement Learning［D］. Hangzhou： Zhejiang University of Science and Technology， 2024.
[18]	LUO Wangbin， WANG Xiang， HAN Fang， et al. Research on LSTM-PPO Obstacle Avoidance Algorithm and Training Environment for Unmanned Surface Vehicles［J］. Journal of Marine Science and Engineering， 2025， 13（3）： 479.
[19]	WU Yaohuan， XIE Nan. Design of Digital Low-Carbon System for Smart Buildings Based on PPO Algorithm［J］. Sustainable Energy Research， 2025， 12（1）： 9.
[20]	ZHANG Xiaoya， ZHANG Yuyang， DONG Ping， et al. PPORM： A PPO-Assisted Packet Reordering Mechanism of Heterogeneous VANETs for Enhancing Goodput and Stability in Fog Computing［J］. Vehicular Communications， 2025， 53： 100894.
[21]	WANG Junwei， ZENG Zilin， SHANG Peng. Smooth Clip Advantage PPO in Reinforcement Learning［J］. Journal of Physics： Conference Series， 2023， 2513（1）： 012005.