基于离线奖励回填机制的智能协同干扰策略学习方法

杨博阳 李康 纠博 姜汶岐 王英华 刘宏伟

杨博阳, 李康, 纠博, 等. 基于离线奖励回填机制的智能协同干扰策略学习方法[J]. 雷达学报(中英文), 待出版. doi: 10.12000/JR26061
引用本文: 杨博阳, 李康, 纠博, 等. 基于离线奖励回填机制的智能协同干扰策略学习方法[J]. 雷达学报(中英文), 待出版. doi: 10.12000/JR26061
YANG Boyang, LI Kang, JIU Bo, et al. Offline reward backfilling-based intelligent cooperative jamming strategy learning method[J]. Journal of Radars, in press. doi: 10.12000/JR26061
Citation: YANG Boyang, LI Kang, JIU Bo, et al. Offline reward backfilling-based intelligent cooperative jamming strategy learning method[J]. Journal of Radars, in press. doi: 10.12000/JR26061

基于离线奖励回填机制的智能协同干扰策略学习方法

DOI: 10.12000/JR26061 CSTR: 32380.14.JR26061
基金项目: 国家自然科学基金(62201429, 62192714),高校外国学者研究与教学项目基金(B18039),雷达信号处理全国重点实验室(KGJ202X0X),中央高校基本科研业务费专项资金(QTZX22160)
详细信息
    作者简介:

    杨博阳,博士生,主要研究方向为智能干扰决策、雷达抗干扰

    李 康,副教授,主要研究方向为雷达智能抗干扰、雷达智能信号处理

    纠 博,教授,主要研究方向为雷达智能信号处理、雷达智能抗干扰

    姜汶岐,博士生,主要研究方向为雷达智能检测、高速机动目标检测

    王英华,教授,主要研究方向为合成孔径雷达(SAR)自动目标识别、SAR图像处理

    刘宏伟,教授,主要研究方向为雷达目标分类与识别、认知雷达、网络协同探测

    通讯作者:

    李康 likang@xidian.edu.cn

    责任主编:崔国龙 Corresponding Editor: CUI Guolong

  • 中图分类号: TN974

Offline Reward Backfilling-Based Intelligent Cooperative Jamming Strategy Learning Method

Funds: The National Natural Science Foundation of China (62201429, 62192714), the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B18039), National Key Laboratory of Radar Signal Processing (KGJ202X0X), the Fundamental Research Funds for the Central Universities (QTZX22160)
More Information
  • 摘要: 针对复杂电磁环境下多干扰机协同对抗组网雷达存在的局部可观测、奖励稀疏以及信用分配失真等难题,本文提出一种基于离线奖励回填机制的智能协同干扰策略学习方法。所提方法将多干扰机的协同干扰过程建模为部分可观测马尔可夫决策过程,构建即时奖励与离线奖励回填相结合的双层奖励结构,以实现对干扰效果的精准评估。具体而言,多干扰机通过周期性信息汇总弥补局部可观测带来的信息缺失:在周期内,各干扰机利用局部观测对截获节奏和发射行为进行在线引导;在周期末,通过多机交互数据联合回溯评估与奖励回填,修正策略梯度信号,提升策略对真实干扰效果的表征与学习能力。在此基础上,该文结合集中训练、分散执行框架,提出基于离线奖励回填的多智能体近端策略优化算法ORB-MAPPO,实现多干扰机协同时频干扰策略学习。仿真结果表明,所提方法能够稳定学习有效的时频协同干扰策略,干扰遮盖率可达95%以上,信息截获率接近100%,相较典型多智能体策略优化方法,所提方法的干扰遮盖率提升约20%,表现出更优的协同干扰性能与训练稳定性。

     

  • 图  1  干扰资源占优条件下的海空协同对抗场景示意图

    Figure  1.  Schematic diagram of a sea-to-air cooperative adversarial scenario under jammer-side resource superiority

    图  2  两层时间单元示意图

    Figure  2.  Schematic diagram of two-level temporal units

    图  3  信息汇总示意图

    Figure  3.  Schematic diagram of information summary process

    图  4  协同干扰策略学习方法流程图

    Figure  4.  Flowchart of Collaborative Interference Strategy Learning Method

    图  5  干扰遮盖率曲线对比图

    Figure  5.  Comparison of jamming coverage rate curves

    图  6  信息截获率曲线对比图

    Figure  6.  Comparison of information interception rate curves

    图  7  消融实验中干扰遮盖率曲线对比图

    Figure  7.  Comparison chart of jamming coverage rate in ablation experiment

    图  9  消融实验中的奖励曲线对比图

    Figure  9.  Comparison of reward curves in ablation experiment

    图  8  消融实验中信息截获率曲线对比图

    Figure  8.  Comparison of information interception rate curves in ablation experiments

    图  10  所提方法的动作示意图

    Figure  10.  Schematic diagram of the proposed method

    图  11  基线MAPPO算法的动作示意图

    Figure  11.  Schematic diagram of action of the baseline method MAPPO

    图  12  不同权重下干扰遮盖率变化曲线

    Figure  12.  Curve of jamming coverage rate variation under different weights

    图  13  不同权重下信息截获率变化曲线

    Figure  13.  Curve of information interception rate variation under different weights

    图  14  不同漏检概率下干扰遮盖率变化曲线

    Figure  14.  Change curve of jamming coverage rate under different missed detection probabilities

    图  15  不同错分概率下干扰遮盖率变化曲线

    Figure  15.  Change curve of jamming coverage rate under different misclassification probabilities

    图  16  不同测频误差概率下干扰遮盖率变化曲线

    Figure  16.  Change curve of jamming coverage rate under different frequency measurement error probabilities

    图  17  不同对抗规模下干扰遮盖率变化曲线

    Figure  17.  Change curve of jamming coverage rate under different adversarial scales

    表  1  智能协同干扰策略学习方法算法流程表

    Table  1.   Algorithm flowchart of intelligent cooperative jamming strategy learning method

     输入:干扰机数量N,雷达数量R,单次行动时间,信息汇总周
     期长度;
      Actor参数$ \theta $,Critic参数$ \phi $;PPO裁剪系数$ \varepsilon $,折扣因子$ \gamma $。
     输出:协同干扰策略$ {\text{π} }_{\theta } $
     1) 初始化Actor参数$ \theta $、Critic参数$ \phi $;
     2) For 回合 = $ 1,2,\cdots ,M $ do
     3)  重置环境,获得初始跨周期记忆$ {P}_{0} $与局部观测
       $ \left\{{o}_{i}(0)\right\}i=1,2,\cdots N $;
     4)  清空轨迹缓存D
     5)  For k = $ 1,2,\cdots ,K $ do % 第k个信息汇总周期
     6)   清空本周期截获日志、发射日志与时频占用记录
     7)   For t = $ 0,1,2,\cdots ,T-1 $ do
     8)    For i = $ 1,2,\cdots ,N $ do
     9)     将智能体id与局部观测$ {o}_{i}(t) $拼接,输入Actor;
     10)     采样多头动作
          $ {a_i}(t) = (a_i^{{\text{mode}}}(t),{\mkern 1mu} a_i^{{\text{tar}}}(t),{\mkern 1mu} a_i^{{\text{style}}}(t),{\mkern 1mu} a_i^{{\text{para}}}(t)) $;
     11)     根据动作可行域 mask 修正非法动作概率;
     12)     记录动作、对数概率$ {\log }_{{{\text{π} }_{\theta }}}\left({a}_{i}(t)|{o}_{i}(t)\right) $;
     13)    End for
     14)    环境执行联合动作$ a(t)=\left\{{a}_{i}(t)\right\}i=1,2,\cdots ,N $;
     15)    返回下一时刻局部观测$ \left\{{o}_{i}(t+1)\right\} $、即时奖励
         $ \left\{r_{i}^{inst}(t)\right\} $;
     16)    记录本步局部观测、集中式观测、动作、即时奖励、
          done标志;
     17)    更新本周期的截获历史、发射历史、能量状态与时频
          占用表;
     18)   End for
     19)   周期末汇总所有截获日志与发射日志;
     20)   计算系统指标:干扰遮盖率、信息截获率、冲突率;
     21)   构造周期末奖励
     $ {R}_{done} = {\omega }_{c} * {C}_{bar} + {\omega }_{\mathrm{cov}} * Co{v}_{int} - {\omega }_{c\mathrm{onf}} * \text{Conf} + {F}_{fair}\left(\{{\mathrm{C}}_{r}\}\right) $
     22)   基于周期内时频匹配关系计算结构化离线回填奖励
        $ \left\{r_{i}^{off}(t)\right\} $;
     23)   将$ r_{i}^{off}(t) $与$ {R}_{done} $按预定规则回填到该周期各时间步;
     24)   用回填后的奖励替换缓存D中对应样本的训练奖励;
     25)   生成下一周期跨周期记忆$ {P}_{k} $,并更新局部观测输入;
     26) End for
     27) 利用集中式 Critic 计算整条轨迹中各时间步状态值;
     28) 依据回填后奖励计算回报与优势函数;
     29) 基于PPO裁剪目标、价值损失与熵正则项更新Actor参数
       $ \theta $,Critic参数$ \phi $
     30) End for
     31) 返回训练完成的协同干扰策略$ {\text{π} }_{\theta } $。
    下载: 导出CSV

    表  2  仿真参数表

    Table  2.   Simulation parameter table

    参数 数值
    干扰机数量N 3
    雷达数量R 4
    干扰机与雷达之间距离 100 km
    单次行动时间$ \Delta t $ 500 μs
    信息汇总周期$ {T}_{\text{sum}} $ 10 ms
    阻塞干扰带宽$ {B}_{\text{blk}} $ 200 MHz
    瞄频干扰带宽范围$ \left[{B}_{\text{sp,min}},{B}_{\text{sp,max}}\right] $ [10,100] MHz
    雷达脉冲重复间隔PRI 1000 μs
    雷达信号到达时间偏移 (0, 200, 500, 800) μs
    雷达脉冲宽度$ \tau $ 100 μs
    雷达频率范围$ [{f}_{\text{min}},{f}_{\text{max}}] $ [3.0, 4.0] GHz
    雷达频点数 10
    雷达跳频间隔$ \Delta f $ 100 MHz
    雷达带宽$ {B}_{r} $ 50 MHz
    步进序列长度$ {L}_{1} $ 10
    脉组包含脉冲数$ {L}_{2} $ 4
    伪随机序列长度$ {L}_{3} $ 10
    奖励回填权重系数$ \beta $ 1
    单回合交互次数 20
    训练回合数 500
    下载: 导出CSV

    表  3  网络结构与训练超参数设置表

    Table  3.   Network architecture and training hyperparameter settings

    类别 参数 数值
    单智能体观测维数/Actor输入维数 $ {d}_{o} $ 93
    集中式Critic输入维数 $ 3\times {d}_{o} $ 279
    Actor/Critic隐藏层神经元个数 \ 256
    Actor/Critic隐藏层层数 \ 2
    Actor/Critic激活函数 \ ReLU
    Actor输出维数 模式/目标/样式/参数 2/4/2/10
    Critic输出维数 \ 3
    折扣因子 $ \gamma $ 0.99
    GAE参数 $ \lambda $ 0.95
    PPO裁剪系数 $ \varepsilon $ 0.2
    Actor学习率 $ l{r}_{a} $ $ 3\times {10}^{-4} $
    Critic学习率 $ l{r}_{c} $ $ 3\times {10}^{-4} $
    Batch Batch_size 512
    下载: 导出CSV
  • [1] 崔国龙, 余显祥, 魏文强, 等. 认知智能雷达抗干扰技术综述与展望[J]. 雷达学报, 2022, 11(6): 974–1002. doi: 10.12000/JR22191.

    CUI Guolong, YU Xianxiang, WEI Wenqiang, et al. An overview of antijamming methods and future works on cognitive intelligent radar[J]. Journal of Radars, 2022, 11(6): 974–1002. doi: 10.12000/JR22191.
    [2] LIU Yongxiang, YANG Wei, QIU Xiangfeng, et al. Advanced cognitive radar: Principles, systems, and essential applications[J]. IEEE Aerospace and Electronic Systems Magazine, 2026, 41(4): 68–81. doi: 10.1109/MAES.2025.3648682.
    [3] QIU Xiangfeng, JIANG Weidong, LIU Yongxiang, et al. Constrained riemannian manifold optimization for the simultaneous shaping of ambiguity function and transmit Beampattern[J]. IEEE Transactions on Aerospace and Electronic Systems, 2025, 61(3): 5771–5787. doi: 10.1109/TAES.2024.3520951.
    [4] QIU Xiangfeng, JIANG Weidong, ZHANG Xinyu, et al. Design of complementary PCFM waveform set for smearing spectrum jamming suppression in MIMO radar systems[J]. IEEE Transactions on Aerospace and Electronic Systems, 2025, 61(4): 10149–10168. doi: 10.1109/TAES.2025.3560612.
    [5] 张大琳, 易伟, 孔令讲. 面向组网雷达干扰任务的多干扰机资源联合优化分配方法[J]. 雷达学报, 2021, 10(4): 595–606. doi: 10.12000/JR21071.

    ZHANG Dalin, YI Wei, and KONG Lingjiang. Optimal joint allocation of multijammer resources for jamming netted radar system[J]. Journal of Radars, 2021, 10(4): 595–606. doi: 10.12000/JR21071.
    [6] 蒋雯, 贾琼, 刘真, 等. 面向主被动雷达复合探测的全脉冲多机协同干扰方法[J]. 雷达学报(中英文), 2025, 14(6): 1515–1530. doi: 10.12000/JR25016.

    JIANG Wen, JIA Qiong, LIU Zhen, et al. Full-pulse multi-jammer cooperative jamming method for active-passive radar composite detection[J]. Journal of Radars, 2025, 14(6): 1515–1530. doi: 10.12000/JR25016.
    [7] GONG Liangliang, WU Shilong, and LV Tao. A radar emitter identification method based on pulse match template sequence[C]. 2010 2nd International Conference on Signal Processing Systems, Dalian, China, 2010: V3–153–V3–156. doi: 10.1109/ICSPS.2010.5555410.
    [8] SONG Xiufeng, WILLETT P, ZHOU Shengli, et al. The MIMO radar and jammer games[J]. IEEE Transactions on Signal Processing, 2012, 60(2): 687–699. doi: 10.1109/TSP.2011.2169251.
    [9] ZHANG Chudi, WANG Lei, JIANG Rundong, et al. Radar jamming decision-making in cognitive electronic warfare: A review[J]. IEEE Sensors Journal, 2023, 23(11): 11383–11403. doi: 10.1109/JSEN.2023.3267068.
    [10] 王俊, 郑彤, 雷鹏, 等. 深度学习在雷达中的研究综述[J]. 雷达学报, 2018, 7(4): 395–411. doi: 10.12000/JR18040.

    WANG Jun, ZHENG Tong, LEI Peng, et al. Study on deep learning in radar[J]. Journal of Radars, 2018, 7(4): 395–411. doi: 10.12000/JR18040.
    [11] 解烽, 刘环宇, 胡锡坤, 等. 基于复数域深度强化学习的多干扰场景雷达抗干扰方法[J]. 雷达学报, 2023, 12(6): 1290–1304. doi: 10.12000/JR23139.

    XIE Feng, LIU Huanyu, HU Xikun, et al. A radar anti-jamming method under multi-jamming scenarios based on deep reinforcement learning in complex domains[J]. Journal of Radars, 2023, 12(6): 1290–1304. doi: 10.12000/JR23139.
    [12] 杜兰, 王梓霖, 郭昱辰, 等. 结合强化学习自适应候选框挑选的SAR目标检测方法[J]. 雷达学报, 2022, 11(5): 884–896. doi: 10.12000/JR22121.

    DU Lan, WANG Zilin, GUO Yuchen, et al. Adaptive region proposal selection for SAR target detection using reinforcement learning[J]. Journal of Radars, 2022, 11(5): 884–896. doi: 10.12000/JR22121.
    [13] ZHANG Yujie, HUO Weibo, HUANG Yulin, et al. Jamming policy generation via heuristic programming reinforcement learning[J]. IEEE Transactions on Aerospace and Electronic Systems, 2023, 59(6): 8782–8799. doi: 10.1109/TAES.2023.3312231.
    [14] ZHANG Chudi, YANG Biao, WANG Lei, et al. A cognitive jamming decision-making method based on heuristic improved A2C algorithm[J]. IEEE Transactions on Vehicular Technology, 2025, 74(2): 2871–2883. doi: 10.1109/TVT.2024.3470832.
    [15] LIU Hongdi, ZHANG Hongtao, HE Yuan, et al. Jamming strategy optimization through dual Q-learning model against adaptive radar[J]. Sensors, 2022, 22(1): 145. doi: 10.3390/s22010145.
    [16] PAN Zesi, LI Yunjie, WANG Shafei, et al. Joint optimization of jamming type selection and power control for countering multifunction radar based on deep reinforcement learning[J]. IEEE Transactions on Aerospace and Electronic Systems, 2023, 59(4): 4651–4665. doi: 10.1109/TAES.2023.3272307.
    [17] 王跃东, 顾以静, 梁彦, 等. 伴随压制干扰与组网雷达功率分配的深度博弈研究[J]. 雷达学报, 2023, 12(3): 642–656. doi: 10.12000/JR23023.

    WANG Yuedong, GU Yijing, LIANG Yan, et al. Deep game of escorting suppressive jamming and networked radar power allocation[J]. Journal of Radars, 2023, 12(3): 642–656. doi: 10.12000/JR23023.
    [18] YANG Boyang, LI Kang, JIU Bo, et al. Execute-evaluate two-stage framework for intelligent jamming decision-making based on reinforcement learning[J]. IEEE Transactions on Aerospace and Electronic Systems, 2025, 61(4): 8624–8640. doi: 10.1109/TAES.2025.3548594.
    [19] ZHANG Wenxu, ZHAO Tong, ZHAO Zhongkai, et al. An intelligent strategy decision method for collaborative jamming based on hierarchical multi-agent reinforcement learning[J]. IEEE Transactions on Cognitive Communications and Networking, 2024, 10(4): 1467–1480. doi: 10.1109/TCCN.2024.3373640.
    [20] SUN Sizhe and SHI Yanling. Joint optimization of resource utilization and jamming method selection for cluster asymmetrical multifunction radars based on deep reinforcement learning[J]. IEEE Transactions on Aerospace and Electronic Systems, 2025, 61(2): 5116–5131. doi: 10.1109/TAES.2024.3515942.
    [21] LI Yan, JIA Yubo, and PAN Zesi. ALI-MAPPO: Attention on local information aided MAPPO algorithm for power allocation of wireless cognitive jamming systems[J]. IEEE Transactions on Aerospace and Electronic Systems, 2025, 61(5): 13759–13774. doi: 10.1109/TAES.2025.3580014.
    [22] FENG Cheng, FU Xiongjun, WANG Ziyi, et al. An optimization method for collaborative radar antijamming based on multi-agent reinforcement learning[J]. Remote Sensing, 2023, 15(11): 2893. doi: 10.3390/rs15112893.
    [23] 王子怡, 傅雄军, 董健, 等. 基于分层多智能体强化学习的雷达协同抗干扰策略优化[J]. 系统工程与电子技术, 2025, 47(4): 1108–1114. doi: 10.12305/j.issn.1001-506X.2025.04.07.

    WANG Ziyi, FU Xiongjun, DONG Jian, et al. Optimization of radar collaborative anti-jamming strategies based on hierarchical multi-agent reinforcement learning[J]. Systems Engineering and Electronics, 2025, 47(4): 1108–1114. doi: 10.12305/j.issn.1001-506X.2025.04.07.
    [24] SONDIK E J. The optimal control of partially observable markov decision processes[D]. Stanford: Stanford University, 1971. (查阅网上资料,未找到本条文献信息,请确认).
    [25] YU C, SAHU A K, TALAKOUB S, et al. MAPPO: A PPO variant for multi-agent cooperative competition[J]. arXiv preprint arXiv: 2103.01955, 2021. (查阅网上资料,未找到本条文献信息,请确认).
    [26] KRAEMER L and BANERJEE B. Multi-agent reinforcement learning as a centralized training decentralized execution problem[J]. arXiv preprint arXiv: 1604.07239. (查阅网上资料,未找到本条文献信息,请确认).
    [27] FILAR J and VRIEZE K. Competitive Markov Decision Processes[M]. New York, NY, USA: Springer, 1997.
    [28] ELMAN J L. Finding structure in time[J]. Cognitive Science, 1990, 14(2): 179–211. doi: 10.1207/s15516709cog1402_1.
    [29] HOCHREITER S and SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735–1780. doi: 10.1162/neco.1997.9.8.1735.
    [30] CHUNG J, GULCEHRE C, CHO K H, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[J]. arXiv preprint arXiv: 1412.3555, 2014. doi: 10.48550/arXiv.1412.3555. (查阅网上资料,不确定文献类型及格式是否正确,请确认).
  • 加载中
图(17) / 表(3)
计量
  • 文章访问数: 
  • HTML全文浏览量: 
  • PDF下载量: 
  • 被引次数: 0
出版历程
  • 收稿日期:  2026-03-17
  • 修回日期:  2026-06-08

目录

    /

    返回文章
    返回