战术先验知识启发的多智能体双层强化学习

陈晓轩; 黄魁华; 梁星星; 冯旸赫; 黄金才

doi:10.3969/j.issn.2096-0204.2022.01.0072

战术先验知识启发的多智能体双层强化学习

Tactical Prior Knowledge Inspiring Multi-Agent Bilevel Reinforcement Learning

摘要

摘要: 针对典型海空协同作战中指挥控制技术对时效性、准确性和跨域融合能力的高要求, 提出了一种先验知识启发的双层强化学习框架. 通过研究先验知识启发的奖励塑造方式, 提取作战子任务设计状态聚合方法, 从而把具体状态映射到抽象状态; 基于抽象状态使用马尔科夫决策过程（Markov decision process, MDP）理论进行建模, 使用强化学习算法求解该模型; 使用最终求解出的抽象状态价值函数进行基于势能的奖励塑造. 以上流程与下层具体MDP 过程并行求解, 从而搭建了一个双层强化学习算法框架.基于全国兵棋推演大赛的兵棋推演平台进行了实验, 在状态空间、动作空间、奖励函数等方面细化算法. 指出了先验知识代表从上而下的任务式指挥方式, 而多智能体强化学习在某些结构上符合自下而上的事件式指挥方式. 两种方式结合, 使得该算法控制下的作战单元学习到协同作战战术, 面对复杂环境具有更好的鲁棒性. 经过仿真实验, 该算法控制的红方智能体对抗规则智能体控制的蓝方可以获得70 %的胜率.

Abstract: Aiming at the high requirements of command and control technology in typical Airsea Battle, a two-layer reinforcement learning framework inspired by prior knowledge is proposed. By studying reward shaping method inspired by prior knowledge, the aggregation method of combat subtask design state is extracted, so as to map the specific state to the abstract state. Then, based on the abstract state, Markov decision process（MDP）theory is used in modeling, and reinforcement learning algorithm is used to solve the model. Finally, the abstract state value function is used to shape the reward based on potential energy. The above process is solved in parallel with the lower specific MDP process, so as to build a double-layer reinforcement learning algorithm framework. The research is based on the military chess deduction platform of the national military chess deduction competition, and refines the algorithm in state space, action space, reward function, etc. It is pointed out that prior knowledge represents the top-down task-based command mode, and multi-agent reinforcement learning conforms to the bottom-up event-based command mode in some structures. The combination of the two methods makes the combat unit under the control of the algorithm learn cooperative combat tactics and have better robustness in the face of complex environment. Simulation results show, the red side agent controlled by the algorithm can obtain 70 % victory rate against the blue side controlled by the rule agent.

HTML全文

参考文献(0)

施引文献

资源附件(0)