Research on Military Equipment Knowledge Extraction Method Based on Retrieval-Augmented Generation

doi:10.3969/j.issn.1009-086x.2026.03.017

Abstract

Abstract:

To overcome the difficulty of knowledge extraction from unstructured data in the field of military equipment， this paper proposes a knowledge extraction method based on retrieval-augmented generation （RAG） using hybrid search. First， a large language model is used to assist in constructing an ontology model. On this basis， knowledge is extracted from semi-structured data with reference to the constructed ontology model to form triple data， and the extracted results are used to construct a database. Then， for unstructured data， a novel hybrid search method is proposed. This method integrates sparse retrieval and dense retrieval methods to retrieve similar knowledge blocks as reference examples for prompt design. Finally， prompts for knowledge extraction in the field of military equipment are designed， based on which a large language model is used to extract knowledge from unstructured data. The results show that the proposed method is capable of extracting knowledge from unstructured data. Compared with knowledge extraction without the RAG framework and knowledge extraction based on the RAG framework without hybrid search， the proposed method extracts a larger number of triples and achieves a higher recall rate.

Key words: retrieval-augmented generation （RAG）, unstructured data, military equipment, large language model, hybrid search method, prompt

摘要：

针对军用装备领域非结构化数据知识抽取困难的问题，提出了一种利用混合搜索的基于检索增强生成（retrieval-augmented generation，RAG）的知识抽取方法。利用大语言模型辅助构建本体模型，在此基础上参考构建的本体模型对半结构化数据进行知识抽取形成三元组数据，并利用该结果构建数据库；针对非结构化数据，提出一种融合稀疏检索和稠密检索的混合搜索方法，检索出相似知识块作为提示词设计的参考示例；完成军用装备领域知识抽取提示词设计，在此基础上利用大语言模型完成非结构化数据知识抽取。结果表明，相较无RAG框架的知识抽取和无混合搜索的RAG框架知识抽取方法，所提方法抽取得到的三元组数据数量更多、召回率更高。

关键词: 检索增强生成（retrieval-augmented generation，RAG）, 非结构化数据, 军事装备, 大语言模型, 混合搜索方法, 提示词

CLC Number:

Fengguang ZHOU, Chunyan HU, Yuan ZHOU, Haoyuan ZHANG. Research on Military Equipment Knowledge Extraction Method Based on Retrieval-Augmented Generation[J]. Modern Defense Technology, 2026, 54(3): 190-200.

周丰光, 胡春燕, 周园, 张昊源. 基于检索增强生成的军用装备知识抽取方法研究[J]. 现代防御技术, 2026, 54(3): 190-200.

Figures/Tables 10

Fig. 1 Basic framework of RAG technology

Fig. 2 Knowledge extraction process based on retrieval enhancement

Fig. 3 Example text jieba segmentation result diagram

Fig. 4 Knowledge retrieval process based on hybrid search method

Table1 Prompt design for knowledge extraction in the field of military equipment

设计步骤	设计描述
背景知识	本文聚焦于军用装备领域，范围具体到美国航母编队军用装备。在本文的本体模型的构建中，将围绕美航母编队的作战平台、电子装备和作战行动3个角度构建8个类别：美航母编队、航空母舰、护航舰艇、舰载机、武器系统、传感器与电子战、指挥与控制和作战行动。上述列举出的领域中的类别需要进一步把它们进行细分子类，其中护航舰艇的子类有巡洋舰、驱逐舰、护卫舰、核潜艇、补给舰；武器系统的子类…
任务描述	假设你是一个知识抽取的专家，请结合背景知识、原文本内容、输出要求和参考示例抽取出三元组数据。
输入文本	星座号航空母舰装载了3座8联装“海麻雀”舰对空导弹发射装置；4座SRBOC电子对抗诱饵发射装置。…
输出要求	输出格式必须为json格式：［［“主语”，“关系”，“宾语”］，…］。关系类型包括： -属性关系 1）美航母编队的属性定义包括名称、隶属编队、母港位置、部署状态等。…其他类别的关系定义如下：从不同维度进行考量：物理部署、功能依赖、任务协同、技术支撑、后勤保障、信息流动、其他关系等。 a）物理部署关系。物理部署关系是指某一类别的实体（如武器、设备）须搭载在另一类别的实体（如舰船、舰载机）上，具体的定义包括“装备”“搭载”“配备”等。例如F-18战机部署在航空母舰上。 b）功能依赖关系。功能依赖是…
参考示例	［［“星座号航空母舰”，“搭载”，“海麻雀舰对空导弹发射装置”］，［“星座号航空母舰”，“搭载”，“SRBOC电子对抗诱饵发射装置”］］

Table2 Experimental method configurations

实验模型	方法1	方法2	方法3
稀疏模型	无	无	TF-IDF+BM25
稠密模型（文本嵌入）	无	paraphrase-multilingual-MiniLM-L12-v2	paraphrase-multilingual-MiniLM-L12-v2
文本生成模型	DeepSeek-V3	DeepSeek-V3	DeepSeek-V3
调优参数	无	无	$α$ 可调参数

Table2 Experimental method configurations

实验模型	方法1	方法2	方法3
稀疏模型	无	无	TF-IDF+BM25
稠密模型（文本嵌入）	无	paraphrase-multilingual-MiniLM-L12-v2	paraphrase-multilingual-MiniLM-L12-v2
文本生成模型	DeepSeek-V3	DeepSeek-V3	DeepSeek-V3
调优参数	无	无	$α$ 可调参数

Table3 Triplet data extraction results under three methods

数据	方法1	方法2	方法3
总数	30 723	39 350	44 792
第1组	13 070	15 571	16 022
第2组	12 512	17 111	20 166
第3组	5 141	6 668	8 604

Table4 Comparison of evaluation metrics of evaluation data under three models

实验编号	1	2	3
准确率	0.773 2	0.812 8	0.822 3
召回率	0.588 3	0.731 6	0.901 8
F1分数	0.668 3	0.770 1	0.860 2
抽取三元组数量	7 012	8 296	10 108
正确抽取数量	5 422	6 743	8 312
理应抽取数量	9 217	9 217	9 217

Fig. 5 Line chart of evaluation data

Fig. 6 Line chart of extracted data

References 24

[1]	SINGHAL K， AZIZI S， TU Tao， et al. Large Language Models Encode Clinical Knowledge［J］. Nature， 2023， 620（7972）： 172-180.
[2]	CHALKIDIS I， PASINI T， ZHANG Sheng， et al. FairLex： A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing［C］∥Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2022： 4389-4406.
[3]	SARMAH B， MEHTA D， PASQUALI S， et al. Towards Reducing Hallucination in Extracting Information from Financial Reports Using Large Language Models［C］∥Proceedings of the Third International Conference on AI-ML Systems. New York， NY， USA： Association for Computing Machinery， 2023： 39.
[4]	钟逸，冷彦，陈思慧，等. 基于大语言模型RAG架构的电池加速研究：现状与展望［J］. 储能科学与技术， 2024， 13（9）： 3214-3225.
	ZHONG Yi， LENG Yan， CHEN Sihui， et al. Accelerating Battery Research with Retrieval-Augmented Large Language Models： Present and Future［J］. Energy Storage Science and Technology， 2024， 13（9）： 3214-3225.
[5]	JI Ziwei， LEE N， FRIESKE R， et al. Survey of Hallucination in Natural Language Generation［J］. ACM Computing Surveys， 2023， 55（12）： 248.
[6]	TONMOY S M T I， ZAMAN S M M， JAIN V， et al. A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models［EB/OL］. （2024-01-08）［2025-09-16］. .
[7]	LEWIS P， PEREZ E， PIKTUS A， et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks［C］∥Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 9459-9474.
[8]	LI Yunqing， STARLY B. Building a Knowledge Graph to Enrich ChatGPT Responses in Manufacturing Service Discovery［J］. Journal of Industrial Information Integration， 2024， 40： 100612.
[9]	王柳苏. 检索增强生成在医疗问答领域的应用研究［D］. 上海：华东师范大学， 2024.
	WANG Liusu. Exploration of Retrieval Augmentation Generation Application in the Medical Question and Answer Domain［D］. Shanghai： East China Normal University， 2024.
[10]	SELVA BIRUNDA S， KANNIGA DEVI R. A Review on Word Embedding Techniques for Text Classification［C］∥Innovative Data Communication Technologies and Application. Singapore： Springer Singapore， 2021： 267-281.
[11]	DO Q， MORIYANI M A， LE C， et al. Cost-Weighted TF-IDF： A Novel Approach for Measuring Highway Project Similarity Based on Pay Items’ Cost Composition and Term Frequency［J］. Journal of Construction Engineering and Management， 2023， 149（8）： 04023069.
[12]	LEWIS P， PEREZ E， PIKTUS A， et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks［C］∥Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 9459-9474.
[13]	GAO Yunfan， XIONG Yun， GAO Xinyu， et al. Retrieval-Augmented Generation for Large Language Models： A Survey［EB/OL］. （2024-03-27）［2025-09-16］. .
[14]	CHEN Jiawei， LIN Hongyu， HAN Xianpei， et al. Benchmarking Large Language Models in Retrieval-augmented Generation［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2024， 38（16）： 17754-17762.
[15]	NOY N F， MCGUINNESS D L. Ontology Development： A Guide to Creating your First Ontology［EB/OL］. ［2025-09-16］. .
[16]	SANG E F T K， DE MEULDER F. Introduction to the CoNLL-2003 Shared Task： Language-Independent Named Entity Recognition［C］∥Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003. Stroudsburg： ACL， 2003： 142-147.
[17]	REIMERS N， GUREVYCH I. Making Monolingual Sentence Embeddings Multilingual Using Knowledge Distillation［C］∥Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing （EMNLP）. Stroudsburg： ACL， 2020： 4512-4525.
[18]	CHEN Wenhu， VERGA P， DE JONG M， et al. Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering［C］∥Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg： ACL， 2023： 1597-1610.
[19]	SHI Peng， LIN J. Cross-Lingual Relevance Transfer for Document Retrieval［EB/OL］. （2019-11-08）［2025-09-16］. .
[20]	GAO Luyu， DAI Zhuyun， CALLAN J. COIL： Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List［C］∥Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2021： 3030-3042.
[21]	JUVEKAR K， PURWAR A. COS-Mix： Cosine Similarity and Distance Fusion for Improved Information Retrieval［EB/OL］. （2024-06-02）［2025-09-16］. .
[22]	KADHIM A I. Term Weighting for Feature Extraction on Twitter： A Comparison Between BM25 and TF-IDF［C］∥2019 International Conference on Advanced Science and Engineering （ICOASE）. Piscataway： IEEE， 2019： 124-128.
[23]	HE Chuanni， HE Weilin， LIU Min， et al. Enriched Construction Regulation Inquiry Responses： A Hybrid Search Approach for Large Language Models［J］. Journal of Management in Engineering， 2025， 41（3）： 04025001.
[24]	鞠炜刚，汪鹏，王佳. 基于大语言模型和RAG的持续交付智能问答系统［J］. 计算机技术与发展， 2025， 35（2）： 107-114.
	JU Weigang， WANG Peng， WANG Jia. Continuous Delivery Intelligent Question-Answering System Based on Large Language Models and RAG［J］. Computer Technology and Development， 2025， 35（2）： 107-114.