一种面向自动化标检的文本分类方法

doi:10.3969/j.issn.1009-086x.2020.05.015

现代防御技术 ›› 2020, Vol. 48 ›› Issue (5): 97-104.DOI: 10.3969/j.issn.1009-086x.2020.05.015

一种面向自动化标检的文本分类方法

郭泽, 焦倩倩

北京电子工程总体研究所,北京 100854

收稿日期:2020-04-09 修回日期:2020-05-07 出版日期:2020-10-20 发布日期:2021-02-01
作者简介:郭泽(1988-),男,重庆巴南人。工程师,硕士,主要从事机器学习,指控总体设计。通信地址:100854 北京142信箱30分箱 E-mail:guoze0987@126.com

Text Categorization Algorithm for Automatic Document Review

GUO Ze, JIAO Qian-qian

Beijing Institute of Electronic System Engineering,Beijing 100854,China

Received:2020-04-09 Revised:2020-05-07 Online:2020-10-20 Published:2021-02-01

摘要/Abstract

摘要： 针对自动化标检中的段落文本分类问题,提出一种基于机器学习的改进朴素贝叶斯分类算法。该方法对朴素贝叶斯分类算法进行改进并作为分类器,采用遗传算法作为训练模型对分类器中的所有特征权重进行训练,并采用一种基于图表位置的修正算法优化分类结果。在实际的数据集中进行了实验,结果表明,该方法与传统KNN(K-nearest neighbor)算法和朴素贝叶斯算法相比具有更好的分类结果,能够有效的处理错误样本较多的情况,可大幅提升自动化标检的准确性。

关键词: 机器学习, 文本分类, 朴素贝叶斯, 遗传算法, 自动化标检

Abstract: A machine learning based improved native Bayes algorithm proposed to solve the text classification problem in automatic document review field.Firstly,it improves naive Bayes algorithm and applies it as the classifier.Then a genetic algorithm is adopted to train all the feature weights.Finally,a table and figure position based identification algorithm is used to improve the results.The experimental results show that the algorithm performs better than traditional (K-nearest neighbors) KNN and naive Bayes in most cases,especially when the sample sets have more wrong samples.It can improve the accuracy of automatic document review effectively.

Key words: machine learning, text categorization, naive Bayes, genetic algorithm, automatic document review

中图分类号:

TP391

郭泽, 焦倩倩. 一种面向自动化标检的文本分类方法[J]. 现代防御技术, 2020, 48(5): 97-104.

GUO Ze, JIAO Qian-qian. Text Categorization Algorithm for Automatic Document Review[J]. Modern Defense Technology, 2020, 48(5): 97-104.

参考文献

[1] 宁凌,韩冰洁.基于产品数据管理系统的标准化管理[J].计算机与网络,2011,37(6):42-44.
NING Lin,HAN Bing-jie.Standardization Management Based on Product Data Management System[J].China Computer & Network,2011,37(6):42-44.
[2] 任朋启,王芳,黄树成.一种改进的文本分类算法[J].电子设计工程,2017,25(18):1-5.
REN Peng-qi,WANG Fang,HUANG Shu-cheng.An Improved Text Classification Algorithm[J].Electronic Design Engineering,2017,25(18):1-5.
[3] 刘赫,刘大有,裴志利,等.一种基于特征重要度的文本分类特征加权方法[J].计算机研究与发展,2009(10):1693-1703.
LIU He,LIU Da-you,PEI Zhi-li,et al.A Feature Weighting Scheme for Text Categorization Based on Feature Importance[J].Journal of Computer Research and Development,2009(10):1693-1703.
[4] 吴龙峰,于瓅,王峰.向量空间模型的文本分类研究进展与应用[J].宿州学院学报,2019,34(12):69-72.
WU Long-feng,YU Li,WANG Feng.Research Progress and Application of Text Categorization of Space Vector Model[J].Journal of Suzhou University,2019,34(12):69-72.
[5] 刘勇华.基于朴素贝叶斯的中文段落情感分析[D].太原:太原理工大学,2015.
LIU Yong-hua.Analysis of Chinese Paragraphs Emotion Based on Naive Bayes[D].Taiyuan:Taiyuan University of Technology,2015.
[6] 张孝飞,黄河燕.一种采用聚类技术改进的KNN文本分类方法[J].模式识别与人工智能,2009(6):936-940.
ZHANG Xiao-fei,HUANG He-yan.An Improved KNN Text Categorization Algorithm by Adopting Cluster Technology[J].Pattern Recognition and Artificial Intelligence,2009(6):936-940.
[7] 谢娟英.基于SVM的特征选择方法研究[D].西安:西安电子科技大学,2012.
XIE Juan-ying.SVM Based Feature Selection Algorithms for Classificaion[D].Xi’an:Xidian University,2010.
[8] 黄永,陆伟,程齐凯,等.学术文本的结构功能识别——基于段落的识别[J].情报学报,2016,35(5):530-538.
HUANG Yong,LU Wei,CHENG Qi-kai,et al.The Structure Function Recognition of Academic Text——Paragraph-based Recognition [J].Journal of the China Society for Scientific andTechnical Information,2016,35(5):530-538.
[9] 郭正斌,张仰森,蒋玉茹.一种面向文本分类的特征向量优化方法[J].计算机应用研究,2017(8):2299-2302.
GUO Zheng-bin,ZHANG Yang-sen,JIANG Yu-ru.Feature Vector Optimization Method for Text Classification[J].Application Research of Computers,2017(8):2299-2302.
[10] 沈宏伟,邵堃,张阳洋,等.基于朴素贝叶斯的信任决策模型[J].小型微型计算机系统,2018,39(2):275-279.
SHEN Hong-wei,SHAO Kun,ZHANG Yang-yang,et al.Trust Decision Model Based on Naive Bayesian[J].Journal of Chinese Computer Systems,2018,39(2):275-279.
[11] 沈荣,张保文.机器学习学习方式及其算法探讨[J].电脑知识与技术,2017,13(23):159-160.
SHEN Rong,ZHANG Bao-wen.Machine Learning’s Learning Methods and Algorithms[J].Computer Knowledge and Technology,2017,13(23):159-160.
[12] 乔家庆,付平,孟升卫.基于个体差异的遗传选择算子设计[J].电子学报,2006(S1):2414-2416.
QIAO Jia-qing,FU Ping,MENG Sheng-wei.A Genetic Selection Operator Based on Difference Among Individuals[J].Acta Electronica Sinica,2006(S1):2414-2416.
[13] 于岩,王春雨,汪洪艳.基于改进后的实数编码遗传算法无源测向定位[J].现代防御技术,2016,44(5):116-119.
YU Yan,WANG Chun-yu,WANG Hong-yan.Passive Direction Location Based on Improved Real Encoding Genetic Algorithm[J].Modern Defence Technology,2016,44(5):116-119.
[14] 安霆.基于遗传算法的图像分割处理技术研究[J].电子技术应用,2019,45(10):92-95.
AN Ting.Research on Image Segmentation Technology Based on Genetic Algorithms [J].Application of Electronic Technique,2019,45(10):92-95.
[15] 李建波.基于VSTO的文档审阅批注自动导出技术[J].计算机与现代化,2018(5):56-59.
LI Jian-bo.Auto-Exporting Technology for Word Review Comments Based on VSTO[J].Computer and Modernization,2018(5):56-59.

一种面向自动化标检的文本分类方法

Text Categorization Algorithm for Automatic Document Review

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价

[1]	唐明南, 张承龙, 赵强, 李林林. 任务场景驱动的防空资源部署方案智能生成与优化方法[J]. 现代防御技术, 2023, 51(3): 1-9.
[2]	姚保寅, 毛磊, 王智斌. 人工智能技术在航天装备领域应用探讨[J]. 现代防御技术, 2023, 51(2): 33-42.
[3]	阎哲, 汪民乐, 汪江鹏, 吴丰轩, 闫少强. 带时间窗的海军航空兵场站特种保障车辆调度问题[J]. 现代防御技术, 2022, 50(6): 117-123.
[4]	陈小卫, 杨超, 季自力. 基于粗糙集和遗传算法的作战试验指标集优化方法[J]. 现代防御技术, 2022, 50(3): 90-96.
[5]	赵黎兴, 侯兴明, 徐兆文, 和林子. 基于GA-小波-BP神经网络的装备维修能力评估[J]. 现代防御技术, 2022, 50(2): 84-95.
[6]	施端阳, 林强, 胡冰, 陈嘉勋. 遗传算法优化神经网络的雷达杂波抑制方法[J]. 现代防御技术, 2021, 49(6): 74-83.
[7]	宋卫星, 武婧婧, 董志鹏, 张君, 周凯. 基于成本分析的装备维修调度优化模型[J]. 现代防御技术, 2021, 49(5): 88-94.
[8]	姚保寅, 毛磊, 肖柯, 曲徽. 人工智能技术在光学对地观测领域应用探讨[J]. 现代防御技术, 2021, 49(5): 26-31.
[9]	姬正一, 陈阳, 沈培志, 韩先平, 齐鸿坤. 决策树集成方法在反舰导弹效能评估中的应用[J]. 现代防御技术, 2021, 49(4): 15-23.
[10]	马小梦, 何岷, 张长革. 递进型优化算法对相控阵天线阵列的波束优化[J]. 现代防御技术, 2021, 49(2): 72-77.
[11]	王睦深, 谭湘霞. 基于自适应遗传算法的弹道性能优化设计[J]. 现代防御技术, 2020, 48(2): 35-39.
[12]	马特, 刘刚, 何兵. 弹载小卫星应急发射入轨方法设计与优化[J]. 现代防御技术, 2018, 46(3): 177-183.