Modern Defense Technology ›› 2020, Vol. 48 ›› Issue (5): 97-104.DOI: 10.3969/j.issn.1009-086x.2020.05.015

• INTEGRATED LOGISTICS SUPPORT TECHNOLOGY • Previous Articles     Next Articles

Text Categorization Algorithm for Automatic Document Review

GUO Ze, JIAO Qian-qian   

  1. Beijing Institute of Electronic System Engineering,Beijing 100854,China
  • Received:2020-04-09 Revised:2020-05-07 Online:2020-10-20 Published:2021-02-01

一种面向自动化标检的文本分类方法

郭泽, 焦倩倩   

  1. 北京电子工程总体研究所,北京 100854
  • 作者简介:郭泽(1988-),男,重庆巴南人。工程师,硕士,主要从事机器学习,指控总体设计。通信地址:100854 北京142信箱30分箱 E-mail:guoze0987@126.com

Abstract: A machine learning based improved native Bayes algorithm proposed to solve the text classification problem in automatic document review field.Firstly,it improves naive Bayes algorithm and applies it as the classifier.Then a genetic algorithm is adopted to train all the feature weights.Finally,a table and figure position based identification algorithm is used to improve the results.The experimental results show that the algorithm performs better than traditional (K-nearest neighbors) KNN and naive Bayes in most cases,especially when the sample sets have more wrong samples.It can improve the accuracy of automatic document review effectively.

Key words: machine learning, text categorization, naive Bayes, genetic algorithm, automatic document review

摘要: 针对自动化标检中的段落文本分类问题,提出一种基于机器学习的改进朴素贝叶斯分类算法。该方法对朴素贝叶斯分类算法进行改进并作为分类器,采用遗传算法作为训练模型对分类器中的所有特征权重进行训练,并采用一种基于图表位置的修正算法优化分类结果。在实际的数据集中进行了实验,结果表明,该方法与传统KNN(K-nearest neighbor)算法和朴素贝叶斯算法相比具有更好的分类结果,能够有效的处理错误样本较多的情况,可大幅提升自动化标检的准确性。

关键词: 机器学习, 文本分类, 朴素贝叶斯, 遗传算法, 自动化标检

CLC Number: