基于STACKING-BERT集成学习的中文短文本分类算法
DOI:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP391

基金项目:

国家自然科学基金项目(61363022),云南省教育厅科学研究基金项目(2021Y670)


Chinese Short Text Classification Algorithm Based on STACKING-BERT Ensemble Learning
Author:
Affiliation:

Fund Project:

The National Natural Science Foundation of China (61363022), the Yunnan Provincial Department of Education Science Research Foundation (2021Y670)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    由于word2vec、Glove等静态词向量表示方法存在无法完整表示文本语义等问题,且当前主流神经网络模型在做文本分类问题时,其预测效果往往依赖于具体问题,场景适应性差,泛化能力弱。针对上述问题,提出一种多基模型框架(Stacking-Bert)的中文短文本分类方法。模型采用BERT预训练语言模型进行文本字向量表示,输出文本的深度特征信息向量,并利用TextCNN、DPCNN、TextRNN、TextRCNN等神经网络模型构建异质多基分类器,通过Stacking集成学习获取文本向量的不同特征信息表达,以提高模型的泛化能力,最后利用SVM作为元分类器模型进行训练和预测。与word2vec-CNN、word2vec-BiLSTM、BERT-texCNN、BERT-DPCNN、BERT-RNN、BERT-RCNN等文本分类算法在网络公开的三个中文数据集上进行对比实验,结果表明,Stacking-Bert集成学习模型的准确率、精确率、召回率和F1值均为最高,能有效提升中文短文本的分类性能。

    Abstract:

    Duo to the static word vector representation methods such as word2vec and Glove have problems such as incomplete representation of text semantics, and when the current mainstream neural network model is doing text classification problems, its prediction effect often depends on specific problems, the scene adaptability is poor, and the generalization ability is weak. To solve the above problems, a Chinese short text classification method based on multi-base model framework named Stacking-Bert is proposed. The model uses the BERT pre-trained language model to represent text word vectors, outputs the deep feature information vector of the text, and uses neural network models such as TextCNN, DPCNN, TextRNN, TextRCNN to construct a heterogeneous multi-base classifier, and obtains the text vector through Stacking integration learning Different feature information is expressed to improve the generalization ability of the model, and finally SVM is used as a meta-classifier model for training and prediction. Comparing experiments with text classification algorithms such as word2vec-CNN, word2vec-BiLSTM, BERT-texCNN, BERT-DPCNN, BERT-RNN, BERT-RCNN, etc. on three Chinese data sets published on the Internet, the results show that Stacking-Bert integrated learning The model has the highest accuracy rate, precision rate, recall rate and F1 value, which can effectively improve the classification performance of Chinese short texts.

    参考文献
    相似文献
    引证文献
引用本文

郑承宇,王新,王婷,等. 基于STACKING-BERT集成学习的中文短文本分类算法[J]. 科学技术与工程, 2022, 22(10): 4033-4038.
Zheng Chengyu, Wang Xin, Wang Ting, et al. Chinese Short Text Classification Algorithm Based on STACKING-BERT Ensemble Learning[J]. Science Technology and Engineering,2022,22(10):4033-4038.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-06-28
  • 最后修改日期:2022-03-23
  • 录用日期:2021-11-16
  • 在线发布日期: 2022-04-14
  • 出版日期:
×
律回春渐,新元肇启|《科学技术与工程》编辑部恭祝新岁!
亟待确认版面费归属稿件,敬请作者关注