中文文本特征选择中的分词方法研究
DOI:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

“十一五”武器装备预先研究项目(513300102)


Study on Method of Word Segmentation in Feature Selection in Chinese Text Categorization
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对汉语自动分词后词条的特征信息缺失的问题,本文提出把整个分词过程分解为三个子过程,以词串为分词单位对文本进行分词:首先,采用逆向最大匹配法对文本进行切分;第二,对切分结果进行停用词消除;第三,计算第一次分词得到的词条互信息和相邻共现频次,根据计算结果判定相应的词条组合成词串。实验结果表明,词条组合后的词串的特征信息更丰富,改善了文本特征选择的效果,提高了文本分类性能。

    Abstract:

    Since the automatic of Chinese word will bring the lack of information, we provide word segmentation according to lexical chunk as the unit. We divide such segmenting process into three sub-process: firstly, we segment text by means of Backward Maximum Matching. Second, we delete the stop-words from the segmentation result. At last, we count words mutual information and adjacency by the first time we segment words, and then, according to this counting result we can judge and sign the lexical chunk by relevant words. The experimentation shows that after the word combination, the lexical chunk bear much more feature information which shares a better effect of the process. It also has proved the effect of Feature Selection in Chinese Text Categorization and enhanced the capability of text classification.

    参考文献
    相似文献
    引证文献
引用本文

黄魏. 中文文本特征选择中的分词方法研究[J]. 科学技术与工程, 2010, (1): .
HuangWei. Study on Method of Word Segmentation in Feature Selection in Chinese Text Categorization[J]. Science Technology and Engineering,2010,(1).

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2009-09-15
  • 最后修改日期:2009-09-15
  • 录用日期:2009-09-21
  • 在线发布日期: 2010-01-11
  • 出版日期:
×
律回春渐,新元肇启|《科学技术与工程》编辑部恭祝新岁!
亟待确认版面费归属稿件,敬请作者关注