Research on Cross-language Text Similarity Calculation

被引:0
作者
Yuan, Sun [1 ]
Qian, Zhao [1 ]
机构
[1] Minzu Univ China, Sch Informat Engn, Natl Language Resource & Monitoring Res Ctr, Minor Languages Branch, Beijing, Peoples R China
来源
PROCEEDINGS OF 2015 IEEE 5TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION | 2015年
关键词
text similarity; cross-language; tibetan-chinese; LDA model;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cross-language text similarity calculation is a critical and fundamental problem in natural language processing. It is widely used in cross-language research, such as cross-language information retrieval. In this paper, we used the LDA (Latent Dirichlet Allocation) model to calculate similarities of Tibetan and Chinese texts at the topic level. Through topic modelling and forecasting, the texts are mapped to the feature space of topics. This method reduced the dimensions of text space vector and improved the speed and efficiency of computation.
引用
收藏
页码:423 / 426
页数:4
相关论文
共 11 条
  • [1] [Anonymous], 2009, P 18 INT C WORLD WID
  • [2] [Anonymous], 2013, STUDY ENGLISH CHINES
  • [3] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [4] Hasan M.M., 2001, P NAT LANG PROC PAC, P617
  • [5] Judita P., 2012, 2012 Conference ofthe North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies, P558
  • [6] Mimno D, 2009, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, V2009, P880
  • [7] Potthast M, 2008, LECT NOTES COMPUT SC, V4956, P522
  • [8] Steinberger R., 2002, Computational Linguistics and Intelligent Text Processing. Third International Conference, CICLing 2002. Proceedings (Lecture Notes in Computer Science Vol.2276), P415
  • [9] Uszkoreit J., 2010, P 23 INT C COMP LING, P1101
  • [10] [王洪俊 WANG Hongjun], 2007, [中文信息学报, Journal of Chinese Information Processing], V21, P30