Semantic Text Alignment based on Topic Modeling

被引:0
作者
Le, Huong T. [1 ]
Pham, Lam N. [1 ]
Nguyen, Duy D. [1 ]
Nguyen, Son V. [2 ]
Nguyen, An N. [2 ]
机构
[1] Hanoi Univ Sci & Technol, Sch Informat & Comp Sci Technol, Hanoi, Vietnam
[2] Minist Def, Inst Mil Sci & Technol, Hanoi, Vietnam
来源
2016 IEEE RIVF INTERNATIONAL CONFERENCE ON COMPUTING & COMMUNICATION TECHNOLOGIES, RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF) | 2016年
关键词
text alignment; topic modeling; Latent-Dirichlet Allocation; Apriori;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of Internet makes plagiarism problem more and more serious. Plagiarism can be in different types, ranging from copying texts to adopting ideas, without giving credit to the original author. Most research in plagiarism checking concentrate on string matching. This method cannot deal with intelligent plagiarism in which the same content can be expressed by different ways. To deal with this problem, this paper proposes an approach to semantic text alignment based on sentence-level topic modeling. Experiments with PAN corpora gave us much higher recalls and approximate plagdets compared to the winning system in PAN2014. It shows that topic modeling is a potential solution for detecting intelligent plagiarism.
引用
收藏
页码:67 / 72
页数:6
相关论文
共 22 条
[1]  
Al-Shamery E. S., 2016, INDIAN J SCI TECHNOL, V9, P1
[2]  
Alvi Faisal, NOTEBOOK PAN CLEF 20
[3]  
[Anonymous], 1994, P INT C VERY LARGE D
[4]  
[Anonymous], P 4 INT PLAG C
[5]  
Barrón-Cedeño A, 2010, LECT NOTES COMPUT SC, V6008, P687, DOI 10.1007/978-3-642-12116-6_58
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]  
Cheema Waqas Arshad, NOTEBOOK PAN CLEF 20
[8]  
Dat Quoc Nguyen, 2015, JLDADMM JAVA PACKAGE
[9]  
Elhadi Mohamed, 2008, 2008 Third International Conference on Digital Information Management, P520, DOI 10.1109/ICDIM.2008.4746719
[10]   Duplicate Detection in Documents and Web Pages using Improved Longest Common Subsequence and Documents Syntactical Structures [J].
Elhadi, Mohamed ;
Al-Tobi, Amjad .
ICCIT: 2009 FOURTH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCES AND CONVERGENCE INFORMATION TECHNOLOGY, VOLS 1 AND 2, 2009, :679-+