Extracting digital fingerprints from Chinese documents

被引:0
作者
Liu, Guo-Hua [1 ]
Ma, Hui-Dong [1 ]
Li, Xu [1 ]
Liang, Peng [1 ]
机构
[1] Yanshan Univ, Coll Comp Sci, Qinhuangdao 066004, Peoples R China
来源
CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS | 2007年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is an important problem to extract features from Chinese documents for protecting intellectual property . The existing approaches are major oriented to words frequency or semantic, they can't extract features efficiently. By mapping Chinese documents into an ordered set of integers, we find that a Chinese document can be corresponded to a unique ordered set of integers and the set is an isomorphism of the document. So, we propose an algorithm which can hash the set to three kinds of hash value sequences: paragraph sequence, sentence sequence and chunk sequence, which can represent the features of the document completely. In order to reduce the numbers of the features defined as digital fingerprints in this paper we present an optimal strategy to select some hash values from the sequences. The experiment results show that the algorithms proposed are efficient.
引用
收藏
页码:438 / 441
页数:4
相关论文
共 5 条
  • [1] JIN YH, 2006, ALGORITHM EXTRACTING, V41, P582
  • [2] Liu LY, 2005, PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), P597
  • [3] Manomaisupat P, 2006, LECT NOTES COMPUT SC, V4224, P1003
  • [4] RICHARD M, 1987, IBM J RES DEV, V31, P249
  • [5] WANG ZQ, 2006, 3 INT S NEUR NETW IS, P1381