An Ontology-based and Domain Specific Clustering Methodology for Financial Documents

被引:0
作者
Kulathunga, Chalitha [1 ]
Karunaratne, D. D. [1 ]
机构
[1] Univ Colombo, Sch Comp, Colombo, Sri Lanka
来源
2017 17TH INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER) - 2017 | 2017年
关键词
Financial document clustering; WordNet based clustering; Resnik similarity; Word sense disambiguation; SEMANTIC SIMILARITY; WORDNET;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Financial documents play an important role in modern financial analysis and information retrieval tasks. In order to accomplish various investigational needs, financial organizations continuously search for accurate and meaningful unsupervised document classification techniques. Nevertheless, unsupervised document categorization or document clustering is a challenging problem studied by many scientists. Incorporating semantic knowledge from an ontology into document clustering has been extensively studied and it has provided enhanced clustering performances. The incorporated semantic knowledge is generally used for identifying the correct meanings of the ambiguous words in the documents. Most of the proposed methodologies were experimented on general document datasets and most of the few available domain specific clustering studies were constrained to specific domains where complete domain ontologies are available. Although financial domain has several domain ontologies, none of them are complete and suitable for semantic document clustering. In this context, our study proposes a document clustering methodology for financial documents which adapts WordNet ontology to the financial domain to serve as an external knowledge source. This study empirically shows that nouns are relatively prevalent and more important for document clustering rather than other terms in a document. Afterwards, a subset of nouns is identified as most important for the clustering, based on their frequency distribution within the main noun list. We developed a word sense disambiguation technique which uses ontological knowledge for noun disambiguation. Finally, nouns in each document are disambiguated with the proposed word sense disambiguation technique, associated with tf-idf weights and clustered. On the basis of the empirical results of this research, it can be concluded that the proposed methodology can significantly enhance the clustering performance compared to no disambiguation and pure WordNet based disambiguation approaches.
引用
收藏
页码:209 / 216
页数:8
相关论文
共 22 条
[1]  
Amine Abdelmalek, 2008, 2008 Third International Conference on Broadband Communications, Information Technology & Biomedical Applications, P394, DOI 10.1109/BROADCOM.2008.7
[2]  
[Anonymous], 2005, INT J HYBRID INTELL, DOI DOI 10.3233/HIS-2004-13-402
[3]   On ontology-driven document clustering using core semantic features [J].
Fodeh, Samah ;
Punch, Bill ;
Tan, Pang-Ning .
KNOWLEDGE AND INFORMATION SYSTEMS, 2011, 28 (02) :395-421
[4]  
Francis W.N., 1979, DEP LINGUISTICS
[5]  
Hotho A, 2003, THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P541
[6]  
Kloptchenko A., 2004, International Journal of Intelligent Systems in Accounting, Finance and Management, V12, P29, DOI 10.1002/isaf.239
[7]  
Leacock C, 1998, LANG SPEECH & COMMUN, P265
[8]  
Lord P. W., 2003, PACIFIC S BIOCOMPUTI, VVIII
[9]  
LOVINS JB, 1968, MECH TRANSL, V11, P22
[10]   WORDNET - A LEXICAL DATABASE FOR ENGLISH [J].
MILLER, GA .
COMMUNICATIONS OF THE ACM, 1995, 38 (11) :39-41