Using psycholinguistic features for profiling first language of authors

被引:5
作者
Torney, Rosemary [1 ]
Vamplew, Peter [2 ]
Yearwood, John [2 ]
机构
[1] Univ Ballarat, Internet Commerce Secur Lab, Ballarat, Vic 3353, Australia
[2] Univ Ballarat, Sch Sci Informat Technol & Engn, Ballarat, Vic 3353, Australia
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2012年 / 63卷 / 06期
关键词
artificial intelligence; natural language processing; text mining; IDENTIFICATION; WORDS;
D O I
10.1002/asi.22627
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study empirically evaluates the effectiveness of different feature types for the classification of the first language of an author. In particular, it examines the utility of psycholinguistic features, extracted by the Linguistic Inquiry and Word Count (LIWC) tool, that have not previously been applied to the task of author profiling. As LIWC is a tool that has been developed in the psycholinguistic field rather than the computational linguistics field, it was hypothesized that it would be effective, both as a single type feature set because of its psycholinguistic basis, and in combination with other feature sets, because it should be sufficiently different to add insight rather than redundancy. It was found that LIWC features were competitive with previously used feature types in identifying the first language of an author, and that combined feature sets including LIWC features consistently showed better accuracy rates and average F measures than were achieved by the same feature sets without the LIWC features. As a secondary issue, this study also examined how effectively first language classification scaled up to a larger number of possible languages. It was found that the classification scheme scaled up effectively to the entire 16 language collection from the International Corpus of Learner English, when compared with results achieved on just 5 languages in previous research.
引用
收藏
页码:1256 / 1269
页数:14
相关论文
共 42 条
[1]   Applying authorship analysis to extremist-group web forum messages [J].
Abbasi, A ;
Chen, HC .
IEEE INTELLIGENT SYSTEMS, 2005, 20 (05) :67-75
[2]   Sentiment analysis in multiple languages: Feature selection for opinion classification in Web forums [J].
Abbasi, Ahmed ;
Chen, Hsinchun ;
Salem, Arab .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2008, 26 (03)
[3]  
Abello J, 2006, DIMACS SER DISCRET M, V70, P1
[4]  
[Anonymous], 2009, Proceedings of the Australasion Language Technology Association
[5]  
Argamon S., 2005, LIT LINGUISTIC COMPU, V17, P401
[6]   Automatically Profiling the Author of an Anonymous Text [J].
Argamon, Shlomo ;
Koppel, Moshe ;
Pennebarker, James W. ;
Schler, Jonathan .
COMMUNICATIONS OF THE ACM, 2009, 52 (02) :119-123
[7]  
Argamon-Engelson Shlomo., 1998, AAAI/ML Workshop on Text Categorization. Wisconsin: AAAI-98, P1
[8]  
Baayen H., 1996, Literary & Linguistic Computing, V11, P121, DOI 10.1093/llc/11.3.121
[9]  
Baayen H., 2002, Proceedings of JADT 2002: Sixth International Conference on Textual Data Statistical Analysis, P29
[10]   Linguistic markers of psychological change surrounding September 11, 2001 [J].
Cohn, MA ;
Mehl, MR ;
Pennebaker, JW .
PSYCHOLOGICAL SCIENCE, 2004, 15 (10) :687-693