A Semi-Supervised Approach for Gender Identification

被引:0
作者
Soler-Company, Juan [1 ]
Wanner, Leo [1 ,2 ]
机构
[1] Pompeu Fabra Univ, NLP Grp, Dept Informat & Commun Technol, C Roc Boronat 138, Barcelona 08018, Spain
[2] Catalan Inst Res & Adv Studies ICREA, Barcelona, Spain
来源
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2016年
关键词
author profiling; gender identification; semi supervised learning; text classification; machine learning;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
In most of the research studies on Author Profiling, large quantities of correctly labeled data are used to train the models. However, this does not reflect the reality in forensic scenarios: in practical linguistic forensic investigations, the resources that are available to profile the author of a text are usually scarce. To pay tribute to this fact, we implemented a Semi-Supervised Learning variant of the k nearest neighbors algorithm that uses small sets of labeled data and a larger amount of unlabeled data to classify the authors of texts by gender (man vs woman). We describe the enriched KNN algorithm and show that the use of unlabeled instances improves the accuracy of our gender identification model. We also present a feature set that facilitates the use of a very small number of instances, reaching accuracies higher than 70% with only 113 instances to train the model. It is also shown that the algorithm performs equally well using publicly available data.
引用
收藏
页码:1282 / 1287
页数:6
相关论文
共 28 条
[1]  
[Anonymous], 2006, P AAAI SPRING S COMP
[2]   Automatically Profiling the Author of an Anonymous Text [J].
Argamon, Shlomo ;
Koppel, Moshe ;
Pennebarker, James W. ;
Schler, Jonathan .
COMMUNICATIONS OF THE ACM, 2009, 52 (02) :119-123
[3]   A TYPOLOGY OF ENGLISH-TEXTS [J].
BIBER, D .
LINGUISTICS, 1989, 27 (01) :3-43
[4]  
Bohnet B., 2010, P 23 INT C COMP LING, P89
[5]  
Burger J.D, 2011, P 2011 C EMPIRICAL M, P1301, DOI DOI 10.1007/S00256-005-0933-8
[6]  
Cheng N. C. N., 2009, 2009 IEEE S COMP INT
[7]  
Crystal D., 1969, Investigating English style
[8]  
Estival Dominique., 2007, 10th Conference of the Pacific Association for Computational Linguistics (PACLING 2007), P262
[9]   The language of love: Sex, sexual orientation, and language use in online personal advertisements [J].
Groom, CJ ;
Pennebaker, JW .
SEX ROLES, 2005, 52 (7-8) :447-461
[10]  
Gupta Aditi, 2012, ARXIV12084324