Gender classification of microblog text based on authorial style

被引:0
作者
Shubhadeep Mukherjee
Pradip Kumar Bala
机构
[1] Indian Institute of Management Ranchi,Information System and Analytics Area
来源
Information Systems and e-Business Management | 2017年 / 15卷
关键词
Text mining; Twitter; Natural language processing; Gender classification; Knowledge discovery; Supervised learning; Artificial intelligence; Business intelligence;
D O I
暂无
中图分类号
学科分类号
摘要
Gender profiling of unstructured text data has several applications in areas such as marketing, advertising, legal investigation, and recommender systems. The automatic detection of gender in microblogs, like twitter, is a difficult task. It requires a system that can use knowledge to interpret the linguistic styles being used by the genders. In this paper, we try to provide this knowledge for such a system by considering different sets of features, which are relatively independent of the text, such as function words and part of speech n-grams. We test a range of different feature sets using two different classifiers; namely Naïve Bayes and maximum entropy algorithms. Our results show that the gender detection task benefits from the inclusion of features that capture the authorial style of the microblog authors. We achieve an accuracy of approximately 71 %, which outperforms the classification accuracy of commercially available gender detection software like Gender Genie and Gender Guesser.
引用
收藏
页码:117 / 138
页数:21
相关论文
共 47 条
  • [1] Argamon S(2003)Gender, genre, and writing style in formal written texts Text Interdiscip J Study Discourse 23 321-346
  • [2] Koppel M(2009)Automatically profiling the author of an anonymous text Commun ACM 39 4760-4768
  • [3] Fine J(2012)Comparison of term frequency and document frequency based feature selection metrics in text categorization Expert Syst Appl 11 121-132
  • [4] Shimoni AR(1996)Outside the cave of shadows: using syntactic annotation to enhance authorship attribution Lit Linguist Comput 22 39-71
  • [5] Argamon S(1996)A maximum entropy approach to natural language processing Comput Linguist 16 9-17
  • [6] Koppel M(2003)Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution Chance 146 1301-1309
  • [7] Pennebaker J(2011)Discriminating gender on Twitter Test 55 78-1182
  • [8] Schler J(2012)A few useful things to know about machine learning Commun ACM 3 1157-88
  • [9] Azam N(2003)An introduction to variable and feature selection J Mach Learn Res 1 82-47
  • [10] Yao J(2006)Performing gender: automatic stylistic analysis of shakespeare’s characters Digit Humanit 51 35-1019