Strength of linguistic text evidence: A fused forensic text comparison system

被引:9
作者
Ishihara, Shunichi [1 ]
机构
[1] Australian Natl Univ, Dept Linguist, Canberra, ACT, Australia
关键词
Forensic text comparison; Likelihood ratio; Logistic-regression fusion; Multivariate kernel density; N-grams; Authorship attribution features; LIKELIHOOD-RATIO; PROBABILISTIC EVALUATION; AUTHORSHIP ATTRIBUTION; HANDWRITING EVIDENCE; IDENTIFICATION; FRAMEWORK; MESSAGES; STYLE; DNA;
D O I
10.1016/j.forsciint.2017.06.040
中图分类号
DF [法律]; D9 [法律]; R [医药、卫生];
学科分类号
0301 ; 10 ;
摘要
Compared to other forensic comparative sciences, studies of the efficacy of the likelihood ratio (LR) framework in forensic authorship analysis are lagging. An experiment is described concerning the estimation of strength of linguistic text evidence within that framework. The LRs were estimated by trialling three different procedures: one is based on the multivariate kernel density (MVKD) formula, with each group of messages being modelled as a vector of authorship attribution features; the other two involve N-grams based on word tokens and characters, respectively. The LRs that were separately estimated from the three different procedures are logistic-regression-fused to obtain a single LR for each author comparison. This study used predatory chatlog messages sampled from 115 authors. To see how the number of word tokens affects the performance of a forensic text comparison (FTC) system, token numbers used for modelling each group of messages were progressively increased: 500, 1000, 1500 and 2500 tokens. The performance of the FTC system is assessed using the log-likelihood-ratio cost (C-llr), which is a gradient metric for the quality of LRs, and the strength of the derived LRs is charted as Tippett plots. It is demonstrated in this study that (i) out of the three procedures, the MVKD procedure with authorship attribution features performed best in terms of Cllr, and that (ii) the fused system outperformed all three of the single procedures. When the token length is 1500, for example, the fused system achieved a Cllr value of 0.15. Some unrealistically strong LRs were observed in the results. Reasons for these are discussed, and a possible solution to the problem, namely the empirical lower and upper bound LR (ELUB) method is trialled and applied to the LRs of the best-achieving fusion system. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:184 / 197
页数:14
相关论文
共 82 条
[1]   Applying authorship analysis to extremist-group web forum messages [J].
Abbasi, A ;
Chen, HC .
IEEE INTELLIGENT SYSTEMS, 2005, 20 (05) :67-75
[2]   Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace [J].
Abbasi, Ahmed ;
Chen, Hsinchun .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2008, 26 (02)
[3]  
Aitken C., 2004, STAT PRACTICE, DOI 10.1002/0470011238.ch3
[4]  
Aitken C.G.G., 1991, The use of Statistics in Forensic Science
[5]   Evaluation of trace evidence in the form of multivariate data [J].
Aitken, CGG ;
Lucy, D .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2004, 53 :109-122
[6]  
Aizawa Akiko., 2001, Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium (NLPRS2001), P307
[7]  
[Anonymous], 2013, AAAI SPRING S TECHN
[8]  
[Anonymous], 44 C AUSTR LING SOC
[9]  
[Anonymous], NAT C REC TRENDS COM
[10]  
[Anonymous], P 10 AUSTR INT C SPE