Author identification of literary works based on text analysis and deep learning

被引：5

作者：

Tang, Xu ^{[1
]}

机构：

[1] Chongqing Normal Univ, Coll Literature, Chongqing 401331, Peoples R China

来源：

HELIYON | 2024年 / 10卷 / 03期

关键词：

Text analysis; Convolutional neural networks (CNN); Attentional mechanisms; Long -and short-term memory network (LSTM);

D O I：

10.1016/j.heliyon.2024.e25464

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

With the development of science, speech, picture, and other analysis, problems have been gradually better solved, but the study of Chinese text has been a complex problem to overcome. Chinese text analysis requires not only statistics but also semantic comprehension analysis. Different text types need other language style feature modeling to obtain good recognition results. In this study, we use the deep learning method to construct an automatic text feature extraction model and classify it with the author as a classification label. This study presents a literature author recognition model based on deep learning, which is mainly divided into three phases: text preprocessing, feature extraction, and classification. Each part consists of several small modules or steps. First, we input the corpus to Word2Vec to generate the new word vector. Then, the improved text feature extractor based on CNN and Attention extracts the text features and uses them as the input of the CNN convolution layer. After convolution, the text is combined with bits to get Window Feature Sequence. It is the text feature vector. Next, based on LSTM and Softmax classification output, Window Feature Sequence is used as the input of LSTM to obtain two onedimensional vectors spliced by concatenate layer. Finally, the result is classified through the fully connected layer, Batch Normalization layer, and Softmax. The performance of the proposed model in recognizing authors of Chinese literature was evaluated using two datasets. In the research process, the data we collected included works of different forms, such as prose and fiction. The research results show that the proposed model can effectively identify author identity. The classification accuracy of our proposed algorithm is significantly better than that of the benchmark model.

引用

页数：16

共 32 条

[1] Applying authorship analysis to extremist-group web forum messages [J].

Abbasi, A ;

Chen, HC .

IEEE INTELLIGENT SYSTEMS, 2005, 20 (05) :67-75

[2]

Alhuqail NK., 2021, Eur. J. Comput. Sci. Info. Tech, V9, P1

[3]

Ali N., 2014, Academy of Science and Engineering[C]., V42, P67

[4] Identifying Cyber Predators through Forensic Authorship Analysis of Chat Logs [J].

Amuchi, Faith ;

Al-Nemrat, Ameer ;

Alazab, Mamoun ;

Layton, Robert .

2012 THIRD CYBERCRIME AND TRUSTWORTHY COMPUTING WORKSHOP (CTC 2012), 2012, :28-37

[5]

[Anonymous], 2006, P INT C EMP METH NAT

[6]

[Anonymous], 2014, Information Access Evaluation. Multilinguality, Multimodality, and Interaction

[7]

Baayen H., 1996, Literary & Linguistic Computing, V11, P121, DOI 10.1093/llc/11.3.121

[8] Stacked authorship attribution of digital texts [J].

Custodio, Jose Eleandro ;

Paraboni, Ivandre .

EXPERT SYSTEMS WITH APPLICATIONS, 2021, 176

[9]

de Vel O, 2001, SIGMOD REC, V30, P55, DOI 10.1145/604264.604272

[10]

De Vel O, 2001, PROC WORKSHOP DATA M

← 1 2 3 4 →