Development of content-based SMS classification application by using Word2Vec-based feature extraction

被引：17

作者：

Balli, Serkan ^{[1
]}

Karasoy, Onur ^{[1
]}

机构：

[1] Mugla Sitki Kocman Univ, Fac Technol, Dept Informat Syst Engn, TR-48000 Mugla, Turkey

来源：

IET SOFTWARE | 2019年 / 13卷 / 04期

关键词：

electronic messaging; feature extraction; pattern classification; unsolicited e-mail; data privacy; e-mail filters; learning (artificial intelligence); mobile computing; unwanted messages; Word2Vec word embedding tool; ham words; classification algorithms; successful correct classification percentage; content-based SMS classification application; Word2Vec-based feature extraction; mobile instant messaging applications; Viber offer benefits; phone users; stable communication; collective communication; direct communication; short message service; reliable privacy-preserving technology; mobile communication; product promotion; promotion etc; spam messages; unknown sources; serious problem; SMS recipients; content-based classification model; SPAM;

D O I：

10.1049/iet-sen.2018.5046

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

While mobile instant messaging applications such as WhatsApp, Messenger, Viber offer benefits to phone users such as price, easy usage, stable, collective and direct communication, SMS (short message service) is still considered a more reliable privacy-preserving technology for mobile communication. This situation directs the institutions that want to perform the product promotion such as advertising, informing, promotion etc. to use SMS. However, spam messages sent from unknown sources constitute a serious problem for SMS recipients. In this study, a content-based classification model which uses the machine learning to filter out unwanted messages is proposed. From the selected dataset, the model to be used in the classification is created with the help of Word2Vec word embedding tool. Thanks to this model, two new features are revealed for calculating the distances of messages to spam and ham words. The performances of the classification algorithms are compared by taking these two new features into consideration. The random forest method succeeded with a correct accuracy rate of 99.64%. In comparison to other studies using the same dataset, more successful correct classification percentage is achieved.

引用

页码：295 / 304

页数：10

共 30 条

[1]

Akbar Fatemeh, 2015, 2015 IEEE MTT-S International Microwave Symposium (IMS2015), P1, DOI 10.1109/MWSYM.2015.7167107

[2]

Almeida TA, 2011, DOCENG 2011: PROCEEDINGS OF THE 2011 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, P259

[3]

Arifin DD, 2016, 2016 IEEE ASIA PACIFIC CONFERENCE ON WIRELESS AND MOBILE (APWIMOB), P80, DOI 10.1109/APWiMob.2016.7811442

[4]

Bilgic A, 2017, 2017 25 SIGN PROC CO, P1, DOI [10.1109/SIU.2017.7960368, DOI 10.1109/SIU.2017.7960368]

[5]

Bozan YS, 2015, SIG PROCESS COMMUN, P2345, DOI 10.1109/SIU.2015.7130350

[6] Random forests [J].

Breiman, L .

MACHINE LEARNING, 2001, 45 (01) :5-32

[7]

Castiglione A, 2009, LECT NOTES COMPUT SC, V5692, P50, DOI 10.1007/978-3-642-03964-5_6

[8]

Church K., 2013, P 15 INT C HUM COMP, P352, DOI [10.1145/2493190.2493225, DOI 10.1145/2493190.2493225]

[9] SMS spam filtering: Methods and data [J].

Delany, Sarah Jane ;

Buckley, Mark ;

Greene, Derek .

EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (10) :9899-9908

[10] An approach to the use of word embeddings in an opinion classification task [J].

Enriquez, Fernando ;

Troyano, Jose A. ;

Lopez-Solaz, Tomas .

EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 :1-6

← 1 2 3 →