Improving FastText with inverse document frequency of subwords

被引：18

作者：

Choi, Jaekeol ^{[1
,2
]}

Lee, Sang-Woong ^{[2
]}

机构：

[1] NAVER, Bundang Gu Buljeong Ro 8, Seongnam Si, South Korea

[2] Gachon Univ, Pattern Recognit & Machine Learning Lab, Sujeong Gu SeongNam Daero 1342, Seongnam Si, South Korea

来源：

PATTERN RECOGNITION LETTERS | 2020年 / 133卷 / 165-172期

关键词：

Word embedding; FastText; Inverse document frequency; Word2vec;

D O I：

10.1016/j.patrec.2020.03.003

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Word embedding is important in natural language processing, and word2vec is known as a representative algorithm. However, word2vec and many other dictionary-based word embedding algorithms create word vectors only for words that appear in the training data, ignoring morphological features of these words. The FastText algorithm was previously proposed to solve this problem: it creates a word vector from subword vectors, making it possible to create word embeddings even for words never seen during the training. Because of morphological features, FastText is strong in syntactic tasks but weak in semantic tasks, compared with word2vec. In this paper, we propose a method of improving FastText by using the inverse document frequency of subwords. Our approach is intended to overcome the weakness of FastText in semantic tasks. According to our experiments, the proposed method shows improved results in semantic tests with a little loss in syntactic tests. Our method can be applied to any word embedding algorithm that uses subwords. We additionally tested probabilistic FastText, an algorithm designed to distinguish multiple-meaning words, by adding the inverse document frequency, and the results confirmed an improved performance. (C) 2020 Elsevier B.V. All rights reserved.

引用

页码：165 / 172

页数：8

共 20 条

[1] [Anonymous], ARXIV160804738 CORR
[2] [Anonymous], ARXIV14023722 CORR
[3] [Anonymous], ARXIV160607601
[4] Athiwaratkun B, 2018, PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, P1
[5] Multimodal Word Distributions
Athiwaratkun, Ben
Wilson, Andrew Gordon
[J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 1645 - 1656
[6] Bojanowski P., 2017, Trans. Assoc. Comput. Linguist., V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACL_A_00051]
[7] Bruni Elia, 2012, P ANN M ASS COMP LIN, P136
[8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[9] Placing search in context: The concept revisited
Finkelstein, L
Gabrilovich, E
Matias, Y
Rivlin, E
Solan, Z
Wolfman, G
Ruppin, E
[J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2002, 20 (01) : 116 - 131
[10] SimLex-999: Evaluating Semantic Models With (Genuine) Similarity Estimation
Hill, Felix
Reichart, Roi
Korhonen, Anna
[J]. COMPUTATIONAL LINGUISTICS, 2015, 41 (04) : 665 - 695

← 1 2 →