Improving word embeddings in Portuguese: increasing accuracy while reducing the size of the corpus

被引:1
作者
Pinto, Jose Pedro [1 ]
Viana, Paula [1 ,2 ]
Teixeira, Ines [1 ]
Andrade, Maria [1 ,3 ]
机构
[1] INESC TEC, Porto, Portugal
[2] Polytech Porto, Sch Engn, Porto, Portugal
[3] Univ Porto, Fac Engn, Porto, Portugal
关键词
Natural language processing; Machine learning; Multimedia systems; Context awareness; Word2Vec; MODELS;
D O I
10.7717/peerj-cs.964
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The subjectiveness of multimedia content description has a strong negative impact on tag-based information retrieval. In our work, we propose enhancing available descriptions by adding semantically related tags. To cope with this objective, we use a word embedding technique based on the Word2Vec neural network parameterized and trained using a new dataset built from online newspapers. A large number of news stories was scraped and pre-processed to build a new dataset. Our target language is Portuguese, one of the most spoken languages worldwide. The results achieved significantly outperform similar existing solutions developed in the scope of different languages, including Portuguese. Contributions include also an online application and API available for external use. Although the presented work has been designed to enhance multimedia content annotation, it can be used in several other application areas.
引用
收藏
页数:22
相关论文
共 40 条
[1]   Multimedia recommendation using Word2Vec-based social relationship mining [J].
Baek, Ji-Won ;
Chung, Kyung-Yong .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (26-27) :34499-34515
[2]  
Bhardwaj A., 2018, Deep Learning Essentials: Your HandsOn Guide to the Fundamentals of Deep Learning and Neural Network Modeling
[3]  
Bojanowski P, 2017, Arxiv, DOI arXiv:1607.04606
[4]   Multimodal Distributional Semantics [J].
Bruni, Elia ;
Nam Khanh Tran ;
Baroni, Marco .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2014, 49 :1-47
[5]  
Bruni Elia., 2011, Proceedings of the GEMS 2011 Workshop on GEo- metrical Models of Natural Language Semantics, P22
[6]   Tuning Word2vec for Large Scale Recommendation Systems [J].
Chamberlain, Benjamin P. ;
Rossi, Emanuele ;
Shiebler, Dan ;
Sedhain, Suvash ;
Bronstein, Michael M. .
RECSYS 2020: 14TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, 2020, :732-737
[7]   Leap2Trend: A Temporal Word Embedding Approach for Instant Detection of Emerging Scientific Trends [J].
Dridi, Amna ;
Gaber, Mohamed Medhat ;
Azad, R. Muhammad Atif ;
Bhogal, Jagdev .
IEEE ACCESS, 2019, 7 :176414-176428
[8]  
Dusserre E, 2017, IWCS
[9]  
ehurek R. R., 2010, WORKSH NEW CHALL NLP, P45
[10]  
Hartmann N, 2017, Arxiv, DOI arXiv:1708.06025