On the Semi-unsupervised Construction of Auto-keyphrases Corpus from Large-Scale Chinese Automobile E-Commerce Reviews

被引：0

作者：

Li, Yang ^{[1
]}

Qian, Cheng ^{[1
]}

Che, Haoyang ^{[2
]}

Wang, Rui ^{[3
]}

Wang, Zhichun ^{[1
]}

Zhang, Jiacai ^{[1
]}

机构：

[1] Beijing Normal Univ, Coll Artificial Intelligence, Beijing, Peoples R China

[2] Autosmart Inc, Data Intelligence Lab, Beijing, Peoples R China

[3] Princeton Int Sch Math & Sci, Princeton, NJ 08540 USA

来源：

CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019 | 2019年 / 11856卷

关键词：

Auto-keyphrases corpus; Keyphrases corpus; Chinese corpus; E-Commerce website reviews; Position Rank; Semi-unsupervised method;

D O I：

10.1007/978-3-030-32381-3_37

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The long-standing automobile e-commerce websites in China have accumulated huge amounts of auto reviews, and extracting keyphrases of these reviews can assist researchers and practitioners in obtaining online users' typical opinions and acquiring their underlying motivations. However, there haven't existed any relevant text corpora so far. In this paper, the authors propose a semi-unsupervised scheme to construct a comprehensive auto-keyphrases corpus from online collected reviews in Chinese automobile e-commerce websites by Position Rank, which performs very well in keyphrases extraction from texts in the scenario of scarce labeled data. The iterative annotation process consists of three-round labeling and two-round corrections. During the process of the three-round unsupervised labeling, the computing model will extract seven most important words as the keyphrases of the whole paragraph. Between each labeling phase, there are manual check, correction, re-check and arbitration stages, in which the previous labeling errors are corrected and new vocabulary and rules are summarized up to further improve the unsupervised model. For comparison, the paper runs the experiments using another two unsupervised approaches: TF-IDF and Text Rank, the experimental results also show that Position Rank is a more efficient and effective method for keyphrases extraction. By the time this paper was written, the auto-keyphrases corpus had contained 110,023 entries, and there are still much room for improvement in corpus volume and labeling quality.

引用

页码：452 / 464

页数：13

共 6 条

[1] The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service
Chen, Meng
Liu, Ruixue
Shen, Lei
Yuan, Shaozu
Zhou, Jingyan
Wu, Youzheng
He, Xiaodong
Zhou, Bowen
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 459 - 466
[2] Detecting the Internet Water Army via Comprehensive Behavioral Features Using Large-scale E-commerce Reviews
Guo, Bo
Wang, Hao
Yu, Zhaojun
Sun, Yu
2017 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (IEEE CITS), 2017, : 88 - 92
[3] FLOPPIES: A Framework for Large-Scale Ontology Population of Product Information from Tabular Data in E-commerce Stores
Nederstigt, Lennart J.
Aanen, Steven S.
Vandic, Damir
Frasincar, Flavius
DECISION SUPPORT SYSTEMS, 2014, 59 : 296 - 311
[4] Effect of e-commerce popularization on farmland abandonment in rural China: Evidence from a large-scale household survey
Wang, Yahui
Yang, Aoxi
Li, Yuanqing
Yang, Qingyuan
LAND USE POLICY, 2023, 135
[5] Does E-Commerce Participation among Farming Households Affect Farmland Abandonment? Evidence from a Large-Scale Survey in China
Zhou, Rui
Ji, Mingbo
Zhao, Shaoyang
LAND, 2024, 13 (03)
[6] Moving from data-constrained to data-enabled research: Experiences and challenges in collecting, validating and analyzing large-scale e-commerce data
Bapna, Ravi
Goes, Paulo
Gopal, Ram
Marsden, James R.
STATISTICAL SCIENCE, 2006, 21 (02) : 116 - 130

← 1 →