On the Semi-unsupervised Construction of Auto-keyphrases Corpus from Large-Scale Chinese Automobile E-Commerce Reviews

被引:0
|
作者
Li, Yang [1 ]
Qian, Cheng [1 ]
Che, Haoyang [2 ]
Wang, Rui [3 ]
Wang, Zhichun [1 ]
Zhang, Jiacai [1 ]
机构
[1] Beijing Normal Univ, Coll Artificial Intelligence, Beijing, Peoples R China
[2] Autosmart Inc, Data Intelligence Lab, Beijing, Peoples R China
[3] Princeton Int Sch Math & Sci, Princeton, NJ 08540 USA
来源
CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019 | 2019年 / 11856卷
关键词
Auto-keyphrases corpus; Keyphrases corpus; Chinese corpus; E-Commerce website reviews; Position Rank; Semi-unsupervised method;
D O I
10.1007/978-3-030-32381-3_37
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The long-standing automobile e-commerce websites in China have accumulated huge amounts of auto reviews, and extracting keyphrases of these reviews can assist researchers and practitioners in obtaining online users' typical opinions and acquiring their underlying motivations. However, there haven't existed any relevant text corpora so far. In this paper, the authors propose a semi-unsupervised scheme to construct a comprehensive auto-keyphrases corpus from online collected reviews in Chinese automobile e-commerce websites by Position Rank, which performs very well in keyphrases extraction from texts in the scenario of scarce labeled data. The iterative annotation process consists of three-round labeling and two-round corrections. During the process of the three-round unsupervised labeling, the computing model will extract seven most important words as the keyphrases of the whole paragraph. Between each labeling phase, there are manual check, correction, re-check and arbitration stages, in which the previous labeling errors are corrected and new vocabulary and rules are summarized up to further improve the unsupervised model. For comparison, the paper runs the experiments using another two unsupervised approaches: TF-IDF and Text Rank, the experimental results also show that Position Rank is a more efficient and effective method for keyphrases extraction. By the time this paper was written, the auto-keyphrases corpus had contained 110,023 entries, and there are still much room for improvement in corpus volume and labeling quality.
引用
收藏
页码:452 / 464
页数:13
相关论文
共 6 条
  • [1] The JDDC Corpus: A Large-Scale Multi-Turn Chinese Dialogue Dataset for E-commerce Customer Service
    Chen, Meng
    Liu, Ruixue
    Shen, Lei
    Yuan, Shaozu
    Zhou, Jingyan
    Wu, Youzheng
    He, Xiaodong
    Zhou, Bowen
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 459 - 466
  • [2] Detecting the Internet Water Army via Comprehensive Behavioral Features Using Large-scale E-commerce Reviews
    Guo, Bo
    Wang, Hao
    Yu, Zhaojun
    Sun, Yu
    2017 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (IEEE CITS), 2017, : 88 - 92
  • [3] FLOPPIES: A Framework for Large-Scale Ontology Population of Product Information from Tabular Data in E-commerce Stores
    Nederstigt, Lennart J.
    Aanen, Steven S.
    Vandic, Damir
    Frasincar, Flavius
    DECISION SUPPORT SYSTEMS, 2014, 59 : 296 - 311
  • [4] Effect of e-commerce popularization on farmland abandonment in rural China: Evidence from a large-scale household survey
    Wang, Yahui
    Yang, Aoxi
    Li, Yuanqing
    Yang, Qingyuan
    LAND USE POLICY, 2023, 135
  • [5] Does E-Commerce Participation among Farming Households Affect Farmland Abandonment? Evidence from a Large-Scale Survey in China
    Zhou, Rui
    Ji, Mingbo
    Zhao, Shaoyang
    LAND, 2024, 13 (03)
  • [6] Moving from data-constrained to data-enabled research: Experiences and challenges in collecting, validating and analyzing large-scale e-commerce data
    Bapna, Ravi
    Goes, Paulo
    Gopal, Ram
    Marsden, James R.
    STATISTICAL SCIENCE, 2006, 21 (02) : 116 - 130