Shallow Semantic Parsing of Product Offering Titles (for better automatic hyperlink insertion)

被引:13
作者
Melli, Gabor [1 ]
机构
[1] VigLink Inc, 539 Bryant St, San Francisco, CA 94107 USA
来源
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14) | 2014年
关键词
shallow semantic parsing; automated terminology extraction; composite CRF ensembles; product offer titles; hyperlink insertion;
D O I
10.1145/2623330.2623343
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With billions of database-generated pages on the Web where consumers can readily add priced product offerings to their virtual shopping cart, several opportunities will become possible once we can automatically recognize what exactly is being offered for sale on each page. We present a case study of a deployed data-driven system that first chunks individual titles into semantically classified sub-segments, and then uses this information to improve a hyperlink insertion service. To accomplish this process, we propose an annotation structure that is general enough to apply to offering titles from most e-commerce industries while also being specific enough to identify useful semantics about each offer. To automate the parsing task we apply the best-practices approach of training a supervised conditional random fields model and discover that creating separate prediction models for some of the industries along with the use of model-ensembles achieves the best performance to date. We further report on a real-world application of the trained parser to the task of growing a lexical dictionary of product-related terms which critically provides background knowledge to an affiliate-marketing hyperlink insertion service. On a regular basis we apply the parser to offering titles to produce a large set of labeled terms. From these candidates we select the most confidently predicted novel terms for review by crowd-sourced annotators. The agreed on terms are then added into a dictionary which significantly improves the performance of the link-insertion service. Finally, to continually improve system performance, we retrain the model in an online fashion by performing additional annotations on titles with incorrect predictions on each batch.
引用
收藏
页码:1670 / 1678
页数:9
相关论文
共 19 条
  • [1] Abney S. P., 1989, PARSING BY CHUNKS
  • [2] [Anonymous], 2010, P NAACL HLT 2010 WOR
  • [3] Culotta A., 2004, P HLT NAACL 2004
  • [4] Ghani R., 2006, ACM SIGKDD EXPLORATI, V8
  • [5] Gornostay T., 2010, P 5 INT C APPL LING
  • [6] Kopcke H., 2012, P 15 INT C EXT DAT T
  • [7] Krishnan V, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P1121
  • [8] Liu X., 2011, P 2011 ACL C
  • [9] Melli G., 2008, OBJ ROL MOD WORKSH O
  • [10] An Overview of the CPROD1 Contest on Consumer Product Recognition within User Generated Postings and Normalization against a Large Product Catalog
    Melli, Gabor
    Romming, Christian
    [J]. 12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2012), 2012, : 861 - 864