Large-Scale Information Extraction from Emails with Data Constraints

被引:1
作者
Gupta, Rajeev [1 ]
Kondapally, Ranganath [1 ]
Guha, Siddharth [1 ]
机构
[1] Microsoft R&D, Hyderabad, India
来源
BIG DATA ANALYTICS (BDA 2019) | 2019年 / 11932卷
关键词
Emails; Information extraction; Machine learning; Learning by examples; Anonymization;
D O I
10.1007/978-3-030-37188-3_8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Email is the most frequently used web application for communication and collaboration due to its easy access, fast interactions, and convenient management. More than 60% of the email traffic constitutes business to consumer (B2C) emails (e.g., flight reservations, payment reminder, order confirmations, etc.). Most of these emails are generated by filling a template with user or transaction specific values from databases. In this paper we describe various algorithms related to extracting important information from these emails. Unlike web pages, emails are personal and due to privacy and legal considerations, no other human except the receiver can view them. Thus, adapting extraction techniques used for web pages, such as HTML wrapper-based techniques, have privacy and scalability challenges. We describe end-to-end information extraction system for emails-data collection, anonymization, classification, building the information extraction models, deployment, and monitoring. To handle the privacy and scalability issues, we focus on algorithms which can work with minimum human annotated samples for building classifier and extraction techniques. Similarly, we present algorithms to minimize samples for human inspection to detect precision and recall gaps in the extraction pipeline.
引用
收藏
页码:124 / 139
页数:16
相关论文
共 40 条
[1]  
[Anonymous], 2006, IEEE T NEURAL NETWOR
[2]  
[Anonymous], 2006, Label Propagation and Quadratic Criterion
[3]  
Bayardo RJ, 2005, PROC INT CONF DATA, P217
[4]  
Cartright M.-A., 2017, P INT C WORLD WID WE
[5]  
Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
[6]  
Chiticariu Laura, 2013, EMNLP
[7]   A spatial relation-based framework to perform visual information extraction [J].
Della Penna, Giuseppe ;
Magazzeni, Daniele ;
Orefice, Sergio .
KNOWLEDGE AND INFORMATION SYSTEMS, 2012, 30 (03) :667-692
[8]   Visual extraction of information from web pages [J].
Della Penna, Giuseppe ;
Magazzeni, Daniele ;
Orefice, Sergio .
JOURNAL OF VISUAL LANGUAGES AND COMPUTING, 2010, 21 (01) :23-32
[9]   Differential privacy: A survey of results [J].
Dwork, Cynthia .
THEORY AND APPLICATIONS OF MODELS OF COMPUTATION, PROCEEDINGS, 2008, 4978 :1-19
[10]  
Ester M., 1996, KDD-96 Proceedings. Second International Conference on Knowledge Discovery and Data Mining, P226