A Machine Learning Approach for Layout Inference in Spreadsheets

被引:31
|
作者
Koci, Elvis [1 ]
Thiele, Maik [1 ]
Romero, Oscar [2 ]
Lehner, Wolfgang [1 ]
机构
[1] Tech Univ Dresden, Dept Comp Sci, Database Technol Grp, Dresden, Germany
[2] Univ Politecn Catalunya UPC BarcelonaTech, Dept Engn Serv & Sist Informacio, C-Jordi Girona 1,Compus Nord, Barcelona, Spain
来源
KDIR: PROCEEDINGS OF THE 8TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL. 1 | 2016年
关键词
Speadsheets; Tabular; Layout; Structure; Machine Learning; Knowledge Discovery;
D O I
10.5220/0006052200770088
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach delivers very high accuracy bringing us a crucial step closer towards automatic table extraction.
引用
收藏
页码:77 / 88
页数:12
相关论文
共 50 条
  • [31] Detection of layout-purpose TABLE tags based on machine learning
    Okada, Hidehiko
    Miura, Taiki
    UNIVERSAL ACCESS IN HUMAN-COMPUTER INTERACTION: APPLICATIONS AND SERVICES, PT 3, PROCEEDINGS, 2007, : 116 - +
  • [32] Learning constraints in spreadsheets and tabular data
    Samuel Kolb
    Sergey Paramonov
    Tias Guns
    Luc De Raedt
    Machine Learning, 2017, 106 : 1441 - 1468
  • [33] Learning constraints in spreadsheets and tabular data
    Kolb, Samuel
    Paramonov, Sergey
    Guns, Tias
    De Raedt, Luc
    MACHINE LEARNING, 2017, 106 (9-10) : 1441 - 1468
  • [34] The ALAMO approach to machine learning
    Sahinidis, Nick
    26TH EUROPEAN SYMPOSIUM ON COMPUTER AIDED PROCESS ENGINEERING (ESCAPE), PT B, 2016, 38B : 2410 - 2410
  • [35] A Machine Learning Approach to SDL
    Kannavara, Raghudeep
    Gressel, Gilad
    Fagbemi, Damilare
    Chow, Richard
    2017 IEEE CYBERSECURITY DEVELOPMENT (SECDEV), 2017, : 10 - 15
  • [36] Evaluating the layout quality of UML class diagrams using machine learning
    Bergstroem, Gustav
    Hujainah, Fadhl
    Truong, Ho-Quang
    Jolak, Rodi
    Rukmono, Satrio Adi
    Nurwidyantoro, Arif
    Chaudron, Michel R. V.
    JOURNAL OF SYSTEMS AND SOFTWARE, 2022, 192
  • [37] Music Document Layout Analysis through Machine Learning and Human Feedback
    Calvo-Zaragoza, Jorge
    Zhang, Ke
    Saleh, Zeyad
    Vigliensoni, Gabriel
    Fujinaga, Ichiro
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2017), VOL 2, 2017, : 23 - 24
  • [38] Voting: A machine learning approach
    Burka, David
    Puppe, Clemens
    Szepesvary, Laszlo
    Tasnadi, Attila
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2022, 299 (03) : 1003 - 1017
  • [39] Disease Inference on Medical Datasets Using Machine Learning and Deep Learning Algorithms
    Chinnaswamy, Arunkumar
    Srinivasan, Ramakrishnan
    Gaurang, Desai Prutha
    COMPUTATIONAL VISION AND BIO-INSPIRED COMPUTING, 2020, 1108 : 902 - 908
  • [40] Planter: Rapid Prototyping of In-Network Machine Learning Inference
    Zheng, Changgang
    Zang, Mingyuan
    Hong, Xinpeng
    Perreault, Liam
    Bensoussane, Riyad
    Vargaftik, Shay
    Ben-Itzhak, Yaniv
    Zilberman, Noa
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2024, 54 (01) : 2 - 20