A Machine Learning Approach for Layout Inference in Spreadsheets

被引:31
|
作者
Koci, Elvis [1 ]
Thiele, Maik [1 ]
Romero, Oscar [2 ]
Lehner, Wolfgang [1 ]
机构
[1] Tech Univ Dresden, Dept Comp Sci, Database Technol Grp, Dresden, Germany
[2] Univ Politecn Catalunya UPC BarcelonaTech, Dept Engn Serv & Sist Informacio, C-Jordi Girona 1,Compus Nord, Barcelona, Spain
来源
KDIR: PROCEEDINGS OF THE 8TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL. 1 | 2016年
关键词
Speadsheets; Tabular; Layout; Structure; Machine Learning; Knowledge Discovery;
D O I
10.5220/0006052200770088
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Spreadsheet applications are one of the most used tools for content generation and presentation in industry and the Web. In spite of this success, there does not exist a comprehensive approach to automatically extract and reuse the richness of data maintained in this format. The biggest obstacle is the lack of awareness about the structure of the data in spreadsheets, which otherwise could provide the means to automatically understand and extract knowledge from these files. In this paper, we propose a classification approach to discover the layout of tables in spreadsheets. Therefore, we focus on the cell level, considering a wide range of features not covered before by related work. We evaluated the performance of our classifiers on a large dataset covering three different corpora from various domains. Finally, our work includes a novel technique for detecting and repairing incorrectly classified cells in a post-processing step. The experimental results show that our approach delivers very high accuracy bringing us a crucial step closer towards automatic table extraction.
引用
收藏
页码:77 / 88
页数:12
相关论文
共 50 条
  • [11] BagReg: Protein inference through machine learning
    Zhao, Can
    Liu, Dao
    Teng, Ben
    He, Zengyou
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2015, 57 : 12 - 20
  • [12] Causal inference and machine learning in endocrine epidemiology
    Inoue, Kosuke
    ENDOCRINE JOURNAL, 2024, 71 (10) : 945 - 953
  • [13] Machine Learning Inference of Random Medium Properties
    Gao, Kai
    Modrak, Ryan T.
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 13
  • [14] Accelerating Machine Learning Inference with Probabilistic Predicates
    Lu, Yao
    Chowdhery, Aakanksha
    Kandula, Srikanth
    Chaudhuri, Surajit
    SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1493 - 1508
  • [15] GRAPH TOPOLOGY INFERENCE BENCHMARKS FOR MACHINE LEARNING
    Lassance, Carlos
    Gripon, Vincent
    Mateos, Gonzalo
    PROCEEDINGS OF THE 2020 IEEE 30TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2020,
  • [16] PRIMAL: Power Inference using Machine Learning
    Zhou, Yuan
    Ren, Haoxing
    Zhang, Yanqing
    Keller, Ben
    Khailany, Brucek
    Zhang, Zhiru
    PROCEEDINGS OF THE 2019 56TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2019,
  • [17] Variable selection in double/debiased machine learning for causal inference: an outcome-adaptive approach
    Kabata, Daijiro
    Shintani, Mototsugu
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2023, 52 (12) : 5880 - 5893
  • [18] Mitigating Membership Inference Attacks in Machine Learning as a Service
    Bouhaddi, Myria
    Adi, Kamel
    2023 IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND RESILIENCE, CSR, 2023, : 262 - 268
  • [19] Statistical Appliance Inference in the Smart Grid by Machine Learning
    Bilgin, Zeki
    Tomur, Emrah
    Ersoy, Mehmet Akif
    Soykan, Elif Ustundag
    2019 IEEE 30TH INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS (IEEE PIMRC WORKSHOPS), 2019,
  • [20] Hierarchical and Distributed Machine Learning Inference Beyond the Edge
    Thomas, Anthony
    Guo, Yunhui
    Kim, Yeseong
    Aksanli, Baris
    Kumar, Arun
    Rosing, Tajana S.
    PROCEEDINGS OF THE 2019 IEEE 16TH INTERNATIONAL CONFERENCE ON NETWORKING, SENSING AND CONTROL (ICNSC 2019), 2019, : 18 - 23