Automated formatting verification technique of paperwork based on the gradient boosting on decision trees

被引:3
作者
Nasyrov, Nail [1 ]
Komarov, Mikhail [1 ]
Tartynskikh, Petr [1 ]
Gorlushkina, Nataliya [1 ]
机构
[1] ITMO Univ, 49 Kronverksky Pr, St Petersburg 197101, Russia
来源
9TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE IN COMPUTATIONAL SCIENCE, YSC2020 | 2020年 / 178卷
关键词
machine learning; gradient boosting on decision trees; CatBoost; design standard; multiclass classification; docx; formatting verification; modelling;
D O I
10.1016/j.procs.2020.11.038
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The article describes the automated formatting verification technique of docx document elements which were the basis for developed online service. Checking the document formatting correctness is an important task when writing various research papers, explanatory notes for course projects, and other scientific works. The article describes the approach for identifying design features of text document elements. The structure of the client-server interaction of the service, the operation of which was simulated, is also given. Such a machine learning algorithm as gradient boosting on decision trees CatBoost was chosen as the primary tool for multi-classification. Empirically refined parameters of the algorithm that increase its accuracy are described. Special attention is paid to the developed method of checking the results of classification of elements, the sequence of which has different regularities. This approach allows us to identify the formatting classes of docx files elements that were incorrectly identified by the classifier. Sometimes, it is possible to override the results of the classifier to increase the accuracy of checking the elements formatting in docx files. (C) 2020 The Authors. Published by Elsevier B.V.
引用
收藏
页码:365 / 374
页数:10
相关论文
共 13 条
  • [1] Machine learning identifies the dynamics and influencing factors in an auditory category learning experiment
    Abolfazli, Amir
    Brechmann, Andre
    Wolff, Susann
    Spiliopoulou, Myra
    [J]. SCIENTIFIC REPORTS, 2020, 10 (01)
  • [2] [Anonymous], GOOGLE STORE PLAY AP
  • [3] [Anonymous], EXCELSIOR OWL EXCELS
  • [4] EDITING IN STYLE - A STYLE SHEET FOR CONFORMING TO APA EDITORIAL STYLE
    BLAU, GL
    [J]. BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 1984, 16 (01): : 28 - 31
  • [5] Dorogush Anna Veronika, 2018, arXiv preprint arXiv:1810.11363, DOI DOI 10.3390/RS13142805
  • [6] Performance Analysis of Different Types of Machine Learning Classifiers for Non-Technical Loss Detection
    Ghori, Khawaja Moyeezullah
    Abbasi, Rabeeh Ayaz
    Awais, Muhammad
    Imran, Muhammad
    Ullah, Ata
    Szathmary, Laszlo
    [J]. IEEE ACCESS, 2020, 8 : 16033 - 16048
  • [7] Formatting Open Science: agilely creating multiple document formats for academic manuscripts with Pandoc Scholar
    Krewinkel, Albert
    Winkler, Robert
    [J]. PEERJ COMPUTER SCIENCE, 2017,
  • [8] Predicting Ayurveda-Based Constituent Balancing in Human Body Using Machine Learning Methods
    Madaan, Vishu
    Goyal, Anjali
    [J]. IEEE ACCESS, 2020, 8 : 65060 - 65070
  • [9] Ponomarev D.D, 2020, PEDAGOGICHESKII ZH, V10, P367, DOI [10.34670/AR.2020.1.46.143, DOI 10.34670/AR.2020.1.46.143]
  • [10] [Салахутдинова К.И. Salakhutdinova K.I.], 2018, [Научно-технический вестник информационных технологий, механики и оптики, Scientific and Technical Journal of Information Technologies Mechanics and Optics, Nauchno-tekhnicheskii vestnik informatsionnykh tekhnologii, mekhaniki i optiki], V18, P1016, DOI 10.17586/2226-1494-2018-18-6-1016-1022