Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia

被引:0
作者
Fedorov P.E. [1 ]
Mironov A.V. [1 ]
Chernishev G.A. [1 ,2 ]
机构
[1] UniData LLC, St. Petersburg
[2] Saint-Petersburg University, St. Petersburg
关键词
corpus; dataset construction; Russian language; Russian Wikipedia; table crawler; Web tables; Wikipedia;
D O I
10.1134/S1995080223010110
中图分类号
学科分类号
摘要
Abstract: Corpora that contain tabular data such as WebTables are a vital resource for the academic community. Essentially, they are the backbone of any modern research in information management. They are used for various tasks of data extraction, knowledge base construction, question answering, column semantic type detection and many other. Such corpora are useful not only as a source of data, but also as a base for building test datasets. So far, there were no such corpora for the Russian language and this seriously hindered research in the aforementioned areas. In this paper, we present the first corpus of Web tables created specifically out of Russian language material. It was built via a special toolkit we have developed to crawl the Russian Wikipedia. Both the corpus and the toolkit are open-source and publicly available. Finally, we present a short study that describes Russian Wikipedia tables and their statistics. © 2023, Pleiades Publishing, Ltd.
引用
收藏
页码:111 / 122
页数:11
相关论文
共 13 条
  • [1] Wang Y., Hu J., Document Analysis Systems V, (2002)
  • [2] Zhang S., Balog K., Web table extraction, retrieval, and augmentation: A survey, ACM Trans. Intell. Syst. Technol, 11, 2, (2020)
  • [3] Bhagavatula C., Noraset T., Downey D., The Semantic Web—ISWC 2015, (2015)
  • [4] Cafarella M.J., Halevy A.Y., Zhang Y., Zhe Wang D., Wu E., Uncovering the relational web, Proceedings of the 11Th International Workshop on the Web and Databases, Webdb 2008, (2008)
  • [5] Eberius J., Braunschweig K., Hentsch M., Thiele M., Ahmadov A., Lehner W., Building the dresden web table corpus: A classification approach, Proceedings of the 2015 IEEE/ACM 2Nd International Symposium on Big Data Computing BDC, pp. 41-50, (2015)
  • [6] Lehmberg O., Ritze D., Meusel R., Bizer C., A large public corpus of web tables containing time and context metadata, In Proceedings of the 25Th International Conference Companion on World Wide Web, WWW’16 Companion (Rep. and Canton of Geneva, CHE, Int. World Wide Web Conf. Steering Committ., pp. 75-76, (2016)
  • [7] Rezig E.K., Bhandari A., Fariha A., Price B., Vanterpool A., Gadepally V., Stonebraker M., DICE: data discovery by example, Proc. VLDB Endow., 14, pp. 2819-2822, (2021)
  • [8] Castelo S., Rampin R., Santos A.S.R., Bessa A., Chirigati F., Freire J., ‘Auctus: A dataset search engine for data discovery and augmentation, Proc. VLDB Endow., 14, pp. 2791-2794, (2021)
  • [9] Bleifuss T., Bornemann L., Kalashnikov D.V., Naumann F., Srivastava D., Structured object matching across web page revisions, Proceedings of the 2021 IEEE 37Th International Conference on Data Engineering ICDE (, pp. 1284-1295, (2021)
  • [10] Bleifuss T., Bornemann L., Kalashnikov D.V., Naumann F., Srivastava D., The secret life of wikipedia tables,’’, In Proceedings of the 2Nd Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (Seadata), Co-Located with VLDB, (2021)