Big Data Approach to Developing Adaptable Corpus Tools

被引:0
作者
Lutskiv, Andriy [1 ]
Popovych, Nataliya [2 ]
机构
[1] Ternopil Ivan Puluj Natl Tech Univ, Comp Syst & Networks Dept, Ternopol, Ukraine
[2] Uzhgorod Natl Univ, State Univ, Dept Multicultural Educ & Translat, Uzhgorod, Ukraine
来源
COMPUTATIONAL LINGUISTICS AND INTELLIGENT SYSTEMS (COLINS 2020), VOL I: MAIN CONFERENCE | 2020年 / 2604卷
关键词
adaptable text corpus; Big Data; natural language processing; natural language understanding; statistics; machine learning; data mining; conceptual analysis; corpus-based translation studies; conceptual seme; componential analysis;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Thesis deals with the development of corpus tools which allow building corpus of religious and historical texts. It is foreseen that the corpus has the features of data ingestion, text data preprocessing, statistics calculation, qualitative and quantitative text analysis. All these features are customizable. With Big Data approach is meant that corpus tools are treated as the data platform and the corpus itself is treated as a combination of data lake and data warehouse solutions. There have been suggested the ways for resolving algorithmic, methodological and architectural problems which arise while building corpus tool. The effectiveness of natural language processing and natural language understanding methods, libraries and tools on the example of building historical and religious texts' corpora have been checked. There have been created the workflows which comprise data extraction from sources, data transformation, data enrichment and loading into corpus storage with proper qualitative and quantitative characteristics. Data extraction approaches which are common for ingestion into data lake were used. Transformations and enrichments were realized by means of natural language processing and natural language understanding techniques. Calculation of statistical characteristics was done by means of machine learning techniques. Finding keywords and relations between them became possible thanks to the employment of latent semantic analysis, terms and N-gram frequencies, term frequency-inverse document frequencies. Computation complexity and number of information noise were reduced by singular value decomposition. The influence of singular value decomposition parameters on the text processing accuracy has been analyzed. The results of corpus-based computational experiment for religious text concept analysis have been shown. The architectural approaches to building corpus-based data platform and the usage of software tools, frameworks and specific libraries have been suggested.
引用
收藏
页数:22
相关论文
共 37 条
  • [11] Bible Holy, 2011, HOLY BIBLE NEW INT V, P1140
  • [12] Brezina Vaclav, 2018, #LancsBox v. 4.x
  • [13] Dartchuk N., 2002, CORPUS UKRAINIAN LAN
  • [14] Garretson G., 2011, DEXTER TOOL ANAL LAN
  • [15] Godfrey J.J., 1997, AIR TRAFFIC CONTROL
  • [16] Jurafsky D., 2008, Speech and Language Processing: Computational Linguistics and Speech Recognition, DOI DOI 10.1162/JMLR.2003.3.4-5.993
  • [17] Kibler S., 2015, CORPUS LINGUISTICS L, P21, DOI [10.5040/9781472593573, DOI 10.5040/9781472593573]
  • [18] Kiran M, 2015, PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, P2785, DOI 10.1109/BigData.2015.7364082
  • [19] Klymenko N., 2014, STRUCTURAL LINGUISTI, P445
  • [20] Kruger A., 2011, Corpus-Based Translation Studies: Research and Applications