On Automating Basic Data Curation Tasks

被引:19
作者
Beheshti, Seyed-Mehdi-Reza [1 ]
Tabebordbar, Alireza [1 ]
Benatallah, Boualem [1 ]
Nouri, Reza [1 ]
机构
[1] Univ New South Wales, Sydney, NSW, Australia
来源
WWW'17 COMPANION: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB | 2017年
关键词
Data Curation; Big Data Analytics; Curation API; SOCIAL MEDIA; ANALYTICS;
D O I
10.1145/3041021.3054726
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Big data analytics is firmly recognized as a strategic priority for modern enterprises. At the heart of big data analytics lies the data curation process, consists of tasks that transform raw data (unstructured, semi-structured and structured data sources) into curated data, i.e. contextualized data and knowledge that is maintained and made available for use by end-users and applications. To achieve this, the data curation process may involve techniques and algorithms for extracting, classifying, linking, merging, enriching, sampling, and the summarization of data and knowledge. To facilitate the data curation process and enhance the productivity of researchers and developers, we identify and implement a set of basic data curation APIs and make them available as services to researchers and developers to assist them in transforming their raw data into curated data. The curation APIs enable developers to easily add features - such as extracting keyword, part of speech, and named entities such as Persons, Locations, Organizations, Companies, Products, Diseases, Drugs, etc.; providing synonyms and stems for extracted information items leveraging lexical knowledge bases for the English language such as WordNet; linking extracted entities to external knowledge bases such as Google Knowledge Graph and Wikidata; discovering similarity among the extracted information items, such as calculating similarity between string and numbers; classifying, sorting and categorizing data into various types, forms or any other distinct class; and indexing structured and unstructured data - into their data applications. These services can be accessed via a REST API, and the data is returned as a JSON file that can be integrated into data applications. The curation APIs are available as an open source project on GitHub.
引用
收藏
页码:165 / 169
页数:5
相关论文
共 13 条
[1]  
Anderson Michael., 2013, CIDR
[2]  
[Anonymous], 2015, Elasticsearch: the definitive guide: a distributed real-time search and analytics engine
[3]  
[Anonymous], 2012, Nw. J. Tech. Intell. Prop
[4]  
Beheshti, 2016, ABS161203277 CORR
[5]  
Beheshti S.-M.-R., 2016, Process analytics-concepts and techniques for querying and analyzing process data
[6]   Scalable graph-based OLAP analytics over process execution data [J].
Beheshti, Seyed-Mehdi-Reza ;
Benatallah, Boualem ;
Motahari-Nezhad, Hamid Reza .
DISTRIBUTED AND PARALLEL DATABASES, 2016, 34 (03) :379-423
[7]  
Beheshti Seyed-Mehdi-Reza, 2016, COMPUTING, P1, DOI [DOI 10.1007/S00607-016-0490-0, 10.1007/s00607-016-0490-0]
[8]  
Chen HC, 2012, MIS QUART, V36, P1165
[9]   Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach [J].
Gattani, Abhishek ;
Lamba, Digvijay S. ;
Garera, Nikesh ;
Tiwari, Mitul ;
Chai, Xiaoyong ;
Das, Sanjib ;
Subramaniam, Sri ;
Rajaraman, Anand ;
Harinarayan, Venky ;
Doan, Anhai .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (11) :1126-1137
[10]   A tale of two sites: Twitter vs. Facebook and the personality predictors of social media usage [J].
Hughes, David John ;
Rowe, Moss ;
Batey, Mark ;
Lee, Andrew .
COMPUTERS IN HUMAN BEHAVIOR, 2012, 28 (02) :561-569