Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment

被引:121
|
作者
Kandel, Sean [1 ]
Parikh, Ravi [1 ]
Paepcke, Andreas [1 ]
Hellerstein, Joseph M.
Heer, Jeffrey [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
来源
PROCEEDINGS OF THE INTERNATIONAL WORKING CONFERENCE ON ADVANCED VISUAL INTERFACES | 2012年
基金
美国国家科学基金会;
关键词
Data analysis; visualization; data quality; anomaly detection;
D O I
10.1145/2254556.2254659
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views, requiring significant expertise. We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler's architecture-including modular components for custom data types, anomaly detection routines and summary visualizations-and describe its application to motion picture, natural disaster and water quality data sets.
引用
收藏
页码:547 / 554
页数:8
相关论文
共 50 条
  • [31] Strategic Shift of Statistical Review on Data Quality Assessment for New Drug Applications in China
    Jun Wang
    Gang Wang
    Min Li
    Jingjing Han
    Xin Zeng
    Jianhong Pan
    Jinbo Yang
    Therapeutic Innovation & Regulatory Science, 2019, 53 : 227 - 232
  • [32] Adopting Data Analysis and Visualization Technology to Construct Clinical Research Data Management and Analysis System
    Tang, Haijing
    Zhou, Yangdong
    Yang, Xu
    Gao, Keyan
    Zheng, Wenhao
    Zhao, Jinfeng
    PROCEEDINGS OF 2018 2ND INTERNATIONAL CONFERENCE ON SOFTWARE AND E-BUSINESS (ICSEB 2018), 2018, : 49 - 53
  • [33] Exploratory visualization of multivariate data with variable quality
    Xie, Zaixian
    Huang, Shiping
    Ward, Matthew O.
    Rundensteiner, Elke A.
    VAST 2006: IEEE SYMPOSIUM ON VISUAL ANALYTICS SCIENCE AND TECHNOLOGY, PROCEEDINGS, 2006, : 183 - +
  • [34] Statistical quality control of warehouse data
    Hinrichs, H
    DATABASES AND INFORMATION SYSTEMS, 2001, : 69 - 84
  • [35] vlda: An R package for statistical visualization of multidimensional longitudinal data
    Lee, Bo-Hui
    Ryu, Seongwon
    Choi, Yong-Seok
    COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS, 2021, 28 (04) : 369 - 391
  • [36] RiboStreamR: a web application for quality control, analysis, and visualization of Ribo-seq data
    Perkins, Patrick
    Mazzoni-Putman, Serina
    Stepanova, Anna
    Alonso, Jose
    Heber, Steffen
    BMC GENOMICS, 2019, 20 (Suppl 5)
  • [37] riboStreamR: A Web Application for Quality Control, Analysis, and Visualization of Ribo-Seq Data
    Perkins, Patrick
    Heber, Steffen
    2017 IEEE 7TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2017,
  • [38] Commentary on issues in data quality analysis in life cycle assessment
    Joyce Smith Cooper
    Ezra Kahn
    The International Journal of Life Cycle Assessment, 2012, 17 : 499 - 503
  • [39] Visualization of health indicators: Utilizing data mining techniques and statistical analysis for effective comparison of user profiles
    Szeto J.
    Lycett A.
    Yi X.
    Afra S.
    Sarhan A.
    Xilogiannopoulos K.F.
    Karampelas P.
    Alhajj R.
    Network Modeling Analysis in Health Informatics and Bioinformatics, 2014, 3 (1) : 1 - 16
  • [40] Assessing Data Quality of Integrated Data by Quality Aggregation of its Ancestors
    del Pilar Angeles, Maria
    Mhor MacKinnon, Lachlan
    COMPUTACION Y SISTEMAS, 2010, 13 (03): : 331 - 344