Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment

被引:121
|
作者
Kandel, Sean [1 ]
Parikh, Ravi [1 ]
Paepcke, Andreas [1 ]
Hellerstein, Joseph M.
Heer, Jeffrey [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
来源
PROCEEDINGS OF THE INTERNATIONAL WORKING CONFERENCE ON ADVANCED VISUAL INTERFACES | 2012年
基金
美国国家科学基金会;
关键词
Data analysis; visualization; data quality; anomaly detection;
D O I
10.1145/2254556.2254659
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views, requiring significant expertise. We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler's architecture-including modular components for custom data types, anomaly detection routines and summary visualizations-and describe its application to motion picture, natural disaster and water quality data sets.
引用
收藏
页码:547 / 554
页数:8
相关论文
共 50 条
  • [41] Commentary on issues in data quality analysis in life cycle assessment
    Cooper, Joyce Smith
    Kahn, Ezra
    INTERNATIONAL JOURNAL OF LIFE CYCLE ASSESSMENT, 2012, 17 (04) : 499 - 503
  • [42] Scalable Visualization and Interactive Analysis using Massive Data Streams
    Pascucci, Valerio
    Bremer, Peer-Timo
    Gyulassy, Attila
    Scorzelli, Giorgio
    Christensen, Cameron
    Summa, Brian
    Kumar, Sidharth
    CLOUD COMPUTING AND BIG DATA, 2013, 23 : 212 - 230
  • [43] Information visualization for DNA microarray data analysis: A critical review
    Zhang, Leishi
    Kujis, Jasna
    Liu, Xiaohui
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2008, 38 (01): : 42 - 54
  • [44] Network Analysis for the Visualization and Analysis of Qualitative Data
    Pokorny, Jennifer J.
    Norman, Alex
    Zanesco, Anthony P.
    Bauer-Wu, Susan
    Sahdra, Baljinder K.
    Saron, Clifford D.
    PSYCHOLOGICAL METHODS, 2018, 23 (01) : 169 - 183
  • [45] Usability Heuristic Evaluation of Scientific Data Analysis and Visualization Tools
    Swaid, Samar
    Maat, Mnsa
    Krishnan, Hari
    Ghoshal, Devarshi
    Ramakrishnan, Lavanya
    ADVANCES IN USABILITY AND USER EXPERIENCE, AHFE 2017, 2018, 607 : 471 - 482
  • [46] RiboStreamR: a web application for quality control, analysis, and visualization of Ribo-seq data
    Patrick Perkins
    Serina Mazzoni-Putman
    Anna Stepanova
    Jose Alonso
    Steffen Heber
    BMC Genomics, 20
  • [47] QDex:: A database profiler for generic bio-data exploration and quality aware integration
    Moussouni, F.
    Berti-Equille, L.
    Roze, G.
    Loreal, O.
    Guerin, E.
    WEB INFORMATION SYSTEMS ENGINEERING - WISE 2007 WORKSHOPS, 2007, 4832 : 5 - +
  • [48] AN ONLINE VISUALIZATION AND DATA ANALYSIS SYSTEM FOR SOCIAL AND ECONOMIC DATA BASED ON FLASH TECHNOLOGY
    Zhang, Jinqu
    Zhu, Yunqiang
    Yang, Yaping
    Sun, Jiulin
    JOINT INTERNATIONAL CONFERENCE ON THEORY, DATA HANDLING AND MODELLING IN GEOSPATIAL INFORMATION SCIENCE, 2010, 38 : 456 - 460
  • [49] BECA: A Software Tool for Integrated Visualization of Human Brain Data
    Li, Huang
    Fang, Shiaofen
    Zigon, Bob
    Sporns, Olaf
    Saykin, Andrew J.
    Goni, Joaquin
    Shen, Li
    BRAIN INFORMATICS, BI 2017, 2017, 10654 : 285 - 291
  • [50] Big Data Provenance Analysis and Visualization
    Chen, Peng
    Plale, Beth
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 797 - 800