Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment

被引:121
|
作者
Kandel, Sean [1 ]
Parikh, Ravi [1 ]
Paepcke, Andreas [1 ]
Hellerstein, Joseph M.
Heer, Jeffrey [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
来源
PROCEEDINGS OF THE INTERNATIONAL WORKING CONFERENCE ON ADVANCED VISUAL INTERFACES | 2012年
基金
美国国家科学基金会;
关键词
Data analysis; visualization; data quality; anomaly detection;
D O I
10.1145/2254556.2254659
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Data quality issues such as missing, erroneous, extreme and duplicate values undermine analysis and are time-consuming to find and fix. Automated methods can help identify anomalies, but determining what constitutes an error is context-dependent and so requires human judgment. While visualization tools can facilitate this process, analysts must often manually construct the necessary views, requiring significant expertise. We present Profiler, a visual analysis tool for assessing quality issues in tabular data. Profiler applies data mining methods to automatically flag problematic data and suggests coordinated summary visualizations for assessing the data in context. The system contributes novel methods for integrated statistical and visual analysis, automatic view suggestion, and scalable visual summaries that support real-time interaction with millions of data points. We present Profiler's architecture-including modular components for custom data types, anomaly detection routines and summary visualizations-and describe its application to motion picture, natural disaster and water quality data sets.
引用
收藏
页码:547 / 554
页数:8
相关论文
共 50 条
  • [1] Geo-spatial data analysis, quality assessment and visualization
    Ge, Yong
    Bai Hexiang
    Li, Sanping
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2008, PT 1, PROCEEDINGS, 2008, 5072 : 258 - 267
  • [2] Multidimensional data visualization in the statistical analysis of curricula
    Dzemyda, G
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2005, 49 (01) : 265 - 281
  • [3] vcfView: An Extensible Data Visualization and Quality Assurance Platform for Integrated Somatic Variant Analysis
    O'Sullivan, Brian
    Seoighe, Cathal
    CANCER INFORMATICS, 2020, 19
  • [4] ROOT - A C++ framework for petabyte data storage, statistical analysis and visualization
    Antcheva, I.
    Ballintijn, M.
    Bellenot, B.
    Biskup, M.
    Brun, R.
    Buncic, N.
    Canal, Ph.
    Casadei, D.
    Couet, O.
    Fine, V.
    Franco, L.
    Ganis, G.
    Gheata, A.
    Maline, D. Gonzalez
    Goto, M.
    Iwaszkiewicz, J.
    Kreshuk, A.
    Segura, D. Marcos
    Maunder, R.
    Moneta, L.
    Naumann, A.
    Offermann, E.
    Onuchin, V.
    Panacek, S.
    Rademakers, F.
    Russo, R.
    Tadel, M.
    COMPUTER PHYSICS COMMUNICATIONS, 2009, 180 (12) : 2499 - 2512
  • [5] ROOT - A C++ framework for petabyte data storage, statistical analysis and visualization
    Antcheva, I.
    Ballintijn, M.
    Bellenot, B.
    Biskup, M.
    Brun, R.
    Buncic, N.
    Canal, Ph
    Casadei, D.
    Couet, O.
    Fine, V.
    Franco, L.
    Ganis, G.
    Gheata, A.
    Maline, D. Gonzalez
    Goto, M.
    Iwaszkiewicz, J.
    Kreshuk, A.
    Segura, D. Marcos
    Maunder, R.
    Moneta, L.
    Naumann, A.
    Offermann, E.
    Onuchin, V.
    Panacek, S.
    Rademakers, F.
    Russo, P.
    Tadel, M.
    COMPUTER PHYSICS COMMUNICATIONS, 2011, 182 (06) : 1384 - 1385
  • [6] WQA: an integrated DSS and statistical package for water quality data management, processing and analysis
    Tennakoon, S.
    Ramsay, I.
    Shen, S.
    Christiansen, N.
    19TH INTERNATIONAL CONGRESS ON MODELLING AND SIMULATION (MODSIM2011), 2011, : 3532 - 3538
  • [7] VISUALIZATION INVESTIGATION ON THE MARINE DATA WITH MULTIVARIATE STATISTICAL ANALYSIS METHODS
    Li Yajie
    Lv Zhengdong
    Wang Maonan
    POLISH MARITIME RESEARCH, 2017, 24 : 89 - 94
  • [8] NGS data analysis and quality assessment
    Weissmann, R.
    Gilissen, C.
    MEDIZINISCHE GENETIK, 2014, 26 (02): : 239 - 245
  • [9] Integrated geoscience databanks for interactive analysis and visualization
    Khan, Khalid Amin
    Akhter, Gulraiz
    Ahmad, Zulfiqar
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2013, 6 : 41 - 49
  • [10] Data Visualization and Statistical Graphics in big data analysis by Google Data Studio - Sales Case Study
    Allaymoun, Mohammad H.
    Khaled, Masooma
    Saleh, Fatima
    Merza, Fatima
    2022 IEEE TECHNOLOGY AND ENGINEERING MANAGEMENT CONFERENCE (TEMSCON EUROPE), 2022, : 228 - 234