Selection Bias Tracking and Detailed Subset Comparison for High-Dimensional Data

被引:14
|
作者
Borland, David [1 ]
Wang, Wenyuan [2 ]
Zhang, Jonathan [3 ]
Shrestha, Joshua [4 ]
Gotz, David [2 ]
机构
[1] Univ N Carolina, RENCI, Chapel Hill, NC 27515 USA
[2] Univ N Carolina, Sch Informat & Lib Sci, Chapel Hill, NC 27515 USA
[3] Univ N Carolina, Dept Biostat, Chapel Hill, NC 27515 USA
[4] Univ N Carolina, Dept Comp Sci, Chapel Hill, NC 27515 USA
基金
美国国家科学基金会;
关键词
High-dimensional visualization; visual analytics; cohort selection; medical informatics; selection bias; VISUAL ANALYTICS; ADJUST; VISUALIZATION;
D O I
10.1109/TVCG.2019.2934209
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The collection of large, complex datasets has become common across a wide variety of domains. Visual analytics tools increasingly play a key role in exploring and answering complex questions about these large datasets. However, many visualizations are not designed to concurrently visualize the large number of dimensions present in complex datasets (e.g. tens of thousands of distinct codes in an electronic health record system). This fact, combined with the ability of many visual analytics systems to enable rapid, ad-hoc specification of groups, or cohorts, of individuals based on a small subset of visualized dimensions, leads to the possibility of introducing selection when the user creates a cohort based on a specified set of dimensions, differences across many other unseen dimensions may also be introduced. These unintended side effects may result in the cohort no longer being representative of the larger population intended to be studied, which can negatively affect the validity of subsequent analyses. We present techniques for selection bias tracking and visualization that can be incorporated into high-dimensional exploratory visual analytics systems, with a focus on medical data with existing data hierarchies. These techniques include: (1) tree-based cohort provenance and visualization, including a user-specified baseline cohort that all other cohorts are compared against, and visual encoding of cohort, which indicates where selection bias may have occurred, and (2) a set of visualizations, including a novel icicle-plot based visualization, to compare in detail the per-dimension differences between the baseline and a user-specified focus cohort. These techniques are integrated into a medical temporal event sequence visual analytics tool. We present example use cases and report findings from domain expert user interviews.
引用
收藏
页码:429 / 439
页数:11
相关论文
共 50 条
  • [21] Network-Based Interface for the Exploration of High-Dimensional Data Spaces
    Zhang, Zhiyuan
    McDonnell, Kevin T.
    Mueller, Klaus
    IEEE PACIFIC VISUALIZATION SYMPOSIUM 2012, 2012, : 17 - 24
  • [22] Visualization of high-dimensional data on the probabilistic principal surface
    Chang, KY
    Ghosh, J
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT, VOLS 1 AND 2: INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT IN THE GLOBAL ECONOMY, 2005, : 1315 - 1319
  • [23] Multiple change point detection for high-dimensional data
    Zhao, Wenbiao
    Zhu, Lixing
    Tan, Falong
    TEST, 2024, 33 (03) : 809 - 846
  • [24] Visualizing High-Dimensional Data: Advances in the Past Decade
    Liu, Shusen
    Maljovec, Dan
    Wang, Bei
    Bremer, Peer-Timo
    Pascucci, Valerio
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2017, 23 (03) : 1249 - 1268
  • [25] Visualizing Large-scale and High-dimensional Data
    Tang, Jian
    Liu, Jingzhou
    Zhang, Ming
    Mei, Qiaozhu
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 287 - 297
  • [26] High-Dimensional Data Visualization Based on User Knowledge
    Liu, Qiaolian
    Zhao, Jianfei
    Guo, Naiwang
    Xiao, Ding
    Shi, Chuan
    DATA MINING AND BIG DATA, DMBD 2016, 2016, 9714 : 321 - 329
  • [27] A Visual Method for High-Dimensional Data Cluster Exploration
    Zhang, Ke-Bing
    Huang, Mao Lin
    Orgun, Mehmet A.
    Nguyen, Quang Vinh
    NEURAL INFORMATION PROCESSING, PT 2, PROCEEDINGS, 2009, 5864 : 699 - +
  • [28] Integrative clustering methods for high-dimensional molecular data
    Chalise, Prabhakar
    Koestler, Devin C.
    Bimali, Milan
    Yu, Qing
    Fridley, Brooke L.
    TRANSLATIONAL CANCER RESEARCH, 2014, 3 (03) : 202 - 216
  • [29] Analyzing high-dimensional cytometry data using FlowSOM
    Quintelier, Katrien
    Couckuyt, Artuur
    Emmaneel, Annelies
    Aerts, Joachim
    Saeys, Yvan
    Van Gassen, Sofie
    NATURE PROTOCOLS, 2021, 16 (08) : 3775 - 3801
  • [30] Stimulation spectrum based high-dimensional data visualization
    Liu, Kan
    Liu, Ping
    Jin, Dawei
    INFORMATION VISUALIZATION-BOOK, 2006, : 721 - +