Evaluating the Effects of Missing Values and Mixed Data Types on Social Sequence Clustering Using t-SNE Visualization
被引:7
|
作者:
论文数: 引用数:
h-index:
机构:
Lazar, Alina
[1
]
论文数: 引用数:
h-index:
机构:
Jin, Ling
[2
]
论文数: 引用数:
h-index:
机构:
Spurlock, C. Anna
[2
]
Wu, Kesheng
论文数: 0引用数: 0
h-index: 0
机构:
Lawrence Berkeley Natl Lab, Computat Res Div, 1 Cyclotron Rd, Berkeley, CA 94720 USAYoungstown State Univ, Dept Comp Sci & Informat Syst, 1 Univ Plaza, Youngstown, OH 44555 USA
Wu, Kesheng
[3
]
Sim, Alex
论文数: 0引用数: 0
h-index: 0
机构:
Lawrence Berkeley Natl Lab, Computat Res Div, 1 Cyclotron Rd, Berkeley, CA 94720 USAYoungstown State Univ, Dept Comp Sci & Informat Syst, 1 Univ Plaza, Youngstown, OH 44555 USA
Sim, Alex
[3
]
Todd, Annika
论文数: 0引用数: 0
h-index: 0
机构:
Lawrence Berkeley Natl Lab, Energy Anal & Environm Impacts Div, 1 Cyclotron Rd, Berkeley, CA 94720 USAYoungstown State Univ, Dept Comp Sci & Informat Syst, 1 Univ Plaza, Youngstown, OH 44555 USA
Todd, Annika
[2
]
机构:
[1] Youngstown State Univ, Dept Comp Sci & Informat Syst, 1 Univ Plaza, Youngstown, OH 44555 USA
[2] Lawrence Berkeley Natl Lab, Energy Anal & Environm Impacts Div, 1 Cyclotron Rd, Berkeley, CA 94720 USA
[3] Lawrence Berkeley Natl Lab, Computat Res Div, 1 Cyclotron Rd, Berkeley, CA 94720 USA
来源:
ACM JOURNAL OF DATA AND INFORMATION QUALITY
|
2019年
/
11卷
/
02期
关键词:
Joint sequence analysis;
optimal matching;
missing values;
time series clustering;
data quality;
t-SNE;
dimensionality reduction;
life trajectories;
TIME-SERIES DATA;
TRAJECTORIES;
POLICY;
LIFE;
D O I:
10.1145/3301294
中图分类号:
TP [自动化技术、计算机技术];
学科分类号:
0812 ;
摘要:
The goal of this work is to investigate the impact of missing values in clustering joint categorical social sequences. Identifying patterns in sociodemographic longitudinal data is important in a number of social science settings. However, performing analytical operations, such as clustering on life course trajectories, is challenging due to the categorical and multidimensional nature of the data, their mixed data types, and corruption by missing and inconsistent values. Data quality issues were investigated previously on single variable sequences. To understand their effects on multivariate sequence analysis, we employ a dataset of mixed data types and missing values, a dissimilarity measure designed for joint categorical sequence data, together with dimensionality reduction methodologies in a systematic design of sequence clustering experiments. Given the categorical nature of our data, we employ an "edit" distance using optimal matching. Because each data record has multiple variables of different types, we investigate the impact of mixing these variables in a single dissimilarity measure. Between variables with binary values and those with multiple nominal values, we find that the ability to overcome missing data problems is more difficult in the nominal domain than in the binary domain. Additionally, alignment of leading missing values can result in systematic biases in dissimilarity matrices and subsequently introduce both artificial clusters and unrealistic interpretations of associated data domains. We demonstrate the usage of t-distributed stochastic neighborhood embedding to visually guide mitigation of such biases by tuning the missing value substitution cost parameter or determining an optimal sequence span.
机构:
Beijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Minist Educ, Engn Res Ctr Digital Community, Beijing, Peoples R China
Beijing Lab Urban Mass Transit, Beijing, Peoples R ChinaBeijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Zhang, Haili
Wang, Pu
论文数: 0引用数: 0
h-index: 0
机构:
Beijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Minist Educ, Engn Res Ctr Digital Community, Beijing, Peoples R China
Beijing Lab Urban Mass Transit, Beijing, Peoples R ChinaBeijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Wang, Pu
Gao, Xuejin
论文数: 0引用数: 0
h-index: 0
机构:
Beijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Minist Educ, Engn Res Ctr Digital Community, Beijing, Peoples R China
Beijing Lab Urban Mass Transit, Beijing, Peoples R ChinaBeijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Gao, Xuejin
Qi, Yongsheng
论文数: 0引用数: 0
h-index: 0
机构:
Inner Mongolia Univ Technol, Sch Elect Power, Hohhot, Inner Mongolia, Peoples R ChinaBeijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Qi, Yongsheng
Gao, Huihui
论文数: 0引用数: 0
h-index: 0
机构:
Beijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China
Minist Educ, Engn Res Ctr Digital Community, Beijing, Peoples R China
Beijing Lab Urban Mass Transit, Beijing, Peoples R ChinaBeijing Univ Technol, Fac Informat Technol, Beijing 100124, Peoples R China