Evaluation of Synthetic Data Generation Techniques in the Domain of Anonymous Traffic Classification

被引:6
作者
Cullen, Drake [1 ]
Halladay, James [1 ]
Briner, Nathan [1 ]
Basnet, Ram [1 ]
Bergen, Jeremy [1 ]
Doleck, Tenzin [2 ]
机构
[1] Colorado Mesa Univ CMU, Dept Comp Sci & Engn, Grand Junction, CO 81501 USA
[2] Simon Fraser Univ, Fac Educ, Burnaby, BC V5A 1S6, Canada
关键词
Anonymous traffic; synthetic data; CopulaGAN; CTGAN; SMOTE; VAE; TabNet; deep learning; machine learning; unbalanced data;
D O I
10.1109/ACCESS.2022.3228507
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Anonymous network traffic is more pervasive than ever due to the accessibility of services such as virtual private networks (VPN) and The Onion Router (Tor). To address the need to identify and classify this traffic, machine and deep learning solutions have become the standard. However, high performing classifiers often scale poorly when applied to real-world traffic classification due to the heavily skewed nature of network traffic data. Prior research has found synthetic data generation to be effective at alleviating concerns surrounding class imbalance, though a limited number of these techniques have been applied to the domain of anonymous network traffic detection. This work compares the ability of a Conditional Tabular Generative Adversarial Network (CTGAN), Copula Generative Adversarial Network (CopulaGAN), Variational Autoencoder (VAE), and Synthetic Minority Over-sampling Technique (SMOTE) to create viable synthetic anonymous network traffic samples. Moreover, we evaluate the performance of several shallow boosting and bagging classifiers as well as deep learning models on the synthetic data. Ultimately, we amalgamate the data generated by the GANs, VAE, and SMOTE into a comprehensive dataset dubbed CMU-SynTraffic-2022 for future research on this topic. Our findings show that SMOTE consistently outperformed the other upsampling techniques, improving classifiers' F1-scores over the control by similar to 7.5% for application type characterization. Among the tested classifiers, Light Gradient Boosting Machine achieved the highest F1-score of 90.3% on eight application types.
引用
收藏
页码:129612 / 129625
页数:14
相关论文
共 43 条
[1]  
Ajay G., 2018, STUDY ANAL EFFECTIVE
[2]  
Al-omari Ahmad, 2022, 2022 International Conference on Intelligent Data Science Technologies and Applications (IDSTA), P170, DOI 10.1109/IDSTA55301.2022.9923042
[3]   QoS-Classifier for VPN and Non-VPN traffic based on time-related features [J].
Andres Caicedo-Munoz, Julian ;
Ledezma Espino, Agapito ;
Carlos Corrales, Juan ;
Rendon, Alvaro .
COMPUTER NETWORKS, 2018, 144 :271-279
[4]  
Arik SO, 2021, AAAI CONF ARTIF INTE, V35, P6679
[5]   SMOTE for high-dimensional class-imbalanced data [J].
Blagus, Rok ;
Lusa, Lara .
BMC BIOINFORMATICS, 2013, 14
[6]   A Review of Tabular Data Synthesis Using GANs on an IDS Dataset [J].
Bourou, Stavroula ;
El Saer, Andreas ;
Velivassaki, Terpsichori-Helen ;
Voulkidis, Artemis ;
Zahariadis, Theodore .
INFORMATION, 2021, 12 (09)
[7]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[8]   SMOTE: Synthetic minority over-sampling technique [J].
Chawla, Nitesh V. ;
Bowyer, Kevin W. ;
Hall, Lawrence O. ;
Kegelmeyer, W. Philip .
2002, American Association for Artificial Intelligence (16)
[9]   Using chi-square statistics to measure similarities for text categorization [J].
Chen, Yao-Tsung ;
Chen, Meng Chang .
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (04) :3085-3090
[10]  
Cullen D., 2022, CMUSYNTRAFFIC 2022