DongTing: A large-scale dataset for anomaly detection of the Linux kernel

被引:2
作者
Duan, Guoyun [1 ,2 ]
Fu, Yuanzhi [1 ]
Cai, Minjie [1 ]
Chen, Hao [1 ]
Sun, Jianhua [1 ]
机构
[1] CSEE Hunan Univ, 2 Lushan South Rd, Changsha 410082, Peoples R China
[2] Hunan Univ Sci & Engn, Informat & Network Ctr, Yongzhou 425199, Hunan, Peoples R China
基金
中国国家自然科学基金;
关键词
Anomaly detection; Dataset; Linux kernel; System calls; Kernel BUG; Deep learning;
D O I
10.1016/j.jss.2023.111745
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Host-based intrusion detection systems (HIDS) can automatically identify adversarial applications by learning models from system events that represent normal system behaviors. The system call is the only way for applications to interact with the operating system (OS). Thus, system call sequences are traditionally used in HIDS to train models to detect novel attacks, and a wide range of datasets has been proposed for this task. However, existing datasets are either built for user-level applications (not for OS kernels), or completely outdated (proposed more than 20 years ago). To address this issue, this paper presents the first large-scale dataset specifically assembled for anomaly detection of the Linux kernel. The task of creating such a dataset is challenging due to the difficulty both in collecting a diversified set of programs that can trigger bugs in the kernel and in tracing events that may crash the kernel at runtime. In this paper, we describe in detail how to collect the data through an automated and efficient framework. The raw dataset is 85 GB in size, and contains 18,966 system call sequences that are labeled with normal and abnormal attributes. Our dataset covers more than 200 kernel versions (including major/minor releases and revisions) and 3,600 bug-triggering programs in the past five years. In addition, we conduct cross-dataset evaluation to demonstrate that training on our dataset enables superior generalization ability than other related datasets, and provide benchmark results for anomaly detection of Linux kernel on our dataset. Our extensive dataset is both useful for machine learning researchers focusing on algorithmic optimizations and practitioners in kernel development who are interested in deploying deep learning models in OS kernels. (c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页数:12
相关论文
共 61 条
  • [1] A survey of network anomaly detection techniques
    Ahmed, Mohiuddin
    Mahmood, Abdun Naser
    Hu, Jiankun
    [J]. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2016, 60 : 19 - 31
  • [2] Akgun IbrahimUmit., 2020, Proceedings_of_the_On-Device_Intelligence_Workshop,_co-located_with the_MLSys_Conference, P1
  • [3] [Anonymous], 2022, SUSE
  • [4] [Anonymous], 2022, GLIBC TESTS
  • [5] [Anonymous], 2022, SYZB DASHB
  • [6] [Anonymous], 2022, SYZK
  • [7] [Anonymous], 2022, OP POS TEST SUIT
  • [8] [Anonymous], 2022, LIN KERN SELFT
  • [9] [Anonymous], 2022, Ubuntu
  • [10] [Anonymous], 2022, AFL