Large-Scale Analysis of Docker Images and Performance Implications for Container Storage Systems

被引:30
|
作者
Zhao, Nannan [1 ,2 ]
Tarasov, Vasily [4 ]
Albahar, Hadeel [3 ]
Anwar, Ali [4 ]
Rupprecht, Lukas [4 ]
Skourtis, Dimitrios [4 ]
Paul, Arnab K. [3 ]
Chen, Keren [3 ]
Butt, Ali R. [3 ]
机构
[1] Northwestern Polytech Univ, Key Lab Big Data Storage & Management MIIT, Sch Comp Sci, Xian 710129, Shaanxi, Peoples R China
[2] Northwestern Polytech Univ, Natl Engn Labo Integrated Aerosp Ground Ocean Big, Xian 710129, Shaanxi, Peoples R China
[3] Virginia Tech, Dept Comp Sci, Blacksburg, VA 24061 USA
[4] IBM Res Almaden, San Jose, CA 95120 USA
基金
北京市自然科学基金;
关键词
Containers; Image coding; Cows; Crawlers; Measurement; Libraries; Ecosystems; Docker; container images; container registry; deduplication; Docker hub; container storage drivers;
D O I
10.1109/TPDS.2020.3034517
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Docker containers have become a prominent solution for supporting modern enterprise applications due to the highly desirable features of isolation, low overhead, and efficient packaging of the application's execution environment. Containers are created from images which are shared between users via a registry. The amount of data registries store is massive. For example, Docker Hub, a popular public registry, stores at least half a million public images. In this article, we analyze over 167 TB of uncompressed Docker Hub images, characterize them using multiple metrics and evaluate the potential of file-level deduplication. Our analysis helps to make conscious decisions when designing storage for containers in general and Docker registries in particular. For example, only 3 percent of the files in images are unique while others are redundant file copies, which means file-level deduplication has a great potential to save storage space. Furthermore, we carry out a comprehensive analysis of both small I/O request performance and copy-on-write performance for multiple popular container storage drivers. Our findings can motivate and help improve the design of data reduction and caching methods for images, pulling optimizations for registries, and storage drivers.
引用
收藏
页码:918 / 930
页数:13
相关论文
共 50 条
  • [1] Performance virtualization for large-scale storage systems
    Chambliss, DD
    Alvarez, GA
    Pandey, P
    Jadav, D
    Xu, J
    Menon, R
    Lee, TP
    22ND INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 2003, : 109 - 118
  • [2] A Large-scale Data Set and an Empirical Study of Docker Images Hosted on Docker Hub
    Lin, Changyuan
    Nadi, Sarah
    Khazaei, Hamzeh
    2020 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME 2020), 2020, : 371 - 381
  • [3] Large-Scale Analysis of the Docker Hub Dataset
    Zhao, Nannan
    Tarasov, Vasily
    Albahar, Hadeel
    Anwar, Ali
    Rupprecht, Lukas
    Skourtis, Dimitrios
    Warke, Amit S.
    Mohamed, Mohamed
    Butt, Ali R.
    2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 215 - 224
  • [4] An End-to-end High-performance Deduplication Scheme for Docker Registries and Docker Container Storage Systems
    Zhao, Nannan
    Lin, Muhui
    Albahar, Hadeel
    Paul, Arnab K.
    Huang, Zhijie
    Abraham, Subil
    Chen, Keren
    Tarasov, Vasily
    Skourtis, Dimitrios
    Anwar, Ali
    Butt, Ali R.
    ACM TRANSACTIONS ON STORAGE, 2024, 20 (03)
  • [5] H∞ Performance Analysis of Large-Scale Networked Systems
    Guan, Rongxing
    Liu, Huabo
    Huang, Keke
    Yu, Haisheng
    IEEE SYSTEMS JOURNAL, 2024, 18 (03): : 1528 - 1537
  • [6] Churros: a Docker-based pipeline for large-scale epigenomic analysis
    Wang, Jiankang
    Nakato, Ryuichiro
    DNA RESEARCH, 2024, 31 (01)
  • [7] Performance analysis of a parallel algorithm for restoring large-scale CT images
    Harizanov, Stanislav
    Lirkov, Ivan
    Georgiev, Krassimir
    Paprzycki, Marcin
    Ganzha, Maria
    JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2017, 310 : 104 - 114
  • [8] Identification and Authentication in Large-scale Storage Systems
    Niu, Zhongying
    Zhou, Ke
    Jiang, Hong
    Yang, Tianming
    Yan, Wei
    NAS: 2009 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE, 2009, : 421 - +
  • [9] Analysis and prediction of performance variability in large-scale computing systems
    Beni, Majid Salimi
    Hunold, Sascha
    Cosenza, Biagio
    JOURNAL OF SUPERCOMPUTING, 2024, 80 (10): : 14978 - 15005
  • [10] Performance Analysis of Feedbacked Passive Systems for Decentralized Design of Large-Scale Systems
    Urata, Kengo
    Inoue, Masaki
    2017 IEEE 56TH ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), 2017,