Analysis of dawnbench, a time-to-accuracy machine learning performance benchmark

被引：58

作者：

Coleman C. ^{[1
]}

Kang D. ^{[1
]}

Narayanan D. ^{[1
]}

Nardi L. ^{[1
]}

Zhao T. ^{[1
]}

Zhang J. ^{[1
]}

Bailis P. ^{[1
]}

Olukotun K. ^{[1
]}

Ré C. ^{[1
]}

Zaharia M. ^{[1
]}

机构：

[1] Stanford DAWN

来源：

Operating Systems Review (ACM) | 2019年 / 53卷 / 01期

基金：

美国国家科学基金会;

关键词：

Competition - Benchmarking - Deep learning - Economic and social effects;

D O I：

10.1145/3352020.3352024

中图分类号：

学科分类号：

摘要：

Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare these optimizations. To address this problem, we recently introduced DAWNBENCH, a benchmark competition focused on end-to-end training time to achieve near-state-of-the-art accuracy on an unseen dataset-a combined metric called time-to-accuracy (TTA). In this work, we analyze the entries from DAWNBENCH, which received optimized submissions from multiple industrial groups, to investigate the behavior of TTA as a metric as well as trends in the best-performing entries. We show that TTA has a low coefficient of variation and that models optimized for TTA generalize nearly as well as those trained using standard methods. Additionally, even though DAWNBENCH entries were able to train ImageNet models in under 3 minutes, we find they still underutilize hardware capabilities such as Tensor Cores. Furthermore, we find that distributed entries can spend more than half of their time on communication. We show similar findings with entries to the MLPERF v0.5 benchmark. © Copyright held by the owner/author(s). Publication rights licensed to ACM.

引用

页码：14 / 25

页数：11

共 67 条

[1]

Second Conference on Machine Translation, (2017)

[2]

Tensorflow Xla Overview, (2017)

[3]

(2018)

[4]

An automated end-to-end optimizing compiler for deep learning, OSDI, (2018)

[5]

Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J., Devin M., Ghemawat S., Irving G., Isard M., Et al., TensorFlow: A system for large-scale machine learning, OSDI, 16, pp. 265-283, (2016)

[6]

Adolf R., Rama S., Reagen B., Wei G.-Y., Brooks D., Fathom: Reference workloads for modern deep learning methods, IISWC, pp. 1-10, (2016)

[7]

Akiba T., Suzuki S., Fukuda K., Extremely Large Minibatch Sgd: Training Resnet-50 on Imagenet in 15 Minutes, (2017)

[8]

Amodei D., Hernandez D., Ai and Compute, (2018)

[9]

Atikoglu B., Xu Y., Frachtenberg E., Jiang S., Paleczny M., Workload analysis of a large-scale key-value store, SIGMETRICS, 40, pp. 53-64, (2012)

[10]

Baevski A., Auli M., Adaptive Input Representations for Neural Language Modeling, (2018)

← 1 2 3 4 5 6 7 →