Attack of the Killer Microseconds

被引:156
作者
Barroso, Luiz [1 ,2 ]
Marty, Mike [1 ]
Patterson, David [1 ,3 ]
Ranganathan, Parthasarathy [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
[2] Google Inc, Engn, Mountain View, CA 94043 USA
[3] Univ Calif Berkeley, Berkeley, CA 94720 USA
关键词
D O I
10.1145/3015146
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
THE COMPUTER SYSTEMS we use today make it easy for programmers to mitigate event latencies in the nanosecond and millisecond time scales (such as DRAM accesses at tens or hundreds of nanoseconds and disk I/Os at a few milliseconds) but significantly lack support for microsecond (mu s)-scale events. This oversight is quickly becoming a serious problem for programming warehouse-scale computers, where efficient handling of microsecond-scale events is becoming paramount for a new breed of low-latency I/O devices ranging from datacenter networking to emerging memories (see the first sidebar "Is the Microsecond Getting Enough Respect?"). Processor designers have developed multiple techniques to facilitate a deep memory hierarchy that works at the nanosecond scale by providing a simple synchronous programming interface to the memory system. A load operation will logically block a thread's execution, with the program appearing to resume after the load completes. A host of complex microarchitectural techniques make high performance possible while supporting this intuitive programming model. Techniques include prefetching, out-of-order execution, and branch prediction. Since nanosecond-scale devices are so fast, low-level interactions are performed primarily by hardware. At the other end of the latency-mitigating spectrum, computer scientists have worked on a number of techniques- typically software based-to deal with the millisecond time scale. Operating system context switching is a notable example. For instance, when a read() system call to a disk is made, the operating system kicks off the lowlevel I/O operation but also performs a software context switch to a different thread to make use of the processor during the disk operation. The original thread resumes execution sometime after the I/O completes. The long overhead of making a disk access (milliseconds) easily outweighs the cost of two context switches (microseconds). Millisecondscale devices are slow enough that the cost of these software-based mechanisms can be amortized (see Table 1). These synchronous models for interacting with nanosecond-and millisecond-scale devices are easier than the alternative of asynchronous models. In an asynchronous programming model, the program sends a request to a device and continue processing other work (vs. performance-per-total-cost-ofownership in large-scale Web deployments). Consequently, they can keep processors highly underutilized when, say, blocking for MPI-style rendezvous messages. In contrast, a key emphasis in warehouse-scale computing systems is the need to optimize for low latencies while achieving greater utilizations.
引用
收藏
页码:47 / 54
页数:7
相关论文
共 14 条
[1]  
[Anonymous], P USENIX ANN TECHN C
[2]  
[Anonymous], 2015, INTEL NEWSROOM 0728
[3]  
Caulfield A., 2010, P 2010 IEEE ACM INT
[4]   The Tail at Scale [J].
Dean, Jeffrey ;
Barroso, Luiz Andre .
COMMUNICATIONS OF THE ACM, 2013, 56 (02) :74-80
[5]  
Erlang, ERL US GUID VERS 8 0
[6]  
Fikes F., 2010, P 2010 GOOGL FAC SUM
[7]  
Golang.org, EFF GO GOR
[8]  
Hennessy John L., 2017, COMPUTER ARCHITECTUR
[9]  
Kanev S., 2015, P 42 INT S COMP ARCH
[10]  
Microsoft, AS PROGR AS AW C VIS