Compiler-Assisted Detection of Transient Memory Errors

被引:0
作者
Tavarageri, Sanket [1 ]
Krishnamoorthy, Sriram [2 ]
Sadayappan, P. [1 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Pacific NW Natl Lab, Comp Sci & Math Div, Richland, WA 99352 USA
关键词
Performance; Reliability; Transient memory errors; def-use tracking; checksums;
D O I
10.1145/2666356.2594298
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The probability of bit flips in hardware memory systems is projected to increase significantly as memory systems continue to scale in size and complexity. Effective hardware-based error detection and correction require that the complete data path, involving all parts of the memory system, be protected with sufficient redundancy. First, this may be costly to employ on commodity computing platforms, and second, even on high-end systems, protection against multi-bit errors may be lacking. Therefore, augmenting hardware error detection schemes with software techniques is of considerable interest. In this paper, we consider software-level mechanisms to comprehensively detect transient memory faults. We develop novel compile-time algorithms to instrument application programs with checksum computation codes to detect memory errors. Unlike prior approaches that employ checksums on computational and architectural states, our scheme verifies every data access and works by tracking variables as they are produced and consumed. Experimental evaluation demonstrates that the proposed comprehensive error detection solution is viable as a completely software-only scheme. We also demonstrate that with limited hardware support, overheads of error detection can be further reduced.
引用
收藏
页码:204 / 215
页数:12
相关论文
共 41 条
  • [1] [Anonymous], 1984, IEEE T COMPUTERS
  • [2] [Anonymous], WORKSH ARCH SUPP GIG
  • [3] [Anonymous], IEEE T COMPUTERS
  • [4] Bastoul C., 2004, LANGUAGES COMPILERS
  • [5] Baumann R., 2005, DESIGN TEST COMPUTER, V22
  • [6] CHECKING THE CORRECTNESS OF MEMORIES
    BLUM, M
    EVANS, W
    GEMMELL, P
    KANNAN, S
    NAOR, M
    [J]. ALGORITHMICA, 1994, 12 (2-3) : 225 - 244
  • [7] Borkar S., 2005, MICRO IEEE, V25
  • [8] Bright J. D., 1995, FAULT TOLERANT COMPU
  • [9] A data-centric approach to checksum reuse for array-intensive applications
    Chen, G
    Kandemir, M
    Karakoy, M
    [J]. 2005 INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2005, : 316 - 325
  • [10] Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits
    Dreslinski, Ronald G.
    Wieckowski, Michael
    Blaauw, David
    Sylvester, Dennis
    Mudge, Trevor
    [J]. PROCEEDINGS OF THE IEEE, 2010, 98 (02) : 253 - 266