ERrOR: Improving Performance and Fault Tolerance using Early Execution

被引:1
|
作者
Choudhary, Raj Kumar [1 ]
Patel, Janeel [1 ]
Singh, Virendra [1 ]
机构
[1] Indian Inst Technol, Comp Architecture & Dependable Syst Lab, Mumbai, India
来源
2023 IEEE 29TH INTERNATIONAL SYMPOSIUM ON ON-LINE TESTING AND ROBUST SYSTEM DESIGN, IOLTS | 2023年
关键词
reliability; soft errors; fault tolerance; instruction re-execution; CORE;
D O I
10.1109/IOLTS59296.2023.10224863
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Contemporary integrated circuits are becoming increasingly susceptible to soft errors due to single-event upsets, effectively decreasing the reliability of operation. In this paper, we propose the ERrOR microarchitecture, that detects soft errors in processor operation using temporal redundancy with minimal hardware overhead. Previous proposals have explored the idea of introducing an Early Execution Unit (EXU) at the processor frontend in order to expeditiously execute dynamic instructions with short dependency chains for performance improvement. However, we observe that the functional units in the EXU are idle for a significant fraction of the program execution duration. ERrOR leverages these inactive frontend functional units to re-execute dynamic instructions for the purpose of error detection. A lightweight verifier introduced at the backend makes use of idle resources for redundant execution by interleaving program execution with re-execution for error detection. ERrOR provides exhaustive transient fault coverage while improving performance by 7.5% over an existing restricted OoO microarchitecture, Freeflow Core.
引用
收藏
页数:3
相关论文
共 50 条
  • [1] Early Execution for Soft Error Detection
    Choudhary, Raj Kumar
    Patel, Janeel
    Singh, Virendra
    PROCEEDINGS OF THE 37TH INTERNATIONAL CONFERENCE ON VLSI DESIGN, VLSID 2024 AND 23RD INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS, ES 2024, 2024, : 366 - 371
  • [2] Fault Tolerance Mobile Agent Execution System (FTMAS) Modeling and Performance Analysis
    Al Dweik, Amal Moh'd
    Ismail, Imam Aly Saroit
    Ahmed, Sanaa Hanafi
    2014 5TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2014,
  • [3] Task-Level Re-Execution Framework for Improving Fault Tolerance on Symmetry Multiprocessors
    Baek, Hyeongboo
    Lee, Jaewoo
    SYMMETRY-BASEL, 2019, 11 (05):
  • [4] Error Detection and Fault Tolerance in ECSM Using Input Randomization
    Dominguez-Oviedo, Agustin
    Hasan, M. Anwar
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2009, 6 (03) : 175 - 187
  • [5] Improving Reliability in Cell-based Evolve Hardware Architecture using Fault Tolerance Control
    Wongyai, Chanin
    Nilagupta, Pradondet
    2014 IEEE INTERNATIONAL CONFERENCE ON CONTROL SYSTEM COMPUTING AND ENGINEERING, 2014, : 190 - 195
  • [6] FAULT TOLERANCE TASK EXECUTION THROUGH COOPERATIVE COMPUTING IN GRID
    Goraya, Major Singh
    Kaur, Lakhwinder
    PARALLEL PROCESSING LETTERS, 2013, 23 (01)
  • [7] Distributed speculative execution for reliability and fault tolerance: an operational semantics
    Tapus, Cristian
    Hickey, Jason
    DISTRIBUTED COMPUTING, 2009, 21 (06) : 433 - 455
  • [8] Fault tolerance through re-execution in multiscalar architecture
    Rashid, F
    Saluja, KK
    Ramanathan, P
    DSN 2000: INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS, PROCEEDINGS, 2000, : 482 - 491
  • [9] Comparative tests of decision making algorithms for a multiversion execution environment of the fault tolerance software
    Kovalev, Igor
    Voroshilova, Anna
    Losev, Vasiliy
    Saramud, Mikhail
    Chuvashova, Maria
    Medvedev, Aleksandr
    2017 EUROPEAN CONFERENCE ON ELECTRICAL ENGINEERING AND COMPUTER SCIENCE (EECS), 2017, : 211 - 217
  • [10] Distributed speculative execution for reliability and fault tolerance: an operational semantics
    Cristian Ţăpuş
    Jason Hickey
    Distributed Computing, 2009, 21 : 433 - 455