Power and Reliability in Extreme Scale Computing

46
Опубликовано 8 августа 2016, 21:30
From the time of its inception, the semiconductor process has witnessed an unhindered growth in transistor integration levels. However, in the forthcoming CMOS technology generations, this aggressive scaling poses critical reliability issues due to the increasing power density and process variation. To address these challenges, this talk will present three reliable, energy-efficient solutions for network-on-chip, on-chip caches, and processor pipeline. Dynamic voltage scaling is commonly used to reduce the power consumption. However, the supply voltage cannot be reduced below a certain threshold without addressing failures. Tangle monitors the error rate observed in the network and, based on its value across different network routes, selectively increases or decreases the voltage of individual voltage domains. With Tangle, the voltage of the different domains continuously adapt to the most energy-efficient, error-free conditions. Next, I present Archipelago, a highly flexible cache design that by reconfiguring its internal organization can efficiently tolerate the large number of SRAM failures. Archipelago partitions the cache to multiple autonomous islands with various sizes which can operate correctly without borrowing redundancy from each other. An adapted version of minimum clique covering algorithm is used to minimize the amount of space lost in the cache when operating in the low-voltage region. With proper solutions in place for network-on-chip and caches, a robust and heterogeneous core coupling execution scheme, Necromancer, is presented to protect the general core area against failures. Although a faulty core cannot be trusted, for most defects, execution traces on a defective core coarsely resemble those of fault-free executions. Consequently, Necromancer exploits a functionally dead core to improve system throughput by supplying hints regarding high-level program behavior.
автотехномузыкадетское