Disk Failure: How It Happens And What To Do About It

74
Следующее
06.09.16 – 431:13:54
Demystifying Internet Traffic
Популярные
Опубликовано 6 сентября 2016, 17:12
Disk drive failures continue to be one of the primary causes of data loss and system failures. In most failure scenarios, the disk does not stop working entirely; rather, the failures tend to be partial failures, where some disk sectors are unavailable due to a latent sector error or block corruption. In this talk, I will present the characteristics, storage-stack impact, and tolerance techniques for such partial failures. First, I will present our large-scale studies of two important kinds of partial disk failures -- latent sector errors and data corruption. Our studies show that these failures do occur and a significant percentage of disk drives suffer from them. The studies also identify interesting failure characteristics such as non-independence, spatial locality, and temporal locality, all which greatly impact techniques used to protect against these failures. Second, I will briefly discuss our analyses of how these failures affect various components of the storage stack, including file systems, virtual memory systems, and RAID storage systems. Our analyses show that the storage stack components are ineffective, and often inconsistent in dealing with partial disk failures. Third, I will present a file system architecture that I am building to tolerate such failures. The architecture, based on N-version programming principles, uses multiple different file systems to store and retrieve data in order to provide robust and efficient file service.
автотехномузыкадетское