FAILSafe

FAILSafe: Failure Analysis & Inference in Large-Scale Systems

Large-scale distributed systems such as content distribution networks, peer-to-peer systems, computation grids, and network testbeds provide an essential platform for numerous distributed applications ranging from content sharing and Web services to VoIP and scientific simulations. Many of these systems consist of large number of nodes communicating with each other over widely distributed networks. Due to their inherent scale, diversity and complexity, these systems are prone to frequent failures, which could be caused by a variety of factors related to network, hardware, and software problems. Any downtime due to failures, whatever the cause, can lead to large disruptions and huge losses. Identifying the location and cause of a failure is critical for the reliability and availability of such systems. However, identifying the actual cause of failures in such systems is a challenging task due to their large scale and variety of failure causes.

The goal of this project is to develop techniques for a better understanding of such failures, in order to identify their possible locations, causes, and to predict their occurrences pre-emptively. Such an understanding would lead to failure prevention, quick diagnosis and trouble-shooting, and enable better system design to handle inevitable failures, resulting in more fault-tolerant systems. Our methodology involves analyzing existing failure logs for real systems and inferring patterns about failures based on their statistical properties as well as other supporting monitoring data.