FAILSafe: Failure Analysis & Inference in
Large-Scale Systems
Large-scale
distributed systems such as content distribution networks, peer-to-peer
systems, computation grids, and network testbeds
provide an essential platform for numerous distributed applications ranging
from content sharing and Web services to VoIP and scientific simulations. Many
of these systems consist of large number of nodes communicating with each other
over widely distributed networks. Due to their inherent scale, diversity and
complexity, these systems are prone to frequent failures, which could be caused
by a variety of factors related to network, hardware, and software problems.
Any downtime due to failures, whatever the cause, can lead to large disruptions
and huge losses. Identifying the location and cause of a failure is critical
for the reliability and availability of such systems. However, identifying the
actual cause of failures in such systems is a challenging task due to their
large scale and variety of failure causes.
The goal of
this project is to develop techniques for a better understanding of such
failures, in order to identify their possible locations, causes, and to predict
their occurrences pre-emptively. Such an understanding
would lead to failure prevention, quick diagnosis and trouble-shooting, and
enable better system design to handle inevitable failures, resulting in more
fault-tolerant systems. Our methodology involves analyzing existing failure
logs for real systems and inferring patterns about failures based on their
statistical properties as well as other supporting monitoring data.