Research

In this work, we try to understand failure characteristics in a large-scale system. We have used PlanetLab system as our case-study and demonstrate how using additional information along-with classification provide us with insights into the nature of some of the failures.

Methodology

We followed a two-step methodology to identify failures and their possible causes (Figure 1). At a high level, this methodology consisted of first separating out failures based on their statistical properties, and then using other monitoring data to explain these failures. The first step consisted of classifying the various failures observed in the data based on their characteristics such as their duration (how long did a failure last), size (how many nodes failed together), and whether a failure was hard (node failure) or soft (software/network failure). This classification was largely done based on the failure time series data itself, and the goal was to separate different kinds of failures, since failures with different characteristics are likely to have different causes. The second step consisted of failure inference, where we correlated the failures in each class to additional monitoring data, such as location of node, resource usage, and types of error messages. The goal of this step was to be able to explain the causes of the various failure classes based on additional information available.

Figure 1: Schematic diagraph of two step methodology

Sample Results and Findings

The following figure illustrates the failures occuring in PlanetLab over a 1-month period (Feb'07).

This figure shows the node-wise failures in PlanetLab for Feb’07.

Y-axis: Nodes, X-axis: Time (each dot: 5 min)

Green Color: Transition from node down to up

Blue Color: Transition from node up to down

Black: Both green and blue fades into black if there is no change in up/down status of the node

Our case-study has demonstrated that using additional information along-with classification provide us with insights into the nature of some of the failures. Our results show that most of the failures that required restarting a node were of small size and lasted for long durations. Some of the failures were site-wise correlated and some failures could be explained by using error-message information collected by the monitoring node. At the same time, we find that the failure analysis and monitoring systems need to be tightly coupled together and require more intimate co-designing.