Finding service outages with robust statistics

Feb 13, 2021 5:00 AM

Check out the monitoring dashboard in action. Note that this post is interactive and may not render correctly without Javascript.

The Probe Scraper underpins the data schema infrastructure at Mozilla. Every night, it trawls through Firefox Mercurial and Git repositories, searching for registry files. The output powers the Probe-Info Service API (probe-info) that is used to generate schemas and build data dictionaries.

It is scheduled to run on business days (Monday to Friday UTC+00) on Airflow. While there are notifications to the data engineering team when the job inevitably fails for one reason or another, the status of the schema deployment pipeline has not always been clear from the outside. I put together a monitoring dashboard to address the lack of visibility. The monitoring mechanism is not unlike a watchdog timer in embedded electronics, where it runs on a separate clock checking on the health of the main system.

In this post, I’ll write a little bit about how the probe-info service is monitored and displayed for operational transparency. I’ll focus the discussion on the merits of statistical tools like the median absolute deviation (MAD) for figuring out whether the probe-info service is up to date without actually having access to the internals of the service.

Collecting monitoring data

Every 15 minutes, an endpoint from the probe-info service is queried to obtain the last updated timestamp. This gets loaded into a BigQuery table, then transformed and dumped into a JSON file. During this process, the timestamps series is transformed into deltas representing the time since the last update. In the SQL, we encode some of the business logic to consider the lull of the weekends.

We expect to see five updates a week at a regular interval of 24 hours. Plotting the data reveals apparent irregularities with the date. We consider anything that takes longer than 24 business hours a partial-outage. It is a partial-outage because the infrastructure continues to serve requests despite out of date information about schemas.

You can obtain a copy of the frozen output here or updated data directly from the monitoring dashboard.

The large spike on 2021-01-28 was due to a memory pressure issue that took several days to resolve. This is the largest partial-outage in recorded history, but not the only one. Between November and December, there are five outages in total.

Is there a way to automatically detect whether there is currently a partial-outage without knowing the intricacies of scheduling? It turns out we can make this inference easily with the help of statistics.

A refresher on statistics, are you MAD‽

To talk about outliers, we’ll need to know how to describe the data and their patterns.

Standard statistics

You might already be familiar with these. If you are, feel free to skip ahead. Otherwise:

The mean ($\mu$) is the average of the dataset. We’ll let $\text{avg}$ be the the arithmetic mean, or the sum of all the values divided by the number of values.

$\mu = \text{avg}(X) = \frac{ \sum_{i=1}^n x_i }{n}$

The standard deviation ($\sigma$) is a measure of spread which quantifies the average distance from the mean. More precisely:

$\sigma = \sqrt{ \text{avg}((x_i - \mu)^2) }$

We can determine how many standard deviations a point is from the mean by computing a z-score ($z$).

$z = \frac{x - \mu}{\sigma}$

For a set where $\mu=10$ and $\sigma=5$, the point $x=20$ has a score $z=2$, while the point $x=5$ has a score $z=-1$.

Robust statistics

We’ll define robust equivalents of the mean, standard deviation, and z-score. A robust statistic is less susceptible to the effects of outliers. The median ($\tilde{\mu}$) is the point in the sorted set of values, which we define as $\text{med}$.

$\tilde{\mu} = \text{med}(X)$

The median absolute deviation ($MAD$) is the median of differences from the median, and a robust measure of spread.

$MAD = \text{med}(x - \tilde{\mu})$

Finally, we can compute a modified z-score ($\tilde{z}$) using $MAD$. We apply a scale-factor $k$ that’s appropriate for the distribution. I’ll assume the data is Gaussian (or normal) for lack of better prior understanding. According to Rosseeuw, this means $k=1.4826$. NIST suggests a similar value of $k= 1 / 0.6745$. I’ll be using the former for all calculations.

$\tilde{z} = \frac{x - \tilde{\mu}}{k \cdot MAD}$

Detecting partial-outages on historical data

Now armed with the necessary statistical tools, let’s run some analysis on historical data to find partial-outages in the service. We’ll mark point points that are 3 deviations away from the center (mean or median) following the three-sigma rule of thumb.

All of history (7 partial-outages)

First, let’s take a look at the data over the entire dataset so far, representing approximately three months of data.

Click to reveal analysis

The MAD-based threshold of 3 handily finds all the legitimate outages. If we had set the threshold for the standard score to 1, we should obtain similar results.

Month of November 2020 (3 partial-outages)

Now let’s repeat it for November. This is a period with three partial outages.

Click to reveal analysis

Again, the MAD-based threshold is working well. The optimal threshold for the standard deviation is ~2, but the threshold from the previous section would work too.

Month of October 2020 (no outages)

And again for October. This was a perfect month without any outages 😊.

Click to reveal analysis
Robust measures of spread like MAD are good at finding outliers in data with minimal tuning of the threshold. Note the standard deviation varies between $0.5$ and $21.0$ hours depending on the period, whereas MAD does not go over $1.0$ hour. Outliers can affect the variance of the standard deviation by a large margin. This is less of a problem when we use a robust statistic like MAD, which makes it useful when we don’t know much about the data. While we can use standard deviation to find service outages, we have to tune the z-score threshold to meet our needs.