Adjust's post mortem for incident on 2018-06-21
Posted Jun 29, 2018
Dear clients and partners,
As promised, here is a full rundown of the events leading up to last Thursday.
What happened? And how do we ensure it won't happen again?
Before I can answer these questions, it's important to get some background on the changes we made to our infrastructure in the past 13 months, as well as what happened in the weeks before the outage.
The changes we made to our infrastructure in the last 13 months
After an outage in 2017, we took serious measures to increase redundancy. The first step was to run all tracking services on three fully-redundant data centers. We chose Amsterdam (AMS) and Los Angeles (LAX), in addition to our main data center in Frankfurt (FRA). In the event of an outage of the main data center, the affected services would have been limited to the dashboard and some non-real-time critical functionality. Recently, we also began creating a full copy of the dashboard in LAX, mainly to serve our clients in the US and APAC with faster access to their KPIs. This came with the added bonus that an outage in FRA would no longer limit dashboard access.
We also upgraded our network setup; we now control all global routing rules for all Adjust services. In an emergency, this allows us to move traffic from one datacenter to the other within a few seconds
Last, and most importantly, we switched to a fully distributed database setup for our In-Memory database. This means that all requests from our SDK and partners are globally synced across three identical sets of database clusters, which allows us to redundantly route traffic from one location to another without any inconsistencies. This real-time syncing requires a high bandwidth, low-latency connection between the clusters, so we chose a triangular approach with two independent connections for each: LAX - AMS, AMS - FRA, and FRA - LAX. Within a few months, the FRA - LAX connection will be online.
In order to initially synchronize two database clusters, a full backup needs to be made and copied from one location to another. That means reading around 100TB from disk, copying it via network, and writing it back to disk at its destination. After that, all transactions made during the transfer are replayed to establish a real-time stream. We currently use 10 Gbit/s connections and the fastest commercially available hardware for our setup. However, due to size, and the fact that the cluster performing the backup has to simultaneously respond to around 500k queries per second, the process of a full sync takes between two and three weeks.
What happened in the weeks leading up to the outage?
On Monday, May 7th, we started an important upgrade to our database system. This upgrade was a major release and required a full resync of a cluster. After careful planning and testing, we decided to start the upgrade at the LAX data center. Three weeks later on May 24th, the upgrade and resync finished without any major incidents. On Monday, June 4th, having experienced no issues with the version (within the last 10 days), we decided to upgrade the AMS cluster.
On June 6th, Level3, one of our providers for fiber optic connections between AMS and LAX, announced a scheduled maintenance between 04:15 - 05:50 UTC. Unfortunately, our other connection, provided by Zayo, was cut at 00:15 UTC the same day. This was because an undersea cable between Zandvoort, NL and Lowestoft, UK was cut by (what we’ve assumed to be) anchoring fishing boats. The connection was restored by 08:40 UTC.
Due to the overlap in the connection outages, the cluster in LAX got out of sync and needed to be shut down to avoid tracking inconsistencies. Once the connections were reestablished, we were unable to replay a backlog of transactions and were forced to restart the full synching process.
Unfortunately, this happened while we were still in the middle of synching AMS from FRA. Therefore, FRA had to handle all tracking requests and a new backup process for LAX. We saw no other option than to wait for AMS to finish syncing. However, the bandwidth between FRA and AMS was now additionally strained by the traffic going to LAX from FRA via AMS.
What occurred on June 21st?
At 9:29 UTC, during a scheduled maintenance, human error led to a short circuit at our data center in FRA; this resulted in blown fuses on all electrical equipment connected to the main power line. While designed for power failures, these electrical systems were not built for a human induced power surge, and our entire setup in FRA went offline.
At 9:41 UTC, power was restored and our servers/network equipment began to reboot; we had suffered some damages due to the power surge and many components had to be replaced.
By 10:17 UTC, we had brought back some systems and were able to re-activate the dashboard.
Soon after, we began responding to click redirects, albeit at an initially low success rate. This was because a damaged hardware component delayed the reboot process from minutes to hours.
While the above was happening, we examined the status of our former (and last remaining) database cluster. Due to the worst possible circumstances, the failed network equipment had created a state that required the cluster to be manually re-configured. This process took approximately 30 minutes.
The main challenge in booting the database cluster from a “cold” state is that all data needs to be read from disk into RAM. To avoid inconsistencies, we must wait for all data to be read before enabling the cluster. Even at optimal conditions, reading 100TB takes a while. Still, at over 10GB/s, the process took 2.5 hours and was completed at 12:48 UTC.
Once the database was back online, we slowly enabled all backend services in preparation for queued up requests from partners and SDKs. At 14:40 UTC, half of all requests were receiving responses at a rate of 140% successful requests above normal. By 15:28 UTC, all backends were back and fully operational, responding to over 200% above normal requests.
At 17:38 UTC, all services were operating normally and traffic settled at 150% incoming rate, catching up over 75% of all requests that failed during the outage.
How do we make sure this won’t happen again?
Adjust holds an important position in the mobile marketing ecosystem and its uninterrupted availability is crucial for both our clients and partners. We take this responsibility extremely seriously and have created a series of initiatives detailed below to mitigate the possibility of another outage and further improve overall uptime and network performance.
For starters, we are going to increase bandwidth between all locations: FRA - AMS will be upgraded to 100 Gbit/s, while FRA - LAX will finally receive the missing link. Secondly, all database servers will receive an upgrade to a faster type of hard disk, going from SSD to NVMe technology. Furthermore, we will investigate the option to fall back to the public internet in cases where all connections of a location get cut. These measures should reduce syncing times in case of cluster resets and increase resilience against connection outages.
We are also re-evaluating our upgrade strategies and exploring how we can avoid being vulnerable through such events.
Additionally, we’ve started evaluations for a 4th data center in the APAC region. While we originally planned to provide the region with a faster dedicated dashboard, we will evaluate a full tracking setup on site.
Last but not least, in August we will physically relocate our servers from our current FRA location to another housing facility within the city. This move was actually planned a year ago, but due diligence with location possibilities, obtaining access certification, and network equipment purchases prolonged the process.
During the outage, we also experienced issues with our email provider, which led to many clients not getting all the communication we send out. To extend your access to status information during any incident, we are going to rigorously test all future email providers and create a new status page for Adjust. This page will allow you to keep track of all updates on any incident and status of each service Adjust provides.
There is no such thing as an acceptable amount of downtime and I sincerely apologize for letting you, our clients and partners, down. Please know that we will do everything we can to prevent this incident from repeating itself.
While no system is impervious to failure, we will continuously strive to make it harder for bad luck or human error—no matter how unlikely—to bring us down. Last Thursday, Murphy may have won the battle, but we will learn from it and rebuild better and more resilient.
I also want to thank every client and partner that has reached out to us with supportive words or has simply exercised patience during this time. It is much appreciated.
Should you have any more questions around the outage please reach out to me at firstname.lastname@example.org or simply contact our support team.
Paul H. Müller
CTO & Co-Founder Adjust