A postmortem of Adjust's recent server outage
May 24, 2017
Last week, Adjust suffered the worst outage since its founding: for approximately eight hours, we did not process real-time data—effectively doubling any downtime experienced over the last five years.
Since then, our SDK has replayed most data retroactively, and self-attributing networks—like Facebook or Google—used this data stream to claim attributions, as usual. All inventory transmitted via server-to-server also functioned normally, but wasn't attributed in that same time frame. However, we are currently investigating the possibility of replaying those clicks and impressions to repair missing attributions.
During the outage, we didn't redirect many clicks into the store. While this may be the smallest portion of tracked inventory, we still left users stranded on a white page. For this, I sincerely apologize. When it comes to app marketing, you trust Adjust with your most critical data, and on that day, we failed you.
Make no mistake, taking steps towards preventing anything similar from recurring, is my highest priority. This task has our fullest attention, and I would like to share with you what we are working on, right now.
Adjust rents its servers, racks, and network equipment from a large hosting company with data centers (DCs) around the world. We utilize the same, standard hardware employed by the likes of Dell, and other large-scale vendors. Yet, not all data centers are directly operated by our hosting company, and one of the third-party operated DCs is our main data center in Frankfurt (Germany). This DC operator rents out the physical space, power, and air conditioning, for our racks, to our hosting company. This means that Adjust manages its servers remotely just like any cloud hosted competitor—the biggest difference being that we do not share our hardware and infrastructure with any other companies.
Around 0:00 UTC we detected a failing network card in one of our servers and immediately contacted our hosting company for a replacement—they provide us with 24/7 on-site support for hardware issues.
While we investigated the card failure, we also lost connectivity to central parts of our infrastructure and multiple racks full of servers immediately went dark.
Even though Adjust currently runs on two DCs—Frankfurt and Amsterdam (the Netherlands)—we have one remaining component in our software stack (crucial to tracking) centrally working in Frankfurt. This component of our stack was built redundant across multiple racks, with redundant uplinks. In other words, it would basically take the entire DC to go offline in order to break this service... which is exactly what happened next.
Once we reached our hosting company’s head of operations, we learned that the DC’s cooling system in Frankfurt had failed, and the temperature in the server rooms had risen to critical levels. As of today, the most probable cause seems to be a series of extreme power spikes that first, destroyed some USVs (i.e. UPSs) of the DC and reached the redundant cooling control system, thereby triggering a set of valves to close, while still reporting them as open.
The DC operator worked on fixing the system but was unable to confirm an ETA for reestablished operations. Therefore, it was impossible for us to communicate to our clients when we would restore service.
Around 2:45 UTC, as we messaged clients to pause their campaigns, our hosting company had to shut down all core routers to prevent permanent heat damage. This effectively took all servers in the DC offline. At this point, the server rooms had reached about 60°C, which made it nearly impossible for the on-site engineers to repair the hardware.
The following three hours were the most painful time in Adjust's existence. Without an ETA for a working cooling system, we were forced to shut down the majority of our systems to preserve our hardware. It probably goes without saying, but the inability to deliver an update to our clients made for an excruciating support experience for everyone involved.
Finally, at 5:30 UTC, the cooling system was fixed and temperatures in the server room had started to drop. As soon as we regained access, we learned that most parts of our internal networking were fried and dozens of hard disks had permanently failed. Almost all servers had damaged hardware, and some were simply gone for good.
Despite this reality, our sysops and the hosting company’s technicians managed to create a running system and powered up the core routers by 6:30 UTC. Some of our components, like in-memory database systems, need up to 60 minutes to reboot if the entire cluster has failed. Therefore, it took another hour till all systems were brought back online.
When Adjust’s SDK could finally reach our servers, hundreds of millions of devices started to send us their stored sessions and events. Normally, we see about 100k requests per second around this time of day. But on this day, we peaked at close to 500k.
Although we anticipated that this increase would be significant, we could have never predicted the load size after such a long downtime: the flood almost brought down the system, again. Only the quick reaction of sysops and backend engineers, (re-purposing anything with a CPU and a heartbeat) allowed us to consume the billions of incoming requests. A few hours later, we were back to real-time and had started the process of replacing all damaged hardware.
What will Adjust do about it?
Let's discuss the consequences and next steps for Adjust resulting from this outage.
The first decision we’ve made is to relocate the Frankfurt data center to another location, within Frankfurt.
While it may have been a freak accident (the first of its kind in our hosting company’s history) that disabled the cooling system, the operator did not communicate any status updates during the entire outage and still hasn't released a final incident report. As a result, we have asked our hosting company to relocate our servers to another operator that has a better power supply system and has successfully survived the same power spikes without any issue.
But of course, there is no way that we can put this entirely on a data center failure.
Consequently, Adjust has formed a taskforce of senior engineers that have already begun to distribute our tracking infrastructure across three independent data centers—ensuring its resilience to a DC failure.
In addition to the servers in Frankfurt and Amsterdam, we will also add another location inside the US. Furthermore, the currently central component of the tracking stack will be rewritten to deal with high latencies between globally distributed servers.
All of this will be completed within the next three months.
We should have done this a long time ago, but always assumed that the complete DC failure of so many redundant systems was highly unlikely. This assumption has proven fatal, but we will not make this mistake again.
Providing that the rest of the year passes without any major incident we will just barely reach the 99.9 percent availability for 2017—missing our previous average of 99.97 percent and the 2016 uptime of 99.99 percent.
We will work hard to earn back your trust and rebuild any lost confidence.
If you have any further questions around this issue please reach out and I will personally get back to you.
Paul H. Müller
CTO & Co-Founder