4 reasons Facebook went down so badly last week

4 reasons Facebook went down so badly last week

Packet switched networks were designed to withstand a nuclear attack. However, a series of mistakes triggered by a command that slipped through the auditing tools caused a global outage of Facebook on the 4th of Oct 2021, which is estimated to have temporarily cost 47 billion in market cap of the company, on top of a revenue loss between 60 and 100 million. More importantly, such events reveal how vulnerable our society has become to complex software systems. In this blog post we review the causes that triggered Facebook’s outage and the known facts about it. We will follow up in a second post with a plea for BGP change validation, a key technology to make BGP networks more resilient.

20211008_FacebookDown_Blog2.jpg

Fundamentals

The Internet is a network of networks, which are connected with each other at various locations around the world to exchange traffic and deliver services to other networks or end-users. To support this traffic exchange on the data plane, where data packets flow, networks need to communicate on the control plane, where control information flows. The current inter-domain routing system, built upon the mechanisms of the Border Gateway Protocol (BGP) [14], facilitates this control-plane interaction, and enables networks to send and receive reachability information related to their Internet Protocol (IP) prefixes (i.e., groups of IP addresses) and the paths leading to them on the level of Autonomous Systems (AS) which are assigned to different organizations. Despite its scalability and capabilities of expressing complex routing policies, BGP is known to have many fundamental problems and has been the subject of extensive research for almost three decades [13].

Facebook’s global outage: What happened?

On October 4 Facebook experienced a global outage that lasted almost 6 hours and affected the services of Facebook and its affiliated services WhatsApp, Instagram and Oculus. The company initially released a brief update on the same day [7] stating that it was the result of a configuration change in the network; specifically on the backbone routers that coordinate network traffic between their data centers. The change cascaded to inter-data-center communications, and led to Facebook resources (incl. other application domains) disappearing from the Internet and the staff themselves having issues getting the service going again. In effect, Facebook looked to be “off the grid” for several hours.

A follow-up update shed more light on the matter [8]. During a maintenance operation, an erroneous configuration command, which slipped past the auditing tools, unintentionally took down the backbone inter-datacenter network connections, isolating Facebook’s data centers from each other and the Internet.  This is remarkable because packet switched networks were designed to withstand failures. It seems that logical centralization in the design of the  inter-data center backbone has introduced a single-point of failure. Facebook’s DNS servers could not contact the data centers any more, thus signaling unhealthy/broken network connections on the data plane. This data plane behavior was coupled with automated control plane countermeasures; servers were programmed to automatically withdraw the respective BGP advertisements for the affected IP prefixes. This seems to be an indication that they use anycast [2], announcing the same route from multiple locations and selectively withdrawing (some of) them when needed for maintenance, load redirection or other reasons. BGP withdrawals are useful in the case of a limited number of FB sites being offline; the client requests would be served by the next (closer) available Point οf Presence (PoP) as a fallback. However, they can prove disastrous in the case of global site outage, as in the actual scenario that occurred. Moreover, Facebook did not seem to have some sort of out-of-band access in place. The end result: despite the DNS servers themselves being actually operational, they became unreachable. DNS did not work any more, rendering not operational  internal management tools that would be normally used to resolve the situation. Activating on-site security protocols, getting people in the data centers and gradually restoring services in a staged fashion prolonged the problem in the order of several hours.

Public data confirm that during the incident the IP prefixes of Facebook, including critical routes to their DNS servers, were entirely withdrawn from the Internet. In particular, according to CloudFlare [3], numerous withdrawals of Facebook prefixes (which are typically quiet on a minute-to-minute basis) were observed in BGP at around 15.40 UTC.  At 15:58 UTC they noticed that Facebook had stopped announcing the routes to their Domain Name System (DNS) [4] prefixes. DNS stopped resolving Facebook domains globally. CloudFlare actually observed the incident via an increase in the number of failed DNS requests towards the Facebook domains, through their DNS resolver 1.1.1.1. Meanwhile, other Facebook IP addresses remained routed but weren’t actually useful since without DNS Facebook and related affiliated services were effectively unavailable. An interesting observation was that applications continued retrying aggressively to reach the domains creating even more DNS traffic and even impacting other services. At around 21:00 UTC engineers saw renewed BGP activity from Facebook's network which peaked at 21:17 UTC, reverting the previous BGP state. The availability of the DNS name 'facebook.com' returned at 21:20 UTC. As of 21:28 UTC Facebook appeared to be reconnected to the global Internet with DNS working again.

The aforementioned observations were visualized in the context of BGP by BGPlay [1], one of the BGP visualization services available from RIPEstat [15]. According to them [5] Facebook went down at 15:42 (UTC). The outage went on for several hours with the first signs of recovery starting to show at around 21:00 and another half hour or so passing before things were starting to look close to stable again. As a representative example, normally AS32934 originates prefix 129.134.30.0/24, a prefix containing an important piece of the Facebook network (more specifically, the authoritative DNS name server - a.ns.facebook.com - for the facebook.com domain). At 15:53:47 (UTC), all the links between AS32934 and the rest of the Internet, in the context of routing to this critical prefix, were gone. Similar observations hold true both for IPv4 or IPv6, since both protocols were impacted. It was another several hours before things started to recover.

More sources also verified the timeline of the incident as seen in BGP and DNS [6, 12], as well as in NetFlow data [10, 11] and from users’ perspectives [9].

The root causes

In summary, the impact of the issue was total unavailability of Facebook’s services for several hours, while its root cause actually comprised four factors

1. A network configuration command that slipped through the cracks during maintenance.

2. Logical centralization of configuration and control affecting several data centers at once.

3. Tight coupling between the data plane (DNS-to-data center connectivity) and the control plane (BGP advertisements/withdrawals) without proper validation in place for related changes.

4. Problematic procedures in the service restoration phase, including lack of out-of-band management network access for remediation of in-band issues.

While all four factors are important, the actual point where a Facebook-internal issue cascaded to the entire Internet was manifested due to the third one. Therefore, we argue that BGP change validation and informed network configuration via real-time monitoring and verification are critical in today’s complex operational network environments; evaluating the impact of a (manual or automated) change as soon as possible and notifying the affected party about it can save hours of remediation and avoid extensive service outages and damages. This is exactly what we are striving for in Code BGP. In a follow up post next week we will elaborate further on BGP change validation.

In conclusion, managing complex interconnected networks at scale is a hard task. Even when automation mechanisms are in place, like in the Facebook case, misconfigurations can cause outages and propagate to the entire Internet if not properly verified and examined. 

What are we working on at Code BGP?

Code BGP provides innovative SaaS solutions for real-time BGP monitoring, data analysis, alerting, and automation. Our cloud-based platform collects the BGP data of an organization from multiple public and private data sources and makes them accessible in real-time. Data-driven APIs allow organizations to query and consume BGP data as well as inferred information about their neighbors and policies, information that is updated asynchronously and automatically. On top of the BGP data platform, we provide immediate alerting about hijacks and other misconfiguration events, coupled with mechanisms for BGP change validation. Let’s make the Internet more robust for everyone! Follow us on LinkedIn and Twitter. If you would like to test our pilot, please message us at info[at]codebgp[dot]com! We’d love to hear from you.

References

[1] BGPLay by RIPEstat. https://stat.ripe.net/widget/bgplay 

[2] Calder, M., Flavel, A., Katz-Bassett, E., Mahajan, R. and Padhye, J., “Analyzing the Performance of an Anycast CDN”. In IMC (2015)

[3] Celso M., & Strickx, T., “Understanding How Facebook Disappeared from the Internet.” The Cloudflare Blog. https://blog.cloudflare.com/october-2021-facebook-outage/ 

[4] Cloudflare. “What is DNS? | How DNS works.” https://www.cloudflare.com/learning/dns/what-is-dns/ 

[5] Davies A. “Facebook Down and Out in BGPlay.” RIPE Labs. https://labs.ripe.net/author/alun_davies/facebook-down-and-out-in-bgplay/ 

[6] Glenn, A.. “When giants fall there is always an aftershock.” https://radar.qrator.net/blog/giants_fall_aftershock 

[7] Janardhan S. “Update about the October 4th outage.” Facebook Engineering. https://engineering.fb.com/2021/10/04/networking-traffic/outage/ 

[8] Janardhan S. “More details about the October 4 outage.” Facebook Engineering. https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/ 

[9] Krebs, Brian. “What Happened to Facebook, Instagram, & WhatsApp?.” https://krebsonsecurity.com/2021/10/what-happened-to-facebook-instagram-whatsapp/ 

[10] Madory D. “Facebook suffers global outage.” https://www.kentik.com/analysis/facebook-suffers-global-outage/ 

[11] Madory D. “Facebook’s historic outage, explained.” https://www.kentik.com/blog/facebooks-historic-outage-explained/ 

[12] Medina A. “Facebook Outage Analysis.” ThousandEyes. https://www.thousandeyes.com/blog/facebook-outage-analysis 

[13] Mitseva, A., Panchenko, A., & Engel, T. (2018). “The state of affairs in BGP security: A survey of attacks and defenses.” Computer Communications, 124, 45-60.

[14] Rekhter, Y., Li, T., & Hares, S. “A border gateway protocol 4 (BGP-4).” No. RFC 4271. 2005

[15] RIPEstat. https://stat.ripe.net/