头图

[Clue to the root cause of the downtime]: This problem is related to something called BGP routing, and it is most likely related to a configuration error.
[Clues for a long time of impact]: At that time, the Facebook office network could not connect to the external network. It is very likely that the impact was up to 6 hours because it was unable to log in to the server to fix the online problem.
[Downtime period]: 2021-10-04 23:39~2021-10-05 06:45

On Monday, the downtime of Facebook, Instagram, WhatsApp, and Oculus brought down every corner of Mark Zuckerberg’s empire. This is a kind of social media blackout. The most appropriate term is a "complete" blackout, and it seems difficult to resolve.

Facebook itself has not yet confirmed the root cause of its downtime, but it is widely circulated on the Internet. Coinciding with the time when DNS records were inaccessible, all of the company's applications disappeared from the Internet at 11:40 am Eastern Time. DNS is often referred to as the phone book of the Internet; it converts the host names you enter into URLs (such as facebook.com) into IP addresses, which are the IP addresses where these sites are located.

DNS incidents are very common, and if in doubt, they are the cause of downtime for a particular site. They can occur due to various unstable technical reasons, usually related to configuration issues, and can be resolved relatively simply. However, this time, something more serious seems to have happened.

Troy Mursch, chief research officer of the cyber threat intelligence company Bad Packets, said, “ Facebook’s downtime seems to be caused by DNS; however, this is just the appearance. ” Mursch said, the is (other experts agree) Facebook has The so-called Border Gateway Protocol (BGP) router , which contained the IP address of the DNS server. If DNS is the phone book of the Internet, then BGP is its navigation system; it determines the route that data takes when traveling on the information highway.

"You can think of it as a phone game, but it's not people playing it, but smaller networks that let each other know how to contact each other. They announce this route to their neighbors, and their neighbors will spread it to their neighbors. Said Angelique Medina, director of product marketing at Cisco ThousandEyes, a network monitoring company.

There are many terms here, but it is easy to make it clear that Facebook has disappeared from the Internet map. If you try to ping these now, as Mursch said, "These packets end up in a black hole."

image.png

The map shows that Facebook cannot be accessed due to DNS resolution failure. Courtesy of CISCO THOUSAND EYES
https://www.thousandeyes.com/outages/

The obvious but still unresolved question is why these BGP routers disappeared in the first place. This is not a common problem, especially at this scale or duration. During the downtime, Facebook said nothing but a tweet, "it is working hard to get things back to normal as soon as possible." After the service gradually resumed late Monday afternoon, Facebook issued a statement that it still lacked any technical details. The company said, "To everyone affected by the disruption of our platform today: We are sorry! We know that billions of people and businesses around the world rely on our products and services to stay in touch. We thank you for your patience."

Internet infrastructure experts who spoke to WIRED said that the most likely answer is Facebook's misconfiguration. John Graham-Cumming, chief technology officer of Internet infrastructure company Cloudflare, said: “ looks like Facebook did something to their routers, which connect the Facebook network to the Internet. ” He emphasized that he doesn’t know what happened in the details. . After all, he said, the Internet is essentially a network of networks, and each network advertises its existence to another network. This time, Facebook stopped advertising.

This also means that it is not just Facebook's external services that are affected. For example, you cannot use Facebook login on a third-party website. And because the company’s own internal network cannot access the external Internet, its employees are reportedly unable to work today. (Instagram CEO Adam Mosseri even said on Twitter "It feels like a snowy day.")

This also explains why it takes so long to resume operation. In 2019, Google Cloud downtime caused Google engineers to be unable to log in to Google Cloud to repair Google Cloud downtime. It seems that Facebook is at least likely to fall into a similar catch-22, unable to access the Internet to fix the BGP routing problem.

Medina said, "The good news is that once Facebook can restore any configuration, it should resume business soon. "When it is corrected, traffic will really start to flow."

At the same time, other Internet applications have also felt the downtime of Facebook. Or, to be more specific, DNS resolvers (services that convert domain names into IP addresses) like Cloudflare are monitoring twice as much as normal traffic because people have been trying to load Facebook, Instagram, and WhatsApp. These requests are not enough to overwhelm the entire system, but the surge in traffic reminds people that the Internet is indeed interdependent and sometimes fragile.

Quote from "Why Facebook, Instagram, and WhatsApp All Went Down Today"


誉儿
173 声望1.2k 粉丝