1
Original: [ https://blog.cloudflare.com/october-2021-facebook-outage/]
Translation: Timing

image.png

"FB will not be down, will it?", we thought about this for a few minutes

Today at 2021.10.4 16:51 UTC, we created a list titled "FB DNS query returns SERVFAIL" because we are worried that there is a problem with our DBS 1.1.1.1. But when we were about to publish the status on our [Public Status] page, we found that there may be more serious problems occurring.

Social media quickly reported the incident and our engineers confirmed it. FB and its associated services WhatsApp and Instagram are also down. Their DNS domain names stopped resolving, and their infrastructure IPs were no longer available. It's like someone "unplugged the network cable" of their data center at the same time, making them disappear from the Internet.

How could this happen?

Will BGP

The full name of BGP is Border Gateway Protocol (Border Gateway Protocol). It is a protocol used to exchange information between autonomous Autonomous systems (AS) and routing information on the Internet. The huge routing allows the Internet to quickly update the connected list to deliver network packets to the destination address. Without BGP, the Internet won't work if you don't know how to do it.

The Internet is basically a network in a bunch of networks, which is divided by the BGP protocol. BGP allows a network (referred to as FB in this case) to inform other networks in the Internet of its existence. As we mentioned earlier that FB does not broadcast its existence, ISP service providers and other networks do not know how to find FB's network, so it is unavailable.

Each independent subnet has an ASN: (Autonomous System Number). An Autonomous system (AS) is an independent network that uses a separate internal routing strategy. An AS can generate prefixes (indicating that they control a set of IP addresses), and it can also transmit prefixes (indicating that they know how to reach a specific set of IP addresses).

Cloudflare's ASN is AS13335. Each ASN must use BGP to declare its prefix to route to the Internet; otherwise, no one knows how to connect and find us.

Our [Learning Center] has good information on how [BGP] and [ASN] work.

This is a simplified picture. You can see that the Internet has 6 autonomous systems, and 2 packets can be used to route from the start point to the end point. AS1->AS2->AS3 is the fastest, AS1->AS6->AS5->AS4->AS3 is the slowest, but you can go if there is a problem with the first path.
image.png
At 1658UTC we noticed that FB stopped broadcasting their DNS prefixes to routers. This means that at least FB's DNS server is unavailable. For this reason, Cloudflare's 1.1.1.1 DNS cannot answer the IP address query for facebook.com or instagram.com.
route-views>show ip bgp 185.89.218.0/23
% Network not in table
route-views>

route-views>show ip bgp 129.134.30.0/23
% Network not in table
route-views>
At the same time, other FB IP addresses are still routable, but they are basically useless because there is no FB DNS related information:
route-views>show ip bgp 129.134.30.0
BGP routing table entry for 129.134.0.0/17, version 1025798334
Paths: (24 available, best #14, table default)
Not advertised to any peer
Refresh Epoch 2
3303 6453 32934

217.192.89.50 from 217.192.89.50 (138.187.128.158)
  Origin IGP, localpref 100, valid, external
  Community: 3303:1004 3303:1006 3303:3075 6453:3000 6453:3400 6453:3402
  path 7FE1408ED9C8 RPKI State not found
  rx pathid: 0, tx pathid: 0

Refresh Epoch 1
route-views>

We continue to track the BGP updates and announcements we see on the global network. In our case, the collected data gives us a view of how the Internet is connected and where the traffic is coming from in the world.

The BGP update message tells the routing to revoke the prefix any you broadcast to the prefix or the whole. When checking our chronological BGP database, we can clearly see the series of updates we received from Facebook. Usually this picture is very calm: FB will not make a lot of changes.

But at 15:40 UTC we saw a spike in routing changes from Facebook. This is when the problem starts.
image.png
If we look at routing declaration and revocation separately, we can see the problem more clearly. The routing was plugged in, and Facebook's DNS server was offline. One minute after the problem occurred, a Cloudflare engineer was in a room trying to determine why 1.1.1.1 could not resolve the facebook.com address, and worried that it was a problem in our system.
image.png
Due to these withdrawal events, Facebook and its site were quickly disconnected from the Internet.

DNS is affected

Due to the direct impact of this problem, DNS resolutions around the world stopped resolving their domain names.
➜ ~ dig @1.1.1.1 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com. IN A
➜ ~ dig @1.1.1.1 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com. IN A
➜ ~ dig @8.8.8.8 facebook.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;facebook.com. IN A
➜ ~ dig @8.8.8.8 whatsapp.com
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 31322
;whatsapp.com. IN A
This happens because DNS, like other systems on the Internet, has its own routing mechanism. When someone opens https://facebook.com in the browser, DNS resolves, responds to the request for domain name query and returns the IP address that needs to be connected, and initially checks whether it exists in its cache and uses the cache. If not, the answer is fetched from the domain name server, which is generally the responsibility of the entity in charge of it.

If the domain name server is unreachable or cannot respond for some reason, SERVFAIL will return, and the browser will return an error to the user.

Similarly, our learning center provides [explanation] of how DNS works.
image.png
When Facebook stopped broadcasting their DNS prefix routes through BGP, our DNS services and others could not connect to their domain name servers. Then, 1.1.1.1, 8.8.8.8 and other major public DNS servers began to send out (or cached) SERVFAIL responses.

But this is not all. Now human behavior and program logic together lead to other exponential effects. DNS requests generated a tsunami.

This problem is partly because the app does not accept the returned error and starts to retry. Another part is that the end user ignores the wrong request and starts to refresh the page, or kills them and restarts the app, which also caused a large number of requests.

This is the increase in traffic we see on 1.1.1.1:
image.png

Up to this point, due to the large size of Facebook and its website, our DNS processed 30 times more queries than usual and caused delays and timeout issues on other platforms.

Fortunately, 1.1.1.1 is built to be free, fast (as proven by the independent DNS detection tool DNSPerf), and scalable, and we can ensure that the service has minimal impact on users.

Our DNS request can be kept below 10ms. At the same time, the percentiles of p95 and p99 can see the increase in response time, which is probably due to the invalid TTL that needs to re-request the Facebook domain name server and cause a timeout. The timeout period of DNS is limited to 10 seconds, which is the default rule for engineers.
image.png

Affect other services

People are turning to other services and want to know what happened. When Facebook is unavailable, we see an increase in DNS access to Twitter, Signal and other news, social media platforms.
image.png
We can get the negative impact on WARP traffic from the unavailability of ASN 32934 affected by Facebook at this time. In this graph, you can see how the traffic changes in each country from 15:45 UTC to 16:45 UTC 3 hours ago. The flow of WARP traffic in and out of the Facebook network all over the world has disappeared.
image.png

the Internet

Today’s event reminds us that the Internet is a very complex and consists of hundreds of independent systems and protocols that work together. Trust, standardization, and collaboration between entities allow 5 billion active users around the world to be connected.

renew

At approximately 21:00 UTC we saw the BGP update activity sent from the Facebook network, and it peaked at 21:17 UTC.
image.png

This picture shows the availability of the DNS name facebook.com on Cloudflare's DNS server 1.1.1.1. It is unavailable at approximately 15:50 UTC and resumes at 21:20 UTC.
image.png

There is no doubt that Facebook, WhatsApp and Instagram need more time to go online, but at 21:28 UTC it seems that Facebook started to reconnect to the global Internet and DNS started to work.


This article is from Zhu Kunrong's WeChat public account "Malt Bread", the public account id "darkjune_think"

Developer/Science Fiction Enthusiast/Hardcore Host Player/Amateur Translator
Please specify if reprinted.

Weibo: Zhu Kunrong
Station B: https://space.bilibili.com/23185593/

Communication Email: zhukunrong@yeah.net


祝坤荣
1k 声望1.5k 粉丝

科幻影迷,书虫,硬核玩家,译者