|Bai Yu

While we were enjoying the National Day holiday, there was a major "accident" in the Internet world across the ocean: Facebook and its Instagram and WhatsApp apps went down all over the network. The downtime was nearly 7 hours and 5 minutes, and the browser was trying to open DNS error is displayed at the time. This is a huge loss for Facebook, which has 3.51 billion and 2.76 billion monthly and daily active application groups. According to estimates by investment institutions, the 7-hour downtime caused more than US$968 million in impact costs. And it directly caused Facebook's market value to lose 64.3 billion U.S. dollars, and its founder Mark Zuckerberg's net assets evaporated 7 billion U.S. dollars.

1.png

Facebook stated that the root cause of the failure was a problem with routine maintenance work. The configuration changes of the backbone router that coordinated network traffic between data centers caused problems with its DNS servers and shut down internal tools and systems. Operation and maintenance personnel could not access remotely. Equipment in order to restore the network. Therefore, the operation and maintenance personnel had to enter a data center with strict procedures and measures for manual restart. As a result, the MTTR was severely prolonged.

In one sentence, a bad command, a flawed audit tool, a DNS system that hindered the successful restoration of the network, and cumbersome data center processes all led to a major 7-hour failure of Facebook.

Specifically, operation and maintenance personnel perform disconnection maintenance on part of the backbone network. Part of the routine maintenance is to assess the availability of the global backbone network capacity, but inadvertently disconnected all connections to the backbone network and also disconnected the connection to the Facebook global data center. At the same time, because Facebook's architecture is designed to expand or reduce DNS services based on server availability. When server availability drops to zero due to a network failure, all DNS servers are deactivated. The automatic response to backbone network collapse seems to be the cause of DNS paralysis. This deactivation is accomplished by sending messages from Facebook's DNS name servers to Internet Border Gateway Protocol (BGP) routers, which store information about the routes used to reach specific IP addresses. These routes are usually advertised to routers so that they know how to properly direct traffic.

The BGP message sent by Facebook's DNS server disables advertisement to routing, so the traffic cannot be parsed into any corresponding content on Facebook's backbone network. The end result is that even if the DNS server is still running, it cannot be accessed, and users will lose service due to the breakdown of the network they are trying to access. More unfortunately, the DNS service is used for customer-facing websites, and it is also used for its own internal tools and systems.

Seeing this, we will find that DNS plays an important role in this, so what is DNS? DNS stands for Domain Name System. The Domain Name System maps domain names and IP addresses to each other in the form of a distributed database. Simply put, DNS is used to resolve domain names. Under normal circumstances, every user's Internet access request will be directed to a matching IP address through DNS resolution, thus completing an Internet behavior. DNS, as an application layer protocol, mainly works for other application layer protocols, including but not limited to HTTP, SMTP, and FTP. It is used to resolve the host name provided by the user into an IP address. The specific process is as follows:

(1) A DNS client is running on the user host (PC or mobile phone);
(2) The browser extracts the domain name field from the received URL, which is the host name of the visit, such as http://www.aliyun.com/ , and transmits this host name to the client of the DNS application;
(3) The DNS client sends a query message to the DNS server, which contains the host name field to be accessed (including a series of cache queries and the work of distributed DNS clusters);
(4) The DNS client will eventually receive a reply message, which contains the IP address corresponding to the host name;
(5) Once the browser receives the IP address from DNS, it can initiate a TCP connection to the HTTP server located by the IP address.

Facebook's outage lasted for nearly 7 hours and affected approximately 85 million users, the worst since 2008. Looking back on this failure as a bystander, we will find a very critical problem: But as far as we know, users continue to report that the four major social platform websites and All apps respond to server errors and fail to refresh. Facebook is almost completely offline in Europe, America, and Oceania. It is also inaccessible in Asia, Japan, South Korea, India and other countries, affecting users in dozens of countries and regions around the world. It seems that Facebook did not find these problems in the first place. The problem was discovered only after feedback from users in many countries and regions around the world.

Even large companies such as Facebook did not find DNS failures in the first place and suffered serious economic losses. Faced with such a failure, how can we discover and monitor the operation of the product and DNS in the first place? And keep abreast of user usage in different countries and regions around the world?

Throughout all kinds of APM products, the non-intrusive cloud dial test becomes the best solution. Alibaba Cloud dial test uses 1000+ monitoring points all over the world, including real user monitoring, to initiate network requests to the target domain name 24 hours a day, to help users monitor the availability and resolution performance of DNS services. DNS dial test supports designated recursive and iterative queries. Mode and parsing server, through flexible dial test parameter configuration, as much as possible to simulate the access of real users.

2.png

After a regular dial test task, Alibaba Cloud dial test can generate reports on DNS resolution time in different regions, and at the same time, it clearly lists the details of DNS request pairs for each dial test, including A address, DNS time, DNS resolution process, etc. Can help users quickly analyze and locate DNS resolution problems.

In addition, by configuring DNS alarms, for DNS availability issues and resolution performance issues, you can also buy time before users perceive and ask questions to repair, improve user satisfaction and reduce economic losses.

3.png

To avoid similar problems, start using cloud dial test!

Part of the content is quoted from
1. "Europe 丨 Facebook has the worst downtime in history, burning 6 billion in 7 hours"
https://www.163.com/dy/article/GLI6PFA70552C1I4.html
2. "Causes of Facebook's Major Failure"
https://baijiahao.baidu.com/s?id=1712926610001333324&wfr=spider&for=pc


阿里云云原生
1k 声望302 粉丝