Abstract: At the 7th Global Software Conference, Huawei Software Engineer Du Zhigang shared with developers the high-availability assurance solution of Huawei Cloud's official website, and deeply analyzed how the website can recover quickly in various extreme disaster scenarios. Solutions and engineering practices.
This article is shared from the HUAWEI CLOUD Community "What Happened Behind the Website Access Failure?" Huawei engineers teach you how to respond quickly [ Global Software Technology Conference Technology Sharing ], the original author: Technical Torchbearer.
Recently, a certain CDN service failure caused a large number of well-known overseas news websites to be unable to access or load normally, causing a wave of waves. Indeed, as more and more businesses go to the cloud, whether a website or a certain business can guarantee continuous online is a test of the highly available and highly reliable solution design behind it.
At the 7th Global Software Conference, Huawei software engineer Du Zhigang shared with developers the high-availability assurance plan of the official website of Huawei Cloud, in-depth analysis of how to quickly recover the website in various extreme and major disaster scenarios. Engineering practice.
The website is unreliable and the loss is immeasurable
From the perspective of the website owner: the unavailability of the website directly leads to the impact of economic income. Especially for e-commerce websites, transactions are generated every minute and every second. Once access is interrupted, the impact of economic losses is obvious. In addition, from the customer's point of view, facing the inaccessibility of the website, the most intuitive feeling is that it is unreliable, which has an irreparable negative impact on the reputation and trust of the website and the corporate brand behind the website.
Judging from the major Internet failure events in the past decade, the large-scale impact caused by DNS and CDN is vivid, and regional and global failures caused by other IT infrastructures also have a significant impact.
The website usability indicators widely used in the industry include website unavailability time and annual website availability rate. Different types of websites and applications have different requirements for usability.
The website unavailability time (failure time) = failure recovery time point-failure time point. Annual website availability rate (Yearly Uptime Percentage) = (1- website unavailable time / annual total time) * 100%.
As the Internet access portal of the cloud infrastructure provider, Huawei Cloud official website has extremely high requirements for availability. The core page facing end users must be online 7*24 hours. If there is a major failure, such as cloud service area level or basic For a single cloud global failure caused by a facility, an alarm will be notified to the relevant person in charge within 5 minutes, and the failover will be completed within 15 minutes.
What happened behind the website access failure?
Let's analyze the overall process and key failure points of website page access in combination with the legend:
At ①, DNS failure will usually make the website inaccessible as a whole, when ② it is a CDN failure that will make users in some geographic areas inaccessible, ③ a global failure of a single cloud will cause the website as a whole to be inaccessible, and ④ it is a cloud service area level failure. As a result, users who are diverted to this area are inaccessible. ⑤It is a cloud service availability zone level failure that will cause users routed to the faulty AZ to become inaccessible. ⑥It is a container cluster failure that causes users who are routed to the corresponding container service to be inaccessible. It will cause users who are routed to the faulty service node to be inaccessible.
In summary, in the cloudification scenario, page access faces many key technical challenges, including
- How to deal with the overall failure of a single DNS service provider?
- How to deal with the failure of a single CDN manufacturer as a whole or in multiple areas?
- How can the overall failure of a single cloud caused by infrastructure failure ensure that the page can still be accessed normally?
- How can a single cloud service area level failure minimize the impact of user access time?
- There are many back-end services that page access depends on. How to minimize failure points, reduce the overall complexity and cost of the solution, and ensure that the solution is universal and feasible?
Four solutions to easily deal with various failures of the website
In response to the above key challenges, through the practice of Huawei Cloud official website in recent years, we have summarized 4 solutions to share with you. We will disassemble them one by one to show you the actual effects of these solutions.
1. Overall failure of a single DNS service provider: dual DNS service provider resolution
DNS is a relatively important weak link that has not received due attention. For commercial portals with extremely high availability requirements, relies on DNS with a service provider, and no problems occur. If a global failure occurs, it will cause problems. The impact can be catastrophic.
Our current strategy is to adopt a dual-DNS vendor domain name resolution solution. When a service provider fails partially or overall, failover can be automatically realized in a short time, and the domain name resolution work is handed over to other service providers. In addition, we have also built a unified operation and maintenance platform to realize the unified configuration of multi-vendor domain name resolution, as well as the ability to quickly eliminate DNS availability monitoring and fault services.
The dual-vendor DNS configuration is shown in the figure:
The premise of this configuration is that the domain name registrar and domain name resolver support multi-vendor Name Server configuration. In terms of specific configuration, first migrate the domain name registration hosting to a registrar that supports multi-vendor NS configuration, and then synchronize the resolution records configured by the DNS vendor to the new vendor. Finally, the domain name registration service and resolution service are configured with NS records pointing to the dual-vendor Name Server (0 ~72 hours effective)
With this configuration, when a single manufacturer's Name Server fails, ISP Local DNS will automatically lower the priority of the failed Name Server (BIND SRTT algorithm, failure penalty), and use the preferred Name Server for A record or CNAME domain name resolution.
The rehearsal steps can be broken down as follows:
Step 1: Dual-vendor NS record configuration.
Step 2: Check that the service can be accessed normally through the browser.
Step 3: Dial to test the availability of Name Server to verify whether ISPs in different regions use Name Servers of different vendors for domain name resolution.
Step 4: Shut down Bind to simulate a single vendor DNS failure.
Finally, through HTTP dialing from multiple regions to test whether the service can be accessed normally.
2. Regional failure of a single CDN manufacturer: multi-CDN service provider solution
The following describes the configuration and switching of multiple CDN vendors, as shown in the figure:
There are three limitations to using this solution: DNS protocol does not support CNAME resolution configuration of multi-vendor CDNs; DNS intelligent resolution supports CNAME resolution records with different regions or network configurations; CDNs have a lower probability of overall failure, and more regional failures .
For the configuration of multiple CDN vendors, the primary and secondary CDN acceleration should be performed for domestic and overseas access respectively, and then the CDN CNAME resolution TTL is set to 60s, so that when the service of a single CDN vendor is unavailable, the failover effective time is shorter; the last is to build a CDN management platform for docking Multi-vendor DNS management API, pre-configured switching and fallback strategies, one-click switching in case of failure.
The final configuration effect is also obvious. After the CDN warns of a large-scale failure of vendor A, the CNAME of the corresponding area can be resolved to vendor B through the CDN operation and maintenance management platform to provide services, and the effective time is 1 minute.
The following figure is an example of the switching interface of our operation and maintenance platform, which can be switched according to different second-level domain names for domestic and overseas user access scenarios.
In 2020 and 2021, we have encountered actual live network failures. The CDN's failover function has been effectively applied, allowing page access to achieve rapid failure recovery.
3. Regional geographic disaster scenarios: page access to live multiple activities in different places
Here is an introduction to the networking strategy of our Chinese station and international station, which are more active in different places, as shown in the figure:
If there is a regional geographic disaster scenario, we use the site multi- multi-active deployment 160e721090a24b, using this solution to ensure that the content of the page published by the content management service is kept synchronized in the multi-cloud service area. At the same time, the multi-active cloud service area of LB and gateway routing configuration is consistent.
In the specific configuration, the CDN back-to-source traffic of domestic and overseas users is divided proportionally to different cloud service areas; then the health check policy is configured to alert when a cloud service area level failure occurs, which is convenient to automatically or manually switch back to the source traffic to healthy Cloud service area: If there is a difference between overseas and domestic services, cross-cloud service area routing is performed on the LB or gateway through the internal dedicated line of the cloud vendor.
In this way, in a non-disaster tolerance scenario, the multi-cloud service area provides page access services at the same time, reducing the pressure of returning to the source in a single cloud service area. Even in the event of a cloud service area level failure, one-click failover can be achieved through the CDN Admin API, and the CDN back to the source can quickly return to the available state.
As shown in the figure, through our operation and maintenance platform, in a single cloud service area failure scenario, the faulty cloud service area can be quickly eliminated. This process is mainly achieved by batch switching the second-level domain name Region level back to the source DNS A record.
4. Single cloud global failure scenario: website backup and switching plan
Finally, I will introduce the bottom-level guarantee scheme of the entire high-availability scheme: website backup and failover. First, let’s take a look at the website’s backup process, as shown in the figure:
The operation and maintenance personnel first configure the site metadata and configure the backup strategy. The site management issues the backup task to the scheduling service according to the backup strategy, and then the scheduling service calls the backup service regularly to perform the backup task.
For collection, the backup service starts the Headless Browser to load the entry page, then loads the static page resources, executes the page script to load the dynamic page resources, then executes the preset script to load the dynamic page resources, and finally identifies the page jump URL, including HTML tags and script triggers The dynamic jumping point is to start a new Headless Browser instance to achieve cascading crawling.
After the collection is completed, the page master document and related page resources are loaded and dumped to the object storage service through the OBS interface, and then the cross-region synchronization capability of the object storage provided by the cloud vendor is used to achieve remote disaster recovery of the page content. Cross-cloud replication uses cross-cloud synchronization tools to synchronize the page content of the backup site to the object storage services of other cloud vendors to achieve cross-cloud disaster recovery.
After the backup is over, look at the failover process again. When a single cloud and multiple regions faults caused by infrastructure problems and other reasons make the web service unavailable as a whole, fault detection starts. The page availability dial test service detects that cloud service areas A and B are unavailable, and an alarm is issued within 5 minutes.
The next step is failover, the establishment of a major problem emergency response team, and the operation and maintenance disaster recovery management platform at the same time, to check whether the unavailable area and the backup site dial test are normal. If the same cloud backup site is available, switch to the same cloud backup site first; if it is not available, the third-party cloud vendor backup site is available and switch to the backup site. The entire switchover is achieved by updating back to the source domain name A record to resolve the address to the OBS public network access address.
The last is the fault repair phase. First locate and solve the problem, dial to test that the Web Server is available, and then manually perform the fault recovery, and then the user returns to normal access.
to sum up
The above is a summary of some practical experience on how to ensure that the website continues to be online in various extreme scenarios. The related solutions have been verified and effective in actual scenarios, and continuous routine drills have been achieved.
In addition, for websites of different types or scales, there is no specific quantitative standard for high availability. You can give several rough levels for reference: the most basic guarantee function is available, regardless of the single point of network element problem. The requirements are higher, consider the application service clustered deployment, DB, cache and other middleware for corresponding high-availability deployment to ensure that there is no basic single point of problem. Going forward, consider the deployment of multiple data centers to solve the problem of unavailability of a single data center. Finally, consider living more or disaster tolerance in different places to deal with the scenario of a disaster in a certain geographic area.
In addition to the above traditional routines, as more and more companies are moving to the cloud, it is also necessary to consider how to quickly replace and escape when a single cloud vendor’s infrastructure fails, such as CDN, DNS, etc., which are all website visits. The point of failure to be considered in the basic scenario.
welfare
This time, there are two Huawei experts who shared "Five Key Measures for Intelligent Practice of Huawei Cloud Official Website" and "Technical Evolution and Low Code Practices of the Front End of Huawei Cloud Official Website". They also answered developers' concerns. Problems, such as the practical experience of website smart recommendation, the selection of low-code platforms, and so on. Welcome scan code Watch video .
Click to follow and learn about Huawei Cloud's fresh technology for the first time~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。