1. The day of collapse
December 20th, can be regarded as the day when Xi'an collapsed.
There were 21 new cases on December 19 and 42 new cases on the 20th, and some cases have spread in the community...
Xi'an is under great pressure for epidemic prevention , all units and companies require 48 hours of nucleic acid test report to go to work.
Under such severe circumstances, as the core system of prevention and control: Xi'an Yimatong collapsed unexpectedly, and the collapse was so complete.
paralyzed for more than 15+ hours!
For a whole day, how many office workers were blocked at the subway entrance, how many passengers were frozen on the way, unable to advance or retreat...
In the afternoon, the news even reminded:
In order to reduce the pressure on the system, it is recommended that the general public do not expand or brighten the code unless necessary. When the system is stuck, please wait patiently and avoid repeated refreshing. Thank you for your understanding and cooperation.
Is this the solution to the problem?
If you really need to limit the current to prevent the system from crashing, would it be easier to limit the current with technical means? Even adding a nginx to the front can solve the problem.
Today, we will try to analyze this business and the corresponding technical issues.
2. Product analysis
We will not analyze other services of Yimatong in Xi'an for the time being. That is not the point, and it did not completely crash that day. The only thing that crashed was the scan code function.
In fact, this is a very typical business with a large number of queries and a few updates. If you analyze it with your eyes closed, it can be said that more than 90% of the traffic is inquiries.
Let's take a look at the first version of the product form. After scanning the code, part of the personal name and ID information will be displayed. At the same time, the green, yellow, and red codes will be displayed below.
This is what Xi'an Yimatong looks like at the beginning. The business process only needs one request, or even one query SQL.
Later, this interface made two major revisions.
The first revision added vaccination information and added a border; the second revision added nucleic acid detection information, showing the nucleic acid detection time and results at the bottom.
Two query services have been added to the entire page. If a relational database is used behind the system, at least two query SQLs may be added.
This is basically such a demand. According to statistics, Xi'an has a population of 13 million. According to the largest 10% of citizens, scan codes at the same time (I doubt there will be so many), which is a concurrent volume of one million.
Such a concurrent business is very common in Internet companies, and it is even more complicated than this scenario.
Why did it collapse?
3. Technical analysis
In the official reply that evening, we saw this sentence:
December 20th, the number of visits by users of "YiMatong" in Xi'an increased sharply, and the number of visits per second reached more than 10 times the previous peak, causing network congestion, resulting in parts including "YiMatong" The application system cannot be used normally. "
One-code communication" background monitoring alarms at the first time. The 24-hour on-site communication, network, government cloud, security and operation and maintenance teams immediately carried out investigations. The platform application system and database are operating normally, and the problem is judged to be on the network interface side.
According to the above information, the database and platform system are normal, and there is a problem with the network.
I previously drew an access diagram in the article "A tragedy caused by a DNS cache" , and use this diagram to analyze with you the situation of network problems.
A general user’s request will start with the domain name, get an external IP address after being resolved by the DNS server, and then access the firewall and load through the external IP to hit the server. Finally, the server responds and returns the result to the browser.
If there is a problem with the network, the most common problem is the DNS resolution error, or the broadband of the external network is full.
DNS resolution error must not be the problem this time, otherwise it may not only be an error in this function; the bandwidth of the external network is full, just increase the bandwidth directly, and it will not be solved in a day.
If there is a problem on the network side, there is generally no need to change the business, but in fact, when the system is restored, everyone found that the interface returned to the beginning of the article and mentioned the first version.
means that the system "rolled back".
The interface lacks the content of vaccination information and nucleic acid detection information, and a new nucleic acid query page is added to the homepage of Yimatong.
So, is there just a problem on the network interface side? I have a little question here.
4. Personal analysis
According to my past experience, this is a typical system overload phenomenon, which means that the request volume exceeds the server response in a short period of time.
In human terms, the amount of external requests exceeds the maximum processing capacity of the system.
Of course, the maximum processing capacity of the system is closely related to the system architecture. The same server has different architectures, and the system load varies greatly.
To deal with this problem, there are nothing more than two solutions, one is current limit , the other is expansion .
Current limiting is to keep users out and process the requests that can be processed first; expansion is to add servers and increase the database carrying capacity.
The above mentioned that the official way to let everyone do not have to use one code to communicate is also a way of manual current limiting; however, it is basically not done in the technical system.
There are many technical current limiting solutions, but the simplest one is to hang a Nginx configuration and use it; the more complicated one is that the access layer writes its own algorithm.
Of course, the current limit cannot really solve the problem, but it is only responsible for blocking part of the request; the real solution to the problem is to expand the capacity to satisfy all users.
But in fact, based on the problem-solving process and product rollback, Yimatong did not expand the capacity immediately, but chose to rollback.
This shows that, in the system architecture design, the expansion of capacity has not been fully considered, so it cannot support the first choice of this scheme.
5. The ideal solution?
All that said above is just a personal speculation. In fact, they may face more practical problems, such as tight schedules, boss control of the budget, etc...
Having said that, if you were the architect in charge of the Yimatong company, how would you design the entire technical solution? Welcome everyone to leave a message and talk about my thoughts here.
The first step is separation of reads and writes and caching.
Divide the system into at least 2 large blocks, and extract the reading services that meet daily use separately to undertake the maximum external traffic.
Separate a subsystem to be responsible for business updates, such as the update of vaccination information, the change of nucleic acid information, or the color of the code according to the business timing.
At the same time, for a large number of single queries of users, the cache system is uploaded, and the information of the cache system is read first to prevent the subsequent database from being overwhelmed.
The second step is to split the database and tables, and split the service.
In fact, a single query between a user and a user is irrelevant, and it can be divided into databases and tables according to the attributes of the users.
For example, use the user ID to divide 64 tables, or even divide it into 64 subsystems for query, and distribute the traffic at the front end of the interface to reduce the pressure of a single table or service.
The above analysis did not expand the capacity in time, it may be that the service split was not done. If it is a single business sub-service, it is easy to expand the capacity when encountering the problem of overload.
Of course, if the conditions are right, the microservice architecture is better, and there is a set of solutions to deal with similar problems.
The third step is big data system and disaster recovery.
If a lot of information is displayed on one page, there is another technical solution, which is to integrate it into a large table in nosql through asynchronous data cleaning.
Users can directly go to the nosql database for related services such as scanning and querying.
The advantage of this treatment is that even if the update service is completely suspended, it will not affect the user's scan code query, because the two systems and databases are completely separated.
Deploy services in the form of remote dual computer rooms, and at the same time make an overall disaster recovery and disaster preparedness plan to avoid extreme situations such as the cutting of optical cables in the computer room.
There are still many optimizations in detail, so I won't explain them one by one here. Here are just some of my thoughts. Welcome to leave a message to add.
6. Finally
No matter how you analyze it, it must be a man-made disaster rather than a natural disaster.
The system was directly put into production without undergoing rigorous testing, and it collapsed in a slightly more intense environment.
There are many cities larger than Xi’an, and the situation is even more serious than Xi’an’s current epidemic situation. Other cities have also encountered it. How come there are no similar problems?
Xi'an, as a big province of science and technology, really shouldn't have such problems, especially after I looked at the domain name address used behind this small program.
There is a feeling of powerlessness. Although it has nothing to do with the use of the program, the details can really tell the strength of a technical team.
I hope I can learn from this lesson and avoid similar problems again!
Recommended reading: "Xi'an Health Code has collapsed!" The programmer was repaired by..."
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。