Station B collapsed. How to prevent similar accidents from occurring?

Everyone knows that although I am a programmer, I am very passionate about sports, such as dancing. This is not every day when I go home before going to bed, I will learn related dances in the dance area at Station B.

Yesterday was no exception. As soon as I finished washing, I rushed to sit in front of the computer and opened the dance area at Station B to learn to bite meow. If Xin Xiaomeng and Xiaoxian are their new dance moves, I have to say that my wives dance really well. , Even an introvert like me twisted unconsciously.

Just when I was about to learn the next action, I found out how 404 NOT found.

It's broken. As a developer, my first instinct was that the system crashed. I even suspected that it was a problem with my network. I found that the mobile phone network is normal and the computer can access other webpages. I knew that the development was going to be done.

I refreshed it several times and found that it is still the case. I feel a little sympathetic to the corresponding development classmates, and it should be gone by the end of the year. (The website has not been restored when I write this article)

As a former programmer, I habitually think about the composition of the website structure of station B, and the points where problems may go wrong after the accident is repeated. (The old profession is used to it)

First of all, we can roughly draw a simple structure diagram of a website composition, and then we can guess where the problem might be.

Because I stayed up late to write articles, I haven't stayed in a company that mainly relies on live video broadcasts, and I don't know the technology stack very well, so I used the general logic of e-commerce and drew a sketch. Everyone clicked on it.

From top to bottom, from entry to CDN content distribution, to front-end server, back-end server, distributed storage, big data analysis, risk control to search engine recommendation, I just drew it casually, I think the overall architecture should not be different Especially big.

I went online and randomly checked some companies like Douyu, Station B, Station A. The main technical stacks and technical difficulties are:

Video access storage

flow
Nearest node
Video codec
Resumable upload (much different from the io example we wrote)
Database system & file system isolation

Concurrent access

Streaming media server (all major manufacturers have it, the bandwidth cost is relatively high)
Data cluster, distributed storage, cache
CDN content distribution
Load balancing
Search engine (sharding)

Barrage System

Concurrency, thread
kafka
nio framework (netty)

In fact, the technologies we all learn are similar, but the language composition of their corresponding microservices may account for a larger proportion of go, php, vue, and node.

We analyze the possible causes and places of this accident:

1. delete library and run

This happened in Weimeng before. I think that companies should not give such a large operation and maintenance authority. For example, the host authority directly prohibits commands such as rm-rf, fdisk, and drop.

And the database is now most likely to be multi-master, multi-slave, multi-site backup, disaster recovery should also be done very well, and even if the database is exploded, many static resources of CDN should not be unloaded. The page went 404 directly.

2. down the large cluster 160ee3726851e7

Now the front and back ends are separated. If the back end hangs, many things on the front end can still be loaded, but the data cannot be reported. Therefore, the cluster may hang because the front end is down, or the front and back ends are hung together, but it is still that. The problem, now it seems that all static resources are inaccessible.

However, I think this is a little bit possible, because some services are down, resulting in a large number of errors, and the cluster is pulled, and the more you do this, the more people will refresh the page, making it more difficult for other services to restart, but this possibility is not the last one I said. The possibility is great.

3. server manufacturer has a problem

This kind of large website is a cluster of cdn+slb+ stations. All kinds of current limiting and downgrading and load balancing will all be done very well. Therefore, it is only possible that the hardware of these pre-service server manufacturers has a problem.

But what I am more puzzled is that the BFF of station B should be routed to some computer rooms where access nodes are more accessible, so that when small partners across the country brush, some people should be good, some people are bad, and some people are good and sometimes bad. Yes, but now it seems to be completely broken. Did they bet on a node area of a manufacturer?

I think the Internet is also spreading that the Yunhai Data Center is on fire. I don’t know whether it’s true or not. I can only wait to wake up to see the official announcement of the B site. In principle, the B site is from CDN, distributed storage, big data, and search engines. You should have done a lot of guarantee measures, if you really all in a place, it is really not wise.

My feeling is that I didn’t do all the cloud access, offline servers had problems, and the ones that didn’t go to the cloud happened to be the key business. Now the company is using a hybrid cloud like public cloud + private cloud, but the private cloud part They are all internal businesses of station B, so there should be no problems with his own computer room.

If it is really like what I said, betting on a server manufacturer, and the physical machine still has a problem, then the data recovery may be slow. I used to do big data. I know that the data backup is incremental + full. When restoring It's really good, and some can be pulled from nodes in other regions, but if it is placed in one place, it will be troublesome.

Replay

I think no matter what causes it in the end, what we technicians and companies should think about is how to avoid such things from happening.

data backup: backup must be done, otherwise if there is any natural disaster, it will be uncomfortable, so many cloud vendors are now choosing places where there are fewer natural disasters like my hometown in Guizhou, or the bottom of the lake or the bottom of the sea (it is cooler The cost can go down a lot).

Basically, full and incremental data are always required. Daily, weekly, and monthly incremental data, as well as full data backup on time, can reduce the loss a lot, even if the mechanical disks in all regions are broken ( Disaster recovery in remote places can be retrieved except for the destruction of the earth).

The operation and maintenance authority has converged, and I am still afraid of deleting the library and running away. Anyway, I often use rm-rf on the server, but generally there is a springboard to get in. You can do order prohibition.

Cloud + Cloud Native: cloud products are now very mature. Enterprises should have enough trust in the corresponding cloud vendors. Of course, they must choose the right one. The various capabilities of cloud products are one of them. Disaster recovery and emergency response mechanisms at critical moments are not available in many companies.

Cloud native is a technology that has only been paid attention to in recent years. The corresponding combination of docker+k8s and the various capabilities of cloud computing can actually be unattended, dynamic scaling, and emergency response as mentioned above. But the technology itself requires some trial costs, and I don’t know if a video-based system like station B is suitable.

own strength building: In fact, I think whether it is going to the cloud or not, you cannot rely too much on many cloud vendors. You still have to have your own core technology system and emergency mechanism. What if the cloud vendors are really unreliable? How to achieve real high availability, I think this is something enterprise technical personnel need to think about.

For example, many cloud vendors sell one physical machine divided into multiple virtual machines, and then there will be a single physical machine multi-hosted situation. If one of them is an e-commerce player playing Double Eleven, one is a game manufacturer, and the other is a large number Occupy network bandwidth, you may have packet loss, which is extremely poor experience for game users. This is why I say not to trust and rely too much on cloud vendors.

If the other party buys it to mine, it will be even more excessive, draining the computing power, and running at full load will be even more uncomfortable.

This time at station B, fortunately, such a problem was exposed in advance, and it was at night, there should be a lot of low traffic time to recover. When I write here, most of the web pages have recovered, but I found that it is still partially recovered.

In any case, it can be completely eliminated next time. I believe that behind station B will be busy with the structural system transformation for a long time to ensure that it is truly high-available.

I hope that in the future, I can steadily look at the dance area at night instead of staring at the 2233 girls of 502 and 404 in a daze.

Station B collapsed. How to prevent similar accidents from occurring?

Video access storage

Concurrent access

Barrage System

Replay

敖丙

引用和评论

体验完豆包MarsCode，我觉得字节AI编程工具算成了

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性