8

In 2018, Tencent officially launched the "self-developed business on the cloud" strategy, and massive businesses such as WeChat, QQ, and Tencent Games began a mighty moving journey.

Transferring locally deployed programs directly to the cloud cannot fully enjoy the advantages of cloud computing. Only by redeploying, partially reconstructing, or even completely rewriting the cloud technology stack, can the unaccustomed "outsider" become a native. "Aboriginal" on the cloud.

In this process, Tencent's developers played an extremely important role. They carried out self-technological iterations around cloud native, and worked together to build a technical "heavenly road" to the cloud.

On June 16, 2022, Tencent officially announced that the massive internal self-developed business has been fully migrated to the cloud.

Behind the short sentence is the largest cloud native practice in China. According to statistics, the scale of Tencent's self-developed business on the cloud has exceeded 50 million cores, and the cumulative cost savings exceeds 3 billion.

The four years that Tencent’s self-developed business has been on the cloud is also the four years that Tencent’s developers have grown and transformed.

Going to the cloud, let the technical level be reborn

The elastic scaling capability brought by going to the cloud can greatly improve resource utilization and reduce financial costs. This is enough of a reason for most businesses to move to the cloud. However, for game teams, because of continuous high-intensity content research and development, The cost and risk of refactoring cannot be ignored.

"Frankly speaking, the main goal of our initial cloud adoption was not to save computing costs. Games are a relatively complex business scenario. Compared with continuous iteration under the original architecture, a complete cloud-native reconstruction requires a lot of investment. Is it worth the small extra technical cost?" This is the biggest tangle of Ma Tongxing's team, technical director of Tencent's IEG Joy Studio, before the start of cloud-native reconstruction.

"After thinking about it, cloud-native reconstruction is a systematic improvement of the dynamic management and scheduling capabilities of traffic, improving disaster tolerance and fault tolerance, and thus improving business reliability." Ma Tongxing said.

"By refactoring to the cloud, we have turned the dynamic scaling of services into a normalized action that can be performed all the time, which means that when any node or network fails, failover can be completed automatically. No With this ability, we would have to get out of bed in the middle of the night and switch it manually — that’s a whole different level of skill.”

A similar situation is also reflected in fault management. When a traditional client fails, because a series of complex calls occur on the backend, it is necessary to analyze the logs to find the source of the problem. However, under the cloud-native architecture, the call link is clear, and the problem node is clear at a glance.

"These governance capabilities are downgraded to the infrastructure level. Although the cost and challenge of early business structure reconstruction are great, the improvement of R&D operation efficiency and quality will benefit long-term." Ma Tongxing persuaded his team, "And this is an industry The trend is an opportunity. Although it can operate very well without doing it now, what about three years later? Maybe we are far from the first-class team in the industry.”

In this way, the team reached a consensus and invested in the practice of cloud native. They have done a lot of research and sharing, and even specially translated a K8s book. After several years of cloud-native refactoring practice, the team's professional capabilities have been significantly improved, and many colleagues have naturally received professional promotions.

If the team didn't stick to the cloud-native technical route, what would it be like now? Ma Tongxing can't imagine it, but he knows that when new technologies appear, there will be revolutions. If you can't guarantee that you will not be revolutionized by others, it is better to take the initiative to join the technological revolution.

Dedicate yourself to cloud-native practices to break technology anxiety

Before participating in the self-developed cloud project, Wang Ang still had lingering technical anxiety even for star businesses like QQ. The same anxiety, like a low pressure, hangs over everyone on the team.

In 2013, Wang Ang joined Tencent after graduation and participated in the development of the QQ backend system. At that time, QQ, as a star product with more than ten years of launch and more than 800 million monthly active users, had a mature and stable self-developed structure. In the eyes of outsiders, this is definitely an enviable job.

But after the initial excitement passed, Wang Ang fell into self-doubt: "If you just tinker with the self-developed technology stack based on QQ and work behind closed doors, in the long run, will your technology be out of touch with the industry, or even fall behind?"

With intractable anxiety, Wang Ang finally had a turnaround after a few years. After the 930 reform in 2018, CSIG was established, and his team was included in it, and began to create a new product called "Tencent Classroom". At this point, a question that needs to be decided is before the team: should we continue to use the mature and stable old technology stack of QQ, or use the new technology stack that Tencent Cloud is continuously improving?

This is a difficult question to measure. From the perspective of pragmatism, QQ's technology stack has served a large number of users, which can be called a thousand trials. If you move closer to the new technology stack, the workload and learning costs will be greatly increased in the initial stage of cloud migration, and the quality of business may even be affected because the components are not mature enough.

Desire for new technology has overwhelmed concerns, and technical anxiety has become the last bargaining chip to tip the balance: For businesses, whether to go to the cloud or not may make no difference in the short term, but in the long run, it will bring about a huge gap in research efficiency; for developers Personally, the technology stack of the public cloud is advanced, and it is more in line with the outside world. They have long been unwilling to be just a tinkerer. Who wants to let this opportunity slip away?

"I didn't have a choice before, and now I want to try new technology."

Wang Ang gave an example: "If you continue to use QQ's technology stack, you must write a bunch of code to implement business logic, but after the business goes to the cloud, Redis can directly solve many complex data structure problems, and the same can be achieved with a few lines of code. Effectiveness. There are a large number of excellent components on the public cloud, which can greatly improve the research efficiency.”

As expected, changing the technology stack was not smooth sailing. One day, some user data suddenly became garbled. The R&D team located the cause for a long time and found that the local database had some special codes that were not aligned with the components of the cloud database, which led to the problem.

After stepping on this pit, Wang Ang realized that the cloud migration process must be solid enough, including the initial tool preparation, the monitoring of the migration process, and the data reconciliation after the migration is completed. success rate.

To this end, Wang Ang and his team had a lot of running-in with the product department of Tencent Cloud on the details of the cloud components, and reported nearly 400 questions in one year. The two sides had continuous exchanges on components such as containers, operation and maintenance, and audio and video. , which not only improves the technology stack of Tencent Cloud, but also guarantees the cloud migration process of Tencent Classroom.

At the beginning of 2020, with the outbreak of the new crown epidemic, the traffic of Tencent Classroom also fluctuated greatly with the epidemic, and the number of visits during the peak period can reach 100 times of the usual.

If you continue to use QQ's self-developed technical architecture, you need to apply for server resources a few days in advance, and then manually expand the capacity. After Tencent Classroom went to the cloud, with the support of containerized deployment and elastic expansion and contraction, Tencent Classroom achieved automatic and smooth expansion, which not only greatly eased the pressure on operation and maintenance, and its stable and smooth performance was also recognized by the business department.

At this moment, Wang Ang felt that the technical anxiety shrouded in the team's head finally dissipated.

Today, Wang Ang is the technical director of Tencent Classroom. He was promoted to this position after 9 years of work. The speed is very fast. He believes that this has a lot to do with participating in self-research and migrating to the cloud. In addition to the help for promotion, he is more grateful to the self-developed cloud platform for allowing him to contact a large number of excellent open source components, broadening his technical vision, and being able to apply cutting-edge technologies in his business, which greatly eases his technical anxiety.

"If you feel anxious, you must act and do the hard and right thing. Action is the best way to relieve anxiety."

Pursue technical ideals, but also understand practical needs

Technology geek Yu Guangyou is one of the earliest pioneers in the K8s community in China, and has a deep understanding and cognition of containers. Together with the container team, he built the earliest domestic platform based on K8s containerized products - Tencent Cloud Container Service (Tencent Kubernetes Engine, TKE).

When he first heard that the company was going to develop its own cloud, Yu Guangyou's first reaction was "very cool", "Tencent is going to run on Tencent Cloud".

But the reality is not as cool as imagined.

Yu Guangyou once considered himself a technical fundamentalist and had a strict adherence to the technical route, but in the process of self-research on the cloud, the business put forward a series of requirements that were contrary to cloud native, which made him very mad.

The difference stems from history. For example, the original architecture of WeChat runs on physical machines, and some systems, such as access control, need to write IP to death. Over the past ten years, WeChat has supported the reliable and stable use of more than one billion users. Despite the continuous iteration of the architecture, this feature has been integrated into the entire system, affecting the whole body.

After a lot of discussions around the technology, in order not to damage the original design architecture of WeChat and thus reduce the cost of migration, the container team finally developed a feature that supports fixed IP.

After developing this "industry-exclusive" capability, Yu Guangyou was depressed because the technology was not pure enough, but he soon discovered that support for fixed IP was actually a common demand, and many existing businesses of CSIG and IEG were moving to A fixed IP is always required on the container, and this feature has even become a powerful tool for attracting external customers.

A similar situation has happened more than once. Some business teams hope to specify the interface parameters of the TKE cluster, and some business teams hope that TKE can implement in-situ changes of Pods. These are different from the cloud native understanding that Yu Guangyou understands. However, in terms of the results, The needs put forward by the business team are often the common needs of the industry.

"Cloud-native technology not only supports the migration of new services to the cloud, but also the needs of existing services, but pure cloud-native technology cannot fully support it." In order to support the migration of a large number of stock services to the cloud, TKE has developed many capabilities that are not so "fundamental". But it really reduces the cost of going to the cloud for the business.

After the running-in of technical ideals and practical needs, Yu Guangyou abandoned the reservation of the cloud native of the fundamental principle and became more flexible and pragmatic. In addition to supporting microservices, the container technology developed by him and his team has also begun to support stateful services such as databases and big data... Today, he no longer regards these as customized services, but innovations based on cloud-native kernels, which can not only help Businesses reduce migration costs and allow cloud native to expand from dedicated scenarios to general-purpose scenarios

"What is the difference between customization and innovation? The core of cloud-native is a declarative API and an immutable infrastructure. As long as the customization requirements do not affect the essence, it is still cloud-native, and even becomes an opportunity for innovation, further enriching TKE capability."

"Originally, I thought that technology should move forward in a forward-looking direction without hesitation, but later found that any technological evolution must first be implemented in the most suitable scenario, and then technical enhancement and universalization should be carried out, that is, it can be compatible with the old architecture to continue. evolution."

"Not only should we pursue local optimality, but also global optimality," Yu Guang lobbied. "Who would have thought that those things that abused me in the past have now become the leading point of TKE's technology."


思否编辑部
4.4k 声望117k 粉丝

思否编辑部官方账号,欢迎私信投稿、提供线索、沟通反馈。