2
头图

Image credit: https://unsplash.com/@alistairdent

Author: Yan Shisan

question

The so-called assembly is inseparable from the old-fashioned reuse. We can do a system-level encapsulation for most of the scenarios that are considered to be common, encapsulate them into services with a high degree of reuse, and then implement them through interfaces and extension points. Part of the capabilities are open, but there is a scenario that cannot be solved, that is, when the code execution of one function level is completed, it is hoped that another function will be triggered, and at the same time, it is hoped that this function can be solved through configuration and does not need to be developed. means to solve such problems.
For example, a user sends a gift to a host, and the user has +1 points on the contribution list in the live broadcast room. This is a typical "performance" scenario. We place an order in the live broadcast room to buy, and the contract is fulfilled to The principle of the warehousing system is the same, but it is based on the solidification of this process model. Activities are often not like this. Activities are more flexible than these solidified processes. The author's team has developed such a scenario:
By giving gifts, users help the host to complete the impact of a virtual energy bar in the live broadcast room, and every time the energy bar is full, they want to do one thing. Add points to a certain list, the second event is to drop a virtual treasure box for the anchor, the third event is to send some raffle tickets to users, and so on, as long as it is a function that has been done, he will I hope this thing can be used. After a long time, you will face frequent modification of the code of this module. When it "finishes", the if conditional triggers each code, or the strategy mode removes the magic code. For long-term construction In my opinion, this is just so many things to do after this module ends, so if the next module chooses two things from so many things, won't it have to write a bunch of code?
So we thought of a relatively primitive solution: bus service.

Combination ability of carrier pigeons

Rather than a combination, it is more appropriate to use the word "performance", because for the event, it is more important for the carrier pigeon to only perform the function of the "performance" side. What he has to solve is the distribution of asynchronous scenarios. , rather than the ability to do some combination of adapters for some systems. Here is an example of an activity scene:
image

A simple page of musician activity mainly contains several module functions:

  1. Users complete related tasks through behaviors such as giving gifts and watching.
  2. Add progress bar points for your favorite musicians.
  3. When the progress bar points complete a certain gear, trigger the drop of the treasure chest, the floating screen to give gifts, etc. The type of each gear is different.
  4. A module for Lucky Koi.

For such an activity, from the perspective of development, it is actually assembled by several modules. Except for 4, which needs to be developed independently, 1, 2, and 3 can all be assembled through the existing system. Here, 1 is abstracted as Task system, 2 is abstracted as a progress bar system, 3 can also be abstracted into a treasure chest system, a gift-giving system, etc. The following figure is the concept of the scene abstraction module:
image

So the difficulty comes, by what means can these modules be assembled?

As can be seen from the figure, when a module completes its life cycle, it can send a piece of data. After the routing system receives the data, it will help us do a layer of routing forwarding and decide that the data will be routed to the next system. . The next system could be a "reward system" or a "progress bar system".

At the same time, we can further refine the abstraction of this behavior. We hope that this routing system acts as a "bus". "Letters" represent the data that each system expects to receive, and at the same time abstract each system into a destination. If we configure a routing relationship (abstracted as a carrier pigeon configuration), then this "carrier pigeon" can be used as a data letter to take us to any destination we want, then the benefit to the system is that the system only needs to provide its own The ability to access this routing system is enough, and the next time it is any activity, you can directly do some combination relationships.

With the idea of pigeon routing, we can sort out the complete link of a process for the scenario of musician activity:

image

As can be seen in the figure, we have made some extension points (EXT-POINT) for basic modules such as tasks and progress bars to meet the orchestration of business processes. In the scenario of system module group combination, there is an additional concept of carrier pigeon service. In the processing flow of [2], [3], and [4], the carrier pigeon can determine which system the data flows to. Facts have proved that this scheme is feasible and effective.

Overall structure

So how to design such a combined capability architecture? The focus of R&D can be divided into four levels:

  1. The ability to fulfill the contract, when the code ends, the end trigger is achieved.
  2. SDK access capability, Interface-level packaging, and natural Autoconfigure capability.
  3. A global identification dictionary, how a piece of data allows all connected systems to reach a consensus.
  4. The system's automatic registration capability, after accessing the SDK, will automatically report to the unified management of the carrier pigeon service, and automatically activate it without manual intervention.

Based on these four levels, the architecture design is as follows:
image

As can be seen from the figure, the dotted box on the left represents the trigger of one system, and the dotted box on the right represents the final touch of another system. The carrier pigeon system acts as a proxy service. The triggered system only needs to access the carrier pigeon SDK. After performing its own responsibilities, it will assemble the data and send it to the carrier pigeon system. The carrier pigeon system will find the carrier pigeon according to the sent carrier pigeon id. For related configurations, the carrier pigeon id determines that the data will flow to the connected subsystem. This flow can be synchronous or asynchronous (of course, most of them are asynchronous scenarios). Asynchronous mainly depends on the delivery capability of "RocketMQ". When the forwarding fails, the delivered data result set will be backed up and stored for the retry operation of the timing step.
In the past practice process, most scenarios are asynchronous links, and there is no need to obtain the returned result set provided by the next subsystem, and the error rate of "RocketMQ" itself is also a small probability event (after all, four brokers , the possibility of making 3 errors is extremely low), which has an absolute advantage over communication-level interfaces such as RPC.

SDK

route trigger

In the above figure, we can see that the subsystems are divided into roughly two types of scenarios, one is the active business domain, the other is the inactive business domain, which itself is related to the business scenario, we hope that all subsystems can be Access the SDK according to the standard, but it is not guaranteed that each subsystem provides the ability to rely on the SDK. For some inactive business domains, the customized development mode is used for bridging work. This bridging work is more like the traditional adapter and ESB. bus idea.
The subsystems of the active business domain can adopt the mode of accessing the SDK. Here we will mainly introduce the asynchronous design idea. When a subsystem is connected to the SDK, when a bean is created in the Spring container, a PushConsumer bean will be created by default. , add a "Listener" that listens to the "fly" of the carrier pigeon, so that the message can be automatically consumed to the route, and the message can be parsed. Assuming that the module capabilities that this system can undertake are S1 / S2 ..., SN and other functions, then The overall implementation diagram is as follows:
image

automatic registration

The Server of each Provider is an independent application service cluster. For each Provider, the capabilities it provides are not single. As mentioned above, a domain service (active TOY service: mainly provides various building block-style games) field), the modules it can provide are very diverse, such as treasure chests, progress bars, message boards, etc. The "act-toy" service has many functions. When the "act-toy" service is connected to the SDK, it needs to Register the function that your subdomain needs to be published as a Provider to the carrier pigeon. The registration scheme is as shown in the figure below. When a Server pulls the SDK to start, it will regularly pull the defined Interface, and obtain custom annotations for the implemented Class. The Type type is registered to the carrier pigeon service by means of sequential messages, and the method of startup + regular push is adopted. The carrier pigeon service can store the relevant registration information after it receives it.
image

Comparison with ESB

From the perspective of the ability of the entire combination, the overall design idea is in line with the bus-type service, and the more classic bus-type service in the industry is the ESB bus. The ESB bus technology is relatively outdated from the perspective of today's Internet, because it itself It is more suitable for some cross-language service architecture solutions used in large IT enterprises, and it itself undertakes the bridging ability between systems. However, the design of the carrier pigeon itself is different from the ESB bus. The feature he pays more attention to is the concept of "dumb pipe", ignoring the adaptation and conversion of integration, and it is not suitable for centralized synchronization integration. Then the specific comparison is as follows:

Adapt to the scene advantage shortcoming
Pigeon service Code module integration for temporary short-lived activities The "dumb pipeline" design idea under the microservice system does not focus on data integration and packet protocols. Adapt to the business characteristics of the active domain, flexibly integrate, and shorten the R&D cycle Weak ability to integrate large IT systems
ESB bus Traditional large-scale system integration Highly robust integration of traditional large-scale IT systems, adapting to traditional industry non-Internet system architecture Carrying heavy business logic and protocol conversion, the bottleneck of service governance is highly centralized

Then we can introduce the advantages of carrier pigeons from two ESB scenarios. The ESB bus technology that I understand can be roughly divided into two categories:

  1. Open source ESB solution: This is a framework technology that provides open source. It is deployed in the same Tomcat container. Several bundles can be developed to run in the container and can be hot-replaced, but each bundle needs to communicate with each other. The mode requires a custom private protocol. Since this type of open source framework is relatively early, most of the communication protocols provided are of the WebService type, so in the development process, the development cost will be very high, such as the early open source ServiceMix of Apache. Regarding the protocol, it is necessary to write a lot of "wsdl" for communication, and most of them also rely on the interface published by WebService.
  2. The self-developed ESB integration capability, the author once participated in an enterprise-level project, roughly the ESB bus integrates various systems, including the Rest protocol interface, the WebService protocol published by C++ services, and the WebService protocol published by overseas third-party companies. In order to get through the company's internal C++ services, Java Web services and overseas third-party systems, an integrated bus is used, which requires too much protocol conversion, including how to parse the "wsdl" file, read node data, convert it into Json, etc. , this type of scenario is common in enterprise-level services, and it is not suitable for Internet scenarios. The focus is more on the conversion of synchronous interface protocols. At the same time, the ability to arrange is not known, and most of them rely on hard coding to solve.

No matter 1 or 2, the design ideas of homing pigeon are different from ours. In essence, homing pigeon emphasizes the asynchronous execution of terminal events. The lightweight protocol structure is selected, which is technically the same language family, so the definition of the dictionary is also standardized. , that is to say, userId and anchorId represent the user id and the host id respectively. It is not necessary for any party to define words such as uid and aid on a whim. The unified dictionary is defined by the carrier pigeon, and the arrangement is unified and collected into the carrier pigeon service. To be sure, it is visual, no need to write complex XML node files.

Comparison with Pipeline

From the perspective of the concept of bus, it is more or less similar to "pipeline", and the idea of Pipeline is widely used in many technical fields:

  1. CI/CD continuous integration scenarios.
  2. Pipelining design patterns in open source frameworks, such as network byte stream processing in the Netty framework, etc.
  3. Some workflow transfer technologies customized by the business.
  • The first scenario is more inclined to DevOps solutions, from continuous integration, continuous delivery, continuous deployment, in order to process engineering in a fast, automated, and repeatable way, which is completely different from the upper-level business orchestration scenario we are going to solve today. two fields.
  • The second scenario is essentially a code-implemented design pattern. For example, in Netty, the "chain of responsibility" design pattern is used for implementation. After the network byte stream passes through the "factory pipeline", it is packaged and finally a finished product is obtained. , which is also not an area of the same business as we are addressing today.
  • The third scenario is a problem that is often encountered in the process of business development, especially in scenarios with complex processes, which include the orchestration of processes and services, and each code block and service may be treated as a kind of " Node", which is serialized in the entire pipeline to complete the realization of the business. How is it different from our carrier pigeons?

| | Adapt to the scene|
| --- | --- |
| Carrier Pigeon Service | "Performance" at the end of the code for temporary short-term activities. It does not pay attention to the sequence of core processes, and solves the diversification of scenarios at the end. |
| Pipeline | Business orchestration in the process field, which can orchestrate and schedule core processes, such as approval flow, marketing activities, etc. |

In my opinion, the two technical solutions can exist at the same time. We do some flexible and customized process arrangements for the already stable field scenarios. These processes can be implemented as the idea of "pipeline", at the end of "pipeline" One of the process nodes can be positioned as a "carrier pigeon" node, and this node can continue to freely combine customized activity scenarios.

Carrier pigeon forwarding capability

question

Live events are different from marketing events. Most of the triggering scenarios tend to be downstream of the platform. Therefore, it is necessary to monitor many topics to realize their own business coding. However, in the context of immature forms, the event will inevitably face a large number of short-term codes. , the life cycle of short-term code is often limited to the activity cycle, and this type of code represents exploration and pioneering. Live events are two-sided in domain modeling. On the one hand, they must replicate and accumulate experience based on historical experience, and on the other hand, they must have the ability to quickly develop short-term code. With the increase in the number of services, there are many services that need to monitor topics, and these topics may have been monitored for service A and service B. Faced with such problems, we need to access the topic Do a copy of the code or do some packages to solve it, but this is not a friendly solution. At the same time, this also faces another problem. The other problem is that when a code block completes its mission and is not opened again, the topic he accesses will continue to be idling for consumption. Nydus" (Cloud Music Version RocketMQ) management and control platform to do some consumption offline, after a long time, message management becomes more difficult, just like we are going to solve the service cycle dependency in layers, it is impossible to imagine that one day in the asynchronous link There was also a mess. The following figure represents the problem encountered during the "dev" process:
image

Solutions

We hope that carrier pigeon can continue to play its advantages. It is not just a bus service that only performs business end-of-business contracts. It should also help our services in the activity domain to do a good job in message governance. First of all, we need to have a research and development idea. The change is that the developer needs to provide his topic for his "dev" module. This topic and his own "dev" module are a closed-loop technology. If every "dev" module has such capabilities, carrier pigeons only It is necessary to take advantage of his advantages, and turn the topic that needs to be monitored into the topic forwarded by the carrier pigeon to the module.
image

Under the premise of this solution, we hope that what carrier pigeons need to have is the ability to forward messages, and this forwarding scenario can be abstracted into three types:

  1. Business-defined distribution scenario: This scenario must be processed by some specific services, through the original message, do some business cleaning work, and then forward it to some business scenarios, which is a customized scenario.
  2. Scenario of automatic distribution of registration report: In this scenario, as long as the receiver of the carrier pigeon service monitors enough topics, the carrier pigeon can be automatically distributed to the topics of each business module. The distributed message structure is different, and can be made by different tags. Consumers can choose to pull by themselves. According to the distinction of tags, it can also solve the scene of too many topic messages, which leads to starvation, as mentioned in the above figure.
  3. Scenario of message storage: Not all messages need to be stored. In the event, we think that the more important messages need to be backed up to facilitate playback later, such as gift messages, which are highly used in any scenario. In order to improve writing The ability to enter, you can use the consumption method of batch consumption to do batch writing. The initial selection here is to use TiDB. Compared with DDB (NetEase distributed database), TiDB is more suitable for storing such messages. Most of the time, the message does not need to be read.

According to the three scenarios mentioned, the overall architecture is as follows:
image

Our receiver can dynamically create a consumer bean in the spring container according to the configured topic, which is compatible with the new access topic and needs to modify the code of the carrier pigeon service. At the same time, when the topic message is received, the message is forwarded as shown in the figure. As described, it is divided into three paths. These three paths represent: automatic forwarding, custom forwarding, and message storage. Here, three different consumers are selected to ensure that the consumption thread pool is wide enough, and the forwarding topic queue is sufficient for consumption. Automatically After forwarding the received message, it will be tagged again and forwarded to the configured topic relationship according to the type of the source topic. At the same time, this forwarding relationship can allow R&D to self-manage, and its life cycle can also be configured, which is very suitable for short-term activities. In the scenario, after the event ends, the idle consumption of the service is reduced. Of course, another problem must be considered after all. One more hop operation will lead to the failure of forwarding. For this scenario, we will store the failed message as an exception and retry it.

At the same time, we have done a stress test on the forwarding capability of the carrier pigeon. The length of the queue is set wide enough. In the scenario where the table writing link is not considered, the simple forwarding capability and the consumption of io are mainly concentrated in the scenario of broker communication. If you consider that all messages are placed on the disk asynchronously, the throughput of the system will be better. The "8U16G" server configuration is selected. With the support of 32 docker cloud containers, the receiver can bear the message volume of 300w/min , and the cpu can maintain to about 45%.

Although the forwarding capability of carrier pigeons solves our problem, it does not mean that this is an optimal solution. The optimal solution I hope is to allow carrier pigeons to be equipped with the FaaS platform. After all, FaaS can provide many scenarios about message cleaning, and FaaS is on the machine. There will be better performance in resource scheduling. The combination of FaaS + BaaS is the trend of future system technology transformation.

summary

What are the benefits of spending time building a system, and what difficulties do you encounter?

development and maintenance

Here is a case to describe the benefits that R&D will bring after using the carrier pigeon service for business development. As a business R&D student, Dayong needs to develop an activity today, which involves ranking promotion, killing monsters in the live broadcast room, and live broadcast. Between the floating screen and the anchor task, the process idea of this event is:

  1. After the anchor completes the "xx" task, a "monster" will be dropped in the live broadcast room and a floating screen notification will be sent.
  2. The anchor adds some points to the list after completing the "xx" task.
  3. The list is promoted to topN across the day, and topN sends a floating screen notification.

Fortunately, the functions involved in the above have been developed before, and this time only the combination needs to be completed. Unfortunately, Dayong needs to develop such combination capabilities.
Until he met the carrier pigeon, everything was solved. Every function was connected to the carrier pigeon. Then Dayong only needed to go through the carrier pigeon background, configure the task to complete the task and fly to the carrier pigeon such as "killing monsters", "floating screen", etc. It only takes a short time to test, the release process is also reduced, and there is no need for one-time glue assembly code. After all, completing the task and dropping monsters, this logic does not seem to be very suitable to write in the task system.

difficulty

When dealing with issues such as message management, we encountered many difficulties when implementing business coverage. We have to make some changes to the topics that have been connected to the existing system. It is time-tested in new business scenarios, but in the old system (For example, the task system), a new task topic is provided to send and receive old messages of the route. Although we have also made some distinctions between different services according to the tag (source topic) in the way of the SDK, the data cleaning cannot be avoided here. And the protocol conversion problem of data structure, this kind of problem may have problems with the cleaning idea of the task system itself, and the best solution to this kind of problem is generally to use scripting language to do message cleaning, which will be more flexible.
At the same time, when doing capacity evaluation, the initial stress test was not very smooth. Because the table writing link and the carrier pigeon service exist in one service for stress testing, if data is written to TiDB for some messages, even if the data is written in batches The throughput of the overall service is also difficult to increase. Therefore, the service that writes the TiDB link is independently independent, and the pressure test in this regard is carried out separately. The original service of the carrier pigeon only does forwarding, and the write storage can be independently written. evaluation of. The scenario of writing TiDB links is removed, and the throughput pressure test of the service is carried out separately. The "4U8G" server resource was initially selected because the overall forwarding performance is more CPU-intensive. After the specification is improved, the overall throughput has improved. Obviously doubled, and due to the evaluation of online message volume, 100w/min messages may be the limit of our existing business. We did stress tests according to different message volumes, and finally output a stress test result set. Appropriate expansion and reduction will be made according to different message volume ranges.

future outlook

This article provides a system design idea for the problems of business scenario combination and message management encountered by the technical team in the cloud music live broadcast event in the daily development process, hoping to help readers provide some reference opinions in daily development . At present, it mainly focuses on solving the problems of the existing technology. In the later stage, we will consider the playback of messages as an important means and solution to repair data. From the perspective of exception handling, how to help R&D students quickly repair online question.
At the same time, in the face of the future internationalization scenario, in the scenario where the computer room in the message user area is not sensitive enough, I hope that some means can be used to help the service side message forwarding to the relevant computer room, and help solve the problem of business side routing after the internationalization of "Nydus". On the basis of the future international routing computer room, if the messages between modules can be accurately routed to the computer room where the user is located, it is also a problem that we need to think more deeply.
We hope that the carrier pigeon can be used as one of the important solutions for live broadcast related products in the event business field to help more related students solve the trouble of "reuse" and "combination", and at the same time hope that it can be internationalized and adapt to more product scenarios. middle.

team introduction

The cloud music live event mid-stage technical team is mainly responsible for the event business research and development of live broadcast related products, providing one-stop event mid-stage solutions for Look live, Sonic, Xinyu and other related products, focusing on the large live event structure The construction of the system in Taiwan is the direction. Interested students are welcome to communicate with each other.

This article is published from the NetEase Cloud Music technical team, and any form of reprinting of the article is prohibited without authorization. We recruit all kinds of technical positions all year round. If you are ready to change jobs and happen to like cloud music, then join us at staff.musicrecruit@service.netease.com

云音乐技术团队
3.6k 声望3.5k 粉丝

网易云音乐技术团队