2

1. Preface

Under the prevailing microservice architecture, the dependency problem caused by the large number of services often becomes a stumbling block in the development process. Similar topics are often heard at various technical exchange meetings, and everyone is actively discussing how to solve such problems. So I decided to introduce the principle of traffic coloring and what problems it can solve in the development process under the microservice architecture.

Second, the concept of flow coloring

To put it bluntly, traffic coloring is to color the requested traffic with a label, and then the request will carry the label information in the entire link, which can be used for traffic scheduling and other functions.

Many functions can be implemented based on traffic coloring, such as grayscale logic, blue-green deployment, and lane isolation.

Here is a brief description of the relationship between traffic coloring and microservices, lest everyone think this is a headline-grabbing article. Just imagine, if it is a single application, can there be an application scenario of traffic coloring? The general process of the request is App -> Load Balancer -> Application. The entire link is very simple, and traffic coloring is completely useless in this scenario. Only in the case of a large number of services, and a business function involves N services, it is necessary to color control the traffic to solve the problems we encountered in the process of development and testing.

3. Application based on flow dyeing

The pain points of multiple deployments in the test environment only need incremental deployment

At present, in addition to the frequently used T environment, there are many MF environments in our test environment. The MF environment is basically used in independent requirements, and the normal version iteration is in the T environment.

  • Problem 1: Different environment configurations

This will lead to a problem. Many functions are tested in the T environment. When there are independent requirements that need to be tested in the MF environment, the corresponding services need to be deployed. During the deployment process, various configurations are often missing or wrong. situation, the application cannot be started.

  • Question 2: Services that have not changed should also be deployed

There is a requirement that needs to be tested in the MF environment, and the service is deployed, but during joint debugging, it is found that none of the dependent downstream services are deployed. However, these services have not changed in this requirement, and the dependent interfaces are also functions that have already been launched.

If it is not deployed in the corresponding environment, the entire link cannot be adjusted. Therefore, at this time, it is necessary to find the corresponding downstream and let the downstream deploy these services. During the downstream deployment, there may also be problems with different configurations of each environment, resulting in a long time-consuming early stage of the entire joint debugging and affecting the progress of the project.

  • How does flow coloring solve the above problems?

For example, if a requirement is currently being developed, a version will be configured in the application to be changed, and the version information will be stored in the metadata of the registry.

Then, create a swimlane (independent environment) that belongs to this requirement for deployment, and only need to deploy the application with this requirement change. The downstream applications that this application depends on do not need to be deployed. If the corresponding service provider cannot be found in the current environment, it will be routed to the stable environment. If there is no such service in the stable environment, an error will be reported.

图片

R&D local startup random registration problem

Sometimes R&D will start the service locally, mainly to debug a certain problem. The advantage is that it can quickly reproduce the problem in the test environment and find the problem code in time.

Since the service started locally will also be registered in the registry, the request of the test environment may be routed to the service started locally by the R&D, and the code of the service in the R&D local may not be the latest, resulting in an abnormal call.

The current common solution to this problem is to block the registration function of the service when it is started locally, that is, not to register, so that it will not be routed by normal test requests.

If you have the function of traffic coloring, you can specify a version number of your own when developing the local startup service, as long as it is not the same as the normal test version. Requests for normal testing will not be routed to this instance of the R&D registration.

application level grayscale

For the grayscale at the interface level, grayscale control is currently performed within the application. But at the application level, there is currently no particularly good way to control grayscale. For example, if there is a technical transformation requirement, it is necessary to change the Redis Client from Lettuce to Jedis. The grayscale of this scene is at the application level. The current method is to publish a node, and then end the publishing process. The specific amount that can be grayed out It is determined by the total number of service instances and cannot be controlled flexibly.

If there is traffic coloring, you can issue a new node and upgrade the version of this node. For example, the previous version is V1, then the newly issued version is V2. First of all, the V1 version must carry all the production traffic. It can be controlled through the gateway to forward the traffic to the V2 version in a certain way, such as user whitelist, region, user ratio, etc. If there is a problem, you can switch the traffic back to V1 at any time, which is very convenient.

图片

Graceful Offline of Service

For a service to go offline gracefully without damage, a lot of work still needs to be done. For example, the service to be published will be first deregistered from the registry when it is released, but there will still be a cache of service instance information inside the application, which needs to be cached for a certain period of time. After the clearing is completed, the corresponding target instance will not be requested.

If it is implemented based on coloring, the instance information (IP:PORT) that needs to be offline will be pushed to the gateway for coloring processing through the configuration center. The coloring information will follow the request throughout the entire link, load balancing components in the application, and middleware such as MQ. The information of the target instance to be offline will be filtered, so that there will be no traffic to the instance to be offline.

Production release speeds up

At present, the mainstream releases are rolling deployments. The advantage of rolling releases is that the cost is low, and there is no need to add additional deployment resources. One radish and one pit can be replaced slowly. The bad point is that the release time is long, and the full link dependency is too serious. If the dependencies are disordered before the release, it is an online failure.

To solve this issue of release speed, blue-green deployment can be implemented based on traffic coloring. That is to say, a V2 version is redeployed when it is released. The number of instances of this V2 version is the same as that of V1. Since this V2 version has no traffic, there is no dependency. You can release it at the same time. You can distribute traffic through the gateway. First distribute a little traffic to the V2 version for verification. If there is no problem, you can slowly increase the traffic, and then release the V1 version of the container.

图片

The release speed has indeed improved, but the problem is that the cost of blue-green deployment is too high, and the resource cost will be doubled. Although the old resources are recycled after the release, your total resource pool still has to accommodate the two versions in parallel. Row.

Is there a compromise that can improve release efficiency without increasing resource costs?

You can use the form of replacement when publishing, and publish half of the instances first. This half of the instances is our V2 version. There is no traffic at the time of publishing, so you can still publish in parallel.

After the release is completed, start to increase the volume to the V2 version, and then verify. After verification, the other half of the instances can be released. In this way, the total resources are unchanged, but there is a serious problem that half of the instances are directly stopped. Can the remaining instances support the current traffic, because The applications in the transaction are all for C-end users, and the traffic is likely to reach a high volume in a short period of time.

Full link stress test

Full-link stress testing is essential for e-commerce business. There are N times of big promotions every year, and pressure testing needs to be carried out in advance to ensure the stability of the big promotions. The core point of the full-link stress test is the distinction of traffic. It is necessary to distinguish whether the traffic is a normal user request or the stress test traffic of the stress test platform.

Only by distinguishing the traffic, can the pressure measurement traffic be routed accordingly. For example, traffic such as databases and Redis need to be routed to the shadow library. It is easy to label traffic based on traffic coloring to distinguish the type of traffic.

Fourth, the realization of flow coloring

Application has the concept of versioning

Every application needs to have the concept of version, in fact, it can be bound to each iteration. It is just to put this version information into the configuration file in the project. When the project starts, it will register the version information together with its own instance information in the registry. This information is generally called metadata (Metadata).

With Metadata, the corresponding matching can be performed according to the coloring information when controlling the traffic routing. For example, a request specifies that the call to the order needs to go to the V2 version, then how to match the instance information of the V2 version when routing? ? You need to rely on Metadata.

Full-link transparent transmission of dyeing information

It is very important to transparently transmit the coloring information through the whole link. If the whole link cannot be transparently transmitted, there is no way to control the routing of traffic at all nodes. The transparent transmission of this coloring information is actually the same principle as distributed link tracking.

At present, the mainstream support for distributed link tracking are Skywalking, Jaeger, etc., basically borrowing the idea of Google Dapper. Each request will generate a unique TraceId at the entry, through which the entire link can be associated. This TraceId needs to be transmitted throughout the link, and the traffic coloring information also needs to be transmitted across the link.

The means of transmission are generally divided into two types, one is to transmit in an independent Agent package, and the other is to carry out embedded point transmission in the basic framework. If Http is used to call the interface between the intranets, then the information is transmitted in the request header. If it is in the way of RPC, it can be passed by RpcContext.

The information is passed to the application, and other downstream interfaces will continue to be called in this application. At this time, the transparent transmission should continue. Generally, the information is put into ThreadLocal, and then the transparent transmission is continued when the interface call is initiated. What needs to be noted here is to use ThreadLocal to prevent the scenario of thread pool switching, otherwise the information in ThreadLocal will be lost. Of course, there are also some means to solve the problem of information transmission in ThreadLocal asynchronous scenarios, such as using transmittable-thread-local.

Traffic Routing Control

When the traffic has label information, the remaining work is to route the request to the correct instance based on the label information. If the internal framework is the Spring Cloud system, routing can be controlled through Ribbon. If it is a Dubbo system, the routing logic can be re-formulated by inheriting Dubbo's AbstractRouter. If it is an internal self-developed RPC framework, there must be corresponding extensions to control routing.

V. Summary

Flow coloring is still very useful overall, but it's also a big technical overhaul. In addition to getting through the transmission of dyeing information at the level of the basic framework, what is more important is the cooperation of various business parties. Of course, it would be better if the access is in the form of an agent. Otherwise, each business party has to upgrade the package, which is really annoying. .

*Text / Yin Jihuan
@德物科技public account


得物技术
846 声望1.5k 粉丝