How to cut machine budget by 50%? "Human-Machine Confrontation" Explore the Cloud

Preface

Human-machine confrontation aims to unite various security teams to jointly manage black and gray industries. Due to historical reasons, the business side has many access methods to each security capability, and there are more than a dozen docking systems/protocols, showing a fragmented state. It is not conducive to the convenient access of the business to the security capability externally, and it is not conducive to the security team internally. Co-construction. In order to improve efficiency in all aspects, the human-machine confrontation service uses cloud services on a large scale during the construction process, and has achieved good results. Looking back on the past of security capabilities going to the cloud, it is a process from vague to clear, from hesitation to firmness, here is a brief sharing for everyone.

About the cloud

What is cloud

The essence of cloud as I understand it can be understood as free and flexible resource sharing. Resources join and leave at any time, like clouds floating in the sky, come and go uncertain, clouds rise and fall, it looks like that cloud, but when you look closely, it is different.

Expectations for the cloud

From the perspective of computer applications, the ideal cloud can turn computing/network/storage resources into water and electricity in life, and you can control the switch to come and go. Compared with the era of physical machine deployment, cloud users do not need to run away because of data damage caused by a machine crash, nor do they have to be saddened by machine damage to work overtime to restore services.

Automatic disaster recovery (abnormal pull up, fault migration)
Easy remote deployment (multi-cluster)
Resource Isolation-A dead dao friend will not die a poor dao, you are greedy, rest in peace
Expansion quickly, resources come right away, and can fight when they come
Everything is perfect, the gospel of the development, operation and maintenance testing brothers!

System cloud analysis

With the improvement of the company's basic services, the existing service facilities in the company can support us to go to the cloud. After investigation, it is found that the company's cloud-related platforms and deployment methods include:

CVM

On the basis of physical machines, resources such as machine hard disks and cpu are virtualized. The way users use them is essentially the way of physical machines, but the pain of machine abolition can be avoided, and there is no essential change in user experience.

Containerized deployment platform

Docker containerized deployment is currently the main cloud deployment method in the industry. Docker containerized deployment allows us to build once and run everywhere, which perfectly meets the requirements of free operation and resource isolation. The system environment is naturally strongly maintained, and all programs/scripts/configurations are in In mirroring, there is no longer a problem of loss or omission of maintenance. In the era of physical machines, the phenomenon that machine damage caused the script configuration to be damaged and unrecoverable will no longer appear. The problems of system maintenance relying on conscious or strong constraints have naturally disappeared.

The emergence of container orchestration systems like K8s supports powerful platform features such as automatic disaster recovery/failover/multi-cluster deployment, which brings us one step closer to the goal of cloud services. Based on the K8s container orchestration and scheduling mechanism, the company has developed a series of deployment platforms, such as 123 platform, GaiaStack, TKE, etc., which perfectly cooperate with the automatic association management of addressing services such as L5/Polaris, providing a complete platform for cloud services Mechanism support. Coupled with the flexible deployment of resources based on the resource management platform, the convenience of cloud computing is even higher. For example, when you apply for TKE container resources (CPU/memory/storage, etc.) on the cloud ladder, the process is as smooth as placing an order on Taobao. The resources are in place quickly. Under the strong promotion of approval, it can reach the minute level. When I experienced it for the first time Was surprised and admired.

Based on the in-depth understanding and analysis of the company's services, we finally decided to use the TKE deployment platform and use the docker containerized deployment method to upload the human-machine confrontation service to the cloud

The core impact of going to the cloud on development

A core change brought about by cloud migration is that resources are changeable. In order to facilitate system resource scheduling, the service node IP is variable. After cloud migration, it needs to face the IP changes including upstream business end/self/downstream dependent end. Derive a series of constraints and dependencies

Upstream changes: It is no longer feasible to authenticate the client through the source IP, and a more flexible and flexible authentication method is needed
Self-change: The external service address can be associated with the service address to provide services to the outside world. If the dependent downstream requires authentication and the source IP authentication is used, the downstream needs to be modified to support a more flexible authentication method. In most cases, the service needs to do some routine operation and maintenance work on itself, such as frequent modification of configuration and delivery, old operation and maintenance tools no longer work, and a centralized operation and maintenance configuration center is required.
Downstream changes: This is not a big problem, as long as L5 or Polaris automatic addressing is provided, the platform currently provides corresponding service management functions.

System architecture and cloud planning

The main module of the human-machine confrontation data center is a variable sharing platform. Its core has two, one query service module, and the other is a web module supporting variable management api. Both modules are developed based on the tRPC-Go framework. The system architecture diagram is as follows :

Ignoring some dependent systems, currently only two core parts are used to deploy to the cloud using TKE. The entire TKE deployment architecture is as follows:

For the deployment plan of the entire system, two system loads black_centre and http_apiserver are created on TKE respectively. These two parts are the core. Black_centre carries the user's variable query. The request from the web side passes through the smart gateway, then passes through the CLB, and then enters http_apiserver. In order to realize the real business processing, it mainly supports functions such as case checking, system variable management, and variable query access application. You may have noticed why http_apiserver introduces clb instead of being directly accessed by the smart gateway, mainly because the IP of the computing node may change at any time after the module is uploaded to the cloud, but the company does not support Polaris or Polaris when applying for a domain name to specify the service address within the company. For L5 configuration, only fixed IP can be configured, and the fixed VIP feature provided by CLB solves this problem well.

Here is a brief mention of CLB (Cloud Load Balancer), which is a service that distributes traffic to multiple cloud servers. CLB can expand the external service capabilities of the application system through traffic distribution, and improve the availability of the application system by eliminating single points of failure. The most common usage is to automatically forward to the workload bound to the rule according to the associated forwarding rules (access port/http url, etc.). Back to the application scenario of human-computer confrontation, we are mainly low-load http services, multiple services Just configure the url forwarding rules to share the same service address resource. For example, cluster services A and B both provide http services externally, but both need to use port 80. Generally, at least two machines need to be deployed in the traditional way, but use sharing As long as we configure the url distribution rules, we can distribute different interfaces to the corresponding cloud services. Of course, using nginx also has similar functions, but in terms of ease of use, maintainability and stability, CLB is more than one level stronger

When deploying TKE cloud applications, containers are easy to scale and expand, and the container itself is generally not fixed on a certain IP, so cloud applications should be designed in a stateless dependent mode by nature. The entire image should be as small as possible, and the business logic should be as small as possible, so as to avoid mixing too many logics at the same time. Since the human-machine confrontation related module is a brand-new module, there are not too many burdens. Although the protocol is flexible and compatible, it is essentially functionally independent and has a single responsibility, which is well adapted to this scenario.

As for how to apply for resources, how to create space, create load, etc., the process is very long, so there are no more screenshots. The product help documentation has provided good guidelines, please refer to https://cloud.tencent.com/ Document/product/457/54224, although I am using the intranet TKE, the overall experience of cloud services on the intranet and external networks is not much different.

In the process of using TKE, I continue to feel the power and stability of TKE, but the most urgent feeling is the need for a container reference copy function, because in real usage scenarios, I often want to base on an existing container, and make a minor revision 2 -3 parameters (high probability is the load name/mirror version), the load can be created quickly. Common usage scenarios are test verification/relocation deployment, or load re-deployment (there are many parameters in the load that cannot be changed and can only be rebuilt), or even the deployment of new services (the same as the resource usage and operation mode of the existing load). Now when it comes to creating a new load, it feels very cumbersome to fill in a lot of parameters, and the operation is too much. Small needs and big progress. If this problem can be solved, the convenience of using TKE can also be improved.

Discover the king in the cloud

From the analysis of the above architecture process, we are ready to mirror and use the TKE platform. Our service is running, but how can others find my service address? Some people say that entering the service address into L5/Polaris is done, but don’t forget, Yunjie During the point operation process, the service IP is changeable at any time. It is necessary to find a way to associate the changed address of the cloud service with the Polaris, and to associate the address list of the Polaris to manage the address list in order to truly be able to do so. It just so happens that TKE provides such a feature, perfectly fulfilling our purpose. Follow the steps below to solve:

To create a load is to make the service run. Ours is no problem.
Create a corresponding Polaris service for the load for later use.
To create a new North Star association rule, first enter the North Star Association operation page as shown below

Pay attention to selecting the corresponding business cluster and enter the creation page:

In this way, fill in the Polaris service information that has been created, and then associate it with the specified container service, and then submit it to complete, and finally complete the binding of Polaris and the dynamic service address

When we expand and shrink the container service in the load, we can find that the address list in the Polaris service will also be added or deleted, so that regardless of the service deployment change or disaster tolerance migration, the service address seen by the business side is Effective.

At the same time, in order to be compatible with some old users' habit of using L5, Polaris also supports the creation of L5 aliases for the Polaris service. Users can use aliases and use the L5 method to happily address the same services released by Polaris. Enter "Service Management"->"Service Alias" in the menu bar on the left side of Polaris official website to create an alias

Past analysis is that the old version of l5agent is not compatible with this aspect, and it can be solved by upgrading l5agent to the latest version.

Changes in the thinking of authentication on the cloud

After the system is on the cloud, the most typical change is the influence of the authentication mode due to the unfixed IP of the node. The source IP authentication in the old mode is no longer applicable, and a more flexible authentication method is required. Two authentication schemes commonly used in the industry:

SSL/TLS authentication method. This method is often used for transmission encryption instead of access control. The tRPC-Go API has corresponding support.
Token-based authentication method, which is also the key solution of this system

Although there are many methods, we need to choose different solutions according to different scenarios.

Upstream access authentication

When a user applies for access, the user needs to be authenticated. The source IP authentication is definitely not feasible. We use the token authentication method under weighing. When the user applies for access, we will assign them appid/token, when users come to visit, we will ask them to bring a timestamp, serial number, appid, rand string, and signature in the message header. The signature is jointly generated based on the timestamp, serial number, appid, rand string and token. When the server receives the request, it will also use the same algorithm as the client to generate a signature according to the corresponding field, and compare it with the requested signature. If it is consistent, it will pass, otherwise it will be rejected. The function of time stamp can be used to prevent replay.

In fact, the company's internal authentication platform knocknock also provides a set of token signature authentication methods, which are based on tRPC-Go authentication methods. However, based on the simplicity of user access and the reduction of platform dependence, human-machine confrontation finally customized its own token authentication method according to the above ideas.

Downstream relies on authentication

At present, our downstream is relatively simple. The big data query agent (see the architecture diagram) has been transformed to support the token authentication method, and ckv+ is the password authentication method when logging in. Except for CDB, there are not too many problems with other services. To access CDB, root is not allowed in accordance with the security regulations, and access is authorized for IP, and authorization needs to specify a specific IP first. During the process of going to the cloud, the container IP often drifts. In these circumstances, TKE has been planning to realize the authorization with the downstream business, and CDB is among them. Each business can also be connected to TKE to get through registration. In fact, the essence is to add an automatic registration mechanism between CDB and TKE. According to the change of service IP, the IP is automatically registered to the authentication list of CDB, which is logically similar to the relationship between Polaris and TKE load changes.

Why tRPC-Go

At the beginning of the construction of the human-machine confrontation service platform, I once faced the problem of language framework selection. The department was used to C++, and the framework had SecAppFramework or spp framework. How should we choose the new system? Before answering this question, come back to our face Questions and goals:

Large number of visits, high concurrency requirements and high machine resource utilization
The rapid development of business requires our efficient support, including development/operation and maintenance release/positioning problems/data analysis
It is a trend for the company’s services to go to the cloud, and various cloud-related platforms and services in the company will be used. The language and framework used should be able to use these capabilities quickly and easily. C++ and AppFramework seem to be big, and various services Insufficient support or difficult to use, a little powerless,

Faced with a series of choices such as the department’s old framework/spp Framework/tRPC-cpp/tRPC-Go, and from the perspectives of performance/development convenience/concurrency control/richness of peripheral service support, tRPC-Go was finally selected. The goal is Focusing on the improvement of research efficiency, a more detailed analysis is as follows:

Golang language is simple, various code packages are complete and rich, and the performance is close to c++, but it naturally supports coroutines and simple concurrency control. Simple concurrency design can squeeze machine resources, lower mental burden than C++, and stronger production capacity.
A series of coroutine frameworks in companies such as spp framework and tRPC-C++ are all based on c++. At the same time, under the spp framework, a single worker process can only use a single core at most, and the proxy itself will become a bottleneck. tRPC-C++ has also been used, which is complicated High sex
tRPC is the OTeam promoted by the company and is constantly improving. The tRPC-Go service interface is easy to develop, and the peripheral service support components are rich. You can run it when you add it to the configuration file, such as Polaris/L5, Zhiyan log/monitoring, various Storage access components (such as mysql/redis, etc.), use R configuration services, etc. Basic coverage of all links in the development level, plus a certain amount of experience in use and high familiarity, calling various services such as hand-to-hands.

In the process of using tRPC-Go to build the system, in addition to encountering some problems in the familiar process, but no big pits, code errors can also be quickly located, and no strange and unpredictable problems have been encountered. . Generally speaking, it is smooth and smooth, with light mental burden, able to focus more on business, and strong production capacity

Blessing of the power of all beings

In addition to the core logic of the module, in order to make the service more stable and the operation and maintenance more efficient, a series of peripheral services are needed to support, such as log/monitoring/configuration center and other support services

Unified log

For nodes on the cloud, writing local logs is inappropriate for locating problems. The core reason is that when a problem occurs, you have to find the node where the problem is and then go in and check the local log. This is a troublesome process, and the cloud node Clicking to restart may cause the log to be lost, so it is imperative to use a unified web log center. At present, the mainstream log services in the company are:

Zhiyan, TEG products, simple operation, easy to use, have successful practice in the verification code. Under the tRPC-Go framework, it is simpler and easier to use. As long as you configure it, you can forward the log to the log center without modifying the business code.
Eagle Eye, the system has rich functions, but the access is complicated, and the ease of use needs to be strengthened
uls, cls, etc.

The final choice is the Zhiyan log, which is simple and easy to use when the requirements are met. With the blessing of tRPC-Go, only:

Introduce the code package in the code
It can be used by simple configuration in yaml:

When locating the problem, you can view the log on the web:

The specific middleware implementation details and help can be seen in the Zhiyan log plug-in project under tRPC-Go

Powerful monitoring

Monitoring services include:

Zhiyan: TEG products, multi-dimensional monitoring, powerful and easy-to-use, simple to use, multiple use verification in the department
monitor: Monitoring based on attribute definition, old product, mature and stable, but single monitoring method
007: The system has rich functions, complex access, and ease of use needs to be enhanced

The final choice is Zhiyan monitoring. Zhiyan's multi-dimensional monitoring makes the monitoring ability more abundant, locates the problem faster, and is simple enough to use. Under the tRPC-Go system, only plug-in configuration/plug-in registration/api call is required for data By reporting, you can complete routine data monitoring. At the same time, multi-dimensional monitoring can use appropriate dimensional definitions to make monitoring data more three-dimensional and more conducive to problem analysis:

Take the above figure as an example, you can define a variable query indicator, associate the source ip/processing result with two latitudes. When we pay attention to the access situation of the corresponding business, we can see the traffic from each source IP, or you can see The overall situation of the service is clear at a glance to the various visits of each processing result (as shown in the figure above). Compared with the original monitor monitoring, this is a dimensional improvement.

Unified Configuration Center

Our business is generally based on a certain configuration operation, we will often modify the configuration and then release. The traditional way in the past was to modify the configuration file on each machine and then restart it. Or the configuration is stored centrally, and the modules on each machine read the configuration regularly, and then load it into the program. In the case of Oteam collaboration, the company provides various configuration synchronization services. At present, the company mainly has two synchronization solutions:

T configuration service

Simple to use, average feature richness

Single configuration authority control, no version plan for data encryption after consultation

R configuration service

Company-level solution, with dedicated Oteam support

Configuration synchronization has permission control, and encryption features will be supported in the future

Support rich configuration form, json, yaml, string, xml, etc.

Supports public/private configuration groups, which facilitates the reuse of configurations and the division of configurations between modules, and supports gray/step-by-step release

Both services are simple and easy to use in the development mode of tRPC-Go, and the background can take effect immediately after data modification. Under comprehensive consideration, the R configuration service is finally selected as the configuration synchronization on the same platform. The use method is relatively simple:

Register the project on the R configuration service first, and configure the group
Connect the R configuration service in the code, read data and listen for changes in variable configuration items.

The following is a simple and easy-to-use interface:

Take the configuration of externally released variable IDs as an example. This ID list will be continuously added or modified according to needs. As long as the user modifies and releases on the web, all nodes on the cloud can perceive the configuration changes and take effect immediately. Whether it is service stability or release efficiency, it has been qualitatively improved compared to the traditional configuration modification release method

tRPC-Go designs services in the form of plug-ins, which greatly simplifies the way of invoking each service. Coupled with the active open source projects under tRPC-Go, the research efficiency has been improved by more than 50%. Take our commonly used access to mysql/redis as an example, we need to deal with exceptions/connection pool packaging/addressing packaging (using L5 or Polaris in the company), etc., with the usual open source library use or packaging, development + testing basically takes 1-2 days , But using the corresponding open source components under trpc-go/database can be reduced to about 2 hours. Besides, compared with self-research or semi-self-research, the development time of configuration synchronization/log center service can be reduced from 2 days to 4 hours. Objectively speaking, if relevant components are self-developed or semi-self-developed, they can only be customized and usable in such a short period of time. In terms of stability and versatility, I believe it is harder to compare platform services. The key is that the code collaboration and various Oteams within the company will have a process of increasingly rich and powerful features. The increase in their maturity will also evolve into an increase in the productivity of the entire organization. Overall assessment, using mature middleware in the company is a correct choice

What does going to the cloud bring

At present, the human-machine confrontation service is continuously introducing old system traffic or accessing new services. The read and write access of the entire system is up to 12 million/min, and the variable latitude access traffic is up to 140 million/min (a single access request can access multiple variables) . The entire service has been running stably on the cloud for more than 3 months, which has lived up to expectations.

Stability improvement

Ignoring the code quality of the module itself, the stability provided is mainly brought about by several platform features of the cloud container deployment:

TKE supports service heartbeat checking, application exceptions, fault migration

Easily support remote deployment and disaster recovery in multiple locations
Resource isolation, abnormal business does not affect normal business

If 99.999% is our pursuit of service stability, compared with the hourly/day-level machine abnormal recovery time in the old deployment mode, the service stability will increase by 1-2 9s. This is foreseeable

Improve resource utilization

In the physical machine/CMV deployment mode, it is generally good for the overall machine resource utilization to reach 20%, especially for some small modules. However, under the flexible expansion and contraction mechanism of TKE after the cloud, the utilization rate of container resources can reach 70% through easy configuration.

For example, the following is the current monitoring of my application market. In order to cope with the possible impact of large traffic, I configured automatic expansion and set a larger number of basic nodes, so the CPU occupancy rate is low. In fact, these are all configurable. The deployment of the container mode on the cloud can increase the utilization rate of system resources by 50% compared with the traditional mode.

Improvement after going to the cloud

Use the company’s open source technology to increase development efficiency by more than 50%
Due to the improvement of machine resource utilization, the machine budget for human-machine confrontation has been reduced by 50% compared with the original
The centralization of various services makes the release and problem location faster, such as the R configuration service and the commissioning of Zhiyan, release and problem solving from half an hour -> minutes
In the container deployment mode, the system enters a natural strong maintenance state. Because the mirroring is complete and replayable, it will definitely be recorded and stored in the library after the problem is repaired, so there is no worries about loss or omission. In the era of physical machines/CVM, programs/scripts/configurations are likely to be placed on a single machine (especially some offline preprocessing systems), and things like machine damage or system crashes are gone. Of course you would say physical machines The deployment can also be backed up or stored in a timely manner, but that depends on people's consciousness or strong restrictions on behavior, and the cost is different. But in most cases, Murphy’s law will become a nightmare you can’t avoid

Concluding remarks

Looking back on the efforts of the center and the company in improving the research efficiency for so many years, the system framework has gone from the initial srv_framework, to the central Sec Appframework, to various coroutine frameworks such as spp framework, and then to the vigorous development of tRPC. Data synchronization consistency ranges from manual synchronization, active/standby synchronization, to synchronization using the synchronization center, and then to various distributed storage services in the company. System deployment ranges from manual deployment of physical machines to various publishing tools to cloud weaving blue whales. The physical machine, CVM and the cloud carrying services have been successfully verified by large-scale business, etc. The company's research efficiency has been improved along the way, from a toddler to a young man who can walk fast. Looking back on the cloud at the pinnacle of the modern computing power system, the company’s understanding and exploration and practice of the cloud have also experienced a process from vague to clear, from hesitation to firmness, and our efforts have not been disappointed by this era. A series of improved descriptions of various indicators and user recognition are our affirmation and rewards, which inspire us to move forward in the future without hesitation.

To recognize and build clouds, innovate and develop clouds, we step on the earth, but we look up to the stars. The son said in Sichuan: "The dead are like a spouse! Do not give up day and night."

The above is a little bit of my practice and insights in the process of human-computer confrontation to go to the cloud, and I will share with you mutual encouragement.

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !