Cloud native practice of Aegis DDoS protection system "cost reduction and efficiency increase"

author

Tomdu, senior engineer of Tencent Cloud, is mainly responsible for the architecture design and background development of the Aegis security protection system control center.

Lead

As a company-level network security product, the Aegis DDoS protection system provides professional and reliable DDoS/CC attack protection for various businesses. In the increasingly fierce hacker attack and defense environment, DDoS confrontation requires not only "cost reduction" but also "efficiency". With the popularity of cloud native, the Aegis team continues to invest in the transformation and optimization of cloud native architecture to improve the processing capacity and efficiency of the system. This article mainly introduces the practice and thinking of the cloud process on the Aegis protection scheduling platform.

Why go to the cloud?

Cloud native, as a very popular concept in recent years, has been sought after in all departments within the company or by other big peers and friends outside the company. Cloud native involves technologies including containers, microservices, DevOps, continuous delivery, etc. What benefits can these new technologies and concepts bring? From our perspectives,

Resource sharing, dynamic expansion and contraction-"cost reduction"

Take the Aegis protection scheduling platform as an example. Because the physical machine resources previously applied for are still in service, most of the current background services are still running on the physical machine. When applying, buffers will be reserved appropriately (resource consumption is related to external attack threats, and the difference between peaks and valleys can be up to ten times). Although this part of the buffer is used as a backup, it is idle most of the time, and the physical machine is not easy to share across systems and projects. In the stage of resource clouding, CVM has virtualized the physical machine to achieve resource sharing to a certain extent. With the emergence of containerized management platforms, resource control granularity becomes finer. For example, on the TKE platform, the resources allocated by a container can be accurate to 0.1-core CPU and 1 MB of memory, which can be expanded and contracted at any time according to the load. At the same time, all resources are shared as a large pool, reducing resource waste.

Containerized management, rapid deployment-"enhanced efficiency"

To increase the iteration speed, in addition to the development process, testing, release, and operation and maintenance all need to be optimized. When deploying a physical machine, it was necessary for operation and maintenance students to complete the release manually or through a special operation and maintenance release platform. During this period, release failures or exceptions may occur due to differences in the machine environment, requiring manual handling. Now through containerized deployment, the consistency of the operating environment of the service can be ensured, and the "snow server" can be avoided. At the same time, with the help of the container management platform, one-click release and rapid expansion and contraction can be achieved. The general container disaster tolerance strategy provides a basic guarantee for the stability of the service.

How to go to the cloud?

Current architecture

As shown in the figure above, the DDoS protection process includes attack detection (attack detection) and attack protection (attack traffic cleaning and normal traffic returning to the source).

In this process, a master control "brain" is also needed to coordinate the detection system and the protection system to realize the automatic processing of the whole process-this "brain" is the protection scheduling platform. When a DDoS/CC attack is detected, the protection scheduling platform will automatically decide which protection equipment room, quantity, and protection strategy needs to be invoked. In the event of a strong confrontation, it is necessary to adjust the protection strategy on the protection equipment in real time. The extent can be seen.

The overall architecture of the current protection scheduling platform is shown in the figure below.

Protection equipment: Divided into three categories: ADS (Anti-DDoS System), HTTP CC, HTTPS CC, combined with the team's accumulation of DDoS/CC offensive and defensive confrontation over the years, respectively integrated self-developed protection algorithms under various protocols/scenarios . Protective equipment is deployed at the entrances of computer rooms around the world, and after triggering an attack alarm, it pulls and cleans the attack traffic, and then returns it to the source. A management and control agent is deployed on it, which is responsible for communicating with the background and managing protective equipment.
Access layer: multi-point access, internal network, public network access. After the control agent starts, it randomly selects an available access for TCP connection.
Background service: Multi-active/active/standby deployment, registration heartbeat with access, all background requests are distributed and load balanced through access, and all agent requests are forwarded through access.

If deployed directly to TKE according to the current architecture, the system can run, but due to the different characteristics of machine deployment and container deployment, there will always be some conflicts and awkwardness in direct deployment. When considering the plan, we feel that the current architecture is not suitable for TKE deployment. The main areas are:

Service discovery: The background service registers the heartbeat according to the pre-configured access IP list. After the containerized deployment, the IP switches frequently, and it is no longer applicable to load the access through the configuration method.
Stateless: Containerized deployment can achieve rapid expansion based on load, but the service needs to be completely stateless to achieve full horizontal expansion. The current multi-master deployment services are stateless and can be directly migrated, but the services deployed in the active/standby mode need to be modified.
Configuration management: currently managed by machine dimensions, mixed with configurations related to/unrelated to the operating environment.

At the same time, taking this opportunity to go to the cloud, we also hope to make a major optimization of the system architecture and access the company's mature public services such as Polaris Name Service, Colorful Stone Configuration Center, and Zhiyan to improve research efficiency.

Cloud architecture

Based on the current architecture, in addition to mirroring and migrating services to TKE for deployment, it also transforms and optimizes places that are not suitable. The general structure and process after the transformation are as follows:

Service discovery: All back-end services are connected to the Polaris name service, register instances to Polaris, send heartbeats regularly, and access to obtain various service health instances from Polaris to distribute requests.
Stateless: There are two main types of scenarios where the current system exists.
- File download: Mainly the policy file download of protective equipment. Stateless transformation involves the synchronization of files to be downloaded among multiple file service instances. The solution is to choose to use CFS to synchronize files.
- Policy subcontracting issuance: When the policy is too large, the application layer performs subcontracting, and the same request is hashed to the same back-end policy service instance. The solution is to bring the current subcontracting status information in the request, which can be processed by any policy service instance and the result is consistent.
Configuration management: For configurations that have nothing to do with the operating environment, access to the Seven Color Stone Configuration Center to ensure that the configurations of instances of the same type of deployment are consistent.
Log monitoring: Migrate to Zhiyan log collection and monitoring treasure.

Two "blockers"

How to smoothly migrate

The current physical machine environment is running stably, and it is planned to gradually grayscale and cut to the TKE environment, so there will be a period of time when the physical machine + TKE is mixed. Controlling access and background service migration TKE is transparent to the protective equipment agent. Because the protective equipment agent will only choose an available access to establish a connection, that is, the agent will only connect to the physical machine environment or the TKE environment. Therefore, when the background service interacts with the agent, the situation of mutual access between the physical machine environment and the TKE environment is involved in the mixed running state. . In this case, TKE provides flexible configuration support.

When deploying services on TKE, two network modules are provided:

Global Route: Private IP in the VPC, cannot be accessed from outside the cluster, and cannot be registered to the CMDB. After enabling random port mapping, it can be accessed from outside the cluster, and can be bound to CLB and Polaris.
ENI IP: The routable IP within the company can be accessed from outside the cluster and can be registered with CMDB, CLB and Polaris.

During the gray-scale mixed running period, the access deployment chooses the ENI IP method, and the physical machine background service access TKE access is the same as access to the ordinary intranet service. After the migration is completed, the back-end service uses the Global Route method, which allows only mutual access within the cluster. The back-end services are accessed through the native service, and the services are only exposed through the CLB to the outside.

Changing IP

Due to the business characteristics of DDoS offensive and defensive confrontation, we have been dealing with IP for a long time and have a special plot for IP. In internal communication, we found that everyone will encounter some common problems in the process of migrating TKE. Among them, IP-related issues will often be mentioned.

In a physical machine deployment environment, the machine IP is fixed and changes less frequently (machine abolition every few years). However, in the TKE environment, if the service is restarted once, the assigned container IP and node may change. As a result, the functions implemented by the IP in the system cannot be well adapted to the TKE environment.

Access authentication

The simpler authentication is based on whitening the source IP, such as interface access and DB access. For interface access, we have defined a set of interface authentication specifications based on JWT, and all external interfaces no longer use the source IP whitening method. For DB access, independent DB accounts with unlimited sources are currently used, and permissions are divided into more fine-grained categories (accurate to the table). Subsequent DB permissions support real-time application. When the container gets up, it applies for the access permission of the node where the current container is located through the interface.

Service discovery

In the original architecture, the management and control access layer implements simple service registration and service discovery functions, and back-end services register and report heartbeats through IP configuration. If the access layer does not migrate to TKE and remains relatively fixed, then this solution is still feasible. However, after the access layer is migrated to TKE, its own deployment nodes are also constantly changing, so an independent and access, relatively fixed service registration and discovery module is required. The services deployed in the cluster can use the K8s / TKE native service, and the Polaris name service can be considered when the physical machine and TKE are mixed.

Service exposure

There are two meanings here, one is how to maintain stability of services that should be exposed to the outside, and the other is how to hide services that should not be exposed to the outside.

(1) Expose the service to the outside

In the physical machine environment, the change of service IP caused by machine abolition is a common problem, which can be solved by domain name and VIP. On TKE, it is implemented through CLB.

(2) Hide internal services

Through the private IP in the VPC, it is possible to ensure that the service cannot be accessed from outside the cluster and achieve isolation.

The ultimate goal of our system is: all external services are exposed through the access layer for authentication; internal background services are hidden to ensure security.

Cloud effect

After the protection scheduling platform goes to the cloud,

(1) In terms of cost reduction, it is estimated that the resource utilization rate can reach more than 50%. A large part of the resources previously reserved to ensure the normal operation of the DDoS attack peaks are shared in the entire large resource pool when it is idle, and dynamically expanded when it is busy.

(2) In terms of deployment efficiency, the time spent on deployment and expansion has dropped from days to minutes. It turned out that a dedicated person from operation and maintenance students was required to complete the release. At the same time, with the rapid changes in business scenarios, TKE has satisfied our rapid deployment and expansion of scenarios such as high defense, gateways, and third-party clouds.

As the company's cloud services become more and more abundant, through cloud access and access to public services, the architecture of the Aegis protection scheduling platform is optimized to improve system scalability and iterative efficiency.
In addition, Aegis’ core capabilities are DDoS and CC protection. In addition to managing and controlling the cloud, we are also exploring the possibility of virtualization of protection capabilities to provide flexible and flexible protection capabilities for various businesses and scenarios on the cloud.

about us

For more cases and knowledge about cloud native, please follow the public account of the same name [Tencent Cloud Native]~

Welfare:

   ①公众号后台回复【手册】，可获得《腾讯云原生路线图手册》&《腾讯云原生最佳实践》~

   ②公众号后台回复【系列】，可获得《15个系列100+篇超实用云原生原创干货合集》，包含Kubernetes 降本增效、K8s 性能优化实践、最佳实践等系列。

[Tencent Cloud Native] Yunshuo new products, Yunyan new technology, Yunyou Xinhuo, Yunxiang information, scan the QR code to follow the public account of the same name, and get more dry goods in time! !

Cloud native practice of Aegis DDoS protection system "cost reduction and efficiency increase"

author

Lead

Why go to the cloud?

Resource sharing, dynamic expansion and contraction-"cost reduction"

Containerized management, rapid deployment-"enhanced efficiency"

How to go to the cloud?

Current architecture

Cloud architecture

Two "blockers"

How to smoothly migrate

Changing IP

Access authentication

Service discovery

Service exposure

Cloud effect

about us

账号已注销

引用和评论

Serverless AI绘画技术沙龙【深圳站】火热报名中

DeepSeek 从热潮到应用，腾讯云携手行业专家共探 AI 下一步

2025免费云服务器盘点

信息安全风云录，AI 时代安全江湖如何见招拆招？

腾讯云TVP AI与安全高峰论坛圆满落幕，共探大模型时代的安全破局之道

腾讯云cos大文件上传服务端实现一篇搞定

具身智能全解读，从实验室到产业化 | TVP技术夜未眠