How to design a real-time security baseline engine in the field of information security

Abstract: This article is compiled from the sharing of Qin Yongjing, a senior technical expert of Qi Anxin Group, in the Flink Forward Asia 2021 industry practice session. This article is mainly divided into four parts:
What is Security Analysis
Choice of Computational Framework
engine design
Practice and Outlook

view live replay & speech PDF

1. What is security analysis

In the past 10 years, big data technology has developed rapidly. Security analysis scenarios and methods are also updated. At present, the commonly used security analysis process is roughly divided into three processes:

Log collection and analysis. Collect data such as various traffic logs, threat logs, operation logs, and terminal logs from servers, gateway devices, terminals, databases or other data sources through various methods;
Real-time security analytics. The analysis process mainly includes security baseline, correlation analysis, security rules, threat intelligence, and security knowledge base;
Security operations and situational awareness. This process is based on the results of security analysis to achieve situational visualization, security operations, multi-device security linkage, security orchestration and other capabilities.

As shown in the figure above, the security field has three main characteristics:

Quick response. Because security events are often sudden, such as vulnerability disclosure, virus outbreaks, etc. often occur suddenly in a very short period of time, it is necessary to quickly deal with sudden security events in a very effective and easy-to-use way. , can respond to customer needs and respond to various security incidents in the first time;
Scene customization. Because security analysis is different from conventional big data analysis scenarios, security analysis detects abnormality instead of normality, and there are some unique requirements in the field, so there will be many requirements for customized development in the field.
Resources are limited. Compared with the use of conventional Internet big data platforms, there are many restrictions on the resources that can be used for security analysis. Usually, users are limited by budget and resource constraints, and they will try to compress and optimize the scale of available computing and storage resources. A large number of components are deployed in a mixed manner, and the hardware and cluster expansion costs are also high in the future, and the process is very long.

In terms of real-time data security, there are five requirements:

The first is the need for real-time analytics. The security detection has strict requirements on the delay, and the attack and defense parties have a difference in time, and it is necessary to detect the abnormality at the first time. At the same time, because it is driven by security events, the solution is required to be launched quickly, respond in a timely manner, and can form protection capabilities in the shortest time;
The second is the need to provide very rich analytical semantics. Security detection scenarios are usually complex, and rich security analysis semantics need to be provided to cover most security analysis scenarios;
The third is flexible deployment. Because the customer's environment is very complex, it needs to be able to support various big data platforms, such as those built by the customer, or some cloud platforms he purchased, and it must have maximum version compatibility;
The fourth is the need to achieve minimal resource usage. Support the deployment of clusters from one to dozens or even hundreds of nodes, and support large-scale while occupying as little resources as possible, such as the simultaneous operation of thousands of analysis rules or security baselines;
The last is stable operation. Security products are deployed on the client side and require 7×24 hours of uninterrupted operation. They need to provide extremely high operational stability, minimize manual maintenance and intervention, and make the client feel unaware.

Traditional security analysis methods are generally based on prior knowledge, and feature-based detection technology is used to detect anomalies, generally by analyzing the data or log characteristics of the detection object, and then manually summarizing and counting detectable features to establish security models, such as security. Rules, threat intelligence, etc. This method is relatively simple and reliable, and can well deal with existing attack methods, but it cannot effectively detect unknown attack methods without corresponding detection features.

The security baseline adopts a behavior-based detection method, uses the detection object data and logs, and adopts various methods to learn behavioral characteristics, establish a security baseline, and use it for anomaly detection.

In order to give everyone a better understanding of the usage scenarios of security baselines, here are three real scenarios:

Scenario 1: The login of the DBA user account is abnormal. For example, the login location is abnormal, the login time is abnormal, and the behavior of using the database is abnormal. For example, a DBA user usually logs in at certain times, certain IPs or certain places, but suddenly logs in at another place one day, and the time is not the usual login time, then this may be an exception and needs to be generated abnormal event;
Scenario 2, the number of attachments sent by the email exceeds the normal value. For example, the behavior data of emails sent by the security baseline learning department or the entire company, if it is found that there is a large discrepancy between the number of attachments sent by a certain user and the historical learning data, it may be an anomaly;
Scenario 3: The number of times the account has recently logged into the VPN service is abnormal. By learning the historical login behavior of user account VPN, a security baseline is constructed. If an abnormal account login is found in the future, an abnormal event will be generated.

Second, the choice of computing framework

There are currently two mainstream real-time computing frameworks, Spark and Flink. We originally designed this engine around 2018. At that time, we studied the three computing frameworks of Storm, Spark, and Flink, and finally chose Flink as the underlying computing framework based on various factors. The Flink used at that time was around version 1.4, which is a relatively mature version. Compared with other frameworks, its API and its underlying distributed and stream computing implementation are more in line with our usage scenarios.

The advantages of Flink are more prominent. It is a distributed computing framework with flexible deployment and is suitable for common big data platforms. It has good processing performance and can achieve high throughput and low latency, which is very suitable for real-time security analysis. It also provides a flexible DataStreaming API to facilitate customized requirements. In addition, an easy-to-use checkpoint and savepoint mechanism is supported. Moreover, as a very popular computing framework at present, the community is active and there are rich documents and scene samples.

Although Flink has many advantages, when enterprise resources are limited and the number of rule sets is thousands of scales. Flink has encountered many problems in meeting business and performance requirements. For example, there is no large-scale rule semantic/flow optimization, lack of security scenario customization window and logic, no security baseline-related operators, and no resource protection mechanism.

3. Engine Design

The engine application framework is divided into three layers:

The bottom layer is the deployment layer, usually a big data cluster;
The second layer is the security analysis layer, which builds a security baseline engine based on the Flink DataStreaming API. Flink is responsible for the underlying distributed computing and event stream sending, and the specific business calculation is completed by the security baseline engine. The user interface provided by the security baseline engine is the rule and DSL. The user sends the rule DSL to the engine through the interface. The engine analyzes and calculates the event stream according to the rules and DSL, and uses external data, such as knowledge, according to the semantics of the rules. data, threat intelligence, assets and vulnerabilities, etc.;
Users manage and use the engine through the application layer of the third layer. And based on engine data results, situation analysis, security operations, resource monitoring and other specific security services.

The business process of the engine is divided into three parts, namely the user interface, the engine service and the engine analysis task. Users can configure rules, manage baselines and monitor operations through the user interface. The engine service provides users with services such as rule distribution, baseline distribution, and status monitoring in the form of a RESTfull API. After the engine service receives the user's rule issuance request, it needs to parse and optimize the issued rule set to generate an analysis task code package. The analysis task code is submitted to the big data cluster for operation, and the analysis task receives the baseline of the engine service during the running process. Post data, and add, delete, and modify the runtime baseline. The analysis task also reports the task running status to the engine service, and the engine service maps the task running status into business monitoring information, which is provided to users for query and analysis.

Since most users are not developers, it is necessary to provide a security analysis language that is optimized for security analysis scenarios. It needs to have the following:

Simple and easy to use, low learning cost, easy to use, even a person without R&D background can use it after simple learning, and it conforms to the intuitive thinking of security analysts;
It is necessary to provide rich data types, first of all, to provide rich basic data types, followed by data commonly used in security analysis such as IP, various time, assets, vulnerabilities, threat intelligence, geographic location, etc. to provide direct support, users can do nothing. Customized can directly use all kinds of data;
Provide rich semantics, especially to enhance and customize security analysis semantics;
Support for extensions. Although the provided security analysis semantics are relatively comprehensive and support most security analysis scenarios, there are still some special scenarios that cannot be directly supported. In this case, it is necessary to support such requirements in an extended manner.

Security analysis language needs to design a special compiler to compile and optimize the security analysis statements and rules designed by security analysts. The compiler needs to provide support for some features and optimizations:

Common expression optimizations. Optimize the same semantic logic in the analysis statement to reduce repeated calculations and computational complexity;
Quoting data table optimization. There may be thousands of analysis statements and rules in a rule set, which will refer to a large number of external data table data, and table calculation needs to be optimized, such as hash matching, large-scale IP matching optimization, large-scale regular matching and string matching optimization, etc. ;
Constant expression optimization. improve operational performance;
Table reference optimization. It includes two parts, merging of reference examples and merging of reference semantics, to reduce the resource occupation of the reference table.

After the analysis statement and rule are compiled, a running subgraph is generated. Each statement and rule corresponds to a subgraph. At this time, all rules need to be concentrated for graph optimization, which is divided into four processes:

The first step is graph fusion. Graph fusion involves subgraph fusion, that is, fuses all subgraphs in the rule set into a running graph, and then performs semantic fusion of graph nodes, merging of time windows, and optimization of referencing public resources;
The second step is data flow optimization, reducing the size of the graph and the amount of data transmitted, mainly performing operations such as Key preposition, semantically equivalent node fusion, network throughput balancing, reducing data skew, node merging, and greatly compressing the number of super-large graph nodes;
The third step is field clipping, which reduces the size of transmission events, thereby reducing network IO pressure, mainly including field derivation and clipping and field merging on the graph;
The last is code generation, which will analyze the statement and rule semantics to generate code, and map the execution graph to the Flink DataStream API.

A core element of real-time computing is time. Different time processing methods and implementation schemes will bring about very different or even completely different computing results. In real-time analysis, time mainly affects two functions, namely time window and timeline.

In the security analysis scenario, the time window needs to support the general sliding time window and the natural time sliding time window, such as yearly, monthly, weekly, etc. naturally, or even longer, and it needs to support cascading window repeated data fusion, reducing the Data storage capacity, can automatically eliminate repeated calculation, avoid repeated alarms, merge time timers, and handle events out of sequence correctly, and avoid miscalculation caused by event out of sequence.

The time line can be divided into two categories: event occurrence time and time processing, which further extends the time precision. Different time precision will put a lot of pressure on processing performance and storage, such as scenarios that need to sort the time. Since events in real-time analysis may be out of order, delay time needs to be supported to solve most of the computational inaccuracies caused by out of order. Some calculation scenarios involve the mutual conversion between system time <-> event time, and a conversion calculation method that can provide two kinds of time is required. Since the execution graph is a fusion of a large number of subgraphs, it is necessary to support the management of the global and local time water levels at the same time to ensure that the timeline on the graph can advance correctly.

Security baselines fall into three categories:

The first category is the statistical safety limit, which includes the common safety baselines of time, frequency, space, range and multi-level statistics;
The second is the sequence category, such as exponential smoothing and periodic safety baselines;
The third is the security baseline of machine learning, such as the security baseline using some clustering algorithms, decision tree security baseline, etc.

The baseline processing flow is mainly divided into three parts: baseline learning, baseline detection and baseline routing, including interspersed event filtering, time window, baseline noise reduction, baseline management and other processes. The baseline learning process includes reading event streams from message queues and storage. After event filtering and time window aggregation, the event streams may contain noise data, and the data noise reduction process is also required. Finally, the baseline learning process learns the input event process and generates The corresponding security baseline. The learned security baseline is used for anomaly detection, prediction and anomaly detection after the baseline management process. If an abnormal behavior is found, an abnormal event will be generated and output to the subsequent processing process for subsequent business use. Users may need to modify or delete some learned baselines or create a new baseline during use. The addition, deletion and modification of these baselines are completed through the baseline routing function. The baseline routing process routes the user-edited baselines on the map and distributes them correctly. into the corresponding graph node instance.

The cycle of the baseline is divided into four stages: learn, ready, close, and expire:

learn represents the learning phase, in which the baseline learns the input event stream;
The ready stage indicates that the current timeline has reached the learning deadline of the baseline, but because of the delay time, the baseline needs to wait for a delay time. During this time period, the baseline can continue to learn the delayed events, and the baseline can be used for anomaly detection;
close indicates that the current timeline has reached the delay time, at which time the baseline no longer learns the input events and is only used for anomaly detection;
Expire indicates that the current timeline has reached the baseline timeout, and the baseline needs to be stopped for anomaly detection and deleted.

The calculation of the baseline is triggered by two situations:

The first is event-triggered calculation, and an anomaly detection calculation is triggered after each event arrives;
The second is time-triggered calculation. The baseline period will register a time timer. After the time timer is triggered, the relevant baseline calculation process will be triggered.

The output of the baseline is divided into the baseline abnormal event output and the baseline content output:

The output of baseline abnormal events occurs in the process of baseline abnormality detection. When abnormal events are found, the corresponding events need to be output;
The output of the baseline content occurs after the baseline learning is completed, and the baseline itself needs to be output for baseline editing and anomaly analysis of the baseline itself.

During use, users may often edit existing baselines or create new security baselines for specific scenarios based on some analysis and data. After the baseline is edited, it needs to be able to be delivered to the baseline engine, which involves how to edit and update the baseline online.

First, the baseline needs to be editable, the analysis language needs to support the semantics related to baseline editing, and the design of the baseline data structure needs to support the semantics of baseline editing. Finally, a set of baseline editing visualization processes should be provided, including baseline display, modification, deletion and other functions. , the user can edit and issue the baseline directly on the page;
The second is that the baseline is required to be routable. The actual execution graph of analysis statements and rules after compilation and graph optimization will be very different from the rules displayed on the page. A routable baseline requires that a global baseline be constructed at compile time. Update the flow graph, a set of runtime baseline routing methods, including the construction of the routing flow for the execution of the flow graph, the support of broadcast and directional routing, and finally achieve accurate baseline data distribution;
Finally, the baseline is required to be updatable. It is necessary to have a set of clear baseline update semantics, define the baseline running cycle and calculation method, and then during your baseline update process, an exception may occur at any location and cause the update to fail. At this time, you need to design A mechanism to feed back information to the user after a failure in any location can be used for error determination and problem repair.

In the process of baseline learning, the learning cycle is usually relatively long, such as the last week, the last month, etc. The long-term learning usually faces a problem of data fragmentation, such as learning the data of the last week, but now it is Wednesday, that is It is said that the data of the last week is divided into two parts, among which the data from Monday to Tuesday is stored in the historical data storage, and the data of Wednesday and after is generated in real time. Here, the problem of fusion learning of historical and real-time data will be involved. There are three cases here:

The first is that all the data to be learned is historical data, which needs to support the detection of historical data learning range and online baseline update;
The second is that the data to be learned are all real-time data, which requires support for automatic baseline learning, baseline automatic detection and baseline automatic update;
The third type is the one just mentioned, and it is also the most complicated case, that is, the fusion of historical and real-time data, which requires support for boundary division of historical and real-time data, baseline fusion, and deduplication of data.

There is usually some noise in the data of baseline learning. These noises may be an abnormal operation, such as a user’s abnormal login, or the wrong data introduced in the process of data collection. Therefore, the noise needs to be eliminated to increase Baseline accuracy, reducing false positives.

Data noise reduction can be simply divided into numerical data noise reduction and non-numerical data noise reduction according to the data type, and the two processing methods will be different. There are four main types of noise:

The first is to judge relative to the data of this cycle, that is, to compare it with other data of this cycle to determine whether it is noise;
The second is to compare the data of the previous cycle and the latest cycle to determine whether it is noise;
The third is to compare historical data with all historical data to determine whether it is noise;
Finally, the user defines a noise judgment logic, such as setting how much greater than or less than how much it is noise.

When denoising data, it is usually necessary to save relevant data. For example, if historical data is used for noise determination, some historical key data needs to be stored. There is usually a lot of historical data. In order to reduce storage, it is necessary to optimize the noise reduction data structure. Optimizations, such as minimizing denoising key data, field clipping, etc.

A very important part of engine operation is how to monitor and protect resources. It involves three aspects:

The first is stability enhancement, which requires dynamic monitoring of memory usage during baseline operation. Hundreds of thousands of baseline rules may be running in the engine at the same time. It is necessary to monitor the memory usage of each baseline rule, and for rules with abnormal memory usage , you need to take memory protection measures, such as deleting some data or isolating it, to prevent affecting the operation of other normal rules, you can use resource priority management when deleting, if the priority is relatively low, and it occupies a lot of resources at the same time, It may reduce or even disable the operation of the rules by reducing resources. The engine also monitors the baseline calculation process. If the monitoring process finds a slow path that seriously affects the performance of graph processing, it uses subgraph isolation to isolate the subgraph corresponding to the slow path to prevent other analysis processes from being affected;
The second is status monitoring. Status monitoring consists of two parts. The first part is that the engine reports the status data of all computing nodes in the execution graph, such as CPU, memory, disk, input and output, and other information to the monitoring service; the second part is that the monitoring service reports After processing, the operation information of the execution graph is mapped to the rule state information, and the conversion from the graph state to the business state is completed. For large-scale execution of graphs and high-concurrency analysis tasks, the graph status reporting process needs to be optimized to reduce resource usage;
The third is flow control. The downstream business of the engine may be some processes with relatively slow processing capacity, such as database writing, etc. At this time, it is necessary to support flow control to prevent the faster processing flow from inputting too much data to the slower processing flow. Causes excessive resource consumption and freezes. Flow control needs to support active flow control, passive flow control, and flow control related to time windows. User configuration or automatic processing can be used to solve data loss and system instability problems caused by different processing performance before and after.

Users often need to operate the rules during use. These operations will cause the start and stop of running tasks. During the start and stop process, the data needs to be consistent before and after, and the saved data cannot be lost due to start and stop.

Flink itself supports reloading data when the task restarts, but the problem is more complicated in the baseline engine, because the user may disable, enable or modify the rules, which will cause the rule set to change, and then cause the execution graph to change. In order to ensure the task The rules that do not change when restarting can be correctly loaded from savepoint to the correct data, and it is necessary to support the stable local state of the graph, that is, during the graph optimization process, the local changes of the graph do not affect other subgraphs, and at the same time, the stable subgraph generation is guaranteed during the code generation process. Stable execution of code, changing rules only affect subgraphs related to them, and other unchanged rules are not affected.

In the process of baseline learning, a large amount of intermediate data is usually saved. In order to speed up savepoint and checkpoint, serialization and deserialization of complex data structures need to be optimized, and incremental state needs to be supported. The engine service usually needs to provide analysis services to multiple users, so it is also necessary to manage the status of multiple users and multiple tasks to ensure that each task can be accurately associated with its corresponding status data.

4. Practice and Prospects

The real-time security analysis capability provided by the analysis engine serves most of the company's big data products, such as big data and security operation platform, situational awareness, EDR, cloud security, industrial control Internet, intelligent security, etc. With the deployment of these products to nearly a thousand customers, including central enterprises, governments, banks, public security, etc., they also support common localized systems and various private clouds. The deployment environment ranges from one to hundreds of clusters, and the event volume ranges from hundreds to millions of EPS. It has also participated in and supported hundreds of special actions of ministries and central enterprises.

With the spread of knowledge and the frequent occurrence of various security loopholes, various attack methods and security threats are emerging one after another. This requires higher and higher security analysis capabilities. The engine needs to be continuously updated and optimized to improve security As for the detection capability of attacks, it is necessary to continue to integrate more and better behavior learning algorithms and technologies with the security baseline to improve the detection capability of the security baseline. At the same time, it is expected that some practices of the engine can be fed back to the community through certain channels, so that more people can use the good designs and practices.

view live replay & speech PDF

For more technical issues related to Flink, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics for the first time, please pay attention to the public number~

How to design a real-time security baseline engine in the field of information security

1. What is security analysis

Second, the choice of computing framework

3. Engine Design

4. Practice and Prospects

ApacheFlink

引用和评论

Flink在B站的大规模云原生实践

基于 MCP 的 AI Agent 应用开发实践

OSPO Summit 2025 正式定档！议题征集同步开启

OSPO Summit 2025 首批议程发布！

Dolphinscheduler IDEA本地调试

【Hadoop】HDFS架构解析

【Hadoop】HBase系统解析及适用场景