360 government and enterprise security group based on Flink&#39;s PB-level data ad hoc query practice

This article is compiled from the topic "Practice of PB-level data ad hoc query based on Flink" shared by Big Data Engineer Su Jun and Liu Jia of 360 Government Enterprise Security Group at Flink Forward Asia 2020. The content of the article is:
Architecture and Design of Threat Hunting Platform (Su Army)
Optimization and exploration with the goal of reducing IO (Liu Jia)
future plan

First make a simple personal and team introduction. We are from the 360 government-enterprise security group and are currently mainly engaged in the development of the "threat hunting" project of 360 Security Brain. Our team was in contact with Flink earlier. During this period, we developed a number of products based on Flink and participated in the Flink Forward conference held in Berlin in 2017 and 2019. We introduced our "UEBA" and " AutoML" two products.

This sharing is mainly divided into two parts:

The first part "Threat Hunting Platform Architecture and Design" will be shared by the Soviet Army;
The second part "Optimization and Exploration with the Target of Reducing IO" will be shared by Liu Jialai.

1. Architecture and Design of Threat Hunting Platform (Su Army)

The first part is roughly divided into three parts, namely:

Evolution of the platform
Architecture design
Explore the index structure in depth

1. Evolution of the platform

We believe that all technological evolution and innovation need to be driven by specific business issues. The following are several products developed by our team based on Flink in recent years:

In 2017, we developed UEBA, a user behavior analysis system based on Flink DataStream. It accesses various behavioral data of the enterprise IT topology, such as identity authentication data, application system access data, terminal security data, network traffic analysis data, etc. Users/assets are the core to conduct real-time detection of threat behaviors, and finally build a system of user threat levels and portraits;
Based on UEBA's implementation experience in 2018, we found that security analysts often need a means to obtain the original logs corresponding to security incidents to further confirm the source and resolution of security threats. So we developed HQL based on Spark to solve the problem of data retrieval in offline mode. Among them, HQL can be considered as a query language with richer expressive ability than SQL. It can be roughly regarded as adding algorithmic calculations on the basis of SQL capabilities;
With the use of offline HQL on the customer side in 2019, we found that it can quickly define security rules and build threat models. If you write a sentence in offline mode and publish it directly into an online task, it will greatly shorten the development cycle. The Flink SQL capabilities are relatively complete, so we upgraded the HQL capabilities based on Flink SQL + CEP, and produced the HQL RealTime version;
In 2020, as the amount of customer data increases, many have reached the PB level. Past solutions have caused offline data retrieval performance to be far lower than expected. Security analysts are accustomed to using fuzzy matching operations such as like and full-text retrieval, resulting in queries The delay is very large. So starting this year, we focused on optimizing the offline retrieval capabilities of HQL and launched a new Threat Hunting platform.

According to the survey, customers with PB-level data often have the following business needs:

The first is a low-cost cloud native architecture. We know that most of the current big data architectures are based on hadoop, which is characterized by the fact that the data is on the computing node, which can reduce a lot of network overhead and accelerate computing performance. However, in order to achieve resource balance, the entire cluster often needs the same resource configuration, and in order to be able to store as much data as possible, the cluster size will be large, so this type of architecture requires a lot of hardware costs in the early stage.
The separation of storage and computing and flexible computing can solve this problem, because the price of disks is much lower than that of memory and CPU, so use cheap disk storage with low-configuration CPU and memory to store data, and use a small number of high-configuration machines for calculations. , Can reduce costs to a great extent.
The second is low-latency query response. When security analysts do threat detection, most of the time they are ad hoc queries, that is, data retrieval and association are done through filtering and join. In order to obtain query results as quickly as possible, the corresponding technical solution is: column storage/index/cache.
- Needless to say, column storage is a common storage solution in the field of big data;
- On the basis of column storage, an efficient indexing scheme can greatly reduce io and improve query performance;
- The network delay caused by the storage and calculation analysis can be compensated by the distributed cache.
The third is the need for rich query capabilities, including single-line fields/filter/udf, etc., multi-line aggregation/join, and even algorithmic analysis capabilities. We mainly rely on the analysis language HQL developed by ourselves to provide this part.

2. Architecture Design

First of all, the data comes from the historical data that has been stored in ES and the real-time data in Kafka. The historical data in ES is synchronized through the synchronization tool developed by ourselves, and the real-time data in Kafka is written in orc through Streaming File Sink. Files to the storage cluster. At the same time of data synchronization, we will update the index information of this batch of data to the database.

The security analyst will initiate a data retrieval request from the front-end page by writing the interactive analysis language HQL. At this time, the request will enter the scheduling system. Once the job is executed, the analysis statement will first be parsed into a list of operators, and the operator caching algorithm will judge Whether this query can hit the existing cache data in the cache system.

If the input of the analysis statement is an intermediate result that has been calculated and cached, then the cache is directly read to continue the calculation;
If it cannot be hit, it proves that we must start from the orc file to recalculate.

We will first extract the filter conditions or Join conditions of the query language to do the predicate pushdown, enter the index database to obtain the list of documents currently matching the query, and then pass the list of documents to the calculation engine for calculation. The calculation engine adopts the dual-engine model, in which the high-complexity sentences are completed by the Flink engine, and the other simpler tasks are handed over to the "hummingbird engine" inside the platform. "Hummingbird Engine" is based on Apache arrow for vectorized execution, coupled with LLVM compilation, query latency will be very small.

Due to the separation of storage and calculation of the entire system, in order to speed up data reading, we have added alluxio to the computing cluster nodes to provide data caching services, which not only caches the data on the remote cluster, but also caches part of the historical job results through the operator cache Algorithm to speed up the next computing task.

Two points need to be emphasized here:

The first point is that the index database will return a batch of file lists that meet this condition. If the file list is very large, when the current Flink version builds the job graph, the logic of obtaining Filelist Statistics here will cause a long time when traversing a large number of files. Time cannot construct a job graph. Currently we have repaired it and will contribute to the community later.
The second point is the data cache. Our HQL was previously implemented through Spark. Those who have used Spark may know that Spark will use a table as a cache or persist. We also used this operator when we migrated to Flink. At Flink, we have implemented a set ourselves, that is, when the user is in the cache table, we will register it as a brand-new table source, and then only use this new table source to get through the whole process when re-reading.

3. Explore the index structure in depth

In order to speed up data retrieval, we often create an index for the data in advance, and then locate the starting position of the data through the index before scanning the data, thereby speeding up data retrieval. In traditional databases, row indexes are common. Indexes are created through one or several fields, and the index results are stored in a tree structure. This type of index can be accurate to the row level and has the highest index efficiency.

Some big data projects also support row indexing, and the disadvantage it brings is that a large amount of index data will cause delays in writing and retrieval. And our platform processes machine data, such as terminal/network data, which is characterized by a very high degree of repetition, and the results of security analysis are often very few. A very small number of threats will be hidden in the massive data, and the proportion is often It will be 1/1000 or even less.

Therefore, we choose a more cost-effective block indexing solution, which can support the current application scenarios. At present, according to customer data, the index can provide a clipping rate of more than 90% for 85% of the sentences, which basically meets the delay requirement.

Some big data platforms store index data in the form of files on disks, plus some cache mechanisms to speed up data access, but we store index data directly in the database. There are mainly the following two considerations:

The first is transaction. We know that listing files are often unable to be updated, and we will perform Merge File operations when optimizing file distribution on a regular basis. In order to ensure query consistency, the database needs to provide transaction capabilities.
The second is performance. The database has strong read, write and search capabilities, and can even push down the predicate to the database to complete, and the high compression ratio of the database can further save storage.

The picture above shows the design of the block index. In our index database, we divide these data into different types of data sources. For example, terminal data is a type of data source, and network data is a type of data source. The logic of our classification of data sources is whether they have a unified schema. As far as a single data source is concerned, it uses the date as the Partition. There are a large number of small ORC files inside the Partition. For the index structure, we will build a min/max index for each field, and we will build a Bloom index for fields with a cardinality less than 0.001.

As mentioned above, security personnel prefer to use like and full-text search. We have also made some optimizations for like. In terms of full-text retrieval, we will segment the data to build an inverted index. At the same time, we will also do a bitmap index at the file distribution level for a single item after a single word segmentation.

The above figure is an approximate ratio assumption of the index size. The original log in JSON format is 50PB, and it is about 1PB when converted to ORC. Our Index data is 508GB, of which 8GB is Min/Max index and 500GB is Bloom. Coupled with the bitmap and inverted rows mentioned above, the proportion of this index data will be further increased. Based on this, we use a distributed indexing scheme.

We know that the log is constantly changing. For some data clerks, he sometimes increases or decreases fields, and sometimes even the field type changes.

Then we adopt this kind of Merge Schema mode scheme, in the process of file incremental writing, that is, while updating the index information of this batch of data, do the Schema Merge operation. As shown in the figure, in block123, file 3 is the last to be written. As the files are continuously written, a brand new Merge Schema will be formed. It can be seen that the B and C fields are actually historical fields, and the A_V field is the historical version field of the A field. We use this method to let customers see as much data as possible. Finally, a new table source is constructed based on the Input format and Merge Schema developed by ourselves, thus opening up the entire process.

2. Optimization and exploration with the goal of reducing IO (Liu Jia)

The above has introduced why the block index should be selected, then the following will specifically introduce how to use the block index. The core of the block index can fall on two words: "cutting". Cutting is to filter out irrelevant files before the query statement is actually executed, to reduce the amount of data entering the calculation engine as much as possible, and to throttle from the data source.

This picture shows that the entire system uses IndexDB for the cutting process:

The first step is to parse the query statement. Get the relevant filter, you can see that there are two filter conditions in the leftmost SQL statement, respectively src_address = a certain ip, occur_time> a certain timestamp.
The second step is to bring the query conditions into the meta table of the index DB corresponding data source to filter the files. src_address is a string type field, it will combine min/max and bloom index for cropping. Occur_time is a numeric type field and a time field. We will first look up the min/max index for file clipping. It should be emphasized that here we encapsulate the filter written by the user into the query condition of index db, and directly push down the filter to the database to complete.
In the third step, after obtaining the file list, these files plus the merged schema mentioned above will be jointly constructed into a TableSource to be handed over to Flink for subsequent calculations.

At the same time, when building the source, we made some optimizations in the details. For example, when the filter is passed to the ORC reader, the filter that has been pushed down is removed to avoid secondary filtering on the engine side. Of course, not all filters are removed here. We keep the like expression. The filter pushdown of like will be introduced later.

Next, I will focus on the four optimization points:

The first point is that when the data is not sorted, the clipping rate has a theoretical upper limit. We use the hilbert curve to sort the original data to improve the clipping rate when data is written;
The second point is that because of the particularity of the security field, threat detection relies heavily on the like grammar, so we have enhanced the orc api to support the push-down of the like grammar;
The third point is that because the usage scenario relies heavily on join, we have also optimized the join operation accordingly;
Fourth, our system supports multiple file systems at the bottom, so we choose Alluxio, a mature cloud-native data orchestration system, for data caching to improve the locality of data access.

1. The theoretical upper limit of clipping rate and Hilbert space filling curve

Clipping can be abstracted into the probability of N balls thrown into M buckets. Here we directly say the conclusion. Assuming that the rows are randomly and uniformly distributed in the block, the total number of rows in all blocks is fixed, and the total number of rows hit by the query condition is also fixed, the block hit rate is directly related to the "total number of rows hit/total number of blocks".

There are two conclusions:

First, if the total number of hits = the total number of blocks, that is, when the X-axis value is 1, the hit rate is 2/3, that is, 2/3 of the blocks, which all contain the hit rows, correspond to the upper limit of the block pruning rate It is 1/3. 1/3 is a very low value, but since its premise is that the data is randomly and evenly distributed, in order to make the data distribution better, we need to sort the original data when the data is written.
Secondly, assuming that the total number of hits is fixed, greatly reducing the number of rows in each block to increase the total number of blocks can also increase the block pruning rate. So we reduced the block size. According to the test results, we set the size of each file as: 16M. Reducing the file size is very simple. For sorting, we introduce the hilbert space filling curve.

Why use hilbert curve? Mainly based on two points:

First of all, what path is used to traverse the 2-dimensional space so that the address sequence of the path is basically ordered in any dimension? Why is there an order for each column or subset? Because the query conditions are not fixed when the system is in use. Five fields are used for sorting when writing data, and only one or two of them may be used when querying. Hilbert sorting allows multiple fields to be ordered both globally and locally.
In addition, there are many space filling curves, as well as Z-shaped curves, serpentine curves, etc. You can take a look at the two comparison pictures on the right. Intuitively, the shorter the long span jump of the curved path, the better, and the more stable the position of the point during the iteration, the better. The hilbert curve has the best overall performance in the space filling curve.

The hilbert usage is to implement a UDF, input column values, output coordinate values, and then sort according to coordinate values.

We sampled 1500 SQL statements used in the customer environment, and filtered out the relevant statements with a clipping rate of 100%, that is, invalid statements that did not hit the file. Then there are 1148 remaining. After sorting the cropping rate using these sentences, we compared the cropping rate. The 95th percentile of the cropping rate increased from 68% to 87%, an increase of 19%. You may think that the value of 19% is not particularly high, but if we bring a base number, such as 100,000 files, it will be very impressive if we look at it this way.

2. Optimization of Like on the dictionary index

I also talked about the particularity of the security industry before. When we do threat detection, we will rely heavily on like queries. In view of this, we have also optimized it.

First, we added a like conditional expression to ORC api to ensure that the like in SQL can be pushed down to the orc record reader.
Second, reconstruct the row group filter logic of the orc record reader. If it is found to be a like expression, first read the dict steam of the field to determine whether the dict stream contains the like target string. If the value does not exist in the dictionary, skip directly With this row group, there is no need to read data stream and length steam, which can greatly improve file reading speed. In the later stage, we also consider building a dictionary index into the index database, and directly pushdown the dictionary filtering to the database to complete.

For example, as shown in the figure, there are three expressions in the leftmost SQL. The first two have been mentioned above. They are directly pushdown the filter to index db. The filter we handed to the orc reader only has the last attachment_name like'%bid%'. The only record that needs to be read is that the dict contains "bid." "The row group of ", that is, filter at the row group level, further reducing the amount of data that needs to enter the calculation engine.

3. Optimization of join based on index

A large number of join operations are used in the matching of threat intelligence. If you want to speed up the performance of join, the filter pushdown of the where condition is far from enough.

There are many join algorithms built in Flink, such as broadcast join, hash join and sort merge join. Among them, sort merge join is very friendly to pre-sorted table joins, and as mentioned above, we use Hilbert curve to sort multiple fields jointly, so sort merge join is temporarily out of our optimization range.

In addition, we know that the performance of join is positively related to the size of the left and right tables, and the sparsity of the threat intelligence join is very high, so cutting the left and right tables in advance can greatly reduce the data entering the join phase.

As mentioned above, we have established bloom indexes for common fields. Then using these already created blooms to perform file pre-filtering becomes logical and saves the time overhead of building blooms.

For broadcast join, we directly scan the small table, enter the records of the small table into the bloom of the file to which the large table belongs, determine whether the data block is needed, and pre-cut the table with a large amount of data.

For hash join, as we have seen, we can pre-join the file-level bloom of the join key. Specifically, the bloom of a file belonging to the left table and the bloom of the file belonging to the right table will be combined with the bloom of the file belonging to the right table in turn. "Operation, only keep the left and right tables" and the files whose post-result number is not 0", and then let the remaining files of each table enter the engine for subsequent calculations.

For example, the three tables on the picture are table1, table2, and table3. We can get the statistical information of the table from the index DB, that is, the number of files or the size of the file table. The figure directly lists the number of files: table 1 is 1,000, then table 2 is 50,000 files, and table 3 is 30,000 files.

We just refer to the logic in the previous picture to pre-join, and then estimate the cost of the join. We will let the low-cost pre-join proceed first, so that the intermediate results can be greatly reduced and the efficiency of the join can be improved.

4. Alluxio as a cache for object storage

Because of the variety of underlying file storage systems, we chose the Alluxio data orchestration system. The advantage of Alluxio is to bring data closer to the computing framework, and use memory or SSD multi-level caching mechanisms to speed up file access. If the cache is completely hit , Can reach the file access speed of memory-level IO, reduce the frequency of reading files directly from the underlying file system, and greatly ease the pressure on the underlying file system.

For our system, it brings higher concurrency and is more friendly to queries with low clipping rates, because a low clipping rate means that a large number of files need to be read.

If these files have been loaded into the cache in the previous query, the query speed can be greatly improved.

After doing these optimizations, we did a performance comparison test. We selected an es cluster with a size of 249TB. It uses 20 servers, and Flink uses two servers. In order to see a more intuitive comparison effect on the icon, we selected 16 test results.

The red and orange ones on the chart are es, the blue ones are before HQL optimization, and the green ones are after HQL optimization. The number label above is the performance difference of HQL compared with es. For example, the first label means that the performance of HQL is five times that of es, of which No. 6 and No. 7 are slower than es, mainly because HQL is a block index, and es is a row index, all in memory, so it can be super fast The retrieval speed. No. 13 is because HQL uses not equal to have a relatively poor cropping rate.

Generally speaking, the optimization effect is very obvious, and most of the sentences are the same or even slightly better than the es query speed. Fully meet customer expectations for long-term data storage and query.

3. Future planning

The picture above is the future plan. Because the customer site often involves a lot of BI Dashboard calculations and long-period calculation reports, we will consider the BI budget in the next step, as well as the containerization and JVM warm-up mentioned by the Soviet army, and of course benchmarking es, And to improve the ability of multiple users to query concurrently.

For more Flink related technical issues, you can scan the code to join the community DingTalk exchange group
Get the latest technical articles and community dynamics in the first time, please follow the public account~

360 government and enterprise security group based on Flink's PB-level data ad hoc query practice

1. Architecture and Design of Threat Hunting Platform (Su Army)

1. Evolution of the platform

2. Architecture Design

3. Explore the index structure in depth

2. Optimization and exploration with the goal of reducing IO (Liu Jia)

1. The theoretical upper limit of clipping rate and Hilbert space filling curve

2. Optimization of Like on the dictionary index

3. Optimization of join based on index

4. Alluxio as a cache for object storage

3. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

小米基于 Apache Paimon 的流式湖仓实践

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

基于Flink的配置化实时反作弊系统

物化视图详解：数据库性能优化的利器

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元

360 government and enterprise security group based on Flink&#39;s PB-level data ad hoc query practice

1. Architecture and Design of Threat Hunting Platform (Su Army)

1. Evolution of the platform

2. Architecture Design

3. Explore the index structure in depth

2. Optimization and exploration with the goal of reducing IO (Liu Jia)

1. The theoretical upper limit of clipping rate and Hilbert space filling curve

2. Optimization of Like on the dictionary index

3. Optimization of join based on index

4. Alluxio as a cache for object storage

3. Future planning

ApacheFlink

引用和评论

Flink CDC 3.4 发布, 优化高频 DDL 处理，支持 Batch 模式，新增 Iceberg 支持

小米基于 Apache Paimon 的流式湖仓实践

基于 Flink CDC YAML 的 MySQL 到 Kafka 流式数据集成

基于Flink的配置化实时反作弊系统

物化视图详解：数据库性能优化的利器

vivo基于Paimon的湖仓一体落地实践

Apache Flink 2.0.0: 实时数据处理的新纪元

360 government and enterprise security group based on Flink's PB-level data ad hoc query practice