Introduction | This article is selected from the column of Tencent Cloud Developer Community - [Technical Thinking Guangyi·Tencent Technician Original Collection]. This column is a sharing and communication window created by the Tencent Cloud developer community for Tencent technicians and a wide range of developers. The column invites Tencent technicians to share original technical accumulation, and to inspire and grow with a wide range of developers. The author of this article is Yang Bo, a senior development engineer at Tencent.
This article mainly summarizes the personal experience of encountering problems in the process of data security classification and landing. I hope this article can provide some experience and help to developers who are interested in this aspect.
background
With the successive promulgation of the "Data Security Law" and "Personal Information Protection Law", data security has risen to the level of national security and national strategy, and data classification and classification has become a must-have topic for enterprise data security governance. However, the realization of data classification and classification has many pain points in the industry, which are mainly reflected in the following points:
- The formulation of rules is complex: There are multiple dimensions for data classification, and each dimension has its own value. In different industries and fields, and even specific to each enterprise and department, there are also definitions for different levels of data. The unclear dimensions and levels will lead to problems in many subsequent compliance management and control based on classification and classification.
- High cost of coordination and communication: The scale of the enterprise continues to grow, and the organizational structure becomes complex and bloated. The scanning and reporting of data is associated with multiple departments, business groups, and even subsidiaries. This involves coordination and communication among multiple people, as well as many issues such as network isolation, access rights, and approval.
- Large data capacity: With the advent of the Internet era, enterprise informatization construction has been developing rapidly, and business systems have become more and more complex. The resulting massive amounts of data bring enormous value to enterprises. Correspondingly, once massive data leaks, it will also cause serious consequences for enterprises. How to cover the classification and classification of massive data in a real-time, efficient and comprehensive manner is a test for the technical architecture.
- Many storage components: In the Internet, especially in the era of cloud computing, in order to cope with high-traffic and high-concurrency business scenarios, enterprises have created various storage components such as relational, non-relational, and object storage. This is both open source implementation and in-house research. Different implementations have different transport protocols and data structures. To cover a variety of storage component data classification and classification, it requires a lot of work.
However, when looking at a lot of information inside and outside the company, often only focus on explaining the concepts and standards of data classification and classification. At present, there is no reference for the implementation of classification and classification technology that can be used for reference. Therefore, the focus of this paper is not to discuss the standard formulation of data classification and classification, but to describe the implementation of a classification and classification architecture of abstract encapsulation of general capabilities, identification of massive data, and data access across departments and platforms from a technical level. Empower data classification and grading technology to avoid repeated wheel building. And based on this, it can meet the implementation and promotion of data security compliance work from a practical point of view.
Note: For an introduction to data classification and classification, refer to Data Security Governance: Guidelines for Data Classification and Classification.
Data Security Business Process
(1) Business level
From a business perspective, data classification and grading are used as the cornerstone of data security to control data security, such as data encryption, data desensitization, data watermarking, rights management, and security auditing. It can be seen the importance of data classification and classification to data security.
(2) Technical level
From a technical point of view, the data is scanned and reported, and identified through the data identification engine. However, in the actual implementation process, many problems were found. For example, there are many types of storage components, large data traffic reported, and issues such as timeliness, accuracy, and coverage.
Overall structure
Through continuous business analysis of data classification and classification, the above data classification and classification structure is designed. The core of the architecture consists of five blocks:
- Various storage component data scanning and reporting tools.
- The data identification service cluster, uniformly receives and reported data, and performs data identification.
- Identification rule engine, unified maintenance of identification rule management, online hot update and other functions.
- The data center, relying on the classification and grading results, conducts data security management and control.
- Relying on the company's basic framework capabilities, it ensures the high availability of engine services, such as monitoring, alarms, logs, and elastic expansion and contraction.
The focus is on the first three points.
Real-time identification of massive data
The scale of the enterprise continues to grow, and the massive number of users will inevitably generate massive amounts of data. How to meet high performance, timeliness and high accuracy and coverage requirements is a huge test for the system architecture.
(1) Data storage
PCG currently covers nearly 20 types of storage components and platforms, and 30 million tables. Take mdb, cdb, tredis, and skydome as examples:
Storage selection
As can be seen from the table, only mdb has more than five million MySQL tables, and cdb has even more than 10 million MySQL tables. And a MySQL table corresponds to a classification and classification recognition result to be saved. MySQL single-table data is recommended to be around 5 million. It is recommended to use sub-database or sub-table to process more than this amount of data. This is feasible in some scenarios of e-commerce projects, such as transaction order data. But this also brings problems such as classic distributed transactions.
Therefore, it is necessary to choose a database that satisfies large capacity, high concurrency, high availability and transaction acid.
big data hadoop
As a classic big data storage architecture, hadoop can store data at or above the petabyte level, but it is not time-sensitive, and is usually used in T+1 offline task olap scenarios. In addition, hadoop has limited support for transaction acid, and cannot fill oltp scenarios.
tidb
tidb is a distributed massive capacity cloud native newsql. The bottom layer of tidb uses the raft algorithm to realize distributed data storage and ensure data consistency. It is also compatible with the MySQL protocol and supports transactions. Therefore, tidb meets the requirements, but the company currently does not have a dedicated team to maintain tidb.
Cloud native tdsql-c
tdsql-c is a database developed by TEG. tdsql-c has improved the MySQL architecture to separate computing and storage, so as to achieve rapid expansion of storage and computing resources. Therefore, tdsql-c supports MySQL protocol and transactions, and has high performance and other characteristics. And the company currently has a dedicated team to maintain tdsql-c.
Storage comparison
It can be seen from the table that tidb and tdsql-c meet the requirements, but tdsql-c is maintained by special personnel within the company. Therefore, tdsql-c is selected to store data classification and classification identification results.
(2) Data access
The server needs to connect with multiple storage component platforms for data reporting, and different platforms have different requirements for resources, performance, and timeliness. Therefore, multiple access methods such as http, trpc and kafka are implemented to meet different scenarios.
Kafka transmits big data
Kafka can realize the failure retry on the consumer side, and can cut the traffic peak. It is recommended to use Kafka for data reporting.
In order to ensure that the recognition results are correct, 200 pieces of data are taken from a single table in the relational database and uploaded. There are some wide tables or large fields in big data, which causes the uploaded data to exceed 1M, which exceeds the default configuration of Kafka. In addition to limiting the upload packet size, the kafka configuration also needs to be optimized.
kafka producer
max.request.size=1048576 (1M)
batch.size=262144 (0.25M)
linger.ms=0
request.timeout.ms=30000
Because the message data packet is relatively large, it is not expected that the message will reside in the producer's memory, causing pressure on the producer's memory. Therefore, the message should be sent to the broker as quickly as possible.
kafka consumer
fetch.max.bytes=1048576 (1M)
fetch.max.wait.ms=1000
max.partition.fetch.bytes=262144 (0.25M)
max.poll.records=5
topic partion>=20
retention.ms=2
Because the message data packet is relatively large, and the consumer needs a delay of several hundred seconds to consume the message, the number of messages to be pulled in batches is reduced and the waiting time for pulling messages is increased, so as to prevent the consumer from frequently going to the broker to pull messages, resulting in the burst of the consumer cpu. \
Optimization effect
data identification
After solving data reporting, data storage, and data access, it is data identification. This is the core and most complex part of the entire data classification and grading architecture. The data identification process is mainly divided into four parts: data mapping, rule management, weight calculation, and data verification.
data mapping
The server recognizes 200 pieces of data from a single table. According to 20 fields in each table, each field needs to perform 20 kinds of regular recognition. Assuming that 10 million tables are run every day, a total of 800 billion regular calculations will be run. Such a huge amount of calculation, under the impact of traffic, immediately soared the CPU of the server to 100%, resulting in the service being unavailable! ! !
Compared with the io-intensive type, the cpu-intensive type cannot simply use common caching, asynchronous and other methods to reduce the pressure on the server. Therefore, the following points need to be considered:
- Through the elastic expansion and contraction of k8s on the cloud, the traffic is distributed to multiple container nodes to reduce the load pressure of a single node.
- A single node utilizes multi-core parallelism to share the computing pressure on multiple CPU core processors. And use the semaphore to limit the current to prevent the cpu from being at 100% all the time.
- Regular expression optimization. The trap hidden in the regular expression actually made the CPU soar to 100%!
Multi-core parallelism
Multi-core parallelism draws on the MapReduce programming model, which is essentially a "divide and conquer" idea.
Optimization effect
rule management
The classification and grading of data requires more refined rule management to achieve more reasonable management and control of subsequent data security. Rules include but are not limited to regular, nlp, machine learning, algorithm, full text matching, fuzzy matching, blacklist, etc. Corresponding to each specific classification and classification definition, it also includes the combined use of multiple rules. After actual operation and sorting, there are nearly 400 classification and classification definitions and 800 identification rules.
Therefore, it is necessary to consider a reasonable way to decouple the rule management and identification logic for subsequent maintenance and upgrade. At the same time, it is necessary to consider the hot update and shutdown of rules, so as to be unaware of online services.
Weight calculation
Data classification and grading have different dimensions and definitions in different industries and businesses. And the source data is not clearly defined by developers and operation and maintenance personnel, resulting in fuzzy boundaries in the final identification results. In the actual operation process, it is often reported by the business side due to inaccurate identification results.
Suppose there is a field called xid, which may be qqid or wechatid, and qdid and wechatid correspond to different classifications, which will affect the subsequent compliance process. In actual scenarios, xid may be hit by qqid and wechatid identification rules at the same time, so which one should be selected?
Therefore, the concept of weight is introduced. The weight is not a simple choice of 0 and 1 for the identification results, but after identifying through multiple combination rules, a weight value is calculated, and the weight values of multiple identification results are sorted, and the weight is selected. The largest recognition result is used as the classification rating of the current field.
Data validation
The most important aspect of data security compliance control is data encryption. In order to facilitate the follow-up compliance traceability of the operation, it is necessary to verify whether the currently reported data is encrypted on the server side, and save the verification result.
Whether the data is encrypted requires a comprehensive judgment of the database table status and other information, including whether the data is encrypted, whether the table is deleted, whether the database is deleted, and whether the instance is offline. The transition of the state is represented by the following decision tree:
Access across departments and platforms
After focusing on solving the difficulties of data reporting and data identification, the data classification and grading framework can meet most business scenarios. Therefore, it is also hoped that the framework can serve the needs of more departments and reduce a lot of tedious and repetitive workload.
Since the data classification and classification results are sensitive data, for cross-department and platform access, it is necessary to consider storing the data in physical isolation according to different departments and platforms.
Summarize
Data classification and grading are very complex, and this complexity has both a business level and an architectural level. The focus of this paper is to address issues at the architectural level. Some of these issues can be planned and designed in advance, such as storage selection, general scanning capabilities, etc. There are also some that need to be continuously optimized in the process of landing, such as massive data identification, in addition to optimizing the performance of the service itself, it is also necessary to comprehensively consider the resource cost.
There is no good or bad architecture, only the right one. What this article describes is based on personal experience of the problems encountered in the landing process. Therefore, after careful consideration and careful sorting out and writing this article, it is also a summary of my work at a stage. At the same time, I also hope that the framework can be recognized by more people, and achieve the reuse of data classification and classification capabilities, so as to make a small contribution to the company's data security compliance work.
If you are a creator of Tencent technical content, the Tencent Cloud developer community sincerely invites you to join the [Tencent Cloud Original Sharing Program] to receive gifts and help with your rank promotion.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。