When big data architecture meets TiDB

Author introduction: Hu Mengyu, Zhihu core architecture platform development engineer, big data infrastructure direction, main work content is responsible for the secondary development of Zhihu internal big data components and data platform construction.

Preface

A year ago, Zhihu's big data architecture first met with TiDB. At that time, we migrated the meta-database of Hive MetaStore to TiDB, and achieved an order-of-magnitude performance improvement over a stand-alone database. After seeing the power of the distributed NewSQL database TiDB, we have high hopes for it and apply it to other scenarios in the big data architecture, such as Hive large query alarm, NameNode RPC acceleration.

Hive large query alarm

background

Inside Zhihu, Hive is mainly used in two scenarios: 1. ETL core link tasks 2. Adhoc ad hoc query. In the ETL scenario, the Hive SQL tasks are relatively fixed and stable, but in the Adhoc scenario, the Hive SQL submitted by the user is relatively random and changeable. When the user does not optimize the SQL, the MapReduce task that is started will scan too much data, which not only makes the task run slower, but also puts huge pressure on HDFS and affects the stability of the cluster. This situation is at the end of the quarter. Or it appears very frequently at the end of the year, and some users scan the data for a quarter or even a whole year. Once such a query occurs, it will cause a shortage of cluster resources, which will affect ETL tasks and cause delayed report output.

Introduction to SQL query real-time alarm system

In response to the above pain points, we have developed a big SQL query real-time alarm system. When the user submits the SQL, we will do the following:

Analyze the SQL execution plan and convert it into the table path and partition path that need to be scanned;
Summarize the size of all partition paths and calculate the total amount of scanned data;
Determine whether the total number of scanned partitions exceeds the threshold. If it exceeds the threshold, notify the user on the enterprise WeChat.

The concrete realization of each step is explained in detail below.

Get the HDFS path scanned by Hive from the execution plan

In this step, we use the Hook mechanism of Hive Server to output an audit log to Kafka after each SQL is parsed. The format of the audit log is as follows:

{
  "operation": "QUERY",
  "user": "hdfs",
  "time": "2021-07-12 15:43:16.022",
  "ip": "127.0.0.1",
  "hiveServerIp": "127.0.0.1",
  "inputPartitionSize": 2,
  "sql": "select count(*) from test_table where pdate in ('2021-07-01','2021-07-02')",
  "hookType": "PRE_EXEC_HOOK",
  "currentDatabase": "default",
  "sessionId": "5e18ff6e-421d-4868-a522-fc3d342c3551",
  "queryId": "hive_20210712154316_fb366800-2cc9-4ba3-83a7-815c97431063",
  "inputTableList": [
    "test_table"
  ],
  "outputTableList": [],
  "inputPaths": [
    "/user/hdfs/tables/default.db/test_table/2021-07-01",
    "/user/hdfs/tables/default.db/test_table/2021-07-02"
  ],
  "app.owner": "humengyu"
}

Here we mainly focus on the following fields:

Field	meaning
operation	SQL type, such as QUERY, DROP, etc.
user	The user who submits the SQL is a group account inside Zhihu
sql	SQL content submitted
inputPaths	Scanned HDFS path
app.owner	Personal account for submitting SQL

Summarize the size of the partition

To summarize the partition size, you need to know the directory size of each HDFS path in the inputPaths

Program	advantage	shortcoming
Call HDFS API for real-time acquisition	The result is accurate	The getContentSummary method needs to be called, which consumes the performance of the NameNode and takes a long time to wait.
Utilize partition statistics of Hive MetaStore	Faster	The result may be inaccurate. Some tables are directly written into the HDFS directory through other computing engines such as Flink and Spark, and the statistics are not updated in time;
Use HDFS's fsimage to parse out the size of all Hive directories and store them in TiDB	Faster	The result has a delay of T+1, and the size of the partition on the day cannot be counted.

Taking into account the usage scenarios, most of the large SQL queries are scanned for months or even years of data. It is acceptable to ignore the partition information for one or two days. We chose the third solution: parse the fsimage of HDFS every day. And calculate the size of each Hive directory, and then store the result in TiDB. Because we will also use fsimage information in other scenarios, here we not only store the Hive directory, but also store the entire HDFS directory, nearly tens of billions of data. Obviously, with such a large amount of data and related to data indexes, TiDB is a good choice.

Real-time alarm

We send the audit logs to Kafka in real time, and then use Flink to consume the audit logs in Kafka in real time, use KafkaTableSource and Json Format to use Kafka as the flow table, and then use JdbcLookupTableSource to use TiDB as the dimension table to easily calculate each SQL scan Then make alarm judgments for the amount of data.

The final result is as follows:

NameNode PRC acceleration

background

The reason for the story is this. For a period of time, users often reported that Hive queries were stuck and did not respond. The short card was ten minutes, and the long card was several hours. It was very strange. After positioning, it was found that Hive was calling the getInputSummary method. When there is a global lock, when a certain query is large, calling this method will take a long time, causing other query threads to wait for the lock to be released. After reading the source code, it is found that the getInputSummary method can be executed concurrently. It is actually calling the getContentSummary method of the HDFS client. We remove the lock and no longer use the global lock function, but use a thread pool-like method to let It can be executed with a high degree of concurrency. But this will bring some problems. The getContentSummary method of the HDFS client is similar to the du operation of the file system. If the concurrency is too high, it will significantly affect the performance of the NameNode. Not only Hive, but other computing engines also call the getContentSummary method. Therefore, it is necessary to optimize this method.

Cache ContentSummary information

I know that HDFS has split the Federation in 2019, adopting the Router Base Federation solution, and introducing the proxy component Router of NameNode. We only need to make a layer of cache for the ContentSummary of HDFS at the Router layer, and when the client initiates a call If the cache hits, then read from the cache, if the cache misses, then request from the NameNode. After internal discussion, there are several caching schemes:

Program	advantage	shortcoming
When the client requests the Router for the first time, it returns from the NameNode to update the cache; for the second request, it takes the cache first and judges the modification time of the directory. If there is no modification during the period, it returns to the cache. The NameNode returns and updates the cache.	For directories that are not frequently modified, the NameNode only needs to be requested once.	For the first request, you still need to access the NameNode; only directories without subdirectories can be cached, because the upper-level directory cannot perceive changes in subdirectories.
Every day, fsimage is used to generate a full catalog of ContentSummary information and cache it to TiDB. When the client requests it, the logic of the first scheme is followed.	Most directories do not need to go to the NameNode for the first request.	Still, only directories without subdirectories can be cached, because changes in subdirectories cannot be sensed by upper-level directories.

We chose the second option, because the ContentSummary information was already generated when we did the Hive SQL big query and alarm, so it is very convenient to access it. After accessing TiDB for caching and indexing the request path, the delay for getContentSummary requests under normal circumstances can be guaranteed to be less than 10ms, while for NameNodes without TiDB cache, this time may take several minutes or even dozens of minutes.

Outlook

This time we use TiDB's large storage and indexing functions to cache the meta-information of HDFS, which satisfies some internal scenes of Zhihu. We will continue to improve and expand this scene in the future: for example, caching HDFS file information can be made into a real-time cache. Edit log subscribes to file changes, and then merges with the stock fsimage in TiDB to produce a low-latency NameNode snapshot for some online analysis.

When big data architecture meets TiDB

Preface

Hive large query alarm

background

Introduction to SQL query real-time alarm system

Get the HDFS path scanned by Hive from the execution plan

Summarize the size of the partition

Real-time alarm

NameNode PRC acceleration

background

Cache ContentSummary information

Outlook

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

好用的开源埋点方案-ClkLog埋点用户分析系统

【TVM教程】为 ARM CPU 自动调度神经网络