Dealing with the "Druid Metadata" Mess

vivo Internet Big Data Team - Zheng Xiaofeng

1. Background

Druid is a data storage system designed for high-performance slicing and OLAP analysis on large datasets.

Since Druid can provide both offline and real-time data queries, Druid is most commonly used as a data storage system for GUI analysis, business monitoring, and real-time data warehouses.

In addition, Druid has a multi-process, distributed architecture where each Druid component type can be configured and scaled independently, providing maximum flexibility for the cluster.

Due to the particularity of Druid's architecture design and data (offline, real-time), Druid's metadata management logic is relatively complex, which is mainly reflected in Druid's numerous metadata storage media and the metadata transmission logic between many different types of components.

The purpose of this article is to further understand the internal operation mechanism of Druid by sorting out the aspect of Druid metadata management.

2. Concepts related to Druid metadata

2.1 Segment

Segment is the most basic unit for Druid to manage data. A Datasource contains multiple Segments, and each Segment stores the data of a certain period of time in the Datasource. The data organization method of this specific period of time is defined by the payload (json) of the Segment. The payload defines the dimensions, indicators and other information of a segment.

The payload information (dimensions, indicators) of different segments of the same Datasource can be different. Segment information mainly includes the following parts:

【Interval】: used to describe the start time and end time of the data.
【DataSource】: It is expressed as a string, specifying which Datasource the segment belongs to.
[Version]: It is represented by a time. Segments with the same time interval (Interval), the segment data with higher version is visible, and the segment with lower version will be deleted.
[Payload information]: It mainly includes the dimension and index information of this segment, as well as the location information of the segment data in DeepStorage, etc.

The main components of the segment

Example of segment internal data

2.2 Datasource

Datasource is equivalent to the table of relational database. The schema of Datasource changes dynamically according to the available Segments. If a Datasource does not have available Segments (used=1), it cannot be seen in the Datasource list interface and query interface of druid-web this Datasource.

The druid\_dataSource table in the metabase does not store schema information, but only stores the offset information of the datasource corresponding to the real-time task consumption data. It is said that Druid's Datasource is equivalent to a relational database table, but Druid's table (Datasource) Schema Information, not defined in the druid\_dataSource metadata table.

So how does the Schema information of the Datasource seen on the druid-web page come from?

In fact, it is merged in real time according to all Segment metadata information under the Datasource, so the Schema of the DataSource changes in real time.

The advantage of this design is that it is well adapted to the changing needs of the Datasource dimension in :

Schema merging process

2.3 Rule

Rule defines the segment retention rules of Datasource, which are mainly divided into two categories: Load and Drop.

Load represents the segment retention policy.
Drop represents the segment deletion strategy.

There are three subclasses of Load/Drop rules, namely Forever Load/Drop, Interval Load/Drop, and Period Load/Drop. A Datasource contains one or more Rule rules. If no Rule rules are defined, the cluster's Default Rule rules are used.

The list of Datasource Rule rules is ordered (custom rules are in the front, cluster default rules are in the back). When running the Run rules, all available segment information under the Datasource will be judged according to the order of the Run rules, as long as the Segments When a Rule rule is satisfied, the following rule rule will no longer run (as shown in the figure: Rule processing logic case). Rule rules mainly include the following parts of information:

[Type]: Types include delete rules and load rules.
[Tier and replica information]: If it is a Load rule, you need to define the number of historical machine replicas in different tiers.
【Time Information】: Delete or load a segment of a certain time period.

An example of a Rule is as follows:

 [
   {
   "period": "P7D",
   "includeFuture": true,
   "tieredReplicants": {
     "_default_tier": 1,
     "vStream":1
   },
   "type": "loadByPeriod"
 },
 {
   "type": "dropForever"
 }
 ]

Rule handles logical cases

2.4 Tasks

Task is mainly used for data ingestion (this article mainly discusses the task of ingesting kafka data in real time). During the running process of Task, it will generate one or more Segments according to the data time column. Task is divided into real-time and offline tasks.

Real-time tasks (kafka) are automatically generated by the Overload process according to the Supervisor definition;
Offline tasks (type: index\_hadoop, index\_parallel) need to be submitted by the external system through the access interface.

Each task mainly includes the following parts of information:

[dataSchema]: Defines the dimensions (dimensionsSpec), metrics (metricsSpec), time columns (timestampSpec), segment granularity (segmentGranularity), and data aggregation granularity (queryGranularity) in the segment generated by the task.
[tuningConfig]: The optimization parameters of the task in the process of ingesting data (including segment generation strategy, index type, data discarding strategy, etc.). Different task types have different parameter settings.
[ioConfig]: defines the source information of data input, different data source configuration items are different.
[context]: The configuration about the global nature of the task, such as the option information of the task Java process.
[datasource]: Indicates that the task constructs a Segment for that Datasource.

Real-time task generation segment case

2.5 Supervisor

Supervisor is used to manage real-time tasks. There is no corresponding Supervisor for offline tasks. Supervisor and Datasource have a one-to-one relationship. Supervisor objects are created by the Overlord process during cluster operation. After submitting Supervisor information through the Overlord interface, it will be stored in the metadata database (MySQL ), the Supervisor content is similar to the Task, and it can be considered that the real-time Task is cloned by the Supervisor.

3. Druid overall architecture

The concepts related to Druid metadata have been introduced in general. In order to understand Druid metadata in-depth, let's first understand the overall architecture of Druid from a macro perspective.

The Druid cluster can be visually compared to a company, and the Druid cluster is introduced by comparing the different types of employees in the company with different components of Druid. The Druid components can be roughly divided into three types of employees: leadership, workshop employees and sales employees, as shown in the following figure:

Druid component classification

Leadership: The leader sends production tasks to the corresponding professional managers (MiddleManager) and the professional manager management team (MiddleManager starts the Peon process) according to the external market demand (Overlord receives external intake task requests), and issues specific production tasks Tasks are given to different types of employees (Peon processes).
Workshop employees: production employees (Peon) are responsible for producing products (segments), and warehouse managers (Coordinator) are responsible for distributing the produced products (segments) to the warehouse (Historical).
Sales Employee: The sales clerk (Broker) obtains the latest product (segment) from the production employee (Peon), obtains the original product (segment) from the warehouse, and then organizes and packs the product (the data is further merged and aggregated) and then handed over to the customer ( query user).

The above has a preliminary overall impression of the Druid cluster through the analogy of the company.

The following describes the Druid cluster architecture in detail. Druid has a multi-process, distributed architecture. Each Druid component type can be configured and expanded independently, providing maximum flexibility for the cluster.

Disruption of one component does not immediately affect other components.

Below we briefly introduce the role of each Druid component in the cluster.

Druid Architecture

Overlord
Overlord is responsible for accepting tasks, coordinating task allocation, creating task locks, and collecting and returning task running status to the caller. When there are multiple Overlords in the cluster, the leader is generated through the election algorithm, and other followers are used as backups.
MiddleManager
MiddleManager is responsible for receiving real-time tasks assigned by Overlord, and creating new processes to start Peon to perform real-time tasks. Each MiddleManager can run multiple Peon instances, and each real-time Peon provides both real-time data query and real-time data ingestion. .
Coordinator
The Coordinator is mainly responsible for the management and release of segments in the Druid cluster (mainly managing historical segments), including loading new segments, discarding segments that do not meet the rules, managing segment replicas, and segment load balancing. If there are multiple Coordinator Nodes in the cluster, the leader is generated through the election algorithm, and other followers are used as backups.
Historical
The responsibility of Historical is to load all segments of historical data in Druid that are not in the real-time window and meet the loading rules. Each Historical Node only keeps synchronization with Zookeeper, and will synchronize the loaded segment to Zookeeper.
Broker
The Broker Node is the entry point for the entire cluster query. The Broker synchronizes the meta information of all published segments in the cluster saved on Zookeeper in real time, that is, on which storage nodes each segment is stored. The Broker creates a timeline for each dataSource in Zookeeper, a timeline The storage location of each segment is described in chronological order.

Each query request will contain dataSource and interval information. Based on these two information, the Broker finds the storage nodes corresponding to all segments in the timeline that meet the conditions, and sends the query request to the corresponding node.

4. Druid metadata storage medium

Druid stores metadata in different storage media according to its own business needs. In order to improve query performance, it also caches all metadata information in memory. Save the metadata information of historical data to the metadata database (MySQL) for recovery when the cluster is restarted.

Since Druid has a multi-process and distributed architecture, Zookeeper needs to be used for metadata transmission, service discovery, master-slave election and other functions, and the historical nodes will store segment metadata information in local files.

So why does the historical node (Historical) cache the segment metadata information loaded by the node in the local of its own node?

This is because after the restart of the historical node, the metadata information of the segment is read locally without going to other metadata storage media such as Mysql for cross-node reading, which greatly improves the recovery efficiency of historical node data.

The data and functions in these storage media (memory, metabase, Zookeeper, and local files) are described below.

4.1 Metadatabase (MySQL)

MySQL database is mainly used for long-term persistence of Druid metadata information. For example, segment metadata information is stored in druid\_segments table, historical Task information is stored in druid\_tasks, Supervisor information is stored in druid_supervisors, and so on.

Some Druid service processes will load the metadata persisted in the metadata database when they are started. For example, the Coordinator process will periodically load the list of segments whose used field in the table druid\_segments is equal to 1. When Overlord starts, it will automatically load the druid\_supervisors table information to restore The original real-time ingestion tasks and so on.

MySQL metabase tables

4.2 Zookeeper

Zookeeper mainly stores the metadata generated in real time during the operation of the Druid cluster. The Zookeeper data directory can be roughly divided into three categories: master node high availability, data ingestion, and data query .

The following describes the metadata content of the Druid-related Zookeeper directory.

Zookeeper metadata node classification

4.2.1 Master Node High Availability Related Directories

${druid.zk.paths.base}/coordinator: coordinator is the master-slave high-availability directory. There are multiple temporary ordered nodes with the smallest number being the leader.

${druid.zk.paths.base}/overlord: overlord is the master-slave high-availability directory. There are multiple temporary ordered nodes with the smallest number being the leader.

4.2.2 Catalogue related to data query

${druid.zk.paths.base}/announcements: Only store historical, host:port of peon process, no process information such as MiddleManager, broker, coodinator, etc., used to query related node service discovery.

${druid.zk.paths.base}/segments: The list of segments that can be queried in the current cluster. Directory structure: host:port/${segmentId} of historical or peon, the Broker node will synchronize these segment information in real time as an important basis for data query.

4.2.3 Catalogue related to data intake

${druid.zk.paths.base}/loadQueue: The list of segment information that Historical needs to load and delete (not only loading). The Historical process will listen to the events (loading or deleting) that it needs to process in this directory. After the event is completed, it will Actively delete events in this directory.

${druid.zk.paths.indexer.base}=${druid.zk.paths.base}/indexer: The base directory for ingesting task data.

${druid.zk.paths.indexer.base}/announcements: Save the list of currently surviving MiddleManagers. Note that the historical, peon list is not here. Only ingestion-related service information is stored here for data ingestion-related node service discovery.

${druid.zk.paths.indexer.base}/tasks Overlord The task information assigned by Overlord is placed in this directory (host:port/taskInfo of the MiddleManager). When the task is running on the MiddleManager, the task node information will be deleted.

${druid.zk.paths.indexer.base}/status: Save the status information of the task running. Overlord obtains the latest running status of the task by monitoring this directory.

4.3 Memory

In order to improve the efficiency of metadata access, Druid will synchronize metadata to memory. It mainly synchronizes MySQL metadata through scheduled SQL query access or uses Apache Curator Recipes to synchronize metadata on Zookeeper to memory in real time, as shown in the figure below.

The metadata in each process is different. The following describes what data is cached by each role process.

Druid process metadata synchronization method

4.3.1 Overlord

Synchronize the data in the Zookeeper directory (${druid.zk.paths.indexer.base}/announcements) in real time, use the variable RemoteTaskRunner::zkWorkers (type: Map) to store, each ZkWorker corresponds to a MM process, which will be stored in the ZkW orker object Synchronize task information in the Zookeeper directory (${druid.zk.paths.indexer.base}/status/${mm_host:port}) in real time, using the RemoteTaskRunner::runningTasks variable to store.

By default, the data of druid_tasks active = 1 in the database is synchronized every minute, and is stored in the variable TaskQueue::tasks (type: List ). During synchronization, the task list in the memory will be compared with the task list in the latest metadata to get the new The task list and deleted task list are added to the memory variable TaskQueue::tasks to clean up the tasks to be deleted.

4.3.2 Coordinator

By default, the segment list of column used=1 in druid_segemtns in the metadata database is synchronized to the variable SQLMetadataSegmentManager::dataSourcesSnapshot every 1 minute.

By default, the metadata druid_rules table information is synchronized to the SQLMetadataRuleManager::rules variable every 1 minute

Use the CoordinatorServerView class (described later) to synchronize the data of ${druid.zk.paths.base}/announcements and ${druid.zk.paths.base}/segments in real time, for comparison with the segments in the metadata database, for Determine which segments should be loaded or deleted.

4.3.3 Historical

The data under ${druid.zk.paths.base}/loadQueue/${historical_host:port} will be synchronized in real time, and segment loading and deletion operations will be performed. After the operation is completed, the corresponding node will be actively deleted.

Historical exposes segments by reporting segment information to ${druid.zk.paths.base}/segments.

4.3.4 MiddleManager

It will synchronize the data of ${druid.zk.paths.indexer.base}/tasks/${mm_host:port} in real time, start the task (peon) process, and automatically delete the corresponding node after the startup is completed.

MiddleManager reports segment information to ${druid.zk.paths.base}/segments to expose segments.

4.3.5 Broker

Use the BrokerServerView class to synchronize the data of ${druid.zk.paths.base}/announcements, ${druid.zk.paths.base}/segments in real time, and construct a timeline object (BrokerServerView::timelines) of the entire system as a data query basic basis. The dependencies of classes during synchronization are shown in the following figure.

The lower-level class object senses the addition, deletion and modification of the segment by monitoring the upper-level class object, and performs corresponding logical processing. It will also monitor ${druid.zk.paths.base}/announcements and ${druid.zk.paths.base} The data change of the data of /segments is notified to the lower-level class object by way of callback listener.

The listening relationship between objects in the process of segment synchronization to Druid process in zk

4.4 Local files

The metadata of the local file is mainly used to read and load when restoring a single node.

For example: the info\_dir directory (such as: /data1/druid/segment-cache/info\_dir) in the first data directory of the Historical node saves all segment information loaded by the node, which will be read when the Historical process restarts The segment metadata information in this directory is used to determine whether the segment data exists locally. If not, go to the deep storage system (hdfs) to download. After the data download is complete, the segment information will be reported to Zookeeper (path: ${druid.zk.paths .base}/segments).

5. Druid metadata related business logic

Since there are many types of Druid components, the business logic is more complex. From the whole to the local way, from the macro to the details, step by step to understand the business logic of Druid, in order to understand the role of Druid metadata in the business logic.

5.1 Overall business logic of Druid metadata

In the previous chapter, we have learned about the cooperation relationship of each component of the Druid cluster as a whole. The following is to sort out the role of metadata in the Druid cluster from the business logic of ingestion task management, data ingestion, and data query.

5.1.1 Ingestion task management

Before ingesting data, the user needs to submit an ingestion task. According to the configuration of the task, Overlord will correspondingly instruct the MiddlerManager to start the relevant process (peon process) of the task for ingesting the data.

Task submission and management

The following describes the business logic of Druid's internal task management in the order of the numbers in the figure above:

① After the Overlord process receives the task submission request, it will write the task information into the druid_tasks table, and the field active is equal to 1.
② Overlord assigns tasks to a specific MiddleManager node, and writes the task information to the Zookeeper directory (${druid.zk.paths.indexer.base}/tasks ).
③ The MiddleManager process monitors the task information that the current node needs to start in the Zookeeper directory (${ruid.zk.paths.indexer.base}/task).
④ MiddleManager will start the Peon process (task) by fork. At this time, the Peon process starts to ingest data, and writes the task Running status to the Zookeeper directory (${ruid.zk.paths.indexer.base}/status).
⑤ Overlord will monitor the Zookeeper directory (${ruid.zk.paths.indexer.base}/status) in real time to obtain the latest status of task running.
⑥ After the task is completed, Overlord will update the task status information to the database table druid_tasks, and the field active=0 at this time.

5.1.2 Data Ingestion Logic

Druid data ingestion logic

The following describes the business logic of Druid's internal data ingestion in the order of the numbers in the figure above:

① After the Peon process produces the segment locally, it will upload the segment data to the deep storage Hdfs.
② Insert a segment metadata information into the metadata druid_segments table, including segment data hdfs address, Interval information, note that the used field is 1 at this time.
③ The Coordinator process regularly pulls the data whose used value is 1 in the druid_segments table.
④ The Coordinator process writes the segment allocation information to the Zookeeper directory: ${druid.zk.paths.base}/loadQueue.
⑤ The HIstorical process monitors the current node to obtain the segment information to be loaded in the Zookeeper directory (${druid.zk.paths.base}/loadQueue).
⑥ Download segment data from Hdfs and load segment.
⑦ Synchronize the metadata information of the loaded segment to the Zookeeper directory (${druid.zk.paths.base}/segments).

5.1.3 Data query logic

Data query mainly involves three roles: Peon, Historical, and Broker. The Broker will filter out the segment to be queried according to the dataSource and interval information contained in the client's query request, and then the Broker will act as a client to obtain real-time data from Peon and from Historical. Historical data, and then according to the query requirements, the two parts of the data are further aggregated, as shown below:

Druid data query logic

5.2 Druid metadata specific business logic

With the overall understanding of the Druid cluster, the following is a more detailed discussion of the role of Druid metadata between various components.

The dashed arrows in the figure below indicate the transmission of metadata. The following describes the metadata between the components at both ends of each dashed arrow and the metadata storage medium (MySQL, Zookeeper) according to the numerical serial numbers in the figure. It includes two aspects of reading and writing, as follows:

Druid metadata business logic

①Write: Write task information when starting a task, and write supervisor information when submitting a real-time task. Read: When the broker calls the overlord interface, it will query the task information in different states, and restore the supervisor information when the process restarts.
②Write : When assigning tasks to MiddleManager, write task information. Read: Synchronize status information for running tasks.
③Write: write the task status information of the current node to Zookeeper, read: read the task information with start or stop.
④Write : report real-time segment information after the task is started.
⑤Read: The coordinator regularly reads the segment list information of the field used=1.
⑥Write : segment information allocated by coordinator, read: allocated segment list information.
⑦ Write : segment information that has been loaded, read: segment information that needs to be loaded.
⑧ Read : Load the completed segment information as the basis for data query.

6. Summary

The previous part introduced the role of Druid metadata in Druid clusters from four aspects (basic concepts of Druid metadata, overall architecture of Druid, Druid metadata storage medium Druid metadata related business logic) from the whole to the part and from the abstract to the details. .

Druid has a multi-process, distributed architecture, each component only pays attention to its own business logic and metadata, and decouples components through RPC communication or Zookeeper. Each Druid component type can be independently configured and extended, greatly To provide the flexibility of the cluster, so that the interruption of one component does not immediately affect other components, the following is a summary of the Druid metadata:

Druid metadata storage media include memory, metadata database (MySQL), Zookeeper, and local files.
The metadata database (MySQL) and local metadata play the role of backup and persistence. Zookeeper mainly acts as a metadata transmission bridge, saves metadata in real time, and synchronizes metadata to memory, which greatly improves the performance of Druid data query and data ingestion, and the metadata of local files is mainly used to restore a single node Fast read and loaded into memory.
In the Druid component process, the metadata in Zookeeper and the metadata database (MySQL) will be synchronized into the memory of the process through real-time synchronization and timing pull, respectively, to improve access efficiency.
The metadata stored in the memory of each component process is the latest and most complete metadata in the current cluster.