2
头图

Introduction | As a national-level application, WeChat has covered all aspects of people's lives such as social interaction, payment, and travel. Massive and diversified business forms pose new challenges to data analysis. In order to meet the needs of business data analysis, the WeChat WeOLAP team teamed up with Tencent Cloud to build a ClickHouse data warehouse with a scale of 1,000 units, PB-level data, and batch-flow integration, achieving a performance improvement of more than 10 times. The following will start from the simpler to the deeper, revealing the experience and methods accumulated by WeChat in ClickHouse real-time data warehouse practice.

Author of this article: WeChat WeOLAP team & Tencent cloud data warehouse Clickhouse team.

1. Challenges encountered by WeChat

Generally speaking, the main data analysis scenarios of WeChat include the following aspects:

  • Scientific exploration: Serving data scientists, doing business attribution inferences through ad hoc queries;
  • Kanban: Serving operations and management, showing the core indicators of concern;
  • A/B experiment platform: Serving algorithm engineers, put the new model on the A/B experiment platform for hypothesis testing to see if the model meets expectations.

In addition, there are scenarios such as real-time monitoring and detailed query of the log system.

In all scenarios, users have very important demands-fast: I hope that the query response will be faster, the indicator development will be completed faster, and the Kanban update will be more timely. At the same time, WeChat is faced with massive amounts of data. In business scenarios, "single meter daily increase by trillion" is very common, which poses new challenges for the next generation of "data analysis systems".

Before using ClickHouse, WeChat used a data warehouse based on the Hadoop ecosystem, which has the following problems:

  • The response is slow, basically on the order of minutes, and may reach hours, leading to a long decision-making process;
  • The development is slow. Due to the multi-layer architecture of the traditional data warehouse concept, the cost of updating an indicator is very high;
  • The architecture is bloated. Under the data of WeChat business volume, it is difficult for the traditional architecture to achieve the integration of flow and batch. As a result, multiple sets of codes need to be written, data results are difficult to align, and storage is redundant. After more than ten years of development, the traditional Hadoop ecological architecture has become very bloated, and maintenance is difficult and costly.

Therefore, WeChat has been looking for a lighter, simple and agile solution to solve these problems. After some research, among the OLAP products blooming in a hundred flowers, ClickHouse was finally selected as the main core engine of WeChat OLAP. There are two main reasons:

  • Efficiency: In the experimental scenario with real data, ClickHouse is more than 10 times faster than the Hadoop ecosystem (tested at the end of 2020);
  • Open source: WeChat’s A/B experiments, online features, and other scenarios will have some personalized requirements, requiring more changes to the engine core;

Therefore, WeChat is trying to build a “batch and stream integrated” data warehouse based on ClickHouse computing and storage as the core in the OLAP scenario. However, using the native ClickHouse, there are many problems in the real heavy-duty stage:

  • Stability: The original stability of ClickHouse is not good. For example, problems such as too many parts often occur in high-frequency writing scenarios. The entire cluster is dragged down by a slow query, and node OOM and DDL requests are stuck. common. In addition, due to the flaws in the original design of ClickHouse, the zookeeper bottleneck that relied on data growth has always existed and cannot be solved well; WeChat made multiple kernel changes in the later stage, which made it gradually stabilized under massive data, and some issues were also contributed to the community. .
  • The threshold for use is high: Those who can use ClickHouse and those who do not use ClickHouse may have 3 or even 10 times worse business performance of the system built. Some scenarios require targeted optimization of the kernel.

2. WeChat and Tencent Cloud Data Warehouse Co-built

At this time, the Clickhouse team of Tencent Cloud Data Warehouse actively went deep into the business and actively cooperated with the WeChat team, and the two parties began to jointly solve the above problems. Tencent cloud data warehouse Clickhouse provides a full-hosted one-stop comprehensive service, so that the WeChat team does not need to pay too much attention to stability issues. In addition, the teams of both parties have accumulated rich experience in query optimization, and sharing experience is more conducive to the ultimate improvement of Clickhouse performance.

The cooperation between WeChat and Clickhouse, the data warehouse of Tencent Cloud, began in March this year. After a small-scale trial of ClickHouse during the verification period, the business has been growing rapidly, and the two parties have begun to build together to optimize stability and performance. Two things were mainly done: one was to establish the entire ClickHouse OLAP ecology, and the other was to explore a query optimization method close to the business.

3. Co-build ClickHouse OLAP ecosystem

To better solve the ease of use and stability of ClickHouse, ecological support is needed. The overall ecological plan has the following important parts:

  • QueryServer: data gateway, responsible for intelligent caching, large query interception, and current limit;
  • Sinker: Offline/online high-performance access layer, responsible for peak shaving, hash routing, traffic priority, write frequency control;
  • OP-Manager: Responsible for cluster management, data balancing, disaster recovery switching, and data migration;
  • Monitor: Responsible for monitoring alarms, sub-health detection, query health analysis, and can be linked with Manager;

The WeChat WeOLAP team and Tencent Cloud focused on the following aspects of cooperation:

  • High-performance access: WeChat's throughput has reached a billion level. In terms of real-time access, the problem of traffic peaks has been better solved through token and back pressure solutions. In addition, through the Hash routing access, after the data has landed, you can directly do the Join, without the need for shuffle to achieve faster Join query, and the access is also accurate once. In terms of offline synchronization, WeChat is basically the same as the practice of most industries. Part is built through pre-built Merge and then sent to the online service node. This is actually the idea of separation of read and write, which is more convenient to meet high consistency, High-throughput scenario requirements.

  • Extreme query optimization: ClickHouse's entire design philosophy requires the use of specific syntax in specific scenarios to achieve the most extreme performance. In order to solve the problem of high barriers to the use of ClickHouse, WeChat has put the corresponding optimization experience on the internal BI platform, and after it has been deposited on the platform, it will make it easy for novice users to use ClickHouse. Through a series of optimization methods, the performance of multiple cases such as live broadcast and video number has been improved by more than 10 times.

Based on the ClickHouse ecosystem co-built, there are the following typical application scenarios in WeChat:

1. BI analysis/kanban board: Because scientific exploration is random, it is difficult to solve it through pre-built methods. Previously, the ecology of Hadoop could only achieve the level of hours to minutes. At present, after ClickHouse optimization is completed, with a single table of trillions of data, most of the queries, P95 is within 5 seconds. The data scientist now wants to do a verification, which can be achieved very quickly.

2. A/B experiment platform: When doing A/B experiments in the early days, all the experimental statistical results must be aggregated the night before, and the experimental results can be inquired the next day. In the scenario of a single table with hundreds of billions of data per day and a large table with real-time Join, WeChat has gone through several solutions before and after, achieving a performance improvement of nearly 50 times. The leap from offline to real-time analysis makes the P95 response <3S, the A/B experiment conclusion is more accurate, the experiment cycle is shorter, and the model verification is faster.

3. Real-time feature calculation: Although it is generally believed that ClickHouse is not good at solving real-time-related problems, it can achieve billions of scans through optimization, the full link delay is less than 3 seconds, and the P95 response is nearly 1 second.

4. Significant improvement in performance

At present, the current scale of WeChat is 1,000, the data volume is PB level, and the daily query volume is over one million. The single-cluster TPS has reached 100 million level, and the average query time only needs to be returned in seconds. Compared with the previous Hadoop ecosystem, ClickHouse OLAP's ecosystem has improved its performance by more than 10 times. It provides more stable and reliable services through streaming batch integration, making business decisions faster and experimental conclusions more accurate.

V. Co-build a cloud-native data warehouse with separate storage and computing

The original design of ClickHouse and the architecture of Shard-Nothing cannot achieve the second-level scaling and Join scenarios well; therefore, the next goal of the WeChat and Tencent cloud data warehouse ClickHouse co-construction is to achieve a cloud-native data warehouse that separates storage and computing:

  • Elastic expansion: second-level elasticity, users only pay for use, achieving faster peak queries and lower peak costs;
  • Stability: No ZK bottleneck, easy separation of reading and writing, remote disaster tolerance;
  • Easy operation and maintenance: data is easy to balance and storage is stateless;
  • Full-featured: focus on query optimization and Cache strategy, support efficient multi-table Join;

The cloud-native data warehouse capability that separates storage and computing will be launched on Tencent Cloud's official website next year, so stay tuned!

This article is produced by the WeOLAP team of the WeChat Technical Architecture Department. "WeOLAP" focuses on using cutting-edge big data technology to solve the high-performance query problem of WeChat's massive data.

Tencent cloud data warehouse Clickhouse 10 yuan new customer experience activity is hot ↓↓↓


腾讯云开发者
21.9k 声望17.3k 粉丝