前端 - Jiguang Notes | The road to the operation and maintenance practice of tens of billions of KV storage in Jiguang - 极光推送

Preface
In a sense, Aurora is a data company. In the entire company's technical operation system, a large amount of KV data needs to be stored. According to factors such as data volume, KV structure characteristics, data update frequency, data cold and hot, read and write request volume and ratio, three different KV solutions have been gradually formed in Jiguang, CouchBase, Redis and Pika.
Below we mainly introduce the operation and maintenance problems encountered in large-scale data volume/request volume scenarios, as well as service performance, scalability, data backup and security, operation and maintenance assurance tools, etc., to introduce Jiguang's use of the above KV components Experience.
One, CouchBase practice
Aurora CouchBase scale

The current total usage of CouchBase is about 6.5T; a single cluster has a maximum of 25 units (16 cores and 128G memory).

CouchBase is a non-relational database, which is actually composed of couchdb+membase, so it can not only store json documents like couchdb, but also store key-value pairs at high speed like membase.
At Aurora, we mainly use it for KV storage, which has the following characteristics:
high speed
Because it is a database in memory, all read and write operations are directly operated on the memory, so the speed is very fast. The largest cluster of Aurora data is about 25 machines (16 core 128G), which can provide 1.5TB of usable memory. The cluster with the greatest read and write pressure is about 8 machines, but it needs to withstand a QPS of 1 million, and a single machine can withstand QPS of more than 100,000, and the performance is quite powerful.

Different from the linear relationship between the performance and the number of nodes declared on the official website, in actual use, it is found that a cluster with more than 40 nodes has obvious problems in expansion and contraction, replacement of faulty nodes, and cluster stability. In the expansion and contraction, replacement of nodes When the cluster is in a state of suspended animation, there is no response on the console, and the client intermittently displays timeouts or incorrect results. This process cannot be recovered quickly and automatically. The longest similar failure in Jiguang lasted for more than 3 hours. In the end, we stopped the business and repaired it.
In another failure, it was discovered that the main reason, the memory usage of CouchBase, only increased, which caused the machine's memory to become full and was eventually oomed. Finally, the problem was temporarily solved by splitting the cluster, and the new cluster also used the updated version synchronously.

High availability
There are two main aspects of high availability: one is that it comes with a cluster solution and supports multiple copies mode. We generally choose the 2 copy mode, which is similar to the one-master one-standby scheme, which can ensure that there is no single point of failure;
The other is that it has its own persistence solution, which can be set to periodically write data to the file system asynchronously, so in general, the downtime of a single node has limited impact on the availability and performance of the cluster, but a one-time failure of more than 2 nodes , There is a possibility of data loss.

Easy to configure and use
After CouchBase is installed, it comes with a web management console, which can manage and operate clusters, buckets, indexes, and searches on the management console, which greatly simplifies the operation and maintenance work. This is also an ability that general open source software generally does not possess, and commercial products do a good job of user experience.

Of course, CouchBase also has some obvious disadvantages:
1. Only simple KV storage models can be provided, and complex data structures are not supported (value supports json format)
2. In order to improve query efficiency, all Keys in KV are cached in memory, which consumes a lot of memory, especially the MetaData attached to each pair of KV (a KV storage consumes at least 56Bytes of MetaData) is also cached in memory. In the practice of Aurora, we often encounter situations where the memory usage of MetaData exceeds 50%, which puts high demands on the memory capacity, which also leads to higher costs.

3. It is a closed-source product, with fewer documents, limited means of handling abnormal failures, and there is a certain risk; we have encountered several cases of suspended animation of cluster nodes, and clusters with large amounts of data cannot be backed up, and we have not found them Any effective plan. So for the core and key data, it is unreliable to store on CouchBase. This piece needs to be judged according to the business scenario. At present, the data that Aurora uses CouchBase is all middle-layer data. Once data loss occurs, it can be retrieved from other data. The channel is back.
Data backup problem:
When the amount of data reaches a certain scale, backup failures will occur when the backup tool that comes with CouchBase is used for backup. There was no response to the official communication with CouchBase.

In view of the problem of suspended animation under CouchBase's large-scale cluster rebalance, we will disassemble the large cluster into several small clusters to realize the decentralization of pressure and the decentralization of operation and maintenance management, but we need to do data sharding on the business side. Processing requires business intervention, and there is a certain intrusion to the business side.

The monitoring and alarm ecology is relatively mature. The data obtained through the corresponding Exporter and uploaded to Prometheus is relatively simple and convenient.

We have set up the following main alarm rules to facilitate timely detection of problems:

1. Bucket memory usage exceeds 85%
2. The CPU usage of CouchBase cluster nodes exceeds 90%
3. xdcr synchronization failed
4. The CouchBase node is abnormal/restarted

Two, Redis practice
Redis is a well-known open source KV storage and is widely used in the industry. Aurora is also using Redis extensively.
Aurora Redis scale
The current resource usage of Aurora Redis is about 30TB, and the total number of instances is about 7K

Ecologically rich
Because of its popularity, Redis has a lot of practical cases, and the online articles and tutorials are very rich. It has a very good ecology, and the operation and maintenance and R&D personnel are also very familiar with it. Compared with CouchBase and Pika, this is really important. It not only saves a lot of communication costs, but also reuses many mature solutions directly, making troubleshooting and optimization easier.

In choosing basic components, mature, universal, and ecologically rich are really important.

high performance
Also because of the full memory operation, Redis shows quite high performance. The peak performance of a single Redis instance in our actual use can reach 50,000 QPS (this is the company's peak standard, and when this value is reached, we think that the bottleneck has been reached, but the actual performance The test is higher than this), if you follow 8 instances of a single machine, you can theoretically reach 400,000 QPS, which is quite amazing. Of course, due to the single process + blocking model of redis, slow instructions will lead to an overall increase in latency. We need special consideration in application architecture design and operation and maintenance, such as performing some high-latency operations during low peak periods of business. .
Redis performance test:

In the master-slave mode, we generally use the sentinel mode. Generally speaking, there is no problem; but once we expanded a set of sentinels from 3 nodes to 5 nodes with the purpose of improving availability, and then discovered that the sentinel supervises There were a large number of delay alarms in the redis instance. Later, the investigation found that it was closely related to the expansion to 5 nodes. Further investigation of the slowlog found that the INFO command sent by the sentry caused the response delay of the redis instance to increase. In this area, pay special attention to the resource consumption caused by the sentry's detection. Under the pressure of high QPS, any additional pressure is very sensitive.

Built-in cluster mode
Redis comes with the Redis-Cluster architecture, which solves the storage problem of very large-scale data sets to a certain extent.
Redis cluster performance test:

However, due to its non-centralized architecture model, operation and maintenance and data migration in large-scale clusters are more difficult. In Jiguang, our current largest Redis-Cluster cluster has reached 244 master and slave nodes, with a total data volume of more than 800G, and the peak QPS can reach 8M (eight million QPS). The pressure is quite heavy, and the pressure on operation and maintenance is also considerable. .
First of all: expansion is more difficult. The slot migration caused by expansion is under the pressure of large QPS, which causes the client to receive a large number of MOVED returns, and finally causes a large number of errors on the business side;
Second: The expansion time is very long. Expanding from 244 nodes to 272 nodes took us a week and wasted a lot of our human resources.

Gradually we formed a set of best practices for Redis-Cluster applications
First, strictly control the scale of the Redis-Cluster cluster, with no more than 100 master nodes.
Second, in the early stage, we distributed data to different Redis-Cluster clusters by cutting data on the business side; gradually the company realized macro clusters through self-developed JCache components, combining multiple native Redis-Clusters into a macro cluster; JCache does hash based on Key rules, which is equivalent to adding a layer of proxy. Of course, this also brings some new problems to be solved, such as the splitting of the back-end Redis-Cluster and the problem of data resharding, which also increases the complexity of JCache.
JCache macro cluster solution:

Monitoring also uses Exporter to collect data to Prometheus. This ecology is very complete, and ready-made solutions can be reused directly.

We have set the following warning rules for Redis:
1. Memory usage alarm (customized for each redis application, 90% by default)
2. Alarm on the number of client connections (default 2000)
3. Redis instance down alarm
4. A master-slave switch alarm occurs in the sentry
5. Application of overall qps threshold alarm
6, specific redis key monitoring

Performance optimization:
1. Turn off THP
2. High-risk command shielding (keys/flushall/flushdb, etc.)
3. Some Redis applications with frequently expired data can speed up the cleaning of expired keys by adjusting the hz parameter, and reclaim the memory as soon as possible. Some of them have been cleaned up from 10 times per second to 100 times by default, but under large QPS requests, performance will be affected.
4. For some Redis applications with high memory requirements and slightly lower performance requirements, save memory by adjusting the underlying storage structure of the data, such as adjusting hash-max-ziplist-entries/hash-max-ziplist-value

Three, Pika practice
Pika is a domestic open source set of KV storage compatible with the Redis protocol. The bottom layer is based on the RocksDB storage engine and supports multi-threading. It belongs to the disk KV storage system. It is more suitable for scenarios where the amount of data is particularly large (such as TB level) but the QPS requirements are not particularly high. .
Aurora Pika scale

Pika's current data volume is about 160T.
Higher performance
From our actual measurement using NVMe SSD as the storage medium, there is no problem with a single machine (8-core 32G) peaking at 18W QPS. In actual usage scenarios, the peak QPS can reach 10W without affecting business.
At present, the company's self-built IDC uses NVMe SSD, the largest ops is about 7.6W

Ecologically not mature
Domestic open source software, the audience is not large, the software is not particularly mature, and the function iteration changes quickly. From the 2.X version to the 3.X version, many improvements have been made based on community feedback. At present, we are basically using 3.2.X stably. version of.
Note: The new applications are basically based on the 3.2.x version, and some legacy historical applications have not been upgraded due to the important data.
Due to project changes in the past year, Pika seems to have donated to an open source foundation in China and is no longer maintained by the 360 company. From the feedback from the QQ group, it seems that there are fewer and fewer core developers, and the project prospects are a little bleak. . This is also a common problem of many niche open source software. The software itself is easily affected by company policies or changes in certain core personnel.

The last time Pika released the version was in December 2020, almost a year has passed.
Agent layer
Because Pika only has a master-slave mode, and our data volume is just a few T, and QPS is hundreds of thousands, so we need to add a proxy layer in front to do hash sharding in actual use. At present, a single instance has reached a maximum of 2T, the actual QPS has reached 100,000, and the ratio of read and write instances is 1:3, which meets the scenario of writing less and reading more.

Standard SSD
Regardless of official recommendations or actual use, Pika's disks are equipped with NVMe SSD disks as standard. Compared with ordinary SSD disks, they provide higher IOPS and throughput capacity, and lower read and write latency. In the Pika production environment, the peak IOPS of NVMe SSD can reach 50,000, while the ordinary SAS SSD disk is only 10,000. The performance improvement is still very obvious. Compared with the price difference, the performance improvement is very much, not to mention the memory.

Series master-slave problem
Redis can support and implement the master-slave-slave serial relationship. We also did a similar deployment when using Pika. As a result, serious problems occurred, which caused us to lose data. Pika's management of data synchronization status in the series master-slave mode is very simple and rude. It can only be seen that the slave and the master have established a connection, but whether the data is synchronized and whether there is data missing needs to be processed by the application layer. This is also the result of our later research on the code. On the one hand, it is the lack of documentation. On the other hand, it also shows that when using open source software, you should not take it for granted, but must go through rigorous confirmation and testing, otherwise it will be caused. The price is very high.
Note: Pika no longer supports serial master-slave settings since version 3.2.X. All slaves must synchronize data from the master, which will cause a certain read pressure on the master node.
Monitoring and alerting
Monitoring information can still be collected through Exporter and uploaded to Prometheus, which is relatively simple and clear

Pika's warning rules are set as follows:
1. Pika instance process alarm
2. Pika connection number alarm
3. Sentinel master-slave switch alarm
4. JCache slow query alarm
5. Pika master failed to write data alarm
Performance parameters:
1. Auto-compact automatic compression parameters, for applications with a large amount of outdated data
2. Root-connection-num ensures that the client can still connect locally through pika after the number of connections is full
3. The compression algorithm is changed to lz4, with good performance and high compression ratio

Four, follow-up planning
Aurora’s current KV storage has undergone so many years of evolution, and it is sufficient to meet current needs, but there are still problems such as inconvenient and unsmooth expansion, large-scale cluster performance bottlenecks, and cost-effectiveness; the future will focus on business empowerment , To make KV storage into a more adaptable and flexible storage service, you may consider introducing technologies such as storage and computing separation, cold and hot stratification, etc., to improve the dimensions of capacity expansion and contraction, performance optimization, and cost-effectiveness. building.

Jiguang Notes | The road to the operation and maintenance practice of tens of billions of KV storage in Jiguang

极光JIGUANG

引用和评论

AIGC | 如何用“Flow”，轻松解决复杂业务问题

Vue.js-Vue实例

你可能不知道的图片加载相关知识

手写一个动态海洋和天空效果的vue hooks

使用CSS给标题添加书名号并超出省略

Koa+Typescript起手式(空环境) 不用每次玩node都要搭环境了！

原生electron起步-从零到一完成构建和打包