This article comes from the sharing by Liu Yunsong of Tuya Smart at PingCAP DevCon 2021, including the use of TiDB in the IoT field, especially in the smart home industry.

About Tuya Smart

Tuya Smart is a global IoT development platform that creates interconnected development standards and connects the smart needs of brands, OEMs, developers, retailers and various industries. Based on the global public cloud, realize the interconnection of smart scenes and smart devices. Covers three aspects: hardware development tools, global public cloud, and smart business platform development; provides comprehensive empowerment from technology to marketing channels, and creates a neutral and open developer ecosystem.

At present, Tuya has more than 100,000 partners at home and abroad, and the number of ecological customers on IoT PaaS and IoT developer platforms has reached 320,000+, which involves manufacturing, retail, operators, real estate, pension, Hotel (PaaS) and so on. Graffiti empowers European and Chinese brands, including Philips, domestic Haier and the three major operators.

Real-time response to massive data: TiKV selection process

Tuya's equipment processes 84 billion requests every day around the world, with an average peak processing frequency of 1.5 million TPS, and an average response time of less than 10 milliseconds. Because Tuya is an Internet of Things industry, which is different from the traditional industry, there is no low peak, and the amount of writing is very large. Tuya has spent six years continuously selecting and trying to explore the most suitable data architecture.

The reason why Tuya has such a large amount of data is because people should use smart devices in their homes, such as smart lights and sweeping robots. After the devices are connected to the Internet, they have the ability to communicate with Tuya's platform, and various smart devices are triggered at regular intervals. For example, the home camera patrol and the location information of the sweeping robot need to be reported to the Graffiti Zeus platform. As the most important role of the Tuya platform, the Zeus system is responsible for processing data reporting. The business topology is shown in the figure below. After the application gateway collects the MQTT messages reported by smart devices, they will be sent to Kafka and NSQ. The Zeus system will consume these messages for decryption. After processing, put it in storage. This article mainly describes the product selection from Zeus to storage.

AWS Aurora

Tuya used AWS Aurora in the early days. Aurora is similar to Alibaba Cloud's PolarDB. It has a separate storage and computing architecture. Tuya has been running steadily on Aurora for three years. In the first three years of use, Aurora has fully met the demand. The Internet of Things was still relatively unpopular six or seven years ago. Smart home devices were not so popular and users didn’t use them much. However, with the expansion of business, the devices have grown exponentially in recent years, doubling three to five times every year. , Aurora cannot withstand the surge in data volume, especially the response time of the Internet of Things is within 10 milliseconds. Even if the database and tables are divided, the dismantling of the cluster will not meet the business needs of Tuya.

Apache Ignite

So Tuya began to try to use Apache Ignite, which is also a distributed KV system, similar to PingCAP's TiKV. It is based on JAVA architecture for data fragmentation. Its fragmentation is relatively large. 1G data is a partition, and its expansion does not have TiKV. So linear. If Tuya's business volume doubles, it will have to shut down when the machine needs to be expanded, and there will be a risk of data loss. During this period, we installed Aurora behind an Ignite as disaster recovery, and data will be written to Aurora synchronously. However, with the rapid increase in business volume, an Ignite cannot meet Tuya's business needs, so it needs to be expanded. However, the Ignite architecture requires downtime when expanding, which is intolerable by the Internet of Things.

TiDB 3.0 and 4.0

When Tuya tried to replace Ignite Cluster in 2019, the storage devices in the US had already reached 12 nodes. Coinciding with PingCAP's TUG event in Hangzhou, we conducted a verification test on TiDB 3.0. However, the launch of TiDB 3.0 did not meet the requirements of Graffiti, because of the high latency and the inability to get up the throughput, I had to give up after trying it for a few months.

When the time comes to 2020, TiDB 4.0 is online. We also tested TiDB 4.0, which is a great improvement compared to 3.0, but the problems of high latency and insufficient throughput still exist. At this time, the PingCAP R&D team conducted an in-depth analysis of this problem and found that the main time-consuming was in the SQL PARSER layer, while the underlying storage of TiKV was completely idle, because the amount of writing in Tuya was large and the latency requirements were high, which was completely up to Not expected.

Since the delays that occur are all consumed in the SQL PARSER layer, and the data written by the Internet of Things has a high TPS, but the business logic is not that complicated. Can you remove the SQL layer and write it directly to the TiKV layer? We refer to the official API documentation of TiKV provided by PingCAP, declare that we have supported JAVA, GO and Rust, and started to try and explore.

The result of the online application was very pleasant and it was recognized by the whole company. After that, we launched TiDB 4.0 in various regions around the world. After a year of testing, no problems were found in the normal operation. Originally, 12 machines were needed. Now only 3 machines can be used under the same configuration, which means that the hardware cost is only A quarter of the original.

When Tuya's throughput went online, there were already 200,000 TPS. Looking at the North American cluster, the version at that time was 4.0.8. The response time of 99% of the query was 150 microseconds, and the write time was 360 microseconds (less than one millisecond). ), friends who have similar scenes can try it.

New challenge: cross-region deployment

But we didn’t have a long time to meet new challenges, because AWS was deployed in three availability zones. For example, the first deployment in Frankfurt is the three regions of ABC. The communication between the three replicas consumes traffic, and the traffic is There is a fee, and all Tuya applications are also deployed in three regions, and cross-region calls are also required. TiKV does not have the same district call strategy like Double, so the cost of this fee remains high, even though Tuya is only a quarter of the previous A machine, but the cost is higher than the original. The current solution is to perform RPC-based compression to reduce network traffic, but this traffic can only solve the region replication traffic, and the cross-region replication traffic of the application code has not dropped.

We found that the reason for this problem is that the server of TiKV does not perform server-side filtering. The data stored in TiKV needs to be retrieved locally for application filtering, and then plugged back. This communicated with TiKV's R&D team , Subsequent versions may introduce server-based filtering to reduce server load and traffic costs may also decrease.

Cost reduction and efficiency enhancement: the architecture upgrade from X86 to ARM

The reason why the IoT industry focuses on reducing costs is that the gross profit margin of the IoT industry is very low, and we need to reduce the cost of each module. In June 2020, AWS launched a C6G product. The price/performance ratio is claimed to be 40% higher than that of the previous generation C5. So we tried AWS's C6G, but when using TiUP to compile and deploy directly, we found that the response time was slower than that of the X86 architecture6 To 7 times, that is, TiUP deploys a universal compiled version, which is not so appropriate to the hardware. After testing and verification, it is found that the existing TiKV version does not support the SSE instruction set, which means that the RocksDB version currently used by TiKV 4.0 does not support the SSE instruction set.

The SSE instruction set is mainly used for CRC check, HASH and floating point operations. At that time, the compromise solution was mixed deployment. TiKV uses the X86 architecture, and other nodes use the ARM architecture, but this also brings inconvenience. If the version is upgraded, the mirror pointed to will be X86 for a while. It will be ARM for a while, which will be very troublesome, so the whole switch back to the X86 architecture.

This year, TiKV released version 5.0. TiKV 5.0 supports the CRC32C instruction set optimized by aarch 64, that is, the SSE 4.2 instruction set, but the prerequisite is that the RocksDB version is greater than 6.1.2, and the TiKV 5.0 version of RocksDB is the version. It is 6.4.6, and TiKV's optimization for the SSE instruction set can be found on TiKV, which means that TiKV 5.0 now fully supports the SSE instruction set, and will be included in the key test in the second half of the year. In this way, the cost may be even greater. Decline.

Business outlook

With the help of TiDB 5.0 and 5.1 in the future, Tuya is confident that it can undertake several times the business growth. It is expected that TiKV's traffic will double to three to four times the current level by the end of the year. The big data platform also uses TiDB as a large-screen display, and the device pipeline of the Internet of Things is also considering using TiKV 5.1 as storage, which will improve the ease of use to a greater extent. The deployment of the TiDB ARM version is also planned in the second half of the year.


PingCAP
1.9k 声望4.9k 粉丝

PingCAP 是国内开源的新型分布式数据库公司,秉承开源是基础软件的未来这一理念,PingCAP 持续扩大社区影响力,致力于前沿技术领域的创新实现。其研发的分布式关系型数据库 TiDB 项目,具备「分布式强一致性事务...