How to store and analyze 5 billion massive data efficiently? GaussDB (for Cassandra) 3 tips to get it done

Abstract: information society is moving from the Internet era to the Internet of Things era. Enterprises will inevitably face a series of problems caused by the rapid increase in data volume: how to efficiently store and expand capacity, and how to minimize changes to the original business Achieve intelligence and real-time analysis.

This article is shared from the Huawei Cloud Community " How to efficiently store and analyze 5 billion massive data? GaussDB (for Cassandra) 3 secrets to get ", author: Cassandra official.

At present, the information society is moving from the Internet era to the Internet of Things era, and information interaction has become more complex, efficient and intelligent. For Internet companies and IOT companies, it is both an opportunity and a challenge. Because companies inevitably have to face a series of problems brought about by the rapid increase in data volume: how to efficiently store and expand capacity, and how to achieve intelligence and real-time analysis with minimal changes to the original business.

In response to challenges, Huawei Cloud GaussDB (for Cassandra) provides customers with a series of capabilities such as strong expansion, high storage, efficient import/export, and real-time analysis. It has successfully served many Internet companies and IOT companies, and has been highly recognized and recognized by customers. support. This article will take one of the customer’s business pain points as an example, and talk about 3 secrets of efficient storage and real-time analysis.

Mass storage, PB-level non-inductive expansion

When the user locally deploys and uses a database offline or uses other databases stored as cloud disks, he often needs to plan and purchase storage resources in advance when the capacity reaches a threshold, and may also need to expand unnecessary computing resources. But after using GaussDB (for Cassandra), there is no such trouble. GaussDB (for Cassandra) adopts a storage-calculation separation architecture, which can independently expand storage, efficiently expand capacity, and has no sense of business, and can scale up to PB level.

In addition, in order to perform big data analysis, the customer writes a copy of the data in the database to HDFS for MapReduce and Spark analysis. At the same time, two sets of resources need to be maintained, and maintenance and resource costs have become pain points. After customers use GaussDB (for Cassandra), they can only use GaussDB (for Cassandra) to complete the function of database storage and docking big data analysis. At the same time, GaussDB (for Cassandra) provides a more easy-to-use CQL interface to make users more Focus on function development, not resource management.

Data change capture and real-time analysis

A customer's use scenario requires online analysis and real-time recommendation services of crawler or user input data. The total amount of data in this business has reached 5 billion, but the incremental data is less than 500 million. The analysis object is mainly daily new data . In this scenario, GaussDB (for Cassandra) provides customers with a streaming service + real-time analysis solution. Under the premise of losing a small part of the read and write performance, the client can achieve parallel data reading and writing and real-time analysis without modification. The scheme is shown in the figure below. The solution mainly has the following stages:

Customer business has used open source drivers to write data to GaussDB (for Cassandra)
GaussDB (for Cassandra) provides external streaming interface, which can obtain data change capture
The streaming service component built by the customer reads the streaming interface data and writes it to the specified Kafka queue
Kafka queue writes streaming data to Spark or Flink
Customers can analyze incremental data in Spark, or perform full analysis after merging

Full data export and analysis

Another business of the customer needs to analyze and process the full amount of data periodically, but does not want to affect the online business, and hopes to process it in idle time. GaussDB (for Cassandra) provides a full data export and analysis solution, which can trigger tasks to perform data export and cold data analysis during the low peak period of the business. The data export rate is 10+ times that of open source. Influence. The following is a solution for Internet customers to regularly export data and analyze user portraits every week. The solution has the following stages:

The customer configures the ECS specifications according to the requirements, and mounts the obsfs parallel file system
The customer configures the export job on DLF, including ECS information, export parameters and timing tasks
CDM issues job tasks
The export task on ECS exports the data of the specified conditions in the specified table in GaussDB (for Cassandra) to obsfs
Spark reads all data from obsfs for data analysis

Through these three secrets, HUAWEI CLOUD GaussDB (for Cassandra) perfectly solves the problems of difficult expansion, high cost, and untimely changes, and realizes the efficient storage and real-time analysis of massive data, which provides the digital development of Internet companies and IOT companies. More possibilities. For more detailed information about GaussDB (for Cassandra), please Huawei Cloud official website .

Author of this article: Huawei Cloud Gaussian Cassandra Team

Resume delivery in Hangzhou, Xi'an and Shenzhen: zhaojuan.zhao@huawei.com

For more technical articles, please pay attention: Gauss Cassandra official blog

Click to follow, and learn about Huawei Cloud's fresh technology for the first time~

How to store and analyze 5 billion massive data efficiently? GaussDB (for Cassandra) 3 tips to get it done

Mass storage, PB-level non-inductive expansion

Data change capture and real-time analysis

Full data export and analysis

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

分布式数据库解析

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

数据库的下一场革命：S3 延迟已降至原先的 10%，云数据库架构该进化了

Ape-DTS：开源 DTS 工具，助力自建 MySQL、PostgreSQL 迁移上云

好用的开源埋点方案-ClkLog埋点用户分析系统