Deep dive into optimizing Amazon Redshift workload decomposition through data sharing

foreword

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that is simple to operate and efficient to perform. It uses standard SQL and existing business intelligence (BI) tools to analyze all data quickly, easily, and cost-effectively. Today, Amazon Redshift is one of the most widely used cloud data warehouses. Customers use Amazon Redshift for many types of big data workloads, such as accelerating existing database environments or log ingestion for big data analytics.

As the amount of data generated by the internet has grown significantly in recent years, some customers have started asking how they can use Amazon Redshift more efficiently. In order to help customers use Amazon Redshift more efficiently and cost-effectively, Amazon Cloud Technology has been continuously improving it since its launch, and released a large number of new features and functions. Examples include Amazon Redshift Spectrum, SSD-based RA3 node types, suspend and resume clusters, Amazon Redshift data sharing, and AQUA (Advanced Query Accelerator), among others. Each improvement and/or new feature can improve Amazon Redshift performance and/or reduce costs. A deep understanding of the above features can help you use Amazon Redshift more efficiently.

In this blog, we'll walk through a case study on how to use Amazon Redshift RA3 nodes https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-ra3-node- types , data sharing https://docs.aws.amazon.com/redshift/latest/dg/datashare-overview.html and pause/resume clusters https://docs.aws.amazon .com/redshift/latest/mgmt/managing-cluster-operations.html#rs-mgmt-pause-resume-cluster greatly improves the cost-effectiveness of Amazon Redshift clusters in the business scenario of this case.

First, let's quickly review some key features.

Introduction to RA3 Nodes for Amazon Redshift

Amazon Redshift RA3 nodes for Amazon Redshift are powered by a new managed storage model that optimizes compute power and storage capabilities separately. They take advantage of several architectural improvements, including high-bandwidth networking, managed storage using S3-backed local SSD storage backed by S3, and several advanced data management techniques to optimize data movement to and from S3. They bring some very important features, one of which is data sharing. RA3 nodes also feature pause and resume capabilities, which allow customers to easily pause a cluster to suspend billing when the cluster is not in use.

Introduction to data sharing

Available in 2021, Amazon Redshift data sharing enables you to securely and easily share real-time data for reads across Amazon Redshift clusters. Data sharing enables instant, granular, and high-performance data access across Amazon Redshift clusters within an Amazon Cloud Technology account without the complexity and latency associated with data replication and data movement. Data can be shared at multiple levels, including schemas, tables, views, and user-defined functions, providing fine-grained access controls that can be tailored to the different users and businesses that need access to the data.

Example Crop.'s Amazon Redshift Optimization Project

Client Example crop. Extensive use of Amazon Redshift as a data warehouse for their big data analytics business, and they have always enjoyed the possibilities that Amazon Redshift brings to their business. Customers primarily use Amazon Redshift to store and process user behavior data for business intelligence purposes, which in recent months has grown by about a few hundred gigabytes per day, with personnel in various departments constantly passing through its business intelligence (BI) platform during office hours Query various data on Amazon Redshift clusters.

Because some data will be used by all workloads, Example Crop. runs the top 4 analytical workloads on a single Amazon Redshift cluster. The four analytics workloads are:

Queries from its BI platform: Various queries executed mainly during office hours.
Hourly Data ETL: Runs on the first few minutes of every hour. Execution usually takes about 40 minutes.
Daily Data ETL: It runs twice a day. This load execution is done during office hours because the operations team needs to get a daily report by the end of each day. Each execution usually takes 5 to 3 hours. It is the second most resource-consuming workload.
Weekly Data ETL: It runs every Sunday in the wee hours of the morning. This is the most resource-intensive workload. Execution time usually takes 3 to 4 hours.

Due to the continuous growth of business and data volume, the data analysis team of Example Crop. has migrated the original Amazon Redshift cluster to the new Amazon Redshift cluster based on RA3 series nodes. Because all analytical workloads are running on one cluster and the data volume is growing rapidly, in order to maintain the average query execution time of its business intelligence platform, Example Crop. gradually increased the number of nodes in the Amazon Redshift cluster to 12.

However, they noticed that except for some peaks during the execution of ETL tasks, the average CPU utilization was usually lower than 50%, so the data analysis team hoped to cooperate with the technical team of Amazon Cloud Technology to explore the premise of not losing query performance. A solution to optimize the performance and cost of their Amazon Redshift clusters.

Since peaks in CPU usage typically occur during the execution of various ETL tasks, our first thought was to split the workload and related data into multiple Amazon Redshift clusters of different sizes. By reducing the total number of nodes, we hope to reduce costs for our customers.

After a series of discussions, we found that one of the obstacles to splitting different analytical workloads into multiple Amazon Redshift clusters is that due to business needs, multiple analytical workloads often need to read and/or update the same set of data tables in the data warehouse. On the premise of ensuring high performance, it is a challenge to ensure the data consistency of each Amazon Redshift cluster after splitting. For example, some tables have data that needs to be read by ETL workloads and updated by BI workloads, while others are the opposite. So if we replicate data to 2 Amazon Redshift clusters and only create data shares from BI cluster to reporting cluster, the customer needs to additionally design the data synchronization process to maintain data consistency across all Amazon Redshift clusters - This can get complicated due to the number of factors to consider.

After further understanding the customer workload, we found that the data table can be divided into 4 types. Following this idea, we proposed a two-way data sharing scheme for two clusters. The purpose of this solution is to migrate resource-intensive ETL workloads to a separate Amazon Redshift cluster - so that we can take advantage of Amazon Redshift's pause/resume cluster functionality and only use it when these ETL workloads are executing cluster, thereby reducing the operating cost of Amazon Redshift, and ensuring data consistency between two Amazon Redshift clusters through bidirectional data sharing, without the need to build additional data synchronization processes.

The old architecture single-cluster (left) was improved to a new two-cluster solution (right)

Break down workloads and data

Due to the characteristics of the four main workloads, we divide workloads into two categories, namely long-running workloads and periodic-running workloads.

Continuously running workloads are available for BI platform and hourly ETL. Because an hourly ETL workload takes an average of about 40 minutes to complete, which means that even if we migrate it to a standalone Amazon Redshift cluster and pause/resume hourly, the benefit is small, so we currently combine it with the BI platform Loads are divided together.

The workloads that run regularly are daily ETL and weekly ETL.

Planning for data sharing

The next step is to determine the access patterns for all data (tables) for each category.

For this, we have identified 4 types of data tables:

Type 1 data tables: can only be read and updated by continuously running workloads.
Type 2 data tables: read and updated by continuously running workloads, and read but not updated by periodically running workloads.
Type 3 data tables: read and updated by periodically running workloads, and read but not updated by continuously running workloads.
Type 4 data tables: can only be read and updated by regularly running workloads.

After classifying all data sheets according to the above 4 types, it is found that all data sheets conform to one of the 4 types. So we can go ahead and split the single Amazon Redshift cluster into 2 clusters. One cluster has 12 RA3 nodes, named long-running-cluster , for continuously running workloads; the other cluster has 20 RA3 nodes, named periodic-running-cluster , for periodically running workloads.

We will create bidirectional data sharing between the long-running cluster and the periodic-running cluster. For the type 2 data tables listed above, we will create a data share L-to-P on the long-running cluster (as producer/producer) and set the periodic-running cluster as the consumer. For the type 3 data table above, we will create a data share P-to-L on the periodic-running cluster (as producer '/producer') and set the long-running cluster as consumer (consumer').

Two-way data sharing architecture diagram:

The long-running cluster (producer) shares the type 2 data table with the periodic-running cluster (consumer). The periodic-running cluster (producer') shares type 3 data tables with the long-running cluster (consumer')

Build bidirectional data sharing across Amazon Redshift clusters

First, let's take a snapshot of the original single Amazon Redshift cluster, which will later be transformed into a long-running-cluster cluster.

Now, let's create a new Amazon Redshift cluster with 20 RA3 nodes, named periodic-running- cluster, for the periodic running workload, make sure to select the RA3 type instance. Once created, we migrate the type 3 and type 4 tables to the cluster.

1. Share data from long-running-cluster cluster to periodic-running-cluster cluster

The next step is to create a data sharing L-to-P. We can do it with Amazon Redshift Query Editor V2 https://aws.amazon.com/blogs/big-data/simplify-your-data-analysis-with-amazon-redshift-query-editor-v2/ .

On the periodic-running-cluster cluster, we execute the following command to get its namespace, record it as [periodic-running-cluster], and record the value of the namespace:

SELECT current_namespace;
Swipe left to see more

Then, on the long-running-cluster cluster, we execute the following queries:

 CREATE DATASHARE ltop_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ltop_share ADD SCHEMA public_long;
ALTER DATASHARE ltop_share ADD ALL TABLES IN SCHEMA public_long;
GRANT USAGE ON DATASHARE ltop_share TO NAMESPACE '[periodic-running-cluster]';

Swipe left to see more

Note, please replace [periodic-running-cluster] in the last query with the namespace value recorded in the previous step.

Then, we can verify the data sharing L-to-P with the following command:

SHOW datashares;
Swipe left to see more

After verifying the data sharing L-to-P, we get the namespace [long-running-cluster-namespace] of the long-running-cluster through the following query, and record the value of the namespace:

SELECT current-namespace;
Swipe left to see more

Then, let's go back to the periodic-running-cluster cluster. On the periodic-running-cluster cluster, we execute the following command to load the data:

 CREATE DATABASE ltop FROM DATASHARE ltop_share OF NAMESPACE '[long-running-cluster-namespace]';

Swipe left to see more

Note, replace [long-running-cluster] in query with the namespace value recorded in the previous step.

Then we verify that we can read the sheet data via data sharing L-to-P.

2. Share data from periodic-running-cluster cluster to long-running-cluster cluster

Create data sharing P-to-L

We have recorded the namespaces of the long-running-cluster cluster and periodic-running-cluster cluster in the previous steps, and we will use them directly in the following steps.

On the periodic-running-cluster, we execute the following queries to create the data share P-to-L:

 CREATE DATASHARE ptol_share SET PUBLICACCESSIBLE TRUE;
ALTER DATASHARE ptol_share ADD SCHEMA public_periodic;
ALTER DATASHARE ptol_share ADD ALL TABLES IN SCHEMA public_periodic;
GRANT USAGE ON DATASHARE ptol_share TO NAMESPACE '[long-running-cluster-namespace]';

Swipe left to see more

After that, we use the following query to verify whether the data sharing P-to-L was created successfully:

SHOW datashares;
Swipe left to see more

After we verified that the data sharing P-to-L was created successfully, we returned to the long-running-cluster cluster.

On the long-running-cluster cluster, we execute the following command to load the data from the data share to the long-running-cluster cluster:

 CREATE DATABASE ptol FROM DATASHARE ptol_share OF NAMESPACE '[periodic-running-cluster-namespace]';

Swipe left to see more

Then we verify whether the worktable data can be read through data sharing P-to-L on the long-running-cluster cluster.

After the verification is successful, the establishment of two-way data sharing is completed.

The next step is to update the code for the different workloads to properly use the endpoints of the two Amazon Redshift clusters and conduct business tests.

At this point, we have divided the workload into two Amazon Redshift clusters and successfully established bidirectional data sharing between the two clusters.

Pause/resume a periodic-running Amazon Redshift cluster

Now, let's update the cron script (probably crontab) that runs the daily ETL and weekly ETL workloads. There will be two updates.

When the script starts, with the cluster ID, the Amazon Redshift check and resume cluster API is called to resume (resume) the Amazon Redshift cluster periodic-running-cluster that is in the suspended state.

 aws resume-cluster --cluster-identifier [periodic-running-cluster-id]

After the ETL workload completes, call the Amazon Redshift Pause Cluster API with the cluster ID to pause the periodic-running-cluster cluster.

 aws pause-cluster --cluster-identifier [periodic-running-cluster-id]

result

After migrating workloads to the new architecture, the Example Crop. data analytics team conducted a month or so of testing.

According to the tests, performance improved for all workloads. Details are as follows:

The average performance of the BI workload improved by about 100% during the ETL workload run.
The execution time of the hourly ETL workload was reduced by about 50%.
Daily ETL workload duration was reduced from a maximum of 3 hours to an average of 40 minutes.
Weekly ETL workload duration was reduced from a maximum of 4 hours to an average of 5 hours.

About 10% of new data was added to the business during the optimization period, but the total cost only increased by about 13%, and all businesses were running normally.

Conclusions and Limitations

According to the project, after splitting the workload into different Amazon Redshift clusters, we have the following findings.

First, the average performance of the BI workload has improved by 100%, as it no longer competes for resources with the daily and weekly ETL workloads.
Second, the duration of the ETL workload on the regularly running Amazon Redshift cluster was significantly reduced because there was no resource contention from the BI and hourly ETL workloads, and the cluster had more nodes (20).
Third, by leveraging the cluster pause/resume feature of the Amazon Redshift RA3 series, the overall cost of an Amazon Redshift cluster only increased by about 13% when the volume of business data increased by about 10%.

From this it was concluded that the Amazon Redshift cluster was at least 70% more cost-effective in this project.

However, this solution has some limitations.

First, to use the Amazon Redshift suspend/resume functionality, the code that calls the Amazon Redshift suspend/resume API must be added to all relevant scripts that run ETL workloads on the periodic-running-cluster cluster or configure timed suspend and resume in the cluster .
Second, Amazon Redshift clusters take several minutes to suspend/resume, although customers are not billed for the suspend and resume process.
Third, the size of an Amazon Redshift cluster cannot be automatically scaled down/scaled based on workload, and requires manual adjustments on a regular basis.

next steps

In order to ensure the stable operation of the system, the number of nodes in the long-running-cluster cluster is not reduced in this project. Considering that the average CPU Utilization of the cluster is not high, after testing, we can try to perform the Elastic Resize operation on the cluster to reduce the number of nodes to achieve BI workload and hourly ETL workload performance. To further reduce costs.

While the solution in this article achieves significant price/performance improvements on customers' analytical workloads, I think we can further simplify Amazon Redshift operations and increase the granularity of workload duration.

Amazon Redshift Serverless has announced a preview at re:Invent 2021. Amazon Redshift Serverless automatically provisions and intelligently scales data warehouse capacity, delivering best-in-class performance for all your analytics. You only pay per second for the computing resources used for the duration of the workload. You can benefit directly from it without making any changes to your existing analytics and business intelligence applications.

So Amazon Redshift Serverless is a service we can try next.

Author of this article

Ma Jingbin

Amazon Cloud Technology Solution Architect

Committed to helping customers build products and applications with excellent architecture, as well as helping domestic customers to go overseas and overseas users to enter China. He has more than 14 years of product development experience in the Internet industry, and has worked overseas for many years. He is a fan of Serverless and Infrastructure as Code.

Deep dive into optimizing Amazon Redshift workload decomposition through data sharing

foreword

Introduction to RA3 Nodes for Amazon Redshift

Introduction to data sharing

Example Crop.'s Amazon Redshift Optimization Project

Break down workloads and data

Planning for data sharing

Build bidirectional data sharing across Amazon Redshift clusters

1. Share data from long-running-cluster cluster to periodic-running-cluster cluster

2. Share data from periodic-running-cluster cluster to long-running-cluster cluster

Pause/resume a periodic-running Amazon Redshift cluster

result

Conclusions and Limitations

next steps

亚马逊云开发者

引用和评论

准确率从 19% 提升至 95%！文本审核模型优化的三个阶段实践（上）

2027倒计时：5个关键数据揭秘100%国产替代实施路径

用C#在Excel工作表中创建数据透视表和数据透视图

通过阿里云Milvus与通义千问VL大模型，快速实现多模态搜索

ClkLog埋点分析系统支持自定义SQL 查询

好数据驱动真智能：NoETL + 大模型 = 万数皆可问

数据要素如何驱动产业数字化转型？