头图

Introduction

Amazon Redshift Data Warehouse is a fast and fully managed data warehouse service that enables you to cost-effectively analyze all your data using standard SQL and existing business intelligence tools, providing a premium data warehouse solution.

When planning an Amazon Redshift database, the design of certain critical tables can have a large impact on overall query performance. These design optimizations can reduce the number of I/O operations and minimize the memory required to process queries, thus having a large impact on storage requirements and thus query performance.

Best practices for Amazon Redshift table design include the following:

  • Choose the best sort key
  • Choose the best distribution method
  • Let COPY choose the compression encoding
  • Define primary key and foreign key constraints
  • Use the smallest possible column size
  • Using Date/Time Data Types in Date Columns

In this article, we'll focus on "use the smallest possible column size" and describe how to semi-automatically optimize column sizes in Amazon Redshift tables through SQL scripts.

The necessity of using the smallest possible column size

According to the official documentation of Amazon Cloud Technology, although Amazon Redshift is very good at data compression, defining an excessively large column length will not have a great impact on the size of the data table itself. However, when running some complex queries, because the intermediate process data will be temporarily stored, and the temporary table created at this time will not be specified in the compression format, which will cause the query to occupy too much memory or temporary disk space. This results in a decrease in query performance.

https://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_best-practices-smallest-column-size.html

Test environment preparation

First, run the following SQL script to create a new data table test_schema.customer in your Amazon Redshift database for testing.

CREATE SCHEMA test_schema;

CREATE TABLE test_schema.customer
(
  c_custkey      INTEGER NOT NULL encode delta,
  c_name         VARCHAR(65535) NOT NULL encode zstd,
  c_address      VARCHAR(65535) NOT NULL encode zstd,
  c_city         VARCHAR(65535) NOT NULL encode zstd,
  c_nation       VARCHAR(65535) NOT NULL encode zstd,
  c_region       VARCHAR(65535) NOT NULL encode zstd,
  c_phone        VARCHAR(65535) NOT NULL encode zstd,
  c_mktsegment   VARCHAR(65535) NOT NULL encode zstd
) diststyle even;

*Swipe left to see more

Second, execute the following SQL script to import the test data into the test_schema.customer table. The test data is in the LoadingDataSampleFiles.zip archive.

copy test_schema.customerfrom 's3://<your-bucket-name>/load/customer-fw.tbl'credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' fixedwidth 'c_custkey:10, c_name:25, c_address:25, c_city:10, c_nation:15, c_region :12, c_phone:15,c_mktsegment:10';

*Swipe left to see more

Now, the length of the varchar fields in the data table test_schema.customer are all defined as 65535. Running the following SQL script, you can see that the field length in the actual data is much less than 65535.

select
    max(len(c_name)) as c_name,
    max(len(c_address)) as c_address,
    max(len(c_city)) as c_city,
    max(len(c_nation)) as c_nation,
    max(len(c_region)) as c_region,
    max(len(c_phone)) as c_phone,
    max(len(c_mktsegment)) as c_mktsegment
from test_schema.

*Swipe left to see more

image.png

column size in the existing data table 161ea7d2017792

Next, we will complete the optimization of the data table test_schema.customer through the following eight steps, optimize the column sizes of all varchar data types in the data table, and perform corresponding compression according to the maximum length of the existing data .

The first step is to create the stored procedure proc_replicate_table_with_resized_columns in the Amazon Redshift database. This stored procedure provides four parameters, namely:

  • var_schema_name varchar(50), this parameter is used to specify the schema of the data table for which you need to optimize the column size.
  • var_table_name varchar(50), this parameter is used to specify the table name of the data table for which you need to optimize the column size.
  • var_postfix varchar(50), this parameter is used to append the suffix name to the original table name as the table name of the newly created data table.
  • var_ratio decimal(19,2), which specifies a factor to resize the column to the maximum length of the column multiplied by the factor.

First, the stored procedure will create a new data table with the same table structure as the specified data table, and the name of the new table will append a suffix you specify to the name of the original table; second, the stored procedure will check the specified data table For all varchar data type columns, if the maximum length in the existing data multiplied by a factor is still less than the length originally set in the table definition, an SQL script will be generated to adjust the column in the newly created table. length, length is adjusted to round(column_actual_len * var_ratio).

create or replace procedure proc_replicate_table_with_resized_columns
        (
        var_schema_name in varchar(50),
        var_table_name in varchar(50),
        var_postfix in varchar(50),
        var_ratio in decimal(19,2)
    )
    language plpgsql
as $
DECLARE
    sql_text varchar(65535);
    table_full_name varchar (100) ;
    table_full_name_new varchar (100) ;
    rec RECORD;
BEGIN
    select into table_full_name var_schema_name || '.' || var_table_name;
    select into table_full_name_new  var_schema_name || '.' || var_table_name || var_postfix;

    EXECUTE 'drop table  if exists'  || table_full_name_new;

    --create a new table with the same schema
    EXECUTE 'create table '  || table_full_name_new || ' (like ' || table_full_name || ')';

    -- Get the temp table for all columns for a table, temp_table_column_scripts
    drop table if exists temp_table_column_scripts;
    create temp table temp_table_column_scripts(script varchar);

    insert into temp_table_column_scripts
    select  'select ''' || column_name || ''' as column_name, ' ||
       'max(len(' || column_name ||')) as column_actual_len, ' ||
       cast(character_maximum_length as varchar) || ' as column_def_len' ||
       ' from ' || table_schema || '.' || table_name as column_script
    from svv_columns
    where table_schema = var_schema_name and table_name = var_table_name and data_type = 'character varying';

    --loop to insert the column info into temp_table_column_info
    drop table if exists temp_table_column_info;
    create temp table temp_table_column_info(column_name varchar, column_actual_len int, column_def_len int);

    FOR rec IN SELECT script FROM temp_table_column_scripts
      LOOP
        RAISE INFO 'insert into temp_table_column_info %', rec.script;
        EXECUTE 'insert into temp_table_column_info ' || rec.script;
      END LOOP;

    --Generate temp table for alter scripts
    drop table if exists temp_table_alter_scripts;
    create temp table temp_table_alter_scripts as
    SELECT 'alter table ' || table_full_name_new || ' alter column ' ||
           column_name || ' type varchar(' || cast(round(column_actual_len * var_ratio) as varchar) || ');'
    FROM temp_table_column_info
    where round(column_actual_len * var_ratio) < column_def_len;

END;
$;

*Swipe left to see more

The second step is to execute the stored procedure created in the previous step, and obtain the SQL script for optimizing the column size from the result table.

For specific operations, refer to the following SQL script. Please replace {s3-bucket-name}, {ACCESS_KEY_ID}, and {SECRET_ACCESS_KEY} in the script according to the actual situation.

call proc_replicate_table_with_resized_columns('test_schema', 'customer', '_resize_columns', '1.3'); select * from temp_table_alter_scripts; 

*Swipe left to see more

The returned result is shown in the following figure:

image.png

*Swipe left to see more

In the third step, run the SQL script returned from the temp_table_alter_scripts table in the previous step.

The fourth step, run the following SQL script to check whether the column size in the data table has been adjusted.

select  a.table_name, a.column_name, a.data_type,
       a.character_maximum_length as as_is_length, b.character_maximum_length as to_be_length
from svv_columns as a
inner join svv_columns as b
    on a.table_name + '_resize_columns' = b.table_name
    and a.column_name = b.column_name
where a.table_schema = 'test_schema'
  and a.table_name = 'customer'
  and b.table_name = 'customer_resize_columns';

*Swipe left to see more

The running result is shown in the figure below. It can be seen that the length of the varchar type fields in the data table is adjusted according to the actual length of the existing data.

image.png

The fifth step is to UNLOAD the data in the original data table to the specified Amazon S3 path. For specific operations, refer to the following SQL script. Please replace {s3-bucket-name}, {ACCESS_KEY_ID}, and {SECRET_ACCESS_KEY} in the script according to the actual situation.

unload ('select * from test_schema.customer')
to 's3://{s3-bucket-name}/resize-redshift-columns/customer/'
ACCESS_KEY_ID '{ACCESS_KEY_ID}'
SECRET_ACCESS_KEY '{SECRET_ACCESS_KEY}'
GZIP;

*Swipe left to see more

The sixth step is to copy the data in S3 to the new data table.

For specific operations, refer to the following SQL script. Please replace {s3-bucket-name}, {ACCESS_KEY_ID}, and {SECRET_ACCESS_KEY} in the script according to the actual situation.

copy test_schema.customer_resize_columns
from 's3://{s3-bucket-name}/resize-redshift-columns/customer/'
ACCESS_KEY_ID '{ACCESS_KEY_ID}'
SECRET_ACCESS_KEY '{SECRET_ACCESS_KEY}'
GZIP;

*Swipe left to see more

The seventh step is to check whether the data in the new data table is exactly the same as the data in the original data table.

Run the following SQL script, if the returned result is empty, it means that the data in the new data table is exactly the same as the original data table. If there is data returned, you need to check whether the previous steps are incorrect, resulting in inconsistent data.

select * from test_schema.customer
except
select * from test_schema.customer_resize_columns
union all
select * from test_schema.customer_resize_columns
except
select * from test_schema.customer

*Swipe left to see more

The eighth step, rename the data table so that the new data table replaces the original data table, and delete the original data table after the replacement.

In this step, first rename the original data table, for example: add _original suffix; then, rename the name of the new data table to the name of the original data table; finally, delete the original data table (note: the deleted data table is a datasheet with a _original suffix).

For specific operations, refer to the following SQL script:

alter table test_schema.customer
rename to customer_original;

alter table test_schema.customer_resize_columns
rename to customer;

drop table test_schema.customer_original;

*Swipe left to see more

Through the above eight steps, you have completed the optimization of the size of all varchar type columns in the data table test_schema.customer.

Conclusion and Outlook

This article explores best practices for Amazon Redshift table design, focusing on the optimization point of using the smallest possible column size, and describes how to script the column size optimization of existing tables in Amazon Redshift data. Compresses the column lengths in the table definition appropriately according to the maximum column length of existing data in the data table.

It should be pointed out that the script of each step in this article also needs to be manually run and checked for the running result. If necessary, you can write an automated task flow, and integrate the scripts in each step in the task flow, so as to realize the automatic optimization of one or more data tables in a database.

References

Author of this article

image.png

Yao

Amazon Cloud Technology Senior Solutions Architect

Responsible for solution consulting, application architecture design and optimization for multinational enterprise customers, and at the same time dedicated to the application and promotion of Amazon cloud technology data analysis services.

image.png

Scan the QR code above to register now


亚马逊云开发者
2.9k 声望9.6k 粉丝

亚马逊云开发者社区是面向开发者交流与互动的平台。在这里,你可以分享和获取有关云计算、人工智能、IoT、区块链等相关技术和前沿知识,也可以与同行或爱好者们交流探讨,共同成长。