Building a Multi-Classification Model with Amazon Redshift ML

Amazon Redshift ML simplifies machine learning (ML) operations by using simple SQL statements to create and train machine learning (ML) models using data in Amazon Redshift. You can use Amazon Redshift ML to solve binary classification, multiclassification, and regression problems, and can use techniques such as AutoML or XGBoost directly.

This article is part of the Amazon Redshift ML series. For more information about building regression with Amazon Redshift ML, see Building Regression Models with Amazon Redshift ML.

You can use Amazon Redshift ML to automate data preparation, preprocessing, and question type selection, as described in this post. We assume you have a good understanding of your data and the types of questions that best apply to your use case. This article will focus on creating a model in Amazon Redshift using a multi-classification problem type that includes at least three classes. For example, you can predict whether a transaction is fraudulent, failed or successful, whether customers will remain active for 3 months, 6 months, 9 months, 12 months, or whether they want to flag news as sports, world news or commercial content.

Prerequisites

As a prerequisite to implementing this solution, you need to set up an Amazon Redshift cluster with machine learning (ML) capabilities enabled.

use case

In our use case, we want to find out the most active customers for a particular customer loyalty program. We use Amazon Redshift ML and a multi-class model to predict how many months a customer will be active in a 13-month period. This translates to up to 13 possible classifications, so it is more suitable to take multiple classifications. Customers who are expected to remain active for 7 months or more will be the target group for special customer loyalty programs.

input raw data

To prepare the raw data for the model, we populate the ecommerce_sales table in Amazon Redshift with the public dataset ecommerce sales forecast, which includes sales data for UK online retailers.

Enter the following statement to load data into Amazon Redshift:

CREATE TABLE IF NOT EXISTS ecommerce_sales
(
    invoiceno VARCHAR(30)   
    ,stockcode VARCHAR(30)   
    ,description VARCHAR(60)    
    ,quantity DOUBLE PRECISION   
    ,invoicedate VARCHAR(30)    
    ,unitprice    DOUBLE PRECISION
    ,customerid BIGINT    
    ,country VARCHAR(25)    
)
;

Copy ecommerce_sales
From 's3://redshift-ml-multiclass/ecommerce_data.txt'
iam_role '<<your-amazon-redshift-sagemaker-iam-role-arn>>' delimiter '\t' IGNOREHEADER 1 region 'us-east-1' maxerror 100;

*Swipe left to see more

To reproduce this script in your environment, replace <> with the Amazon Identity and Access Management (IAM) ARN for your Amazon Redshift cluster.

Data Preparation for Machine Learning (ML) Models

Now that our dataset is loaded, we can choose to split the data into three groups for training (80%), validation (10%), and prediction (10%). Note that Amazon Redshift ML Autopilot automatically splits the data into training and validation, but if you split it here, you will be able to validate the accuracy of your model quite well. Additionally, we will calculate the number of months the customer has remained active, as we want the model to be able to predict this value based on new data. We use the random function in the SQL statement to split the data. See the code below:

create table ecommerce_sales_data as (
  select
    t1.stockcode,
    t1.description,
    t1.invoicedate,
    t1.customerid,
    t1.country,
    t1.sales_amt,
    cast(random() * 100 as int) as data_group_id
  from
    (
      select
        stockcode,
        description,
        invoicedate,
        customerid,
        country,
        sum(quantity * unitprice) as sales_amt
      from
        ecommerce_sales
      group by
        1,
        2,
        3,
        4,
        5
    ) t1
);

*Swipe left to see more

Training set

create table ecommerce_sales_training as (
  select
    a.customerid,
    a.country,
    a.stockcode,
    a.description,
    a.invoicedate,
    a.sales_amt,
    (b.nbr_months_active) as nbr_months_active
  from
    ecommerce_sales_data a
    inner join (
      select
        customerid,
        count(
          distinct(
            DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(
              DATE_PART(mon, cast(invoicedate as date)),
              2,
              '00'
            )
          )
        ) as nbr_months_active
      from
        ecommerce_sales_data
      group by
        1
    ) b on a.customerid = b.customerid
  where
    a.data_group_id < 80
);

*Swipe left to see more

validation set

create table ecommerce_sales_validation as (
  select
    a.customerid,
    a.country,
    a.stockcode,
    a.description,
    a.invoicedate,
    a.sales_amt,
    (b.nbr_months_active) as nbr_months_active
  from
    ecommerce_sales_data a
    inner join (
      select
        customerid,
        count(
          distinct(
            DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(
              DATE_PART(mon, cast(invoicedate as date)),
              2,
              '00'
            )
          )
        ) as nbr_months_active
      from
        ecommerce_sales_data
      group by
        1
    ) b on a.customerid = b.customerid
  where
    a.data_group_id between 80
    and 90
);

*Swipe left to see more

prediction set

create table ecommerce_sales_prediction as (
  select
    customerid,
    country,
    stockcode,
    description,
    invoicedate,
    sales_amt
  from
    ecommerce_sales_data
  where
    data_group_id > 90);

*Swipe left to see more

Create a model in Amazon Redshift

Now that we have our training and validation datasets created, we can create our machine learning model using Multiclass_Classification using the create model statement in Amazon Redshift. We specify the question type and let AutoML take care of everything else. In this model, the target we want to predict is nbr_months_active. Amazon SageMaker created a function predict_customer_activity that we will use for inference in Amazon Redshift. See the code below

create model ecommerce_customer_activity
from
  (
select   
  customerid,
  country,
  stockcode,
  description,
  invoicedate,
  sales_amt,
  nbr_months_active  
 from ecommerce_sales_training)
 TARGET nbr_months_active FUNCTION predict_customer_activity
 IAM_ROLE '<<your-amazon-redshift-sagemaker-iam-role-arn>>'
 problem_type MULTICLASS_CLASSIFICATION  
  SETTINGS (
    S3_BUCKET '<<your-amazon-s3-bucket-name>>’,
    S3_GARBAGE_COLLECT OFF
  );

*Swipe left to see more

To reproduce this script in your environment, replace << your-amazon-redshift-sagemaker-iam-role-arn >> with your cluster's IAM role ARN.

Verify prediction

In this step, we will evaluate the accuracy of the machine learning (ML) model against the validation data.

When creating a model, Amazon SageMaker Autopilot automatically splits the input data into training and validation sets and selects the model with the best objective metric, which is deployed in an Amazon Redshift cluster. You can view various metrics, including accuracy scores, using the show model statement in the cluster. If not explicitly specified, Amazon SageMaker automatically uses the accuracy of the target type. See the code below:

Show model ecommerce_customer_activity;

As shown in the output below, our model has an accuracy of 0.996580.

Let's use the following SQL code on the validation data to run an inference query on the validation data:

select 
 cast(sum(t1.match)as decimal(7,2)) as predicted_matches
,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches
,cast(sum(t1.match + t1.nonmatch) as decimal(7,2))  as total_predictions
,predicted_matches / total_predictions as pct_accuracy
from 
(select   
  customerid,
  country,
  stockcode,
  description,
  invoicedate,
  sales_amt,
  nbr_months_active,
  predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active,
  case when nbr_months_active = predicted_months_active then 1
      else 0 end as match,
  case when nbr_months_active <> predicted_months_active then 1
    else 0 end as nonmatch
  from ecommerce_sales_validation
  )t1;

*Swipe left to see more

As you can see, the prediction accuracy on our dataset is 99.74%, which is in line with the accuracy in the show model.

Now let's run a query to see which customers are eligible for our customer loyalty program by being active for at least 7 months:

select 
  customerid, 
  predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active
  from ecommerce_sales_prediction
 where predicted_months_active >=7
 group by 1,2
 limit 10;

*Swipe left to see more

The table below shows our output.

Troubleshoot

Although the Create Model statement in Amazon Redshift is automatically responsible for initiating the Amazon SageMaker Autopilot process to build, train, and tune the optimal machine learning model and deploy the model in Amazon Redshift, you can review the intermediate steps performed in the process, if the problem, this can also help you troubleshoot. You can also retrieve the AutoML Job Name from the output of the show model command.

When creating the model, you need to set an Amazon Simple Storage Service (Amazon S3) bucket name as the value of the parameter s3_bucket. You can use this bucket to share training data and artifacts between Amazon Redshift and Amazon SageMaker. Amazon Redshift creates a subfolder in this bucket to hold the training data. After training, it deletes subfolders and their contents unless the parameter s3_garbage_collect is set to off (useful for troubleshooting).

Conclusion

Amazon Redshift ML provides database users with a suitable platform to create, train, and tune models using an SQL interface. In this post, we'll show you how to create a multiclass classification model. We hope you can leverage Amazon Redshift ML to help gain valuable insights.

Credits to

The data was provided by Dr. Chen Daqing, Director of the Public Analytics Group, according to the UCI Machine Learning (ML) repository. Email: chend@lsbu.ac.uk Address: School of Engineering, London South Bank University, London SE1 0AA, UK.

Dua, D. and Graff, C. (2019). UCI Machine Learning (ML) Repository [ http://archive.ics.uci.edu/ml ]. Irvine, CA: School of Information and Computer Science, University of California.

Author of this article

Phil Bates

Senior Data Analytics Specialist Solutions Architect at Amazon Cloud Technologies

Over 25 years of data warehousing experience.

Debu Panda

Principal Product Manager at Amazon Cloud Technologies

Industry leader in data analytics, application platforms and database technologies with over 25 years of experience in IT.

Nikos Koulouris

A software development engineer at Amazon Cloud Technologies

He received his Ph.D. from UC San Diego and has been working in the field of databases and data analytics.

Enrico Sartorello

A Senior Software Development Engineer at Amazon Cloud Technologies

Help customers adopt machine learning solutions that fit their needs by developing new capabilities for Amazon SageMaker. In his spare time, he enjoys watching football games and is keen to improve his cooking skills.

Scan the QR code above to register now

Building a Multi-Classification Model with Amazon Redshift ML

Prerequisites

use case

input raw data

Data Preparation for Machine Learning (ML) Models

Create a model in Amazon Redshift

Verify prediction

Troubleshoot

Conclusion

Credits to

亚马逊云开发者

引用和评论

Amazon Bedrock 助力 SolveX.AI 构建智能解题 Agent，打造头部教育科技应用

分析型数据库入门指南：如何选择适合你的实时分析工具？

湖仓一体架构解析：如何平衡数据灵活性与分析性能？

6行代码节省超千万成本——记一次字段治理的“巧渡金沙江”