Amazon Redshift ML simplifies machine learning (ML) operations by using simple SQL statements to create and train machine learning (ML) models using data in Amazon Redshift. You can use Amazon Redshift ML to solve binary classification, multiclassification, and regression problems, and can use techniques such as AutoML or XGBoost directly.
This article is part of the Amazon Redshift ML series. For more information about building regression with Amazon Redshift ML, see Building Regression Models with Amazon Redshift ML.
You can use Amazon Redshift ML to automate data preparation, preprocessing, and question type selection, as described in this post. We assume you have a good understanding of your data and the types of questions that best apply to your use case. This article will focus on creating a model in Amazon Redshift using a multi-classification problem type that includes at least three classes. For example, you can predict whether a transaction is fraudulent, failed or successful, whether customers will remain active for 3 months, 6 months, 9 months, 12 months, or whether they want to flag news as sports, world news or commercial content.
Prerequisites
As a prerequisite to implementing this solution, you need to set up an Amazon Redshift cluster with machine learning (ML) capabilities enabled.
use case
In our use case, we want to find out the most active customers for a particular customer loyalty program. We use Amazon Redshift ML and a multi-class model to predict how many months a customer will be active in a 13-month period. This translates to up to 13 possible classifications, so it is more suitable to take multiple classifications. Customers who are expected to remain active for 7 months or more will be the target group for special customer loyalty programs.
input raw data
To prepare the raw data for the model, we populate the ecommerce_sales table in Amazon Redshift with the public dataset ecommerce sales forecast, which includes sales data for UK online retailers.
Enter the following statement to load data into Amazon Redshift:
CREATE TABLE IF NOT EXISTS ecommerce_sales
(
invoiceno VARCHAR(30)
,stockcode VARCHAR(30)
,description VARCHAR(60)
,quantity DOUBLE PRECISION
,invoicedate VARCHAR(30)
,unitprice DOUBLE PRECISION
,customerid BIGINT
,country VARCHAR(25)
)
;
Copy ecommerce_sales
From 's3://redshift-ml-multiclass/ecommerce_data.txt'
iam_role '<<your-amazon-redshift-sagemaker-iam-role-arn>>' delimiter '\t' IGNOREHEADER 1 region 'us-east-1' maxerror 100;
*Swipe left to see more
To reproduce this script in your environment, replace <> with the Amazon Identity and Access Management (IAM) ARN for your Amazon Redshift cluster.
Data Preparation for Machine Learning (ML) Models
Now that our dataset is loaded, we can choose to split the data into three groups for training (80%), validation (10%), and prediction (10%). Note that Amazon Redshift ML Autopilot automatically splits the data into training and validation, but if you split it here, you will be able to validate the accuracy of your model quite well. Additionally, we will calculate the number of months the customer has remained active, as we want the model to be able to predict this value based on new data. We use the random function in the SQL statement to split the data. See the code below:
create table ecommerce_sales_data as (
select
t1.stockcode,
t1.description,
t1.invoicedate,
t1.customerid,
t1.country,
t1.sales_amt,
cast(random() * 100 as int) as data_group_id
from
(
select
stockcode,
description,
invoicedate,
customerid,
country,
sum(quantity * unitprice) as sales_amt
from
ecommerce_sales
group by
1,
2,
3,
4,
5
) t1
);
*Swipe left to see more
Training set
create table ecommerce_sales_training as (
select
a.customerid,
a.country,
a.stockcode,
a.description,
a.invoicedate,
a.sales_amt,
(b.nbr_months_active) as nbr_months_active
from
ecommerce_sales_data a
inner join (
select
customerid,
count(
distinct(
DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(
DATE_PART(mon, cast(invoicedate as date)),
2,
'00'
)
)
) as nbr_months_active
from
ecommerce_sales_data
group by
1
) b on a.customerid = b.customerid
where
a.data_group_id < 80
);
*Swipe left to see more
validation set
create table ecommerce_sales_validation as (
select
a.customerid,
a.country,
a.stockcode,
a.description,
a.invoicedate,
a.sales_amt,
(b.nbr_months_active) as nbr_months_active
from
ecommerce_sales_data a
inner join (
select
customerid,
count(
distinct(
DATE_PART(y, cast(invoicedate as date)) || '-' || LPAD(
DATE_PART(mon, cast(invoicedate as date)),
2,
'00'
)
)
) as nbr_months_active
from
ecommerce_sales_data
group by
1
) b on a.customerid = b.customerid
where
a.data_group_id between 80
and 90
);
*Swipe left to see more
prediction set
create table ecommerce_sales_prediction as (
select
customerid,
country,
stockcode,
description,
invoicedate,
sales_amt
from
ecommerce_sales_data
where
data_group_id > 90);
*Swipe left to see more
Create a model in Amazon Redshift
Now that we have our training and validation datasets created, we can create our machine learning model using Multiclass_Classification using the create model statement in Amazon Redshift. We specify the question type and let AutoML take care of everything else. In this model, the target we want to predict is nbr_months_active. Amazon SageMaker created a function predict_customer_activity that we will use for inference in Amazon Redshift. See the code below
create model ecommerce_customer_activity
from
(
select
customerid,
country,
stockcode,
description,
invoicedate,
sales_amt,
nbr_months_active
from ecommerce_sales_training)
TARGET nbr_months_active FUNCTION predict_customer_activity
IAM_ROLE '<<your-amazon-redshift-sagemaker-iam-role-arn>>'
problem_type MULTICLASS_CLASSIFICATION
SETTINGS (
S3_BUCKET '<<your-amazon-s3-bucket-name>>’,
S3_GARBAGE_COLLECT OFF
);
*Swipe left to see more
To reproduce this script in your environment, replace << your-amazon-redshift-sagemaker-iam-role-arn >> with your cluster's IAM role ARN.
Verify prediction
In this step, we will evaluate the accuracy of the machine learning (ML) model against the validation data.
When creating a model, Amazon SageMaker Autopilot automatically splits the input data into training and validation sets and selects the model with the best objective metric, which is deployed in an Amazon Redshift cluster. You can view various metrics, including accuracy scores, using the show model statement in the cluster. If not explicitly specified, Amazon SageMaker automatically uses the accuracy of the target type. See the code below:
Show model ecommerce_customer_activity;
As shown in the output below, our model has an accuracy of 0.996580.
Let's use the following SQL code on the validation data to run an inference query on the validation data:
select
cast(sum(t1.match)as decimal(7,2)) as predicted_matches
,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches
,cast(sum(t1.match + t1.nonmatch) as decimal(7,2)) as total_predictions
,predicted_matches / total_predictions as pct_accuracy
from
(select
customerid,
country,
stockcode,
description,
invoicedate,
sales_amt,
nbr_months_active,
predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active,
case when nbr_months_active = predicted_months_active then 1
else 0 end as match,
case when nbr_months_active <> predicted_months_active then 1
else 0 end as nonmatch
from ecommerce_sales_validation
)t1;
*Swipe left to see more
As you can see, the prediction accuracy on our dataset is 99.74%, which is in line with the accuracy in the show model.
Now let's run a query to see which customers are eligible for our customer loyalty program by being active for at least 7 months:
select
customerid,
predict_customer_activity(customerid, country, stockcode, description, invoicedate, sales_amt) as predicted_months_active
from ecommerce_sales_prediction
where predicted_months_active >=7
group by 1,2
limit 10;
*Swipe left to see more
The table below shows our output.
Troubleshoot
Although the Create Model statement in Amazon Redshift is automatically responsible for initiating the Amazon SageMaker Autopilot process to build, train, and tune the optimal machine learning model and deploy the model in Amazon Redshift, you can review the intermediate steps performed in the process, if the problem, this can also help you troubleshoot. You can also retrieve the AutoML Job Name from the output of the show model command.
When creating the model, you need to set an Amazon Simple Storage Service (Amazon S3) bucket name as the value of the parameter s3_bucket. You can use this bucket to share training data and artifacts between Amazon Redshift and Amazon SageMaker. Amazon Redshift creates a subfolder in this bucket to hold the training data. After training, it deletes subfolders and their contents unless the parameter s3_garbage_collect is set to off (useful for troubleshooting).
Conclusion
Amazon Redshift ML provides database users with a suitable platform to create, train, and tune models using an SQL interface. In this post, we'll show you how to create a multiclass classification model. We hope you can leverage Amazon Redshift ML to help gain valuable insights.
Credits to
The data was provided by Dr. Chen Daqing, Director of the Public Analytics Group, according to the UCI Machine Learning (ML) repository. Email: chend@lsbu.ac.uk Address: School of Engineering, London South Bank University, London SE1 0AA, UK.
Dua, D. and Graff, C. (2019). UCI Machine Learning (ML) Repository [ http://archive.ics.uci.edu/ml ]. Irvine, CA: School of Information and Computer Science, University of California.
Author of this article
Phil Bates
Senior Data Analytics Specialist Solutions Architect at Amazon Cloud Technologies
Over 25 years of data warehousing experience.
Debu Panda
Principal Product Manager at Amazon Cloud Technologies
Industry leader in data analytics, application platforms and database technologies with over 25 years of experience in IT.
Nikos Koulouris
A software development engineer at Amazon Cloud Technologies
He received his Ph.D. from UC San Diego and has been working in the field of databases and data analytics.
Enrico Sartorello
A Senior Software Development Engineer at Amazon Cloud Technologies
Help customers adopt machine learning solutions that fit their needs by developing new capabilities for Amazon SageMaker. In his spare time, he enjoys watching football games and is keen to improve his cooking skills.
Scan the QR code above to register now
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。