Smart Lake Warehouse Architecture Practice: Building Real-time Data Warehouses Using Amazon Redshift&#39;s Streaming Ingestion

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that enables users to easily and cost-effectively analyze a wide variety of data through standard SQL. Amazon Redshift delivers up to three times the price-performance ratio of any other cloud data warehouse. Tens of thousands of customers are processing exabytes of data every day with Amazon Redshift, powering analytic workloads such as high-performance business intelligence (BI) reporting, dashboard applications, data exploration, and real-time analytics.

We are excited to announce the Amazon Redshift Streaming Ingest feature for Amazon Kinesis Data Streams , which enables users to ingest Kinesis data streams into a cloud data warehouse without first storing the data in Amazon Simple Storage Service (Amazon S3). Streaming ingestion helps users ingest hundreds of MB of data into an Amazon Redshift cloud data warehouse cluster in seconds with extremely low latency.

This article will show you how to create a Kinesis data stream around an Amazon Redshift cloud data warehouse, generate and load streaming data, create materialized views, and query the data stream and visualize the results. In addition, this article covers the benefits and common use cases of streaming ingestion.

Cloud data warehouse requirements for streaming ingestion

Many customers have reported to us that they want to further expand batch analysis capabilities into real-time analysis capabilities, and access their streaming data stored in the data warehouse in a low-latency, high-throughput manner. In addition, many customers want to combine real-time analysis results with other data sources in the data warehouse to obtain richer analysis results.

The primary use cases for Amazon Redshift streaming ingestion are characterized by processing data that is continuously generated (streaming) and needs to be processed within a very short time (latency) after the data is generated. Data comes from a variety of sources, from IoT devices to system telemetry, from utility services to device location.

Before streaming ingestion was released, if you wanted to ingest real-time data from Kinesis Data Steams, you would need to stage the data to Amazon S3 and then load it using the COPY command. This typically creates a delay of several minutes and requires a data pipeline on top of loading data from the stream. But now, users can ingest data directly from the data stream.

Solution overview

Amazon Redshift Streaming Ingest lets users connect directly to Kinesis Data Streams, completely eliminating the latency and complexity of staging data and loading it into the cluster through Amazon S3. With this, users can connect and access streaming data using SQL commands, and create materialized views directly on the basis of the data stream, thereby simplifying the data pipeline. Materialized views can also contain the SQL transformations required by the ELT (extract, load, and transform) pipeline.

Once the materialized view is defined, the view can be refreshed to query the latest streaming data. This means we can use SQL to perform downstream processing and transformation on streaming data at no additional cost, and then use legacy BI and analytics tools for real-time analysis.

Amazon Redshift streaming ingestion does its job as a consumer of the data stream, and the materialized view acts as a landing zone for the streaming data to be consumed. When a materialized view is refreshed, the Amazon Redshift compute nodes assign each data shard to a compute slice. Each compute slice starts processing the data in the allocated data slice until the materialized view is equivalent to the data stream. The first refresh of the materialized view can obtain data from the data stream TRIM_HORIZON , and subsequent refreshes can read data from the last SEQUENCE_NUMBER generated by the previous refresh, until it is State is equivalent to streaming data. The whole process is shown in the figure below.

Setting up streaming ingest in Amazon Redshift requires two steps. First, we need to create an external schema to map to Kinesis Data Streams, and then we need to create a materialized view to pull data from the data stream. Materialized views must be capable of incremental maintenance.

Create a Kinesis data flow

First we need to create the Kinesis data stream that will receive the streaming data.

Select Data streams in the Amazon Kinesis console.
Choose Create data stream.
Enter ev_stream_data for Data stream name.
Select On-demand for Capacity mode.

Provide additional configurations as needed to create dataflows.

Generate streaming data with Kinesis Data Generator

We can generate JSON-formatted data in an aggregated fashion using the Amazon Kinesis Data Generator (KDG) tool and the following templates:

 {
    
   "_id" : "{{random.uuid}}",
   "clusterID": "{{random.number(
        {   "min":1,
            "max":50
        }
    )}}", 
    "connectionTime": "{{date.now("YYYY-MM-DD HH:mm:ss")}}",
    "kWhDelivered": "{{commerce.price}}",
    "stationID": "{{random.number(
        {   "min":1,
            "max":467
        }
    )}}",
      "spaceID": "{{random.word}}-{{random.number(
        {   "min":1,
            "max":20
        }
    )}}",
 
   "timezone": "America/Los_Angeles",
   "userID": "{{random.number(
        {   "min":1000,
            "max":500000
        }
    )}}"
}

The image below shows the template in the KDG console.

Load reference data

In the previous step, we covered how to use the Kinesis Data Generator to load aggregated data into a data stream. In this section we need to load reference data related to electric vehicle charging stations into the cluster.

Please download plug-in electric vehicle charging station network data from the City of Austin Open Data Portal. Split the latitude and longitude data in the dataset and load it into a table with the following schema:

 CREATE TABLE ev_station
  (
     siteid                INTEGER,
     station_name          VARCHAR(100),
     address_1             VARCHAR(100),
     address_2             VARCHAR(100),
     city                  VARCHAR(100),
     state                 VARCHAR(100),
     postal_code           VARCHAR(100),
     no_of_ports           SMALLINT,
     pricing_policy        VARCHAR(100),
     usage_access          VARCHAR(100),
     category              VARCHAR(100),
     subcategory           VARCHAR(100),
     port_1_connector_type VARCHAR(100),
     voltage               VARCHAR(100),
     port_2_connector_type VARCHAR(100),
     latitude              DECIMAL(10, 6),
     longitude             DECIMAL(10, 6),
     pricing               VARCHAR(100),
     power_select          VARCHAR(100)
  ) DISTTYLE ALL

Create a materialized view

We can use SQL to access our data from the data stream and create materialized views directly on the data stream, thereby simplifying the construction of data pipelines. To do this do the following:

Create an external schema to map data from Kinesis Data Streams to Amazon Redshift objects:

 CREATE EXTERNAL SCHEMA evdata FROM KINESIS
IAM_ROLE 'arn:aws:iam::0123456789:role/redshift-streaming-role';

Create an Amazon Identity and Access Management (IAM) role (see the Streaming Ingest Getting Started Guide for related policies).

Materialized views can then be created for working with streaming data. We can choose to use the SUPER data type to store the payload in JSON format, or use the Amazon Redshift JSON function to parse the JSON data into separate columns. In this article we will use the second method because the schema is already defined.

Create a materialized view that is distributed according to the UUID values in the data stream, sorted by the approximatearrivaltimestamp value:

 CREATE MATERIALIZED VIEW ev_station_data_extract DISTKEY(5) sortkey(1) AS
    SELECT approximatearrivaltimestamp,
    partitionkey,
    shardid,
    sequencenumber,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'_id')::character(36) as ID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'clusterID')::varchar(30) as clusterID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'connectionTime')::varchar(20) as connectionTime,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'kWhDelivered')::DECIMAL(10,2) as kWhDelivered,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'stationID')::DECIMAL(10,2) as stationID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'spaceID')::varchar(100) as spaceID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'timezone')::varchar(30) as timezone,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'userID')::varchar(30) as userID
    FROM evdata."ev_stream_data";

Refresh this materialized view:

 REFRESH MATERIALIZED VIEW ev_station_data_extract;

In the current preview, the materialized view is not refreshed automatically, so we need to schedule a query in Amazon Redshift to refresh the materialized view every minute. For instructions, see Scheduling SQL Queries in an Amazon Redshift Data Warehouse .

query data flow

The refreshed materialized view can then be queried to view usage statistics:

 SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
,SUM(kWhDelivered) AS Energy_Consumed
,count(distinct userID) AS #Users
from ev_station_data_extract
group by to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS')
order by 1 desc;

The results are shown in the table below.

Next, we can join the materialized view with the reference data to analyze the usage data of the charging station over the past 5 minutes, broken down by the type of charging station:

 SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
,SUM(kWhDelivered) AS Energy_Consumed
,count(distinct userID) AS #Users
,st.category
from ev_station_data_extract ext
join ev_station st on
ext.stationID = st.siteid
where approximatearrivaltimestamp > current_timestamp -interval '5 minutes'
group by to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS'),st.category
order by 1 desc, 2 desc

The results are shown in the table below.

Visual presentation of results

We can set up a simple visualization using Amazon QuickSight . For instructions, see Quick Start Guide: Create an Amazon QuickSight Analysis with a Single Visualization Using Sample Data .

We created a dataset in QuickSight to join the materialized view with the charging station reference data.

Then create a dashboard that shows power consumption and connected users over time. The dashboard also shows places on the map by category.

The benefits of streaming ingestion

In this section we'll look at some of the benefits that streaming ingestion can bring.

High throughput and low latency

Amazon Redshift can ingest and process data from Kinesis Data Streams at several gigabytes per second (throughput depends on the number of data shards in the data stream and the Amazon Redshift cluster configuration). This will allow us to consume streaming data in a low-latency, high-bandwidth way and gain insights from the data in seconds instead of minutes.

As mentioned above, the biggest advantage of Amazon Redshift's direct ingest and pull approach is lower latency, typically seconds. This is in stark contrast to creating a process to work with streaming data, staging the data to Amazon S3, and then running a COPY command to load the data into Amazon Redshift. The latter approach tends to incur delays of several minutes due to the multiple steps involved in data processing.

Simple to set up

The streaming ingest method is easy to get started with. All setup and configuration in Amazon Redshift can be done using SQL, which is already very familiar to most users of cloud data warehouses. Then, get real-time insights in seconds without managing complex pipelines. Amazon Redshift and Kinesis Data Streams are fully managed, allowing users to run their own streaming applications without managing infrastructure.

increase productivity

Users can use familiar SQL skills in Amazon Redshift to perform rich analytics on churn data without learning new skills or languages. In addition, you can create additional materialized views, or create views against materialized views, to do most of the ELT data pipeline transformation using SQL directly in Amazon Redshift.

Streaming Ingest Use Cases

Numerous use cases and vertical industry-specific applications will become possible through near real-time analysis of streaming data. Listed below are just some of the many use cases:

Improve game experience: By analyzing real-time data from players, you can focus on game conversions, player retention, and optimize the game experience.
Analyze clickstream user data for online advertising: The average customer visits dozens of websites in a single session, yet marketers typically only analyze visit data from their own website. We can analyze authorized clickstream data ingested in our data warehouse to assess customer footprint and behavior, and serve customers more targeted ads in real time.
Real-time retail analytics with streaming POS data: We can access and visualize all global point-of-sale (POS) retail transaction data for real-time analysis, reporting and visualization.
Provides real-time application insight: By accessing and analyzing streaming data from application log files and network logs, developers and engineers can troubleshoot problems in real time, build better products, and take preventive action with alerts measure.
Real-time analysis of IoT data: We can use Amazon Redshift streaming ingest with Amazon Kinesis service to build real-time applications such as device status and attribute detection such as location and sensor data, application monitoring, fraud detection, real-time dashboards, and more. We can ingest streaming data using Kinesis Data Streams, process it using Amazon Kinesis Data Analytics, and then use Kinesis Data Streams to send the results to any data store or application with extremely low end-to-end latency.

Summarize

This article describes how to create an Amazon Redshift materialized view to ingest data from a Kinesis data stream using Amazon Redshift streaming ingestion. With this new capability, we can easily build and maintain data pipelines to ingest and analyze streaming data with low latency and high throughput.

Streaming ingestion is currently in preview and is available in all Amazon Cloud Tech regions where Amazon Redshift is available . To get started with Amazon Redshift streaming ingestion, provision an Amazon Redshift cluster on your current stack and verify that your cluster version is at least 1.0.35480.

SeeStreaming Ingest (Preview) for details, and also check out the Live Analytics with Amazon Redshift Streaming Ingest demo on YouTube. At the same time, the modern data architecture of Amazon Cloud Technology gives you more ideas, click to view.

Author of this article

Sam Selvan

Senior Solution Architect of Amazon Cloud Technology.

Smart Lake Warehouse Architecture Practice: Building Real-time Data Warehouses Using Amazon Redshift's Streaming Ingestion

Cloud data warehouse requirements for streaming ingestion

Solution overview

Create a Kinesis data flow

Generate streaming data with Kinesis Data Generator

Load reference data

Create a materialized view

query data flow

Visual presentation of results

The benefits of streaming ingestion

High throughput and low latency

Simple to set up

increase productivity

Streaming Ingest Use Cases

Summarize

Author of this article

亚马逊云开发者

引用和评论

基于 Q CLI 和 Bedrock KB 实现智能问答系统

Smart Lake Warehouse Architecture Practice: Building Real-time Data Warehouses Using Amazon Redshift&#39;s Streaming Ingestion

Cloud data warehouse requirements for streaming ingestion

Solution overview

Create a Kinesis data flow

Generate streaming data with Kinesis Data Generator

Load reference data

Create a materialized view

query data flow

Visual presentation of results

The benefits of streaming ingestion

High throughput and low latency

Simple to set up

increase productivity

Streaming Ingest Use Cases

Summarize

Author of this article

亚马逊云开发者

引用和评论

基于 Q CLI 和 Bedrock KB 实现智能问答系统

Smart Lake Warehouse Architecture Practice: Building Real-time Data Warehouses Using Amazon Redshift's Streaming Ingestion