Amazon Kinesis Data Analytics
Nowadays, all kinds of enterprises are facing continuous generation of data that needs to be processed every day. This data may come from log files generated by mobile or web applications, online shopping data, game player activities, social networking site information, or financial transactions. can process and analyze these streaming data in a timely manner is very important for enterprises , through good streaming data processing and application, enterprises can quickly make business decisions, improve the quality of products or services, and enhance user satisfaction.
At present, there are many tools on the market that can help enterprises realize the processing and analysis of streaming data. Among them, Apache Flink is a popular framework and engine for processing data streams, and is used for stateful calculations on unbounded and bounded data streams. Flink can run in all common cluster environments and can perform calculations at memory speeds and at any scale.
- Apache Flink
https://flink.apache.org
Picture from Apache Flink official website
Amazon Kinesis Data Analytics is a simple way to quickly use Apache Flink to convert and analyze streaming data in real time, and realize the processing and analysis of streaming data through a serverless architecture. With Amazon Kinesis Data Analytics, you can build Java, Scala, and Python applications using open source libraries based on Apache Flink.
Amazon Kinesis Data Analytics provides the underlying infrastructure for your Apache Flink applications. Its core functions include provision of computing resources, parallel computing, automatic scaling, and application backup (implemented in the form of checkpoints and snapshots). You can use advanced Flink programming features (such as operators, functions, sources, sinks, etc.) just as you would use them when hosting Flink infrastructure.
📢 To learn more about the latest technology releases and practical innovations of Amazon Cloud Technology, stay tuned to the 2021 Amazon Cloud Technology China Summit! Click on the picture to sign up~
in Amazon Kinesis Data Analytics
Amazon Kinesis Data Analytics for Apache Flink now supports building streaming data analysis applications using Python 3.7. on Amazon Kinesis Data Analytics via Apache Flink v1.11 in Python language, which is very convenient for Python language developers. Apache Flink v1.11 provides support for Python through the PyFlink Table API, which is a unified relational API.
Picture from Apache Flink official website
In addition, Apache Flink also provides a DataStream API for fine-grained control of state and time, and has supported Python DataStream API since Apache Flink version 1.12. For more information about the API in Apache Flink, please refer to Flink official website introduction .
- Flink official website introduction
https://ci.apache.org/projects/flink/flink-docs-release-1.11/concepts/index.html
Amazon Kinesis Data Analytics Python application example
Next, we will demonstrate how to quickly build the Python version of the Amazon Kinesis Data Analytics for Flink application. The reference architecture of the example is shown in the following figure. We will send some test data to Amazon Kinesis Data Stream, and then perform basic aggregation operations through the Tumbling Window window function of the Amazon Kinesis Data Analytics Python application, and then persist the data to Amazon S3 Medium; Later, you can use Amazon Glue and Amazon Athena to quickly query these data. The entire sample application adopts a serverless architecture, which not only can achieve rapid deployment and automatic elastic scaling, but also greatly reduces the burden of operation, maintenance and management.
The following example is performed on the Amazon Cloud Technology China (Beijing) region operated by Sinnet.
Create Amazon Kinesis Data Stream
The example will create Amazon Kinesis Data Stream on the console. First select Amazon Kinesis Service-Data Stream, and then click "Create Data Stream".
Enter the data stream name, such as "kda-input-stream"; the number of partitions in the data stream capacity is set to 1. Note that this is for demonstration purposes. Please configure an appropriate capacity according to the actual situation.
Click to create a data stream and wait a moment for the data stream to be created.
Later, we will send sample data like this Amazon Kinesis data stream.
Create Amazon S3 bucket
The example will create an Amazon S3 bucket on the console, first select the Amazon Kinesis service, and then click "Create Bucket".
Enter the bucket name, such as "kda-pyflink-", this name will be used later in the Amazon Kinesis application.
Keep other configurations unchanged and click "Create Bucket".
After a while, you can see that the bucket has been successfully created.
Send sample data to Amazon Kinesis Data Stream
Next, we will use a Python program to send data to the Amazon Kinesis data stream. Create a kda-input-stream.py file, and copy the following content to this file, pay attention to modify STREAM_NAME to the name of the Amazon Kinesis data stream you just created, and profile_name to the corresponding user information.
import datetime
import json
import random
import boto3
STREAM_NAME = "kda-input-stream"
def get_data():
return {
'event_time': datetime.datetime.now().isoformat(),
'ticker': random.choice(['AAPL', 'AMZN', 'MSFT', 'INTC', 'TBV']),
'price': round(random.random() * 100, 2)}
def generate(stream_name, kinesis_client):
while True:
data = get_data()
print(data)
kinesis_client.put_record(
StreamName=stream_name,
Data=json.dumps(data),
PartitionKey="partitionkey")
if __name__ == '__main__':
session = boto3.Session(profile_name='<your profile>')
generate(STREAM_NAME, session.client('kinesis', region_name='cn-n
Execute the following code to start sending data to the Amazon Kinesis data stream.
$ python kda-input-stream.py
write Pyflink code
Next, we write the PyFlink code. Create the kda-pyflink-demo.py file and copy the following content to this file.
# -*- coding: utf-8 -*-
"""
kda-pyflink-demo.py
~~~~~~~~~~~~~~~~~~~
1. 创建 Table Environment
2. 创建源 Kinesis Data Stream
3. 创建目标 S3 Bucket
4. 执行窗口函数查询
5. 将结果写入目标
"""
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
from pyflink.table.window import Tumble
import os
import json
# 1. 创建 Table Environment
env_settings = (
EnvironmentSettings.new_instance().in_streaming_mode().use_blink_planner().build()
)
table_env = StreamTableEnvironment.create(environment_settings=env_settings)
statement_set = table_env.create_statement_set()
APPLICATION_PROPERTIES_FILE_PATH = "/etc/flink/application_properties.json"
def get_application_properties():
if os.path.isfile(APPLICATION_PROPERTIES_FILE_PATH):
with open(APPLICATION_PROPERTIES_FILE_PATH, "r") as file:
contents = file.read()
properties = json.loads(contents)
return properties
else:
print('A file at "{}" was not found'.format(APPLICATION_PROPERTIES_FILE_PATH))
def property_map(props, property_group_id):
for prop in props:
if prop["PropertyGroupId"] == property_group_id:
return prop["PropertyMap"]
def create_source_table(table_name, stream_name, region, stream_initpos):
return """ CREATE TABLE {0} (
ticker VARCHAR(6),
price DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
)
PARTITIONED BY (ticker)
WITH (
'connector' = 'kinesis',
'stream' = '{1}',
'aws.region' = '{2}',
'scan.stream.initpos' = '{3}',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
) """.format(
table_name, stream_name, region, stream_initpos
)
def create_sink_table(table_name, bucket_name):
return """ CREATE TABLE {0} (
ticker VARCHAR(6),
price DOUBLE,
event_time TIMESTAMP(3),
WATERMARK FOR event_time AS event_time - INTERVAL '5' SECOND
)
PARTITIONED BY (ticker)
WITH (
'connector'='filesystem',
'path'='s3a://{1}/',
'format'='csv',
'sink.partition-commit.policy.kind'='success-file',
'sink.partition-commit.delay' = '1 min'
) """.format(
table_name, bucket_name)
def count_by_word(input_table_name):
# 使用 Table API
input_table = table_env.from_path(input_table_name)
tumbling_window_table = (
input_table.window(
Tumble.over("1.minute").on("event_time").alias("one_minute_window")
)
.group_by("ticker, one_minute_window")
.select("ticker, price.avg as price, one_minute_window.end as event_time")
)
return tumbling_window_table
def main():
# KDA 应用程序属性键
input_property_group_key = "consumer.config.0"
sink_property_group_key = "sink.config.0"
input_stream_key = "input.stream.name"
input_region_key = "aws.region"
input_starting_position_key = "flink.stream.initpos"
output_sink_key = "output.bucket.name"
# 输入输出数据表
input_table_name = "input_table"
output_table_name = "output_table"
# 获取 KDA 应用程序属性
props = get_application_properties()
input_property_map = property_map(props, input_property_group_key)
output_property_map = property_map(props, sink_property_group_key)
input_stream = input_property_map[input_stream_key]
input_region = input_property_map[input_region_key]
stream_initpos = input_property_map[input_starting_position_key]
output_bucket_name = output_property_map[output_sink_key]
# 2. 创建源 Kinesis Data Stream
table_env.execute_sql(
create_source_table(
input_table_name, input_stream, input_region, stream_initpos
)
)
# 3. 创建目标 S3 Bucket
create_sink = create_sink_table(
output_table_name, output_bucket_name
)
table_env.execute_sql(create_sink)
# 4. 执行窗口函数查询
tumbling_window_table = count_by_word(input_table_name)
# 5. 将结果写入目标
tumbling_window_table.execute_insert(output_table_name).wait()
statement_set.execute()
if __name__ == "__main__":
main()
Because the application needs to use Amazon Kinesis Flink SQL Connector, here you need to corresponding 161ca84c2ea686 amazon-kinesis-sql-connector-flink-2.0.3.jar .
- amazon-kinesis-sql-connector-flink-2.0.3.jar
https://repo1.maven.org/maven2/software/amazon/kinesis/amazon-kinesis-sql-connector-flink/2.0.3/amazon-kinesis-sql-connector-flink-2.0.3.jar
Package kda-pyflink-demo.py and amazon-kinesis-sql-connector-flink-2.0.3.jar into a zip file, such as kda-pyflink-demo.zip/; then, upload this zip package to the newly created zip file Amazon S3 bucket. Enter the Amazon S3 bucket you just created and click "Upload".
Select the zip file that has just been packaged, and then click "Upload".
Create a Python Amazon Kinesis Data Analytics application
First select Amazon Kinesis service-Data Analytics, and then click "Create Application"
Enter the application name, such as "kda-pyflink-demo"; select Apache Flink when running, and keep the default version 1.11.
Keep the default access permissions, such as "create/update Amazon IAM role kinesis-analytics-kda-pyflink-demo-cn-north-1"; select "Development" as the template for the application settings. Note that this is for demonstration purposes, and can be based on actual conditions Select "Production".
Click "Create Application" and wait a moment for the application to be created.
According to the prompts, we continue to configure the application and click "Configure"; the code location is configured as the location of the zip package in Amazon S3 just created.
Then expand the property configuration.
Create an attribute group, set the group name to "consumer.config.0", and configure the following key-value pairs:
input.stream.name is the Amazon Kinesis data stream just created, such as kda-input-stream
aws.region is the current region, here is cn-north-1 flink.stream.initpos to set the position of the read stream, configured as LATEST
Create an attribute group, set the group name to "sink.config.0", and configure the following key-value pairs:
output.bucket.name is the Amazon S3 bucket you just created, for example kda-pyflink-shtian
Create an attribute group, set the group name to "kinesis.analytics.flink.run.options", and configure the following key-value pairs:
python is the PyFlink program just created, kda-pyflink-demo.py
jarfile is the name of Amazon Kinesis Connector, here is amazon-kinesis-sql-connector-flink-2.0.3.jar
Then click "Update" to refresh the application configuration
Next, configure the permissions of the Amazon IAM role used by the application. Go to the Amazon IAM interface, select the role, and then find the newly created role.
Then, expand the additional policy and click "Edit Policy".
Supplement the last two paragraphs of Amazon IAM policies to allow this role to access Amazon Kinesis data streams and Amazon S3 buckets. Note that you need to replace it with your Amazon Cloud Technology China account.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadCode",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetObjectVersion"
],
"Resource": [
"arn:aws-cn:s3:::kda-pyflink-shtian/kda-pyflink-demo.zip"
]
},
{
"Sid": "ListCloudwatchLogGroups",
"Effect": "Allow",
"Action": [
"logs:DescribeLogGroups"
],
"Resource": [
"arn:aws-cn:logs:cn-north-1:012345678901:log-group:*"
]
},
{
"Sid": "ListCloudwatchLogStreams",
"Effect": "Allow",
"Action": [
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws-cn:logs:cn-north-1:012345678901:log-group:/aws/kinesis-analytics/kda-pyflink-demo:log-stream:*"
]
},
{
"Sid": "PutCloudwatchLogs",
"Effect": "Allow",
"Action": [
"logs:PutLogEvents"
],
"Resource": [
"arn:aws-cn:logs:cn-north-1:012345678901:log-group:/aws/kinesis-analytics/kda-pyflink-demo:log-stream:kinesis-analytics-log-stream"
]
},
{
"Sid": "ReadInputStream",
"Effect": "Allow",
"Action": "kinesis:*",
"Resource": "arn:aws-cn:kinesis:cn-north-1:012345678901:stream/kda-input-stream"
},
{
"Sid": "WriteObjects",
"Effect": "Allow",
"Action": [
"s3:Abort*",
"s3:DeleteObject*",
"s3:GetObject*",
"s3:GetBucket*",
"s3:List*",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws-cn:s3:::kda-pyflink-shtian",
"arn:aws-cn:s3:::kda-pyflink-shtian/*"
]
}
]
}
Go back to the Amazon Kinesis Data Analytics application interface and click "Run".
Click "Open Apache Flink Control Panel" to jump to the Flink interface.
Click to view running tasks.
You can further view detailed information according to your needs. Next, we go to Amazon S3 to verify whether the data has been written, enter the created bucket, and you can see that the data has been successfully written.
uses Amazon Glue to crawl data
Go to the Amazon Glue service interface, select the crawler, click "Add Crawler", and enter the name of the crawler.
Keep the source type unchanged and add data storage as the output path of the created Amazon S3 bucket.
Choose an existing role or create a new one.
Select the default database, and you can add table prefixes according to your needs.
After the creation is complete, click Execute.
After the crawl is successful, you can view the detailed information in the data table.
Then, you can switch to the Amazon Athena service to query the results.
Note: If there is an Amazon Glue crawler or Amazon Athena query permission error, it may be caused by turning on Lake Formation. You can refer to the document to grant the corresponding permissions to the role.
Summary
This article first introduces a quick way to use Apache Flink on the Amazon Cloud Technology platform-Amazon Kinesis Data Analytics for Flink, and then uses a serverless architecture example to demonstrate how to implement Python stream data processing and PyFlink in Amazon Kinesis Data Analytics for Flink. Analyze and perform ad hoc queries on the data through Amazon Glue and Amazon Athena. Amazon Kinesis Data Analytics for Flink's Python support has also been launched in the Amazon Cloud Technology China (Beijing) region operated by Sinnet and the Amazon Cloud Technology China (Ningxia) region operated by West Cloud Data. Welcome to use.
Reference
1.https://aws.amazon.com/solutions/implementations/aws-streaming-data-solution-for-amazon-kinesis/
2.https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html
3.https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-python-s3.html
4.https://ci.apache.org/projects/flink/flink-docs-release-1.11/
Related Reading
Author of this article
Shi Tian
Amazon Cloud Technology Solution Architect
With rich experience in cloud computing, big data and machine learning, he is currently committed to research and practice in data science, machine learning, serverless and other fields. Translations include "Machine Learning as a Service", "DevOps Practice Based on Kubernetes", "Prometheus Monitoring Actual Combat", etc.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。