Data preprocessing tool Amazon Glue DataBrew

foreword

In daily business, we usually use relational databases to store data for application use.

The way that the database stores data by table and row often causes data of different sensitivity levels to be mixed together. However, when doing data statistics, we need to distinguish the permissions of different data in a finer-grained manner to avoid leakage of sensitive data.

Taking stock exchanges as an example, different departments may have different data access needs.

The financial department needs to summarize the transaction amount and handling fee for the day
Compliance department, which needs to audit transactions in specific regions
The data department needs to use desensitized data for application research and development

These requirements are difficult to achieve by setting permissions on the data table. Some of them need to do field-level access control, some need to restrict specific content, and some need to do some masking processing on fields.

Faced with this requirement, we can use traditional ETL means and tools to handle it, or we can use serverless Amazon Glue DataBrew to achieve it. Today, let's introduce the basic usage and common usage scenarios of Amazon Glue DataBrew.

Introduction to Amazon Glue DataBrew

Amazon Glue DataBrew ("DataBrew") is a serverless, visual data preprocessing and ETL tool. It provides a visual editing interface, allowing users to analyze data form data, perform processing operations on the data, and preview the effect. It provides more than 200 operations and functions to cover most basic data processing needs.

Core idea

This time, we mainly explain data processing, and the core concepts involved are as follows.

Dataset (Dataset) , that is, the original data plus the corresponding metadata
Project is a web-based visual data editor
Disposal scheme (Recipe) , that is, perform a series of processing on the data, such as removing empty fields, converting data types, etc.

Next, let's go to the first step: the creation of the dataset.

Create a dataset

This time we use the New York City Taxi Trip Open Data (NYC-TLC) as a demonstration.

1. Create a local S3 bucket

Because of regional restrictions, NYC-TLC data cannot be used directly in DataBrew. To use this data, we must first create an S3 bucket ourselves, and then copy some of the data over.

2. Copy to local S3 bucket

After installing the amazon cloud technology command line tools, we can use the following command to copy the bucket to our own bucket. Because the S3 bucket name is globally unique, please replace my-nyc-tlc below with the bucket name you actually created.

aws s3 cp s3://nyc-tlc/trip+data/fhvhv_tripdata_2020-12.csv s3://my-nyc-tlc

3. Connect the dataset

Next, we can go to DataBrew's "Datasets" interface and select "Connect to a new dataset".

As you can see, there are still many data sources supported by DataBrew. In addition to the common direct connection to S3, you can also connect to the database through JDBC. If the data itself is already in the data lake and entered into the Glue Data Catalog via the crawler, we can also choose directly.

For simplicity, in this demo, we select the S3 bucket created earlier and the copied .csv file. Select CSV for data format.

Data Morphology Analysis

After creating the dataset, we can quickly preview the data in the dataset.

If you want to learn more about the data form, such as value distribution, value interval, extreme value, average value, effective value, etc., we can click the "Run data profie" button in the upper right corner to create a "Data Profile Analysis" Task".

When creating, we can choose whether to perform sampling analysis or full analysis, and configure some parameters, such as: whether to analyze repeated rows, whether to analyze the correlation of different fields, etc.

Click "Create and Run", the data shape analysis task will start running, and later we can see the results similar to the following under the "Data Shape Overview" tab.

On the left side of the figure, you can see the overall data volume, valid values, and duplicate rows, etc. On the right is the field association table, which represents the linkage between fields. For example, when the A field changes, if the B field also changes, the color will be darker.

Taking the first row as an example, we can see that the horizontal VendorID field is strongly correlated with the vertical extra field, and the larger the VendorID value, the smaller the extra value.

Assuming that extra stands for tip, then we can say that the smaller the VendorID is, the more tip the passenger will give. This may be because the smaller VendorID companies use higher-end taxis. In addition, we can also see that the number of passengers, pick-up and drop-off locations, and even payment methods are all related to extra.

Through this relationship diagram, we can quickly determine which fields we are interested in and want to analyze in depth. Point the mouse to any grid, you can see the original value of the grid, which is convenient for refining judgment.

Of course, we can also easily notice the slashes composed of blue grids. It goes without saying that this is because any field is completely linked to itself.

This report is very clear and easy to understand, here I will only briefly introduce the following terms:

Valid means that the field value conforms to DataBrew's inference or the definition in the metadata
Distinct represents how many unique values this field has (different from other values)
Unique represents how many unique values this field has (occurs only once)
Median is the median, that is, the value in the middle after sorting by size
Mean is the mean
Mode is the modulus, the value with the most occurrences

Also, on the histogram, black represents extreme values that exceed the variance of the data. Because the number of this part of the value is small, if it is displayed in proportion, it may not be visible or unclear, so the proportion is usually enlarged and distinguished by black, indicating that the value of this part is not the same as the value of the blue part. normal ratio.

Note: Currently, DataBrew only supports data morphology analysis of string and number types. If the content of a field is a date or time like 2021-10-10, this field will not be included in the analysis. To solve this problem, we can process the data, add a prefix to the field (such as DT-) or modify the type of the field.

data processing

The shape of the data is almost the same, so let's process the data.

DataBrew provides us with a visual editing interface for data processing. This interface is called a "project" in DataBrew, so we'll first create a project. Click "Create a project with this dataset" directly in the upper right corner of the dataset interface to create a project.

Here, the required parameters are project name, dataset and IAM role. Roles can be automatically created by DataBrew for accessing data sources, ie S3 buckets.

After the creation is complete, we will see such a loading interface. This interface lasts for a few minutes as it needs to allocate the underlying compute instance, install the editor app, configure permissions, etc.

After a few minutes, we can see the full editor, the top row of icons is what we can do. These operations are categorized as follows:

Filter filtering, that is, remove rows containing illegal values, missing values, etc.
Column column-level processing, that is, delete fields, modify field types, etc.
Format value formatting, such as string case conversion, decimal precision of numbers, etc.
Clean value cleanup, such as removing leading and trailing quotation marks, spaces, adding prefixes and suffixes, etc.
Extract value extraction, such as the year, month, day in the extraction date, etc.
Missing missing value processing, such as filling in the specified value when the value is missing, filling in the average value, etc.
Invalid, Duplicates, Outliers correspond to illegal, duplicate and outliers, and the logic is similar to missing values
Split and Merge are split and merge respectively, that is, split a string field into multiple fields according to the delimiter, or vice versa
Create is to generate another field based on the value of one field, usually used to generate label fields
Functions is to use a function on a field to generate another field
Condition is similar to Create, but also generates new fields according to conditions
Nest-Unnest refers to combining multiple fields into a single array, object field, or vice versa
Pivot refers to dimension replacement. For example, there were originally two fields "license plate number" and "pick-up location". Now each license plate number will become a field name, and then the corresponding pick-up location will become a list value. under the license plate number field
Group, Join and Union are similar to SQL's GROUP BY, JOIN and UNION
Text is for text processing, such as tokenization
Scale is numeric value normalization, such as normalization
Mapping also creates a new field based on the field value. Similar to the previous Condition, it can be regarded as a shortcut
Encode is a common encoding for data, such as "One-hot encoding" common in machine learning
Sensitve is the processing of sensitive and private data, such as masking, encryption and decryption, etc.

Next, let's look at a few common data processing scenarios.

Note: Please note that DataBrew's items (editors), unlike many AWS, are not charged per second but at least half an hour.

1. Column-level filtering – restrict access to data types

Suppose now an analyst in a financial or tax department wants to analyze and count the following information.

Transaction value of cars owned by different taxi information providers
Taxi transaction value by date
main payment method
Total monthly taxi transactions

At this time, we can only display relevant fields and filter out other fields. To do field filtering, we can use DataBrew's Column > Delete action. This action deletes the specified fields, leaving only those fields that the user is allowed to view.

2. Row-level filtering – only keep data with specific semantics

Almost all data tools support table-level filtering, and some tools also support field-level filtering, but fewer support row-level filtering. Row-level filtering means that we can make judgments about the semantics of the data and only allow users to view certain types of data.

For example, the following requirement.

Only allow analysis of taxi records between the airport area and CDB business district
Only allow analysis of taxi records at night
Only records with an amount less than $100 are allowed to be analyzed

To achieve row-level filtering, we can use the Filter > By condition operation, which supports simple comparisons of numbers and strings, and then filters.

If some fields cannot directly meet the comparison requirements, then we can extract the value to another field according to the condition, and then perform conditional processing on the new field. For example, it is difficult to directly judge whether it is night or not, then we can use Functions > Date functions > HOUR function to extract the hour information into a new field, and then filter for this field.

3. Other processing

In addition to row-level and column-level filtering, DataBrew can also do a lot of other data processing. Here are a few scenarios.

Sensitive data handling

To protect user privacy, we can encrypt sensitive data.

Currently DataBrew supports two encryption methods, one is "Deterministic Encryption" and the other is "Probablistic Encryption".

The former is suitable for encrypting some fields that do not need to participate in the operation, but need to determine the value to participate in the statistics, and finally need to be restored. For example: the license plate of a taxi or the location where passengers get on and off.

The latter is suitable for encrypting some data that does not need to participate in computing or statistics, but at a specific moment, people with specific permissions still need to restore it. For example: personal information such as name, ID number, etc.

If you only need statistics and do not need to restore, you can choose to directly hash the field, or mask part of the value of the field (for example, replace it with "#").

function call

DataBrew has a lot of built-in functions, which are similar to our common Excel functions, which can help us process data quickly. Commonly used functions are:

Mathematical functions: rounding, taking absolute values, etc.
Text functions: case conversion, taking part of text, replacing text, etc.
Date function: take the year, month, day, take the date difference, etc.
Window function: take the total number, maximum and minimum values of a certain time window, etc.
Web functions: IP and integer value conversion, URL request string extraction, etc.

Condition processing

DataBrew supports the use of IF or CASE to output the specified value to a new column according to the field value.

This processing method is suitable for manual and fine-grained processing or marking of a field when no ready-made functions are available. For example: convert the user's birthday into a zodiac or constellation.

In addition, conditional processing also supports date format, and supports the value of other fields as input, forming a FOR loop effect similar to language, which can help users implement complex data processing logic.

Create data processing tasks

In the project, we can process the data and preview the effect, but the processing at this time is only for the sampled data, and the actual data is still not processed. After the preview effect feels that the processing is correct, we can create a data processing task to process the full amount of data.

In the upper right corner of the project, click Create Task. Next, we need to enter:

Target S3 bucket
Data formats, such as: CSV, Parquet, etc.
Execution cycle, such as: one-time, timed repeated execution, etc.
Execution role, which needs to be able to access the source bucket and target bucket

After entering this information, we can select "Create and Run" to run the task directly.

data lineage

DataBrew provides a simple data lineage diagram (Linage) for each project. In the data lineage diagram, we can see which S3 bucket our original data is in for this project, which data set this bucket corresponds to, what is the project corresponding to this data set, what processing has been done to the data in this project, and then where is the output.

It should be noted that the current DataBrew bloodline only shows the flow of data in the current DataBrew project.

Summarize

With the construction and technological development of big data systems, the data in the data lake is becoming more and more real-time. In order to be compliant and legal, data holders must take access control very seriously.

Data lake tools usually also provide permission control, but the granularity is usually only at the table level. If you want to go to the row level or column level, you need to install additional tools locally, and there is a risk of being bypassed. People with different departments and access rights use different S3 buckets, and use data files that only contain specific business needs, which can greatly reduce the risk of data leakage.

For projects in the exploration period, the management of permissions may be relatively extensive, but when the business starts to run steadily for a long time, we should consider solidifying these processes so that users and departments can only access the data that they can access and need to access.

As a serverless data processing service, DataBrew is very suitable for data scientists, business experts and other users who are familiar with the business but have little research on the underlying technologies, as well as teams in the data compliance exploration period. Using Amazon Glue DataBrew, we can quickly build an elastic data processing pipeline and focus more on business.

That's it for a brief introduction to Amazon Glue DataBrew, which I hope will be helpful to the reader.

Author of this article

Zhang Dai

Amazon Cloud Technology Solution Architect

More than ten years of experience in enterprise software development, design and consulting, focusing on the organic combination of enterprise business and Amazon cloud technology services. He has translated "The Way of Software", "Lean Entrepreneurship Practice", "Lean Design", "Internet Thinking Enterprise", and author of "Experience Design White Book" and other books.

Data preprocessing tool Amazon Glue DataBrew

foreword

Introduction to Amazon Glue DataBrew

Core idea

Create a dataset

1. Create a local S3 bucket

2. Copy to local S3 bucket

3. Connect the dataset

Data Morphology Analysis

data processing

1. Column-level filtering – restrict access to data types

2. Row-level filtering – only keep data with specific semantics

3. Other processing

Sensitive data handling

function call

Condition processing

Create data processing tasks

data lineage

Summarize

亚马逊云开发者

引用和评论

Amazon Q Developer 实践：零基础创建贪吃蛇游戏

从数据提取到管理：合合信息的智能文档处理全方位解析【合合信息智能文档处理百宝箱】

融云监控平台「北极星」：指标异常及时告警，趋势变化预先知悉

Veeam Backup & Replication v12.2 发布，新增功能概述

数据要素如何重塑企业价值？

Veeam Data Platform 12.2 发布下载，新增功能概览

什么是数据治理？我国与新加坡的数据治理有何异同