Under the supervision of the General Data Protection Regulation, has your data lake been "disconnected"?

The General Data Protection Regulation (GDPR) is an important regulation in today's technology world and a data processing requirement that many users who build solutions in Amazon Cloud's public cloud must follow. The GDPR proposes a "right to erasure", or "right to be forgotten" clause, which requires the implementation of relevant solutions to ensure the deletion of specific users' personal data.

In the context of the Amazon cloud technology big data and analytics ecosystem, every architecture, no matter what its target, needs to use Amazon Simple Storage Service (Amazon S3) as the core storage service. Although Amazon S3 has a wealth of options and integrity, but lack of a mechanism out of the box that contains the user identifier with Amazon S3 objects mapped user data up .

In this article, we will introduce a framework to help clear each specific user data within your organization's Amazon Cloud Managed Data Lake. In addition, we will learn about a suite several different Amazon Cloud storage tiers, as well as sample code for Amazon S3.

Amazon Simple Storage Service
https://aws.amazon.com/cn/s3/

📢 Want to know more about data lake warehouse? Please pay attention to the big data and intelligent lake warehouse sub-forum of the Amazon Cloud Technology China Summit Beijing Station! More latest technology releases and practical innovations can be found in the 8.19-20 Beijing and 9.15 Shenzhen branches. Come and click the picture to register~

Reference Architecture

To address the challenges of implementing a data cleansing framework, we simplify the problem here to a simple use case of how to implement user data removal in a platform using Amazon Cloud Technology as a data pipeline. The following diagram illustrates the basic situation of the use case.

We introduced the idea of building and maintaining an index meta-repository that keeps track of where each user's records are, helping us find those locations efficiently and thus narrowing the search space.

You can use the following architecture to delete specific users' data within your organization's Amazon Cloud Technology data lake.

For this initial release, we created three user flows responsible for mapping tasks to the appropriate Amazon Cloud Technology service:

User Flow 1: Live Metadata Store Update

An Amazon S3 ObjectCreated or ObjectDelete event triggers an Amazon Lambda function that parses the object and performs add/update/delete operations to keep the metadata index up to date. Similar simple workflows can be established for any other storage tier, including Amazon Relational Database Service(RDS) , Amazon Aurora or Amazon Elasticsearch Service(ES) . In this example, we use Amazon DynamoDB and Amazon RDS for PostgreSQL as index metadata storage options, the specific method used here is widely applicable to other technical scenarios.

Amazon Relational Database Service
https://aws.amazon.com/cn/rds/
Amazon Aurora
https://aws.amazon.com/cn/rds/aurora/?aurora-whats-new.sort-by=item.additionalFields.postDateTime&aurora-whats-new.sort-order=desc
Amazon Elasticsearch Service
https://aws.amazon.com/cn/elasticsearch-service/
Amazon DynamoDB
https://aws.amazon.com/cn/dynamodb/
Amazon RDS for PostgreSQL
https://aws.amazon.com/cn/rds/postgresql/

User Flow 2: Clear Data

When a user asks to delete their data, we trigger an Amazon Step Functions state machine via Amazon CloudWatch to coordinate the workflow. The first step is to trigger the Lambda function, which queries the metadata to identify the storage tier that contains the user record, and saves the resulting report in an Amazon S3 reporting bucket. Next, an Amazon Step Functions activity is created and fetched by the Lambda Node JS-based worker node, and an email with approval and rejection links is sent to reviewers Amazon Simple Email Service (SES)

Amazon Simple Email Service
https://aws.amazon.com/cn/ses/

The figure below shows the basic architecture of the Amazon Step Functions state machine displayed on the Amazon Cloud Technology Management Console.

The reviewer chooses one of the two links and then calls the Amazon API Gateway endpoint, which calls Amazon Step Functions to resume the workflow. If you choose to approve the link, Amazon Step Functions triggers a Lambda function that takes as input the report in the bucket, deletes the object or record in the storage layer, and then updates the index metastore. After the cleanup job completes, Amazon Simple Notification Service (SNS) sends a notification email to the user that the operation succeeded or failed.

Amazon Simple Notification Service
https://aws.amazon.com/cn/sns/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

The figure below shows the actual Step Functions execution flow on the console after the cleanup process has completed successfully.

For the full codebase, see the step-function-definition.json file in the GitHub repo.

step-function-definition.json
https://github.com/aws-samples/data-purging-aws-data-lake/blob/master/Scripts/index-by-file-name/step-function-definition.json

User Flow 3: Bulk Metadata Store Update

This user flow is primarily geared towards existing data lake use cases that require the creation of indexed metastores. You can orchestrate processes through Amazon Step Functions, taking historical data as input and updating the metastore through batch jobs. The implementation in this article does not include a sample script for this user flow.

Our frame

Now, we'll detail the two use cases used in the implementation:

You store multiple user records per Amazon S3 file
The user stores each record in the same Amazon Web Service storage tier

Building on these two use cases, we will demonstrate several alternative ways to implement indexed metadata storage.

by Amazon S3 URI and line number

In this use case, we can implement index storage using a free Amazon RDS Postgres instance. First, create a simple table with the following code:

CREATE UNLOGGED TABLE IF NOT EXISTS user_objects (
                userid TEXT,
                s3path TEXT,
                recordline INTEGER
            );

You can optimize query performance by indexing by user_id. When uploading an object, you need to insert a row identifying the user ID, the URI identifying the target Amazon S3 object, and the corresponding row record in the user_objects table. For example, when uploading the following JSON input, enter the following code:

{"user_id":"V34qejxNsCbcgD8C0HVk-Q","body":"…"}
{"user_id":"ofKDkJKXSKZXu5xJNGiiBQ","body":"…"}
{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json tuple information into the user_objects table, see the following code for details:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can trigger a Lambda function on any Amazon S3 ObjectCreated event to implement an index update operation.

When we receive a delete request from a user, we need to query the index for information about where the data is stored. Please see the following code for details:

SELECT s3path,
                ARRAY_AGG(recordline)
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The above example SQL query will return a row like this:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output shows that lines 529 and 2102 in the Amazon S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user data, which requires Clear. Next, we need to download the object, delete the lines, and then overwrite the object. For a Python implementation of a Lambda function that implements this, see deleteUserRecords.py in the GitHub repo.

deleteUserRecords.py
https://github.com/aws-samples/data-purging-aws-data-lake/blob/master/Scripts/index-by-row-number/deleteUserRecords.py

Available record rows help us perform delete operations efficiently in byte format. To simplify the implementation process, we replace deleted rows with empty JSON objects to quickly implement row cleanup. This operation incurs only a small storage overhead and eliminates the need for expensive operations to update subsequent metadata rows within the index. To eliminate empty JSON objects, we can use offline vaccum with index updates.

index by filename, group by index key

In this use case, we created an Amazon DynamoDB table to store index information. Amazon DynamoDB was chosen because of its ease of use and scalability; you can use a pay-as-you-go billing model, so you don't have to guess at the exact unit of capacity you might need. After uploading the file to the data lake, the Lambda function parses the filename (for example, 1001-.csv) to tokenize the user identifier and populates the Amazon DynamoDB metadata table accordingly. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 has data in Amazon S3 and Amazon RDS, his record would look like this:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

See update-dynamo-metadata.py in the GitHub repo for a Python live example of this functionality.

update-dynamo-metadata.py
https://github.com/aws-samples/data-purging-aws-data-lake/blob/master/Scripts/index-by-file-name/update-dynamo-metadata.py

Upon delete request, we need to query the metadata store table (i.e. Amazon DynamoDB) and generate a purge report that contains detailed information about user records contained within those storage tiers, as well as other hints to help speed up record lookups . We store purge reports in Amazon S3. For an example Lambda function implementing this logic, see generate-purge-report.py in the GitHub repo.

generate-purge-report.py
https://github.com/aws-samples/data-purging-aws-data-lake/blob/master/Scripts/index-by-file-name/generate-purge-report.py

After the cleanup is approved, we will use the cleanup report as input, thereby removing all corresponding resources. For an implementation example of a Lambda function, see gdpr-purge-data.py in the GitHub repo.

gdpr-purge-data.py
https://github.com/aws-samples/data-purging-aws-data-lake/blob/master/Scripts/index-by-file-name/purge-data.py

Implementation and technical alternatives

We explored and evaluated various implementation options, realizing that each has its own strengths and compromises in certain aspects, including simplicity of implementation, efficiency of execution, compliance with critical data, and functional integrity. Wait:

scans each record in the data file to create an index — every time a file is uploaded, we iterate over its records and generate a tuple (containing userid, s3Uri, row_number), which is then inserted into our metadata storage layer . On a delete request, we fetch the metadata record for the requested user ID, download the corresponding Amazon S3 object, perform the delete in-place, and reupload the updated object to overwrite the existing object. This is the most flexible implementation because it supports storing multiple users' data through a single object, and it has become the most common and common implementation today. But flexibility also comes at a price, as deletions often create network bottlenecks due to the need to download and re-upload objects in the process. User activity datasets (such as customer product reviews) are particularly suitable for this approach, since it is rare for the same user to publish multiple records in each partition (such as the date partition), and it is best to combine the activities of multiple users. merged into a single file. Refer to the instructions in the Indexing by Amazon S3 URI and Line Number section, and you can find sample code in the GitHub repo.

GitHub repo
https://github.com/aws-samples/data-purging-aws-data-lake/tree/master/Scripts/index-by-row-number

Store metadata as filename prefix — Setting the user ID as the name prefix of uploaded objects under different partitions defined by query mode helps us reduce the search operations required for delete requests. Metadata processing utilities are able to look up the user ID directly from the filename and perform index maintenance operations accordingly. This method can bring high resource removal efficiency, but each object can only correspond to one user, and requires us to store the user ID in the file name, which may violate the information security requirements. This approach is particularly useful for managing clickstream data, where multiple click events are generated by a single user on a single date partition during a session. You can download the relevant repositories from the GitHub rep, as we explained earlier in the Indexing by Filename, Grouping by Index Key sections.

GitHub rep
https://github.com/aws-samples/data-purging-aws-data-lake/tree/master/Scripts/index-by-file-name

Using Metadata Files — In addition to uploading new objects, we can also upload metadata files that can be used by indexing tools to create and maintain up-to-date indexes. Upon delete request, we can query the index, thereby pointing us to the location of the record that needs to be cleared. This method is most suitable when uploading a new object and uploading the corresponding metadata file synchronously (such as uploading multimedia data). In other scenarios, uploading metadata files at the same time every time an object is uploaded may put heavy pressure on resource capacity.

Using Amazon Cloud Technology Service's tagging feature — We use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user ID whenever a new file is uploaded to Amazon S3. And whenever there is a user data deletion request, you can use this tag to get the object and delete it. This solution can be easily implemented using the existing Amazon S3 API, and the whole process is quite easy. However, this solution also faces many limitations. It assumes that there is always a 1:1 relationship between Amazon S3 objects and users (each object contains only the data of a single user); in addition, the method of searching objects based on tags is not efficient , and the practice of storing user IDs as labels may also violate information security requirements within the organization.

Using Apache Hudi Apache Hudi has become a very popular choice for implementing record-level data deletion on top of Amazon S3. The latest version of Hudi is limited to Amazon EMR use, so it is only suitable for users who are building a data lake from scratch, which requires you to store data in the form of Hudi datasets during the creation process. The Hudi project itself is quite active, and it is expected that it will usher in more functions and integrate with more Amazon cloud technology services.

Amazon EMR
http://aws.amazon.com/emr

In the choice of specific methods, we always require that the data storage layer be distinguished from the metadata storage layer. Therefore, the various designs proposed here have good generality and can be directly inserted into any existing data pipeline. Similar to choosing a data storage layer, you also need to consider the following important factors when choosing a storage indexing scheme:

Request Concurrency — If you don't plan to insert too many requests at the same time, you might even consider a simple storage solution like Amazon S3 directly as an initial indexing option. However, if you need to handle multiple concurrent writes for many users, it is better to choose services that have stronger transaction processing capabilities.

Consider the team's existing expertise and infrastructure - In this article, we demonstrate how to store and query metadata indexes using Amazon DynamoDB and Amazon RDS Postgres. If your team has no experience in this area, and is basically satisfied with the effects of Amazon ES, Amazon DocumentDB (compatible with MongoDB), or other storage layers, you can use it directly. In addition, if you already have a MySQL database with redundant capacity, you can also implement it as an index to save operating costs.

Amazon DocumentDB (MongoDB compatible)
https://aws.amazon.com/cn/documentdb/

Index size - The volume of metadata tends to be orders of magnitude lower than the actual data. However, as the size of your datasets grows significantly, you may want to consider a distributed storage solution with strong scalability as an alternative to traditional relational database management systems.

summary

The publication of the GDPR has had a significant impact on best practices and introduced a number of additional technical challenges for the design and implementation of data lakes. Hopefully, the reference architecture and scripts presented in this article will help you implement data deletion in a GDPR-compliant manner.

If you have any suggestions or comments, including a better data deletion solution within your organization, please share with us in the comments section.

Author of this article

George Komninos

Amazon Cloud Technology

Data Lab Solutions Architect

He helps clients turn creative inspiration into production-ready data products. Before joining Amazon Cloud Technology, he worked as a data engineer for Alexa Information Domain for 3 years. In his spare time, George loves football and is a huge fan of Olympiacos in Greece.

Sakti Mishra

Amazon Cloud Technology

Data Lab Solutions Architect

He helps clients design data analytics solutions that accelerate their modernization journeys. Outside of work, Sakti enjoys learning new technologies, watching movies and traveling.

Under the supervision of the General Data Protection Regulation, has your data lake been "disconnected"?

Reference Architecture

Our frame

Implementation and technical alternatives

summary

亚马逊云开发者

引用和评论

基于亚马逊云科技构建音视频直播审核方案

2025 Lakehouse 趋势全景展望：从技术演进到商业重构

分析型数据库入门指南：如何选择适合你的实时分析工具？

湖仓一体架构解析：如何平衡数据灵活性与分析性能？