Flexible use of multiple storage services in Amazon SageMaker

Amazon SageMaker is a fully managed end-to-end machine learning service that enables data scientists, developers, and machine learning experts to build, train, and host machine learning models quickly and at scale. This greatly advances all of your machine learning work, allowing you to quickly incorporate machine learning techniques into production applications. The main working components of Amazon SageMaker include: algorithm writing, model training, model evaluation, model hosting, etc.

Model training is a crucial step in the machine learning workflow, and providing flexible, efficient, and accurate input data for model training directly determines the quality of training results. Usually, the work of model training is not done overnight, but a dynamic adjustment, a step-by-step process that requires repeated adjustments and the collaboration of multiple departments and workflows.

In this article, we take the image classification algorithm in machine learning as an example to introduce the input data preparation process for typical machine learning model training:

In order to meet the requirements of the machine learning image classification algorithm for training and verification data, we will maintain a huge image warehouse, which stores tens of thousands or even more classified image files. The data preparation team adds newly acquired image files to the warehouse every day and completes the classification. At the same time, for some reasons (for example, finding previous classification errors or deprecating some categories), we will also modify or delete image files in the warehouse.

In the actual machine learning training task, in order to control the time and scale of the training task, we will select part of the data in the complete warehouse or a subset to form the training data set and the validation data set according to the scenario of the model.

Differences in the use of input data in different formats

Amazon SageMaker's training tasks support two input modes: pipeline mode and file mode .

When running a training task in pipeline mode, the training instance does not download the data to the local before running it, and the training instance reads the training data from the specified pipeline as needed. The pipeline mode can speed up the process of starting and training the training instance, especially when the amount of training data is very large, such as more than 16TB, when the local disk of the training instance cannot carry the full amount of running data, the pipeline mode must be used for training.

Next, we mainly look at the file mode input of the Amazon SageMaker training task. The data sources currently supported in the training task file mode include: Amazon S3, Amazon EFS, and Amzon FSx for Lustre. Different data input sources are provided to the training instance as input data in the form of channels. A training instance can be configured with up to 20 different input source channels, and different channels can use different data source types.

We use the built-in image classification algorithm provided by Amazon SageMaker to train the image classification model. The built-in algorithm can receive two data formats, one is the RecordIO format, and the other is the image file. For how to process the data to the RecordIO format, please refer to MXNet's Official documentation. Below we focus on the input method of image files:

The image classification algorithm accepts 4 input channels: train, validation, train_lst, validation_lst

Corresponding to training data set, validation data set, training data set list file, validation data set list file,

Dataset and dataset list files vary widely in format, storage mode, and usage behavior.

The dataset itself is a picture file, which is stored in some form of directory structure, such as: by time, by category, by department, etc. Once the image file is generated, the file content is fixed and can be read, replaced or deleted as a whole. In different training tasks, some images are selected from the image warehouse to form a training data set, and the same image file may be reused many times in different training tasks.

The data set list file is a file with the extension .lst, and the file is divided by tabs to provide index information for the image files in the data set.

A typical .lst index file format is as follows:

15      1   your_image_directory/train_img_dog1.jpg
2
31000   0   your_image_directory/train_img_cat1.jpg
4
522     1   your_image_directory/train_img_dog2.jpg

The first column is the unique index number of the image file; the second column represents the number corresponding to the image category (starting from 0), such as 0 for cat, 1 for dog; the third column is the file path including the file relative directory and file name .

The dataset list file needs to be generated according to the applicable scenarios of each training task, and may be read and modified frequently, including:

In multiple progressive training tasks, more image file index records need to be added to the list file;
Because the classification information is found to be wrong, it is necessary to modify the information of some rows;
Because of the modification of the image warehouse, it is necessary to replace the information of some lines;
Frequent diffs are required between multiple lst files, and the impact on the training output model is confirmed by comparing the differences in the input data;
The classification rules are changed, the original classification is subdivided into branches or merged into a new classification;
Need to share and collaborate between teams to complete the confirmation and creation of the same .lst file content;
some other scenes.

The image repository saved in the form of images is suitable for storing on Amazon S3 in the form of objects. Of course, if you want to avoid repeatedly downloading a large amount of image data from Amazon S3, or some existing auxiliary tools are developed based on the POSIX file system interface, Then, Amazon EFS or Amazon FSx for Lustre can also provide a simple and efficient data persistent storage to build your image warehouse.

For the list file .lst, it needs to be created and modified frequently, and may need to be combined with the workflow to complete the determination of the final file content in the case of collaboration among multiple members. Therefore the listing file .lst is suitable for storage on shared file systems with POSIX interfaces, including Amazon EFS and Amazon FSx for Lustre. Of course, you can also use S3 to save .lst files, but considering the ease of modifying and reading at the file block level and the ease of file sharing, using .lst files directly on a POSIX-supported shared file system will work The process will still be easier.

to configure input channels and data

For the specific configuration of the input channel, you can specify it through the integration interface during the training task creation process in the Amazon SageMaker console:

Here is the configuration for S3 as the input channel data source:

The following is the configuration of EFS as the input channel data source:

Here is the configuration using Amazon FSx for Lustre as the input channel data source:

If you programmatically create and submit training jobs to Amazon SageMaker, you can use the following APIs to build different input data sources:

 1from sagemaker.inputs import FileSystemInput
 2from sagemaker.inputs import TrainingInput
 3
 4content_type = 'application/x-image'
 5
 6
 7fs_train = FileSystemInput(file_system_id='fs-c0de3680',
 8                                    file_system_type='EFS',
 9                                    directory_path='/caltech/256_ObjectCategories',
10                                    content_type=content_type,   
11                                    file_system_access_mode='ro')
12fs_validation = FileSystemInput(file_system_id='fs-c0de3680',
13                                    file_system_type='EFS',
14                                    directory_path='/caltech/256_ObjectCategories',
15                                    content_type=content_type,
16                                    file_system_access_mode='ro')
17
18fs_train_lst = TrainingInput(s3train_lst, content_type=content_type)
19
20fs_validation_lst = FileSystemInput(file_system_id='fs-0cd42e47a9d3be5e1',
21                                    file_system_type='FSxLustre',
22                                    directory_path='/k4jhtbmv/image-classification/validation_lst',
23                                    content_type=content_type,
24                                    file_system_access_mode='ro')
25
26
27data_channels = {'train': fs_train, 'validation': fs_validation, 
28                 'train_lst': fs_train_lst, 'validation_lst':fs_validation_lst}

In the above code examples, both the training and validation sets are stored as file objects on Amazon S3. Considering the usage characteristics of the lst file, the .lst list file for the training set is stored on EFS, and the .lst list file for the user validation set is stored on Amazon FSx for Lustre.

The training task created by the above code will obtain the input data required for training from 3 different data sources, which increases the flexibility of data preparation for the training task.

to choose a storage

Amazon SageMaker can simultaneously use Amazon S3, Amazon EFS and Amazon FSx for Lustre three different storage service types as the data source of the input channel. How to choose a suitable storage service in a specific machine learning scenario?

We recommend starting by determining where your training data is currently saved:

If your training data is already on Amazon S3 and you are satisfied with the time taken to complete the current training task, you can continue to use Amazon SageMaker for training tasks on Amazon S3. However, if you need to start training tasks faster, with shorter training times, we recommend that you use Amazon FSx for Lustre, a file system that is natively integrated with Amazon S3.

With Amazon FSx for Lustre, Amazon SageMaker provides high-speed access to your training data on Amazon S3 to accelerate your machine learning training tasks. The first time you run a training job, Amazon FSx for Lustre automatically downloads the training data from Amazon S3 and provides it to Amazon SageMaker. In addition, subsequent iterations of the training task can continue to use this data, avoiding the need to repeatedly download the same data from Amazon S3 multiple times. Because of this, for scenarios where the training data is stored in Amazon S3, and different training algorithms and training parameters need to be used multiple times for training tasks to evaluate the optimal results, using Amazon FSx for Lustre will bring great benefits. benefit.

If your training data is already in the Amazon EFS file system, we recommend that you use Amazon EFS as the data source. Training tasks can directly use data on Amazon EFS without additional data handling, which speeds up training startup time. Some typical scenarios include: data scientists already have a directory on Amazon EFS, and are constantly introducing new data to quickly iterate on the model, and may need to share data among colleagues, on data fields or labels. various tests.

Secondly, it is also a factor to consider whether the machine learning training data set, especially the index file of the training data (the .lst file mentioned above) will undergo frequent content changes:

Generally speaking, Amazon S3 is used as object storage, which is more suitable for storing data files whose content will not change;

If the file content needs to be frequently increased, decreased or partially modified, then we recommend that you consider using Amazon EFS or Amazon FSx for Lustre to store your data files. As shared file systems, Amazon EFS and Amazon FSx for Lustre provide a POSIX file operation interface and block-level modification capabilities, which are easy to integrate with existing machine learning workflows.

📢 Want to unlock various best practices for more storage technologies on the cloud? Come to the 2021 Amazon Cloud Technology China Summit to discuss and communicate with leading technology practitioners in the industry! Click on the image to sign up

summary

Amazon SageMaker can use Amazon S3 , Amazon EFS and Amazon FSx for Lustre three different storage service types as input channels It can flexibly combine various data sources to realize the efficient operation of machine learning training tasks.

References

Image classification algorithm:
https://aws.amazon.com/blogs/machine-learning/classify-your-own-images-using-amazon-sagemaker/

Built-in image classification algorithms:
https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html

Official documentation of MXNet:
https://mxnet.apache.org/versions/1.7.0/api/faq/recordio.html

Author of this article

Dai

Amazon Cloud Technology Partner Solutions Architect

Mainly responsible for helping partners' solutions take root on Amazon cloud technology, and empowering partners to continuously optimize their cloud solutions. At the same time, it is committed to promoting various best practices of storage technology on the cloud.

Flexible use of multiple storage services in Amazon SageMaker

Differences in the use of input data in different formats

to configure input channels and data

to choose a storage

summary

References

亚马逊云开发者

引用和评论

利用 Amazon Bedrock Data Automation（BDA）对视频数据进行自动化处理与检索

2025 Lakehouse 趋势全景展望：从技术演进到商业重构

分析型数据库入门指南：如何选择适合你的实时分析工具？

湖仓一体架构解析：如何平衡数据灵活性与分析性能？