Foreword
interacts with the machine through voice, which can improve efficiency in many scenarios, and it is also one of the current research hotspots in the field of artificial intelligence. voice recognition technology can be divided into in-vehicle scenarios using in-vehicle voice assistants as an example, home scenarios using smart home devices, and so on. To realize the voice interaction between humans and machines, the machine must first be able to recognize the voice content, but general voice recognition services cannot fully meet the needs of different scenarios, so customers need to train models according to their own needs.
This article will show you how uses the Amazon SageMaker service to train its own speech recognition model . We have chosen a open source speech recognition project WeNet as an example.
Amazon SageMaker is a fully managed machine learning service , covering basic processes such as data labeling, data processing, model training, hyperparameter tuning, model deployment, and continuous model monitoring; it also provides automatic labeling, automatic machine learning, High-level functions such as monitoring model training. Through its fully managed machine learning infrastructure and support for mainstream frameworks, it can reduce the overall cost of ownership of customer machine learning.
WeNet is an open source end-to-end speech recognition solution for industrial-grade products. It supports both streaming and non-streaming recognition, and can run efficiently on the cloud and embedded. In the process of model training, a lot of computing resources are needed. We can use Amazon SageMaker to easily start a cluster containing multiple fully managed training instances to speed up the training process.
📢 To learn more about the latest technology releases and practical innovations of Amazon Cloud Technology, stay tuned to the 2021 Amazon Cloud Technology China Summit held in Shanghai, Beijing and Shenzhen! Click on the picture to sign up~
Preparation work
Before starting to train the model, we need to do some preparations, including preparing the FSx file system to store the data during the training process, creating an Amazon SageMaker Notebook as an experimental environment, mounting the FSx file system in the notebook, preparing the experimental code, and preparing the data Process and model training operating environment (Docker image) and push the image to Amazon ECR (Elastic Container Registry).
The experiment contents in this article are all completed using the services in the us-east-1 region, and you can use other regions on your own.
Create FSx for Lustre storage
In the past, training models in Amazon SageMaker generally used Amazon Simple Storage Service (Amazon S3) as storage, and Now, Amazon SageMaker supports multiple data sources when training models, such as Amazon FSx for Lustre and Amazon Elastic File System ( EFS) . Amazon SageMaker speeds up the data loading progress when training the model by directly reading the data stored on EFS or FSx for Luster.
FSx for Lustre supports importing data from Amazon S3 and exporting data to Amazon S3. If your data is already stored in Amazon S3, FSx for Lustre will transparently display objects as files. The same FSx file system can also be used for multiple Amazon SageMaker training tasks, saving the time of repeatedly downloading training data sets multiple times.
Here, we will choose to use FSx for Lustre as the main data storage. Next, we will create a FSx for Lustre storage.
for Lustre based on Amazon S3 161cd35d21cfdc
Set the VPC, subnet group, and security group in the "Network and Security" section, and confirm whether the security group inbound rules allow port 998 traffic.
Select "Import data from Amazon S3 and export data to Amazon S3" at "Data Repository Import/Export" and specify the storage bucket and path where the Amazon S3 training data is located.
After the creation is complete, click the "Mount" button to pop up the steps to mount this file system, which we will use in Amazon SageMaker Notebook later.
Create Amazon SageMaker Notebook
Select the notebook instance type, here we choose a ml.p3.8xlarge machine, which contains 4 Tesla V100 GPU cards. You can choose other GPU machines, if you don't need a GPU card, you can also choose a CPU machine.
In addition, you can determine the volume size of the notebook instance by yourself. For example, 100GB of storage is selected in this instance. You can adjust the size of this storage later.
Choose to create a new IAM role, including the required permissions, as shown below:
For the network part, just select the VPC where FSx is located and the public subnet. The security group needs to allow Amazon SageMaker to access FSx.
storage in the notebook 161cd35d21d149
On the notebook console page, click "Open JupyterLab".
On the Launcher page, click "Terminal" to create a new command line terminal. Install the Lustre client in the command terminal according to the steps prompted in the chapter "Create an Amazon S3-based FSx" and execute the mount command.
In addition, you can also configure the notebook life cycle strategy to realize that the notebook automatically mounts the FSx file system when creating or starting a Notebook instance, refer to the document [2].
download WeNet source code
In the command line terminal in the previous step, execute the following command to complete the code download.
1sudo chown ec2-user.ec2-user /fsx
2
3ln -s /fsx /home/ec2-user/SageMaker/fsx
4
5cd ~/SageMaker/fsx
6
7git clone -b sagemaker https://github.com/chen188/wenet
Here, we recommend that you place all relevant files for the experiment in the ~/Amazon SageMaker directory . The data in this directory can still exist separately after the Notebook instance is shut down.
You can open the notebook file
/fsx/wenet/examples/aishell/s0/SM-WeNet.ipynb,
You can find all subsequent commands in this notebook.
Prepare Docker image
In Amazon SageMaker, many tasks are implemented based on Docker images, such as data preprocessing, model training, and model hosting. adopts Docker image to greatly ensure the consistency of the environment and reduce the operation and maintenance cost of environment presets.
Next, we will need to build our own Docker image to implement data format conversion and model training. Amazon Web Service has provided some general Deep Learning Container (DLC) environments. For a specific list, please refer to [6]. But the TorchAudio package is not yet included. At this time, we can choose to build the operating environment based on the open source version.
The image is built on Ubuntu, and pytorch 1.8.1, torchaudio and other related dependencies are installed.
File /fsx/wenet/Dockerfile:
1FROM ubuntu:latest
2ENV DEBIAN_FRONTEND=noninteractive
3ENV PATH /opt/conda/bin:$PATH
4
5RUN apt-get update --fix-missing && \
6 apt-get install -y gcc net-tools && \
7 apt-get install -y --no-install-recommends wget bzip2 ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 git mercurial subversion && \
8 apt-get clean && \
9 rm -rf /var/lib/apt/lists/* && \
10 wget --quiet https://repo.anaconda.com/archive/Anaconda3-2021.05-Linux-x86_64.sh -O ~/anaconda.sh && \
11 /bin/bash ~/anaconda.sh -b -p /opt/conda && \
12 rm ~/anaconda.sh && \
13 ln -s /opt/conda/etc/profile.d/conda.sh /etc/profile.d/conda.sh && \
14 echo ". /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
15 echo "conda activate base" >> ~/.bashrc && \
16 find /opt/conda/ -follow -type f -name '*.a' -delete && \
17 find /opt/conda/ -follow -type f -name '*.js.map' -delete && \
18 /opt/conda/bin/conda clean -afy
19
20COPY ./requirements.txt /tmp/
21
22RUN pip install -r /tmp/requirements.txt && \
23 pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html && \
24 pip install sagemaker-training && \
25rm /tmp/requirements.txt
You may notice that our additionally installed the Amazon SageMaker-training package to provide mirroring support for Amazon SageMaker training.
build an image and push to ECR
ECR is Amazon's fully managed container registry service. We can push the built image to ECR, and then Amazon SageMaker will download the corresponding image from here when training or hosting the model.
1import boto3
2account_id = boto3.client('sts').get_caller_identity().get('Account')
3region = boto3.Session().region_name
4ecr_repository = 'sagemaker-wenet'
5
6# 登录ECR服务
7!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com
8
9# 训练镜像
10training_docker_file_path = '/fsx/wenet'
11!cat $training_docker_file_path/Dockerfile
12
13tag = ':training-pip-pt181-py38'
14training_repository_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)
15print('training_repository_uri: ', training_repository_uri)
16
17!cd $training_docker_file_path && docker build -t "$ecr_repository$tag" .
18!docker tag {ecr_repository + tag} $training_repository_uri
19!docker push $training_repository_uri
20Python
uses Amazon SageMaker to train the model
Now that we have completed the preparation of the experimental environment, we will proceed to the topic, using Amazon SageMaker to complete the model training work .
WeNet supports training a variety of models, such as Conformer, Transformer, etc. Here we will take the unified transformer as an example to show the entire training process. For training data, WeNet also supports multiple sources. You only need to format during training, such as AIShell-1, AIShell-2 and LibriSpeech. Here, we will take AIShell-1 as an example.
data download
We first need to download the training data to the local FSx storage and execute the command in the notebook:
1cd /fsx/wenet/examples/aishell/s0 && \
2bash run.sh --stage -1 --stop_stage -1 --data /fsx/asr-data/OpenSLR/33
3Bash
The data will be automatically downloaded to the /fsx/asr-data/OpenSLR/33 directory, and the status after the download is complete:
1sh-4.2$ ls /fsx/asr-data/OpenSLR/33
2data_aishell data_aishell.tgz resource_aishell resource_aishell.tgz
3Bash
Data preprocessing
Next, we need to organize the data into the format required by WeNet. Here we use Amazon SageMaker to execute the logic of data preprocessing.
Mount the FSx file system to the data preprocessing container
As we mentioned earlier, the data required for model training has been stored in the FSx file system. When we process the data through Amazon SageMaker, we need to mount the FSx file system in the container. The code to mount the file system is as follows:
1from sagemaker.inputs import FileSystemInput
2from sagemaker.pytorch.estimator import PyTorch
3
4file_system_id = 'fs-0f8a3xxxxf47b6ff8'
5file_system_path = '/yobzhbmv'
6file_system_access_mode = 'rw'
7file_system_type = 'FSxLustre'
8
9security_group_ids = ['sg-04acfcxxxx929ee4e']
10subnets= ['subnet-07ce0abxxxxcfeb25']
11
12file_system_input_train = FileSystemInput(file_system_id=file_system_id,
13 file_system_type=file_system_type,
14 directory_path=file_system_path,
15 file_system_access_mode=file_system_access_mode)
16Python
It should be noted that the subnet specified in the subnets parameter requires the ability to access services such as Amazon S3. You can choose to use a private subnet and specify the default route to the NAT gateway for the subnet.
The security group specified by security_group_ids will be bound to the instance started by Amazon SageMaker and must have the ability to access the FSx service.
Start data preprocessing job
So far, we have defined the file system that needs to be mounted by specifying the file system id, file system path, read-write mode and other information. Next, you can set the data processing time, the operating environment and the parameter information that needs to be passed. code show as below:
1hp= {
2 'stage': 0, 'stop_stage': 3, 'train_set':'train',
3 'trail_dir':'/opt/ml/input/data/train/sm-train/trail0',
4 'data': '/opt/ml/input/data/train/asr-data/OpenSLR/33',
5 'shared_dir': '/opt/ml/input/data/train/shared'
6}
7
8estimator=PyTorch(
9 entry_point='examples/aishell/s0/sm-run.sh',
10 image_uri=training_repository_uri,
11 instance_type='ml.c5.xlarge',
12 instance_count=1,
13 source_dir='.',
14 role=role,
15 hyperparameters=hp,
16
17 subnets=subnets,
18 security_group_ids=security_group_ids,
19
20 debugger_hook_config=False,
21 disable_profiler=True
22)
23Python
We use the image_uri parameter to specify the container environment in which the data processing code runs, instance_type specifies the required instance type, instance_count specifies the number of instances required, and hyperparameters specifies the hyperparameters that need to be passed.
Next, you can start the specified computing resource with a line of commands and execute the data processing logic.
1estimator.fit(inputs={'train': file_system_input_train})
2Python
We set the data input information when the container is running through the inputs parameter. Amazon SageMaker supports multiple data sources, such as local files (file://), Amazon S3 path (s3://bucket/path) and file system (FSx or EFS). Here, our FSx file system will be mapped to the /opt/ml/input/data/train directory of the container, train is a custom channel name, and other common channels include test, validation, etc. For specific path mapping rules in Amazon SageMaker, please refer to [1].
View the processed data
After the processing is completed, the corresponding files will be created in the trail_dir and shared_dir directories. Execute the command on the Notebook instance as follows:
tree -L 3 /fsx/sm-train/trail0
tree -L 3 /fsx/sm-train/shared
Start model training job
At this point, we have prepared the training data. Next, we can enter the model training phase . We will show local training and fully managed instance training two training modes.
Local training mode
In the process of model development, algorithm personnel need to repeatedly adjust the code logic. It is very troublesome to package a docker image every time the code is adjusted. Therefore, you can debug the code first through the local training mode of Amazon SageMaker. The local training mode will directly start the corresponding container in the instance where the Notebook is located, execute the training logic, and automatically map the data to the container. For details of local mode training, you can refer to the document [3], the local training code we use here is as follows:
1instance_type='local_gpu'
2instance_count = 1
3CUDA_VISIBLE_DEVICES='0'
4
5hp= {
6 'stage': 4, 'stop_stage': 4, 'train_set':'train',
7 'data': data_dir, 'trail_dir': trail_dir, 'shared_dir': shared_dir,
8 'CUDA_VISIBLE_DEVICES': CUDA_VISIBLE_DEVICES,
9 'num_nodes': instance_count
10}
11
12estimator=PyTorch(
13 entry_point='examples/aishell/s0/sm-run.sh',
14 image_uri=training_repository_uri,
15 instance_type =instance_type,
16 instance_count=instance_count,
17 source_dir='.',
18 role=role,
19 hyperparameters=hp,
20
21 subnets=subnets,
22 security_group_ids=security_group_ids,
23
24 debugger_hook_config=False,
25 disable_profiler=True
26)
27
28
29estimator.fit({'train': 'file:///fsx'})
30Python
The output of the code is as follows:
1Creating 2n0im72bz3-algo-1-tpyyu ...
2Creating 2n0im72bz3-algo-1-tpyyu ... done
3Attaching to 2n0im72bz3-algo-1-tpyyu
4…
52n0im72bz3-algo-1-tpyyu | Invoking script with the following command:
62n0im72bz3-algo-1-tpyyu |
72n0im72bz3-algo-1-tpyyu | /bin/sh -c ./examples/aishell/s0/sm-run.sh --CUDA_VISIBLE_DEVICES 0 --data /opt/ml/input/data/train/asr-data/OpenSLR/33 --num_nodes 1 --shared_dir /opt/ml/input/data/train/sm-train/shared --stage 4 --stop_stage 4 --trail_dir /opt/ml/input/data/train/sm-train/trail0 --train_set train
8…
92n0im72bz3-algo-1-tpyyu | algo-1-tpyyu: 2021-06-24 15:50:09,408 INFO [checkpoint.py:33] Checkpoint: save to checkpoint /opt/ml/input/data/train/sm-train/trail0/exp/unified_transformer/init.pt
102n0im72bz3-algo-1-tpyyu | algo-1-tpyyu: 2021-06-24 15:50:09,669 INFO [train.py:228] Epoch 0 TRAIN info lr 8e-08
112n0im72bz3-algo-1-tpyyu | algo-1-tpyyu: 2021-06-24 15:50:09,670 INFO [executor.py:32] using accumulate grad, new batch size is 1 timeslarger than before
122n0im72bz3-algo-1-tpyyu | algo-1-tpyyu: 2021-06-24 15:50:12,560 DEBUG [executor.py:103] TRAIN Batch 0/7507 loss 417.150146 loss_att 148.725983 loss_ctc 1043.473145 lr 0.00000008 rank 0Python
Among the above parameters, the path specified by source_dir will be packaged and uploaded to Amazon S3, and then downloaded to the container instance. In this way, every code change we make can be directly reflected in the container.
In addition, when using the local training mode, Amazon SageMaker will start the corresponding training task with the help of the local docker-compose. You can find the relevant docker-compose file in the /tmp directory.
For example, /tmp/tmp6y009akq, we can observe the following:
1sh-4.2$ tree /tmp/tmp6y009akq
2/tmp/tmp6y009akq
3├── artifacts
4├── docker-compose.yaml
5├── model
6└── output
7 └── data
8Bash
Among them, docker-compose.yaml contains related configuration information, the content is as follows:
1sh-4.2$ cat /tmp/tmp6y009akq/docker-compose.yaml
2networks:
3 sagemaker-local:
4 name: sagemaker-local
5services:
6 algo-1-tpyyu:
7 command: train
8 container_name: 2n0im72bz3-algo-1-tpyyu
9 environment:
10 - AWS_REGION=us-east-1
11 - TRAINING_JOB_NAME=sagemaker-wenet-2021-06-24-15-49-58-018
12 image: <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/sagemaker-wenet:training-pip-pt181-py38
13 networks:
14 sagemaker-local:
15 aliases:
16 - algo-1-tpyyu
17 stdin_open: true
18 tty: true
19 volumes:
20 - /tmp/tmp6y009akq/algo-1-tpyyu/output:/opt/ml/output
21 - /tmp/tmp6y009akq/algo-1-tpyyu/output/data:/opt/ml/output/data
22 - /tmp/tmp6y009akq/algo-1-tpyyu/input:/opt/ml/input
23 - /tmp/tmp6y009akq/model:/opt/ml/model
24 - /opt/ml/metadata:/opt/ml/metadata
25 - /fsx:/opt/ml/input/data/train
26version: '2.3'
27Bash
It can be seen that docker-compose maps the local path to the directory in the container through the volume parameter, without the need to perform a secondary copy of the training data.
Hosted training mode
After confirming that the code logic is correct, we can easily use the managed instance to start the real training task by modifying the parameters.
Here, we only need to adjust the instance type, the number of instances needed and the data input method. Let's take 2 instances of ml.p3.8xlarge as an example, each of which contains 4 Tesla V100 graphics cards, for a total of 8 graphics cards.
The training code is as follows:
1instance_type='ml.p3.8xlarge'
2instance_count = 2
3CUDA_VISIBLE_DEVICES='0,1,2,3'
4
5hp= {
6 'stage': 4, 'stop_stage': 4, 'train_set':'train',
7 'data': data_dir, 'trail_dir': trail_dir, 'shared_dir': shared_dir,
8 'CUDA_VISIBLE_DEVICES': CUDA_VISIBLE_DEVICES,
9 'ddp_init_protocol': 'tcp',
10 'num_nodes': instance_count
11}
12
13estimator=PyTorch(
14 entry_point='examples/aishell/s0/sm-run.sh',
15 image_uri=training_repository_uri,
16 instance_type =instance_type,
17 instance_count=instance_count,
18 source_dir='.',
19 role=role,
20 hyperparameters=hp,
21
22 subnets=subnets,
23 security_group_ids=security_group_ids,
24
25 debugger_hook_config=False,
26 disable_profiler=True,
27 environment={
28 'NCCL_SOCKET_IFNAME': 'eth0',
29 'NCCL_IB_DISABLE': 1
30 }
31)
32
33estimator.fit(inputs={'train': file_system_input_train})
34Python
Among them, the parameter CUDA_VISIBLE_DEVICES needs to be set to the number of GPU cards in the training instance. If there is only one GPU graphics card, its value is '0'.
It should be noted here that at the time of writing this article, the Amazon SageMaker training task does not support specifying the mount option flock when mounting FSx, which makes it impossible to use the file-based distributed initialization method. Therefore, we simply adjust the WeNet training code and use the TCP-based initialization method to continue the model training.
You can also observe that we passed in the environment parameter, which means setting the corresponding environment variable in the container. Since the training instance launched by Amazon SageMaker will contain more than one network card, we need to set the network card used by NCCL to eth0 through the NCCL_SOCKET_IFNAME environment variable.
In addition, Amazon SageMaker supports the use of auction instances to train models to effectively reduce costs. You can refer to the document [4] to see how to use it.
model file
After the training is completed, the corresponding model file will be generated in the directory you set. This article is the /fsx/sm-train/trail0/exp/unified_transformer directory.
If you need to export a (TorchScript) model that supports serialization and optimization, you can adjust the stage and stop_stage in the hp variable, and execute the training code in the local mode. For TorchScript, please refer to [5].
The relevant code logic is as follows:
1instance_type='local_gpu'
2…
3hp= {
4 'stage': 5, 'stop_stage': 6, 'train_set':'train',
5…
6}
7
8estimator=PyTorch(
9…
10)
11
12estimator.fit({'train':'file:///fsx'})
13Python
After the execution is complete, the corresponding model file final.zip and quantized model final_quant.zip files will be generated in the above-mentioned directory.
Now, we have completed a model training work. We know that if wants to get a model that meets the current needs, it needs to go through multiple trials, multiple iterations and training . You can use the above methods, to quickly try different hyperparameters or other algorithms on Amazon SageMaker, without having to consider how to configure the basic environment of machine learning and other operations and maintenance related work.
model hosting
So far, we have got the trained model file. You can deploy the model through Amazon SageMaker, or you can deploy it in other ways. In a follow-up article, we will introduce in detail how to deploy the trained model on Amazon Web Services.
summary
This article shows how uses Amazon SageMaker to run the open source end-to-end speech recognition model WeNet , covering data processing, Docker operating environment construction, model training, etc.
reference
[1] Amazon SageMaker Toolkits:
https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-toolkits.html
[2] The laptop automatically mounts the FSx file system:
https://github.com/aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples/blob/master/scripts/mount-fsx-lustre-file-system/on-start.sh
[3] Use local mode to train the model:
https://sagemaker.readthedocs.io/en/stable/overview.html#local-mode
[4] Use Spot mode to train the model:
https://docs.aws.amazon.com/sagemaker/latest/dg/model-checkpoints.html
[5] TorchScript compiler:
https://pytorch.org/docs/1.8.1/jit.html
[6] DLC list:
https://github.com/aws/deep-learning-containers/blob/master/available_images.md
Author of this article
Chen Bin
Amazon Cloud Technology Solution Architect
Responsible for the architecture consultation and design of the cloud computing solution based on Amazon Cloud Technology, has a wealth of experience in solving customers' practical problems, and currently focuses on the research and application of deep learning.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。