The official blog document of AWS DeepRacer-for-Cloud: Direct link: https://aws.amazon.com/cn/blogs/china/use-amazon-ec2-to-further-reduce-the-cost-of-deepracer- training/
write first
Since the specific plan has been written in the blog, the script in it is extracted here, and the corresponding problems are solved.
Deep Learning AMI (Ubuntu 18.04) Version 60.1 is used here
Instance preparation, I saw that the article mentioned the use of G and P series instances, as follows:
g4dn.2xlarge: cost-effective training method, based on GPU acceleration, the training speed is slightly faster than DeepRacer console training
p3.2xlarge: The training speed is much faster than the DeepRacer console training, and the model can be quickly iterated to obtain training results
This practice is still using g4dn.2xlarge, so I use spot request instance instead of On-Demand, which will reduce the cost by 70% or more, but there will also be some problems, then US-EAST-1 The instances of the G and P series in the region are seriously insufficient, and they may stop after running for a while. Then we can deploy the basic environment first and create an image directly. Then we can quickly pull up the instance when we use it later, and we don’t need to make any further changes. Too much configuration in the basic environment
It should be noted that you need to check your EC2 limit. By default, there is no capacity for G and P series. You need to submit a case to increase the limit. When submitting the limit, do not submit too many requests, it may be rejected for you.
For convenience, I made an EC2 startup template here, which is also to lay a foundation for subsequent training, so as to avoid manual configuration every time.
Article directory structure:
1. Create an IAM role for EC2
1. First enter the IAM console
2. Select EC2 and click Next
3. Add permissions
4. Name, view and create
5. View created roles
2. Create a bucket
1. Enter the S3 console
2. Create a bucket
3. View the created bucket
3. Create a startup template
1. First you need to select the mirror
2. Set the instance type and key pair
3. Define Subnets and Security Groups
4. Configuration Storage
5. Advanced Details
6. Review summaries and create templates
Fourth, create an instance
1. Select Launch instance from template
2. View spot requests
Five, connect the instance and build the basic environment
Basic. connect your instance
- ①.SecureCRT import key
- ②.SecureCRT connection example
Next, enter the construction of the basic environment
Step-1. Pull the code
Step-2. Install the basic components required for DeepRacer local training
- Error Scenario One, Solution
- Error Scenario 2, Solution
Step-3. Reconnect the EC2 instance and execute the environment initialization code of the second stage
Step-4. Load the script required to train DeepRacer
Step-5. Edit reward function, training information, car information
Step-6. Edit the environment file run.env
- ①.Add bucket information
- ②. Edit track information
Step-7. Update python version
Step-8. Update configuration
Step-9. Upload dr-upload-custom-files to S3 bucket
Step-10. Start training
6. Follow-up operations (retraining)
1. If the instance is terminated, re-pull the instance and continue the last training
- ①. Modify the run.env file
- ②. Update to make this configuration take effect
- ③. If you have modified the files in custom_files, please execute the following command to upload custom_files again
- ④. Start training
7. Parameters and command interpretation 8. Problems encountered
1. NO PUBKEY
2. Python3.6 reports an error, update the python version
3. Unable to acquire lock
4. Sagemaker is not running
The following is the specific operation plan
1. Create an IAM role for EC2
As mentioned in the AWS official blog, we use EC2 to train the model using the following 3 services
- AmazonKinesisVideoStreams
- CloudWatch
- S3
So here's a demo to create a new EC2 role and empower it
1. First enter the IAM console
IAM Create New Role Console: https://us-east-1.console.aws.amazon.com/iamv2/home#/roles/create?step=selectEntities
2. Select EC2 and click Next
Control Panel-1
3. Add permissions
Search for S3, CloudWatch, AmazonKinesisVideoStreams in turn, as shown in the figure below, you need to select the end of FullAccess, which means full access
After all three permissions are added, click Next
4. Name, view and create
- ①. Set the role name
SetRoleName
- ②. View the permissions attached to the role
CheckRole_policy
- ③. Prepare to create
Whether to add labels according to your own needs (optional)
5. View created roles
Check_Create_role
2. Create a bucket
1. Enter the S3 console
S3 Bucket Control Panel: https://s3.console.aws.amazon.com/s3/
2. Create a bucket
You only need to set the bucket name and region here, and keep the other settings by default.
Scroll to the bottom and click Create Bucket
3. View the created bucket
Check_bk
3. Create a startup template
EC2 Launch Templates Dashboard: https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchTemplates
1. First you need to select the mirror
Please select the Deep Learning AMI(Ubuntu 18.04) image
2. Set the instance type and key pair
Although the official tip is not to include these two parameters in the template, this experiment is to save time for subsequent deployments, so it is defined in the template in advance
3. Define Subnets and Security Groups
Since I need to use the spot request this time, I will not set the subnet, and the spot will make a request in any region of us-east-1 later.
4. Configuration Storage
The configuration instructions given by the official blog are: the root volume must have at least a space higher than 150GiB, which is set to 160GiB here. We can also see that the AMI has a storage volume of 225GB.
5. Advanced Details
Here I checked the request spot instance. In order to reduce the cost, here you can check whether it is checked according to the user's own situation. Then in the IAM role, choose the role we created for EC2 in advance.
6. Review summaries and create templates
Launch Templates - View Templates that have been created
Fourth, create an instance
EC2 Instance Console: https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances
1. Select Launch instance from template
Here select the template created earlier
After confirmation, start the instance
Check the successfully created instance If you have checked the spot request instance, and an error is reported when the instance is started, indicating that there is no spot quota, and you are making sure that your account has sufficient limit, then there may be no spot available for you in this region during this period. Instances, use On-Demand Instances to create
2. View spot requests
If you check the spot request instance in the instance template and it starts successfully, you can see your instance in the spot request control panel
Five, connect the instance and build the basic environment
Here we still use the installation steps given by the AWS official blog
Or do a very basic show, how to connect EC2, I use the SecureCRT connection tool here
Basic. connect your instance
Only .pem type keys (shared with OpenSSH) are supported here, if you are .ppk type, please use putty to connect
- ①.SecureCRT import key
- ②.SecureCRT connection example
Please note that since we are using an Ubuntu image, the username used here is ubuntu
After connecting, if a prompt box pops up, please click to accept and save
SecureCRT connection instance succeeded
Next, enter the construction of the basic environment
The AWS DeepRacer-for-Cloud installation training script is as follows
- Step-1. Pull the code
Go to the created EC2 instance and execute the following command to pull the code from GitHub:
git clone https://github.com/aws-deepracer-community/deepracer-for-cloud.git
git code
- Step-2. Install the basic components required for DeepRacer local training
Execute the environment pre-configuration code in the first stage, which will install the basic components required for DeepRacer local training, and then restart the EC2 instance: This step officially gives only 2 lines of code, so in the actual execution process, because of the version change , resulting in some new problems
cd deepracer-for-cloud && ./bin/prepare.sh
sudo reboot
Local base environment-1
Because the scene that will make mistakes is considered here, additional operations are performed here, and the actual error report scene 1 is posted here, prompting NO_PUBKEY
Error Scenario One, Solution
### 添加PUBKEY
wget -qO - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add -
### 更新
sudo apt-get update
### 重新执行,请确保您不在 deepracer-for-cloud目录中,请回到ubuntu的home目录下
cd deepracer-for-cloud && ./bin/prepare.sh
### 重启
sudo reboot
If the following progress bar appears, it means that the error has been fixed and the basic environment is being installed
Error Scenario 2, Solution
If the progress bar does not appear and the following error occurs, it may appear multiple times
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Please run the code to solve it, and re-execute the basic environment installation command
sudo rm /var/lib/dpkg/lock-frontend
sudo rm /var/lib/dpkg/lock
### 重新执行,请确保您不在 deepracer-for-cloud目录中,请回到ubuntu的home目录下
cd deepracer-for-cloud && ./bin/prepare.sh
### 重启
sudo reboot
- Step-3. Reconnect the EC2 instance and execute the environment initialization code of the second stage
### 请确保您此时在ubuntu的home目录下
cd deepracer-for-cloud/ && bin/init.sh -c aws -a gpu
At this time, the environment is initialized, and a lot of container images need to be pulled. Please wait patiently for the completion.
The environment initialization is completed, as follows
- Step-4. Load the script required to train DeepRacer
source bin/activate.sh
- Step-5. Edit reward function, training information, car information
There is a custom_files directory in the deepracer-for-cloud directory with 3 files, namely:
reward_function.py #reward function file
hyperparameters.json #training information file
model_metadata.json #Car information file Edit the reward function in the deepracer-for-cloud/custom_files/reward_function.py file
def reward_function(params):
'''
Example of penalize steering, which helps mitigate zig-zag behaviors
'''
# Read input parameters
distance_from_center = params['distance_from_center']
track_width = params['track_width']
steering = abs(params['steering_angle']) # Only need the absolute steering angle
# Calculate 3 marks that are farther and father away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width
# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
reward = 1
elif distance_from_center <= marker_2:
reward = 0.5
elif distance_from_center <= marker_3:
reward = 0.1
else:
reward = 1e-3 # likely crashed/ close to off track
# Steering penality threshold, change the number based on your action space setting
ABS_STEERING_THRESHOLD = 15
# Penalize reward if the car is steering too much
if steering > ABS_STEERING_THRESHOLD:
reward *= 0.8
return float(reward)
Edit the training information in the deepracer-for-cloud/custom_files/hyperparameters.json file, for example:
{
"batch_size": 64,
"beta_entropy": 0.01,
"discount_factor": 0.995,
"e_greedy_value": 0.05,
"epsilon_steps": 10000,
"exploration_type": "categorical",
"loss_type": "huber",
"lr": 0.0003,
"num_episodes_between_training": 20,
"num_epochs": 10,
"stack_size": 1,
"term_cond_avg_score": 350.0,
"term_cond_max_episodes": 1000,
"sac_alpha": 0.2
}
在deepracer-for-cloud/custom_files/model_metadata.json文件中编辑车辆信息,包括action space、传感器以及神经网络类型等,例如:
{
"action_space": [
{
"steering_angle": -30,
"speed": 0.6
},
{
"steering_angle": -15,
"speed": 0.6
},
{
"steering_angle": 0,
"speed": 0.6
},
{
"steering_angle": 15,
"speed": 0.6
},
{
"steering_angle": 30,
"speed": 0.6
}
],
"sensor": ["FRONT_FACING_CAMERA"],
"neural_network": "DEEP_CONVOLUTIONAL_NETWORK_SHALLOW",
"training_algorithm": "clipped_ppo",
"action_space_type": "discrete",
"version": "3"
}
- Step-6. Edit the environment file run.env
- ①.Add bucket information
Edit the deepracer-for-cloud/run.env file and add the following:
DR_LOCAL_S3_BUCKET=<created bucket name>
DR_UPLOAD_S3_BUCKET=<created bucket name>
You can also use the command, replace <created bucket name> with the name of the S3 bucket you created earlier
sed -i '1i\DR_LOCAL_S3_BUCKET=<created bucket name>' run.env
sed -i '1i\DR_UPLOAD_S3_BUCKET=<created bucket name>' run.env
The name of the bucket I created here is: deepracer-demo-bk, as a demonstration
- ②. Edit track information
I am using the re:Invent 2018 track here, and its DR_WORLD_NAME is reinvent_base. Please find the DR_WORLD_NAME in the run.env file and modify its value
Modify track information
- Step-7. Update python version
Since the image comes with python 3.6, it is no longer supported here. If you need to update the python version, please execute the following command to update python3
### 安装python 3.8
sudo apt-get -y install python3.8
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2
Install python3.8
- Step-8. Update configuration
### 执行如下代码,使此处配置生效
dr-update
dr-update
- Step-9. Upload dr-upload-custom-files to S3 bucket
dr-upload-custom-files
At this point the custom-files in the S3 bucket should contain the file below
- Step-10. Start training
Execute the following command to start training
dr-start-training
training-1
normal training
At this point, your model has started training on EC2
5. Follow-up operations
1. If the instance is terminated, re-pull the instance and continue the last training
Please do the following #### ①. Modify the run.env file and modify the following parameters
### 这是上一次训练存放文件夹,请查看S3存储桶中的文件夹名称
DR_LOCAL_S3_MODEL_PREFIX=<The directory where the S3 bucket is stored for this training>
### 确定培训或评估是否应基于在上一个会话中创建的模型
DR_LOCAL_S3_PRETRAINED=True
### 设置本次训练的目录
DR_LOCAL_S3_PRETRAINED_PREFIX=<The directory where the last training S3 bucket is stored>
### 本次训练从上一次训练的checkpoint设置 默认为 last
DR_LOCAL_S3_PRETRAINED_CHECKPOINT=best
This is before modification
This is modified
Check out the files in the s3 bucket
- ②. Update to make this configuration take effect
dr-update-env
dr-increment-training
- ③. If you have modified the files in custom_files, please execute the following command to upload custom_files again
dr-upload-custom-files
- ④. Start training
dr-start-training
training-again
If prompted that Sagemaker is not running, please execute dr-start-training -w
dr-start-training -w
training-again-error-deal-1
run.evn configuration file parameter settings, dr-command interpretation
For specific parameters, please refer to Deepracer-for-Cloud GitHub
problems encountered
1. NO_PUBKEY
W: GPG error: http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY F60F4B3D7FA2AF80
E: The repository ' http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
Solution Since ubuntu 18.04 is used this time, the following solutions are used to solve
wget -qO - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add -
sudo apt-get update
image.png
2. Python3.6 reports an error, update the python version
sudo apt-get install python3.8
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2
Note:
The first parameter --install means registering the service name with update-alternatives.
The second parameter is to register the final address. After success, the command will be used as the soft chain of the real command at this fixed destination address, and the future management is to manage this soft chain;
( --install link name path priority)
Among them, link is the public link directory of the software with the same function in the system, such as /usr/bin/java (absolute directory is required); name is the name of the command linker, such as java path is the new command you want to use, and the priority of the directory where the new software is located is Priority, when the command link already exists, it needs to be higher than the current value, because when the alternative is automatic mode, the system enables the link with high priority by default; # The priority of the integer is set according to the version number (the priority of the change needs to be greater than the current one )
The third parameter: the service name, which will be used as the association basis for future management.
The fourth parameter, the absolute path of the managed command.
The fifth parameter, priority, the higher the number, the higher the priority.
3. Unable to acquire lock
E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
Solution:
$ sudo rm /var/lib/dpkg/lock-frontend
$ sudo rm /var/lib/dpkg/lock
Question 3
4. If prompted that Sagemaker is not running, please execute dr-start-training -w
dr-start-training -w
training-again-error-deal-1
Hope this basic tutorial helps you!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。