Using Amazon EC2 to reduce the training cost of DeepRacer DeepRacer-for-cloud in practice

The official blog document of AWS DeepRacer-for-Cloud: Direct link: https://aws.amazon.com/cn/blogs/china/use-amazon-ec2-to-further-reduce-the-cost-of-deepracer- training/

write first

Since the specific plan has been written in the blog, the script in it is extracted here, and the corresponding problems are solved.

Deep Learning AMI (Ubuntu 18.04) Version 60.1 is used here

Instance preparation, I saw that the article mentioned the use of G and P series instances, as follows:

g4dn.2xlarge: cost-effective training method, based on GPU acceleration, the training speed is slightly faster than DeepRacer console training

p3.2xlarge: The training speed is much faster than the DeepRacer console training, and the model can be quickly iterated to obtain training results

This practice is still using g4dn.2xlarge, so I use spot request instance instead of On-Demand, which will reduce the cost by 70% or more, but there will also be some problems, then US-EAST-1 The instances of the G and P series in the region are seriously insufficient, and they may stop after running for a while. Then we can deploy the basic environment first and create an image directly. Then we can quickly pull up the instance when we use it later, and we don’t need to make any further changes. Too much configuration in the basic environment

It should be noted that you need to check your EC2 limit. By default, there is no capacity for G and P series. You need to submit a case to increase the limit. When submitting the limit, do not submit too many requests, it may be rejected for you.

For convenience, I made an EC2 startup template here, which is also to lay a foundation for subsequent training, so as to avoid manual configuration every time.

Article directory structure:

1. Create an IAM role for EC2
1. First enter the IAM console
2. Select EC2 and click Next
3. Add permissions
4. Name, view and create
5. View created roles

2. Create a bucket
1. Enter the S3 console
2. Create a bucket
3. View the created bucket

3. Create a startup template
1. First you need to select the mirror
2. Set the instance type and key pair
3. Define Subnets and Security Groups
4. Configuration Storage
5. Advanced Details
6. Review summaries and create templates

Fourth, create an instance
1. Select Launch instance from template
2. View spot requests

Five, connect the instance and build the basic environment
Basic. connect your instance

①.SecureCRT import key
②.SecureCRT connection example

Next, enter the construction of the basic environment
Step-1. Pull the code
Step-2. Install the basic components required for DeepRacer local training

Error Scenario One, Solution
Error Scenario 2, Solution

Step-3. Reconnect the EC2 instance and execute the environment initialization code of the second stage
Step-4. Load the script required to train DeepRacer
Step-5. Edit reward function, training information, car information
Step-6. Edit the environment file run.env

①.Add bucket information
②. Edit track information

Step-7. Update python version
Step-8. Update configuration
Step-9. Upload dr-upload-custom-files to S3 bucket
Step-10. Start training

6. Follow-up operations (retraining)
1. If the instance is terminated, re-pull the instance and continue the last training

①. Modify the run.env file
②. Update to make this configuration take effect
③. If you have modified the files in custom_files, please execute the following command to upload custom_files again
④. Start training

7. Parameters and command interpretation 8. Problems encountered
1. NO PUBKEY
2. Python3.6 reports an error, update the python version
3. Unable to acquire lock
4. Sagemaker is not running
The following is the specific operation plan

1. Create an IAM role for EC2

As mentioned in the AWS official blog, we use EC2 to train the model using the following 3 services

AmazonKinesisVideoStreams
CloudWatch
S3

So here's a demo to create a new EC2 role and empower it

1. First enter the IAM console

IAM Create New Role Console: https://us-east-1.console.aws.amazon.com/iamv2/home#/roles/create?step=selectEntities

2. Select EC2 and click Next

Control Panel-1

3. Add permissions

Search for S3, CloudWatch, AmazonKinesisVideoStreams in turn, as shown in the figure below, you need to select the end of FullAccess, which means full access

After all three permissions are added, click Next

4. Name, view and create

①. Set the role name

SetRoleName

②. View the permissions attached to the role

CheckRole_policy

③. Prepare to create

Whether to add labels according to your own needs (optional)

5. View created roles

Check_Create_role

2. Create a bucket

1. Enter the S3 console

S3 Bucket Control Panel: https://s3.console.aws.amazon.com/s3/

2. Create a bucket

You only need to set the bucket name and region here, and keep the other settings by default.

Scroll to the bottom and click Create Bucket

3. View the created bucket

Check_bk

3. Create a startup template

EC2 Launch Templates Dashboard: https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchTemplates

1. First you need to select the mirror

Please select the Deep Learning AMI(Ubuntu 18.04) image

2. Set the instance type and key pair

Although the official tip is not to include these two parameters in the template, this experiment is to save time for subsequent deployments, so it is defined in the template in advance

3. Define Subnets and Security Groups

Since I need to use the spot request this time, I will not set the subnet, and the spot will make a request in any region of us-east-1 later.

4. Configuration Storage

The configuration instructions given by the official blog are: the root volume must have at least a space higher than 150GiB, which is set to 160GiB here. We can also see that the AMI has a storage volume of 225GB.

5. Advanced Details

Here I checked the request spot instance. In order to reduce the cost, here you can check whether it is checked according to the user's own situation. Then in the IAM role, choose the role we created for EC2 in advance.

6. Review summaries and create templates

Launch Templates - View Templates that have been created

Fourth, create an instance

EC2 Instance Console: https://us-east-1.console.aws.amazon.com/ec2/v2/home?region=us-east-1#Instances

1. Select Launch instance from template

Here select the template created earlier

After confirmation, start the instance

Check the successfully created instance If you have checked the spot request instance, and an error is reported when the instance is started, indicating that there is no spot quota, and you are making sure that your account has sufficient limit, then there may be no spot available for you in this region during this period. Instances, use On-Demand Instances to create

2. View spot requests

If you check the spot request instance in the instance template and it starts successfully, you can see your instance in the spot request control panel

Five, connect the instance and build the basic environment

Here we still use the installation steps given by the AWS official blog

Or do a very basic show, how to connect EC2, I use the SecureCRT connection tool here

Basic. connect your instance

Only .pem type keys (shared with OpenSSH) are supported here, if you are .ppk type, please use putty to connect

①.SecureCRT import key

②.SecureCRT connection example

Please note that since we are using an Ubuntu image, the username used here is ubuntu

After connecting, if a prompt box pops up, please click to accept and save

SecureCRT connection instance succeeded

Next, enter the construction of the basic environment

The AWS DeepRacer-for-Cloud installation training script is as follows

Step-1. Pull the code

Go to the created EC2 instance and execute the following command to pull the code from GitHub:

git clone https://github.com/aws-deepracer-community/deepracer-for-cloud.git

git code

Step-2. Install the basic components required for DeepRacer local training

Execute the environment pre-configuration code in the first stage, which will install the basic components required for DeepRacer local training, and then restart the EC2 instance: This step officially gives only 2 lines of code, so in the actual execution process, because of the version change , resulting in some new problems

cd deepracer-for-cloud && ./bin/prepare.sh
sudo reboot

Local base environment-1

Because the scene that will make mistakes is considered here, additional operations are performed here, and the actual error report scene 1 is posted here, prompting NO_PUBKEY

Error Scenario One, Solution
### 添加PUBKEY
wget -qO - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add -
### 更新
sudo apt-get update

 ### 重新执行,请确保您不在 deepracer-for-cloud目录中，请回到ubuntu的home目录下

cd deepracer-for-cloud && ./bin/prepare.sh
### 重启
sudo reboot

If the following progress bar appears, it means that the error has been fixed and the basic environment is being installed

Error Scenario 2, Solution

If the progress bar does not appear and the following error occurs, it may appear multiple times

E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Please run the code to solve it, and re-execute the basic environment installation command

sudo rm /var/lib/dpkg/lock-frontend
sudo rm /var/lib/dpkg/lock

 ### 重新执行,请确保您不在 deepracer-for-cloud目录中，请回到ubuntu的home目录下

cd deepracer-for-cloud && ./bin/prepare.sh

### 重启

sudo reboot

Step-3. Reconnect the EC2 instance and execute the environment initialization code of the second stage

### 请确保您此时在ubuntu的home目录下

cd deepracer-for-cloud/ && bin/init.sh -c aws -a gpu
At this time, the environment is initialized, and a lot of container images need to be pulled. Please wait patiently for the completion.

The environment initialization is completed, as follows

Step-4. Load the script required to train DeepRacer

source bin/activate.sh

Step-5. Edit reward function, training information, car information

There is a custom_files directory in the deepracer-for-cloud directory with 3 files, namely:

reward_function.py #reward function file
hyperparameters.json #training information file
model_metadata.json #Car information file Edit the reward function in the deepracer-for-cloud/custom_files/reward_function.py file

 def reward_function(params):
'''
Example of penalize steering, which helps mitigate zig-zag behaviors
'''

# Read input parameters
distance_from_center = params['distance_from_center']
track_width = params['track_width']
steering = abs(params['steering_angle']) # Only need the absolute steering angle

# Calculate 3 marks that are farther and father away from the center line
marker_1 = 0.1 * track_width
marker_2 = 0.25 * track_width
marker_3 = 0.5 * track_width

# Give higher reward if the car is closer to center line and vice versa
if distance_from_center <= marker_1:
    reward = 1
elif distance_from_center <= marker_2:
    reward = 0.5
elif distance_from_center <= marker_3:
    reward = 0.1
else:
    reward = 1e-3  # likely crashed/ close to off track

# Steering penality threshold, change the number based on your action space setting
ABS_STEERING_THRESHOLD = 15

# Penalize reward if the car is steering too much
if steering > ABS_STEERING_THRESHOLD:
    reward *= 0.8

return float(reward)

Edit the training information in the deepracer-for-cloud/custom_files/hyperparameters.json file, for example:

 {
  "batch_size": 64,
  "beta_entropy": 0.01,
  "discount_factor": 0.995,
  "e_greedy_value": 0.05,
  "epsilon_steps": 10000,
  "exploration_type": "categorical",
  "loss_type": "huber",
  "lr": 0.0003,
  "num_episodes_between_training": 20,
  "num_epochs": 10,
  "stack_size": 1,
  "term_cond_avg_score": 350.0,
  "term_cond_max_episodes": 1000,
  "sac_alpha": 0.2
}
在deepracer-for-cloud/custom_files/model_metadata.json文件中编辑车辆信息，包括action space、传感器以及神经网络类型等，例如：
  {
  "action_space": [
      {
          "steering_angle": -30,
          "speed": 0.6
      },
      {
          "steering_angle": -15,
          "speed": 0.6
      },
      {
          "steering_angle": 0,
          "speed": 0.6
      },
      {
          "steering_angle": 15,
          "speed": 0.6
      },
      {
          "steering_angle": 30,
          "speed": 0.6
      }
  ],
  "sensor": ["FRONT_FACING_CAMERA"],
  "neural_network": "DEEP_CONVOLUTIONAL_NETWORK_SHALLOW",
  "training_algorithm": "clipped_ppo",
  "action_space_type": "discrete",
  "version": "3"
}

Step-6. Edit the environment file run.env
①.Add bucket information

Edit the deepracer-for-cloud/run.env file and add the following:

DR_LOCAL_S3_BUCKET=<created bucket name>
DR_UPLOAD_S3_BUCKET=<created bucket name>

You can also use the command, replace <created bucket name> with the name of the S3 bucket you created earlier

sed -i '1i\DR_LOCAL_S3_BUCKET=<created bucket name>' run.env
sed -i '1i\DR_UPLOAD_S3_BUCKET=<created bucket name>' run.env

The name of the bucket I created here is: deepracer-demo-bk, as a demonstration

②. Edit track information

I am using the re:Invent 2018 track here, and its DR_WORLD_NAME is reinvent_base. Please find the DR_WORLD_NAME in the run.env file and modify its value

Modify track information

Step-7. Update python version

Since the image comes with python 3.6, it is no longer supported here. If you need to update the python version, please execute the following command to update python3

### 安装python 3.8

sudo apt-get -y install python3.8

sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2

Install python3.8

Step-8. Update configuration

### 执行如下代码,使此处配置生效

dr-update

dr-update

Step-9. Upload dr-upload-custom-files to S3 bucket

dr-upload-custom-files

At this point the custom-files in the S3 bucket should contain the file below

Step-10. Start training

Execute the following command to start training

dr-start-training

training-1

normal training

At this point, your model has started training on EC2

5. Follow-up operations

1. If the instance is terminated, re-pull the instance and continue the last training

Please do the following #### ①. Modify the run.env file and modify the following parameters
### 这是上一次训练存放文件夹，请查看S3存储桶中的文件夹名称
DR_LOCAL_S3_MODEL_PREFIX=<The directory where the S3 bucket is stored for this training>
### 确定培训或评估是否应基于在上一个会话中创建的模型
DR_LOCAL_S3_PRETRAINED=True
### 设置本次训练的目录
DR_LOCAL_S3_PRETRAINED_PREFIX=<The directory where the last training S3 bucket is stored>
### 本次训练从上一次训练的checkpoint设置默认为 last
DR_LOCAL_S3_PRETRAINED_CHECKPOINT=best

This is before modification

This is modified

Check out the files in the s3 bucket

②. Update to make this configuration take effect

dr-update-env

dr-increment-training

③. If you have modified the files in custom_files, please execute the following command to upload custom_files again

dr-upload-custom-files

④. Start training

dr-start-training

training-again

If prompted that Sagemaker is not running, please execute dr-start-training -w
dr-start-training -w

training-again-error-deal-1

run.evn configuration file parameter settings, dr-command interpretation

For specific parameters, please refer to Deepracer-for-Cloud GitHub

problems encountered

1. NO_PUBKEY

W: GPG error: http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY F60F4B3D7FA2AF80
E: The repository ' http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

Solution Since ubuntu 18.04 is used this time, the following solutions are used to solve

wget -qO - http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo apt-key add -

sudo apt-get update

image.png

2. Python3.6 reports an error, update the python version

sudo apt-get install python3.8
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.8 2

Note:
The first parameter --install means registering the service name with update-alternatives.

The second parameter is to register the final address. After success, the command will be used as the soft chain of the real command at this fixed destination address, and the future management is to manage this soft chain;

( --install link name path priority)

Among them, link is the public link directory of the software with the same function in the system, such as /usr/bin/java (absolute directory is required); name is the name of the command linker, such as java path is the new command you want to use, and the priority of the directory where the new software is located is Priority, when the command link already exists, it needs to be higher than the current value, because when the alternative is automatic mode, the system enables the link with high priority by default; # The priority of the integer is set according to the version number (the priority of the change needs to be greater than the current one )

The third parameter: the service name, which will be used as the association basis for future management.

The fourth parameter, the absolute path of the managed command.

The fifth parameter, priority, the higher the number, the higher the priority.

3. Unable to acquire lock

E: Could not get lock /var/lib/dpkg/lock-frontend - open (11: Resource temporarily unavailable)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?

Solution:
$ sudo rm /var/lib/dpkg/lock-frontend
$ sudo rm /var/lib/dpkg/lock

Question 3

4. If prompted that Sagemaker is not running, please execute dr-start-training -w

dr-start-training -w

training-again-error-deal-1

Hope this basic tutorial helps you!

Using Amazon EC2 to reduce the training cost of DeepRacer DeepRacer-for-cloud in practice

write first

1. Create an IAM role for EC2

1. First enter the IAM console

2. Select EC2 and click Next

3. Add permissions

4. Name, view and create

5. View created roles

2. Create a bucket

1. Enter the S3 console

2. Create a bucket

3. View the created bucket

3. Create a startup template

1. First you need to select the mirror

2. Set the instance type and key pair

3. Define Subnets and Security Groups

4. Configuration Storage

5. Advanced Details

6. Review summaries and create templates

Fourth, create an instance

1. Select Launch instance from template

2. View spot requests

Five, connect the instance and build the basic environment

5. Follow-up operations

1. If the instance is terminated, re-pull the instance and continue the last training

problems encountered

1. NO_PUBKEY

2. Python3.6 reports an error, update the python version

3. Unable to acquire lock

4. If prompted that Sagemaker is not running, please execute dr-start-training -w

亚马逊云开发者

引用和评论

Amazon Bedrock 助力 SolveX.AI 构建智能解题 Agent，打造头部教育科技应用

Windows系统SSH无法通过IPv6地址连接AWS EC2实例