深度学习 - Deep Learning and CV Tutorial (8) | Introduction to Common Deep Learning Frameworks - ShowMeAI研究中心

ShowMeAI研究中心

Author: Han Xinzi @ShowMeAI
Tutorial address : https://www.showmeai.tech/tutorials/37
Address of this article : https://www.showmeai.tech/article-detail/267
Disclaimer: All rights reserved, please contact the platform and the author for reprinting and indicate the source
Bookmark ShowMeAI for more exciting content

深度学习与计算机视觉

This series is a complete set of study notes for Stanford CS231n "Deep Learning and Computer Vision (Deep Learning for Computer Vision)", and the corresponding course videos can be viewed here . See the end of the article for more information.

introduction

You have learned a lot about the principle knowledge and practical skills of neural networks in the previous article. In this article, ShowMeAI will introduce you to the knowledge of deep learning hardware, as well as the current mainstream deep learning frameworks TensorFlow and pytorch related knowledge, with the help of tools You can actually build and train neural networks.

The focus of this article

Deep Learning Hardware
- CPU, GPU, TPU
Deep Learning Framework
- PyTorch/TensorFlow
Static and Dynamic Computational Graphs

1. Deep Learning Hardware

GPU (Graphics Processing Unit) is a graphics processing unit (also known as a graphics card), which is much larger in physical size than a CPU (Central Processing Unit) and has its own cooling system. Originally used to render computer graphics, especially games. Choose NVIDIA graphics cards for deep learning. If you use AMD graphics cards, you will encounter many problems. TPU (Tensor Processing Units) are dedicated deep learning hardware.

1.1 CPU/GPU/TPU

深度学习硬件; CPU / GPU / TPU

The CPU generally has multiple cores, each of which is very fast and can work independently, and can perform multiple processes at the same time. The memory is shared with the system, which is very useful when completing sequential tasks. The running speed of the CPU on the graph is about 540 GFLOPs floating point operations per second, using 32-bit floating point numbers (Note: One GFLOPS (gigaFLOPS) is equal to one billion (\(=10^9\)) floating point operations per second) .
GPUs have thousands of cores, but each core runs very slowly and cannot work independently, so it is suitable for a large number of parallel tasks to be done. GPUs generally come with their own memory and also have their own cache system. The GPU on the graph runs more than 20 times faster than the CPU.
TPUs are specialized deep learning hardware and run very fast. TITANV isn't technically a "TPU" as that's a Google term, but both have hardware dedicated to deep learning. It runs very fast.

If you divide these operating speeds by the corresponding prices, you get the following graph:

深度学习硬件; 每美元对应运行速度

1.2 Advantages and Applications of GPU

GPUs have obvious advantages in multiplying large matrices.

GPU 的优势; 加速大矩阵运算

Since each element in the result is the dot product of each row and column of the two matrices being multiplied, these dot product operations can be very fast in parallel. The convolutional neural network is similar, and the dot product of the convolution kernel and each area of the image is also performed in parallel.

Although the CPU also has multiple cores, it can only perform serial operations in large matrix operations, which is very slow.

You can write code that runs directly on the GPU by using the abstract code CUDA that comes with NVIDIA, and you can write C-like code and run it directly on the GPU.

However, it is very difficult to write CUDA code directly. Fortunately, NVIDIA's highly optimized and open source APIs can be used directly. For example, cuBLAS contains many matrix operations, and cuDNN contains CNN forward propagation, back propagation, batch normalization, etc. Operation; another language is OpenCL, which can be used on CPU and AMD, but no optimization is done, and the speed is very slow; HIP can automatically convert CUDA code into a language that can run on AMD. There may be a cross-platform standard in the future, but for now CUDA is the best choice.

In practical applications, the GPU is much faster than the CPU for the same computing task, and of course the CPU can be further optimized. Using cuDNN is also nearly three times faster than not using it.

GPU 的优势; CPU V.S. GPU

cuDNN 的优势; 运行时间对比

Another problem in the practical application of GPU is that the training model is generally stored in the GPU, while the data used for training is stored in the hard disk. Since the GPU runs fast and the mechanical hard disk reads slowly, it will drag down the training speed of the entire model. There are multiple workarounds:

If the amount of training data is small, you can put all the data in the RAM of the GPU;
Replace mechanical hard drives with solid state drives;
Use multiple CPU threads to prefetch data and put it into cache for GPU use.

2. Deep Learning Software

2.1 Overview of DL Software

There are many deep learning frameworks, the most popular is TensorFlow.

Most of the first-generation frameworks were written by academia, such as Caffe, which was developed at Berkeley University.

The second generation is often dominated by industry, for example Caffe2 was developed by Facebook. Here we mainly explain PyTorch and TensorFlow.

深度学习软件; Caffe、PyTorch 和 TensorFlow

Recalling the concept of computational graphs, a linear classifier can be represented by computational graphs. The more complex the network, the more complex the computational graph. There are three reasons to use these deep learning frameworks:

It is easy to build large computational graphs, and new ideas can be developed and tested quickly;
These frameworks can automatically calculate gradients by simply writing forward propagation code;
It can run efficiently on GPUs, and packages such as cuDNN have been extended to handle how data flows between CPUs and GPUs.

This way we don't have to do this from scratch.

For example, the following calculation diagram:

深度学习软件; 计算图示例

Our previous approach was to use Numpy to write the forward pass and then calculate the gradient. The code is as follows:

 import numpy as np
np.random.seed(0)  # 保证每次的随机数一致

N, D = 3, 4

x = np.random.randn(N, D)
y = np.random.randn(N, D)
z = np.random.randn(N, D)

a = x * y
b = a + z
c = np.sum(b)

grad_c = 1.0
grad_b = grad_c * np.ones((N, D))
grad_a = grad_b.copy()
grad_z = grad_b.copy()
grad_x = grad_a * y
grad_y = grad_a * x

This approach has a clean API and is easy to code, but the problem is that it doesn't run on the GPU, and you need to compute the gradients yourself. So the main goal of most deep learning frameworks now is to write the forward propagation code yourself, similar to Numpy, but can run on the GPU and can automatically calculate the gradient.

TensorFlow version, forward propagation builds a computational graph, and gradients can be calculated automatically:

 import numpy as np
np.random.seed(0)
import tensorflow as tf

N, D = 3, 4

# 创建前向计算图
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
z = tf.placeholder(tf.float32)

a = x * y
b = a + z
c = tf.reduce_sum(b)

# 计算梯度
grad_x, grad_y, grad_z = tf.gradients(c, [x, y, z])

with tf.Session() as sess:
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
        z: np.random.randn(N, D),
    }
    out = sess.run([c, grad_x, grad_y, grad_z], feed_dict=values)
    c_val, grad_x_val, grad_y_val, grad_z_val = out
    print(c_val)
    print(grad_x_val)

In the PyTorch version, forward propagation is very similar to Numpy, but back propagation can automatically calculate gradients without having to implement it.

 import torch

device = 'cuda:0'  # 在GPU上运行，即构建GPU版本的矩阵

# 前向传播与Numpy类似
N, D = 3, 4
x = torch.randn(N, D, requires_grad=True, device=device)
# requires_grad要求自动计算梯度，默认为True
y = torch.randn(N, D, device=device)
z = torch.randn(N, D, device=device)

a = x * y
b = a + z
c = torch.sum(b)

c.backward()  # 反向传播可以自动计算梯度
print(x.grad)
print(y.grad)
print(z.grad)

It can be seen that these frameworks can automatically calculate gradients and can automatically run on the GPU.

2.2 TensorFlow

For the usage of TensorFlow, you can also read the TensorFlow cheat sheet made by ShowMeAI , corresponding to the articles AI Modeling Tools Quick Check | TensorFlow User Guide and AI Modeling Tool Quick Check | Keras User Guide .

The following takes a two-layer neural network as an example, the nonlinear function uses the ReLU function, and the loss function uses the L2 paradigm (of course, it is just a learning example).

TensorFlow; 两层神经网络计算图

The implementation code is as follows:

1) Neural network

 import numpy as np
import tensorflow as tf

N, D , H = 64, 1000, 100

# 创建前向计算图
x = tf.placeholder(tf.float32, shape=(N, D))
y = tf.placeholder(tf.float32, shape=(N, D))
w1 = tf.placeholder(tf.float32, shape=(D, H))
w2 = tf.placeholder(tf.float32, shape=(H, D))

h = tf.maximum(tf.matmul(x, w1), 0)  # 隐藏层使用折叶函数
y_pred = tf.matmul(h, w2)
diff = y_pred - y  # 差值矩阵
loss = tf.reduce_mean(tf.reduce_sum(diff ** 2, axis=1))  # 损失函数使用L2范数

# 计算梯度
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# 多次运行计算图
with tf.Session() as sess:
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
        w1: np.random.randn(D, H),
        w2: np.random.randn(H, D),
    }
    out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
    loss_val, grad_w1_val, grad_w2_val = out

The whole process can be divided into two parts, with previous part defines the calculation graph, and the with part runs the calculation graph multiple times. This pattern is common in TensorFlow.

First, we created x,y,w1,w2 four tf.placeholder objects, these four variables are used as "input slots", and then enter the data below.
Then use these four variables to create a computational graph, use matrix multiplication tf.matmul and fold function tf.maximum calculate y_pred , use L2 distance to calculate s loss . But there is no actual calculation at the moment, because only the calculation graph is constructed and no data is input.
Then calculate the gradient of the loss value with respect to w1 and w2 through a magical line of code. At this time, there is still no actual operation, just build the calculation graph, find the path of loss about w1 and w2 , and add additional gradient calculations to the original calculation graph.
After completing the computation graph, create a session to run the computation graph and input data. After entering the Session, you need to provide the Numpy array to the "input slot" created above.
The last two lines of code are the real operation. Execution sess.run needs to provide the Numpy array dictionary feed_dict 和需要输出的计算值 loss , , 340b2503f041aa888c51c8a007b34a80, and finally get the Num array by unpacking--- grad_w2`.

The above code only runs once, we need to iterate multiple times and set hyperparameters, parameter update methods, etc.:

 with tf.Session() as sess:
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
        w1: np.random.randn(D, H),
        w2: np.random.randn(H, D),
    }
    learning_rate = 1e-5
    for t in range(50):
        out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
        loss_val, grad_w1_val, grad_w2_val = out
        values[w1] -= learning_rate * grad_w1_val
        values[w2] -= learning_rate * grad_w2_val

One problem with this iterative method is that Numpy and arrays need to be provided to the GPU at each step, and then unpacked into Numpy arrays after the GPU calculation is completed. However, due to the transmission bottleneck between the CPU and the GPU, it is very inconvenient.

The solution is to use w1 and w2 as variables instead of "input slots", and variables can always exist on the calculation graph.

Since w1 and w2 become variables, it cannot be initialized by inputting Numpy arrays from the outside. It needs to be initialized by TensorFlow, and the initialization method needs to be specified. There is still no specific calculation at this time.

 w1 = tf.Variable(tf.random_normal((D, H)))
w2 = tf.Variable(tf.random_normal((H, D)))

Now you need to add the parameter update operation to the calculation graph as well, use the assignment operation assign to update w1 and w2 and save it in the calculation graph (located in the calculation gradient Behind):

 learning_rate = 1e-5
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

Now to run this network, you need to run one step of parameter initialization tf.global_variables_initializer() , and then run the code multiple times to calculate the loss value:

 with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
    }
    for t in range(50):
        loss_val, = sess.run([loss], feed_dict=values)

2) Optimizer

In the above code, the loss value will not change during the actual training process.

The reason is that we execute the sess.run([loss], feed_dict=values) statement only to calculate loss , TensorFlow is very efficient, and the calculation that has nothing to do with the loss value will not be performed, so the parameters cannot be updated.

A solution is to add two parameters to the calculation when executing run , which will force the parameter update, but it will cause communication problems between the CPU and the GPU.

One trick is to add a dependency of two parameters to the calculation graph, and this dependency needs to be calculated during execution, which will update the parameters. This trick is the group operation, after the parameter assignment operation is performed, execute updates = tf.group(new_w1, new_w2) , this operation will create a node on the calculation graph; then the executed code is modified to loss_val, _ = sess.run([loss, updates], feed_dict=values) , in the actual operation, updates the return value is empty.

This method is still not convenient enough. Fortunately, TensorFlow provides more convenient operations and uses its own optimizer. The optimizer needs to provide the learning rate parameter, and then make the parameter update. There are many optimizers to choose from, such as gradient descent, Adam, etc.

 optimizer = tf.train.GradientDescentOptimizer(1e-5)  # 使用优化器
updates = optimizer.minimize(loss)  # 更新方式是使loss下降，内部其实使用了group

The code executed is also: loss_val, _ = sess.run([loss, updates], feed_dict=values)

3) Loss

The code to calculate the loss can also use the functions that come with TensorFlow:

 loss = tf.losses.mean_squared_error(y_pred, y)  # 损失函数使用L2范数

4th floor

There is still a big problem that the shape of x,y,w1,w2 needs to be defined by ourselves, but also to ensure that they are connected together correctly, and there are deviations. These definitions are more troublesome if layers such as convolutional layers, batch normalization, etc. are used.

TensorFlow can solve these troubles:

 N, D , H = 64, 1000, 100
x = tf.placeholder(tf.float32, shape=(N, D))
y = tf.placeholder(tf.float32, shape=(N, D))

init = tf.variance_scaling_initializer(2.0)  # 权重初始化使用He初始化
h = tf.layers.dense(inputs=x, units=H, activation=tf.nn.relu, kernel_initializer=init)
# 隐藏层使用折叶函数
y_pred = tf.layers.dense(inputs=h, units=D, kernel_initializer=init)

loss = tf.losses.mean_squared_error(y_pred, y)  # 损失函数使用L2范数

optimizer = tf.train.GradientDescentOptimizer(1e-5)
updates = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    values = {
        x: np.random.randn(N, D),
        y: np.random.randn(N, D),
    }
    for t in range(50):
        loss_val, _ = sess.run([loss, updates], feed_dict=values)

In the above code, the initialization of x,y has not changed, but the parameter w1,w2 hidden, and the initialization is initialized with He.

The calculation of forward propagation uses a fully connected layer tf.layers.dense , this function needs to provide input data inputs , the number of neurons in this layer activation units , activation function activation , convolution kernel (weight) initialization method kernel_initializer can automatically set the weight and bias.

5) High level API: tensorflow.keras

Keras is a higher-level encapsulation based on TensorFlow, which makes the whole process easier. It used to be a third-party library and is now built into TensorFlow.

Part of the code using Keras is as follows, the others are the same as above:

 N, D , H = 64, 1000, 100
x = tf.placeholder(tf.float32, shape=(N, D))
y = tf.placeholder(tf.float32, shape=(N, D))

model = tf.keras.Sequential()  # 使用一系列层的组合方式
# 添加一系列的层
model.add(tf.keras.layers.Dense(units=H, input_shape=(D,), activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(D))
# 调用模型获取结果
y_pred = model(x)
loss = tf.losses.mean_squared_error(y_pred, y)

This model has simplified a lot of work, and the final version code is as follows:

 import numpy as np
import tensorflow as tf

N, D , H = 64, 1000, 100

# 创建模型，添加层
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=H, input_shape=(D,), activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(D))

# 配置模型：损失函数、参数更新方式
model.compile(optimizer=tf.keras.optimizers.SGD(lr=1e-5), loss=tf.keras.losses.mean_squared_error)

x = np.random.randn(N, D)
y = np.random.randn(N, D)

# 训练
history = model.fit(x, y, epochs=50, batch_size=N)

The code is very concise:

Define the model : tf.keras.Sequential() indicate that the model is a series of layers, then add two fully connected layers, and set the activation function, the number of neurons in each layer, etc.;
Configure the model : use the model.compile method to configure the optimizer, loss function, etc. of the model;
Training model based on data : Use model.fit , you need to set the number of iteration cycles, batch number, etc., you can directly use the original data to train the model.

6) Other knowledge

① Common expansion packages

Keras ( https://keras.io/ )
TensorFlow has built-in:
- tf.keras ( https://www.tensorflow.org/api_docs/python/tf/keras )
- tf.layers ( https://www.tensorflow.org/api_docs/python/tf/layers )
- tf.estimator ( https://www.tensorflow.org/api_docs/python/tf/estimator )
- tf.contrib.estimator ( https://www.tensorflow.org/api_docs/python/tf/contrib/estimator )
- tf.contrib.layers ( https://www.tensorflow.org/api_docs/python/tf/contrib/layers )
- tf.contrib.slim ( https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/slim )
- tf.contrib.learn ( https://www.tensorflow.org/api_docs/python/tf/contrib/learn ) (deprecated)
- Sonnet ( https://github.com/deepmind/sonnet ) (by DeepMind)
Third party packages:
- TFLearn ( http://tflearn.org/ )
- TensorLayer ( http://tensorlayer.readthedocs.io/en/latest/ ) TensorFlow: High-Level

② Pre-training model

TensorFlow already has some pre-trained models that can be used directly, using transfer learning to fine-tune parameters.

tf.keras: ( https://www.tensorflow.org/api_docs/python/tf/keras/applications )
TF-Slim: ( https://github.com/tensorflow/models/tree/master/slim/nets )

③ Tensorboard

Increase logging loss value and status
draw the image

TensorFlow; Tensorboard 绘制loss图

④ Distributed operation

Can run on multiple machines, Google is better at it.

⑤ TPU (Tensor Processing Units)

TPUs are dedicated deep learning hardware and run very fast. The computing power of Google Cloud TPU is 180 TFLOPs, and the computing power of NVIDIA Tesla V100 is 125 TFLOPs.

TensorFlow; 谷歌云TPU

⑥Theano

The predecessor of TensorFlow, the two are very similar in many ways.

2.3 PyTorch

For the usage of PyTorch, you can also read the PyTorch quick reference table made by ShowMeAI, corresponding to the article AI modeling tool quick reference | Pytorch usage guide

1) Basic Concept

Tensor : similar to Numpy arrays, but runs on GPU;
Autograd : A package that uses Tensors to build computational graphs and automatically calculate gradients;
Module : A layer of a neural network that can store state and learnable weights.

The code below is using the v0.4 version.

2) Tensors

The following uses Tensors to train a two-layer neural network, the activation function uses ReLU, and the loss uses L2 loss.

PyTorch; Tensors

code show as below:

 import torch

# cpu版本
device = torch.device('cpu')
#device = torch.device('cuda:0')  # 使用gpu

# 为数据和参数创建随机的Tensors
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(500):
    # 前向传播，计算预测值和损失
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    loss = (y_pred - y).pow(2).sum()

    # 反向传播手动计算梯度
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # 梯度下降，参数更新
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

First create a random tensor of x,y,w1,w2 , which is consistent with the form of the Numpy array
Then the forward pass calculates the loss value and the predicted value
Then calculate the gradient manually
Last update parameters

The above code is very simple and very close to the Numpy version. But the gradient needs to be calculated manually.

3) Autograd automatic gradient calculation

PyTorch can automatically compute gradients:

 import torch

# 创建随机tensors
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # 前向传播
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    loss = (y_pred - y).pow(2).sum()
    # 反向传播
    loss.backward()
    # 参数更新
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()

The main differences from the previous version of the code are:

---9dbdf7d7647875843a31a8034a9100f7--- is required when creating w1,w2 requires_grad=True , which will automatically calculate the gradient and create a calculation graph. x1,x2 No need to compute gradients.
Forward propagation is similar to the previous one, but now instead of saving nodes, PyTorch can help us keep track of the computational graph.
Use loss.backward() to automatically calculate the required gradient.
The weights are updated in steps, and then the gradients are zeroed. Torch.no_grad means "don't build a computational graph for this part". PyTorch methods that end in an underscore modify the Tensor in place and do not return a new Tensor.

The difference between TensorFlow and PyTorch is that TensorFlow needs to construct a computational graph explicitly, and then run it repeatedly; PyTorch builds a new graph every time it performs forward propagation, making the program look more concise.

PyTorch supports to define its own automatic calculation gradient function, you need to write forward , backward function. Very similar to homework. It can be used directly on the calculation graph, but in fact, there are not many times when you define it yourself.

PyTorch; Autograd 自动梯度计算

4) NN

High-level encapsulation like Keras makes the whole code simple.

 import torch

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 定义模型
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.ReLu(),
                            torch.nn.Linear(H, D_out))

learning_rate = 1e-2
for t in range(500):
    # 前向传播
    y_pred = model(x)
    loss = torch.nn.functional.mse_loss(y_pred, y)
    # 计算梯度
    loss.backward()

    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad
    model.zero_grad()

The definition model is a series of layer combinations, and layer objects such as fully connected layers, hinge layers, etc. are defined in the model, which contain learnable weights;
Forward propagation can directly calculate the predicted value and then calculate the loss by sending the data to the model; torch.nn.functional contains many useful functions, such as loss function;
Backpropagation computes the gradient of all weights in the model;
Finally, each step updates the parameters of the model.

5) Optimizer

PyTorch also has its own optimizer:

 import torch

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# 定义模型
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.ReLu(),
                            torch.nn.Linear(H, D_out))
# 定义优化器
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# 迭代
for t in range(500):
    y_pred = model(x)
    loss = torch.nn.functional.mse_loss(y_pred, y)

    loss.backward()
    # 更新参数
    optimizer.step()
    optimizer.zero_grad()

Optimizer with different rules, Adam is used here;
After computing the gradients, use the optimizer to update the parameters and zero the gradients.

6) Define a new module

A module in PyTorch is a neural network layer whose input and output are tensors. Modules can contain weights and other modules, and you can define your own modules using Autograd.

For example, the two-layer neural network in the above code can be changed into a module:

 import torch
# 定义上文的整个模块为单个模块
class TwoLayerNet(torch.nn.Module):
    # 初始化两个子模块，都是线性层
    def __init__(self, D_in, H, D_out):
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
    # 使用子模块定义前向传播，不需要定义反向传播，autograd会自动处理
    def forward(self, x):
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
# 构建模型与训练和之前类似
model = TwoLayerNet(D_in, H, D_out)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
for t in range(500):
    y_pred = model(x)
    loss = torch.nn.functional.mse_loss(y_pred, y)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

This mix of custom modules is very common, defining a module subclass and then adding it to the module sequence as part of the overall model.

For example, by defining a module like the following, the input data is first multiplied by the results obtained by two parallel fully connected layers and then passed through ReLU:

 class ParallelBlock(torch.nn.Module):
    def __init__(self, D_in, D_out):
        super(ParallelBlock, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, D_out)
        self.linear2 = torch.nn.Linear(D_in, D_out)
    def forward(self, x):
        h1 = self.linear1(x)
        h2 = self.linear2(x)
        return (h1 * h2).clamp(min=0)

And then apply across the model:

 model = torch.nn.Sequential(ParallelBlock(D_in, H),
                            ParallelBlock(H, H),
                            torch.nn.Linear(H, D_out))

The calculation diagram of the new model using ParallelBlock is as follows:

PyTorch; ParallelBlock 计算图

7) DataLoader

DataLoader wraps datasets and provides fetching of small batches of data, rearranging, multi-threaded reading, etc. When you need to load custom data, just write your own dataset class:

 import torch
from torch.utils.data import TensorDataset, DataLoader

N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

loader = DataLoader(TensorDataset(x, y), batch_size=8)
model = TwoLayerNet(D_in, H, D_out)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

for epoch in range(20):
    for x_batch, y_batch in loader:
        y_pred = model(x_batch)
        loss = torch.nn.functional.mse_loss(y_pred, y_batch)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

The above code is still a two-layer neural network, using a custom module. This time a DataLoader is used to process the data. When the last update is made, it is updated on a small batch, and one cycle will iterate all the small batch data. The general PyTorch model basically looks like this.

8) Pre-trained model

Using a pretrained model is very simple: https://github.com/pytorch/vision

 import torch
import torchvision
alexnet = torchvision.models.alexnet(pretrained=True)
vgg16 = torchvision.models.vggl6(pretrained=-True)
resnet101 = torchvision.models.resnet101(pretrained=True)

9) Visdom

Visualization package, similar to TensorBoard, but cannot visualize computational graphs like TensorBoard.

PyTorch; 使用 Visdom 可视化

10) Torch

The predecessor of PyTorch, cannot use Python, does not have Autograd, but it is relatively stable and is not recommended.

3. Static vs Dynamic Graphs (Static vs Dynamic Graphs)

TensorFlow uses Static Graphs:

Build computational graphs to describe computations, including finding paths for backpropagation;
The same computational graph is used for each iteration to perform the computation.

Corresponding to the static graph is the dynamic graph (Dynamic Graphs) used by PyTorch. The construction of the computational graph and the computation are performed simultaneously:

Create a tensor object;
Each iteration builds a computational graph data structure, finds parameter gradient paths, and performs calculations;
Each iteration throws out the computational graph, which is then rebuilt. Then repeat the previous step.

3.1 Advantages of static graphs

With static graphs, since a graph needs to be run many times, the framework has the opportunity to optimize on the computational graph.

For example, the following calculation graph written by yourself may be optimized to the right side after multiple runs to improve operating efficiency.

静态图的优势; Static Graphs

Static graphs only need to build a computational graph once, so once built, even if the source code is written in Python, it can be deployed on C++ without relying on source code; while dynamic graphs use source code, component graphs and runtime for each iteration are intertwined.

3.2 Advantages of dynamic graphs

The code of the dynamic graph is relatively concise, much like the Python operation.

In the conditional judgment logic, since PyTorch can dynamically construct graphs, normal Python stream operations can be used; while TensorFlow can only build a computational graph at a time, so all situations need to be considered, and only TensorFlow stream operations can be used, which is used here. is conditional.

动态图的优势; Dynamic Graphs

In a loop structure, the same is true.

PyTorch only needs to write according to the logic of Python, and it will update the calculation graph every time regardless of how long the final sequence is;
TensorFlow needs to use TensorFlow's cyclic flow tf.foldl because it is necessary to add the loop structure displayed as a node to the calculation graph when using static graph. And in most cases, in order to ensure that the loop graph is only constructed once, TensorFlow can only use its own control flow, such as loop flow, conditional flow, etc., instead of using Python syntax, so it is necessary to learn TensorFlow-specific control commands.

动态图的优势; Dynamic Graphs

3.3 Application of dynamic graph

1) Recurrent Networks

For example, image description needs to use a recurrent network to operate on a sequence of different lengths. The sentence we want to generate to describe the image is a sequence that depends on the sequence of input data, that is, dynamically depends on the length of the input sentence.

2) Recursive Networks

For natural language processing, the entire syntax parse tree is recursively trained, so it's not just a hierarchy, but a graph or tree structure with a different structure at each different data point, which is difficult to achieve with TensorFlow. Python control flow is available in PyTorch, which is easy to implement.

3) Modular Networks

A network used to ask the content on the picture, the dynamic picture generated by the question is different.

3.4 TensorFlow and PyTorch get closer to each other

The lines between TensorFlow and PyTorch are blurring, with PyTorch adding static features and TensorFlow adding dynamic features.

TensorFlow Fold can automatically convert the code of static graphs into static graphs
TensorFlow 1.7 adds Eager Execution, allowing the use of dynamic graphs

 import tensorflow as tf
import tensorflow.contrib.eager as tfe
tf.enable eager _execution()

N, D = 3, 4
x = tfe.Variable(tf.random_normal((N, D)))
y = tfe.Variable(tf.random_normal((N, D)))
z = tfe.Variable(tf.random_normal((N, D)))

with tfe.GradientTape() as tape:
    a=x * 2
    b=a + z
    c = tf.reduce_sum(b)

grad_x, grad_y, grad_z = tape.gradient(c, [x, y, 2])
print(grad_x)

Use tf.enable_eager_execution mode at program start: it is a global switch
tf.random_normal will generate specific values, no placeholders/sessions needed, if you want to calculate gradients for them, wrap them with tfe.Variable
Operating under GradientTape will build a dynamic graph, similar to PyTorch
Use tape to calculate the gradient, similar to backward in PyTorch. and can be printed directly
Static PyTorch has [Caffe2 ]( https://caffe2.ai/ ), [ONNX Support ]( https://caffe2.ai/ )

4. Expand your learning

You can click station B to view the [bilingual subtitles] version of the video

5. Summary of key points

Deep learning hardware is best to use the GPU, and then the CPU to GPU communication problem needs to be solved. TPUs are specialized hardware for deep learning and are very fast.
Both PyTorch and TensorFlow are very good deep learning frameworks. Both have arrays that can be run directly on the GPU, both can automatically calculate gradients, and there are many functions, layers, etc. that have been written that can be used directly. The former uses dynamic graphs, the latter uses static graphs, but both are developing towards each other. The trade-off depends on the project.

Stanford CS231n full set of interpretation

Featured Recommendations in ShowMeAI Series Tutorials

ShowMeAI用知识加速每一次技术成长

Deep Learning and CV Tutorial (8) | Introduction to Common Deep Learning Frameworks