Preface
AI
is constantly expanding the front-end technical boundaries, and the support of algorithms has also injected new power into front-end R&D. This article introduces what is end intelligence, the application scenarios of end intelligence, and the basic concepts of Web
end-side implementation of AI
What is end intelligence
First, review AI
application. The specific steps include
- Data collection and preprocessing
- Model selection and training
- Model evaluation
- Model service deployment
The intermediate product of model training is a model file . By loading the model file, it is deployed as a callable service, and then it can be called to perform inference and prediction .
In the traditional process, the model service will be deployed on a high-performance server. The client initiates a request, the server performs inference, and returns the prediction result to the client. The end intelligence is completed on the client The process of reasoning.
Smart application scenarios
End intelligence now has a lot of application scenarios, covering visual AR
, interactive games, recommendation information flow recommendation, touch smart Push
etc., voice live broadcast, intelligent noise reduction and other fields. Algorithms are gradually covering from the server to mobile terminals with stronger real-time perception of users.
Typical applications include
AR
Apps, games .AI
provides the ability to understand visual information, andAR
realizes the combination of virtual and real interaction based on visual information, bringing a more immersive shopping and interactive experience. For example, the beauty camera and virtual makeup try on the key points of the face, and useAR
enhance and render the makeup in a specific area.- Interactive game . Fliggy Double Eleven’s interactive game "Find it" is a
h5
page. It captures pictures in real time through the camera, calls the classification model to classify, and scores when the game sets the goal. - end-side rearrangement . Through real-time user awareness recognition, the
feeds
stream issued by the server recommendation algorithm is rearranged to make more accurate content recommendations. - Smart Push . Through the end-side perception of user status, it is necessary to decide whether to intervene in the user, push
Push
, select the right time to proactively reach the user, rather than the server-side scheduled batch push, bring more precise marketing and better user experience.
The advantages of end intelligence
From the common application scenarios, we can see the obvious advantages of end intelligence, including
Low latency
Real-time calculation saves the time of network requests. For applications with high frame rate requirements, such as beauty cameras requesting the server every second, high latency is definitely unacceptable to users. For high-frequency interactive scenes, such as games, low latency becomes more important.
Low service cost
Local computing saves server resources. Now the new mobile phone releases will emphasize the
AI
computing power of the mobile phone chip. The increasingly stronger terminal performance makes it possible for more end-to-endAI
applications.Protect privacy
The topic of data privacy is becoming more and more important today. By performing model inference on the end side, user data does not need to be uploaded to the server, which ensures the security of user privacy.
Limitations of end intelligence
At the same time, terminal intelligence also has one of the most obvious limitations, which is low computing power. Although the performance of the terminal side is getting stronger and stronger, it is still far from the server. In order to implement complex algorithms in limited resources, it is necessary to adapt the hardware platform and perform instruction-level optimization to make the model run in the terminal device. At the same time, the model needs to be compressed to reduce the time and space. Consumption.
There are already some more mature end-to-side reasoning engines. These frameworks and engines are optimized for terminal equipment to give full play to the computing power of the equipment. For example, Tensorflow Lite
, Pytorch mobile
, Ali’s MNN
, Baidu PaddlePaddle
.
What about the web
Web
side also has end AI
advantages and limitations, as PC
primary means of user access to Internet content and services, mobile end many APP
will embed Web
page, but limited browser memory and storage quotas, so Web
on It seems even more impossible to run the AI
However, in 2015
years when he has emerged a ConvNetJS
the library, you can do classification, regression task convolution neural network in the browser, although it has not maintained, 2018
years, when the emergence of a lot of JS
machines Learning, deep learning framework. Such as Tensorflow.js
, Synaptic
, Brain.js
, Mind
, Keras.js
, WebDNN
and so on.
Limited by the browser's computing power, some frameworks such as keras.js
and WebDNN
only support loading models for reasoning, and cannot be trained in the browser.
In addition, some frameworks are not suitable for general deep learning tasks, and the types of networks they support are different. For example, TensorFlow.js
, Keras.js
and WebDNN
support DNN
, CNN
and RNN
. The ConvNetJS
mainly supports the CNN
task, and does not support the RNN
. Brain.js
and synaptic
mainly support the RNN
task, but do not support the convolution and pooling operations used in the CNN
Mind
only supports the basic DNN
.
When choosing a framework, you need to see whether it supports specific needs.
Web-side architecture
Web
end use the limited computing power?
A typical JavaScript
machine learning framework is shown in the figure. From bottom to top, it is the driver hardware, the browser interface using the hardware, various machine learning frameworks, graphics processing libraries, and finally our application.
CPU vs GPU
A prerequisite for running the machine learning model in the Web
browser is to get enough computing power GPU
In machine learning, especially deep network models, a widely used operation is to multiply a large matrix with a vector, and then add another vector. Typical operations of this type involve thousands or millions of floating-point operations, but they are usually parallelizable.
Taking a simple vector addition as an example, the addition of two vectors can be divided into many smaller operations, that is, the addition of each index position. These smaller operations do not depend on each other. Although CPU
usually requires less time for each individual addition, as the amount of calculation becomes larger, concurrency will gradually show its advantages.
WebGPU/WebGL vs WebAssembly
Once you have the hardware, you need to make full use of the hardware.
WebGL
WebGL
is currently the highest performanceGPU
utilization scheme.WebGL
is designed to accelerate the2D
and3D
in the browser, but it can be used in the parallel calculation of neural networks to accelerate the reasoning process and achieve an order of magnitude increase in speed.WebGPU
With
Web
application of programmable3D
graphics, image processing andGPU
continued to improve access requirements, in order toWEB
introduced inGPU
accelerated scientific computing performance,W3C
in2017
proposed inWebGPU
, as the next generationWEB
graphics ofAPI
standard, Lower driving overhead, better support for multi-threading, useGPU
for calculation.WebAssembly
When the terminal device does not
WebGL
or the performance is weak, the general calculation schemeCPU
WebAssembly
.WebAssembly
is a new encoding method that can be run in modern web browsers. It is a low-level assembly language with a compact binary format that can run close to the native performance. It also provides a language for languagesC / C ++
Compile the targets so that they can runWeb
Tensorflow.js
Take Tensorflow.js
as an example. In order to realize operation in different environments, tensorflow
supports different backends. The corresponding backends are automatically selected according to the equipment conditions. Of course, manual changes are also supported.
tf.setBackend('cpu');
console.log(tf.getBackend());
Some generic model for testing, WebGL
faster than ordinary CPU
rear calculation 100
times, WebAssembly
the ratio of normal JS CPU
faster rear 10-30
times.
At the same time, tensorflow
also provides tfjs-node
version, by C++
and CUDA
code compiler native compiler library drive CPU
, GPU
calculation, speed training and Python
version of Keras
quite. There is no need to switch common languages. You can directly add the AI
nodejs
service instead of starting another python
service.
Model compression
With the framework for adapting to hardware devices, it is also necessary to compress the model. Although the complex model has better prediction accuracy, but the high storage space, the consumption of computing resources, and the excessively long reasoning speed are on most mobile terminals. The scene is still unacceptable.
The complexity of the model is that complex model structure and parameters mass . There are usually two parts of information stored in the model file: structure and parameters , as shown in the simplified neural network in the following figure, each square corresponds to a neuron, and each neuron and the connections in the neuron are parameter.
The reasoning of the model is to input from the left side, perform calculations with neurons, and then add weights to the next layer to calculate through the connection, and then to the final layer to obtain the predicted output. The more nodes and connections, the greater the amount of calculation.
Model pruning
0
the trained model is a common way of model compression. There are a lot of redundant parameters in the network model, and the activation value of a large number of neurons is close to 06156e33b14beb. By cutting invalid nodes or less important nodes , Can reduce the redundancy of the model.
The simplest and rude pruning is DropOut
, which randomly discards neurons during training.
Most pruning methods calculate the importance factor, calculate the importance of the neuron node to the final result, and cut the less important nodes.
The process of model pruning is repeated iteratively. It is not directly used for reasoning after pruning. The accuracy of the model is restored through training after pruning. The compression process of the model is a constant trade-off between accuracy and compression ratio. Choose the best compression effect within the range of accuracy loss.
Model quantification
In order to ensure higher accuracy, most scientific calculations use floating-point calculations. The common ones are 32
-bit floating-point type and 64
-bit floating-point type, namely float32
and double64
. Quantization is the conversion of high-precision numerical values into low-precision.
For example, binary quantization ( 1bit
quantization) will directly Float32/float64
the value of 1bit
to 06156e33b14c83, and the storage space will be directly compressed by 32
times / 64
times. The memory required for loading during calculation will also be smaller, and the smaller model size will bring lower Power consumption and faster calculation speed. In addition, there are 8bit
quantization and arbitrary bit
quantization.
Knowledge distillation
Knowledge distillation is to transfer the knowledge learned in the deep network to another relatively simple network. First train a teacher
network, and then use teacher
network and the true labels of the data to train the student
network.
tool
The realization of model compression is more complicated. If it is only application-oriented, it is enough to understand the principle of its function, and you can directly use the packaged tool.
For example, Tensorflow Model Optimization Toolkit
provides a quantization function, and its official compression test for some general models, as can be seen in the following table, for the mobilenet
model, the model size is compressed 10M+
3、4M
, and the accuracy loss of the model is very small.
PaddleSlim
provided by Baidu's flying paddle provides the above three compression methods.
Summarize
In summary, the development of a Web
on the end AI
application process that is turned into
- Design algorithms and train models for specific scenarios
- Compress the model
- Convert to the format required by the inference engine
- Load the model to make inference predictions
For algorithms, the general deep learning framework already provides a number of general pre-training models, which can be used directly for reasoning, or they can be trained on their own data sets. Model compression and reasoning can also use existing tools.
references
[1] https://tech.taobao.org/news/2021-1-7-32.html
[2] https://juejin.cn/post/6844904152095539214
[3] Ma Y, Xiang D, Zheng S, et al. Moving deep learning into web browser: How far can we go?[C]//The World Wide Web Conference. 2019: 1234-1244.
[4] WebGPU: https://www.w3.org/TR/webgpu/
[5] Tensorflow.js: https://www.tensorflow.org/js?hl=zh-cn
[6] WebAssembly: https://developer.mozilla.org/zh-CN/docs/WebAssembly
[7] Deep Learning with JavaScript https://www.manning.com/books/deep-learning-with-javascript
Welcome to follow the blog of Bump Lab: aotu.io
Or follow the AOTULabs official account (AOTULabs) and push articles from time to time.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。