5

Preface

AI is constantly expanding the front-end technical boundaries, and the support of algorithms has also injected new power into front-end R&D. This article introduces what is end intelligence, the application scenarios of end intelligence, and the basic concepts of Web end-side implementation of AI

What is end intelligence

First, review AI application. The specific steps include

  • Data collection and preprocessing
  • Model selection and training
  • Model evaluation
  • Model service deployment

The intermediate product of model training is a model file . By loading the model file, it is deployed as a callable service, and then it can be called to perform inference and prediction .

In the traditional process, the model service will be deployed on a high-performance server. The client initiates a request, the server performs inference, and returns the prediction result to the client. The end intelligence is completed on the client The process of reasoning.

Smart application scenarios

End intelligence now has a lot of application scenarios, covering visual AR , interactive games, recommendation information flow recommendation, touch smart Push etc., voice live broadcast, intelligent noise reduction and other fields. Algorithms are gradually covering from the server to mobile terminals with stronger real-time perception of users.

Typical applications include

  • AR Apps, games . AI provides the ability to understand visual information, and AR realizes the combination of virtual and real interaction based on visual information, bringing a more immersive shopping and interactive experience. For example, the beauty camera and virtual makeup try on the key points of the face, and use AR enhance and render the makeup in a specific area.
  • Interactive game . Fliggy Double Eleven’s interactive game "Find it" is a h5 page. It captures pictures in real time through the camera, calls the classification model to classify, and scores when the game sets the goal.
  • end-side rearrangement . Through real-time user awareness recognition, the feeds stream issued by the server recommendation algorithm is rearranged to make more accurate content recommendations.
  • Smart Push . Through the end-side perception of user status, it is necessary to decide whether to intervene in the user, push Push , select the right time to proactively reach the user, rather than the server-side scheduled batch push, bring more precise marketing and better user experience.

The advantages of end intelligence

From the common application scenarios, we can see the obvious advantages of end intelligence, including

  • Low latency

    Real-time calculation saves the time of network requests. For applications with high frame rate requirements, such as beauty cameras requesting the server every second, high latency is definitely unacceptable to users. For high-frequency interactive scenes, such as games, low latency becomes more important.

  • Low service cost

    Local computing saves server resources. Now the new mobile phone releases will emphasize the AI computing power of the mobile phone chip. The increasingly stronger terminal performance makes it possible for more end-to-end AI applications.

  • Protect privacy

    The topic of data privacy is becoming more and more important today. By performing model inference on the end side, user data does not need to be uploaded to the server, which ensures the security of user privacy.

Limitations of end intelligence

At the same time, terminal intelligence also has one of the most obvious limitations, which is low computing power. Although the performance of the terminal side is getting stronger and stronger, it is still far from the server. In order to implement complex algorithms in limited resources, it is necessary to adapt the hardware platform and perform instruction-level optimization to make the model run in the terminal device. At the same time, the model needs to be compressed to reduce the time and space. Consumption.

There are already some more mature end-to-side reasoning engines. These frameworks and engines are optimized for terminal equipment to give full play to the computing power of the equipment. For example, Tensorflow Lite , Pytorch mobile , Ali’s MNN , Baidu PaddlePaddle .

What about the web

Web side also has end AI advantages and limitations, as PC primary means of user access to Internet content and services, mobile end many APP will embed Web page, but limited browser memory and storage quotas, so Web on It seems even more impossible to run the AI

However, in 2015 years when he has emerged a ConvNetJS the library, you can do classification, regression task convolution neural network in the browser, although it has not maintained, 2018 years, when the emergence of a lot of JS machines Learning, deep learning framework. Such as Tensorflow.js , Synaptic , Brain.js , Mind , Keras.js , WebDNN and so on.

Limited by the browser's computing power, some frameworks such as keras.js and WebDNN only support loading models for reasoning, and cannot be trained in the browser.

In addition, some frameworks are not suitable for general deep learning tasks, and the types of networks they support are different. For example, TensorFlow.js , Keras.js and WebDNN support DNN , CNN and RNN . The ConvNetJS mainly supports the CNN task, and does not support the RNN . Brain.js and synaptic mainly support the RNN task, but do not support the convolution and pooling operations used in the CNN Mind only supports the basic DNN .

When choosing a framework, you need to see whether it supports specific needs.

Web-side architecture

Web end use the limited computing power?

A typical JavaScript machine learning framework is shown in the figure. From bottom to top, it is the driver hardware, the browser interface using the hardware, various machine learning frameworks, graphics processing libraries, and finally our application.

Untitled

CPU vs GPU

A prerequisite for running the machine learning model in the Web browser is to get enough computing power GPU

In machine learning, especially deep network models, a widely used operation is to multiply a large matrix with a vector, and then add another vector. Typical operations of this type involve thousands or millions of floating-point operations, but they are usually parallelizable.

Taking a simple vector addition as an example, the addition of two vectors can be divided into many smaller operations, that is, the addition of each index position. These smaller operations do not depend on each other. Although CPU usually requires less time for each individual addition, as the amount of calculation becomes larger, concurrency will gradually show its advantages.

Untitled

WebGPU/WebGL vs WebAssembly

Once you have the hardware, you need to make full use of the hardware.

  • WebGL

    WebGL is currently the highest performance GPU utilization scheme. WebGL is designed to accelerate the 2D and 3D in the browser, but it can be used in the parallel calculation of neural networks to accelerate the reasoning process and achieve an order of magnitude increase in speed.

  • WebGPU

    With Web application of programmable 3D graphics, image processing and GPU continued to improve access requirements, in order to WEB introduced in GPU accelerated scientific computing performance, W3C in 2017 proposed in WebGPU , as the next generation WEB graphics of API standard, Lower driving overhead, better support for multi-threading, use GPU for calculation.

  • WebAssembly

    When the terminal device does not WebGL or the performance is weak, the general calculation scheme CPU WebAssembly . WebAssembly is a new encoding method that can be run in modern web browsers. It is a low-level assembly language with a compact binary format that can run close to the native performance. It also provides a language for languages C / C ++ Compile the targets so that they can run Web

Tensorflow.js

Take Tensorflow.js as an example. In order to realize operation in different environments, tensorflow supports different backends. The corresponding backends are automatically selected according to the equipment conditions. Of course, manual changes are also supported.

Untitled

tf.setBackend('cpu');
console.log(tf.getBackend());

Some generic model for testing, WebGL faster than ordinary CPU rear calculation 100 times, WebAssembly the ratio of normal JS CPU faster rear 10-30 times.

At the same time, tensorflow also provides tfjs-node version, by C++ and CUDA code compiler native compiler library drive CPU , GPU calculation, speed training and Python version of Keras quite. There is no need to switch common languages. You can directly add the AI nodejs service instead of starting another python service.

Model compression

With the framework for adapting to hardware devices, it is also necessary to compress the model. Although the complex model has better prediction accuracy, but the high storage space, the consumption of computing resources, and the excessively long reasoning speed are on most mobile terminals. The scene is still unacceptable.

The complexity of the model is that complex model structure and parameters mass . There are usually two parts of information stored in the model file: structure and parameters , as shown in the simplified neural network in the following figure, each square corresponds to a neuron, and each neuron and the connections in the neuron are parameter.

The reasoning of the model is to input from the left side, perform calculations with neurons, and then add weights to the next layer to calculate through the connection, and then to the final layer to obtain the predicted output. The more nodes and connections, the greater the amount of calculation.

Untitled

Model pruning

0 the trained model is a common way of model compression. There are a lot of redundant parameters in the network model, and the activation value of a large number of neurons is close to 06156e33b14beb. By cutting invalid nodes or less important nodes , Can reduce the redundancy of the model.

The simplest and rude pruning is DropOut , which randomly discards neurons during training.
Most pruning methods calculate the importance factor, calculate the importance of the neuron node to the final result, and cut the less important nodes.

The process of model pruning is repeated iteratively. It is not directly used for reasoning after pruning. The accuracy of the model is restored through training after pruning. The compression process of the model is a constant trade-off between accuracy and compression ratio. Choose the best compression effect within the range of accuracy loss.

Model quantification

In order to ensure higher accuracy, most scientific calculations use floating-point calculations. The common ones are 32 -bit floating-point type and 64 -bit floating-point type, namely float32 and double64 . Quantization is the conversion of high-precision numerical values into low-precision.

For example, binary quantization ( 1bit quantization) will directly Float32/float64 the value of 1bit to 06156e33b14c83, and the storage space will be directly compressed by 32 times / 64 times. The memory required for loading during calculation will also be smaller, and the smaller model size will bring lower Power consumption and faster calculation speed. In addition, there are 8bit quantization and arbitrary bit quantization.

Knowledge distillation

Knowledge distillation is to transfer the knowledge learned in the deep network to another relatively simple network. First train a teacher network, and then use teacher network and the true labels of the data to train the student network.

tool

The realization of model compression is more complicated. If it is only application-oriented, it is enough to understand the principle of its function, and you can directly use the packaged tool.

For example, Tensorflow Model Optimization Toolkit provides a quantization function, and its official compression test for some general models, as can be seen in the following table, for the mobilenet model, the model size is compressed 10M+ 3、4M , and the accuracy loss of the model is very small.

Untitled

PaddleSlim provided by Baidu's flying paddle provides the above three compression methods.

Untitled

Summarize

In summary, the development of a Web on the end AI application process that is turned into

  • Design algorithms and train models for specific scenarios
  • Compress the model
  • Convert to the format required by the inference engine
  • Load the model to make inference predictions

For algorithms, the general deep learning framework already provides a number of general pre-training models, which can be used directly for reasoning, or they can be trained on their own data sets. Model compression and reasoning can also use existing tools.

references

[1] https://tech.taobao.org/news/2021-1-7-32.html

[2] https://juejin.cn/post/6844904152095539214

[3] Ma Y, Xiang D, Zheng S, et al. Moving deep learning into web browser: How far can we go?[C]//The World Wide Web Conference. 2019: 1234-1244.

[4] WebGPU: https://www.w3.org/TR/webgpu/

[5] Tensorflow.js: https://www.tensorflow.org/js?hl=zh-cn

[6] WebAssembly: https://developer.mozilla.org/zh-CN/docs/WebAssembly

[7] Deep Learning with JavaScript https://www.manning.com/books/deep-learning-with-javascript

Welcome to follow the blog of Bump Lab: aotu.io

Or follow the AOTULabs official account (AOTULabs) and push articles from time to time.


凹凸实验室
2.3k 声望5.5k 粉丝

凹凸实验室(Aotu.io,英文简称O2) 始建于2015年10月,是一个年轻基情的技术团队。