Abstract: In this era when the computing power is okay, our researchers are committed to continuously studying the general network in different scenarios on the one hand, and on the other hand, they are committed to optimizing the learning method of neural networks. These are all It is trying to reduce the computing resources required by AI.

This article is shared from Huawei Cloud Community " OCR Performance Optimization Series (2): From Neural Network to ", the original author: HW007.

OCR refers to the recognition of printed text in pictures. Recently, I am doing performance optimization of the OCR model. I used Cuda C to rewrite the OCR network written based on TensorFlow, and finally achieved a performance improvement of 5 times. Through this optimization work, I have a deeper understanding of the general network structure of the OCR network and related optimization methods. I plan to record it through a series of blog posts here, and also serve as a summary and study notes of my recent work. In the first article "OCR Performance Optimization Series (1): BiLSTM Network Structure Overview", from the perspective of motivation, how to build an OCR network based on the Seq2Seq structure step by step is deduced. Next, we talk about from neural network to plasticine.

1. Deep into CNN: also talk about the essence of machine learning

Now, start with the input in the lower left corner of Figure 1 in the OCR performance optimization series (1), and string through the process of Figure 1. The first is to input 27 text fragment pictures to be recognized, and the size of each picture is 32 132. These pictures will be encoded by a CNN network and output 32 preliminary encoding matrices As shown below:
image.png

It is worth noting that in this operation, the order of dimensionality has been adjusted, that is, the input is from 27 160b853cacc46e (32 132) to 27 (384). You can understand that the size of the image is stretched into one line (1 160b853cacc474 4224), and then reduce the dimensionality to 1 384, similar to the example of reducing 1024 to 128 when optimizing the amount of calculation in optimization strategy one above. How to do this dimensionality reduction 4224 to 27 The simplest and rude method is the above Y=AX+B model, which directly 4224 by a 4224 384 matrix A, which is obtained by feeding data training. Obviously, for the same X, different A gets different Y, and the dimension from 4224 to 384 drops a bit harder. So the scholars offered a gang fight Dafa, one A would not work, so let's have multiple A's. So 32 A's are sacrificed here. This 32 is the sequence length of the following LSTM network, which is proportional to the width 132 of the input picture. When the width of the input picture becomes 260, 64 A's are needed.

Someone might ask if you can’t see the 32 A's you mentioned in CNN? Indeed, that is just my abstraction of CNN network functions. The point is to let everyone have a vivid understanding of the dimensional changes in the CNN encoding process and the sequence length of the next layer of LSTM network. Knowing that the sequence length of LSTM is actually "dimensionality reduction" The number of "devices". If you are more witty, you should find that I even said "dimensionality reduction" wrong, because if the output results of 32 "dimensionality reduction devices" are put together, it is 32*384=12288, which is much larger than 4224. The data is passed through CNN After the network, the dimension has not decreased, but increased! In fact, the more formal name for this thing is "encoder" or "decoder". The "dimensionality reduction device" is my creation for the sake of image in this article. Whether it is called a cat or a mi, I hope you can remember it. The essence of is the coefficient matrix A.

Now we are still following the idea of 32 A. After passing through the CNN network, on the surface, for each text image, the data dimension has changed from 32 132 to 32 384, which on the surface looks like this amount of data. Increased, but the amount of information has not increased. Just like I have introduced OCR in this long article, the text has increased, and the amount of information is still the principle of OCR, which may improve readability or interest. How does the CNN network achieve nonsense (only increase the amount of data without increasing the amount of information)? Implied here is a top-level routine in a machine learning model, the scientific name is "parameter sharing". A simple example, if one day your friend is happy to tell you that he has a new discovery, "Buy 1 apple for 5 yuan, buy 2 apples for 10 yuan, and buy 3 apples for 15 yuan. Yuan...", I believe you will doubt his IQ, just say "Buy n apples for 5*n dollars", isn't it all right? The essence of nonsense is that it does not make abstract summaries, but directly gives a lot of redundant data by giving crazy examples. The essence of nonsense is to sum up rules and experience. The so-called machine learning is like this example of buying an apple. You expect to give the machine a large number of examples simply and crudely, and the machine can sum up the rules and experience. This rule and experience correspond to the above OCR example, which is the network structure and model parameters of the entire model. On the model network, the current mainstream is still relying on human intelligence, such as the CNN network for picture scenes, the RNN network for sequence scenes, and so on. If the network structure is not selected correctly, the effect of the learned model will be poor.

As in the above analysis, in fact, in terms of structure, I used 32 384, which can fully meet the above-mentioned CNN requirements on the dimensional size of data input and output, but the effect of the trained model in this case It will be bad. Because theoretically, using 32 4224 384 A matrix parameters is 32 4224 384=51904512. This means that in the learning process, the model is too free and it is easy to learn badly. In the above example of buying Apple, the person who can state the fact clearly with fewer words will have a stronger ability to summarize, because more words may introduce some irrelevant information in addition to redundancy. Interfere when using this rule in the future. Assuming that the price of this apple can be clearly stated in 10 words, if you use 20 words, it is very likely that your other 10 words are describing, for example, it is raining. , Time and other irrelevant information. This is the famous "Occam's Razor Law", don't complicate simple things! In the field of machine learning, it is mainly used in the design and selection of model structure. “Parameter sharing” is one of its commonly used routines. Simply put, if 32 4224 384 A matrix parameters are too large, the model will be in Learning is too free, so we will add restrictions to make the model not so free, we must force it to explain clearly about buying apples within 10 words. The method is very simple. For example, if these 32 A's look exactly the same, then the number of parameters here is only 4224 384, which is reduced by 32 times at once. If you think the reduction by 32 times is too severe, then I relax a little One point, it is not required that these 32 A's be exactly the same, just that they look alike.

Here I call this razor Dafa "If you don't force the model, you won't know how good the model is." Some people may raise objections. Although I allow you to use 20 words to explain the price of Apple, it does not obliterate the possibility that you are very motivated and pursue the ultimate. You can make it clear in 10 words. If the above can be covered with only one A matrix, then no matter if I give 32 A or 64 A, the model should learn exactly the same A. This statement is correct in theory, but it is unrealistic for the moment.

To discuss this issue, we have to return to the essence of machine learning. All models can be abstractly expressed as Y=AX, where X is the model input, Y is the model output, and A is the model. Note that A in this paragraph is different from the above The A, which contains both the structure of the model and the parameters. The training process of the model is to know X and Y, and solve A. The above question is actually whether the solved A is unique? First of all, let me take a step back and suppose that we think that this problem has laws in the real world, that is to say, this A exists and is unique in the real world, just like the laws of physics, then whether our model can be trained in a large number of Is this pattern captured in the data?

Take the physical mass-energy equation E=M C^2 as an example. In this model, the structure of the model is E=M C^2, and the parameter of the model is the speed of light C. This model proposed by Einstein can be said to be the crystallization of human intelligence. If the current AI Dafa is used to solve this problem, there are two situations, one is strong AI and the other is weak AI. First of all, the most extreme weak AI method is the current mainstream AI method. Most artificial A small part of machine intelligence, specifically, humans find that the relationship between E and M satisfies the model of E=M C^2 based on their own wisdom, and then feed a lot of E and M data to the machine, let the machine use this model The parameter C in is learned. In this example, the solution of C is unique, and as long as a small amount of M and C are fed to the machine, C can be solved. Obviously, the workload of the intelligent part mainly lies in how Einstein eliminated all kinds of messy factors such as time, temperature, humidity, etc., and determined that E is only related to M and satisfies E=M C^2. This part of the work is currently called the "feature selection" of the model in the field of machine learning, so many machine learning engineers will jokingly call themselves "feature engineers".

On the contrary, the expectation of a strong AI should be that we feed a lot of data to the machine, such as energy, mass, temperature, volume, time, speed, etc. The machine tells me that the energy inside is only related to quality. , Their relationship is E=M C^2, and the value of this constant C is 3.0 10^8(m/s). Here, the machine not only learns the structure of the model, but also learns the parameters of the model. To achieve this effect, the first step to complete is to find a generalized model that can describe all the model structures in the world after processing, just like plasticine can be kneaded into various shapes . This plasticine is a neural network in the AI field, so many theoretical AI books or courses like to give you the ability to describe neural networks in popular science at the beginning, proving that it is the plasticine in the AI field. Now that the plasticine is available, the next step is how to pinch it. The difficulty lies here. Not every problem and every scene in life can be perfectly represented by a mathematical model, even if this layer is established, in this model structure Before it was discovered, no one knew what this model looked like. How could you let the machine squeeze out something that you didn't know was in any shape? The only way is to feed a lot of examples to the machine, saying that the things that are pinched must be able to walk, fly, and so on. In fact, there is no unique solution to this problem. The machine can pinch out a bird for you, or pinch out a Xiaoqiang for you. There are two reasons. One is that you cannot feed all possible examples to the machine, and there will always be a "black swan" event; the other is that if you feed too many examples, the computing power requirements of the machine are also high, which is also for The reason why neural networks have been put forward very early, it has only become popular in the past few years.

After the discussion in the above paragraph, I hope you can have a more intuitive understanding of the machine learning model structure and model parameters at this time. Knowing that if the model structure is designed by human intelligence and the parameters are learned by the machine, under the premise that the model structure is accurate, such as the above E=M*C^2, we only need to feed the machine a small amount of data. Even the model can have a good analytical solution! But after all, there is only one Einstein, so it is more ordinary people like us who feed a large number of examples to the machine, hoping that he can squeeze out the shape that we don't know, let alone the beautiful nature of analytical solutions. Up. Students who are familiar with the machine learning training process should know that the "stochastic gradient descent method" used by machine learning to knead plasticine, in layman's terms, is to squeeze it for a while to see if the squeezed out can meet your requirements (ie Feed the training data), if you don’t meet your requirements, pinch it again, and keep repeating until you meet your requirements. It can be seen that the power of the machine to knead the plasticine is that the kneaded things do not meet your requirements. This proves that the machine is a very "lazy" thing. When he describes the law of Apple's price in 20 words, he has no motivation to describe the law of Apple's price in 10 words. So, when you don’t know what you want the machine to pinch out, it’s best not to give the machine too much freedom, so that the machine will pinch out a very complicated thing for you, even though it can meet your requirements, It will not work well because it violates the razor law. In a machine learning model, the greater the number of parameters, the greater the degree of freedom and the more complex structure, and the phenomenon of "overfitting" appears. Therefore, many classic network results will use some techniques to do "parameter sharing" to achieve the purpose of reducing parameters, such as the convolution method used in CNN networks for parameter sharing, and LSTM introduces a less drastic change in output. C matrix to realize parameter sharing.

2. Sword test: LSTM is waiting for you

After the discussion in the above section, I believe you have already got some routines of analyzing machine learning, and finally practice it with two-way LSTM.
image.png

As shown in the figure above, first look at the input and output of the LSTM network. The most obvious input is 32 384, and the output is 32 matrices of 27 256, of which 27 256 is composed of two 27 128 Patched together, respectively output by the forward LSTM and reverse LSTM network. For the sake of simplicity, we will only look at the forward LSTM for the time being. In this case, the input is actually 32 384, and the output is 32 matrices of 27 128. According to the "dimensional reducer" routine analyzed above, 32 384*128 matrices are needed here. According to the "parameter sharing" routine, the structure of a real single LSTM unit is shown in the figure below:
image.png

It can be seen from the figure that the real LSTM unit is not a simple 128, but the output node H of the previous unit in the LSTM unit sequence is pulled down and the input X is put together to form a 27 512 The input of is multiplied by a 512, and then combined with the control node C of the previous sequence output to process the obtained data, reduce the 512 dimension to 128 dimension, and finally get two outputs, one is 27 128 The new output node H is a 27*128 new control node C. This new output of H and C will be introduced to the next LSTM unit to affect the output of the next LSTM unit.

Here, it can be seen that due to the existence of matrix C and matrix H, even if 512 in the 32 units of the LSTM sequence is exactly the same, even if the relationship between the input H and X of each unit is different They are the same, but because they are all multiplied by the same 512 512 matrix, they should be similar to each other even though they are different, because they follow a set of rules (the exact same 512*512 matrix). Here we can see that LSTM combines the output H and input X of the previous unit as input, and at the same time introduces the control matrix C to realize the razor method, achieve parameter sharing, and simplify the model. The structure of this network structure also makes the output of the current sequence unit connected with the output of the previous sequence unit, which is suitable for modeling sequence scenes, such as OCR, NLP, machine translation, speech recognition, and so on.

Here we can also see that although the neural network is the piece of plasticine, it cannot feed all the data, the machine's computing power is not supported, or we want to increase the learning speed of the machine. In the current AI application scenario, We all carefully design the network structure based on actual application scenarios and our own prior knowledge, and then hand over this piece of plasticine, which is almost squeezed by humans, to the machine to squeeze. Therefore, I prefer to call it an era of weak AI. In this era when computing power is okay, on the one hand, our researchers are committed to continuously researching general networks in different scenarios, such as CNN for images. , Used for sequence RNN, LSTM, GRU, etc., on the one hand, it is committed to optimizing the learning methods of neural networks, such as various optimized variant algorithms based on SDG and training methods of reinforcement learning, etc., all of which are trying to reduce the need for AI Of computing resources.

I believe that with the efforts of mankind, the era of strong AI will come.

Click to follow and learn about Huawei Cloud's fresh technology for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量