Preface
Word Embedding is one of the most commonly used technical points in the entire natural language processing (NLP), which is widely used in the modeling practice of enterprises. We can use Word Embedding to map natural text language to computer language, and then input it into the neural network model for learning and calculation. How to understand more deeply and get started quickly to generate Word Embedding? This article explains the principles and generation methods of Word Embedding.
1. A preliminary exploration of Word Embedding
What is Word Embedding
In one sentence, Word Embedding is a word vector, which is a function mapping relationship. We know that in machine learning, features are all transferred in the form of values. Similarly, in NLP, text features also need to be mapped into numerical vectors. For example, after we carry out Word Embedding of the word "Hello", we can map it into a 5-dimensional vector: Hello——> (0.1, 0.5, 0.3, 0.2, 0.2).
Word vector mapping process
Generally speaking, we use the mapping process of "words -> vector space 1 -> vector space 2" to achieve text word vectorization. The entire mapping process can be divided into two steps:
1. Words -> Vector Space 1
This step solves the problem of converting a word into a vector (numerical vector). For example, converting text words into One-Hot vectors.
2. Vector space 1 ——> Vector space 2
This step solves the vector optimization problem, that is, when there is already a vector, seek a better way to optimize it.
Second, use One-Hot and SVD to find Word Embedding method
One-Hot (word -> vector space 1)
One-Hot is currently one of the most common methods for extracting text features. This article uses One-Hot to complete the first step of the mapping process, that is, word -> vector space 1.
We treat each word in the corpus as a feature column. If there are V words in the corpus, there are V feature columns, for example:
In this mapping process, One-Hot has the following shortcomings: 1) It is easy to produce sparse features; 2) It is easy to cause dimensional explosion; 3) It makes the semantic relationship between words lose.
For example, according to common sense, there should be some similarity between a hotel and a motel, but our mapping result shows that their vector product is zero. The similarity between a hotel and a motel is equal to the similarity between it and a cat, which is obviously unreasonable.
improvement direction:
1) Try to map the word vector to a lower-dimensional space;
2) At the same time, keeping the word vectors with semantic similarity in the low-dimensional space, so that the more related words, the closer their vectors can be in this low-dimensional space.
SVD (vector space 1 -> vector space 2)
1. How to express the relationship between words and words
SVD, or Singular Value Decomposition (Singular Value Decomposition), is an algorithm widely used in the field of machine learning. It can not only be used for feature decomposition in dimensionality reduction algorithms, but also widely used in recommendation systems, natural language processing and other fields. The cornerstone of many machine learning algorithms. This article uses SVD to solve the optimization problem of vector.
We first constructed an affinity matrix to ensure that the relationship between words and words can be reflected without dimensionality reduction. There are many ways to construct affinity matrix, here are two more common ways.
✦ way one
Assuming you have N articles and a total of M de-duplicated words, you can construct the affinity matrix as follows:
Each value represents the number of occurrences of the word in a certain article. This matrix can reflect some properties of words. For example, if a word is "seeding", it may appear more in articles of "agriculture"; if a word is "movie", then it may appear more in articles of "art".
✦ way two
Assuming that we have M deduplication words, we can construct an M*M matrix, where each value represents the number of times the corresponding two words appear together in an article, for example:
2. Decompose the affinity matrix
With the affinity matrix, SVD decomposition can be performed on it. The purpose is to reduce the dimensionality. The results are as follows:
We decompose the original affinity matrix X (left) into three parts on the right, and the three parts on the right can be understood from left to right:
✦ U matrix : a conversion relationship from the old high-dimensional vector space to the low-dimensional vector space;
✦ σ matrix : variance matrix. Each column represents the information content of each coordinate axis in the low-dimensional space. The larger the variance, the significant data fluctuations on the coordinate axis, and the richer the information content. When reducing dimensionality, we first consider retaining several coordinate axes with the largest variance;
✦ V matrix : the new representation of each word vector. After multiplying with the first two matrices, the final word vector representation is obtained.
At this time, the matrix on the right is still V-dimensional, and dimensionality reduction has not yet been achieved. Therefore, as mentioned above, we take the variance column with the large top k, and arrange the three matrices U, σ, and V in the order of variance from largest to smallest, so that the final dimensionality reduction result can be obtained:
3. Disadvantages of SVD
1) The dimensionality of the affinity matrix may change frequently, because there are always new words added, and the SVD decomposition must be done again every time it is added, so this method is not very general; 2) The affinity matrix may be very sparse, because there are many words It will not appear in pairs.
improvement ideas:
1) In terms of reducing sparsity, you can not only focus on those words that have a contextual relationship with a word; 2) For a model that has never seen a word, consider guessing its information from the context to increase versatility .
Following these two ideas, we can consider introducing CBOW and Skip-Gram to find word embedding.
3. CBOW and Skip-Gram for Word Embedding
The full name of CBOW is continuous bag of words (continuous bag of words model), and its essence is to predict whether a word is a center word through context word. The Skip-Gram algorithm predicts whether a word is its context (context) given the center word.
The topic of this article is embedding. Here we mentioned that the ultimate goal of predicting the central word and context is to train the semantic relationship of the words through the central word and context. At the same time, the dimensionality reduction is done, so that the final desired embedding can be obtained. .
CBOW
idea:
Assuming that a center word and a string of context are known
You can try to train a matrix V, its role is to map words to a new vector space (this is the embedding we want!)
At the same time, a matrix U can be trained. Its function is to map the embedding vector to the probability space and calculate the probability that a word is center word.
training process:
process details:
(1) Assuming that the C-th power of X is the middle word, and the length of the context is m, then the context sample can be expressed as
Each element is a One-Hot vector.
(2) For these One-Hot variables, we hope that Word Embedding can be used to map it to a lower-dimensional space. Here is a supplementary introduction, Word Embedding is a function, which is mapped to a lower-dimensional space in order to reduce the sparsity and maintain the semantic relationship in the word.
(3) Enter the average value of the vector after obtaining embedding. The reason for taking the average is because these words have contextual connections. For training convenience, we can use a more compact method to represent them.
(4) In this way, we have realized the average embedding of a text in a low-dimensional space.
Next, we need to train a parameter matrix to calculate the average embedding, so as to output the probability that each word in the average embedding is the central word.
CBOW one-stop training process review
softmax training scoring parameter matrix
Cross entropy:
skip-gram
skip-gram knows the central word and predicts the context. I won't repeat them here.
Summarize
This article explains the principles and generation methods of Word Embedding, and answers related questions in the process of Word Embedding generation, hoping to help readers improve the practical efficiency of Word Embedding.
Nowadays, machine learning is developing rapidly and applied to many industry scenarios. As a data intelligence company, I push continues to explore the field of large-scale machine learning and natural language processing, and also applies Word Embedding to label modeling and other aspects. At present, GeTui has built a three-dimensional portrait system covering thousands of labels, and continues to provide assistance for customers in the fields of mobile Internet, brand marketing, public services and other fields to carry out user insights, demographic analysis, and data-based operations.
Follow-up pushes will continue to share dry content in areas such as algorithm modeling, machine learning, etc. Please stay tuned.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。