One Push Technology: Principle and Generation Method of Word Embedding

Preface

Word Embedding is one of the most commonly used technical points in the entire natural language processing (NLP), which is widely used in the modeling practice of enterprises. We can use Word Embedding to map natural text language to computer language, and then input it into the neural network model for learning and calculation. How to understand more deeply and get started quickly to generate Word Embedding? This article explains the principles and generation methods of Word Embedding.

1. A preliminary exploration of Word Embedding

What is Word Embedding

In one sentence, Word Embedding is a word vector, which is a function mapping relationship. We know that in machine learning, features are all transferred in the form of values. Similarly, in NLP, text features also need to be mapped into numerical vectors. For example, after we carry out Word Embedding of the word "Hello", we can map it into a 5-dimensional vector: Hello——> (0.1, 0.5, 0.3, 0.2, 0.2).

Word vector mapping process

Generally speaking, we use the mapping process of "words -> vector space 1 -> vector space 2" to achieve text word vectorization. The entire mapping process can be divided into two steps:

1. Words -> Vector Space 1

This step solves the problem of converting a word into a vector (numerical vector). For example, converting text words into One-Hot vectors.

2. Vector space 1 ——> Vector space 2

This step solves the vector optimization problem, that is, when there is already a vector, seek a better way to optimize it.

Second, use One-Hot and SVD to find Word Embedding method

One-Hot (word -> vector space 1)

One-Hot is currently one of the most common methods for extracting text features. This article uses One-Hot to complete the first step of the mapping process, that is, word -> vector space 1.

We treat each word in the corpus as a feature column. If there are V words in the corpus, there are V feature columns, for example:

In this mapping process, One-Hot has the following shortcomings: 1) It is easy to produce sparse features; 2) It is easy to cause dimensional explosion; 3) It makes the semantic relationship between words lose.

For example, according to common sense, there should be some similarity between a hotel and a motel, but our mapping result shows that their vector product is zero. The similarity between a hotel and a motel is equal to the similarity between it and a cat, which is obviously unreasonable.

improvement direction:

1) Try to map the word vector to a lower-dimensional space;

2) At the same time, keeping the word vectors with semantic similarity in the low-dimensional space, so that the more related words, the closer their vectors can be in this low-dimensional space.

SVD (vector space 1 -> vector space 2)

1. How to express the relationship between words and words

SVD, or Singular Value Decomposition (Singular Value Decomposition), is an algorithm widely used in the field of machine learning. It can not only be used for feature decomposition in dimensionality reduction algorithms, but also widely used in recommendation systems, natural language processing and other fields. The cornerstone of many machine learning algorithms. This article uses SVD to solve the optimization problem of vector.

We first constructed an affinity matrix to ensure that the relationship between words and words can be reflected without dimensionality reduction. There are many ways to construct affinity matrix, here are two more common ways.

✦ way one

Assuming you have N articles and a total of M de-duplicated words, you can construct the affinity matrix as follows:

Each value represents the number of occurrences of the word in a certain article. This matrix can reflect some properties of words. For example, if a word is "seeding", it may appear more in articles of "agriculture"; if a word is "movie", then it may appear more in articles of "art".

✦ way two

Assuming that we have M deduplication words, we can construct an M*M matrix, where each value represents the number of times the corresponding two words appear together in an article, for example:

2. Decompose the affinity matrix

With the affinity matrix, SVD decomposition can be performed on it. The purpose is to reduce the dimensionality. The results are as follows:

We decompose the original affinity matrix X (left) into three parts on the right, and the three parts on the right can be understood from left to right:

✦ U matrix : a conversion relationship from the old high-dimensional vector space to the low-dimensional vector space;

✦ σ matrix : variance matrix. Each column represents the information content of each coordinate axis in the low-dimensional space. The larger the variance, the significant data fluctuations on the coordinate axis, and the richer the information content. When reducing dimensionality, we first consider retaining several coordinate axes with the largest variance;

✦ V matrix : the new representation of each word vector. After multiplying with the first two matrices, the final word vector representation is obtained.

At this time, the matrix on the right is still V-dimensional, and dimensionality reduction has not yet been achieved. Therefore, as mentioned above, we take the variance column with the large top k, and arrange the three matrices U, σ, and V in the order of variance from largest to smallest, so that the final dimensionality reduction result can be obtained:

3. Disadvantages of SVD

1) The dimensionality of the affinity matrix may change frequently, because there are always new words added, and the SVD decomposition must be done again every time it is added, so this method is not very general; 2) The affinity matrix may be very sparse, because there are many words It will not appear in pairs.

improvement ideas:

1) In terms of reducing sparsity, you can not only focus on those words that have a contextual relationship with a word; 2) For a model that has never seen a word, consider guessing its information from the context to increase versatility .

Following these two ideas, we can consider introducing CBOW and Skip-Gram to find word embedding.

3. CBOW and Skip-Gram for Word Embedding

The full name of CBOW is continuous bag of words (continuous bag of words model), and its essence is to predict whether a word is a center word through context word. The Skip-Gram algorithm predicts whether a word is its context (context) given the center word.

The topic of this article is embedding. Here we mentioned that the ultimate goal of predicting the central word and context is to train the semantic relationship of the words through the central word and context. At the same time, the dimensionality reduction is done, so that the final desired embedding can be obtained. .

CBOW

idea:

Assuming that a center word and a string of context are known

You can try to train a matrix V, its role is to map words to a new vector space (this is the embedding we want!)

At the same time, a matrix U can be trained. Its function is to map the embedding vector to the probability space and calculate the probability that a word is center word.

training process:

process details:

(1) Assuming that the C-th power of X is the middle word, and the length of the context is m, then the context sample can be expressed as

Each element is a One-Hot vector.

(2) For these One-Hot variables, we hope that Word Embedding can be used to map it to a lower-dimensional space. Here is a supplementary introduction, Word Embedding is a function, which is mapped to a lower-dimensional space in order to reduce the sparsity and maintain the semantic relationship in the word.

(3) Enter the average value of the vector after obtaining embedding. The reason for taking the average is because these words have contextual connections. For training convenience, we can use a more compact method to represent them.

(4) In this way, we have realized the average embedding of a text in a low-dimensional space.

Next, we need to train a parameter matrix to calculate the average embedding, so as to output the probability that each word in the average embedding is the central word.

CBOW one-stop training process review

softmax training scoring parameter matrix

Cross entropy:

skip-gram

skip-gram knows the central word and predicts the context. I won't repeat them here.

Summarize

This article explains the principles and generation methods of Word Embedding, and answers related questions in the process of Word Embedding generation, hoping to help readers improve the practical efficiency of Word Embedding.

Nowadays, machine learning is developing rapidly and applied to many industry scenarios. As a data intelligence company, I push continues to explore the field of large-scale machine learning and natural language processing, and also applies Word Embedding to label modeling and other aspects. At present, GeTui has built a three-dimensional portrait system covering thousands of labels, and continues to provide assistance for customers in the fields of mobile Internet, brand marketing, public services and other fields to carry out user insights, demographic analysis, and data-based operations.

Follow-up pushes will continue to share dry content in areas such as algorithm modeling, machine learning, etc. Please stay tuned.

One Push Technology: Principle and Generation Method of Word Embedding

Preface

1. A preliminary exploration of Word Embedding

What is Word Embedding

Word vector mapping process

1. Words -> Vector Space 1

2. Vector space 1 ——> Vector space 2

Second, use One-Hot and SVD to find Word Embedding method

One-Hot (word -> vector space 1)

SVD (vector space 1 -> vector space 2)

1. How to express the relationship between words and words

2. Decompose the affinity matrix

3. Disadvantages of SVD

3. CBOW and Skip-Gram for Word Embedding

CBOW

softmax training scoring parameter matrix

skip-gram

Summarize

个推

引用和评论

个推助力小米米家全场景智能生活体验再升级

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

科学计算编程涉及到的技术栈简介

manus 的替代品有哪些？使用LLM大模型技术做手机/网页/浏览器自动化操作技术汇总

vLLM 实战教程汇总，从环境配置到大模型部署，中文文档追踪重磅更新

入选AAAI 2025，浙江大学提出多对一回归模型M2OST，利用数字病理图像精准预测基因表达

性能远超SAM系模型，苏黎世大学等开发通用3D血管分割基础模型