参考文献:
1.bert中文使用总结:https://blog.csdn.net/sarracode/article/details/109060358
2.用pytorch版Bert获取中文字向量:https://blog.csdn.net/yuanren201/article/details/124500188
3.【重要】BERT中的词向量指南,非常的全面,非常的干货:https://blog.csdn.net/u011984148/article/details/99921480
主要讲了句子输入,输出的维度,每一个维度的内容,讲的非常好,把代码复制过来:
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
#logging.basicConfig(level=logging.INFO)
import matplotlib.pyplot as plt
% matplotlib inline
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
100%|██████████| 231508/231508 [00:00<00:00, 2386266.84B/s]
text = "Here is the sentence I want embeddings for."
text = "After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank."
marked_text = "[CLS] " + text + " [SEP]"
print (marked_text)
[CLS] After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank. [SEP]
Token初始化
tokenized_text = tokenizer.tokenize(marked_text)
print (tokenized_text)
['[CLS]', 'after', 'stealing', 'money', 'from', 'the', 'bank', 'vault', ',', 'the', 'bank', 'robber', 'was', 'seen', 'fishing', 'on', 'the', 'mississippi', 'river', 'bank', '.', '[SEP]']
下面是词汇表中包含的一些令牌示例。以两个#号开头的标记是子单词或单个字符。
list(tokenizer.vocab.keys())[5000:5020]
['knight',
'lap',
'survey',
'ma',
'##ow',
'noise',
'billy',
'##ium',
'shooting',
'guide',
'bedroom',
'priest',
'resistance',
'motor',
'homes',
'sounded',
'giant',
'##mer',
'150',
'scenes']
接下来,我们需要调用tokenizer来匹配tokens在tokenizer词汇表中的索引:
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
for tup in zip(tokenized_text, indexed_tokens):
print (tup)
('[CLS]', 101)
('after', 2044)
('stealing', 11065)
('money', 2769)
('from', 2013)
('the', 1996)
('bank', 2924)
('vault', 11632)
(',', 1010)
('the', 1996)
('bank', 2924)
('robber', 27307)
('was', 2001)
('seen', 2464)
('fishing', 5645)
('on', 2006)
('the', 1996)
('mississippi', 5900)
('river', 2314)
('bank', 2924)
('.', 1012)
('[SEP]', 102)
运行一下我们的例子
接下来,我们需要将数据转换为torch张量并调用BERT模型。BERT PyTorch接口要求数据使用torch张量而不是Python列表,所以我们在这里转换列表——这不会改变形状或数据。
eval()将我们的模型置于评估模式,而不是训练模式。在这种情况下,评估模式关闭了训练中使用的dropout正则化。
调用 from_pretrained 将从网上获取模型。当我们加载 bert-base-uncased时,我们会在日志中看到打印的模型定义。该模型是一个12层的深度神经网络!
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased')
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
接下来,让我们获取网络的隐藏状态。
torch.no_grad禁用梯度计算,节省内存,并加快计算速度(我们不需要梯度或反向传播,因为我们只是运行向前传播)。
# Predict hidden states features for each layer
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
输出
这个模型的全部隐藏状态存储在对象“encoded_layers”中,有点令人眼花缭乱。这个对象有四个维度,顺序如下:
1. 层数(12层)
2. batch号(1句)
3. 单词/令牌号(在我们的句子中有22个令牌)
4. 隐藏单元/特征号(768个特征)
第二维度,是批处理大小,用于同时向模型提交多个句子,这里,我们只有一个句子。
print ("Number of layers:", len(encoded_layers))
layer_i = 0
print ("Number of batches:", len(encoded_layers[layer_i]))
batch_i = 0
print ("Number of tokens:", len(encoded_layers[layer_i][batch_i]))
token_i = 0
print ("Number of hidden units:", len(encoded_layers[layer_i][batch_i][token_i]))
Number of layers: 12
Number of batches: 1
Number of tokens: 22
Number of hidden units: 768
让我们快速查看一下给定层和token的值范围。
你将发现,所有层和token的范围都非常相似,大多数值位于[- 2,2]之间,少量值位于-10左右。
# For the 5th token in our sentence, select its feature values from layer 5.
token_i = 5
layer_i = 5
vec = encoded_layers[layer_i][batch_i][token_i]
# Plot the values as a histogram to show their distribution.
plt.figure(figsize=(10,10))
plt.hist(vec, bins=200)
plt.show()
按层对值进行分组对于模型是有意义的,但是出于我们的目的,我们希望按token对值进行分组。
下面的代码只是重新构造这些值,这样我们就有了它们的形式:
[# tokens, # layers, # features]
# Convert the hidden state embeddings into single token vectors
# Holds the list of 12 layer embeddings for each token
# Will have the shape: [# tokens, # layers, # features]
token_embeddings = []
# For each token in the sentence...
for token_i in range(len(tokenized_text)):
# Holds 12 layers of hidden states for each token
hidden_layers = []
# For each of the 12 layers...
for layer_i in range(len(encoded_layers)):
# Lookup the vector for `token_i` in `layer_i`
vec = encoded_layers[layer_i][batch_i][token_i]
hidden_layers.append(vec)
token_embeddings.append(hidden_layers)
# Sanity check the dimensions:
print ("Number of tokens in sequence:", len(token_embeddings))
print ("Number of layers per token:", len(token_embeddings[0]))
Number of tokens in sequence: 22
Number of layers per token: 12
从隐藏状态中构建词向量和句向量
现在,我们怎么处理这些隐藏状态?我们想要得到每个token的单独向量,或者可能是整个句子的单个向量表示,但是对于输入的每个token,我们有12个长度为768的单独向量。
为了得到单独的向量,我们需要组合一些层向量……但是哪个层或层的组合提供了最好的表示?BERT的作者通过将不同的向量组合作为输入特征输入到一个用于命名实体识别任务的BiLSTM中,并观察得到的F1分数来测试这一点。
虽然最后四层的连接在这个特定的任务上产生了最好的结果,但是许多其他方法紧随其后,并且通常建议为你的特定应用程序测试不同的版本:结果可能会有所不同。
结果是,正确的池化策略(平均值、最大值、连接等等)和使用的层(最后四层、全部、最后一层等等)依赖于应用。对池化策略的讨论既适用于整个语句嵌入,也适用于类似于elmo的单个token嵌入。
词向量
为了给你一些例子,让我们用最后四层的连接和求和来创建单词向量:
concatenated_last_4_layers = [torch.cat((layer[-1], layer[-2], layer[-3], layer[-4]), 0) for layer in token_embeddings] # [number_of_tokens, 3072]
summed_last_4_layers = [torch.sum(torch.stack(layer)[-4:], 0) for layer in token_embeddings] # [number_of_tokens, 768]
句向量
要为整个句子获得一个向量,我们有多个依赖于应用的策略,但是一个简单的方法是对每个token的倒数第二个隐藏层求平均,生成一个768长度的向量。
sentence_embedding = torch.mean(encoded_layers[11], 1)
print ("Our final sentence embedding vector of shape:"), sentence_embedding[0].shape[0]
Our final sentence embedding vector of shape:
(None, 768)
确定上下文相关的向量
为了确认这些向量的值实际上是上下文相关的,让我们看一下下面这句话的输出(如果你想试试这个,你必须从顶部运行这个例子,用下面的句子替换我们原来的句子):
print (text)
After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.
for i,x in enumerate(tokenized_text):
print (i,x)
0 [CLS]
1 after
2 stealing
3 money
4 from
5 the
6 bank
7 vault
8 ,
9 the
10 bank
11 robber
12 was
13 seen
14 fishing
15 on
16 the
17 mississippi
18 river
19 bank
20 .
21 [SEP]
print ("First fifteen values of 'bank' as in 'bank robber':")
summed_last_4_layers[10][:15]
First fifteen values of 'bank' as in 'bank robber':
tensor([ 1.1868, -1.5298, -1.3770, 1.0648, 3.1446, 1.4003, -4.2407, 1.3946,
-0.1170, -1.8777, 0.1091, -0.3862, 0.6744, 2.1924, -4.5306])
print ("First fifteen values of 'bank' as in 'bank vault':")
summed_last_4_layers[6][:15]
First fifteen values of 'bank' as in 'bank vault':
tensor([ 2.1319, -2.1413, -1.6260, 0.8638, 3.3173, 0.1796, -4.4853, 3.1215,
-0.9740, -3.1780, 0.1046, -1.5481, 0.4758, 1.1703, -4.4859])
print ("First fifteen values of 'bank' as in 'river bank':")
summed_last_4_layers[19][:15]
First fifteen values of 'bank' as in 'river bank':
tensor([ 1.1295, -1.4725, -0.7296, -0.0901, 2.4970, 0.5330, 0.9742, 5.1834,
-1.0692, -1.5941, 1.9261, 0.7119, -0.9809, 1.2127, -2.9812])
我们可以看到,这些都是不同的向量,它们应该是不同的,虽然单词“bank”是相同的,但在我们的每个句子中,它都有不同的含义,有时意义非常不同。
在这个句子中,我们有三种不同的“bank”用法,其中两种几乎是相同的。让我们检查余弦相似度,看看是不是这样:
from sklearn.metrics.pairwise import cosine_similarity
# Compare "bank" as in "bank robber" to "bank" as in "river bank"
different_bank = cosine_similarity(summed_last_4_layers[10].reshape(1,-1), summed_last_4_layers[19].reshape(1,-1))[0][0]
# Compare "bank" as in "bank robber" to "bank" as in "bank vault"
same_bank = cosine_similarity(summed_last_4_layers[10].reshape(1,-1), summed_last_4_layers[6].reshape(1,-1))[0][0]
print ("Similarity of 'bank' as in 'bank robber' to 'bank' as in 'bank vault':", same_bank)
Similarity of 'bank' as in 'bank robber' to 'bank' as in 'bank vault': 0.94567525
print ("Similarity of 'bank' as in 'bank robber' to 'bank' as in 'river bank':", different_bank)
Similarity of 'bank' as in 'bank robber' to 'bank' as in 'river bank': 0.6797334
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。