在pretraining 880000 step之后, fine-tune不起作用

我用的这里的代码 https://github.com/NVIDIA/Dee...

Pretrain参数:

 15:47:02,534: INFO tensorflow 140678508230464   init_checkpoint: bertbase3layer-extract-from-google
 15:47:02,534: INFO tensorflow 140678508230464   optimizer_type: lamb
 15:47:02,534: INFO tensorflow 140678508230464   max_seq_length: 64
 15:47:02,534: INFO tensorflow 140678508230464   max_predictions_per_seq: 5
 15:47:02,534: INFO tensorflow 140678508230464   do_train: True
 15:47:02,535: INFO tensorflow 140678508230464   do_eval: False
 15:47:02,535: INFO tensorflow 140678508230464   train_batch_size: 32
 15:47:02,535: INFO tensorflow 140678508230464   eval_batch_size: 8
 15:47:02,535: INFO tensorflow 140678508230464   learning_rate: 5e-05
 15:47:02,535: INFO tensorflow 140678508230464   num_train_steps: 10000000
 15:47:02,535: INFO tensorflow 140678508230464   num_warmup_steps: 10000
 15:47:02,535: INFO tensorflow 140678508230464   save_checkpoints_steps: 1000
 15:47:02,535: INFO tensorflow 140678508230464   display_loss_steps: 10
 15:47:02,535: INFO tensorflow 140678508230464   iterations_per_loop: 1000
 15:47:02,535: INFO tensorflow 140678508230464   max_eval_steps: 100
 15:47:02,535: INFO tensorflow 140678508230464   num_accumulation_steps: 1
 15:47:02,535: INFO tensorflow 140678508230464   allreduce_post_accumulation: False
 15:47:02,535: INFO tensorflow 140678508230464   verbose_logging: False
 15:47:02,535: INFO tensorflow 140678508230464   horovod: True
 15:47:02,536: INFO tensorflow 140678508230464   report_loss: True
 15:47:02,536: INFO tensorflow 140678508230464   manual_fp16: False
 15:47:02,536: INFO tensorflow 140678508230464   amp: False
 15:47:02,536: INFO tensorflow 140678508230464   use_xla: True
 15:47:02,536: INFO tensorflow 140678508230464   init_loss_scale: 4294967296
 15:47:02,536: INFO tensorflow 140678508230464   ?: False
 15:47:02,536: INFO tensorflow 140678508230464   help: False
 15:47:02,536: INFO tensorflow 140678508230464   helpshort: False
 15:47:02,536: INFO tensorflow 140678508230464   helpfull: False
 15:47:02,536: INFO tensorflow 140678508230464   helpxml: False
 15:47:02,536: INFO tensorflow 140678508230464 **************************

Pretrain loss: (我去掉了nsp_loss)

{'throughput_train': 1196.9646684552622, 'mlm_loss': 0.9837073683738708, 'nsp_loss': 0.0, 'total_loss': 0.9837073683738708, 'avg_loss_step': 1.200513333082199, 'learning_rate': '0.00038143058'}
{'throughput_train': 1230.5063662500734, 'mlm_loss': 1.3001925945281982, 'nsp_loss': 0.0, 'total_loss': 1.3001925945281982, 'avg_loss_step': 1.299936044216156, 'learning_rate': '0.00038143038'}
{'throughput_train': 1236.4348949169155, 'mlm_loss': 1.473339319229126, 'nsp_loss': 0.0, 'total_loss': 1.473339319229126, 'avg_loss_step': 1.2444063007831574, 'learning_rate': '0.00038143017'}
{'throughput_train': 1221.2668264552692, 'mlm_loss': 0.9924975633621216, 'nsp_loss': 0.0, 'total_loss': 0.9924975633621216, 'avg_loss_step': 1.1603020071983337, 'learning_rate': '0.00038142994'}

Fine-tune代码:

self.train_op = tf.train.AdamOptimizer(0.00001).minimize(self.loss, global_step=self.global_step)

Fine-tune 正确率: (restore from my ckpt pretrained from https://github.com/NVIDIA/Dee...

epoch 1:
training step 895429, loss 4.98, acc 0.079
dev loss 4.853, acc 0.092

epoch 2:
training step 895429, loss 4.97, acc 0.080
dev loss 4.823, acc 0.092

epoch 3:
training step 895429, loss 4.96, acc 0.081
dev loss 4.849, acc 0.092

epoch 4:
training step 895429, loss 4.95, acc 0.082
dev loss 4.843, acc 0.092

不 restore pretrained ckpt:

epoch 1:
training step 10429, loss 2.48, acc 0.606
dev loss 1.604, acc 0.8036

Restore google官方的BERT-Base pretrained ckpt. 或者 restore from a pretrained ckpt pretrained from https://github.com/guotong198...

epoch 1:
training loss 1.89, acc 0.761
dev loss 1.351, acc 0.869
回复
阅读 632
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
宣传栏