THOR: MindSpore self-developed high-order optimizer source code analysis and practical application

Abstract: This article shares with you the practical application of THOR. Part of the THOR algorithm is currently open source in MindSpore

This article is shared from the HUAWEI cloud community " MindSpore self-developed high-level optimizer source code analysis and practical application ", the original author: HWCloudAI.

This article shares with you the practical application of THOR. Part of the THOR algorithm is currently open source in MindSpore, the source code location:

https://gitee.com/mindspore/mindspore/blob/master/mindspore/nn/optim/thor.py

Using THOR to train a network in MindSpore is very simple. Let's take a look at how to use it with four lines of code.

from mindspore.nn.optim import THOR  #引用二阶优化器

#创建网络
net = Net() 

#调用优化器
opt = THOR(net, lr, Tensor(damping), config.momentum, config.weight_decay, config.loss_scale,
           config.batch_size, split_indices=split_indices)  

#增加计算图提升性能
model = ConvertModelUtils().convert_to_thor_model(model=model, network=net, loss_fn=loss, optimizer=opt,
                                            loss_scale_manager=loss_scale, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False,
                                            frequency=config.frequency)  

#训练网络
model.train(config.epoch_size, dataset, callbacks=cb, sink_size=dataset.get_dataset_size(), dataset_sink_mode=True)

Import the packages required by the second-order optimizer THOR
The first line of code routinely creates a network
The second line of code defines the optimizer THOR we use
The third line of code is to increase the calculation graph so that THOR achieves better performance
The fourth line of code trains the network

Let's introduce it in detail. first import the second-order optimizer package required by MindSpore, located at mindspore.nn.optim

Then create the network you need; then define the THOR optimizer network information and the hyperparameter information required by THOR (such as learning rate, regularization coefficients, etc.);

then call the convert_to_thor_model function . This function is to increase the calculation graph to make THOR achieve better performance. What does it mean? When the network is running, it is a calculation graph. The outdated second-order information will be used in THOR by adding an additional one. Two calculation graphs, the two calculation graphs respectively perform the operations of updating the second-order matrix and not updating the second-order matrix to achieve better performance (PS. MindSpore supports dynamic and static graphs. Here, for better performance, the static graph mode is used. Students who are more interested in this content can click this link: https://mindspore-website.obs.cn-north-4.myhuaweicloud.com/white_paper/MindSpore_white_paper.pdf);

Finally, call model.train to start training. A brief introduction to how to use it, let's take a look at its source code.

Source code analysis

The init function is used to initialize THOR. It needs to pass in the hyperparameters and network structure required by THOR. THOR supports GPU and Ascend, which are class THOR_GPU (Optimizer) and class THOR_Ascend (Optimizer) respectively. The main difference between these two classes is The operator is different. Let's take class THOR_Ascend(Optimizer) as an example to analyze it.

class THOR_Ascend(Optimizer):
    def __init__(self, net, learning_rate, damping, momentum, weight_decay=0.0, loss_scale=1.0, batch_size=32,
                 decay_filter=lambda x: x.name not in [], split_indices=None):
        params = filter(lambda x: x.requires_grad, net.get_parameters())
        super(THOR_Ascend, self).__init__(learning_rate, params, weight_decay, loss_scale)
        if isinstance(momentum, float) and momentum < 0.0:
            raise ValueError("momentum should be at least 0.0, but got momentum {}".format(momentum))
        self.momentum = Parameter(Tensor(momentum, mstype.float32), name="momentum")
        self.params = self.parameters
        self.moments = self.params.clone(prefix="moments", init='zeros')
        self.hyper_map = C.HyperMap()
        self.opt = P.ApplyMomentum()
        self.net = net
        self.matrix_A_cov = ParameterTuple(filter(lambda x: 'matrix_A' in x.name, net.get_parameters()))
        self.matrix_G_cov = ParameterTuple(filter(lambda x: 'matrix_G' in x.name, net.get_parameters()))
        ...

All optimizers in MindSpore inherit class Optimizer, which defines some basic functions (such as obtaining learning rate, gradient scaling, etc.). When THOR is initialized, the passed-in hyper-parameters are defined as class attributes for easy calling, and operators that will be used in subsequent calculations are defined.

That is to say, the function of the initialization function is to define the operators and variables (Parameter, Tensor, etc.) needed for the THOR calculation.

Focus on self.matrix_A_cov, self.matrix_G_cov. These two variables are the information needed to calculate the second step degree, which are the covariance matrix of the input of each layer and the covariance matrix of the first derivative of the output of each layer, which have been in the forward process and the reverse process at runtime. Save it.

Let's look at the input parameters when creating THOR:

net: the model established in this training;
learning_rate: learning rate super parameter;
damping: the super parameter of the regularization term added to the second-order matrix;
momentum: momentum superparameter;
weight_decay: weight decay, used to prevent overfitting, the default value is 0.0, that is, weight decay is not used; loss_scale: used to scale the loss in the training process to prevent gradient out of bounds, the default value is 1.0, that is, not used Scaling; batch_size: the amount of data currently used for training a step, the default is 32;
decay_filter: selects which layers to do weight decay, and it works when weight_decay>0; split_indices: This parameter is used to accelerate the allreduce process.
_get_Ainv_Ginv_Amax_Gmax_list function is used to calculate the inverse of the covariance matrix A/G, and returns the inverse matrix. The specific process is to traverse all layers of the model, process by layer, add a regularization term to the covariance matrix of each layer, and then perform cholesky decomposition on the matrix to find the inverse. The current open source code THOR supports the processing of fully connected layers and convolutional layers.

def _get_Ainv_Ginv_Amax_Gmax_list(self, gradients, damping_step, matrix_a_allreduce, matrix_g_allreduce,
                                      matrix_a_max_allreduce, matrix_g_max_allreduce):
        """get matrixA inverse list, matrixG inverse list, matrixA_max list, matrixG_max list"""
        for i in range(len(self.params)):
            thor_layer_count = self.weight_fim_idx_map[i]
            conv_layer_count = self.weight_conv_idx_map[i]
            layer_type = self.weight_layerType_idx_map[i]
            if layer_type in [Conv, FC, Embedding]:
                g = gradients[i]
                matrix_A = self.matrix_A_cov[thor_layer_count]
                matrix_G = self.matrix_G_cov[thor_layer_count]
                matrix_A = F.depend(matrix_A, g)
                matrix_G = F.depend(matrix_G, g)
                A_shape = self.shape(matrix_A)
                A_eye = self.eye(A_shape[0], A_shape[0], mstype.float32)
                G_shape = self.shape(matrix_G)
                G_eye = self.eye(G_shape[0], G_shape[0], mstype.float32)
                if layer_type == Conv:
                    ...
                elif layer_type == FC:
                    matrix_A = matrix_A + damping * A_eye
                    matrix_A_inv = self.cholesky(matrix_A)
                    matrix_A_inv = self.vector_matmul(matrix_A_inv, matrix_A_inv)

_get_second_gradients function is used to calculate the final parameter update direction, the parameter update direction formula in the paper is

, So the actual implementation of the code is

,code show as below

def _get_second_gradients(self, new_grads, damping_step, gradients):
        """get second gradients for thor"""
        params_len = len(self.params)
        for i in range(params_len):
            ...
            else:
                ...
                elif layer_type == FC:
                    temp_a = self.matrix_A_cov[thor_layer_count]
                    temp_g = self.matrix_G_cov[thor_layer_count]
                    temp_a = self.cast(temp_a, mstype.float16)
                    temp_g = self.cast(temp_g, mstype.float16)
                    g = self.cast(g, mstype.float16)
                    g = self.matmul(temp_g, g)
                    g = self.matmul(g, temp_a)
                    g = self.cast(g, mstype.float32)

The construct function is the content that will be actually executed during the network training process. This function includes the calls of the above two functions _get_Ainv_Ginv_Amax_Gmax_list and _get_second_gradients. This function completes the calculation of the second-order matrix and the adjustment of the gradient update direction.

def construct(self, gradients):
        params = self.params
        moments = self.moments
        damping_step = self.gather(self.damping, self.cov_step, self.axis)
        damping_step = self.cast(damping_step, mstype.float32)
        if self.thor:
            matrix_A_allreduce = ()
            matrix_G_allreduce = ()
            matrix_A_max_allreduce = ()
            matrix_G_max_allreduce = ()
            matrix_A_allreduce, matrix_G_allreduce, matrix_A_max_allreduce, matrix_G_max_allreduce = \
                self._get_Ainv_Ginv_Amax_Gmax_list(gradients, damping_step, matrix_A_allreduce, matrix_G_allreduce,
                                                   matrix_A_max_allreduce, matrix_G_max_allreduce) #计算A/G的逆
            ...
            new_grads = ()
            for i in range(len(self.params)):
                ...
                if self.conv_layer_count > 0:#有卷积层时的处理
                   ...
                else: #都是全连接层时的处理
                    if layer_type == Embedding:
                        ...
                    elif layer_type == FC:
                        temp_a = matrix_A_allreduce[thor_layer_count]
                        temp_g = matrix_G_allreduce[thor_layer_count]
                        fake_A = self.assign(self.matrix_A_cov[thor_layer_count], temp_a)
                        fake_G = self.assign(self.matrix_G_cov[thor_layer_count], temp_g)
                        g = F.depend(g, fake_A)#确保执行顺序
                        g = F.depend(g, fake_G)
                        temp_a = self.cast(temp_a, mstype.float16)
                        temp_g = self.cast(temp_g, mstype.float16)
                        g = self.cast(g, mstype.float16)
                        g = self.matmul(temp_g, g)
                        g = self.matmul(g, temp_a)#将一阶方向变为二阶方向
                        g = self.cast(g, mstype.float32)
                    elif layer_type == LayerNorm:
                        g = self._process_layernorm(damping_step, g)
                new_grads = new_grads + (g,)
            gradients = new_grads #计算后得到的更新方向
        else: #该分支表示使用过时二阶信息更新参数
            new_grads = ()
            gradients = self._get_second_gradients(new_grads, damping_step, gradients) #调用_get_second_gradients函数计算方向
        ...

Practical application of THOR

In this section, I will share with you the practical application of THOR. Two examples are ResNet50 and BERT. The codes of these two examples are also open source. The link is as follows: ResNet50: https://gitee.com/mindspore /mindspore/blob/master/model_zoo/official/cv/resnet/train.pyBERT:

ResNet50[1]

The call method of the optimizer is the same as that mentioned at the beginning of the article. In this example, the specific training process is expanded.

First, create the training set required for network training and the network definition as ResNet50; then set the hyperparameter strategy required for THOR, and other hyperparameter settings can be modified in src/config.py in this directory; then create THOR Optimizer, and pass in the set hyperparameter values; then convert the model to save the information required for the second order; finally, you can train the network.

from mindspore.nn.optim import Momentum, THOR  #引用二阶优化器
from src.resnet import resnet50 as resnet 
from mindspore.train.model import Model
...
if __name__ == '__main__':
    ...
    #创建网络训练过程中的训练集 
    dataset = create_dataset(dataset_path=args_opt.dataset_path, do_train=True, repeat_num=1,
                             batch_size=config.batch_size, target=target, distribute=args_opt.run_distribute)
    step_size = dataset.get_dataset_size() 

    #创建resnet50模型
    net = resnet(class_num=config.class_num) 
    ...
    # init lr
    if cfg.optimizer == "Thor": 
        #设置超参值
        from src.lr_generator import get_thor_lr
        lr = get_thor_lr(0, config.lr_init, config.lr_decay, config.lr_end_epoch, step_size, decay_epochs=39)
    # define loss, model
    if target == "Ascend":
        if args_opt.dataset == "imagenet2012":
            if not config.use_label_smooth:
                config.label_smooth_factor = 0.0
            loss = CrossEntropySmooth(sparse=True, reduction="mean",
                                      smooth_factor=config.label_smooth_factor, num_classes=config.class_num)
        else:
            loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
        loss_scale = FixedLossScaleManager(config.loss_scale, drop_overflow_update=False)

        #高层抽象，集成网络模型的训练和测试
        model = Model(net, loss_fn=loss, optimizer=opt, loss_scale_manager=loss_scale, metrics={'acc'},
                      amp_level="O2", keep_batchnorm_fp32=False) 
    if cfg.optimizer == "Thor" and args_opt.dataset == "imagenet2012":
        from src.lr_generator import get_thor_damping

        #设置超参damping
        damping = get_thor_damping(0, config.damping_init, config.damping_decay, 70, step_size) 

        #用于通信时的并行加速
        split_indices = [26, 53] 

        #创建THOR优化器
        opt = THOR(net, lr, Tensor(damping), config.momentum, config.weight_decay, config.loss_scale,
                   config.batch_size, split_indices=split_indices)

        #增加计算图提升性能
        model = ConvertModelUtils().convert_to_thor_model(model=model, network=net, loss_fn=loss, optimizer=opt,
                                                          loss_scale_manager=loss_scale, metrics={'acc'},
                                                          amp_level="O2", keep_batchnorm_fp32=False,
                                                          frequency=config.frequency) 
    ...
    #训练网络
    model.train(config.epoch_size - config.pretrain_epoch_size, dataset, callbacks=cb,
                sink_size=dataset.get_dataset_size(), dataset_sink_mode=dataset_sink_mode)

Finally enter

You can run the script.

BERT[2]

The steps in BERT are similar to ResNet50. First, create the training set required for network training and define the network as BERT; then set the hyperparameter strategy required for THOR, and other hyperparameter settings can be modified in src/config.py in this directory; the optimizer is created The hyper-parameter value set by BERT is passed in when it is created. In this example, it is passed in when it is created:

It means to exclude the bias parameters in the LN layer and FC when doing the weight decay operation; then convert the model to save the information required for the second order; finally, you can train the network.

from mindspore.nn.optim import Lamb, Momentum, AdamWeightDecay, THOR  #引用二阶优化器
from src import BertNetworkWithLoss
...
def _get_optimizer(args_opt, network):
    """get bert optimizer, support Lamb, Momentum, AdamWeightDecay."""
    if cfg.optimizer == 'Lamb':
       ...
    elif cfg.optimizer == "Thor":
        from src.utils import get_bert_thor_lr, get_bert_thor_damping

        #设置lr和damping的超参值
        lr = get_bert_thor_lr(cfg.Thor.lr_max, cfg.Thor.lr_min, cfg.Thor.lr_power, cfg.Thor.lr_total_steps)
        damping = get_bert_thor_damping(cfg.Thor.damping_max, cfg.Thor.damping_min, cfg.Thor.damping_power,
                                        cfg.Thor.damping_total_steps)
        split_indices = None

        #设置并行加速方式
        if bert_net_cfg.num_hidden_layers == 12:
            if bert_net_cfg.use_relative_positions:
                split_indices = [29, 58, 87, 116, 145, 174, 203, 217]
            else:
                split_indices = [28, 55, 82, 109, 136, 163, 190, 205]
        elif bert_net_cfg.num_hidden_layers == 24:
            if bert_net_cfg.use_relative_positions:
                split_indices = [30, 90, 150, 210, 270, 330, 390, 421]
            else:
                split_indices = [38, 93, 148, 203, 258, 313, 368, 397]

        #创建优化器
        optimizer = THOR(network, lr, damping, cfg.Thor.momentum,
                         cfg.Thor.weight_decay, cfg.Thor.loss_scale, cfg.batch_size,
                         decay_filter=lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(),
                         split_indices=split_indices) 
    ...
    return optimizer
def run_pretrain():
    ...
    #创建数据集
    ds = create_bert_dataset(device_num, rank, args_opt.do_shuffle, args_opt.data_dir, args_opt.schema_dir)
    #网络和损失函数创建
    net_with_loss = BertNetworkWithLoss(bert_net_cfg, True)

    ...
    #加载初始checkpoint
    if args_opt.load_checkpoint_path:
        param_dict = load_checkpoint(args_opt.load_checkpoint_path)
        load_param_into_net(net_with_loss, param_dict)

    #动态loss缩放
    if args_opt.enable_lossscale == "true": 
            ...

    #固定loss缩放值
    else: 
        #反向过程梯度计算过程创建
        net_with_grads = BertTrainOneStepCell(net_with_loss, optimizer=optimizer)

    #创建网络
    model = Model(net_with_grads)

    #增加计算图提升性能
    model = ConvertModelUtils().convert_to_thor_model(model, network=net_with_grads, optimizer=optimizer,
                                                      frequency=cfg.Thor.frequency) 
    #网络训练
    model.train(new_repeat_count, ds, callbacks=callback,
                dataset_sink_mode=(args_opt.enable_data_sink == "true"), sink_size=args_opt.data_sink_steps)
if __name__ == '__main__':
    set_seed(0)

Finally enter

You can run the script. So far, the content of the high-level optimizer series is over. There are three articles in the series from the background of the optimizer, the introduction of the MindSpore self-developed optimizer and the source code analysis & practice of the MindSpore high-order optimizer THOR Use these three contents to share with you. If there are any deficiencies, you are welcome to criticize and correct. At the same time, everyone is welcome to play together in the MindSpore open source community.

references:

[1]He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[2]Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.

Click to follow, and get to know the fresh technology of Huawei Cloud for the first time~

THOR: MindSpore self-developed high-order optimizer source code analysis and practical application

Source code analysis

Practical application of THOR

ResNet50[1]

BERT[2]

references:

华为云开发者联盟

引用和评论

华为云开发者联盟入选 2023 中国技术品牌影响力企业榜，深耕开发者生态

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

基于 MCP 的 AI Agent 应用开发实践

基于预生成 QA 对的 RAG 知识库解决方案

祛魅最热门的通用Agent赛道

入选ICLR 2025，MIT/UC伯克利/哈佛/斯坦福等提出DRAKES算法，突破生物序列设计瓶颈

30分钟内输出结果，新加坡国立大学/MIT等基于SVM构建微生物污染检测模型