泰勒公式的基本形式如式(1)所示,
$$\begin{equation}f(x) = \sum_{n=0}^{\infty}\frac{f^{(n)}(x_0)}{n!}(x-x_0)^{n}\tag{1}\end{equation}$$
特殊地,一阶泰勒展开,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x0)\tag{2}\end{equation}$$
二阶泰勒展开,
$$\begin{equation}f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2}\tag{3}\end{equation}$$
令$x^t = x^{t-1}+\Delta x$,则$f(x^t)$的二阶泰勒展开写为,
$$\begin{equation}f(x^t)=f(x^{t-1}+\Delta x)\approx f(x^{t-1})+f^{'}(x^{t-1})\Delta x+f^{''}(x^{t-1})\frac{\Delta x^2}{2}\tag{4}\end{equation}$$

梯度下降

机器学习任务中,需要最小化损失函数$L(\theta)$,其中$\theta$为待求解的模型参数。梯度下降法常用来解决这一类无约束最优化问题。梯度下降法是一种迭代式算法,最开始给$\theta^0$赋一个初始值,然后在每一个迭代过程中,更新$\theta^t$的值,使得损失函数$L(\theta^t)$减小。
$$\begin{equation}\theta^t = \theta^{t-1}+\Delta \theta\tag{5}\end{equation}$$
将$L(\theta^t)$在$\theta^{t-1}$处进行一阶泰勒展开:
$$\begin{equation}L(\theta^t)=L(\theta^{t-1}+\Delta \theta)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta\tag{6}\end{equation}$$
梯度下降法每一个迭代,需要使损失函数$L$变小,即要使$L(\theta^t)<L(\theta^{t-1})$。结合式(6),可取$\Delta \theta = -\alpha L^{'}(\theta^{t-1})$,则$\theta$的迭代公式为,
$$\begin{equation}\theta^t = \theta^{t-1}-\alpha L^{'}(\theta^{t-1})\tag{7}\end{equation}$$
这里的$\alpha$是步长,即常说的learning rate,通常直接赋一个较小的值。

牛顿法

牛顿法本身是一阶算法,本质是求根算法。使用一阶泰勒公式推到牛顿法,则根据式(6),令$L(\theta^t)=0$,则可以推导出
$$\begin{equation}\theta^t = \theta^{t-1}-\frac{L(\theta^{t-1})}{L^{'}(\theta^{t-1})}\tag{8}\end{equation}$$
但是在机器学习任务中,牛顿法被用来求解目标函数的最优解。如果用来求最优解(极值点),这时就要求导数为0的跟,就需要求二阶导数,就变成了二阶算法。牛顿法求解最优值是迭代算法,每一次迭代,在现有极小点估计值$\theta^{t-1}$的附近,对$L(\theta^{t})$做二阶泰勒展开,进而找到极小点的下一个估计值$\theta^t$。
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta + L^{''}(\theta^{t-1})\frac{\Delta \theta^2}{2}\tag{9}\end{equation}$$
先只考虑参数$\theta$为标量的情况,可将一阶导数$L^{'}$和二阶导数$L^{''}$分别记作$g$和$h$,
$$\begin{equation}L(\theta^t)\approx L(\theta^{t-1})+g\Delta \theta + h\frac{\Delta \theta^2}{2}\tag{10}\end{equation}$$
此时,要使得$L(\theta^t)$极小,则让$g\Delta \theta + h\frac{\Delta \theta^2}{2}$极小($L(\theta^{t-1})$看作常量),令$\frac{\partial(g\Delta \theta + h\frac{\Delta \theta^2}{2})}{\partial \Delta \theta}=0$,得到$\Delta \theta = -\frac{g}{h}$,因此
$$\begin{equation}\theta^{t} = \theta^{t-1}+\Delta \theta=\theta^{t-1}-\frac{g}{h}\tag{11}\end{equation}$$
将参数$\theta$由标量推广到向量形式,每一次迭代,参数的更新方式为,
$$\begin{equation}\theta^{t} =\theta^{t-1}-H^{-1}g\tag{12}\end{equation}$$
其中,$g$为雅克比矩阵,矩阵中各元素对应一阶偏导数;$H$为Hessian矩阵,各元素对应二阶偏导数。如果在某一时刻,$H$不可逆,则牛顿法失效。

比较

  1. 梯度下降法和牛顿法相比,两者都是迭代求解,不过梯度下降法是梯度求解,而牛顿法是用二阶的海森矩阵的逆矩阵求解。相对而言,使用牛顿法收敛更快(迭代更少次数)。但是每次迭代的时间比梯度下降法长。至于为什么牛顿法收敛更快,通俗来说梯度下降法每次只从你当前所处位置选一个坡度最大的方向走一步,牛顿法在选择方向时,不仅会考虑坡度是否够大,还会考虑你走了一步之后,坡度是否会变得更大。所以,可以说牛顿法比梯度下降法看得更远一点,能更快地走到最底部。牛顿法为什么比梯度下降法求解需要的迭代次数更少?
  2. 梯度下降法是用来求解最优点的,而一阶的牛顿法是用来求解零点的,二阶的牛顿法才是用来求解最优点,且牛顿法是可以由泰勒公式严格推到出来的,而梯度下降法不是。

代码实现


import numpy as np
from scipy.misc import derivative
from sympy import symbols, diff
import sympy
from sympy.tensor.array import derive_by_array
import math

x1, x2, x3, x4 = symbols('x1, x2, x3, x4')
Y = symbols('Y')
# Y = x1 ** 2 + x2 ** 2
# Y = x1**2 + 3*x1 - 3*x1*x2 + 2*x2**2 + 4*x2
Y = x1 **2 + x2 ** 2 - 4 * x1 - 6 * x2 + 13 + sympy.sqrt(x3) - sympy.sqrt(x4)
var_list = [x1, x2, x3, x4]

def gradient(f, X_, X):
    grad_ = sympy.Array([diff(f, x_) for x_ in X_])
    grad = grad_
    for i, x_ in enumerate(X_):
        grad = grad.subs(x_, X[i])
    return grad_, np.array(grad.tolist(), dtype=np.float32)

def jacobian2(f, X_, X):
    G_ = sympy.Array([diff(f, x_) for x_ in X_])
    G = G_
    for i, x_ in enumerate(X_):
        G = G.subs(x_, X[i])
    return G_, np.array(G.tolist(), dtype=np.float32)

def hessian2(f, X_, X):
    H_ = sympy.Array([[diff(f, x_1).diff(x_2) for x_2 in X_] for x_1 in X_])
    H = H_
    for i, x_ in enumerate(X_):
        H = H.subs(x_, X[i])
    return H_, np.array(H.tolist(), dtype=np.float32)

def jacobian3():
    res = derive_by_array(Y, (x1, x2))
    return res

def hessian3():
    res = derive_by_array(derive_by_array(Y, (x1, x2)), (x1, x2))
    return res

def newton(f, X_, X, iters):
    """
    牛顿法
    :param f 函数
    :param x 输入
    :param iters 迭代次数
    """
    G_, G = jacobian2(f, X_, X)
    H_, H = hessian2(f, X_, X)
    Hessian_T = np.linalg.inv(H)
    H_G = np.matmul(Hessian_T, G)
    # print(H_G)
    X_new = X
    for i in range(0, iters):
        X_new = X_new - H_G
        print("Iteration {}: {}".format(i+1, X_new))
        G_tmp = G_
        H_tmp = H_
        # print(G_tmp)
        # print("X_new", X_new)
        for i, x_ in enumerate(X_):
            H_tmp = H_tmp.subs(x_, X_new[i])
            G_tmp = G_tmp.subs(x_, X_new[i])
        # print("G_tmp: ", G_tmp)
        Hessian_T = np.linalg.inv(np.array(H_tmp.tolist(), dtype=np.float32))
        # print(Hessian_T)
        H_G = np.matmul(Hessian_T, np.array(G_tmp.tolist(), dtype=np.float32))
        # print(H_G)
    return X_new

def gradient_descent_1d_decay(f, X_, X, iters, learning_rate=0.01, decay=0.5):
    for i in range(iters):
        # learning_rate = learning_rate * 1.0 / (1.0 + decay * i)
        _, grad = gradient(f, X_, X)
        X = X - learning_rate * grad
        X[X<0] = 0.0
        print("Iteration {}: {}".format(i+1, X))
    return X



if __name__ == "__main__":
    j2_, j2 = jacobian2(Y, var_list, np.array([1, 2, 1, 1]))
    h2_, h2 = hessian2(Y, var_list, np.array([1, 2, 1, 1]))
    print(j2_, j2)
    print(h2_, h2)
    print("newton:----------------------------")
    print(newton(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 20))
    print("gradient descent:----------------------------")
    print(gradient_descent_1d_decay(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 200))


参考

  1. https://zhuanlan.zhihu.com/p/...
  2. http://wepon.me/files/gbdt.pdf

AGUILLER
10 声望2 粉丝

« 上一篇
FTRL-Proximal
下一篇 »
Hadoop的shuffle