# 梯度下降法和牛顿法

$$$$f(x) = \sum_{n=0}^{\infty}\frac{f^{(n)}(x_0)}{n!}(x-x_0)^{n}\tag{1}$$$$

$$$$f(x)\approx f(x_0)+f^{'}(x_0)(x-x0)\tag{2}$$$$

$$$$f(x)\approx f(x_0)+f^{'}(x_0)(x-x_0)+f^{''}(x_0)\frac{(x-x_0)^2}{2}\tag{3}$$$$

$$$$f(x^t)=f(x^{t-1}+\Delta x)\approx f(x^{t-1})+f^{'}(x^{t-1})\Delta x+f^{''}(x^{t-1})\frac{\Delta x^2}{2}\tag{4}$$$$

### 梯度下降

$$$$\theta^t = \theta^{t-1}+\Delta \theta\tag{5}$$$$

$$$$L(\theta^t)=L(\theta^{t-1}+\Delta \theta)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta\tag{6}$$$$

$$$$\theta^t = \theta^{t-1}-\alpha L^{'}(\theta^{t-1})\tag{7}$$$$

### 牛顿法

$$$$\theta^t = \theta^{t-1}-\frac{L(\theta^{t-1})}{L^{'}(\theta^{t-1})}\tag{8}$$$$

$$$$L(\theta^t)\approx L(\theta^{t-1})+L^{'}(\theta^{t-1})\Delta \theta + L^{''}(\theta^{t-1})\frac{\Delta \theta^2}{2}\tag{9}$$$$

$$$$L(\theta^t)\approx L(\theta^{t-1})+g\Delta \theta + h\frac{\Delta \theta^2}{2}\tag{10}$$$$

$$$$\theta^{t} = \theta^{t-1}+\Delta \theta=\theta^{t-1}-\frac{g}{h}\tag{11}$$$$

$$$$\theta^{t} =\theta^{t-1}-H^{-1}g\tag{12}$$$$

### 比较

1. 梯度下降法和牛顿法相比，两者都是迭代求解，不过梯度下降法是梯度求解，而牛顿法是用二阶的海森矩阵的逆矩阵求解。相对而言，使用牛顿法收敛更快（迭代更少次数）。但是每次迭代的时间比梯度下降法长。至于为什么牛顿法收敛更快，通俗来说梯度下降法每次只从你当前所处位置选一个坡度最大的方向走一步，牛顿法在选择方向时，不仅会考虑坡度是否够大，还会考虑你走了一步之后，坡度是否会变得更大。所以，可以说牛顿法比梯度下降法看得更远一点，能更快地走到最底部。牛顿法为什么比梯度下降法求解需要的迭代次数更少？
2. 梯度下降法是用来求解最优点的，而一阶的牛顿法是用来求解零点的，二阶的牛顿法才是用来求解最优点，且牛顿法是可以由泰勒公式严格推到出来的，而梯度下降法不是。

### 代码实现


import numpy as np
from scipy.misc import derivative
from sympy import symbols, diff
import sympy
from sympy.tensor.array import derive_by_array
import math

x1, x2, x3, x4 = symbols('x1, x2, x3, x4')
Y = symbols('Y')
# Y = x1 ** 2 + x2 ** 2
# Y = x1**2 + 3*x1 - 3*x1*x2 + 2*x2**2 + 4*x2
Y = x1 **2 + x2 ** 2 - 4 * x1 - 6 * x2 + 13 + sympy.sqrt(x3) - sympy.sqrt(x4)
var_list = [x1, x2, x3, x4]

grad_ = sympy.Array([diff(f, x_) for x_ in X_])
for i, x_ in enumerate(X_):

def jacobian2(f, X_, X):
G_ = sympy.Array([diff(f, x_) for x_ in X_])
G = G_
for i, x_ in enumerate(X_):
G = G.subs(x_, X[i])
return G_, np.array(G.tolist(), dtype=np.float32)

def hessian2(f, X_, X):
H_ = sympy.Array([[diff(f, x_1).diff(x_2) for x_2 in X_] for x_1 in X_])
H = H_
for i, x_ in enumerate(X_):
H = H.subs(x_, X[i])
return H_, np.array(H.tolist(), dtype=np.float32)

def jacobian3():
res = derive_by_array(Y, (x1, x2))
return res

def hessian3():
res = derive_by_array(derive_by_array(Y, (x1, x2)), (x1, x2))
return res

def newton(f, X_, X, iters):
"""
牛顿法
:param f 函数
:param x 输入
:param iters 迭代次数
"""
G_, G = jacobian2(f, X_, X)
H_, H = hessian2(f, X_, X)
Hessian_T = np.linalg.inv(H)
H_G = np.matmul(Hessian_T, G)
# print(H_G)
X_new = X
for i in range(0, iters):
X_new = X_new - H_G
print("Iteration {}: {}".format(i+1, X_new))
G_tmp = G_
H_tmp = H_
# print(G_tmp)
# print("X_new", X_new)
for i, x_ in enumerate(X_):
H_tmp = H_tmp.subs(x_, X_new[i])
G_tmp = G_tmp.subs(x_, X_new[i])
# print("G_tmp: ", G_tmp)
Hessian_T = np.linalg.inv(np.array(H_tmp.tolist(), dtype=np.float32))
# print(Hessian_T)
H_G = np.matmul(Hessian_T, np.array(G_tmp.tolist(), dtype=np.float32))
# print(H_G)
return X_new

def gradient_descent_1d_decay(f, X_, X, iters, learning_rate=0.01, decay=0.5):
for i in range(iters):
# learning_rate = learning_rate * 1.0 / (1.0 + decay * i)
X = X - learning_rate * grad
X[X<0] = 0.0
print("Iteration {}: {}".format(i+1, X))
return X

if __name__ == "__main__":
j2_, j2 = jacobian2(Y, var_list, np.array([1, 2, 1, 1]))
h2_, h2 = hessian2(Y, var_list, np.array([1, 2, 1, 1]))
print(j2_, j2)
print(h2_, h2)
print("newton:----------------------------")
print(newton(Y, var_list, np.array([12, 4, 1, 1], dtype=np.float32), 20))