机器学习的数学原理

2020-12-18 14:16

代价函数(损失函数,误差函数)

MSE(均方差)

$$J(\theta)=\frac{1}{n}*\sum_{i=1}^n(y_i-f(x_i))^2$$

MAE(均绝对差)

$$J(\theta)=\frac{1}{n}*\sum_{i=1}^n|y_i-f(x_i)|$$

Huber

$$\theta=y_i-f(x_i)$$

$$J(\theta)=\left\{ \begin{align} MSE& &(-1\le\theta\le1)\\ MAE& &(\theta\lt-1,\theta\gt1) \end{align} \right.$$

阶跃函数(激活函数)

Sigmoid

$$y=\frac{1}{1+\exp(-x)}$$

$$\begin{align} \frac{\partial y}{\partial x}&=\frac{1}{1+\exp(-x)}*(1-\frac{1}{1+\exp(-x)})\\\\ &=y*(1-y) \end{align}$$

ReLu

$$y=\left\{ \begin{align} 0& &(x\le0)\\ x& &(x\gt0) \end{align} \right.$$

$$y=\max(0,x)$$

回归分析

一元线性回归

假设回归函数 $$y=ax+b$$

xy预测 y
787a+b
545a+b
989a+b

MSE 误差(分母简化为 2,原本是 3):$$C_t=\frac{(8-(7a+b))^2+(4-(5a+b))^2+(8-(9a+b))^2}{2}$$

偏导 $$\frac{\partial{C}}{\partial{a}}=0$$ $$\frac{\partial{C}}{\partial{b}}=0$$ 时 $$C_t$$ 最小

$$\begin{align} \frac{\partial{C_t}}{\partial{a}}&=\sum(y_i-(ax_i+b))*-x_i\\\\ &=(8-(7a+b))*-7+(4-(5a+b))*-5+(8-(9a+b))*-9\\\\ &=-148+155a+21b \end{align}$$

$$\begin{align} \frac{\partial{C_t}}{\partial{b}}&=\sum(y_i-(ax_i+b))*-1\\\\ &=(8-(7a+b))*-1+(4-(5a+b))*-1+(8-(9a+b))*-1\\\\ &=-20+21a+3b \end{align}$$

$$a=1$$,$$b=-\frac{1}{3}$$

xy预测 y
786.7
544.7
988.7

梯度下降法

单变量函数 $$y=f(a), a=f(x)$$ 链式法则:$$\frac{\partial{y}}{\partial{x}}=\frac{\partial{y}}{\partial{a}}\frac{\partial{a}}{\partial{x}}$$

多变量函数 $$z=f(a,b), a=f(x,y), b=f(x,y)$$ 链式法则:$$\frac{\partial{z}}{\partial{x}}=\frac{\partial{z}}{\partial{a}}\frac{\partial{a}}{\partial{x}}+\frac{\partial{z}}{\partial{b}}\frac{\partial{b}}{\partial{x}}$$

导函数:$$f'(x)=\lim\limits_{\triangle{x}\to0}\frac{f(x+\triangle{x})-f(x)}{\triangle{x}}$$

单变量函数近似公式:$$f(x+\triangle{x})=f(x)+f'(x)*\triangle{x}$$

多变量函数近似公式:$$f(x+\triangle{x},y+\triangle{y})=f(x,y)+\frac{\partial{f(x,y)}}{\partial{x}}*\triangle{x}+\frac{\partial{f(x,y)}}{\partial{y}}*\triangle{y}$$

$$\begin{align} \triangle{z}&=f(x+\triangle{x},y+\triangle{y})-f(x,y)\\\\ &=\frac{\partial{f(x,y)}}{\partial{x}}*\triangle{x}+\frac{\partial{f(x,y)}}{\partial{y}}*\triangle{y}\\\\ &=(\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y}) \cdot (\triangle x,\triangle y)\\\\ &=\vec a \cdot \vec b \end{align}$$

$$\vec b=-k*\vec a$$ 时向量内积最小即:

$$(\triangle x,\triangle y)=-\eta*(\frac{\partial f(x,y)}{\partial x},\frac{\partial f(x,y)}{\partial y})$$

使用梯度下降法求 a,b:

MSE 误差:$$C_t=\frac{(8-(7a+b))^2+(4-(5a+b))^2+(8-(9a+b))^2}{2}$$

$$\frac{\partial C_t}{\partial a}=-148+155a+21b$$,$$\frac{\partial C_t}{\partial b}=-20+21a+3b$$

设 $$a=0.2, b=-0.2, \eta=0.01$$ 则:

$$\frac{\partial C_t}{\partial a}=-24.24$$,$$\frac{\partial C_t}{\partial b}=3.28$$,$$\triangle a=0.24, \triangle b=-0.03$$,$$C_t=31.12$$

更新 $$a=a+\triangle a=0.44, b=b+\triangle b=-0.23$$ 则:

$$\frac{\partial C_t}{\partial a}=-37.3$$,$$\frac{\partial C_t}{\partial b}=2.65$$,$$\triangle a=0.37, \triangle b=-0.02$$,$$C_t=21.18$$

多次以后:

$$a=1.03, b=-0.26, C_t=8.51$$

总结

对误差函数求偏导,即梯度。

得到梯度后,使用随机数填充并设置学习率,再返回更新随机数,一步步来降低误差,即梯度下降法。

误差反向传播法

$$z=(\sum_{i=1}^n w_ix_i)+b$$

$$a=Sigmoid(z)$$

$$c=\frac{1}{2}\sum_{i=1}^n(t_i-a_i)^2$$

定义 $$\delta_1=\frac{\partial c}{\partial z_1}$$ 即 $$\delta_1=\frac{\partial c}{\partial a_1}\frac{\partial a_1}{\partial z_1}$$,那么:

$$\frac{\partial c}{\partial w_1}=\frac{\partial c}{\partial a_1}\frac{\partial a_1}{\partial z_1}\frac{\partial z_1}{\partial w_1}=\delta_1x_1$$

$$\frac{\partial c}{\partial b_1}=\frac{\partial c}{\partial a_1}\frac{\partial a_1}{\partial z_1}\frac{\partial z_1}{\partial b_1}=\delta_1$$

定义 $$\delta^3_1=\frac{\partial c}{\partial z^3_1}$$,那么:

$$\begin{align} \delta^2_1&=\frac{\partial c}{\partial z^2_1}\\\\ &=\frac{\partial c}{\partial a^3_1}\frac{\partial a^3_1}{\partial z^3_1}\frac{\partial z^3_1}{\partial a^2_1}\frac{\partial a^2_1}{\partial z^2_1}+\frac{\partial c}{\partial a^3_2}\frac{\partial a^3_2}{\partial z^3_2}\frac{\partial z^3_2}{\partial a^2_1}\frac{\partial a^2_1}{\partial z^2_1}\\\\ &=\delta^3_1w^3_1\frac{\partial a^2_1}{\partial z^2_1}+\delta^3_2w^3_2\frac{\partial a^2_1}{\partial z^2_1}\\\\ &=(\delta^3_1w^3_1+\delta^3_2w^3_2)\frac{\partial a^2_1}{\partial z^2_1} \end{align}$$

总结

使用最后一层的梯度来表示前面多层的梯度,这个过程无需求导即可直接得出,即误差反向传播。

参考