learning rate in neural network

0x1 学习率简介

学习率是一个重要的超参数,在大多数神经网络优化算法(如SGD、Adam)中都会用到,它控制着我们基于损失梯度调整神经网络权值的速度。如果学习率过大,可能会错过最小值,并且可能造成不能收敛,损失在某一个值附近反复震荡。如果学习率过小,我们沿着损失梯度下降的速度就很慢,收敛的速度也很慢。

神经网络优化算法的功能是,控制方差,寻找最小值,更新模型参数,最终使模型收敛。
神经网络更新参数的公式为:θ=θ−η×∇(θ).J(θ),其中η是学习率,∇(θ).J(θ)是损失函数J(θ)的梯度。

学习率是神经网络中难以设置的超参数之一,它对模型的性能有很大的影响。目前出现了很多自适应学习率算法,如下文测试中用到的Adam。在下面的测试中发现,虽然Adam会在优化的过程中动态调整学习率,初始的学习率设置还是会对模型的准确率有很大影响。

0x2 构造计算图

用python编写tensorflow测试代码,构造如下图所示的卷积神经网络,用来对MNIST数据集进行0~9分类。

从下图神经网络图可知,首先载入MNIST数据,然后进行处理,包括卷积层layer1, 卷积层layer2, 接下来是一个全连接层fc_layer, 接下来是dropout层,最后连接上softmax层layer3, 得到概率输出。

优化算法采用Adam,在设置Adam优化器的时候设置需要初始学习率,其代码如下。

1
train_step = tf.train.AdamOptimizer(learning_rate).minimize(cross_entropy)

总体网络图如下
computation_graph
layer1内部结构如下所示
首先初始化权重和偏置参数。然后使用conv2d进行卷积操作并加上偏置,卷积参数为[5, 5, 1, 32], 代表卷积核的大小为5x5, 1个颜色通道,32个不同的卷积核。然后使用relu激活函数进行非线性化处理。然后使用池化函数对卷积的结果进行1/2下采样处理,MNIST数据中图片大小为28x28,经过下采样以后变为14x14。
layer1
layer2内部结构如下所示
和layer1结构类似,只是卷积参数设置为[5, 5, 32, 64],因为layer1中经过32个不同卷积核的处理,所以每个数据的通道变成了32,另外卷积核的数量变成64,经过池化函数以后,数据由14x14变成7x7。
layer2
后面再连接一个全连接层fc_layer,输出为1024个隐含节点,并使用relu激活函数。

下面是dropout层,使用keep_prob参数来控制以减轻过拟合。

接下来是layer3层,首先把1024个隐含节点通过矩阵变换转换成10个节点,矩阵变换的参数是权重参数和偏置参数,然后再连接softmax函数,得到对应数字0~9的概率输出。

优化器是Adam,该优化器对定义好的损失函数进行优化。

0x3 高学习率的测试结果

这个时候初始学习率被设置为0.5,从训练集上的测试准确率上来看,准确率是很低的,在0.05~0.15之间来回震荡。
high_learning_rate

0x4 低学习率的测试结果

这个时候初始学习率被设置为0.001,从训练集上的测试准确率上来看,准确率是慢慢升高的,最后稳定在0.95以上。
low_learning_rate

0x5 Adam优化器

通过gdb调试可知,tensorflow中Adam优化器调用堆栈如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
(gdb) bt
#0 tensorflow::functor::ApplyAdamNonCuda<Eigen::ThreadPoolDevice, float>::operator()(Eigen::ThreadPoolDevice const&, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float, 1, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::TensorFixedSize<float const, Eigen::Sizes<>, 1, long>, 16, Eigen::MakePointer>, Eigen::TensorMap<Eigen::Tensor<float const, 1, 1, long>, 16, Eigen::MakePointer>, bool) (this=0x7fffc58d8260, d=..., var=..., m=..., v=..., beta1_power=...,
beta2_power=..., lr=..., beta1=..., beta2=..., epsilon=..., grad=..., use_nesterov=false) at tensorflow/core/kernels/training_ops.cc:293
#1 0x00007fffede52e6d in tensorflow::ApplyAdamOp<Eigen::ThreadPoolDevice, float>::Compute (this=0x555558f61cf0, ctx=0x7fffc58d8840)
at tensorflow/core/kernels/training_ops.cc:2523
#2 0x00007fffe6a2e91b in tensorflow::ThreadPoolDevice::Compute (this=0x555557f25fa0, op_kernel=0x555558f61cf0, context=0x7fffc58d8840)
at tensorflow/core/common_runtime/threadpool_device.cc:59
#3 0x00007fffe69c9c0a in tensorflow::(anonymous namespace)::ExecutorState::Process (this=0x55555945b6c0, tagged_node=..., scheduled_usec=0)
at tensorflow/core/common_runtime/executor.cc:1652
#4 0x00007fffe69d81a7 in std::_Mem_fn_base<void (tensorflow::(anonymous namespace)::ExecutorState::*)(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long), true>::operator()<tensorflow::(anonymous namespace)::ExecutorState::TaggedNode&, long long&, void> (
this=0x7fff8c04b440, __object=0x55555945b6c0) at /usr/include/c++/5/functional:600
#5 0x00007fffe69d7c62 in std::_Bind<std::_Mem_fn<void (tensorflow::(anonymous namespace)::ExecutorState::*)(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long int)>(tensorflow::(anonymous namespace)::ExecutorState*, tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long int)>::__call<void, 0ul, 1ul, 2ul>(<unknown type in /home/kevin/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so, CU 0x314f884, DIE 0x31e50c1>, std::_Index_tuple<0ul, 1ul, 2ul>) (this=0x7fff8c04b440,
__args=<unknown type in /home/kevin/anaconda3/lib/python3.6/site-packages/tensorflow/python/../libtensorflow_framework.so, CU 0x314f884, DIE 0x31e50c1>) at /usr/include/c++/5/functional:1074
#6 0x00007fffe69d5a06 in std::_Bind<std::_Mem_fn<void (tensorflow::(anonymous namespace)::ExecutorState::*)(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long int)>(tensorflow::(anonymous namespace)::ExecutorState*, tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long int)>::operator()<, void>(void) (this=0x7fff8c04b440) at /usr/include/c++/5/functional:1133
#7 0x00007fffe69d32ac in std::_Function_handler<void(), std::_Bind<std::_Mem_fn<void (tensorflow::(anonymous namespace)::ExecutorState::*)(tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long int)>(tensorflow::(anonymous namespace)::ExecutorState*, tensorflow::(anonymous namespace)::ExecutorState::TaggedNode, long long int)> >::_M_invoke(const std::_Any_data &) (__functor=...)
at /usr/include/c++/5/functional:1871
#8 0x00007fffe6297984 in std::function<void ()>::operator()() const (this=0x7fff8c04a930) at /usr/include/c++/5/functional:2267
#9 0x00007fffe64afb9e in tensorflow::thread::EigenEnvironment::ExecuteTask (this=0x555557f5ca48, t=...)
at tensorflow/core/lib/core/threadpool.cc:81
#10 0x00007fffe64b265c in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop (this=0x555557f5ca40,
thread_id=1) at external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:232
#11 0x00007fffe64b0aae in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::NonBlockingThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}::operator()() const ()
at external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:65
#12 0x00007fffe64b3c7c in std::_Function_handler<void (), Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::NonBlockingThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...)
at /usr/include/c++/5/functional:1871
#13 0x00007fffe6297984 in std::function<void ()>::operator()() const (this=0x555557f6d0b0) at /usr/include/c++/5/functional:2267
#14 0x00007fffe64af907 in tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}::operator()() const (
__closure=0x555557f6d0b0) at tensorflow/core/lib/core/threadpool.cc:56
#15 0x00007fffe64b18d8 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/5/functional:1871
#16 0x00007fffe6297984 in std::function<void ()>::operator()() const (this=0x555557f6d108) at /usr/include/c++/5/functional:2267
#17 0x00007fffe64f4f38 in std::_Bind_simple<std::function<void ()> ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x555557f6d108)
at /usr/include/c++/5/functional:1531
#18 0x00007fffe64f4ea1 in std::_Bind_simple<std::function<void ()> ()>::operator()() (this=0x555557f6d108)
at /usr/include/c++/5/functional:1520
#19 0x00007fffe64f4e40 in std::thread::_Impl<std::_Bind_simple<std::function<void ()> ()> >::_M_run() (this=0x555557f6d0f0)
at /usr/include/c++/5/thread:115
#20 0x00007fffe5067c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#21 0x00007ffff7bc16ba in start_thread (arg=0x7fffc58d9700) at pthread_create.c:333
#22 0x00007ffff78f741d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

tensorflow中Adam优化器更新参数的代码如下。
tensorflow/core/kernels/training_ops.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
template <typename Device, typename T>
struct ApplyAdamNonCuda {
void operator()(const Device& d, typename TTypes<T>::Flat var,
typename TTypes<T>::Flat m, typename TTypes<T>::Flat v,
typename TTypes<T>::ConstScalar beta1_power,
typename TTypes<T>::ConstScalar beta2_power,
typename TTypes<T>::ConstScalar lr,
typename TTypes<T>::ConstScalar beta1,
typename TTypes<T>::ConstScalar beta2,
typename TTypes<T>::ConstScalar epsilon,
typename TTypes<T>::ConstFlat grad, bool use_nesterov) {
const T alpha = lr() * Eigen::numext::sqrt(T(1) - beta2_power()) /
(T(1) - beta1_power());
// beta1 == μ
// beta2 == ν
// v == n
// var == θ
m.device(d) += (grad - m) * (T(1) - beta1());
v.device(d) += (grad.square() - v) * (T(1) - beta2());
if (use_nesterov) {
var.device(d) -= ((grad * (T(1) - beta1()) + beta1() * m) * alpha) /
(v.sqrt() + epsilon());
} else {
var.device(d) -= (m * alpha) / (v.sqrt() + epsilon());
}
}
};

算法描述如下
如下所示,虽然在Adam算法中,学习率lr_t是动态调整的,但是也是和初始设置的学习率learning_rate有关,如果learning_rate设置不当,也会影响训练模型的收敛。

1
2
3
4
5
6
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)//计算动态学习率
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)//更新参数