%0 Journal Article
%T On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length
%W http://arxiv.org/abs/1807.05031v2
%U http://arxiv.org/abs/1807.05031
%X Recent work has identiﬁed that using a high learning rate or a small batch size for Stochastic Gradient Descent (SGD) based training of deep neural networks encourages ﬁnding ﬂatter minima of the training loss towards the end of training. Moreover, measures of the ﬂatness of minima have been shown to correlate with good generalization performance. Extending this previous work, we investigate the loss curvature through the Hessian eigenvalue spectrum in the early phase of training and ﬁnd an analogous bias: even at the beginning of training, a high learning rate or small batch size inﬂuences SGD to visit ﬂatter loss regions. In addition, the evolution of the largest eigenvalues appears to always follow a similar pattern, with a fast increase in the early phase, and a decrease or stabilization thereafter, where the peak value is determined by the learning rate and batch size. Finally, we ﬁnd that by altering the learning rate just in the direction of the eigenvectors associated with the largest eigenvalues, SGD can be steered towards regions which are an order of magnitude sharper but correspond to models with similar generalization, which suggests the curvature of the endpoint found by SGD is not predictive of its generalization properties.
%G en
%J Modern Trends in Nonconvex Optimization for Machine Learning workshop at International Conference on Machine Learning 2018
%A Jastrzębski, Stanisław
%A Kenton, Zachary
%A Ballas, Nicolas
%A Fischer, Asja
%A Bengio, Yoshua
%A Storkey, Amos
%D 2018-07-13
%K Computer Science - Machine Learning
Statistics - Machine Learning