2nd-order Optimization for Neural Network Training

4 836
29.3
Опубликовано 11 августа 2016, 7:55
Neural networks have become the main workhorse of supervised learning, and their efficient training is an important technical challenge which has received a lot of attention. While stochastic gradient descent (SGD) with momentum works well enough in many situations, its performance declines dramatically as networks become deeper and more complex. Given their success in other domains, 2nd-order optimization methods seem like a promising alternative. Unfortunately, the cost of inverting the curvature matrix (traditionally the Hessian) is prohibitive for neural networks, due to their high dimension. One common solution is to approximate the curvature matrix as diagonal or low-rank. Because such approximations are quite crude, most of the theoretical power of 2nd-order methods is lost, and experimental evidence suggests that they don't work much better than SGD. In this talk I will present two methods for achieving efficient and robust 2nd-order optimization for neural networks that do not rely on such approximations. The first, called Hessian-free Optimization (HF), is a truncated-Newton method which uses preconditioned conjugate gradient (CG) (in lieu of matrix inversion) to approximate the 2nd-order update, without making any approximations to the curvature matrix itself. Experiments show that HF works well, and in particular that it converges in orders of magnitude fewer iterations than SGD. While this makes a compelling case for the potential of 2nd-order methods, HF unfortunately suffers in practice due to the high cost of computing its updates (via multiple iterations of CG). The second method I will present, called Kronecker-Factored Approximate Curvature (K-FAC), gets around this issue by using a high quality approximation of the curvature matrix which is neither diagonal nor low-rank, but can nonetheless be inverted very efficiently. Experiments show that K-FAC significantly outperforms existing methods, and has the potential to be 20-100 times faster in a highly distributed setting.
автотехномузыкадетское