banner



What Is A Neural Network Learning Rate

How to Configure the Learning Rate When Grooming Deep Learning Neural Networks

Last Updated on Baronial 6, 2019

The weights of a neural network cannot be calculated using an analytical method. Instead, the weights must be discovered via an empirical optimization procedure chosen stochastic gradient descent.

The optimization problem addressed by stochastic gradient descent for neural networks is challenging and the space of solutions (sets of weights) may exist comprised of many skillful solutions (called global optima) besides as easy to observe, but low in skill solutions (called local optima).

The amount of change to the model during each step of this search process, or the step size, is called the "learning rate" and provides perhaps the virtually important hyperparameter to melody for your neural network in society to reach skillful functioning on your problem.

In this tutorial, you will observe the learning rate hyperparameter used when preparation deep learning neural networks.

Afterwards completing this tutorial, you will know:

  • Learning rate controls how quickly or slowly a neural network model learns a problem.
  • How to configure the learning rate with sensible defaults, diagnose behavior, and develop a sensitivity analysis.
  • How to farther meliorate performance with learning rate schedules, momentum, and adaptive learning rates.

Kick-start your projection with my new volume Ameliorate Deep Learning, including step-past-footstep tutorials and the Python source code files for all examples.

Let's get started.

How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks

How to Configure the Learning Rate Hyperparameter When Training Deep Learning Neural Networks
Photo past Bernd Thaller, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. What Is the Learning Rate?
  2. Effect of Learning Rate
  3. How to Configure Learning Rate
  4. Add Momentum to the Learning Process
  5. Use a Learning Rate Schedule
  6. Adaptive Learning Rates

What Is the Learning Charge per unit?

Deep learning neural networks are trained using the stochastic slope descent algorithm.

Stochastic gradient descent is an optimization algorithm that estimates the fault gradient for the current land of the model using examples from the training dataset, and so updates the weights of the model using the back-propagation of errors algorithm, referred to as only backpropagation.

The corporeality that the weights are updated during training is referred to as the step size or the "learning charge per unit."

Specifically, the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, oft in the range between 0.0 and 1.0.

… learning rate, a positive scalar determining the size of the step.

— Page 86, Deep Learning, 2016.

The learning rate is often represented using the annotation of the lowercase Greek letter of the alphabet eta (n).

During training, the backpropagation of error estimates the amount of error for which the weights of a node in the network are responsible. Instead of updating the weight with the full amount, it is scaled by the learning rate.

This ways that a learning charge per unit of 0.1, a traditionally common default value, would mean that weights in the network are updated 0.1 * (estimated weight error) or 10% of the estimated weight fault each time the weights are updated.

Want Amend Results with Deep Learning?

Take my free 7-twenty-four hour period e-mail crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the grade.

Upshot of Learning Rate

A neural network learns or approximates a function to best map inputs to outputs from examples in the training dataset.

The learning rate hyperparameter controls the charge per unit or speed at which the model learns. Specifically, it controls the amount of apportioned error that the weights of the model are updated with each time they are updated, such as at the finish of each batch of grooming examples.

Given a perfectly configured learning charge per unit, the model will learn to best guess the function given available resources (the number of layers and the number of nodes per layer) in a given number of training epochs (passes through the training data).

More often than not, a large learning rate allows the model to learn faster, at the toll of arriving on a sub-optimal concluding set of weights. A smaller learning charge per unit may allow the model to acquire a more optimal or fifty-fifty globally optimal set of weights but may take significantly longer to railroad train.

At extremes, a learning rate that is too large will result in weight updates that will be likewise large and the performance of the model (such equally its loss on the training dataset) will oscillate over training epochs. Oscillating functioning is said to be caused by weights that diverge (are divergent). A learning rate that is too small may never converge or may get stuck on a suboptimal solution.

When the learning rate is besides large, slope descent can inadvertently increase rather than decrease the training fault. […] When the learning rate is too minor, training is not only slower, but may go permanently stuck with a high training fault.

— Folio 429, Deep Learning, 2016.

In the worst instance, weight updates that are too large may cause the weights to explode (i.e. result in a numerical overflow).

When using high learning rates, it is possible to encounter a positive feedback loop in which big weights induce large gradients which and then induce a big update to the weights. If these updates consistently increment the size of the weights, so [the weights] chop-chop moves away from the origin until numerical overflow occurs.

— Page 238, Deep Learning, 2016.

Therefore, we should non use a learning rate that is likewise large or too pocket-size. Still, we must configure the model in such a way that on average a "good plenty" fix of weights is found to approximate the mapping problem as represented past the training dataset.

How to Configure Learning Rate

Information technology is important to find a adept value for the learning rate for your model on your training dataset.

The learning rate may, in fact, exist the almost of import hyperparameter to configure for your model.

The initial learning charge per unit [… ] This is often the single most important hyperparameter and one should always make sure that it has been tuned […] If in that location is only fourth dimension to optimize one hyper-parameter and 1 uses stochastic gradient descent, then this is the hyper-parameter that is worth tuning

— Practical recommendations for gradient-based preparation of deep architectures, 2012.

In fact, if there are resources to tune hyperparameters, much of this fourth dimension should exist dedicated to tuning the learning rate.

The learning rate is perchance the most important hyperparameter. If y'all have time to tune simply ane hyperparameter, tune the learning charge per unit.

— Page 429, Deep Learning, 2016.

Unfortunately, we cannot analytically calculate the optimal learning rate for a given model on a given dataset. Instead, a good (or good enough) learning rate must be discovered via trial and error.

… in general, information technology is not possible to summate the best learning rate a priori.

— Page 72, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

The range of values to consider for the learning rate is less than 1.0 and greater than x^-half dozen.

Typical values for a neural network with standardized inputs (or inputs mapped to the (0,1) interval) are less than 1 and greater than x^−6

— Practical recommendations for gradient-based training of deep architectures, 2012.

The learning charge per unit will interact with many other aspects of the optimization process, and the interactions may exist nonlinear. Nevertheless, in general, smaller learning rates will require more training epochs. Conversely, larger learning rates will require fewer training epochs. Farther, smaller batch sizes are better suited to smaller learning rates given the noisy estimate of the error gradient.

A traditional default value for the learning rate is 0.1 or 0.01, and this may stand for a practiced starting point on your trouble.

A default value of 0.01 typically works for standard multi-layer neural networks just it would be foolish to rely exclusively on this default value

— Practical recommendations for gradient-based preparation of deep architectures, 2012.

Diagnostic plots can be used to investigate how the learning rate impacts the rate of learning and learning dynamics of the model. Ane instance is to create a line plot of loss over preparation epochs during training. The line plot can show many properties, such equally:

  • The charge per unit of learning over preparation epochs, such as fast or slow.
  • Whether model has learned too quickly (sharp rising and plateau) or is learning also slowly (piddling or no alter).
  • Whether the learning rate might be too large via oscillations in loss.

Configuring the learning rate is challenging and fourth dimension-consuming.

The choice of the value for [the learning charge per unit] tin can be adequately critical, since if information technology is too small-scale the reduction in error will exist very slow, while, if it is too large, divergent oscillations can result.

— Folio 95, Neural Networks for Pattern Recognition, 1995.

An alternative arroyo is to perform a sensitivity analysis of the learning rate for the chosen model, also chosen a grid search. This can assistance to both highlight an society of magnitude where good learning rates may reside, as well every bit describe the human relationship between learning rate and performance.

Information technology is common to grid search learning rates on a log scale from 0.one to 10^-5 or ten^-6.

Typically, a grid search involves picking values approximately on a logarithmic scale, e.k., a learning rate taken within the set {.1, .01, 10−3, 10−4 , 10−five}

— Page 434, Deep Learning, 2016.

When plotted, the results of such a sensitivity analysis ofttimes evidence a "U" shape, where loss decreases (performance improves) equally the learning rate is decreased with a stock-still number of training epochs to a betoken where loss sharply increases again because the model fails to converge.

If y'all need help experimenting with the learning rate for your model, see the mail:

  • Understand the Impact of Learning Rate on Model Performance With Deep Learning Neural Networks

Add Momentum to the Learning Process

Grooming a neural network can be made easier with the improver of history to the weight update.

Specifically, an exponentially weighted average of the prior updates to the weight can exist included when the weights are updated. This change to stochastic slope descent is called "momentum" and adds inertia to the update process, causing many past updates in 1 management to continue in that direction in the future.

The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to movement in their direction.

— Page 296, Deep Learning, 2016.

Momentum can advance learning on those problems where the high-dimensional "weight space" that is existence navigated by the optimization procedure has structures that mislead the gradient descent algorithm, such equally flat regions or steep curvature.

The method of momentum is designed to accelerate learning, especially in the face of loftier curvature, small but consistent gradients, or noisy gradients.

— Page 296, Deep Learning, 2016.

The amount of inertia of past updates is controlled via the addition of a new hyperparameter, often referred to every bit the "momentum" or "velocity" and uses the notation of the Greek lowercase letter blastoff (a).

… the momentum algorithm introduces a variable v that plays the role of velocity — information technology is the direction and speed at which the parameters motility through parameter space. The velocity is set to an exponentially decaying average of the negative gradient.

— Page 296, Deep Learning, 2016.

Information technology has the effect of smoothing the optimization process, slowing updates to keep in the previous direction instead of getting stuck or oscillating.

One very uncomplicated technique for dealing with the problem of widely differing eigenvalues is to add a momentum term to the gradient descent formula. This effectively adds inertia to the move through weight space and smoothes out the oscillations

— Page 267, Neural Networks for Pattern Recognition, 1995.

Momentum is gear up to a value greater than 0.0 and less than 1, where common values such equally 0.9 and 0.99 are used in practice.

Common values of [momentum] used in practice include .5, .9, and .99.

— Page 298, Deep Learning, 2016.

Momentum does not make it easier to configure the learning rate, as the step size is contained of the momentum. Instead, momentum can improve the speed of the optimization process in concert with the step size, improving the likelihood that a better fix of weights is discovered in fewer preparation epochs.

Use a Learning Rate Schedule

An alternative to using a fixed learning rate is to instead vary the learning rate over the training process.

The way in which the learning rate changes over time (training epochs) is referred to every bit the learning rate schedule or learning charge per unit decay.

Perhaps the simplest learning rate schedule is to decrease the learning rate linearly from a large initial value to a small-scale value. This allows large weight changes in the beginning of the learning process and small-scale changes or fine-tuning towards the cease of the learning process.

In practice, it is necessary to gradually decrease the learning charge per unit over time, so we now denote the learning rate at iteration […] This is because the SGD gradient estimator introduces a source of dissonance (the random sampling of one thousand training examples) that does not vanish even when we arrive at a minimum.

— Folio 294, Deep Learning, 2016.

In fact, using a learning rate schedule may be a best exercise when training neural networks. Instead of choosing a fixed learning rate hyperparameter, the configuration challenge involves choosing the initial learning rate and a learning rate schedule. It is possible that the choice of the initial learning rate is less sensitive than choosing a stock-still learning rate, given the better performance that a learning rate schedule may permit.

The learning rate can be rust-covered to a small value close to goose egg. Alternately, the learning rate tin be decayed over a stock-still number of training epochs, then kept constant at a small value for the remaining training epochs to facilitate more than time fine-tuning.

In practice, it is common to disuse the learning charge per unit linearly until iteration [tau]. After iteration [tau], it is common to leave [the learning charge per unit] constant.

— Page 295, Deep Learning, 2016.

Adaptive Learning Rates

The performance of the model on the preparation dataset can be monitored by the learning algorithm and the learning rate can be adjusted in response.

This is chosen an adaptive learning rate.

Maybe the simplest implementation is to make the learning rate smaller once the functioning of the model plateaus, such as by decreasing the learning rate by a factor of two or an social club of magnitude.

A reasonable option of optimization algorithm is SGD with momentum with a decomposable learning charge per unit (popular decay schemes that perform meliorate or worse on different problems include decaying linearly until reaching a fixed minimum learning charge per unit, decomposable exponentially, or decreasing the learning charge per unit past a factor of 2-10 each time validation error plateaus).

— Page 425, Deep Learning, 2016.

Alternately, the learning rate tin can exist increased once more if operation does not improve for a fixed number of training epochs.

An adaptive learning rate method volition generally outperform a model with a badly configured learning rate.

The difficulty of choosing a adept learning rate a priori is one of the reasons adaptive learning rate methods are so useful and popular. A good adaptive algorithm will unremarkably converge much faster than simple dorsum-propagation with a poorly chosen fixed learning rate.

— Page 72, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

Although no unmarried method works best on all issues, in that location are three adaptive learning rate methods that accept proven to be robust over many types of neural network architectures and problem types.

They are AdaGrad, RMSProp, and Adam, and all maintain and adapt learning rates for each of the weights in the model.

Mayhap the about popular is Adam, as it builds upon RMSProp and adds momentum.

At this bespeak, a natural question is: which algorithm should one choose? Unfortunately, there is currently no consensus on this point. Currently, the most popular optimization algorithms actively in utilize include SGD, SGD with momentum, RMSProp, RMSProp with momentum, AdaDelta and Adam.

— Page 309, Deep Learning, 2016.

A robust strategy may be to starting time evaluate the operation of a model with a modern version of stochastic slope descent with adaptive learning rates, such as Adam, and use the result as a baseline. Then, if time permits, explore whether improvements can be accomplished with a carefully selected learning rate or simpler learning rate schedule.

Further Reading

This section provides more resources on the topic if you are looking to become deeper.

Post

  • Understand the Touch on of Learning Rate on Model Performance With Deep Learning Neural Networks

Papers

  • Practical recommendations for slope-based training of deep architectures, 2012.

Books

  • Chapter viii: Optimization for Grooming Deep Models, Deep Learning, 2016.
  • Chapter 6: Learning Rate and Momentum, Neural Smithing: Supervised Learning in Feedforward Bogus Neural Networks, 1999.
  • Section v.7: Gradient descent, Neural Networks for Pattern Recognition, 1995.

Articles

  • Stochastic gradient descent, Wikipedia.
  • What learning rate should be used for backprop?, Neural Network FAQ.

Summary

In this tutorial, y'all discovered the learning charge per unit hyperparameter used when preparation deep learning neural networks.

Specifically, you lot learned:

  • Learning rate controls how quickly or slowly a neural network model learns a problem.
  • How to configure the learning rate with sensible defaults, diagnose behavior, and develop a sensitivity analysis.
  • How to farther ameliorate performance with learning rate schedules, momentum, and adaptive learning rates.

Practice you have any questions?
Inquire your questions in the comments beneath and I will do my best to answer.

Develop Ameliorate Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

...with just a few lines of python code

Detect how in my new Ebook:
Amend Deep Learning

It provides self-study tutorials on topics like:
weight decay, batch normalization, dropout, model stacking and much more...

Bring better deep learning to your projects!

Skip the Academics. Just Results.

Run across What's Inside

Source: https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/

Posted by: whitehavager.blogspot.com

0 Response to "What Is A Neural Network Learning Rate"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel