Optimizing Gradient Descent

Gradient descent is the process of updating the weights for the nodes in each layer of a neural network during back propagation in order to reduce the error in the predicted values to maximize the accuracy of the model. This is the “learning” process in supervised machine learning. You train a model by giving it a dataset to learn from. Often times these datasets are extremely large. The larger the dataset, the more accurate the model becomes after training, generally. However, regular ole batch gradient descent can be extremely slow. For this reason it’s often more practical to combine various optimization techniques to find the sweet spot between speed and accuracy.

Regular batch gradient descent is so slow because it accounts for every example in the dataset when computing the gradients before updating parameters in a single pass. This ultimately will lead to the most accurate model. However, not only is it slow, it requires enough memory to hold the entire dataset at once.

One way to optimize gradient descent is to split the dataset into batches and run each pass of gradient descent on each batch separately. This can speed things up, especially when taking advantage of parallel processing to run each batch concurrently.

At the extreme edge of the spectrum, where each batch is a single example, is known as Stochastic gradient descent. Stochastic is the fastest, but has a major limitation, as the weights get updated for each example and can wander around before making it’s way towards the minimum. This is referred to as internal covariate shift.

Unfortunately, as it wanders, it’ll never reach the global minimum, but rather wander it’s way around it, never quite finding the most accurate model, but just getting close.

Instead, it’s more common to split the batches into powers of two, with 32 being a common value for the batch size. This would be called mini-batch gradient descent and is the best of both worlds, improving the speed and reducing the wandering.

Regardless of the method of gradient descent used, there are additional methods which help in optimization. One of the most crucial is feature scaling. As each feature has it’s own scale, the gradients for each layer can vary greatly, and thus the gradients will be skewed by the differences in scale. Instead, we can normalize the features to fall between [0, 1].

This normalization process reduces the oscillations in the gradients as the features were skewed and reduces the number of passes of gradient descent necessary to reach the minimum.

Similarly with normalizing feature inputs, we can normalize the activations for each layer within the neural network. This is referred to as batch normalization. If you normalize the inputs during preprocessing this is redundant.

Another optimization method is to use what’s called momentum. This is essentially just an exponential weighted moving average of the gradients. This reduces oscillations and gives greater weight in the direction of the minimum, allowing for convergence to happen sooner.

Another similar method, called RMSProp, attempts to reduce oscillations in the wrong axis or wrong direction. If you imagine vertical and horizontal oscillations, with the minimum being horizontally distanced, any vertical movements will move us away from the minimum, and each pass of gradient descent will oscillate up and down but closer towards the opitimum value. RMSProp attempts to reduce these vertical oscillations (or whatever axis), only moving in the direction toward the minimum.

Another algorithm, known as Adam, basically combined momentum and RMSProp.

Decaying the learning rate is another method used, especially in stochastic and minibatch gradient descent. As you continue learning, you reduce the learning rate, so that the oscillations away from the minimum are reduced. By waiting until later to reduce the learning rate, this allows the algorithm to take larger steps earlier, getting closer faster, then reducing the learning rate to allow the wandering around the minimum to occur within a shorter range. It’s like having your dog on a leash staked on a pole, the pole being the target, and shortening the leash being like decaying the learning rate.