 Many things affect how well gradient descent works. One that may be less obvious is that rescaling the features or rescaling inside the network, the inputs and outputs, can have a big effect. It's standard practice when training deep learning to standardize all of the features input to the neural network. Take each feature, subtract off its mean over all the observations, divide by its standard deviation over all the observations. So each of the features, each of the inputs to the neural network has mean zero and variance one. This takes a lost surface that is more angled and oval and makes it more round and spherical. It means that gradient descent points more toward the local minimum that you're trying to go to. Now beyond just doing normalization of the inputs, one can also take the output of each layer, or if you will, the input to the next layer, and take every number that is output, x sub j, one for each of the outputs, and divide it by its mean over the mini batch, because we're running the mini batches, and divide it by its standard deviation over the mini batch. That will make it mean zero variance one. Some people also multiply by constant a and add a b, but small numbers. The idea is to transform each of the intermediate layer outputs, such that they are all scaled roughly the same size, in spite of the fact that some may have a big fan out, with lots of features being created in layers, or a big fan in going to fewer layers. This normalization then makes gradient descent more stable. It means there's less zigging and zagging as the machine learns. So give it a try. Sometimes this helps, sometimes it hurts. Think about when it's going to help, when it's going to hurt, and see what happens when you run the cone.