 Now that we've laid all the groundwork, we can take things up a level and see how to apply optimization in a three-parameter model by looking at another M&M dataset. We notice when fitting our line that not every bag had identical M&M weights. Some of them were noticeably higher and some noticeably lower than others. We suspect there might be an underlying pattern here, and our suspicions are further aroused when we talked to some of our friends that work at the M&M factory. We learned that they upgraded their equipment during the period that our M&Ms were being sampled from, but we don't know exactly when. If there was a change in M&M weight due to the change in equipment, we would expect to see a jump at some point, with the M&Ms coming after the jump having a consistently higher or lower weight. When we plot the average M&M weight in a package against the manufacturing date of that package, we see that there probably is a small difference. There are three questions this data could help us answer. What was the typical M&M weight before the change? What was the typical M&M weight after the change? And when did the change occur? There are lots of generic models available. These are models like linear regression or decision trees that don't know anything about where our data came from or what it means or how we got it. In this example, we can do a little bit better than these naive models. We have some extra information that helps us to make a better guess at what our model should look like. Given what we know, it's reasonable to hypothesize a model with a constant M&M weight before and after the equipment change, and a step change between them. This is a three-parameter model. There's three numbers we need to define it completely. The weight before the change, the weight after, and the date of the change. For consistency, we will keep our loss function as the square of the deviation between our model and the data. We can now perform an exhaustive search again over all possible combinations of all plausible parameter values, but notice that it's taking a lot longer to compute with each additional parameter. And we can get a four-dimensional loss function. Plotting a three-dimensional function is tricky, but plotting a four-dimensional function is really hard. But we'll give it a try. One trick we can use is to choose a specific value for one of our parameters. What's left is a three-dimensional loss function. This lets us experiment and get a feel for how those two parameters interact, where the sweet spot is for both of them, and then we can see at least for this value of the third parameter what combination of the other two parameters produces the lowest loss. In our case, we have fixed the day on which the equipment was swapped out and are varying the other two parameters, the weight of an M&M before the change and the weight of an M&M after. The loss function helps us see what the best estimates are for these, at least if we assume this particular equipment shift day. Another trick we can use is to take the parameter that we chose a fixed value for, our shift day, and increment it just a little bit. We can do this again and again and get a series of three-dimensional plots. What this does is to take one of our dimensions and spread it out across time. We can use time to represent our fourth dimension. If you look very carefully as the shift day progresses through its values, you can see the bottom of our bowl jump up and down. This shows how the lowest possible value for the loss function changes depending on the shift day. We're looking for the very lowest of all possible values. That is the optimal estimate. It depends not only on our estimate of M&M weight before and after the shift day, but also on our choice of the shift day. There are three separate choices we have to make. Three separate dials we have to spin to find the best answer. If you watch the bottom of the bowl very carefully as the shift day progresses, you can see that it reaches a low point, then jumps back up again, then reaches another low point. This means the gradient descent will give us different answers depending on where we start. The answer is sensitive to our initial guess. This is an example of non-convex loss function. As we mentioned before, there are a few ways to handle this. We won't go into them in detail here, but one method that's good to have in your bag of tricks is a scatter shot approach. That involves choosing several random starting points and running gradient descent from each of them. Once you discover that several of them are ending up in the same place or that several of them are giving the same minimum value, then you can be reasonably confident that you have a good answer. The more random starting points you pick and the more times you choose a random starting point and try again, the better your chances of finding the global optimum. Although you'll never be able to guarantee that you have the absolute best answer. It's worth keeping in mind, however, that in machine learning applications, we are dealing with data that's assumed to have some noise on it. And that we'll be applying our model to future data points that will have their own unique patterns of noise in them. It's not worth our time to worry about getting the exact perfect global optimum. For our purposes, getting an answer that is just as good as we can get is usually good enough. This one concept of looking for the lowest point in an error function is all you need to survive an optimization for machine learning. It takes a lot of different forms, can extend to many more parameters. It can be used with generic models like logistic regression where there are just a few parameters or a few dozen of them. It can be used with very domain specific models such as macroeconomics or agriculture. The optimization method doesn't know the difference. In either case, all it does is try to find the right combination of values for getting the lowest loss. Optimization is also at the heart of deep neural networks success. It's how all of the weights, making all of the between layer connections from node to node are chosen. In these models, there can be thousands or even millions of parameters. The loss function is not only impossible to draw on a sheet of paper, it's mind boggling to even contemplate, but the math behind optimization carries on anyway, doing its thing and seeking out the lowest points of the loss function that it can find. A quirky thing about optimization in deep neural networks is that the loss function is known to have many dips, like a many dimensional egg crate mattress. Depending on where you drop a marble, it will verily likely land in a different divot every time. Deep neural networks can take a long time to train and in practice they are often trained only once, starting from a single random set of initial weights. That means that of all the millions of low points available on the loss function, we are only going to find one. This may be one of the reasons that training neural networks is such an art. It's not quite as glee as it sounds, however. Some recent research suggests that most of these divots have the same depth, like an actual egg crate mattress, at least for certain types of neural networks. So maybe the multitude of local minima is not as much of an issue as it might appear at first glance. Another part to the art of deep neural networks is the alphabet soup of gradient descent methods. As you read about it, you'll hear names like SGD, Stochastic Gradient Descent, NAG, Nesterov Accelerated Gradient, Adam, or Etograd. All of these are embellishments of gradient descent. They differ in the fine details about exactly how far to step and in precisely which direction. Some work a little better for some problems than others, but mostly they just change how fast the neural network converges to the final answer. In every case, the underlying goal is the same, starting from a random place in a many-dimensional landscape of hills and valleys, find the lowest point in the nearest valley going downhill. The other type of optimization that occurs in deep neural networks is hyperparameter tuning. This is still just optimization, but instead of optimizing the many parameters in the model, it's actually optimizing the values of some of the high-level settings that determine how the model works. In some cases, determining how gradient descent is performed. If decorating a house, you can think of model training as optimization where you're choosing exactly where within a room the couch and the bookshelf and the picture should be and how they should be oriented. Hyperparameter tuning is the next level up. It's deciding which room they should even be in to start with. Because testing a new set of hyperparameter values requires retraining the model entirely from scratch, it takes a long time and usually a lot of computers. Computational intensity has driven creative machine learning practitioners to find ways to try as few hyperparameter combinations as possible. However, the underlying goal and the underlying problem are still the same. Find the combination of values that gives the lowest loss. Get to the bottom of the deepest valley you can find. Great job sticking in there to the end. Thanks for hanging in there with me. I hope the concept of optimization is a little bit less scary now and that you have an intuitive sense of what's going on under the hood when the machine learning algorithm is being trained. Good luck on building your next project.