 One way to think about machine learning is as a way to convert raw data to a simplified cartoon-like representation, a model. And to use that representation to make predictions and estimates. Whenever we're engaged in the first part, fitting a model to some data, there is some optimization going on. It's built into the foundation of machine learning. It comes to the foreground particularly when we're training deep neural networks. It's worth taking a look at a couple of examples of optimization in machine learning models. Let's start by looking at M&M's. We'll use a model that's so simple we usually don't think of it as a model. We'll try to answer the question, how many M&M's come in a bag? Finding an answer to this is easy. I bought a bag of M&M's and I counted them and I ate them. The answer was 53 M&M's in a bag. Then I bought a second bag. It had 57 M&M's in it. Now I'm in a tight spot. If I answer 53, I'm only right half the time. If I answer 57, I'm still only right half the time. If I answer 55 right in the middle of the two, I'm not right at all. There's no longer a right answer. It's like the situation described by the saying, a person with a watch always knows what time it is, but a person with two watches is never sure. Counting more bags of M&M's didn't help the situation. The more I ate, the more I saw that there was no right answer. It became clear that being right was impossible. So I changed my goal to being less wrong. What's the least wrong estimate I can make about how many M&M's come in a bag? This question I have a shot at answering. First, I had to make it very specific. I had to translate it to math. For a given guess and a given bag of M&M's, the difference between them is D. If I number each bag of M&M's, then the difference between my guess and the count of bag number one is D. The difference between my guess and the count of bag number two is D. The difference between my guess and the count of bag I is D. D is often called the deviation. The next thing I need to decide is if I'm off by a certain deviation, how much do I care? What's the cost to me? There are lots of valid ways to answer this question and they all result in different answers. The most common is to take the deviation and square it. If a bag is twice as far as the estimate from another, it incurs four times the cost. This squared cost function cares more about the points that are way off than the ones that are close to correct. Another common cost function is absolute deviation. That way, if a bag is twice as far away from the estimate as another, it incurs just twice the cost. For some problems, this cost function makes the most sense. There are lots of other options. For instance, the square root of the deviation or 10 raised to the power of the absolute value of the deviation could also work. In all these cases, the cost goes up as the estimate gets further away from the actual count of the bag. For our purposes, we'll start with the squared deviation that's our cost function. Calculating the total cost is straightforward. Another name for cost is loss, which comes with a cool curly L symbol, so we'll switch to that. The total loss for a given estimate can be calculated by adding up the cost associated with each bag. This is the square of the distance between the estimate and the number of M and M's in that bag. In this form, we can calculate it. With a little Python code, we can run an exhaustive search on this problem and calculate the loss function for a wide range of estimates. There are links to all the code for this below. Feel free to check it out. Then all we have to do is pick off the M and M count that corresponds to the lowest value on the loss curve, the least wrong option. In this case, about 55.5 M and M's. As you may or may not find interesting, there's another way to get at this estimate. Because this model is so simple, we can work out the math and find an analytic solution. There's going to be some lightweight calculus, so if that gives you the heebie-jeebies, feel free to skip to the next section. We are lucky to have a nice equation for the loss function. We know we are looking for the lowest point, and that this is the only place where the curve is flat, where it has a slope of zero. We can use this to solve for the estimate of N at this lowest point. We find the slope of the loss function with respect to the estimated number of M and M's by taking its derivative, and then set that equal to zero and solve it. We then just have to find the place where this is true, and that will show us our best estimate of N. We take advantage of the fact that the derivative of a sum is the sum of the derivatives of all its parts. Then we can take the derivative of each part. The exponent two comes down in front. Because the whole expression is already equal to zero, we can divide both sides by two, then what's left is still equal to zero. And we can break out the sum into its pieces. We have a total of M bags of M and M's. If we add up N estimated, M times, we end up with M times N estimated, multiplied. Then we can subtract this term from both sides and move it to the other side of the equal sign, and finally we can divide by M. What we're left with is magically the sum of all the data values divided by the total number of data points, that is, the average of our counts. This unexpectedly slick result at the end of such a long path is one of the reasons optimization is so popular. Okay, calculus is over, back to the modeling discussion. Although it may not look like it, this is a model. It's a model of central tendency. It has all the hallmarks of a model. It's a simplified representation of our data. Instead of having to remember every single bag of M and M's we counted, we created a single imaginary, but ideal, platonic, typical representative bag of M and M's. It's like a cartoon in that it resembles the originals, but is far less complex. This is exactly what a model does. It also allows us to make predictions. If someone asks us how many M and M's are in a bag that they're about to open, we can answer with confidence 55.48. We know that we're not correct. It's rare to find 0.48 of an M and M. But we know that our answer is less wrong than any other answer we could give. Although it was simple, this was a complete example of optimization for machine learning. Our model, central tendency, had a single parameter, the estimated number of M and M's, that was adjusted to minimize a loss function. Most machine learning models have more parameters than this. Many layered neural networks can have millions of parameters, but the underlying process is similar. The number of all possible combinations of parameter values a model can have is called its parameter space. Exhaustive search requires visiting and evaluating every point in a model's parameter space, trying out every possible combination of parameter values. Because the central tendency model's parameter space was so small, we were able to get away with that, but it's not often the case that we can. It's also worth noting that the beautiful shortcut we found, using the mean of the data points to find the optimal estimate, via the mathematical long cut, the analytical derivation, is special. We can't expect to do such things or find such nice results most of the time. Being able to do that depends both on the simplicity of the model and on the exact choice of our loss function. For instance, if we had chosen absolute deviation loss rather than square deviation loss, the mean is no longer the shortcut to get the best answer. In that case, it just so happens that the median is the right shortcut. But if we had chosen square root absolute deviation loss, I have no idea whether there would be any viable shortcut or what it would be. Most of the time, there will be no clean analytical solution and no shortcut for calculating optimal parameter values. We'll have to find them using gradient descent or some slower but tougher method. This example showed how optimization works in a single parameter machine learning model. Click the link below to join me for part 3 where we use optimization to fit a two-parameter model.