 Hello and welcome to week four of deep learning. I am non-conrad courting. I'm Lyle Unger for a change. Conrad and I will take turns covering the weeks and it's my turn for agrar. I have introduced myself much earlier on but I'm told that faculty can be big and scary but if we have a nice dog or a cat that we look more friendly and I actually am friendly. I don't have a dog but that's my cat Marmalade. That's being warm and cuddly. So I hope you will have a warm and cuddly week and with no further ado let's talk about optimization. Optimization requires both deciding what you want to optimize for, what's your loss function, what's the train needed to use and how you optimize. Both of them are critical. I'm going to talk a little bit at the beginning and the end of the week of what to optimize for. We'll spend most of the week on gradient descent talking about how to optimize. Now the first piece of optimization is picking up loss functions. There are two considerations, two types of considerations for loss functions. One is technical. Do you want to do an L2 error, a sum of squared errors of your Y hat predictions versus your actual Y labels? Do you want to use a maximum entropy loss? For multi-class problems such as common armed vision people have found that maximum entropy losses are much better. They just yield more accurate and better models. Similarly people used to optimize accuracy. What fraction do I get correct of my labels? If you have something that has big class imbalance, so say it's 95% of the people don't have cancer, 5% do, you can get really high accuracy by saying, oh no one has cancer. So optimizing accuracy can be a terrible loss function. Instead people optimize things like the area under the ROC curve which trades off precision versus recall. Hopefully you've seen these terms if not, it balances false positives against false negatives. So it's important to pick the right loss function in a technical sense. Do you want sum of squared errors or maximum entropy? Do you want accuracy or area under the curve? But it's equally important, probably more important, to pick the right loss function from a societal standpoint. Everything is optimal in some fashion. Who is it optimal for? The buyer, the seller, the first people in, the last people in, who is benefiting? And that requires being conscious about what you consider fair. What are you trying to accomplish? The other piece that is important to realize with optimizations and loss functions is you never can measure the thing you really care about, how happy someone is, how much money you'll make over them throughout all future time, whatever you're trying to do, how to make someone live as long as possible. And so you take a surrogate loss function and the surrogate loss functions often lead to unintended consequences. I want to show you two stories, examples of these that are fun, very different ones. The first one, cobras in India. There was a problem, I'm told in Wikipedia, so it must be true, that there were cobras or nasty snakes in India and in one region, the local authorities offered a bounty. We will pay you this many rupees for each cobra head you bring in, thereby reducing the number of cobras by paying people to kill them. Oh, except the loss function isn't quite the same as the utility function that they care about. What did the clever local people did? They started farming cobras, growing them on their farms, nice little cages and bringing in the heads and collecting the rupees. Awesome. The local authorities then said, no, no, that's not what we intended. We wanted fewer cobras, they're in your cages, that's nice, but no, we're not going to pay any more for cobra heads. At which point the cobra farmers, yes, released the cobras into the wild, thereby increasing the number of cobras. Oops. Note that there was a surrogate loss function that did not achieve what was wanted. Now, the same thing happens all the time in reinforcement learning. So let me shift to a second example, a reinforcement learning system very similar to the AlphaGo, AlphaZero system that you saw the first week, which learns to play a game where you are to pilot this little boat along the waterway, going from the beginning reaching the end of the waterway, trying to collect as many points along the way as you go from the beginning to the end. Sounds cool. What did the clever reinforcement learning algorithm learn? Well, actually you can maximize points not by going to the end of the race, but by just going round and round in circles and smashing into stuff over and over again, collecting points. So it takes itself this great optimum where the boat goes around and in fact never completes the course. Okay, it optimized the point getting, but it didn't actually learn to win the game and end because it wasn't told it needed to end the game, it just needed to collect points. This happens surprisingly frequently. Cool. So we talked about what to optimize for, but it's also important how you optimize. And we will spend most of this week on gradient descent methods learning how to optimize. Now you might think it doesn't make that much difference how you optimize, but it really does. And to cement that, what I want to do is to, oops, show you how expensive training deep learning can be. So a recent paper looked at the data Google released, they trained up a natural language processing system called T5. It's in the style of GPT-3. We'll cover all these in a few weeks, but 11 billion parameter model, state of the art, deep learning transformer model for natural language processing. And if Google were paying retail price for their computing, which they don't, of course, they pay a third or 10th, I don't know how much, the price tag of compute time would have been $10 million. Even if it's a tenth that it's still a million bucks of computing. Now it's probably more than a million bucks of labor as well, because Google engineers are not cheap. But it is the case that if every time you do an iteration and a modification, you have to wait an extra hour for your run list to converge, it slows you down. So these neural nets can be enormously big, billions of parameters, and enormously expensive in terms of training costs, speed, matters. So what will we do this week? We will talk about optimization. We'll look a bit at the geometry of neural nets to see why optimization is hard in this case. We will look at a bunch of gradient descent methods. We will do mini batch algorithms. We will normalize them in ways that they're learning better and think more about how the geometry there will get a cool concept called momentum, which helps us learn more efficiently. We will learn state of the art adaptive learning schedules to adjust the learning rates in the best way. And we will almost conclude by looking at some different gradient descent methods that are not yet widely used, may not be, but which give some insight into what we're trying to do with gradient descent. Finally, we'll come back at the very end and look at some of the unintended consequences of optimization and pay attention to how the loss function itself can inevitably drive certain properties of the accuracy and the bias of the models that one probably wants to be aware of. So welcome to week four. Have a great week. I look forward to seeing you as you go through your collapse.