 We have Professor Mehta giving two lectures and then we'll have Professor Sagawa giving two lectures. These are the strong people. They can give two lectures back to back. There's no way I could do that. So. All right. Let me turn this down. All right. Welcome back. The trick to giving two lectures is to make you do something interactive in between so that you don't have to talk the whole time. So let's continue. So so far we've gone through the first two of these topics which is that I wanted to emphasize that the difference between machine learning and say statistics, classical statistics or optimization is the difference between fitting and predictive. Again, the important point is that we're minimizing the energy for things we see, but what we care about is data we haven't seen yet, right? And this is the idea of generalization. And because of that, we saw that there was basically three tools that we needed, right? We needed an optimizer. We needed some loss functions, but we also needed some regularization. And what we saw is the kind of optimizers we ended up using or that have become standard actually exploited this difference, right? Because they didn't care what started off as tricks to make things faster, actually turned out to work better because they had already introduced regularization into these things. And we saw that we also dealt with kind of standard loss functions. One of them was this kind of physics-inspired cross entropy for categorical data. And the other one was this, you know, L2 regularization, but as we'll see today very quickly, any loss function that we can take a derivative of is going to be okay, right? And basically the last, I would say five, seven years has been a bunch of tricks on using the fact that if I can differentiate stuff, then I can actually combine any differentiable architecture with all this stuff. And somehow the fact that I make the whole architecture differentiable regularizes things so I can work in really, really high dimensions, right? So I can map a million dimensional space to another, you know, I can map, you know, whatever the dimension of language is. I don't know, I forget what Shannon calculated, the entropy of language. I don't have some ZIF slot people here. I don't remember what the entropy was, right? Some humongous thing. But I can map Korean, my Google translate will literally make a lookup table, right, from every Korean phrase almost to an English phase and you would think that's way to high dimension, normal space to make a map between them. But that's really where the differential network is. But the magic of making everything differentiable is somehow regularizes thing enough that you can do it, right? So there's a joke among a lot of my friends and how we think about machine learning that modern deep learning is just differentiable lookup tables, right? So it's just a lookup table but by making it differentiable somehow you can do it magically, right? It's interpolation plus differentiable lookup tables. So I just wanted to emphasize this differentiability is very central to everything that's going on, all right? And that requires you to calculate gradients, right? We notice that we need to be able to calculate gradients with respect to parameters. And so, you know, and again, the basic important thing you should take away from this. If you don't take away, somehow it's not emphasized enough if you're not working in industry or you're not in the thing, it's really the fully differentiable or auto-diff paradigm that is being, at least for the last 10 years, that's seven years. All the tricks are some trick to make something look like a completely differentiable function. And by doing that, you can basically do new things you could never imagine doing before, all right? And so, this idea is that if I make very complicated models, right, with many layers, as we'll see when it happens. And what happens is that coupling derivatives gets hard because I have to use the chain rule. And you say, what's the big deal about the chain rule? It's super easy. Well, you're dealing with problems with millions and millions of parameters. And remember you have to take a derivative every time you take a step, right? So every step, derivative of a million parameters, derivative of a million parameters. So you have to be able to do that really quickly. I wouldn't see that. And so the interesting thing was that it turns out that you can calculate if you make a certain structure and we'll talk about what it is beyond what's going on. If you make, basically, what is common to all world architecture right now is that you have to be able to take this chain rule in a very efficient way. And that is going to turn out that you're going to be able to do it very, very quickly through one forward pass, through a network, and one backward pass in an algorithm that's called back propagation. All right? So that's the whole point. That you can calculate a million gradients or how many other parameters you have. If you design the architecture of your model right, you can take the gradient all once, all at once. All right? And we'll talk about what's going on. And but I want to emphasize to you that this algorithm is just the chain rule but a computationally efficient thing. So you choose the architecture so you can calculate the gradient in a very efficient way. So what's the basic thing? Well, deep learning is really just neural networks, right? So they've rebranded it because neural networks were very out of vogue. I guess when you guys were all like six or something, no one thought, no, probably 10. No one thought the neural networks could work, ever. And everyone just made fun of people who worked on them. It was a very, very rag tag band of like three or four people and some physicists who had moved on and everyone who was working on these. And but, you know, we made a comeback which just goes to show you that if you think an idea is good, you should keep on working on it, no matter what everyone else says. Because you never know if, you know, in the end you're going to turn up your right and everyone else is just going to have to eat their words. Because I urge you to go read some of the things people said about neural networks in 2011. All right, it's kind of amazing. And now, you know, they're supposed to take over the world. Now we're having new cold wars, right? Like, you know, now we have a cold war because of these neural networks again. So the basic idea of this neural network is that the basic element of this is what people call stylized a neuron. All right? A neuron is basically an element where it takes inputs, right? X1, here I have three inputs, X1, X2, X3. It weighs them with some parameters, W1, W2, W3. So W dot X. It adds a constant shift, right? So this is just a linear function of the inputs. And then you just stick it through a non-linearity, all right? So in the old days, the non-linearity used to be this kind of Fermi function that was very famous and that was what people used to use or this TANCH. But in recent times, people have figured out that actually the most commonly used thing that's much more computationally expressed normally is this kind of thing called a RELU which is basically a rectified linear unit with zero below the threshold and linear above it, all right? And the parameters you have for each neuron, each neuron comes just with a set of parameters which are these weights, which are basically these arrows that tell you how you take the inputs from the neurons in the layer before, right? So this is like this neuron is getting an input from these three things. So this is like for this neuron, this is W1, W2, W3 is associated with that arrow, right? So this neuron sums the input from the things before, adds a constant, puts it through a non-linearity. These days only a value and that's the output. And you just do that over and over and over and over and over again. Yeah, because they're going to be two or three neurons. No, no, every arrow comes with its own parameter. So every neuron, every arrow comes with its own parameter, every neuron comes with its own offset. No, no, no, these are, well, you can choose however you want, okay? This is what's the simplest version of a fully connected network. But actually the whole game is to design these arrow, do fun, clever things with these things, right? But this is the simplest version. This is a fully connected neuron where everything is connected to everything. But I could not connect some neurons. In fact, what I want you to take it from this lecture, that's really unclear from reading the lecture, the literature, actually it's never said explicitly, is really what kind of architectures are allowed, right? Everyone understands it who works in this field, but somehow no one says it very explicitly. So I'll try to explain to you what you can do, all right? And the answer is going to be things that it's easy to take the chain rule off, all right? One of the things you'll notice about this architecture is what do you notice about arrows? They always go in one direction, right? There's no loops. That's going to be central to everything. So I'll just tell you the answer because it's always better to tell you the answer. It's going to turn out any architecture that I can write like this where there's no loops backwards is something that you're allowed to make. And people have exploited it to no ends to make that, all right? So as long as there's no loops, we'll see that's going to be good for us. So one of the things I want to point out is that we're eventually going to start taking derivatives, right? And one of the funny things that happened is that people used to use these kinds of non-linearities and they replaced them by this non-linearity and you got everything started working much better, all right? Yeah? Yes, that's going to be the whole point of the whole lecture. It's because otherwise you can't efficiently calculate... You can't efficiently do chain rule because you have to use implicit function theorem, essentially. Right? You can't explicitly calculate gradients. So that's the whole point. We're going to get there in a second. I just want to point out some more things. So why has value become much more things than sigmoid, right? Well, I need to calculate gradients, right? And the gradients are how I get the information about the cost function, right? Now, the problem is that if I take derivatives of these kind of functions out here, look at these functions. They're almost flat, right? So I can't tell you much about what the input value was. The gradient is really hard to know about the input from the gradient, right? The information gets lost because basically everything out here basically has the same gradient. Everything over here has the same gradient. Whereas here, if I tell you, you know, I don't have that problem. So the gradients basically disappear here. They go to zero, right? The gradients disappear. Zero, zero. Here the gradient never disappears. And that turned out to be a very important thing because of chain rule. Because of chain rule, I have to multiply... If I multiply a bunch of small numbers, 0.1, 0.1, 0.1, 0.1, I get a tiny number. And when I do the chain rule, eventually you have to multiply a bunch of derivatives backwards. Here's the cost function. If I want some parameter here, I take the output of this, output of that, output of that. So chain rule says I multiply a bunch of numbers. And if I multiply a bunch of small numbers, everything disappears. So that's why people started basically using this value because that solves that problem. It made a big difference. So that's one big revolution that happened in the last 10 years. People realize dumb things like this can solve what was called the vanishing gradient and glowing up gradient problem. So the central thing of all this stuff are what's called the back propagation equations. And I'm just going to leave them up here because otherwise I'll get them wrong. But we'll go through them. It sounds very complicated, but it's just the chain rule. So I wish I had put the picture up here. So just so we're clear, this sigma is this non-linearity. So this sigma is one of these functions here. It doesn't really matter what it is, but it's whatever the non-linear function that you put in of the inputs. It could be a tan, it could be a value, whatever it is. And what we're going to do is we're just going to do the chain rule. So if we go back here, let me define some things. We'll call, remember the inputs are x's. The weights are w. The shifts are b. And now I also have an output of the neuron, which I'll call the activation. That's the a. So I have activation. This is going to be a. And then I have these parameters, w and b. And I just want to take derivatives of those. I want to take a derivative of the last output of the output layer with respect to w and b. That's all I'm trying to do. And you see, it's just a chain rule. I take how much is the output. If I want to know how to take the derivative, I change this parameter a little. And I can ask, in principle, I can propagate how does that change. So there's going to be a chain rule. Because I can think of this. If I have two neurons, I just want to make sure everyone understands. Imagine I just have the simple case of two neurons. I have neuron 1. It goes to neuron 2. And it goes to some outputs. So this is, let me call this sigma 2. This is sigma 1. And I have some input, which is x. Can you even see that x? You can. So now, this is the output. Output is some function of sigma 2. So now, let me call the output like this. So if I call the output some function of sigma 2, which is going to be a function of sigma 1, it's going to be a function of w1x plus b. So if I just compose all this stuff, you see that it's just composition. I take x. I put it through this function sigma 1, which goes through this function sigma 2, which goes through this thing. So if I just stop, let me write this, sigma 1 of a1. And then I have sigma 2, a2 of sigma this. So let me just write it out explicitly. So a1 here is going to be, say, omega 1x plus b1. And then I get sigma of a1, a2. So if I write this, so these are called z's, what I've written down on the next page. So you see, I've taken the linear part and called this z. So z1 is just omega 1x plus b1. And then the output of the neuron, a1, is just sigma z1. Then I get a2. Now I get z2 is just going to be w2 times a1 plus b2. And then the output is going to be sigma z2. It looks ridiculous for me to differentiate these things, but you'll see that. This is just those equations. Now I take the derivative. So I want to take the derivative of, I want to know, how does the function f change if I change w1? Everyone see? I just want to know this. So what do I do is I just say, let me move this up here. So what I do is I say, look. It's f df dA2 dA2 dZ2 dA1 dZ1 dZ1 dW1. So I can just go through and multiply, and that's the chain rule. So the claim is, these equations are the same as those equations. That's all that's going on here. So let's think about how it goes through. So the first thing you can do that's worth defining is that you see that it's natural to work in these equations Z and A. Just introduce extra variables, Z and A. A is the activation of each neuron, and Z is just the linear sum of the activations on the last layer. So you just go through, and you see I can define these quantities that are important, which are these kind of delta lj's, which is just the derivative of this output, which I called f, but I should have called e, because that's the notation I've been using. I just say, OK, let me define these things, which is the derivative of e with respect to this Zj. And so these layers l, this l is going to be this little analogous labels the layer. So this is l equals 1, this is l equals 2, whatever. This is l equals 3, so it's useful to just label things by neurons and layers. So I have this Zj of l, which is just the jth neuron in the capital L is the last layer. So this I can just calculate, because this is not a chain rule thing. This is the last neuron. Then what I can do is I can now, look, just use the chain rule to say, OK, if I have some other layer, l, dedl is just the derivative of the energy with respect to the activation times sigma prime Zlj. This is just the chain rule that says that the activation is this function of sigma Zlj. So this is the first equation. The second equation, how do you change? Oh, Adobe always wants me to update something. I just like, I don't understand that company. I feel like it's constantly updating. And it never works. That's the worst part. But now, look, look at these delta lj's. Well, look, because Z is a linear function of Blj, derivative with respect to Z is the same as the derivative with respect to B. So this is some of the gradients we want. So if I can calculate these delta lj's, then I'm OK. I have one side of the gradients I want. I want dedbj, and I want dedw. Now, look, what's nice now is I can be the same thing and say, look, I calculate delta lj. And delta lj, I can express at a layer l in terms of delta l at a layer above. So this is a recursive relationship that says, if I know delta l at this last layer, in this thing, if I know deltas at this layer, I can calculate the deltas at this layer. So deltas, I can propagate backwards. That's the whole point. This is why it's called backprop. If I know the deltas here, I can calculate the deltas here. I can then calculate the deltas here, then calculate the deltas here, and so on. So this is just chain rule. I haven't done anything. But it gives us this really nice thing is that if I know the parameters and I know the z's, then I can calculate the deltas backwards. You see that? And the final thing is that, look, I can also take the derivative with respect to the parameters. So this is the parameters at the l-th layer where we're connecting the jth neuron and the l-th layer to the kth neuron and the l minus 1 layer. It's the derivative with respect to the whatever. If this is the l-th layer, the kth neuron, the jth neuron, it's this w that weights this thing. And what you see now again is that I just do the chain rule and I get this, I could get this gradient in terms of these delta lj's times the activities. I haven't done anything fancy. But now look at the amazing thing. What do I do? Look, if I know the activities of everything, then I know the z's, the z's and the activities are whatever. So now I give you, what I do is I give you an input. I give you some input. It's really computationally efficient to just calculate all the z's and a's. Because that's just composing a function over and over again. I give you an input, I calculate the z's here, then I calculate the a's here, that allows me to calculate the z's here, that allows me to calculate the a's here, that allows me to calculate the z's, the a's, the z's, the a's all the way to the end. But then, once I know those z's and a's, I can use these equations, because I know sigma prime analytically. I know what this is. For the values, it's either 1 or 0. If it's above the threshold, it's 1. Otherwise, it's 0. I can just, now this one I just calculate because it's just a single gradient. But now, I can just use this recursive things to calculate all the gradients all at once. So what's the trick? This is backbrop. So I calculate all the gradients. Sorry. I calculate all the gradients by basically going once forward, where I calculate all the activations, and then calculate all the deltas backwards. So all I have to do is run forward pass and run backward pass, and I get all the gradients. Now, you can ask, what did I exploit when I did this? So this was just a very particular architecture, but now let's think about what we want to do. Let's see if I get this right this time. No. Almost. All right. I took 1045, right? 1045? Yeah, OK. All right. Having said that, now someone tell me, what's the most arbitrary architecture I can draw? To have a chain rule. To exploit this backbrop. What do I need? When will it go wrong? When won't this trick work? I just sit here. I don't care. When will it work? What goes wrong? Let's do it with two neurons. Someone draw me it. Take 30 seconds and draw me an architecture where with two neurons this doesn't work. You can already see when it won't work with two neurons. So I have an input. I have a neuron. I have a neuron. I have an output. When won't this work? It's just a chain rule. You only have to take a derivative. OK. How about if I do this and then I do this? What will happen? Work it out. Take 30 seconds. Think for yourself. What will go wrong? Yeah, so now, this is a function of that, which is a function of this. Just work out the chain rule. We can just work it out. All right? This is an exercise for you. Show that the back prop algorithm doesn't work. Yeah, you can do it. You can do it. It's a implicit function theorem. It's really important for this that anything here doesn't depend on anything. That this stuff here doesn't depend on anything in front of it. This is enough for you to understand all the architectures that this will work for. If you understand this example of white fails, then you'll be OK. That's all there is. So what exactly takes here? Input x. You calculate z's. Yeah, so that's the whole idea. So what's the whole idea? How do you even calculate it? You have to go back and converge and do whatever. That's the whole point is, in principle, I can make a function like this. This is x. This is a function of f of x and g. And this is g of f of x. So this is just a function like this of x. So there's no problem. So this is the way it goes. In principle, I can calculate it. It has to converge. It's annoying. I probably have to iterate, but this is the implicit function. Let me use the actual, instead of f and g's, let me just use the actual thing. Here's sigma 1 of x, but now it takes two inputs. So I just specified whatever sigma 1 is. It takes this input, x1. So this function is some function x1 and y1. So this is x1, and let me call this y1. Now y1 itself is f of, and this is sigma 2 of sigma 1 of x. But now this is what y2 is, so this is sigma 1 of sigma 2 of sigma 1 of x. So in principle, this is fine. It is whatever it is. It defines a network of functions. But you can't calculate derivatives of this, because you see it has these implicit functions. Everything becomes functions of everything else. So this chain we law humans don't go through. So what the most general architecture you can build is you can build anything you want. However complicated you want it to be. Anything like this. It'll all work. Classical graph, is this what you wanted to say in this picture? What doesn't this have? It doesn't have any backward loops. It doesn't have feedbacks. It has to be feed forward. Any feed forward network doesn't work. And that's been the trick. If I have to turn everything into a feed forward network, even if I have time series, I'm real time. That's what a recurrent neural network is. If I have generative models, I can throw in stuff like drawing a random number and throwing it in here and putting it through a neural network. That's how you make generative models generally. But everything is feed forward. And you can put in any module you want. As long as it's differentiable, all I need is this is some function. So any arbitrary function can be put in here. And I just run through this. I can put a Monte Carlo sample layer in here that's often done. This is the modern deep learning revolution. And basically now, since 2017, 18, they've made tons of packages. This goes under the name of Autodiff. And the whole point of these libraries is that they'll take derivatives of everything for you. It's fully differentiable. And the reason they're fully differentiable means that you can put them into a backprop algorithm. You don't have to write it yourself, though. You should all write backprop one time yourself to understand it. So most of the cleverness has been, how do I exploit this basic architecture? Because then I can do chain rule in one pass forward and one pass forward. In any graph like this, I can do things in one pass forward and one pass backwards. So for example, another big imaging thing was that people realized this. And they made something called ResNet, where in the old days, people used to just make things like this. And then in about 2017 or 18, they figured out, maybe it was 2016, they figured out that, oh, we should just make direct, what happens is that if I made very deep networks, the information about the input was getting lost. So they just decided, oh, let's just put big connections from the input to arbitrary places in deep networks. Those were called residual connections. So that information didn't get lost. So every time people do something clever, it's some clever thing of turning a problem. The goal of deep learning is to turn a problem into a problem that looks like this. Whether it's a time series, whether it's a generative model, whatever it is, that's what modern deep learning is. That's what I want you to take away. That's not written down anywhere, at least in very few places. That is the modern deep learning revolution. How do I write this as a differentiable thing? And the second thing is that it turns out that every function you want to care about, you can approximate pretty well with a pretty complicated neural network. And so what I do is I just say, oh, I don't know what this function is. It has a bunch of, I'm going to replace it by a neural network. And I learn the function directly from the data. And I fully differentiate it. That's basically the idea of deep learning. Yeah. You have to unroll it. That's called a recurrent neural network. And basically what you do is you basically, so imagine I have a feedback like this. What do I do? Is I unroll this thing. So you can imagine this is at time t, t plus 1, or something, time is the classic thing. They unroll this thing to make it look like this. So you just imagine this is the first time I went through the loop, the second time I went through the loop, the third time I went through the loop. And you can't keep infinite feedback, but you can keep how many other times you want to go through the loop. So you have to deal with it. A lot of cleverness is dealing with this. I want to represent more complicated things, but I want to replace it. But that's something that I can fully differentiate. All right. So this is basically what modern deep learning is about. And it's getting clever and clever in how people design these things. So these are the models. At some point, people figured out we could throw a Monte Carlo or throw it in there. You could put in an optimizer that's fully differentiable, throw it in there. That's been basically the cleverness of the whole thing. Very clever people, a lot of hyperparameter tuning, a lot of architecture tuning, all these things like that. So let's write the simplest version of a neural network. So let me put this down. I should add that everything I'm doing here, every figure I've shown you, is actually and all these notebooks are actually from this review. So they're from my review. I spent last sabbatical writing this to make it easy for people to learn machine learning, and also because I didn't like all the hype. So I wanted people to get in a hype-free format. That's what it is. This is maybe 1 15th, 1 20th of what's in that review. I actually like the chapters that are not on fully differentiable models and energy-based models much, much better. But it's just like, I feel bad for all these musicians who probably have all these songs they love, and then every time you go to a concert, you have to play the same three hits songs over and over again. So deep learning is like the hits, right? You have to explain it, whether you like it or not. Anyway, it doesn't matter. In Korea, maybe with K-pop, there's no real deep music anyway, so I shouldn't say that. See, that's the kind of thing I shouldn't say. All right, so what we're going to do is open up, go back to the same place on Google Colab, right? And let's open up notebook 10. All right, so here is Google Colab, notebook 11. And again, the basic idea is that we're going to go through these MNIST, MNIST digits. This is not even considered a data set anymore. About 10 years ago, it was considered a serious data set. Now it's not even considered a serious exercise in anything that that day is long gone. But it's good to learn how to write these things. All this stuff works yourself. All right, I'll use this thing for now. All right, so this is Keras. And what's nice about Keras is that once you get more interested, there's these code examples. And you can just walk through very, very complicated stuff, transformers for distilling vision. And it's all 300 lines of code or less, right? So they have BERT. BERT is the most famous simple text extraction. But you'll be surprised how quickly you can get to state-of-the-art stuff if you spend a few hours a week or a few hours a day doing this stuff. So the code we're going to go to is this notebook that I said before, notebook 11. And the goal is to basically do these MNIST. So again, MNIST is these 70,000 handwritten digits, 28 by 28 pixels that were made in the 90s data set. It's irrelevant now. But it's nice to learn for like, it's like the hello world of machine learning. I think that's the best way to describe it. Like maybe once upon a time it was hard to make the screen say hello world, but now it is whatever it is. And so what we're gonna do is we're going to take, it's 28 by 28, 256 nuances. So what we're gonna do is the first thing you have to do is take this thing, because I wrote these notebooks a long time ago, like six years ago, which is infinity machine learning, comment out this random seed thing right here. Otherwise you're gonna get errors because the command has changed. All right, so everyone commented out that thing. Now run this thing. So what it's gonna do is it's gonna import this Keras SK learn. It's gonna import TensorFlow, which is the backend that's also been kind of discontinued. Google's trying to push jacks on everyone. And noopy, so I just run this thing and run anyway. And it should work. Okay, apparently there's too many people calling things from here maybe, I don't know. Oh, it's running, and you shouldn't get an error. Everyone run it and make sure they don't get an error. So you have this little check mark with five seconds. Did everyone get it to run and then we're gonna work in groups and think about what's going on. Work for everyone, raise your hand if it didn't work or raise your hand if it did work. Did everyone get it to work? Well, everyone didn't raise their hands. I don't understand how people can be shy about raising their hand. It's mysterious to me. Raise your hand if it worked. Keep your hand in the air. Wait a minute, if you don't, don't care. Okay, no, no, I'm gonna start kind of wrapping up here for you. You have error, did you turn off this TF random set seed? Did you comment it out? Then you shouldn't get an error. I'm surprised because it's working for everyone else. Did you comment this out? That's just, someone help them. Okay, good, all right. So here's the basic idea of how you write a neural network. You load and process the data, define the model in its architecture, choose the optimizer and cost function. See, you guys all know what this stuff means now. Then we train the model, evaluate the model on unseen test data. Yes, we know that, right? And then we, the last thing which I haven't talked much about is we have to modify the hyperparameters and optimize stuff, right? So this notebook is just gonna walk you through that. So the first cell just loads and processes the data. And so what we've done is we have training data, the MNIST is naturally divided into training and test data. So it's 28 by 28 pixels. We're gonna reshape it into just flat vectors. So we're gonna ignore the matrix, the two by two structure and just make it into one big 784 pixel input. So the input's just a vector of size 784. And then what we're gonna do is we're going to rescale the data on the interval zero to one. It's always good to rescale all your data, right? Because remember what we were talking about. We wanna treat all directions equally. That's what all these optimizers do. So you don't wanna have data, some of it is, imagine the natural scale of something is 100,000. And another thing is 0.01, the model gets confused. Rescale all your data so it lives in the same interval. So generally between zero and one, if you can. I mean not zero and one, but like of order one. And then we're gonna convert. This is just making one hot vectors. It's not really important. So now just run the cell and make sure it works. And this is a typical example of a data point with label four, right? You can change it to something else. Everyone tell me if it works. Raise your hand once it starts working. All right, so now here's where fun stuff happens. All right, so this is the simplest thing. So now we have to define a neural network, right? So now we have to choose our model. And look at this very simple thing that we're gonna put in, right? I didn't tell you what dropout is, but we can just comment that out. It's not important for what you do. So just comment out that dropout. Dropout is a kind of regularization. And I don't have, in like three lectures, I can't tell you anything about it, right? But just comment that out. It's not important. So here is my deep neural network. This is all I need. What is it? Well, sequential just means that I'm doing a really simple neural network where every layer connects to every other layer, right? It's not this complicated. Remember I drew this stuff on the board with complicated stuff that skipped layers and did all this stuff, all this. Here I skipped that layer. Sequential is just for models where every layer is connected to the layer before, right? So it's the simplest kind of deep neural network. So now what I do is I just add stuff. So you see now I added a layer where I added 400 neurons, so this thing here, I start and it adds a layer of 400 neurons whose input is just whatever the shape is. I mean, 784 here. And it's activation function. What's the activation function? What do people think it is? It's the non-linearity. It's just a value. And then I add another layer with how many neurons? 100. Why do I have to give the input? Why do you think I don't have to give the input dimension? Because I've already told that it's a sequential model so he knows how many outputs are in the last one. And then forget this drop out, right? And then I add the softmax layer which is the classification layer. The last layer, remember, I want to make categories. And the way we make categories is we need that probability. We need 10 neurons, each that tell you the probability of being in each category. And then remember that was done with softmax, that Boltzmann-like thing that we talked about yesterday. If it doesn't all make sense to you, it doesn't matter. I just want you to understand the basic logic and you're gonna have to play with it. If this is all you ever do with this, it'll never make any sense to you anyway. But remember what softmax was? That was essentially this thing here. Remember we did these kind of softmax things here? Remember the probability, if we had 10 things, we made these kind of layers with 10 output neurons, right? So I need the probability, M goes up from one to 10 because there's 10, zero to nine is the numbers. So I need a softmax with 10, all right? So I make a softmax, right? And as 10, the number of classes, that has 10 neurons as the output because that's how many I need, right? And that's the probability of being in the given class. And then I just run this. So everyone run this. And if it runs successfully, we'll keep on going, right? I think mine ran, has a little check mark. All right, now we just choose an optimizer and a cost function, all right? That's the next thing. And look, I compile my model, so I create it. Then here, I have to use my loss and I have to use my optimizer and the metrics are just what I plot. So the optimizer I'm using, you see up here, is Adam. Adam, do this, let me run this thing. Everyone run it, all right? And then we train the model. So here we go. Now I have to tell you what my batch size is, how many epochs I have, what model I used, and then the history is just a way of keeping track of what's going on, so I run this thing and you'll see this thing. And what I do is, in my training error, right? I'll explain what's going on here once we run it, but everyone start running this thing. And change your epochs, 10 is gonna take too long, so change your epochs to like four. So just start running it. So I start running it and it goes right, here's the first epoch. This is the number of mini batches there are, 938 is going through them. This is my lost function, this is my accuracy, this is how well I'm doing on the training debt, so right now I'm classifying 94% of these. And then validation loss is like, you take some of your training data set and you try to predict on it, you don't train on it, it's for helping you tune hyper-parameters and you ask how well you're doing. And you keep on going and I'm done, right? So I got an accuracy of 98.67% on my training data on validation, which is like a small little test set, I got 98% accuracy. All right, so, and then here I'm gonna evaluate it on test data. So now I evaluate it on test data. Here we go, oh, I did something wrong. Oh shoot, I have to change this, I think that, oh yeah, I remember now, change this ACC to accuracy. Here, change that to accuracy. Again, this is because I wrote these five years ago, code wasn't stable back then. Accuracy, okay, and now it should work. Ah, oh, ValAC, everything that says ACK, change it to accuracy. All right, so here you go. So this is what it looked like, right? So in the training error, the accuracy, you see, as I went through the epics, it got kept getting better and better and better and the test accuracy basically leveled off, even though the training accuracy keeps increasing. Because you start overfitting if you go too long. So early stopping, which is meaning you stop it early so you don't overfit is another form of regularization. And then what you can do is you can basically modify the hyperparameters and this we're not gonna go through, but basically what you can do is you can change the optimizer here, you can change, you know, so here we're just like going over optimizers, I guess, I didn't put anything else, right? So you can ask how you do in different optimizers and you see you should recognize some of these. These are all just different, doesn't matter, stochastic gradient descent was whatever we did. RMS prop is kind of like Adam, whatever, these are all just variations of second moment problems, so it doesn't matter and you can go through and generally you'll still have to search through all the choices I made and see which one works best, but we're not gonna do that here. All right, so what I want you to do for the last, how much time do we have, 20 minutes? Right, let's just do some exercises. Yeah, so what I want you to do is work with a friend, right? Take 15 minutes, all right, just go through these, see if you can do it, see if you can play with the code, all right, and Google is your friend. Documentation is your friend, all right? Again, the purpose is to show you that you can just actually go up, pick up some code up and do it and it's not that hard. I think there's like this big activation barrier to initially the first time you write it because it seems so overwhelming and then once you're done with that, you're like, oh, it's not that bad, so that's what I hope you get out of this. All right, 15 minutes and then we'll come back and discuss, right, it's just to play around, play with a friend. So what's our current architecture? Our current architecture, so let's draw our current architecture, right? Let's go here, oh, I don't mean that, I mean after the 100, yeah, I was debating playing with a different notebook. Just change the architecture, make it really wide, make it really narrow, just add one layer of two, how well do you do? Think about, you know, play, play, just think about what you're doing. I just want you to play a little bit for 15 minutes, that's the goal, performance, that's basically the goal, right? Play around. Okay, hopefully you guys played around a little bit, right? I should add, I don't know, at the bottom there's a cell on a different kind of architecture which is called a convolutional neural network, all right? You can go look up what it is, all right? And the basic idea is now I do image filtering, these tend to work better, but the point is like look, you have to go Google, you have to read, even if you don't want to do that, you'll see that in this thing you opened up, if you go back here, you'll see there's all these notebooks I wrote, right? And they have different kinds of things, right? So here's a different thing with a SUSE dataset, right? And you can just go through and play, right? What I just want to show you is that it's not that hard to write this code, right? At least initially to start playing. And it's really, you have to play. It's a numerical experimental field. It's an empirical field, right? The best thing you can do is re-implement papers you read. So it's not that hard. Open it up. The documentation's amazing. You can usually find the answer to Google on Google of lots of stuff, but then you'll also get confused. It's like all coding. There's some shape that doesn't agree. There's some funny little code thing you got wrong. So always whenever you take some code, try to change it. And you'll find that you break it quickly and then you don't understand and you go back. You know, it's the usual thing. It's empirical, numerical conversation things. So in the last five minutes I have in this lecture, I just wanted to tell you how you should really understand how you build something, right? And this is really the deep learning workflow, I would say, for most problems. So what you do is, the first thing you wanna know is how good can you do anyone do on that task, right? So you have some task which is classifying something and you wanna establish, is my neural network close to being that good or bad at something, right? So for example, and this is called finding an optimal error rate or establishing a Bayes error rate. So you wanna know what's the best you could do on a task cause you wanna have some metric for establishing how close your model is to the best you could do. And the best thing you should do is usually ask a bunch of experts or something like an expert that says what's going on, right? So a lot of deep neural networks is about automating things that humans can do very easily, right? So you ask how good is this? What's the best I could do, right? And then what you do is there's two things that can happen, you're either overfitting or you're underfitting, right? And this is gonna be the theme of the next thing too. You don't have to take pictures, it's in the review. Go read the review, all these figures are straight from my review. Our review, it's not just my review. A lot of people worked very hard on that review. I should stop saying my review. I don't mean to use subcredit in any way. Though I did, I used to say I wasted my whole sabbatical cause I thought it would take me a month and it took me 11 months to write and even then I couldn't finish it and everyone helped me on that review so much. But enough people have told me they found it useful that I feel less bad about doing that with my whole sabbatical. So if the training error is too high, that means your model's not complicated enough. That's called underfitting, right? Your bias is too big. So what you should do is you should either need to train longer, you need a new model architecture or often what's missing from all this is data. You might not just have enough data. Right, you might need to get more data. If the training error is not high, then the question is the validation and validation is like a test set, but it's not a test set because you're never allowed to tune anything on the test set. So what you do is you take your training data set, you divide that into a training data set and a validation data set. And the validation data set is like a test set which I used to tune hyperparameters, architectures, things like that. Right, if I'm gonna change anything, I can't use the test set. The test set is after I'm all done, I declare victory, then I can check what I did on the test set. I can't check what I did on the test set before that. So if I have to change hyperparameters, change stuff, I have to make like a separate, what I call a validation set, which is like a test set that I used to change things. So now if the validation error is high, but the training error is low, that means you're overfitting, right? You're fitting weird things in the training data set that are not generalizable. So that means you have to regularize more, or you need more data, or you need new model architecture or something, until, and you need to keep on going here, right? So this is basically your model's not expressive enough. This is your model, your overfitting, so you have to regularize. This is basically a mismatch between the training and test error. And then if that's done, then you're done, right? That's basically the workflow. So this is all hyper, and then often you have to tune so many hyperparameters, right? Each of these errors is actually training lots and lots of models with different hyperparameters, changing stuff around, and that's why it gets so computationally expensive. That's why you hear numbers like it took $30 million worth of extra electricity to train whatever, the open AI, whatever the new thing is, I don't know whatever they are, like you know all these things, right? 40 million. It's because you can't, you're not just training it once. If you were just training it once, you would be okay. It's because you have to try a million different architectures, a million different hyperparameters. You have to do all this stuff, right? And often you can have the right idea, but you have the wrong hyperparameter and then you think it doesn't work, but it is actually the right idea. So it's really just, you have to play, right? And the last thing I want to emphasize is that there's lots and lots of examples, right? Even if I just go to Keras, so these are examples of things you might want to do, right? Here is image segment, you know, so and what's nice about all these is they have less than 300 lines of code and you can see how much you can do with 300 lines of code. So for example, here is a neural network that learns how to segment images, right? All, it takes us as input is a picture and an output is the outline of the object you want to do, right? And then you have to come up with a loss function, right? So now I say, oh, I'm gonna, so now the neural network is designed to take an image, put an output and now you have to come up with a loss function, so you have to decide how am I going to measure the loss between what the neural network outputs and what this outputs, right? So I'll often use, I don't know what to use in this particular thing. There's a very prominent architecture that's become very popular for doing this kind of thing, which is called UNET. You can just go read about it. You don't have to build, if you're just applying pre-existing methods, you don't have to build them off the shelf, right? So in biology in the last three, four years, like it's hilarious, it's like the rest of biology, three years ago you could get a nature paper or a nature communications paper with a UNET thing and now everyone has a UNET, right? You don't ever. But my whole point is you shouldn't be scared. So it's 300 lines of code, right? This is all it is, right? And, and again, I can tell you what the output is because I can read this thing. It looks like the way they're putting an output is soft max. So they're saying this is the probability of being the output and they're using cross entropy to do this thing, right? So I can just look through, do this. There's much more complicated stuff, right? So let me see. Well, you can just go through and look at all these things, you know? And these are like, you know, there's a pretty complicated stuff, right? And you can do it all with 300 lines of code. The other thing I should point out that I didn't have a chance to talk about is computation matters. Everything is faster on GPUs. But actually coding for GPUs in these libraries is trivial. It's usually just a flag. Just say GPU true. All right? So back in the day it was really hard to do this and now it's trivial, essentially. So I think this is gonna be an important tool in everyone's toolbox. Going forward, I don't think, you know, I'm not one of these people who thinks like deep learning is gonna put science out of business. I mean, I don't think it has that capability anymore than like linear algebra has the capability of putting science out of business. But we know linear algebra is useful and we know this is gonna be really useful going forward. So you should just, the sooner you familiarize yourself with these things, the better it'll be for you. All right, so I think that's all I wanted to say about this. And in the next lecture, I'm gonna do some more theory. I'm gonna do, there's only so much basic stuff I can do. So I'm gonna give you a research talk, mostly because I'm not gonna be here next week because I have to go back and teach back in Boston. So I'll tell you something about bias invariance and double descent in neural networks. That's kind of interesting. At least we're really proud of it. We're really proud of these papers we're gonna talk about. Okay, so see you in 15 minutes I guess. Any questions? No, the notebook answered everything. All right, so we'll see you back here at.