 So in this video, I want us to explore the sequential model a little bit more. Just delve slightly deeper, slightly deeper level. It's all about becoming familiar with what is possible. Now if this is the first video that you see in this playlist, I would really urge you to go back to the start. Watch this playlist from the start. I'm a surgeon. I'm involved in research as far as deep neural networks are concerned and this series really is about deep neural networks for domain experts. So you're not necessarily a computer scientist or statistician. You are an expert in a field outside of that, but you want to get involved with deep learning. This is a course, a series, a playlist on YouTube to get you to become comfortable and familiar with deep neural networks and we're using TensorFlow as our back end, Keras on top of that, and we're using the R programming language for statistics, a beautiful environment, an easy environment to start learning about deep neural networks. So we're at a stage where we've played before. We've used TF Runs as far as really as the tensorboard playing with that. Let's delve a little bit deeper and I've got we can see up here. It's very small, but you'll see the tabs model one, model two and model three already created. But let's first get to grips with what's going on. I'm going to set as always my working directory. I'm going to hold down control and hit enter, command and enter. So we're going to accept that. I'm going to import Keras and I'm also going to import plotly, but I'm just suppressing the messages so that I don't get all those yellow messages at the bottom here. So we're going to have Keras and plotly here. Now let's import a data set. I'm going to use one of the inbuilt data sets inside of Keras. And it's the MNIST data set. Now I just want to move here. I'm going to go on to my other monitor. Let's bring this up. There's the MNIST. In case you haven't heard about it, it really is famous. It's the easiest data set by that. I mean, easy for a computer to learn, but to explain things. So there are many of these handwritten digits. They tiny, tiny little files. It's 28 by 28 pixel pixels. It's monochrome. So every pixel in this 28 by 28. I mean, remember how big the photos are that your phone can take. So these are really tiny, 28 by 28. And every pixel of that 28 is just one after the other 28 in one column after the other. And then next row, next row, next row. It's just between 0 and 255, 255 being a bright white pixel, 0 being a pitch black pixel, and there's all the grays in between. So it just gives a value at each of these 28 times 28 pixels. And each of these are looked at by a human being, and a human being has marked all of these. That's a 6, that's a 6, that's a 6. And then you want the computer to learn if I give it one of these what it is. So that's what we are after. Now, the best way to do images is not through a normal multi-layer perceptron, a densely connected neural network. A better way to go about it is a convolutional neural network, and we are very close to start looking at convolutional neural networks. We are not there yet, so I'm going to do something with this data set that will still allow us just to use our normal, normal densely connected neural network, a multi-layer perceptron. So let's have a look at this. And I'm going to use my normal function that we have, or the symbol that we have here, this percent with a backward arrow percent, and it's the data set underscore mness, that's our function. And we're going to import that, and it's very nicely written in KRS that I can immediately split it up, done all for me. So I'm using C to have this list of lists, and inside of that I have X-train and Y-train, that's my X-train would just be my training set and my training labels, my test set, my test labels, and that's what I've called them. My objects I've called X underscore train, Y underscore train. So it's going to take a second or two, but let's import that. There we go. Now let's have a look at the dimensions. I'm just calling them on X-train, and right at the bottom we see they are 60,000, and then 28 and 28. So this is a tensor. It's not just a rank two tensor, it's not just a rank one tensor, or a rank two tensor, I should say, where it's just a matrix of rows and columns. We've added another dimension to make this a rank two tensor. So it's 60,000 images of 28 times 28. Now what I want to do, now think of that 28, it's one after the other, after the others, in the first row, so that we have 28 pixels in the first row, we jump to the second one, 28, but what we could do is just make them all into one long vector. So 28 times 28 is 784, so I'm changing this into something that has 60,000 rows and 784 columns, and that's 784 columns. So each sample in that one long row is just all the 28 times 28 pixel values in one long row, and that's what we're doing with array underscore reshape. So function, and it's going to take X-train, and it's going to reshape it into this row and column. So the columns are easiest, 28 times 28, that's 784, but the rows are a bit different. You have to call the number of rows of X-train and put that inside of our C function there, and we're going to do exactly the same with X-test, so that now, if we look at the dimensions, we see 60,000 rows for X-train across 784 columns, as simple as that. Now we need to normalize, remember we need to bring them all down, and the way that we're going to normalize this is not by doing a standardizing it, by subtracting the mean and dividing by the standard deviation. It's because we have this very gradual, very linear from zero to 255, all we're going to do is we're going to use broadcasting by dividing every value by the maximum, which is 255. So every pixel value is now just going to be some fraction between zero and one of what it was, because we're just dividing each of them by 255. So that's one way of normalizing our feature set, our feature variables, which we haven't seen before. So let's just look at what train number one was, our training label, and our training data set, what the first label was, and as far as the target's concerned, yeah, it was a five. So now we're just going to do one hot encoding. Remember that's the two categorical function inside of Keras, and because it's from zero to 900 digits, there are 10 of them, so I'm using 10 as an argument, and we're going to do both of those. So let's look at what five was, I'm going to call y train one, the row one comma space data, show me all the columns, and look at this is zero, zero, zero, zero, zero, one. So if I count from zero, zero, one, two, three, four, five, the five is one hot encoded, so indeed that five is now represented by this 10 element vector of which the fifth one here with a starting from zero is one hot encoded. Excellent. Now let's import TF runs, so that we can actually have our tensor board to look at these, and I'm going to call mnist underscore model underscore one dot r, let's open it up and let's see what this one was all about, because I promised you we're going to delve slightly deeper into the sequential model, what can we add in the first thing that's going to pop out here as this callbacks. We haven't seen callbacks before, and I want to introduce you to callbacks here. So my model is just going to be a Keras underscore model underscore sequential, we've seen this before. I'm going to use my pipe symbol here. So I'm going to pipe model as the first argument to layer dense. So I'm going to have one, two dense layers here and then the third dense layer, my first dense layer, 256 units. That's fairly conservative, remembering that I have 784 columns. So I'm going down from, remember, right at the beginning we drew all these circles with all the lines in between them, I'm going from an input vector of size 784 to down to 256. My first hidden layers, I only got 256 nodes, whereas my input had 784. So that's dropping down quite a bit. My activation function would be a rectified linear unit. And remember for that first layer, you've got to tell tensor flow or Keras in this instance, then what the input shape is, and that's 784. It's got to accept that from there, it'll infer how big the weight matrix is, the weight tensor that it has to multiply. So my second one drops down even further to 128. Rectified linear unit is my activation. My last one is 10 units, and I'm going to use a softmax. I've got to use softmax because I want a probability for zero, one, two, three, four, five, six, and eight, nine. And the one that has the maximum probability that's going to be the predicted value for that handwritten digit. Then I'm going to compile the model. My loss is going to be a categorical cross entropy as we should do. And this type of problem, my optimizer is going to be RMS prop. We've seen that video, what RMS prop is all about. My metric is going to be accuracy. And then I'm finally going to fit my model with X train and Y train. I'm going to run my mini batches, 256 rows, over 50 epochs. Now let's get to callbacks. You pass a list of all the callbacks that you want, and I'm only going to do one callback here. And the callback is early stopping. Early stopping means it's going to run, you know that we're going to see this graph on the right hand side. But when it sees that something is happening, it's going to terminate the learning process. And the arguments that we're going to use there is what to monitor and for how long. So monitor in this instance, I'm going to monitor the loss. Because I pass validation data to this, I could have just loss or I could have val underscore loss, or I could also monitor the accuracy or the validation accuracy. I've chosen loss as my simple example here. And my second argument here is patience. So every time a loss comes up, it's going to look at two beyond that, because I've set it to two. And if my loss doesn't decrease, then for two mini batches in a row, it is going to say, hold, stop the bus. We're not learning anything more. Exit from this, I'm stopping. And that is what this callback is all about. I'm still going to set my velocity to two and my validation data. I'm going to use my test data as my validation data. So that's model number one. And I just have to write it out like this. I don't have to import Keras. It's all going to come from here and from the TF runs. So let's run this training run for that model. And because we've set the working directory and these models all live in the same directory, I can just use the name dot R. I don't have to type in the full address. So let's run that. And we can see in this instance, I can do the recording and post this to our pubs. I'm not using my GPU. I'm running this off of the CPU. So it's a bit slow, but we can certainly see things starting to happen there. You can see the little zigzag pattern there, my validation data set. It probably means that, remember that my mini batch size is too small and that we get this noise inside of it going up and down, up and down, depending on the specific data inside of that mini batch. We can smooth this out by making the mini batch size, the batch sizes it's called as an argument, slightly bigger. Now, let's just, while this runs, just talk for a moment about the base truth that we have here. This was done by human being. We're going to accept that the base accuracy here, the human level accuracy was very close to 100%. So if we think about our variance and our bias and we look at this, certainly we're getting very close to 100% in our training set. So that's going very well, but there's between that and what we expect the ground truth to be, the real truth is very small. And so we're probably dealing with a bit of variance here, a bit of overfitting and we have to do something about that. But we've called TF runs and we've got called train data. So we get this beautiful tensor board illustration. But look what's happened. We see this early termination here because of our callback. So it's recognized that for two steps in a row, there was no increase in the accuracy or was it lost? Now I can't remember what I called. Let's have a quick look. We used loss. So it didn't see that the loss got less for two after one was called. So it terminated early, no problem whatsoever. And that's very good. It's very good to set that so that you don't sit with bigger models. I mean, all of these are toy models that we've dealt with. You don't wanna sit for days and have something run and nothing really is happening. Your model's not making any improvements. So to put these callbacks in, very, very good thing to do. So let's just go to our second model. Let's see what we did there. So what we've done, it's a bit more restrictive, 128. So that space, I'm really that hypothesis space, I'm shrinking it down and I'm also doing a bit of dropout, 0.2 there. So I'm adding these two dropout layers there. I'm using RMS prop still. I'm still using my callbacks and nothing's changed. So there wasn't, I don't know if one can really call this that there was high variance, but between what we assumed to be this baseline of 100% accuracy as the labels were done, the training data got very close to that, but then there was this gap slightly to to the test data, the validation set. So let's try and improve that by constricting our hypothesis space a little bit just for fun. Basically as a reminder, I haven't introduced anything new, but let's run this model and we can see training is going on. We can see quite slow. Remember, I'm running this off of a Core i7 Intel processor, so it's not running off of my GPU. So it's going slowly, but we're suddenly getting there and we can see our model there went right to the end. So the callback did not terminate. It didn't have to, it didn't find the reason with the patients being set to two to terminate early. So we see this whole run taking place over all 50 epochs. So let's compare these two runs. We've seen that before and now we can see the two runs. So let's look at this accuracy and really for the training set went up to 99, 98 for validation. And here it was on this side. Let's have a look. Yes, on this side it was 99, 98. So certainly you must always have a look at the scale. It's deceptive, this y-axis, very bad. And so just look at that. So as I say, we haven't really improved this and whether we should call this high variance is that's also debatable, but it's about the principle you get it and then we can see the changes we made. We went from 256 to 128. All the greens are the new ones. We introduced the dropout layer there and et cetera, et cetera. Let's look at model number three, something new I want to show you here as well. So we've got our constricted model still. We've got the dropouts, et cetera. And now when we get to compiling in the optimizer, before we would have just put Adam inside of quotation marks. What you do there is you just accept all the default values in the next video. If I have time, I might show you an easy way after creating your model, how to get to what all the defaults worth, very easy to do. But here, instead of just using all the defaults by putting Adam inside of quotation marks, I can actually use the proper function. It's called optimizer underscore Adam, which allows me to add some of the default values or change them even. So if I just hover there, it's tiny. You're probably not gonna see this on your YouTube video, but it says they're learning rate 0.001. So I've changed that to learning rate being 0.003. There's beta underscore one, beta underscore two. We've spoken about that before. We have an epsilon there, a decay rate, an AMS grad, clip norm, clip values, et cetera. So I've changed three of the arguments there. And that is how we drill down deep into the sequential model, where we start manipulating things to our heart's content, and it is so easy to do. So there we go, a lovely example of drilling a bit deeper. So instead of using the default values by just using the name, and this for Adam, it would just be Adam in inverted commas, I'm actually using the function, optimizer underscore Adam, so that I can manipulate the arguments that go with that. So that's beautiful. That is something new. So let's just run this model three. Now while that runs to the end, I'm gonna stop the video here. We can just compare them. It's gonna show us the comparison. You've seen that before. But now you've seen drilling down a bit deeper, getting a bit more finesse, taking more control over your sequential models.