 Now let's continue our look into deep neural networks and I want to pause a bit and discuss poor performance of a deep learning model. So if you come across this video and it's the first one you see, please go back to the beginning. That really is my suggestion, watch the whole series, one lecture just builds on the other. Now in the preceding lecture we looked at building our very first neural network. We had these two hidden layers. We wrote the code inside of our studio using Keras with the TensorFlow backend and that to me is very exciting. In the real world though no one ever ever ever writes a single neural network and just runs the data through it and everything works perfectly, it's beautiful. There are usually problems with the results. The code works well, the neural network performs well but the choice of the hyper parameters and how the network is constructed must be changed based on the results that you get. There must always be this attempt at improving the model. So in real life there's data pre-processing usually with the help of a domain expert. Now that doesn't always happen but having a domain expert present when the data pre-processing is happening really makes a big difference. A network is designed, later on we'll also see that you can just import some already existing networks. We run the model, put the data inside of the model, we get results, we have to look at those results somehow we have to interpret what they mean and how good, especially how good they are so that we can subtly change things to that model, change the hyper parameters, change the architecture of the network, do something else to improve the results. Now for that to happen we must have some way of understanding what went wrong and I don't mean what went wrong in the execution or writing of the code, why is the results not as good as we think they would be or better still how can we think about improving them. So first of all let's just go back to the training and the test sets. And there's something specific about these that we might have mentioned before but we should just spend a little bit more time because it is this pre-processing of the data that is really important before you start assessing how well your model did. I want to remind you that you have a data set and you then split it so that you have a training set that is the data that's going to be passed to the model and it learns from that, a big black box that learns from the data and then you must pass a set of data to it that it has never seen before. So the model never learned from these new cases and that's the test set that we keep totally separately. Now you don't always need to split that from an existing data set. It might well be that you have a data set and you use that data set and while the network is being developed and while all of that takes place, the data collection continues and it's this new data that is going to be the test set so it needn't just be a splitting of the original so that can also happen. One thing though is to consider just the size of the data set. We mentioned before that if the data set is very small the norm used to be this, we can see it there, the 73% split because we needed enough data in this test set to make it worthwhile to make the testing accurate but in modern days where we have perhaps millions of samples, we can really have this 5% or 1% split in making that test set. That will still contain enough samples to be representative of the whole data set in itself. Now a few problems might arise here. It might very well be that the test set and the training set are not the same, especially when they are collected, the data is collected at different times. Think for instance of cases where images form the data that we want to train on and that the training set might be very nicely selected high resolution images so that we have this fantastic results when we do the training. But then when we pass real world images they might be blurry, lower resolution, we get very poor performance. So there's this difference between the training set somehow and the test set and those differences must be minimized. In the end we want a model that will work well on real world data. That brings us to this point of a class imbalance, one of the types of distribution, where we think about distribution in our data. Think for instance of a situation where one of the elements in the target variable occur very infrequently that 95% of the target variable is just one class, one of the elements and less than 5% is the other. That just means if that really exists I might as well just guess the majority class every time I'm going to be right to 95% of the cases. Why do I need a deep, deep neural network? So if that is the truth in real life then yeah, perhaps you don't need a neural network. But if it's because there's something wrong with the data collection that there's this class imbalance, then you have to do something about that. And one way to go about it of course is this better data collection, but if that is not possible to look at something like data augmentation, which we'll discuss later. Another point to belabor is that the training set is not absolutely required and some data scientists will take the whole data set initially and just do the validation split inside of the model. And some even refer to that validation set as the test set. Another name for it also is also as a holdout set. But that holdout set, validation set, that is a development set that's also called. That can be done inside of the model and we saw that in the preceding lecture that that validation set can be just be extracted when the model is training and we can just see that as our test set. So when you see that occur don't worry too much about it. Just to be formal here, we're going to talk about a training set from which we split a small little validation set during the training and we have the separate test set. Just make sure when you see that that you're not confused about it. So really put these thoughts into the before designing, just think about the data for a minute. Think about the things that can go wrong with the data and specifically the splitting of this data. The next important thing to talk about is just the idea of the ground truth. Have you ever said, just thought about it, somehow someone or by some means every sample in the data set had their target variable denoted in a spreadsheet in a database. Doesn't matter how, but someone or something decided that that is the actual value that has to go in there. That if this is a CD scan and that is a benign nodule or malignant nodule, someone marked that as benign or malignant. And that they might be an error, that might have been wrong. And what we refer to as the ground truth, the labels that exist in the data set that we have might not be absolutely correct. So we're training on something that has an inherent mistake in it. Now there is this idea of an optimal error. That's the maximum theoretically the smallest possible error that exists in the target data set. It's also called Bayes' error. And that is what we are trying to work towards. We want our models really to achieve, or we get close to this Bayes' error or optimal error. And at times this can be different from the human error rate. So imagine we have a bunch of CD scans and it was a radiologist to just set a couple of radiologists and they just mark this, one is benign, that one is malignant. There's going to be some human error in that. So at the very least, we really want our models to exceed the capabilities of our human. So the human error must be exceeded. And we want to really approach this theoretical optimal Bayes' error. One way to think about the ground truth though is when you sit down and you look at the data set and you want to try and think about it and evaluate it yourself, think about how was that target variable, how was it decided. So examples of human errors coming very close to the optimal error is if we have a group of experts looking at every sample. So you can't get much better than a group of experts, say a whole group of radiologists to then sit together and label every CT scan. That's as good as it gets. When it is a piece of apparatus that just makes a measurement, you want the best possible piece of equipment there, the best apparatus to record that target value. So think about when you see it makes a difference what the error rate is in your actual data set and what this theoretical optimal error is. It makes a difference. So how do we evaluate the result? We've data's pre-processed, we've fed the data, we tested now against the validation or the test set. How do we know how good or bad it is? And there are two things we have to discuss here and that is bias and variance. And these are easy to understand, but there's some subtlety to bias and variance. Let's just quickly start with bias that we can see here that's also called underfitting. And that is where the model does not separate two classes very well. So I want to just draw your attention to these little samples. They come from the scikit-learn website and there's some Python code, you can just write it in Python and it'll produce these images and I got them from writing that code. From their website. Now I just want you to suspend the... There's a bit of a difference here between this and what happens in machine learning. So all this is these data points and they fit this orange line and that's the actual function. And because this apparatus that measured these data points is perhaps not that absolutely accurate, there's a bit of noise in that. But what we want to do now is just to fit a line, a model to this data. And if we have a degree one polynomial, that means a straight line that goes through this, so given any X input, what does it predict the Y output going to be? There's going to be a lot of mistakes made here. So the straight line is a very poor fit. It underfits this data. Now if we just move this slightly, just use your imagination, change this to a machine learning scenario. What this line will be, will be what we call a decision boundary. And if we have lots of data points, the one side of this boundary will be predicted to be of the one class. Let's just imagine a binary target variable. And anything on the other side of this line will be predicted to be something else. And then you might have ones going on the wrong side of this line as far as the predictions concerned. So that would be a poor, that model or boundary line would be a poor model neural network. But in this instance, we were just fitting a line to data. It's also poor because it makes big errors. If I give an input value here, down here, there, the actual value is up there. There's a big difference between those two. It's a poor fit. Now we can make our machine learning model more complex. So we move away from a straight line decision boundary to something that's more curved. And in the instance that we've drawn here, there's a much better fit to the actual real world behind the scenes line, this orange one that was the true line. And this model, it's a very close fit. That was somehow the optimum. Now remember, that's what we're trying to achieve with machine learning. We never know what this orange line really is. But again, if this is a decision boundary, it's going to be, it curls around some of the data. So imagine again, suspend or use some imagination. So we see this graph in a different way that there'll be points on either side of the line. And again, on one side and on the other side, they will be predicted as different classes. And if it sort of squirms around these data points on either side, that's a better decision boundary. But look at the right-hand side now. We've upped the degree of the polynomial so much that this equation, this blue line, which is an actual equation, actually goes through almost each and every one of these points. And that is complete overfitting. So complete overfitting means that if you see this blue line as a decision boundary, it's going to curl around the training set data point so well that it completely separates the two classes. But if you give it new data, that's way too convoluted. That overfits the training data, but it's going to be very poor when it comes to real-world data. We call that overfitting, or this model will have what we call a high variance. I actually started off by wanting to discuss bias, so let's get back to that. This is a bias problem that we have here, that there's a total underfitting. It really does not separate the classes very well. If this was the decision boundary, there isn't a good separation of the two. So you're going to make a lot of error there. So see bias on the one side underfitting and variance overfitting on the other side. This is also called memorization. So if you pass data to a model and it does very well in the training data, it might very well be just that it's just memorized. It's called it's memorized the data, and it will perform very poorly on actual data. Now, how are we going to know when we deal with bias underfitting or high variance overfitting? Well, we're going to look at two things, the training set error and the validation set error. Now, if you think back to the preceding video, we had those two come out, the error, the two sets that went into the model, the whole set, but then the validation set that was split from that. So let's look at overfitting with a very low training set error. So that really indicates then there's a large difference between the error rates of the training and the validation set. So when you look at those lines that formed when we ran the model in RStudio, that the training set that error just drops and drops and drops and it becomes very small. But unlike the example, the contrived example that we had with the 50,000 data samples, we might find that the error on the validation set, the green line was much higher, much more, much different. So imagine the error rate of the former being about 1% and the error rate and the latter being about 10%, that's high variance. So it's trained, it's overfitted, it's trained to the training set too well and it doesn't work well on unseen data, the validation set. Now imagine there is a poor error between both of these. So they both have in the order the training set and the validation set are both here in the order of about 15% to 16%. Under the assumption that the error in the target variable was very low, so it was very accurate data that was fed to it. So just bear in mind that little assumption because that's going to make a difference in a short while. So the difference between the two, the validation set and the training set, their error is very close to each other and they are much higher than you would expect the error inside of the target variable itself. So this model is said to have high bias. So it's not even doing well on the training data that it's being given. And then we might get a scenario where both the variance and the bias is high that is still assuming this very low human error or optimal error inside of the target variable and then say a 15% error on the training set and a 30% error on the validation set. And that's going to give us both high variance and bias. So let's just think about this influence of the optimal error though. So if we moved, when we had 15% and 16% up here, we had both, we had the training and validation set errors being very high. But that was under the, that gives us bias. That was under the assumption that there was no error in the target you bring up the error in the target, say to 14%, that we know that there might be a misclassification in the data that we bring in and we still sit with an error of 14% to 15%. That's actually a fantastic model with a low bias and low variance, same error rate. So, so you've got to see this in context of what you think as a domain expert, the error rate is in the target variable itself. In short though, this is what we're after. Look at when you run a model what the training set error is and what the validation set error is and what the difference between them are, keeping in mind what the baseline error, the ground truth error might be. Just to mention in older reports, older research documents, you might see the straight off between bias and variance. You change your model and if you change it, you go more in one direction and change it going to the other direction. But really in a modern world where we have big data sets and where we have very sophisticated deep neural networks, that trade-off thing no longer is really the norm. You can get very low bias and very low variance in the same go. So if you get these errors, which you should understand, you can read this material again until it really becomes part of you so that when you run these models, you know exactly what's happening. Just a few pointers as to what to do. If you see high bias, so this model really underfits the training data. And ways that you can go about solving this problem is just the following. Just write this down somewhere and this will become part of you as you do more and more, create more and more networks. If you have this high bias, just create a bigger network. That means putting in more layers, putting in more nodes in every layer. That's just one way to go. If that doesn't work, just train for more epochs. So this might be that you haven't gone to the bottom of your gradient yet. So just train through more epochs and then lastly, you can try a whole different architecture. So when the input was for images, don't do a normal neural network. Try a convolutional neural network, for instance, and we'll get to designing and writing code for convolutional neural networks in the future. When you have high variance, large difference between the iterate of the training and validation sets, you can try the following possible solutions. Number one, capture more data. That is king. You can augment the data if you can't get your hands on more. And there is this class imbalance. You can nearly think about augmenting your data for both those problems. And then vary the interesting stuff because this we can manipulate and code. And that is implementing with in the design of your neural network, implementing regularizations, dropouts, batch normalizations, batch normalizations and other techniques. And in the very next video, we're going to look at this very exciting concept of understanding regularization. So that was a bit quick. As I said, it's easy concepts to understand, but there's a lot of subtlety to it. Don't let these confuse you. I'm trying to depict two things here. One is just this fitting of this actual data, but I want you to use your imagination to also see these blue lines as a decision boundary where it's going to be, predict a class on one or either side. Now remember in higher dimensional space, it's not just going to be a line, but some hyperplane. But you can imagine it going from just a straight thing to something that is convoluted and curled all around. That's what we call the decision boundary. And that decision boundary can just become too complex and you have this high variance on the side. It can also be totally not complex enough. And you have this high bias on this side. So you've got a sort of aim for this middle ground here. Read this document again. So there was no coding in this. So read this document. It is going to be available on our pubs. The actual file on GitHub, I'll put the links down below and I'll speak to you in the next lecture.