 Greetings fellow learners! Now before we get into this dazzling world of data requirements, I have a thought-provoking question for you. When preparing presentations at school or work, how many data-driven facts do you like throwing in? Now for example, in work I prepare small presentations to motivate why we should do project X. And when preparing my conjecture, I add little single facts that pull a lot of weight. And for example, the facts should be big picture and hence easy to understand, and they should also act as a motivator and a conversation starter. So for example, we could make a statement like, we are losing 7% of value per order dot dot dot, and so we propose project X to mitigate this. So turning the question over to you, how would you prepare for these presentations? Now please comment down below and I would love to know your thoughts. Now this video is going to be divided into three passes. One of the first pass, we're just going to go through an overview. The second pass is more details and model comparisons. And the third pass is code. So let's get to it. How much training data does a neural network need? Now this depends on multiple factors. However, four of the main ones are model architecture. The second is model complexity. The third is data quality. And the fourth is data diversity. So let's talk about each. So the first factor is model architecture complexity. Now this is a feedforward network that has parameters that need to be learned. If we increase the number of layers, we can increase the number of parameters that need to be learned. And this means that we need more data for training, generally speaking. Now the second factor is problem complexity. Neural networks, at least from a supervised learning perspective, need to learn how to map inputs to outputs. They learn this mapping using training data. And sometimes this mapping is easy to learn. To illustrate this, let's take a trivially simple example where the network input is the age of a person in years. And the network output should be the age of a person in 10 years. Now this can be solved with a simple function that just adds 10 to the input, of course. But we can also train a neural network to solve this problem. And we probably just need a handful of examples. On the other side, the mapping of the input to the output could be very challenging. So for example, language modeling, we want the network to take some text as input and produce some text as output for the next word. And now here, because the relationship between the input and output is complex, it needs more than a handful of examples for training effectively. Now, the third factor here is the data set size depends on data quality. So we need to ensure that the data used during training is a good representation of data during inference. So if we feed the model clean and crisp pictures during training, during inference, the model is exposed to not so clean and crisp data, then its performance will be low, even with lots of training data. Now the fourth factor is data diversity. So say that we train network that can take an images and classify them as cat dog or monkey. Now a data set with 1500 cats and 1500 dogs and zero monkeys might have a performance of 60%. But a data set with 1000 cats, 1000 dogs and 1000 monkeys may have a performance of say 90%. So even though the data set size is the same in both cases, the second case is more diverse than the first, and hence is able to generalize and perform better. Now there are many more factors that determine the size of the data set required. But these are among the top four. Have you been paying attention? Let's quiz you to find out which of these alterations to a neural network is likely going to decrease the amount of training data required a increasing the number of layers b increasing the number of dimensions of the input c decreasing the number of layers or d decreasing the dimensions of the input. Note that more than one option could be correct. So comment your answer down below and let's have a discussion. And if you think I deserve it at this point, please do consider giving this video a like because it will help me out a lot. Now that's going to do it for quiz time for now and past one of the explanation. But keep paying attention because I will be back to quiz you. Now one way to determine how much data that you need for your problem is to research similar models and similar problems. So let's take a look at some models outlining some factors that we described in past one. And let's start with a computer vision model Lynette five. So this model, it's a convolution neural network with the convolution activation and pooling layers. Now the problem that it is trying to solve is digit recognition where the input is a handwritten digit and the output is a 10 class classification digits zero through nine. Now the number of parameters in its original form is 60,000 parameters for Lynette five. And it was originally trained on a data set of size 60,000 training examples. And the performance of the Lynette five architecture achieved an error rate of 0.95%. So that's Lynette five. Now let's look at another architecture, which is Alex net. So the model here is that it is a deeper convolutional neural network than Lynette. The architecture takes in an image and it performs object classification where it's classified into one of 1000 classes. The number of parameters of this architecture is 60 million. And the data set size is 1.2 million training examples with 1000 images per 1000 classification categories. And this is trained on image nets. Now as far as performance is concerned, it's top one error rate is 37.5%. And top five error rate is 17%. This means that if we take the top one prediction from the Alex net predictions, 37.5% of the time, the true label is not in the top one. But if we take the top five predictions from Alex net, then 17% of the time, the true label is not in the top five. Now let's actually put Lynette five and Alex into a table for comparison purposes. And from this table, we notice a few things from a problem complexity perspective. Alex net is solving more of a complex problem. The number of parameters in the data set to the data set size, that ratio is one to one for Lynette, whereas it's close to like 20 to one for Alex net. Now from a performance standpoint, Alex net is still able to achieve good performance on a relatively complex problem. And this is also likely because of data augmentation, regularization techniques, among many others. So overall, I hope that this table shows how difficult it can be to answer the question, what is the optimal number of training examples? Now you can also observe similar comparisons for NLP models. One way to see this is comparing the glue scores of different models. So glue is a benchmark that evaluates a model on nine language understanding tasks. And each record in this table shows a model and its performance on each of these nine tasks. And their accuracies are then taken as a weighted average to create a glue score, higher the score, better the model. Now looking at these models, the problems they solve and their performance, you can get a sense of how much data that your problem might need. It's that time of video again. Have you been paying attention? Let's quiz you to find out. Increasing the training data by two fold will increase the model performance by a one fold B two fold C four fold, or D by an uncertain amount. Comment your answer down below and let's have a discussion. Now that's going to do it for quiz time for now. But keep paying attention because I will be back to quiz you. In this past three, we're actually going to look at a collaboratory notebook, where we're going to see how the data set size can affect model accuracy. So we first by importing some libraries, NumPy for tensor and array manipulation, torch for creating the neural network, the optimizer, and then scikit learn, we're going to use make classification for generating our synthetic data set, train test split to split the data into a training data set and a test data set. And then we're going to plot some pretty charts with map plot live. Now input dim is going to be the number of dimensions for every single example. In this case, it's going to be a 1000 dimensional input. Number of samples is 50,000. And the number of epochs during training for every single data set size is going to be 10. Next, we define our simple model, which is going to be a very simple feed forward neural network with three feed forward layers along with a relu activation and a sigmoid as well. We then generate this data set where we have 50,000 samples, each sample is 1000 dimensions. And all of those 1000 dimensions are informative. We're then going to split it split it into a test train set. So that would be 40,000 in the training set and 10,000 in the test set. Now we're going to train on the same model about 10 times over here. And each time we're going to only take like some fraction of this data. So for example, of the 40,000 examples in the first iteration, we're just going to take 10% of that data. So it'll be like 4000 training examples. So we train the model with that 4000 examples, we then define a loss criterion, which is going to be the binary cross entropy loss since it's a classification model here, we use the atom optimizer. And then we begin the training of this model with the 4000 examples, where we first zero out the gradients, we will make a prediction for the model, we then calculate the loss by comparing the models outputs along with the actual true label, we then compute the gradients with back propagation. And then optimizer dot step is going to actually perform the update of the parameters of the network and the model effectively learns. And we perform this 10 epochs, that's 10 times. Next, we're going to evaluate the model by taking the output of that model, and determining the binary classification. That is, if it is an output probability greater than 0.5, you have one class, otherwise it is the zero class. And then we just compute an accuracy. Now, if you look at the logs over here, you can see that each iteration, we're going to train the model multiple times. So we have a data set size of 4000, then 8000, then 12000. And for each of those cases, we're going to plot the train and test accuracy. So we get this beautiful chart. So you can see as we increase the data set size, the test accuracy goes on increasing and the train accuracy will go on decreasing. This is actually a good sign because that means that for lower amounts of data, that is when the data set size is small, we have a tendency of overfitting. That means the model is essentially trying to memorize patterns in the data. And so it performs really well on the train set, but it doesn't generalize very well. But as in when the data set size increases, you can see that we are overfitting less and the model will actually generalize better. And so overall observations is that as we increase the data set size, the test performance increases in less than a linear fashion, and we are overfitting less and less. And overall, we can say that while there is a correlation of increased data and better accuracy, this is not always the case. Now this correlation is less than one is to one, especially as we see more and more data. And so to determine how much data you need for your problem, we need to look at other similar problems and models and see how much data they require for training. And using that as a starting point, we can run tests on our own for our specific use case. And so I hope this answers the main question of this video. This is going to be a fun one. If you were to graph the amount of data on the x axis and test accuracy on the y axis, the relationship would look a exponential b quadratic c linear or D logarithmic. Comment your answer down below and let's have a discussion. And if you think I do deserve it at this point, once again, please do consider giving this video a like because it will help me out a lot. And that's going to do it for quiz time and past three of this explanation. But before we go, let's generate a summary. How much training data does a neural network need? Now this depends on four main factors, but there are many others. The first is model architecture complexity, then problem complexity, then data quality and data diversity. We can see how complicated answering this question can be when looking at computer vision models, or even the models on the glue benchmark. Overall, to determine how much data you need for your problem, you need to look at other similar problems and models and see how much data was required for training there. And using that as a starting point, you can run tests on your own for your specific use case. And that's all we have for today. Now the link to the code will be in the description below. But to learn more about training large models with their data requirements, you can check out my playlist on training a transformer neural network from scratch. Thank you all so much for watching. And if you think I deserve it, please do consider giving this video a like on your way out. And I will see you all in the next one. Bye bye.