 Greetings fellow learners! Now before we get into this hypnotic world of hyperparameters, I have a thought-provoking question for you. What are some of the factors that are outside of you that can brighten your day? Now could it be something as simple as just good weather outside? Or would it have to be something that comes from another person, like somebody giving you chocolates or flowers? I ask this so that we can see if we can relate hyperparameters to real life. So for example, if your body or mind is the neural network set to optimize happiness, then can these external factors in some way be considered as hyperparameters, since your body depends on them? After watching this video, please do comment your thoughts below and let me know what you think because I would love to hear them. This video now is going to be divided into three passes. In the first pass, we'll go through an overview, the second pass some details, and the third pass code. So let's get to it. What are hyperparameters? So let's talk about that in this phase. This is a feed-forward neural network to classify an image as a dog or not. During the training phase, we can pass image label pairs to the network. During the inference phase, we can then pass some unseen image and determine if it is a dog or not. Now, during the training phase, the neural network would learn some parameters, these edge weights. Now, the performance of the model depends on the parameters of this model, which are the edge weights, the parameters are in turn dependent on two things. The first is the data during training of the model, and the second is that they depend on another set of parameters called hyperparameters. The prefix hyper is used to distinguish them from the model parameters, and it indicates the top-level parameters on which the model depends. Quiz time! Have you been paying attention? Let's quiz you to find out. What makes these hyperparameters hyper? A, they are top-level parameters that depend on the model parameters. B, they are the top-level parameters on which the model parameters depend. C, they are top-level parameters that represent the edge weights of the network. And D, they are the top-level parameters that represent the activations in the network. Comment your answer down below and let's have a discussion. And if you think I do deserve it at this point in the video, please do give this video a like, because it will mean a lot. Now, that's going to do it for past one of this explanation, but keep paying attention because I will be back to quiz you. In past one, we covered what hyperparameters are, and so in this past, let's look at some hyperparameters and talk some details. So, this is a non-exhaustive list of some hyperparameters, and each of these have like tens to even thousands of configurations. So, the main question here is, how does one set the value for these hyperparameters? So, first you need to find some ballpark value of how to set these hyperparameters, and then you can fine-tune these hyperparameters to suit your own needs. Now, to get some ballpark estimate, I see two main ways to do this. The first is practical experience that is like looking at other machine learning engineers code and building some intuition of your own. And then the second way is through theoretical best practices, and this includes like seeing the configuration in research papers related to the problems and models similar to the one that you are trying to solve. Now, once you have this ballpark starting point estimate for a hyperparameter, you can then manually set these hyperparameters, which involves like starting with some value for the hyperparameters, training the model, see the performance, and then manually change the hyperparameter, train the model, see the performance until you are satisfied. Or you could also automatically set hyperparameters. And this is most commonly done with the technique known as grid search, where we would set a range of values that different hyperparameters can take, and then a Cartesian product of all of these hyperparameters is created. And then we would train a model on each of these configuration settings. And then we will choose the hyperparameter settings that yielded the best model performance. Let's say now that you are a machine learning engineer and you're trying to build a neural network that is going to predict the price of a house given some information about said house. And let's walk through some hyperparameters and see how we would set them for this problem. Let's assume here that we are going to build a feed forward neural network. And the number of features is 10. So the number of layers in this case, we can start out super simple. So something like one to two hidden layers. And then we can evolve from there. The number of neurons per layer, we can keep it something small. If we have an input of 10, maybe some hidden layer dimension, the first one could be like 64. The second one could be like 32. I like keeping it as powers of two, but it's not really necessarily factor you need to do this. Next is the type of layers. We're going to use simple dense, fully connected layers, as this is a feed forward neural network connection type, we only need fully connected layers. Now skip connections. We don't quite need it because this is quite a small network itself. Now recurrent connections. In this case, we are dealing with non sequential data. And so we don't really need these recurrent connections, maybe for sequential data like language or time series analysis, some recurrent connections could be useful. Now the type of activation function, we can use ReLU for hidden layers and just simply use a linear activation function for the output neuron. This is after all a regression problem. So the output doesn't need to be constrained. Now ReLU can be used to pick up on the complexities of your problem without also excessively leading to vanishing gradient problems. Now the dropout positions in the network could be after every hidden layer with the dropout rate being set to initially like 0.5 to start out. That is that there is like a 50 50 chance that a given neuron can turn off or can turn on. Now the batch normalization positions in the network could be after every single layer except for the final layer. Now if we go into the training and evaluation related hyper parameters, we can say that the number of epochs during training could be set to somewhere starting at five and then increase to 10 potentially, then training and validation test split ratio. In this case, if we have an entire data set, we could probably split it up into 80 20 data set if we are just training and then testing. But if you want to do evaluation, you can do 80 10 10 or like a 70 20 10 split. Now the loss function here, well, in this case, we are doing a regression problem to predict the price of a house. And so a mean squared error would be appropriate. But if you have like a classification problem, maybe a cross entropy loss would be more appropriate here. The metric for model evaluation could simply be like, again, a mean squared error that is like the difference between the price of a house and the difference between its actual price. Or it could be a mean absolute error to to make it more human interpretable. And these other evaluation metrics could be used in case of like classification problems. Now cross validation settings. Here we can use like k fold cross validation, in which case we would take the entire training data and split it up into k chunks. The value of k could start out to be like five or 10. In this case would be like a reasonable starting value. The learning rate is how fast you want this network to learn initially. Now the optimizer choice, typically we can do like Adam, for example, as it does provide some benefits over the stochastic gradient descent. And for more information on optimizers, I highly recommend you check out this video right over here. Neural networks take data in parallel, but how many samples can they take in parallel at a time? And we could set this to be like eight or 16 or 32. Now, this is a good example of me using my practical experience to set some of the cyber parameters. But there are some more fun and complex models that we could refer some research papers to build. So for example, in the paper that introduced Alex net, there are some pretty good guidelines on how to set various parameters for convolution neural networks. And these are networks that deal with spatial data like images or audio. So overall, I hope you can see how practical experience and research papers can help you choose appropriate hyper parameter values. Quiz time. It's that time of video again. Have you been paying attention? Let's quiz you to find out. How can I set the number of convolution layers for a digit classification problem? A use my past experience as a starting point and manually tune hyper parameters. B use my past experience as a starting point and perform some grid search. C use research papers that solve similar problems as a starting point and manually tune hyper parameters. Or D use the research papers that solve similar problems as a starting point and perform a grid search. Note here that multiple options may be correct. So comment your answer down below and let's have a discussion. That's going to do it for quiz time and past two of the explanation, but keep paying attention because I will be back to quiz you. So let's run through a collab notebook to tune hyper parameters of a neural network using grid search. So in this case, first I need to install Scorch. Scorch is a library that allows you to use scikit-learn libraries along with PyTorch neural networks. And so we install Scorch and we are also going to import it and specifically we're going to import the neural net regressor because in this case we are going to determine the price of a house, which is going to be a regression based problem. Next we also import grid search CV to actually perform grid search. And this is the whole reason we need Scorch because it is a part of scikit-learn. Now we can scale our data with the standard scaler. We can split our data into train and test. We can use this fetch California housing to get California housing data set and then other torch related parameters along with NumPy. So we'll load the California housing data set and we can see that we have 20,000 examples with eight features right over here. We then perform scaling and transform the input data. We can then split the train and test data into an 80-20 split where 80% is training and 20% is for test. Next we will convert the NumPy arrays to PyTorch float tensors. We then create a neural network regression model where we have an input layer with a hidden layer and an output layer that consists of only one neuron and that one neuron is going to be just the price of the house. And along the way we're going to also add some activation functions and specifically we're adding relu. Next we will take our regression model and then package it within a scorch neural network regressor and we're going to use the mean squared error loss because again this is a regression problem and then we will use the atom optimizer. Next we actually perform grid search and specifically we want to set some possible values of hyper parameters like max epochs, the learning rate and the batch size. In my case I think the max epochs should be around one epoch or two epochs. The learning rate should either be 0.01 or 0.1 and the batch size could be 8 or 32. Now we're going to feed the grid search CV with the scorch model along with the set of grid hyper parameters over here and then we have the CV which is cross validation parameters three which means that we're going to basically train all of the configurations three times. And looking at this you can see that the best score was this value right here and the hyper parameters set that corresponded to this best score was a batch size of 8 learning rate of 0.01 and max epoch of 2. Now if you want to see kind of down into more details of like what's going on in here you can actually print grid dot CV results and you'll get like this wonderful blurb over here. Now what's important here is that this based on the hyper parameters and their values it came up with this Cartesian product of all possible models so there's about like eight possible models to train over here and all of these eight models are trained three times. So we have the first time split zero second time this was the performance and the third time this is split two which is the performance. Now each of these numbers this number this number and this number we take the average of for the first model right here which is batch size 8 learning rate 0.01 max epochs 1 and you get this first value here as a metric of performance and similarly you do it for all the other eight right here. Now the one with the highest score is going to be this this one right over here which is yeah so the rank test over here so the first rank is going to be the second one right over here which basically means that it corresponds to batch size 8 learning rate 0.01 and max epochs 2 which is exactly what we saw returned by this function and so this configuration of hyper parameters would be the best for this current model now we can change the param grid parameter to include other hyper parameters like activation functions dropout rates and anything in the list of hyper parameters that I mentioned before quiz time ooh this is going to be a fun one we cannot perform grid search on a activation functions b dropout rate c and model edge weights or d optimization algorithms now comment your answer down below and let's have a discussion and if you think I do deserve it at this point and you love learning please you consider giving this video a like because it will help me out a lot now that's going to do for quiz time and pass three of this explanation but before we go let's generate a summary the performance of a model depends on the model parameters now these are like edge weights of a neural network the model's parameters in turn depend on the training data and a set of other parameters called hyper parameters and here's an example of such hyper parameters to determine these hyper parameters we determine a ballpark starting point and then tune these parameters and we can get this ballpark starting point of these parameters through practical experience or from theoretical knowledge of reading research papers then you can tune them manually or automatically using techniques like grid search and that's all we have for today now to learn more about optimization algorithms that I mentioned within this video do check out this video on optimizers right over here but thank you all so much for watching and if you think I do deserve it please do give this video a like and I will see you in the next one bye