 Yeah, I'm Nyle, and today I want to talk to you about how we can use Bayesian optimization to kind of configure neural networks and how we can kind of get the best out of them using a different approach to what traditional methods are. So yeah, just a bit about me at the start. I'm currently a machine learning engineer at Argos. I work in the supply replenishment and forecasting team there. So working on kind of demand forecasting methods and kind of everything that's associated with that. Previously I was a data scientist at an enterprise search company called Lucidworks. And then previous to that, I've kind of experienced in then statistics mathematics. So what I want to go through today then, we'll go through a brief kind of high level overview of what neural networks are, then bit deeper dive into what convolutional neural networks are and how we can use those. And then falling on from that, how we can then tune those models using Bayesian optimization and then hopefully we can get to a demo of how we then can implement that in Python. So yeah, I suppose just a quick show of hands. Who has kind of background or kind of experience working with convolutional neural networks or neural networks in Keras or has used Keras before? Okay, so yeah, a few. Worry not, if not, we'll kind of go a high level step through each of the different aspects and build up a bit of basic intuition what neural networks are and then how we can use optimization in that. So what is a neural network? This is kind of your very generic neural network has been actually around since the 1960s. This is called a multilayer perceptron whereby you have this kind of hidden layer in the middle and everything's connected together. So taking a very kind of toy example from the MNIST set. So MNIST is this kind of universally known dataset which is used to kind of benchmark machine learning models it's basically handwritten digits from zero to nine. And it's classically used to see how we can hire models perform and basically it's an image recognition problem. So yeah, if we're putting in our digits we're feeding them into our network. Basically everything here is kind of connected together and these are kind of weights between what we input what's inside and then what we're then output. So what we're then outputting is 10 different classes and then we're assigning probabilities to each of those classes. Hopefully then being able to assign the highest probability to what we have them input and then forming some kind of loss towards what we've input. Basically this is kind of an iterative process whereby we're throwing in our data we're iterating over all the data we're outputting what our class what we think our class is. We're then logging what our loss function is or what our loss is and then we're going back and then updating our weights within the model itself and we're updating that by some kind of gradient descent or like back propagation and then kind of iterating over it and then logging this loss and then hopefully we should get some kind of loss function which kind of goes down and then also converges. So what we should ideally get then is a loss curve, an accuracy curve that looks a bit like this. So here we have training loss, we have validation loss. So basically, typically in machine learning we'll have our training data and we'll have our test data. So our test data is what we hold out and what we're testing our model against and training data is what we're training our actual model on. We can then actually split this training data down into training and validation. So we're gonna hold this for the validation data to see how our model's doing. Basically, it's kind of an iterative process as I say to see how it's performing. So with our loss we should see it hopefully going down and converging to a certain point and with our accuracy which is kind of essentially the converse of our loss we should see it going up and also converging to a certain point. If we see these kind of plots diverging it's evidenced of overfitting so whereby our model is training itself on the training data and it's becoming not generalizable so when we're training it when we're testing against the validation data it's not doing so well. So with that high-level view of with that high-level view of what neural network is what is convolutional neural network and then how can we then train that? So basically convolutional neural network involves a series of what are called convolutions and this might be a bit sorry this might be a bit daunting at first look but we'll kind of break down step by step and basically we'll gauge a kind of high-level intuition of what's going on. So basically the convolutional neural network is really effective at image recognition and classification problems which is what our problem at hand will be. So starting at start how does our model see data itself? So excuse this kind of vibe image but effectively any kind of picture or image can be represented as a matrix of values. So here we have going back to our handwritten digits we have our hand drawn eight and effectively we can represent that as a matrix of values whereby we're representing white space as zero up to 255 which is basically it's a measurement of the intensity of that pixel. So here we have our it's an 18 by 18 grade and each of those values is a value from zero to 255 and gauging the intensity of that pixel. So traditionally neural networks have effectively what is tunnel vision whereby we're feeding in data that's perfectly centered. So each time that we're, each time the network is seeing that perfect centered image just classifying that as one instance meaning that it doesn't generalize well on images whereby that data is not perfectly centered. So if we show it's test data that is perfect center it'll do really well for anything else it's gonna be way off. So how can we then solve this? I mean there's definitely a way to brute force it in the sense that we can add more layers to our network or we can augment that data and then duplicate it to make kind of much more training dates work off and effectively learn if that is in different position it's still in it but how can we make a smarter way to do that and that comes in the form of convolution. So taking this very generic kind of picture of a kid on a toy horse. As a human we instantly recognize kind of the inheritance of that picture. We see the kid on top of the horse, the horse on the grass we see the kid in the backyard but the important thing here is that we recognize the idea or the concept of the child regardless of the background that we see the child in. Basically what this called is translation invariance and it's essentially the concept that regardless of wherever we see that child on whatever background we still recognize the idea of the concept of the child. So how can we then translate that into our model itself and going back to kind of our eight recognizer how can we then tell the model that that is an eight regardless of wherever it is within that image. So yeah, we need to then get some way of getting translation invariance into our model. So to do this convolution is a solution. So very high level view of how this works I'll skip the maths because I suppose in the interest of time and to make this a bit more engaging. So basically if we take our image of the child convolution essentially works as a sliding window over the image itself. So if we think about we've split that image down into grades where we're scanning over and moving that grade over over the image itself and effectively what this does is that it learns concepts from the image and then boils them down and distills them down so we're kind of shrinking our space and distilling it every time. So back to our initial original convolution kind of setup and we're seeing that each time we're layering up these convolutions so we're laying convolutions another thing called pooling and basically each time the overarching concept of this is that at each level we're taking some kind of notion or some kind of concept within the image and then distilling it down each time and then at the end we're just gonna condense this all together in this flatten fully connected layer and then we're gonna output as we saw originally from our original model again some kind of probability towards the class what we believe the model is predicting. So yeah, that's kind of the very high level view of what a convolutional neural network is but the tricky thing is how do we then construct the right convolutional neural network? So within itself you saw how there was many, many layers basically you can layer as many of these layers as you want. If I just quickly rewind to this basically how it works is that at each layer you can learn some kind of concept. So if we take this picture of the car in this first layer it might learn some kind of concept of a round shape in the next layer it might learn kind of the concept of a wheel that is inherited through its concept of a round shape and then onto the next layer it might recognize the concept of a car which it is recognized from the concept of a wheel so you're layering off that and you can have as many layers as you want but with each of these layers comes a multitude of high parameters that you can then tune. So things such as the number of layers that you want number of convolution filters which is essentially how many outputs you have from each of those convolutions things like the number of epochs so that's the number of times that you're iterating and basically updating your model things like batch size, activation function. So basically it runs into kind of millions the actual, the number of configurations that you can then have which is why neural networks are so notoriously difficult to train. Basically it's a trial and error process and like sometimes just by tweaking high parameters by slightly amount sorry you can get vastly different results. So how do we then find these optimal hyper parameters? So the actual methodology by which we try to find optimal hyper parameters is called hyper parameter optimization. And traditionally there are two ones that stand out which are kind of widely used across all machine learning and algorithms whereby we have some kind of hyper parameter that we want to tune and that's grid search and random search. Grid search itself involves laying over a grid over the domain of possible values that you have to tune and just running over that so exhaustively running over that grid which as you can probably guess is it's competitionally it's a lot to do but it's actually very parallelizable in the sense that you can just distribute across and run over it quickly but you can often miss the optimal values which and then at the same time the actual parameter space goes exponentially with the number of hyper parameters that you then have to tune so you're running to kind of oh enter the k time complexity in the sense that if you have n number of hyper parameter options for k number of hyper parameters the actual space goes exponentially. Another option which it seems very basic it's simply random search so basically you have your domain space you specify some kind of probability distribution sample from and effectively you hope to randomly land on the optimal value I suppose if you kind of liken the process of finding the optimal hyper parameter if you're surveying a landscape and you're wanting to find the kind of the peak of the mountain out there grid search would involve laying a grid over that whole area and going specifically along the grid and hoping to get to the peak of the mountain random search would involve I suppose if you're like jumping out of a plate in multiple times and hoping to land on the mountain the tip of the mountain. So as well as these sorry these tuning algorithms work they leave a bit to be desired so how can we then leverage what we've previously learned into making a more intuitive decision about what hyper parameters to look at next and that comes in the form of Bayesian optimization. So I mean there's actually a whole field of statistics dedicated to Bayesian optimization. Again I keep it very light on the maths and just build that very high level concept of an integration around the actual framework of how we do Bayesian optimization. So what is Bayesian optimization? It's a very powerful tool for basically globally optimizing any objective functions so in the case of our neural network it's gonna be our last function and functions specifically which are very costly or slow to evaluate. So this is extremely pertinent I suppose in the context of our neural network in that neural networks are relatively take quite long to train so it's not very efficient to simply just grid search or random search if it's taking so long to iterate and train model each time. So Bayesian optimization falls within a class of optimization algorithms called sequential model based algorithms, sorry optimization. Sequential in the sense that you're leveraging what you've previously seen so kind of your history of what you've seen. Model based in the sense that we're going to build some kind of surrogate or proxy function which we will then optimize itself. So you're taking what you're seeing and you're creating a surrogate function which is that surrogate function that you'll then optimize. So kind of mathematically if we have our last function in the case of our neural network which is f of x we wanna find the arguments or the hyperpranners that will minimize that last function. So to do Bayesian optimization itself just very simply initialize a Gaussian process on a small, like initial small set of samples from our domain and that itself enables us to compute a prior probability model. And then we're gonna sequentially select new locations at each step in the domain by optimizing our surrogate function which is called an acquisition function. And then basically then we're gonna iterate it each time and then we're gonna update that prior probability distribution to update our belief of what we believe our last function or our objective function looks like having computed. So basically we're leveraging what we've previously seen before and saying to ourselves given what I've seen before what do I feel is the next best possible set of hyperparameter configuration that will minimize that loss. So a key concept of that is that it's a Gaussian process. So a Gaussian process is essentially a generalized version of the Gaussian distribution or normal distribution but it effectively is it's a distribution over function. So as we see here, we assume that then our last function is Gaussian process distributed with a mean function, u, x and then a covariance function x, x prime. So this is also called the kernel or covariance function and basically it describes the kind of smoothness and different properties of our last function. And the important thing about this is that it induces a posterior distribution that's analytically tractable. So it means that each time we can then update what we've previously seen before and leverage our historical data. So just quickly what an acquisition function is it's the thing, the proxy function that we're gonna be quickly updating. So it comes in kind of three different flavors which is expect improvement, upper confidence bound and maximum probability of improvement. So basically this axis is a function of the posterior which we're then gonna continue to update and the thing that we wanna kind of balance and trade off is exploration versus exploitation. Exploitation in the sense that we're gonna continually look at values that we already know the maximum of or are we going to explore and go to different regions where we have higher variance. So just a really quick one-dimensional example of how this works. So we have this one-dimensional function where the maximum is two, X equals two and we're gonna bind it between minus two and 10. And then we're gonna initialize initially with our two spots. And then at each time we have our, if we see just below, that's our acquisition function and at each step we want to maximize that utility. So we're gonna sample, we're gonna sample, we're essentially just probing the function and it's important to know here that prior to this we don't know the function itself. We only gauge what the function is through probing points and then we're then gauging our expected improvements seeing where to go next. So we want to go with like X equals four. Our next point's gonna be at X 10. And we're just iterating over and over again and quickly we get towards our maximum, which is X equals two. So kind of putting this all together, if we have many, many hyperparameters, how can we then gauge what the optimal set of hyperparameters is and this leads us on to the demo. So basically what we'll look at will be an image classification problem which is it's gonna use the fashion MNIST set. So where we had handwritten digits which is the original MNIST set, we'll have 60,000 training examples of effectively gray scale pictures of clothes which was released by the fashion online company Zilando and it's the same kind of setup as the MNIST in that there's 60,000 training examples, 10,000 test examples and each is a gray scale picture of 28 by 28 pixels. Sorry, that one minute? Okay, cool, cool. Sorry, if I could then just quickly run through, sorry, run a bit over. So basically, yeah, if you wanna go to this GitHub as well, so it's just NileTurbo forward slash PylonDinium, you'll find the demo as well as what we've just previously went through, perfect. So what I'm using here is actually it's Google Colab, so effectively it's kind of Google's VM in the cloud and it actually, especially for machine learning tasks, it allows you to use a GPU in the cloud which is actually really cool. And this is essentially just an iPython notebook, the same iPython notebook which is on GitHub for anyone that wants to check it out. So running through what we have here, quickly or some conscious of time. So using Keras, which is kind of, effectively it learns us to quickly train deep learning models. We'll do our imports. Also just to note here, I'm setting random seed, so usually in deep learning models you don't actually use, you wouldn't normally set seed as the neural network actually inherently replies on, sorry, it relies on randomness within it. So just defining our fashion MNIST class, we'll specify our inputs. So basically, these are gonna be the parameters that we're gonna optimize. So I'm gonna use a kind of widely known architecture for image recognition and classification, whereby you kind of stack a few convolutions on top of each other, have a couple of dense layers, and then I put your prediction. So here we have basically the number of convolutions that you have, so basically what's coming out, dropout which is a method of preventing your model from overfitting, whereby at each convolution you kind of just drop a certain amount. So I've just set these default values, which are then gonna be the values that we optimize. So firstly, loading the data itself, the actual fashion MNIST data can be loaded within Keras itself. We're gonna specify the shape of our inputs. It's gonna be actually just one, so the depth is only gonna be one because it's a black and white image. If it was colored, we'd have three different filters or three different layers, sorry. We reshape, we divide then by 255 to normalize the data because our pixel values are ranged from zero to 255. And then we then construct the architecture of the actual neural network itself. So we're layering these convolutions and then adding these two dense layers. So at each convolution you have this convolution which is where we're scanning over. The kernel size being basically the grid that you're scanning across. We're outputting our shape. We're using then this activation function. So within neural networks you have what's called these activation functions which are kind of non-linear transforms. Okay, sorry, I'm kind of running out of time here, but that's all on GitHub. If you wanna run through it, it's pretty self-intuitive and I've commented everything through it, but yeah, sorry, I ran a bit over. If anyone has any questions, just catch me outside of the end. Yeah, thank you.