 Okay, so before I start I will have to warn you, basically this is almost completely math except for one slide from biology and then I was supposed to actually make a reference implementation but I never quite got to that point. So you have been warned that will be math, quite a bit of it, okay? So anyway, today I'm going to talk about neural networks and deep learning which is what they call nowadays. So the main impetus for this is that because I actually had to implement a number of neural networks mostly just for learning purposes as well as some MMOCs. So one of the problems I was actually having was that I have actually very little clue as to what's going on in terms of neural networks. So what I did was actually as kind of part of this talk was to try to derive neural, basically the back propagation algorithm which is very key to neural networks and how it relates to all of it. So the main trust of this is that I will actually kind of relate this to nonlinear regression and kind of show how it relates to how you kind of look at neural networks in terms of what I would call a bigger class of problems and a way of looking at things mostly in terms of regression, okay? So but before I do that, let me quickly go through what real neurons are and kind of give you a feel of why it has almost nothing to do with neural networks. Okay, so this is a real neuron. This is an image of the LGMD. This stands for the large globular motion detector neuron which I used to work on at one point in time. These are the dendrites which take in data from the eye and they all feed into, they do dendritic computation and then they go down the exon and then they sit next onto something called a DCMD. This is one neuron. And what this neuron is capable of doing, it actually can do pattern recognition and it can actually do collision detection. So think about that this one neuron, not a network neuron, just a single neuron, is capable of doing all the computations required that most what is this called automatic cars and self-driving cars are capable of doing. So quite a big chunk of that computation which is done by neural networks in those cars is done by a single neuron. And this neuron is actually in the eye of a very, very simple animal, the locus, specifically, sister circle of Americana. It's just kind of behind the eye here, over here. So this is a real neuron, neural networks have very, very little to do with any of this. This is just a reference. Okay, so the way I'll build up neural networks is actually from the viewpoint of regression. So the way we define regression in abstract formulation is, firstly, by defining kind of the two things. The target, which is the, what will we call the training data set. And the prediction, which is what your model actually generates. X will basically be your feature set, feature vector. And what you're very interested in is actually deriving what these betas are, which are the parameters of the model. So the model is usually written down as something, as a function. Which takes in your feature set with a set of parameters which you do not know. And you want to find out. And it will make an estimation or a prediction over here. So basically it describes a mapping from the feature vector and the parameters to the prediction. And what we want is we want error to be very small in some sense. So we want h bar to be very, very close to h, essentially. So that's an abstract formulation for regression. So to make this more concrete, what we are saying is this. Given a target vector y and feature vectors xi, which equals to x. So this is basically a matrix where i is basically the index for the observation. We wish to find a beta, which is the parameters of the model. So we want to touch that the target vector y and the prediction vector is very small. In other words, we're looking at a minimization problem over here. So what we're doing here is that over all possible values of beta, we want this particular minimization function f, which takes in y and y hat, which is equivalent to the model with respect to the targets. And we want this to be as small as possible and find the betas, which are the parameters for it. With respect to all i. So this is basically the cost function of regression. So the next part of this is in order to find this beta, we need to use gradient descent, which is a method of doing optimization. So the minimizer beta, which we were talking about previously, which is our parameters, we use the gradient descent algorithm to basically, which is written over here, to compute what this set of betas are going to look like. So what this algorithm does is that it takes the current best guess of beta, and you basically subtract off a learning factor, which is the lambda over here, and you multiply this by your gradient of your cost function, which you have to find out. So, which is also dependent on the current parameters. So, and then you estimate from the current step, what your next step is going to be using the gradient operator with the learning rate, beta zero, which is the initial guess that you need to start up this algorithm, can actually be randomly initiated. So most people would do would be to initialize a random vector from zero to one. Stop condition is defined when the beta plus one, and the beta minus, the next step, and the future step, sorry, the current step, and the future step is very, very small, which means that basically the algorithm is not going anywhere, and you're very close to a proper solution, essentially. So this is just some conditions that must be met. So this is Lipschitz, which is a kind of, a weaker kind of uniform continuity, and the local convergence is dependent on lambda. So what happens is that for all problems, it is not necessarily true that you will converge and it depends on lambda. So it's possible to actually have divergence of solutions. This is kind of a picture of how it works. So you start off with an initial guess over here, and this is basically the minimizer, so this will be where your betas are going to be, and you will slowly move, the algorithm basically moves bit by bit towards the beta. So at each iteration here, which are the axis instead of betas, you see getting closer and closer to the minimal solution, which is over here. The other thing to note, just wanted to draw this quickly so that everyone gets a feel of this. So you can also represent it something like that. This is a one-dimensional version of this kind of image. So let's say this is the minimizer. One of the problems which gradient descent has is that your lambdas, your guesses are actually dependent on the lambdas. So if you do it correctly, like you start somewhere here, you do it correctly, you get here, and then eventually here, and then eventually here, so you find the true solution. But if your lambdas are too big, you sometimes can go over here, and then you'll try to go here or something, and then it just gets completely lost, and then it never converges. So this is one of the problems with a gradient descent from a theoretical point of view, and it actually also happens in practice if you don't select your lambdas correctly based on the problem that you have. And unfortunately, a lot of this is kind of iffy. So you kind of have to just guess what the best approximating solution is going to look like. Okay, so let me give you a concrete example to some extent. This is a classical linear regression written in this abstract framework. So once again, this is the model, F, which takes in your feature vector, which is a set of matrices, multiplied by a vector of betas, which is your model, sorry, your parameters. And what you want to do is you want to minimize this in the square norm, the square of the L2 norm, in other words, your square error. So the way, so what happens is that if we write this down, we get something like that, Minimum F, which you saw previously, which is equivalent to this, and then when you do the minimization, which is if it isn't the L2 square, so this is the L2 norm, you get something like this, which is the minimizer, I'm sorry, the minimizing cost function, which you see over here. Which I think you probably see in very conventional text in regression analysis. So what we do in order to find that beta, the parameters, is that we can use applied gradient descent. Once again, you see this is what we saw previously in the previous couple of slides. But now we substitute in the cost function, which is over here, which we have to take a gradient off. So the gradient, you can do some computation and you come out with a gradient something like this, where these are, once again, these are the future vectors. And then you basically multiply it by the target, which is over here. And what happens is that this is classically what you do in linear regression. If these two are equal, which is what you want, you want them to be the minimum, so these two are very small. So if these two are equal, so these two are zero, you basically get the normal equations which you use classically to solve linear regression. But in the case of nonlinear regression, you have to use gradient descent as one of the methods that you apply. There are also other methods which are high order and allow you to converge to faster solutions, but I'm not gonna talk about that in this particular case. So, now we get on to neural networks. So the way I perceive neural networks is that neural networks is just an extension of regression, except in a nonlinear case. So, instead of having that very nice x times beta function, which, let me just quickly write it over here, which is our model. It's on beta. You now have a much, somewhat more complicated set of equations over here. So, once again, one is the target vector and x is the feature vector. The input layer of our model is described by this particular equation. So, x is equal to A1, the first input layer. And then the hidden layers are described as beta multiplied by A. So, once again, you see something like linear regression coming out. This is passed through a nonlinearity G which is still part of the hidden layer. And then it's output to a final output layer by just multiplying it again by another set of parameters. So, basically, this model has beta one as a parameter and beta two as a parameter. And then you apply, once again, a very simple cost function what we saw previously, which is the least square error cost function or L2 norm squared. And the other thing to keep in mind when you do equations like this or neural networks like this is that the nonlinearity which is over here, G, must be differentiable function. So, usually what you have is things like a hyperbolic function which is a very standard case that you use in the neural networks. I will actually not use a specific function because I like the more general formulation. You can substitute whatever function that you like. So, this is basically kind of what it looks like in pictures. The input layer, so this equation over here is the first set of nodes. The hidden, sorry. The hidden layer, these two set of equations describe this and of course naturally the output layer over here is described over here. So, one of the things to keep in mind is that matrices, which you can write, which people usually think of as something like that, a one, one, a one, two, blah, blah, blah, a one, sorry, two, one, a two, two, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. They actually describe a network very easily. So, what happens is that you think of it as the first index i, so let's say this is index i and this is index j and basically the values of the, if they are actually, if the value actually exists, this is actually a real weight. If the values does not exist, so for example, if you have something like that, a, a, zero, zero, you would have basically a matrix which goes something like that and in this case, you would have four, hold on, this is a one, a two, so this would be, give me a sec. Yeah, this basically has weights going in this direction and this in this direction and this one only has weights going in this, sorry, yeah, and this has weights going backwards and this only has weights which are going in this direction. So, this is the kind of networks which can be described by matrices. So, the matrix representation is another way of looking at neural networks which is quite useful from a computational point of view. Okay, so this comes to what we're actually very interested in, which is back propagation. Back propagation is the fundamental algorithm which is used to solve neural networks, essentially. So, what this algorithm does is only one thing. You just want to compute what the derivative of f is. So, if you can see over here, when we were looking at gradient descent in the general case, what back propagation does is creates a fast way of computing this particular number and that's it. The rest of it is basically standard regression, more or less. Okay, so this is basically a definition of back propagation. So, this particular gradient is actually this special delta which is usually called a delta rule multiplied by a, which is the activation we saw previously. These are the activations. And if j, which is the output of the neuron, sorry, output of the node is on an output node, then it just boils down to this simple equation, which is the coefficient minus the target at the top level. Alternatively, if this is at one of the internal nodes, then you basically have to use this at equations. This is the derivative of the non-linearity and this is just a z that you saw previously with the delta, which we actually took from one of the inner nodes multiplied by the beta, which is the parameters of the previous nodes. And you use gradient descent to minimize beta. So, the way to derive back propagation is basically to apply chain rule continuously. So, it's very simple. It's just basically continuous application of chain rules. So, you start with the derivative that you're looking for. You basically do a applied chain rule and you take out the two z parameters, which are kind of the internal node variables. And then you basically take a derivative of this. This is basically just a chain rule. And you see basically very quickly that this particular term is actually equivalent to just the A term over here, which you can see, how do I show you this? Oh yeah, so basically just by deriving from the original term over here, if you take the derivative of the beta over here, you see of z over here, you just get the A term. So, this basically goes to one and you just get A. So, very simple. You just get A over here. And this we defined as a delta, which I'll show why it is, why is that case. So, one thing to note about this set of equations is that the parameter beta j i depend only on the z j's, because if you look at a set of equations, they don't appear anywhere else except for this particular one. So, only for this particular layer. So, for example, this particular layer, the z three, the beta two's basically only depend on the z three. Similarly, sorry, get away with it. The z three depends only on the beta twos and the z two only depends on the beta ones. Okay? So, this basically breaks the chain rule down and makes it very, very simple to understand. Okay? And, okay. So, basically, that deals with this particular, this breaking this down into a very simple set of equations. Definition of beta and basically, we just make the equivalent that this delta is equivalent to this partial derivative of z j. So, we see that very simply that this particular term, if it's an outer node, is equivalent to z j minus y j i. See this, because what we are doing is we are just taking the derivatives of this particular term. And it's a simple calculation to in order to generate that particular value, which is this. But internal nodes are a bit more complicated. So, what happens with the internal nodes is that you have to take partial derivatives over all the input nodes. So, you can't just take derivatives over just itself. You have to actually take partial derivatives of all the other internal nodes. So, you basically take a summation over them to actually extract them to get something like that. And once again, you see this is basically just the definition of the internal nodes. So, you take the delta from the previous internal nodes. This is quite simply when you just do the right derivation, which is over. Here, it's just the derivative of this with respects. Sorry, this with respects to the previous layer. So, you get something like that. So, we're doing a bit of substitution. You basically get the original definition that I showed you previously. So, it's a kind of very, very simple concept. It's just repeated applications of chain rule in order to get that set of equations. So, once you actually have that in single elementary components, you can basically rewrite them as these delta rules in matrix or vector notations. So, these are all vectors instead and you just put in the vectors over here and matrix over here. And that's a vector again. And you can very simply compute what the derivative of the beta one and the beta two is based on what you saw in the previous based on this series of equations. So, it's quite straightforward. Okay, and what happens is that at this point, it's very easy to implement them on a computer because essentially, vectors are just one dimensional arrays and matrices are two dimensional arrays and you just apply some form of impact which is the linear algebra packages which is available in NumPy or if you want it off the bar and you can get the results quite easily. So, that's basically it and this is the reference I use which is the machine learning by Bishop. It's one of the very good references. There's also a bunch of references on the web but I found them not to be that particularly useful. So, that basically covers my talk. Thank you very much. If anyone has any questions, if not, I'll be around after this. So, that's about it. And I guess Raymond wanted to find out whether there's any interest for applied version. Oh yeah, so. So, a presentation. Applied version? Okay. So, I mean, the thing about this is that there are lots of packages around which basically does all of this that you see. Even Python basically has a bunch of them, TANOS, Tiano, Tiano is one of the packages. Then you have things like TensorFlow which are available. So, it's very easy to basically use them now. The problem is of course that you kind of need an understanding of what goes behind the scenes. That's why I don't like to implement things without at least a bit of understanding what the underlying app is. So, I don't know. Does anyone want to understand those packages at all here? Yes? One person? Okay, then I might. I think that there may be a little bit more of those people with packages because you have different kind of transition functions but you know neurons. Yeah. And different architectures of the nets that suit different applications. Oh yeah, okay. So, I'm not super familiar with those. I just, I mean, I've just used them mostly playing around with them. So yeah, I really can't say too much about those. That's all I can say. Sorry about that. Yeah. Any other questions? Actually, you can find a tutorial online on deeplearning.net. They have a variety of tutorials. So, what they give you is that the model and the definition of the layer and everything you just need to understand. I think the thing deep learning that the actual deep network is very complicated when you compare with this one. So, it's very hard for a newbie to quote it as a form of beginning. So, you need to refer to a tutorial. So, one source of a tutorial is deep learning. You can find a TNO tutorial. Very good. Okay, yeah. I have one question. Sorry, after that you, yeah, sorry. How do you make this slide with a very beautiful metamatic equation? Which software do you use to make your slides? Oh, to make my slides? This is just a latex. Latex is a standard for, like, if you want to write your equations, it's latex. Okay, okay. The other thing you can use is, I use a lot, our presenter also allows you to use quite a lot of mathematician. Just, BIML are just a simple latex. Yeah, so this is basic latex and BIML. Yeah. So, yeah, sorry, I had a question. Thanks so much for your presentation. Yeah. It's very important for me. So, I have two questions. Why is, I don't know why some of you are learning out of it. I mean, those are not totally answerable. Is that too many? That's what they say it's supposed to be. So, yeah, okay. So, because I come more from a regression background and this deep learning, sorry, this neural networks and deep learning stuff to me is actually just a, more like an application of that. And I don't know it very well. I'm just getting into it also. So, the way I perceive it is basically it allows you to stack a whole hierarchy of models which feed into each other. That's kind of the idea. So, the one we are looking at is as neurons. Neurons, but they are actually other neural, they are other neural models out there. Things which take into account, things like the synapses and channel conductances and stuff like that. But it's not really a real neural network in the biological sense. It's more like a stacking of non-linear models from my point of view. Which I think brings me to my second question. So, how do you throw around choosing these non-linear models? He has everything to choose from. So, okay, so firstly, as I said, I'm a really real beginner. So, from what I can understand, it's most of it is heuristics. The underlying theory is completely messed up. Firstly, there's no, the algorithms that in most of the time they use, for example, stochastic gradient descent. Stochastic gradient descent is the gradient descent method which I showed you just now, but instead of looking at all of the records at one point in time, they basically select subcategories of records. So, they don't use all of the data. Things like that, no guarantee of convergence at all. So, we have no idea why they even kind of converge, how to make them work in general, theoretically from a theoretical point of view. So, it turns out that a lot of things are basically tuned. You kind of play around with the parameters until you get something which works, which is, yeah, which is the case. I'm not so much a big fan of that, but that's like four neural networks, yeah. Okay. So, that will not be the ideal case, right? When the classic case converges into a global node. Yeah, no, this is no way. So, for these, even for general non-linear equations, generally the best you can hope for is convex problems. No guarantee of convex problems, but apparently they're from, I've seen a paper which I've not read because the theory is quite deep that for some reason, the more layers you stack, you get, the well is actually deeper for some reason. But it's still there, you can still get trapped in local minnows very easily. Yeah, that's all I can say. Yeah. Oh yeah, and the other thing is that, I think they basically use regularization to, in order to prevent overfits and things like that. So, that kind of helps, but I think that's a bit more of an ad hoc way of doing things, but yeah, any other questions? All right. Okay, yeah. Thank you.