 This video is part of a course that I'm putting together about TensorFlow 2.0 and deep neural networks. Once that course is ready I'll put it in the description down below. So far we've looked at this process of forward propagation and back propagation So what we're going to do in this notebook is just look at this very simple neural network and we're going to go through the mathematics of this process of taking the data as As an input taking it through the network to the output It's going to involve some linear algebra and some differentiation Now I want to put this out front. You do not need to really understand this when you write code using TensorFlow 2.0 You can just write the code and the mathematics will happen behind the scene What I do find though is if you have some understanding of what is going on here It really eases the way for you into more complex forms of neural networks Just this basic understanding of what is happening and it really is not that difficult at all So we're going to have only two nodes as an input here and We're going to have four nodes in our hidden layer With a bias node and a simple single node as far as the output is concerned So let's just import some libraries So I'm going to use numpy and I'm going to use Google colab from there I'm going to use the files function and from ipython.display. I'm going to use the image function as Always I'm going to use plotly as my plotting library So just the graph objects there and then the input output. I'm just setting my default plot style to have a white background Now some of the mathematics that I'm going to show you especially the Linear algebra part we're going to use symbolic Python some pi and this that just allows us to to display the mathematics very nicely on the screen as you would see in the textbook and It's also geared towards mathematical or symbolic mathematical solutions, so it really works well and Because of that we also have to run this little boilerplate Code here ipython.display we're going to import math and HTML and then we've got to load this math jacks So that we can view LaTeC code and that is this formatting Mathematical notation so that it looks nicely and then we initialize the printing So have a look at that boilerplate text there, but we're certainly going to make use just of symbolic Python Now I'm not going to import this image I'm just using the files.upload function here so that I can upload this image directly from my hard drive I'll make this these little Figures available and I just use the image function there to show you what we're going to do what it's about though is This little data box here So we're going to have only four samples in this very small neural network of ours So our training set only contains four samples and you can see there We have two feature variables feature variable one and two Very nicely named there and then a target variable and you can see it's encoded as a zero and a one in other words This would be a classification problem as this would be a categorical variable And it is binary in nature and all these things are very important to know Because when we design our neural network, we need to know what kind of problem we're dealing with so this is a binary classification problem and of course it's supervised learning in as much as we know what the answer is giving this Data the 10 and 11 we know that we have got to predict a 1 and if it's 9 and 8 it's got to predict a 0 Now let's just take this first sample in this whole notebook We're going to concentrate only on this first sample for the first feature variable This sample had a value of 10 and you can see this as a patient a customer Whatever you want for this first variable this first Subject here in our data set the first variable value was 10 and then 11 two feature variables Hence the two input nodes in other words the 10 is going to go in there and the 11 is going to go in there That 10 and 11 is going to forward propagate through this whole network And it's going to predict something in the output and it's going to be either zero or one Now I tell you now it's not going to be zero or one. It's going to be somewhere in between 0.34 0.78 and We can make that arbitrary cut off say at 0.5 and if it's more than 0.5 We'll say our network predicted a one the class one and if it was below 0.5 It'll predict the class zero. So that's how we're going to interpret it But the output is going to be some numerical value between zero and one Again though we have two input nodes because we have two feature variables when we put an image as An input that will look quite different this input. So this input is really designed around your feature variables So to keep it very simple. We're only going to have these two feature variables here Now here's the first part of the mathematics and this is a vector You can see this in equation one I've marked the equations here for you and there's the 10 and 11 we can write it in this column format and I'm just going to call it Clearly for the description in this notebook I'm going to call this vector I and the subscript one denoting This is our first sample and we see we write the 10 and 11 we write the 11 below the 10 So 10 and then below 11 and we see two times one there That is the size of this vector the dimensions of a vector And it's always how many rows they are times how many columns and we can clearly see they're two rows and a single column So it's always row comma column and that's going to be Something about a vector a vector is only going to have one column now I just warn you we also get row vectors and then it's just a single row of values But here we're just dealing with column vectors. So it's two rows one column Which means if we go back up to our little figure here This is a two by one vector Guess what these four are going to be they are going to be a four times one column vector So how on earth do I go from a values of 10 and 11 in a two by one column vector to a four by one column vector? Well, I do this through this magic of all these connections How many are there while each node is connected to input node is connected to each of these four nodes So there's four there and four there so somehow I've got to bring in something that's got eight values To go from this two by one column vector to this four by one column vector and we're going to see that we're going to represent These things as a matrix But hang on so here we go. We're going from this I input vector two by one to this first Hidden layer now we only have one hidden layer here and it's a four by one column vector and As I mentioned the way to go from one vector of one dimension to another vector as we're going to use Multiplication of this vector by a matrix So we're going to have a matrix times this input vector and that's going to give us this new vector So I'm just using this generic matrix area r times s it has our rows and s columns So it might be three times four three rows and four columns of data And I'm multiplying it that dot again with this vector s times one so a vector will have one column a column vector and The result is r times one so that little r there comes from this r in the matrix And you see these two s's they have to be identical You cannot multiply a matrix times a vector if those two are not the same So the column number of the matrix and the row number of the vector has to be exactly the same and They will fall away you could say and then we are left with the number of rows in the matrix and The number of columns in the vector hence the r times one So if I wanted to go from a vector, that's two by one vector to a vector That's a four by one vector my matrix had better be of size four by two So look at that the two in the two exactly the same they fall away in quotation marks In inverted commas and what gets left behind is the four and the one so my result is a four by one So if I go back up to this little graph this all these lines We can represent them as a matrix a four by two matrix because if I take a four by two matrix and I multiply it by a two by one Column vector I get a four by one column vector. So that's exactly what we're going to do And here you see a representation of that and you see the notation of a four by two You see these subscript they two digits the first digit is the row number the second digit is the column number So this w11 is going to be some numerical value in row one column one and the second one here is going to be row two row one column two and then row Two column one row two column two, etc. Until we get to row four column two right at the end So they are just going to take some values and those values. We're going to multiply by this ten and one So let's just simulate Some random numbers. I'm going to see the pseudo random number generator here with the integer 42 That just means every time we run this code. We're going to get the same random numbers I'm going to create my input vector here with a some pie dot matrix the matrix function And you see how we do that for color matrix It's ten comma one and they each go inside of their square brackets Every row goes inside of its own set of square brackets and then the square brackets on the outside to denote this whole thing and Then we're going to create the matrix and the matrix. I'm just going to use the numpy dot random dot normal function Lo c equals zero that means take from a mean of zero a standard deviation of zero point zero one and the size of it Must be four by two. So once I create this little four by two array inside of numpy I pass that to the matrix function inside of Symbolic Python or some pie and There we get so there you can see it already on the screen that ten to ten one You can see that little math checks going crazy They just to to do this pretty printing on the screen and with these square brackets or sometimes as you can see here We just use these parentheses large parentheses or you can use these Rectangular brackets and you see my column vector there ten and one and here. Let's just run that again So you can see how prints There we go there. We see our four by two weight matrix now we're going to call these weights weights a weight matrix, so there's my Four rows over two columns and this came from a normal distribution With a mean of zero and a standard deviation of zero point zero one That's we typical now these are random values and that's exactly what we're going to do in a neural network when we started up we have the data in As you saw in our little table of only four examples and we're going to multiply it by some weight matrix and And those values initially are just chosen at random there are various ways to choose it I'm just using this normal distribution for now now we can just check on these Matrices and the vector and just make sure the size of my input vector is two by one and my weight matrix is four by two The next thing that we're going to do is to multiply them And we do that just by the star symbol normal multiplication And it's got to be the vector the matrix. I should say the weight matrix times the times the input vector Because it's got to be in that order so that we have those dimensions lining up So a thing about matrices and vectors Usually these things do not commute normal numbers do commute three times four's twelve three times four and four times three They're the same thing But the w here times the input vector is not equal to the input vector times the weight matrix and actual factor Can't even do that because those in their numbers don't align So once we do that loan behold here. We see let me just run this you can see the output already there All of these you'll see the output before I run the code because I just wanted things to go smoothly in this in this recording But there you can see I have as promised We deliver a four by one Column vector and that's exactly what we want and if we check on the shape of that Let's run that in any way we see it is a four by one Column vector exactly what we want. So let me just show you how that is done So I have my eight values here in my four by two weight matrix and I multiply that by this column vector So how that happens? How do we get to a four by one? Well, we take the ten We take we go row by row as fast the matrix is concerned So we take the ten and that gets multiplied by the weight matrix in row one the weight value at least in this matrix in row One column one and the eleven we multiply that by the w1 to there and we just add these two products Then we go to the next row in the weight matrix and again It's the ten times the w2 one and the eleven times the w2 two You can see it's quite simple how that happens now. Let's go all the way back up What we represent here is that we add a bias term there's a bias node here and So we're going to do vector addition We're gonna add whatever this value is to each of these nodes and the only way to do vector addition is if those two vectors Have exactly the same dimensions you cannot add two vectors together if they don't have the exactly the same Dimensions, but we usually just put one node here because all four numbers will be exactly the same and It's also very typical when you first run your neural network to just Make all the bias values zero. So we're gonna have this four by one Column vector of all zeros that means the weight value We just actually have to store one value zero because it's gonna be the same for all of them but I created here that with a simple matrix and Again, just to show that it is four by one. It's just a four zero values So When we did this multiplication of the weight matrix and the input vector We we got our four values, but that's not the values that are actually going to go into those four nodes We are not done yet. We add this bias vector to those values now this is quite easy for now because they're all zero and What happens with the vector addition is you just take each element in each of the two vectors the corresponding Elements and just add them to each other. So what happened here is nothing changed And by the way, I'm now calling this Solution of the weight matrix times the input vector plus the bias vector I'm calling this z and you can see that I mentioned here Z is four by one that came from a weight matrix, which was four by two Times a input vector, which was four two by one. That should be a two. We'll fix that plus a bias Vector, which was four by one. So this is Quite good for us to open this up. So let's fix this immediately. So our input vector was two by one. There we go Two by one so these inner twos are the same and remember the input vector was 10 and 11 So all of this makes absolute sense and there we can see Z there No problem whatsoever. Nothing changes because I added just the zero values Now we are not done yet. That is also not the final values that are going to go into those four nodes We have to put this column vector through what is called an activation function And that's very important in neural networks and activation function is what brings non linearity to Deep neural network to this model, which is quite different from linear models and there are many activation functions and We'll learn all about them. One that I'm going to use here is one that is used probably most often and that's called a rectified linear unit We usually just say ReLU, R-E-L-U as you can see there and the E is lowercase and all the rest is uppercase So ReLU. And what the ReLU activation function does with an input is it looks at every single value separately And it says if this value is more than zero, the value remains unchanged If it's zero it also remains unchanged and if it's negative less than zero it turns into a zero So all negative numbers get converted to zero and all positive numbers to stay what they are So the way to do that is just to use the maximum function in NumPy It says take the input or zero whatever is the highest so if I were to run that You see that my 0.03 state the same My 0.03 state the same but the negative 0.049 changed into a zero and The 0.24 state the 0.24. So that's it That's the rectified linear unit and it's such a simple function, but it really performs a bit of dark magic So I'm just going to call my ReLU function here sigma. So I'm just applying sigma to this whole 4 by 1 column vector Z that I have here and I'm just reminding you of how we got there How did we get to these four rows in one column and I'm just showing you here? It just means I'm just applying ReLU to each of these rows separately And that's how we got these four values up top with a third one turning into a zero So that's where we are at the moment. I'm going to scroll all the way back up Because we are going to go from a 4 by 1 Vector column vector to a single value. Now a single value is very easy You can see it as a one by one Column vector says one row in one column one single number So how am I going to go from a 4 by 1 to a 1 by 1? I think you can work that out by now It's quite simple. Now. I'm just going to call it for just showing this notebook I'm going to call it v and you can see there V by the way this after the activation function I'm calling my vector a this vector here now after the activation. I'm calling it a that's how we stored it here a And what we're going to have to multiply it with is a weight matrix I'm calling it v that is one by four because if I take one by four and I multiply it by a four by one The fours are exactly the same so I can do this and what is left a one by one So my output vector is just a one by one a simple single number And here you see v up top here v one one one two one three one four So it's row one column one row one column two row one column three and row one column four So that is just a matrix with a single row. It is actually A row vector, but we won't see it as such at the moment. It is a weight Matrix it has a single row and four columns and if I do that multiplication Things are going to work out perfectly for me. So all I'm doing here at once again I'm just choosing from a normal distribution with a mean of zero and a standard deviation of 0.01 And I want it to be size one by four comes from the random dot normal function there from numpy And I'm passing that as an argument to the matrix function in symbolic python so that I can have this beautiful four by one Weight matrix here, and if I do that multiplication of v times a lo and behold I get a single number Now I'm not done with that single number. That's not what is going to go inside of The output node there. I have to put it through you guessed it an activation function and another activation function is just the sigmoid activation function And let's run this can redraw on the screen for us. I'm just using Plotley's graph objects here a figure just creating values from negative three to three And I'm using the sigmoid activation function And what the sigmoid activation function is here the y vowels that you can see here is one over one plus the e to the power negative x And if you do that you get this beautiful s-shaped curve and it's going to go from zero to one It's always going to come be constrained between those two And is that not exactly what we wanted because we want to predict either a zero or one And as I said, we can have this cut off say at 0.5 here And if my value that comes into it is more than 0.5 up here somewhere We're going to predict that as a one if it's below this we predict it as a zero So we really have to put this value through this sigmoid activation function And that's very typical activation function to use for a binary classification problem and out pops this value 0.50 so it's one over one plus e to the power negative What happened after my weight matrix my one by four weight matrix times My four by one column vector and I got this one value and if I pass that to this function I get a value of 0.50 so I'm right sitting there right smack in the middle I suppose it's slightly above so this prediction would be for for my target variable class one Very very 50 50 uh, but that's exactly the point I just chose my weight values my bias values and my second matrix My second weight matrix. It's all just random values So of course the prediction is going to be very poor And this is exactly what a deep neural network does It's going to have to update get better values for those two weight matrices in the bias node If it if it can learn better values there This should be predicted much closer to one But look what we've done. We've just gone through a whole neural network It's really as simple as that Just to do that now this value here. We're going to call that y hat Sub one because we're dealing with the first sample that had the values 10 and 11 And the actual value we're just going to call y sub one So we make this distinction y sub one was one Remember in our little table of four samples the first one was actual value was one and our prediction here y hat Is 0.50 So not very good So let me just take you through this whole process here So we started by 10 and one and in the design of our matrix We had to choose two inputs because we've got two feature variables that input is really dependent on the data that you pass into the network And that was our input vector. That's two by one We multiplied that by a weight matrix that had to be four by two and these are all random values And that that cut us to this These four values here it was n sub one n sub two n sub three n sub four And that's on the bottom here how we see that weight matrix times the input vector Then we add this bias. So this was vector addition here in this first step all the zeros we added So nothing changed. There's our values Our n plus b and then we put it through an activation function And once it's through the activation function We multiply it by another Matrix here and that gives us This one by one matrix which we put through another activation function and we get out there So you can see the values are slightly different from what I had up top because previously I did not use the cedar and random number generator and different weight values were generated At random for me and the values were slightly different and you can see the difference that it makes Not as much as the final outputs concerned because these are just purely random numbers And the neural network has not learned anything yet. So these will invariably be very poor random values that were selected So let's scroll down here and have a look. This is how we got to y sub one I just want to remind you of how we got there It was this long equation. We had the activation function of these matrix and And vector multiplication added to that the bias term and then all of those We multiplied by the second weight matrix and that was an addition of all these terms So that's how we we had to do all of that just to get to y hat sub one And all these that you see there v Sub one one w sub one. These are all unknowns We just chose random values to throw in there so we can do the calculation and get to this to this prediction But these are all unknowns and I have to somehow Design something that will correct these that have to get better. So to get to y Y hat I have an equation here that has 13 unknowns 13 variables On a piece of paper, I can draw a single variable function y equals x squared a nice parabola So imagine something with 13 unknowns. We can't we can't Feather them that in our heads. We can't draw it on a piece of paper on a computer screen But that's what we have to solve We have to solve this problem Of something that has 13 unknowns and that is just a very very very very simple neural network So how do we go about this? Well, we use something called a loss function And a loss function is the difference between the prediction and the actual value So remember our prediction was 0.500 something and the real value was one There's a difference between the two But I can't just see them as numerical variables because these are categorical variables. So how do I How do I this how do I calculate the difference between two classes in a categorical variable? Well, this is one of the loss functions that we could use and we use very often To determine the difference between two categorical classes and the one would be my prediction And the one will be the actual value And we call this a loss function and it's a function of two variables It is the prediction and the actual value. It's very simple. It's minus the addition of these two Parts and each of the two parts are the product of the actual value. So that would be one and Our predicted value, which was the 0.50 something and we take the natural log of that Plus one minus the actual value times the natural log. That's log base e Oilers number of one minus the prediction And if we pass all of that in we get a loss And this is quite a high loss because It was 0.5 and the actual value was one So we get a loss of 0.69 just for that first sample And if we add all these losses because we've got to go through all our samples We only had four in our little but if you have thousands tens of thousands hundreds millions of samples you've got to go through all of them All of those losses and that combines into a cost function And for every loss function you get different kinds of cost functions from the cost function for this category binary binary Problem loss function is just simply this we're just going to add over all of the losses And then divide by how many there are that's going to be our cost function But remember this loss function has this y hat in it and to calculate y hat was an equation with 13 unknowns So my cost function is going to have many unknowns in it And that is a function of which I have to somehow manipulate To improve upon those values to get the optimum values for those randomly selected weight matrix and bias node values And how do we do that through a process of gradient descent? And gradient descent involves the derivatives. So let's have a quick look at that now as I mentioned We can't fathom an equation the the graph of an equation that has 13 unknowns So i'm going to stick to a very simple polynomial That has only one variable an x variable like y equals x squared, which is the parabola So i'm choosing this polynomial here in 14 x to the power 4 minus 2 x cube minus 2 x squared minus 4 so Just a simple polynomial But it really represents this cost function that I have here with 13 unknowns I'm just simplifying things to so that I can draw it on a computer screen and we can all appreciate what's going on here So this y plays the part of my cost function And the cost remember represents how wrong I am So with a neural network, I don't want to be wrong. I want to be as correct as possible So I want the minimum I want this cost the value to be at a minimum. So here we had 0.69 we're going to add or Across all of those every sample that we have somehow. I want this value to be a minimum I want values for all the weight matrix and matrices and bias node values I want them to be such that if I plug them into this equation I get the lowest possible number and here we draw this graph Of this polynomial of mine and remember for just my single x value I'm going to get a different cost and it's very simple with this convex function Is that I want to be right down here at the bottom Because this y represents cost and I want the smallest cost And I can clearly see here if I plugged in a value of x equals 2 I'm going to get the lowest cost That represents the fact that for these all these unknowns in this equation I've got to get all of them just the right value So that the cost is at its absolute minimum for this multi-dimensional function And it's actually quite simple to do this process upgrade and descent and it comes from this little equation here equation 15 It says take whatever x value you are at the moment and that's x sub t And you subtract from that some alpha value and that we call the learning rate Times the derivative of where you are at the moment and remember the derivative is the slope So we're going to start here just at random at x equals 1 We're going to start that x equals 1 and that is the same as saying We took those weight matrices and we just drew 8 random values for the first one and 4 random values for the second one and for my bias matrix I just chose 0 basically also at random So that's the same thing I'm just going to start at 1 and now remember the first derivative gives us the slope of a function And in wherever we are so if we write here the slope is downwards And if you think you're standing on the side of a hill Downwards to go to the bottom of the valley. That's exactly where we want to go So a slope is very nice. It shows you how to go downwards So if we start here at x equals 1 It's really going to show us which way because the slope is the tangential line at x equals 1 Around there and it shows us the direction in which to walk Now the slope here is minus But we want to go from x towards the right to plus 1 and that's why we have the minus sign there So we have this wherever I am now at random I start somewhere subtract from that some step size And that's usually very small we make it 0.01 and it gets more complex than that because we can change it During the learning process, but for now, let's choose a fixed one 0.01 times the slope And I subtract that from that. So that's going to be a tiny negative number minus Minus times a tiny negative number is going to be a positive. So my next x value x t plus 1 is going to be Something larger than 1 So let's just do that There's my polynomial. I can print it to the screen very nicely using symbolic python Because I set x as a symbol now a mathematical symbols. So it's not no longer available as a variable name So there's my beautiful mathematical symbol. I can use some pi to do differentiation for me So I say take y give me the derivative of y with respect to x and we see there The first derivative of this very simple polynomial And now we can just do that let's plug in 1 1 is where I am at the moment So that's 1 minus 0.1 times my derivative, which is now 4 x cube minus 6 x squared minus 4 x It's the first derivative of my polynomial there. So I plug 1 into all those x's and I get 1.06 So indeed if we look here, I went from 1 to 1.06. That's a tiny step towards the right But that's exactly what I want now. It's easy to see here where the minimum is But imagine this multi-variable function in multiple dimensions You can't just look at it and know what direction to go in Yeah, we do know but I'm using this simplified example first of all because I have to because I have a flat screen and We see that it is moving in the right direction And that's what this little update will do and all we use in essence in the end I'm not going to describe it as just partial derivatives. That's why you see this curly dell here D here is that we just take each of those individually we we with partial differentiation You can keep all the other variables as a constant and the derivative of a constant is just zero So they all fall away and we just look at one of those 13 at a time And we check for each of those in what direction we have to move in all those 13 dimensions And they will all move downwards or at least they should all move downwards And there we go 1.06 just to show you if we started at x So at x equals 3 And we plugged in 3 into this our next one would end at 2.58 So if we started at 3 just by random up here somewhere it will move to 2.58 Down here. So it's also going to move towards the negative Towards the downward part Now unfortunately, you're going to see there's also a little bit of a dip here minimum So if I started here, it's going to move towards this dip And if I started up here by random I'm going to move towards this dip and I'm going to completely miss the fact that the the lowest part is actually there And that is unfortunately one of the problems of deep neural networks Is that you can land up in these minima local minima. That's not the global minimum Fortunately, that is not as big a problem as you might imagine And also we have this idea of all the possible values that these weights and matrices the weights and the bias values can be That is called the hypothesis space And we want to constrain the hypothesis space somehow because what you don't want is for your neural network to memorize the training data And it will only do well on the training data We wanted to generalize to unseen data the validation or the test data Or we want to create an app and someone from outside is putting data and we wanted to do well on that unseen data We don't want it to memorize our actual training data So all of these things conspire to the fact that it's not as big an issue This idea of We not always get to the minimum. We don't always really need to to to do now That's this is a very gross oversimplification of the problem here Suffice it to say that it is it is a problem, but it's not as big as you might imagine So that is gradient descent and what we have now this new value 1.06 Is something that we can now plug into Our weight matrix say for instance, this is one of the w11 values We can now plug that value back in there And we can start this whole forward propagation process all over again because for each of those 13 values We will now have an updated value which should bring the cost function down And this process of now updating all of these separate 13 Values through this process of gradient descent. That's called back propagation Which now means the second time I do forward propagation again start with a 10 and 11 again I have new values for the weights. I will have new values for the bias term And that should in the end give me a lower cost function My prediction should be closer to 1 After all the activation functions and all the multiplications and additions It should be closer to 1 now And that means my cost function has come down And that is really it. We've taken a very simple densely connected neural network And I've shown you how you do forward propagation, which is just matrix and vector multiplication We did vector addition and we did We used activation functions And we just designed the sizes of these things so that they make sense And then in the back propagation step where we updated all the values We tried to minimize a cost function I think you'd agree is really not as difficult as you might imagine Now we can create all sorts of different neural networks First of all much bigger ones, but also much more complex with different architectures But the basic principles remain as we've seen them here So I really hope this helped you and you now have a good understanding of the basic math that goes on when we use a deep neural network