 We will see now how a neural network can produce an output which is performed through the feed-forward or inference pass We will start by introducing the basic perceptron We will also further see how we can embed the threshold within the model and therefore Utilizing a activation function, which it will be a step function Therefore, we introduce the logistic unit, which output is differentiable with respect to the parameters and the input Finally, we see how to combine multiple logistic units to create a neural network And we will see what are the equations that will allow us to perform an inference step or feed-forward pass Let's begin with the perceptron that was invented in the 1950s by Frank Rosenblatt So there is a series of Input to this system. Let's call them x1 x2 x3 And so on until the last one xn and then The output of this guy we call it we call it z will be simply the summation of All these input multiplied by specific weights that I'm gonna call theta 1 theta 2 Theta 3 and so on up to the last one theta n So we have that z it's equal to Theta 1 times x1 plus theta 2 times x2 plus theta 3 times x3 plus so on Until last the last one. So it's theta n times xn We can notice we can write this in a smarter way. So if we define x being the vector x1 x2 x3 Up to xn and then theta the same way. So our vector theta is going to be the vector of Theta 1 theta 2 theta 3 Down to theta n then we can have that our z here can be written as The product between theta transposed and x So what is this? What is all this stuff? For example, I could use a perception in order to Decide whether you pass or not in the class. For example x1 could be the homeworks X2 could be like how much how many equations do you answer in class? x3 could be The extra material that you cover The last one could be like let's say the final project Of course, we have that these Parts will not contribute in the same way. So we may have like theta 1. It's equal to 30 Then theta 2 it's equal to 20 Theta 3 Could be equal to 10 and the last one the theta 4 in this case if we have an example with just for Inputs and for weights it could be the final project which is county like for 60% And then at the end we can say that we can have that our output. So the H Based on theta of x It's equal It's equal zero if the summation of these Values so meaning like if you do the homework completely then you get 30 points if you get the Questions all done you get 20 points. So they are already up to 50 Then if you do extra material then you can get up to 60 and then if you do perfect perfectly the last project You get up to 120 let's say we put a threshold So how much how difficult is the class? Equal to let's say hundred. So if your summation of all the scores Z it's below or equal Our threshold is called T and we have here that T Equal 100 Which is our bias Then you fail the exam is going to be zero instead. You pass the exam or the class if you achieve Anything above the threshold which was 100 here in this case we can use a Perceptor in order to evaluate whether You can pass or fail a specific course which has These specific weights. Let's see how to reformulate this problem By incorporating this bias T Within the system because we don't want to carry on this T and you will see we will see that this is gonna help us In a following formulation, we can incorporate the threshold T by simply adding one new node to the perception that we will call X zero X zero It's the same thing as a one. So basically X zero is gonna be always to one and we have that our Z which is defined as the product of theta transposed and X but in this case we have the vector X it's equal to X zero X one blah blah blah up to X and So in this case we have that it belongs to our n plus one and the same way theta It's going to be theta zero theta one up to theta n and Also, she belongs to R to n plus one So in this case we're gonna have The same as before but again in X zero we will have It X zero it's equal to one. So the formulation becomes basically theta zero Times one plus the rest no theta one X one plus and so on up to theta n times X n and When we write our final equation for the H We can simply say it's zero if Z It's lower or equal zero and it's one if Z It's greater than zero We simply have brought the threshold T from the right hand side to the left hand side So basically this new Z it's Equal to old Z Minus T and therefore we understand that minus T It's equal to theta zero To simplify further this notation we can introduce a activation function that we're gonna call G which is the step function. So If I have my If I make the plot here Here we have Z Here we have G of Z and Then the function is gonna be zero until zero included the zero and then it goes to one When it's greater than zero So we can simplify H and Becomes simply G of our Z which is G of theta transpose X and this is also called activation. So this is gonna be a So from the previous case, we will have that our parameter theta. It's equal Minus hundred our bias. So how difficult is the the task How is how difficult is it to pass the The exam for example and then there is the list of values we saw before so there is the 30 for the homework There is the 20 for the questions in class There is the 10 for the extra material and then the last part is the 60 for the final project and X it's One for X zero Always and then it's gonna be some values from probably zero to one Which says how good you have performed in those different tasks and both of them belong to R and plus one Let's say now we would like to have a system That can predict whether you will fail or pass a specific class But you are not a professor. So you don't know how much each part of the class or the homework or the exam actually Wait for a specific for the for the total so Let's say we have the same system with several inputs and we would like to find the combination of parameters theta that are actually Correctly predicting whether you can pass or fail a specific exam so In this case, we have that every parameter Is somehow tricky to figure out because if we have small variations in the parameter the output H of theta will not change unless it's a mention of all the parameters is Higher than the bias then we are gonna have a positive outcome, but if it's below it's gonna be Zero output, so we would like to have a softer way to determine the output score So we understand how we place for the specific Set of input we can do this with the logistic unit So in this case, it's very similar to what we have just seen with the perception But the only thing that is gonna change is the activation function instead of having G The step function G is gonna be the logistic sigmoid function Which is equal to? So we have that Sigma of Z. It's equal One plus X to Z At minus one so I can also rewrite this as one divided by one plus X of Minus Z So let's see how it works. How this Plot, what is the plot of this function? So let's draw a couple of axes if Z Tends to minus infinity We had that minus Z Tends to plus infinity Then we have that X of minus Z Tends to to plus infinity and therefore we have that one divided by one plus X of minus Z It's going to tend to zero plus The asymptote is On the zero side Then instead, let's see what happens if we have that Z Tends to plus infinity. We had it minus Z Tends to minus infinity Therefore we had the exponential of minus Z It's going to tend to zero plus And therefore one divided by one plus Exponential of minus Z So this was zero and therefore it's gonna be one divided by one. This tends to one And therefore here we have a state set one So and then let's see for Z equals zero we had that the exponential D of minus Z it's equal to One and Therefore we are gonna have that one divided by one plus one It's gonna be one half there you go so It's very similar to what we have seen before so if we have Values that are let's say above roughly five The output of the sigmoid is gonna be roughly one if we are below minus five Then we are gonna be basically roughly zero The nice part is that now if we apply some small variations on the weight we can Move Z slightly to the left of the axis and to the right so we actually move the output which is gonna be the sigmoid of the Skylar product which also is gonna change in value in this way We can try to make some small variations on the parameters and See how the output changes. Also in this case we have to remember that there is the bias here all of us the plus one Which is connected down here and with the weight theta zero, which is Basically our our bias just to recap We have that X Belongs to R to n plus one data belongs to R and plus one And then we have that Z it's equal Theta transpose X and then our Hypothesis based on theta of our input X It's gonna be in this case our sigmoid of Z and then basically we had that This belongs to the interval 01 we can decide whether we have passed the exam or we if you'd like a binary Binary output we can simply ask we can simply have this H of theta of X Greater than 0.5 If we are above 0.5, then it's gonna be Pass otherwise gonna be a fail, but in this case again, we see we can see that H of theta is gonna be Continuously varying between zero and one So we can see How small variations of each of the singular inputs or each of the singular parameter will affect our final Activation and therefore our final outcome Here we have a neural network. A neural network simply is a combination of multiple logistic units In this case, we have a three layers neural network. So we have an input layer then we have a Output layer here and then in the bed in the middle. We have a one singular hidden Layer the dimensionality of the input Which we call S1 for size of first layer It's gonna be four. So there are just four inputs the size of the Hidden layer S2 second layer It's also four in this case and the size of the output layer as capital L or S3 in this case It's just one. Of course, there would be always a X0 here set to 1 which is connected to All the other neurons Then we're gonna have also here a A0 Activation 0 of the layer 2 which is also a Plus 1 which is all which is the bias for the last node in this case We can perform more more complicated operations For example, each of the single singular activations a 1 a 2 a 3 a 4 of the second layer can compute a operations on the input and For example evaluate whether Specific combinations of the input are above a specific threshold or bias Then the final outcome will simply score based on the output of the previous units The final output. It's it's like before called H in this case. We have the matrix capital theta and it's also a function of the input capital theta it's a collection of Parameters for example in this case we have two layers. So we are gonna be having Theta of layer 1 which it goes from layer 1 to layer 2 and Theta 2 Which goes from layer 2 to layer 3 and in more detail we have that theta 1 first row It's simply the The it's simply the vector theta which is used To compute a 1 of 2 The second row of big theta for the first layer second row It's gonna be also another vector theta Which is used to compute the activation number two all the second layer and the only row of Theta 2 it's Is going to be the vector of the parameters to compute the only activation a 1 of layer 3 Which is also equal to the final output H of big theta of The input X so let's see now a summary of all the All the equations that we have seen so far with sparsely So let's see some definitions now just to recap what we have seen so far and then we see Overview of all the equations that we are gonna be needed in order to compute the forward pass of a neural network So we have that a I of J is the I activation in layer J as We saw here in the red circles circles. We have a 1 is the first activation of the Layer 2 a 2 is the second activation of layer 2 a 3 is the third activation of a second layer and a 4 is The fourth activation of The second layer on top. We don't have to forget the a 0 of second layer the bias term and Also, of course, and of course here x 0 as we can see X 1 it's also The same essay activation 1 of layer 1 X 2 is gonna be the same also activation 2 of layer 1 and So on no so a 3 activation 3 of layer 1 in activation 4 of layer 1 and the same way the final activation So the activation the only activation of layer 3 in this case. It's equal to the final H hypothesis with parameter theta capital theta on the input X capital and theta J is the collection of and mappings to go from layer J to J plus 1 So theta 1 for example capital theta 1 allow us to map the input that is the layer 1 to the activations of layer 2 and Let's see now the full Equations for the for the activations in this case of layer 2 So we have that a 1 activation 1 of the second layer it's equal a sigmoid of Which is gonna be theta For the first layer and then first row First column which starts from 0 because we have a bias term in the input and This is multiplied by X 0 Which we know it's equal to plus 1 then we have plus theta first row First element of X. This is not the one for the first layer X 1 Plus so on until the last one we have theta first layer first row and the fourth of X for and there you go and The same happens for the other activations. So activation second activation is going to be second row. So It's gonna be our sigmoid no linear function of The matrix that is mapping the input so the first layer We go second row and element 0 which is multiplying the bias all the ways Don't forget Then we're gonna have still Mapping the input one second row because the second activation element 1 of X 1 plus all the others and the last one is gonna be Mapping the input to the second activation of the fourth value of the input X And we have the same for a3 and then we have the last one here. So it's gonna be the last activation so the last row of the input The input parameter collection collection fourth row element bias plus Still from the input fourth row first element of the input plus all the others and the last one gonna be the from the input fourth row and Also fourth element of the input We can write this in a much more compact way. So let it be a to the vector a to so a one of second layer a to second layer Up to a for Second layer, which is s Second layer So we can write that a To it's simply in the non-linear function. So the sigmoid of what of These operations here, right? So here we have the first row of Theta one and then the second row and then we have third row so we can simply write the matrix theta for mapping the input and Multiply by the actual input. So what it does this one is simply performing a matrix vector multiplication So row first row by first by the column Which is gonna be first row as we said before here first row first element first row second element And so on until first row last element multiplied by all the elements of our input included the bias and Then all these huge here amount of operations can be simply down written down more compactly this way Then the final one we have the output, which is the hypothesis based on the whole set of parameters to respect with the input to the input, which is also called the Only activation a one of the third layer. It's going to be equal to our Sigma a non-linear function and Then we are gonna use the map for the second layer because we go from the second layer to the third and Then it's gonna be the only row, but we still the right number one here and Then first the first element is the element that is connected to the To the bias times a Zero from the second layer plus Our matrix for the second layer first row only row element one multiplied by the activation one from the second layer Plus all the others up to the last one capital theta for the second layer the only row and the fourth element of the activation and here as well if we define a hat equal the vector with plus one and A and this one relates to the second layer, then we can express our final output of theta the input X as equal our sigmoid of the Mapping for the second layer multiplied by the activation with the bias term from the second level and There we go. We have completed the Whole set of equations that we need to compute the forward pass for a neural network one more definition So we had it SJ is the number of elements layer So as we saw before as one equal four as two equal also four and S3 equal one in this case by using this notation. We can write that theta J It's a matrix Which belongs to our off so Let's have a zoom here. We said that Theta J is mapping the J layer to the J plus one layer So there will be as many rows of the matrix as the number of outputs So the number of elements per layer is called S. So we go from S J plus one and the number of columns is gonna be number of elements we have in the input Which is S J And we have not to forget about the bias. So plus One in this case. So if I draw here The matrix is going to be number of output so S J plus one and The number of number of columns is gonna be SJ number of elements of the current layer plus the one for the bias because this one is multiplied by the Bias plus our current input So this one is SJ and this is just one element and the output is going to be the vector of of SJ plus one element and that's it