 Hi and welcome back to analyzing software using deep learning. This is part three of this introduction and what I want to do here in this Third part is to talk about some basics of the two fields that this course is combining namely program analysis and deep learning. I Assume that some of you already have some background in at least one of these two Yeah fields, so this is basically to make sure that everybody's on the same page and maybe a little bit of a repetition For some of you, but this is just to make sure that we are all ready to get deeper into the material so let me start with some background on Program analysis and in particular on the question of how a program can actually be Represented when it's processed by an algorithm or by a computer in general There's one obvious representation that everybody has of course seen when writing code and that is to represent code as a sequence of characters It turns out that this is not the best representation and there are many others that people working on program analysis like to use Including sequences of tokens abstract syntax trees Control flow graphs data dependency graphs call graphs and many many others. We will not talk about all of those here in this in this Video, but just focus on a few which Yeah, basically we'll visit again and again when talking about different approaches in this course And these are the four that you can now see here So let's start right away with the first representation Which looks at programs as a sequence of tokens. So what is it token? you can basically think of a token as something like a word in a Natural language, but for programming languages. So basically the tokens are the words of your program The tool that produces these tokens is called a tokenizer or sometimes also called Alexa And it typically is part of a compiler because one of the very first things that a compiler is doing is to take your program and cutting it into into these tokens every Token is a sub sequence of the characters that are in this file that contains your entire program and tokenization is basically taking this long sequence of characters and cutting it into smaller sequences that then Well, that then is the sequence of of tokens Just to give you an example. So for Java there are six kinds of tokens, which you can see here on the on the slide So one of them are Identifiers basically all these names that developers can choose for their classes variables fields methods and so on Then there are keywords things like if and while and class for example, so basically all the built-in Keywords that in this case the Java language provides Another group of tokens are separators so things like like this dot Which is for example used if you want to access a member of an object and also these curly phrases, which are for example used to start and end blocks Then another group of tokens are operators So everything that is used for example for arithmetic operations or for for other kinds of operations on values such as this multiplication operator or the plus-plus operator Then there are literals which are basically constants that you literally write into your code for example numbers like 23 or a string like high and again, this is some Kind of token where developers can create arbitrarily many new tokens And we'll see in one of the later modules that this is actually quite a bit of a challenge for learning based program analysis And then finally the last kind of token which is often ignored in program analysis But which we'll also see as something very interesting in the learning based analysis are comments So natural language information that is somehow associated with parts of your code So let me give you an example for how this tokenization Works for a very simple piece of Java So let's assume our code is the following. So we have an if statement that is checking if some flag is equal to true and Then if this is a case the case then some variable called name is set to Joe and that's already it. So now if the tokenizer gets this piece of code It will extract a couple of tokens. So one of them Keynote keyword tokens. So in this example this if for example is a Keyword Then there will be some some separator token in particular this Open parenthesis here the closing parenthesis the curly braces that open and close And also this semicolon that is used to to end the statement Then we have a couple of identifiers here So basically all these names that developers can use for example for this variable flag or also for this other variable called name and all of those Identifier tokens Then we have some operators here in particular this double equal and also this assignment operator the single equals and now the only thing that Remains are the literal tokens. So this true for example and the string Joe Which basically Mean they are constants that are literally given in the source code and now based based on these different categories of tokens What we'll get is a sequence of tokens that basically consist of these Colored boxes and this is one possible way to represent a program So tokens are basically the most simple form of representing a program because they almost literally show you what is in the source code file just cut into smaller Subsequences of characters a slightly more involved representation of programs is the abstract syntax tree So what an abstract syntax tree or short AST is doing is to show you the code in a tree Representation that mirrors the structure of the code It is called abstract because some of the details of this concrete syntax Omitted in this tree. So for example the curly braces that mark the end and the beginning of a block in Java are not represented in an abstract syntax tree But instead you'll basically have a subtree that represents the entire block The nodes in an AST Correspond to constructs in your source code and then the edges in this tree Represent parent-child relationships basically telling you that one source code construct is a part of another source code construct If you want to get a concrete idea of how such ASTs look like I can recommend this Demo of Esprima, which is one tool to give you an AST for the JavaScript language It has very nice web interface where you basically insert a little bit of JavaScript code And then you see right next to it the AST for this piece of code And then you can play with the tree play with the code and and see how one changes if you change the other Let me also show you an example here in the video. So this will be an example of an abstract syntax tree and in this example we'll use a Little piece of JavaScript code. So this code is very simple It's basically one statement that says hey, there's a variable X And I want to assign to this variable the result of multiplying six with whatever is in variable Y now if you take this little piece of code and Pass it into an abstract syntax tree Then what you'll get is a tree that looks as follows So at the root of the tree will have a note called program and it turns out for every abstract syntax tree of every piece of JavaScript code The root is called program Then you have an edge that Goes to this statement that we see in the source code and this happens to be a variable declaration Each of these edges Has some names on this case. This is marked as the body of the program and now in this body we have another child note, which in this case is a variable Declarator which is part of the list of declarations that this variable declaration Has the reason why there are two nodes is because he could have multiple declarations in a variable declaration, for example, you could in the same statement also assign some value to another variable a now every variable declarator consists of two parts one is the left-hand side of the assignment or the left-hand side of the of the declaration which he is represented through an Identifier note and then the right-hand side which shows us what expression this Identifier is initialized with and in this case this happens to be a Binary expression because we have this binary operation of multiplication that gives us the result of this expression so what we see on the on the left is the ID and what we see on the right is The thing that initializes this this ID and now we need to add a few more notes to this To this AST so this identifier happens to have a name and this name happens to be X and For the binary expression we see that there is Well it consists of three things one of them is the operator of this expression Then we have whatever's on the left of this expression and then we have The right hand side of this expression the operator in this case is this multiplication operator the left-hand side in this case is a literal and the right-hand side is an identifier and Now each of those also need to be made more specific so the literal Will have a value and then the end then the identifier will have a name and In this concrete example The value of the literal happens to be six and the name of this identifier here on the on the right Happens to be why and this is the entire abstract syntax tree for this little piece of JavaScript code So the abstract syntax tree focuses mostly on the structure of the code it abstracts away some of the concrete syntax that Is not really relevant to reason about the code But what it does not show you is in what order the statements in the code I actually execute it because the order in which statements appear in the code may not be the order in which they are executed If you do care about the flow of control so the order in which things in your program happen Then the right representation is a control flow graph which models the flow of control through a program a Control flow graph again as a graph so it has nodes and edges and specifically here the nodes represent so-called basic blocks Which are essentially sequences of operations that were all that are always executed together and then the edges between these Basic blocks represent possible transfers of control basically saying that this one block may be followed by this other block Typically control flow graph or see if G as it's usually abbreviated is built on the method level So for one method will have one graph that shows us all the operations that happen in this method And then what order they may happen so let me also show you an example of such a control flow graph and Here again, I'll use a simple piece of JavaScript code to show this idea So in this piece of code, let's say we have an if which checks some condition and then if this condition is true we'll be executing the statement x equals 5 and else We'll be executing the statement x equals 7 and then this whole if statement is followed by a Code to console log which prints the value of x to the console Now if we create a control flow graph for this piece of code, it will have four nodes one of them is this Evaluation of the if condition basically checking whether see evaluates to true or false This Check of the condition may be followed by two things. It may either be followed by this statement x gets assigned five or by this other statement where x is assigned seven So we'll have two edges one going to each of these two nodes and then at the end No matter which of these two statements we execute at the end. We'll come back to this console log statement where we are writing x to the console. So this is another node in our control flow graph and because this can be executed after This one assignment or after the other will have two more edges in in our node. Sorry in our graph Things can get a little bit more complicated if you have a program with loops in which case will basically have an edge that goes back to one of the nodes that we have already been at which means this graph can actually have cycles and it doesn't have to be an acyclic graph as in the simple example The final kind of program representation that I would like to briefly introduce here is a data dependency graph So in contrast to control flow graph which looks at the order in which Statements or operations in a program are executed a data dependency graph models the flow of data from one operation to another and specifically it models the flow of data from an operation called a Definition which is basically every place where you are writing some data to an operation called a use Which is basically every place where you're reading some data Again a data dependency graph consists of nodes and edges the nodes in this case All the operations in your program that either define or use or maybe both Some data and the edges represent a possible definition use relationships between these operations or they basically show us that the data that is Defined at some operation may be used at some other operations So if you have an edge from node and one to node and two then this means that and two may read some data That is written at operation and one Let me also show you a little example here so this is going to be an example for a data dependency graph and Again, I'm going to use JavaScript syntax even though the syntax or the specific language Doesn't really matter. You could do this with any other language you'd like So in this case, we have two variables X and Y that are assigned three and five followed by an if statement where we check whether X is greater or equal to one if this is the case We are assigning the value of X to Y and then we have a third variable after the if statement Where we assign the sum of X and Y to variable Z Now, let's have a look at the data dependency graph here So we would have a node for every operation or every statement that either reads or writes some data so that either Defines or uses some data in this case. We would have one node for this oops assignment of X That assigns value three now the question is where can this Value be used that is defined here. So one operation that uses this value is When evaluating the condition of this if where we check whether X is greater or equal to one and because this can use this value defined above We'll have an edge like this another place where this may be used is in the Body of the if statement where we are assigning X to Y because this is also reading the value of X So there's another edge here and then it may also be used at this last statement where X plus Y is written into Z Because also this statement may use the value that is defined In the first line and then we still need to show where the value of Y is defined So there's another note here which corresponds to this assignment of five to Y and This may be used here at this statement at the end because this is actually reading Y and the place where this may be defined in addition to this other definition that we have here is Is this assignment of Y With the value of five All right, so this was a very very brief Introduction of some of the concepts that are relevant here for program analysis We will introduce some more of these of course throughout a course But this should be enough for this introduction and we can now look into some of the basics of deep learning Let me use as an example a task that has become kind of a classic in deep learning namely hand writing recognition so the goal is to recognize digits between 0 and 9 From the handwriting that is given as an image. This is a relatively easy task for humans So for example, if I show you this sequence of digits digits that you see here on the slides You can probably figure out what digits these are But it turns out to be a pretty challenging task or at least has been a pretty challenging task for computer so now the idea to Solve this task using deep learning is to learn from a large number of training examples Where we basically have an image along with a label that tells us what digit this image represents and using this data We can then train a model that at the end becomes pretty good at predicting the digits that are shown on Images and if you use a deep learning model for this purpose, then typically you'll get more than ninety nine percent accuracy with with modern models nowadays So how does this work? So how can a deep learning model predict what digit? Is represented on an image the way this works is by having a network of so-called neurons and What I want to do now is to just give you a very simple example of such a network So in this example, I'll use these little circles to represent the individual neurons And in this example, we would have multiple layers of of neurons and the very first layer that you see here at the beginning Is what is called the input layer as the name suggests what the input layer does is to Represent the input that we want to give to our model So for our example of handwriting recognition, this could basically be a representation of the pixels of an image Now in addition to the neurons in this input layer We will have more layers on this example. Let's say we have one more layer that looks like this and now all the neurons or at least some of the neurons in our Network are connected with each other and in this example will connect all the neurons from the input layer to all the neurons in This first layer. So this basically looks like this. So the information from the first neuron In the input layer can flow to all the three neurons that we have in this first So-called hidden layer and the same for all the other input neurons so they all can go to all the neurons in The hidden layer. All right now in addition to this one hidden layer There may be another hidden layer which in this example has again three neurons and again everything is Connected with everything so the all the neurons from the first hidden layer are connected with all the neurons from the second hidden layer So it looks like that and then at some point will also have an output layer Which in this very very simple example consists of just one single neuron and this output layer is connected With all the neurons from the last hidden layer. So we have these three connections that all go to the output layer So let me add some terminology to this little picture here. So these layers in the middle are the so-called hidden layers In this layer here at the end, which in this simple example consists of just a single neuron is the output layer and this whole thing is The network or the neural network Just to make this picture complete. So basically all of these things here our neurons and for the example of recognizing digits What this output layer could for example represent is whether or not the digit that is shown on the Image whose pixels we get as an input is say the digit three so you can think of this basically as As a probability that the model predicts telling us how likely it is that the pixels actually represent the digit three in practice the networks that recognize digits on images are a bit more complex and in practice they also have More than just one neuron as an output, but for our simple example, this is good enough So now these neurons basically look like these little circles Let's now have a slightly more detailed look at at what they actually are and let's start with one of the most Simple neurons that exist and which are actually not used in today's deep learning models But still they're very interesting to understand the basic ideas And these simple models are what is called a perceptron so perceptrons are the most basic kind of neural of neurons and The reason why they are so basic is because they can only take binary inputs So basically every input is either zero or one and also produce only binary outputs So also the output is is it's just zero or one So let me just show you a little example So we could have our neuron here and it gets as its input. Let's say three values x1 x2 and x3 and Then it's supposed to produce some output and now the question is how does it know what the output is? depending on the input and This is where it becomes interesting because this because the output is controlled by two things One are so-called weights which basically give every input a specific weight and tell us how important this input is for the output So we'll have three of those W1 2 and 3 and then there's also a so-called bias which can basically Change the output irrespective of the input And now given these weights and biases we can compute the output as follows We can say that the output is Either zero or one and now the question is when is what use so here we say that the output is zero if the sum over all our inputs, which I'm indexing here with J Where we multiply every input with the corresponding weight is smaller or equal to some threshold And we say that the output is one in the other case which basically means if all the inputs weighted by their weight w and then summed up are larger than some threshold and a shorter way to write the same thing and this is the way that is that is typically used in deep learning papers and also in some of the or basically all of the courses that are ahead of us is that we say it's zero if the weights times the input plus the bias is smaller or equal to zero and it's one if the weights times The input plus the bias is larger than zero All right now just to complete this little notes the w here stands for the for the weights and the b here Stands for the biases that each of these neurons and in particular these perceptrons have So let me illustrate this idea with a little example And in this example We want to predict whether you should go to a cheese festival that is happening in your city So it's a very very important question and let's assume that in order to make this decision whether you go to the cheese festival There are three factors that you take into consideration So one of them is whether the weather is good Because you're more likely to actually go to the festival if the weather is good The second factor that is important for you is whether your friends are going because you don't want to go there alone and then the third factor which seems obvious is that It's the question whether you actually like cheese or not and now using these three Inputs you want to make a decision, which is whether you go to the cheese festival And now the way you do this is obviously through a neuron which we have here in the middle the neuron gets these three inputs and Is supposed to produce the output and now in order to make this decision the neuron requires its weights and biases and in this example Let's assume that the weight for the first input is five the weight for the second input is three and The weight for the third input is one and let's say our bias here is minus seven And now to make this concrete, let's assume we have some specific inputs given because for a specific day And for a specific cheese festival you're given The weather and whether your friends go and whether you actually do like cheese in this case Let's say the first input is is one the second input is also one So you the weather is good your friends are going, but you do not like cheese And now in order to fully understand what's going on I invite you to just pause the video for a second and do the computation yourself so that you really understand what's going on But I'll of course give you also the answer here. So what? We'll do is we will multiply the input with the weights, which in this case means we will multiply five times one so the first input times the first weight plus three times one for the second input and then zero times one for the third input and this sums up to eight and Now using this and the bias we can now determine the output And basically We'll see that it would be zero If eight so our result of W times X minus the or plus the bias on this case minus seven would be smaller or equal to zero and It would be one if eight minus seven is Larger than zero in this case The output is One which is larger than zero and this essentially means we will or you will go to the festival Now computing whether you go to a cheese festival is of course very important But the cool thing is that you can use these perceptrons For for other computations and one of them and that's the perhaps most important one here Is that you can use them to compute logical functions? So let me show you how to do this and Let me use one particular logical function Which you've probably already seen in some other kind of course namely nand gates or not and gates So what we want to have here is a gate that returns one whenever Not both of the inputs are true So essentially what we would like to have is something that looks like this We have these two inputs X one and X two and They can be zero and zero or zero and one or one and zero or one and one and We want the output to be one if and only if Not both of the inputs are one Okay, and now the question is how can we do this using just a single perceptron? And the magic is that we can actually do it by basically feeding these two Inputs X one and X zero into to perceptron and then getting our output and By having weights and biases that and why is that makes sure that the output is exactly What we want to have and in this case this for example works if we set w one so the first weight to minus two W two also to minus two and the bias Two three and You can now convince yourself that this actually works. Let me just show you for one example. So the Output for zero and zero is one because in this case We will get zero times Minus two which is zero plus Zero times minus two again, which again is zero and then Plus three because this is the bias and because this is larger than zero. This gives us the output one So why do we even care about nand gates? Well, the reason is something you might remember from some other course of computer science that you may have heard at some Point it is that by just using nand gates We can express arbitrary computations and by extension if we have a model built from a perceptron that can Express what a nand gate is doing then by having multiple of these nand perceptrons We can express arbitrary computations using just a neural network Let me illustrate this using an example. So here we have a Combination of nand gates that is adding two bits I'm not going to go through all the details of it So you can convince yourself that this is actually adding two bits and then at the bottom You see the corresponding network of perceptrons that is doing essentially the same thing So it's also adding two bits and you can basically express this computation just using the perceptrons that we've just seen So now you know that by combining different Neurons, we can basically express arbitrary computations and you can just build complex networks out of these neurons that perform arbitrarily complex Algorithms now the big question is how do we decide the weights and biases of these complex networks? So just for the nand gates, you could do it by hand You can basically play with the weights and biases until you see that this is actually a nand But for a more complex network, this is not really what you want to do So hand-tuning these networks is infeasible As soon as you have a slightly more complex network Instead the key idea behind machine learning with neural networks is to automatically learn these weights and biases and this enables deep learning to actually learn to express these complex computations So let's have a look how this learning works and whether the perceptrons that we've just looked at Are a good candidate for neurons to help with this learning So what we want to do here is making learning possible So as a simple example, let's assume we have a neural network that consists of Some input layer a hidden layer with three neurons and then some output and let's again assume we have All the neurons of one layer connected to all the neurons of the next layer So what we'll get is essentially this And now for some given input We will get some output based on the current weights that we have at all these Connections, so I'm only showing you one of these weights here and let's say this Weight then helps us to get this output. Now if the output is not what we want We need to help the model to get better Which basically works by telling the model that we would like to have a slightly different output Which we get by adding some delta output here to the Output that we currently get and then we use some math, which will not get into exactly here in this course which will help us to Determine that we need to adapt the weight Accordingly in order to get the output that we want and in order to enable this kind of learning We need to have an important property, which is that a small change of the Weights and biases in our network So such a small change of the weights and biases should cause a small change of the output Now if you look at the perceptrons that we have seen so far Then you'll see that this is actually not the right kind of neuron to provide this property Why is that? Well, the reason is that the output of the perceptron basically follows a step function Which means that it suddenly changes instead of changing a little bit if you change the whites and weights and biases a little bit so this step function Basically looks like this that It produces outputs between zero and one and now up to some point this output Is zero and then suddenly at some point this output Becomes one and this is not really what we want because it makes it very difficult to learn anything because we do not have this Nice property that changing the weights and biases a little bit also changes the output by a little bit Fortunately, there's another kind of neuron that we can replace our perceptrons with and then we suddenly get this property That we want and this kind of neuron is called a sigmoid neuron so the neuron looks basically the same as before so it's one of these little circles and It gets some inputs and outputs Let's say x1 x2 and x3 Which is then used to produce an output and again there's some Weights and biases here. So again for every input we have some weight w and then we have this bias B But what is different now? Compared to the perceptron is that all these inputs and also the output can take arbitrary values. So this this and also this just as this Can take arbitrary values and specifically arbitrary values in the range of 0 and 1 So how does this sigmoid neuron now compute the output given the input? it again works by Multiplying each of the inputs with its with the corresponding weight and then adding the bias but now instead of applying this step function to it as we implicitly did for the Perceptron we now use this so-called sigmoid function represented by this little sigma and The cool thing about this sigmoid function is that it does not have this sharp step from one value to the other But instead it looks It has an s-like shape and it basically is a continuous function That does not suddenly jump to some other value So let me Try to draw this here. I'm not really good at drawing Functions like this, but I hope you get the idea. So it basically starts at zero then at some point increases and eventually approaches one Mathematically, this is defined as as follows. So this little sigma stands for the sigmoid function and it's defined such that it takes some value. Let's say that and returns one divided by one plus e to the power of Minus z and in our case this means it's one divided by one plus and then e to the power of the negative sum over all our weights where we're summing up over j times the corresponding input and Then this whole thing plus our bias B And now the cool thing is that this kind of neuron enables learning because a small change of One of the weights or the bias will cause a small change in the output So now you've seen two of these functions The step function and the sigmoid function and the more general concept behind all of this is that these are called activation functions So one of them you've already seen it's this step function that roughly looks like this So it's zero and then at some point jumps to one and then is one Then the other one that you have already seen is the sigmoid function or sometimes also called Logistic function which as you've already seen roughly looks like that It's this S shape. So it starts at zero Let me try again starts here slowly goes up and then approaches one Another kind of function that you could use as an activation function is the identity function Which simply returns the same value as the value that is given to it So this Essentially looks like this that if a negative value is given it returns negative if zero is given it returns zero And if a positive value is given it returns this very same positive value And yet another kind of activation function which you'll probably Encounter every now and then every now and then is the rectified linear unit which is called like this because it similar to the identity function Returns the same thing as the input but not for negative values because for negative value. It simply returns zero and now What is important to know is that these different activation functions are useful in different kinds of networks and at different places in these neural Networks and one exactly to use what? Will go beyond this course will cover some of these activation functions in in different Modules of this course, but we won't really go into the details of when exactly to use a bit of those All right, so now you've seen that these neural networks are composed of neurons And you have an idea of how the neuron computes its output based on the input the weights and the biases Now earlier we hinted at the idea that these weights and biases are adapted automatically in order to make a network Compute what we actually wanted to compute and now the big question is how does that work? One important factor that we look at now is that there needs to be some kind of feedback that tells the network How good the current output is so that it gets an idea in in what direction to change in order to become even better And this kind of feedback a feedback is what is called a cost function So let's have a look at these cost functions which enable learning because they provide Feedback to the model. So what this cost function does is? To provide feedback on how good The output is for a given input. So let me illustrate this with a simple example and This example again is motivated by this idea of recognizing handwriting and input in particular digits so let's assume we have some kind of network which we do not have to Very about in detail and let's assume that this network predicts the probability that the digit that it gets as an input is zero and It has another output which is interpreted as the probability that a digit is one and so on down to Nine so basically for every digit. There's a probability that tells us how likely it is that this is The digit that is shown on the given image So for example, let's assume that we provide an image where we know that it is showing the digit six So it is known that this stick digit is six Then what we want the output to be is the following so the output will be represented as a vector of of ten digits and for the first few digits which are not six we want the output to be zero so for Digit zeroed we want to say that the probability is zero the same for one and two and three and four and five But then for digit six we want the model ideally to present one is the probability because it should be very certain that the digit shown on the image is six and then for the remaining three digits again It is going to be zero Now the problem is that the network won't be perfect So the actual output that it provides may be something else For example the actual output Let's call it a Could be the following So maybe it's pretty certain that this is not a zero pretty certain that this is not a one and pretty certain that this is not a two But it isn't so sure about three. So it gives 20 probability that this might be a three It's sure that it's not a four. It's sure that it's not a five and It has a relatively high Probability assigned to digit six. So it is on the right track But it also assumes that it might be an eight. I'm sorry a seven Whereas for eight and nine it's again certain that this is not the right digit Okay, so we get a vector that looks a little bit like the one we want to get but it's not yet quite there So now to communicate to the model that this actual output is not quite what we want. We need to provide as feedback A cost imposed by having this wrong output and one way to do this And this is one out of many different cost functions that we could use here is the so-called quadratic cost function So what it does is to take the desired output that we would like to see and then computes the distance from this Output to the actual output that we get and specifically looks at the length of this vector if you imagine this These two points as two points in vector space. We look at the length between these two points and then squares this distance So this is called the quadratic Cost function and usually this is not only done for one specific input example But actually over a set of input examples for example all the examples in our training data And then we take the average Over all of these costs or the average over all of these errors So this should be an n and what the n refers to is the number of training inputs that we have So this is called The cost function and because we are actually taking them the This arrow squared and then take the average. This is also called the mean squared error Okay, so as a little quiz to make sure that you fully understand this idea of a cost function And in particular of this mean squared error cost function Let's have a little example and again Use the example of recognizing handwritten digits But now only looking at the digits zero one and two the basic idea is the same It's just that we have slightly shorter vectors here than if you would have to deal with all 10 digits Now let's assume that we have two training example example one and example two And for each of these training examples We get an actual value that our model is currently predicting this and this And we also have the desired Output that we would like to get from our model Now given these Four vectors The question for you is what is the value of this cost function? That would be computed if you now would Compute the cost of these current predictions that the model is giving us So now I invite you to pause the video at this point and actually do the computation Using pen and paper so that you get a good feeling for what this cost function really means All right, so let me show you the solution so the cost function will return the average So one divided by n And here we have the sum over all our inputs x and for each of these inputs We take the value that we would like to see The desired output minus the value that we Currently get and then we square all these all these differences So in our case we have two examples So it's one divided by two times and now here we have this sum Where it's useful to know what exactly This computation of the length of the vector means so let me just repeat this In case Anyone has forgotten. So if we have a vector x y z then the length of this vector can be computed as x square plus y square plus z square And for our concrete example, this means that here we'll have To compute The length of this vector, which is zero minus 0.5 0.5 and zero plus Um This other vector which only consists of zeros And now if you do the math, um, you'll see that this is 0.5 Plus zero and this whole thing divided by two means we have 0.25 as our cost So now that we know how to compute the cost function, let's have a look at how the learning actually works So now the goal of learning is to minimize the cost function So we want to find weights and biases that across all the examples We trained a model with minimizes the cost function so that ideally it would be zero The way deep learning typically does this minimization Is using a technique called gradient descent We will not go into the details of the math underlying gradient descent But intuitively the idea is that we compute a gradient of the cost function So that we see how we need to Adapt the weights and biases in order to reduce this cost function So as visualized on this little graphic here, we basically try to move closer and closer toward The minimum in a step-by-step Manor and actually how large these steps are as determined by the so-called learning rate Which is one of the hyperparameters that you can typically set when you configure the learning process of one of these deep learning models Now the effort of doing this gradient Computation depends on how many training examples you have Ideally you would like to optimize the model perfectly for all the training examples that you have But in practice this can be pretty expensive. So instead what is used most of the time is A process called stochastic gradient descent Which takes a small subset of all the examples that you have and then computes an estimate of the true gradient and then adapts the weights and biases based on this estimate of the true gradient In order to find this sample what is usually done is That we split all the training data that we have into k so-called mini batches basically subsets of the Of the training data and then train the network with one of these subsets or mini batches After the other until we have seen all the data in our training set and this going through all the data once always in batches of Of samples that we use for stochastic gradient descent. This is called one epoch. So basically going once Through the data is what is called an epoch All right, and this is all I want to say about Program analysis and deep learning at this point. So I hope you now have at least a little bit of an idea of how these two Fields work and of course we look more into this in the upcoming modules. Thank you very much for listening and see you next time