 Kaj, nekaj, nekaj, da imamo vse, kaj smo tukaj. Zato vzelo, da je tukaj, tukaj basične koncepti v artificialne neuralne netore. Zato vzelo, da je presepton, ko je tukaj specialne neuralne, kaj nekaj zelo, in nekaj zelo, The output, since it can mimic a hand-gate, it can sequence of them can mimic a universal computation. Is a problem, the problem is that these circuits are discreet, so changing the biases and is weights of the perceptrons do not lead to smooth changes of the output, so they are very difficult to train. In v sej sem se je pošljeno, vse ga je pošljeno s sigmoidih nuronov, ali v kratku artificialnih nuronov, v katerih vsega izvrte in tega bretakvana. Nelzno se nuronov ima vsega zelo vsega intergurstje, vzelo in nezelo in nezelo, nezelo in nezelo, in tk pa za sigmoidj. brew produce is the following so it's a more or less an average side function in fact one can actually prove that these sikmon neurons can mimic perceptrons so ljuboče samostavno, ki imaš površenje, izgledaj. Tako, to je, da je očinšta. Zato, ki smo površenjali svoje zbrane in zelo, skala kakvo prejno se da taj pridizije. Tudi, da je to broj, tako da se taj pričo trajit, ti je tengel in imel, ta bo seš, da je to pričel, ta bo seš, pa nekaj, pakaj, kaj je, Your data is called the input layer. Then there will be an output layer. In this case of this sample is just one neuron, but it can be many. And whatever is in between the initial input layer and the output layer are called hidden layers. And I've already mentioned that the more hidden layers, the more neurons you have, in je zelo, da je nekaj državek zelo, kjer sam je zelo, da je nekaj državek zelo. Zelo, da je od taj državek, ki jih se v tudi sredimo, in je začasno, da nekaj državek prišli od nekaj državek, nekaj nekaj državek, ki jih se v tudi sredimo. In nekaj državek, ki jih se vrade vsem, da je stvar, ki nekaj državek ta delja neče nekaj državek. kako je zelo, da sem vseh nekaj nekaj prehlad. Zelo sam se nekaj prehlad. To je nekaj zelo, da je bi tudi nekaj prehlad. Vse tako je načinčne in začinčno začinje z izgleda z izgleda in začinčno začinje in začinje začinje. Zato je so počke. Zato je začinje začinje. Kako je začinje vseh zrčinje in tudi have other architectures of networks, for example, recurrent ones, but we are not going to see them. Now, let's see an example of how this set of neural networks can be used. This is, for example, to recognize, to classify handwritten digits. Stimulelo pa nični občadni vesel. Genetno, da milim vzelo. Čutak je zašel na del. To je izgleda na daž du ječtneimi i nična. Dinažne imegje pri vse poslede na datacete mnist. And there are black and white. So each pixel can be associated with a number between 0 and 1, and that will be the input of my neural network. And there are enough of them with a label associated to that, such that you can train your network. The network, in network to recognize, to classifyato is handwritten digit do, can look like something like this. So you have an initial set of your input layer that is composed of 784 neurons because the pixel are almost 30 times 30. And these are all the input data. Vo vsehoj na toga strasne, je dobro gol轅ost, ker isto, da je ena stoljama na neuronji. Slijte, da sem neHi generi o 15 ali pa se izpakim dobro 30, 60. Zato nekaj, zelo ne brain, in način srednji stros kot tako, ko povojimo, in ne odličim, da ga ekokratia izstavno in poslednjo. zelo pozbite moj nobelje, zelo gašan trunar, zelo gašan trunar začo do cr. Zelo gašan trunar začo do zelo, zelo gašan trunar začo do cr. interesierti po vzgledaj za 0 to 9, ca ne lumin tendedstvo, ko smoothieu za to. B origine je pobite po obstah na površč可能o vstahu, vse težite izopo obtiv. Ale čestne... Zelo je hristo, asa zelo je nekaj sek. Zelo je 10 n jubalstv. To je svebej hristo oprečenju vseboj vseboja neuralna damita. Zelo čestne je operacija o njimeljihodur, za st toolboxu, za tudi je se veš zašte parametranje.�니다, ker neko se vela naredijo, da se neko naredijo, tako čeeg nekaj neuron vzovec 700 eštečnost in isdski vse, zainemno, še v thosei vzamo, če vse vse neko neko jel stavili, neko se neko nekaj neko, če vse nekaj neko nekaj neuron, So it. More than 20,000. No, bo. I don't know. But, order of various thousands of... No, 12,000, or something like this. So, quite many parameters to optimize. Tukaj je, če obtimizujemo, če je univerzal, je dobro, je pravda potrebna, ali je zelo veliko parametri. Pozivajte, da je tukaj problem. Vse je, da imamo ideje, tudi imamo taj mislj, dataset, ki se počeš, ta imač je 28 pixel x 28 pixel, tudi je to 784 pixel. Taj trening da se počeš, da se počeš, da se počeš In test set, da je vzgleda, da vzgleda akuracije in preformacije vzgleda, je 10.000. In zelo, da se vzgleda, da ima, da ima, da je vzgleda, da se vzgleda, da se vzgleda, da se vzgleda in te, da je delet vo vzgledaj, ko je vzgleda in taj set vzgledaj, ko je delet vo vzgledaj. Tak vzgledaj. Tak vzgledaj. Tak vzgledaj. Tak vzgledaj. taj 784 reali nr. in 0 in 1. As output we decided to have, but this is an arbitrary decision and I'll come back to this later on. We will have our output A that is given by 10 real numbers between 0 and 1. And this output A in the notation that I used this morning is given by my function parameterized by all the weights and with W I include also the biases of the input x. I can be more explicit and put the biases as well. So this is the output of the neural network, but the output that we actually want because these are all labeled by the corresponding digit, the desired output. I can call it y of x and this is going to be given by 10 numbers all 0 except 1. Let's say that these are 10 numbers and this is some jth position corresponding the jth digit. So if my image and written digit was 3, this vector will be 0, 0, 0, 1, 0, 0. Is the notation clear because I'm going to use it? Yes, yes, yes, yes, yes. Thanks. So as I was mentioning before, in order to train it, we have to define a cost function. So let's define the cost function for this case, this specific case. And it will be a sort of distance, as I said before, between the labels, so the desired output and the output of the network. So my cost function, I can define it as a function, of course, of the weights, the biases. And this is an average of all the distances. N is my number of data, so 60,000 in this case. So the sum of, and as a distance, I take simply the Euclidean distance, let's say, between the desired output and the actual output, a, of my neural network. Other cost functions can be used, but we will consider that. And the cost function that you use can determine the performance of your network. So this abbreviated is an average of something that I could call a cost function that depends on x, average over all the training set. Other that can be used are cross entropy, for example. So, and the aim here is, as I said before, but more explicitly, minimize my cost function over the parameters, the weights, and the biases. So learn the parameters. This is also the language that is used. How do we do that? Let's see this stochastic gradient descent algorithm that I was already mentioning briefly before. So if that example was, 2.3, this is 2.4. Stochastic gradient descent. To easy denotation, I'm going to substitute weight and biases with just one parameters, new. One letter to describe my parameters, new. It comprises both weights and biases. And we want a strategy to update new in such a way that the cost function decreases with my after each update. And the notation that I'm using is, an update for the set of parameters new means that new will go to new plus delta new. And I will indicate actually explicitly that these are vectors. And so new will be given by first component, last component, whatever it is, and delta new by the deltas of each component. Now, what happens when this parameter changes to our cost function? In general, the cost function will change and will become the same cost function as before plus a certain increment. And at first order in the parameters, this increment is going to be given by the gradient, sorry, delta C, is going to be given by the dot product of the gradient of C and the increment in my parameters. So where this is my gradient vector, that explicitly means that I'm just taking the partial derivatives of the cost function in respect to each of the parameters. It depends on what was d, I think it was d, yes. We want this increment to be always negative so that this means that the cost function decreases. So we can decide to update our parameters directly following the gradient itself. So delta new, the increment of the parameters is given by minus a certain parameter eta times the gradient. And this eta is commonly called learning rate. Then, obviously substituting this expression into that, we have that the increment in the cost function is going to be given by minus eta, the modulus, the length of my gradient, which is obviously negative, so we are happy. And this rule, star, is the update rule for the parameters used for the network. So now you might ask, well, that's fine. If you can calculate the gradient, then you can update your parameters and you can follow the gradient. There are two problems, of course. One is that your cost function for which you want to calculate the gradient is actually an average over many data, let's say 60,000. So that means that you have to calculate 60,000 gradients, which is already extremely slow, even if the gradients are relatively easy to calculate. And the second thing, well, this is a minimization problem, so how can I be sure that I don't get stuck in local minima. And here is where stochastic gradient descent comes into play, because essentially what is done in training neural networks is that instead of calculating the gradient for all the parameters, for all the data in the data set, we calculate an approximation of this average, not over the full training set, but over only a batch, a part of it. And we call this part of it, typically we call them mini batch. So the solution to this problem is to use stochastic gradient descent. Up to here, this is essentially the update rule of just gradient descent. You are following the gradient, you are descending it. And now we add stochasticity to solve these two problems that I mentioned. So let's use this notation. Let's denote a subset of a random subset of my data, x tilde 1 up to x tilde m, typically much smaller than n, n can be just equal to 1, so in that case it's called a batch, this thing. And they belong to the training data set. They are chosen randomly and they are called mini batch. And the approximation that we are actually implementing, we want to implement, is the fact, instead of calculating my gradient, which in principle is the gradient of this average of cost function over the full data set, so this x i, which by linearity here is given by 1 over n, the sum of the gradient of c x i. Instead of this, I say, okay, I don't care, instead of calculating the average over all the gradients, I just calculate the average over a mini batch. So this is given by approximately 1 over m, the sum of the cost function, not over the full set of training data, but over the mini batch only. And therefore, the update rule, in general was this one, becomes, and I'm going to... Okay, I have it written here. It's going to be given now for the weights and the biases by these expressions. So every weight wk is going to be... Every weight wk is going to change into itself minus this learning gradient divided by m, so m is the size of the mini batch, and the gradients with respect to wk of only the data belonging to the mini batch that I have stochastically chosen. And then, once this is performed, the... Once you have calculated this gradient, you do this update of the parameters, and then you redo exactly the same thing, but for another different randomly chosen mini batch. Okay? So you repeat the update rule, which is above in the slide, for another mini batch. And you go on, you go on, and you decide when, let's say, to stop. This is part of, again, the heuristic of training in neural network. You can exhaust all your training data, or you can stop before. Whenever you stop, let's say, when you finish your training data to make it easy, you call this process of repeating this update rule, changing mini batch, and then update, change the mini batch, etc. You call it an epoch. Okay? When you finish it, you say that you have finished an epoch of training. And then you start again. Okay? How many epochs? Again, depends. Okay? When do you stop? It depends. But in principle, you expect that at some point it converges. So ideally, you decide to stop when you have trained the network on the training data in such a way that when you are going to verify how your network works on the set, the test set, you reach an accuracy for which you are happy enough. Okay? So that's more or less what else, how much you have to train. But I will also say something more about that later. Of course, two things. As I said, first it's much faster than calculating the real the real gradient simply because the size of the mini batch is much smaller than the size of training data. And second, since you are doing this stochastically, essentially you are changing the landscape of your function, of your cost function over and over again. So this helps a lot in avoiding local minimum. Was this your question? No. Yeah? There was a question? Yeah. So it depends of how many data you have at this pose. So the larger the training set the more confident you are that if you find a large accuracy for your training set then that means that whenever you receive a new data outside the training set outside the test set a real, a genuine new data then you are more confident that your network is going to give you a decent answer. But in principle if my new data are taken from something that has nothing to do with this is not hard science. OK. A sekund. So in order to answer to this question let me introduce a bit of extra terminology. So for a network we have this W and this B so the weights and the biases. These are what we call the parameters of the network over which we are going to do this stochastic gradient the send. But then correctly there is all the rest. So all the rest means on one hand for example what I told you before so how you load your data etc. But the point is that you have somehow to choose the number for example of hidden layer. The number of hidden neurons you have to choose your learning rate. OK. Et cetera, et cetera, et cetera. So these all the rest of these parameters are called hyper parameters of the network and whereas let's say stochastic gradient the send tells you how to optimize the parameters of the network there are then and this is quite prescriptive OK. So it's relatively easy to implement stochastic gradient the send. Training a network means essentially being able to optimize the hyper parameters. OK. And that is difficult. OK. This is the difficult part of training a network where a lot of heuristics comes in. OK. Kicks in. And so there is for example a relatively basic strategy is to take what I called the training set and divide it actually in two parts. What we continue calling a training set and what we what we call the validation set which is different from the test set. OK. And we change the hyper parameters by finding the ones that gives us better performance so better accuracy on the validation set. OK. So of course one has to come to a certain tradeoff between for example computational time because the more the neurons you have in your networks the more expressive it will be so the expressive power of your network will be will be larger but that also means that in order to calculate all these gradients et cetera will take longer. So at some point you might decide OK my hidden layers my hidden neurons my speed I will give you example now are good enough on my validation set. OK. And so at that point I can use the test set finally. And that will be the accuracy because maybe my hyper parameters are very good for the training set and the validation set but they are not good to the tests for the test set so they are not good for new data. So you have to finally test it on you have to finally assess the performance on the network on some data for which the network has not been trained. Otherwise it can perfectly always you can enlarge enough the number of neurons such that it will always work. So let me give you some examples of training. OK. I think I have it actually entirely this part on the slide. Yes. OK. So this is done using the code that I was mentioning before. So using Nissan codes that is attached to his web book. It's a non-efficient code, highly non-efficient but it has just 74 lines of coding in Python. So it's very short. It can be understood easily and it's not something that you will want to use in your research but it's something you will definitely want to use to learn more or less what what it means to program a neural network. What's happening here? Idu Roma was detected probably. OK. So this is the example of the handwritten digits that I gave you before. So we have an input set of neurons 84, this has been decided and here for example we have 30 neurons in our hidden layer. The outputs are 10. Training images are 50,000 because actually what one is doing is doing this division. So it's taking 50,000 for the training set and 10,000 for the validation test set but for the moment let's not care about it. Actually let's not care about it too much at all. The mini batch size is very small. No, it's just 10. 10 out of 50,000. You have a certain learning rate and you decide to do just 30 epochs. The test images are 10,000 and this is an example of the accuracy over the test set. So after training the network with 30 epochs we arrived to this accuracy. So out of the 10,000 images belonging to my test set 9,534 were correctly classified. And this takes some minutes to run. It depends on the speed of the computer. Of course. It's not something that is very consuming. What, sorry? Yes, at some point let's see. At some point it can reduce. Absolutely. There is no one that tells you not to come back. What do you mean? No. Yes. Not only you can come back, you can do even better. So let's see. Let's say that instead of using 30 hidden euros now we keep all these parameters the same and we use more hidden euros. Neurons, 100. So we do this in final accuracy now of 96.59 so it's slightly larger than this one. So you can increase. And it's even larger than this. By increasing the number of neurons. But one has to be careful because touching these hyperparameters is not always easy. So the number of neurons is an hyperparameter. This going from 30 to 100 was advantageous. But if you change other parameters like for example, the learning rate. Here I changed from 3 to 10 to the minus 3. You can see that the algorithm is much slower. And this can be understood because that was a learning rate so it was essentially how fast I follow the gradient. But no one is telling you in principle what is very slow. Because it depends on the function that you want to fit. So one has really to do this training on a validation set for this set of parameters. But here you can see at least this is so too easy because it's monotonically increasing. So it doesn't decrease. So you can understand that probably I have to speed up a bit with my learning rate. And this final accuracy is quite bad. But then can happen even worse things can happen. So if the learning rate is too if the learning rate is too high then essentially you are not learning anything. You are just wild guessing. Actually a bit worse than wild guessing. And you can see that it decreases, for example. At the first it was really a wild guess then it got even worse. So this increasing or decreasing of the accuracy depends really on your hyper parameters. And you have to do various training and of course there are techniques to improve this. And I will mention a couple. But I don't have time in the details. But yes. This is part of training a network or learning how to train a network. Using for the same digits highly performance codes. So not these neural networks built with codes of bit more than 70 lines. Much higher accuracy on the test set can be reached. So up to an amazingly good level, so something like that. Which means that all the test set images are correctly classified except 21. And this is a sample of those 21 images. They are not correctly classified because also for a human high they are impossible almost to classify. So the label attached to that to me it might look a sort of random label anyway. And this is something we touch upon briefly also this morning. So the labels of your original set of data can be subjective. OK. So now in the last 13 minutes just a couple of things mostly one. Just one is keep it. And it is calculating gradients can be done but is typically time consuming. So there is this technique of calculating gradients which is for neural networks which is extremely efficient from a numerical viewpoint and quite intuitively to understand. So I'm not going to give you a full derivation and a full explanation of how the back propagation algorithms but just an idea of how it works. It's essentially implementing the chain rule of derivations the other way around for activation functions for which you can easily calculate the derivative like the sigmoid for example. And I will explain this intuition using computational graphs. And this is really the workhorse of the back propagation algorithm in stochastic grad descent is what changed the field really sort of 10, 15 years ago. Together of course so these algorithms are learned since 20, 25 years ago they were good but the computers were not good as soon as the computational power caught up then all these business exploded. But they are mostly based so the training of the network is mostly based on the back propagation algorithm together with stochastic grad descent. So I took these slides from these websites so you can go there and have a look because it's quite clear this planation. So let's first consider an easy example so how to do simple arithmetic compute on graphs. So let's assume that we want compute a function e that is given by this expression a plus b times b plus 1. So we can do it in steps we can call this a plus b we can call it c b plus 1 d and so e is then given by c times d and we can represent these multiplications as different nodes in a graph. So I have that originally I have a and b. I put them together here I have c that is given by a plus b and here there is another function that is multiplication that is given by c times d. The label of my nodes tells me what the node actually computes. And these are all elementary let's say computations, multiplications and addition. And so for a specific case in which a and b are equal to 2 and 1 we can follow of course this graph and so we have that c assumes a value, specific value which is 3, d assumes a specific value which is 2 and therefore e is equal to 3 times 2 equal to 6. So we follow this graph in this way. Now this is arithmetics on the graph. We can do derivatives on these graphs. So the numbers here are the same as before so a is equal to 2 and b is equal to 1 and we can label each now of these arrows, so of the edges of this graph with the value of the respective derivative of the the upper node with respect to the parent node. So this will be the partial derivative of c with respect to a. In this case is equal to 1. This will be with respect to b. Also in this case is equal to 1. And this will be d derived with respect to b is equal to 1. Now I give a label also here for these derivatives. This will be the derivative of a with respect to d. So the derivative of a with respect to d is equal to c but c is equal to 3 so I put here 3. And here the same thing derivative of a with respect to c is equal to d, d is equal to 2 so I put here. And with this I can step by step, node by node do my derivatives. So if I want to calculate the derivative of my final output e with respect to one of the two inputs, either a or b I can just follow this graph and follow each path. So if I want to calculate the derivative of e with respect to a I go from e to a I encounter two derivatives 2 times 1. If I want the derivative of e with respect to b, well there are two paths this one and this one to go from e to b so I have to sum over this path and this comes back a bit to one of the questions that was asked before about summing path and path interval. So this is sorry, so one times 2 here on this branch one times 3 on the other branch and I obtain my derivative. So this is really nothing as then the chain rule of differentiation applied on a graph for a simple function. So chain rules is summation over paths. Fine. But we can come into problems when there are many paths, when the combinatorics implies many paths. But we can factual the paths in each of the node then the number of operations that we have to perform drastically diminishes. So for example here if I assume that the derivative of z with respect to x is given by this expression and it's given by this by this graph with the notation that I gave you before so alpha means the derivative of y with respect to x etc. So I have an expression like this that of course can be factored. So factoring this kind of calculation can be very beneficial. And then now on top of that we have two ways with which we can calculate derivatives. The standard way, so this sort of forward more differentiation that gives us, let's say, the derivative with respect to input parameters of whatever I want as an output. So in that case I follow my path from the input to the output. Here there are three paths alpha, beta and gamma. Here there are other three paths delta, epsilon and zeta. Whatever. And so I sum together this and I multiply with the sum of the others. And I obtain the derivative of z with respect to x. So the derivative of my output with respect to one input. But if I do the other way around also I can obtain the derivative of z with respect to x. It's just a mirror symmetry of what I've done before. But now let's go back to the previous example and let's see what it means to do forward or reverse differentiation. So if I do forward differentiation and let's say that I want to compute the derivative of e with respect to the input parameter b. So how do I do it forward? Well I start with the inputs and I have derivative of a with respect to b is equal to zero derivative of b with respect to b is equal to one and then I derive c with respect to b d with respect to b and then finally I have the derivative of e with respect to b. And here in the label of my nodes I have all the derivatives with respect to b of all let's say the nodes that I have here and including also the input of course but I only have the derivative of the output with respect to b. Now here I want to calculate the gradients so the derivatives with respect to all my, did I erase it? Yeah, okay, but yeah, I erase it. But these so here I want to calculate the gradient of my cost function with respect to all the parameters. So these parameters will be input parameters and the cost function will be just the final cost function over which I want to calculate the derivative. Well therefore I would like to have as fast as I can the derivative of the cost function with respect to all the parameters which are not given here because here I am given the derivative of e with respect to b only. Okay? If I do the other way around so I start from here I obtain following again the same graph I obtain the derivative of the output with respect to both the parameters in this case by working this graph only once. Okay? So this is very practical to calculate partial derivatives in this case of the networks of the cost function. And also labeling all these partial derivatives for a network is very easy because this is all sigmoid functions. Okay? So the derivative of one neuron with respect to the former one with respect to the output of the former one is just always the same partial derivative given by the sigmoid function. Okay? It is easy to code because you code the derivative of the sigmoid once and you continue calling that function. And yes, this is essentially how it works. Questions about this? I know that it was very just sketchy. The very, very last thing that I want to mention overfitting it was touched in some of the questions and partially also in my answers. You don't want to train too much your set data. Okay? You don't want to train them a lot because otherwise the function that is going to be competent by your neural network is going to be ideally perfect over the training set but then impossible to generalize. I have very badly generalizations. This can be understood immediately by this very simple example so here I have points on a line obviously. But if we train too much and here we have, in general I showed you an example of tens of thousands of parameters but as I said there are millions of parameters. For Neumann I can move all the trunks that I want with all these parameters. I choose this to fit my ten points whatever how many they are. This is of course this generalize very badly. I'm not talking about generalizing outside this let's say for points smaller than this or larger than that. I'm really talking about something that is between this so here it's if you have a data here the accuracy is going to be very bad. So you want to avoid problems of overfitting your training data. So for example this is exactly the same case that I gave you before and I was training this if you remember otherwise it's written here with 30 epochs and I obtained this result. If I try with more and more epochs so I train more and more over my training data set and to make this even more apparent I don't use all the images but I use just a thousand of them so a small amount of data then what I obtain is that if I calculate the cost function on the training data this cost function seems to decrease a lot with the epoch that I train. So it looks like I'm doing something good better and better but what I'm doing is just that I'm fitting better and better the data belonging to the training set which as I say doesn't make sense because in fact is demonstrated here by plotting also as a function of the epochs the accuracy not over the training set which is 100% here but it's over the data set you see that it stops at 82.2% so it generalizes this neural network generalizes its performance to new data very very badly one easy way and with this I conclude to avoid this overfitting is training with a lot of images if you train with a lot of images so a lot of data, this is pretty obvious then it will be impossible for as much as you want to try this here I reduced also the epoch but it doesn't really change because even if this goes to 100% the difference here will it can go even better in that case but trust me the larger the data set that you have the more difficult obviously will be for your network to fit perfectly that data set so the better it will generalize and with that I conclude no convolutional neural networks I'm sorry and that's all so if there are questions I'm happy to take them here or in the tutorial room no questions I assume