 Today I want to build the lecture in the following way because we have not so much time. I want to go a little bit faster through the theoretical material because this you can check also taking the transparencies. And then I will show you two more files, at least four files, not two like last day, so that you get a more practical impression of how this is working. So when we ended up last time when we said we have a shallow deep network, in the shallow deep network this is missing. And we had only weights for the connections between the input and the units of the hidden layer and the units and the output layer. So now for the reasons that I will tell you later, now we will introduce additional layers. If this is layer with index 0, this will be the layer with index 1, the layer with index 2, the layer with index 3, and the output layer will have the index 4, it is not necessary that these different layers which you are introducing have the same number of units. But all these layers will work with the same activation function. So if you design for this layer for example that you will work with reload, it's fine, but then please in all of these units. And if you decide here to work with tangent hyperbolicus, also fine, but please then use it everywhere. So just for the notation we will put the number of the layer here in this practice. So if we have, again, we have the same procedures. We have the forward propagation, then we are looking to the arrow and the output, have the backward propagation, update at every now, at every layer, our weights and our bias, and then we have the next turn for the forward layer. So it's going forward and backward. So as you see, this is coming from the input. This is just for this small neural network. We are multiplying our weights with the output. From this sometimes X is also named like A at layer 0. Of course, this is the output at layer 0. And plus the bias. And this we give to the activation function and get this output. This output then is used again in the second layer. And the procedure is going up to the third layer and then the fourth layer. Finally, we are getting the estimation. And this estimation we are comparing with the label. So in this slide the important thing is just only here. These are the formulas which we will use. And you see in that case I can also work with matrices and vectors. Because these are the weights of layer LL. And the columns are the weights for each connection. And AL minus 1. This is the output from the previous layer but already as a vector. And also the bias here, the bias on layer L. We are using as a vector so that the output is also a vector. And we can use this vector then on the activation function at layer L. This means again it's not necessary to work in a loop. We can do it faster in most of the languages. Faster with matrices or array operations. This we discussed already. To the question which I touched before. Why we are using deep representations. Let me explain this at least on two examples. When we want to teach a computer to view. So that's the thing what a computer can't do in a simple way. For us as humans we are learning that in our childhood. And then we are able of course in a longer learning process. But to classify, to predict, to recognize a computer is not able to do so. But what he can do is learning from very simple things. From very simple things to more and more complex one. At that level we are only recognizing edges. What are horizontal edges and what are vertical edges. And based on that we already combine some kind of edges. We are getting some picture at last at least. We get something like that picture. And then we can try to classify this picture. The same thing is also when we use deep representations for audio. So if a spoken word then of course we have vocals. On the level of audio organization. Then we are coming to phonemos like cat. And later we are coming to words. And last at least two sentences and phrases. So also the computer deep learning representation gives us the possibility. To start from very simple input. And go and get more and more complex representations. That's the idea behind it. Of course there are also other ideas behind this idea. If we are looking to an XOR function. And it's not only two inputs. But let's say n inputs. Of course you can use an XOR in a shadow network. Then you have all your operations. All your operations you have in one layer. But this means the number of operations is two powered by n minus one. But if we are using a deep representation. Then the order of our computations is in the order of log n. So this means we need less computation. Less operations. And this is also a reason to use deep learning. Or deep representations. Because for very very big data sets. Of course if this is very very high in your n. You can imagine that this takes a lot of time. And uses a lot of computational power. So in that case we can reduce it. This is also a reason for the use of deep learning representations. But this means as a vice versa. That of course many things what we are doing with deep learning. We could do also with a shadow network. The question is only how big is the computer which you have? And how big is the team which can work on that to implement it? Until now it seems so that deep networks and shadow networks is more or less the same. Yes and not. To implement a deep learning network. It makes sense to work with caches. So this means we are going through this layer always in that direction. And then back in that or that or that direction. So and something what we calculated on this part of calculation. Concrete our set we can also use in the backward propagation. So in many cases it makes sense to put this value in the cache. And then call it on the cache at the time when you are going back. So at layer L what we are calculating when this is the input. We are calculating the new weights. We are calculating we are working with the weights and we are working with the bias which are given. Sorry they are given at that time. And we are calculating this output and this output is going to the next layer. When we are coming back then we have the change of A at layer L. We are using this value which we calculated at that level using the weights and the bias. We call this from the cache and then we can also calculate the set and also the weights and the bias is always given. And we can call that and we can calculate that. So this reduces the number of computations and eases the algorithm. So in a lot of implementations of the backward function you will find that. What means this concrete? Concrete it means that we should have last at least two representations and two functions. One function at the output layer representation because the calculations at the output layer are a little bit different. Because here we are using our DA is the difference between the prediction and the label. And this is different how we are calculating than DA at that level. So we are implementing one function or let's say one pair of functions for the last layer. And then a structure which we can use more or less at every layer. And now it depends only on the layers and you are going in one loop upwards and the other loop downwards and then it works. Hopefully. The most biggest problem let's say in the implementation the biggest problem in the implementation is always to have the right shape of all your values. Of course you will see this will be matrices and this will be arrays with three-dimensional arrays and four-dimensional arrays. And this also happens with me that I am implementing something and then I see. It's not working and the program answers me shape is not fitting. And then you have to think at which place you put not the right indices. So this means that it makes sense as it's before you are implementing to think also about the shapes of the inputs and outputs which you are getting. And to go through the whole algorithm forward and backward that this is always fitting. And then hopefully you get the right answer. So the implementation of the backward algorithm as you see is more or less the same as we had this already before the shallow network. We have our input DAL which we have got from the previous layer. This is the calculation from the previous layer. And now we want to calculate this output. This changes in the weights and the changes in the bias on layer L. And the formula for that is we are using this DAL. For this you can introduce that with multiplication element wise of the first derivative of our activation function. And based on that we can calculate the changes in the weight. The changes in the bias on layer L. And last of these as an output also the changes in the A for the previous layer. Of course we are going back. And these things then we store at least in the cache we store that. And in the cache of course we have our weights and we have the bias in the cache. And we have also these values also in the cache. If you remember when I spoke about logistic regression I mentioned that we have parameters for the programming. In our case now we have not only parameters we are working also with hyperparameters. The parameters this is what is really changing or is running in the program. But with the hyperparameters we can now influence the performance of our algorithm. One of the most important parameters as I said already is the learning rate. Also the number of iterations, number of hidden layers. And somebody asked me yesterday about does it matter how much hidden layers I have and how much units I have. Of course it matters on the computation time. But also it matters on the accuracy of the output. I will show you later at least some architectures and you will see the authors plate. They plate on the dimensions on the filters. They plate on the dimensions on the different layers just to get the wanted output. And of course also the choice of the activation function. There are also other hyperparameters which I don't mention now because too less time. We have something that we are calling momentum. Momentum is something that is coming from numerics. It is an improved gradient algorithm. Gradient you remember gradient descent. We are using gradient descent to calculate the derivatives. And sometimes this is not converging so good as we want. And then an improvement could be done for example with the momentum. With the mini-batch size means that we say okay we want from let's say a thousand examples by chance. We are choosing only a sample of 100. So the mini-batch size but the choice is done by is a random choice. Always choosing from your results randomly 100 for example. And your run is 100. And regularization is also a technique which we can use to avoid overfitting. All neural networks and a special deep learning neural networks are prone to overfitting. So it makes sense if you have the feeling that your algorithm is not learning to use some kind of regularizations. There are different kind of regularizations you can find this in the literature. The most common regularization is a quadratic regularization. Which is based on the European measure. So quite often this is discussed. Especially for people who have no deep knowledge in deep learning. If you read something in the newspapers in the journals or some philosophers then this is very dangerous. And in the next 50 years the robots will command us and so on. Let's say so. This is not only my point of view and I am now using a method of proof which is called the authority method. You are calling authorities. Of course for deep learning there are some similarities with brain activity. For example as I showed you for the computer vision that we first learn what are the edges. And based on the edges we are building more and more complex pictures or let's say representation of the picture. This is a way how also our brain is learning. Because normally we are starting from a simple one and then we are getting more and more complex. But nevertheless our brain doesn't have a real activation function. It has not a tangent hyperbolic activation function. And later you also will see it has no convolutional layers. It has no padding and so on and so on. So we can say yes the idea going from the simple one to the more complex. This is really taking from the way how humans are learning not only humans also organizations. Yes you remember I mentioned in the beginning two days ago a definition or let's say a question from Tom Mitchell where he said there is a question what are what is similar in learning for computers for humans for organizations for animals and so on. And this is something what we really use. But in general we should think this is my this I very very convinced this is so called general AI. Which really artificial intelligence it would be able to replace our brain activity. And that's what we are doing now. Let's call it AI in a narrow sense. It's not a really nice term but I also have nothing else. I took this from Andrew and Jim but also kind fully his one of the Chinese leaders in AI. He is supporting the same point of view. And he distings between four waves of AI and internet related AI and the business AI which are always running. Yes if you go to the Internet make your click in Amazon somewhere the computer says use Andreas he wants to see shoes next time I will show you show him all brown shoes in the size of course I was looking for that. It might be maybe I will be not today but tomorrow I will buy them. And also in the business AI business AI is of course on the one side this has to do with the production. We are doing a lot now to connect our production via this the buzzword industries 4.0 with different let's say network based operations. And not only the production but all the business the business structures. They are trying to introduce some kind of automatization at that fine. I don't know how is the situation in Armenia but I can say that the expectations in Germany are very high. And not only by the big industries but also by the small and medium enterprises. There is a report from last year or 2017 from SAS it's in English by the way. And they have they did a really deep survey not only two questions but this the discussion of the survey is about 50 or 60 pages. And they really asked where you will use is where are your expectations what really works what doesn't work and so on. And most of the companies expect that they can improve their inner organization. They are not expecting so much to improve the sale but the inner organization of a company. This is a high expectation also from the from the bottom of the medium. The medium sized companies the problem is they have there is a lack of specialist by the but this is not special for Europe. It is not special for Armenia. If you go to the U.S. also lack of specialist in this area. So we are now at that stage at Perception AI. We want to understand how our senses how our senses are working and how maybe maybe we can replace some in some places our senses with artificial intelligence. So this is what we are doing in computer vision. This is what we are doing in natural language processing. That what we are doing in let's say the classification of sounds. I until now I didn't read anything about the perceptions of our tongue but maybe this will come. And haptic perceptions this also already they are started to try also to apply this to haptics but this is very very very very let's say difficult. And let's say the the overall the overall AI and that wave this is autonomous. This means this autonomous driving autonomous swimming autonomous flying autonomous production. Everything but has to do with robotics we have mostly in this area. So this these are the waves which are expected this is going on this will I think this will be the next 10 15 years and autonomous AI also about the next 20 years. I have you already autonomous driving on on your streets in no car no bus introduced. I think they are they are not used to the way how the drivers are driving. This needs a hard learning process. Okay but nevertheless nevertheless I put it here. Terminus reengineering brain the American Academy of Engineering put the challenges for the 21st century and reengineering of the brain. This is one of the I don't remember 49 don't ask me how much challenges they put it but this is one of the big challenges. What they expect with what will be solved during 21st century but where we are now with deep learning. This is maybe one two percent yes and I those I see at least for not only for my generation but also for your generation. I see no danger that you will be replaced by robots. By the way if you go to your type fully he published a book about the power about China as a power in AI. And he is describing also he makes is making a discussion about which kind of jobs really will be replaced. So if somebody is interested that I really can say go to this book because it's it's he has a big experience and is based of course also on statistics. But the general problem of general AI is our computational power which is not so strong. So we can get enough results for our theoretical knowledge to take the right way. I guess that in USA there are already quantum processors to do computational work that is million times faster than our normal computers. I have a good friend who is in close connections with one side is Microsoft especially supercomputing on the other side but he's working in the area of algorithms. And he always emphasize that it is good to have big computational power but this will be not enough. This is a necessary condition. A big computational power by itself will not will not produce brain activities. The computer will not arrange the connections of the hardware and software inside automatically in such a way that it's working like a brain. For this we need much more insight much more understanding about how a brain is working. And as I see the literature now we are still on the beginning. But again I agree with you it's a big challenge but I see another problem. Let's say well that's a challenge to work with supercomputers is one fine thing. But how can we break this down that we can just do this on the level on the chip so that you can put this in a car. So that you can put this let's say in a on the street light or so on. This is by the example in one of the directions of micro in which we are working. We try we try to break down the algorithms to a level that it's can can work at least at Raspberry Pi. Enlarged or it's a or with a stick which is now produced by Intel. And this stick is special for videos is special for machine applications. And we for example tested it and found out that a normal YOLO algorithm or NASA algorithm which you use for object detection. Not for the object classification. Different things. But for the object detection it works in a in a way that you can have a movie and it picks out the car and follows the car. With a Raspberry Pi. But the Raspberry Pi alone is not enough. There's the power to this. Of course we we should train the other. And this you can of course not do on a Raspberry Pi or a smaller computer or we have smaller computational units. But it's again about the energy which you use which you will use as opposed to it. Let me close this. So let me come to special networks. We have now deep learning networks but these kind of deep learning networks have one problem. I will not call it a problem. Of course everything what we see in the sun so hopefully call it a problem. The people in Japan call it a hope. Different way. We have our units let's say so yes and the layers are fully connected. And if the layers are fully connected and you have a lot of parameters this also brings you to the problem that you have too much computational operations. So one way to work on this is to specialize the architectures. Let's say in convolutional networks recurrent or sequential networks and other networks. This means we are designing the network now for special kind of applications. Maybe it is possible to transfer later. For example the convolutional networks were designed exactly for computer vision. But now we see this with convolutional networks we can also solve other problems. But nevertheless the computer vision was the first problem. So what means computer vision? On the one side the computer vision means we need an object classification. This cat or not cat we discussed this already. This is the object classification. Is this Lee or not? Coming to a door looking at the door and the door is opening if your face is recognized. You have this now on every Apple computer. So also my Apple phone is looking to me and say today you are looking great. I will allow you to use me. But another way is to do the following. This is the style transfer. Where you have for example this style from Picasso. And to transfer this style to this picture. So this photo will look Picasso like. You can go to the internet there is a link called deep dream. And there you can even upload your own photo and choose a style. And then see how the computer sees you in the style of Monet. For example. This is another way of computer vision. And then of course you see it here. This is this picture with that style. And the third thing is what we quite often use for autonomous driving is object detection. So it is detecting the objects. And then we can improve even these kind of algorithms in such a way that not only the object is detected. But the object is also classified. That's really that the computer say this is a car. In that case only cars were detected. But of course can also detect trees. We can detect pedestrians and so on. And then for autonomous driving it's a difference. If you have beside you a bicyclist or a pedestrian or a car. So how this is working. It is working as I told you the computer vision is starting with an edge detection. And we are looking for vertical edges. And we are looking for horizontal edges. The edge detection is nothing new in image processing. We have a lot of algorithms in image processing. We can try where we can find out what are sharp edges in which direction. And to locate them. The idea behind is always to use a filtering. And filtering is something what we call from a mathematical point of view. This is a convolution operation. So this is the reason why these are called the convolutional layers. The great thing is to have the right weights in the filter. The right numbers for the filtering. Of course in image processing there are let's say predefined values. But in our case the computer should learn. We should learn these weights which we are using in the filter. This is the way how this is what we are doing last at least. So I will not go through now this a lot of slides about filtering. But you see our filter is going over always 3 by 3. And this is the size of filter which is quite often used. It is going through your picture. These are the pixels. So the filter has a 3 by 3 pixel filter. And this is just. And then we have a convolution. We multiply this in the convolution and we get an output. But there is one problem. The problem is the following. When we are doing the filtering then of course the different pixels or every pixel has an influence. But the way how the pixels have an influence is different if your pixel is inside when we are working on the pixels. If the pixel is inside the picture or it is here at the first row or first column, the first or the last row or the last column. So this means in that case we reduce the influence of this pixel for the output. And what is done? We are adding a column and a row with zeros. This is called padding. Normally zero is used for padding because we need only the position but we don't need the value. So multiplying with zero gives you zero. And you can calculate for example in that case the padding is one but it is not necessary that you have only a padding of one. You can also work with a padding of two. It is up to you. The results are better. So if you want to calculate the output without padding you can always calculate the number of pixels in a row minus the number of pixels in the filter in the row, plus one and then you are doing this for the first and the second dimension and then you get the dimension of the output. If you are using a padding you can use this kind of formula but these formulas are only applicable if our filter moves always one pixel. One pixel in this direction and then one pixel in this direction. We are not calling this destroyed. And if normally in our formula here this should be divided by destroyed but if the stride is one you don't see this number. So we are distinctly between two kind of convolutions and this you can call. You can call this in algorithms. In the algorithm you say padding is valid then there is no padding. If you say it is the same then you should of course say I hope it is the padding but then you are using the pad so that the output size is the same as the input size. Yeah but stride I spoke already. So the change of the formula is just only that we are dividing that we are dividing by s and it's a floor rounding. Everybody understands what's a floor rounding? Floor sealed? You can round in four different ways. You can round then if you go to plus infinity. So this means if you have a number 2.8 then the next integer is in the direction of plus infinity. You can round also in the direction of minus infinity in the case of a positive number this is the same like floor. Of course you are going from 2.8 to 2 but for minus 2.8 if you are round in the direction of minus infinity you are going to minus 3 but if you are going in the direction of zero that's what floor is doing. It's always rounding in the direction of zero then you are getting minus 2. So the first rounding in the direction of plus and minus infinity we are calling seal to zero it is floor and if you just only want the rounding rules which are used in mathematics or what you learned in the school then say round. But in that case we always round in the direction of zero. This brackets means that. But if you implement this for example in the programming language and you use floor the programming language will understand you. I think it's more of a language barrier thing. Sorry? I think it's more of a language barrier thing because we of course in the school we all learned this but because of the language difference it kind of. Ah ok, more of that yeah. But the English expression is floor seal. If you go for example are you using matlab? I'm using JavaScript and Python. Python? It's also called floor. Floor, floor and seal. And round you find also. Now we are coming to the next question if you are speaking about images we have three different channels. We have the red channel, we have the green channel and we have the blue channel. And of course when we are now working with filters then also we should apply the filters in these different dimensions. So this means when you find a graphical representation of image processing or a deep learning image processing algorithm you will find this always in such a representation. They say ok you have this as the size of your image and then you have as much or the volume of this bracket is as much channels as you have. And of course you also can play on the number of channels. This is also an idea. So we have a certain dimension. And this is the reason why I said we are not working now only in matrices we are working in arrays. And always the shape of the array is very, very let's say tricky thing and quite often you make the arrows in this area. So for example we have this input. We have three channels. We are using this kind of filters and then in the output we are getting also something like the volume. Of course it depends how much filters you are using. This is not necessary to work with the same channels. So if we let's say have our RGBA then we are making a 4-dimensional array. Yes you are making a 4-dimensional array. But sorry in this case you need to use the three filters. Yes in that case I need to use the same filters. You mean the number of these three filters you can choose. But if you have here more filters I would say more dimensions or more filters and more channels and of course you also can work here with more channels. So which kind of layers we are using about the convolutional layer we already spoke. Another kind of layer which is used is called the pooling layer and the pooling layer is just to reduce the sample cells. The learning process is going on only on the convolutional layers on the pooling layer we are learning nothing. So there is no parameters to train. It is just to reduce the sample size. And we have fully connected layers. And in the case of fully connected layers of course you have a lot of parameters to learn. But these are the different layers which are used in the convolutional network. And now the combination of these different layers is important. Sometimes you have two convolutional layers then after that the pooling. I didn't mention here that you also have sometimes normalization layer. And the last or the last two layers normally are fully connected layers. Maybe I missed the first 20 minutes of the, I'm a bit lazy I just woke up too late. Maybe I missed it but how is the pooling done? The pooling we will speak about. No, no, no, I will show you the pooling. So the pooling you can do the pooling in different ways. This is, first of all, the idea of pooling is to reduce the size of the representation and to make the features, detect the features which are more of us. This is what we want. And in that case we say we have a filter size of 2 and the stride of 2. So now we are going to these pixels and say this is the pixel with the maximum. And the result will be 9. Then the stride is by 2 and we are in this area and take again the pixel with the maximum. It is not necessary to work only with the maximum. You can also work with an average. But in the case of an average you get ratios. And this is sometimes a problem. Of course you can get around, but let's say the practice, not from a theoretical point of view, but from a practical point of view, max pooling is more used now than average pooling. So we take an array of 4x4 but you can also 3x3 take it. We take an n by an array and we basically decrease it by a fourth of 2, right? Power of 2. If this is a power of 2, then you will reduce it by a power of 2. Yeah, that makes sense. But normally this is a power of 2. Then it makes sense pooling. I can make a pool. Yeah, make a pool. That's the idea behind it. You see, these are the more robust parameters than these two parameters because the value is higher. An averaging will be used for another more specific test? For other applications, but as I said, averaging means that you get the ratios and first of all, we try to work with integers and the second is the results with average pooling was for the architectures, which I will show you, it's not so good. They tried it also with average pooling but they decided that max pooling is the better way. But average pooling is, of course, it is not forbidden. You just try it. One question. In this case, you lose some information, yes? Yeah, I'm looking for the most robust because I'm looking for the features which has the most robust, in this case, the highest values inside. Yes, of course we are losing information. But are we doing this because we want to do less computation? Yes, we want to reduce the size of representation because the last layers are fully connected layers and then you get a lot. Let me finish that and then we are coming to concrete examples. This is an average layer, for example. These are the hyperparameters which you are using for the pooling. Normally for pooling you only use no padding. The important thing from the point of deep learning is no parameters are there. We are just taking the maximum, we are just taking an average, but we don't really try to improve it. Of course, if you put pooling, it depends only on what are the values in the previous layer. Why we are skipping the padding on the pooling layer? Because it prints nothing, especially if you have a maximum pooling. Also like you can have some maximum value in edges? Yes, the maximum between 2 and 0 is 2 and our values are positive. So we always do pooling when we use some filters so we can make pooling. Yes, of course you can do it. This version makes it sense. This is one thing is the formal yes, formally yes, but you always should think it makes it sense to make a pooling at that point. It's better to speak about concrete architectures and not just in an open space. This is, for example, a network which was developed from Likun. He is one of the giants in machine learning and he developed this network already at the end of the 80s. 89 I think it was. This is already adapted because, for example, Likun was working with average pooling and the idea where he used it for what he used it was the recognition or the classification, let's say, of handwritten silos. To recognize on the post office the zip code in an automated way. If there is a database which is called MNIST you find it everywhere on the internet and by the way they have a new one, MFashion. Not to work only with MNIST. And you can try it by yourself. You will find that this algorithm, the Likun algorithm with MNIST ready for use and ready to test. So the idea is this is my input. It's a handwritten silo with 32 x 32 pixels. In that case it's RGB but it's not necessary. You can also work with gray value which will reduce the results will be more or less the same. After that you are applying a first convolution layer. The filter is 5 x 5 and the stride is 1, no padding. Then you apply a max pool with 2 x 2. So how did it become from RGB to 6 layers? Now it's an image, 28 x 28 x 6, right? The second standard. This is the whole point of the talk. It makes lots of layers. I'm kind of new too. I also should now calculate this. It uses 6 filters. You can change. Oh, I think I know. I got it. Then you apply the first pooling layer. The pooling layer has the same number of channels but you see it is reduced. Then again, this is, let's say, layer 1. We have a first convolutional and the first pooling and this we can combine in one layer. Then again a second convolutional layer which is now reduced to 10 x 10 but 16 channels and a max pooling again where we get reduced image 5 x 5 and after that, this is enrolled in one vector and to this one vector we are applying two fully connected layers. But if you fully connect it again, we reduce here from 400 to 120 and from here to 884 and then to 10. Of course, ciphers we have 10, so we have 10 classes. And for the output, for the classification on the output here we are using a softmax, not a sigmoid. The sigmoid, as I told you last time, sigmoid is only used if you have a binary classification but if the classification is a multiple classification we are using softmax and not speaking about the formulas but about the idea we are choosing the label with the highest probability. You are getting numbers which you can interpret like probabilities and the label with the highest probability is the predicted label. So this is one of the classical architectures which are used and if you make a benchmarking for example it makes sense also to work with that architecture. The question, why convolutions? The reason for using convolution layers is to reduce the number of parameters which we should take if we are speaking about the computational size. Because here I have to train 28 by 28 to 6. This is 4704 parameters and if I am going down, down, down then I have less parameters to train. And the second reason is, the second reason if you are using convolutional wave layers you hope, this is a hope, that some results, intermediate results which you have during the computation you can reuse. And this makes also sense, this is a sense to work with convolutional layers. If you have a fully connected network in that case we will have 14 million parameters to train. We are fully connected now. So 22 by 23 and then fully connected was that. The second thing is the parameter sharing. You see, we have here, if we are making our convolution, if we are making our convolution, here we are getting a zero and for that we are getting 30. But this, let's say, this structure is repeated and in other places we can use the same values. This is also this, that we have this parameter sharing. This is also, let's say, a reason to use convolutional networks not fully connected. Is there a way we should use fully connected instead of convolutional? It's more preferable to use fully connected than convolution. Let's say so if you have not too much parameters, of course it makes sense to work with fully connected network because you are not losing information. And as you saw, even in the Likun architecture, the output we are using fully connected network and fully connected layers just to work, to get, let's say, the classification last and least on the fully connected layers. But first we try to reduce the parameters. So if you see, as you see, we can now put everything together. We have a training set with pictures, with labels. We train our algorithm on the number of pictures. The number of pictures may be 50,000, maybe 100,000. These are really good numbers already. And for that you need, of course, a little bit computational power. I agree. I also bought a better computer now because my computer was not enough to work on such applications. And then we apply the network. We spoke about that. And then we have the fully connected layers, the fully connected layers just to make the prediction. And always we calculate for the forward and backward algorithm, which is, of course, much more complicated now than in the case of a normal fully connected network. We apply a cost function, which is the sum over the losses. And we should use some algorithm to optimize the parameters. Optimizing the parameters means to find a minimum for the cost function. Which kind of algorithm you are using for the optimization? This is a little bit a test. You never can say in advance for a given set of data which kind of optimization algorithm works best. Normally you start with a gradient descent. You play on the momentum. And if this is not working, then you can go to other algorithms in the libraries. A lot of algorithms are implemented. So it's not necessary to program your own optimization algorithms. But another optimization algorithm which is quite often used is we have gradient descent. We have stochastic gradient descent. And Adams is an adapted algorithm. But it sounds good. Adams, for example, is also a good choice. So I would always work in that direction. You start with gradient descent. If gradient descent doesn't satisfy you, you can choose stochastic gradient descent. And if stochastic gradient descent is not working, then try Adams. So last but not least, the gradient we already checked. This is an architecture which was developed, as I told you in 2012, for the CACL competition. And it's called AlexNet. It's used in Alexa. But the name is not coming from Alexa. The name Alex has to do with one of the, let's say, Ossos. Alexander Krzysztof skin was one of the Ossos. The reason why it's called AlexNet. And they all are pupils, students, former students from Tio Fre Hinton. I remember the first lesson I told you. There was a small group in Vancouver which survived after all the winter for neural networks. Tio Fre Hinton was working still on that. And Alexander Krzysztof skin is one of his, now of course he is also already a well-known scientist. So as you see, this is more complex. On the one side we see the similarity to the net, but it is much bigger. And because it uses also convolutional natural layers, it uses max pools. And then as they introduced, first time at that time, they introduced the real function as an activation function. Up to that, the sigmoid function was just used. And they also introduced normalization. So the AlexNet work is implemented in most of the libraries. If you go, for example, to a new media and you say, I want to do this with AlexNet, you have nothing to do. You just give your input and you see the output. But of course you can have no influence on the architecture. You just take the architecture as it is and it depends on which level you are a user. If you are a user, which as an engineer, you will not go deep to the programming part. If you are a user as a computer scientist, then of course you want to play also here on the parameters. And last but not least, the output is done over three fully connected layers. They classified thousand classes. Not ten, thousand. And the classification was over different objects which you can find on the picture. This is also quite often used in VGG 16 or VGG 19. It is let's say an enlarged Likun algorithm. This architecture, we tested it and we have got not so good results but of course it has to do with our application. Quite often this is used not for image processing but for video vision. For image vision. The residual networks has to do with the following. Quite often in the learning curve we see the following. By theory if we improve the number of layers then the arrow should go to zero. In practice we see adding layers means you are getting a higher arrow. For this the so-called residual networks or ResNet architecture was implemented it has to do with the problem that on zeros we are not learning nothing. The idea of the ResNet is to use this value add this to that and combine these two layers. These are the residuals which we are adding here. In that case if you are working with a residual network really in practice the arrow is decreasing. This is the... These networks are obtained experimentally or there is some logic behind it? It's got experimental. How it helps to improve training error to keep connection. Because the point is what happens if you have zeros and quite often this happens. For example if you have the log function there is one minus y hat and if the prediction is close to zero then you are getting logarithm of one. This is zero. And you are learning nothing. So we are adding some residuals from this point we are adding some residuals and then the value is a little bit far from one. Far from zero in that case. The last architecture which I would like to show you and how do you... Sorry? How do you... Do you see that the network is not learning? You just try to... Again this is prepared and the value is just... On a special... When you see that the network is not learning then you will go through the layers and check on which layer you are not... You are stuck. Then maybe at that place it makes sense to apply a residual load. Isn't it because in the large dimensions the gradient is vanishing? Okay, this one is good. So one of the last inventions is called the Inception Model. The Inception Model is coming from the movie Inception. You saw it? Going deeper, deeper, deeper. And there is, let's say, the Inception. Inception means there is some prepared cell of layers which you then combine. And you have the previous activation from the previous activation you get this input and you see here you... On the one side you are working with a one-to-one convolution a max pool you get outputs where you are applying these kind of convolutional layers. This was found by experiment. It's not zero-raditally. And you are adding this part from a one-to-one convolution. It looks a little bit complicated but you can implement it as a sub-cell and then call it you have the channel contract and you reduce the number of channels. And normal architecture in Inception or let's say the architecture which was used in 2016 in the competition was looking like that. So 21 layers. I will find this picture on the internet. Is that structure fixed? No, this is... This structure was used and they won the competition. As I told you in machine learning the way to work is to work in competitions. Kaggle or somebody else is opening a competition and then the groups are working to win the competition. It's not only about the price which you can get but it's of course also about the improvements which you are getting. To machine learning or especially also to deep learning you can have different perspectives. One perspective which I'm showing you is to look from the let's say from the programming side from the algorithmic side and another perspective is to look from a probabilistic side and to look what which kind of statistics is behind. Yes? But this is a more theoretical approach. And assert, let's say a view is from an optimization point of view if you say okay which kind of optimization algorithms to find the minimum are best to apply under which conditions. This is also not if you are looking to the structure of the algorithm you cannot answer to this question. Because the every D cell is using their own optimization algorithm. You can switch from gradient descent to statistical gradient descent to add-ins and so on to develop your own to your own optimization algorithm because it's not necessary that during the whole algorithm you are using one and the same one and the same optimization algorithm. But can you go back to the previous slide? I mean is this structure fixed for one cell like can we change the size? No, you can play on the structure. Of course you can play on the structure. But in the case which was which was implemented in the inception architecture it is fixed. So that you can repeat it. So optimization problem is also kind of like from the problems we were discussing Yes, yes, yes. Which kind of optimization optimization algorithm you are choosing. So this means machine learning machine learning has different let's say so perspectives. And you can look to the problems to machine learning from a statistical especially from a bias a bias and statistical point of view you can look to this from the point of view of linear algebra. You can look to this to the point of your optimization. But of course you can look to this also from the point of view of which kind of architecture you are using. I mean we should apply machine learning here as well to find out what kind of No, this is what we did for example of course this is quite hard to implement this only the let's say the specialist in deep learning or developing such kind of architectures. But what we for example did we tried to find out which framework is working better. We were working we took nine frameworks choose from these frameworks the frameworks which are fitting to our parameters this was Kafeel and TensorFlow and then we tried this on the test set we always use the same test set of data and then we checked on two different hardware how this is working. So this is for our conditions way designed we found out for my surprise by the way that Kafeel was better working than TensorFlow a special for higher let's say for more complex for more complex algorithms but maybe this was a problem this was a problem of the used hardware I'm convinced if you use a stronger hardware that maybe the results will be better but in our case this was the result the learning curve in the case of the VGG 16 algorithm the algorithm didn't learn nothing in TensorFlow it learned only under Kafeel and in the case of in the case of the inception the training time for exception was about six hours in the training time with TensorFlow and with Kafeel about two hours it was already a big difference of course you expect a bigger training time but and last at least you find out for your applications what are the best and always for the kid follow the rule case keep it stupid and simple to make it complex is no problem this is no problem to make the things very complex the let's say the big thing is to work with simple with simple things and get good results I think I should finish here and I wanted to show you if you want I still have four does this layers like have the same size the last one yes this do they have the same size because otherwise we cannot can cut them together yes that's what they have that's it the choice of the size one three, five and one this was chosen by a special by a special combination for the input for such an input this was the input you always should have a look with what is the size of the data the original data very well but of course you can play it only one by one layer this is a special layer which makes you which reduces you to get a lot of channels but you have only one let's say one pixel I mean the dimensions should be the same for the last time this you cannot change but here you can try to play but the filter three by three and five by five this is common maybe there are different striped parting so the output dimension is 28 by 28 because there are different size of filters this this is the idea to combine different size of filters so we will see what is working so this is just an example to show how a deep network but fully connected can be developed so and why I am using why I am using this on the web of course I need some test cases and some utils which is quite hard to transport especially in collab so I am using another environment so the we are starting with the initialization and you see the initialization for the initialization we said we will use random numbers for the weights and we will use zeros for the bias so this is the way how you can initialize it and this is the concrete initialization if the result is different this can happen of course it depends also on the computer which we are using because you set the seed and your computer and the seed of the origin can be different but of course this is the same environment where what is developed it is more or less the same so then we are developing the last layer the last layer here is called a capital L a capital L layer is that the sigmoid one is that the sigmoid one the last layer yep so as I said we are putting all parameters we are putting all parameters in some library and we can call them later also from the library the library we are calling parameters so this what is said in the in the library then we developed the forward propagation the forward propagation is quite easy of course we see this as set this is the linear forward we just calculate our set and put it in the cache as I said we will work with cache and later we should apply to this another function so the linear activation forward it makes sense it makes sense to split it the linear to split the calculation of set and A because sometimes you will use this in different places and this is quite quite convenient it gives you a structure so in that case it is implemented in such a way that you can call activation sigmoid or you can call the activation redo and depending from what you are calling the activation function is applied again but in which cases for example sigmoid is better than random not for the last many languages the sigmoid function we are in let's say it is up to you to use activation function as you want but you will get better results if you use the sigmoid function for binary classification in the last layer and in the other layers something like a tension hyperbolic or a reload function I implemented your own only reload but you can also implement tension hyperbolic tested with tension hyperbolic but not with the sigmoid because of this problem of saturation which I showed you that's why I was asking why do we have this sigmoid because at the output we need only a decision between zero and one but this is only the last layer not only in the last layer so and last but not least we of course need also an octocyst to cast activation in the forward model and you see now we are running this over the layers always our previous A is now our A which is used for the next and we are calling depending how you how you call in which layer you are calling the different functions so the call for example could look like this this is the test case special test case which is called and in the test case we have two hidden layers and of course for the last we are using the sigmoid function as you ask the cost function is again implemented sorry implemented in the same way as we did already and the backward propagation in the backward propagation as you see we are working now with the cache we are taking these values from the cache and for every layer we are calculating the changes in W the weights in the bias and in the previous output and write it again in the cache we give this the necessary we have the output and we write it again in the cache this is the backward propagation but this is only the calculation part of the backward propagation here so this is how it's activated and again depending from the from the used activation function we are getting also the in the backward propagation of course we need the derivatives and last but not least everything is put together in one model we have the dictionary of the gradients we have the dictionary of the parameters we are calling everything on every step on the parameters calculated run the loop here we run the loop down this is this is my last layer this is the last layer this is the calculation for the last layer and then I'm going down updating the parameters and that's it so if you run this over one model you can use it let me show you one application you should be saying if you still have time so as you see in this case I'm working already I'm working already with libraries like cpi so that's not necessary to implement everything and from from the package I am calling the sub-package image and from cpi I'm calling the dimensions of the images by the way this should be changed because Python is changing the versions so but in this environment it still works the data set is a prepared data set this is I'm just I'm just training and classify of cat and non-cat zero one so this is an example of non-cat and now as I told you split this in the train and test examples 209 training examples we have 50 test examples this is the architecture of the which we use it's a two layer network of course we have this and this this layer so here see I cannot go through now in detail but if I would run this this is now calculated the cost function you are getting always you should check if your cost function is going down if it's reducing and it makes sense for example also to plot the cost function the plot is implemented in the call and of the command so you can see okay it seems so that this is reading my algorithm is learning and then you switch to we have we have accuracy of 0.2 sorry 0.72 of course was only one hidden layer but if I increase in the same way as I did it already the layers the number of layers and of course I still have a misclassified examples I have misclassified examples but the accuracy is already of 0.8 of course the the structure of my algorithm is it at RSS this is just I added two more layers I have a question for example in case of RSS the cost function is always decreasing and how do we know where to stop if the cost function is getting saturated if you are not anymore reducing the cost then you are not anymore you are close to the minimum you are at the minimum already and then it can happen if you make too much iterations it can happen that the cost function is increasing again this is the number of iterations this is the cost function J you are going down and then you say let's try more and more you hope that's going to 0 no, it's going up or sometimes you also have this also picture which is possible you had something like that in your cost function as well how can we explain this how can we explain this you cannot really because you should go inside how what's happening with the weights but it's not always so that the cost function is going down monetarily but as you see the direction is ok so then I wanted to show you one thing if you have nothing installed on your computer for working with steep learning of course you need a lot of a lot of libraries and so on it depends from the version I recommend you to work with co-label the only thing what is necessary for that is that you are a user of Google so you are identifying Google so Google is offering for smaller applications not for big applications but for smaller applications for training possibility let me 8 hour 8 hour for free I think ok but this is enough this is enough I don't think that you will work 16 hours of course after 8 hours working on a program 8 hours for running the models because I remember one time I tried to run this composition in the network and it took more than 8 hours to run the model and just like then you are using too much too much network but nevertheless it's a convenient for smaller applications it's quite good and you can upload if you have for example if you have a Google Drive you can upload from the Google Drive your files and you can also prepare the examples so it's quite this is quite good and again I think that also provides GPU and CPU for free they are offering GPU so I'm running now all the here I will not go to because we have not so much time but what we are doing now is we are implementing a single convolution we are implementing also of course a padding we are implementing we are implementing the cost function and the forward pass backward pass forward pooling takes a lot of time to implement and for example I had a lot of problems on my computer on my problem running with the shapes of the values and the same this is the same file in call up it works on my computer it doesn't work and fix why but maybe this has to do with the version of Python which I'm using or the CPU and GPU also make difference so and last at least we have convolutional network running I called it running because this was the version that was really running on my computer and then I changed the version of Python and I had the problems that's a real problem because TensorFlow also refuses to update to a new Python I know so but here from the library it's not so it's only NumPy and Hpy not too much what I need as a library and the Anaconda let me check if this is still working use Anaconda of course the new version of Python is not working with TensorFlow I think it works with 267 yes but with 37 is not working with TensorFlow when you're installing TensorFlow you're getting a multiple error that's a problem because then you have to deal with Python renamed Python to work that's a very good update I tried to use Python on Lambda functions I tried to use an environment where it works it was just slow now it was working on my computer and now I changed the version because I need TensorFlow in that case to do origins the good thing is that your own files are saved and you can hold your files so in that case the idea is to recognize different signs which you're getting with your hands as you see TensorFlow should be used why TensorFlow? TensorFlow is just a hyperlibrary and a lot of calls are much easier you should not program everything but it needs an additional two lessons to explain how TensorFlow works and what you can do with TensorFlow but nevertheless how we are working again we call our training examples here the important thing is the following if you see we have five classes we have five classes and the indexing of the classes is done by a vector so this vector represents class 0 this vector represents class 1 and so on and because the vectors have only one input 1 and the answers are 0 you can classify this quite easily with a softmax this is the normalization which we are using the output and then if you're working with TensorFlow before you start to work with TensorFlow you always need placeholders this means some graphs where you will make your computations so this means in advance you have to think what is the shape of the placeholders which you need so this is let's say this is not given by the program this you have to think about this before the initialization of the parameters is similar how we did this already so we are getting we try to get some initializer you see in that case I'm calling an initializer which is called Javier but you can also try who what I told you yesterday or some other initializer TensorFlow already has the packages for the realization of your network implemented it's just calling the only thing is you should give a seat otherwise the program doesn't know how to start and then again we implement the forward propagation and in the forward propagation we see now I can build the structure I say this is a convolutional convolutional layer I give the strides I say is padding the same or is padding valid or something else I also define I define the activation function which you can call now and then the max pool and of course Python is working like a program which is running line by line the structure which you are giving here is the structure of your network yeah finally you get this maybe this but of course it's also what happened here ah I I should run this because I stopped it of course this takes a lot of time for the training so I stopped the program this is the reason for this error but this is typical error error message which you get in Python but if I would now train this for 5 minutes so this is just a quick overview this is a quick overview what are the basics for for deep learning and just one specification but just the surface for convolutional networks and I said now you can also go deeper check which kind of architectures make sense in which kind of applications play a little bit on the files you need for some of the files you need special packages but the packages I have at least on my course so you can download them and you need to install them and another way of application also called recurrent networks which are mainly applied for natural language processing there is the structure of the algorithms of the networks that are different to these kind of networks so also deep learning but it looks more different and one of the big differences for recurrent networks is that the units will have something like a short memory because in our case when the algorithm is running through the unit it's combinational logic when it's running through the unit in forward it's computed and forget forget it's going back computed and you forget it of course you can write it in a cache but it's not in the unit anymore so what if we just leave some of the context that's the reason what we are always doing we are checking is this word in the context with other words but you need very very big libraries for example there is a library of all words in English used you can download it I think most people when they want to get some language references they just go to google books starts to print out everything to print out I mean in that case to train to train recurrent networks for let's say for speech to language for for the translation for the automatic translation you need a library of all words and how often they are used in which connotation they are used from a source and as a source the Wikipedia is used of course and this means of course because the Wikipedia in English is much much bigger than the Wikipedia in Armenian I think this means of course you have a better source file you have this you can embed this I think it's also possible to do this for Armenian languages is not the problem the question is only how much references how much data you have inside because in that case it's not important what concrete is written it is important which word is following or is nearby another word and based on that the translations and based on that also the speech to language and so on things like that because space also you need for speech to language implementations you need a lot of training files on dialects because different speakers speak the same language in a different way and it's the question how big is your source file and again in English it is much bigger than in German German also but also in Armenian because in Europe we have to joke that the most spoken language in Europe is bad English sorry? it's everywhere we joke about ourselves especially I'm not joking about Indians but I have my experience with proof from India speaking English it's actually interesting how Google can train the neural network to understand bad English, European European people speak pretty good English if we talk about India too but like people from Japan who pronounce like it sounds more like Japanese actually than English and Google can still understand that's pretty fascinating because the part is where do they get the data they just take people from Japan and say speak it and write what you're talking about I don't know where they're getting the data but as you see getting data this getting data is one of the most important things in legal and illegal way yeah I think Google just has some shady post space where they keep people sitting down and speak to the microphone and get food for it maybe for example they are using Alexa or something else because they are speaking to it and Alexa is listening Alexa is listening what you hear always or read in the newspapers is Alexa is taking your personal data I think they are not so interested what you are we are concrete speaking in that room at that time but they are interested in our pronunciation and I see this with my granddaughters the granddaughters are pronouncing the things in English totally different than me and as my wife and sometimes okay I have not Alexa I have from Apple oh I think it is not Siri but the apple pot and the apple pot is reacting to me but not always to my granddaughters maybe the library doesn't know the music they want it just doesn't want to play it it likes your music taste more okay so this is more or less all thank you very much for coming my interest thank you for the discussion I hope a little bit I could bring also something new things to you because I see there are different people some are prepared totally new and in any case you should go deeper I wish you success thank you