 Okay, hello. My name is Andreas Besta. I am coming from the Caritas University of Applied Sciences in Austria. I want to talk with you a little bit. This is just an introduction. Of course, I have two less time for that. Only six academic hours. A little bit of an introduction to machine learning and especially deep learning. I want to know what is your background? Is your background computer science or how good are you in statistics? How good are you in Python programming? Can you give me a big thank you? I think I have said that most of us are from the computer science faculty. Okay. I can't say for all, but I know some... I'm good at Python. I know some machine learning. What did you do with machine learning already? So we just passed a class on machine learning. We did all of the normal regression classification studies and boosting and some neural networks. Okay. Thanks for the answer, people. We are from data science majors. We are freshmen. No freshmen. Okay. We did Python a little bit, but machine learning we... Okay. It's not really necessary to program in Python. I will show you some files and hopefully they are working in my environment because this is not always an idea. And I will give you an access to, let's say, a Moodle account in our university where you can check the files and if you want you can download them and then try to implement them and run them by yourself. Okay. The usual slides in the beginning is about where I'm from. So I said I'm from the south of Austria and this is the region where I'm from. This is Carinthia, as you see. This is on the border to Italy and to Slovenia. It takes me 10, 15 minutes by car to drink a real Italian coffee. In an Italian coffee bar. So that's just only where we are located. We are a quite young university. Next year we will celebrate our 25 years of the foundation of our university and now we have four departments. Of course the most important is my department, engineering and IT. But we have also department of engineering architecture and management in health science and social work. In our university we have about 2,000, 2,500 students. But this you should put this in relation also to the number of inhabitants of Carinthia. We have half a million of inhabitants. We have also other higher educational institutions. So for us this is quite high. So we are running in our department, bachelor programs in German, information technologies. This is a new program which is based on elder programs and includes specializations in medical IT. In networks, in geoinformation and new is multimedia. We are running a program in mechanical engineering. Systems engineering is let's say some compilation of on the one side electronics, mechatronics and also informatics. And industrial engineering and management. The master courses are mostly in English. Only the master course in lightweight construction is in German. And the master course in industrial management also. The reason that this is in German, the industrial engineering and management course is for most local students. And the people from mechanical engineering said to us that mechanical engineering was founded in Germany. And every mechanical engineer should speak German. Justification. But I am mostly teaching in the courses of electrical energy mobility systems and in systems design. So for us it's normal that we are teaching also. Research also to speak about a little bit of research. The numbers here are a little bit old. We have already now a third party funding of 5 million euros. And we want to get to reach at 2021, 2022 10 million. And we are running a big lab infrastructure. One of our colleagues know our facilities and let's say so. Our research is of course applied to research. So we are working with the funding from the local government. We are working with funding from the federal government in Austria. We are working with funding from the European Union. So for example, I am here based on the funding from the European Union. And we are also working if we get the contract with the industries. Some development. Okay, so that's all about the introduction. So let's start with deep learning. What I have now. What I was thinking what we are doing in our lessons is a little bit introduction to get an introduction to logistic regression. To speak a little bit about what is a swallow neural network. How this is worked out and how this is working. And we are going to deep neural networks and especially to convolution networks. Tens of flow and so on. I will speak out because there's no time for that. So first of all, to the discussion what is machine learning. Of course you can find a lot of definitions and interpretations in the modern literature. First of all, there is a hype now about machine learning. But I should say machine learning is a very old disk. It was founded in the 50s last century. So and we have a lot of classical methods which are a little bit behind now. Like you mentioned this already logistic regression, linear regression, boosting decision trees and so on and so on. Support vector machines. And it seems so this is not any more necessary. Of course everything is done in deep learning with the use of deep neural networks. But for your understanding you should understand that deep learning is just one of the directions of machine learning. And I like the definition from Tom Mitchell. Tom Mitchell says that machine learning is focused on two questions. The first question is how one can construct a computer system that's automatically improving through experience. So this means that you have not to reprogram your computer program when you get new data. I saw another nice definition where somebody said that machine learning and machine learning we have the data and we are getting the rules. Of course you also can have the rules and the data and that you're getting the decisions, the classification. So it's inverted. And the second question is and this is more interesting also what are the fundamental theoretical laws that cover the learning systems. Regardless where they are coming from. Because also we have learning of course in organizations. We have learning in humans. We have learning by computers. But we have also learning by some animals. And also also learning. So what are the fundamental rules? And you will see that some of these rules are also implemented for example in deep learning. Deep learning is using some of these rules but it is implemented in the programming environment. So this is also a definition from Tom Mitchell who said what is deep learning? Deep learning means improving some measurement of performance P when executing some task T through some type of training experience. So for example if you have a spam or non-spam filter, spam or hem which you can do of course with not only deep learning. For example a logistic regression is quite enough for that. Nevertheless you have the task to learn a function that maps into your mail and make a decision between spam and hem. And this is the task. You should choose the metric. For example the metric can be the accuracy. But it is not necessary only to take the accuracy. You can also take the recall or the sensibility. Because sometimes it is more important that you classify something as spam even if it is not spam. Then some hem even if it is not hem. Because if you are afraid about viruses it is better not to have the virus on your computer even if the mail is a good mail. So this is a different kind of metrics for which we also can use. The training experience is we need a collection of mails which are already labeled. And this is one thing. One of the hardest things in machine learning is to get labeled mails. Let's say they are labeled data. And everybody is fighting for data. This is by the way also one reason why if you have a look to the literature. Then in most of the modern literature you will find five or six names. Three or five or four from Google or from Facebook or for somewhere. And the other three or four or that one or two are from a university. Why? Because they have the data. And we have also now a change. In the 19's on the series years most of the authors came from the US. Now most of the authors are coming from China. Because China is fighting up one of the first places in machine learning. So but coming back to start with the question. What is in your network? How this can work? A very simple example which everybody who had already classes in machine learning knows. This is about this famous example about house pricing. Where we are using, we have the size of a house and the price. And you should fit, you should say okay I want a house of this size. And then the machine should predict your price. Yes, we are using for example in that case what I'm showing you here. Is a so called rectified linear unit activation function. Which is zero under that point. And then it's a linear function. So we are giving inside the neuron a size. Then we have here some calculations. And you are getting out the predicted price. And this predicted price is the prediction. And of course also you have data. You have data where you have a concrete size. And where you also know already the price. The label data. You see here I have my data. They are not directly fitting through this linear regression which I'm using here. But nevertheless they are close. So but the prediction will be somewhere here. Or the prediction will be somewhere here. So I always have some arrows. And it is good that we have arrows. Otherwise if you would not have an arrow our model would be overfitting. And this means it would be fit only to the data which we are using for the training. But we could not generalize. So arrows are good in machine learning. It is not bad. This is very important to understand. Because if you are coming from engineering you are making measurements. And of course you know arrow this is something that you should improve there. In that case we don't really want to. So this means we have some. We can of course improve our features. We can say we don't only want to give the size. We also want to know the number of bedrooms. We want to know the sit code. Of course it is important. We have this house is located. We also want to know the belts of the persons who want to buy this. And based on these features then we make some calculations. And add additional features like family size. Workability, school quality and so on. So this you see we have already here some hidden layer. We have some hidden layer in our network. And this is then calculated at we are getting a predicted price. So this is a small network. Neural network which we can use for the prediction of a house price. So as you see if this is introduced. So we are connecting. We have connections to every neuron of our network. And all these connections have weights. And what we are learning when we are speaking about learning in machine learning. Or in deep learning is we are learning these weights. Not really the computer is learning these weights. So we have initialization of these weights and also of these weights. And based on some algorithms. And based of course on the data which we have here. Because we have this is one of the data sets with as you see four features. But let's say we have 1,000 data or hopefully 10,000 data or 1 million data. Even better in our network. And we can learn this and we improve all these weights which we have here. And we are multiplying the weights with the value which we have on these channels. So this is the way how a network is learning. So of course as an input you cannot hold the home features. You can use as input advertisements, user information, images, audio. For example we are working in audio. English language, Armenian language, German language. And also rather information for example. The output could be a price, the output could be the click-ons on some advertisement. The number, how a concrete user, how often, at which time is clicking. The images, these are the pixels. The pixels and the images, these are, let's say, sorry. For the images we are using the pixels normally as the features. And as an output we are using an object. I don't know who of you is using Google Photo. So you know what's happened when you upload your photos to Google. Then it's classified. And they have a big classifier behind. And when you then go to your photos and say seaside, you will get all your photos from the seaside. But this was based on machine learning. And what you are doing is you are feeding this algorithm with training examples for free. This is the reason why you are getting this service for free. And without advertisement. Of course it's much, much more expensive. Much, much more expensive to collect all these data. Always might be in, let's say, in the German philosophical literature of the 18th century. First you find already the description of what is an automate. And one of these, let's say, you can say also this is a little bit a resist joke. But there was always this picture that you have a machine inside a city, a little child, or a little trainees who is really doing the work. But this work now is much more expensive than all the users say, okay, we give you space for holding all your pictures. But you're paying nothing for that. But you give us the pictures for free to drain our algorithms. We have text transcripts. This is very important. The computers now are more or less able to transcript spoken text to a written text. And if you look to the, let's say, quality, the quality is really good. Of course, the computer should be trained to your voice. This takes sometimes. It takes about five minutes. And after that, you speak to the computer and he is texting. This is already no problem. Also, the translation machines are now really good. And about unmet card driving, I will not speak. So we see that for these kind of applications, for example, a shadow neural network is really good for photo-taking, convolution network for speech recognition and machine translation. Normally, we are using recurrent neural networks. Autonomous driving is just the definition of the words in the game applications. You never can say in advance for which application, which kind of neural networks are the best ones. So it's really experimental. If you want to go really in this direction, you should dig very deep and then try different algorithms and say, this one, for that case, is what may be the best. There is no cigarette, a cigarette which can predict for you which kind of algorithms you should use. And normally, if you are an engineer or you're coming from economics or from business, you will use the help of some people from computer science. Of course, they are the specialists for the algorithmic part. But nevertheless, you should understand it. So this is how a neural network, for example, can look like. We have the standard neural network. We are calling this also the fully connected neural network. We see the connections. Every node is connected. The problem of these networks is if the number of the layers is increasing and the number of the units is increasing, you are getting a lot of these kind of connections. So this means you have a lot of parameters to train. Of course, we remember, every of this connection has a weight and we should train this weight. So we are very, very soon by a number of one million parameters which we have to train. And this takes time. Or you need very, very fast computers. So for this reason, for example, the convolutional networks were developed which are using something like a filtering. We will see this on Wednesday. And the filtering is used to reduce the number of the connections. And this means the number of the parameters which you should train. Nevertheless, you are getting really good results. So, and there are the different techniques. Okay, this I will show later. And the recurrent network, these are networks where you have also units which can have a short time memory. Normally, the unit after that she is trained is forgetting the values but sometimes you need some remembering and for this, recurrent networks are used and the normal architecture is looking like that. So, which kind of data we are using? We are using on the one side a special in supervised learning. Supervised learning means supervised because all the data are labeled. This means for all the data you have a label which tells you this is a dog, this is a horse, this is, let's say, the sound of rain, this is the sound of the seaside or something like that. And we have structured data, like the data which I showed you for the example for the house price application. In this case, we know really the first feature is the size. The second feature is the number of bedrooms and last but not least the label is the price. Also, if you remember we had this example for advertisement then for example the user age is very important because people in different ages are looking for different things. So, this is one of the features and of course the advertisement ID is very important and of course did we have a click or not? And maybe also the time of the click is also one of the features which we could use. In the case of audio and image, the data are unstructured because in the case of the image we are using a state of the pixels. One pixel doesn't have a special meaning. We cannot say this pixel is just for ice or this pixel is just for the tail because we don't know what is on the picture. In the same way, if you have an audio signal normally we are using as the features the frequencies. We have a discrete Fourier transfer. Who knows what that did? Everybody knows what is a discrete Fourier transfer? You make some transformation of the audit from the time signal to the frequency space and in the frequency space you are looking which kind of frequencies are used for this kind of, for this concrete signal. But again, we cannot say we cannot say for what some special frequency stands. This is impossible. This is the case why we say these are unstructured data. And we are using the neural networks for both kinds of data. For structured data and unsructured data. But of course with structured data we can deal sometimes better than with unstructured data. So about the question why we have a hype in deep learning. If you open now, let's say a newspaper and the newspaper is writing something about artificial intelligence and of course you will find the password this is all deep learning. The word deep now is used everywhere. We have deep states and we have a deep learning and we have a deep economy and everything is deep now. Of course it sounds like this. The question is if I said that machine learning is so old why deep learning has such a hype? The hype came around 2012 because the normal way how machine learning is working is to arrange competitions. There is one on the website which you can try tackle.com and there are always competitions announced. And of course they were announced also competitions before were announced competitions before 2012 but in 2012 was announced the competition for image classification. And this was the first time when it could be shown that convolution networks and architectures for convolution networks give much better results than let's say other architectures and this was the first time when the Alexa structure was used. So this means with the amount of data if you have a small amount of data it will not say it doesn't matter which kind of learning algorithm you are using because this also depends on the application but nevertheless you will get into performance more or less the same results but if the amount of data is increasing and I'm speaking now about data which are not in the area of gigabytes but of terabytes or if we are speaking about data which are coming from the area of IoT then we are speaking about zeta bytes so this is increasing and increasing then it could be shown that for large neural networks the deep learning applications are working much better. This has to do on the one side with the algorithms on the other side this of course has to do also that the computational possibilities are now much better than 10 or 15 years ago and I will show you that even on I'm using for the demonstration here I'm using such a kind of notepad but nevertheless it doesn't like me it's only you don't touch a running system it's working again the amount of data on the one side and also of course with the possibilities of computation performance we have now such a big improvement so that we really can see that with deep learning algorithms we are getting now better results than with the other kind of algorithms this is one of the reasons but this can change this can change in the future we will find other algorithms and then of course the deep learning hype is behind them we are speaking about something else but nevertheless this is one of the reasons for my lessons also to give a little insight if you are speaking about deep learning or reading something about deep learning that you also understand what the author is writing about or what the author didn't understand of course mostly what I see at least in the newspapers and the journals the authors don't understand what they are speaking about so the problem as you see is with the scaling we have the data which are scaled up we have the computations and of course also the algorithms and in the algorithms this is one of the things which really improved the performance of the algorithms was a switch from one activation function to the other activation function we will see what is the sigmoid function for logistic regression and with the switch to the non-linear reload function as an activation function we have really really improved results so these are the factors these are the factors for the scale and you also should understand that deep learning is mostly now an experimental work so as I said I can give you some hints how you can work but nevertheless if you really want to apply it it takes a lot of time so that you are algorithmic but you find a lot of code in the internet sorry a lot of code in the internet to make your first start so I said that I can show you the place where you can find everything doesn't like your network so if you will go to the url which you see here httvfs model model h-captan this is my institution .at and you will use the user the user id is eaua everyone American University of Armenia and the password the password is you see it here everyone yes I wrote this in such a way that not everybody is using it 2019 then you will find the materials and also some literature you have it sorry could you spell the link to the website because I can't see it from here then you should move to this site share it with me I can the only thing I can do is I can write it here httvfs yes I did thank you okay it was the introduction part now let's go to more maths I will show you just as an introduction how we can use logistic regression logistic regression for classification and based on that at the end I would like to show some example in Python and in the second lesson that I will today at 2 o'clock hopefully I have enough time I will introduce I would like to introduce a shallow network but why logistic regression first not start directly with the with networks the reason is very easy let's say the calculation figure we are using in neural networks is based on logistic regression so if you understand the logistic regression then it is easier to follow what I will say my lesson about the shallow networks so what we want to classify is pictures a 64 by 64 picture and we want to classify the cats or non cats why cats? because they are not personalized and you find a lot of pictures of cats in the Internet just the type cats image and your computer will full of cats I don't know why the people but they like to upload pictures of them so what we are getting is the matrix of pixels 64 by 64 and every pixel has an RGB value between 0 and 250 part and we are using this for every pixel this is we are using we are using this description so this is the reason this is the reason so we have at least 12,288 features and about the glass we will see how much glasses we will use in our case we have only the label in our case is 1 and 0 1 stands for pets 0 stands for non cats so and with nx we will describe the number of data which we have at the end so and every picture you see this is the first pixel this is the second pixel and so on we are starting here and counting in this direction but this is vectorized this is in 1 and rolled vector so and we can write this of course we can write this of course as a matrix x where I have here my first picture and here I have my second picture and then it depends how much pictures you have yes how much data you have in your data set this is the formal description of the picture normally a picture has RGB values because we have red, green and blue and they are in 3 different channels but in our case in our case for the first applications we will not use every channel so now we will use we will transfer them to gray pictures and then use only the gray values but this is just for simplification it is not necessary we can do the course also make our classification of this kind of interface so for the notation as you see where I have M training examples and every training example has my input with the features as the features I said I will use the RGB values of my pictures and I have a label yes I put this in that way in my matrix so I have here the number of features and I have here the number of training examples so the shape and this is already everybody who knows python he recognizes already the python notation and this is the shape of my pictures unexpired and the shape of Y of the matrix for all the labels will be 1 we have 1 row and M columns the course for every value here we have 0 or 1 come to logistic regression what means logistic regression we have we have our input X we have our input X and we multiply this input X with the transformed vector of the weights value and we add one value which we are calling bias so this is an input which we calculate for every example means for every data set I call this input set and we put this in the so called sigmoid function the sigmoid function is looking like that 1 divided by 1 powered by minus 7 as you have a look to the sigmoid function I think everybody who learn something in statistics say this is similar to the density function density distribution function of the normal sorry, the cumulative density function of the normal distribution this is also the reason but we are using a function which is much simpler which is much simpler it is looking a little bit similar but we are using a function which is much simpler and if you plug in if you plug in to this function your value set which you calculated here for very small let's say negative for negative very very a small set you will get here a value which is very very big so the value of your ratio is going to zero as smaller set is getting the value of sigmoid of set is getting to zero otherwise if the value of set is increasing and is positive then this is going to zero and the value of the ratio is going to one and you have a saturation here this is a limit a saturation I cannot right now I have a saturation at one yes so so this means for every input we get a value here or here on this function and now we can decide and make a decision and say the value of sigma of x is less than 0.5 then the output is zero and if the value of sigma 0.5 is less or equal to 0.5 then the output of sigma of x as a classification is one so that's the way how this is working so this means we have our weight sometimes the weight if you go to literature also named like omega and we are starting with omega zero because I can rewrite this here also as a multiplication of let's say omega transpose let's me call this x1 y because if omega transpose if sorry if omega transpose zero omega zero to omega n and I'm starting my vector here with b x1 and then xn then for omega zero I'm using just one and the output if you multiply it is the same like w transpose plus b because this here this here are my w's and this is this here omega zero because you find both notations you find both notations in the literature sometimes this is named in such a way quite often especially in in deep learning this notation is more common of course in this notation you concrete see what are the parameters which are responsible for the variation for the variance and which parameter is responsible for the bias and variance and bias are very important for the understanding of the behavior of a network so the second thing what we need for logistic regression this is something what we are calling cost function why we need this you understand we are getting we are getting some output we are getting some output to omega set and this output is between zero and one but our label our label is only zero and one so this means we have a difference we have a difference between the predicted value and the label the question is always is this equal or non-equal of course I said here I can say okay I predict why in the following way it is zero if omega set is less than 0.5 and it is one if omega set is greater or equal to 0.5 based on the shape of the sigmoid function so I am getting a predicted label and now I am comparing the predicted label with the given label if this is okay then I have an arrow zero but if this is not okay I have a difference then I should calculate something like a loss function and for the loss function we are using a Bernoulli a Bernoulli distribution I will explain you why as I said we have an output which is only 0 only 0.1 and if the output let's say a very famous experiment in statistics or in probability theory you have some box and in the boxes are let's say red and black sphere balls inside and take something else by chance and you cannot really say will this be a black or will this be a red ball which you hold but nevertheless you can make a prediction you can make a prediction this will be red you put in your hand and it is then we have a difference but if you predict red and you hold red the prediction is true so for the case that the prediction does not be equivalent is not equivalent to the label we have some loss and this loss as I said is based this loss is based on a difference and for this we are estimating this we are estimating this with maximum likelihood like we would estimate I will not go deeper but this is the statistical background and the Bernoulli I'm not sure for one moment I will show it in another way I see something else in my computer what you see the probability that I get x under the condition that I have an output y sorry I get an output y under the condition of some x and of sigma of this input is on the one side this is a product this is a product how often you are running your experiments of y powered by 1 minus yi so this means this is the output which you get on your experiment i and this is was it black or red but you have a product this product is not really good working so normally we are using a logarithm of i and so you are getting a sum of yi log this is the result plus 1 minus yi of log 1 minus yi hat so of course this is always between 0 and 1 to get positive values we are using minus here this is the reason for this formula which you see here and this formula this formula tells us when we are summing up about all experiments we had we said we have m experiments so we are summing up between 1 and 1 so while we are using a log function the use of a log function has to do with the property of the log function that if the loss is if y is 1 if y is 1 this will be 0 and then i want if i have if my expectation is large then i want to have also large wise this means because the prediction is between 0 and 1 it means if this is close to 1 then my log function is going so i will have some value some value which is close to here but i want to have a positive this is the reason for minus and then then this is increasing everything is increasing and on the other side if y equals 0 this means the the label is 0 then i want to have if the log of 1 minus the prediction is large then i want to have small predictions and small predictions is 1 minus 0 if this is 0 this part is going to 0 this is staying i am getting 1 time minus this and if the prediction if y if the prediction is very close then again i get a big value negative and this is multiplied with 1 so i am getting positive so always i have in that case for the Bernoulli for Bernoulli i have a positive loss function and if i am summing up then i have over all losses for every experiment for every data i get in the cost function so this means for a given training set and for a given set of for a given set of my parameters and for a given set of biases i get a concrete loss and also i get a concrete cost and now my aim is to minimize the cost normally the cost the cost function is looking like that this is the surface for my cost function and i am starting here and i want to come to the minimum because if i have a minimum if i may have a minimum then i have a minimum of costs this means i have a minimum of loss and this means the prediction is as accurate as possible this is the idea behind so the question is of course how to get from that point to that point in the case of nonlinear function we cannot use matrix multiplications so we have to we have to use some numerical algorithms there are a lot of numerical algorithms but one of the very often used numerical algorithms is the steepest descent and what is the direction of the steepest descent on the surface sorry i am a mathematician so i must ask it and the opposite direction of the gradient so we are calculating the gradient of our cost function so the question is what are the coordinates for our space the coordinates for our space are now the weights these are not the input data but we are looking for a minimum of our cost functions over the weights so this means if i have here my omega and i have here my bias for example let's assume we have only one weight then this would be a 2 then i have a concrete a concrete pair of coordinates in the WB space where i can say for this concrete values i have a minimum and this means based on this minimum now i can improve my weights and also bias this is now we find the minimum find some values and based on this we are improving the weights and this is not only working for the logistic regression this is a similar way also used then but of course for much much more parameters for the for the so what must we do we must calculate we must calculate the partial derivatives of J by W and by B of course this gives us the gradient and this gives us the direction where i find the steepest descent first this is not the only algorithm which can be applied and if you go to the literature and check the literature you will find a lot of other algorithms also which are proposed but nevertheless the steepest gradient descent is one of the most used methods another method is called the specific gradient descent and this is used in online learning online learning means some set of data comes in the algorithm is trained and it it just forget for example i don't know if this is also popular in here but we have an online an online trading house salando it's something like that it's another trademark but it's something similar to amazon so they are looking what you are looking for and they make you also an advertisement based on your on your personal preferences and then you click and they just use this one click to train the algorithm then it's forgotten and this for every user and every second all over the world can you imagine how much computational effort for that of course also what you need to find all these data to put this in the data warehouse nevertheless nevertheless for the gradient descent this means for finding the minimum you are not using all data of course this is impossible you make just a sample and for this sample is used for the train and in the next time unit another sample is used for the train this is also an algorithm which is quite often used for finding the minimum very important of course they use but i'm not speaking now about the reinforcement first of all we should understand what is supervised then we can go through this is step-high but yes of course they use so the logistic regression if you have if you have error parameters is that as i said we are using this cost function let's call let's call the prediction a i and if we calculate now if you calculate now the partial derivatives by w i of course we have m we have m parameters yes so it doesn't like me today if i have calculated these parameters then i can use the parameters to update my weights and we are using for updating a hyperparameter which is more alpha and this is the learning rate it is something like a step size in a numerical algorithm so this means how fast or how how fast your algorithm is learning the choice of the step size is not so easy because if you are choosing a very small step size then your algorithm is learning very very fast if you choose a big step size in the beginning your algorithm will not converge it will run it will run on the surface up and down if this is your surface then your algorithm will run up and down and will not converge if the step size is too big so this means finding the right step size, finding the right learning algorithm this is one of the big things to run and to to find a good smooth running algorithm to find a good learning rate you are starting with a learning rate for example with 0.1 or with 0.01 and then you increase this by 3 always by 3 and you will see how your algorithm will behave and then divide it by 3 also look for smaller step sizes and then very empirically you will find something which is really better we all can calculate we can calculate this of course we can calculate this based on vectors and this is not the algorithm again we can calculate this is already the code which we can use in Python we can also calculate our vector dw these are the learnings for the I mean the differences the learnings for the for our parameters based on that we are getting something which is looking like that you see this is a simple this is a simple matrix multiplication matrix with a vector and based on that we can arrange we can arrange now our learning because we say w this is a vector this is a vector this is of course a number and here we are getting the new weights also as a vector so everything can be implemented not like a loop of course as a loop this will take a lot of time but it is better to implement it than as a vector so time for hopefully will work so if you want to run your own algorithm you cannot of course if you want to run with a high note which I am using here normally you have to install a anaconda environment we don't have the time for anaconda and also my anaconda environment when I tried it yesterday was not fulfilling every year let's say operation so I am using an environment from the internet where I have a stored notebook this computer but anaconda one is not enough you need a lot of packages and then you have to try all these packages installed or not installed and so on for this reason is better I use my environment but you cannot use it of course for this you had to do first of all you had to to make this exercise and pay to Coursera but I can give you all the files because they are open here from GitHub and you can try this in your own anaconda environment so what we are doing now is we want to try everything what we discussed now to implement for this I am using a developed it is a python notebook which is running in ipython and first of all what you see we should install we should install some packages can you read it we need numpy we need and then we need some special packages just for pictures I cannot try to increase it of course otherwise I will lose environment so we have everything installed now we will have a folder where I have all my where I have all my data so I load the data set and I split the data I split the data in a training set and a test set I am sorry could you please zoom in the code if I release this and I need this I can try better I am splitting my data in a training set in a test set normally just splitting is between 60% for training 40% for testing or 70% for training and 30% for testing the important thing is the important thing is that of course the training set you can use as much as you want but the test set testing your algorithms you can use only once if you use the test set if you use a test set improve your algorithms then again test it that way you will have a problem because the result is already biased of course you will get the result but from a statistical point of view this is not a good result of course you have already a biased result so this means before you are using the data set again you should change this by chance and to re-sample and let's check yes we choose just the picture I can of course change the index you see yes and by the way it was as a non-cat picture predicted so the prediction is in that case too so what I am taking not only as you see I am taking not only the training set I also use get labels of course in the wipe in this set I have the labels for the training set and here I have the labels for the test set now I should reshape my training set and my test set of course I told you this should be in a row this should be all in one vector so this is the reason why this is used here I am taking from the original the first shape parameter and here the second shape parameter just reshape it and this is just a test this is just a test that I have reshaped it in a two way you see my number of trainings example is two hundred nine this is really calculated we calculated the test numbers are 50 the number of pixels per image is 64 and this is the expected output so the algorithm is working at that stage well as I like so now I have everything in one in one matrix you see this matrix 12,288 you remember this number this was 64 by 64 the number of pixels so two thousand the rows and we have two hundred nine we have two hundred nine data sets inside this matrix and the same also for the other for the other input we have our operation work which we should do to reshape to reshape our input in such a way that it can be worked then in the algorithms and that we can use the libraries dividing by 255 means I'm normalizing quite often if you are getting you are getting numbers especially for some operations which are above 255 which are only from 0 to 255 so the idea is to divide by 255 and to normalize it so now we are starting with the general algorithm as I said what we should do first of all we should multiply every input every number with with its weight then we should calculate this value this value here and we should calculate the sigma value and based on the sigma value we should make our prediction is this a cat or not and the prediction has done in the following way as you know already if the output is above 0.5 it is predicted as one then it is cat and if the output is below 0.5 it is non-cat and for equals 0.5 it's up to you to make a choice I mean to make a choice in your programming you can use this for non-cat you can use this for cat this is your idea the first thing what we what we should start to do is to build a sigmoid function as you see in that case I use for the sigmoid function already a numpy function why? because in that case the input will be a vector or a matrix and if you use just only from a master exponential function it is just defined for one number and you will get an error as an output so this is a reason to use the numpy library as you see it works so the next thing is to initialize to initialize our our weights and to initialize also our bias in that case I mean in the case of the logistic regression we can have initialization with 0 because then later we can have initialization with 0 but in the case of the neural networks this is impossible we should use other initializations especially in deep learning and to find a good initialization for your weights and for the bias this is the next problem I spoke already about the learning rate the initialization is also a problem so if you have a look to the literature you will find a lot of different approaches which are proposed to initialize the weights and to initialize the bias for example what could be used also for the initialization is not 0 but some some number which we are choosing by chance from a normal distribution or we can use also an equal distribution but it should be numbers between 0 and 1 so the next thing is what we need to implement is the forward and backward propagation the forward propagation is what I showed you already when we are calculating the cost function and we are going through all input data and the backward propagation then we should calculate all our partial derivatives and based on the partial derivatives make the improvements of our weights so as you can see here as you can see here this is just plugged in plugged in the cost function and I am not using a loop I multiply this I multiply parameters sorry, better matrices or I multiply vectors also for the sigmoid function you see this is a dot product which I am using here for the transpose so this is all the calculations are done on vectors or on matrices of course you can also calculate our program this as loops but I am not really here we have the output of our improved of the cost we have the output for the gradient and also for the parameters so as you can see here w, this is b this is dw, db this is just on one we have a learning rate we have a learning rate used here of 009 and this is the expected output we are getting an output is more or less the same so we can start to make our predictions and put everything together you see here we have the initialization here we are optimizing we put everything together in a dictionary and based on the values which we have for w and for b we make our predictions for the test set also for the trend set and then we will just only give an output to print this and to have a library again d as you see d is a library where we store the cost the prediction test, the prediction train wb, the learning rate and the number of iterations so if we go to this library we always can check the data which we have got so it was trained you see the logistic regression this is really quite fast and the test accuracy was 70% the accuracy for the training was for the training was about 80-80% sorry you see you see this is something like overfitting because if you have a training accuracy which is around 100% this means it is fitting just for this training set and then you can expect if you generalize it generalization means if you apply it to other data then the result will be much much, much worse so how this can be improved one of the methods for improvement for example would be a regularization or we are using another algorithm regularization would be one of the improvements which we could use for this function but so and here you see this was just a prediction which is wrong of course it was predicted as a cat I specially choose for example where you can see this is working so but the picture is not a cat but it was predicted as a cat so the algorithm didn't really image number 5 was not well was not well classified in the test center this is how the algorithm is behaving if we are using a learning rate of 005 if you choose the learning rate the learning curve will change sometimes it happens that the learning curve is somewhere here but when the learning curve is separating that this means it makes no sense to enlarge the set of training examples but in that case this is the number of iterations we will not really get the improvement you also can play on the size of the training set or you can play on the learning rate and then always check the learning curve and this is which I put in the folder and this was predicted but it's really hard to recognize a cat but if you for example will put the picture of a dog which is something like a cat it can happen that it is not really good to classify so okay this is where I am closing the lesson here so in the next lesson we will see how this knowledge which we have gained here in logistic regression now can be used for neural networks of course the idea for neural networks is in that case we had in that case we had only that we have here xn and this all is calculated here plus b and then the sigmoid function is applied and we get an output and now we will apply this we will apply this figure for every node this means if you have a second node and a third node then on every node otherwise will be used so we are getting much more much more connections