 It's okay? Yes. Okay, great. Okay, hello everybody. It's a pleasure for me to teach this introduction to deep learning for biology or for biologists mainly. I have a little problem with moving the slides now. It's usually with the the mouse. The mouse takes over. Yeah. Okay, that works. Yeah, thank you. So this morning we'll just be more the general introduction into deep learning, what it is, what you can do with it and what were some of the issues that people have to overcome to make this successful and it will be very brief on the more biological part. This will be more covered in the afternoon. Just to start a little bit of terminology, we talk about artificial intelligence. This is basically everything where you use a computer to support reasoning. Then a part of artificial intelligence is machine learning. There you use data that is fundamentally different. So you train maybe a classifier or a machine learning model using data in a more or less automatic way. And deep learning is a part of machine learning where you do that with deep neural networks. So the model is kind of restricted to this neural network model. The types of learning just as a very brief for those who don't know, we have supervised learning. This is the learning we use mainly for deep learning. And in this supervised learning you have examples that are annotated. So each example is carrying a label. For example, you have images of objects and the label is the type of object. And you can have thousands or even millions of these examples that you stand for training. In supervised learning, you don't have the labels, so you try to extract knowledge from the data just by, for example, using a clustering algorithm by detecting structures that are inherent in the data. Then you have a bit of a mixture of the two. Sometimes you have a lot of data, but not all of these data points are annotated. And then you can use techniques from semi-supervised learning that allow you to bridge, to kind of learn from the annotated part of the data and infer that to the non-annotated part. You have reinforcement learning. This is also very important, especially used for robotics or if you develop games or algorithms that play games. So there we don't have an immediate reward. So we don't have an annotation for each actions, but the reward will be in the future. And we have to find the actions that give us the best future reward. And then we also have, for example, dimension reduction, where we try or embeddings, where we try to reduce the dimension of a problem to a lower level. Okay, this was just a little bit of terminology. Now I'd like to start with a little bit of deep learning history. So this is a fairly long history, as you can see here. I took this picture from this blog by Fabio Vazquez. And you can, I mean, it goes back to results from brain research, basically, with HEPPS learning rule and all that. So people started to have an idea how neurons work and how they interact. And they started building models that allow to understand that better from a theoretical point of view. And the first neural net that was published, it was actually more a Boolean network by MacMalloch and Pitt, was published already in 43. And they already realized that such a network is able of almost universal computation. And then it went on. And maybe the first neural net that resembles a little bit, what we understand as a neural network was the perceptron by Rosenblatt, which was just an input layer and one output layer, one output neuron. And some sort of a training algorithm. And this could be used for classification. However, the training algorithm they used at the time wasn't really stable. And it didn't really work well, especially if the two classes overlap. But the concept was there. And the proof of concept worked and what one could improve on. Then the big boost actually came later. But maybe before, there were also these Hopfield networks, which were probably more physicists that became interested in the field, because they realized that these neural network, they have some resemblance with spin networks or networks of interacting spins. And you can use very similar models to describe the tool. And these Hopfield networks then they also are closely connected to the Boltzmann machines, which are still in use as restricted Boltzmann machines for pre-training neural networks nowadays. The first real neural network, as we know it now, was trained with the back propagation algorithm. So I will explain a bit in the lecture what that is. It was this very important publication by Rammelhardt, Hinton and Williams, where they trained a fairly small neural network with the back propagation. And they proved that this can actually be done. And that the results are reasonable and make sense. The back propagation was then used for convolutional neural networks, which are mainly used to process signals, to analyze signals. And it was this famous Lynette by Jan Le Con that was introduced. And this was also a neural net that was very successful and that had industrial applications. So for example, banks use that to read handwritten letters of checks, for example, or auto documents. Then it went on. There were these recursive neural networks that came. The LSTMs of Andu will talk about those, very importantly, that allow you to analyze sequence data. Then the deep belief networks, which used the restricted Boltzmann machines to pre-train and then the back propagation to refine the weights came. And other things like improvements of these algorithms like dropout, batch normalization, and so on were also introduced. And the field really took off when a couple of publications came. So one of the probably incubators of this neural network revolution was the foundation of this neural computation and adaptive perception program, which was led by Geoffrey Hinton at the CIFAR Canadian Institute for Advanced Research in 2004. So the CIFAR was founded and headed by Hinton and had Joshua Benguel and Jan Le Con as members. So these were already very well-established researchers. And that was really a concentration of people who had a lot of knowledge in the field and believed in the field and were able to push that field forward. What also came were databases or large connections, collections, for example, of images that are annotated or other databases of sound and text. And that could be used by researchers to train their models. Also, very important is the use of GPUs, especially the immediate GPUs with CUDA that allowed to speed up the fairly slow training process of neural networks by a factor of about 10. So this made possible that a large neural networks could be trained in a reasonable time. And all this then led to solutions like, for example, the famous AlexNet here in 2012, which used a bit everything that was invented. So they used GPUs, they used new activation functions, then used dropout and data augmentation. So all these newly developed techniques came together and showed that they can be successfully used. And this AlexNet is an image classification competition, really almost half the error rate compared to the competitors. So there people kind of started seeing, okay, there's something in this deep learning, there's something in this multilayer neural networks. It's not just an interesting thing, you can really do things better with deep learning. And then also what makes deep learning really attractive for us, people who want to apply it, is that you have this Python or R packages that are fairly simple to use like TensorFlow, PyTorch, or MXNet, or CTNK from Microsoft. And all this success was then also awarded the Turing Award for Jeffrey Hinton, Joshua Bengio and Jan Le Koon in 2018 for their contribution. So here you see this ImageNet competition. You see the date here from 2010 to 2015. And you see up to 2011, there were shallow neural nets or maybe one or two layers or other systems that performed the best. And after in 2012 there was this AlexNet. And you see that you probably can't see my pointer. We see your pointer, yes. Okay. So in 2012, the error rate dropped from 25% to 16%. So these are fairly difficult systems. You have to classify the images. And you have, I think for this competition, they had about 200 classes. And the error evaluates whether the true class is in the top five of your classifications. And they're really almost half the error here by introducing a fairly deep neural network with eight layers. And these new methods like ReLU activation dropout and so on. And this was then further improved. And with time, the number of layers also started increasing up to this ResNet here, which had 152 layers, so really deep neural network, which brought the error down to a astonishing 3.6%. So this image shows that at least for this convolutional nets that do image classification, going deep really also improves performance, which can serve as a justification for deep learning. So the great successes in deep learning are mainly in image, text, and sound processing, maybe also in playing games like this famous Go game. And in these fields, the success is mainly driven by the availability of large volumes of training data. We will see to train the deeper the network is, the more data we need. Also algorithmic improvements and the increase in computational power, mainly that it was possible to paralyze the neural networks on GPUs. And also kind of fairly, they look fairly small, but they're actually quite important, especially if you have deep neural networks, improvement in weight initialization, activation functions, the way you do gradient descent, how you batch your data, whether you use dropout, or maybe batch normalization and all these things contributed that it was possible to robustly learn this deep networks as we will see. Just this nice overview here, which I have from this paper. Here is just how we visualize these neural networks. So we have input layer, I will come to more detail what that is. We have the hidden layers, maybe one or two in the older area of the artificial neural networks, and then we have an output. So the input layer is then processed via these hidden layers and transformed into an output. In deep learning, the difference here is, I mean, the various deep learning architectures, the convolutional ones, the LSTMs and all that, they differ from this architecture, but the basic architecture remains the same as these classical neural networks, you just have more hidden layers here. And this having more hidden layers has an advantage, as you will hopefully see during the course of this lecture. This is just a plot that I did here with, I just did Google searches, Google scholar searches on these search terms here. So I looked, I looked for deep learning and neural networks in red and artificial neural networks, which are, was a name for the more shallow neural networks that were used for a long time, and then support machine, which is another famous machine learning tool. And you see that the artificial neural networks, they also, they got a boost in the nineties, and this is a logarithmic scale here, got a boost in the nineties, and then steadily increased the same with the support vector machine, the boost came a little bit later, but you also have a steadily increase, which just reflects the increase activity in the field of machine learning. Then the deep learning, it really took off after this, in 2012, after maybe this AlexNet paper or other papers, and it started exploding. And it overtook the number of publications here, overtook the number of support vector machine publications, for example, whether the trend will go on or whether it's also a bit, at the moment again, a bit of an over expectation into deep learning and people will realize, okay, maybe it's not all that easy, and whether it will come back down a bit or whether it will continue like this, I don't know, we will see, but it's definitely that deep learning has contributed irreversible things, they have had successes, especially in this image classification, and speech and text processing that couldn't be achieved with other means, and I think for those things, there's at the moment at least no option for people. So it's a good, if you're a biologist, then you want to answer some biological questions, obviously the most important thing is to have good questions so that can be answered by these things, but then it's a good time to maybe use or at least think about using deep learning to answer that, because meanwhile, we have quite a lot of data available, I work in the field of mass spectrometry and proteomics, and we have tons of MSM spectra in these repositories, we have gene expression, many, many will take some sequencing RNA-seq data is available on these GTEX and TCGA repositories, we have protein structures, and so on. So we have a lot of data, good quality data available that we can use for deep learning, and to have a lot of data is important. You can also do deep learning with less data, I come to that very briefly, but if you start a project, you should have a good amount of things. Then we have the computational power, nowadays these immediate graphic cards, for instance, once they are affordable, you can buy a decent computer with maybe two graphic cards for maybe something like 5,000, 10,000 dollars, and you can already do deep learning on such a desktop computer. Then we also have powerful toolboxes like TensorFlow or PyTorch, which makes it not extremely easy but fairly easy to implement deep learning architecture and testing. So now finally, I want to explain a bit more detail what deep learning is and what it does. Therefore, I want to start with the perceptron that was published by Rosenblatt. In the perceptron, you have this input layer, so we have an input vector x, which has components, so it has n components, x1 to xn, and what that one means here, I come to in a minute, and this input vector components have to be numeric. If you have categorical data or ordinal data, you have to convert these two numeric values. That means to do that, you can use, we come to that as well to a one-watch encoder, or you can embed your categorical values in a embedding space or something like that, but these values have to be numeric. Then what you do in this perceptron, the perceptron has just the input and one output. You apply this mathematical function here, so you have these connections that connect the input neurons, we call them these circles here, to this neuron here, and you just sum up the multiplications of these weights with the inputs. You do x1 times w1, x2 times w2 until xn times wn, and you sum them up, and you add what we call a bias term, you add a constant term. This constant term is there because sometimes if this sum here is maybe 10, and you would like, rather have a value this around zero, this constant term can pull this 10 back towards zero. In a vector notation, this just looks like this, it's the bias term w0 plus the vector w and scalar product times the vector in the vector x. Then very importantly, after that summing here, you apply an activation function. It is known from brain research that the neurons do not respond linearly to input, but they have some kind of saturation. Below a certain threshold, there's no activation, then there's a range where the activation grows more or less linearly, and above another threshold, again, that activation saturates, so you don't get more activation if you have more. This is also what people used at the beginning, that they mainly used the sigma heat activation functions, which are zero for low negative values here, then go up to one here. So you apply this activation function to your sum, and this gives your out. So this is what the perceptron does. Now, geometrically, how does this look like? It looks, it does this. So we have, this is our weights here, and what it does, it kind of creates a linear step function. So this function zero in this purple region, and then it grows to one and becomes one in the yellow region. And the separation between the two regions is a straight line, and I changed this slide a little bit, so you don't have this little vector in your slides, but we'll update them. And the direction of the weight vector is just perpendicular to this separation line. And it's also important to know that for large, for W's that they have a large norm, or the step becomes narrow, so it becomes quite crisp. And for W's that are small, the separation is much smooth. So large weights in the neural network usually corresponds to crisp boundaries that are more prone to overfitting, whereas small weights are smoother boundaries that are a bit less prone to overfitting. This will be important when we talk about regularization later in the lecture. Okay, this is what a single perceptron does. Not particularly interesting, you just can do a linear separation. But the interesting parts come now when we start combining this single perceptron to actual neural network. This I would like to demonstrate here. So we have a simple thing here, just two inputs, x1 and x2. And we have three intermediate neurons, z1, z2, and z3. And we have another perceptron here to create the output neon line. Okay, let's start with our first perceptron that just takes the inputs and applies the perceptron and creates an output z1. As we saw before, this corresponds to a linear separation. Let's assume here, just for simplicity, that we have a step activation function, which is exactly 0 for negative values, and then exactly 1. So we have a stepwise activation function, which means for this weight values 3 minus 1 minus 1. This will look like this. So this z1 will be 1 in this orange region here, and 0 in the right region. Now we use, we add another perceptron, second one here in parallel to the first one, with different weights, and that calculates a set 2. And that will, with these values for the weights, again, will have a linear separation. It will create 0 output here, and an output 1 in this screen. Now let's add a third perceptron with yet again different weights, and we will have the output z3 will look like this. It will be 0 here, and 1 here. Okay, so now again, we want to add another layer, and that layer, this example here, will just add up all the three intermediate values here, z1, z2, and z3. And it has a bias of minus 2. So what does it do? If you add, if you take all, overlap all these images here, and we add them up the values, you will get this image. So we get a 2 in this region, a 2 in this region, 1 here, 1 here, 2 here, and the 3 in the region where all the one regions here overlap. Now if you subtract our bias, 2.5 is not yet subtracted here, if you subtract it from here. And again, add our activation function. We will see that this, this is only positive of the subtraction of the bias in this region here in the middle. So our output will be positive in this triangle, and it will be 0 all around. So what do we have, have we gained with that? We did a classification. Now let's assume we have points, red points that lie within this triangle, and blue points that are outside. With this neural net here, we managed to separate the red points from the blue. And if you have such an architecture where the red points are in the middle, surrounded by blue points, we see that we need at least three hidden layers here, or three hidden neurons to do this separation. With two lines, you couldn't do that. Okay, that's what this simple neural network does. The first layer perceptrons do this separation of the plane. Then you add all these things up here in the second layer, and you can create more complex regions, or you can separate more complex regions from each other, just by combining these linear separations you have from the first layer. Now in practice, just a bit of notation, usually we don't write it out with the sum and then the activation function. This will be all combined. So in the future, if you have a neuron, apart from the input neurons, but this hidden or output neurons, they will always include the sum with the bias term plus the activation function. And as I said, the activation function is usually not step-wise, but it's some smoother function, like for example, the sigmoid function. And then our triangle, which was an exact triangle before, will look smoother. It will look something like this. And if you, for example, set here a threshold of, I don't know, 0.5, then we can also separate this term. So again, the system remains the same. It just creates a smoother boundary. Now, we don't need to stop there. We can continue doing that. So we can have several perceptrons here in the first hidden layer. Then we can add the second hidden layer, a third hidden layer, the number of neurons in each layer can vary, can become larger or smaller. This doesn't, I mean, it matters, but everything is possible here until we finally have an output layer, the output layer. If it's just a simple regression value, you want to regress only one value, it's just one output. If you want to do a multinomial regression, we have several outputs. If you want to do a classification with several classes, for example, 10 classes, we will have 10 out. But the system remains exactly the same as before. And the equations we have also remain the same. So for each internal hidden neuron, we apply the same maths as before. We just linearly multiply the input with our weights and we add the bias term. Now, these are weight matrices here because we have several hidden neurons here and these weight matrices, they summarize all the weights. So the weights going for the first hidden neuron, the second hidden neuron, and the third hidden neuron, and so on. But basically, it's exactly the same thing as we did for the simple perceptron. We just multiply the weights with the inputs, add the bias term, and apply the activation. And we do this for each layer. So this is a forward neural network, so we only go forward and until we arrive at the output layer. Okay, I would like to do little exercises. This is maybe for those who are already quite familiar with neural network on the simple side, but I guess for those who hear it the first time, I find it quite instructive. And therefore, I ask you to go to this Google TensorFlow has a playground. And that allows you just by clicking with the mouse to configure some neural nets and to run them on four different geometrical configurations. And you can already see certain things that these neural networks can do, and you can study them in a very playful and simple way. So I don't know whether you want to give you a minute to open that yourself, or you can just follow what I do on my screen. Patricia, can you see the screen with the? Yes, I can. I'll type also the name of that in the site in the chat. So as you can see here is you can choose various things. I will explain later in the lecture what these are. You can choose learning rate, you can choose the activation function. You already know a little bit what that is. Regularization we will discuss later, so leave that at zero at the moment. And you have different inputs, so you have this double circle, you have this configuration of four squares, you have the two blobs, and you have the spider. And the aim of this neural net here is to separate the orange points from the blue ones, a bit similar to what we did with our simple perceptron, where we separated out the points that are within a triad. You can configure how many layers you want, so you can add a layer, or you can remove it again. You can add neurons or remove them again. Here the inputs are somehow already pre-processed, so you have these several input options. You have that input where points on the left of this square are minus one and points on the right of the squares are positive. And here points in the lower half are minus one and points in the upper half are positive. Here points in the middle have higher value, and here points on this horizontal region have a higher value. And you have an input that scores points like this, one that scores them like this. For this first exercise, we just want to use these first two inputs here, and we want to leave the activation function as linear, and we want to first separate the two dots. Then we can start training our neural network, and this actually does the back propagation algorithm here. There's an implementation that runs in JavaScript, and you see this converges extremely fast, and we have a perfect linear separation. That's exactly what we want to separate these two dots. And so the linear activation works just fine for linear problems, but now let's try to do that with a linear activation and go again. And you see here on this side, you see the test loss, so they have test data, which you can visualize here, or they have training data, which is here, and you see the test loss remains high, and the training loss also remains constant, so the network can't learn anything. Obviously, because a linear separation is not able to separate these two classes. And so a linear activation function only works for linear problems. Now we use, let's say, the Tongans hyperbolic activation function is like the sigma e, but instead of being zero for negative values, it is minus one. And we can see what this one is doing, and we wait a little bit, and voila, we can see from the losses here, the losses dropped, both on the training and on the test, which is what we want, and we have a nice separation between our outer circle and the inner circle. So the important message here is that we go back to our slides, that a nonlinear activation function is absolutely essential. So that's the heart of the neural network. So with a linear activation function, we can combine as many layers as we want. The output will still be linear, so we won't gain anything. Whereas when we introduce these nonlinearities, we can do much more complex, produce much more complex outputs, and therefore this activation function is the central part of the neural network. So that's the first part about this, try to explain a little bit about how these neural networks work, and this may be a little bit of a theoretical slide here, but I find it important. It can be shown mathematically that with a single layer neural network, so neural network that has many inputs, just one layer and an output, you can approximate any function you want. So why do we need to go deep? What is the advantage of having, instead of just one layer like people had for a long time for two layers, why do we need 10 layers or 20 or even 150? The reason is, and there's sound theoretical work on that, is that we can learn much more efficiently any function we want with more layers. So if you want to do that with one layer, we can do it, but the learning part is complex and slow. We can do it much more efficiently, and with much less weights and much less training time, if you have stacked layers. And the reason for that is we will see that stacked layers are actually very useful because they, in a certain sense, decompose the problem. So each layer extracts from the layer before the essential information, combines it to a new set of information, and then the following layer does the same thing again. So we are able to subdivide our problem into a set of sub-problems, and this allows us to solve that task more efficiently. Marcus, do you want to give time for people to answer the questions? Yes, if you can keep track of the time as well. There was some questions already on the chat, and Jago has answered as well. So what's the similarities with perceptron and logistic regression? Yes, that's basically the same thing. If you use a sigmoid function, it's very, very similar. If you use stochastic gradient descent, it's even the same learning algorithm, almost so they give the same result. And the very difficult one is also the difficult number of hidden layers that we need, and the number of neurons maybe also. I'll come to that. This is not a general rule. Unfortunately, you can't just give a kind of a problem to an algorithm and it will tell you what is the number of layers in the easy way. That's something one needs to figure out, but that's complicated. Other people, if you have any questions, you can turn on your mic as well and ask or put in the chat. Don't forget. When should we have the break? At 10.30 in 40 minutes. Good. So I go on then. So this is a little sketch. I spent some time drawing and tried to illustrate a bit the difference between what we do in neural networks or deep learning and what we do, let's call it a classical machine learning. In classical machine learning, classical machine learning we have, let's say we want to do image classification. Usually we have a researcher, maybe a specialist in image processing or we take an image processing toolbox and we program this toolbox to extract features. For example, we have edge detectors, corner detectors, various shape detectors, color detectors and so on. Then we assemble all these detectors in a vector of values and we throw this vector of values into a classifier, for example, support vector machine that then separates maybe or classifies the images based on these feature vectors. This may also, it has advantages and disadvantages. The advantage is that while you do develop all these features, maybe optimize the features and feature selection and all that, you become quite an expert in the field if you're not already one. So you learn a lot about your data, you understand your data much better and you can then choose your features probably so well that you can use a simple maybe linear classifier at the end just to do the classification. The deep learning approach is different in a sense that you don't do this feature selection or feature engineering, but the deep network does it for you. What you input here is just the raw images, maybe you need to compress them a little bit or transform them a little bit, but mainly you input the raw pixel values of the images. So you will have large input vectors and then you just have your training set for each image. You have the label by cow or poster and then you use the vector propagation algorithm. I'll explain what that is later and that algorithm will then train a set of weights in your neural network that transforms these input pixels here to the label. For example, this will be transformed to the label bike, this will be transformed to label cow and so on. And what we will see is that the first layer of the neural network usually acts as a low level feature detector. So the first layer usually detects things like edges and simple shapes. And similar to the perceptron example that we discussed, the second layer kind of combines these simple shapes into more complex shapes. And the deeper down you go in the neural network, the more complex shapes you will obtain until in the last layer you have the objects themselves. And this is done automatically, so you don't need to do much feature selection. And this is a huge advantage, especially often we don't know the features. Sometimes we think we have a super good idea about the features are, but we might just miss some of the features and therefore not be able to correctly or successfully classify our object. And this is an example here where they did that, is from a deep belief network. This is a special kind of neural network. And they did handwritten digit classifications, so we have 10 classes for the digits of 0 to 9. And they had three layers and the output layer then the last layer here is just the number. And you see that in the first layer you see these little streaks here. So these are edge detectors that detect important parts of these numbers. In the second layer you have already a bit more complex shapes. Sometimes the numbers are already a little bit visible. Sometimes it's not quite clear which number it is, it might be a mixture of several numbers. And in the third layer the numbers become more apparent. And then in the last layer you just combine these layers again with some weights and you output the class of the digit. And this is all here a deep belief network that also uses restricted Boltzmann machines, but basically you can do the same thing with just back propagation and it will do this sort of feature detection. This is from a very nice publication here. This is for convolutional neural networks, so we hear in one news talk what this is all about. But the paper illustrates a bit this feature, the way neural networks create or detect features and also the refinement of the features with the depth of the layers. I really recommend you if you're interested in that to read that publication. It's a really nice publication. It's not that easy to create these images here, so they had to actually also create another neural network to kind of take the information from intermediate layers and project them back into the input layer. So what we see here, in these convolutional nets you have so-called feature maps and for each feature map these are like layers in a convolutional net and for each feature map that just took the nine neurons with the highest activation and these neurons were then projected back to input space and that here in these images you show the input that activates these neurons in these selected feature map layers. So you can visualize a bit what activates layers, neurons in the first layer you see again we have these simple shape detectors here, edge detectors, color detectors and so on. In the second layer it already gets a bit more complicated. We have a combination of two edge detectors, corner detector here, we have these grid pattern detectors, we have color detectors, we have shape detectors for spiral rounds and so on. And then in the third layer becomes already more complex so we can detect wheels, we can detect more complex shapes and even faces and in the fourth layer we can detect parts of faces of dogs. This is a huge set of images with various objects. We can detect parts of faces of dogs that are important to distinguish dogs for example from cats and so on. And then the further down you go the more detailed this becomes and the more useful it becomes for the final classification. And again all this is not input in the neural network, the only input is the training data, this is all learned by the neural network and extracted from the training data. And this is from this famous publication by Rommelhardt, Hinton and Williams in 86. They had actually where they introduced the back propagation algorithm with neural networks and they had a very at least for this example here that a very very simple or small data set. They had two family trees, one with Italian names, one with English names. And here you see that Christopher is married to Penelope and they have children Arthur and Victoria and Victoria married James and again children. So and the relations here are husband, wife, child, uncle and so on. So what you input this is the neural network they use so we have 24 names and if we give in a name for example Colleen, Colleen has one field here in these 24 names this Colleen field is one and all the other fields are zero. And then we also give a relation for example Colleen is the husband and we give husband here so we have that relation for husband we have a one where husband is and zero otherwise and the output should be Charlotte because Colleen is the husband of Charlotte. So they trained this they just sampled relations of a name relation and another name and trained the network by providing these samples and use back propagation here so the output is then the name so we have input and output and use back propagation to train the weights in these layers. So they used a fully connected layer to six neurons here for the names and also six neurons for the relations then they combined these two layers again fully connected to 12 neurons and reduced that again to six neurons in order to deconvolve that to the output layer that has again 14, 24. Question sorry Colleen is Charlotte's brother not husband. All right that's yeah parents are the children of Victoria yeah this is the the husband. Okay and there were other questions as well so one of them is do the limitations of deep learning will be covered in the future in further later on the limitations I think so. The limitations I come to the limitations to the end. And there's also some question about in fact what is the what is a layer made of in practice I mean how a layer managed to recognize shapes colors and so on so it's more about the data I guess now I saw. Yeah it just depends on the data now that's a lot of theory on that how this works in practice I mean it works in practice how that you want to build theoretical model how this works mathematically is by complex so I don't not sure that people understand that 100 and it probably doesn't do it so nicely for each architecture either it depends a bit on the architecture and the training data. Okay so we can come back about this later maybe with David and Sebastian still have also questions okay thanks. But if you're interested in that I mean for example you can start reading that old publication here because they address exactly these issues a little bit. For example they look at this first layer here the six neurons for the names and what they see that if they do this training that all these neurons correspond to certain classes for example the first neuron it just separates the two family trees so we have the Italian name family tree and the English name family tree and they're just separated by the first one. So if that first neuron fires we're in the English family tree if it doesn't fire. Now for example we take the second neuron neuron we see that it doesn't fire with Charlotte, Victoria, Collin and James and we see that Charlotte, Victoria, Collin and James they are here so it separates out this part of the family tree from the rest. And you can go for this figure and it will just take activate he activated if the relation stems from a certain part of the family tree and deactivated if it doesn't. And again then the next layer will combine all this information combine it again until we can calculate the R. This was really the first example at least published with this back propagation network and this was published in nature so this was quite an important result where people actually realized that just by a very general training algorithm you can obtain already these insights. So we already saw that a bit in this Rommelhardt paper that you neural network architectures all often have so so called encoder decoders or even outer encoders algorithm layouts. So what we do in an encoder is we take an image as an input then we pass it through the first layer which might have a lot of neurons and then we pass it to a second layer which has a bit less neurons a third layer between less neurons to then a central layer which has the least neurons. The number of neurons in that central layer is just large enough to keep most of the information and small enough to discard most of the noise. So you have you have to choose that quite carefully but we can then compress all these images to a dimension of 30 starting at a dimension of 2000 and then decompress it again we can then just revert these calculations with it here and then get the decoded image which can for example use to denoise images. If the dimension here in the in the coding layer is just two we can also use that as shown in this publication to do projections so there's just two dimensional for two dimensional projections or dimension three for three dimensional projections and these projections actually work quite well they work like like other projection methods here we have principle components they seem to separate the classes in this case at least even a little and the in similar approaches we can also embed any object basically we can embed a word so we can embed categorical variables so for example in this publication this is from this famous word to WEC they embed words so lots of words into into this feature space here and they see that doing this embedding the vectors that they create now you protect these words into vectors and the vectors are in such that the relation between the vectors seems to make sense so the difference between a country and the capital is always about the same direction and more or less the same plane this was not input in the training explicitly so this just came out from these embeddings these embeddings we use but often for example we can use them to embed to turn categorical values into numerical values or they often use in our neural network architecture to prepare your data and then to do the actual calculation in this embedded space now we come to a next playground a little exercise here and I would like to investigate a little bit these emerging patterns so what we need to do is now we need two layers one with six one with eight and one with six neurons so add a tier and add another layer with six then we choose learning rate 0.1 activation function and we introduce a little regularization now I explain what the regularization are in a bit value and we try to solve the spiral problem which is probably the most difficult problem here let's see how it manages to do that a little bit it's the learning rate just a little bit there was a question there were two questions Marcus one is what's the purpose of a decoder encoder or auto encoder yes so as I tried to explain the auto encoder is often used for example for noise reduction because we project the signal onto a low dimensional space and in the low dimensional space is less room for noise so we kind of compact the whole signal and then decompress it again and like this we can there are many applications where we get rid of noise like that there's a little practical then later and I will just show you how to program a simple encoder decoder it's actually quite simple to produce and you can do this sort of PCA plots now probably I sometimes it just doesn't find what it's supposed to find so maybe sometimes it has just some initialization that is not there was another question what I forgot here is to add these two inputs yeah also show still that that the features we put at the beginning are still somehow diverse another question about meaning of embedding embedding is just a projection to a lower dimensional space okay or into a vector space generally so if you have a word a word is not a numerical feature a word is just a word so how do we turn that into a numerical value and the neural network needs numerical values somehow and this we can do with embedding and there was a question as well about the purpose of the classification with the family trees the idea was to have to determine whether a person is Italian or English or to predict their generation or their names no the the predict the name so I give you a name I give you a relation like brother and you then output with the system there is no practical importance in that it was just at a time astonishing that you can learn something like that with a neural network and and that the astonishing was also that the network actually managed to give sense for useful information about these relationships in the family alright thanks okay so here we did our little spiral thing and we can see that this is the activation of that neuron if you only have this neuron activated how it looks in the output so you see that the pattern it learns in the first layer of fairly general it's just for example activation here or activation in the upper part or activation in the lower part and in the second layer we already have more complex activation patterns that already resemble quite a bit the actual spiral pattern so and this again is just learned by the neural net we have input data from the spiral here we have these features we calculate and the rest is done by the neural network okay so now let's go on with the lecture I would like to recommend at least some books for those who are interested and want to maybe dig deeper in the field I really like this book by Orelian Geron who this is quite a practical introduction it also covers quite a bit of theory in a in a nice way it's not doesn't go too far but it really explains well what you need to know and it comes with a two-byte notebook so you can you have some code that goes with it you can reproduce all the examples and play with them there's lots of exercises as well also a very nice book is this one from Jan Goodfellow and Joshua Vengua Vengua and Corville that these are people who were in this very much and Vengua is one of these pudding prize winners they were very much in the midst of this revolution deep learning and they have a lot of interesting things to tell it's maybe a little bit more on the theoretical size it doesn't have code examples but it's certainly a very good book then you find plenty of other material like this MIT lecture here or the lecture by Andrew that's it's he's also an important figure in this neural network community and you find a lot of examples on Kaggle or on the TensorFlow or PyTorch so there's lots of things you can look up I mean you just need to google and usually you'll find an example of a simple example for example for an outgoing folder or a drop out example or something so I guess this was the question before how do we choose the number of layers and the number of neurons in a layer the simple answer is we don't know it really depends on the problem as we saw is if the problem is linear one simple perceptron is enough we don't need more than that we don't know how complex the problem is usually at the beginning so the only way is to experiment and this experimentation can be quite complex sometimes so you have to define somehow the right architecture um that gives you the best results for your problem and I guess the the more complex the problem the more experience this requires so but if you want to just do a feed forward neural network without um the the simplest neural network then you just need to experiment with the number of layers and the number of neurons in the layer you can use for example an optimization framework like hyper opt or or similar that tries to find the optimal hyper parameters of your neural network but generally we it's better to make the neural network a bit too large because the neural network can cope with that so it will just put some of the weights to zero and you will see that in a later example that many of the neurons will just not be used so the neural network can be a bit too large and this is called stretch pens approach and in certain applications you deliberately make the neural network too large at the beginning and then the learning will bring it back to shape but this yeah this requires some experience at the beginning you just try it out that's that's the same thing and this is a bit this is a paper that investigates a bit so the the architecture of the neural network the size of the neural network depends obviously on the training data you have if you have just small training data you need to choose a small size if you have larger training data you can choose or you have to choose a larger neural network to get optimal results also the neural network does not necessarily perform well so for example if you use a neural network that is far too complex for the amount of data you have the results are basically random you just get reasonable results the size of your training data then corresponds more or less to the the complexity of your now this relationship between size of training data and complexity of neural network is a complex one and at least I don't know how to judge that a priority so the only way to find that out is by experience and by trying things out okay so what do we do when we actually train a neural network that's the basics principles are not any difference from other machine learning algorithms so we have what we call a loss function this loss function calculates evaluates the difference between our labeled outputs and the outputs calculated by the neural network so we want this function our calculated outputs to be as close as possible to the labeled output and the difference between the calculated and the labeled outputs we evaluate with a loss function and we want this loss function to be as small as possible and our parameters let's assume we have defined the architecture the number of layers and neurons of our neural network the parameters we have for our weights which are now the weights this w here includes all the weights from all the layers and all the bias terms as well so we need an algorithm that finds us those weights that minimize this loss we have several loss functions we usually talk about empirical loss uh this is just a loss when we sum the losses up for all the examples in our training or testing there are many many losses in frameworks like tensorflow or pytorch you can even define your own loss functions if you want to sometimes this makes sense but to start with often you just use these standard loss functions for example the square loss this can be used for regression or classification as well it's just the output of the neural net this is this f as a function of the input and the weights minus the the annotations that are minus the the true output y so if the output of the neural net is always equal to true output we have zero or we have um our cross entropy loss which is used for classification and which evaluates whether the output of our neural network always corresponds to the right class so if the output we have one output per class ideally the output is one in the true class and zero in all the other classes in that case we get zero loss and otherwise we obtain a positive just simple examples here for the square loss first let's assume this is our training data so we have this input data vectors this three input data vectors and then we have the labels one one zero then use neural net with the same neural net we had before we can calculate our outputs here from the neural network and we compare those outputs from the neural network to the desired or true outputs with this square loss here which we do here and since we have the empirical square loss we just take the mean square loss or training or test set if you have um let's do a classification with two class labels of course two class labels you could also compress in one class label because those two sum up to one um but this just to illustrate the principle again we have uh an input input vectors we have output vectors so this time uh there are two outputs so the first one is class one one one zero and then the other one is for class two zero zero one so they add up to one and then we can calculate this log loss and just by applying the formula I showed you before and we then calculate the mean log loss overall our training or test um this is also a very nice paper um that investigates a little bit the uh the way this loss uh loss landscape looks like and I would really recommend you to read that it's really nice it's very difficult obviously to visualize that because in a large neural network we can have hundreds of thousands or millions of weights and obviously it's not possible to uh display a loss function in such a high dimensional space so what they did they just took random projections in two dimensions and then took this you can display this projected loss function here and there are a couple of interesting things actually which are maybe a little bit counter-intuitive uh it was at the beginning of this neural network story it was a lot of the argument was that there are too many local minimals in such a high dimensional space you will get lost in local minimals but it turns out that these local minimals if you have to write learning procedures are actually not uh not a huge problem it's also because in very high dimensional spaces the vast majority of points with vanishing gradients so uh of stationary points are subtle points not local minima this is because the in order to have a local minima you need all second derivatives to be uh positive and if out of a million parameters has one out of a million has a secondary very derivative that is either zero or negative it already turns the local minima into a subtle point so there's always already a way even a very narrow way to escape uh that local minima through that uh settle and this means that the local minima are actually not that frequent in this uh in this loss landscape and there's also another paper of course all this depends it's difficult to obtain theoretical results about general it always depends on the architecture of the neural network and the problem but in these papers they investigated a bit this local minima and they found that they are actually quite benign so there are different local minimas but the loss value of this local minima especially for a lot of weights seem to be quite close so if you end up in any of the these local minimas you are really quite quite well and you can what is also a nice result from this paper you can uh change the smoothness of your space by having uh different things that help you to train your neural network better for example skip connections increase the smoothness of this now how do we train a neural network we do gradient descent like in many other algorithms so first we start with a random initialization of our weights then we define a certain number of epochs maybe 100 200 for a small neural network for a stopping criterion then we compute our loss and we compute the gradients so for each weight we have to calculate this partial derivative of the loss function by back propagation and then we do gradient descent so the gradient always points to higher values so we go with the negative gradient so we change our weights by going down towards lower losses and we have a learning rate this is a very important parameter which tells us how fast we go down slow and we trade that until we exceeded our number of training rounds or we arrived at the certain stop right now this is this famous back propagation algorithm which I tried to explain briefly here um in in this uh frameworks like tensorflow and pytorch they have this uh things called computation graph so all your neural network is basically transformed into a computation graph or three in this case this is actually very simple for I just tried to illustrate it for this very simple neural network here with two inputs and one out so we start in the first layer we have our input six one and next two and our weights w one w two then in the computation graph we multiply them first together as was shown for this pressure drone and that's what we do in the neural net um then we add uh this uh the z one and z two and we add the bias term w zero then we apply the activation function and finally we calculate the loss so in order to calculate the loss we have to go through that computation tree one time and we get a loss for a specific input vector x now the you probably all remember the chain rule from your um high school years so this is a way if you have a function for example a loss function that depends on another function that depends on x and you want to have the partial derivative uh with x then this is equal to the function uh partial derivative by u times the partial derivative of u by x okay this is the chain rule this is a classical rule in uh in in the calculus before we learn so we can use that rule now to calculate our uh gradients and you will see how this goes this is actually quite simple uh we go back now from the tree so we i had to go to say that we always store the values here in these nodes of our computation tree so first we want to uh calculate the partial derivative of our loss j with respect to the output of the neural network f and this is just f minus j squared so this is two times f minus j now we go one note further we want to calculate the partial derivative of j with respect to z and this is just according to the chain rule partial derivative of j with respect to f times the partial derivative of f with respect to z and this one here sorry this one we already know we calculated here so we can just use that value and plug it in into this formula we just need to multiply with this one partial derivative of the activation and the same way we can also obtain the first gradient we then finally use this is a partial derivative with respect to the bias term w zero and we can go on like this until we are all the way through the network the computation tree and like this we can gain all the partial derivatives of the loss function with respect to all the weights in just going through that tree two times that's why it's called back propagation we first propagate for the tree up to the loss function and then back propagate for the tree using the same the values calculated in the first pass to calculate the gradient and this is a huge difference for example if you would calculate your derivatives numerically you would for each weight you had to calculate the value of the tree at x and minus the value of the tree at sorry at w minus the value of the tree at some w plus h some some kind of small difference so you need at least one calculation of the computation tree per weight and here we just need one part of the calculation tree for all of it and this really drastically reduces computation time and thanks to that we can actually train neural networks in recent but that that back propagation algorithm had problems especially for large neural networks so it had this vanishing gradient problem so gradients can become small and basically we have an underflow so they disappear or they can explode both of it is possible and this is mainly true for large or recurrent neural networks with lots of layers so how can we better train our neural networks do you want to do the break now Marcus or do you want to yeah maybe we can have a small break i think yeah then it seems like a good moment to stop okay then we can continue from the how we can better train deeper neural networks after the break let's do a 15 minute break is that fine with it yeah that's great there was there were some questions as well on the chat was about also is it recommended to use python for deep learning or is just fine to use it in r and then yeah goes as that depending on the library use i think they can have the same back ends like torch available available both in python and r yeah so i i would personally recommend python i actually learned python just for that and i'm very happy with python before i also did did r um yeah i think python is just closer to these implementations of of deep learning intensive flow or by torch um then r but there also are interfaces so we can call all these functions in r as well but i personally prefer python i guess there's also more documentation in python there's more debugging there more news groups and stuff like that it depends on your personal case but i my personal preference and the second question was the second question is when i add more layers in the tensor floor exercise during the neural network gets stuck and fails to identify the spiral patterns what is happening as we come to we have a little exercise time at the end but we actually just do that we add more layers and see how we can still manage to find a good pattern and the question is also why more layers are detrimental than why adding more and more is going to become a problem because you overfit it becomes more difficult to learn to train the neural networks i will have a small example where you already see this vanishing gradient problem a little bit just with the simple playground examples there and that might be one of the problems or it's difficult because the calculations to see what's really going on is by complex but you can rescue usually the algorithm by choosing a regularization or different activation functions as it is okay thanks a lot so let's have a break now 15 minutes break i'll put it in the chat so we'll back oops sorry i was on mute so let's get back have a break now and we we'll be back here at 10 53 and there's a question from David also what do we mean by overfitting is like i think like learning by heart something like this i think this question what is overfitting or yeah what is overfitting um that's a good one i use like learning by heart um yeah for example if you have um if you overfit the class boundary it becomes very little so you um let's assume you have two classes that are perfectly linear the optimal separation is linear but you have training data and you have not an infinite amount of training data so you learn with that data and if you have a very complex classifier it can become it just follows more the training examples than the true class boundaries and then you overfit it usually if you overfit the performance on your test set will do you have a great performance on your training set but you have bad performance on your tests because the what you learn from your training data is overfitted so you learn things from your training data which are not generally true which are just specific to the training hopefully that it's playing a little bit yeah and van der Poet and also a picture also that helps great let's have a break now we will be back here at 55 so in 15 minutes see you thanks a lot stretch your legs have a coffee and etc see you later and guess i might run a bit out of time thank you so for the sake of time i have to go a little bit faster now so here's about how can we train this deep neural networks successfully in the earlier days of neural networks we had this deep belief networks which used the restricted Boltzmann machines to pre-train the network before they applied back propagation algorithms so this pre-training already drives the network close to this minimum here and then the back propagation can be however we don't really need this pre-training anymore not or at least not for all the networks and since there are quite a lot of advances in the algorithm some of these advantages come just by changing the gradient descent so if you have a normal gradient descent as i described before the gradient will get stuck at this settle point here because there the gradient vanishes and but we can introduce a momentum so we can keep some speed the gradient descent has from the previous iterations and that speed will then rise the gradient pass that local that settle point here and towards the minimum so that helps and then we can even optimize that further we can try to look ahead a bit where the gradient lands and if it lands somewhere where the slope is already going back up we reduce the speed which helps us to converge to the local minimum faster and this is the so-called master of gradient descent there are many more algorithms there's Adam, Adam, Adagrad and so on which are all implemented in these frameworks and you can choose the one that works best then we have we batch our data if you calculate the loss function over the entire training data that would be very slow it's more efficient if you batch it into several smaller batches that these batches can then be calculated in parallel for example on a on a GPU and you can accelerate your training like that also the batches are chosen randomly this creates some randomness in the in the gradient descent which is actually good because it also helps you out to wiggle out of some optimal settle points and not not so good solution then a very important part is also the sounds quite simple but it's important the the weight initialization you choose let's assume you have this neural network here with n inputs then you calculate this hidden layer output here by summing up the weights times all the inputs now let's if you assume that all the inputs are from normal distribution the weights also initialize with normal distribution then the the standard deviation of of this set prime here before the activation function will be n the square root of n it's the number of inputs times the standard deviation of input times out times weights so if you have n is equal to 100 for example the square root of n is 10 so you will have 10 times higher standard deviation already in the first day and this will drive your values if you for example use the sigmoid function towards the place where the sigmoid function the derivative of the signal function is zero and which can cause the vanishing gradient problem so if you initialize the weights in a way which is done here by just using smaller initial standard deviations that keeps the usual values in this range here where the activation function is most responsive you can obtain much better result and again all these things are implemented in in tensorflow you can choose the activation function traditionally or still a lot of people use the the tongue and superbolicus the sigmoid sometimes if you want to have some sort of binary response but it became the this value activation function which is zero for negative values and just grows linearly without saturation for positive values turned out to be very powerful actually and also very fast to calculate especially on GPUs as well and so you you might be not and it doesn't have the vanishing gradient problem here so even for very high inputs the gradient doesn't use it there are disadvantages and advantages of all of these activation functions again if you want to know which one's works best the it's quite easy to swap them out so you can just try several of these activation functions as you will see in our little collected board so the in these papers here around 2011-2012 this relu activation function was introduced and in this paper here the managed to show that by using this function they actually didn't need the pre-training diarrhea restricted Boltzmann machine but the network could be trained from scratch also in this paper here this famous alex net paper it turned out that this relu activation function it's much it converges much faster and gives lower air rate much faster compared to the tangential quality function the other activation functions that at least in the papers even work better than than relu again it really depends on the problem and i think one needs to try at least some of these activation functions so i don't have time to go through these playgrounds here i guess you can do that yourself it's described here what you should do but here this is an exercise to the vanishing radium problem where you introduce the relu activation and to make the problem then another very important parameter is the learning rate so that's as i said this is how fast you go down the gradients to to find the minimum and if that learning rate is way too high you will just jump around around the minimum so you will never really go down the minimum so your error or your loss will will always be high and you won't learn be able to learn and if it's still a little bit too high you will converge but the convergence will be quite noisy so you will jump around in this local minimum bucket there and you will not quickly converge if the learning rate is just about right you will have the fast initial descent with a fairly smooth curve afterwards and the two losses of training loss and validation loss so the loss calculated on the training set where you train on and on an independent validation set will be around the same then if your training learning rate is way too slow you will just creep much too slowly to the to the local minimum and you might have to vanishing radium problem even that you kind of stop before you actually reach the local so this learning rate is an important parameter you can also have more complex learning rate schedules you can have power schedules or better learning rate first half then it's one third and so on after a certain time you can have exponential scheduling where it's divided by 10 after a certain number of times that you can have piecewise scheduling and so on and this really helps so here I implemented the same thing with piecewise scheduling and you see the initial convergence has a fairly high learning rate so it converges quickly but then you reduce the learning rate and you have a smooth continuation of the curve then you can also use regularization maybe you remember when I was talking about the perceptron at the very beginning I said that if the weights are small you have smoother boundaries you have less overfitting so if you have again overfitting so if you have boundaries that are too weakly for example then you can try to reduce the size of your weights by adding terms to the loss function that grow with the size of the weight so if you try to minimize the loss function you will also try to minimize the size of the weight and I have this little playground exercises for that and you can see that you can actually make smoother class boundaries and obtain better classification by this regularization method as well. Some of these regularization methods also change the shape of the loss landscape again this is from this paper by Liadal where you see the loss landscape without regularization and here loss landscape with so-called skip connections I don't have time to explain what that is really but this is just a mechanism to smoothen out this loss function and you see the results is quite drastic here and in this situation it's much easier to reach the local minimum and to learn with that method. You also have as we already heard it's pre-training with the Boltzmann machines or you also have something called transfer learning where you learn the weights from a similar problem for example you have image classifications and the first layers in the English classification will be quite similar because there will be edge detectors and so on so if you want to classify images of dogs you might have a lot of training set for images with cats you might take the weights you learn from the cats and transfer them to the dogs as an initial immunization of the weight so if you do that you can you can bring the the weights already quite close to that local minimum and then facilitate the back propagation you can also do things like data augmentation so you can not just take the image itself you can also slide the shift or twist or turn the image so you get more or less light in the images and you get less likelihood that you get stopped in a wrong or in a bad solution. Very famously is also the early stopping that's we were talking about overfitting before so what you do when you train a neural network you always calculate not always but usually you calculate the training loss and the validation loss so training is you actually training data validation set is an independent validation data that is not used for training and when the validation loss starts to increase then you probably overthink your data so what you can do is in early stopping you need to provide a validation data set and then you can just stop if the loss of the validation data doesn't go down anymore and starts going down you can have like a lag here you say okay I will try maybe a 10 or 20 epochs and if I don't find anything better than before I will just stop and I will take those weights that gives me the lowest validation and these are some images of maybe this weekly class boundary is here where you actually overfit so this is an example of overfitting here you have the best generalization so you have the best lowest validation loss and that the first guess it's your first guess so that's obviously not a very good guess usually then dropout is also an important technique where you just for each batch you just drop out or put the weights of certain neurons to zero so they drop out of the learning process that means for each batch you learn a slightly different neural network and then you just define a neural network it's just an average over all this different neural network like this again you'll introduce some randomness in your learning which avoids that your algorithms adapt too much to the training data and it usually improves performance quite a bit you can also use dropout in a Monte Carlo fashion to calculate distributions of output values of neural network if you are interested in that and as a last regularization method there's also batch normalization which is also a very strong regularization method where you just after you learn scales and shifts that after each batch you scale and shift your values in such a way that gives you the best training results so you learn these weights here these scales and shifts in order to get the best lowest loss at the end and that also seems to work really well this is just another playground example somebody asked about it doesn't converge anymore if you add too many layers and this is exactly what I tried to simulate here and then you can use they're not all of these regularization techniques are here but you can use some of them you can use the L2 regularization for example and you see then you can still make it converge by choosing the right regularization and the right activation functions and learning there are usually in these papers often they give recommendations which values to use I like as I said the book there I recommended and usually they have they give various which for example initialization you should use with which activation function and so on so but again probably it's a good strategy to try these things out yourself and just see what works best for you you can also use this optimization framework I personally use this hyper-opt which works really well there are others like hyperband or genetic algorithms and there's also some services so you can I think Google provides that service you can they give you kind of the optimal network configuration for a certain problem there are three questions on the chat yeah what is what's the definition of pre-training in the Boltzmann machine what was learned there but it was yeah so this this is an unsupervised method so what you try to do there is that you in a certain sense you try to keep as much of the information in the next layer so you can reconstruct the signal of the previous layer and for that you don't need supervision you just want to reconstruct the signal of the previous layer with the information from the next layer and like this you you can make sure that the network doesn't lose information in the process and you can that works quite well as a pre-training and in the earlier days of the neural network let's say 2005-06 that was usually used to pre-train watching the other question was are there augmentation methods for gene expression data for example that's a good question yeah I don't know I mean I'm sure there are one has to think about that it's not so obvious as for images maybe one can use something like protein interaction networks I have a little example on that if I have time to smoothen out the gene expression values somehow but generally I don't okay the other one is does dropout is equivalent to outliers treatment in machine learning I wouldn't say it's equivalent it's maybe somehow related what you drew in dropout you can use it as a you can create an output distribution with it and there you might also see that certain times maybe the network is too far off and then you can skip that so we can maybe use it as an outlet action but it's not that good okay thanks okay so now we come a bit somebody asked a question about when do neural networks or they also have certain disadvantages um but first to maybe an advantage and this is really when I saw this paper I really almost off my chair because I grew up with with this curve here maybe you saw that already is the so-called bias variance trade-off so I can't explain that in detail now so you have a capacity of a classifier which is basically the number of neurons let's say in our case and if you have too low capacity you can't model your class boundaries properly so you don't perform so well if you have an over capacity you do overfitting so it doesn't generalize either so that's a kind of an optimal point in between and the worst point of overfitting is when you start interpolating your training data so that's what we learned and we always said okay there's some kind of compromise here in the middle there things work well but with deep neural networks and also other deep really complex systems like gradient boosting trees with huge trees you can do the same thing it's not only from networks suddenly you see that you can actually go over that interpolation threshold and suddenly the neural network starts improving again and even in improves it will work better if it's super overfitted neural network here will work better than what we would have expected let's say less and for me this was a really surprising thing because I really thought this is kind of a given truth here and it would be impossible to do better than this one to explain that is a bit difficult I'm not quite sure whether people actually understand it fully mathematically but maybe one potential explanation is that in this range here you have a fairly good fit but you might miss some of the of the smaller features here and you're very dependent on outliers in this part here you overfit or you it's like this stretch bench image so this neural networks become so stretchy that you can easily just jump up to an outlier and come back to the main regression tendency here so you're much less dependent on or outliers disturb your training less than they would in this case here and that seems to be the case I mean there are some papers on that where they show that this really works with some neural networks but this kind of overtraining that also leads that to the fact that the neural networks become very brittle so you can disturb them very easily and this is shown in this example here where they trade a fairly deep neural network to classify this image and they specifically looked for pixels so this is called adversary attack so this is a mean thing to do with neural networks it don't just take a random pixel you specifically look for pixels that and you change just that one pixel in order to change the classification that comes out of the neural network and it shows that you can find such a pixel in about 70 percent of the cases for this role as low resolution images here and you also see some bias in your training data for example there's lots of airplanes apparently in the training data so it likes to classify things as airplanes and sometimes quite astonishing that this deer here becomes something there but that's the way it is so they're a bit brittle these neural networks and it becomes even not worse really because this is also good because you can then use this again as you can use this as data augmentation so you can add these images to the training data and do your training if you do not change one pixel only if you change several of these pixels you can basically classify any picture as any other picture for example all these pictures here on the left were disturbed by this pattern you can't visually distinguish them so this is the disturbed picture and all these pictures on the right were classified as ostriches so this is from this paper and this also shows that neural networks they learn they learn very well but they're not intelligent systems they don't really understand the pictures they just learn pixels which are they supposed to do that's what we trained them for but that has consequences just a second so there is a how can you go beyond the interpolation point if the loss is always zero and paper before um yeah yeah that's a training loss so how can you train anything if the training loss is zero that's that's a good question I don't know the answer maybe there's something that is inserted again and um I have to think a little bit about that maybe I can think of the in the launch break and then there's another question as well how uh doesn't this big paper then goes against the advice of early stopping yes that's another very good question yes but in most examples early stopping will give us a very good result so usually we don't go there anyway so we're quite happy if you find something in this and for this is early stopping it's just good if you want to go really really deep and really learn go to this part here we have to do more we have to probably also have more training data and we have to train much longer okay and I work with neural networks from time to time and I usually am quite happy if I find a good solution okay then there was a more general question how to choose the best activation function function for a given dataset just try them out I mean look at the at the loss curve so with in the practical and we see how to plot the loss curves and you see how the the network behaves if it's too noisy if it's learning too fast or too slow and you just change the activation function with train again and we see how it behaves okay so I'll I'll stop with the questions now and then we just saw that Marcus can finish the before a little break okay thanks okay so this is another example where they inserted so-called adversial patches so they created images and printed them out and added the actual image to the to the to the object and took a picture and that image was made that things are classified as toasters and this banana was often classified as a toast just because they added this this point here excuse me I come back in a second no is it there and if you're interested in that there's a lot of interesting discussion I mean the the people like Jofi Hinton they're very aware of all this and they try to improve that and make these networks more one thing that really helps I guess is to have to maybe take a step back from this really super performing neural networks and try to interpret the data nowadays you have this line or shape packages you can use for that or there's also things that you can build within the neural networks like attention layers and that tell you a little bit why an image is classified here for example why is this husky classified as a wolf and the explanation just tells you because there's no in the background apparently all those pictures had snow in the background and therefore the husky also had this husky which snow was classified as so this will tell you that you don't learn really the right which the neural network itself doesn't so some examples I came across with this is from my field here this is a predictor from our spectra which works well I would say they also have a huge training data set this is a group in Munich they have this proteomics DB and they have millions of spectra of different instruments to train on and they also did this deep learning framework here which is actually going to be a very nice architectures they have this cruise or this LS something similar to LSTMs that we will hear about later and they have this latin space or embedded spaces and it's a very nice paper I think and very nice architectures to study here is an example maybe somebody asks about data augmentation for gene data gene expression data they use the string db network to kind of smoothen out gene expression values over different proteins maybe something like that could be used then they combine the gene expression profiles with the smiles of molecules and they try to predict the effect the molecule has on certain diseases so also very nice paper and very important and they use attention layers that tell you then which parts of the molecules are actually important or which part of the network the proteins are important for this prediction and I mostly work with tabular data and tabular data is just data in an excel sheet format basically typical for it is that it contains a lot of categorical data a mixture of categorical numerical data and so on and it is known that this gradient boosting trees or trees in general work well with this type of data and neural nets usually don't work so well there was recently there were a couple of neural deep learning approaches published on that subject there's a very recent paper here where they compare this to these gradient boosting algorithms and these gradient boosting algorithms still seem to outperform this deep learning algorithm is also the experience I made for my data but that might change in the future I think it's an active field now to try to apply deep learning to tabular data but that's maybe also a field where at least at the moment deep learning doesn't really provide an advantage compared to other this is this tab net this is just this deep learning framework if you're interested have a look at this paper it's also a very nice architecture that tries to emulate somehow the ways this gradient boosting algorithms okay there are many things still to do for deep learning I think I can skip that slide okay so that was it from my part from this theoretical lecture I don't know Patricia how you want to proceed and so why were you going to exchange