 So this is a introductory course to deep learning. And if you want to know more about the subject after the course, there's a lot of excellent books and free online material available. If you're really new to the field, I also recommend that you go to this playground that's TensorFlow that you can play around with neural nets, different architectures, different activation function, and so on. I'd like to start with a short history of deep learning. Neural networks were invented as a simple model to study how neural circuits process information. The first important result was published by Macaulay Compit in 43 already, where it showed that these neural nets, or multilayer perceptrons, as they were called at a time, are capable of universal computation. A little bit later, Rosenblatt published the first learning algorithm, so we introduced these neural networks to machine learning, which was quite unstable. And he also provided hardware implementation of a multilayer perceptron. Then in 74, probably by several orders, more or less at the same time, a better way to train neural networks was invented by gradient descent and back propagation. However, it took a while until this algorithm was really applied. It was this important paper by Rumblehart Hinton and Williams that used back propagation to train a simple neural network that was able to predict words. We come back to that during the lecture. Before that, Fukushima produced the first convolutional neural network that was used to read Japanese characters. But it was not yet trained with back propagation, so it was very cumbersome to train. Then Yanlou Kun produced Lynette, which was a convolutional neural network fully trained by back propagation. And he showed that this network is capable of outperforming other machine learning methods, such as SVMs. And it was actually the first neural network that was then used by the US Postal Service to read handwritten postal codes that was actually used in the practice. During the 90s here, the deep learning or deep neural networks were still at the beginning, but a lot of important inventions were already done then. For example, these recursive neural networks were introduced by Schuster and Pulliwall, and then the long, short-term memory networks were introduced by Hochreiter and Schmittrüder. If you count the citations or compare the citations for deep learning to the citations for artificial neural networks, which correspond to shallow neural networks, it may be one layer, deep neural networks have, let's say, at least three layers. Or support vector machines, you see that up to maybe 2010, there were much more citations for shallow neural networks or support vector. But the number of citations started to grow very quickly after 2010. And neural networks or deep neural networks are now probably the most important methods to do machine learning. So what happened around 2010? Much of it is due to Jeffrey Hinton, who was also the who led this ZIFAR Institute in Canada. And he really believed that deep learning can work at a time where most of the people were very skeptical about it. He has an interesting career, he's actually trained psychologist, so he knew that the brain is highly overparameterized system, but it is capable of efficient learning. And he thought that he can somehow emulate that with deep neural networks as well. Then probably one of the most important things that also happened was that deep neural networks could be trained on graphical processors, so called GPUs, and that allowed to parallelize and largely accelerate the training of these networks. And there was especially the group of Andrew Ng that contributed a lot to this technique here. Then deep neural networks also started to outperform other approaches in these competitions you often have in computer science. For example, for phone classification or speech processing, Hinton and his group produced a neural network that outperformed all the other approaches. It's also important that neither Hinton nor Mohammed were specialists in speech processing, so they were able to come up with and train a neural network within the short time actually and outperform other systems that were done by groups that only did speech processing. Then more probably better known is the so-called Alex Net, which was produced by Alex Krzyzewski, Ilya Suskever, and Jeffrey Hinton. In 2012, it's a convolutional network that used all the modern techniques that were available, so they used backpropagations on GPUs. They used the ReLU activation and dropout and data augmentation. I will explain what all these things are later. And they won this ImageNet competition in 2012 with a clear advantage to the runner-up. Later on, Suskever then went to Google and started developing a language model there, which then led to the GPT models, which we have now with open AI. Part of the deep learning success is also due to open source code, so all these frameworks were developed like TensorFlow or PyTorch, which are very robust and which make deep learning easy to use for the machine learning community. And all this led to the fact that the Turing Award, which was one of the main awards in computer science, was given to Jeffrey Heaton, Joshua Benio, and Jan Lokun in 2018. So this is a graph that shows, for this ImageNet competition, that shows the top five classification error. ImageNet is an assembly of about 1,200,000 images with 2,000 categories. And sometimes these images have several objects in them, and you have to classify which objects are present in these images. So in 2012, the first neural net, this AlexNet, was introduced, and you see that it really quite drastically reduced the error compared to the winner of the previous year. And after that, all the winners of the following years were based on neural network architectures. What you also see is that AlexNet had a depth of eight layers, so it was already deep, but not very, very deep. Then the tendency was that this neural network, so the winner of 2014 Google Net had 22 layers, and the winner of 2015, the ResNet had 152 layers. So there was a tendency to increase the number of layers to increase the depth of the neural network. However, that is not everything. For example, this ZF net, the Cyler-Furpus net, was actually the same as the AlexNet, but it was just differently parameterized. It's not only the depth, it's also how the details, how we use these networks, that makes a big difference. And very importantly is also that the number of parameters over eights in the AlexNet was 62 million, and this compares to 1,200 training examples. And this was really the astonishing fact that such a completely overparameterized neural net, we have 60 times more parameters than training examples, was able to generalize, was able to perform well on a completely independent test set. And this was kind of contradictory to what the general belief was in the machine learning community at the time, and this really led to sort of a paradigm shift in the thing. So the tendency is to increase the depth went on, and nowadays, as you all know, we have GPT-4 with a trillion of parameters. These networks really became huge. It became so big that developing such a network is becoming very, very costly, and also running the network, the computational power, the electricity you need to support that is also a huge cost. OK, this was just very short history introduction. Now we'd like to go and try to find out what deep learning actually does, on which principle it is found. So the basic building block of a neural net is this single-layer perceptron. It's a very simple thing. Actually, as you will see, you have an input vector x1 to xn. These are all numerical values. Neural nets can only deal with numerical values, not with categorical ones. Then you have n weights, so w1 to wn, and an offset w0. What you do is you multiply each input x1 with w1 and xn with wn, and you sum them all up. Then you add the offset to it, and you pass this offline transformation to an activity or a gating function, which is often, for example, the sigmate function, which is 0 for very negative value. Then comes linear around 0, and then flattens again for very positive value. And this gives you the output of your neural net. If you plot that output of a neural net for two dimensions, so you just have an input vector of x1 and x2, then you see that in this region here, with the sigmoid activation function, you have 0. Then it goes over a step here, starts growing up a step, and it becomes 1 on the other side of the step. The step corresponds to this linear equation here, which is just offline transformation we do in this neural network. And the weight vector w is orthogonal to this step. Also important to know is that the larger or the longer the larger the weights w are, the steeper the step is. The smaller the weights w are, the flatter or smoother the step is. Now, with this system, we can only do linear separations between classes. So maybe not particularly interesting. The same thing we can do with logistic regression. It starts getting more interesting if you combine different single-layer perceptrons together. Here, again, we have only a two-dimensional input. And we have three perceptrons, set 1, set 2, and set 3. And these three perceptrons then in the next layer are combined again to give the final output. Now we want to look a little bit in detail how this works. So the first perceptron here is this one here. I just gave random weights to these things. So the two weights here are minus 1, minus 1, and set 3. And if you plot that result, for simplicity, we assume that we have a step-wise activation function. If you plot that result, you see that this output here, set 1, is 0 in this region and 1 in this region. If you take the second perceptron, set 2, also gave some weight to that one. And you can plot again how the output looks like. And you see as a function of x1, x2, that set 2 is 0 in this region and 1 in this region. And then we go to set 3. Again, for this set of parameters, we can show that set 3 is 1 in this region and 0 in this region. Now in the next layer, we add another single layer perceptron. So we combine all these three outputs of the first layer, set 1, set 2, and set 3, with a new perceptron and the weights are 1 here. So we just add these numbers together and then we subtract an offset of minus 2.5. If you do that, we will obtain, without subtracting the offset yet, we will obtain this one here. So in the region where all the one regions here overlap, we add those up, we get 3. In the regions where we have 1, 0 region, we get 2. And where we only have one region with 1, we get 1. If we subtract 2.5, all these regions here around 3 become negative. The only one that remains positive is this 3 region here. And then we apply our step-wise activation functions, meaning all the regions that are lower than 3 become 0 and this region becomes 1. So what we did here was actually we separated this triangular region from all the rest. And we can, for example, use this as a classifier if you want to classify all the points within that triangle in comparison to all of the points outside the triangle. We could use that with a system like that. Now you can imagine that if you start adding more single layer perceptrons, maybe in the first layer or more in the second layer, and add more layers as well, that we can fit more and more complicated shapes. If we do that, usually we do that not with a step-wise activation function, but for example, the sigmoid activation function, then we get, obviously, not this hard boundaries, but more soft boundaries as you can see. As I said, we can do this now. We can extend this. We can add more neurons or more single layer perceptrons in the first layer. We can add a second layer as well before we go to the final layer. And as you can see, this from a publication here, that already by doing that, you can fit this quite complicated shape. So you can separate these two concentric circles from these shapes here. And this is done for these shapes. But if you have even more complicated shapes, you might want to add more layers. And we will see that you can fit basically any shape if you add enough layer to your neural network. So these deep neural networks, they look like this. So as we saw, they have an input vector here, an input layer of dimension n. Then we have hidden layers, maybe more than three, otherwise we probably won't call it deep learning. We have these hidden layers, and each neuron here in the hidden layer is a single layer perceptron. So it is activated like a single perceptron. So it has weights, it has a weight vector. We multiply this weight vector with the input vector. We add an offset and we pass it through an activation function. And we do this for each neuron here. And then for the neurons in the next layer, we use the neurons in the previous layer as input and proceed in the same way and so on until we come to our output neuron. So you can have different activation functions. Often one uses the ReLU activation functions in the hidden layers and maybe the sigmoid or the softmax in the final layer, especially for classification. You often want to know what is the probability of a class and there that is usually done with the softmax activation in the last layer. Now it can be shown mathematically and this goes back quite a while. That's actually shallow neural networks or neural networks with just one layer are already capable of universal computation. So you can fit any functions, any function you want with it if you just add enough nodes in that single layer. Now why do we need deep learning if you can do everything already with one layer? It turns out that if you, instead of one layer, we have a stack of layers, we can do that same task but much more efficiently. So we can do it with less weight and much less training time. And the reason for that is you can to a certain extent prove that mathematically, especially if the data or the function you want to fit has some sort of hierarchical structure, which is often the case for example for images or for text where you have paragraph sentences which are composed of words and so on. So if you have this type of structure, you can show that this deep architecture should perform better than this shallow one. And the reason for that as we will see is that if you have this deep architecture, you can argue in the following way that each layer is able to extract essential information from the previous layer and then pass it on to the next layer which again extracts essential information and you can build up like a hierarchical system like this that in the last layers is then to able to extract the information that you want for classification for example and perform a perfect classification. So what we need to know at this point is that these neural networks, the shallow and the deep ones, but deep ones with that more efficiently are universal function approximators that can fit any, doesn't even need to be smooth, any function more or less of an input track. And we have learning algorithms that are able to efficiently learn the weights that can achieve this thing. And the magic of deep neural network is then that we can fully fit in many cases our training data, but also generalize well on our independent testing. And here is from a publication which I will cite frequently in this talk. It's from Joshua Benga's group. And here they show that this universal approximation property of deep neural networks, what they did is that they took this CIFAR data set and just randomized the labels of the images. So instead of a bird you call it a dog and it just randomized the labels or you shuffle all the pixels or you use completely random pixels to make Gaussian noise. And still the neural network was able to train it without training error. So they was able to perfectly fit even this completely randomized image. So this shows this universal, this approximation properties of the neural network. Of course, fitting random data is not useful. So this type of neural networks will perform very badly on test data. This is just to show the point that these neural networks are universal approximators so you can basically fit everything. You want to fit assuming that the neural network is deep and wide enough to have the capacity for this. Now the way neural networks train or the way we work with neural networks is quite different from the way traditional, let's call it traditional machine learning. In this more traditional approach you as a data scientist you were given some data. You looked at the data, you analyzed the data and you talk to the people who are experts in the data and then you extracted features that describe this data or these images for example, well for example for images you might extract edge detectors or circle detectors or color histograms or things like that. Then you combine all these features in a feature vector and you add the labels in case it's supervised classification and you pass the feature vectors and the labels to a classifier for example, support vector machine that then learns how to classify learns a model how to classify these items here. So you need some expert knowledge you need to understand the data to a certain extent to come up with good features. Marcus, there is a question. There are three questions in the chat. One is in deep learning neural network CNN what is the rule of thumb of hidden layers when we do the modeling? Yeah, I'll come to that point later. I have a slide on that. Okay, so there are the other two. Difference between deep learning neural network and GAN generative adversarial network. Are you going to talk about this too? No, I won't talk about GANs. GANs is a way, these are also neural networks deep neural networks but it's just a way to train them. I won't have time to go into this. Maybe Vandu will talk a little bit about those. And the question was at the end the third one then is if that can be this GANs can be modded on Google collab or Python? Yes, I think there should be. Well GANs are usually very expensive to train so I'm not quite sure how. Well it works on collab but basically I would say yes you have GANs implementations in TensorFlow and you can run them. Thank you. Okay, so the way deep learning works is different in deep learning you don't do feature detection you just take the raw images and so we take the pixel values of these raw images you design your neural net which has done several layers in certain width and you just pass these raw images to the neural network and you also give it the labels so you tell the neural network it should adapt the weights in such a way that it can attribute these images to the right length. And by doing that almost magically the neural network will in the consecutive layers will design its own feature feature detectors or features and usually that's not a general rule but what it usually does in the first layer for these images it has the basic features like edge detectors then in the next layer it kind of combines these edge detectors maybe has corner detectors or more complicated shapes and then if you go down the layers these features become more and more complex until they're complex enough to decide whether these images is a bike, a cow or a toast. So this is the very rough overview of what neural networks are doing. You can look at real examples this is from Silen Ferkus who won this ImageNet competition in 2013 so they used AlexNet and they also won it because they studied how AlexNet actually works and what it's doing and they found a way to visualize these activations or feature maps here can't get into that so they developed neural networks especially for that but what they saw is that in the first layer in this convolutional network we have these simple edge detectors here then in the second layers these become more complex already you have these corner detectors round shape detectors, color detectors or pattern detectors and if you go down in the layers these features become more and more complex but the interesting thing is that you didn't design these features but the neural network just by being forced to learn to classify these images came up with these features by itself. Now this is the famous paper by Rummelhardt Hilton, Hinton and Williams where there was actually a very simple system so they just had two family trees an Italian one and an English one and you see here the associations so Roberto is married to Maria and they have children and this daughter Lucia is married again and so on and you give in in this neural network here you see the architecture of the neural network you put in a name and a relation for example Lucia is daughter of and the answer should be Roberto and Maria okay so that is really a very simple system but what they did they used this back propagation algorithm and they showed that by just sampling examples from this data here and treating it to the neural network for training they could actually perfectly train that network and the interesting thing is that here you have the input vector which is just the one port encoding of the name so if the name is Christopher this is a one and all the other ones are zeros the name is Andrews this is zero, this is one and all the other ones are same this is a non-informative encoding of the names but in first layer already of this naming coding what they saw is that the neural network learns so-called feature maps or embeddings that are much more informative for example for the feature map here for Andrew and Christine is actually the same so they're married to each other and the distinction through a relation only comes into the network at the later point so at this point all these couples have the same feature map and the first dimension of feature map tells us on which family tree we are in this case we are on the English family tree the second dimension tells us in which generation we are so this non-informative input vector is projected on an embedding which is much more informative and then on the next embedding which then takes in the information of the relations further embedding and here we have enough information than just to do the classification this was actually the first network that somehow processed languages in that sense it was only simple words and relations but in that sense this was the first little simple language model now these word embeddings are very central to neural networks so again we want to illustrate that in a simple example here so we have a sentence the dog parks if you do the one hot encoding so for example we say that D is vector one zero zero we only have three words here dog is one zero one and parks is zero zero you see that all these vectors are orthogonal and equidistant so there's no information in this encoding now imagine we do this for real language with ten thousands of words then we have this very long we cannot hear you anymore sorry I cannot hear you okay can you hear me better now now it's better okay I have to go a little bit closer to the mic it's maybe me because everybody's hearing sorry okay so in this we would rather have an encoding or an embedding of these words in a vector space that is more informative for example that where dog and parks they belong together only dog parks where these vectors are close together or where a cat and meows are close together or where dog and cat which are both pets before legs on the tail close together but far apart from other objects like moon and how can we train such an embedding what neural networks are really good at we can use a neural network as was done in this famous VirtuVec publication by Mikhailov and also Suskayev at Google so what they did is they tokenized the text and they split it into pairs of words that appear together in short bits of this text for example that and dog appear together dog and dog appear together dog and parks appear together and so on and they complemented that with negative examples of words that do not appear together in the text and they fed this in to one layer neural network that was actually a simple neural network here just projection matrix basically and they had an objective function that forces words that appear together in the same sentence to be close together and words that don't appear together and if you do that that creates an embedding for you and it turns out that this embedding is actually very interesting and informative for example all the country names seem to be clustered together if you this was a thousand dimensional embedding and this is just the PCA projection in the first two dimensions but in this projection all the country names clustered together names of capitals also seem clustered together and interestingly the distance between the country and the capital is more or less a constant vector so for all the examples here we have about the same distance between the country and capital which just means there's information encoded into this embedding which was not present in the original input then we can use this embedding for many things so we can use for words that are close to other words for example my bike has a flat we can look for words that are close to bike and flat for example most likely be a tire we can relate different sentences to each other so we can relate the sentence the dog walks in the park to a cat creeps around the field both sentences contain pets both sentences contain a verb that describes a movement and a place so if you calculate the probability how similar these sentences are they will be more similar than another if you embed several languages together you can do that then you can use this type of systems not to work but more complicated systems to do translation of language so you can turn the dog walks in the park to a home spot and you can even embed text and images together and then use that to translate text into images or provide text that describes images so the concept of this embedding is very simple to be learning and is used all over the place okay so how the next chapter maybe we have a little break here just to recap a little bit so we can go through the slides again this will be now a more technical part where I try to tell you how you choose the number of layers how you choose the activation function and so on and before that we just have five minute breaks and then go on and please ask if you have questions so everyone five minutes break and then you can turn on your microphones and speak up ask questions or add them to the chat as well welcome to do the best way for you and I'm going to copy the link to the google doctor we have the link to the slides as well to everyone here in the chat in case you didn't receive that so there's a question from in the chat could you please explain what the backpropagation is I will come to that good hopefully that will be clearer so hold your expectations it's going to come but if I can say the backpropagation is just an efficient way to calculate the gradients you learn these networks by doing gradient descent and that would be very time consuming to do if we wouldn't have the backpropagation algorithm how this works I will explain there's another question again also very informative thank you you mentioned that in the last layer we subtract an offset what is that exactly how did you mind that well it's not only in the last layer that we are subtracting offset for each neuron in the perceptron for each neuron in the neural network we subtract an offset and that offset is also learned so we don't know the offset we don't know the weights and the aim of the learning is exactly to learn the weights and the offset all the other things we provide we tell the network how many layers we define the architecture which we define the activation functions normalizations and all these things we pre-define both the weights and the offsets the network has to do is exactly what the learning are doing there's another question as well how long is the vector space for encoding words like 010 0100 etc how long is necessary the vector to be well if the first encoding if you just use that one hot encoding this basically the number of words in your language so if you have a language with 100,000 words that input vector will have a length of 100,000 it will be very sparse because if you have you have one word you encode with that and there will be 99 10,999 zeros and one one it's a very inefficient way to encode a vector but that's how we usually do it in the input but then what the neural network does is it creates this embedding and the embedding has maybe a dimension of 1000 or something like that which is then much more informative and groups works together according to their meaning and their sentence and there's another question as well is the embedding always the first layer of the neural network yeah that's a good question no it's actually each layer creates an embedding maybe the first layer is not the most useful embedding for the problem you want to solve but somehow each layer creates an embedding of the previous layer so each layer in a certain sense summarizes what was there in the previous layer and organizes the previous layer in such a way that it is useful for the task you have to solve for example for the classification or regression this is what the neural networks do internally and this somehow magic how this works it's not always easy to understand it's not always easy to interpret what the neural networks do as a general principle I think that's more or less what's going on I'll complete that no there's a question related to that and then I'll come back to another one that came how do we know which layer or embedding is informative that's a good question as well then we have to figure it out so what you can do is you can take each layer and you have to have some objective function to define what is informative and then you can maybe just have a single logistic regression and try to regress that layer on this informative measure and see which layer gives you the most information on that and then you can figure out which layer has the most information but usually the further the deeper you go in your network the more informative the layers become and then you can ask the neural network is trained another question now you mentioned that the number of parameters in neural networks can be much higher than the number of training data points and that it can still generalize well contrary to the common belief for traditional machine learning approaches is it the default characteristics of your network or is it only possible with the use of special tricks during the training or we will talk about this mechanism it seems to be linked to the way neural networks train without any tricks so it's just the gradient descent algorithm to train the neural network makes that you end up in a you usually find even the global minimum with that I come to that and you find the global minimum is a very broad space in these over parameterized networks and you find a point in the global minimum even generalize as well so people were not aware of that when Geoffrey Hinton wasn't aware of that at the time he just knew these things work he didn't know why but now one more more figures out why these things actually work and seems to be due to that gradient descent learning there's a very nice paper this paper cited frequently from Joshua Bengio's group that describes this a little bit so I recommend reading and it's linked in your slides also yeah you find the this saying at all and there's another question also are there some rules of thumb for how large a labelled dataset needs to be for the deep learning to be useful for a particular application yeah I I was thinking about that as well I would say ten thousand but it's really a rule of thumb I couldn't keep learning with 200 training examples again you can do deep learning in a sense you can do transfer learning so if you have a system or a model that was already trained on a lot of data then you can like chat GDP you can train it on a particular task you can do that with little training data but to train a neural network deep neural network from scratch I would say I wouldn't try with less than ten thousand but that's just my guess I think you just need to find it out if you have just a few data points like thousand or something I would just use logistic regression or things like gradient boosting algorithms which perform much better on this model two questions also how can we determine if we optimized the number of neurons or layers needed you have to experiment so I come to that as well but you start with a certain maybe you will read the literature you will see what other people are doing you take an example of the literature you might add a couple of layers you might change the architecture a little bit and you have to experiment and see what gives you the best result so far we don't have a nice theory or something that tells us for this type of data this type of network one last question then what do you think of manually set features that are supposed to be learned by each layer manually setting the weights the features yeah the weights yeah yeah so manually setting the weights I mean if you have 60 million weights I don't think that's something you want to do but that's what people did in the earlier days of the neural network for example that neocochitron was basically manually trained so there was a very cumbersome process now to put the features somehow constrain the neural network to come up with specific features that's maybe something you could try I wouldn't quite know how to do that though I haven't seen that so the power of the neural network is actually more that you don't do this type of things that you really let the neural network figure out what is the best way to fix that data another question also do the deep learning methods rely on what is present in trained data for it to work efficiently on test data in general the animals cat dog cat in training data does it restrict to those keywords in the test data or an animal like horse can be equally classified efficiently with deep learning without learning I guess this that's also a very good question I would say no what is not in the training data is usually not predictable in the test data just maybe in very rare cases sometimes the networks can somehow generalize but it's more the exception I would say no right we don't have any questions anymore you didn't have a break but some people had not a 5 minute break but a 10 minute break 10 minutes for questions right alright okay now we come a bit to a more technical part that discusses a bit how many layers I mean it's an introduction I have to be really brief and I cannot explain all the details so if you really want to know more about this you have to go back to other resources or to to try that out yourself so the first is how many layers do I want to have in my neural networks the answer is actually I don't know as I already said you might read literature you might compare what other people are doing but you have to experiment you have to increase the number of layers and also the width of the layers to be able to fit what you want to fit because if the neural network is too small you can't do that once you're sure that you can fit that you have to look at the layers because the neural networks can cope with that and to give it more flexibility this is sometimes called the stretch pad approach so you have something that is way too large and during training this will adapt itself to the actual data so unfortunately I cannot tell you much more than that there are variations of course there's a relationship between the amount of training data you have and some sort of optimal number of layers for example this one was done for the ResNet which is also a winner of this competition so we can change the number of layers it has and you see that there's a clear relationship between the training data size and the number of parameters in this ResNet architecture but this relationship will be different for different networks again I don't think if you don't have a lot of experience in the field the only way to do that is to figure it out to run different depths of the network different widths of the network to find out which network then generalizes so what do we do when we train a neural network or generally a machine learning model we have a so-called loss function this is the signified as J here and this loss function depends on the internal parameters of the neural network which are the weights and the offsets and you would like to find those weights and offsets that gives us the smallest loss and the loss measures how good our predictions are in comparison to the true labels so how close our predictions are to the true labels if we have perfect prediction then the loss is zero if you have bad predictions then the loss will be really high so we want to minimize the loss and then once we learned these optimal weights here we will use these weights to predict with the model on an independent test now typically we always use the empirical loss so we don't have a mathematical idea how this loss function looks like so we estimate it by using our training data so we just sum the loss of each training item in the training data we sum that up over all items in the training typically we have these two losses here we have the square loss which is just the output of the neural network minus the desired or the target output and then square we don't care if it is too high or too low we just care how far away from the target it is this is often used for regression but we can also use it for classification for classification we often use the cross entropy loss which is if you do classification we calculate probabilities that a certain item belongs to a certain class and we want that this probability is really high for the target class and low for the other classes and the cross entropy loss evaluates now we come to what was already discussed a little bit before to this loss landscape which is quite an interesting subject if you this is like a huge dimensional space of all the weights and the loss function is a function of these weights so it corresponds to like a function in a multiple dimensions now we would like to know what is the structure of this loss landscape and it turns out according to this publication here that for under parameterized systems like we used to have before the neural network the loss landscape is often consists of multiple local minima which are contents however for this very over parameterized system it turns out surprisingly that the loss function almost becomes simpler instead of having multiple local minima it has a few global minima which corresponds to very broad values sorry values and the dimension of these values is almost as large as the dimension of the whole space so it's quite easy for gradient descent actually to find this global minima maybe the reason this has been proved mathematically the structure of this minima by Cooper and this could indicate that it's actually quite easy to find this global minima but even within a global minimum you can wander around and not all the points within the global minimum generalize it pretty well so you have to find a point in the global minimum which doesn't really mean it also performs super well on the test data it turns out again this publication by Tseng and other publications as well that the stochastic gradient descent takes you to a point in this global minimum that generalizes quite well especially if you combine it with early stopping so if you start wandering around this flat here you might wonder of the good solution but if you stop at the right time you can obtain the good solution so how does this gradient descent work we will start at a certain point we do not necessarily know which point we have to start at and we just go downhill because if we go downhill we ensure that we minimize the loss until we reach the end of the downhill until it doesn't go and we can do this with gradient descent so the gradient tells us how the function changes and if you go down the negative gradient we ensure that we always make the functions smaller and that's what we want to do and then we have an important parameter as well is the step size so which step you want to take if you go downhill step is too small it takes too much time if it's too large we might over shoot and now the person who asked for back propagation please pay attention we use back propagation to calculate this gradient efficiently that's basically what it it's somehow a computational trick to have an efficient computation for the gradient and let's try to explain that on a very very simple network with two input nodes and just one output node here and this thing here is what we call a computation graph so if you want to calculate this output here this is what the computer has to do to get from the input to the output now in the first layer we have the two inputs and the two weights and in the next layer of the computation graph this doesn't correspond to the layer of the neural network it's a more it's a finer structure in the next layer we multiply the input x1 with the weight the w1 the input x2 with the weight as we do in the perceptron then again as we do in the perceptron we add all these multiplications up and we add the offset and this is in the computation graph done in this node and also importantly we always store the results of these computations in the nodes then we have done our affine transformation here of the perceptron and we have to pass that to our gating or activation function which is usually a function that is implemented already so we don't need to learn that the computer knows how to calculate that the function and its derivative and that gives us the output of the neural network and then we compare that output the predicted output of the network to the target output y and we calculate our loss ok so far we just did a prediction with the neural network now we want to calculate the gradient and it turns out by using the chain rule which you probably all know from your high school years from your studies we can go back to one graph here and step by step calculate the gradient so how this works we show now we go to the first node and calculate the gradient so the derivative of j with respect to f and as you probably remember this is just 2 times f minus y and we know what f is because we stored the result here and we can calculate then we go to the next node and calculate the change of y with respect to the result in that node that is z so the change of y with respect to z is the change of y with respect to f times the change of f with respect to z according to the change now this first term here is already from the previous node and the second term is just the derivative of the activation then we also we combine that in the same node here we also have the derivative of y with respect to the offset and we calculate that in the same way it gives us exactly the same result because this is just the sum here so it doesn't the derivative of z with respect to f with respect to w1 so the same result so we already have one of our gradients so we already know how j changes with the offset w0 we already get that now we need to know how the changes with respect to the weights w1 and w2 for that we have to go further through that computation graph how chain changes with respect to Z1, again applying the chain rule. And since this is just a simple addition here, it doesn't change anything. Then we come to the last node and then we calculate how chain changes with respect to W1 and W2, and again apply the chain rule and that gives us these final results. Okay, and this is it. So we've going through the computation graph first to calculate the prediction and then going all the way back to calculate the gradients gave us all the gradients. So we only have three parameters in this simple example, but by going back through this computation graph, we get all the derivatives of chain with all the parameters in our network, just by passing two times through that computation. And you have to compare that to, if you want to calculate, for example, your derivatives numerically, you have to calculate two values that are close together. So you have to already parse two times the computation graph for each weight in your network and then you have millions of weight, you have to do that a million times. So that would completely slow down the learning and it would be impossible to learn, but with that back propagation, we can do that just by parsing the computation graph. So this highly efficient way to calculate the gradient, which allows us then to train this vast parameters. Now, the training is not at the beginning, people had quite a lot of problems with this training because if you calculate these derivatives, you see that there consists of products and the more parameters you have, the more, the longer these products get. Now, you all know that if you have a product with numbers that are smaller than one, that become the longer the product is, that become smaller and smaller and smaller until they become so small that you have a computational overflow, the computer content. If you have a product of numbers that are larger than one and you multiply them larger than one and we have a very long product, then your numbers become larger and larger and larger and might explode. So these two problems were real problems people had training neural networks, especially at the beginning, especially if the neural networks are very large. But there are certain things you can do about this and I'd like to present some of these techniques now. So the first thing you can choose is the activation function. The synoid activation function is between zero and one and the tango is a particular activation function between minus one and minus one. These are usually used in the last layer of the neural network if you want to force the output between to be in a certain range. For example, sigmoid if it's between zero. However, they have the disadvantage that if you have very large values the gradient becomes zero and this gradient descent might get stuck in this point. That's why people came up with the rectified linear unit or ReLU activation function, which is very simple. It's zero for all the negative values and the identity for all the possibilities. It has a kink here, but that doesn't seem to really bother the gradient descent because the kink is just at one specific point and it doesn't seem to matter. So the advantage of this function is it doesn't saturate. So the gradient in this part here is never zero. Even if you have very, very high values you always have a finite gradient. It's very fast to calculate and it also has a nice interpretation. It actually leads to a piecewise linear spline interpolation of your training data. And this spline theory is very developed. So one can learn a lot by comparing this neural networks with this spline theories. There's some paper on that if you're interested. It's quite fascinating. Okay, so that ReLU activation function is often used in the hidden layers or in the final layers if you just want the positive output whereas these activation functions are usually only used in the last layers to avoid this saturation gradient. Then you have other activation functions as well which have different advantages and disadvantages. What also turned out to be very important is the way you initialize your weights or where do you start your gradient descent? So it showed that if the weights are too large you already have, because you always form in this perceptron you form the sums. And if you have thousands of parameters the sums can become very large. And again, you might go to places in your activation function where the gradient for example, vanish or it might be the whole thing might be unstable. So you'd rather start with weights that are adapted to the size of your networks and there are several techniques available here I give the references here. Usually this initialization also depends a bit on the activation function. So we have slightly different initialization for ReLU or for SeLU or for logistic. Then the gradient descent is also important. How do we do the gradient descent? We use the H-algorithm to reduce. If you use just the standard gradient descent and the gradient becomes zero for example here at the settled point, we get stuck because here you do a big jump, big jump but here the gradient becomes smaller and smaller and here we get stuck. So we get stuck in a place where we don't not yet at the global meaning. So we don't want that, what can you do? It's like in skiing, if you see a flat part what you do is you speed up and you gain some speed and that speed will carry you over that flat part. You can do that with a gradient descent as well is called momentum gradient descent. So we can give that gradient a certain push and that push allows you then to go over that flat part and then to what's the global meaning. But if you have too much push, you might then overshoot a bit so it might take quite some time until you converge to the global minimum. So there again, improvements of that for example, the Nesterov gradient descent algorithm. So again, like skiing, in skiing you always look a little bit ahead. You see, okay, it's getting too steep now maybe break a little bit and the Nesterov algorithm also does that. It calculates the gradient not at the place where it actually is, but at the projected place where it's going to be. And this helps and it accelerates convergence. And then there further other gradients from that time to discuss here. The important thing is if you use frameworks like TensorFlow or PyTorch this is just a parameter in your learning step you can just choose just give the keyword and choose the gradient descent algorithm you want to use. And there's also something you can experiment with. Then the learning rate is equally important. If you have a learning rate that is way too high you kind of overshoot the jump over the global minimum and we will not be able to go back there again. So we went up in a not optimal solution. If the gradient, if the learning rate is still a little bit too high we might find optimal solution but then there we will jump around quite a bit and it will have difficulties to converge and that is shown here in noisy behavior on the training loss centralization. If you have about the right speed or learning rate then the convergence gets smoother and we will converge to the global optimum eventually. If the learning rate is much too small we just don't converge, we will get stuck or it will just take too long. And when we stop the training then we will not be at the optimum point yet. So again, this learning rate is an important parameter which you also have to experiment. One very powerful technique is called early stopping which is not only used in neural networks it's a general machine learning technique. When we train our model we train it on the training data and as I said, since these are general function approximators we can make the loss on the training data zero almost but the important thing is how this performs on the test data. We know that we can't fit any training data but we want to be good on the test data. So what we usually do when we train neural networks we give it a training data set and an independent validation data set and we look at the loss of both and the TensorFlow or PyTorch export both values. So you can look at them and you see that maybe after a certain training you may be already in that global minimum somewhere but then you start to kind of overfit you start to wander off the point in the global minimum that gives you good generalization. And there you need to stop. And how do you know where this point is? This is the point where your validation loss starts increasing again while the training loss still is going down but the validation loss is going back. And you can figure it out you can tell the network okay if the validation loss doesn't go down for so many steps just take the best parameter settings before testing. This is a very powerful technique that avoids overfitting and that gives you better generalization. Then the other important technique is that we don't do gradient descent on all our training data but we do what is called stochastic gradient descent that works in the following way we batch or we split our training data into small batches we randomly split it and the batch size is usually 32 or 64 so it's quite small. And the idea behind that is twofold. So first we can easily paralyze it we can run each batch independently. The second idea is that if one batch may be stuck somewhere on the gradient descent or has some sort of difficulties another batch will not have this difficulties. So we also have some batches that are able to overcome this local minima and we'll get a better conversions. That way. So this is also a powerful technique which is kind of I guess it's then that usage nowadays that you train on batches and not on the training data. You can also do normalization of each batch that you make sure that the values in the layers of the neural network after this for each batch remain zero centered for example you can calculate the set scores here and then a certain technique how you can because that's during training and during testing you do not have the information about the batches anymore but you can also apply that to your testing data. This is also often used and it often helps to improve the results of the generalization of your neural networks. Another important technique is dropout. So like in the stochastic gradient descent we randomly batch our training data. In the dropout approach we randomly delete nodes in the neural network. You might first think this is a crazy thing to do why do we do that? But since the network is overparameterized we have too many nodes somehow and what sometimes can happen that these nodes somehow core depth so that certain nodes provide a bad solution that other nodes further down actually correct a bad solution that was done before. We don't need to do that really and you can avoid that a little bit if you randomly drop these nodes out so you can avoid this co-adaption and the network is forced to do less batches or to provide better solutions for the train. Then the number of regularization is not finished yet so but these are all very powerful regularization for example, skip connections which is used in this ResNet, this very deep network that also won this ImageNet competition so it seems to be a powerful technique. What happens in very deep network or in recurrent networks is that these networks start to forget the deeper you go, start to forget what the actual input was. At that sometimes can be a problem and what you can do is do the so-called skip connection so reintroduced original input at certain points in your network basically to refresh the network at this point and it is shown that this leads to smoother loss functions and also the better generalization results. Then very powerful as well, you can use transfer learning so especially it's often done in this language processing you take a model that was trained on a huge amount of data with something you couldn't do yourself because you don't have the computational resources for that but you can take the model and its parameters and then adapt that model to your data you have in your lab and that can be a very powerful approach. But finally, it also defines the starting point of your gradient descent and it gives you a starting point that is already close to a very good solution. Also very powerful is date documentation so instead of, for example, for images instead of just giving that one image of the dog you take the dog and you shift it a bit in the frame or you rotate it a little bit, you add a little bit of noise again, you avoid that the network just learns that specific image of a dog but it has to generalize a little bit more and that will lead to a better performance on the test. Marcus, there is one question. Ah, yeah, sorry, I won't really answer that. If the no one is there, if the node and neuron were the same thing if the terms are the same, but Vendri, yeah, thank you. So we are almost done now with the regularization. There are also these kind of standard machine learning regularization like L1 or L2 regularization which just keep the weights small. At the very beginning we learned that small weights correspond to smooth transitions so to better generalization often. And by forcing the weights to be small we can often improve the results a little bit. It usually doesn't have a huge effect. It's not crucial for the neural networks but it can also set constraints, hard constraints on the weight or you can set hard constraints on the gradients. But then this is the big part here. We don't really need that for the neural network. We need it to improve, to push the performance of a neural network, maybe to improve it by 10% or something like that. But the neural network also works without this regularization technique as again, nicely shown in this paper here by Tsangetal and especially gradient descent and early stopping this combination is often sufficient to find a good solution but you can improve the solution with additional regularization techniques. Now which regularization work best for your problem? Again, something you need to either learn from other work that has been done before or just experiment with it. Okay, maybe this was quite technical part here and we want to now go on to the next part and discuss a bit how well can these deep neural networks actually generalize and how robust is this general research. So this is still, I mean, there's a lot of progress being made in understanding how these neural, deep neural networks actually work, this highly overparameterized ones but I don't think at least I'm not aware of really overarching mathematical description that really describes it. Well, it also depends very much on the architecture on the data you have and so it's probably difficult to come up with something that explains, you know. Then another important question as often discussed in papers is whether these deep neural networks actually just interpolate the training data. So they do it very well, better than for example, a nearest neighbor algorithm, which is also perfect able to approximate any data you want but it doesn't generalize very well. Neural networks approximate any data you want but they generalize them. That's already a great achievement. Then the question is how this generalization, how far does it go? Does it really reflect the data generation process so does the neural net really has some idea that these are actual images of 3D objects that are alive or is it just a fit in the pixel space? It's probably the fit in the pixel space that is true and we will have some example on that but sometimes especially with this language model is really astonishing what they can do and you almost have the impression that there is some intelligence behind but people who actually look at this seem to think maybe rather but yeah, that's probably an open question. So one of these papers here that looks at this generalization power of neural network was quite a simple test. They looked at images. You see here this image of the bird and they just compared it to the performance that the human would have in classifying these images. So they had this many people doing this classification and they have only a very short time to do the classification. So one second per image or something like that and then this was compared in the performance of this deep neural network. Now the neural networks actually outperformed the humans for if no noise was in the data. If the data training data, sorry, the test data was very similar to the training data. However, if you start to add noise to your test data, the performance of the neural networks all of them here dropped by quickly whereas the performance of the human annotators dropped as well, it becomes more difficult to classify noisy images, but much less than the neural. So the neural networks don't yet have that the understanding of the images are human. Also, what the neural networks have the tendency to do is if they don't know what it is, they kind of, they choose a class a bit by chance and they always have the tendency to choose a bit the same class. So everything that you don't quite know about how to classify that put in the class of airplane or the class of dog or something. And the humans don't do that. The humans still try to distinguish what is in that image. So in, you can say the neural networks were great, actually better than humans for if the training data, it test data as close to the training data, but if you start changing the characteristics of the data performance. Also quite astonishing is what is similar for a convolutional neural network is not similar for humans. Here they looked, in this paper, they looked at inputs. So these are input images that create the same activation pattern in a certain way. So this is the original image here. Then you look for images, these are just examples of these images that creates the same input pattern in layer one. You see it's very close to that original image. But the further down you go in the neural network, the more distinct this image, this patterns become. For example, in the final layer, this input pattern produces exactly the same result as this image here, which for us looks like noise for the neural network, this looks like a dog. So this kind of astonishing. The other astonishing fact is if you take the same ResNet, you train it a little bit different or you parameterize it a little bit differently. It creates a similar image, but this neural network here will not classify this image as a dog. So it's not transferable from network to network. It's really networks to see. And this is maybe to a certain extent a little bit overfitting. So the neural network really just interprets pixels and it doesn't have a more global context. This can, that plenty of these examples, funny ones as you will see, in this, these are these adversarial attacks where people developed algorithms to, in this case, we change a single pixel in the image that changes the classification of the image. And it was for about 70% of the images in this data set, it was possible to find such a pixel. So we just changed the value of one pixel and the classification swaps. So that horse becomes a frog. The deer already very difficult becomes a dog. This is understandable, maybe. That chip becomes an airplane. You again see that it somehow seems to like airplanes and dogs that network. So a lot of times when it just changed pixel, just puts it into the kind of default class, let's say. And you can even push that a little bit further. You can not change one pixel, but you can also train with gradient descent, look for patterns that change the classification and then add these patterns to your image. And the image almost looks the same. You don't see a difference between this truck and this truck and this truck is just the original truck plus that pattern here. And the same thing for that bird here or for that pyramid. And the aim is to classify all these objects here into a specific object. And you will not guess what that specific object is, obviously because only neural network was and that object is the case in all streets. So all these cars and birds were classified as ostrich. And you, because you forced the neural network to do so, so neural network certain extent is a little bit foolish and we can fool it quite easily with these adversial attacks. Now, we can do something about this adversial attacks. So the one thing you can do is these adversial examples are actually quite useful because if you add them to our training data, you can again force the neural network to classify them correctly and to avoid a little bit this maybe overfitting they do. And this has been done in several publication it has been shown that this is really a good idea. Also the adversial example by themselves are not robust. So if you again change them a little bit the classification will again change. And so if you had a means to calculate the confidence of our outputs, then we might be able to detect these adversial examples or detecting general examples where the network actually doesn't know how to do the classification but just more or less randomly assigns it to one of the class. And this work has been done. So these are the basing neural networks which use a dropout or data sampling or other techniques. Basically you sample outputs and then you get a distribution of outputs and you can evaluate where the variance of that distribution is small in which space of the image, for example, that you can have more confidence or when the variance is large you have less confidence in the identity change. Another very interesting approach which I will present briefly here is the evidential learning which I was invented in 2018 for classification and in 2020 for regression. And this is a very interesting approach. It does the same thing basically as basing neural networks but it does it much faster. So the first thing it does, it has like a, I don't know. The network knows how to say I don't know and it doesn't just, if it doesn't know assign it to more or less random class. And this is a powerful tool I recommend that you read the paper here. So what they did here is, these are just handwritten digits, is a one and they just turned that one, rotated it and as I said, this may be back to the question we had before if the rotated one is not in the training data the network doesn't really know what to do with it so it will assign it more as randomly to the number two in this region here and to the number five in this region. Whereas this evidential deep learning has this I don't know class basically and it says, okay, here I know it's a one and here again, I know it's a one but in between I don't know what it is I just give it that on which is actually a very good idea. And how do they do that technically they instead of predicting class probabilities they predict the parameters of a Dirichlet distribution maybe you know that from the statistics of machine learning course these are kind of prior distributions for probabilities. And here is a simple Iris data set with three classes so we have three outputs here so perfect probability probability one for one class and zero for the others is in the corner of the triangle and completes unknown or if you really don't know what it is corresponds to uniform distribution within the triangle. So for objects where we have a lot of evidence that's why it's called evidential learning we will give it the probability distribution as very centered here in one of the corners of this triangle for objects that we certainly know it's one of these two classes the distribution will be centered between these two classes but again towards an edge and for objects where we really don't know what it is we will have a distribution that is more or less flat and centered in the mean. And like this we can also, we can detect that when the distribution is flat so in this case we can say it's an unknown and we can put that to the harmony class. Technically how it works is we have, we learn instead of learning the output probabilities we learn the parameters of a Dirichlet distribution these are this alphas here and we have now two terms we have the term that basically traditional term but we also have a legalization term and that legalization term makes that if we don't know what it is we do not randomly assign it to a class but we do assign it to this unknown class that is characterized by a uniform distribution of this property. And the interesting thing here is that can all be packed into this loss function so we can learn the same thing with a gradient descent we just need to change the loss and we can include that unknown class and we also get an estimate of the precision of our classification which is can be very useful result as well. And so this makes this evidential learning much faster than this invasion approach. Okay, so that was it from that part I think it's time to have maybe another break or another question answer session and then I go to the last part where I will present some applications of deep learning in biology omics. Maybe a 10 minute break or something or maybe a bit longer to 11. This time we do a real break and then we leave the questions from 10 minutes from now. Yes, so what do you propose? Sorry, I didn't mean to. I propose that we do a real break for 10 minutes so then we turn off the camera for 10 minutes and then when people come back then we do a Q and A session as well and then you start from there. Okay. All right. 10 minutes break everyone. See you there. You have one minute break. So far we haven't had time for machine. Yeah. Let's see if the people coming back and if they have any questions then use this moment. And now we have also one read that is online so he's answering the questions in the chat. Okay. I didn't forward them all to you because he answered them. So everyone if you're back from your break you can put your questions in the chat or turn on your microphone and ask as well. Always good. There's one question. I have a question about missing data. Do you have any recommendations if we are dealing with hard data such as clinical data that cannot be imputated? Which neural networks are the best approaches? It really depends. If it's clinical data in the sense of tabular data where you have different rows with categorical numerical variables a mix of all, I'm not even sure that neural networks are that good. I have a slide on that. Also my own experience is that neural networks for this type of data especially if the data sets of every small don't seem to give an advantage compared to for example XGBoost or logistical regression. Now with regards to imputation I wouldn't know how the neural networks are really capable of imputing such data. It really depends on the data. But maybe other methods might be embedded in this type of data. There's another question as well. Thank you very much for the great talk. I may miss this but my question is are usually the last layer embedding the most informative ones? Yeah, I would say so. There's another one. Can ZIP learning modules handle imbalanced data well? I would say like any other machine learning approach you can handle imbalanced data. It's better if the data is balanced obviously but I think you can handle imbalanced data if the imbalance is not too low. I think that you can also provide the information to the neural network that the data is imbalanced and not want to be sent to you anymore. I have to look at it. I think the neural networks I mean they are popular to do with imbalanced data but there's a limit. It cannot do anything. Yeah, sure. So usually people can do with the data augmentation, the methods like generate synthetic data from the kind of distribution that we have in the data itself. We set the method like a DGA and for alternative learning. Yeah, what you usually do is you can adapt your loss function so you can get more weight in your loss function with a small data set. So you blow up a bit the importance of the small data set in comparison. So you compensate a bit the number of training examples. I know this the way it works for logistic regression or XGBoost for example but special neural networks I've forgotten. It's the same way to do it but I suspect that there's something like that There's another question as well. What is reinforcement deep learning? Just a type of deep learning architecture or is it something different entirely? Well, reinforcement learning is a method in machine learning. It doesn't really need to be linked to neural networks but recently neural networks have been used reinforcement learning to predict what you do there you predict actions that give you a reward in the future and to predict these actions for example in a game which chess figure you want to put where there you need to use neural networks very efficiently to predict which action is the best to do at the moment. But there are actually two different things. Right, I don't think we have any other questions for the moment. I think we are good. Okay, great. So let's go to the last part where this kind of a subjective selection of papers that use deep learning for biological or medical problems and there's several ways you can do that. As was already discussed you can maybe train your own network from scratch but for that you need a certain number of data items in each class. I'm not capable really to give you a number but if I had to give one I would say maybe something like 10,000 per class you should have but it's maybe not what you want to do maybe what you want to do is more this in the language modeling you call that few short learning or it's also called transfer learning or fine-tuning where you take a pre-trained model and fine-tune it to your specific problem that means you kind of use the weights of the pre-trained model as starting values and then you refine these weights by training it maybe on your smaller dataset you have at hand in your lab. Then we have zero short learning this when you for example take a tool like alpha fold or a tool that is ready to use or an embedding that is ready to use without doing any training. So as obvious in machine learning the datasets need to be well annotated otherwise the training or the generalization will suffer. Usually deep learning works well for data that has a lot of correlations within it and hierarchy structures like sequences, images or text that is deep learning really does well. For tabular data, my experience is I tested a little bit there but I didn't get super good results with deep learning. Then you should have especially if you want to train yourself you should have the computational resources and those are the time to do so. There are no general rules if how many layers you need which activation function is the best you really have to gain some experience in that and just to test everything or at least the most important parameters before you do the actual thing. And of course if the problem is too simple don't use deep learning. There are many, many statistical methods very well established methods, robust methods already available and often they're actually enough to solve the problem if the problem is fairly simple but if you really want to go the extra mile you might want to use deep learning but then you also need the time to do so and you need the data to do so. Okay, so here are some examples which I found interesting. This is a paper that is not yet peer reviewed but it's on Met Archivics already and it deals with heart disease so a lot of people with heart problems go to a clinic and they're tested there and but the diagnosis at this mission is that they actually don't have a problem but then this might be quite often is a misdiagnosis and they might suffer a heart attack later on which you could avoid by giving them the proper treatment. So the idea here was that you take a language model this is the long form of model in this case which was pre-trained on a lot of data and works well also for large text up to 4,000 words maybe that language model produces an embedding and then map all your documents so they do this two-shot learning here they have this discharge reports and to retrain the model on this report and then to classify whether these patients have a heart condition or not. And they did this and according to that paper they actually outperformed the classification that was done by the doctors and they could better assign or identify patients that have an actual heart problem which would then get the proper medical treatment. And what they also did well I think they use this line package or you could also use the SHAP package or other packages that help you to interpret the result of the neural net, neural nets are often a bit black boxes which might be difficult to interpret these two boxes are just general machine learning two boxes that work for all types of machine learning models that interpret the results by the first and foremost they give you the words in these reports which are the most important to make the distinction between a heart problem or not and then they also highlight these words here in the text which then allows you to quickly look at a text and maybe double check these diagnosis done by this language model. So I found this actually a very useful and very interesting publication in the medical field. Language models can not only use to analyze or predict words or language or sentences that can also be used to do the same thing with sequences, protein sequences for example. There you train the language model not by predicting a missing word for example but by predicting the next amino acid in the sequence. And as the language models they also produce an embedding this is based on LSTM or this multiplicative LSTM here one who will discuss in his part of the lecture how these LSTMs work. The important thing is that this LSTM space can over the whole sequence and produce these embeddings. These embeddings are then all taken together and produce an embedding of the whole protein and which is a nice thing. Now you have proteins of different lengths that you can map to the same embedding space. And it turns out like if you do word embeddings that embedding space will reveal certain structures in this words, the embedding space of the protein sequence also contains lots of information about this protein. For example, you can use that kind of refinement step use this embedding and train a classifier to predict secondary structure of these proteins and you see that works actually really well. So that structure, secondary structure is somehow already encoded in this embedding. You can predict the stability of the proteins you can predict diverse functions of the proteins and so on. So these are embeddings of the sequence space or it's a very powerful approach. Here in this publication, they used this embedding this was done by this ECM language model or sequence model developed by Meta. It's an open source model that you can download. And it's also they use this embedding produced by this model to calculate probabilities for amino acid substitutions and general proteins that are viable. So the substitution doesn't lead to a protein that is destroyed afterwards because it's misfolds but it leads to a viable protein. Marcus, there is a question for you in the chat about ILLSTMs is a time series neural network. So can we use LSTM in feature classification? Yeah, feature classification. Or features in the sense of economics or? And that's a good question. Maybe Simon can ask or put that in the chat, a clarification that yes, economics, yes. I would say so, yes, I would think. I mean, there was somehow developed to a classified time series. I would just do a Google search and search for LSTMs. Maybe they're more adaptive methods there. I'm not sure, I'm not an expert. But I would say yes. Thank you. So they use these probabilities because an important thing in if you deal if you're working in immunology is the optimization of antibodies. So you have an antibody that already somehow works. It binds to the protein you wanted to bind to but maybe not yet efficiently enough. So you want to change the structure of the antibody a little bit in order to improve that binding. And for that, you need to know you go amino acid by amino acid. So you change one amino acid at a time. But you need to know which amino acids are the most, have the most highest potential to a successful improve binding. And they used this language model here to predict these probabilities. And they showed that the model predicts amino acid changes not only in the variable region of the antibodies, but also in these stable regions. And which was quite astonishing because usually you only adapt this binding region here and not the stable region here. What they found, and this is what was predicted by this model that these amino acids can also be fighting. So again, a nice application of these language models, this kind in a transferring learning way to a problem in immunology. Now, maybe you have also read about what foundation models are. These are usually models that are trained on huge amounts of data. And the idea is that the embeddings they produce can then be used for in a refiner process or directly to give you information about certain problems. And these models here were trained on single cell data. So the idea is to provide foundation models for single cell data. What they do is they take huge amounts of single cell data, maybe 10 millions or up to 30 millions of these data sets and they mask some of the expression or what you measure there is RNA and you measure the expression of these RNA genes. And they mask some of the genes and they train the model in such a way that you predict the intensity of these mask genes. And it's similar to a language model. It's also based on this attention and transformer architecture. And it produces also an embedding. And then the researcher were wondering how useful these embeddings produced by these models actually are. And it turned out that for single cell analysis, they didn't provide a real advantage. So they didn't perform better or improve worse than very simple methods that were already present. There are two papers here. This is from Microsoft Research that are obviously interested in this foundation models. And they found that it didn't outperform a simpler standard technique. And this is from a different paper here where they looked at how well this embedding can predict gene expression in a test data set and they found that logistic regression actually outperformed this prediction based on this foundation. So I think it's always a very good idea to compare to something like logistic regression if you do deep learning anyway, just to see where you stand. Maybe you don't wanna put that in the publication later but just to show you where you are more or less in the performance of your model. It's not worth using deep learning if you can do the same thing with logistic regression. It probably also shows that single cell data is quite complicated data. You have to deal with a lot of missing values. You have batch effects. You have probably a lot of noise in your data as well. And this is just more difficult to handle than maybe images or text. And that might also show up in the gene embeddings here. Maybe then the quality of those embeddings is not yet good enough. Now, one of the success stories of deep learning is certainly alpha fold two. There are also these CASPA competitions where you evaluate all the tools that predict the 3D structures of proteins. And each year you publish the sequences of some proteins where the structure was determined but those structures were not yet published. And then all the participants predict the structures and are evaluated. And in 2018, it was this alpha fold one which was a more traditional neural network that won but in 2020 is a two-yearly by annual competition alpha fold two which was completely redesigned version of alpha fold one outperformed all the others and provided a huge improvement compared to the runner up models. So this alpha fold model not for all the proteins and not for all the bits in the proteins but sometimes they're able to predict structures with an error of less than one angstrom which is huge. Of course. So the experimental error there is maybe 0.5 angstrom for very good models. So you come pretty close to that already which is a huge improvement. And how did they do that? So the number of training data you have they use PDB for training. So you have about 200,000 or so structures in PDB which you can use for that. So this is not such a huge data set for the complexity of the problem. So they were aware of that and they knew that they have to adapt their model this alpha fold two model to the training data and also to the problem. They have to solve, they cannot just come up with a general neural network that learns it all. Now there was a lot of work already done before alpha fold two. I mean, they didn't invent everything by themselves, nicely summarized here in this review by all in this review here. And if you want to read that, so there's a whole history that led to this alpha fold two model but then they used the bits and pieces of these things that were already known and put it together in a really powerful model to predict three destructions. So the model is just takes a single input, single sequence as input, no additional information, all the rest they do themselves but they have a large database of sequences. So the first thing they do is they match, they align this input sequence with the sequence database to get multi-sequence alignment. And this multi-sequence alignment as was known is actually very informative because it tells you which amino acids are conserved, which amino acids change and which amino acids co-evolute. If they co-evolute, that's a good indication that they're also structurally somehow close by. You can also give it a sequence template but you don't need to. And the second data structure they use is this pair distance structure. So it just gives you for all pairs of amino acids in the sequence, the distance between these amino acids. But at the beginning, you don't know that it's not initialized, I think it's randomly initialized but the network will then learn that. So as input, you have this multiple sequence alignment and this pair, the distance representation and you pass that on to a so-called e-former block. You see there's 48 neural networks in there or it's a lot of neural networks in there. They use also the transformer architecture. They use very clever tricks. For example, they use this triangular inequality because you know that the structure or the interaction of amino acid is defined by other amino acids that are close by, not by those, so much that are far away. So you give more importance to those amino acids that are close by. And they were able to encode that into these neural networks with an attention mechanism. Also very interestingly, the amino acids are just like our amino acid gas. They don't have links between each other. They're completely free. It's only the loss function at the end that gives if two amino acids are too far away, the loss will be so small that this will not be considered a good solution. So the loss function and forces, these amino acids to be close together. And they, it's kind of a mix between physical ideas because this John Jumper here, he's this guy who has been working in protein folding for a long time. So he knows a lot about the subject and he worked together with these experts on neural networks. And they came to get up with this alpha fold, which is kind of a mix of physical ideas and neural network. Then finally, you have this pair representation and sequence alignment. And you need to turn that into a 3D structure. And there's another block that does this. And that gives you a 3D structure. First of the backbone, then you have to decide the angles for the side chains. So it's another neural network that does that. And finally, at the very end, where they have this structure, they refine the structure with a force field, a classical force field approach, mainly to avoid that side chains overlap or that you have residues that are too far away from each other. So this is fairly complex system, but it turned out that this works extremely good. So it works so well that this is used now by the EBI to predict structures of all the proteins they have in their databases. It's also included in the Uniprot pages. You can see these alpha fold structures. And it really increased the coverage of proteins, of the protein sequences that are annotated with the structure compared to, for example, Swiss model repository, which did the same thing, but using different methods by using a protein template. Alpha fold is able to predict structures almost close to one angstrom, which is extremely good. But at least in this paper, they showed that this is not quite yet good enough for protein docking, because protein docking really depends on the details or docking of molecules to proteins, depends on the details how the side chains are placed. And the little difference of maybe even a third of an angstrom can have a huge effect. So in the structure prediction, the experimental structure, in the sort of docking prediction, the experimental structures still clearly outperformed the alpha fold predictions. And for the docking predictions, the alpha fold predictions were actually not so much better to the classical or traditional model. But alpha fold evolves as the paper has 29 authors. So there's a lot of power behind alpha fold and they will, of course, improve all the fault. There's already a new result paper out. They don't talk about the methods they achieve, but they show some of the new results. And they are the claim that they are now even further decreased the error in the structure prediction that gets close to a value that is then also suitable for docking. It may be a little bit different, it may be interesting to see the development of that field because often you want to predict the nova sequences, you can't do a multiple sequence alignment for the nova sequences and alpha fold relies on this multiple sequence alignment. So maybe they will find the tricks and hacks to still do that, or maybe we still for certain proteins at least partially we need to combine alpha fold with some traditional more physical methods or molecular modeling methods. So the ultra high sensitivity has not yet been reached by alpha fold, but they're working on it. And I haven't read really these new results, but I guess they are now already quite close to it. Then also very important protein sequences are not just sequences. They have variants, the proteins have PTMs, which can have a large impact on their function. And this is not included in alpha fold, also because it's not, you don't see it in a sequence alignment, it is quite difficult to predict, but they also working on these subjects. If you have multi-domain proteins, often the prediction within the domain is very good, but then how the domains are organized amongst relative to each other, their prediction is maybe not so great anymore. There's a version of alpha fold that is able to predict protein complexes. That's a very difficult task, obviously. It's they actually treat the two proteins at the same. So they mix the residues of the two proteins and it turns out it works quite well actually, but it's also an improvement to the one. And sometimes you're interested in the conformational space of a protein. So sometimes the protein doesn't have only one confirmation, but it might swap from one confirmation to the other. And somehow you can get at a little bit out of the alpha fold results, but it's not yet really implemented that will be something where you might need different approaches to predict basically the entropy of the protein structure. So there's still a lot of work to be done and it will be very interesting to see how much of that work can be done with deep learning and to what extent you will still require these traditional molecular modeling approaches. Now the last example here is from my field. It's from mass spectrometry for those who don't know what mass spectrometry does. So in this MSMS approach, you take a protein or usually a whole cell lysis. So a lot of proteins you digest them into small bits because mass spec can't analyze all proteins. And then you will separate these peptides in a LC column and you measure MSMS spectra. And these MSMS spectra can then be used, for example, matched against a sequence database and you can identify the peptide that produced this MSMS spectrum or it can be used to plan quantitative analysis once the identification is done of these peptides. And for both approaches, it turned out that spectrum prediction really helps. It helps if you have a peptide sequence and you are able to predict this MSMS spectrum. It helps the matching and it helps the quantitative analysis. And also here, in this prosid approach, there are other approaches as well. They were inspired by this language model architecture. So they use the grooves which are gated recurrent units. This is similar to LSTMs, but a bit simplified to parse the peptide sequence in two directions. That has some technical reasons why you do that. Then they had a tension layer and to perform embedding. And this is the encoder and then they had a decoder which turned that embedding into an MSMS spectrum. And the quality of these predictions really astonishingly good sometimes. And if you include these predictions in your scoring methods, you can quite drastically increase the number of identifications you get within a certain force discovery threshold. So this is not a success story, I would say, of deep learning also because that paper was produced by a group which is linked to the proteomics DB. So they have about 150 million spectrum just at hand which they can use to learn these type of things. At the very end, just wanna mention tabular data. This is data we often have in medical or bioinformatics approaches. These are basically data matrices where one row is numerical. The other row is maybe categorical. You have a lot of missing values eventually. The rows are poorly correlated to each other. And often you don't have so much data. So there are some experiments. They came up because of deep learning models, very sophisticated ones that tried to analyze this data but the performance of these models here in the independent analysis, sorry, doesn't seem to be yet comparable to the performance of gradient boosting which also in my experience performs better on these type of data than this deep learning architectures. Okay, that was my last slide. I guess time is already over. So maybe if some urgent questions you can answer otherwise. I think it's time for boundary to take over.