 Hi everyone, thank you for coming and I hope this will be an interesting talk and we'll give you some insight into these very interesting and world-changing, I dare say, models. So Sebastian already introduced me so I'm not going to repeat this but if you know about ONIX it's a format for exchanging neural networks between different frameworks such as PyTorch and TensorFlow, if you don't know it then it's worth checking out. So these are the main players of today's performance, the models that I will be talking about are the transformer model and models used in the diffusion process. So the first kind deals with natural language and the second group deals with image and both can generate images or natural language. Examples, famous models from each group are for example GPT for the transformers, BERT or T5 and for diffusion models, stable diffusion with journey, DAO-E and others. And what's important about these groups of models are the mathematical operations that are at the heart of the mathematics of the entire network and this is what I will be talking about today. So for transformers this is the attention mechanism and the diffusion model uses both attention and also convolutions for the image part. So chat GPT needs no introduction at this conference I'm sure and so this is the transformer model that I will be talking about when I'm talking about transformers but what I will say applies to transformers in general and the diffusion models are the ones that are generating all the nice pictures that you see around the internet that have been artificially created by artificial intelligence. So the operations that I will be mentioning here, the key operations that are part of the neural networks that contain the actual weights that the networks are learning are primarily three in this talk, learning to use my laser pointer. So the linear layer, convolution layer and attention. The linear one is also sometimes called the fully connected layer or dense layer. It's a simple, compared to the others, transformation of inputs to outputs. You can boil it down to a simple equation of AX plus B where A is the weights learned for a given layer and B are the biases. So it's a relatively simple operation. Convolution is a more complicated operation where you have an image as the input, a filter which is the weights and you scan the input image with the filter to produce another image. So it's kind of a scan of an image to produce another image. You can think about an example. If you have a filter that is able to detect vertical edges, you can scan an image with that filter and you get another image just showing where the edges are. You may have done this in a graphics program which finds all the edges and it may have used this type of approach with filter scan. And attention, I will get back to later as well, but you can think of it roughly as a key value store lookup. You have some database and you're looking for entries in the database. It's what attention does in a very, very general terms. Okay. So I want to go from operations now to models. And I will start with the simplest model that you probably heard about and have seen before displayed in presentations or discussions of neural networks. It's this one. It's the famous fully connected simple feed forward network where you start with some input, an input layer. You have some hidden layers and an output layer. And they're all connected to each other and the connections here between them are the ones, the weights that you multiply the values from the input layer. You multiply them by these weights to get these values. So that's what a simple multi-layer perceptron looks like in this classical graphical representation. But I will be using a slightly simplified representation of neural networks throughout my talk. So this architecture goes from looking like this to looking just like that. You now have the two linear layers represented here like this. And the third output one, which is a little smaller, meaning that it has fewer neurons, is represented slightly smaller. But essentially this is what the graphical representation boils down to. So state of the art neural network architecture from 1980s is just these three bars on the graph. But this will allow us to go deeper. So let's start with convolutional networks and the convolution operation. I already described it briefly. It's this operation that scans an image to produce an output image. The math behind it is simple, although when written up like that it may look slightly intimidating. But you're essentially just multiplying every value from the input image by every value from the filter to get the output pixel value as the sum. And with this operation you can build neural networks that scan images and process images. So convolutional networks are usually deep networks for deep learning, meaning that you have a whole stack of these convolution operations one by another. So you start, this is VGG. VGG is a model that was famous in the first half of the last decade, I think at the end. Because it won the ImageNet competition, meaning that it was the best model at the time to classify images into a thousand classes. In the ImageNet competition you had a huge array of images that networks had to assign a class to, whether it's a cat or a car or a person. That's what the ImageNet challenge was doing. And this network scanned the image using a convolution and followed by another convolution. Then there was a pooling operation, which essentially down sampled the image to a smaller size. For every four pixels you got one pixel. And after that pooling operation you had another set of convolutions, another down sampling, max pooling operation, and another stack of convolutions, down sampling, stack of convolutions. So you can see that we're essentially repeating the convolution operations a number of times going deeper and deeper as the deep learning term signifies. And each of these layers is learning something about the input image. And I will get back to that in a second. But you can notice already here something happening. There is the first part of the network that processes the image. And then we have this multilayer perceptron again, the same network that we used before. But now it's taking data coming in from whatever was generated, encoded by this network, and decoding it into one of the 1,000 classes of the ImageNet competition. Now I want to get, so I already sort of hinted at an encoder and a decoder here. We'll get back to that in a second. But I want to talk about what this stack of convolutions represents, or rather what having a stack of something in deep learning allows these networks to do. So this is a famous classic paper from 2013, which talks about visualizing what's happening inside the convolutional neural network. And in the first initial layer, the filters, you can see them up here, that the network learns are relatively simple. And they are filters for detecting either edges for various orientations, or transitions between different colors. And this is a type of fragment from the input image that this filter is most activated by. So these are examples of what these filters are looking for. There are nine filters here, there are nine patches of examples here. That's how this graphic can be read. But anyway, in the first layer, in the beginning of the neural network, the network is looking for relatively simple things. It's looking for literally edges, colors, transitions between two colors, simple features of the image. But then as we get deeper into the network, the filters get more complex, no longer just simple transitions between colors, but already some shape or form of some kind. You can see that one here is detecting circular patterns, another one is stripy patterns, things like that. And then when we go even deeper into the network, the convolution is applied to the result of the convolution of the previous layer. So in the first layer, the convolution is looking at three channels, which just represent the three colors of the image, red, green, and blue. But in the next, the next layer has more than three channels. And these channels, the output of these convolutions, these feature maps represent different things, either edges or shapes, transitions of color. So these feature maps are being scanned by the next layer and more high level features can be detected. So when you get down to deep into the network, into what is labeled here, layer five, you are already looking for a specific thing. You can see that these filters are, this filter seems to be looking for faces, this one for flowers, this one for dogs, this one for eyes. It's learning how to detect higher level concepts, higher level images in case of images. But these are for the network, more high level concept that it's looking for. Okay, that was a digression into why deep learning works because of this going deeper into the network gives the network the ability to learn higher and higher order concepts. But let's get back to architectures. So the first higher level architecture that I want to talk about is the encoder-decoder architecture, which I already hinted at before. You can combine the encoder part of a convolutional network like VGG and use a different network as a decoder. And in terms of a convolutional autoencoder, you can have the decoder part being a symmetric sort of mirror image of the encoder. So the encoder takes an image and creates some sort of latent representation of it, this set of numbers in the middle that means the same thing as the number. And then you can have a deconvolutional network which inverts the process and is able to create an image from those numbers. And this type of convolutional autoencoder actually works and you can have a network that takes an image and also generates an image. However, there are some issues where some information gets lost in these deep neural networks. So an idea of skip connections was introduced and residual networks were formed. So the residual term signifies networks with skip connections. So what does that mean? It means that you can have a network like a convolutional network that we were just talking about. But for every set of convolutions, you add the output of that convolution to the output of that convolution, you add back an input from a few layers previous. So the information from a few layers before gets added in to the output of the layers following it, which allows the network not to lose sight of what it was originally looking at. It's kind of keeping the information from earlier layers in the network going forward. So that's what this looks like on an example of a network called ResNet. And we can build on this. If you have skip connections and an encoder-decoder architecture, you get what is now commonly called a unit. And it's essentially a network that combines both of these together. So you have an encoder part, your decoder part, and then you have the skip connections between the corresponding layers that have the same size. And in convolutional units, the same size images or the same size. These are no longer images but feature map outputs. So that's what a convolutional unit is. Okay. So this was all kind of an introduction to architectures that we need for the next step. And the next step is going to start with talking about the attention operation, the operation that is proving very powerful and allowing us to, especially in natural language processing, have these spectacular results that we are seeing these days. And like I mentioned before, it's kind of, it represents a kind of key value lookup where you have some, this is the only bit of Python in my presentation today. So you can have, if you have a simple dictionary of keys and values, you can have a query that matches one of the keys. And then this is how you look up a value. You put your query into the store and you get a value. Simple, obvious. But you can also do this with matrices, which gets interesting. So the way you, this attention is all you need paper is the paper which introduced this idea and the transformer model and it's a very good paper worth checking out. This is a figure from it. So you, in order to make this attention mechanism with math, you can use just two matrix multiplications, really. It's very simple. It's very ingenious. You start by taking the query and the matrix representing your query and the matrix representing the keys of the key value store and multiplying them together. And what that produces, it produces the attention, where the attention should be paid. When the keys and queries match, those are the indexes of the data that you want to look up. So next, you multiply this attention with the actual values and you get the output, the values that you are looking, that you are querying for. And this simple concept, which can be expressed with this simple equation, allows us to do really powerful things. The way this works in practice in a transformer model is by having a more complex arrangement. This one is called the multi-head attention because there are multiple heads. I'll get back to this in a second. But you could ask, looking at this, you could ask, where are the weights? None of these operations are trainable. They don't learn. They're just matrix multiplications. So the way this works in practice is that before the actual attention mechanism, you have a couple of linear transformations. So the values go through a linear transformation, a fully connected layer. The keys go through a fully connected layer and the queries go through a fully connected layer. And that's where the parameters are, which get learned during the training process. And then these get, go into the scale that product attention and then the output goes once more through a linear layer for better training. And the why is it called the multi-head attention is that you can have for the keys values and queries, you can have multiple linear layers, which allows the network to learn multiple sets of multiple key value stores and look up multiple meanings of what it's looking for in deeper in the network as you go deeper into the network. So that's the attention mechanism in a nutshell. And now let's talk about the transformer architecture that's built using this mechanism. So the transformer, which GPT, our famous chat GPT friend is a transformer as well, looks like this. This is a typical encoder decoder transformer because you have one part of the model, which is the encoder. And this second part is the decoder. Interestingly, with transformers, you can have models that are encoder decoder, or you can have just a decoder, train that, you can have just the encoder used in some cases. But for applications such as neural network translation, you have the full encoder decoder model. So how does this work? Well, let's start with the encoder first. Your inputs this time are not images, not numbers, but words from a natural language such as English. So in order for these words to be processed by our neural network, you have to turn them into numbers. So that's what this first layer does. It's the embedding layer. It turns the words into vectors that represent them numerically. Then there's a step here called the positional encoding, which essentially adds information to each word about whether this word is the first word in the sentence, the second word in the sentence, the third word in the sentence, and so on. So each word gets this information about where it's in the sequence through positional encoding, which I don't think I'm going to go deeper into, but it's a bunch of signs and cosigns of different periodicity that just gets added in with the vectors representing the words. And then you feed that, that's your input, you feed that into your transformer. And the first attention mechanism here is self-attention. So it's both keys, values, and queries come from the same place, but it allows the model to decide what about this information is important. Then you go through a couple of linear layers and you repeat the process. You have a stack of these. Again, like with all the deep learning models, you have a stack. Here you have a stack of these transformer units that have an attention followed by some linear layers. And that's the whole encoder part of the transformer. Now, the decoder part of the transformer is slightly more interesting, but it's the same principle. You start with some words that you embed and add the positional encoding to. So you process the input, you turn words into numbers again, and then you go through an attention layer and then another attention layer, but this time this is not self-attention. It's only the query now comes from the previous layer of the decoder. And both the keys and the values, the key value store, comes from information from the encoder. So the encoder provides the keys and values, but the decoder provides the query. It's kind of asking for specific information from the key value store. And once again, like in all the slides that I'm showing you, there's a stack. And this is the unit that gets repeated, that under the little dashed line here is supposed to indicate the part gets repeated a number of times in the stack. And that in the end, this decoder produces a number that will represent the probability of the next word to appear. And so when you're interacting with chat GPT or any other of these models, you may see that the output comes word by word from the model. And in this case, too, you get the output word, which represents the first word or the next word in your sentence. And you add that to your sentence that you've already generated. And you feed that back in as the input to the decoder. So the decoder takes information from the encoder, but it also takes the information that it generated itself, the sentence that it generated itself as its input to produce the next word, then takes that, produces the next word, takes that, produces the next word, and so on. So that's how the transformer, that's how GPT and the other powerful neural network models work internally with grossly simplified math, of course. So let's, I'm coming to the end of my presentation slowly, and now it's time to talk about the diffusion models. So diffusion is a process that can be described with this slide. I hope this will be clearly understandable. You start with some image, and then you can add some noise to that image. The noise that you generate by just calling a random number generator. And you add this noise to the image, then you add a bit more noise to the image, then you add a bit more noise to the image, then you add a bit more noise to the image, until finally all you have is noise. You have multiple of these time steps, and each of them just adds a little bit of noise to the image. This little bit of noise that was added to the image can be something that we can teach a neural network to recognize. So you actually take a noisy image from one of the steps of your sequence. You feed it into some sort of encoder, decoder that can take an image and produce an image, and it should produce an image of the noise that it thinks is the part of the image that was the noise. And then you can subtract that from the original input image to get back the less noisy image. And so in the diffusion process, the backwards process also has multiple time steps, because you start with just noise, you start with this, you start with random nothing, but you progressively get to less and less noisy, more and more defined and clear picture. And that's how you generate the images. So the way this is done is through this model, this part of the image here, can be done with a diffusion model, and we're going to talk about the latent diffusion model. And this is definitely the most complicated slide in my presentation. You can see we're starting here with an input image that is slightly noisy. We're encoding that with a convolutional encoder, actually with a unit. So you have a connection here with the unit, that's the second part of the unit, but we're still here. We're encoding that image to some sort of latent representation because the convolutional, the diffusion models and working on the images themselves are very computationally expensive. So working in latent space, working in this representation that comes from the end of the encoder is easier and allows you to work slightly more efficiently with this data. So anyway, everything that's beyond this point happens in latent space. Everything in latent space also has the time step that we're at added to it. I'm not going to cover this anymore, but just be aware that we're working in latent space with the time step information available. And what happens here? Well, we have another stack. This time it's a stack of residual blocks. So we have some convolutions. You have the skip connection there. You have this residual block. But it's followed by a spatial transformer. So a transformer is also added into the model and that transformer can either have a self-attention layer that only deals with information coming from this encoder. But it can also do the same thing that we saw in the previous slide about the transformer where the keys and the values come from somewhere else. And where did they come from? They come from here. This is the conditioning step. So you can actually have in this process some sort of hint of the image that you want to generate. You want to generate a logo for your Python in Prague, for example. Well, this image wasn't generated by AI as far as I know. But if it was, it would go through something like this. You take the hint, the prompt, the query, the thing that the user types in. You encode that with a transformer. In this case, a BERT encoder. So just the encoder part of the transformer. And you feed that into the spatial transformers inside of this stack of the latent diffusion. And you have another residual block, another spatial transformer, another residual block, another spatial transformer, it's a deep network. But you end up having, going from spatially bigger representations to spatially smaller representations. So it's another unit. It's a unit inside of a unit. Because this encoder starts with something of the latent representation of this size goes down into a bottleneck and then comes back out again at the same time. And at the same size, which is able to be decoded by the decoder part of this convolutional unit at the end here to generate an image of the noise. So this is the most complex slide. I hope that you're able to see the symmetries and the beauty and actually the relative simplicity of these models. They're not as complicated as they may appear. But anyway, those are all the models. Now I'm coming to the very end of my talk. I just wanted to leave you with some thoughts from me. This is unsolicited advice to you, to everybody. It's not part of the technical talk anymore. But with everything happening in the world of AI these days, it's worth keeping an eye that these are very powerful tools. And some of these tools can be used for good or for ill. And we should be careful how we use them. Interesting observation made by Yuval Harari was that we already had the first battle against AI. The first contact of humanity with AI was social media, recommender systems, the humble recommender system used to recommend you the next music track to listen to, the next product to buy, and finally, the next news story to read. They did change the world. And, well, at least according to Harari, humanity lost because now we are more divided, more polarized. It's more difficult to talk to some friends or relatives because of this. So these are very powerful techniques, very powerful tools, and they will do a lot of good in the world. But we should not lose sight of the fact that they are potentially very world-changing. And we should just keep an eye on them. So thank you very much. This was my class. Thank you very much for how we are a bit over the time. But the next talk is starting at 11.20. So if anyone has any questions, just come to one of the microphones and. So I have a question. Maybe it's a bit outside of the scope of what you have shown. But since you explained us so simply, very difficult concept, I wanted to ask you, what is this fine tuning for GPT and all these models that is so important? The fine tuning is essentially determining how the prompt is reacting. If you can give a comment on this. So essentially training a large language model is extremely expensive and comparable to the energy consumption of a couple of houses over a year. It's very computationally expensive. So you don't usually train a large language model from scratch. You use a trained model, and then you train it on your specific task by either freezing the weights in the model and just adding a head that will do your part. Or yeah, that's usually how I heard about this done. And then this fine tuned model, which is retrained on your specific task, but without having to learn the language again. It's already knows the language. It's only learning your specific use case. It can solve your problems better than a generic large language model. So this slide where you're talking about conditional diffusion, so whether there's some text prompt you're using for the model. I don't really understand how you, I get that you encode the text, but isn't that in like a text embedding space that the diffusion model that works with images doesn't really know how to translate into images? I don't understand that part. That's the magic of neural networks. It learns how to interpret that representation because we're training it. We're showing it prompts and images, and we're teaching it that this is what we're expecting it to see. So it's learning how to interpret this encoded information as part of its training process. So it's trained on text and image pairs, not just images? OK. Hey, thank you for a great talk. I have a question regarding the diffusion. Do you have any production use cases that this can be used for? Is it like for adversarial, like recognizing some adversarial noise mainly? Or is there something else? That's a very good question, but one that would take probably a long time to answer. Because I don't have a specific use case that I'm using them for, but there are many applications that are possible. We can have a coffee afterwards. OK, great. Thanks. OK, and thank you very much again, Mihao. And I guess for any other questions, you will be around for the conference. Thank you.