 I hope you enjoyed the presentation from Marcus and especially you got essentials for deep learning desk. Now I'll show with you some techniques that we put up line in last sciences. Just wanna mention that I do not aim to give you a complete list of techniques in deep learning, but just the common ones that are widely employed for biological or health data. So I'm sorry if you don't see your technique that you expect to see. I'll give an overview for each of them, trying to explain briefly the concepts when they can be used. But I'm not really going to technical elements which you can see more in a seminar this afternoon and maybe more in the future, more advanced course. Okay, so the first technique I'd like to mention is CNN. Well, in tours is just to remind you of the concept of neural network, even if you just saw it, some of this will go with markers as you have no neural networks and used to imitate our plan. So of course the most obvious for natural input to a neural network is an image, what we see. And then the most popular application of neural networks is in computer vision. Imagine if we have an input image, XI, okay? It can be an X or image, but it can be something else. So I can sequence of RNA, DNA or some omics profile of a patient. The important is that they all are read in the same way, but a computer. And the input can be encoded into a vector of numbers. Each node for neuron here is the color measurement of each pixel in the image, for example. Each node in the next layer is the result of an activation function on a linear equation upon the nodes in the current layer, which is defined by a weight matrix W here and so on, up to the output Y at Y, okay? So we can see that there are a number of nodes so we have the input X and output Y. Then once we predict the Y-head or the predicted value we'll compare the Y-head and the observed value Y for every object I, and then we define last function. This can be something like the mean square error or something else depending on our problem or the activation function we used. Then the goal is to find parameters W, the weight matrix to minimize the last function by performing weight and descent for byte propagation that Makers told you before. Okay, so now we see the protocol of performing a deep learning test, but here comes several questions. Does it really work? Can we minimize the last function? And even if when we can, is the minimum good enough? And can we obtain it in a kind of a reasonable time? For example, the encoding of an image to a vector over pixels is somewhat too generic or too simplified or we ignore the spatial dependency between the pixels in the image. That's the type of reasoning we use when we play puzzles and then we put together the piece with the related patterns, right? One more thing with the dimensionality problem. Let's have a look at these two images. On the left, we have a matrix of five five pixels and the right kind of three million pixels. Of course, it's too much easier to parse just the first one with only 25 pixels compared to the second one. The problem comes with a full connectivity of neurons. It means that each neuron in a layer is connected to all the neurons in the previous layer. Well, the problem is that even not full but really high connectivity. And the problem is known as the cause of dimensionality. And there are two issues. The first one, the sparsity of our data. When the dimensionality increases, the volumes of the space increases so fast that the available data becomes sparse. Imagine you have 10 people stay online and the same 10 people stay in a world. And then the same 10 people in a group and then the 10 people in imagine one one hundred dimensional space. So that's the problem. The sparsity of data is problematic for statistical significance. And to obtain a similar significance, we need that the amount of data should increase exponentially when we add more dimensions. The second issue is the covenants of data. When we increase the dimensionality, the dissimilarity of data increases. For example, if you look at two people, you look at only one feature of height. So they either they different or not based on the height. And now if you put weight, you put, for example, the measurement of glucose, you put the eye color, whatever. So when you put more features, when you increase the dimensionality, it becomes problematic for sorting or for classifying data. So that's where we need a convolutional neural network. So why is a CNN? Okay, so please don't confuse with a CNN, the cable news network, okay, the television channel. The CNN is inspired by the organization of our visual cortex. The neurons, our neurons respond to stimuli only in a restricted region of the visual field. We call the receptive field. And a collection of those receptive fields overlaps to cover the entire visual area. And CNN is a deep learning algorithm. We can take an image as input. You are assigned importance with ways and bias to different aspects or objects in image. And then it will be able to differentiate one from the other. Okay. So in this network, each neuron receives connections only from a subset of neurons, but not all of them, just like in the full connected neural network. And then we can, in such a way, we can reduce the number of parameters. Like we have a lot of ways to zero compared to a normal neural network. And CNN can also capture the dependencies in space and time between pixels in an image. So the space dependency is about the relationship between nearby pixels. Imagine if we have the two nearby pixels, they should have the same, I mean the very similar coring scale or a gradient scale. And the time dependency is about the relationship between different moments of the same pixels when we have a series of images. For example, in a video. So with this, the network can be trained to understand better the sophistication of the image. So the role of CNN is to reduce the images into a form, which is easier to process without losing feature that are critical for obtaining a good prediction. So how does it really work as CNN? So here's the image with this pixel matrix. The idea is to take each square block of pixels, for example, a red one here as a neuron and instead of each pixel alone. So this step is to build a, what we call a convolutional layer in CNN, which is centered in the CNN. It performs operation called convolution. It's in fact a linear operation that involves a multiplication of a set of ways onto the input, just like a normal neural network. So the ways here are defined as a filter of this set aside to the block of pixels or the sliding window that we'd like to consider. For example, so for example, here we consider a slide window of three, three matrix. And the filter, the value of ways in the filter, it represents something we want to detect. For example, in this, with this filter, we try to detect an X form, okay, a small X pattern in the image. And if I multiply the filter to the slide window in the input image, so I just, I mean, just a pair-wise products and take some of them and look at the, where they have one and one in the sliding window. So we have three of them. So the sum is three. So we got the value of three in the result that we call a feature map, okay? And the same, I will slide the window a bit on the right. Do something here. There's no one one, so the value is zero. And again, again, three here, zero, five, because we can find a diagonal, the two diagonals here, okay? Zero, three, zero, and three. Okay, so now we have a feature map that summarizes the presence of the small X pattern in the output. The high value fog at the center of the feature map indicates that the pattern X is likely found at the center of the image. Futures like X here, like a backslash, like a slash, the futures we can handcraft, okay? But the innovation of CNN is to learn the futures during training just like learn ways in a traditional neural network in the context of some specific prediction problem. The convolutional layers are not only applied to the input data, for example, raw peaks of values, but they can also be applied to the output of other layers. It means that we can have multiple convolutional layers and these allow for extracting low to high level features. The low level features like client lines, dots, edges, colors, or gradient orientation in image and high level features are like the whole objects or shapes in the image. These layers will allow for reducing the spatial size or the dimensionality, it helps to decrease the convolutional power to post the data. And a problem with the output feature map is that they are sensitive to location of features. For example, we have an X here and we have the number five here. So it's quite specific to the position of the X in the feature map. So the X will be localized at the center of feature map. So one approach of this sensitivity, when I talk about sensitivity is about, imagine you have a football with some patterns on it and we try to detect, I mean the same football but just turning around it, but it's still the same football. So it should be the same object. So we should not be too much sensitive to the position of the features in the feature map. So the one approach to reduce the sensitivity is to double sample the feature match. So how to do that? That's the chopper pooling layer that we call the pooling. These layers are used to reduce the size of our feature map in the CNN and compress the information down to a smaller scale. The pooling is applied to our feature map and it will help to extract broader and more general patterns, there are more rotors to small changes in the input and it's like the patterns in the football that turn around. The layers will be performed after the convolutional layer and activation and you usually are notified in a unit for each feature map. So usually we are applying two, two pixels window which try up two pixels. So for example, here for example, from much average pooling, in a feature map we have four values, one, three, oh, four. We take the maximum, so it should be four. We have up 10, four here. Here the maximum is two. So we have two, five, four, four. So if another framework with an average pooling, with the average value, so one, three, four, we have two and one, one, two, four, two, zero, we have one, for example. And then with this we can reduce our feature map in something smaller, which is also suggest to reduce the sensitivity to the location of the feature, okay? So that's what we try to obtain, that's invariance to local translation of features in our image. And then at the end, once we perform convolution layers, pooling the different pooling step, we'll flatten, we obtain a number of feature maps. So we will flatten these maps and wipe them down as a picture of neurons and then keep going with a normal network up to the output, okay? So that comes out CNN. And so CNN were initially developed for images. For example, it means two dimensional input, but it can also be adapted for one dimensional sequence or 3D data. So then it can be applied watch different types of data in life sciences. So I will not go to tell, but I mean, as the first attempt in deep learning is quite a number of papers about application of CNN in biological data. The second technique I want to mention is the recovered neural networks, which are developed for different types of data. So let's just go back to the already neural networks. These network only means for data points, which are independent of each other. It means we have an output, y-heads determined from each input XI and the XI will be dependent from X chain. But if we have a recurrence, it means we have data in a sequence, such that one data point depends on the previous data points. We need to modify neural network somehow to incorporate the dependencies between these data points because the information about the sequential order in the input data, we cannot keep it, for example, with the CNN. So then we need some concept of memory that helps to store the state or the information of the previous inputs to generate the next output of sequence. Instead, the y-head I is a function XI in a normal neural network. Now it should be a function of XI and some state of the memory, the H, I-minus 4. I mean, the last stage memory. For example, I have a list of songs to practice and the input X is the weather, is rainy or sunny. If it's sunny, I'm kind of motivated, I will practice the next song list. And if it's rainy, I'm not motivated, I'll play again the song that I practiced today before. So the output YI, the song that I will play, will be a function of the weather today and the stage of the memory about the song that I played the day before. That's the situation in the I-min that we need to include in the model. So on the end, we call neural network is also a type of neural network. It's somehow more special. It is adapted to work for time series data or just data that involves sequences. For example, DNA sequences or text. I mean, the sentence in human language. For example, text, video, and the sequence of images. At times series, for example, the heart, the heart rate of blood pressure of patient during several days, the stock price, for example. And the recording network had a, it will process a memory to store historic information to forecast the future values that's below. So this can only be shown in this diagram just to summarize. We have input, X, and output Y. And with the on end, we have a loop of memory stage inside. So about that, just, I mean, the stage of the memory will be defined as a function of the input. And also in the previous memory state. And then again, the same in the same way as in the neural network. We had function, the output as a function of activation function on a linear combination of the stage at I plus some bias. Okay, we have several types of RNN. We have one to one, it's like we try to classify an image. One to many, like we try to put a caption to an image, we have an image and we like to generate a sentence that represents the context of the image, okay? Many to one, we have many inputs from different time steps that produce a single output. For example, in the case of a sentiment analysis or emotional detection, we have a text, we have a sentence. And now we have to say that if the text, the feeling, the emotion of the text is a negative or positive. For example, the many to many, then many possibilities for this. For example, we have a translation, a machine translation. We try to translate English to French. We have a sequence, a sentence of five words in English and we try to translate into a sentence of six words in French, for example. The back propagation, we turn a bit to what Marcus said before. So we'll calculate the question of loss, we define loss function, we calculate the question against all the parameters in the network. And here we have a sequence, we have many parameters in a defined order. So our question will be something, we have a product of different derivatives. So we become something like that. That will give the problem of exploding or vanishing question. For example, for each factor in this product, we have a value just like us slightly higher than one. But 100 times of one or one, for example, we would have 100,000, okay? And if we have something like just slightly less than one, zero, nine, but 100 times, we would have something very close to zero. And then that's the case of exploding or vanishing question. But in both case, we had a problem. Once we will not converge and one, it will stop, we cannot gain information anymore. There's some tricks to get rid of that and not get to reduce the effect. And one is about adapting a network structure because somehow we come up with quite a native structure at the beginning, it cannot do efficiently with the situation with the data. For example, I have these things in English. I grew up in France and I do deep learning, blah, blah, blah. After 200 wars, I come up with, I speak fluently and I try to predict the next world. So I mean, the information is not there anymore. I mean, we lose the information that I mean, just I grew up in France and next phrase, I would just say, I speak fluently in French, it should be okay. But if after two pages of wars, I'll lose information. And that's where we need to modify our network structure in an LSTM, the long structure memory network. So here we have to, we need to modify the memory system. We need to modify the memory system. We have a memory cell to store information. We have a forget gate. And we need to just let something not important to pass through. We want to ignore information from input and output and each gate is a neuron. So this activation function on the way equated some of some neurons. So with this structure of special OINN, the LSTM, we managed to deal with a longer sequence in input data. And also just come back to one question that we had before about LSTM. Can you apply for a future instruction? So here we have a question. Two kind of memory cells want to store information, one want to forget information. So it's kind of, it's the place that we can evaluate the importance of the features. So that in that case can be used for extracting, for classifying the future as well. So the application of OINN, again, of course is used in natural language processing in life sciences, it can be used for, for example, for forecasting the spread of virus for truck development, for example. And in case of, we need to look at the performance of the trucks in pharmacokinetics, we need to design the dosage of the trucks during several days, for example. And one, our technique is autoencoder. Autoencoder, there's a type of neural networks that learn, that will learn to efficiently compress and encode the data and then learn to reconstruct the data back from the reduced encoded representation to obtain something, to obtain a representation that is as close as possible to the original input. Okay, so from the input here, we encode to reduce the number of features and then we decode to, we obtain the input. Of course, we lose something. And this loss will be defined by the difference between input and output and then we also, we need to minimize the loss. Autoencoders reduce the dimensionality of the input data. Okay, so it will reduce the number of features. So as the autoencoders encode input data and reconstruct the same thing, we can say that it learns the identity function in unsupervised manner or somehow we can say it's a self supervision. I mean, it learns by itself. And because the neural networks are capable of learning kind of nonlinear relationships based on the activation function, this autoencoder model can be seen as a more powerful, a nonlinear generalization of PCA, the principle component analysis. What PCA attempts to discover a lower dimensional hyperplan that describes original data. The autoencoders are capable of learning nonlinear manifolds. I mean, a manifold is something like a continuous non-secting surface. Yeah, I mean, autoencoders can be seen as a PCA but somehow it's a more general. The autoencoder is only trained to incode and decode with a small loss as possible. No matter how the latent space is organized. So if we are not careful about the definition of network architecture, it can be natural that during training the network, takes advantage of any overfitting possibilities to fulfill this job. So unless we find a way to recognize it, that's the notion of repelization that Marcus has also mentioned before. And one technique to do with that is about a variational autoencoder. That's kind of autoencoder which training is regularized of overfitting and ensure that the latent space has good properties that will enable a generative process. The idea here is instead of mapping input into a fixed vector, we want to map the input into a distribution. In other words, the encoder outputs two vectors, a vector of means, I mean, the average and another vector of standard variations. So instead of encoding input as a single point, we encode the input as a distribution over the latent space. So the model is then trained in four steps. We encode the input as a distribution and then a point from a latent space will be sampled from the distribution. Okay, well, we have a distribution, we just generate one sample from with the lower with the distribution. And the next one, the sample point will be decoded and we compute the reconstruction error. And finally, the reconstruction error will be back propagated for network to optimize to modify the parameters just to reduce, to minimize the loss function. Auto encoders, of course, it can be used in different user imaging for denoising, for compressing the images. Denoising for example, MRI images. For example, we have, in the intro, we have something we have liquid inside and with this technique, we can denoise, I mean, we can remove the layer of the liquid in your image. So we can see better the information about the entry, for example. In life sciences, it can be used for feature selection for dimensionality reduction and also imagine we have Amix data, transcriptomics, polyamics, et cetera. We have several, several thousands of features and then we need to reduce, we need to obtain a set of features in a reasonable size before performing a data equation, for example. Yeah, so next one, I'd like to talk about is the attention mechanism and transformers, okay. Let's have a look at this, this problem of neural machine translation for automatically translating text from one language to another using AI. The process is that from an input sentence in a source language, a neural network, usually an RNN or LSTM, will play the role of an encoder to encode input into a fixed length vector. And yet another neural network will play as the decoder to decode the vector into a sentence in the target language. Okay, for example, in these two sentences, I do deep learning and one sentence that I get generated, get activity to generate from the four words I do deep learning. So here we see the bottleneck. As the input is encoded into one fixed length vector, the amount of information, propagating information is limited. So the performance should be different with inputs of different lengths. So we have a sentence of four words and a sentence of 40 words is not the same thing. It will translate less efficiently in longer sequences, of course. So we need somehow to deal with the layer, the bottleneck of a fixed length vector. So here comes the attention mechanism. And so it's kind of a break-through in 2017 that the authors, I saw it. Parthesha, could you have a look into Google Doc? Yes, I'll take a look at the Google Doc and see what happened. Thank you, Antonin. So the attention mechanism is, it was developed to mimic our current attention to address the weakness of the onion. The onion will favor, somehow, as is the sequence of elements, it will favor the recent elements and then it will fail to deal with fairly long sequences. And the attention, it can, the goal is to monitor the tendency between elements in a sequence without regard to a distance between them. So we don't have any more the impact of the sequence order. Okay, attention is an interface that connects the encoder and the coder that provides the coder with information from coder hidden states, okay? So the hidden states here from the encoder will be passed to the coder and the coder will take into account of the only hidden states here we interact with and compute some values to improve, to produce the next value in itself hidden layer, okay? So with this framework, the model is able to focus on value parts of the input sequence in a selective manner. And then it will learn the association between them, okay? So they will have the model to cope with very long input sentences. There are quite a number of attention tabs out there. The global attention, it will consider all hidden states. A local attention will only one or a few selected one. There's on, so other things like hierarchical attention that will consider both levels of word and sentence in, imagine we have a huge text of several sentences and they're dependent on each other, okay? And then just come back to this. We, in the machine translation problem, we start by one fixed land vector. We replace by a tension mechanism. And then we still have the neural networks in type of recurrent neural network or LSTM. The, the topic is still there. The RNN is, it works in a sequential manner. It takes time. It has a problem of weight in the vanishing or exploding weapons, okay? So we have to improve. And there the idea of transformer has come. It's still encoder, decoder structure architecture, but now it use only attention mechanism for global dependencies to obtain the global dependencies between infinite output. And the type of attention here is self-attentions, which is a specific type of attention. And the difference between the self-attention and the tension is that instead of relating an input to an output sequence, the self-attention will focus on a single sequence in the input sequence to compute its representation. So that a model can also learn the intra-input and intra-output dependencies as well. In a transformer, there's no recurrent convolution. And to maintain the information of element ordering, they use a positional encoder to encode the position of words in a sentence, for example. In a layer more advanced, there's large language model. There's a type of neural network that might use up transformers or something else, of course. And but the most successful is transformers, we'll say. Then there's a type of neural network that has the ability to generate text reflecting human language by inspecting vast amount of test data that you see with a productivity, for example. So that two popular types of LLM is a word and TBT. One is the word can be seen as the encoder only model and TBT is something can be seen as a decoder only architecture. They are quite different in the design. And so I will say the transformer attention mechanism, large language model, they have a great potential for application in different aspects, including life sciences. So they are quite a lot applications for now, recently. For example, the especially large language model in predicting porting structure, predicting the impact of porting the variants. Also, usually recently, they built, if I might just mention also, they built a foundation model for human single cell transcriptomics and one for one foundation model for human genomics as well, based on TBT model, that's called SCGTBT, same. So I plan to just do forum efforts there, but I saw one question about generative adversarial networks I should mention a bit about that. Okay, so this one is also an approach to generative modeling that use deep learning, of course. And the goal is to produce, to generate new content. It's about automatically discovering and learning patterns in input data, and then the model can be used to generate new examples that probably could have been forward from the original data set. So the idea is we have random input, we have a model for generating, it will generate some samples and then we have another model. So both model here are neural networks. So this model will generate some samples and this model will discriminate them. I mean, it will classify to see if the models it obtains is the real or the fake. I mean, it's the real or it's the sample one and this generated one here, okay? And after several turns of updating, it will be able to distinguish ones, the real ones and the fake ones. And of course, the main application of this is about generating data when we don't have enough data. For example, once we have imbalance problem, once we don't have enough data in one class, for example, we need to generate data, or for example, some years ago, for now it's okay, but no, that's okay. We don't have enough single cell on-site data, so we need to generate them to be able to perform deep learning, for example. Okay, so about deep reinforcement learning. I would say reinforcement learning is one of the three basic machine learning paradigms, okay? Along with unsupervised and supervised learning. So imagine in unsupervised learning, you say that thing is like this other thing. It's like we don't care about what it is, what they are, we just say that they are the same. They are very similar. And the algorithms, the unsupervised learning, the algorithms we will learn, the similarities with our class, with our labels, with our names, they can just say they are similar. And by this, we can detect some abnormal behavior, we can recognize something that's unusual or dissimilar to the normal ones. In supervised learning, that's about, we say when we say that thing is, for example, a double bacon cheeseburger, for example, we put the labels, we put the names to something, and the algorithms will learn the correlations between the data, the samples, two labels. And then they require a label dataset, of course, okay? So the labels, they are used to supervise, they will used to be used to correct the algorithm, and we'll correct the algorithm. When the algorithm is wrong, you correct, based on the label it knows, to improve the model, sorry. And the next one, I mean, the third one is about our topic here, reinforcement learning. Something like, we say with each that thing, because it takes good, and we keep you alive longer. So we have actions based on short and long-term rewards, such as the amount of calories you can change, or the length of time survive, okay? So the reinforcement learning can be thought as a supervised learning in environment of feedback, so that we have an agent. So an agent will take actions, for example, a phone will make some delivery, the super Mario in the game will navigate in a video game, in a playground, the algorithm is also an agent, and in life, the agent is you, of course. We have environment, the world for which the agent moves, okay? And which responds to the agent, okay? So the environment will take the agent's current state and the action of the agent, and returns the output to the agents. Just a question from Andrea. I think for single-cell RNA seed, and then the golden paper is not published yet, but there is one work that is very well-known about the SCETPT, is that the foundation model for single-cell data, single-cell auto-taskatomics data, and it includes quite a number of sales, I mean, about a hundred million or something, if I remember well. So it is a foundation, we imagine like it becomes a model like productivity, we have something, a basic one there, and then we can compare, we can analyze our data based on this foundation. So maybe you don't need to generate data anymore, that's why I think, but I'm not sure about how good a alternative model can generate to synthetic data, because, well, we need a specific question, a biological question to test, and I mean, the way that we generalize our data will depend a lot on our question. So if it is deep learning or other things like, like a Bayesian network, et cetera, try the basic methods, I'm not sure if how they are compared to each other. Yeah, so just come back to reinforcement learning. Okay, we have an agent, the agent will perform some action and the environment will receive the action and give feedback to the agent, okay, by some reward, some reward. Okay, so the agent will take this reward, then it will adapt this action and so on. That's the way that I'm going to reinforce reinforcement learning works. So here, you have state, you have action, you have the state of the agent, you have a neural network, okay, so we need to compute, we need to output a queue values, okay, that's the action value, the expected long-term return, like it's not a short-term reward, it's something a long-term. What we have in long-term, if we leave longer or not, not just we receive how many calories when we eat some burgers, but just if we leave longer or not, that's the case. So the neural network here will take actions, they can compute the return, the reward, and then to optimize this, and then in for somehow optimal policy, set the actions of the agent. Okay, so I think this one might be useful. Of course, it was applied in different subjects, but it might be more useful in kind of in-prime machine interface, for example, that where we have action, we have the interaction between human and machine, like in robotics as well.