 Θα ξεκινήσω με την τελευταία εργασία του αυτήν την εργασία. Ελπίζω ότι έχει been so far as exciting and instructive as it has been for me. Ελπίζω ότι έχετε also had fun in this last section, which is dedicated to deep generative networks in single-cell transkeptomics. Πριν ξεκινήσω, θα εξηγήσω εσένα μου. So my name is Panagiotis Papasaiikas. I'm a computational biologist. I research associate at the Friedrich Miser Institute. So I'm a colleague with Michael Stadler and Charlotte Sonosl. And mainly I'm working on transkeptomic data, single-cell data developing computational methods for the analysis of such data. And I have an interest since a few years now in deep learning methods that are applicable to single-cell analysis. Okay, so let's see how this section is going to go. So this is the overview. I'm going to start with a brief introduction to the TensorFlow and Keras backends for deep learning and machine learning analysis. Then I'll do a short introduction to deep learning and what we mean by deep learning. At that point, we'll stop to have the first exercise, which is implementing a very simple model which is called a multilayer perceptron in Keras. And we'll see how this model can be implemented either using the RKeras API or the Python Keras API. And after we're done with the exercise, around the time we'll have the first break, depending on how we're doing on time. And after that we're going to come back. We'll continue on the theoretical presentation talking about deep generative networks and more specifically variational autoencoders and variational inference and briefly about generative adversarial networks. I'll mention some existing applications and tools that use deep generative networks in single cell transcriptomics. And then we'll move to the second exercise, which is the longer exercise. We will be doing variational inference in Keras and specifically using the RAPI for Keras. And again, depending on how we're doing on time at some point, probably during the second exercise we'll have the second break. Then we'll come back after that break and hopefully around five will be done and then we'll have the closing comments for the course. So we start with a short introduction on TensorFlow and Keras. So what is TensorFlow? TensorFlow is an open source general purpose numerical computing library. So it was not developed specifically for deep learning. So it has more general optimization libraries such as libraries that implement gradient descent or adaptive optimizers that are used in more general optimization problems. It was originally developed by the engineers in the Google Brain team that were conducting machine learning research. And of course with the advent of deep learning and its popularity it took off and it has become probably the most used backend for deep learning research. It is hardware dependent, so any code that you write on TensorFlow can work either on CPU hardware or on GPUs or if you have access to those you can also work to the deep learning specific hardware that was developed by Google which are the TPUs. And another advantage of TensorFlow is that it supports large data sets and distributed execution. So it's a very powerful framework for machine learning and deep learning analysis. What are the model building blocks in TensorFlow but also in Keras? Because Keras is nothing more but a high level API that actually uses the TensorFlow backend. So in terms of data objects the building blocks are tensors and tensors are just multidimensional arrays. So in this table I have some examples of specific data types and how they correspond to specific tensors and how this would be represented as our objects. So for example if you just have a vector of cell labels this would be a one-dimensional tensors tensor in TensorFlow and as another object this would just be a vector. If you have a gene count matrix that will correspond to a 2D tensor where the samples would be on the rows and the genes would be the columns and of course in R this would be a matrix. If you have longitudinal gene expression data then we'll be talking about a 3D tensor where you have again samples in the rows and then in the other two dimensions you have the genes and the timestamp and in the R object that could be represented as a 3D array. You have microscopy images that would be a 4D tensor because there you have samples, you have the height and width of each picture and also you have the channels. So the for example the RGB channels, the color channels on which you are recording on which you are taking the pictures and in R this would be represented as a 4D array. And finally if you have video data that would be for example a 5D array where again samples could be the first dimension then you have height the width and the channel for each picture and of course because again you have a timestamp you have also the frame so that would be a 5D array. Please if you have any question either raise your hand or you can also interact me or post it on the group chat so that I can view it. Okay, so apart from the tensors an important building block for the models in TensorFlow and Keras are layers and layers are basically just units of numerical computations. In TensorFlow these are implemented by TensorFlow operations which actually perform these numerical computations, these transformation functions on the tensors and these functions are typically parametrized by weights. So for example an addition or a matrix multiplication and sampling or taking gradients is a TensorFlow operation or a Keras layer. And here for example you can see a very simple scenario where you have three different inputs that are parametrized by specific weights. Then you take the summation of those so this would be a transformation function to be a Keras layer. You're adding some bias term and then these are going to give you your output. Finally you can combine layers and tensors in order to construct computational graphs. Typically these are direct acyclic graphs but not necessarily but usually that is the case. In this graph the nodes are the layers so the computations and the edges are the tensors. So the tensors your multidimensional arrays are flowing through the computational graph transformations are happening to them and hopefully they're doing something useful. So this is the schematic that you see here where the tensors are flowing through a graph where you have specified transformations that should happen on those tensors. And that's actually where TensorFlow is taking its name from because tensors are flowing through the computational graph in order to perform a useful task. If you have a fully specified graph from input to output this is a model. Okay, so the basic building blocks again are tensors which are your data, the layers or TensorFlow operations that do numerical computation, transformation functions on the data and combinations of those specify a graph that are going to give you your machine learning model. Okay, all right. So how about Keras? So as I mentioned earlier Keras is just a high level API that can use TensorFlow as its backend but not only TensorFlow which provides basically convenient wrappers for commonly used layers or computation graphs. So if you have a transformation function that is a very commonly used transformation function deep learning like for example a convolutional transformation or an activation function this is implementing Keras as a specific layer. You can also have more complex layers that combine different operations and this is the case that I mentioned here where you have a more complex computation graphs representing Keras as layers. So you have a wrapper for a more complex graph of layers and of layer operations on TensorFlow. So what I saw here is actually taken from the screen source from the examples that we're going to the first exercise example that we're going to go through and what we're doing here is that we're defining a model which is called a multilayer perceptron. In this case it's not actually multilayer it actually has only one hidden layer. So what the structure of this graph is the following. You have an input layer which is constructed in order to receive as input images 28 by 28 images of grayscale images of digits which are coming from the very commonly used MNIST dataset. These are actually the input is flattened out to one dimensional feature vector per sample. So you have again 28 by 28 784 pixels as the input. They go through a single hidden layer and then they produce an output. So it's a very simple model. The task that we'll try to achieve with model is digit classification. So we'll try to predict whether the input that we get from a specific MNIST image corresponds to digit 0, 1, 2, 3 and so on. So the output has actually 10 different nodes. Each node corresponding to one of the digits. And the code that I showed here is not actually a chunk of the code that specifies the model. This is the whole code that specifies the complete model. So you can see that with three or four lines either in R or the top here or in Python you can specify a model that does something relatively complex and you can see that it actually performs really well. And that's why we say that Keras is a very high level API because it allows you to abstract much of the underlying details of constructing the graph with TensorFlow and it allows you very easy experimentation in order to very quickly build and try deep learning models. So we'll see this in more detail in the exercises but the way that we specify the model is that basically we stack layers on top of each other. In the first layer that is the input layer we have to specify the size of the input which is 784 again. For the next layers we don't have to do that because the next layers can infer actually what their input is going to be because we specified here. So in the first layer we specify the input but we specify how many units this layer has. So that implies that the input to the next layer is going to be the dimensionality of the units of the first layer. We specify a specific transformation function. In this case it's an activation function that is called ReLU and sorry and the ReLU function I show here at the bottom which basically takes the value of the input for any value that's greater than 0 and it's a 0 if the value of the input is 0 or less and these activation functions are going to give you a new output that's going to be combined and it's going to pass to the last layer of the model, the output layer that has the 10 nodes corresponding to the digits. There we have a softmax activation function that basically what it does is that it gives you a probability so it divides the output by a partition function in order to make certain that what you can get back is a probability function and in each node what you have in the end is a probability that the input that you show corresponds to a particular digit. Okay you have any questions so far? Yeah I have a question. With ReLU activation it seems that it doesn't really work with negative values. No it works with negative values but what it does is that if you have a negative value it's going to get that value it's going to drop it to zero. Yes so somehow it gets rid of the negative values. Exactly get rid of them but it's not that it doesn't work with negative values. It actually expects that at some point it could get negative values but if that is the case it's going to return a zero. Okay. Okay. I think there's one more question from Martin. Yes. Yes I just wanted to ask in the Python code we're actually importing now dense layer. What are the other layers roughly and why are we using right now the dense one? Okay so like I said a layer in Keras is basically any transformation that you are going to perform on the data and we're going to see in the examples several of those transformations so the dense layer just specifies that you have a fully connected layer from the input to the first set of to the first hidden layer so every input is fully connected to the hidden layer that you have corrections for everything to everything and this is what a dense layer is but in terms of what are the possible layers that you can have in Keras there is as I said a huge variety because any numerical transformation that you do on the tensors is actually implemented in Keras as a layer so a layer is nothing more than a than a transformation that is done on the tensors actually not any transformation it has to be a differentiable transformation on the data and the reason is that if you don't have a differential transformation you cannot you cannot fit your model you cannot do back propagation on the model but this we're going to mention a little bit later. Okay thank you. All right so there's of course a life beyond TensorFlow and Keras so as I mentioned Keras is a is a high level API that supports not only TensorFlow but multiple deep learning backends so I think at this point the Keras API specifications also supports also the TNO and CNTK deep learning backends so what this means is that basically you have the same type of abstraction the exact same code that you would use independently of whether in the background whether in the layers below the back end is using TensorFlow TNO or CNTK for the construction of for the actual construction of the graph but apart from Keras TensorFlow TNO and CNTK there are also other machine learning frameworks that are supported by different companies so for example the CNTK that I mentioned before is supported by Microsoft, you have PyTorch, you have Gluon, MxNet, Cafe2 and Sainer and depending on exactly what applications people are working on on what level they're working on if they are developers of models or end users and of course what are their connections different companies people have different preferences but I hope that the material that we're going to go over today is relatively general so it does not really going to matter what's going to be your choice of a deep learning backend I hope that what you're going to learn today is going to be applicable no matter what you end up using so what is deep learning so deep learning models they take an input and transform it to an output via successive layers of increasingly abstract and meaningful representations so this is a very high level description of what deep learning is and I'm going to give here an example to try to explain what I mean by these meaningful representations suppose you have this type of of raw data these two dimensional points that you try to separate into categories between the black and the white so this is your task this is what you try to achieve one transformation one useful transformation that you could do on the data would be a rotation so basically a coordinate change because if you do that then basically what you can do if you do the right rotation what you can do is that basically you can use an x value as a threshold for separating the two categories so what we have done here is is a rotational transformation of the data that that resulted in a different representation than the raw data that we had to begin with that was useful for our task at hand right and in this case the task that we had was to separate the black and the white points so this is the very basic idea the basic spirit of deep learning so what you have is your initial raw data that go through successive layers of transformations and representations and the goal of these successive transformations is to achieve a new representation of the data that is closer to the prediction domain the prediction task that you have to achieve what happens in these successive layers of representation is that extraneous information information that you don't need is filtered out useful information is extracted so that you can be successful in your task and when I say that the the internal transformations result in successive representations and don't mean it in in in a way that is not literal you actually have different representations of your original raw data and you can see an example here which is taking again an instance of an emnist image in this case a handwritten digit 4 and in the successive layers this digit 4 is is converted into different representations that become more and more abstract so as you go through the layers it becomes harder and harder to recognize to the human eye that what you started with was a digit 4 but somehow these representations are actually useful to the model itself in order to perform the task that we have given it and the task here is to actually classify whether an image is coming from the class 0 1 2 or and so on ok what would be extraneous information in this case for example if the digit is slightly rotated or because of noise there is some pixels the light up in the periphery of the image this could be extraneous information so the model should learn in these successive layers of representation to get rid of of that extraneous useless information and and pick up the most salient features that that are useful for the task so of course probably you already understand that when i when i say meaningful representation this is a relative concept right because what is meaningful always depends on the task at hand so if we had if we had a different task so for example if our task was not to to classify the digits but for example i don't know to decide if a digit is hand written or not or maybe to to denoise the digit then the features that would be useful would be completely different and the internal representations of the model would be completely different so what is a meaningful representation which always keep in mind that is is relative to the task at hand so where is the deep where does the deep in deep learning is where is it coming from it's coming from exactly this characteristic of of those models that it's a multi-layered representation so you go deeper and deeper in more abstract representations of your model okay so i think there's a there's a hand up from dania yes so maybe in the same lines what is a meaningful transformation to how do you know what do you just try and see what works best well you don't have so in the old days let's say of of machine learning in the pre-deep learning days of machine learning people had actually to engineer what representations were were useful to engineer features for example that were important for the model but in this case the only thing that you have to do is to specify an architecture for your model you don't define what representation is useful is useful or not these representations sorry these representations that you see here are not representations that that somebody decided or designed by hand these are representations that were decided by the model considering the task at hand right so so the model did its best and we'll see how it actually does that in order to achieve this task its classification task and it decided that the most useful representations internal representations are the ones you see here so you don't have to go through this process of deciding what a meaningful representation is what you have to decide is your architecture of your model and you have very very importantly to decide what is your objective right what is the task that you try to achieve because this is going to to be the arbiter of what is going to be used as an internal representation and i think there was a question that i missed but i guess it refers to the slide before from sebastian who asked if they're some of the hidden parameters that cannot be accessed from different backends no if we no matter which which backend i mean i'm assuming that you that you're working on a specific backend right so if you're working with TensorFlow for example all the hidden parameters are accessible you can you can see what those hidden parameters are at the end for example of the model fitting i don't know if that is a question or if you move from one i just wonder where the complexity come in and and because it's an abstraction the highest level here us and i wonder where there are sort of hidden parameters that you can tune but you don't have access to it if this exists yeah okay so if i understand correctly yes there are cases where the wrappers are hiding from you and useful design parameters of the underlying infrastructure of the layer and the only way to access them is to actually do that through direct access to the TensorFlow backend and you can do that when you're working on Keras this does not prevent you at the same time to access the underlying TensorFlow backend actually often this is a very useful thing to do a necessary thing to do thanks okay so how is actually the model trained so how are the these parameters of the model going to be to be fitted so as i mentioned here before a very important decision that you have to make is the loss function because this is going to to measure the success of your model for the task at hand so for example in the case of image classification the loss function is disagreement between your decision in terms of of the digit that has the maximum probability so the arc max of the final output versus the actual label that is actually given you know it so that would be the loss function so how do you fit the parameters of the model so the main the the basic idea is that the parameters of the model are slowly updated towards a direction that provides an improvement relative to the loss function the updates are done using an algorithm that is known through the since the the 80s which is the black propagation algorithm and the chain rule of differentiation that traverses the model from the tail to the end so from the output towards the input and the main idea of the chain rule is that you can you can find out the gradient of a parameter um relative to to the overall loss function if you already know the gradient of of of a previous parameter relative to the loss function okay so this allows you to traverse the model the transpose of the graph slowly towards towards its head in order to update all the parameters so for example if in this simple model here that goes from input to output you had to update those parameters here w1 and w2 this would be how how you would do it so you would have to calculate the gradients of those parameters relative to the loss model assuming that you already knew the gradient for for the subsequent for the subsequent parameters okay so this is a basic idea you you you try to update the weights of the model towards the directions that provides an improvement relative to the loss function that you have that you have specified and for that you use the back propagation algorithm in order to traverse the the graph from the output direction towards towards its input the direction towards which the parameters need to move is computed using stochastic gradient descent so basically what you're doing is that you calculate the gradients of of the parameters relative to the loss function again so you you you calculate the partial derivatives of of your parameters and you decide towards which direction you have to move you can take smaller and big or bigger steps towards the direction and this is specified by a parameter that is typically called learning rate this overall procedure of back propagating and updating your weights using stochastic gradient descent is performed by by constructs that are called optimizers in in in keras and in tensor flow okay so the optimizers are is the internal machinery that actually performs this task of back propagating calculating the gradients and updating the weights okay and this loop is updated many many times using small splits of the data that are called batches so you don't see all the data at once you see the data in in smaller batches but in every full loop of training you have to see all the data so you see many batches until you have seen everything and you do this many times every complete loop through your whole data set is called an epoch and you'll do this again several times until you reach convergence that is until your weights do not change anymore this is how you update your weights and you can see a schematic of of how this process looks like for two for two parameters for parameters m and b so the the z axis here is the loss and you can see that you you move in this in this optimization landscape you move in this landscape defined by your two parameters m and b trying to reach a minima in in this optimization landscape okay and you can understand for example that the step the size of the step that you're going to take is an important parameter because for example if you have already reached the edge of this funnel and you try to find the the minimum the global minimum here if you take very large steps you will keep maybe jumping around without ever reaching your global minimum all right so there's a there's a question from Ludovic it's okay so how the model update the model updates the weight per batches for one epoch so as I said every time that that you see a small batch of the data the the the the dates are already the weights are already updated so you don't wait until you see the complete data set every time that you see a mini batch of your data you update your weights so in one epoch in one epoch you will see your full data set but let's say that your data set has 5000 samples and your batch is is 500 you will actually do uh 10 uh full back propagations uh on your network okay yeah maybe I can comment on that question um in fact I use sometimes the model and I have always had time to understand the compromise between the number of batch to use per epoch because for me it's hard to understand how the the model will perform better if you have a huge let's say batch per epochs compared to uh to to a low batch um I don't know if I am clear yes you are okay perfect so um yes there is uh there is a trade-off uh and there is a balance uh basically between the the batch size that you're going to use and the learning rate if you use a small batch size uh you can understand that that uh the updates that you're going to have uh are going to fluctuate a lot because you're going to get information only from a relatively small sample of your data so if you have uh a a very a very large batch size uh you can actually increase the learning rate to learn faster because in every epoch you use a lot of you get information from a lot of data if on the other hand you use small batches uh then that means that you have a lot of fluctuation and a lot of jumping around because you you because of the small sample size so in that case you would have to decrease your learning rate so uh basically the the the um the batch size uh one of the impacts of the batch of the batch size is is how how much uh jumping around you do in the optimization landscape is that clear yeah because for me the the batch size impact a lot also on the I can imagine that I mean ideally you would like to have all the data uh seen for one epoch like that you you can really have um I mean you can learn um it's slower in the learning process I guess because every epoch you will have to update you will have an update only every epoch so it's really slower but I think it's performing better because you see more data but it's really long in the other hand if you do a lot of batches it will learn um faster but maybe it's not good because you don't see uh enough data for each batch so my question is more I have always had time to to um to tune this this parameter so maybe if you have some insight on that or maybe it's so typically I decide on the batch size based on on my on my memory constraints um depending on whether you you're working for example with with count data or with with image data there is only so much uh and and you and you're working on uh on gpu's there's only so much data that you can fit on the gpu so one important constraint that that helps that decides uh what is the batch size is basically what can the gpu memory fit if the gpu memory uh can fit uh all your data then I think you uh you can also use a batch size that is basically your complete data set size as long as your learning rate is is is big enough because if your learning rate is very small and you you have a huge batch size your your convergence is going to be extremely slow makes sense thank you all right so just to uh finish this section on the introduction on TensorFlow Keras and and deep learning um why now all the hype what spurred the revolution so the um um the research on on on uh neural networks uh started in in the early 70s um the basic algorithms like back propagation gradient descent were in place since the 80s so what happened that that that um that suddenly um uh created this revolution of of deep learning and mainly these were advances on three fronts so the first front was actually hardware advancement so um the ability to do massively parallel parallel computation on on gpu's mainly and and of course now also on tpu's the second front were improvements on on the algorithms that that are used so um robust uh back propagation uh that is able to train models with we have uh many many uh internal layers until relatively uh recently this was not possible because of the problem that is called the fun the vanishing gradient problem which means that basically as you go from the end of the model towards the beginning of the model uh the impact of of the loss function on the parameters is becoming smaller and smaller making it extremely hard to train models that have uh uh a lot of internal layers uh there were also improvements on the optimizers so now you have as we see in the exercises uh adaptive optimizers that decide for example on different learning rate per parameters depending on on on how they move on the on the uh landscape of optimization you have regularization techniques that um help you to avoid overfitting so you have um many things came together in terms of improved algorithms in a relatively short time that allowed you to to uh to train uh these uh uh models with uh a lot of internal layers finally uh another important thing was the the availability uh of many high quality uh sometimes uh labeled datasets and this came because of the uh massive adoption of course of of of the web users but also advances in tech and experimentation in the hard science you get you have um a lot of of data coming from high energy physics and astrophysics and of course biology and so on uh so you have also uh a lot of available datasets to train your models and so these were the three main things and this kind of uh although it's kind of a chicken and a neck problem this resulted also in improved architectures because there was so much interest after the the first uh big success of deep learning were demonstrated a lot of interest turned into it they were improved architectures uh and then also development of uh user friendly platforms that uh lowered the threshold for somebody entering the field um okay and uh i don't think i have to go into the success of success of deep learning if anything the the field is is overhyped so i'm sure you have heard of of many of those but um pretty much everybody's life is is now touched by deep learning refined web searching spam fraud and fraud detection you've seen uh that uh i'm sure you've seen examples where a machine can do near human image classification near humans machine translation we're probably using either dpl or google translate machines can play uh tests at the superhuman level uh and also go using um basically no information uh apart from the rules of the games um so they don't even need expert games to be trained on autonomous driving natural language processing but also i'm i'm pretty sure you've heard of of many clinical applications like um prediction of protein folding medically much processing drug design and diagnostics just make a declaration about things that we're not going to talk about things that you have probably uh heard things that you may had a hope that you were going to cover here but as as you understand um this is not um deep learning is not a topic that that that uh i can cover or i could cover in any case um in in in uh in this time frame or in any time frame actually speaking for myself um so there are a lot of things that uh are not part of my expertise i'm not using i'm not familiar with um or uh some things uh i i do have some knowledge about but they're not of interest uh in the in the present context um and here i've ordered them in my perceived um a hierarchy of what is maybe most relevant uh to transcriptomic analysis to less relevant so on the top are maybe things that you would like to at some point uh take a look into so this is distributed uh multi gpu training regularization techniques very briefly mentioned but didn't go into detail and we don't have time to go into detail about uh different regularization techniques like the l1 and the tool uh uh regularization how they are combined uh between them and so on uh how you can uh construct custom layers by accessing directly the Tesla flow back end um the batch normalization which is uh another technique that uh uh prevents overfitting but most importantly allows you to to uh to reach much faster convergence rates in in training uh eager versus deferred execution um which is uh two different um uh ways of of uh um of executing um uh the code in in in keras in Tesla flow rather not in keras uh convolutional neural networks geometric deep learning recurrent layers attention models reinforcement learning these are maybe things that many of you have heard of um um but uh i don't think that uh there are particular of particular interest uh at this point in in transcriptomic analysis uh at least the last of those that i that i mentioned um unless uh you know somebody has an opposite opinion or or know something that that i don't which is quite possible uh and you can please uh state your experience with those in transcriptomics okay so before i go to the various autoencoders i i think it's it's uh much easier if i start with uh uh explaining what an autoencoder is uh not a various autoencoder because the architecture is a bit simpler uh so what are all the coders uh they are uh supervised models um so in that respect they have the benefit of of uh having easy access to large training set because you do not need labelled data and the objective is to obtain uh an output that basically resembles as much as possible your original input but this happens in in a particular way so what you do is that you get you start with your input and you start squeezing it more and more in in successive uh uh lower and lower dimensional representations until you reach a bottleneck and this bottleneck is typically referred to as as a latent code and then you start uh uh expanding uh blowing up again the the dimensions uh until you reach uh the dimensionality of your original output what is the training objective here what is the loss function that we're trying to to to minimize uh the training objective here is again to get an output that looks as similar as the input so so the typically the loss that is used is the reconstruction loss that measures the the uh the distance of the the similarity of of your input and and your output after you go through this process of squeezing the data and then expanding them back up again uh what do you gain by by this process uh um why do you want to do that well the main thing is that uh as you go through this process of squeezing the data more and more while at the same time being able to reconstruct them you force your model to only retain the most important features the most salient features of of of your of your input okay so the the uh the bottleneck here the latent code is is a very concise uh representation of your input uh a much more lower dimensional representation of your input that is still able to reconstruct something that more or less looks uh like the original your original data but having set out a lot of of of uh unimportant a lot of of uh non too important uh details the first part of of the model that goes through the squeezing of the data is typically called an encoder and the second part of the data that uh blows up the dimension again is the decoder uh this this uh um nomenclature encoder and decoder is not specific to auto encoders uh there are actually um many many deep learning models that have an encoder decoder architecture um uh but yeah in the context of of of the auto encoders you also use the same nomenclature um here i have a symmetric uh architecture between the encoder and the decoder so you can see that uh here the number of nodes that you have in in the in the hidden layers as you go from input to the bottleneck and from the bottleneck to the output uh are are symmetric they are the same so these two and these two and these two but this is not a strict requirement it's it's uh what people people typically use but there's absolutely no specific reason why the the architecture should be symmetric um and uh as i said the main advantage of this model is that you you learn a representation of of your data that is given by the latent code the representation in the very middle in the bottleneck that very concise and represent the input that's what i'm trying to capture here so here is an original image uh who has gone through this process of squeezing and re-expanding um and this is the original image this is the reconstructed image you can see that the reconstructed image is basically a much more fuzzy version of the original one um and the the latent code here is is of much lower dimensionality okay so it has uh 32 um it's of dimension 32 uh here you can see what the uh the intensity of of the the of each of the 32 nodes of the latent code here uh but this uh uh 32 dimensional uh vector can uh does a decent job uh not a great job but a decent job of of being able to reconstruct uh the original image which is of much higher dimensionality right of what is this like 40 by 40 or something and and also has multiple channels okay there are multiple uh variants of of autoencoders uh deep stacked sparse variational denoising adversarial disentangles and so on um today we're going to to focus on the variational autoencoder variant but before we go into that i'd like to talk a little bit about different applications of autoencoders so autoencoders did not start with uh uh transcriptomics uh they have been used for quite some time now in different tasks for dimensionality reduction in visualization so here is another uh example of using the emnis data set and what you uh uh want to get here is a concise representation of the digits in the latent layer you can use this you can pass this latent layer to this knee to obtain a visualization of your digits um they have been used for denoising and completion so here you can see an example of digit denoising on the top you have uh different noisy version of the data you can pass them through the autoencoder in order to to denoise uh the images uh you can do uh image completion in this case this is face completion where here you have your input you you hide uh from the trained model uh different parts of of uh of the of the face and the autoencoder is able to to complete back uh the faces uh it has also been used for other uh tasks like feature manipulation that are more inferential kind of of um uh of uh applications um like uh for example here you have an original subject in terms of a picture and you want to infer how the same subject would use if you add a particular uh facial feature feature to it like for example glasses so you can do that by uh manipulating latent codes as we will also see in our example basically what you do is that you have um uh a latent vector that corresponds to the face uh without glasses you add to this latent vector uh a vector that corresponds to the facial feature that you want to add in this case the glasses and uh the dot product of these two is going to give you uh in latent space a face with glasses you can decode that back to give you a picture of how the same the face uh would look with that particular facial feature uh you can also slowly morph uh from one type of of object to to another so you can traverse the the latent space uh in order to slowly uh morph from one uh from one point of a picture to to to a final point of a picture in this case you're just morphing a a a short chair to a to a tall chair okay so maybe you already see why why this is um a natural thing to to use in in single center spectomics um single cell data are high dimensional so so you would um there's a lot of interest in visualization uh they're also very noisy uh corrupt data they have um dropouts large magnitude outliers um so denoising is also of interest but um i think to me probably the most interesting application of of um of deep generative networks and and variational auto encoders in single cell data is actually the last um application that is to try to do um uh inference on your data sets um and we'll see what what i mean by that a bit later so can i ask just a question yes of course just to to understand the the denoising and completion in the context of auto encoder for me it's hard to see because for me the the auto encoder is you have only one input you have not training data so how noisy because your image is just noisy how we can learn to the auto encoder to find back the the let's say the the the right solution i completely understand in the case of dimensionality reduction but here it's if you can just comment maybe i miss something yeah okay so basically the idea uh the execution here is that um the the the pixels in the images are not uh uncorrelated from one to the other okay and in the case of of of genes these are also uh not uncorrelated one to the other so information about the neighborhood of pixels uh can give you information about about uh a nearby neighborhood of pixels so the fact that um um i don't know here you have uh a lot of pictures that light up together um suggests that uh it makes sense that that uh this should also light up this pixel that is black should also light up that is that is the high level intuition so basically what you encode in the auto encoder what the auto encoder learns is is relationships between the pixels but i should add one more thing here that that uh in many um applications what you specifically want to learn a denoising auto encoder what you do is that um in your in your input data you actually add a noise um different types of noise like like a salt and pepper noise in your data and what you measure in terms of reconstruction is is the ability of of the output uh to reconstruct the the the picture before you added the noise so basically during the training the the the model learns how to how to filter out noise and how to complete so basically the auto encoder resolved the equation by adding the parameters that you have defined at the beginning i'm i'm not sure i understand the way that you that you put it but if i can summarize it again uh okay is it clear for everyone or maybe i can explain again if i can uh maybe a related question about the face completion i mean i'm really impressed by the second line where the program uh guess uh i mean know how to build the the eyes of the guy does it mean that he it uses this correlation between the the images to know what are how are uh defined the eyes it uses a complete set of of of uh face images that you have shown in order to um to to somehow infer that you when you have a particular uh structure in terms of of of of pixel intensity and pixel uh pattern then you should expect uh uh something uh in particular so for example when you have something that looks like a nose you would expect that on top of it should be something that looks like eyes not something that looks like uh like a mouth why the eyes um uh are uh are there or like the eyebrows are the right color for example this um uh you can learn or you can infer by by looking for example at the at the color of the of the hair i should point out here that that the model that that is used here in order to do face completion uh is not the the very is not similar to the very simple um model that we showed earlier where we flattened out the images it's it's a model that actually uses convolution so it takes advantage of of the 2d uh structure of the image of the correlations between neck boring pixels so it's a it's a much richer modular model liters of its representation uh huh okay thank you so in in the context of an autoencoder you can also use convolutions again uh i'm not i don't have a time to go into detail but convolutionary uh convolutional networks give you a way to to take advantage to to capture uh uh these these uh these types of of correlations uh that that exist typically in images between the neighboring pixels i said that there are many different versions of autoencoders um um what are you trying to do what what is the the the goal of of this different um of these different autoencoder flavors and rather actually what is the the goal of of any uh deep learning model is to actually get a good code representation like we said the what what meaningful or a good representation means uh the it's it's it's very dependent on the specific task um but this is um this is a task where we actually try to to reconstruct uh uh the input so so how we defy how how do we defy goodness of a representation so there are several criteria uh so one of the things that it has to be robust to meaningless input corruptions um this is related to the question that came earlier um about how the the model learns uh to do uh face completion or to do denoising uh you try to construct a model that is robust to such a noisy inputs uh so basically what you try to do to learn is to learn a representation that here it's um it's it's depicted by uh this this this curve here that when you give a point that is outside the curve is going to collapse it back to the curve where the curve represents um uh the possible uh the the manifold that actually generates the data that gives rise to the data okay so it has to be robust it has to be generalizable that means that it can transfer to multiple settings of uh related problems um that means that if you train the model on on a specific set of images or faces um for example let's say of of uh of uh faces of of caucasians and then you move to faces of african americans how are you going to do are you going to do uh well are you are you able to generalize uh are you able to to to generalize if if uh the the faces are slightly uh rotated or a little bit out of focus and so on and so forth so it has to to to be a model that is uh that can transfer to to a multi-percentage related problems it has to be uh a model that gives you a smooth or coherent uh um latent representation uh that means that when you you give similar inputs you should obtain similar codes uh you should not have big jumps in the representation when you have small jumps in the input okay and finally uh sorry it should be explanatory uh that means that ideally the different uh uh dimensions of your latent code should somehow correspond to to real-world uh explanatory variables of of of your data that are generated so for example if you have a latent code that has uh uh one dimension corresponding to the smile uh one to the skin tone and one to the gender and one to the beard this is excellent right because it gives you uh a very easy way to to to to obtain a mapping between the latent variable and uh physical and and and features that have a clear physical interpretation okay so what are various one to encoders is like i said an extension of of the basic code to encoder design that that that i mentioned earlier uh the main idea here is that uh uh operational to encoders uh generalized to encoders by adding stochasticity so the latent layer here does not generate point estimates does not generate uh uh a a a single uh value that that corresponds to to a specific latent feature it actually uh generates uh distributions so uh every uh latent dimension is going to to to be a distribution with a particular mean and a particular variance which is sampled before you go to a decoding uh why uh what do you gain by by doing that several things first of all you encourage a more continuous latent manifold that means that you encourage to to uh representation that do not have these discontinuities that that we mentioned earlier um you guessed uh models that are more robust and typically have valid decoding that means that um if you try to sample uh from from uh from an area that you did not uh where you'd not have originally a lot of training examples it's much more probable that you'll get something that uh that is sensible rather than something that is completely um unnatural and finally and probably most importantly uh because um it is a generative model uh it it estimates the generative distribution of the data uh this generative distribution is the final distribution learned by the by the latent code uh the latent distribution it allows interpolation and exploration um and this is probably the most interesting characteristics the characteristics of the variational encoders so how do you actually train uh certain autoencoder what is the the loss function uh what we had before in the normal encoder was just the reconstruction loss which is showcased here uh which just measures the discrepancy between the input and your output what you add to the variational encoder uh loss function is is a distance to a to a latent prior so you assume that uh you have a prior distribution for your latent distribution and this is a multivariate normal distribution with a with a unit covariance matrix also uncorrelated dimensions and uh along with the reconstruction loss you measure your distance to the latent prior so there is a balance here there's a balance between how well you do the reconstruction and how far you weigh how far away you move from your prior okay you cannot uh move too much uh farther away from from the prior from from the multivariate normal prior because you're going to take a hit in your loss function you can also not do uh terrible terribly in the in the reconstruction because you're going to take a hit in the reconstruction part of the loss um this second part uh you can actually see as a as a tunable parameter as a tunable regularization parameter you can you can tune how much of an impact you want it to have on the final loss function when beta is is equal to one um then you have the vanila variational encoder the the most classical variational encoder and this whole loss is referred to as the evidence lower bound when you have a beta that's lower to one uh that means that you give more weight to the reconstruction okay uh it's more important to get a better reconstruction than to get uh something that is close to the to the latent prior and when you get a beta that is greater than one then you do the opposite you give more uh um get more attention to to this part of the loss uh more weight to this part of the loss so you're trying to get something um that is doing okay in the reconstruction but but more importantly is is closer to is the prior that you have specified um what you choose depends on the application why for example would you choose uh uh to go with a large beta so to to to design a disentag what is called a disentangling autoencoder well if you specify a large beta that means that you you you try to be uh uh as close as possible to the to the multivariate uh normal prior that has a unit covariance matrix why is that important because um that that implies that that the dimensions of of the latent code are uncorrelated which means that the features that the latent code is going to capsule are also going to be uncorrelated that's why or or or at least uh mostly uncorrelated that's why these are called disentangling in autoencoders because they they they try to disentangle your um the the latent coding things that are orthogonal to each other or mostly orthogonal to each other any questions am i going too fast um i think there is a question yeah right down yeah yes do you want to ask yourself so uh in the latent space if you have distributions now um can you apply something like a quotient graphical model to know the dependencies or the conditional dependencies between the different um dimensions in the latent space i don't know if that makes sense to if you know what each um latent node can means and you want to know how they're related to each other okay so you would like to recover the covariance matrix um of of the latent distribution right yeah so this is not explicitly encoded in the model the the covariance matrix um and i'm not certain uh how to do this um but uh i can look it up and and and try to answer this but i'm not sure how to do this okay so variational to encoders in in single cell data um basically you have the same architecture as we had before so you have this particular uh design where you go from your original space in this case we're talking about uh the gene space so every sample is a vector or a vector of of gene counts that's your input data sorry of input data and you go uh through uh successive compressions of of this initial data you have in the case of variational encoders this latent uh layer that you next next need to sample in order to go through the decoder and what you end up with is again uh back to the original dimension of the gene space so what you uh end up with for example for for different samples for different cells that is uh of of uh a trained variational encoder is with a latent um uh with a latent layer that can give you uh very uh succinct uh and summaries of of uh of your of your sample of your cell in the latent space so here for example uh i have um different uh cells and uh what i what i highlight uh what is on here is is how much each one of the of the latent dimensions lights up for a particular sample okay in the final trained model what you also get is a model that is able to do uh the noising or imputation in the original or the original data so you should all uh be familiar with type of of of typical picture between uh the the the relationship between the counts in two different cells of a single cell data set where you have a lot of dropouts in one or the other cell you have uh some uh large magnet outliers and uh even if the cells are are extremely similar so if if you have cells that should be of the same type you end up with this with this picture that is that's actually very far away from what you'd expect from from uh two related bulk RNA sec data sets this is how the same uh cells look like in in the in the imputed in the in the denoised version of the data so after the data have gone through end to end from input to output on a trained model uh where you can see that you uh pretty much have have gotten rid of for example of of all the the dropouts now to what degree this imputation is is the truth um um is is is very hard to to judge um but i would say that i trust i would trust these imputations much more than any uh for example uh nearest neighbor uh based imputation uh that is basically uh a poor man's uh non-linear uh version of of imputing data and you can see the the the large impact that that uh that this process has on our data by by uh looking at the mean uh variance profile of our data set and this actually that is something that we're going to go through in the in the exercise uh so uh this is the mean variance trend of the original data set um uh in the mean variance trend uh i'm sure most most of you are familiar with it uh in the x-axis we have the the uh the gene expression in the y-axis we have a measure of the variance um of the genes and typically in single cell data what you have is is uh a picture that gives you uh that shows over dispersion relative to the to the uh puason to the picture that you would expect if you had uh puason uh distributed data where the only source of variance is the sampling variance um and you have also a very strong relationship between the the mean gene expression and and and the variation so the the higher the gene expression the lower uh the coefficient of variation right because you get um better and better uh estimates as as the genes uh increase in the in level of expression uh so this type of picture is is driven partially by the sampling variance or most importantly by the sampling variance uh but it's also over dispersed to to puason because there are also other sources of of variance like for example there's biological variance but there are also sources of of uh um of imperfections of of artifactual measurements that are kind of unrelated to the sampling variance this is the picture that you get when you um when you denoise the data so in the final denoise dataset where you can see that um you have lost pretty much any relationship between the mean gene expression and the coefficient of variation that means that that uh basically any source of variance that that is because of of sampling is gone right so the the autumn the variational encoder what it tries to do is to give you back denoise measurements measurements that are free of of the variance uh that mainly comes from the sampling variance but also of of artifactual uh measurements this is similar to the idea that you saw earlier in images where where you would try to uh complete or to denoise uh pictures that you had added artifactual noise any questions okay um there has been um um there is one question one yes yes yes sorry dam i'm a bit late but uh i had a question about the training of these auto-encoders yes the central part does it mean that you can still use the same typical propagation backward propagation for for weights absolutely yes it it's not a problem it is not a problem at all yes so so it's it's the same type of of of optimizers that you would use for a standard auto-encoder that you use also in the variational auto-encoders there's there's nothing different in terms of the fitting of the model okay thank you every layer is is still differentiable it's it's still a type of layer that that the that the optimizers can work with so in the past few years i would say maybe i don't know two three years maybe two years there's been an explosion of papers that that basically try to apply uh auto-encoders and or variational auto-encoders in in for different applications in in single cell data i don't have the time to to go uh through through those i just show some some examples here so you have applications of auto-encoders where the goal is to denoise so to do what you mentioned earlier imputation of the data um to for visualization um and clustering uh because you can use for example the latent layer as as a as a low d representation that you can use as input to clustering techniques um bat's effect removal you can do bat's effect removal as we will see in the exercise by doing operations in the latent space um uh and and and so on okay so these are just a few examples probably examples of that got a lot of visibility uh when they were published there is another question yes yeah good question why why would you rely more on those approaches compared to cnn for or denoising data to cnn to cnn or no knn i'm k nearest neighbors i you want to i'm knn okay yes yes yes no no yeah not the convolution on the network yeah i was talking about uh k nearest neighbors because k nearest neighbors basically it's it's also a a non-linear kind kind of approach for for doing uh imputation but but by construction it only can only take uh into account a very uh limited uh neighborhood of your data it doesn't have the global picture that that um a well-trained variational encoder has right of course you can end up with a variational encoder that is trust if you have not uh if it's extremely shallow if it's extremely uh it has it has very poor representation if you have chosen for example a latent layer that has only two nodes it's the the model is not going to be very expressive but if you if you have done reasonable choices in terms of the architecture of your model i think you'll do a better job okay now because i thought that this proximity of cells would in a way be really important to take into account it is and it will be taking into account by the variational encoder but but uh but um like i said um because uh during training you have seen everything you have seen all your data it has a more global pictures of picture of of how genes for example uh are correlated in your data set it doesn't it doesn't have this very limited picture of of a neighborhood okay thank you problem um i will very very quickly uh mention generative adversarial networks because um although they haven't been applied so extensively in single cell data it's it's another category of of of um generative networks uh the idea here is is in terms of of of the training uh of of the loss function is completely different so you don't have anything like uh a prior distribution like in the case of of of the the variational encoder what you have is is a model that has two components you have a generator component that is uh shown here and you have a discriminator component that is shown here so what happens is that uh you start with a generator that basically um produces a random uh uh random data uh but of the same dimensionality as for example your your your cells so it produces random values it spits out random values of of of gene expression and then you have a discriminator that tries to decide whether this this this sample that was generated by the generator is actually a real sample or not okay so if the discriminator is doing a great job here that means that the the samples that are spit uh that are spit out by the generator have nothing to do then look nothing like uh a real uh a real cell sample but uh but uh as the model is being trained and uh what you're trying to do is is to actually uh uh trick the discriminator is to actually uh uh make the discriminator confused samples that are that are produced by the generator versus the real samples the better you are able to treat the discriminator to to trick the discriminator the the better job the generator is doing in terms of producing things that look like real cells in this case if we're talking about about uh transcriptomics and that is the the the the idea that is the loss function that that you're using in order to train the model you're trying to to produce a generator that's going to to uh be able to trick your discriminator and if you're able to do that then that means that you have estimated a latent space that is a good approximation of of of the space that gives generation to your actual data this is what i'm trying to to show here you so you have your uh uh initial uh uh distribution of your generator that looks like this you have your real distribution that looks like this and during the training uh you move your your uh latent space distribution closer and closer to the real uh distribution until the two hopefully overlap and you end up with a latent space representation that is able to give rise to data that look almost uh identical to to the real data um generative adversarial networks are notoriously unstable uh they're they're hard to to to train uh the suffer from what is known as the mod collapse uh which basically leads to some particular sources of of the data being over represented another missing however uh as as many applications particularly in image processing have shown they're able to generate high realistic highly realistic fake examples or probably if you have seen these examples of of deep fakes in videos and and and images that uh can trick you in terms of of believing that something uh of a face uh is real uh a face that never existed is real or they try to to morph um one actor to another and so on this typically use uh generative adversarial networks okay so not so many applications in in single center scriptomics but there are a few uh and and this is one that's actually playing to the to the strengths of of the guns which is to actually generate uh realistic like data so this is an application that tries to use guns for augmentation of single cell RNA sequencing data using uh generative adversarial networks. So an example I'll show here and I'm going to close the presentation with uh some observations about um what again I think is is the most uh interesting uh aspect of of of um generative networks in trastictomics which is again that the latent representation is an estimaton an estimation of the underlying distribution uh that gives rise to the data so uh in other words the the latent space uh can be viewed as as a representation of of uh the transcriptional landscape as a representation of the landscape that is able to give rise to the uh the different cells uh that you have in your dataset. This is a picture that the biologists have have uh are familiar with uh for more than 60 years so uh this very closely corresponds to the idea of a waddington uh landscape uh which basically says that there is a particular uh topology uh that uh the cells can can uh traverse or they can go uh uh down to during differentiation. This is uh from the original um uh publication from from waddington. This is a more uh modern updated version of of the same intuition of the same notion where you have uh the cells going through different paths in in a differentiation topology and ending up in in different parts of this topology. So this is this is not an idea that that should be uh strains to to biologists that you can you can view um transcriptum uh transcriptum regulatory landscapes as an actual uh space as an actual manifold okay where you can place your yourself and what's the the the advantage of having such a representation well if you had uh a good estimation of this generative process of your data then you can do uh inference and and what would be some examples of doing inference on such manifolds um you can for example in uh infer transcriptums upon biological perturbations so you you can ask questions like uh how would the cell look like if if i i i perturb a particular gene how how would the the rest of the genes be affected by by such a perturbation um you can infer the effects of perturbation in different cell uh tissues of context so if you have seen what what a gene uh knockout or a gene overexpression has done in 30 different uh cell types can you infer what what it would do in in cell type 31 or if you have seen what a particular perturbation is doing in in uh in in the muscle cells of human can you maybe infer what the same perturbation would do in the muscle cells of a simple g and finally can you infer trajectories can you ask uh what would be my most likely trajectory if i want to start from point a in this in this manifold in this landscape and i want to end up in point b can i do that can i get all the intermediate um um states of that cell that would be an approximation of the of the physical reality if you have actually uh estimated uh the manifold correctly then this would be in principle be possible right and this is actually what um uh uh these papers have tried to do this is a paper coming from the lab of fabian ties um the name of tool is s gen is it's it was it was exactly trying to do this to predict uh single cell perturbation responses either uh in different cell types where you have the seen the perturbation in in a specific subset of cells you want to predict it in another one that is unseen um or you want to predict how the perturbation would look like a cross species for example and this is uh somewhat similar work but this is using generative adversarial networks to do it and again what is trying to do uh what the authors were trying to do was to predict uh single cell perturbations