 I'm a research associate at the Frigid Mystery Institute since 2016, and we work on all types of transcriptomic and epigenomic data in collaboration with the experimental labs here. And so in that capacity, that gives us the opportunity to not only focus on different biological problems, but also to test different latest developments in the field of contextual biology and see how well these apply in the real world problems that are tackled here in the institute. So today I'd like to talk to you about how deep learning applies to transcriptomics and specifically to single cell transcriptomics. And so I would first start with a very brief overview of single cell transcriptomics, the different problems that come with the specific data that are generated from single cell RNA sequencing, but also the different opportunities that we get from this data. And hopefully this will give you an inspiration about why deep learning can be actually an ideal approach for tackling some of these issues, but also for different types of analysis that are coupled with these types of data sets. Then I move to the auto encoders that historically was actually the first model that was applied to single cell transcriptomics. You heard a few things in the morning, so I'm not going to go into too much detail there. But this will also give me the opportunity to introduce representation codes that is a very important concept in the deep learning applications in transcriptomics, in deep learning applications in general, but also in transcriptomics. Then I'll move to some common architectures that are used in single cell transcriptomics, which are of the deep generative network types, specifically I'm going to talk about our small encoders and adversarial networks. And finally I'm going to talk about different applications of single cellomics existing tools, which by necessity is not going to be a comprehensive list, and I will finish up with some perspectives of where we think the field is moving and what are some challenges in the future. So a very broad introduction to single cell transcriptomics. I'm sure that pretty much all of you have either some hands-on experience or you have heard some things about a single cell RNA sequencing, but the main advantages, the main opportunities that come with single cell RNA, as opposed to bulk RNA sequencing, is that you have the ability to probe the transcasional output of individual cells as opposed to getting an average view of the whole cell population. And this gives unprecedented opportunities as compared to the bulk RNA sequencing technologies. It gives you the ability to probe the population structure, to look in detail the cell heterogeneity structure of your population. It gives you the ability to study the dynamics of the population and to study the gene distribution characteristics of your population, which again, all of these tasks were either completely impossible or at the very least not straightforward at all to do when using bulk RNA sequencing technologies. A typical single cell RNA sequencing workflow starts with cell dissociation, where you basically try to move apart the cells that you want to assay. And then you have to isolate the cells, and depending on the technology, the specific technology that you use, there are many ways in which you can do that. But by far, the most common technique that you use is the isolation of cells by encapsulation in droplets, which is also known as microfluidics. So basically, what happens there is that you end up with small droplets that contains hopefully only one cell, although there are cases where you can have in the same droplet and multiplets of cells. And these same droplets also contain all the necessary chemicals that you need in order to perform library construction. So the necessary chemicals that you need in order to do your reverse transcription to get your CDNA and the amplification of the CDNA. And this procedure is shown here. And also during library construction, construction typically, you also incorporate in your CDNA molecules, the amplifying CDNA molecules, different barcodes that can be extremely useful for different tasks. So for example, you have sample barcodes that can be used in order to demultiplex your samples in cases where you have, at the same type, assays of different experiments, for example. You have barcodes for the individual cells, which give you the ability to tell if a particular molecule comes from a specific cell. And also typically, you have barcodes that are called unique molecular identifiers that give you the ability to tell whether a particular molecule that you're looking at is actually coming from an original unique transcript or is the result of perhaps over amplification of the same molecule. Typically, after all, this process is done, what you end up with is basically a huge count matrix where you have, depending on how you view it, but you can have in the rows your features, which are essentially your genes. And the columns correspond to the different cells. So these count matrices are the input point of all the analysis. And the first task that you have to do are different types of quality control, where you filter out cells of locality or genes that are completely non-informative. You have to normalize your data. You have to often correct your data for technical batches, which we'll talk about later. You have to select features, that means genes that are particularly informative in terms of dissecting the biological heterogeneity of your samples. And after that, come the downstream analysis where you can visualize your cells. You can study the heterogeneity by, for example, clustering your cells or performing composition analysis to look at what are the levels of gene expression in different parts of your population. You can try to annotate your clusters. And also, in cases where you're looking at cell populations that are dynamic in nature, you can have specific types of analysis that are targeted to study the dynamics of the populations. So for example, you can have trajectory inference tasks. You can try to dissect the metastable states of your population. You can study the gene expression dynamics. And finally, you can have analysis that are targeting specifically the genes of the populations. You can perform differential expression analysis, for example, between different sub-populations of your complete population. You can perform gene-centered enrichment analysis, or you can try to infer gene regulatory networks based on the distribution characteristics of your genes in the population that you're looking at. If we wanted maybe to summarize in one sentence what is the key characteristic of single-cell data, then that sentence would be that single-cell data sets contain other result of multiple and often confounded sources of variance. What I mean by that is that the end result that you're looking at, the particular profile of cells that you're looking at in terms of the gene count, are the results of multiple signals. Part of the signals can be biological. For example, it can be the cell type heterogeneity of your sample, it can be genetics, it can be the specific cell state or microenvironment of a cell, it can be gene expressions of stochasticity, cell cycle dynamics, other types of oscillatory behaviors of the cells and so on. But at the same time you have different technical source of variance, which for example can come for different capture efficiency of your experiment, different amplification biases, PCR artifacts, contaminations, cell doublets, damaged cells, sampling effects and so on. And all together these effects give rise to a particular profile that I showed here. So if you look at two cells of a typical single-cell RNA sequencing experiment, even if those two cells are of the same type, you usually get this type of picture that for people that are used to looking at bulk RNA sequencing data is of a completely different type, in the sense that you have an overall lower correlation between yourselves, as opposed to different, to do bulk populations of the same type. You have dropouts, and by dropouts I mean gene features that are measured in one cell and they give you counts in one cell, but not in another. You have over dispersed genes. That means genes that appear to have a much higher variance than what you would expect just because of sampling. And you also get high magnitude outliers. And again these outliers can be because of technical reasons or it could also be because of biological reasons. And a very common problem with single-cell data sets are the very strong batch effects that I mentioned earlier. And I mentioned this because this is also one of the applications of deep learning, the correction of batch effects. So batch effects are caused by a technical source of variation that are introduced in a data set during handling and preparation of your samples. They're essentially distortion signals with different characteristics in terms of their intensity and variance that are applied to each technical batch. And this distortion can have different effects on each of the features of your data sets, on each of the genes. In the case of single-cell sequencing, the distortions can also have different effects on distinct cells of your population, which means that this variance is also confounded with biology because typically batch populations are not identical in composition. And because single-cell sequencing involves more and more complex steps compared to bulk RNA sequencing, the batch effects that are introduced are typically much exaggerated. OK, so now I will switch gears and talk a little bit about autoencoders, which as I mentioned were historically the first models that were applied to single-cell spectomics. And I would also introduce representation codes. And then I will talk about variational encoder and the adversarial networks. So you heard a few things during the morning session about autoencoders. So I'll give a brief summary. So autoencoders are unsupervised models. You don't need labeled data sets in order to train those models, which means that you also have easy access to large training sets. The objective of an autoencoder is basically to obtain an output that matches quite closely your original input. But you do this in a particular way. And the way that you do this is by squeezing your data through successive layers of decreasing dimensions. So essentially what you're doing is that as you move from one layer to the other in the encoder to layers of decreasing dimensions, you are compressing your data. And hopefully what you're doing throughout this procedure is that you are maintaining the most salient features, that is the most important features of your original data set. The middle layer, so what you end up after you go through the encoder, is a code, is a latent code that represents, again, the most salient features of your input. So you have two components in this autoencoder. You have the encoder that is the part of the machine that performs the compression and the decoder that performs the decompression of the latent code. And the way that you try to estimate the weights for this model is by using any construction laws, which is basically a way to quantify the difference between your input and your output. So that's it. It's very simple. There are, of course, multiple flavors of autoencoders. So you have deep stacked autoencoders parts, various autoencoders that we'll talk about in more detail later, denosing autoencoders adversarial, disentangled, and so on. But the main principle, these are small variations of the main thing, of the same thing that I just talked about. Historically, there have been many different applications of autoencoders. So autoencoders have been used for dimensional reduction and visualization, for denoising and image completion, for feature manipulation, interpolation, and extrapolation. And I saw some examples here, mainly from the field of image processing. However, these applications on image processing have actually very close correspondence to the problems that we face in single-celled transcriptomics. What is the connection here? So as in many cases in image analysis, transcriptomic data are high-dimensional. They can be extremely noisy, as I mentioned earlier, and they can have corruptions, which means that, on one hand, we have a need for techniques that will give us the ability to perform efficiently dimensionality reduction on this data. And on the other hand, of techniques that will allow us to do denoising or infiltration on our transcriptomic data. They have very complex feature relationships, so the relationships between the genes are not easy to model. And finally, the sources of variance that I described earlier have effects on our data that can be highly nonlinear, which means, again, that it's not at all straightforward to model the effects of these sources of variance using traditional machine learning approaches. I mentioned earlier that one of the most common architecture that is used in single-celled transcriptomics is actually a variation of autoencoders, which I would like to introduce here. And these are the various autoencoders. You also had a brief introduction of them in the morning, but again, I'll give you a reminder. So various autoencoders generalize autoencoders by adding stochasticity to our model. The latent layer, instead of now being point-estimates, now actually represents distributions. And what are the advantages of using a variational autoencoder as opposed to a traditional autoencoder? Well, first, it encourages a continuous latent manifold. That means it encourages an embedding that has no breaking points between the different parts of the data that you're trying to represent. It gives more robust models, and also it encourages valid decodings, which is not always the case for a traditional autoencoder. And perhaps most importantly, it allows interpolation and exploration because it sits on a very solid and statistical inference framework. What is the difference in terms of the loss function that you use in order to train those models? So you still have a reconstruction loss, which is the same that you use for a traditional autoencoder. So this measures the difference between your input and the output. But you have an additional loss term which quantifies the distance to a latent prior distribution. And this latent prior distribution is a multivariate normal distribution with a unit covariate matrix, meaning that it assumes independence between the different latent nodes. You can see here that there is a beta parameter in this penalization term, which is this distance to the latent prior. And this is actually a tunable hyperparameter. So when this beta is equal to 1, then this whole penalty is also known as a evidence lower bound. And this is the standard, the vanilla variation of the encoder. However, you can have beta values that are less than 1, which gives rise to partially regularized variation of the encoder. And basically what you do by setting the beta value to lower values is to encourage the model to have a better reconstruction performance because this part is now toned down. You can also have beta values that are greater than 1. And this gives a rise to the beta variation auto encoders or disentangling auto encoders. And these actually encourage these models that give rise to latent nodes that are more independent to each other. And the reason for that is because we are encouraging models that are closer to the prior to the multivariate normal with a unique covariance metric is again, assumes independent between the latent nodes. A second common architecture in single-centers ketomics. Again, you've heard a few things about this in the morning. A generative under-cellular networks. And these are machines that have basically two components. So first, you have a generator that gives rise to a fake sample, basically. And then you have a discriminator whose task is to try to tell whether the samples that you get from the generator are actually real samples or not. And the way that you try to estimate the weights of this model is by actually trying to get a machine that minimizes the performance of the discriminator. Basically, a machine that is not able to tell a real sample from the samples that are generated, the fake sample that are generated from the generator. So in the first stages of the training, you might have a distribution of the generator that is far away from the real sample distribution. But as you progress to training, Mr. Sibius's approach is other more and more. And hopefully, by the end of the training, you end up with a generator that gives rise to a manifold that is extremely similar to the real sample manifold. Generative under-cellular networks have notoriously unstable training dynamics. However, there are ways to overcome this. And suffer from what is known as mode collapse, which leads to some modes of the data being overrepresented and others completely missing. However, they are able to generate very, very highly realistic fake samples. So they're also very commonly used in single-cell transcriptomics. For example, in cases where you want to generate samples that very closely resemble a real transcriptomic samples. For example, in cases where you want to do that documentation. It doesn't really matter which or other type of model you use. There is a common underlying goal in the cases where you use deep learning models. And this goal is to obtain a good code representation of the input data. And what does a good representation mean? It can sometimes depend on the particular goal that you have in mind. However, there are some commonalities in terms of what a good representation has in terms of characteristics. So first, it has to be robust to meaningless input corruption. That means that it is a representation that is robust in the presence of noise. It has to be generalizable. That means that it should ideally transfer to multiple settings and to multiple related problems. It should be smooth and coherent. That means that for similar inputs, you should get a similar code output. And ideally, it should also be explanatory. That means if you have different sources of distinct source variants that give rise to your data, ideally, you would want your representation code to also disentangle these source variants. So to have different nodes, for example, encapsulating the different distinct type of variants that give rise to the data. What does this mean in practice in terms of transcriptomics? What is the interpretation of a representation code? We were talking about transcriptomics. And a representation code, a latent code in terms of transcriptomics, is basically a succinct generative representation of complex transcriptomic manifolds. That means that this location in this manifold represents a different, realizable cell state. So there is a very direct relationship between the representation code that you get from a deep learning model and the way that the data are generated to give a particular transcriptomic profile out. And a useful analogy that I always keep in mind and I think might be also useful in terms of drawing this connection is the Waddington landscape, which is a concept that is known to biologists since many decades ago, which is basically an abstraction for people that are working typically in differentiation or developmental biology that views these cells as a ball in a high dimensional landscape, a high dimensional manifold. And this ball can actually move through this landscape. And every position in this landscape basically represents a different cell state. So for example, when you go through differentiation or through reprogramming, this ball moves and jumps to different cell states. And so you can view the latent representation of a deep learning model as a realization of these ideal abstractions that the biologists have had in mind since many decades. OK, are there any questions so far before I move to the last slides talking about specific applications? There was a question more on the how are the generated single cell transcriptomics data used? So if you have some examples of how to do this. Some examples of how to generate them. No, yeah, I think it's more how to encode or how to use the data from transcriptome single cell. I think it's maybe Katarina can speak up if this is the question. If you're talking about preprocessing, I'm not sure I completely understand the question, but if you're talking about preprocessing, there is minimal need for preprocessing of the data. So typically, you don't need to do any feature selection because as you've heard earlier, deep learning models are end to end. So there's no real need for feature selection, maybe with the exception of doing this in order to reduce the contextual time for training your model. There's also little need for transformation of the data. And typically people log transform the data before using the machine to deep learning models. And the reason for this is to make the training landscape a little bit smoother so that you don't end up with a model with very hard to estimate weights. You try to bring your input data in terms of the features in a similar scale. But that's pretty much it, if that was the question. The microphone is not working, but I think that you can also rephrase if it was not what you're looking for. While in the meanwhile, there's other two questions. How are these models validated experimentally? That depends again on exactly what the goal is. So first of all, like I said, auto encoders are a supervisor. So there is an internal control of how well this data perform in terms of how well are they able to reconstruct your input data sets. You can also try to see how well they compute data, for example, with common techniques of validation. For example, by holding back a part of your data set and look how well it represents the validation set that was never used for training. But again, it really depends on exactly what you're trying to do in terms of validation. And maybe some of these things will come clearly in the specific exam. So a very common application is dimensional active reduction data visualization and clustering. This is a very natural application of various auto encoders or auto encoders. Because as we mentioned earlier, the latent layer is actually a compressed representation of your data. That means that you have already performed dimensionality reduction. You can use this latent encoding to efficiently visualize your data with your favorite technique of 2D embedding, like Disney or UMAP. And you can also very succinctly represent every cell by its latent encoding. You can also use this compressed latent layer, this reduced representation, as the input for clustering methods. Or you can even use this latent representation, for example, in cases where you're studying dynamics of yourselves and you try to reconstruct the trajectories of your population. So instead of going through techniques of feature selection, you can use this latent encoding for these tasks. Another very natural application is in mutation and denoising. So as mentioned earlier, the output of the decoder is essentially a denoised version of your input data. So what you get back in this last layer of the encoder is essentially a denoised version of your original data set. There are several applications with small variations of auto encoders and variational encoders that actually perform mutation and denoising. And here I show an example from actually in-house generated data from retina, where you can see the original relationship between two retinal cells, where you see this typical picture of high-magnet outliers and dropouts between the cells. And you can see how the same data look like after you have gone through these steps of denoising. And on the bottom here, what you see is the effect that denoising with a variational encoder has on the mean variance relationship that you typically see in single cell data sets. So typically you have this very particular relationship where as you move to higher and higher levels of expressing, the variation of the cells when this is normalized for gene expression becomes lower and lower. And this is basically an effect of a sampling effect. So typically it follows very closely the Poisson distribution or is a little bit over dispersed compared to a Poisson distribution. On the right, you can see the same mean to variance relationship, but on the decoded output. And you can see that this association between gene expression and variance is now completely lost, which means basically that the autoencoder has corrected for the sampling effects, meaning that it's also much more straightforward to model your genes. For example, in tasks of differential gene expression or for tasks of genes that are marked for particular sample relation of your data set. A very related task perception is batch correction and data harmonization. I talked about the batch effects earlier. And here essentially what you're trying to do is to come up with a representation of your data that has erased the technical effects that give rise to the batch effects. There are different techniques by which you can do this. And here I just showcased three of them. You can take advantage of a technique that is called arithmetic operations on latent space. Or basically what you can do is that you summarize the sources of variance as a latent vector. And then you can subtract this latent vector from yourselves in order to remove different types of technical variance. So for example, if you have two different sample relations coming from two different laboratories, you can summarize the average profile of a laboratory in terms of its latent profile and then subtract this in order to move the two, your two data sets from the two, come from two different labs to the same space. You can also use one hotel coding in order to represent the origin of your different batches. And then basically by sifting bits of this one whole bit encoding to move from one batch to the other. Another technique is to use conditional variance on time coders where basically the latent space representation is conditioned on different nuisance factors. For example, one such nuisance factors can be the batch origin of yourselves. A very related task conceptually is the multimodal data integration where again you try to harmonize data sets. But in this case, the data sets are not of the same time. They are datas where you can measure different attributes of the data. You have different modalities. So for example, you have gene-expression data set, a toxic data set, chroma synaxis building data, even imaging data. The cells are not necessarily paired in this experiment. And even the number of features can also be very different. But the main concepts that you use in order to harmonize these data sets is basically very similar to what you use in terms of batch correction, which is basically to try to bring your data to a certain latent representation. Again, because these models are end to end, this allows for very efficient translation between domains. And it even allows you to predict what the output of a particular modality would be in a different modality where you have not seen a particular sample. So these are very powerful models in terms of inference. The final task that I'm going to mention here is the automatic annotation of single cell data So commonly a task that you face in single cell analysis is to try to annotate the cells in terms of the difference of populations. And here I show one example that again uses conditional variation of the encoders, where apart from the noise of factors that are encoded as a batch ID, you also encode the cell type ID. And essentially, the latent representation that you get back is conditioned on both the batch ID and the cell type ID. So this model can, at the same time, perform batch correction, but also automatic classification because you can get very naturally from this model using the posterior of your bias auto encoder, the cell type ID for a particular cell. I have a final, very interesting application of deep learning models on single cell data, which is the out-of-distribution inference, which allows inspection of regions of the transcriptomic landscape that have not been visited. So what this allows you to do is to infer transcriptions on biological perturbations that have not been observed experimentally for all cells. So you can, for example, infer the effects of perturbations in different tissue contexts that have not been seen experimentally, or it allows you to infer trajectories of cells that, again, have been seen only for a subset of cell types. And on the bottom here, I saw three different applications that actually do exactly this out-of-distribution inference on single cell data. There are several tasks I did. I do not have the time to talk about. So for example, the convolution of spatial transcriptomics, analysis of single cell attack data, doublet detection, analysis of side-seq data, and so on. And again, this is a very expanding field, rapidly expanding field. So it would be impossible to fit everything in the context of a half-an-hour talk. And to finish up, I'd like to give you some perspective of where the field is and where I think it's moving. So despite the multitude of publications on deep learning in the past four or five years in single cellomics, the underlying principles that are used for most of the models are actually relatively few. And the architectures are also quite limited. The existing applications do not represent yet conceptual seeds, but rather provide alternative limitations to problems that already have existing counterparts using different algorithmic approaches. And also, the performance shift that you see is also not very spectacular yet, leading many people to say that using deep learning models for transcriptomics is like bringing a gun to an eye fight. However, there are developments that I think prove many of these naysayers wrong. One of these examples is the use in the past few years of geometric deep learning techniques, graph neural networks that allow the integration of existing biological knowledge in the networks in active bias, and give rise to spatial network with more accurate representation. Another example of an opportunity that cannot be tackled with traditional machine learning techniques is the modeling of molecular perturbations. So large scale perturbation atlases that are being generated these years, combined with the representation capacity of deep generative networks, they hold the promise of producing more comprehensive mapping out of the regulatory states of the cells, which means that it should be possible to perform perturbation response prediction, target and mechanism prediction, and prediction of combinatorial perturbation effects, which is one of the holy grails of transcriptomics since many decades now. I leave you with a quote from a listen paper of a lab that has done tons of work on single-cell transcriptomics of the lab of Fabian times. And this quote is that it comes from an evaluation of different classification methods and comparison with deep learning techniques. And they conclude that after performing this comparison, we are still waiting for the image net moment in single-cell genomics. That means that we are still waiting for this transformative moment where an application will present a conceptual shift or show a performance that shows no comparison to any other traditional machine learning methodology. And I think this method is coming given the complexity of data sets, the rising complexity of data sets that are being generated, and also the complexity of the questions that biologists are asking when they look at the transcriptome data. OK, thank you very much.