 The data set that they will use today is a well known data set from the Surat tutorials, which is a BDMC data set with 3K cells. So, yeah, so you would be able to see this data set and download it. Okay, so let's start with the presentation. So as I said ASAP, so it's an automated single cell analysis portal. The goal is really to do the whole analysis online using the web tool. So first before I start introducing ASAP, maybe I should start introducing the single cell field in general. And I actually like this picture because it's good, good, the capitulation of what is single cell analysis. Most of you know already but to summarize. So, up until now we have like two main way of doing RNA-seq so you can do a bulk RNA-seq analysis in a position with single cell. So in general bulk RNA-seq analysis is to take a bunch of cells and then utilize them all, and then you get average expression for all the genes of the cells. So that was what was done in the past and it's still actually done. It's pretty good. It works very well. There is a very good sensitivity so you detect very low express genes. However, you get usually like an average of expression for all the genes. And if you want to do stuff like find some population of cells in your data set, then of course you are very, very limited by this technology. Firstly, if you use single cell data, so you really can get the peptone information at the single cell level. So for every cell you get RNA information. And this is for symptomics but single cell field is big and actually you can you can do also proteomics at the single cell level. You can do whole genome sequencing and epigenomics data. So you can get, now you can even get a TAC-seq and RNA-seq data from the same cell, which is called nucleotomics. And more recently we also have like special oscarotomics, which I still put it in single cell but it's not really single cell because then you get a spot and in a spot you can have actually multiple cells. But it's still something that is growing a lot and that people usually use together with a single cell methodology like a new map and stuff like that. So it's still something that is very, very big today. But today I will mostly speak about single cell transcriptomics, not the other ones because ASAP is mainly meant to be used for single cell transcriptomics data. Even though you can do also bulk transcriptomics with ASAP, so you can choose when you submit your data to do bulk or single cell. But the main topic of today is really single cell transcriptomics, so I will focus mostly on this part. Different type of application of single cell, of course cancer, but also development and microbiology, neurobiology, et cetera, et cetera, so you name it. It's a field that is really exploding since it was first released I think around 2015. At least the first really usable pipelines. Then it really exploded and nowadays we reached really thousands of applications every year. So it's really something that is kind of trendy. Most of the application nowadays that I saw were on atlases, so people like to create this kind of atlases, so the human cell atlas, maybe you've heard of it. The cell atlas that we participated in, but others like the world cell atlas and some specific other cell atlas, so that's really some kind of trend to try to create this huge cell atlases of many, many tissues and many, many organs. So that's something that is still ongoing, actually. And I wanted to speak more specifically about the flies atlas. First, because we were part of it so ASAP was used when we created the cell atlas to actually annotate all the cell types. So we really, it took a really long time to annotate all the cell types because usually I don't know if you know but the annotation part where you really try to annotate your clusters is really the longest part when you do any single cell analysis. It takes really a lot of time. And here we collaborated with many, many groups and it took something like six months I think to annotate all the different tissues. So that was really a heavy workload but ASAP and SCOPE as well, most of the tools really helped a lot to decipher the different clusters and the different marker genes to be able to annotate most of the cell types. So I'm speaking about this one specifically because actually the data set that you will have next week for the Hanson is actually a data set that is coming from the flies atlas. So that's on what you will work. So I know it's not human, it's not mouse, it's photophila data. I guess it was a good thing so maybe none of you all worked with photophila already so it's a kind of new thing. So then you can see the power of ASAP and how you can actually do this analysis pretty straightforward. Okay, one little point about the different protocols that exist to single cell, you probably know about 10x, which is the most used one I would say nowadays. But there are other ones, many other ones actually the two main ones are probably 10x and SmartSeq. Usually what people do is they run 10x if they really want to get a lot of cells, but the sensitivity of the next is lower so you get less of the of the express genes. And if you use SmartSeq on the other hand, usually you get less cells, but with a much better sensitivity so most of the genes are actually detected. That's what you see here. So yeah, I guess it's important to know at least the two main technologies because if you want to do single cell then of course you need to use the main technologies. And currently what we use at least in the lab here is the next technology which works pretty well. Okay, so yeah, so that was a global introduction about single cell, now I will dig a little more into like the bioinformatics part of it. So usually something we saw actually even in the lab when we started to single cell it was in 2015 or 16. So finally after I don't know one or two years of library preparation and dissections and stuff like that, then you get your single cell data so that's cool that works. And then suddenly you want to do the analysis yourself and then it becomes a nightmare. The field evolved a lot from when we started back in 2016, it was really hard to do a proper single cell analysis. And nowadays it's easier because nowadays you can use dedicated pipelines such as Surat or ScanPy. But there are still some people that are not very, very familiar with R or with Python and then it can be difficult actually when you don't have this like bioinformatics background to do the analysis by yourself. So that's a typical bottleneck and we actually experienced this in the lab as well where some biologists were acquiring data but they actually didn't know how to do the analysis themselves. And that's actually the original point why we developed ASAP internally and then afterwards we made it public because we thought it would be a good help for people that are actually not by implementations. So the analysis pipeline, the traditional single cell analysis analysis pipeline can be divided in two main steps. So the first step is what we call the pre-processing where you get your FASQ files and then you need to align your FASQ files to a reference genome like human or mouse reference genome. And then you need to do multiplex to find back your barcode so your cells and then finally do some QC to generate your final UMI or row count matrix. And this usually is done in Bash, so in Unix, so you don't need R or Python to do that. It can be handled almost automatically if you use a tool like CellRanger if your data is coming from the next. So CellRanger will take care of all the steps automatically and then it will generate automatically the row count matrix. It can also be done with other tools like StarSolo. It's a bit more tedious to do because then you need to parameterize a little bit the different things but it works well as well. And then so basically the input of this pre-processing step, the FASQ file in the output is the count matrix. So that's what I call the pre-processing. So usually, for example, if you run CellRanger, then you will get something like that. So it's like a QC that you get with the estimated number of cells and the average number of rates per cell. And also like this kind of curve here where you see the cells that are detected compared to the empty cells or the empty barcodes where you really don't have enough UMI to be able to call them cells. And you have all the stuff that I will not enter into the detail because that's not the topic today. But yeah, usually you get these kind of things. And then when you're done, so you have your count matrix and now you need to do the proper analysis that I call the downstream analysis. And there it's when you need to go to Python or to R to actually analyze the data. So first you load the data in the environment and then you need to run the actual processing pipeline. Like normalization, filtering, and UMA, PCA, et cetera, et cetera. And this is where ASAP is actually used. So with ASAP, you cannot do the pre-processing part. In inputs, you need the count matrix. But from the count matrix, then you can do all the different steps, but it cannot process FASQ files. The main reason is it's a web tool. So uploading FASQ files would be actually too heavy to handle by the web tools. So we decided to start from the count matrix. It's much easier. The other reason is that usually the pre-processing pipeline is very, very safe for one. So you don't really need to tune some parameters or to come back or to redo something. Usually you just process it with the default parameters and it generates your row count data and that's it. There are some exceptions, of course, it's not completely true. Sometimes you need to tune a little bit the threshold to find your empty cell versus your non-empty cells. But usually it's pretty straightforward. So we decided to start with the end of this pipeline, which is the count matrix. Okay. One question that you may ask that was asked before was, can't I use the black box pipeline? So why do I need a surat or scantai or ASAP? Cannot I just run a default script to do everything for me and then that's it. That could be, so this kind of things is okay for the pre-processing, as I said before, because it's very, very streamlined and very, very straightforward. But usually it's insufficient for downstream analysis because it's very rare that you actually go straight from the count data to your annotation. And it's very common that you actually circle back and you need to redo the clustering with different parameters or you need to redo the featuring with different parameters as well because you identify the outlier cell population. So it's much more complicated. So it is possible, for example, in serranger to do like automated downstream analysis pipeline and then you can visualize it using this loop tool that is present for 10x. But it's limited, of course, because then you cannot reprocess anything. So everything is fixed and then you need to work with what is generated. Yeah, so that's why there are some other tools that exist. And I don't think that many, many people are actually using this loop visualization thing. Usually bioinformaticians are using other solutions I talked about them already like serrat or scantai. So serrat is R based and scantai is Python based. And depending on the language you use, so you will prefer using serrat or you prefer using scantai but basically the analysis pipeline for both of the frameworks is basically the same. So the steps are the same, the architecture, everything is the same. And so, of course, the limitation is that it requires R and Python skills. So it takes time to implement thoroughly. And then that's why people can actually prefer like user friendly automated analysis portals like ASAP or others that are so fast genomics, it's another one but you need to pay for it. So it's commercially available. ASAP is free of use. And these portals are nice because you don't need to code anything. It's very fast to obtain results basically just click buttons and then it progress into the pipeline. It's interactive as well so you can select some cells, you can visualize some plots, you can interact with most of the plots. And also it's reproducible, which is very nice in nowadays that people try to do really fair standards. So it's reproducible, interoperable standards. So in ASAP, which is nice is that you see your pipeline, everything is stored in a Docker environment. So it's completely reproducible and you can rerun it and you will get the exact same result. So as I said, there's two main pipelines Serrat in R and Scampi. Serrat is now in version five from recent actually. And that's actually the version we have in ASAP. In ASAP, the pipeline we are currently using is Serrat. We had some scripts in Scampi but we didn't abandon them so we for now they are marked obsolete so you cannot use them anymore. But we will add the new ones soon. But currently, yeah, the whole pipeline is designed based on the Serrat pipeline. Don't trim analysis, typical don't trim analysis when you do single serranic data. So those are the different steps that usually you run. So first you start with a QC self filtering where you try to identify cells that are outliers because for example, they have too many mitochondrial reads or they have too low UMI. And then you remove them from the data because they can actually pollute your PCA or the new map afterwards. Then you normalize your data to remove the depth bias because of course every of the cells have different counts of UMI so you need to normalize for this. Then you try to identify the highly viable genes, HVG, which are the viable features in your data so you have to also do that. Then you scale your data, you can remove some covariates. So if you have some known covariates like batch effect or if you have like you want to remove for example the mitochondrial content or you want to remove the depth impact of your signal, then you can do it also at the scaling step. Then you run a traditional PCA so I guess all you know what is a PCA where you can start visualizing your data. But usually the PCA for single cell analysis is not meant for visualization purpose. It's mostly meant as like a reduction dimension technique. That is afterwards used to build a U-map so the U-map is not built on the raw count or the normalized count matrix. Usually the U-map you compute based on the PCA and similarly for the T-SNE and for the clustering so they all come from the PCA results. And usually the U-map and the T-SNE are the visualization methods so that's what you use to visualize your data in a better way. And you will see directly even with very very simple data that the U-map and the T-SNE visualization are much much better than the PCA visualization. You see much better your different clusters. It separates the cluster better basically. And then based on the clustering then the goal is to annotate your clusters. And usually what you try to annotate are like cell types that you can find in some tissue and also cell lines that you are studying. So the way to do that is to find the marker genes of each of your cluster. What genes are differentially expressed like higherly expressed in this cluster compared to all the other ones. And once you have the marker genes then you can use this information to annotate your cell types. If you know of course in advance that these marker genes is very specific for this kind of cell type. So you have some data set or database available online where you have these marker genes per cell type like mapping then you can use this actually to annotate your clusters. So that's a traditional pipeline and that you can do with Surat scan file and you can also do everything with SAP completely online. But that's a pretty pipeline and if this would be like that then you could actually use a standout straightforward analysis, black box kind of analysis. But usually it's more complicated than that and usually when you start to annotate your clusters then you realize that some clusters maybe are not correctly done so you need to increase the number of cluster or you need to reduce the number of cluster. So you go back to the clustering and you go back to the marker gene again and again so you're back and forth. But that's usually what you cannot do with black box pipelines like this loop thing. And this is actually even the easy way because usually it's much more complex than that and you have some examples of data sets where you realize that actually the clustering is not good. So you need to go back to the PCA to increase the number of principal components or reduce it to try to see if you can get a better new map. Or sometimes you identify in the new map like a cluster of very, very weird population that is probably just like outliers because of technical artifacts like material content and stuff like that. And then you realize that actually these cells are not good and then you need to filter them out so actually you go back to the self filtering step and you start again. So that's why I think it's interesting to have a very modular framework where you can go back in your pipeline and change some parameters and then see what it changes. And then of course it's very convenient with ASAP because everything is online so you just press buttons you realize that something is wrong so you go back to the step you change the parameters and then you don't have to run scripts or to rerun anything. So it's pretty straightforward I would say to do with ASAP. Okay voila. I'm done with the global introduction. So now my goal is to do what time is it. So my goal now is to do a more demo, a live demo using ASAP. So maybe before I continue, is there any questions? Ah, Fabrice you already answered. So someone's answered is asked is it possible to integrate. Any more questions? Single cell RNAs. There's a question about integration. Yeah, it is currently not possible to integrate different datasets on ASAP. That's something that we plan to do, but it's not yet available. So it's something that will be available probably by beginning of next year. That's something that we are currently implementing. Any more questions? So now we'll continue with ASAP. Ah, there is one question. So, so basically in ASAP, when you upload your data, we have a, we have a script that runs that tries to map all the genes in your data to the ensemble database. So those, those database are then annotated to like gene ontology terms and then other types of ontologies like cell types and stuff like that. And if they are non annotated, they are kept in the datasets, of course, you can still visualize the expression and everything, but, but that's it. So that's, that's not more than that. So you will not find them or you will not be able to use them for enrichment in gene ontology and other ontologies or cell types. They are kept so you don't lose them. So you can, yeah, so, so ASAP, so now actually ASAP. There are two ways of using ASAP. So whether you can use it from scratch to upload your roll count data and then you do the whole analysis online on ASAP. And then you can, you can afterwards you can replace the data as a Loom or as a H5AD file. And then we continue the analysis by itself. So that's one way to use it. And the other way to use it is actually to upload an already existing H5AD or Loom file containing all your analysis. And then it will be automatically displayed basically so you will be able to display all your UMAP and Disney and clustering and stuff on ASAP directly. And then it can be used only for visualization. So that the second usage I would say of ASAP. So you can, so if you have already integrated data, then you can upload it on ASAP and then it will display on the website and then you will be even able to do extra analysis from the bottom. Loom H5AD is different formats. So when you're sourcing the file or anything like that, you have many, many formats that exist. The first one is just a text count matrix in a text format, but you can also store it as a serat object or a scampi object and usually scampi creates an H5AD file. So that's a default format for scampi. And it becomes actually the common format I would say to store single cell data. But there are others and the one we use actually internally within ASAP is Loom. So everything is there to use Loom files within ASAP, but afterwards you can export it as an H5AD or Loom file or other format. So yeah, the input file can be basically anything. So H5AD, Loom, text file, whatever. And the output file is whether Loom or H5AD. And then when you have this output, so here I will open ASAP. We have a tutorial actually that says how to work with Loom files that are created with ASAP. And if you want to use it afterwards, so it explains what is a Loom file. And then it gives also a few code here. This is an R in Python here, where you can actually import this Loom file within a R or a Python environment. Does it answer your questions? Yeah, I see no other questions. So I will probably continue. Yeah, yeah, there is a Loom R package from Satijaya. It's quite old, I think, but it still works pretty well. So yeah, that's one way to do. Okay, so now I will switch to ASAP. So it creates a live demo. I hope it works. It's always this live demo or whether it's something that you didn't think about and then it crashed. So yeah, so here is a main portal for ASAP. So the URL is asap.pfl.ch. And my goal, at least for the course today, is to reproduce the analysis that we have in Serrat. So here is the PBMC tutorial. So I can give the link to you, but maybe you know it already. So this is the main tutorial where you want to use Serrat. So here is the data here. So you can download it. It's a .gz file, which is actually stored as a text file, just it's gz. And so here you have the tutorial that explain all the steps that you can do if you use Serrat. So all the comments that you need to run and all the plots that you get, et cetera, et cetera. So today is to show you that we can actually reproduce this with ASAP in a very, very convenient way. So that's what I will do in just a moment. But first, I want to present you the portal and how it works and how to use it. So if you connect to ASAP.pfl.ch, then you get to this main page where you see all the public projects that are here that get a permanent public key, which is usually used by people. For example, if you want to give, to put it in a publication, then it's a permanent URL. So you can give the link and then people from the publication can access directly your data. You can also log in. It's not mandatory, but that's something you can do. You can register and then you can log in afterwards. If you don't log in, then it still works. So we create what we call a sandbox project, which is, which is basically the same as a normal project that you would do if you are logged in, except that it's destroyed after I think it's 24 hours or something like that. After a certain period of time, it's destroyed. And of course, if you disconnect from the page and then you come back like, I don't know, eight hours later, then, then of course you will not be able to find back your data. So it's just to test the pipeline if you don't want to register. It's impossible to do it without registering as a sandbox project. So if you want to keep track of your projects and you want to keep them for later, or if you want to share them with some of your, of your collaborators, then you need to register to be able to keep everything in your session. Okay, so as I showed before that we have many tutorials that you can access from here. And also, I have two main atlases that are available directly within ASAP. So the flight service data and the human service data. So in the flight service page that you see here, all the different projects that that exists, and you can access them from here or you can download the role data. You can see all the tissues, the technology that was used because in the flight service, we use both the 10X and the SmartSync 2 technologies. You can see all of that and you can access the project by clicking the view button. Or you can access already existing public data, I will try with the first one maybe. And of course it's public data, so as you can see, you can only view it in read only mode. So, since it's public you will not be able to make any modification or to run new analysis on it. If you want to do this, then you can, you just need to clone it. So if you clone the projects, then it will create a perfect copy of the existing project and then you will, it will be in your session. So then you will be able to actually modify it and annotate it or whatever you want to do. So let's click on the first one and then we'll see, we'll see what is, what is this, pancreas data from mouse. There's 9000 cells and 55000 genes apparently. But it was published recently. And it was done with the V6 version of ASAP. The last one that you may have seen is really seven, which is very, very recent because it was published a few days ago, I think. So, so yeah, so that's something I didn't show. But basically, for each of the release of ASAP, so we created a separate docker, so it's completely version. And basically you see all the different package that we use with their version. So here you have the R version that is used, the Python version that is used and all the different package with their version. So everything is really made to be reproducible fully. So you really have all the information about the different package that I use in the specific version. But this project apparently was done with version number six. And here when you open the project, so here it's an already run project, so it's a public project. You see many things. So you see on the front page first you see what analysis we're done. So you see here the, the, the pipeline that was run. So passing you see, then they use a particular pipeline. And then we have here and all the DE that was run. So here you have like a tree showing everything that was that was run. You can also see here the number of steps that were run for each, the number of tools that were run for each of the steps. And you can also put some extra external link to your project so here apparently the author that publishes project they actually linked with a geo data set so here you can click on it and then you can access the GC data if you want to really reproduce everything from the slides so you can create this kind of links. And then, if you go to the different steps so then you see how the steps were done so that's why I said it was really the produceable because here you see everything that was run by the authors. You see the parameters that we use so for the set filtering you see all the parameters that they use. You see the visualization and provide removal and etc. And if you click on visualization, then you see the actual plots, so like the new map or the PCA so here apparently they run to you map and 3D and 3D one. So here you see the different cells on this new map. And if you go to the coloring so controls. Here you can also display the clustering that they run, apparently they run clustering here. And you see the cluster that they chose to keep in the data set with the different cell populations. And actually they annotated some of the clusters you see the cluster number two was annotated as beta cells and cluster number three was also annotated as beta cells. So if I select on a cluster three that's this cluster here. So this one was annotated as a beta cluster and cluster six apparently was annotated as the alpha cells and cluster 11 was annotated and depth of cells etc etc. So apparently did not annotate all the clusters but at least some of them were annotated by the authors so that you can find also on on here. Okay, so that's basically how it works when you, when you when you have public data sets but what we are interested in also today is how to do an analysis from scratch. First login. And when you log in you have a slightly different view. So first you have still the public project but you also have your own projects. So here you see all the project that I've created. And I can still clone some projects if I want to create copies of some project to maybe modify some parameters, but keep the original project. And you can share projects as well with other users. So here this one was shared with two other users. That's pretty convenient. Actually, as by information usually that's what I do. So I, I have data that was sent by collaborators. I run the analysis, whether on Sirat and then I put it on ASAP or directly on ASAP. And then I share it with like the biologists or like the tissue experts that will do the annotation and then they have access to everything and then they can annotate and they can work on the same project that I created. But a nice feature also that you can have on some of your projects to share it with collaborators. Okay, but now I will I will create a new project. So I click this new project button on the top. And here. So, so that's where you need to submit your camp metrics from your processing pipeline that I spoke about before. So there's many way to do that. So you can browse your computer you can even copy and URL that exists. So if you have some files, let me check. Yeah, I have like a CSV. So we just show you how it works with different types of files. So here if I upload the CSV file. Then you see a pre processing that is run. And here you have normally here you have an extract of the, the matrix that is displayed but here you see it's not working. So first the delimiter here shows this tabulation. So here I need to select coma. And then now you see very nicely the, the, the columns names so about one by two, etc. And then the gene names. And here you have the recapitulation of the number of cells so it's a very small data set on it's from 2015. And 23,000 genes. And it's detected as a camp matrix because you can also upload normalized data, if you want. If you do so then you will of course be limited because there are many, many steps that you will not be able to run, but it works as well so you can also upload already normalized data or integrated data that works as well. So that was just for the example so let me submit another one. So if you issue, so do I have like maybe 10x. Yeah, I don't. But you can also upload data directly from 10x I don't know if you know but in 10x you can download a bunch of data sets in h5 format so it's a proprietary format that you get. So you can also upload that directly here if it works as well. But here so let me let me then submit the data so here I can weather download the targe easy data that I found from the surat tutorial so this pbmc that I set, but can download it on my computer and then upload it. I have it here so I can I can show you so if I do that, then I upload it, and then it's detected as what type of data is it it's I think it's market fine. Yeah, so it's detected as an mtx on market file. And then you see again this this is this summary of the 10 first columns and 10 first rule of your main matrix. But you can also, if you don't want to download it or it's too slow your connection, for example, you can also copy the link address, I can put it directly here. And then the asset portal with actually download the data for you directly on the server, and then we'll process the pre processing directly from the server so you don't have to actually download it first on your computer and upload it you can directly do it from the URL. Okay. So then, then you need to select the organisms. So here we have all the organism that are in ensembles as actually many, many of them. I think that's something like 500 different organism that you can use. And then it will try to match the genes that you have here on the left with the ensemble database. So if you can find it, it will be automatically annotated and the given an ensemble ID. And if not, then it will just keep the original gene name that it's found in your data. Here I know that the data, the PBMC data is human so it will keep the human organism. Here you can pick the asset version. So the last one as I told you is the seven version. It's very, very recent actually the six one was was a change actually we should also change the six version to not beta anymore. And then you can also change the project type. So if you want to bulk our single cell as I told you before you can do both a laser. But here today I will focus on single cell. And then you can use a name to your projects like test project. But you can change afterwards if it's not what you want to do and then once you click the create project button. Then you get to this front page that we saw before. And you can see that the passing of your data is first pending because it needs to create a to call the Docker basically to run it. And then it's running so it means that the passing is currently ongoing. You see how much time it takes and for some steps you can see also the expected time. So we have a predictor that says, well, this step should take this amount of time given the size of your own data set. But for the passing we don't have that. So it should not take too much time so the passing is basically creating all the standard metadata that you have in a traditional file. Like we count for example the mitochondrial content. We count the depth for each of the cells, and we store all this as metadata. So, so now that the passing is successful. So if you click on the vignette, then you see the output summary. So you see that there was 32,000 genes that were detected 2,700 cells. Indeed the count matrix. That's 97% of the values 10 zero, which is additional or expected for single cell data is a very, very sparse that the sense. As you know, and also it's good because it means that all the genes or the 32,000 genes that were in the original data were all found in our ensemble database, which makes sense because as you may have seen before it was already an ensemble ID so then it's easy. Sometimes it can be a bit more tricky if the, the IDs are not ensemble IDs but are really gene symbols. Sometimes you may have some that are not recognized or that are ambiguous. But here in our case here in this data set at least from from the Surat pipeline, it's ensemble IDs. So it's easy. The loop files that is generated by default is ASAP. If you want to export it and to run something on your own. Well, so that's a report path. But we also have like now that we've passed the original data now we can visualize the metadata that were created. So at the cell level, we have, of course, we have the cell IDs, like the actual barcodes that you have in your, in your, in your columns, in your data sets. We also generated a bunch of stuff automatically, like the depths, which is just a sum of a few mice per cell. You also have the mitochondrial contents, the ribosomal content that could be useful and you will see afterwards with the, with the QC part. So we use this information to filter some of cells that have, for example, too much mitochondrial content. So that that's automatically computed based on the ensemble mapping that we did, since we know the ensemble genes. So we know which genes are coming from the mitochondrial genome. Ribosomal genes and so we can actually do the computation automatically. That's pretty, pretty simple one. Okay, so now I go back to the pre treatment. So the passing step was done. So now I go to the next step. So you see all the steps of the density analysis pipeline that are here so they in principle should be run sequentially. So next step is a cell filtering. So in these steps. So what we want to do is to identify some cells that are outliers or bad quality cells. So once you click on this, then you, so you have a bunch of free determined kind of QC that I will disable for now for now. And then you have a few QC plots. So here you have like the traditional plot that you have also in the cell range or output that I showed you before. So this is just the traditional plot, except that of course you don't have the whole plot, because the data that was loaded is filtered already so all the, the cells that were empty were not present in the original data so we always only have the top of the plot. But you can use that to feature maybe more. So for example, if you say, well, I want more than 1000 reads, I want more than, I don't know, 10,000. You might read, then you can do that and then you will see the filtering automatically. Maybe that's too hard. You do that, then you see the cell in the one like that. Okay. Then you have the, so you have all the plots that are usually done in search, so now we'll come back to Sirat. And, and you see so in Sirat, so first you need to create the Sirat object from the data that I've uploaded to ASAP. Then you have the QC part with the mitochondrial content that you need to specify yourself so here it's done automatically on ASAP so you don't need to do that. So usually you have this kind of plot to try to spot some of the alpha cells. And this plot also are visible here so there are all these plots basically so detected gene plot that you see here. So that's all your cells and all the cells that are kept. So here it's basically corresponding to this plot detected genes. Yes, so and features is detected genes. So that's what you see here. So it's the same distribution basically. And you can say okay well let's enable this and then I want to keep only the cells that have more than 1000 detected genes. So then I activate this and then you see that I keep only 500 cells out of my 2700 and so it means that I have 2000 that are discarded. And in fact that it has on the distribution. So of course if I do that then it's basically a clear cut that you have here at 1000. So you see the two distribution of the discarded and the kept down. But you can also see the impact that it has on the other plots. So this is a depth plot. So it's basically the end count that you have here in the second plot here. So the distribution here is the same. So here you have it split again by the different specials that I put here. And you also have the person of microcodeal genes so you can see again what effect it has to filter 1000 take the genes on the microcodeal distribution. But then I will here we'll keep the the same, the same thing. And also these two plots sorry I forgot to mention so we have this plot here. This one is basically this one. And let's disable this so here you have all the cells. And here you have the second plot which is the numbers of depth basically versus the number of detected genes and this is also the plot that you have here. So basically, as you can see it produces all the plots that you have in the sort of pipeline for your, for your QC. And here what they chose in this tutorial is to keep 200 UMI at least 200 detected gene sorry features, and then 5% mitochondrial visit, right. Yes, 5%. So that's what we will use here as well. So we use the same threshold. There's one threshold that we don't use is this one, because we don't have it currently in ASAP that's something that we will probably add later. It's not really a big issue so we can continue with this. So you see so you see in the different plots, the effects that it has. So the ones that are discarded and the one that is kept. So usually, it's good to see that the discarded cells of course are the one on the bottom so really the one which were lower, lower quality. Okay. So that was the self between step. Let's run it. Sorry. Get back. Let's run it now with five and 200. There is a question. Let's run this and then I can answer the question. Okay. So this you see so I run the filtering stuff with the two threshold that I've told you before and now you can see that it's, it's pending now, and then soon it will be running on our server. So that's the case here and once it's done, you will see it will this little calm that you see here will actually appear as a data that means that now you can move to the next step. Okay. So what question is it. So there are two question I see so what would be a recommend the workflow in case I have multiple samples so that will come addresses. So one way to do it is to just merge it from addresses together and then you can pull them on ASAP. You can create afterwards so I didn't show it but actually in the metadata part, you can also import metadata so you can import a new metadata which contains like a batch kind of information where you have zero for the first data set and one for the second And then you can use afterwards for removing the, the effect of the batch effect. So one way to go. The other way to go would be to integrate the data sets together on a Serrat first, and then to upload it on ASAP that another, another possibility, both both are possible. And then Zeba. So, what does it feature for first and present genes. Yeah, so, So here you, you may want to focus on the on the on the number of protein coding genes or instead of mitochondrial or ribosomal genes. So usually you expect to have a lot of reads that map to protein coding genes and not to I don't know long non-coding RNA or MI RNA or mitochondrial RNA and stuff like that so maybe you want to just say okay I want that this to have like a 95% of my reads in my cell to map to protein coding genes I don't want all the other weird stuff. So you can also set that as a special. And if I do that, then you see any scan 500 cells. And you of course you can also see the effect here on the on the different thing. And ribosomal genes, it's the same kind of thing. So you so say all the the reads that map to ribosomal genes. If you want, you can also feature out cells based on the special. And that's really it just that sometimes do. So you will see afterwards when you visualize your data with a new map, then you can also display the for each of the cells and each of the cluster you can display the percentage of mitochondrial percentage of protein coding percentage of ribosomal genes. And then sometimes you see some real little cluster which contains a lot of ribosomal genes. But then you go back, as I was saying before so you go back to this step and you say okay well this I need to filter out this as to win. So that's one one way to remove, but you can of course disable it and don't use it, but it can be used this way. Asma, will you provide this expansion and annotating sense. Yes. So this is basically the last step of the pipeline so that I will do it and where exactly the computing happens. So you need to be registered to ASAP to compute. So even if you're not registered, it works. So it will compute. So it runs actually on our server. So I have a very, very big server here at EPL, which was a terabyte of Ram, I think and I don't remember how many calls but it's really it's huge. It can run many, many things in parallel. There's really no issue with this. You will see next week when we will run it to like 30 people together in principle should be fine. Everything is queued. So basically if there is an overcharge of the server then you will just in like a waiting mode and then you need to wait your turn before your competition is done. But yeah, but everything is run and there's really no limitation if you are registered or not. So you can run anything you want. You can see the data analysis you are doing. So only you. So it's very, very private. So whether you're registered or not registered, it's only you that can see what you're doing. Of course, except if you, so there is a possibility to make your data public. So here, so if you set the project public, then of course everybody will be able to see it. But by default, you see here the project is private. It means that it's only seeable by you. Unless you then chose to share it with other people so you can share it with emails. Then it will be visible by the people with whom you shared it. But if not by default, it's only you can see the data. Okay, what does it recommend default in ASAP based on the QC. So we have some so you saw it probably when you do a new, a new self featuring then you have a lot of default things that we put. But it's very, very arbitrary, I would say so depending on the species, depending on the tissue depending on the technology that you use if you stand next to something else. So we put some default kind of special parameters, but you see here it would really filter out too many sounds so I would not use it for the project. But yeah, so that's something that you need to filter yourself. One way would be to keep with the default parameters that you have like in Surat, which is this. In our case it works well, but that's, that's something you can tune. So since you see how many are removed how many are discarded. You are able then to tune the parameters accordingly to try not to remove too many sound. Do we have to run space to save our data. So now, currently, the there is a space limit I think it means that so all the data is stored in on the server directly. So we have quite a big hard disk to store everything. But then if it's not accessed for I don't remember how long it is like two weeks or something like that, then it's also automatically send over to like S3 storage that we have here at PFL. And we have actually grants that pay for this. So we don't charge anyone for storing the data. So the data is stored on our server or on the at three internally at PFL. So you don't need to pay for that as well. So that's something that is included as well. Okay, so I think I've answered everything. So let's move on. So if you've seen the self featuring step is no finished. So you have a recap on the on the result count you have a recap of the parameters that were used. And then you have a recap of the output so now you have the new number of cells and now you see that the remaining data set is 2,600 400 cells, which match what we saw before. So now we can move to the next step which is the normalization step. So again, you can create a new normalization. So we use here the serrat one. And here you need to select the data on which you want to normalize, because if you want, you can discard the self featuring and just directly normalize the raw account data. So it's something you may want to do so the passing step here is the raw account that you, you've got at the passing step, or you can prefer doing it on the self featuring which in our case makes sense here because I mean we featured it for a reason. You can also do both. So if you do that, then you will see it will run both in parallel. And then you will get them to that assess and then you afterwards you can continue the pipeline with both independently. So you can see the impact of the featuring compared to the raw account for example. But here in our case, I will just run it for the self featuring. So you see it creates the account and here you have the progression report that appear that shows that it was pending and now it's actually running on the server. You have other information here so you see that the number of the job. You have already like 400,000 jobs that run on a self and also the wait time so the wait time is basically the initialization time it took to access the docker before it started to run, and then you have the runtime and, and you still have this information at the end. But here you see, still be how long it took to run the step so 20 seconds, and how much ram it used on the machine so to get by it. So now the normalization is finished. But here there's no output actually just normalize so that the same kind of output that you have with with syrup, just normalize the data. So that's exactly some method that we have here in ASAP. And then the next step is to run the HPG HPG so the highly viable features computation. And now we can do it as well. And here you see that actually you have multiple methods. So you can run the default one the VST, which is the same as the one that is used here I think VST with 2000 features. And this runs on the normalized data so this you cannot run on the self featuring of the raw data set you really need to run it on the normalized data set. So here I run it on my normalized data. I can select the number of features want to keep. So we'll keep the default 2000, which is a default one that we're using in syrup, and I can run it. And you can see that if I want, I can actually select another method and run it as well. And I can even run it with all methods actually that's let's try. And you see that they all run in parallel, basically. So, so don't need to wait for one method to finish before the next one is run. They really all run in parallel on the server. So that's also good in terms of speed, because then you can run multiple stuff in parallel. So if you want to even run multiple projects in parallel to create a new tab on your browser, and then run another projects in another tab and then you can run them in parallel as well so that works as well. Okay, so here it's over. So here. So here in this step, as you see here there is actually a plot that is generated. So basically you can also see it if you click on the calm that brings you to this, this, this kind of recap thing where you see the parameters that we used. You can also see the metadata that were generated. So this is like the metadata name. This is in the in the h5 AD on the room file. And also you can download the output. But you also see here the plot that was generated in the path in the path of the method. So it's dynamic. And you see, it's the same as this one so this one is maybe a bit crushed. But here you see it, probably better, but you see it's exactly same plot as this one. You can see generate the same thing. Well. So, and of course if you try on another and other of the methods then you see that you will see that the plot is different. Of course, because it's not the same method that was used. So we have slightly different plots. And you can do it for each of the different methods that we run. Okay, so now that the HPG method is done, then we can go to the next step, which is the scaling of the data. So let's go to it. So the scaling method takes into input, the, again, a normalized data set. So I take again my normalized data that I created before. And if you want, you can, at this step, you can actually regress out some covites. So let me first run it up. And then I can explain. So while it's running. So, so actually, it's also described in the surat pipeline. If you have a source of variation that you want to remove like percentage of, of, of mitochondrial RNA or batch effect or whatever, then that's something you can, you can do here in in surat. So you're making which Bible you want to read this out, and this will be removed from your main signal. And this is also possible to do here so here if you click on the select button. So you see all your different metadata that were generated by ASAP, or that you uploaded yourself as I told you before, I don't know if you remember but you can also import your own metadata. It's possible to import metadata like as a list or as a matrix. And then it will be added to the ASAP object. And then, from here from the same step, if you have a metadata code batch, for example, then you will see it here, and then you will be able to remove it so you can select them as many as you want and then they will be removed at the same time as the scaling is not. So here I've selected known because I don't want to read or anything that's what they do here in the in the pipeline and I try to match it as much as possible, but if needed, we can also do it here. Okay, so the scaling is done. So next step is actually the, the PCA run PCA here by default in surat, they run 50 PCs, I think. You can select a scaled data because the PCA worked only on scaled data, not normalized data, so you can select your scale data set. And then you can select also the viable features that you selected so here I have three possibilities because I've run three methods to show you that you can run several in parallel. But here I will stick to the VST which is the one that is using the surat tutorial. So you select the number of principal components that you want to generate with the PCA. So here we'll just select the default 50 that is used by surat pipeline. Yeah, and then again, so then it goes into this scope and then it will be generated. So as you remember, so the PCA is really the entry point for all the different tools afterwards so all the other downstream analysis. So if you run the tizny or umap, then you don't run the umap or the tizny on the normalized data or the matrix, you run it on the PCA. So the PCA is very, very important when you're seeing a ceramic analysis because that's really the entry point for the tizny, the umap and also for the clustering as well because the clustering as well is run on the PCA. So that's done, but here you see you don't see the PCA directly from the scale. So if you want to see it you need to go to the visualization pane. And then I will explain a little bit in more details what you can do with the visualization pane. So here when you click on visualization then you get this thing that I showed you before where you see so every, every dot here is a cell. And here you have D1 is the first principle component of the PCA and D2 is the second principle component of the PCA. And on the right you have the control panel so you can remove it but if you open it then you have multiple options to display different coloring and different graphical options I would say. So in the general panel so you can change for example I don't know the dot opacity or the dot size, you can also display the cell names when you hover. Here by default it's not activated which means that if you hover on the cells you don't see their names, which usually it's not pretty needed when you do single cell analysis analysis but it's something that we activate for example for bulk analysis where you have sample names it could be useful to see the sample names when you hover your dots. And then of course you can color so whether you put no coloring, which is a default, or you can color by three different modalities. So whether you want to use continuous metadata, like cheat expression for example, or other metadata, like if I click on numerical metadata I have like the matter control content that was automatically created. So if I select this then you see automatically the plot will be colored by the matter control content and you have the legend here. So you can color by ribosome content, and you see that it's pretty well sparse so you don't really so usually this step is used when you have like really a real cluster and you want to see if you don't have like a huge value of ribosome content at some place. And you have also the protein coding content and stuff like that. But you can also color by gene expression. So you select which, which matrix you want to use so we want to plot based on the rule count, or you want to plot the normalized counts or the scale counts. By default, I feel it's better to use a normalized count so here if you enter your gene. So here you see the list of genes that exist, which are all the genes that were present in your data, you can pick one like that or you can auto complete. For example, then you see the expression of CD 14 gene in your data. So you see that the approach is a bit specific to this cluster here. And for now, I think that's it. So this PCA is not the best representation so probably you will not spend too much time on the PCA data. But you want to go back and then to run the Disney and the UMAP. So Disney is run on the PCA. And again you can select so here it's the same as what is done in in Surat. You can select the number of PC that you want to use so usually they use 10 part not usually but at least in this example they use 10 PCs of the PCA. So here let's do the same. We use 10 PCs from the PCA that were computed. And I will do the same with UMAP. UMAP. So I can run 10. Just for the example, I can run the default that we've put at 50. So you can see the difference but in principle it should not be a huge difference. Hi, there is a question. Your experience is 50 PC. The default insertion of the number. Well, it's a good question. That's that depends on your data set. So basically. There is a way in Surat. So it's not really visible here anymore but you have this elbow plot. Or you can also run a Jack's throw analysis to try to guess what is the best number of PCs that you should use. That's a possibility. It's not implemented in ASAP Jack's throw. That's something we also consider to do probably to help the user to select the correct number of PC. But that's not currently possible. But usually from our expertise, the default values are okay. So sometimes you see a few difference but not too much. The main limitation is if you work with really big data sets. So if you work with the traditional 10 X data sets, which contains like from, I don't know 4,000 to 10,000 cells. Then it's okay to work with 30, 50, 10 PCs. It's fine. It works. Generates pretty similar results. But if you work with much more cells, for example, if you have like integrated data sets, then it will start to be limited. And you will see that weird clusters that seems to be bundled together. So then you will you may want to increase the number of PCs. So basically, it's rescaled according to the number of cells and the number of cell types of population cell population that you have in your data set. So the more cell types, the more cell population and the more PC you should use. And reversely, the less cell types, the less cells and the less PCs you should use. So in this example that makes sense why they use 10 PCs. It's mainly because they have only 3000 cells and and you will see in the U map that there is a very small number of cell population. So that makes sense actually to to use only 10 PCs. But that's something that you can always work on. You saw I've tried to just to show you here on in the visualization now I can use whether my 10 so the first one was a 10. So here it's a U map plot with 10 PCs. And this is a U map plot with 50 PCs. You may see so it's shifted a bit, but which is normal because you map is completely randomly generated. But you see the population are the same. So it doesn't really change much the results. So as I said before, it's mainly if you really work with baby big data sets that it can have an impact. So if you work with small enough that I said usually it's okay to work with 10 58 doesn't change that much. Well, that's from my experience. Let me check for more than a gene at the time. Yes, we can check for more than a gene. So this switch channel. You can enter multiple genes. So this switch is max. It's not more than that. But here you have a sweet channel. So you have the, you can enter a first gene that will be colored on the red channel. And here another one that will be colored on the green channel. And here you see the overlap of the two colors together. And then you have a blue channel as well. I don't know what's the name to put here. But if I take the example here to the news, yes, like this. And then another one like this. You can see a different color in the different place where the gene like that possible. Yeah, yeah. And it's also possible actually to, to use more genes, but then you need to use a features in the insert that is called the modules call. Which is here actually. You're importing your own gene set. So you can create a gene set and import it as a metadata, as I showed before. And then you can put as many genes that you want. And then you can basically create a score that take the average of expression of all the genes that you entered, compared to a background. And if you do that, then you will see, you will see the average expression of all the genes that you picked, compared to the background, it will color according to this scoring. It's called module score. It's a method that is implemented in Serrat that we've kept here because we think it's very cool. And you can also use, so you can use it with custom metadata that you would create all that are generated automatically. You can do it with that global gene sets like drug bank or human genitalias. I don't know if there is anything interesting like B cell bone marrow. Yeah, I don't know. Maybe it doesn't make sense in this example, but so then you see it creates a score, and then you have like the blue score which means it's, it's not enriched or it's very seriously enriched, or red means that it's very enriched. So you concur. So in, I think bone marrow maybe doesn't make sense at all in pdmc of course, but, but yeah you can you can pick any, any, anything you want and any group of genes that you want. And we create this core to color your plot. Yeah, if you want to reproduce so the last questions before maybe we will do a little break up like 10 minutes. So the last question is to reproduce the exact same you map plot. Is there something similar to set seed required. Yes. So, so we don't have that actually. So basically we always use the same seed. I don't know if it would be useful to actually allow the user to put a seed itself or herself. But, but yeah so it will not produce exactly what was done on on the surat on the surat pipeline of course because we don't use the exact same scene. But it should be close enough in principle in ASAP it will generate all the time the same thing because it will always use some seed. But, but yeah, currently we cannot change this. Okay. So maybe let's do a quick. So we stop at the. So we did all the pre treatment things. So we did the Disney. So he said it run, and we run to you map one with 10 dimensions one with 50 dimensions. And then now in the visualization part you can actually with visualize all of this so you can visualize the Disney, you can visualize the the PCA and you can visit a visualizer you map. Usually, nowadays people are mostly working with with you map data. You, I don't know if you know why but it's, it's like, it's better to represent the inter cluster distance, which means that usually in a Disney, your clusters can be put really completely randomly. So, if you have two clusters that are close together doesn't mean that they come from similar cell types population. So in a U map, it is more the case. So usually in a U map, if two clusters are very close together it means usually means that they come from more similar cell types. So it preserves this kind of structure of cell types and cluster and position. And if you have a cluster that is very very far away from the other it means that principle, it means that the cell types is very different than the other one. Okay, so here. So let's get back to our surat pipeline so so we had the the PCA plot, which is here, and we have it also here so I plot the PCA at me remove the coloring. So you get this. I will enlarge a little bit the cells. Okay. And then you see that it's, it's the same as the one that we have in the surat pipeline except that is shifted on the x axis. But that's just normal I mean the PCA also can can shift plus minus on some axis so that that makes sense that's but not an issue. So this is here. And if you plot the U map with 10 PC so the first you map, you can see that it's, it's also similar kind of with what you have here. So you have this little cluster here that is probably this green cluster here. And then you have this cluster with little, little thing here which is this one, you have this very long cluster here, which is this one. And then you have this bottle shaped cluster here that is here very specific term bottle shaped cluster. But yeah, so you get that. So now what you want to do is something that I skipped for now, which is a clustering of the cells, which is in the surat pipeline can be done before or after it's the same in ASAP. So we could have run it just after the PCA but we prefer to run the Disney and the U map. But it's available so it's outside of what we call the pretreatment, because it's, I would say, so we chose to do that because usually we like the pretreatment part to be very linear to go from top to the bottom and then you run everything together. So clustering can run a bit when you want. And you can run it again and you can run multiple clustering so it's really a bit different than the pretreatment part we thought. That's why we separated it here. And similarly, you can create a clustering. And then the clustering you need to specify your PCA so of course you can run multiple PCA so here I have only one. And then you select the one you want to run it on. And then again you need to select the number of PC. So here we will select 10. And finally you need to specify a resolution parameter. So I don't know if you know what is a resolution parameter but it's basically a parameter that is used by the modularity part of the clustering, which, which is very arbitrary. In Serrat they use here the open file. So let's try it. But basically, so we have an infobuli here. And it says that basically the higher the resolution value and the more clusters you will get. So it's just a parameter to tune the number of clusters that you want. So here I can try, for example, with one resolution and with two resolution, and you will see the output. So again, they're already parallel. And you will see the output at the end. You will see that the number of clusters is different. So in principle, we should have less clusters with 0.5 parameter, which is the first one here. And as much as as you increase the resolution, then you will have more and more clusters. Okay, so the first one is done. So the one that is the same as in the Serrat pipeline. So in the Serrat pipeline, they get eight plus one, so nine clusters. And here we also get nine clusters. But you see when we increase the resolution, then with a resolution of one, then you get 11 clusters. And with the resolution of two, you get 15 clusters. So that's that the part of the pipeline where it gets very arbitrary. Because then you will need to go back and forth, as I showed you before, trying to guess or to fit the better number of clusters. So we can visualize it on the UMAP. So if you go back to the visualization step, then this time you go to the discrete because it's a categorical metadata or the discrete metadata that we are using. And then you see the three clusterings. So the first one is the one with the 0.5 resolution. And here you see the nine cluster that were generated and you see the number of cells in each of them. If you hover, hover with your mouse, you can see the cluster as the cells belong to. So here is cluster one, two, five, seven, the other is random. So it's not one, two, three. And then you see the different clusters. But if you want, you can also display the other resolutions ones. So here, this one is nine clusters. So from one to nine here cluster. But if you plot the second question, then you have 11 clusters. So you see that it's pleated a little bit more this big thing here and also this one. And if you go here, you can see again, even more clusters, which are about 15. If you go to the clustering step, you have also the possibility to compare your clustering. For example, if you want to compare the clustering 0.5 with clustering two, and you want to look at the overlap between the two clusters, then you can see that. So the one with resolution 0.5, the first cluster is overlapping with the first cluster of the second clustering. The second one is developing with the fourth one of the second clustering. And then you see some like, like this one, which seems to be split. So this is the one that gets split actually based on the increase of the resolution. So we can also display this kind of information. So let's go back to the UMAP and let's go back to the official, official clustering. I say official but it's very arbitrary as I said. What did I do? Let me refresh that. So here we have the cluster. So if you look back at the cluster that were found with the Serrat pipeline. So you have this thing that is one cluster. So that's what we have here. That's probably this cluster here. We have this little tail that is one cluster. So that's also what we have here. And then you have two clusters. That's also what we have here. So for now it's good. We have the same clusters. This cluster nine here that seems to be a bit alone. That's what we see here as well. And then we have this one which is split into four different clusters. And that's also what we observe here. So it seems that we get fairly similar clusters than with the Serrat pipeline. Okay. Any questions? Can you run the clustering in UMAP dimensions? You cannot do that. We allowed to do before. I think if you go to the bulk pipeline you should still be able to do it. But I would say it's very wrong to do it. As a bioinformatician you should probably not do it. So that's why we did not allow it because that's not the way it should be done. I think the association of cells between different clusters is very nice to know. I have questions regarding this. Okay. Let's just write it down and then I will try to answer. Okay. So that's the clustering part. Yeah. Yeah, so now I need to wait for the evidence to run. So maybe, maybe. So something I didn't show before is now that we have like categorical metadata, like clustering is a categorical metadata. I have different categories. It's not continuous. You can go back to this and display some gene expression like CD14, for example. And then you see again, so in this UMAP this time you see the expression of CD14 that is specific to this cluster apparently. You can show the stats according to a specific metadata. So if I want to show the expression of this gene according to the first clustering, then you see the different clusters and you see that this gene is expressed mostly in cluster number three. So yeah, and if you have some annotations that I will show you afterwards, then you also have a recap here. And that's a way when you are in this view of gene expression is it's a way to display not only the expression, but also some information relatively to like a clustering at all to an important metadata. And also, if it's linked to any informative annotations. Okay. So the next step, if you go back to the pipeline, so the next step is the almost final step I would say, which is the calculation of the market genes for each of your clusters. So in ASAP there's actually two ways of doing this. The first one is the one that I activated already, but it takes a bit of time to compute. So if you go to discrete, and then you select a clustering of your interest, then you get this view, which is the view that we call the annotation view where you see all the different clusters. You can select a cluster in particular or multiple ones you see the cells in them. And here you say that you see the best annotation which is empty for now because I didn't create any. If I want, I can actually annotate each of my clusters based on market genes that I know of. So for example, if I click here for the first chapter, it's this one. You see that there is this tab here evidences that is now finished so it was computing before but now it's over. So you can see the up and down regulator genes. And you can actually select the genes that are the market genes. So that computes basically the what called evidences are the market genes for all the evidences for an annotation that you want to do. So here you see that there are some genes expressed here like CCR seven and left one and LBHB. So if you go back to Surat, I think they have a nice, yeah. They have here a nice, a nice tab where they say that CCR seven is naive CD four plus. So indeed, it seems that CCR seven is the market genes for this cluster. So then you can go back. And you can create a new annotation. So CCR seven uses the the the the the ontology so you have multiple ontology that we use for annotating the cells. We always encourage the users to use an ontology for annotation, even though you could actually put some some free text, but we believe that using ontology is better for reproducibility but also for integration. Across datasets because sometimes people will annotate a specific cluster in a way. Sometimes it will make some typo and then sometimes when you want to integrate multiple datasets you end up with so many different names for the same cell type that it becomes complicated to integrate. And since we are very into this fairness principle. We try to implement multiple ontology to help you annotate your data. So you can use it to actually annotate your datasets. And if you don't find it, then of course it's okay then you can use the free text. But yeah, by default, I would encourage to use this. And then here you see I selected this, which is probably not exactly what was before but it's okay. And, and then it means that when I will press save it will annotate my cluster with this annotation. I can also put here some evidences that I found. So the evidences that I found is in this tab I can put it here. So what was it already? It was a CCR 7. So we use CCR 7 to annotate this cluster with the gene. So I put it here. This is my evidence that this cluster is indeed the CD4 plus. And then you save it and then you see your annotation is automatically put here as a new annotation. You have the automatic annotation from ASAP, which is just a number, which is the cluster number that you may not need. But here your annotation is put here. So you have the annotation, so the name of the anthology and you have the marker genes that you use. And here you have on the right, you have the possibility to upvote or to download the annotations. So of course if the project is private and it's only you that can see the project that it's not useful. If you start sharing the project or if it becomes a public project that can be useful because other people can upvote or download a specific annotation so that it gets higher in the ranking. Maybe because it's more suited or there's a new name in the anthology that appears so that you want to update your annotation. And then it gets it gets with a higher rank. Yes, this is what you get here. So here what you remember when you are in this, I can go back so if you are in the plot you go back and then you color coloring discrete select your cluster. At least on this one is this one. And, and then here you see this column that was MC before because there was no annotation. You have actually the best annotation to the one that is more of voted that appear here with both the ontology term and the, the name of the genes that were used for supporting this annotation. And of course you can do that then afterwards for all the different clusters. So for example let's take one to the first one let's take the last one. So here what is it so the evidences are p p b p. So that's a, that's this one so let's plate itself. So then I can, I can go back, create a new annotation so I say okay that's late, late, but correspond to this ontology term. No, not like the p p, what was it p p p p p p p p p p and then I save it. And then it's created here. And then if you go back, then you see it here now it's correctly annotated the platelet and the supporting evidence is this p p p p. Let's see if it's correct actually I didn't check. Let's see if I go to continuous gene expression and then I would be p p p. But then indeed you see that p p p is very, very expressed in this cluster to this one was the last customer customer and you see it again in this view here of expression by cluster. Okay. What else. So that's convenient for annotation and for for for sharing also like your annotation with other people, and also for finding this, this evidences. If you want to see like the the market genes in your data as we have this this graph here you can change actually the threshold so if you want to have more genes you can also. You can increase to four chains greater than 1.3 by default the whole change was special was too. But if I want to be more lenient I can change the special to 1.3 and then you see all the different results here. You can also change the FDR threshold. And then you will update. Okay, so the last thing that you can do is the. So there are two more things that you can do. So, if you want to run a DE, you can also do it through the DE or differential expression step. So here again we use a Serrat with coaxing test it's a default one for differential expression. And, and then here you can select a clustering one for example. And here you got multiple options whether you do all against complimentary. So let's run this one. And this will basically produce the same results as the one you saw before in this evidences tab. So it will take every cluster and it will. It will compare this cluster of cells versus all the other cells. So that's a way to have it in another way you will see in this tab you have more detail. But the evidences you just have the table and that's it. In this tab you will be able to highlight some genes of interest, transmission factors, surface markers and stuff like that. So you have more options in this view than in the evidences view. So that's a good way probably to to run it anyway. But you can also do something else you can also issue some clusters that are very, very similar. And you don't know if they are really different cell types or if it's actually the same cell type. What you can do is you can pick a cluster of interest like cluster four and rather comparison with cluster I don't know eight for example. And then you run it and then it will basically do really only these two clusters comparison. So it's not a marker gene anymore. So it's really differential expression that we're doing. And you really try to find what is different between those two clusters. And in case you have some weird things or you want to get more in depth into some of your interesting clusters. Or if you have some conditions because if you may apply also data with different conditions of different culture or disease state or tumor states and stuff like that. And then maybe you want to compare tumor versus non tumor. So you want to do really two groups and compare these two groups. That's also possible. And then you get this view here with basically all your results. So here you see the results for group one versus everything. And here you see the results of group four versus group eight. And you have here the column of the up regulated genes and here the column of the down regulated genes. So if you go back to like nine for example, then again you see this PPB results that is here and you get again this results that we had before. But you also have the marker view. So you can also, sorry, you can also feature here based on the. And you also have the marker view. The marker view allow you to see everything at once. So, so whether the top 10 genes are the top, I don't know 20 genes for each of the clusters you see the up in green and the down in rock in red for each of the clusters. So here you have the ref and the comparison cluster. And then you can highlight stuff like for example, you want to highlight which of the each is our function factors. And you click this. And then it fresh everything and then it should highlight so you see here it highlights the construction factor, both in the up and the down regulated genes. You can also highlight surface markers, which can be useful if you have some foundation to do for example for some some cell types population. And you click surface markers and then it will highlight in both conditions, which which ones are the surface markers. So that could be also useful so that you cannot do with the evidence is that you really need to do it here from the different expression. But yeah, that could be useful depending on what what you try to do or if you want to do some validation and stuff like that. So the thing that I didn't show before is in the, in the visualization, you can actually, you can actually do a last selection, so that I didn't show before. Let's call the cars again. So you see for example in this cluster you have some cells, it's difficult to select but you have some cells that are not from cluster nine cluster from cluster five, another stuff for cluster three. And this is a problem with clustering methods, especially when you do the clustering of the PCA. So you can try to solve this by just reducing the number of PCs that you use for the clustering, then you will get much cleaner clustering. But usually that's not really an issue. Because although they apply yourself about doing this, but sometimes you may want to select a cluster yourself, because sometimes you try a million different parameters for the clustering so different resolution, and you never really find the cluster that you want to find really a small number of cells, I don't know, like this one, and you won't really to see what is special for the cells. So it's possible, as you see, using whether the box or the last selection, so you can select a bunch of cells, create a new metadata from the cells like my cells. And then it will create a new metadata containing only those cells. So basically it will be, it will be a metadata at the cell level with one of all the cells that you selected and zero for all the other ones. And then from this you can, again, you can do a differential expression, for example, so you can go in the differential expression step, create a new one. And then here in the group, you can select the selection that you just made. And you can select your reference group so the one you selected is one, the compare the group is zero. And you can run a differential expression on this and then you will find what is special for the cells that you actually selected. So that could be useful when the clustering doesn't really match what you want to see, which happens very often, sadly. But yeah, so that's a way to focus on some specific cells that may not be readily visible in your clustering. Okay. And the last step is the functional analysis. Again, you have the modules called that I told about, but as a separate step in case you want to do a module score calculation and you want to store it for later, for example. So it creates really a new metadata that you can use afterwards all the time so you don't get to recompute it all the time. Because if you use it from the visualization as I showed before, it's computed on the go. So then if you switch to another one then it's the previous one is deleted. While here you can keep it for later. And here you can also do some gene enrichment. So you can take a DE of interest, like I can select everything that I've done before so like the DE I've run with all the clusters. And then you can run global gene set enrichment, like you can enrich your results into magical go magical processes react on drug bank if you want to see there are some drugs that are involved in this. And then you run the enrichment. You can run on Jean Atlas if you want to see some cell types or some organs of interest and that would be enriched in the cells you selected on the the engines that you found. So that's a way to help you also for the annotation that sometimes annotation can be tough. And these kind of things can help you to to prioritize some magical pathways or some some cell types or drugs that may be useful for for your, your problem. And again you have the same view as the view for the different expression where you see the pathways that are up regulated for each of your clusters. And you can you can open it and you see like for example for cluster. I don't know which one I click or cluster to which one I click. Let's click the last one so cluster eight, then seems to be enriched with the drug and for me. I don't know how to pronounce that I'm not became a story. But yeah, so basically you can you can also use that to help you annotate your own classroom. I will try to answer the question because I think I'm mostly down now. Unless I forgot something. Let's see your questions. Yeah, that's that's the case. Clustering is very arbitrary and random as well. So it may depend on the pack version of the packages. So that's why actually we keep all the versions of the packages because sometimes you change a version of one packages and then you change completely the clustering results or you map or it can change a lot, even if you use the same seed. It can also depend on on your your computer if you use our if you spite them, of course all that have an impact. That's why it's difficult to fully produce results usually. You really need to to have it in the docker as we did with fixed parameters fixed versions for all the packages to really be able to reproduce perfectly some results. To compare to compare to compare two datasets. So you need to to integrate them together. That's usually what is done for that you can do currently with ASAP. So that's something we are implementing. But currently there's not something that you can do using ASAP. So if you want to to do that, then you will need to go do it outside of ASAP or you need to integrate outside of ASAP and then you put it on ASAP to help you annotate the excels that fall together. Then ontology super nice. Thank you. Surface markers nice. Thank you. TF and surface markers they come from gene ontology. Yeah, Fabrice already answered that. Yeah, there's a very nice question from, sorry, I don't know how to pronounce it. Are you are you considering to implement some automated annotation in the future. Actually, we already did that. We actually collaborated recently with two groups or a group from a year, which is called BG, they're developing another website called BG. And we also collaborate with a group in ETH in Zurich from Marco Vincent. And we actually developed already an automated annotation pipeline, but it's still under benchmarking. So we are in the verge of implementing it in ASAP. So that's something that we want to add in the near future to help people to annotate the data set because as I said before, usually the step of annotation is really, really tough. It takes a lot of time and usually experts or literature search and stuff like that. So here in ASAP you can be helped with the functional analysis tool to some extent. And of course, depending on the species that you are working on, then you can be more or less helped. So if you work with mouse human orthophila, then usually it's not too hard to annotate yourself. But if you work with other species, they can become very, very hard. So yeah, so for now we have a pipeline, we have a tool for antelope annotation. We actually use ontology for training a classifier system. So we really use the hierarchical tree of ontology to help the classifier to classify the cells. But yeah, it's not yet in production. So we are still working on it. But that's something that should come out in the near future. Then, Nastasia, isn't it dangerous to manually select the cluster that should be? Yes, it is dangerous, of course. Usually it's much better to keep the unsupervised clustering and then just stick with that. But sometimes the unsupervised clustering, they can really react weirdly. And sometimes you see a cluster that should be split in two. You know it, you see it with the marker genes. And you try different resolution and then it splits everything else, but not this one. And it can be very, very frustrating. So I guess this option is good if you want to verify something or if you want to compute a specific DE. But that's something that you would use in the end for the annotation, I would say. So the annotation, I would still do it on the cluster, the main clustering results, even if you need to annotate the cluster as two different autology terms. So that's something you can do. But yeah, yeah, I agree. It's a bit, it's not dangerous. It's more like it's biased, I would say. And finally, cell cycle. So cell cycle, we did not implement it. So we, we, we don't have cell cycles course. That's something, yeah, that's something probably we need to do. But yeah, currently it's not, it's not, so we don't compute the cell cycles course. It's, yeah, that's something we could actually do for like human and mouse. That's to not be to talk to us a good suggestion, actually, maybe we can implement that, at least for human and mouse where we have a defined set of genes that belong to cell cycle, different, different space. That's something we could do. But we, yeah, currently we don't have it. But we, yeah, as you say, you can export, you can run in R. So that, that, that I showed at the very beginning in the tutorial part, you have a part, tutorial four, how to work with loom files created by ASAP. So here it's, you can always, when you're in a project can always export it to a loom file. And then you can, or HPE file, and then you can use these two scripts here. So whether this one or this one to import it in our environment or Python environment. And then you can continue working on it. So if you want to run the cell cycle, you can do it afterwards in R or in Python, for example. So that's, that's always doable with ASAP. And in principle, when you do that, you have everything that is computed that is present within the loom file. So the loom file contains the row count matrix, the normalized matrix, the scale matrix, the, the, the PCA, the UMAP, the clustering, it contains really everything. The only thing that is not present is the annotation, because currently we cannot really store it in the loom file. So that we don't have. But all the rest of the metadata is stored in the loom file. So you can, in principle, reproduce the whole thing, or you can recreate a SRAT object, for example, if you want to continue with the analysis on SRAT on your own computer. So that's, that's possible. Then Gert, I think it can be quite interesting to identify the algorithm on micro-conjunct genes or none of them. Yes. Yes. So for now, we optimize it for a few organisms. So where we know what are the micro-conjunct genes. As I said, we are, we know the micro-conjunct genes are the one that starts with empty, empty, little empty, big empty, caps empty, empty with a hat or whatever. So, so that we did for some organism, but it's true that we didn't do it for all organisms. So it can be that sometimes you have some organisms where the micro-conjunct genes are all zero. So this is probably due to the fact that the annotation is not yet done. So if this happens, don't hesitate. So if there is a button here, a feedback button, you have also at the bottom of the website, you have also the email contact. So don't hesitate to send us an email and then we can have a look at it. So we are pretty responsive to the users of ASAP. We had already some questions like that. So we implemented them quite, quite fast. So if you need to have this information really for a non-model organism, then just tell us and then you can have it. Can we implement maturation directory as well? What do you mean by maturation directory? I'm not sure. You mean like trajectory analysis or stuff like that? Yeah, so trajectory analysis is something we originally had and it was a nightmare to work with. Especially in our lab, we have Walter's silence. There is actually a postdoc in our lab. And I don't know if you know the paper from Walter, but it's a paper where he compared, I know, 50 different trajectory methods. He's an expert in the lab and we discussed a lot actually with him. So that's something we could have, but yeah, it's something that is very tough. And also I consider very arbitrary because there are some trajectory methods that works well in some cases but not others. And then basically in ASAP we try to have a very robust pipeline and also to guide the users as much as we can. And we thought that having this trajectory thing was really, really not easy to do. So we did not keep it. So at some point we stopped it. At first we had the monocle trajectory, monocle three, I think we went to. At first we had monocle one, two and three and then after the three we stopped. But yeah, for now we didn't really have many complaints about that from our user base. But if we have, then probably we will implement it again. Can you use the evidence function for model plans? Yes, so that the annotation basically works for any organism, even non-model organism because we use the uberon by default. And we also use some ontology that are specie specific. For example, if you work with human data, you have tons of ontology for cell types, for development stage and stuff like that. But if you work with non-model organism, usually you have uberon, which is the most I would say generic one. So of course it's not perfect, but of course we are dependent on the existing ontology so we cannot create new ontology. So if you work with cow data, then if there is no ontology that describes the specific cell types that exist in cow, then we will not have it. So there will be only the generic cell types that you can find in uberon. But for plants, I think we have some. So I don't know if for our Arbidopsis we have. Yeah, it depends if it's in uberon I guess or not. I'm not sure for plants actually that's a good question. Yeah, something to consider. I don't know if you know any ontology for plants, if you do then I can. You can upload your own, yeah, you can upload your own gene sets. Yeah, there's a, I'd say it's in the ADMETA data part, you can upload gene sets as well. So you can upload your own if you want to annotate it. But I would recommend actually if you have some that are not present and you think they would be important because you are an expert in the field that we don't know about because of course we don't know everything. And in the lab mainly we work with the model organism. We also work with mosquito, but mosquito is very, very similar actually to drosophila. So we can use the drosophila ontology for mosquitoes that works pretty well. But for other like plants and other stuff, if it's missing, please don't hesitate to tell us again with the feedback button and we can try to integrate them. It's not too hard actually for us to integrate a new ontology if it exists, so it takes maybe like half a day of work. So if you have one that you know it's the best one for a specific species and we don't have it, then please tell us and then we will just add it. You probably mentioned it's a special to compare conditions that control disease. Yeah, so if you have that then the thing you need to do is as I showed before you need to create. Let's go back to the projects. So if I'm in the project view, you go to the metadata info metadata. And here you can submit. You can submit a new metadata so basically you can say okay these cells are all these condition and do like a two column the kind of thing, or you just you just put it as a matrix format like that. You can copy paste it directly you can upload a CSV file that you would have on the side. And then you upload this it will be added as a metadata within the asset object. And then you will be able to see it everywhere so you will be able to plot so color your visualization with the metadata that you've just created if you want to see the difference between the two cells or the two conditions. So in the case of busy versus control that if you create this metadata with annotating which cells are disease which cells are control. Then you can add it. And then afterwards you will be able to visualize it and then you will also be able to do different expression between the two conditions. Anything you want to do it seems it will be included then you will be able to run it. Then a question from there now, I don't know much about scope of technology. You work with mouse. Yes. Yes, so yeah for the model organism it's usually we have, we have both the generic and the specific ontologies. So that's for the flight select us we have the fbbt from fly base, which is a very very specific ontology for both of you. For humans at last we have the, the HCA so humans like us to you bear on the CL, which is also specific for sometimes. And similarly for master for the model organism usually we have a lot of ontologies so. Yeah, I have the list where is it the list. In info you can see the list of ontologies. Yeah. So we have the classic one that you will also for human we also have these two extra ones so for the mental stage and for the one from the humans at last for flies we also have the one from everybody. And the cell ontology it works for human only or for master I don't remember. Yeah it's a generic one I think so yeah you have all this available ontology to do the annotation. But of course as I said before if you feel that one is missing and you know one that we could use for another non model species or even more than species actually please tell us. And we also work, currently we work with David also means other than, which is that one of the expert in ontologies. And, yeah, so we are also actively looking for better ontology we also submitting new terms, especially bg, the bg group is submitting a lot of new terms in these ontologies to try to really map to the most up to date database. So yeah, we also collaborating with these groups. So we try to do our best. Of course, it's a very rapidly evolving field. Sometimes we've made lack of it behind. So please tell us if you see any inconsistencies and, and this is stuff that we have already from our user base. Some people send feedbacks in terms of so well that's we and all that's missing. And then we need we discussed and then we can afterwards go back to the ontology guys and submit a new request for a new term also.