 Okay, perfect. So now we went on to the dimensionality reduction and in the exercise you saw that there might be something a batch correction that you might want to do something you want to correct for. So you might want to move to do some integration as it's called in Sirat and then come back to some dimensionality reduction to see if whatever you did here has been correctly done such that you now have a beautiful art piece as we said visual picture that is nice enough and that you can make good conclusions about for your paper. So this is what we will be talking right now. Let me put that full screen maybe first just because it's better. Okay. So and just put this up. Sorry. So why should we integrate and here is a picture from a colleague. So this is very much what we sometimes see in some data set where you would have several donors or patients for instance and everything would cluster together according to the donors and not according to the condition you really want to visualize. So you would not have cell type one cell type two that cluster together but really clusters made out of each donors and in each donors clusters you still can see that there are sub clusters so probably these would represent cell types but visually it's not appealing for a picture to put in a paper to be able to make to draw the attention to some important conclusions you want to make. So this might be the situation you encounter. Then another situation you might actually encounter is that your two conditions that you want to compare so you might have sick mice and healthy mice for instance they would drive the lower dimensionality reduction picture so the for instance here in Satisney that would clearly separate healthy and sick patients but would not actually group together according to the cell type and this is what you would like to represent. In the Surat tutorial they are actually also talking about a different integration problem that you might have is that you have sequenced cells on different platforms same type of cells and that you would like to group them together and this is what they are putting in their tutorial. So here you would see the unintegrated picture of the the t-snare representation of the unintegrated data where you will see that everything groups together according to sequencing technology and what you would like to have at the end is an integrated version and this is actually also something I had in my hands where I had data that was collected from in Switzerland from a lab that was doing the 10x sequencing and their colleagues from the same data set their colleagues in Sweden decided to go for SmartSec and so they decided to group together those two data sets and what was nice about that is that at the end in the 10x data set in with our colleagues from here we were able to figure out rare cells and in the SmartSec data set it was one cell only which was having that rare cell type so at the end we could still benefit from the SmartSec together with the 10x because we had actually the more in-depth version with SmartSec and we had the rare cell types with the 10x. This was just a parenthesis but here is what we want to understand is how to go from this picture to the magic that happens here where suddenly everything is grouped together and as I said integration can be useful in many different variables so let's say you might have some technical variability you have difference in library preparation difference in sequencing technology maybe difference in people that handled the data so this would be what we call technical variability and this technical batch effects is not something you would like to visualize so you would not want to visualize a group of cells that were handled by your colleague and a group of cells that were handled by you this is not what you want to put as a visual output in your paper you might also have what I said are more biological variability this is what I showed in the pictures before so for instance patient differences or sample differences this is what we oftentimes see or even evolution so there were some papers that came out where they analyzed cross species so they had mouse and they had human and they wanted to group the findings in mouse and human together on one picture that summarize it all and this is also something what happens sometimes so these are the biological batch effects that might also confound your single cell analysis so this is the the thing that also Herd emphasized yesterday is that you would like to have a balanced cohort so it would have you want to have a not confounded design so you do not want to have batch that will match exactly your your variable of interest so this is something we already discussed so good experimental design does not remove batch effect but it prevents them from biasing your resource this is what we said and which is important to emphasize and again here is the picture of the single cell RNA tools and there is a category called integration and as you can see there are 200 and something so as always we will not discuss them all however with integration which which is nice is that many tools function on the same general principle the general concept and so I will try to in to describe to you the general concept and then many of the different integration tools would function function on that general concept so here are a list of some of them which are or were popular most of them are still quite used so there is mnn which is quite used then there is the from starting from surat version 3 on the version of integration that they used and then there is one which I want to emphasize because it's functioning a little bit differently which is stacas and has been developed locally by colleagues of ours that's why I make a bit publicity so these are the ones I would like to discuss because they are very popular and here is generally speaking how most of these integration tools function so what they try to do is first find corresponding cells across data sets and what corresponding cells means is how the different methods of integration would differ so they will try to compute a certain distance between the cells in a certain space so let's give a small example it could be for instance that you first project the cells into a umap and on the umap you would take a clidian distance to tell the distance between the cells across the different batches that you have and so this might be a way of starting an integration you you say what are the corresponding cells across your data sets then you would like to compute a data adjustment based on these corresponding cells and here again the the different methods could differ in how they would compute this adjustment that they need to do so it's like a correction vector that you compute of how to to make the cells closer that should be close and then they apply that adjustment so this is generally speaking how it functions and how the distance is computing in which space you are in and how the adjustment functions this is how the different integration tools would actually differ so let's take the MNN first the mutual nearest neighbors first and let's describe how they are functioning so here here is how they are functioning so exactly as I said first you have your two batches or more but let's say you have two batches this might be the cells that you have measured with smartsec and these might be the cells you have measured with 10x what you want to understand is which are the cells that correspond to each other or let's say that should be close to each other in your final picture and what you do is that you take a pairing as I said you find corresponding cells and then you apply a correction vector and put them together in the in the last space so how it's really working is like that so first you go to your dimension reduction in the mutual nearest neighbors it's t-snake and so what you do is that you try to compute which are the closest neighbors to in terms of euclidean distance so here in this example and it's a dummy example you can see that you have blue cells here and red cells here so it's really clear separation and clear batch effect and what you want to do is to find neighbors so cells that should be close to each other or that are close to each other in the dimension reduction space so as you can see here the corresponding cells are these two so the first thing you do is that for each of the cells in your batch number one you would compute the closest neighbors in the batch number two so for this cell highlighted here you have these three neighbors and the number of neighbors you choose is a parameter that you will put in your in your method so here I choose three three neighbors so these are the three closest points in the blue batch for this point this red point here you do the same thing for all the points in the blue in the blue batch so you have this point in the blue batch you will select in the red batch the three closest neighbors which are these three and a mutually nearest neighbor would be just a pair of cells that have called each other as being a neighbor and so these two cells have called each other as being a neighbor so this will be one of the pairings that you will use now these two cells here would actually also pick each other as neighbors right so these two cells will be another pair mutually nearest neighbor pair that you would consider in your in your data set and m and n how it functions the method has been described like that so first you need to figure out what are all the the pairs that you have and then you will calculate a pair specific batch correction vector which will be just the following which will be you take the expression of the genes in your in your cell A and you subtract the expression of the gene in the cell B and you do that for all the mutual nearest neighbor pairs and so you will get several different vectors as I said here for instance for that point you have this pair and this pair which is considered to be a mutual nearest neighbor and so for all of the pairs that you have you will calculate this pair specific batch correction vector and then you would use this sort of function which is doing here and sort of average of those and this average would be giving you the correction vector that you have calculated for each cell and this is what you will then apply to change from your original gene expression to the corrected batch effect gene expression and this is what you will then do and you do that for both of the batches for sure and then you merge that into a final picture so then here is they will show you how m and n worked and this is an example in in this paper here where they will advertise that m and n works well here you can see the first picture and you can see that the batches were here actually data sets so these were data sets that were where data was collected separately and you can see that what is clustering mainly together is the the data set itself so you can see the green which is the gse 81 0 and 6 is clustering together the red ones has several specific cluster but is clustering also mainly together and at the end if you do the m and n correction this is the picture you obtain so now you don't have clustering anymore that is dependent on on the data set and if you color here actually by cell type you can see that at the end the what clusters together is according to cell type not according to data set so this is what you would like to achieve to have a picture of something that you can then discuss in a paper now we will be mainly using surat so for sure i would like to discuss to you how surat works and so again it uses this the same principle of finding corresponding cells computing the data adjustment and then apply this adjustment but just the the space in which it calculates the anchors and in which it calculates the data adjustment is not the same as the m and n what it first computes is correlation component analysis sorry and correlation component analysis instead of pc pca is actually grouping cells together according to their correlation scores and here you can see that this is how it looked like in pca and this is how it looks like in the correlation component analysis field and what it then does is a sort of first a normalization so which is the l2 normalization so this is what it does and at the end in this space here it will calculate it will find the anchors so the corresponding cells just like the mutual nearest neighbors it just has a different way of calling it it calls it anchors and this is what it calculates here so which are the cells in batch one or batch purple that correspond to cells in batch green and at the end it does exactly the same thing it will again calculate exactly these so a pair specific batch correction but here instead of m and n instead of gene expression what you will take is the component the l2 normalized correlation component number and compute the distance between those two and then you do exactly the same and average over all the pairs or anchors that you have and this is what you will then use to to calculate the the correction for each of the batches and then you merge the two in serrat how it's functioning it's called integrate data and find integration anchors to find the the anchors in the in serrat this is how it works i don't know why i went back okay so here is just serrat would advertise how well it works on on this data set where you can see that they manage to merge technology where here you have these six different protocols so you do have also smart second annex and you can see also that they try to merge species so human and mouse and at the end you can see that it's merging very well together except for this cell type which is us in our cells which are specific apparently to human not the biologists so i do believe what i see and you can see that was what was what is correlating what is clustering together is according to cell type and this is what you would like to achieve because this is what you would like at the end to discuss if a gene is high in a cell type etc and this is what you would like to visualize so at the end integration is used in order to have a visual representation that is enabling you to show conclusions you would like to make on your data set so it's a nice picture where what you see is according to the cell types and not according to the variability that you have inferred which is a technical technical variability or biological variability that you would like to remove this is not the something you would use for differential gene expression where you would like to include those as covariates and not include the the correction vector that you have calculated as numbers that you would like to infer to a differential gene expression tool this is important to remember and why is it so because these are quite harsh the numbers that you get at the end are not really representative anymore of what's happened it happened at the beginning and you gave more power to the the batch that you considered than to the to the biology that you really have so in my opinion it is always better to go back to original data and then include for instance technology or species as covariates when you will model the the differential gene expression you would like to perform and in that sense you give the same power to all of your all of your covariate effects and this is for for me the the way to to go this is something else is that Surat enables you to use that that theory of integration to actually do what they call label transfer and how it works is that you have your data sets that you don't know it's all in black you don't know which cell types you have and you have a reference data set that you know is a good author and has well annotated it's itself and you would like to use that as a reference to try to annotate yourselves what it will do is that it will use the technique of integration and put the cells to closer to their to their anchors let's say to to merge the two data set and with that will use it to understand what are the cell types that you have in your data and will therefore annotate your query so meaning your data set with the reference that you would have inferred this is what they call label transfer and they claim it it's working super nicely it does however it is quite computationally intense and quite long to run so depending on the size of your reference and the size of your data set it might not be the best option but this is something Tanya will discuss at some point today right or tomorrow yes and so I will let her give you the full details of that and just at the end I want to mention stack us because it's working a little bit differently because it's semi-supervised so at the end you not only use an unsupervised method where you just click and it gives you the result you also give a sort of knowledge about the clusters such that if you ever have anchors between two different cell types it will remove those and so it's like a yeah a supervised version of what Surat has implemented and it could be useful for you in case you would like to use that so it's called subtype anchor correction for alignment in Surat to integrate single cell only set data fine but what it really does is just use Surat by removing some anchors between cells that are from a different cell type in order to make it a little bit more precise and here is how they would show how it's performing better because they have removed some some links between cells that are of different cell type and that's where they pretend it's working a little bit better and here in this example you are definitely convinced by it and so in case you do already know the cell types you have you do already have some annotation you really would want to go to stack us to merge two data sets I guess but up to you just for you to know that it exists and how it's working so I guess that was my last slide on integration and again so integration when is it useful is it useful in order to have a corrected table of expression that you can then use for differential gene expression of data coming from two or more sources is it useful when you want to obtain a visual representation of data coming from two or more sources or is it both of the above who more people need to decide some people change their minds I think I can soonish close it yes so most of you said it's for visual representation some people said it's both of the above I try to emphasize that you should not use it for differential expression because it is actually giving more power to the batch that you have than to the biology that you're trying to assess so it's quite harsh and it it might be masking some of the things that are really there in terms of differential gene expression so I would recommend that you rather go for the original data set and use the batch as a covariate such that you give the same power to whatever is sick versus healthy you're comparing or whatever and to the batch and not give more power to the batch correction but this is maybe more of the personal opinion