 Now we can really start. Okay, so. I'm going to talk about data integration. Some of the material we're using for the rest of the afternoon is reused from the analysis of single cell or any C course hosted by the Welcome Trust Center Institute now but actually to Lula was involved in making the original content so it's mostly the lab that will be using from there. Okay, so data integration. By the end of this lecture, we're going to work on the understanding of the differences between batch correction and data integration. When and how to use data integration methods, how to figure out if, if you're using data integration if it worked, and how some popular methods work, and then we'll have, you know, just to make sure we are covering all of the concepts that are covered in the lab section. Okay, so the motivation is to do joint analysis over many samples so if you have one sample, then you know if you're only ever going to work on one sample you do not need to listen to anything like I say but if you have more than one sample, then you are going to have to combine them in some ways to analyze them together. And the basic idea is that you normalize each sample and then you combine them. The simple way is merging. There's a merge vignette. There's a simple merge function you can just take data and put it together in the same object and then once you have that data in that object you can visualize it and do other things with it. Sometimes there's a problem when you do this. So this is a bit of a disaster so I have blood that I sequencing and I have done three experiments and I want to combine them and I merge them together. And when I plotted my nice you map. I have three parts of the map that are completely correlated with the experiment, not my cell so for instance T cells are here and T cells are here and T cells are here. And in fact every cell is dupe is replicated once for each experiment so instead of the you map organizing my data by cells which is what I really want it organized by experiment. And that's like, you know, what do I do now so the reason why that happened is because the experimental differences were stronger than the cell type differences that exist in the data and so you map or clustering will always work with the strongest data that you have like if it sees major differences will group them separately based on those major differences whatever those major differences are. Usually, we were not interested in this particular major difference. So this was these these samples. This is from our nature protocol that to introduce earlier. There's 10x platform with five prime, three prime and two versions of three prime, and there's a pretty strong batch effect between these kind of technology choices. Additionally, batch correction exists as a method as a technology, and we can go from our, you know, cells being separated into three different technologies on the map to everything being nicely organized by cells, and the particular batch correction method that I applied here that we applied here is harmony. And so, Matt, you know, nicely we were able to fix this problem using these batch correction batch correction methods. Okay, great so merging versus batch correction. The first question that I encourage everybody to think about is if batch correction is even needed. The second question, as I mentioned is simply joining the data together in the same data set usually as I said, we do normalize first and then join. And when we do that we need to actually assess whether there's a batch effect that's present there so a batch effect, as many of you know is when some samples cluster by batch which is, you know, a set of things batch means a set of things and a set of samples that are run together using a particular day at the core facility or particular technology. And it is a technical confounder that definitely should be removed from the data, especially if it's stronger, well, it should be removed from data, if it's stronger than the signal that we're interested in. If it's there but fairly weak and it doesn't, you know, out compete our main signal like cell types for instance that we might not even care about it too much. So yeah, if it's strong it will interfere with clustering and all sorts of downstream in analysis. So the question that. So now we know about batch effect just just the idea of it. Does my data have a batch effect. So if the data sets are separated when you merge like in the previous slide based on something that you don't expect like technical factors like the experiment or the library preparation technology, then there's a batch effect. So even if there's some kind of subtle, subtle shift like for instance you have T cells represented in two samples, but then the T cells are clusters are not overlapping each other they're shifted. You know that that, you know, this is a big shift like T cells here T cells here T cells here, but that shift can be any, you know that they could be kind of almost overlapping each other to a little mention you can't really completely interpret the distances between these things but usually, if there's a weaker batch effect that the clusters will the cells from different experiments will probably be closer to each other like maybe roughly in the same cluster. Okay, so. So if you see that in the you map when you color the you map by technical factors like the sample, usually the first one you try is the sample ID. You probably have a batch effect and if you don't see that then maybe merging works well, and then you can just do your analysis without batch correction. However, when merging we often need to prove that integration is not needed and I'll explain why in a second so here's an example of our favorite liver data set that we published and so this was joint analysis of five human liver samples. And no batch correction was needed in this paper, merging the data was sufficient. And we got really nice you map. There was some sample specific variation. Eventually we concluded that it was biological. And I'll explain that in a bit. Yeah, so maybe I should have actually gone through here and shown this plot first that, you know, when we when we looked at this map by sample. A lot of the clusters had representation from all the samples. Some clusters have representation only one sample so this, if we just looked at this. It's not a traditional batch effect it's not like a systematic change with all the data. Some of the cells were specific to the sample and, but many of them were actually not specific to the sample, and they nicely cluster just by merging. So, this was kind of an interesting question that came up in review. The reviewer want to know how is batch effect controlled in the experiment because they saw this question and so, you know, I'll just go through the argument here because I think it's, it's this type of thing is very useful. So this is actually available in the online peer review file for this paper, but, you know, almost every time we submit a paper that gets reviewed the reviewers ask about this batch effect. And if we've merged it they say, well, what maybe there was a batch effects. So you have to kind of prove whether you should merge or have a batch effect one way or another. So in this case, and you have to think about the argument. So in this case, you know, we tried to show that basically doing batch correction didn't change anything. And as a result batch correction is not needed. So we, we regressed out technical factors like donor library size and gene detection rate. The blood related non hepatocyte clusters very robust to that correction that, you know, it didn't change what they look like donor specific differences in these, these hepatocyte clusters that those are those big clusters that were were specific to individual samples. We're still there. We also tried different batch correction methods, and we got the same answers before like with or without. So no batch correction method could fix that problem of hepatocytes being represented in different samples. So given that the batch effect was only on foot like the sample specific effect was only seen with one cell type, and all the other cell types integrated nicely, and also, you know, none of those corrections made a difference. So we concluded that, you know, while we don't know why there's a difference between, you know, samples having one type of a parasite or another. We think it's more likely that it's a biological effect, and not a technical effect. And, you know, we could speculate on why like for instance, the parasites are very responsive to the environment, the diet of the person, the body mass index of the individual all sorts of things that could be quite easily quite easily different biologically between individuals. So, you know, we are comfortable concluding that and the reviewer said okay. Another example was the paper that the work that Trevor mentioned yesterday morning, which is looking at human glioblastoma stem cell samples there are 29 patients represented here. And this was an interesting case because as we were just discussing during the break, often when you're looking at cancer. The samples are usually are frequently separated from each other and the reason is is that there's frequently sample specific genome and stability effects and cancer like big, you know, a chromosome arm is missing or something that makes a big difference in your gene expression and that will be the strongest signal then that clustering will identify. And so in this case, you know, this is kind of hard to argue, because there is sample specific effects and the reviewer said again, you know, what, you know, I see a batch effect I see sample specific effects proved to me that it's bad, it's not a batch effect. So, you know, we think it's biological because all almost all the cancer samples that we've looked at have the same show the same thing. But how we argued that there was no batch effect here was that we tried a whole bunch of different batch effect methods, and actually three, and no algorithm corrected that problem, and they all disagreed with each other. So, we argued that that they were, you know, basically, it's not a batch effect it's not a systematic fact that affects all of the samples that the same. So, just those are two examples of cases real cases where we published where we're just able to merge the data, and that was appropriate for that data data set. We had to prove that it's not really a systematic batch effect. Okay, so we have these two ideas we can merge the data, or we actually do see a big batch effect a technical effect that we want to correct. And then we need to do batch correction. So, so once we do batch correction. So let's say we're going to do merging or batch correction. How do we know that there's no batch effect or technical effect. So I showed you one way just looking at the UMAP and you can see these big differences that are sample specific. So that's, that's good. We can visually see that type of thing on the UMAP. But there are also other ways of identifying verifying if a technical batch effect is removed or never existed. So we can also ask if the clusters that we identify make sense in that they represent samples appropriately. If they make sense that they represent cell types or states appropriately and I'll show you examples of this. We can also check but various biological signals in the data to see if they make sense like for instance, the cell cycle effect is explaining a difference between cell types that we should that should be explained. That should be different in their cell cycle rates, even if it's from different samples. Or, and then that biological effects that interesting biological effects kind of explain that difference. So this if we do apply a batch correction method, we also need to make sure that these effects are not ruined. So for instance, we don't merge cell types that shouldn't be merged. You know, to use the example of endothelial cells and, and I don't know another one macrophages or something they shouldn't be merged on top of each other they're very different. The most rigorous version of this is to understand as many factors in the data as you can find and check them for appropriate handling during merger batch correction process. Definitely more important in the batch correction process because the batch correction process changes the data so you want to make sure the data is not messed up by batch correction. Okay, so let's go through this these types of things with the merge example. So visual inspection of the map I showed you this our samples separating by tech obvious technical effects if no then the batch effect is less likely. Let's look at all the clusters that we identify against the samples. So we have five samples and we made a stacked bar plot to show the representation of the samples in different clusters and a lot of the clusters represent all the samples. So that's, you know, a good sign. Here's an example where we're looking at individual specific cell type. There's three sub types of these endothelial cells. And we also, it was nice to see that they're also represented evenly across samples. So that kind of indicates that, you know, you know, there's different distributions here so there's definitely differences between the samples, but it's not like, you know, it's Yeah, it's in this particular case we have nice even representation. And then we another good check is to check if the clusters represent cell types nicely. That was the check I was doing when I showed you the blood samples that didn't merge. So ideally, we have one cluster per cell type. It's, it's similar to these stacked bar plots that I showed you except you can also check it visually on this you map. So none of these, none of these ways are definitive, but these are the ways that people usually use to estimate or guess if there's a batch effect if you really want to be definitive as I mentioned the rigorous way is to actually show that there's a factor in the data that is explained by something like library type or sample and that you can correct it with regression or something like that you can remove it, and then it will eliminate that pattern from the data. Okay, so let's go through the same thing for the batch effect correction example so here we have this big batch effect, we're correcting it with a batch correction method harmony. And we get this nice integrated data after that. And now we can look at the before and after so if we look at samples represented across clusters here in the before each cluster has cells from a single experimental protocol, right, we colored them by these experiments, and then after batch correction, pretty much all of the clusters have representation from all of the samples. So that's sort of fixed. Let's look at our cell types appropriately merged. And, you know, so here we're, we're doing these stack bar plots and we're coloring the clusters by how many cells of particular type they have in the cluster. And in general, the clusters should have, as I mentioned one cell type per cluster. And we shouldn't have that cell type spread over many clusters. So here, for instance, the before we have this blue cell type B cells spread over three different clusters it's represented three different clusters and after merging after sorry batch correction. Those are grouped into one cluster now we still have some differences here so maybe the few reasons for this it's possible that the batch correction didn't fully work. It's also possible that for instance for T cells there are actually different subsets that are different enough from each other and we don't see it very well because we just use major cell type labels not detailed cell type labels. Any questions. Yeah. Whatever. Example here, the one at the bottom. Actually, I can see cells at the left. You said they are shifted using a little bit. You should have for batch correction. So what I meant is that let's say we have T cells that we're interested in, and we label them on our individual samples annotate them as T cells and then when we merge them. They're not directly over that we'd like to see them directly overlapping if they're comparable if they don't have any systematic effect, the systematic effect like they're shifted somehow all together. As pushing as those T cell groups being pushed away from each other on on a U map plot in a sample specific way. So all the T cells from sample one are shifted compared to all the T cells from sample two, and they're not overlapping. And so the more that shift is, even though we can't really measure the exact distance on the map plots, you Jen, the general ideas that the, the, the, like, you can generally distinguish like a very big shifts from less big shifts. And in this particular case I'm just saying that this is a big shift, like they're totally different clusters. Sometimes you can get a little bit of overlap but they're not perfectly overlapping. And they'll be like shifted all in one, you know, like, they'll be, you know, instead of like this, they'll be like this. They won't be if you annotated. Well, I'm talking about using the cell annotation labels to do the comparison cluster numbers could be different. If that's what you're asking. Yeah, so whenever we do a clustering the numbers are kind of arbitrary, arbitrarily defined, unless you set them. If this example, if the N case and T cells, would you consider that there's the same. Yeah, yeah, yeah, sorry, I didn't quite get that question earlier. So if you're saying this was T cells and it was from a different experiment. Yeah, if they both say T cells, I would so many. They're from different experiments. Yeah. Yeah. Yeah, sorry I don't have an example to show the kind of slight overlap version. Okay, so when we're doing this correction, we need to understand that the data is being changed. It's possible that these methods over correct, which means they remove important signals in the data or important variation of interest. So, for instance, the cell cycle to little mention this a little bit, but the cell cycle could be removed and that would be important. Like, we wouldn't want it to be removed, but it's, it's automatically removing it. The cells of different types of states will be merged. I talked about that. Another example is samples of different types will be merged like two different tissues that would be like a extreme example of just everything gets pushed together. So it's, we need to take care to check that those things are happening so whenever we're, you know, a good way of doing this just generally as when you start with your data, it's good to kind of learn about each sample individually. When you look at the sample like one sample at a time, you do annotation on it you examine things like variables of interest like if you're interested in the cell cycle you do. You look to see if you can predict cell cycle stage for the different cells and you look at the patterns involved in those cells and so you kind of get to know the variables of interest how they look in the individual sample, and then you can do it for other samples. And when you merge them, if those things are lost, if those patterns that you expect to be there are lost, then it's probably over merged it's in that case or sorry over over corrected. Some integration methods are just known to be more harsh and prone to over correction. A common example people site is so rats CCA or canonical correlation analysis method. We'll talk about that later. And then there's also the problem of under correction which it fails to remove all the batch effect. So in these cases, often people will try a different batch correction algorithm. If it's, you know, if you're finding that things are over corrected or under corrected, try another algorithm to see if it, if it doesn't do that, because they definitely have differences in terms of how harsh they are. And people who do feel free to people who are analyzing data more frequently than me please jump in if I if they have any other tips. Okay, so how to batch correction work. The general idea is that similar factors or signals in the data like cell types as a type of factor signal in the data across samples will be identified. And the algorithm will correct the overall data such that those similar signals align on top of each other. So you can imagine let's take cell types. Let's say we we we kind of just naturally want these T cells, and these T cells and these T cells to be on top of each other. So the algorithm. These date batch correction algorithms automatically identify identify a way that you can kind of correct the data so that those things lie on top of each other. So we put a signal in the data in this case T cells, and we're aligning that signal we know those should be on top of each other so we put them on top of each other, and we make note of how we switch change the data such that that worked. And then we do that for the whole data set is kind of the general idea. So, okay, but I'll go through different approaches that people use for this. There's a lot of slightly different ideas around how to do that. So, so what is, so the title of this part of this section was data integration and I've been talking about batch correction so what's the difference between batch correction data integration I find people. Sometimes you use these terms interchangeably so just want to clearly explain my view batch effect is what I mentioned it's a batch like a set a batch means a set of things. A set of samples run together on the same day at the core facility or individual samples. And that is a technical confounder that usually should be removed from the data usually want to remove it. But a batch is just one factor of many that could exist in the data we've talked about all sorts of factors. To Lilla talked about confounding factors, we talked about cell cycle is an interesting potentially interesting factor. So batch is one factor, but there's lots of other factors batch correction methods actually work with any type of factor they don't actually understand anything about the meaning of these factors they're just looking for ways to align the data, although you can, you know, tell it that cells should be overlapping. Depending on the batch correction method they can just find any, any way of overlapping of making sure things overlap. And so a more general term for batch correction methods is data integration and the data integration idea is that you're just going to try and push the data together so that the signals align. If they're not aligned through whatever factors are different. That's why it's sometimes dangerous because they could, depending on the method that you use it could correct the batch correct the batch problem the technical fact that you're trying to correct. But then there are other factors that are also correct and gets rid of right the difference between them so just good to know about okay so I want to talk more about factors. So, what do I mean, what do we mean by factors, we've thrown this word out around there's generally the ideas that's variation in the data caused by biological or technical factors so factors again so we have this idea of confounding factors. Typically those are nuisance ones that we're not interested in is too little mentioned, and sometimes they could be correlated with a factor of interest, in which case they would interfere with our analysis because we wouldn't be able to tell the difference between the technical thing and the biological thing because they're correlated. The theoretical version of what a factor is, is that it's a factor in the data is usually caused by a process like a physical or biochemical process like a pathway operating like a cell cycle. By running the cell cycle by running a bunch of genes are affected, they change, you know, lots of genes go up and down because the cell cycle is there. And that process that causes what basically cause those genes to be correlated in some way. And that is what the factor fits, or you could say that correlation, which is, you know, those genes that are correlated, they're organized now in relation to each other. And so some aspect of the variance of the data is now basically caused by that factor. So there's lots of different factor examples that we've talked about. And some of them are technical some of them are biological some of the biological ones, we don't even care about so we wouldn't be interested in them so other ones we would be interested in. So it's just good to think about different ways that signals or processes that generate correlation, because they have a common cause could occur in your data. Okay, so actually, you know, I, I, okay, I'll come back to it. Okay, so what is a factor in the math sense. So I think everyone knows that the word factor is a number that divides another number, another number evenly. So one, two, three and six are factors of six. So you can consider those to be types of parts of six. So you could make you have a type of part of six called a three and if you have two of them it makes six. And you know, two is another type of a part of six if you have three of them it makes six. So there are different ways of making six. And so when we talk about factors in a single cell genomics data set, we can think of it as factors of a cell by gene matrix. And while simple numbers very easy to think about factors in matrix mathematics. So there are also a similar idea where you have types of parts of a matrix that are factors. And you can just with matrix matrices, there's like lots and lots and lots of ways of thinking about these things, even linear ways and they're not linear ways. So for instance, a linear, a linear component or a vector can be a factor and I'll show a picture of this. And a set of these components could be combined by multiplication to form the original matrix so you can decompose the matrix into factors and you can recompose it, it's kind of like dividing and multiplying. And the data within that component are correlated along a linear axis, and PCA is one way to find such linear components so we have a bunch of data here on two axes PCA might find variation on two of these axes. And if, you know, PC one will be, you know, this, this axis of variation, which is basically the very, you know, the variation along this line and PC two is the variation along this line. So, each of these PCs explained some variability in the original data set. And PC one explains most of the variability and PC to explain some of the variability. If you take any point here in this 2d data set you can, you can define it in terms of a position on this line and a position on this line just like you can define the point as a position on this, you know, the X and Y axes. So, you know, any, any, you know, coordinate system that you rotate around this will have that property right. You know, you can think about this as, you know, seemingly like a natural axis of variation this data, right. If we were to look at this data we would say oh there's some weird correlation here probably is explained by something probably created by a process and biology. And the processes will basically create that correlation as I mentioned in our matrix. And, you know, when we see that correlation. So there's probably some meaning associated with it. And, you know, there's correlation in that the big correlation is here but there's also correlation of these, these points along this axis, and that might be in something else. So the PC one is a cell type and PC two is the cell cycle variation or something like that. Anyway, so I just want to kind of explain some general ideas about what fact these factors are, because data integration really relies on think on identifying these factors and aligning them so their assumption when when we do data integration is that a large portion of the biological variability, each factor can be considered one type of variability like this is a type of variation and this is a type of variation, their, their factors. You can combine them both to get the, the whole picture, but each one separately is like a part of the picture. Right. And, and, you know, these data integration methods. Identify the try to various different ways try to identify these factors that are similar between data sets and then that's what gets aligned that's what get pushed together. And then all the rest come for the all their other data comes along for the ride based on the alignment process. For example, so, so to do integrate data effectively, there needs to be some sharing of factors between the data of some kind, like cell types. Linear linear or non linear factors are possible doesn't have to be nice straight lines like this. It could be all curvy lines it could be basically anything it's like infinite possibilities actually. Typically in the data integration methods factors are automatically a lot identified and aligned. So we don't really need to worry about them too much. And, however, the identified factors may be hidden. So they're not really easily extractable by us for, you know, to think about what they mean from the integration process sometimes the integration process is just pushing things together but it doesn't tell you exactly how it's doing it. Okay, so now let's move to another topic integration methods. So, key take home lots of different integration methods. And there's no theory that tells us which one we should use. So I can't tell you given your data set, if you should use one integration method or another just like I can't tell you if you should use one clustering method or another, or what your clustering resolution is. And the reason for that is that is actually a theory that tells us we can't have that theory actually can't. I can't tell you that and because there's, it's called the no free lunch theorem. And you can read up about it's very fairly theoretical but it actually states that any to optimization algorithms are equivalent when their performance is averaged across all possible problems and it implies that there's no single best method for optimization, but basically anything machine learning uses optimization clustering uses optimization data integration use optimization. So they're all related because they're all use optimization and we don't know a way to figure out an advance based on a method that the data set whether one optimization method is going to be better than another so. So we can't unfortunately use theory to tell us anything useful about choosing these methods. What we can do is benchmark methods and use practical experience to figure out which methods are working and which methods are not working frequently, roughly on the types of data that we're interested in. So. Okay, so let's go through some integration methods just to get a sense of how some of them work. And then I will talk to you about benchmarks and we can see how we can use the how to choose the integration method problem. Okay, so harmony is a popular integration method, probably one of the most popular ones right now I would think it is fast and generally works well in practice. And, you know, many people will say why don't you start trying to start analyzing, if you're going to choose an integration method to start. Why don't you try harmony because frequently works well and it's pretty fast so it's kind of low cost. So the way harmony works is you have your, it tries to put the cluster it tries to push the clusters on top of each other that should be on top of each other so if you have you know, if you have clusters from different data sets of these different colors or the different data sets, the different shapes are different cell types, and basically harmony iteratively tries to push the data so that the cell types are on top of each other. And it keeps track of how it's pushing pushing the data that way. By the end of it, kind of all the data is pushed on top of each other. So, the way you know a little bit more detail about how it works it merges data sets or integrates data sets, represented by top principal components, which are then used to cluster cells, and each cell is iteratively adjusted, using an interaction vector shift to get it closer to central, the cluster centers overlapping until it can't do it anymore. Okay, so that's kind of gives you an idea of how people do this. In this case it's kind of iterative shifting until everything's aligned. One of the interesting things with harmony is that we don't really know what the factors that it's using are, because it's kind of just following a path that it tries to do as best it can. A very early method. These methods only really came out in 2000, like like five years ago I think was probably the first integration methods right. So, they're pretty new. One of the first ones is based on this idea of mutual nearest neighbor so if you have two data sets, you can identify cells, like a cell in one data set that's closest to a cell in another data set. And the cell in, okay let's say we have two data sets A and B, let's take a cell in data set A, and find its closest cell in data set B. And then take all the cells in data set B and find its closest cell in data set A. And if those two cells are their mutual closest similar, then they'll be the best match for each other and they should be matched up. So, those are called anchors in these particular way of doing things, and they're used to estimate and correct the cell type specific batch effects over the whole data set. So, here we have two batches from this MNN correct paper I think. And, oh, is this, yeah, anyway, is this to remember which, anyway, I forgot the name of the method but anyway, conceptually, it is, you know, finds these pairings of cells between the data sets, and then estimates correction to figure out how to change the data so that you shift everything to be in line with each other. Canonical correlation analysis is one that I mentioned it was one of the first data integration methods that was published. Five years ago it's, and it was in Surat so people were using it we're excited to use it when it came out. What it does is it takes two different data sets and then it finds factors in the data that it should align, and then it uses this thing called dynamic time warping to locally stretch or compress the vectors during the alignment to actually if they're not like linear related to each other it can kind of, it can kind of handle messiness in the data in some way. So, the problem was is that over time people found that it actually frequently overcorrects data. And so it's considered a harsh method but it's actually good for tougher integration problems that are where the data sets are really different from there was a new method that came out a newer method that came out from Surat afterwards that tried to replace CCA after the Surat authors figured out that it was not always working very well. And this one projects data sets into each other's PCA space so it actually uses these factor ideas. And then, you know, tries to match, find the best matches of aspects of the data, and then corrects the whole data set and it's less harsh than CCA now generally performs pretty well, and Surat now recommends it on their integration website. I don't think they've published this yet, or PCA. No, they haven't published it, at least because they didn't make enough differences, and updates to publish it. Essentially it's just projecting into PCA space, everyone's been able to do that forever. And then it uses the correct method to identify matching methods while that was published by Eminence Correct, and they just put it together after a conference where everyone was like, you know, we really like the integration method but it tends to merge things that don't match to the Surat guys, and to the Eminence Correct people going, we really like how it matches the snowpipes up really well, but it doesn't actually correct it very well. So then they said, okay, we'll put these two together into Surat version three and then do that on the form. So I don't know if they'll ever, it's actually hard to figure out how it, what it does because they never really explained the full method. But if you go to this, like, if you go to this website, it roughly explains it but you know, they don't even explain it as well as tool it did so. So, still be nice if they published it. Okay, so the last method I'm going to mention is one called Liger. This is kind of funny, I guess it's a name for a mythical lion tiger mix. So, can take a lion and a tiger and you can integrate them, make a Liger, and the idea here is that so this is also interesting as a cell paper, just some commentary. There was an existing method and they applied it to single set that was previously developed for multi omics data integration, and they applied it to single cell, and they published a paper and sell. But they. So they basically took this other method that was published a few years earlier, a non negative matrix factorization method for detecting modules and heterogeneous multi omics are omics multimodal data so basically what that means is that you can take data that's fairly like you can take a taxic data and single transcriptomics data and you can integrate them. Ideally, so the method that they use finds specific factors that are common between the data sets you're integrating and factors that are unique to individual samples or data sets you're integrating, and then it. It basically uses all of that together to kind of merge focusing on the stuff that's common, but it keeps the stuff that's unique. So one of the interesting things that can do is that if cells are specific to one data set it can. It can keep them or save them doesn't, it doesn't, you know, it models that explicitly at least I don't know if it, you know, nobody's proven if it is, you know, better than other integration methods. In a sense, handling cells that are unique to a data set but that was one of the things that they talked about in their paper. And it also can identify these features in some way, although I don't think we've ever, I don't know if we've ever used that information I don't know how easy it is to access it, but the original method does identify these factors. So, final. So those are all the kind of standard types of integration methods where the idea is you take different data sets and you integrate them right. The, just a note on wording it's confusing sometimes to the difference between merge and integrate. For example, just make mistakes and interchanging those. It's clear if merging is the just grouping things together with no batch correction no factor correction. And integration is one of these data integration methods that does this factor alignment and correction. And that's, you know, in Surat the functions are called merge and integrate so that helps but even the papers here talk about merging sometimes when they mean integration so it's really just a connection in the field that those have to have those meanings. That's not really English that defines the difference but be better maybe if they had better terms but anyway just a quick note about that. Because sometimes it's confusing. Okay, so I talked about these methods that merge that integrate data together. Basically, each data set that's being integrated is kind of an equal citizen when that's happening. It's not like one's better than the other. There is a newer kind of integration method called reference based integration method, which takes a reference which means it's like the textbook example of whatever into whatever data set we have like someone's published the best lung map. And everyone agrees that that long map is amazing because they really took care to annotate everything properly and it's really high quality data. And so that's going to be our reference that we, you know, are going to compare everything to and so then we can kind of take our data that we've created say we have some lung data and we want to integrate it with a reference to check if our data, how similar it is to that reference that we get all the same cells that they do or we get new cells. So overlaying that is kind of this idea is reference based integration so in this case there's one data set that's unequal and the data sets are unequal ones considered like the gold standard, and the others are pushing themselves to that gold standard. The previous integration methods could allow both data sets to be shifted around, even if it was going to change one of them, or change both of them. So, sometimes, though, you know, the integration methods like harmony and all those others that I mentioned, they work pretty well but sometimes they, they're hard to use. The biggest time that we found they're hard to use is when you have a lot of cells and you can't fit them all into memory to actually operate to run the method. And then, you know, one way of addressing that I'll talk about in a bit what you know one way of addressing as buying a faster bigger computer with more memory, but eventually, probably not going to be able to fit all the cells that we have like maybe 10 years from now when you run this course will be instead of thinking about experiments that measure 10,000 cells will be thinking about experiments that measure a billion cells at once or something like that right I mean it could happen. And then we're not going to be able to fit those into memory definitely going to have to have different ways of thinking about the problem. So, because memory is not we we don't have that much memory to manufacture. So, okay, so the, in the case of reference based integration the query sample should be similar, like the same tissue, same biological context like healthy samples. Ideally, like we can take healthy lung that we're looking at and match it with a lung reference. And this is very useful if you want to kind of use prior data that's being published that is the kind of gold standard maybe in the community like human cell Alice data or hub map data that people have decided is is valuable. So, just a bit of commentary, it's a bit newer. It's less used and I don't know of any benchmarks that anyone knows any benchmarks that have tested these things. So, as of 2023, there are examples out there but not a lot of practical use to compare them and knowing which ones are better. So Serrat has one if you go to this link integration mapping can find instructions for how to use it. The Serrat group, the satigia lab in New York has created a nice website called as a myth. Hub map consortium or where you can load up your data and map it to a library of references. The only problem is they probably only have a dozen or maybe 20 different references and most. Most probably your tissue is not represented. I mean, there's some nice ones but it's not comprehensive. So they'll build that up over time, but right now it's still growing. And there are other methods like symphony is one that I understand some of people in the community are using here that we know about. And it claims high performance but I actually haven't my lab hasn't tried it. Okay, any questions so far. Okay, so last section is data integration benchmarks. There are two big benchmarks for data integration methods. The first one was really focused on. They called batch correction methods. And they concluded that Harmony Liger and Serrat three, which is the RPCA. I think right that's no no that's so they call it CCA plus MNN is that RPC. Okay, so it's a little bit confusing, but they recommended that these are good and they said that because Harmony is faster, you might as well try it first. But a lot of methods work well. There are 14 methods. Those in 2020 when they benchmark these methods. And in 2021 there was a bigger method benchmark that came out this is more influential recently but they used 20 methods, and they claim that scan V scan and SCVI and SC Gen perform well particularly on complex integration test. So they had a lot more details but so far, you know, definitely people are using these, but I haven't seen people shift away from Harmony too much. I don't know what you folks are using. These are Python ones. So still the communities mostly based on R so people are using Serrat and Harmony because they're RG. SCVI is now becoming more popular because more and more people are moving to Python. And it's the easiest to use in Python packages is SCVI and scan V. They have a new initiative called SCverse that's trying to integrate and make these things better and easier to use. So, over time, there might be different preferences. A lot of times in bioinformatics, you can have a lot of different tools that do roughly the same thing. Some of them will be provably better than others, but people won't shift to them because it's not, they're actually even though they're better than not very much better. They're like a little bit better. And so provably better but not substantially better. And for practical purposes people just choose the easiest one. The best example of that is that everybody uses BLAST because it's easy to use online at the NCBI they make it so easy. There's hundreds of methods probably at least that have tried to be better than BLAST, but nobody uses them because you have to install them yourself and make your own databases and there's no easy website so I think that kind of thing happens a lot. And in the end it's practically doesn't matter frequently because it's not going to change the results. Your biological conclusions. If you're interested in comparing them. I encourage you to think about it in terms of whether one method or another will actually change your biological conclusions so define. Go forward with one method. I'm up with biological conclusions. If you have a question, if you want to know if your method choice or parameter choices are going to be important. Try them after you've done that. And then you can test whether it's changing your results or not and it doesn't change your results then it's not relevant for that conclusion. So that's kind of a practical way of approaching that problem. Okay, so a couple more points. The, you know, I mentioned in the beginning, how do you know if your batch correction or data integration method is working and we can use these stack people tend to use these stack bar plots and look at you maps with some general rules that I mentioned. There are however a whole bunch of different metrics that people are developing to more precisely quantify the amount of correction that is occurring. There's various different statistics. There's the Lucan 2021 paper that I mentioned as one of the benchmarks. They had two types of metrics so one is batch effect which is like technical factors. And the other one is conservation biological variants which I call biological factors. So, and then for the biological factors is obviously different types so they divided them into different things like cell type labels. And then things that were not cell type labels. So they had things like. Cell cycle conservation highly variable gene conservation trajectory conservation. So those are other biological signals that they decided you could have added a whole bunch more here like you could have wise just sell cycle what about the other 100 pathways that exist. Each one of those is a factor probably if it's active in your data and could be used to evaluate how good this these methods are are working. So I mentioned that that Lucan paper published a nice big table that kind of lists all sorts of results and how well all the different integration methods worked. You know, here's harmony. Checkmark is green checkmark is good. Okay, so a couple of additional points like just practical points touched on a little bit. Integration if you're running it on very large data sets need a lot of computer memory. So there's a couple of things you can do about that you can buy more computer memory or you can find a computer with more memory you can find computers and very large amounts of memory like, you know, terabytes of memory, and they'll be, you know, you'll probably also satisfy most of your integration tasks with these things. But usually get them at a super computing center that you have access to. And the other Python just to be aware is is something that people switch to because it's tends to be more memory efficient than are it traditionally has not been as popular as to a little mentioned, because there's a lot more functionality in R and it's easier to use typically. If you're getting if you're really getting to big data sets, then this is one option, and then reference based integration may be useful as a way of integrating these data sets and then there's lots of research in this area, especially with deep learning, there's going to be really interesting new ways of thinking about this problem and it and they're going to solve this memory issue because most of the data that we are work with is just repetitive. It's like the same cell over and over and over again we have lots and lots of examples of them we don't need to represent each cell as a separate vector we could just represent them as a, as a factor or something and then have it all learned automatically and that's probably what's going to happen in the future. So. Okay, so last point, just a practical point calculating differential expression after integration after batch correction in particular to Lily mentioned ways practical ways to compute differential expression one of the problems that most popular integration methods do is they don't produce a corrected cell by gene matrix, they don't correct they don't correct the gene expression matrix, they do result in updated PCs and clusters. And a lot of people basically, in fact, so rat itself. The last I checked recommends that you just use the updated clusters, but do differential expression testing when the original data set, which doesn't make sense because as if you took the example that, and even as to little mention that, you know, you can correct that effects in a differential expression testing should. But so rat, as far as I know doesn't have any way of doing that in it unless they change it recently. There are methods that do it to little mentioned already some, but just be aware that that the integration methods have not integrated very well yet with the differential expression. methods, although there are some, some integration methods that are doing a better job now. So it's less straightforward compared to how it is with book or any seek. I don't know if you have any other practical things like once harmony, corrects your data you don't have access to that correction factor to use in a regression. Harmony doesn't assure gene expression at all, because it doesn't even know your gene expression exists. Harmony sees you have PCA stills and PC space and creates a new PCA space for your thoughts together. It doesn't know anything about the broad gene expression. There's been lots of debate about whether you should use and your post batch effect correction data to some both correct the gene expression matrix, whether you should use that for differential expression or not. The last conclusion I heard was probably not. I think this is why most batch effect correction methods don't correct the calculations because we don't want to correct it because we believe those batch effects are biological variation, not technical variation. Whether we know that's true or not, we don't really know because we can't really do the technical replicates. We can't take the same cell and sequence twice to see what this, what is technical variability versus biological variability. So right now we're just assuming it's all biological noise, not technical noise, so we should move it and consider it as technical replicates or biological replicates. That may change in the future as we are now. That's just a practical note. Okay, so that's the end of this lecture. What do we learn. It's good to verify if batch correction or data integration is even needed before applying it don't just automatically apply it that's not a good idea, because it can change your data and can remove important parts of it. Sometimes. So sometimes merging data is efficient. So batch effects can be detected and removed, and the results, and you can check the results over an under correction are possible and those are not desirable. Selecting an integration method sometimes requires trying more than one harmony is a good place to start in 2023. So we run this course might change, but next year, but that's, I think, reasonable to recommend now.