 So, in this one, we're going to pre-process the data and do some quality assurance stuff to it. And I'm going to be moving it tiny bit faster than I was in the earlier module. So when I say pre-processing, I mean before you get into the gating and the clustering and all of that actual analysis stuff, you have to pre-process your data and make sure that what you're starting with is something that makes sense. So things such as compensation, which we expect you guys to already understand that really well and know what the best way is to do it, Ryan already addressed that. Also how to transform your data. So the transformation is important. In Flowjo, when you're looking at your data, it kind of doesn't really transform your data, but it transforms it for display purposes, right? So that it looks nicer so that you can gait it on the screen. Choice of transformation is very important, as I will show you on the next slide. Also remember that how we had when we were plotting the forward scatter, side scatter, dot plots of the flow frame. And there were these dots sort of at the end of the plots that were the margin events of forward scatter because the cytometer is not capable of recording values greater than a certain value and it just assigns that value to the ones that are greater. That's kind of like a technical kink that we should clear up in order to facilitate the automated gait later. So that's going to be what I'm going to do up until the coffee break and then a little bit after the coffee break probably. And then at the end, before dinner, I will do quality assurance. Some example remember Ryan was going over some of the quality assurance things and there was this HTML web page with these green dots and yellow dots and red dots. We're going to actually make one, one of those. And I'm just going to focus on some simple ones just so that you get accustomed to how to work with that package that creates those quality assurance HTML web pages so that when you're later reading the vignette for that package, you will know what's going on and be able to do fancier stuff. So pre-processing steps, they vary from data set to data set. Different data sets have different sort of little issues you have to clean up. But this is the common, the common ones. That's all you have to compensate your data well, otherwise further analysis will be meaningless. I have seen many data sets where they send me the data set. It's already been compensated poorly and all the information about how it was compensated is they just don't know what they did, what the software was doing, but they're still giving me the data and hoping that I can do something with it. I can't. Like Ryan said, garbage in, garbage out. If you give me crappy data, there's nothing I can reverse engineer about is going to get into its original, higher quality condition. There's also issues with poor staining or poor experimental design, panel design, whatever, cytometer issues where no matter what you do, you just can't, the compensation is terrible. There's nothing you can do to make the data be good. And again, there's nothing we can do about that. If that's your data, then that's too bad. So when you said the compensation was terrible and they don't know what they did, they were actually manually changing the compensation. Yes. And also. It was a big inquire. I believe that that's what, it's a very excellent question. They couldn't really give me a straight answer. And I don't know why that was, but also in Flojo they watched around with something and they flojo did something, but they didn't know what Flojo was doing and they just assumed that if they just can't hand me the files, I should be able to magically, like reproduce everything that Flojo ever did, like as a machine. So there's also cases where biologically speaking, the physical parameters of the experiment set up make it so that, you know, certain stains just don't go well together, right? You just, no matter what you do, like try to compensate, it just still looks like crap in the end. I'm saying I will not do that for you. I cannot. No, no, no, no. What I'm saying, what I tend to do is we use single, we use beads with a single color stain, and then we set up a compensation, so collect beads with a single color, it's when we've, when we've, when we think we've got our volumes about arrived, then you can collect single color comps and then let be recalculate the matrix. Yep. If you go back and collect some more cells and if the vultures still look about right, then you've probably got your comps and your vultures about right. Yep. That sounds like something very good. Yeah, so that's... No, it doesn't always happen, so make sure it happens, because if it doesn't, then... So just keep doing that. Because I'll keep doing the right thing. Yes. So that's one thing that I just have to point out that if you have poor starting data, there's nothing I can do about it. Then the next step, you construct a procedure to objectively remove debris, doublets, margin event, all of these pre-processing things, and objectively means that because we're going to be automating all of this, you're not going to be gating each and every sample separately or anything like that. Are you going to try to think of some kind of way that you can do it objectively? So the crime was saying manual gating. There's a lot of variability. You give it to one person and then to the next, they gait different things. Slightly different, but maybe if you gait 25 populations in a row, in the end, you end up with completely different populations than you started with. So... As soon as it... No. How long... But we can do it. Yeah. That's questionable. We have the tools to do it. Sure. Yeah. We have to do it. We do it all the time. Yeah. We have to have some objective function in your head. Work is a double. Yeah. And so usually it's based on position. So you find port side, you find stuff that's above someone. Yeah. Use force to get, force better. Use force better. You can use any number of... That's... Yeah. And you think best describes what a doublet is. All of the... All of these procedures, removing the debris, doublets, margin events, they're all... The way that you construct this code, this procedure, is you think about how do I do this manually in Flojo? Well, you know, I go like this. What is the... How do you go like this? Well, I try to get rid of all the really low forward scatter things. And, you know, capture most of them, but not in the really far away. You know, you think logically when you're gating it with your mouse and the screen. So you try to encode that logic into R. And we will do this. Good job. The R part is the R part. Yeah. Putting it into logical terms is definitely... Yeah. Try to describe... People have this intuitive sense of what it is. What it seems like very difficult to be able to write down as a variable and a chain, and it becomes things. And it becomes harder to program. It's a program that requires set of rules. You have to be able to explain it to a computer how you're gating. You can't just tell me, oh, you get it like this. No. What does this mean? They're a little bit... Yeah, exactly. What do you mean, exactly? We'll get there. So after you have removed your debris and all of these useless cells you don't want to be working with, you transform your data carefully. And after that, of course, after the transformation, you're going to be able to remove the dead cells and then proceed with the regular analysis, the interesting analysis where you do your gating or discover your diagnosis or whatever you're doing. So today we're just going to be pre-processing. It's probably the most... More important than the actual gating part. So I'm going to talk a little bit about transformations. Log... It's a lot of the channels that you measure on a log scale, right? So when you look at them, your plot will have 0, 10 to the 1, 10 to the 2, 10 to the 3 on the... It's because it's on a log scale. So we have to transform the values so that they make sense to us when we look at them. Log by definition cannot handle negative values. A lot of the negative numbers are undefined, mathematically speaking. So in flogel, that's when all these negative cells, instead of undefined, flogel just sets them to be 0 right on the margin. So that's when you sometimes get a warning that you have too many cells on the margin. It's because you're using log and you have too many negative cells. When you have a compensation, you know how a lot of the values actually get pushed to be negative? Because your compensation is good. So sometimes if you have too many negative values and you use log, they're going to end up on the axis and you're not going to be seeing what your data actually looks like. You're just going to be looking at some dots and the plot in a bunch on the margin that really it doesn't... It's not very informative. So what people do instead is use the bi-exponential. Bi-exponential is just a class, a family of functions that how many here have heard of arcs-inch? Yeah, heard it. It's one type of bi-exponential function. Another one is logical. Has anyone heard of logical? Usually when you go to Flogel and you go to... Actually, I don't know, but I assume this is what Flogel is doing. Because Flogel doesn't tell you exactly the formula that they're using but it's essentially using a logical transform. It's just another type of bi-exponential transform. And arc-sinch is defined for negative values. So it's better than log in that sense because if you give it a negative value it's not going to put it on the margin on the axis. But it kind of has this kink where it kind of splits artificially positive numbers and negative numbers into make it look like two separate populations sometimes. I will show you a picture in a minute. The logical transform addresses that issue with arc-sinch. So it's better than log because it handles negative values and it's better than arc-sinch because it has more parameters that you can sort of play around with to make your transformation look even better for your specific type of data that you have. Someone said they were interested in the transformation math stuff. Who was that? You? Okay. So mathematically speaking, what is log? It's the inverse of this function. Y equals e to the x. Is everyone familiar with the number e? Usually I guess you'd be working 10 to the x, log 10 scale, but this is the natural log scale. So the inverse of this function is ln y. It's natural log. So when you're talking about a log transform it's inverting this function basically. Here's arc-sinch. It has the same e to the x part but it's adjusted by e to the minus x which is a small number. The inverse of this function is log of y plus square root of y squared plus 1. If you think about this, here if you put a negative value into log y it will be undefined. You can't take the log of a negative number. Here if you put a negative number it will get squared plus 1 and then square root it so it's a little bigger than itself, just negative. So it will never be undefined. It will always give real value. So that's the benefit. It could be very, very small. Yeah, it will be very, very small so it will be artificially... It's just artificially closed up above zero. Yeah, something like that. It's exactly just so that you can take the log of it and then when you take the log of a very, very small number it actually has a very negative value. So yes, here's the... So remember arc-sinch is defined as the inverse of this. The inverse of this function which the solution for is here. This is the bi-exponential in general and this is sort of where the logical transform which is the one that as you probably guessed is the best one because I mentioned it that's better than all the others. This is what its form is. A is some kind of constant parameter that you tweak to make... like you can float your... like float your tweaks before you or you can actually tweak it. Same with B and same with W. So there's actually all of these in F and D. So there's all of these parameters that you can tweak to make the transformation optimum. Don't worry, we're not going to be asking you to be choosing these numbers. We're going to estimate them for you. But just to give you an idea of where the logical transformation comes from you start out with log that looks so simple and you end up with something that is the same idea as log but really just fancy it up a little bit so that it can be tweaked and it can... we can do the best we can with the transformation of the data. Here's an example of the same sample we were working with earlier. Forward Scatter, Side Scatter. Oh, we're not using RStudio yet. I will let you know as soon as we get to the code part. So remember how it was really squished and I had to plot it, cut it off at like 5,000 and it was still kind of squished looking? This is when you do a log transform on the Side Scatter channel. Suddenly it looks much better, right? How many people think it looks better? Good. Here is when you do arc cinch. It looks almost the same as log, right? Here's when you do the logical transform. They all kind of these three look kind of the same. Not really any benefit over which one should you choose. Why is that? Because Side Scatter by the physics of it never has negative values, right? So taking the log of it, you're not going to run into the issue of taking the log of negative numbers. Arc cinch, you're not going to run into the issue of artificially splitting negative and positive numbers into two separate populations. Logical didn't really have too many issues to begin with. So in this scenario, if you're looking at trying to pick a transformation for the Side Scatter channel, it really doesn't make any difference which one you choose. Just choose one and be consistent. Might as well choose logical. Here's an example of where it does matter. This is the same sample and I'm plotting the R780 channel, which is the one with CD3, measuring CD3. This is no transformation. So obviously all your data, you can't even see it because it's exponential data. You can't really make sense of this. So when you open it, I don't know flow. I think by default it does log or tries to do log or I'm not sure. I think you can use the problem to allow a client to find possible transformation, other ones default and follow the pages a lot more. Yeah. So when you actually open this in flow, Joe, you will also see this really thick line along the X axis because there's a bunch of cells there and I've printed out for you. There's actually 13.89% of cells who would be on the axis because there's 13% of the cells have CD3 expression negative. So when you take the log of these numbers, they seem to kind of disappear. Instead, if I did the arc since transform, this is what I would get. This part of the plot is exactly the same as this entire plot. These cells are the ones that were negative and see how they are kind of artificially split from this population. These cells should really be combined with these. It should be CD3 positive, CD3 negative, not CD3 positive, CD3 negative, CD3 really negative. So that's what arc since has this little kink that it does. And this is what the logical does. Because of all these extra parameters, we can tell it, you know what, there's actually way more negatives. Don't shoot them off at the end. Just kind of try to bunch them with these low-valued ones. So this is... Yeah, it can. It's a bit divided by 2. Yeah. That's not the thing that varies actually. What they would probably have is some kind of parameter here, like a times e to the x minus b times... I think that's the only thing that can really vary there. You can. I think just the cofactor takes care of that problem. Just the 2? No. I'm not sure if it's just you. It's not the 2. There is something else. Yeah, there is something else. And the logical is just the generalization of exactly that. It's the generalization of exactly adding a cofactor and changing that. So it's just logical. So now we're going to be doing stuff in RStudio. Okay. So especially for those of us with slower computers, this here removes all of the current variables that are stored in our R session. So right now I can do x, like that we did earlier in the day. They're still there. But when I execute this command, it's going to delete all of them. It's like starting fresh. Now there's no x. This kind of helps prevent you from getting messed up just because we're starting a new module. This function, if you have a bunch of plots open, that can kind of slow you down a little bit. So just running this will automatically close all the plots. So they're not stored in your memory and are trying to maintain all the previous plots that you did. It's just going to make it faster for you. And we already loaded this package before, but let's just make sure it's there. We have our directory just in case you shut down your computer or something. This is where our files were stored. So what did I do here? I named this variable to be there. Remember, there it gives you basically a list of all the files in the directory you're specifying. So I wanted it in this folder to remember, what did I have in there again? Right. All my files. So if I want to read in the first file, I can actually remember read.fcs. You have to supply the exact file name with the path of it and everything. So here's a new function we're going to learn, paste. It kind of is the concatenate for strings. So I pasted together the string fullfcs slash, which I know is my folder name, plus the first file in that directory. And you get this. Does this make sense? Why are you doing this? It's just if you have, for example, an Excel file sheet or an Excel sheet, and there's a list of file names in there, but only the file names are not going to tell you what directory they're stored in on your system, right? On your computer. So you want to read in all of the ones that it says next to them that they're HIV positive. So you're going to read in your Excel sheet into R. You're going to locate all of the rows that have HIV positive diagnosis and take those file names. And you can't say read.fileName. You have to specify which folder it is, where it is. It comes in really handy all the time. You will see this when you're reading vignettes. So... It's another way. It's another way. Yeah. Something like that, yeah. This step equals quotation marks, nothing, and quotation marks just means how should I separate these two strings that I'm concatenating? You can separate them by the number eight. It will insert this number eight here. So I didn't want them to be separated by anything. And now you can read the flow frame like this. Because now first file has the exact file name. Before we had to type it out exactly in quotation marks and have the folder name and everything. This is just one way that you can help automate the file names. And that's just the same flow frame as before. Now the first thing we're going to do is talk about compensation. This data set is not compensated. It has been... A compensation matrix has been created for it, but the data we have is raw. It's uncompensated, so we have to apply the compensation. If you remember where we had this long conversation about this, all of these gibberish keywords here and we decided some of them maybe are useful, some of them not so much. There was one here, spill, that contains the spill over matrix. It's automatically saved. So if you execute these lines, I've just printed out just a small portion of them here on the screen. What does this mean? Have you ever seen a compensation matrix? In Flowjo, you can export it, right? This is what it is. So when you say it's... Compensation matrix. Yeah. Does this make sense? Everyone's familiar with what this represents? I don't even know. 7% of this stain has spilled over into that one and you have to adjust it, basically. That's what it's saying. Oh, sorry, I didn't print it all out on my screen because it would be on separate lines. So how do we compensate? You use the function compensate. Very convenient. They use the old BDBs because you can't compensate the 450 channel with BDBs. So the 450 compensations are zero. Oh, okay. So that's bad? But it's just what it is. So you can't compensate 450. This is the point where I trust the biologist knows what they're doing. We trust them a lot. Exactly. So let's compensate. So you can see that this function compensate came from the package flow core. So it's one of those core functions that is one of those things you typically do with flow cytometry data. So it's part of the standard flow cytometry analysis package. And you can see how it works. It takes a parameter X and a parameter spillover where spillover is the spillover of compensation matrix and X is an object of class flow frame or flow set. So all I did was compensate. I put my flow frame in there and my compensation matrix. And how can we tell that it's been compensated? So you guys can run this while I'm doing the presentation. If you run summary F, it's another function you can run on a flow frame object. And it gives you sort of an overview of the distribution of your data. So within that flow frame F, the minimum forward scatter value is 23,410. The maximum one is 262,100. The median, so that's like your MFI here, the median, right, is 41,000. And it has this for all... I have only printed out some of them. I don't want to have too much stuff going on. This is for the F, the original flow frame before we compensated. Now the one that we compensated, we call it f.com, so we keep track of where we're at. When I do summary of that one, the forward scatter and side scatter summaries, they're identical. There was no compensation down on the scatter, sorry, the three scatter channels, right? We don't compensate those. This one seemed to have changed very, very little. I mean, this is negative 67.28, negative 67.34. So this one was already pretty clean channel. The compensation didn't really affect it too much. But look at this one. This one had some significant spillover into it. Suddenly I went from having a minimum value of minus 67 to a minimum value of minus 26,000, right? So this is just to prove to you the compensation was applied. Does this make sense? Now, yes. Again, we defined the compensation information that generated the matrix M was already in the FCS file. Yes, it was already automatically embedded. We linked and saved it. So when we load a compensated FCS file, as a flow joke, it already comes up as compensated. But when you load it into, ah, it's not compensated, we just have to run it. So the flow joke is automatically running. Yeah, and automatically applying it, yeah. Cool. Yeah. We're basically trying to do what flow joke does behind the scenes. With a little more polarity, right? I'm delighted that I'm basically learning how flow joke works. Yeah, well, that's the thing. I don't know exactly how flow joke works. It's a mystery. It's a mystery. It's a secret. Okay, so we're clear on compensation. The one thing about compensation is it's very important that if you're analyzing your own data, that's wonderful. But if you're analyzing someone else's data, there's always this huge miscommunication about is the data already compensated? Like you just said, some people think that just because when they open it on flow joke and it looks compensated, that it must have been compensated to begin with. They don't realize flow joke is secretly compensating for them. So sometimes they'll give me the data and say, oh yeah, it's already compensated. And I'll work with it and ask them, are you sure? Yes, absolutely, it's already compensated. And then I spent some months working on it and realized it wasn't compensated. I don't want to say that it was just a miscommunication. So that's something to be... It's always set to true. In my experience, that's what I thought when I first saw that keyword. There is a keyword that is something like this. Yeah, it's just misleading. Yeah, it misled me. Okay, so I'll assume we're good on compensation. It's just compensate and specify your matrix. If for some reason what some people do is this matrix will be stored in there from Diva or whatever, but then they'll go and open it up in flow joke, delete that and compensate it themselves in flow joke. Then you can actually, instead of giving this matrix M, you can essentially replace all of these entries with the entries you got from flow joke if you feel that that's a better compensation for your data. And then again, you would just do compensate F, M the exact same way. Okay, so now I'm going to do a little... So this was on compensation. The data that we're working on is the HIV data Ryan mentioned briefly. It has 466 samples and 13 colors. We don't have the computational power right now to be doing this in the workshop. So what I've done is I've just chosen 20 samples and I have only... Unfortunately, it's not going to be super interesting what we end up with biologically, but for illustration purposes, it should be good enough. I have only selected a few colors. Okay, so I have those three files we were working with earlier. There's actually 466 of them. I took 20 of them only, deleted a bunch of the colors because it would be too difficult otherwise. And I only took 20,000 cells of each. They had hundreds of thousands of cells before. If you want the whole data set, it's available for repository.org? I don't think they want it. It's not theirs. But yeah, it is publicly available data. So this is what I have done this. So now you can... I have actually saved it onto the virtual machine. So if you look on the folder, it's right there. Okay, that's the... You shouldn't have this other stuff. Before we define the flow sequence, what you've done is you've already done that and saved it so that you can just... Yeah, basically I... That's exactly what I did. We did one of the three files and you've gone back and forth. I've done 20 and I've deleted some of the colors. So now load fs.rdata. So it has... I guess I can work here. It has 20 files, like I told you. The first one, there it is. It's a flow frame object just like before. It has 20,000 cells. All of them have exactly 20,000 cells. I randomly chose 20,000 from each one. And I have only... kept these markers here. This vivid slash 3d14, that's, I guess, a viability slash dump channel. And then we have Ki67, 3d3, 3d8, 3d4, 3d127. So we're clear on this. So we can plot it. Just the first thing you ever do when you load in your data, plot it, make sure it looks right. And remember these cells here that we were talking about before, that are right here on the margin. How would you get rid of them? If you say you're XLIM, that would only plot it so that you don't see them. But I want to completely get rid of these cells. I don't want to be analyzing them anymore. I want to gate them out, so to speak. Get them. XLIM is part of the... XLIM is just within the plot function. It's just for display purposes. So... I want to remove them. Yeah. Yeah, okay. That's right. So first of all, we all agree 250,000 seems like a legit threshold. Just looking at it. I want to first identify the indices of these cells. Which are... Is it the first cell that's in those margin events? Which cells are they? So which cells are they? We're going to use which for the function which. This here that I have selected right now, Express FS1. What does that mean? What was that? We used that earlier. You're switching the first file closer to expression values. Yep. So it's like the matrix of the values I'm taking. And within that, I'm taking only the column that is forward scatter area. So, if I just took this, these are all the 20,000 cells and these are all the forward scatter values that are present in my flow frame. Which of these are greater than or equal to 250,000? That's essentially what this line says. Is that matrix E that we had before? It's basically the expression values. These are all these numbers where each row was the measurement of each cell. So you're just asking the matrix to tell you... Exactly. Which ones, yeah. So now I have what margin cells contains now. These are all the indices of the cells which are on the margin. The 15th cell had a forward scatter value greater than $250,000. $250,000. The 27th cell that went through the cytometer also landed in that region. The 69th cell also landed in that region and so on. So these are all the cell numbers that I do not want anymore. I don't want to keep them around. How many of them do we have? 601. What can I do with that? I had 20,000 cells to begin with. So this is how you can calculate the percentage of margin events that your data had. 100 times the length of the margin events divided by the number of cells. 3%. So there we have calculated one type of quality thing. 3%. If that number happened to be 50%, you'd probably be pretty unhappy with your data. So clearly when you're analyzing your flow set you're not going to be plotting them one by one, putting a blue line where all the cells are there and then you're going to do this in an automated fashion where maybe all you're going to be looking at at the end is the 3%. And if you have 100 samples and you write a little function that will do this for you automatically you'll see 3%, 2%, 3%, 2%, 50%, 2%. The sample that had the 50%, you probably want to take a closer look at that one or just exclude it. So that's one sort of hint of where we're going to go with quality assurance later today. So let's plot this again. And let's try to visualize these margin events. You guys can all see the red dots, right? The red. So how did I do that? This is one visualization technique that is really helpful when you're gating a population or trying to remove the margin event or something like that. You want to just visually maybe present the results to someone else or just convince yourself that you did the right thing. Remember how A is a matrix here? I took the expression values of the flow frame but only the forward scatter and side scatter channels. This is one way you can plot full cytometry data. Just give it as a matrix. Remember what was PCH about? Yeah, point character, yeah. And the Y limb was just for the display, right? Just so that it looks prettier for us. I just wanted the Y values to go between 0 to 1000 because there's maybe one or two dots higher than that. Now this points function, what it does is it will plot the thing inside of it on top of the current plot you already have. So this first plot is just all the black dots, right? Everything. The second plot, because I'm not saying plot, I'm saying points, is going to keep this plot open and add some points additionally on top of the current plot. And what am I actually plotting this time in red? Substituting the matrix. Exactly. I'm only taking the rows, so remember margin cells are the cell number 6 and 10 and 12 that were margin events. Now I'm taking the rows of these cells and only the forward scatter and side scatter column, right? Because A my A started out being just forward and side scatter so now I'm going to take the forward and side scatter of only the margin events. And I'm going to plot those on top of the existing plot in red. Okay, so far so good. And this C ex equals 2 stands for character expansion. So I want it to be a dot still, just like here I said point character should be a dot, but I want it to be a little bit bigger than these black dots because otherwise it's really difficult to see. And in fact I want those dots to be twice as big. It just helps the margin events pop out a little bit. You can play around with that if you want. Yeah, you did okay. You seem very pleased. Here's one additional plotting feature we can add in order to make these plots presentable. Legend. So you can add a legend to the plot and you can by keyword specify that I want it at the top of the plot somewhere. And the actual legend should read margin events are margin percent percent. So this is another time where this paced function comes in handy. This is the sentence that that produces. I'm going to paste these things together. The words margin events, comma whatever my actual margin percent was remember earlier we calculated it 3.005 and then I'm going to add the string percent so that it's clear. So the color red is clearly why this is red. The red stuff is the one that's the margin events. This pch point character used in the legend. The code 19 corresponds to a solid circle. If you were interested in seeing what the point characters are you can do plot something like this. So the first if you say point character equals one or you don't say anything if you don't specify point character is the period the dot it's going to plot these open circles. 19 is here. So if you're making a plot with a bunch of different things and you want to annotate them using point characters there's a bunch. So now we've identified the margin events. We've in fact plotted them. Now how do we actually take them out of our flow frame? Just plotting them in red doesn't really do anything. Well it just so happens the flow frame object is made so convenient that you can literally say subset take my f clean is gonna be now my original f minus the margin cells. Take them out. And now when you do that you have 19,399 cells left. So we removed 601 cells which were the margin cells. Does this make sense? So far, yeah. So let's see if we'll have anything else here. I don't think so. So so far what we have done is compensated and removed the margin events. I have to show you that we've done that because it's actually... You tell me. Why don't you instead of doing plot again this is what you can do. What am I doing so far? Points. I'm gonna be plotting it on top. I'm gonna do it in green. I'm gonna use dots still but I'm gonna make them just a little bit fatter than the black dots just so that they pop. I didn't plot any green over top of the red, right? Otherwise it would be green. Is that good? Now I'm gonna move on to transformation. Let me just check how long I have. Okay. We've got a lot of stuff to cover. So we talked about transformation. Let's just think about it before we try to apply it to our data. Think about how does it work in R? I just talked about it in theory. You know, this transformation stuff. So let's imagine you have some values that you want to transform. Let's say your values are A, 1, 10, 100. You know, those are standard exponential values, right? If you take the log 10 of A you get 0, 1, 2, 2.7, 3. Does that make sense? Because 1 is 10 to the 0 power. 10 is 10 to the 1st power. 100 is 10 to the 2nd power and so on. A center of A is just another less straightforward transformation. This one is pretty clear because you guys know 1 to 10 to the power, whatever. A center is a little bit more fancy. But again, we put in some values, some other values come out. With the logical transform, however, there's actually a function that generates that transformation. It's not automatically built into R1. Okay? So it's a little fancier. It's been added by... I don't even know which package has it. Or it's just a fancier transformation. So you have to actually generate it. Let's look at the help. So it creates a subset of a bi-exponential transform, hyperbolic sign transformation function, blah, blah, blah, blah, blah, blah. And look, it has a bunch of parameters. M, A. You can read about the parameters. Remember how I said the logical is a bi-exponential transform that has a bunch of parameters that you can tweak to make your data be the most effectively transformed that it can be. These parameters, W, T, M and A are the things that you're tweaking. So if you don't do... If you don't put anything in this call to the logical transform, it will just use the default values, which are already pretty good. If you wanted to, you can actually play around with these values yourselves and tweak them and see what your data looks like, tweak them again, see what your data looks like. But don't worry, the standard ones work really well. So just heard the whole logical transform package in R is actually written by way more than we actually invented in the logical transform. That's great. Yeah. So it's not as natural as typing in log 10 of A or a inch of A. It's because it is a little bit of a pancier construction of a transformation. So you can't just type logical of a... It's not going to really print anything. It just... R knows what those values are but it's not doing anything. It's not printing them to the screen. So you have to actually print it. And here's what they look like. 0.5, 0.55, 0.95, 1.7, 2.0. Just some transformation of our original numbers, 110, 105,000. Okay. So log transform would give you these numbers. An archsinge transform would give you these numbers. A logical transform gives you these numbers. Now let's try it with some of our actual values, these A values, 110, 180,000. Let's try it with our values. Let's try it with the CD3 values there in the R780 channel. Here's the first four values. Now you guys can run those lines of code yourselves. I'm going to switch to this. So everybody can run these lines here. What are we going to be doing? First of all, remember this MF row where I plotted three plots, one after the other. Now I'm going to do four plots and I want to be able to see all of them at the same time so I can compare between the plots. So I'm doing two by two plotting region. One, two, three, four. It's going to look like this. Two rows and two columns of plotting region. And these two don't worry about they're just to make the margins of the plots a little bit smaller so that they fit on the screen a little bit better. Then I'm going to plot the density of my values. Remember my values are just the raw expression values of the CD3. I'm going to plot the density. Then I'm going to plot the density of the log 10 of those values. Then the density of the arc center of those values. Then the density of the logical transform of those values. Does everybody have those plots? So the first one is the raw untransformed values. You can see how they're all bunched up in the very beginning, the lowest values. You can see that there's two populations probably a CD3 negative and a CD3 positive, right? But they're really bunched up there. They don't look that great so you have to transform them so that your method will be better able to see them. Here's the log transform. That looks pretty good. There's nice peak and another peak but this peak looks very small. Why is that? Because there's a bunch of negative values that log just ignored. Decided their throw away. The arc cinch, this is basically, you know, like a not very well parameterized version of the logical. If you don't play around with the parameters then this is what you end up with, a split right around zero. And then here's the logical. Looks legit, right? For example, you've done many monocytes and you're trying to look at CD3 because of cells that get cinched. Yeah? Yeah. Because then you've got, you've still got some other cells that are not lymphocytes. Yeah. Not much more than lymphocytes. Yeah? Okay. So by doing this we kind of convince ourselves using the logical transform. Transform is valid here that it doesn't look like there's anything weird going on. You know, it's pretty straightforward dataset. So now this next code it's going to generate this plot. Okay? So let's go through it line by line and see what it does. So again I'm making my screen two by two. I'm gonna plot, you know, the untransformed, the log tens, the arc cinch and the logical. Okay? A next deeper question. You've already called something F. You know, you're renaming what F is. But... Now the old F is gone. I have replaced it with this new clean. Exactly. But the new F is based on the old F. It's a form of transformation with clean margins. So it's okay because it's a separate, it's actually a separate variable. Exactly. Yeah. It just seems a little bit... So the reason why is because actually theoretically speaking you should be replacing it all the time because that saves space in your memory. Your computer is like, I have this variable called F. Now I have this variable called f.cleanmargin. Now I have this variable called f.cleanmargin.7. Now I have this. But if I say, here's my variable F, my variable F cleanmargin. Now my new variable F I'm gonna... It's a little bit more efficient. And we just call it different things because they're bright and shiny. Well, yeah, right now I want us to be learning it. Exactly. And also, because in these next steps I'm gonna be plotting this thing and I don't want to type f.cleanmargin. It's also laziness factors into it. It's also for clarity just so you can read it better. So this first line again generates this 2x2 plot region. The first thing, notice how I'm using the flow-vis package plotting. This time I'm just giving it a frame. I'm not using the express f and that smooth equals false because otherwise it looks too, like to me it doesn't look super clear what's going on. The main, the title of that plot is no transformation. Now, how do I transform actually within the flow frame the data? Before I just kinda plotted log10 of the values, right? The values. I just plotted it but how do I actually make it transformed? You just take these values which are the original values, the side scatter values and you just replace them by the log10 of those values. It just... Again, you're replacing the third. Yeah. So in my, if you imagine the flow frame as a bunch of rows of cells and a bunch of columns of parameters, I'm taking that column, the side scatter A column, throwing out those values and putting in the log10 of them back in, sort of. And when I do that, I don't know what that happened, when I do that, now I can plot basically the same thing I did up here but now in this f the side scatter values have been replaced with the log10. How's that? It's a little... Produced. Yeah. That's because... not a number. N-A-N means not a number and it's because you're taking the log10 of... Oh wait. Are you doing side scatter or did you just skip ahead to the C3? Oh yeah. When you're running twice, now you're taking the log of the log. So if you accidentally run a line twice, go back to here. Go back to line 81, where it's f equals f. Yeah. So is that clear? What I did? So now I'm going to do the exact same thing. I'm going to start with the clean margin. But now I'm going to take the arc cinch of the channel and plot that. And then I'm going to do the exact same thing but this time with the logical. And that reproduced that earlier plot that I showed you. Oh actually, I didn't do that. So this is basically, instead of looking at this plot here to assess how your transformation is doing, you could instead do that plot that we just did. And in fact, that's the next step. Which I'm going to just let you run that on your own. It basically makes this exact same plot but instead of for side scatter for the CD3. So I'm going to let you do that and let you know when you're done. I'm going to get some coffee. Okay. I think that's probably good. We're good. So, remember how I told you that the logical transform actually has these parameters that you tweak them and it makes your transformation better. So for this data set it was actually pretty decent already. We didn't need to play around with the parameters at all. The default ones just worked. But there is actually this function that someone really nice wrote, which actually has some mathematical estimation of what is the optimal parameters and uses those. So instead of doing the logical transform you can do the estimate logical transform. Which first takes your sample and tries to look at its distribution of those density plots that we were looking at earlier. It kind of like takes this in, looks at this and tries to figure out what are the best parameters to make it look really nice and clean like this. And it does that for you behind the scenes. If you really, really want to know you can do course question mark and there's logical transform explained in this help as well as the estimate logical. So you can read through all these details and there's some references to papers and stuff that if you really wanted to read. But just take my word for it that it's worth a try. It's not necessarily always going to be the best one but when you're trying to choose a transformation for your data always visualize your data using one transformation and another and another and pick one of those. Don't just go with something blindly and never look to see and make sure that it looks good. It might not be always, there's no one transformation that works on all data sets ever. Unfortunately yet. The logical is pretty safe but and here's how the estimate logical actually works. First you define your transformation using this because it's fancy we use the special function estimate logical and because it estimates the parameters based on your data you must supply it with your data. So you define the transformation based on your data and you also specify which channels you want to transform. So it's not just one at a time you can actually give it all the channels you want transformed. So here's the call names of F remember with all the parameters it's like the columns matrix with all of the data all the cells and taking the third to the ninth channel gives me all the ones that are I know are log scale so I must transform those. So I have here this line creates a logical estimate logical transform for all those channels simultaneously and the way that you apply it on the frame is using this function transform it's written specifically for flow data you specify the flow frame that you have and the transformation function that you have defined. So now F trans has is transformed now we want to visualize that and you know just double check and now we're going to use this new package that is not in bioconductor yet flow density because it has some nice visualizations added to it. First you have to load this other package that's necessary for the visualizations and then the flow density package. We're again going to do a two by two plot because we're going to be having four little plots and they look really crappy because I ran something twice so if yours look like mine is because I ran a bunch of things twice but they should look something like this Does everyone have that? If we don't have that it happens to me it's because I ran over one line a couple of times accidentally and I probably re-transformed it one more time so just go back and move on all the lines up until that one. Okay, so what's this? Forward scatter versus side scatter notice how we have removed the margin events they're not there anymore and I have plotted CD3 versus this here is the viability dump channel so these are the dead cells and this is the CD3 positive live cells Here I have Ki67 versus CD4 I randomly chose these just so that we visualized all the colors so it looks like CD4 has a pretty nice positive and negative fraction Ki67 is kind of I guess I would draw my gate around here somewhere it doesn't really have that many positives and then I have CD8 and CD127 doesn't particularly interesting right now Does this make sense? Looks a little bit more familiar try to copy flow gel a little bit Is that what time is it? What time was the copy break supposed to be at? 3 o'clock Let me just see if we can do a couple more things Well, what did we do so far? We probably want to do the same to the whole flow set not just the long flow frame so we're gonna work with four loops Is anybody familiar with a four loop in programming? Very good Ryan So just to illustrate how it's done in R if you haven't done it in R this is what it is basically you say four and then you put some kind of variable you're gonna refer to within the four loop for I in the values one to three print I squared and there it prints one, four, nine It makes sense It doesn't have to be I and it doesn't have to be numbers it can be a longer variable name like channel in the call name so the parameters of the flow set four to the number of columns No, because that relaxed meaning your variable names should be sensible they shouldn't be generic So I'm gonna use I for channel I'm gonna use Chan because then when I read this print channel I'm printing the channel informative channel names So sort of like running a script so instead of being able to do this for 12 samples that are right Yep whatever is inside of here you do it we did our code for the single flow frame right but so far I mean we did a lot of extra things you know we managed to remove the margin event and transform it now I'm not gonna be removing the margin events for the second one and transforming that one and the third one right here's your flow frame remove the margin events transform but do this for all 20 of them so that's what that's the idea so that's yeah it's a do this while I go yeah so I do this here these two lines to close all the plots because again it's getting a little heavy on the memory usage of the computer and I'm reloading the data again just this one time because we have been kind of running our lines of code a few times here and there we may have accidentally affected the original flow set and I want to start fresh because we want to do this four loop thing properly right now remember how we plotted our flow frame with forward scatter and side scatter and then we saw these cells at the end and decided that 250,000 was the right value to choose well I don't want to be plotting each and every one of the 20 samples and making sure that the 250,000 is indeed the right value to choose for all of them so what I'm gonna do instead when I have a large enough data set so here I have 20 samples instead of plotting on 20 of them I'm gonna take at random a few cells from each of them each of the 20 sets and I'm gonna so it's like a pooled sample I guess you guys do that sometimes as well like controls and stuff right you pool yourselves together to get a better yes yeah so from the first frame I'm randomly selecting let's say a thousand cells randomly and a thousand from this and a thousand from that one and hopefully I can get sort of a broader overview of what all of my data looks like and this is it and the way that I got this random sample from each of the frames I'm not gonna go into detail on that what I have done is supplied you guys with this function that I wrote you can actually open it and see what's in it if you want it's in the folder code and then in the folder support code support functions if you look through this I have all these little functions here that I have written for you that will make life much easier for you so you don't have to every little thing that you do you don't have to you know do it from scratch like we've been doing so far today so it's actually to help you guys out for later it's free code just for you nobody else has this you do this actually all the time huh? yes and it is possible that some of these functions will not work perfectly for you every time but because I have given you the starting point you should feel free to go through them and see how I'm getting the lymphocytes for example I have one function called gate lymphocytes or something like that and remember how we talked about you must put it into logical terms how do you gate out the degree exactly you know oh I do this well how do you do this I have made decisions about how I do this while coding this function if you read through it carefully I have some pretty good you know explanations within the function you can see what my logic behind it is and if you disagree with some part of it feel free to change it to suit your own needs anyways we're not going to go through this at all right now I just want to point out that I have I have this function that I have written get global frame and what it is is basically you just supply the flow set and it gives you this randomly pulled frame that represents the whole flow set in just one frame so you can visualize all of your data into a sort of one as if it was just one so I have done this for you you can go through it on your own time and try to figure out how I did it but for now let's not talk about it just take it for granted in order to make use of my functions tell R that we're going to be using these functions you must make R read my script so the way you do that is using source and now you can actually make use of my function get global frame now global frame I call it global because it's a global over your whole set it's just a flow frame right it's just a flow frame object doesn't have a name because it's not anything really it's just randomly generated and the way I've written my function is it kind of gives you twice as many cells as you start with just to make sure it gets every little artifact that could be in the data set and this you plot it exactly the same way you plot any other flow frame right you put the flow frame and your channel names your wire limit if it looks funny otherwise now instead of just removing the margin cells I decided to just sort of roughly gate the lymphocytes very rough correct me if I'm wrong my opinion is that these cells should be kind of removed they're a little bit no the low end these cells are probably doubles or something like that same with these so really I just want these cells do you agree so these values so far we have none generated in an automatic way yet I just eyeballed it I just looked at it and decided these are the sort of static gate values that I would use to roughly gates the lymphocytes so it's exactly the same thing we did before where we set it at 250,000 except now I have a bunch of them that end up sort of outlining the population that I'm really interested in does that make sense so far how did I like it's kind of difficult to eyeball this right it's you know a bit of a mess so you can actually the side scatters fairly clear you know it's really like there's a bulk of the cells here cut off there but the forward scatter it's a little unclear where to draw the line so if you wanted to one you could use a different visualization plot the density of the forward scatter channel yours will look slightly different because you generated a different randomly selected global frame right so yours will be just slightly different from mine does this make sense sort of what does yours look like very different how did I select something so because I remember how I had 20 samples but I wanted yeah I just took randomly some cells I didn't want to take you know the first 10,000 so does this make sense as a very rough preliminary approach to removing debris it's it's very basic very like it's not like you set a flow join you do a nice little circle around there you know because you know the cells should be all in a circle it's kind of more rough like get rid of these get rid of these but it sort of makes sense right does it no sort of makes sense yeah okay yeah good side scatter it's pretty clear just this one time because we have this is the last thing I'll do before we break for coffee because we only do have 20 samples it's not 100 why don't we just plot these supposive gates you know 35,000 and 125,000 and 600 over top of every one of our samples just to visualize what they look like and convince ourselves that they do indeed work for every one of our samples so it's not perfect right there's still some in some of them it looks like the lower debris is a little there's still some left over but we're not gating out and then there's some that we've gated a little too much maybe but for now for a starting point it's good enough if I had a very large data set let's say 100 samples clearly I'm not going to be plotting each and every one of them and looking but what I would do is randomly select 10 of them and plot those and just just to double check that my data is reasonably of good quality there's not too much variation if I were to do this kind of approach so far so good so what do we do we first sort of preprocessed one single frame by removing the margin events and transforming it then we decided maybe we should go to the whole flow set but we didn't want to be looking at each and every frame one by one so we created this pooled sample and decided to kind of use that as our basis for our logic of creating steps to preprocess so far we're at the point where we have created a we have designed we haven't actually applied but we have designed an approach to remove mostly debris in our flow set and we're going to do that next after the copy break