 Excuse me Okay, so all that stuff you did yesterday would have sucked even more if We didn't have the opportunity to do automated data analysis in terms of getting rid of that whole gating step Because a lot of this stuff yesterday was just reading in the files doing transformation conversation. That was like getting in the first base I'm gonna get a little bit. Oh can take do we do the Camtasia thing? All right So here's where we get go for the home run so and This section I'm going to talk about the state of the art in terms of how to get rid of that pop problem of manual gating Talking about the flow cap project, which is a lot of effort and trying to evaluate different methods We're going to talk about two tools that we're talking and using a lot more about today Archaeopteryx and flow density Archaeopteryx I think is a fantastic tool for doing discovery Finding all the populations telling which one is the most important and pull that see a package for doing diagnosis for you Know what you want to find, but you don't want to have to find it by hand So yesterday we kind of walked through and got you up until data transformation I think so now we're at the next step of looking at what we can do to identify populations And then once you have those populations there's a couple different tools in yet to do diagnosis discovery so Since about well, there's a couple of people who kind of messed around doing automated gating In the early days like Bob Murphy They kind of sucked and they were Toy examples and nobody really used them. Oh, no the Bob Murphy's awesome But he just used basically like a k-means approach and it worked kind of but nobody was using it Mostly because it didn't doesn't really work. It's not really just generalizable. It's not robust That all changed starting about 2008. There was a bit of a renaissance and since then Fortune-different our packages have been released that just do manual gating There's about 20 packages and all some stuff that's not in our but we're just gonna put this in our workshop But there's some other stuff that sort of one-off kind of approaches. There's like in Python It's and Java stuff like that, but by far and large 14 out of the different 20 packages that have been published have been in ours like it kind of shows you the power of R for doing what you want to do and And I think there's a lot of hope in the future that's gonna be a lot more stuff in ours result And basically because like you've seen today it leverages that whole infrastructure So if you're developing a clustering method, you don't have and you wanted people to use that You don't have to write all the extra code that does everything else that you guys are working on up till now To get people to use the code and that's a lot of that's all seriously a lot of effort to do all that stuff just to get an FCS file in to a program and to manipulate that and and the people doing the clustering algorithms Don't want to have to write all that stuff again so they can you know tap in the stuff that we have and it's you know It's physical programming language, so it leverages all the tools. They're in there. It just kind of makes sense So there's two different analysis problems and they really need different ways to salute Just approach that mathematically. So if you're trying to do discovery You really know what you're trying to find so you design some market panel to look at pretty much the areas that you think are important You want to find in case L's you have some markets in there to find those in case L's But there's gonna be other stuff you might throw in there because you're doing efficient expedition You you were curious about what else it could be so you throw some other stuff in might find something into this other stuff And then the goal is you want to find every cell population In every sample because you don't know which one is going to be the one that's going to be interesting to you You got some ideas based on biology. Well, what could be important? It's you know T cell the B cell But you know it could be this one could be that one. We're not really sure and then for every population you want to see Does the cell population that we're looking at this time? Correlate with my outcome of interest. Am I getting a p value less than point oh five If you find something that's important, then you publish and you win you're done If the one you're looking at is not the one that's important Then you go look at the next one and you got a little for each loop going here in biology This is a great for each of you's going on in code to keep going around looking at a population Get a population like population. Oh, I got one and then you stop right? You don't look for you found one and ten that's a that's good enough for me. You might keep looking but generally that's doesn't seem to be the case Diagnosis is a bit different approach You know what you're looking for so you can design a marker panel to find just that population and then the question is For this population that I know is important because that's the one I'm looking for is is the MFI or is the proportion over some threshold that it's my criteria to make that diagnosis if yes, then The person sick and it's going to die if no, then that person could be healthy or if yes These mice are going to do something and if no these mice can do something else so the way we approach these are different and Which are these tools to use is been a difficult problem not that my group would do this But other groups when they publish it, they only show their best data, right? Here's my here's my algorithm look it found this stuff. They're looking for it's fantastic and Every paper that's published Does this on a different set of data? Now here's my here's my algorithm look it works on my data Here's my algorithm look it works on my data and what people Haven't done is look at here's my Algorithm. Oh look here's another algorithm and mine works better than that one on this data set that tended not to really happen and Because of that it was difficult to assess the relative merits of one algorithm versus the other Everyone is showing on their own data sets or the data sets they had were toy examples one very complicated analysis So a bunch of us got together three times now Basically everybody in the world There's not that many of us They were developing automated algorithms and If that building had fallen over we'd have set the field back 20 years is that's that's everybody seriously Who are developing these automated tools in a really collaborative manner and open and Try this it's not so much a competition because we didn't want it there to be losers Because then people aren't happy But to sit really see on where one method works better than the others because these methods We've developed to do different kind of things and it's important Oh when one our hope was to see well this method really looks really good for that this method works really good for that This one okay kind of sucks, but you know, it's maybe a complement system like this So all the results all the code all the methods and how we did all the analysis They're all available at flowcap.flowside.org. It was a really open approach everyone was sharing We published the paper came out earlier this year in Nature Methods Nemo's a grad student in my lab who did that This is a summary of the results from the first flowcap project and one of the Main take-home messages from that that these are all sort of the different kinds of methods that these different algorithms are using They're all really approaching or many of them approaching the Problem in different ways We try to Do the best job we could and try and assess the relative merits An F measure is a statistical measure to say how close You are matching manual gating. I'll talk a little more how we did that assessment in a second, but Not heck we'll do it now. So One of the problems when we're trying to say how good an automated method is working is What's the gold standard that we're measuring that against you need some objective measure to say this one's working here and this one's not and right now if We struggle for this a lot and the gold standard for Gating is somebody drawing boxes around dots That's what people do So it's not really gold standard. It's the In medical practice is the current what's the standard of Standard care, right? Yeah, you have to be at least as good as that. We can't call the gold standard Because we know we know there's we can but we know there's issues with that We know if you give as we talked about yesterday you give the same FCS file to three different people to get a few different answers. So some variability Another problem that I have with that is the gold standard is because I think we could do better I think Because that's where I'm coming from and I'm letting you my biases I think computers can do a better job of finding cell populations in high-dimensional space Then humans can if that's my hypothesis. I can't prove that. I can't prove that Really sorry Yeah But the bigger problem So you get you get a point in one or two cases you could find you could find a particular case that you say You know here that some humans done something and Actually may be two populations But when you look when you look at hundreds of thousands of cell populations in a big data set You can't really go through all those differences one by one. It's just a really ordeal and if you're saying this is my best If this is the way that humans done anytime that we find a difference from what the human has done That's making us Be not up to that standard and if that and instead the difficult assessment becomes is that difference mean we're doing it Right or we're doing it wrong and what we're saying well This is the way the human has done that how try to assess well This difference is actually because the human did it wrong is a really difficult process to go through and try to convince Especially when looking at hundreds of thousands all populations all these differences having said that If this was one that means that We have done put all the same dots into the same gates as the human had done So many methods are doing quite well. This is how long it took those algorithms to run per sample And so it's not some of these methods take longer than others. Some are very quick And then we scored them based on all the different data sets they looked at So but Because we're creating one everybody get along Because with it was also interesting scientists of scientific We know that if you give two People the same data set they're gonna get it a bit differently and one person might find something that the other person missed for example and Same kind of thing happens with computer algorithms some algorithms were really good at finding some things I'm kind of because the way they're modeling the data. They're good at finding Maybe rare populations and other algorithms some work better at other things And so the thought we had is well, let's put these algorithms together and see if we can do a better job and by golly it worked so On this figure it's a bit complicated This is the average net measure This is on the number of most consistent populations included in the sample So we had some humans these are shown by these dash lines gate the same data But as I said four times already different humans are going to gate those populations differently And then so how do we measure which one's the right one? So what we said which human is the right human? So what we did is we took all the humans combine them together and said well Here is one population that all the humans agreed how to draw the box around those dots So the we're pretty sure that they're that one's the right one And so that we call that population why everyone's pretty or at least most people this one's a lot bit of an outlier And they're all experts in the field. We can say this person's an idiot. Maybe they're not maybe this is the only person Knows to get it right. How do you evaluate that right? But so one two three four five six humans I'm sorry one two three four five six seven eight units So all most of these eight humans all agreed on this one population because we all bring all the same dots so the f-measure Against that consensus is one of those humans And then we said okay. Well, let's look at the next best Population because again most of the humans except maybe one or so or maybe then that one is kind of drawing those boxes Just a little bit overlap. That's population two so I went through all these populations around this data set it was human products step cell data set and We did this for we had four or five different data sets. I'm just showing you one example here Then we can compare each of the algorithms against that one population that all the humans agree on Now would be And then for all the algorithms That's the f-measure score on that one population and we walked through all the populations now here the humans are agreeing Disagreeing more on how that one population is going to be it's more difficult cell population decay and if it's more difficult for Humans, it's also more difficult for algorithms. It's tricky We combined all the algorithms together and did something called an ensemble clustering So that they sort of do a weighted kind of voting to say I agree on how to gate this population That's this line here. So if you put all these algorithms together Even though some of these algorithms suck They're not not doing as well and that's we can debate what that actually means Not performing as well versus the human anyway that ensemble of all the others together works better than the Anyone algorithm and better than the best algorithm all the time So it's kind of neat and that kind of makes intuitive sense that you're getting sort of a general kind of overview Of how to get this population through all the algorithms in they're firing lots of different stuff Pop them all up and where they agree more it tends to be closer with the humans have so it's kind of a win So the short answer is In terms of what algorithm should I use you should use all of them And through the power of computers, it's not that hard because it's just computer time You just throw another algorithm into the tool you gain it again, and you go have some work You just get a lot of breaks, right? You're just busy working. That's okay, and you can go do other stuff Yeah So how does this actually work? So this is one example Sort of puts you in your head what what these f-measures actually look like so the algorithm then the Even get this way manual gates, so This is the book the line is a manual gate box so that's how the human drew the box around these orange dots and The algorithms where they put that where they put all those samples is shown in color So these three dots the algorithms put over here, but otherwise they got anything more or less, okay, and Here's this purple sample down here, and so the humans to the get around here But the algorithm said actually this dot actually belongs that red still over here So that's why this f-measure isn't one because that one end up in the wrong spot and And point nine eight because here's a few dots that should have been over there first the manual But this one everybody agreed with the humans and the green Everybody all the algorithms the ensemble of the algorithms agreed with the ensemble of the humans Here's another example So you can see getting a better idea what the f-measures looks like so again the ensemble the humans looking really good But didn't look didn't do so good for this one up here You can see for for an f-measure point eight six This is what it kind of looks like there's some stuff over here that the algorithms put together with this when really maybe it shouldn't have but Part of the problem is you know This is a we're trying to look at three-dimensional data a two-dimensional space and sometimes it can't be hard for humans To maybe get that separation right and that's looking at it in the right way This is one case. We think the humans perhaps may have been getting it wrong and we the algorithms may have been getting it Right, but that's needed here. They're there. So trying to get around that whole problem of what the gold standard is The idea that we had is we need something else other than manual gating to tell when gating is working well and The way we did approach that problem was we need some measure outside of drawing boxes around dots and One way we thought about doing that is if you have some diagnosis on some samples you get some expert clinician So look at some samples and say this patient has AML and hopefully there's there's something in Extra in there Above and beyond this particular cell population has 27 dots and this particular cell population has 24 dots It's a sort of meta analysis. It gets around this One versus two dots inside this population so we had eight tubes of five colors on 360 subjects and The clinicians tell me that AML is not a difficult thing, you know, we try to find this blast population I can do that by hand great But again, maybe you don't want to do that by hand 360 times If we can prove to you that algorithms to do as good as job as you can do by hand That's kind of a win for us And it turns out that many of the algorithms perform perfectly What does that mean? It says many of the algorithms were able to get perfect sensitivity Specificity and accuracy versus human on day diagnosis in the AML again This may be a simple test, but if we can't do this right, this is the inverse problem of what I did with Mario Roder's data Because we weren't sure how this didn't work out, but if we can't get this one right We're not going to be able to get other stuff right, but it turns out that many of the algorithms can do it right Now some algorithms got some things wrong and that show along here on the bottom So this is how many of the algorithms we tested 42 algorithms in total How many of those 43 algorithms got a particular patient show along here on the bottom run. There's a misclassification and Again, this is one of these things is not like the others game There's this one patient that turned out like wow half the algorithms are getting this patient wrong Why is that? Not the easiest question? so We looked at that particular sample And we found out that it kind of looks weird. So this is a typical normal individual Nothing weird going on. Here's a typical person with AML and the blast frequency and these people Get what the actual threshold is but this one happens to be 31 percent. You see this glass population down here I think the thresholds But they say I actually forget what it is with this criteria Point is It's lots of cells in this glass population this Outlier population the blast frequency isn't really above this threshold that they are that they were comfortable with in terms of making that diagnosis And I looked a little bit different. It's more It's not in the same place and not as to it's been more diffused than what is really expected And so we showed this clinician and said, you know, not really telling giving a whole lot of background about what's going on He said well, it doesn't really look like AML Maybe some kind of high-grade myotysplasia. Maybe there's something diluting the blast. We see so the neat thing is we if we'd given this Sample to the best-performing algorithm and made a diagnosis. They would have got it right They would have made it according to what the human had but by using many different algorithms and Combining them together and seeing where there's misplastication when we found a problem sample and so You can point it to us that there may be something interesting going on in the biology that we might have missed So we're sorting new discovery. It can be used for doing discovery on the way to diagnosis as well Here's another example from the flow cap to study where We had you've gotten a data set where they're doing ancient gene stimulation post HIV vaccination gagging and stimulation of these groups Again, there wasn't wasn't a very complicated data set But still the kind of stuff that people do to a lot of a color isn't uncommon at all on 48 subjects And there were six different algorithms that able to perfectly classify Which it's a bit of a toy example because it's not the really kind of thing that you usually do see which kind of Stimulation you gave to your samples because you know which kind of simulation gave the examples But can the algorithms uncover that based on the data is a valid kind of question and many of the out in terms of classification problem anyway, and many of the algorithms are able to do it perfectly So again, this is more examples on top of what I showed you yesterday that stuff actually works And I mean go to the paper and find out more about that So I'm going to talk to you through two different tools that are that weren't involved in the flow cap study because they came Later and just in the last year or so That can be used for either discovery or that can be used for discovery and another one for diagnosis Archaeoptimic simple density, and you'll be using both of those today Yes, you will so For the Archaeoptimics platform, it's a tool for doing discovery It's it needs some it needs you to feed it the cell populations and one problem problem that we have when Bunny ears that when we looking and doing automated analysis is we're going to find lots and lots of stuff and we've run in this problem several times we're doing collaborations is We find too much too many things. It's too much information So we're going to find three to the M and meet a few types per post-sumptory assay or M This is a number of markers for so for a two marker study. I think anybody does this anymore You're going to find nine populations for five markers study you're buying 243 Ten markers gets to be sort of the area where people are kind of working nowadays 3dm the 60,000 different cell populations that are possible in that kind of study You don't find those by hand you're not going to go out and get 60,000 cell populations because you know what you're looking for But when you do discuss when you're doing discovery kind of assay You don't know what you're looking for and so the computer's gonna do something that you can't do which is find everything Then when they find everything you kind of the mind through that one by one To see which ones are important and that's gonna be coming up on a few sides So this is the approach that we initially used To find all these cell populations that seem to work really well. It's something called flow type That we published in 2012 and it basically separates Sort of like you do now populations in the negative positive For each of the dimensions that you're looking at and also we don't really care of its negative positive because it's not really important to us For this population, so this is don't care of state as well So this is how it looks in a cartoon So we basically slice up on the negative and positive to try and get the best slice that separates things in different groups That seems to work okay because that's the way you tend to design your experiments You tend to design them so you have some negative and positive population not all the time And if you're not doing that like in this case here Either because your computation is right or because you have some populations that are more a smear kind of thing This is not going to work for you Right, this is gonna work for you if you have negative and positive populations in the way you find it And if you have negative populations that are associated With something like a smear we can probably find that but if the only thing is a smeary population Not gonna work for you, but we think this is gonna work for lots of data The original flow type that's in a bioconnector today can't handle negative dim positive kind of solutions, but It's not in bioconnector yet, but the papers so You're gonna walk away today here with a whole bunch of tools But it's gonna be out of date is by next week because this is a moving field is really rapidly progressing And so you have to keep up with the literature. You have to keep checking bioconnector now and then You want to see if the packages have been updated that there's a new version of for example flow type being gonna be released But if you don't check these things not gonna be no, it's like any active field You want to read the latest stuff that's going on and now you have a whole new bunch of papers They put into your PubMed searches I want to get all the flow photography bioinformatics tools because it's a really active area development stuff is changing And here's one example. So the stuff that is in flow by connected today was using the old type of flow type Used a brute force not no math today, but use it for it's called a brute force approach We now have a dynamic a programming approach works much better and here's one example of why it sucks to be on a laptop This is the old this is how the runtime in seconds for the old version of flow type Old as in the one they get today the one that we should be having in full type in the next couple weeks Much quicker in terms of number of cells how long it's gonna take you to run that Also for number of markers if you're doing more than 10 markers it gets to be really really really slow the old type now scales perfectly and The other thing we can do now with a new version of flow type so we can handle more than positive negative We can slice it into as many bins as we think they actually are so that's a win But the main point of this is you're gonna have to keep up in the literature now The bad news is what I said is we're gonna find lots of different stuff And so this is my Star Wars slide because it's just too much to fit on the page I actually had to you know tilt it down to get the whole list of 101 phenotypes That we find because we're gonna we're gonna find everything and everything's probably gonna be a lot And this is only showing the significant new things See here the p values is still 10. This is still 10 the minus seven This is you know, I read this potentially is a really really important p-type And the problem we had is on which one it's like doing microwaves all over again, right? We're gonna find lots of stuff. Which one of these is the one that you want to follow up on because there's just too many not Can do follow all 101 You can see well, there's a lot of here. There's a lot here that kx67 Then here's a whole bunch. I can't I think they're standing right next to it There's a little this case you seven positive and some that are CD8 and some of the CD45 and see 28 But here's the CD45 over here. He's just sort of five over there. It was really hard It was really hard to figure it out. And so it's not still worse in the sense of happy you walks dancing around This is um, this is dad's cutting off their son's arm I mean, this is the unhappy stowers. We're not really solving anybody's problem And so this is the collaboration was done with Mario rotor. We show him this big list. So here it is It's in here somewhere. He's like, this is not helping me. This is not solving my problem Now I just got it something else I have to look at Is Yeah, so we can account for that so we do we do multiple testing so those are all actual Yeah, yeah, yeah Yeah, so so what do you do so? We saw the problem because then we get another paper out of it And because we make Mario happy and that's really important so we had this whole long list and The thing we realized which is obvious in retrospect, I guess is that these are all highly related. So Here's this KSC 7 C4 negative CCR 5 C147 negative population really significant You can get to that population in many different ways get to that in the sense of gating So starting from all T cells you can either start with cake, you know Get on the case is seven or the city for the CCR 5 the C217 Once you have that population that you can add in the next marker And so you this is a gating hierarchy until trying to delve down until you get to the population of interest And we basically we can build these hierarchies reach these populations We can also color each of these nodes With the p-value So this is all everything remember yesterday I said everything almost anything is that we do is associated with statistics all these p-value all these populations are Significantly associated with their ability to distinguish between people who are going to dive HIV tomorrow people are going to be kind of okay That's that was the question that he wanted to have answered So here's this can't here's this population that you that we found That's very significantly associated associated with the ability to distinguish between the two But when we lay it out in this way for this one to get a population we see higher up in the hierarchy There's a population that requires fewer markers and because it requires three markers is also a bigger population It's a large proportion So this is this kind of interest on the thickness of these arrows is how much the key value is changing When you've gone from a population to adding it to the next one that population is a Decreased a lot when you go to here Decrease p-value increase p-value gotten better So There's different ways you can trace down to that population of interest And there's a mathematical way you can figure out on what's the best way? So if we're trying to get to here The best way we can go is where the p-values get high very quickly And so if we trace down this way This is the best getting this is the best getting hierarchy to get to this population out of all those hundred and one That's the one path for that about one you mean if you type out of the hundred and one That's the best path to get to that, but we have a hundred and one Different me if any types so for each of those you mean if you had types We drew the tree the creature those trees we get the best path. So now we have a hundred one getting hierarchy paths Still not helping Until we get to this part and we took all those hundred and one Paths and merged them together. So this figure is the real money shot because it combines Everything in the data. So this is sort of like the spade view of the data It says I've looked at everything in your data I've looked at all the significant meaning types that tell me the ability between Diagnosing or making distinguish between group one and group two. Here's all the important ones. I've combined them together and Out of all these Here it is this is your data. It's all 466 samples. You don't have to look at dots anymore. You don't want to look at Gating this is sort of a meta analysis meta overview And so the interesting thing is well here's some stuff that pops out. It's kind of easy for Mario to understand I want to look for stuff that's up here. There it is But you don't have to look at any gates at this point the computer said here's the most important unit for check now If you want to you can go look at KSC seven positive CCR five positive and see by hand Does it actually look right it has a computer messed up? But it gives you somewhere to look remember doing discovery trying to find stuff that's interesting and there's a couple other ones here that are interesting as well something we found a few times is This the significant key values kind of peter out after about one two three four five or so markers adding more doesn't really help you Doesn't help you in a couple ways is these are much more rare populations. They're using more markers There's fewer tend to be fewer cells in there. So the heart of the gate in the p values don't tend to be as high as the ones up here So the call it corollary of this that's the right word You can do something like a something it's a really good example where site off comes useful as you do a big fishing expedition To run as many markers as you can possibly afford because you don't know what's gonna turn out Then you find the one population or two populations that are really interesting that use a few markers Now you don't have to do site off anymore You've done that whole fishing expedition. We say this is the population. That's really interesting to you But it also works if you're doing 16 color or any kind of high-dimensional analysis You want it when you do discover you don't know what to look for use as many markers You can just mind the whole space and archaeopteryx will tell you what's the most important one So In summary archaeopteryx is a really good way I think and I think because we've used this on many different data sets Some that are impressed some that are still working out To look for a lot of different cell populations You can use any kind of way that you want to use to find those cell populations if you want to do manual gating on Those 466 samples and get all those cell populations. You can use do that. That's fine Then you put that into archaeopteryx and it will tell you what's the most important one You can use spade if you want you spade to find all these cell populations You just need some way to find cell populations. We use flow type. It seemed to work Doesn't really matter once you have all those cell populations identified you throw it into this and they'll draw this nice figure for you saying But a lot of people use spade and one of the questions that have been asked is how does this work versus spade? So there's a couple different things This is we'll get the same answer the second time with archaeopteryx. We'll get you p-values Which you don't really get Not really done in the same way is That's probably the best way to say it versus spade But we both do this nice overview of the whole Data, which is really useful when you're doing high-dimensional or really complicated experiments so they both work on the site off how we Define the distance between populations and how we find these things and the relationships is a bit better And this is one of the things with all these algorithms like I showed you before We're all pushing the challenge a bit differently in terms of statistics We don't have to do this manual notation that's sort of associated with site top We have to pull these blubs out and draw boxes around things There's does work really good for gradual expression change our is doesn't We can find some of our markers involved, but more or less. We're kind of the same in terms of how good we are So this paper came out there comes paper came out last year Yeah Yes So this this this is Not you using If we pop this is Using flow type with archaeopteryx. This is using the spade by itself So this is the flow type archaeopteryx platform So went in spade These spade trees and my it's described a lot in their paper The mathematics behind that is the way to do this. They're sampling and how they draw these Bloods it's how far apart these pop these they're not essentially they're not really so populations It's a bit abstracted from that and the way these Bloods are drawn is how far apart these clusters of dots are in high-dimensional space And it's based on a distance kind of measure. We're not we're splitting populations I We I didn't want to do the math because on spade, but we can do that if you want The way we do it the way we do it in so the measures here is How far apart what they're showing you is how far apart these populations are allows them to group these things together and here this P value is based on How the ability of the population to distinguish between the labels that you've assigned to the samples? So you put these in group one using group two This population has a very strong ability to predict which group you belong to so if it's about we make some we do The Archimarchs platform does thresholds and it picks the best threshold So if you have some population that's up the proportions here and other proportions down here The Archimarch platform says okay I want to draw a line here and every any cell in the future any sample that comes through that's above that threshold It's gonna be in that group any populate any sample that's below that threshold different portions to be in the other group And if that's how the key values gets and then we test that on some other samples and that's how we figure out How do I threshold and what the predictive value is if you have something here becomes much get two populations in terms of Proportion they're overlapping like this it becomes much harder to draw a line between those two proportions to get something That's going to be able to do a really good predictive value So the flip side of that so I think that works I'm pretty much willing to guarantee that's going to work on your data if you're doing that kind of thing If it doesn't let us know that goes for everything if that stuff doesn't work Let the people know because that's how things improve I'm even more excited about flow density because people wanted you want to do that more So okay, we can do that It's a package that really leverages the complete flow Informatics infrastructure is an arm by connector. So it uses everything again. We don't have to do a lot of the heavy lifting It's easy to use it's flexible. It just takes a few seconds to get an FCS file. It's based on density estimation techniques looking at How these densities change in different kinds of samples and what we work the goal of this tool is to Automate the practice of 2d gating. So you have to start with a gating strategy This is how it kind of works without showing the math behind that If there's three more populations for example, this B cell population down here We look for all the peaks and then we look at where those peaks are In right relationship to each other We look at the height of the peaks and it's distance from the peak next to it We find the biggest peak and then we make a cut and we say okay This is where we want to draw the line and This is where it take the first time you do this You have to make up these rules based on your gating hygiene So you have to know where the populations are and you have to we give you some different choices about how They you they might we give you the ability to make different choices about the best way which approach to use on how to draw that line and you basically Make these check boxes. I want to do it this way this way this way this way So it's giving some options to your Function in our If there's one peak, this is the kind of a tricky one Because there's a couple different because there's no there's only one peak This is going to be very little information Usually in the data that says this is where the line has to be drawn So in this case Because we know how the data is going to look we know that what can happen a lot is you get this shoulder This happens This for this case in many other cases where the density kind of changes its direction And so we can look mathematically and we won't go into how that actually works We you go for a change of slope. There's one way we do that To find that inflection point for where that slope changes and then we draw the cut at that point If that doesn't work, you can do other things like We want everything over the 80 to 50 percentile because that just works for our data So you have to know how your data is going to look But we have we make some we can make some really good guesses based on all the data sets that we've looked at about intelligent choices to make This it doesn't doesn't take that hard usually a few days To take a gaining hierarchy and implement that before density But once that those couple of days few days are done You don't have to adjust your gates anymore because the algorithm does that based on each new data set that comes along Here's another example If all the other choices, so we if you don't know what your data looks like We make some guesses if you do we can help train that If nothing else fails, we just can pick some Standard deviations plus or minus and that works sometimes for some data sets again Yep, you have to go walk through this to see what the best option that's gonna work If there's two populations, it's really easy one two peaks just cut in the middle between the two So how do you use that? You need to have a strategy because we're doing diagnosis You know what you're looking for so you have to have something kind of worked up in flow Joe You need some 2d gates Step-by-step some expression levels plus or minus We can find stuff that's highly expressed versus damage needs a bit more options to be set This is how it works. So we had a flow cap three last November We have one two three four five six different groups participate This is showing how those groups Are compared against the data set that was analyzed versus manual gating so we had nine different centers on With expert gating centralized on that data set and so the variation of the humans of the human across centers This is accounting. This is taking account center effects as many facts that we could take so That was already moved and what's left is the variation of manual gating and we know there's gonna be some variation in that where there is red that means the automated method has Been farther away from manual gating that we would have liked statistically different. So you can't find everything Necessarily all the time, but we seem to be doing pretty good than most the other methods This is the method open side-o from our favorite Cheryl's group. We're working together To improve both our methods because we're using because the complimentary This is on the B cell panel We're not gonna be any different We were not any different than what the humans have done on the B cell panel It seems to be working pretty good dense to actually work here We didn't work very good on the T cell panel on the other ones. Yeah, not so much So in summary really excited about this tool because it It seems to work. We've used we've used this one on Three or four different data sets and it's been really phenomenal how well it's worked It's really great for me to be able to get up here and say stuff is finally working up with all these years We spent writing what you learned all about yesterday all these tools to get stuff doing all the Quality checking and stuff. This is where it's getting more exciting because we're getting what people want which is not have to do The manual gating it's going to be within the average of expert humans Which is as good as they can get in terms of doing diagnosis. We can't do any better than that I'm pretty much guaranteeing you get the same results you can get by hand if you don't I will give you your money back It's all about except it's not in bi-connected today, but it's in your virtual machines will be in there soon What we're gonna do next is get stuff into open a side-to-one gene pattern, and I think I'm talking about that in module six So again, this is just a large club of effort. The flow cat has been going on for a long time Mima is my graduate student is now Gary Nolans at Greg Pinnock. We're happy to be part of with a great touch The other two PIs are Tim Austin and Richard Sherman Everybody who provided data could not have done that without them look at that post.org. Here's the Nima Hadron worked on Archeoponics and Along with Kieran on prototype 2 and your farm and Russian Regina all helped out on flow density The analysis of the flow density packages done through hip-c with Sourcing cold maker film employer involved in that and with that it's time for more coding and questions It's how how different We're trying to see is what the purpose of this figure is to show how different we are the manual gating so what Red means are significantly different than what they've gotten by hand. There's some very there's some variation which is shown by here on based on How how much of the how much manual gating is changing? Across different samples and we're trying to see how much different on top of that variation How much variation is there due to manual versus automated? What we're trying to do is you're trying to get rid of the effects that are due to things that aren't gating So if you look at how the results on two samples when you have data from different centers What can happen is you can have an effect that's due to that data has come from a different center So it's like a batch effect What we were trying to do is count account for all these different batch effects that we knew would be there And one of the one of the obvious ones was coming from different centers The other one was other instrumentation So even if they're coming from the same center, they may be in running different instruments And so that is an effect and so we removed those effects that we knew That we knew could be a problem that we if you analyze them you can see all there's an effect due to Centers and if you look at if you do some Views of the data you can see that I'm here's the data from the same center, but then if you look at the proportions that are in a Specific cell population the proportions that are coming off one instrument is different than the proportions that are coming off another instrument So that's another effect so we can account for that Yeah And then on top of that is well now we count for all those effects now we're looking at okay Now we've gated that data by hand now get that same data on the computer What extra effects are there due to that kind of gating and what we don't want to have to do is you go You want to take away all those effects that you know about and then hopefully When you look at manual versus automated there is no difference left over But you have to account for all that because first off and then after you count for all those effects that you know about Then we look to see what's left over and that's how we get those red lines Let's start Ah, right it's fit depends which pot which so again, this is One of the big questions we have is what defines a population Right, so how do you say this is one population versus two? Oh, it also depends which dimension you're looking at so here This is we're trying to find in this dimension That right this one there's only Depends which way you're looking at is there one or three how you want to do that cut so in some in some ways So in here looking down this way There's not three Here We'd say there's three because you're seeing that curve And so you have to use both of those when you're doing full density to figure out on one axis versus the other axis How you want to make that cut? Third Yeah I don't know up top my head, but I'm pretty sure this one Almost positive this one's an affliction point here that we're trying it and again This is where it takes two days for a getting hierarchy. You'll try So again, if they're for this example where there's one population that there's kind of three four one two three four five different ways This this is the hard problem where there's was when this data like this And this is why automated gating in a general sense does not work When you try and do diagnosis you can't take something like flow means or same spectral or flow-clust or spade or a general one of these off and Unsupervised algorithm to do diagnosis because an unsupervised algorithm is kind of use the same approach for every single cell population and that doesn't work when you're doing diagnosis because of These rare popular essentially rare populations or what gets you and for these rare populations you have to try on your own data by At most five different ways to get that threshold right And so first Or three different ways I guess Yeah, three different ways. I think so You try you can try a tracking slope you can try a percentile You can try your standard deviation from the mean and one of those is probably gonna work because that's what you That's gonna head what you what you're doing right now And you have to basically make these rules up you have to take the rules during your head I think we're talking about this yesterday these rules they have in your head You have to encode them into our and the rules that seem to work are why I'm looking for a bump and You see that by your eye, but now if you saw the computer look for a bump or we have in your head That maybe it's a 95th percentile Or you have an FML control My god, I wish I wish people ran FML control That's that my my PS2 resistance data set looks like that. That's the pop. That's the that's the pattern I'm saying you define something really really important I don't get that that that didn't population the 20s see you in your person So I've been doing I've been doing data analysis of full section data for 2005 do you know many data sets we've gotten without my controls So this goes back to the whole I guess what type of garbage in garbage out the more stuff that you can give It's gonna help but people don't run FML controls Takes a lot of time and money Yeah, so yeah surface markers If my words in the room you'd be beating you right So what so I've asked this of clinicians and they told us that the next day That's right now when they say well you know you have internal controls in each of these samples You have populations that should not in your Healthy state it's Based on Assumptions that we might Where your average B cells will express Significant state it's the expression of these cells with that 70 is distinctly greater than the expression Which would be an internal I'm not gonna say that I know everything about But I've asked that question to a large group of doctors and I said how do you How do you set cut-offs without that? They say well we have internal positive Having said all that I can I'm just checking to see what populations are So there's So I don't so I asked this in So So I would if you have some ethanol data I would love to have it today because We don't have that in full density package right now and before we publish I mean it'd be really simple for us to add in one extra step that here's my fm oaken sample Thank you All I need only is like one sample just one one example for the paper Okay, fantastic So I gave this talk at Saito couple last month. It's always what about my controls like we didn't have our controls We didn't have that for the use case that we use to develop full density So the full density package is based was written to solve the problem Of the hip seed data set which I think I talked about yesterday where there's they had the panel They knew what they're gonna find the t-cells and b-cells eight different centers nine eight color assave And they didn't have ever more controls. There's too much money. They're not gonna run them If we have them it makes total sense, right? You can set a threshold based on that but I just need an example So we can put it up and put in the paper and publish this Cool, I'm glad I came here Any other questions That's what that's what I wanted is I want to take my own tool That's how this that's part of this tool so you can then you know bang it in and say it We'll take a couple days that couple days to add it in and then then one of the options that we would have in the package Is that you would point to your terminal control and say for this for this particular Population here's my terminal control and then the computer would look at it So the threshold and above that and you have to say what your threshold is they want above that is a positive Really simple to do Everyone's happy. I think it depends on the design of the experiment as well I mean a lot of times you might be looking for a little control for a marker But it's that marker that you're actually Like in the case that we do possible stuff and we're running three possible markers at the same time It's like a unit for FF3 control But I mean you will see Multimodality in the data and then in the fossil data So I mean someone would ask what words you're from on the show because you want to look at Within those same those multiple populations and issues, right? But it wasn't an experimental design. So is where we're actually mining that data Yeah, so great for differential expression on this data mining versus so it's again dinos discovery and how you set up those Panels and approaches that you use Yeah, yeah, good So hopefully your Dean is voices back. It's just run better. You're you're fading you're fading yesterday Yeah So let's look at CCR 5 And then then we add case to 7 and the pop and the significance and the ability to distinguish between the two states has increased a lot Yeah And so this that we use we use the these thicknesses to find the best During the path during the way of finding the best path You want to look for big changes early to find to get all those death paths through the data set But this look at this and this is also for a bank of the buck So if you're somewhere here on it like a four color or I can add one more color and it's really gonna help me a lot That's also shown by the color itself Yes Right we're doing we're doing this Discovery yeah, and order the discovery you have your hypothesis is that there's there I have a bunch of samples a group one a bunch of samples and groups are You know how many I would be a rich man and if I had a dollar for every time every time a question all the time How many so it? It depends and it's so and it's really hard um To do a power calculation for flow We've done it because we had to do it for a grant people expect that it's a lot of hand-waving You can say if if and the way we did it is if the population that's going to distinguish between group one and group two looks and it looks like this example population where it's portion one proportion to And the portions are this different. We would need this many samples so they pen and But that kind of gets around the problem That a vote skirts around the issue of how easy it is to automatically Distinguish between Population and population to that's a separate problem right so one is the bill How different the proportions are and how easy is it to find those differences of proportions and you have to do both right? If it's a rare population That's close to a big population. That's going to be probably hard to find in discovery and so You may never find that I made it you may or something like this The win is that If if it's something like this and it's a rare population We kind of hope that that rare population has something closely associated to that that something like this is at least Can appoint you in the right area? We may not get the in that case. We may not get the Exact population correctly, but we'll get you close then you go to manual gating you actually see oh There's another population that's right there and that was grouped together But it's still distinguishing that we found that before you go back you go back and look at this by hand And I can just trust this and so we may not get you the right population. We'll get you close in high dimensional space Also depends on how big that proportion has changed between Group 1 and group 2 so if it's if all your sick people Are going to die because it's 20% of the cells have this and then everybody else is 21 and Everybody has that It's you can find if you have enough patients you will find You can't get a p value out of that that's going to be significant if you find if you look at a thousand patients Everybody in group 1 has less than 20 everyone group 2 has 21 to 23 You get a very high p value of that one of the problems that we've had is that even if you tell you the p is 10 to the minus 10 and You know, it's 20 versus 21 Nobody's going to use that because it's not going to be enough for them even though it's significantly significant And it's like and then we get them. We're stuck. It's like well this you want a statistical Result this is a system result. They say well, I'm not willing to follow that because it's still I'm not comfortable with that Um, it's your question 30 Absolutely Right Given all these other things But But More is always better But realistically Realistically 30 per group if there's something that's really significant. It'll probably show you're not going to find You're gonna lose some stuff if there's something in there That'll be enough to find that if there's something that's big enough difference If they want they want a number What's not gonna I can tell you what's not gonna work is three But that doesn't it doesn't work for that doesn't work for flow Three will not work for flow. We've tried it and some of the examples Yeah, what what you can't do is come with me through three samples and I tell you this one's different That's 20 versus 21% Then you can't you can't argue with me at that point, right? So That that's what gets you into the silk paper, right? So that's a between finding something and then doing the mouse models and actually showing that that's what why why is it? That's tough We can't have that and yet the other problem is the up and I don't think I talked about that yet It is we find some stuff. I'm just pointing something randomly. We're gonna find something like with eight markers Actually, I won't tell you about that now because that's this afternoon's talk. Sorry. Are we good? All right, so now you get to learn how to use these awesome tools at least a couple of them But there may be an announcement or there may be a break or he's gonna are you hopping right in? 20 minutes. All right, so we can hop right in we can do a bit of coding to get warmed up You call the rear cell population So so the question is Rare population rare population seem to be a problem. How do we deal with them? Gradient So, right Right, so how do we deal with that when we're doing diagnosis? How do we try to deal with that when we're doing discovery? So what if it's if we're trying to do? Die if you're trying to do diagnosis We have to look at those samples. So here's the sample of something maybe that kind of looks like that, right? There's there's a population here. How do you deal with that in automatic? So this is on so in this case of automated gating What we do is we look at some samples and you say I want to find this and we'll try three different approaches That's gonna work. So we have some tests set. You say here's here's here's my data. Here's a population. I want to find Here's how I think I would find that and we look at that So you want this and we look at the density distribution here We see that there's a bit of a shoulder based on here's a big thing. Here's a little thing And that shoulder is in that test set that you gave us and so we'll say oh Well, we can just use the track slope method here to find where that is and then the assumption is Okay, then you give us a test set Or training set the training sets or we find this and the test set we see if that actually works It works in the test set We're done and then our hope is from now until the end of time. Everything else is gonna look like this If you know if you have to get it so so this is this is where you had the gating hierarchy, right? If you don't care about this and you're using flow density You care about it's not a population Okay Yeah, so yeah, I alluded to that right with Suzanne's question is what what is it? What is a cell population? That's a very sorry, I'm talking about ontologies and trying to find what these things are The question how I can put it this way Is there a bunch of dots that you want to find when you put those dots into a different box and these other dots? If you think those dots along should be counted separately From these other dots that should be counted separately How would you? Decide those dots go in group one versus those dots go in group two and if you if this Looking at the change of slope works for you Then and then if that change of slope is going to be in the training set and it's in the test set Then that's good enough for me and hopefully things don't change from now on If you decide that you want to find those dots in a different you want to put those dots into a different group Then we have to find a better way to get those dots in a different group But this scene we have a couple we have three different choices that seem to work It's looking for the slope picking some arbitrary percentile looking for plus or minus standard deviation Or the three things that have worked on every data set that we looked at what if you're doing discovery much more difficult problem none of The automated tools that I am aware of are robust in the ability to detect rare populations consistently in A way that I would get up here and say use this tool is going to work. It's a it's a difficult problem That's that's a different that's a different problem That's not related to rare cells. It's the way the way they do their sampling This is why If you try and find a rare So if you try to if you're trying to find a rare population That to me means you're doing diagnosis and that to me means you need to do supervised Discovery which means you need a tool like flow density if You need to find something That means you're doing discovery. You don't care if it's a rare cell population or not and what we've tended to find if you try and do discovery is You'll find something like you you unless it's Parkinson's disease Where we didn't But generally You don't care what it is and if it's only that if it's only the if there's only one population in there And it's a rare population You're probably Hooped If that's the only thing that's going to be in your discovery and you don't know what's what it's going to be You're probably not going to find that you're not going to find that by hand And you're not going to find that with an IMA method if it's one rare population in 60,000 Unless you have some kind of hypothesis that that's the one if you have hypothesis is that it could be something in here and you can do something like do a diagnostic kind of approach as well as doing a Discovery approach there's nothing wrong with that you could say well I have this hypothesis in my head that it could be it's sort of a T or B cell related And you know what most of the common to mean your phenotypes there are and you can do that on your Discovery data set to find those common things and then on as well on top of that run something completely unsupervised that works We haven't done that yet, but that's a completely viable kind of approach Theoretically if you had more theoretical markers to this data set you could that might pull out in some higher dimension So what we're doing here is replicating manual gates, so someone has decided this is the right way to gate this it wasn't me I'm going to taking their word for it And I'm just trying to copy exactly what they have done the best I can to match it And we have made some comments to them about some of their games seems a little bit random where they place the gate Particularly specific logic to it, and they say oh, yeah, I know you're right You know another population and try to help us, you know more alive we said the gate so Here with this yeah Is there algorithm that can look at that yeah, so yeah And maybe if you compare it to other distributions in the day, so we do this once we do this info density That's the first thing we do is we we look for peaks and then so so What is a peak is the same thing as what is a cell population right so you can find little stuff out in space So what we do is it has to be big has to be far or something else That's how we define these modes and then we count how many these we have some realistic that says if it's big Enough and far enough away from something else then it's a mode it's a cell population So that's how we figure out is there one two or three No, it doesn't per prepare of markers for marker so the more but if you know stuff about your data Maybe a good day, but we do that right now. We do this All of your wife combinations One marker at a time