 So for the next hour or so, I'm going to not show you any code, but what I hope to do is give you an idea of what we can do with the tools we have and other groups that have developed over the last few years to get you excited about learning how to do it yourself. So ideas, no code, not going to teach you how to do this stuff, but give you the idea that this stuff is actually kind of fun and even when it's raw and wiggling. So the hypothesis that my group has, or that's going to be sort of underlined the next hour, is that automated algorithms have reached a level of development today that automated algorithms have reached level development today that they can match at least, and in many cases, exceed what you can do by hand. And I'm going to try and demonstrate that through a walk-through, kind of some different examples of the kinds of things we can do with automated analysis. Most of these things you're going to then have in the practical laboratories. And then at the end of this session, I'm going to just hammer you again and again and again and again, well, eight hammers, with the idea that show you how it's actually worked in practicality. So some things that my group has done with these tools to answer real problems on real data and get real manuscripts published. So why should you care? So as you all know, I'll leave. Full-starring has gotten more complex. So the first paper that was published using full-starring data analysis tools really in an automated fashion was back in 1985 by Bob Murphy. And he was really excited because he had one sample of three colors and collected 50,000 events. And he used the K-means method to analyze that data. Kind of simple in this approach and kind of simple in the data. We published a paper last year, analyzing some data from Mario Rotary's lab that had 466 samples, 13 colors, like the 400,000 events. So just a huge amount of data has been increased over the last several years. The good news is computers have also increased in power. So in his, he had a, he talked about, he was describing the instrument that he had and I went and Googled a picture of it. As you can imagine back in 1985, these computers were like these huge things, like a big, giant chest freezer. And it wasn't very powerful at all. The computer that we used to analyze the data last year was 7 million times more powerful. You have these huge, high-performance computer systems. And one of the really cool things that I'm actually not going to talk about much is the kind of computers that you can use. You all have it using your laptops here. But all the tools that we're using can also be used on these high-performance networks or computer resources that you probably have access through universities. And the way these computer systems usually work is you have many, many, many CPUs running together. And that works fantastic for phototometry data because what you do is you send each phototometry data file to a different one of these nodes and these computers. And so we had 600 nodes each with 12 CPUs. Each of those gets one FCS file, which is fantastic because even if it takes maybe two million, two minutes to analyze one FCS file, when you have 4,000 nodes, you can do a very large experiment in two minutes because they all give you, it's a bit of hand-waving there, but essentially it gets farmed out really easily. So it's really good, easy to do lots of analysis. But you can also do it on your laptop. But we really are comparing apples and oranges in terms of both the amount of data that we have today versus yesterday and the computing resources that we had then versus now. And the way people are analyzing all their data now is through manual analysis. And this can be time-consuming, especially for discovery. So I like to divide the world of analysis for phototometry data into two parts. One is for diagnosis, and both of these have little bunny fingers on them. One is for diagnosis, and one is for discovery. And so when I say diagnosis through the rest of my talks, what I really mean is that you know what you want to find. So you study this thing for your whole life, and you know that NK cells are the most important thing that you have to find for the thing that you're studying. In that case, you want your algorithm to find that population in the same way that you found it from now to the end of time. That could be right, it could be wrong, but it's what you want to do. And I'm okay with that. We have tools that will do that diagnosis thing. The other thing you want to do with phototometry data is do a discovery and also call it a phishing expedition. So you don't know what you want, so you've collected some data and you've got some patients that are sick, some patients that are healthy, or some mice that you've done something to and some mice that are controls. And your hypothesis is that there's something going to be different in the flow. You don't know what it is, but if you find it, you get to publish a paper. And so what you have in that case is you have labels on your two samples, and what you want to do is you find some cell population that best explains that difference. And we have tools that can do that too. But the approaches that you have to do in the way that they analyze the data are completely different for those two kind of outcomes. So when you're doing diagnosis, it's not that hard to do the phototometry to gaining, because you know you can follow the gaining hierarchy right to that population. So intuitively, it's not so bad. Now where it gets to be a bit of a pain in the ass is it can be time consuming, especially when you have a lot of sample variation. And you may not want to do that. You may have other things like you do in your life other than gaining. And so the idea is you can have the computers find those cell populations in the same way that you had found it with the same variation. And so you can do the more interesting things perhaps, that's science. And we also know even though if you know what it is that you want to find, you know how to find it, there can be, and usually is, significant variability in how you define those populations compared to the person sitting next to you. And papers have shown that if you hand that same FCS file to two experts working in the same lab with the same whip behind them as a PI, they're going to get sometimes different answers quite often. And sometimes those differences can be significant. And maybe we can make those go away using computers. There's also no really possible way to get P values out of manual analysis because there's no real statistical underpinnings behind that. You get P values on your bar charts at the end after you've done the analysis to compare you to your groups. But it's really hard to get any statistical kind of basis on how you're drawing those gates. And we can use math to help us do that when we're doing it inside the computer. This has done a new problem with manual analysis. So when Bob Murphy brought his paper back in 1985, he was like, OMG, you know, three curves is really, really hard. I don't know how anybody can ever do this. I mean, two is easy, but now we're at three. This is really hard. And it's actually, he says that in his paper. But then you go to the paper that was published a couple of years ago and it came up with the mass etometry. It's like OMG, 30 colors is really hard. You know, how can we ever do this by hand? And so this is not a new problem. But the good news is some of these tools have been developed to help us solve these problems. So if we're going to use automated tools to do the analysis that we want to do with them, what do we have to deal with? Well, there's a large number of events, dimensions, and samples. So the algorithms, you don't want to wait around for a computer. That's just frustrating, right? We get so pissed off when you only have to get the latest iPhone because it takes like 30 seconds to load the web page. We're used to have this immediate gratification. So the algorithms have to be quick and fast so you'll handle it even though, you know, it doesn't take, I mean, from out to large nodes, we want to be efficient. There's lots of different data formats. So it's not just the FCS files. It's all that other data that you need to have that goes along with those FCS files. And I have to say for our group, time and time and time again, when we're analyzing data sets from different studies and even studies that we're intimately involved with, it's that metadata, the data about the FCS files that has been absolutely, positively, without a doubt, the most difficult thing. It's not the analysis. If all the data comes in and we understand what it is and all the FCS files are properly annotated, it feels kind of well. The biggest problem, and this is the thing that you're going to have to, hopefully you remember, only one thing out of this two-day session is if you're going to do automated analysis, the more and more effort you can spend on making sure everything is annotated as best you can, it's going to make your whole life easier because you're going to be parsing in all this data and you're going to be reading in those headers. And what you don't want to be doing is then after you automate all this analysis it's sitting there and trying to tweak all this metadata one by one by one by one. You're not winning at that point. And so the more stuff you can do in Excel, it's fantastic. We work with Excel all the time. It can be a great tool to annotate data. So you have your FCS files, you read them in, and then what you also have is this Excel spreadsheet that has all the extra information. And you spend a lot of time getting all those columns and rows, organized, rail, and you read that in and you put the two together and that's a fantastic way to work. It's getting people to annotate in a structured way has been the biggest difficulty. Right now, there's no commercial software that can go out and buy that solves all these issues. And the nice thing is Paul Robinson published a paper last year that I can use now to quote this because I believe this is my whole life. Well, this is the start up of some video analysis that you can't go and buy something that's going to solve these problems. Sure you can use Splojo and solve some of these or some of these other tools, solve some of these problems, but nothing really solves, nothing you can go and buy today solves all of them. We did a review back in 2009 that describes some of these tools. This is a really good paper that just came out this year by Noelle Numeur, someone I used to work with, who reviews as well all the full-estometry R packages that are out there, at least most of them. Paul Robinson's paper kind of gives you a bigger overview of full-estometry data analysis as well as our paper. So we didn't want to give you a reading because that would kind of spoil the whole story. But for backup stuff after you go away, I think these are three really good reviews. If you're still excited after two days, you can go read more. So the good news is there's a solution and this is why you're all here. And the solutions are by connector, obviously. So R is... Is that a question? So the question is when you talk about structured data, what do I mean by that? People can use Excel for good or for evil. And so one thing I've seen people do is they had this Excel spreadsheet and they had all this data in there. It was fantastic. Here's all my data in Excel and here's all the information. And then they used color to say here's all my positive samples, here's all my negative samples, and they used different coloring to describe that. And there's no way to parse all color from Excel. So structured means your rows and columns are consistent in what they describe. That the words you're using, so to describe the things, don't change. That the labels that you're using, you're not messing things up. So one thing we have is file names. So people do... We spend so much time parsing file names. You would not believe how such a simple thing can be so complicated. But people use the file names to mean something. And that's okay as long as you do that in a consistent way. So you don't change underscores with spaces. You don't change underscores and dashes. This stuff kills you when you're doing... You think it's a simple thing, but it absolutely positively will kill you when you're doing IMEA analysis. Because all of a sudden it's like normal, underscore, and then something, something, something. And then somebody takes a disease, dash, something. And you have to spend all this time and you've got better things to do. We have better things to do. So being consistent in how you do naming, being consistent in your call names, being consistent on your labeling of reagents, what you put in the EFCS file. Pick something and stick with it. And if you change it, then go back and change it for all your other files. Don't... When you say, well, this is a stupid way. I want to change how I'm doing it. Then go back and fix everything. Don't... Pick something that's smart. Don't change it. If you have to change it, make sure you go back and fix everything that was done before. Changing things just makes it a big headache. Does that answer your question? And don't use color and Excel spreadsheets. This is a really smart person too. It was insane. I say that because we spend so much time trying to figure that mess out. So ours free. It's free to two senses. It's a beer. So here it is. Have a nice beer. Enjoy it. It doesn't cost you anything. It's also free in the sense of freedom. It's open source. So you can do with it whatever you want. We show you all the code. If you want to, because you're going to be all super intelligent after this, you can go in and hack all that code and change all that and then use that within your company or whatever you want to do. It doesn't cost you anything to change that. The only thing is if you then change that to somebody else under the same conditions that we gave it to you, which means they have to then share all that with everybody else. It works on Mac, Linux and Windows pretty well. No problems at all. So we did something. We gave you a virtual machine today. The reason we did that maybe two reasons. One is that when I've kind of done this before, we spend and you can spend a lot of time trying to get things to work. It's difficult the first time when you're trying to get R to work to spend all that time to get the whole system up and running and downloading the packages. That can be painful. Nothing wrong with that. It's a great thing to learn. It's a bit intimidating and daunting and might piss you off. You have to do that just to get ready for this workshop. But you can very easily install software and we are going to teach you how to do that at least with one or two packages during this workshop. The other thing is there's one package that we're doing such cutting-edge stuff here the next few days. There's one package that we're just putting into Bioconductor called Flow Density. It's not quite there yet and so we couldn't do this. Go download it and install it in this workshop. So we had to give it to you in this virtual machine. Hopefully nobody had any problems installing it. If we do, we'll get to that. But all the software that we have developed and all the software I've been talking about is in Bioconductor and through R you just type in one word on one little line and it'll basically go and install a lot of stuff automatically. It's really quite simple to install packages in R. And there's a lot of people who can contribute stuff to this Bioconductor. It's been widely used for lots of different kinds of things. So R is a statistical programming language at its root. So if you're doing math and stats in some university somewhere you're probably going to be using R for lots of different statistical purposes. Robert Gellman who's doing a lot of development in R and others have developed packages to do biopharmatic analysis. If you're doing micro-analysis for example probably one time or another you've used R or somebody that you know and love has used R to analyze your data because it's a really good way to do statistical analysis and also for sequence analysis. And obviously now for R. Sorry for flow. And the way that it works it's a scripted approach. So you're not going to have a little bit of mouse interaction when you're using something like RStudio but you're going to get your fingers dirty on the keyboard. So as biologists I get that. It's not the way that you're used to working with software but the good news is because it's all written down in a scripted kind of way you can actually see the process that you have to walk through to do analysis. And when something goes wrong you have that described in front of you. There's no mouse menus that you have to remember where to click. You can describe in code what you have to do. So it's self-documenting what's happened. And this is one thing I really love about computers and that's actually why I got out of working with mice is when something goes bad with a mouse you don't know what it is. You don't know if it's the subway next door being built that the mice got upset this happened at our center. The mice just aren't having good times because there's a vibration like three blocks away because of the subway. And it happens and you don't know what happened. With code it's right there in front of you. It may be hard to find but you can debug it and work it out and eventually it's black and kind of white. And with a bit of hand holding you can usually figure things out. And the wonderful thing is with Google now when you ever have a problem you just 50 people have the same problem you do and we won't do it this today but maybe we will. But if you have a problem you just copy that error message paste it in the Google and usually you find the answer. This is what I do in my day to day life with my spouse. I can't get the home computer to work to try Google and she hates me when I do that but usually you find the answer. The way it solves the problem is we break it into smaller pieces. So I'll describe that a bit more in the next slide. But the tools we developed solve one problem, they solve it really well and then it goes on to the next step and another tool is developed. We're not developing some mammoth program to solve the whole idea is here's my data and I want to do discovery and this one program is everything. Now we're breaking the quality assurance, gating, transformation and one tool is doing each of these little steps along the way. It's all open software with open standards which means it works really good with collaborative development. So this has been developed we've done a lot of development but a lot of other people have done development in this area as well. And it all works because it's all open, it's all open source all these tools work together, it's fantastic. So bioconnected.org to find more information about that so lots of software out there for doing data analysis in our we're not going to talk about all of those the first one that's probably the most important just to get stuff going is FlowCore, it's sort of the basis underlying, you couldn't do anything without that because it's the whole way of how to read in your files process them, do data analysis you can analyze with PlayCore multi-wall plates, flow details there's importing, transformation compensation quality control we have some advanced statistical methods if you're working with large data it kind of sucks on a laptop we don't have enough RAM so for example, for the workshops studies that we're doing today it was this HIV data set but if we gave you that HIV data set and asked you to analyze on your laptop you would be sad people because you'd be waiting, turning through RAM because you don't have super car computers to turn through 466 samples of 13 color with 400,000 events it can't be done but with some tools like NCDF Flow you can have all your data sitting on disk but it doesn't have to read the whole thing into RAM you're running the issues when you're trying to read too much stuff into RAM at once big data files and your computer just claw down and your computer's swapping and it's not going to run fast but if you're doing large data analysis like flow, you can leave your stuff on disk and only read in little bits and snippets of the time so that really works really well our qualifier, if we work with data data you can use that to check to make sure that the gates people have used are the same we have really cool stuff for visualization because our Embark connector are developed for doing other things in flow analysis and some people have got great ideas on how to visualize data we can use a lot of that within post-autometry data we can do make nice plots Flow workspace has been a fantastic development so nothing against Flowjo we use it all the time we use it to figure out what people did so we can automate that in R Flow workspace will read in your Flowjo workspaces and get them into R so then you can then do more stuff with them it's a really fantastic tool How many workspaces are coming out of versions 10? 9 or 10 actually most workspaces now they've done a lot of work to make this improved Flow trends allows people to do data transformation OpenCyto is a new tool that Rafael's group has just developed it basically simplifies stuff so you don't have to do as much coding you can use this Excel spreadsheet idea again to help set all the parameters and he's reading this Excel spreadsheet and then you don't have to remember how many stuff on how R works you just have to tell it this is what I want it to do it's a really fantastic development most of these with the ones that have stars have papers associated with them they all have vignettes so vignettes are so the peer review to get stuff into Bioconductor involves not only writing the code but they make you write a description of how that code actually works on a real problem and so when you go down all these package you'll find these vignettes that actually tell you this is what you have to type in this is what happens and this is what it should look like so all these have that and since most of people are academics they'll usually have papers associated with them because that's how we get paid where the real bang for the buck has been though is in gating because we recognize that as a pain point for you and it's also kind of cool to solve that kind of problem because it's the kind of thing that statisticians get a kick out of so for these two reasons there's been a lot of development in our Dissolve automated gating lots of different packages Flow-clust Flow-merge, Flow-mean, Centetetrol Flow-QB, Flow-Peaks Flow-Peaks, Flow-QB those are all coming out of my group there's two more, there's one here that's not on this list, it's Flow-density because you can't go download it from Bioconductor so we're going to be using it, I think it's fantastic because it actually works I've been doing this for a while, this is the first thing that actually seems to be useful that people care about all these are using all different kinds of approaches, they're all approaching a problem having to do clustering in different kinds of ways and we're talking a lot more about that later so lots of different tools for doing data analysis but data analysis is only the first step data analysis only gets you what the cell populations are not answering any questions it's like here's all your doesn't even know what a T cell is here's a bunch of dots that came together in the same kind of space there's a bunch of dots that came together in high-dimensional space and we think these are two different kinds of things that doesn't answer anybody's question Phototype and Archeoptimics are two tools that you'll be using during this workshop that actually help you answer some questions on what are the important information bits in this data set so we have all these tools all fit together in a pipeline and so the pipeline always starts with compensated data we kind of assume that you guys know what you're doing up until that point you just have to hand off some compensated data there are some people who are working on tools to help do automated compensation Wayne Moore is one of the I think he's the only guy who's really working a lot in this area he's got some stuff that kind of is working but garbage in garbage out so we can't we don't help you up until this point so you've done the appropriate things you've compensated data and here you go take it away it works best usually works best if you do the I won't tell you how to do flow but do your compensation post-acquisition and storing the spillover matrix it makes our lives much easier do your compensation post-acquisition and storing the spillover matrix because it's easier for us because we can fix things then and if you do your compensation and hard code it no just how, when you're doing a compensation don't do it when you're sitting on the machine it makes our lives if you do it it works much better for us if you have it stored in the spillover matrix and don't store your files compensated because we can never go back I can begin with this yeah this is something that we do as we run it so we almost I almost always we'll run this in the colors and have a load of compensation and then click my data we're going to be compensating it into AFC surrounds but the FCS feed is raw and it has the compensation separated that's my understanding so as long as you're using FCS feed the data that's going to be there is going to be raw so I think that's time every time you look at the old method or just the compensation yeah we see data sometimes that is like that old data hopefully nobody's doing that anymore I think most of the machines now but sometimes we do a lot of long term studies and sometimes it's old data and people will pay me ads I didn't want to go there but anyway well compensated you guys are all doing that so a few tools before you even start doing an analysis you want to make sure it's not crap and it's very difficult to figure out what's crap and what's not because there's no statistical P-values just to figure out what something is crap so what you have to do is visualize your data and so we have lots of different ways to help you visualize your data to see if things are going wrong we can't tell you what's going wrong or even if it's actually something going wrong but we can play the game of one of these things but it's not like the other it's really important to do you want to normalize your data so if things are moving around because you've done something in the machines over long term studies we can take care of that there are ways to get rid of stuff you can do that a lot so you've got some clean data you can transform it using any of the transformations that you've used today so you've got your transform data and then it gets fun then we have all these different choices on ways to do manual gating so we have greater than 20 algorithms to do that at least 14 or 15 of those are in R so we've got our populations depending on what you're trying to do here sometimes it gets tricky for matching populations I'll mention this again the algorithms don't know what a T cell is they just have these dots in space and it becomes tricky when you have one file with some dots in space and you have another file with some dots in space but now these dots are over here and you can look at that by eye and say well this one's not over here it's the same as this one over here because it's relatively in the same position computers you kind of have to be tricky about that and it's difficult to know if this one is the same one over here or if it's the same population or is it different and how we do that in computers is you have to sort of match these and I'll talk a bit more about that but it gets tricky when you're looking at lots of samples over time to try and make sure to match all those up but we have some tools to do that and then because of the decision point are you trying to do discovery which you're trying to do diagnosis and there's different kinds of tools and you'll be using some of those during your class and these can swap it out so we have different kinds of algorithms you want to use it's easy now to try one swap it out, put another one in because all this stuff up to here doesn't change you're just kind of changing this point after that you might have to do some kind of different hand-waving and massaging to do his input to that next step but usually this works pretty well so just looking at this slide right now there's possible transformation there's lots etc sorry, you can use any transformation we got them all primarization for some of these is more difficult than others I don't know if we're going to talk much about logical versus yeah, so we'll talk about them and how you can do that and some of the primarization but they're all in there so this is how you work today on the left hand side you're using your malice, you're going file, open file, compensate doing a bunch of clicking this side, no clicking typing so two days from now this will all make sense but basically there's code that you can write that is the clicking for you which is great because instead of clicking you can go have lunch with us so this is our Play Core workflow we published back in 2009 all we're doing is automating what you're doing by hand so we're going to use our studio so there's some clicking do you like clicking? our studio is one tool to do the analysis there's different ways you can interact with our a lot of people just use the command line this is a nice graphical user interface that you'll learn more about that separate stuff, it's kind of like Fojo you see pictures except for this part actually Fojo does that in the background, you just don't know that but you can do clicking in here you can select dots and stuff like that and you can do analysis this is a picture from Noelle's paper so for those of you who are working at home all this stuff in our project this is where you go get this stuff it's statistical programming lots of different tools to do lots of different things most of what you don't care about within R is a subgenre of stuff for biological analysis so I talked about microwave sequence analysis down here at the bottom we live in here close to telemetry data so you click on that if you're doing this at home once you have R installed this is all you have to type to get access to all a lot of the software you just have to say go get me the bioconductor packages and it'll go off and run some stuff and it'll install it for you they do a lot of stuff in the background so you don't have to and if you get stuck it'll help so there's vignettes that describe all these tools we have some examples of workflows for flow cytometry some of this stuff is a bit older it's hard to know on the internet what's crap and what's not never try to get a diagnosis from the internet and sometimes never try to get help on flow cytometry data not on the internet because sometimes these workshops are better than others there's mailing lists there's facts so the people at mailing lists are really responsive so most of the package maintainers listen because you want people to use your stuff and when somebody uses your stuff you get a kick out of that and you want to help people use your stuff that's why I'm here so the people who wrote these codes tend to sit on the mailing lists and if they see something go by with a question they'll tend to answer it that way because we're nice people and you also can take some courses on how to do more stuff with by conductor if you want to get into more programming stuff so this is where flow lives there's some examples in here for doing this was written about two or three years ago but they had the flow core package there a lot of the basics so it's a different way that you guys can walk through and get reinforced with some of the things that we're going to teach you on Nikesh not Nikesh forget it's a name like that Nishant look for the packages that Nishant or the descriptions that Nishant did they're really really good for how to use flow some of the basic flow packages they're available online so I described the vignettes all the packages have one they describe the basic functionality they're interactive so some of the really cool things that you can do in our for example is you can write some code that actually writes your paper and all the code in there will generate a PDF and this is how all the vignettes are done the PDFs are actually generated by the code that is written and so if the plots in the PDF you actually see the code that generated those, reads in the FCS blah blah blah and that's what the result is and that's how these vignettes are written so if you follow the examples and type them in you should end up with that description as that person described them as the code actually makes that happen and easy ways you just browse vignette it opens up in the browser and in our studio you can see the plain text in the R code alongside that this is an example package page so we'll be doing this I think with Archieoptimics so you'll learn more about how to actually go through the process of installing the software for the Archieoptics example because that's not installed in your VM the version machine that you have so you basically say go get that package and you'll see some stuff going on and boom it's ready for you to use it's that simple so all the packages have some basic examples of documentations and our scripts associated with them so this is what the vignettes look like they'll tell you what you do so this is how you say I want to use the flow means package before you do that our knows nothing about flow means or bioconductor or our studio knows nothing about all these tools out there because there's thousands of them and it doesn't necessarily have all those thousands loaded up on your computers you say I want to use this package and then now your computer now learns how to run that software and it describes what you have to do you have to read some data here's some different variables you have to do tells you what has to happen this is a plot and we all have papers because we're all scientists and so this is more the advertisement wow look at this it actually works and does something neat so it gives you a bit of more background on what it is that we're trying to do with this software and shows you a practical example like a normal paper would without showing all the code because nobody wants to read that in a paper so now I'm going to walk you through each of these steps one by one not on one set of code or one set of data but more generally step by step so it all starts making sure your data is not crap and the problem is how do you detect what's crap or not there's no these errors you have to fix you don't have to fix, you have to be aware that something is happening because the last thing you want to do and I have been the supervisor of somebody who's done this that's the best way I can say that is we spend all this time analyzing this data set trying to understand a cancer data set trying to predict a few large B-cell lymphoma we did this analysis like here's a cell population that's standing out you know these people now look different than everybody else and what it turned out is that somebody had swapped a laser on the machine and because we didn't know that because they didn't annotate their data properly there's this thing that was not in the FCS file that happened on the machine but they knew that, we didn't because here's our old data analyzer but all this extra information that was in their head doesn't get put in the FCS file again metadata annotation is really important all of a sudden these populations are moving around because of this laser change and these populations moving around is a difference but we didn't know that difference was based on time and that time was all of a sudden this laser swapped so we saw this different so we didn't know what that was but you can visualize these things using the flow Q package so you can get this broad idea of how these samples are looking over time across you have to think it really depends on what it is you're studying so I can't tell you how to do that because it's data it's experiment dependent but you want to explore your data in different ways without before you even do try and find use your head find things that might be different so plates so you might want to look at all your plates over time or different plates how these different wells are looking so there's ways that you can with plate core you can color all the wells according to the population density for example and maybe you might want to see that the outside wells look a lot different than the inside wells would tell you that maybe something is dried out like a forward side scatter for example just plot me the average forward side scatter for all the wells and you'll see that these are the cells on the outside or big and the cells on the inside are small and something's happened to them they're unhappy and they're bledding out but then when all your analysis posts that it's going to be influenced by that and you might be finding well differences rather than biological differences there's no p-values associated with that so you do a lot of exploring based on just to explore data. There's two papers that have been published that describe some of this and we do things like this this is basically forward side forward versus side scatter on a whole bunch of wells and you can see these wells are different than these wells that might be biology these may be patients that are getting sick and it may be related to the size of those cells patients aren't happy so the cells are getting bigger or smaller or maybe because they dried out. You don't know at least you identify there's some variations going on here before you actually do any gating and you can do some statistics to see actually is that different significant or not there's a really cool I think it's kind of cool because you use the web sort of page so it's like red light green light and so we do some stats stuff in the background we have some ideas of things that can go wrong we put those like the viable cell count or how many cells there are and then if something doesn't look like everything else is our test and if something doesn't look like anything else then you get a red light and you can click on that and actually show you that data that's behind that we can also do checks of your gates so you can read in your flow Joe template and then it tries to see if some samples have proportions that are highly different than other samples and if the proportions are changing a lot that might be due to biology or maybe due because your gates are not in the right spot and your dots are here but your gates over here and so you're going to get a much different count and I sort of flag that your counts are a lot more often than they might be should otherwise so this is kind of like raffi auger turtles lab just to question where is the technology at right now can we do that experiment or calculus so the question is so we can read in flow Joe can we read in other technologies so one of the problems so one of the nice things that flow Joe has done they have exposed how their stuff works through XML files so flow Joe writes his workspace and that workspace is an XML file and we can parse XML but we have to know what it is we're parsing so a company like BD have these diva workspaces they don't tell anybody what's in that workspace there's no place that we can go where BD has written down this is what's in the workspace this is what each of it means and effort has to be spent to reverse engineer that flow Joe doesn't tell anybody what they do and they change what they do it was much easier for us to reverse engineer that than it is to do other kinds of software so everything that we do is open everything we do is based on open standards one of the hacks that I wear is I chair ISAC's data standards task force so we wrote the FCS 3.1 standard and we write it and we describe it and we tell you this is what has to go in each spot and this is what each spot means and we do that because now we use that same description and write the FCS file there is no place where there is one now that we just wrote for gating but right now nobody's using that because we just developed a standard for it but there is no as an example there is no standard that describes that people have implemented except flow Joe is using gating but essentially there's no place that describes all this other stuff except in the FCS file and an open way that people were all using so for us as software developers it's a big pain in the ass to go and reverse engineer all these different tools that all these different software developers have done for all the different workspaces and unless we had a reason to do that because we had some study that we're getting paid to do we're not really interested in spending the time to do that because that's not very interesting so no that's turn interest is no and the long answer is because it sucks if you do that we have to how do you take shape? flow workspace that was a few I don't know the FT time it was like a couple of months of work and so part of the problem is flow Joe is a moving target so they come up with new versions every time every time they come up with a new version they tend to come up with a new workspace that's why you can't read in your old template file right same problem for us because at least they have some they should have some idea what they've done but they don't want to spend the time trying to figure it out that's why these versions don't work and for us it's even worse because we don't know what the heck they're doing in the first place but it kind of works now with the flow workspace tool but not for anything else but that's going to be changing so as I mentioned we just came up with a standard called gating ML we published on that and right now we're writing a reference representation in R all these software developers can see how to exchange gates between software tools I think this is going to be fantastic so you can take flow Joe and then read it in the FCS Express so read it in the diva and so BD has said yes we're going to follow this standard in our next version of software we're going to follow the gating in the standard so you can go right out gating ML files and then you can take those gating ML files and analyze them in diva gating ML so we're gating ML things are changing it takes time and it's painful but it's a good time to get into this space because these problems are being solved so this is an example of QA with qualifier so you can basically plot over time proportions that are in each gate and reaching your FCS file you can see one of these things is not like the others and it's even read to show you that it's not like the others we don't know why that is but the proportions here are much different than the proportions of all the other samples maybe that's cool for you maybe that's a person who's going to die tomorrow or maybe because your gate's off and all these are not on this one or maybe it's two populations that are in here who knows it hasn't been enough RBC lysis these don't look like every other sample you might want to check those so again it depends on what kind of questions I'm asking what are the kinds of things that could go wrong and you have to write checks for those this is where being smart about your science works in so this is looking at our sensitivity over time we do a lot of long-term experiments because nothing against R but it sucks if you're going to do all this effort for analyzing three samples because you can kind of do that by hand so when we tend to analyze data it tends to be for hundreds or thousands of samples and so things go wrong over time right something doesn't work that day you've got to have your copy and you've got to buy patent and some welds and so you want to check for that is there any expectations to what kind of business files to bring to let's say I have to read data from a QA 17 is that confident? so the question is expectation about files so there's no expert the problem with doing quality checking is there's no expectations at all if you have an expectation you can probably write a statistical test about that because if you have an expectation that means you have some idea of a distribution and you have an idea of a distribution then you can check your values against that distribution and that's how you get p-values by expectations all this stuff you have no expectations because if you did you would have a hypothesis and you have a hypothesis you could test that and if you have a test it you should have some kind of t-test or something you can do there are no t-tests essentially there is a bit of that going on here because that's how we figure out what's read but a lot of it is just visualization of your data big data exploration to try and find things that look wrong generally you can write some tests once you know here we had some ideas how we got the stuff that's read how we got p-values first doing this you're just looking at data over time usually data alarmization when stuff changes it gets hard and so one of the things that can change is technical variations so if you changed your lots on your reagents you ran out of reagents that happens but you have a new reagent and all of a sudden your populations are not like the populations you had before when you're doing large studies you have to recount for that so you have to move stuff back to where it was before or move them back to some common reference so we have two different tools that we wrote to help remove that technical variation but you have to know it's technical because if you have some biological variation and you do normalization all of a sudden you've gotten rid of your sick versus healthy and that's kind of bad so you want to do that when you know you have to so we had some change here that happened to be a laser change so here's an example I talked to you before this is the same thing as this but it's not because it changes laser and all of a sudden that population is way way over versus what it was before but we had to recount for that so we can get the labeling right I'll talk a little bit about labeling later on so this is kind of how it looks again not showing any code you have to trust me that the math is really cool this is what the raw data looks like so here's that population here's that other population they're actually the same thing and now these are lined up on top of each other we have two different ways essentially they work the same way so this is before but just a commission of all this populations this one's all the same we know that there's two kinds of things going on here and so we'll go whoop it's all good and the cool thing is this works really well so this is using flow gel so here was with the manual gates so something that people want to do with flow gel is they'll do a I don't know it's called painted gate or static gates here's my two populations and now I have another 30 samples I don't want to get all those by hand one by one by one so you just take that gate and paste it across all your data but then your stuff is moving so if you use the qualifier you can see that the portions have gotten a lot lower because you're missing all the data up there now you normalize your data the gates still pasting the same gate across all the samples but it's going to get the right sample in the right spot because we've changed where that data is so the question is how is data normalization the most straightforward way how is data normalization the most straightforward way of normalizing as opposed to again referencing flow gel moving the gates so that's all we're doing so the question is how is this different than moving your gates in flow gel flow gel you're doing it by hand now or not it's the same thing I'm spending hours and hours and hours and hours and hours and hours and hours and hours because I'm going back to some experience I did two years ago to try and standardize the gates now that I understand and reanalyze it there's two reasons why I want to move the gates we're going to talk a lot about flow density later on flow density is a tool that we developed that is counting for a little minor sample to sample variations so sometimes in flow gel you will do some tweaks so you get the gates and you want to twist them a little twist them a bit because you have sample to sample density variation that's small normalization you have here's a whole bunch of data that looks like this and then later you have a whole bunch of data that looks like that so you're trying to move gross sets of data it might be a different problem that we're doing just adjusting of gates versus wholesale moving of data so we have data transformations and we got them all I want to check a hole here so transformations is another pain point for us and absolutely as you know probably know how you transfer your data has a huge impact on everything else you do after if you don't get your parameterization right on how you do the transformation everything else is going to fail we spend a lot of time but you probably spend a lot of time trying to get those transformations right there is some tools that will help you do that transformation and flow trans there's a package again coming out of your old script that does some assumptions or tries some different kinds of assumptions on how to parameterize your transformations but it still can be a very difficult problem with automated analysis it is with manual analysis so here's an example on using that flow trans package before and after you can see this probably looks a lot more like you would want to look at it does some cool math in the background to help you try to get things to look great Automated Gating it's awesome lots of different packages to use 14 and R it's free since it's computer time not your time we think it would be more accurate you can probably find more stuff that you can do by hand because the computers can spend all the time to analyze the data that you might not want to spend all the time especially when you're doing high dimensional data there's just so many populations you don't have all the time to find them so you can spend your time doing science and talk a lot more about this in module 4 excruciating detail about all the different tools there so I'm not going to talk about that today that's more for tomorrow yes tomorrow labeling trying to figure out what samples are the same across large datasets is a problem so the way we do that right now is here's patient 1 here's patient 2 patient 1 doesn't have any cells up here patient 2 does but we want to know and we look at this by eye we can see that these two blue samples are the same these black ones are the same and there's no red one over here but the computer has to learn that and the way the computer learns that is we take a representative of here, here, here, here is the center of that population so we take the centers of all those put them together and then we cluster we do clustering on the centers and we say, oh, these three dots are all the same these three dots are all the same and these four dots are all the same and we map those back to here and we're able to label those populations as the same this is something you do in two seconds by eye but the computer has to do this one by one so again just kind of walking through it and map them back figure out all the red ones are the same but the algorithm still don't know what a T cell is or what a B cell is and that becomes really difficult because when you do automated analysis you're going to find thousands of cell populations you're going to find three to the number of markers of cell populations and if you're doing a ten or more color study tens of thousands of cell populations and there are going to be immunofinotypes there are going to be CD3 positive CD4 positive, CD16 positive, CD28 negative and when you get these long strings on a sheet of paper, there's nothing that says this is an NK cell, this is a B cell this is a pain point right now for automated analysis it is translating what the computer spits out in terms of population labels versus biology and there's no link between what we do and what you do at that point and this is where working with biologists who love you is really important because it's really, really painful we're working on that, we have some ideas there's some ways we can solve this problem and they're not solved yet and it's not going to be solved in the next couple of months even then it's not going to be solved elegantly I'll be talking more about that in module 6 but this is a problem and knowing what's important out of all that stuff requires, this is where it gets to be science again and this is a tricky part so we get to two parts here, now we've got all the some properties identified we're going to need to do diagnosis and discovery I'm going to talk a bit more about that in module 4 what I am going to talk about now is some examples from my own lab where we're going to use these tools to do some good success as I get a paper out of it when I started in this field developing tools I thought that I don't know if this is the right thing to do but get the most complicated data set I could get my hands on and I thought if we can analyze this data set and analyze it well then it's going to work for everything else because we've solved the hardest problem so the easy stuff will be easy but I didn't think it would be useful to solve all the easy six color data that doesn't necessarily mean it could scale well and scaling is what you want to make sure when you're developing tools it's going to scale to all the larger data sets so I talked to Mario Argaro because I didn't know anything about flow when I started but I knew that he was a big guy and he was doing all these papers and pushing the boundaries in terms of data and he had this 13 surface markers and KSQ7 set data so at the time and still is, that's a complicated data set there was also a lot of data and they had survival time that was measured over that and the cool thing was they had found something they had published a paper that said we found some signal and all this data that was important for us to understand HIV so they had a true positive that you had but that thing was I had to burn through two graduate students before we actually got something to work but then I had MIMA a lot of the stuff you see today is from a graduate called MIMA who is now working in Gary Nolan's lab at CD127 positive and KSQ7 positive populations had either positive or negative correlation with HIV onset and so they had done the math and Pradeep who worked with Mario spent what he tells me is seriously months analyzing this data set to try and find this and looking at all these different cell populations one by one and it was a total pain they asked because do all of them, it's by hand and trying to tease all this out but they found something and they got their paper published and it was P guys P is 0.03, 10 to the minus 4 differential survival based on these two populations so the question that we had is can we develop a tool that can find what they had because they found this we better be able to find that with automated analysis and the question is for the win can we find more and I wouldn't be here talking to you about this example if we hadn't done just that so they found these two populations through automated analysis and I won't talk to you about how we did this I'm just trying to get you excited and show you that these tools actually work for reality we found something that sort of combined these two together so we found two things by hand we said well actually these are kind of related these two populations you can find something that combines those two so we validated that we were able to find what they found but we found it in a bit better way and we found two other things because we found what they found and we found these two other things that makes us feel a bit better about these because we had this one as well that those other populations that we found actually made sense and they were very excited about that it's not only that we found we found it a bit better so it's really all about the p-values you win when you get less than 0.05 now I can publish a paper so they found 0.003 and 10 to the minus 4 through the power of automated analysis we got 10 to the minus 13 so we were obviously better and then we found these two other ones with also very high significant p-values that are competing with their survival using the same sort of tools this is an example of an Archeoponics plot so this previous analysis was with flow type and one of the problems with flow type is there's nothing pretty to look at you get this string of aminofinotypes on this big long page that doesn't make anybody happy it certainly didn't make Mario Rotter happy now I make pretty pictures with Archeoponics it's one of the tools that we're showing today it's a great tool for doing discovery it's kind of like spade that summarizes all your data in one plot spade sort of does that without the labels one of the nice things that we give you is labels on the aminofinotypes you can't see these and it says cd10 positive, cd38 negative and the color of this dot tells you how important that cell population is to distinguish between your two groups in this case the group was G.C.L. and Polma versus reactive lymphoid hyperplasia this idea that these FCS files belong to two different groups and so here's group one so give you all these FCS files and then give you an Excel spreadsheet that says these FCS files belong to group one and these FCS files belong to group two and you pour that into Archeoponics and flow type and it spits out this graph saying we found this population and this data that is most associated with the division into these two groups is a fantastic tool if that's the kind of question you're trying to answer and then you have to go back and go to Flowjo and look at those cell populations and actually see if it's actually real so again the problem is with an eight color tube we're going to find 6,000 cell phenotypes there's a lot of cell populations in your data that you probably don't know about and don't care about but we're going to find them all and then the trick is you have to go back and see if it's actually real so we get this, this is what we spit out these aminophytites hadn't been found by manual analysis she goes back does some gating, can find that by hand so it's actually there and then looks at the difference between group one and group two, it's like yeah your stats actually make sense so ding, another paper that's the paper and the cool thing was is that idea of that cell population had been written about before another paper and now we have something that's submitted this, our paper isn't accepted yet but it will be, and really great specificity and sensitivity on this analysis another example this is now in press and plus one looking at the difference the question here is we're going to do lioplate analysis so you can take your reagents dry them down into 96-well plates and then the question is the reason why you're doing this is if you want to do large studies so if you're doing a large clinical study across nine different centers you want everybody to, you want to get rid of noise as much as you can when you're doing large studies and one way you can get rid of noise is by everyone using the same reagents and so one way you can do that is by giving people liopolized plates and so the question is then are these liopolized reagents working the same as liquid reagents so here's all your SS files here's labels and Excel spreadsheet here's all your samples that are liopolized here's all your samples at liquid is there a difference well yes there is iO10 is different between liquid and liO so maybe we want to be aware of that maybe we want to figure out what's actually going on here is one better than the other that's kind of up for you guys to understand as biologists we're just telling you there's a difference here and it's a very significant one and then you go back again and manually analyze it by hand and actually in the lioplates the brightness is actually increased so it's actually a better reagent better than liquid i don't really care all i care about is that it will work and find something that the biologists were able to validate we can also do diagnosis so the question here is the cancer is derived work we're trying to tell people if they're going to die of cancer and you want to do that so they do that and that's great and it works people die but but we want to make sure that we want to make sure the diagnosis that we're giving they want to make sure no, let me try that again the powers that be above the clinicians want to make sure the clinicians haven't gotten to say or doing some kind of crazy diagnosis so we've had some cases here in Canada where at least in radiology that some guy isn't doing his job right and then it gets the paper and they have to rescreen 900 patients because this guy isn't doing a very good job and so the routine standard of care is that they are supposed to do a random review of about one percent of cases that's what I was told people have to come to us with their problems we just solved them and so Andy Wing's problem was a collaborator in this project says I've got to review all these cases that sucks and it's also kind of a waste of time because I'm just going to pull out some random case and reanalyze that and I might not find the one that's actually important in this problem and so the hypothesis that we had is we'll use computers to identify those cases that are most interesting for re-evaluation so how do we do that? well we train a computer to do what Andy was doing and then we reanalyze all his data and then if the computer is trained very well it should get the same answer if it doesn't get the same answer that means Andy's gone insane or it's a problem sample or who knows if one of them sucks that will never happen but at least it's pulling up problem cases for him that are least informative and so we can do stuff like clustering and we can find populations we can train I'm not talking about the classifier that we developed I don't think we're talking a lot about classifiers but there's a lot of statistics you can do that we're not talking about that are outside of analyzing closed structure data so what we're talking about over the next few days is how do we do reading an FCS file the integrity of how to do population matching there's all the statistics that comes after that like how do you do classification how do you do t-tests how do you look for differences so we can do that here so we developed a classifier to tell the difference between diffuse large B-cell lymphoma and a follicular lymphoma we found two different ways to do that and then what we did is combine those two ways so one of those methods got some patients wrong prediction and actual so when an algorithm got these guys wrong the other algorithm got these guys wrong it looks like a combination of those but here's four patients that think both algorithms agreed that you said it was this we say it's that what up and it turns out we went through those cases they were re-reviewed by a pathologist in each case there is some underlying reason it's either not quite diffuse large B-cell lymphoma it's not quite a follicular lymphoma or maybe something has gone wrong with their diagnosis before the reason why something has gone wrong with the diagnosis before because they don't just look at flow they have all this other information they're collecting about these patients and so the diagnosis that's in the computer is based on all this other information all this other metadata that we don't have because all we have is the flow so we could have incorporated all that other information that they had done on those patients other kinds of studies and you can do that too if you have other once you've gotten the flow data if you have microarray data if you have sequence data if you have clinical outcome data you can combine all these different miliures together we're just talking about flow we didn't have access to that information but if we had that we probably could have gotten these more correct but at least based on the flow alone these were incorrect diagnosis so more of that we have to be calculating some more precision of how positive is do anything that help give you the number yeah so this gives so here so this is all so this gives you the idea of what the sensitivity and specificity of your assay is so we have so this is built into these tools so we have an idea of how well they're working all the archaeoptimics are all based on area under the curve analysis so we have these are all key everything everything can I say everything we do is based on statistics okay yes it's all based on statistics yeah yeah yeah what's your level of confidence that everything you do is based on statistics well some of the stuff is it's math and stats where you draw that line is so when we're making diagnoses or during discovery all this is based on key values so this is essentially a key value plot and so these are and all this stuff is based on statistics in a sense that we had to come up with some cutoffs some thresholds that allowed us to make a separation between group one and group two how we do these thresholds sometimes can be a bit arbitrary we're trying to find the best separator but I would like to say and I haven't shown you any of the math and stats behind that because that's not the place for this but all this is all built on statistics I guess is the best answer I'm almost done I just have four more I'm on example five example five is really short because somebody came to us that hypothesis that Parkinson's disease can be distinguished based on flow I believed him, he was really excited turns out that's not the case so we did all this analysis we didn't find anything I didn't find anything either he says so the tools that we use don't spit out in our hands all the data sets we've analyzed are false positive which is great because you don't want to do all this analysis and spend your time it's so hard to do biology you don't want to spend your whole life trying to understand something that we spit out by automated analysis when that's just crap and so because it's all based on statistics we have really good positive predictive value we have really bad positive predictive value negative predictive value was it performing very well the good news is we're not going to tell you something that's useful when it's actually example 6 2 more examples 6 and 8, 2 more examples to go 3 minutes I can do this standardizing data across lower centers very important the more you guys can do to standardize your data the better and more happy we're going to be so much people got together decided that we understand immunology we understand that T cells are important we understand that NK cells are important we know everything that's going to be important for immunology from now to that time at least innate colors so what we're going to do is we're going to take those reagents we're going to dry them down because standard variability can be a big problem with large clinical studies liapalization should help that carapazation should solve that really great paper that explains why this all makes sense that Holden-Maker published last year in nature reviews and immunology but if we do all that we know, or Holden knows because he published a paper on this very famous in 2005 that you give a whole bunch of labs of data you have a whole bunch of people analyze that data and then you get some variation in how that data was analyzed when you do it by hand and these are not idiots these are all smart people that grow in MIT you get that same data set to one guy much smaller CVs now you can be wrong at least he's wrong in a consistent way and that's sometimes a good thing so the CVs have gone 55 to 24% so a huge variation in how manual gating can be done when you use one versus many gators so the idea is we automate the process of finding all these cell populations so they listed out all these so this is a diagnosis problem find all these, find them in the same way that we did by hand so this is where the flow density tool comes in that you'll be using today I'm not going to talk today about the math but it kind of looks like what you do in flow gel it's putting gates in the right spots because you told us what to do does it work? of course it does because otherwise I wouldn't be telling you about it so some examples here comparing manual versus automated analysis manual on the left, automated on the right some examples of things that are perhaps easy to find so one thing that tends not to be easy to do for these different algorithms to find is rare cells that are off into space there's a couple dots that are separated from each other dots but there's a mesh there's no line that makes any statistical sense this is what's kind of hedging on what's done by this example I was thinking of there's no reason why the line is here versus here versus here there's no math that says this is the cut that has to be made there's nothing in the density distribution that says that's where it's going to be you as a biologist says this is where the line is and this is where it always has been and this is where it always shall be that's your way to make diagnosis because they have that but for this particular monocyte population we have no explanation for why they do what they do but if you tell us what to do we can do that really well same thing for here there's nothing in the data that supports exactly why that should be here which is a little good job but you give us some rules we're really good at following rules so computers follow rules really well so we can follow that rule we get the same answer again and again and again and again and again all these different populations these transitionals are another really tricky one we had developed three tools for three different data sets trying to find everyone was really interested in transitional cells I don't know why but it sucks because it's really hard to get that gate right until you tell us exactly how to do that then we have no problem so I showed you some examples now I put all those examples together so this is the manual analysis on all that data on average for whole bunch of data sets this is the automated analysis on the right hand column for all these different series of different populations six different populations and for the win we are within the variation you would have got with the automated analysis that's what they wanted we're not finding something else which is why I'm so excited about this flow density package that we'll be talking about tomorrow second last example three different patient categories healthy, some specific cancer some other disease I can't tell you anything about this because this is from a company that has two letters in its name so the question was can we use flow density to follow the gating hierarchy so they had some gating hierarchy here we knew that they wanted to find blasts, granites and then the question is they had other disease so again we're using the flow density package gait these samples one by one by one gait on distribution and then the question is we're able to classify it after that so we can find all these populations and then can we find the diagnosis that they made on a test set we got some wrong but what they didn't tell us is that here's a bunch of samples that are blood samples and then we're going to sneak in a few bone marrow samples in here as well and funnily enough bone marrow looks different than blood and so we're kind of diagnosing bone versus blood but we got most of them right the last example is with the international malnutrition consortium this because they're screwed they're going to have 20,000 lines two females, one male they're knocking out every gene in the most genome over the next five years they're going to have two 10 to 12 dimensional datasets for 60,000 mice that's 120,000 SES files that's a lot of data they don't really want to get that by hand so the question is they had some humans circles, gait their data flow density, gait the data we are within the variation that they see by hand and not only that when we looked at that one human was doing what that one human was doing it's kind of off of the other human was doing based on our analysis and then they were able to look at how we automated that process and said actually I was kind of doing it wrong this is 1035, not too bad I'm okay she kind of took up a bit more time in the beginning with that whole yeah I'm important this is what I'm doing I'm just a spokesmodel for lots of different people so all this was started with Robert who basically wrote R and I was so lucky to have him involved from the beginning on my first grant to basically build the infrastructure within bio-conductor in R I didn't talk about Polka today that's got me for more tomorrow but I'll show you some of that data HIV was done with Mario Roder G. Sloan from Jonah Craig B. Sloan from Andrew Wang heart disease with Harvard gentlemen funding from lots of different people third