 Now I think the the next part is going to be a lot more interesting and it's not going to be you know as basic So we're going to be able to move on quickly Okay, so we're going to talk about exploratory data analysis So in short it's called EDA. So what is the EDA? So it's the basics of EDA is to try to use Mainly I would say graphical ways to look at data Okay, and some of these ways are box plots histogram scatter plots but they're also Ways all try to find better ways to look at data using transformations or PQ plot and so forth We'll see some applications to macro and flow cytometry data okay, so Lot of people always think about statistics, you know your case statistics You're gonna do a test you're gonna have a p-value and and that's it You know the p-value is less than a point of five. You're done. You know the you got significant results. It's great But before you do that, you need to make sure the model is correct. You need to make sure you're looking at the right test There's a lot of assumptions you need to check and so forth and the way to do that is to look at exploratory data analysis So in my opinion exploratory data analysis is probably one of the most important aspects of statistics So this was named by John Tukey was a very famous statisticians So as I said the the techniques are mostly graphical So it's plotting the raw data histogram scatter plots a sector etc But they can also be looking at simple summary statistics like the mean the median computing the standard deviation Doing the box plots But also trying to position the plot so as to maximize your natural pattern recognition abilities really to try to visualize data in the best way and this is in particular When we look at principle component analysis, this is trying to do exactly that And I think I mean you know that but a clear picture is worth a certain word So it's very important that you know how to display Your data in the best way possible So you have a few tips and in fact, I mean this is something I always tell my student when I teach these kinds of courses and Believe me, you can see all all kinds of crazy plots, you know they Somehow they try to make it very fancy, you know, lots of colors lots of symbols and things But at the end you can't see anything on it So it's very easy to generate plots that are useless where you can't see anything So don't try not to show too much information on the same graph, you know avoid using too many colors too many patterns too many symbols and so forth I think it's clear that you should try to stay away from Excel I mean, it's not a statistics package and it's not a graphical package either, you know I mean with Excel you can make very ugly plots. That's for sure In fact, I get very annoyed when I go to talks and I see these people Showing all these plots and you can tell right away. It's made by Excel There's the gray background and there's the big bars and it looks so ugly I don't understand why people would use Excel to make graphics. I mean in Looking at data analysis If you want to make any picture with Excel, you know go for it. I don't care But when you're looking at data analysis, you try to avoid it Okay, here are a few examples of bad plots most likely in Excel's of course So why is this one bad? Well, the quality is bad of course, but this is because I could be the plot for someone else Yeah, so Excel is very good at making 3d plots that don't need 3d Have you ever looked at that when you do history on whatever the default is going to be like a fancy 3d But it's just a 2d plot. So here. There's no reason you have that dimension. Okay It's only a histogram plot should be 2d Well same again here you've got Some fancy 3d thing that you don't really need well here you do have You do have a third dimension if you want, but you just have four different Categories what you could do is just a simple 2d plot with four different lines colors, which will be a lot easier to compare the different lines Same here. And by the way, all of these plots have been published in journals and things Here again, that's another example. You can't see anything. I mean here if you can see the difference between the four different Lines, I think you're very good So we much easier to just put in 2d different line types or colors or something else to distinguish the four lines What about this one? So this is a typical plot that you see in You know, and I know many of you to biology or biologists, but this is typical plot you see in biological journals So putting a big bar and kind of like a small arrow bar over here. So the first thing is that We don't know that the errors are symmetric So if you just show me half of it, how do I know it's the same thing on the other side of the bar? Then you don't actually really need the bar. All you need is just a single number, right? This is you're just looking at the height. So maybe the bar you like it And you know, it shows you the the height. It's easier to compare whatever, but you don't actually need the bars But here what's even more striking is that there are only three data points So it triplicate samples. So they should have never Computed these standard errors with the bar because in fact what you could have showed are the three numbers for each of these Data, right? You have only three numbers. It would be a lot better to show the three numbers here So this was definitely not a good way to summarize your data and if you want more plots, so these Plots I should compile by Carl Brumman. So he's a very nice biostatistician at The University of Wisconsin-Madison has done a lot of work on QTL and things like that And he teaches a lot of courses like that and there's a top 10 of the worst graphs that have been published in papers So you can actually go back to the original paper and to see what it was about and Where it was published and So I really encourage you to do that and it's quite interesting. Maybe that will Help you in not doing these kinds of graphs in the future Okay So this is was just to summarize that export the exploratory data analysis and graphical representation are very important You should be very careful when you consider graphing something because a graph will really show Your data directly and if you've got a bad graph, no one's going to understand what's going on Okay, so let's get to the basics of Probability and statistics so we deal with probability statistics So of course we're going to talk about probability distribution So you probably already know or you've heard before or maybe you know You took a course 20 years ago or whatever that probability distributions can be either discrete or continuous. Okay? Some examples of probability distribution are uniform Bernoulli normal Etc. So typically it's defined by a density function When it's discreet, you will typically use a P for probability mass function or when it's continuous you will use an F And then to say it's all the same thing. So often I will just say density function, you know, whether it's discrete or continuous and use an F So here's an example of a discrete probability distribution. It's called the Bernoulli distribution This will occur if you flip a coin you can have a tail. It's a zero or a hat one and Let's say the probability of head is point one Then this will be your probability mass function The probability of having a head or one is just point one. It's just a mass with point one And here it will be point nine one minus point nine point one right because the sum of the probabilities need to add up to one So we can do that example in our so always great because you can generate random numbers There are all the probability distributions you can work with. So these will generate some values We'll compute the density of the only distribution at zero one and then we can plot that Here we go. So we've get the bar point one at one and point nine at zero So the probability of having a one or a head is point one and the probability of having a tail is point nine you can generate samples from a Bernoulli with Probability of success equal to point one. So let's try to do that in our So here I should explain that first There's no such things as random numbers Please when you generate the numbers on a computer, it's never random, right? How do you how can it be random, right? The computer needs to use something to generate numbers? So there's nothing less random than a random number By that I mean that in fact the numbers are not random at all It's just a it is a deterministic sequence that the computer uses, but they look Very random. So if you look at them, you would have no idea that they're not random. So basically this is the it's very complex It's very hard to understand but Random numbers are not renumber. So Random numbers are not really random and you can actually fix the way that the machine will generate the numbers and The way I do that is By what we're called setting the seat So a random number generator will start with the seat you give him the seat and then you will generate the sequence of random numbers And they will pass any test. They will look very random. So as far as you're concerned They will be just like a sequence of real random numbers But the computer uses a sequence to generate them and in order that In order to have the same results everywhere here so that we can compare what we get We're going to fix that seat so that we're going to make sure that all of our random numbers Are the same so here we fixed seed and we're going to generate hundred Binomial random numbers So it's like flipping flipping a coin a hundred times where the probability of having Thing was head is point one. Yeah, I don't get it either No, the the so the way that the computer generates numbers or random numbers It's actually sequence. So he will generate the first number and then based on that You will generate the second and then third and so forth. So it's really just sequence But the way it the computer does it that's just using an algorithm. That's very clever It's that when you look at them, they look very random I mean you would have no idea that it was actually sequence that generated these numbers What you need you need a starting point. Where do you start? This is you need to specify what we call a seed So a seed will be the first Will help you to generate the first number and then you can go along in your sequence So if we all start at the same place, we have the same sequence. So of course, we're going to generate the same numbers Okay, if you use a different starting point, you will get a generate a new sequence of random numbers But if we all start at the same place if we all Set the same seed we will all get the same sequence of random numbers Seed 100 and then you generate the same exact exactly it's because it's it's using the exact same algorithm so here what we did we generated hundred numbers random numbers using the Bernoulli distribution and then we do a histogram of that This is what it looks like and of course because it's more likely to have a zero Then it is to have a one and in fact the probability is point one. So we should be we should get roughly ten out of the hundred should be one and 90 out of the hundred should be zero. Yeah, this is roughly what we get Because there's a hundred so the sum of these two things so here you've got maybe you're about 15 and here you've got or 10 and here you've got 90 so the sum is about Right, so it's like tossing a coin hundred times and oh Because by default the plot the other plotting function will actually take the maximum number And set that as a limit for y Because here the maximum is not hundred is less is about 90 so it stops at 90, but you could specify that you know if you're If you don't like that you can say a y-leam so you can specify the limits that you want And you say I want to go from zero to a hundred Okay, so you can sense that but the default is just to look at the range of your data And to say the x-axis will be the range and the y-axis will be the range of your data Because the there's no point in looking outside of the range because there's nothing anyway That's the main idea, but sometimes you want to change that and you can customize it So you generated these hundred observations like Tossing your coin hundred times and you count how many zeros and ones you have you do histogram or you tell you these Zeroes and ones and the histogram that you get is kind of like an estimation of the density or the probability distribution Okay, so now we're going to look at a Continuous distribution and once again, please stop me anytime if you think I'm going too fast if you've got questions, please ask me Okay, so we can generate from a continuous distribution Probably the most famous continuous distribution is the Gaussian distribution or the normal distribution or maybe you know it as the bell curve Because it looks like a bell curve Okay, it looks like this so here it's a what did I use so this is a normal distribution with mean zero So it is centered at zero it has standard deviation one Which means that roughly most of the observation now between minus two and two so you can actually do that Using our We're going to look at the range from so by the way, this is another way to create a vector So maybe I should point that out to you We've seen The C to concatenate numbers. We've seen the rep to do just replicate a number and times We've seen the one to four one to twelve, you know the one With the column and the other number you generate a sequence going from one to twelve We've seen that this is the same idea But this is if you want to have a sequence that's not necessarily a sequence of integer going from one to another So this says go from minus four to four and With the step size of point one So let's look at what this look like Okay, so it goes from minus four to four every point one Once again, there's many ways to create Objects in our and often it will depend on the context which one you want to use because some will be easier than others So here, this is really what I wanted This is the most direct way to create that vector because I could of course do it by hand You know C of minus four minus three point nine and so forth But it would take me a long time and it's use this because there's a function to do it Okay, so I've told you Even though we've looked at the our basics You're still going to learn a lot of comments and functions as we go along because it's just impossible for me to name Right all things you could do down So this is the x and then I'm going to compute the dent I'm going to evaluate the normal density at each values of x when that normal density as mean zero and salvation one Okay, and this is what it looks like. This is the nice bell curve that you know It has mean zero and salvation one So the difference with a continuous and discrete distribution is that for a discrete distribution? You look at any value zero or one and it tells you the probability that it's zero and it's one when you look the value of The function so here and we maybe just step back to This one So here this is the probability mass function for the discrete distribution the probability that zero is the value of that point The probability of one is the value of that point For a continuous distribution, it's different because since it is continuous you cannot really take exactly on one single value So typically we would look at the probability that the random variable is within a given interval and Here to compute that probability you will look at the area under the curve So using a gas and distribution You will know that the probability that your variable is between zero and two is the area below that curve and R can compute that for you very easily. There's some functions that you can just do that Is this kind of clear? Now we're getting into statistics and everyone's gonna get lost. Oh Yeah, so here I plot X versus F Okay, fairly standard. I want the X label to be X. I want the Y label to be density So X is here density is here. I want the width of the line to be 5 So the larger the number the larger the width of the line and Then I want the type to be a line because you can have points you can have both So let's play with that. That's a good thing So for example We could have P which will mean points Okay, so what it will do is that you won't draw a line instead. You will draw each point There's no line in between You can have both which mean point and line and it would put the points and the line And you can play with the width. You can do it lighter Okay, Sam, it can play big Okay, oh This is B for both Okay, so this is what I mean that when I say that R is a really good software to make High-quality plots you can customize it exactly how you want and move things Everywhere you can have all symbols colors can make great PDF or JPEG whatever Graphics and then you can just copy and paste it where you can really make high-quality graphics Okay, so Sam again here. We can generate a sample from a normal distribution So here I generate hundred observations from a normal zero one and This time I don't use D norm. I use our norm So D is for density, which is what I use before to compute the density at a given value But this time I use our norm which is for random generation. So our norm Hundred numbers mean zero standard deviation one and then I plot a histogram Okay, the key point here is that when you've got some data Looking doing a histogram of your data It gives you an idea of the distribution because a histogram can be used to estimate the densities, right? So it's kind of like an estimate of the density You can see here that it looks a lot like a bell curve, right? And of course the more data point I'm going to get the better the estimates going to become Actually, we can play with that a little bit So here we generate hundred Excuse me hundred now random numbers from a normal distribution means zero and salvation one we do here to grab, okay Let's say we're going to do the same thing, but with a thousand data points Okay, you can see it's becoming better and better If you do it with ten thousand data points Okay, you're becoming a better and better bell curve Just because the more point you have the better the estimate of the density Okay, so this is pretty basic stuff because I wanted to talk a little bit about probability distribution. Yes No, no, no, you're just you assume it's it's it's normal You want to compute the density at that point or the probability mass function at that point So you use the norm to do that here's just I mean it's toy example is not that interesting We're going to look at real data sets and what we can do there So it's not really something you would do in reality when you get data to D norm You would never do that. It's just to show you that if we do have a couple of toy examples We know the true density would do the histogram. They look alike. That's the main point of these exercise Okay Any other questions say that again? Yeah. Yeah, you can do that. I Do have some of these things later on I can't remember how to do it But you can do it. Well, well, we'll get to such an example But yeah, you can do I mean everything you can think of you can do it It's just a matter of in putting the right comment. Okay, so So we've seen the probability distribution the key point here is that the histogram is a good way to estimate the density of function of of a random variable So it's a good way to sort of estimate the distribution of a random variable So if you're if you're using a t-test for example, and you want to know all of our data Normally distributed because that's one of the assumptions of the t-test Then do histogram, you know, is it does it look like a bell curve? Yes, then it's probably okay If it looks very different, then you should say well, there's a problem here. I need to do something But we'll get to that. Yeah, so there are things we can do But we'll we'll get back to that. Okay, so one thing we can do is looking at what we call quantiles and q-q plots Okay, so what is a quantile? So this is the definition So the p-quantile is the value with the property that there's a probability p of getting a value less than or equal to it And at this point you're like well, this is great. So let's look at an example that you can understand better So we're gonna do that in R again Okay, so I'm gonna graph that so this is the normal distribution Okay, and this is the 90th or the 90 percentile or the 90th quantile So what this mean is that below this value? There's 90 percent of the area under the whole curve is below that value Okay, so there's 90 percent of probability that you will get something below that value. Okay, so that is the quantile So a quantile is the point It's the value at which below that value the probability of getting it is the quantile So for example, the 50 percent quantile is called the median and in this case Let's look at the plot. What do you think the? 50 percent quantile is Zero right because it divides the curve symmetric So it divides it in two halves and therefore it's 50 50 and That is the chance of getting something below zero is 50 percent. Okay, because it is symmetric Okay, in fact when you get when you have a symmetric density Symmetric distribution the median is always zero and therefore The 50 percent quantile will always be zero Of course here we don't I mean it's quite difficult to compute a quantile otherwise And there's a function in or they can do that for you if you tell him this is normal distribution So I use q norm and the my normal distribution has mean zero sign of vision one And I want the 90 percent quantile and therefore you will get that to you And this is what I do here. I get that number I plot f and then I do a straight line at Vertical line at q 90 Okay, so this is what this is doing okay, so This is really nice and this is what I said when These these are just a two examples that in practice you have no idea what the distribution is right So how can you compute a quantile? Well, what we can do is that we cannot compute maybe the exact quantile based on the true Distribution, but we can just compute what we call an empirical Quantile that is based on the data that I have observed. What do you think the quantile is? So it's fairly easy to do So in this case the definition and the empirical quantile is the p-quantile will be the value with the property that P percent of the observation are less than or equal to it. Okay So how do you do you're going to order your data set or your data points and then you're just going to count 80 so let's say there's hundred data points in your data set and you want the 80 percent quantile You're going to count 80 you take the 80 number You know that below that you've got 80 percent of the data points because there's a hundred. Okay, so it's fairly easy to do that Numerically you just order your value and you're going to take the value that gives you 80 percent of the value below that Good question. How do we order or sort a vector in R? Yes Okay, so that's very difficult It's probably it is something like sort so we're going to do question mark sort and Fair enough you can sort a vector Okay, so there's probably an example at the end Okay, so here they use data set They use data set which is part of stats It's a package so we can just do that and here's some data and you can just sort that Okay, and you can see that it's sorting the data for you That's very easy you can do that, but in fact you don't even have to do it Because there's even a quantile function which will do the sorting and the counting of the numbers for you automatically So here's an example again. We're going to generate something from a normal distribution Okay, and we're going to do quantile of X Okay, so by default the quantile of X will actually give you what we call the quartile that is the Zero percent quantile the 25th percent quantile the 50 percent quantile 75th percent quantile 100 percent Okay, so this is the default of the function remember that functions in R Have sensible defaults. That is if you don't specify the other arguments. He will just come up with an answer for you However, I maybe I don't care about these quantiles what I want is the 10% the 20% and 90% So I can just specify that I just specify the probabilities. You have a quantile function And you will do that for me okay yes, so Summary is a function that will give you a summary of any objects in R and One of the summary for a vector will just be very similar to the quantile Actually, the summary we can do that right away. We'll talk about summary, but if you do summary of an object, which is X here You will return the median the first Quartile, which is the 25th percentile the third quartile, which is the 75th and the max which is 100 and the mean So summary is like a fancy Quantile function in addition it also gives you the mean as the main difference But we'll talk about summary Is that sort of clear what the quantiles are and what it's doing Is that okay? So we'll get back to quantiles with the QQ block so Often in statistics, you're gonna be given a data set and you want to very quickly Quantify the data set and maybe summarize it, you know with some summary statistics like the mean the median There are other things you can compute like the variance to start deviation Okay, and all of these things are available in our remember always also statistical language So there are a lot of statistical functions that I built in like the mean the median Defiance and so forth So let's go and try to look at that. So let's look first at Let's look that so we've got X we can compute the mean of X Okay, and remember because it's a normal distribution That we used to generate the numbers we mean zero then the empirical mean should be pretty close to the true mean Which is zero which it is here The median should be also pretty close to zero because it is a symmetric distribution You can compute what's called? The Interquartile range or the IQR will talk about that so the IQR is just a measure of the variability in the data It's very similar to your standard deviation if you want the variance Say there again a standard deviation okay So these these are the mean and the median of what we call measures of location They sort of tell you where is the peak of your distribution or most of your distribution The IQR the variance and the standard deviation are just measures of the variability in your data okay, and When you want something very quickly you want to know roughly what the mean is the median and so forth You get the summary of X and that will tell you what the summary is and the good thing about summary a matter earlier was talking about our being Object oriented Language meaning that some functions will know what to do depending on what the argument is then the summary function is like that That is it X is a vector It will give you the summary for that vector if it is a matrix It will give you a summary for each row and so forth so you will know what to do depending on the type of the argument Okay, so this is what I mean here summary can be used for almost any object and always object oriented I would think that if it's a data frame probably you will summarize every variable in your data frame I can't remember exactly But it's my guess each column here Which makes sense because you wanted to summarize each variable, right? Okay, so what is the box blocks? No, it's because it's a data frame. So we will know that columns are variable. They are the one that will have the names Yeah, you would well if the variables are the rows then you should put in your data frame where your variables are the columns You can they always Okay, so Let's look at the box block because we've seen how these things are very Easy, but they just give you just one number and it's nice to have a graphical representation That might summarize all of these numbers and what we do that is using the box plot So here I still generate the same random numbers from the normal distribution and then I use the box plot function So what is the box plot the box plot is just a graphical representation of some summary statistics The first one that you see which is thick line at the middle is the median This is the 25th quanta 75th quanta This is what we call the IQ or the inter Quartal range Then there's what we called the whiskers Here and here and these are calculated using 1.5 times the IQ or and typically everything that's outside of the whiskers are considered outliers So it's a nice way to summarize your data because you will show you The location of the data by the median the viability You know if you've got a box that's very long It will mean that the data are highly variable if the box is very Small it will tell you that there's not much viability if the whiskers are very long we'll tell you there might be a lot of outliers and The outliers will typically be shown with symbols outside of that. So here they are no outliers These are called the whiskers the things that I extended from the box No, they are calculated using one point. So you you start from here You do 1.5 times the IQ or you stress that you draw a line. You do the same thing below Okay, and you you don't want to use the mean and the max because the goal of these exercises to see if They are potential outliers things that are very far away from the median even taking into consideration the viability of the data That's in between the 90% you don't know depends on the data However, what's the percentage that in that is in the box? What's the percentage that's in the box here 50% right? But corresponding to the whiskers we don't know for example here. It's 100% because there's no outliers There's nothing that's too far away from from the from the media But typically it will be it will be high enough because the goal of the box But is to see if there are any potential outliers one of the course, right? And so typically they will be only a few outliers. I would say usually will be close to 95% Yeah, there might be actually There might be an option in box plots to change up But the real definition of the box plot is 1.5. Yeah, see here. I think you can change that Okay, okay, so now we're getting right to the point So many statistical methods make some assumptions about the distribution of the data a gene, and you know, they're normal So the quanta quanta plot provides a really nice way to visually verify such assumptions So the idea of the QQ plot is you're gonna assume a distribution You're gonna say well, I want my data to be normal or I'm testing if they are normal So therefore, you know the theoretical quanta is in the normal distribution, right? We can compute that you know Then you can also compute the empirical quanta just like we've seen and the idea is that you're going to compare the empirical quanta to the theoretical quanta and if the distribution you assumed is correct Then they should pretty much lined up And we're gonna see that on the example Let's do a very easy example Here so we generate 100 numbers from the normal distribution. We do The QQ plot for the normal distribution is called QQ norm Okay So QQ norm is like doing a QQ plot for a normal distribution And here we can see the theoretical quanta versus the sample quanta or the empirical quanta. It's just the same thing QQ line will just add a line to the plot and it will add the line Which shows you where all the points should be in the distributions where exactly the same Okay So you can see here that the points line up pretty well on the line Of course, it's not perfect because we've generated some data points and therefore there is some variability So you're not gonna get something that's perfectly aligned But looking at this I would say well, it's pretty good. It's well around the line You know, I have no reason to believe that it's not normal Okay, again, it's a bit subjective in a way because you know, it's a graphical representation but Things that you want to do when you when you do these sorts of analysis and exploratory data analysis Yeah, you can so there are statistics you can compute and things but I don't typically I don't really trust them It's it's much easier to do something and do it visually here The key point is that what we want to make sure is that there are no Big violation violation of assumptions, you know Some of these tests are some of these tests can be fully robust to slide departure from the assumption Like linear regression, which you're gonna see tomorrow. He's okay. If it's not perfectly normal or Or constant variance or whatever. So what do you want to test is that the assumptions are? Fairly correct that you know, there's no big violation of the assumption and this is why doing it graphically is probably enough We're gonna see that so here's an example here. We're using another distribution It's called the t-distribution. How many of you have heard of the t-distribution before? Raise your hand. I can't see them Two, okay three four I've you heard of the t-distribution before just before when I asked the question first Okay So the t-distribution is very actually is very similar to the normal It's a better curve, but it has heavier tail the normal distribution Have your tails the normal distribution so Do I have a plot that shows the comparison? So this is I've generated a hundred data points from a t-distribution and I'm gonna do histogram of that so in the histograms you can also Specify the number of beans that you want Okay, so here I wanted to have more beans So I said a hundred what you can see is that that t-distribution has two degrees of freedom and the smaller the degrees of freedom The more the heavier the tails will be okay, so it's a distribution that's fairly close to the guts and distribution there's An extra parameter which is called degrees of freedom when the degrees of freedom is very small. It's gonna have Much heavier tail than the normal distribution when the degrees of freedom is large and be very close to normal distribution So here you stew just because I wanted it to be slightly different from the normal distribution just to show you an example So the t-distribution is kind of handy when You've got lots of outliers in your data, for example Anyway, this is just a key point to show you the difference So if you look at this Histogram here, you can see that you've got a few outliers here very large values that are far away from the mean This is something you would never get with a normal distribution. So let's just look at a QQ plot of this Okay, so here. It's pretty extreme. You can see that even though most of the points are fairly lined up Some of them are very far away from the line. So if you see something like that, you know, there's something wrong You know, there's a lot of outliers That the tail is are a lot heavier than what they should be for a normal distribution Again, these are two examples. We're gonna move on to some real data sets afterwards So here we've been comparing normal distributions Well, we've done a QQ norm, which is a QQ plot when you want to compare to normal distribution Sometimes you don't know what the distribution is or you want to test other possible distribution. So here we're gonna do The same thing, but we're gonna use the QQ plot common where you can compare two samples So here I generate one sample from the normal distribution another one from the normal and I do a QQ plot of these two Guess what because it's normal and normal They should line up again. Okay, so you can see that they line up fairly well We do the same thing with the T and the normal using the QQ plot Again, you see the same behavior is one of them. That's a very far away from the line and again here You could just use QQ QQ norm because we're testing for the normal But I wanted to show you that if you've got just two samples you want to see if they are from the same distribution Assuming that you don't know anything about the actual true distribution, then you can use that Exactly because you don't know the theoretical. So what you can do Either you either you have two samples that you want to compare to see if they have the same distribution Or you wanted to you have one sample and you want to use another Distribution that's not built in into R which you can use that you can just generate from any distribution You want and you can compare the empirical of these two, but I wanted to bring that up Well, actually I had that see I didn't even know so I had that plot where we compare The T and the normal so you can see here. This is the normal distribution in black and In red is the t distribution so in shape. They look very similar But the main differences are the tails here You can see that the probability of having something far away from the meme is Greater same on this side, but in fact if you just look at these two plot You will say all these did you know these two distributions are not very different But then when you look at the QQ plot above you say well This is very different in fact There's you know a lot of outliers over here things that are far away from the mean that I would not get with So if your data were to look like that you'll be in trouble when you do linear regression for example When you do a t-test, you know, you would have to be a bit careful Well, we'll talk a little bit about that we'll see on on real examples Okay, so we've seen that you can also compare two samples the idea is that you know They should be they should line up if they are similar or you will see something that's not really straight line if the two samples are different So this is kind of interesting because this is the main idea behind quantal normalization I don't know if some of you have heard of quantal normalization for microarrays This is the idea. You've got two micro samples. They're coming from the same experiment, you know the Exactly the same experimental conditions. You would expect the two distribution to be fairly similar, right? But in reality, there are some technical variation that makes them different So you want to correct for that one way you can do that is that let's say you have two microarrays that are very similar and They should in theory they should sort of line up like this But they don't what you can do is that you can force them to line up So you will sort of do a plot like that you will see departure from assumption because you won't be exactly a straight line But then you can do transformation to make it line up and this is what the quantal normalization does You will try to match the quantal between the two sample to make them the the most similar possible And this is the way to sort of reduce the technical variation that you have in your data What time is the break Francis? Okay Another way to look at data Our scatter plots so as we know, you know, when you do statistics in the special bind for medics Your data are typically multivariate. I mean, it's rare to have just one measurement per patient For example, right? You might measure the blood the pressure the height the weight it is multivariate So scatter plots a good way to look at two variables at a time So here's an example. This is the the graph is to show us disease flows automatically data set So grab versus was disease that it is actually disease that you get after a bone marrow transportation where the I can't remember if it's the the immune system of The donor that attacks your The host so here what we what we have are is the flow cytometrics experiment so we have several cell surface markers and We're going to measure different cell surface proteins So it is multivariate for but we've got millions Not millions, but we've got thousands of cells. We've got different markers for each cell and therefore it is definitely multivariate So what we can do we can do a scatter plot of one of the markers Which is here city for against the other markers we see which is a city at beta a Scatter plot like that can actually be used to test or to verify independence So here when you look at a plot like that, do you think the two variables are independent? Of course, you don't know very much about the experiment and everything but Do you just looking at that? Do you see some kind of independence between the two? No, if it was independent You will see some kind of a big round cloud where one variable doesn't tell you anything about the other variable but what happens in full cytometry is that you've got a You've got a sample that you run into the machine and then you're going to measure each cell for various markers What happens that you get a mixture of cell? Populations so what you're going to see here is that often here? You might get maybe kind of like a cluster of points, which might be one cell Subpopulation here you might have another one here You might have another one and here and here so here There's actually about five to six cells of populations So From that it's clear that if you know that The value of CD4 is around here Then you're either in here in this population or this one or that one Okay, if you're the value of CD4 is about hundred then you're either in that population in that population But it's very unlikely that you're around 200 so you can see that it's not independent because if you know something about One of the variable it will tell you something about the other variable so scatter plots versus correlations So we know that we've got two variables one way to measure independence or correlation two things that are completely That I'm not exactly the same is to compute the correlation between the two variable So in fact in the example that I showed you before the correlation is point twenty three So this is pretty low right we know that the correlation is typically between minus one and one when it's close to one It means that probably there's a high correlation between the two variable and There's definitely not independent Well, if we look at that example, we get only 0.23 which is not very big But you have to remember a correlation the correlation coefficient is only good for linear Dependence and in our case in the flow cytometry example. We definitely Don't have Linear independence between the two so why don't we go and try to do that in our Yes Are you just talking about general or normalization in general or? No, you can I mean there's tons of graphical displays and things in R And I'm not sure I know what you mean. It's probably Yeah, it's probably highly dependent on your data and the problem So there's probably a lot of things you could do and I don't know about what language or software you're talking about But for sure, you know, I mean if it if it is like a statistical software for sure you can do it in all no question Okay, so let's try to load the data remember what you probably didn't close our so you're probably in the right directory anyway So here I read the table Here what I do is that I'm going to subset the variable Okay, let's first do a summary of GVHD Okay, so right away you can see that it's doing the summary per variable. We've got the four scattered the side scattered which I'm measuring the Basically the shape and the size of the cells Which are just Reflection of lights on the cells and then you've got the surface markers CD4 CD8 CD3 CD8 beta and CD8 So right away you get a quick summary of this data Then here I only care about the one that are greater than 280 and that's because we know about that threshold It was it's coming from a negative Sample where there's no GVHD and therefore we're only interested in the Cells that have the fifth variable which will be CD3 greater than 280 Okay, so this is coming from the biology that we know we only want to look at these cells and then That is I'm only subsetting the cells that have the CD3 Greater than 280 or CD3 positive and then I'm only going to look at the flu or son's markers among I'm going to forget about four scattered the side scattered So only say like three to six as columns and I create a data frame from that because a subset I want to make sure it's the data frame. So I use as that data frame exactly Typically should be but just to be cautious, you know, I prefer to do that But I don't think you would really need to do it. You could do that based on the names, but it's easier here to It's much quicker to write three to six, right, you know versus writing all the names. So Yeah, but it yeah, but here since they are continuous variable from three to six is just You know it you should use whatever is easier depending on the context, right? You so the the the goal of our any language is to try to write minimal code that will do what you want to do So we've selected that and then I'm only going to look So I created a new data frame that I call gbc cd3 positive. I'm only going to look at the first and second variables Exactly, which is basically looking at cd4 and cd8. So the fourth or some markers only the first year Okay, and here once again, you can actually see And here there's actually one two about one two three maybe four five and perhaps six Cells that populations and if you compute the correlation between the two variables, you will see that it's fairly low. So this means that the scatterplug will take typically telling you more about Independence of two variables then correlation remember that the coefficient of correlation only Tells you about linear independence that is if the correlation is zero It does not necessarily mean that two variables are independent. It just means that they are not linearly dependent Here's an example, okay Here generated some variables that way. I've got x and y. This is what I so this is really your toy example What do you think the correlation is between these two between x and y? one, okay zero Try to guess so what say a number between minus one and one It's one. Okay Zero we'll get two ones versus two zeros No one else. How can you tell? well The thing is that it's completely symmetric around zero So it has to be zero So let's go and look at that in our However, so before look at that however when you look at these two, you know that they are not independent Because if you know that x is zero then y is minus one or one So it's definitely not independent, but the correlation is zero So you need to be very careful about Correlation versus independence. No because it's very difficult to measure independence Exactly typically people would look at correlation if you've got a high correlation And you know it's not dependent if you've got low correlation doesn't necessarily mean that it's independent You should probably look at it more closely using a plot, but it's very difficult to measure Independence because it depends on distribution that you have and so forth. So dependence independence. It's a very tricky thing in statistics but Typically using a splattery data analysis. You can get a feel of whether things are dependent or not So this was really a toy toy example Guess I generate x and y and then plot that That's because I So you don't need to close our One time is to break Michelle. Oh, okay So means it was like 10 minutes ago. Yeah I'm getting tired. So I think I'm gonna take a break Okay, so let's take a break now Okay, so we're gonna start again Because we're running a bit at a time. We still have many slides to go over Before we start is there any questions? Things you want me to go over again or you did not understand No, okay, so we're gonna continue with the exploitory data analysis So another thing that you can do in our Chirly graphics, so it's just a nice way when you've got multiple panels you want to look at It's a nice way to look at multivariate datasets all at once I'm not gonna go into too many of the details But there is one way you can do that directly in ours that if you've got a data frame again, you want to look at it Typically when you do a scatterplot you need to specify the x and the y right? Because you want to plot x versus y well, let's try to To do that directly in our and you will see that what you will do is A panel of scatterplots for each of the variables you will show you this is the plot of looking at CD8 beta versus CD4. This is looking CD8 beta versus CD3 CD8 CD4 versus CD3 and so forth so it's kind of like a matrix that contains all of the plots so Let's try to look at that in R. Okay, so you should note that here. I actually Use another option in the plot I use pch which will be the plotting symbol So you could use different things for example you could use a plus sign if you want Okay, and now you get the idea why did not use the plus sign to begin with it's because there's so many data points that it Takes a long time to actually draw all of these plants all of these points in the plot So when you got data set that got many data points in the data set It's very it's a good idea Just use the dot because it will be lighter and won't take very long to display that on the graphic And also when you generate PDFs or JPEG for a paper or something they will be much more plots So it's better to do that It's a lot quicker and you can see quickly You know the distribution of the things in all of the possible scatter plots using all all of the possible variables Is that clear what we're seeing here So these types of plots are very nice to look at because you can quickly look at all the data sets all the possible combinations and it's a nice way to look at The data very quickly to identify possible artifacts or possible things that might be interesting and There are many more possibilities, but we don't have time to go over that today But there's a very nice package that's called the lattice package that allows you to generate tons of graphs like that Conditioning on variables and factors and so forth We don't have time to look at it, but I really encourage you to look at it. It's very very interesting what you can do with it It's very powerful Okay, so let's look at The GVHD data and now from now on we're only going to look at the CD3 positive sample because we know a little bit more about that So the first thing we'd like to do is maybe a box plot just to have an idea of what the the variables look like and The good thing again about the box plot is that if you input A data frame you will know how to deal with it And what it will do is that you will do specific box plot for each of these variables Okay, so here we've got the four fluorescent markers which are the four Surface protein markers and then this is the distribution of the first of the second third and fourth Okay, so what can you say about this from what we've seen up to now about box plots? Let's just look at Let's look at that one for example What can you say about this one? Yeah, it has a lot of outliers. Do you think it's symmetric? It's not really symmetric because here you can see the whisker is a lot longer. There's more outliers, okay So it's not exactly symmetric It's fairly skewed then if you compare them across you can see there's some variation in the overall Intensity shown by the median. There's some variation in the actual viability of The markers within All the markers across the different panels and here there's some outliers on this side However from what we've seen before So box plot is a nice way to look at it But sometimes it's not the best way to look at some kinds of data and we're gonna see why Yeah, I do So typically this is how they are defined, but I don't really know how they're defining are so it could be that the default is different Okay, so that's why then So we okay, so that's a good explanation. So if you've got that's the last point that that's within the range Okay, that makes sense good what we just said here or everything Okay So I said earlier that Which part? Yeah, yeah Okay, this is how I define the the box plot earlier I said that typically you will define the whiskers as 1.5 times the IQR But if so you can see that this one is slightly longer than the one below and what happens is that the last data point Is shown with these so it will stop at the last data point Okay, if they are outliers you will go up to 1.5 times IQR, but if there's no liar You will just draw that point to That line at the last data point. So it might be slightly shorter than 1.5 times the IQR That makes sense okay, so another way to look at Each variable and to have a sense of the distribution is to do a histogram Okay, so if we do a histogram of each of the variables CD4 CD8 beta CD3 CD8 what you can see is that The data are not really Symmetric, but they're not unimodal you can see two modes here for example one over here one over here Here you've got one over here one over here same here and this one is sort of all squished towards the Zero so The density show you a very different picture than the box plots and the main reason is because box plots are not very good to Display things that are not unimodal because in the box plot that we had earlier We summarize the location of the distribution with the median Okay, but it's not very good if you've got sort of two peaks You would want to have the location of the first peak and the location of the second peak if possible So sometimes the box plot is not really the right way to look at the data If they are not unimodal if you've got two peaks here will be better to summarize Maybe this one and this one and here it's because it's a very particular data set. We've got a mixture of cell Population so potentially there is one population over here one population over here So what you would want is more of a summary for this one in the summary of this one and a histogram It's better in this sense to look at these kinds of data. So actually I'm just we're just gonna move on for the second time to the You can just try that on your own copy and paste and you will get the exact same figure here So let's look at another data set. So this is a gene expression data set We're comparing. So this is a sale line comparing cells that are infected with HIV virus Versus cells that are not infected. This was a time course experiment But only going to look at the time point 24 hours after infection The good thing about this data set is that we've got 12 positive controls that are actually HIV genes So genes from the virus itself. So they should be extremely Differentially expressed because they will be expressed when the cells are infected with the virus and they should be not expressed for the cells that are not infected with the virus We've got four replicates two with a die swap. So For those who don't really know too much about gene expression analysis This was done with a two-color microwave. So you could I put two sample on the same slide You will color the two sample and then measure each gene using the color coming from the treatment Which is a child infected and control non-infected you get two color red and green and sometimes people will swap the colors So they will do to green red and green and then to green and red for the two samples And this is typically a good way to see if there's any die effect. So this is one of the samples here This is actually an image that's coming from the microwave You can see all of the about 8,000 genes and the reds are the positive controls So here we'll assume that the image analysis has been processed. We've got the For each gene. We've got a spot with an intensity For that gene in the corresponding sample Therefore, we've got a data matrix of size 7,000 times 8 8 for the four replicate times Either treatment or control okay, so that you have the data set we can Read the data set in the same way as before we can do a summary very quickly and we can do a box plot and again the box plot will be done directly with The whole data frame therefore you will have a box plot for each of the array in this case we've got Actually for arrays for replicates for coming from the HIV the treatment for coming from the control Okay, so this one is the first replicate Paired with this one. This is the second replicate paired with these third paired with these and so forth. Okay Control is cells are not infected with the virus So this is the control for that experiment. So, okay, we've got cells We're measuring expression of about 8,000 genes. We've got two conditions one infected with the HIV virus one That's not infected. Yes, we've got four replicates here Four over here and four over here. Okay to two conditions HIV infected non-infected and the four replicates Yes Okay, that's a box plot What do you think of these box plot? Oh, we're gonna look at that. Yeah. Well, actually you sort of see the difference in this already I could have just skipped it. No one would have seen it Okay, so right away. I mean here. It's what what what can you say about these box plots here? There's a lot of odd liars Where's the box? We don't even see the box right? It's quish towards the zero over here So it's highly skewed towards the high values so Okay y-axis You mean over here. Oh the y-axis. That's the intensity So it's kind of like the expression of all the genes coming from that array from that array It's a measure of the expression. It's the intensity which is a measure of the expression One gene exactly each dot is one gene So another way to look at it is to do a histogram like we've done before What can you say about the histogram? It's highly skewed again Everything is quish towards zero and then you there's some high values that we can barely see over here We'll see over here. Basically. This is not really a nice way to look at the data And if I were to ask you do you think this data are from a normal distribution? What would you say? No, it doesn't look like it. It's highly skewed. It's very difficult to like it. Yes But before we do that there's another thing that's pretty interesting and You maybe you will talk a little bit about that tomorrow When you talk about regression But one thing also that people tend to look at when you look at these kinds of data It is trying to look at the mean versus the star deviation What would you do that? So for each gene? We've got four replicates. Okay, since we've got four replicates for each gene I can compute the expression The average expression across the four replicates and I can also compute the star deviation of the expression across the four replicates And this is what I show you here I show you the mean expression versus the star deviation of the expression what you see here is that If you just ignore the red line for now what you see the plus that there's some kind of dependence when you've got a high mean You tend to have a higher standard deviation then you've got a low mean you've got a lower standard deviation for those who don't really know that the lowest fit is what we call a Scatter smoother or locally weighted scatter plot smoother So it's something that will try to estimate the trend that you see in the plot So it's kind of like doing a linear regression where you've got a straight line. This is more flexible. It will tend to Try to tell you what the trend is in the data. So this is a good way to do Exploratory data analysis when you try to see the trend in the scatter plot. Okay, so it's kind of it's what we call a non-parametric Estimate is something that's not Not exactly a straight line. It can be pretty flexible. It will try to Estimate the the the curve that we see in the data. Okay, so I'm not gonna go too much in the details This is just to try to highlight the fact that there's a dependency between the x and y axis So typically genes that are more variable will have a higher mean So let's try to do that in R and then I'll tell you a little bit why we're trying to do that So this is a very interesting function here that we haven't looked at before This is called the apply function So remember before we've talked about the four and the while loops, right? That is that we could compute the mean of a vector Compute the square of a vector by just querying the vector Or we could loop loop over all the elements of the vector and compute the square of that of the element in the vector Well, sometimes it's nice to be able to if you have a data matrix You would want to compute let's say the mean of each of the row of the data matrix Okay, or maybe of each of the columns of the data matrix So the way you could do that is that you could loop and you can say for each row I'm going to compute the mean and store it into some kind of vector But we've seen earlier than when you do that with the loops. It's very slow, right, especially with here It's it's an example. We've got about 8,000 genes So if you look over the 8,000 genes and every time you compute the mean of the four replicates, it's gonna be very slow So R is vector I so every time you can try to do things more efficiently You should try to do it and there is a couple of functions. You can do you can Used to do that. There's one that's called apply Apply the the basic idea is that you will try to apply a function to each Row or column of the data matrix and it's much faster than doing a loop So here what I say is that okay? Here's my data matrix These are the first four column of the data which are just the HIV samples What I want to do is that I want to compute the mean for each row of the data matrix If I put a two it would mean column Okay, so if you want to know more you can do question mark apply. So what this would do one mean is that you're going to compute the mean for each row of the data matrix and Two would mean you're going to compute the mean for each column of the data matrix Okay, so it's a way to apply a function to each row of the data matrix and It's a lot more efficient than just doing a loop and computing the mean every time And there are variants of apply that you can look at if you're more interested Okay, so let's let's go into our We probably need to load the HIV data first and now we're going to compute that So you can see that even though we're looping over the 8,000 genes is still pretty efficient pretty fast, right? That's because our nose that you want to apply the same function So you're saying well, let's try to be smart and we're going to do it Internally in a more efficient way Okay, and then we can try to plot that and you can see that dependency between the two And we can try to estimate the trend. So for the lowest we put X is the mean Y is SD and we're going to try to estimate that trend and then we're going to plot it Okay, and here you can see it's almost linear. It's almost like there's a linear relationship between the two It's not quite linear, but it's almost linear Yeah, exactly. So Lois is kind of it's nice because it's more flexible So if the relationship was very different from a line, it would sort of show that on the plot Okay, and that's why it's kind of like doing linear regression But since when we don't really care about the actual line all we care is about trying to look at the trend Lois is more flexible, but it's definitely robust. It will give you a good answer In many cases You will see something almost flat. We're going to look at that No because The thing is that if you do a statistical test or if you do linear regression, for example This is different context, but often you can assume that the errors are Independent here's not really independent because you've got a dependency of the the error on the mean So gene with more with higher expression will tend to be more variable, right? So the there's it's not independent and the variance is not constant So there are some statistical tests that assume that the variance is constant like in linear regression For example, we'll assume that the variance is constant across the range of the data Okay, so here are some of the observations that we've made so the data are highly skewed when we looked at the box plot The histogram is it was difficult to see anything The start deviation is not constant as it increases with the mean So a solution for these kinds of data when you do have that when it's highly skewed and you've got Very large observation that our liar is to look for a transformation That will try to make the data more symmetric and the vines more constant Typically these two things go together is that if you make the data more symmetric with less our liars and so forth You will also tend to make the variance more constant. So with positive data Typically you will look for the log transformation. So there's a couple of transformations that are important There's the log as the square root And there's other kinds of power transformation before gene expression data the log transformation is typically very accurate Okay, so let's try to take the log of our data and to see what it looks like That's that's a good point. So there are a couple of things you can do So there are variants of the log transformation that will Try to take into account that you've got negative values. So for example, you could You could define something like that. You could do a log Your data plus some kind of constant like you know something that makes it positive They are variants of the log transform. They will handle the the case where X is negative or very small But the idea is basically the same exactly So here we take the log and we do a box plot. Okay, so it looks much better already, right? Looks more symmetric. That's more comparable across. We'll really see what's going on There are still quite a few outliers, but it looks, you know, fairly symmetric If we look at the histogram, well, this is much nicer, right? I mean it looks a lot more like a bell curve like a normal distribution, right? So when I look at these kinds of data, I'm I like it better because it's easier to see things That's what we expect. It's more comparable across. It makes more sense Now I'm going to do the same thing again. So we've taken the log of our expression And we're going to apply the mean and the standard deviation to each row of the data metrics and do the same plot mean versus standard deviation again And this is what we get This is the mean this is a standard deviation and this is the lowest fit and you can see that it's not exactly flat But it's almost flat which shows you that there's almost no more dependence between the mean and the standard deviation So taking the log I sort of almost remove that dependence between the two. Okay. Does that make sense? so Yes Exactly. So here it's almost the inverse So maybe it sort of suggests that perhaps a transformation was a bit too strong Maybe something you know less than the log Like the square root maybe would have been more appropriate to properly remove the dependence but there are variants of transformation to completely remove the The dependence between the two but here the key point was to show that a Transformation as simple as the log will often do the trick for positive Data that are highly skewed So with gene expression data almost all of the time you will just take the log Yes, yes Yes Yes Or you know as least as much as you can try to remove the dependence make the data more symmetric on the histogram and box plots So typically you will try to do that right away. You're reading your raw data. You try to make it more symmetric, you know slightly better and You take that transformation and then everything else after it's going to be based on the transform data So this is sort of a quick summary. So of course, you know You should take that with a grain of salt because there's really lots of variants and transformations and methods you can do But it's just to say that typically the log will do the trick on expression data and in fact with most of Positive genomic data, you know, if you've got expression data if you've got cans coming from high throughput sequencing The log will also do the trick. So there's a lot of things where the log is just a very good transformation So there's a couple more reasons we've said, you know, it makes data more symmetric the large observation and not as influential The variance is more constant But also what's nice about the log is that it turns multiplication into addition. Okay, and in fact This is really nice because it's a lot easier to do addition than multiplications Right. So for example, if you look at expression data and you compare two samples People often like to look at the fall change, you know, what's the difference in fall change between the two samples? Well, if you take the log computing the fall change will just be computing the difference between two logs And that's a lot easier. So in practice people kind of like to use the log base to Which is just the log of x divided by the log of two. So in terms of the statistics, it doesn't make any difference Okay, so let's look at scatter plots now So this is still the same Data there are these data a lot transformer because we know this is the right thing to do and we do just to scatter plot We plot the first column of the data set which is the first replicate coming from the HIV sample Versus the first replicate coming from the control sample Okay, so these are the two measurements that are coming from the same chip the same micro rig one from the chip man One from the control so HIV one control one What can you see on this plot? What are these genes you think? Yeah So these are the HIV control the HIV control genes right the one that should be extremely different she expressed Otherwise of course there's some other stuff going on and either side of the plot We must have them seem to be aligned on the Y Equal X line right which says that overall there's not much differential expression Except for a few genes which is typical for a micro experiment most of the genes will be in change and some of them will be Differentially expressed Now we you know we might ask well Is this really the best way to look at the data when we look at a scatter plot? Because what we really care about is the difference between the two Samples the difference between the treatment and control So another way to look at gene expression data is to do what we call an AM a plot So the AM a plot is very similar to scatter plot, but except that instead of showing The the one sample versus the other sample we're going to show M Which is the minus between the log ratio and the average between the log ratio. So this is just The average of the two samples is now is a measure of the overall intensity and the minus of the two sample on the log scale It's just a log ratio. So it's just the log of the full change if you want So these are the two quantities well This is really the quantity you interested which is difference between the two sample and This is a measure of if you want the quality of your measurement because you the greater the intensity The more you trust the measurement when you've got very low intensity tends to be slightly more noisy because at low and it's high to have To make sure that the gene was above the detection level for example, so it's likely more variable So typically people really like these kinds of plots because it shows you the genes that are different She expressed you can draw a line y equals zero everything that's way above way below would be the difference express genes things that are Around the axis y equals zero on the one that not different she expressed and the one that sometimes I'll sort of towards the lower And I may be the one that are low quality you want to ignore so Okay, so here here we look at the 8,000 genes Okay, what we have is that we've got two measurements one coming from the HIV sample one coming from the control so M is The minus between the log of treatment Minus log of control This is the average of This is just log of control plus log of treatment divided by two Okay, so this is a measure of the overall intensity the overall expression of the gene in the two samples This is just the the difference between the two expression on the log scale How much is it different from between the two samples for that gene so let's look at this point for example This is one of the HIV genes so here. It's a it tells you that the difference between the control and the treatment on the log scale is a which means that it's highly expressed in the HIV sample and The overall intensity is about six which tells you that the average of the two expression is about six Exactly, so it's high enough that you say well, I chose that because for example if you look at this one You might say oh, this is differential express point when you look at where it stands towards the low expression You know, it's it's noisy Well, the truth is that these these work for fairly old experiments using CDNA microarrays I think they are from 2000 or something so the quality is slightly better now Soon people might not even do micro is anyway But at least it shows you that There are better ways to look at the data than just straight-scattered plot for gene expression data Okay, so this is one way Or one example the way it makes sense to try to look at the data in the best way possible to show The sort of things you're interested in in the plot okay Any questions about this Do you guys want to do that in our or do you want to keep going? You've got all the code so it's up to you if you want to try it or you want to just keep going Okay, and you know if you've got questions in six more Yeah, and if you've got questions here you can always ask me later Okay, so here I'm doing the exact same thing for the four replicates. So we've got four replicates and doing the MA plot which is what I show here and I'm doing a lowest in addition to the MA plot just to show that the overall trend, okay so What can you say about these four plots fairly consistent we've got the different she expressed genes over here, right? So, you know, it's it's nice. It's reproducible. It shows that you always pick out the genes that different she expressed What about the trend can you say something about that? two and two so one and two being Then you swap the three and four basically Okay, so we can see that here There is what we call a die effect because if you look at the trend of the lowest plot so here The the treatment was done. Let's say with the green die I can remember and the control with the red and then you swap that here You did treatment with red and control with great So what you can see is that one of the die is tends to be slightly brighter than the other higher signal So here it's the one at the bottom in here. There's the one above so By swapping the two that allows you that there is That allows you to see that there's some kind of a die effect if you didn't do a die swap You couldn't really say that I mean maybe you would see a trend like that if you didn't do a die swap But we don't know if it's really due because of the dies or if it's just a biological effect or something else, right? So doing the dice or if you can actually observe that there's a difference between the two due to the die effect Okay, so this is one thing that typically people would try to do with normalization you've got so think about the lowest fit as some kind of a string and The pattern shows you the die effect, right? Because if you assume that most of the genes should not be different she expressed Then it should be pretty much a straight line around why you call zero But here we don't we observe that we see some kind of a train So this is what we call lowest normalization lowest normalization will say this should be a straight line So I'm going to push it I'm going to push all the genes with it to make it a straight line and then I'm going to say I have normalized my data Okay, so very simple idea and lowest normalization says It should be a straight line because most of the genes should not be different she expressed There should be no die effect So there's no reason to see a trend if there is one I'm going to correct for it each sample individual each replicate individual Exactly No, that's different. I may is more for I'm a few metrics type data But I am alone is typically no enough not enough. Okay. I may is not I mean it can do normalization But it's not really normalization technique The idea between I may behind I may is to summarize the prob set into an expression level but Before you do that, you should probably normalize your data either using quanta normalization or lowest normalization And often it will be done using quanta normalization So I may contains normalization as part of it But the main idea behind I may it's not really normalization Not for the What not really? Because I mean you're right at the end what you want is a single number that represent everything cross replicates, right? And this is what we're going to do when we do testing when we're going to test to hypothesis that test that a genius Differentially expressed on so here, you know, I know you were going to ask this question So I just put that here. How do we find different she expressed genes? Okay, that's going to be the next question Okay, so we've seen a little bit of exploratory data analysis Of course, there's a lot of you know, there's a lot more things you can do I think EDA is really a subset of statistics that is very easy to understand because it's Yes, it is statistics. We're competing some summary statistics. We're doing some graphics box plots histograms, but there's no real Theory behind it, right? It's very easy to understand that So if you if you want to know more about exploratory data analysis, you can pick out some books You can go and play with all that a lot of tools a lot of functions and graphics you can use And in my opinion is extremely extremely important because a lot of people we sort of rush and do a statistical analysis and do This is the p-value. This is what we've done. We've done these fancy statistical analysis But they have no idea if it's valid and most of the time I can assure you that I mean I wouldn't be surprised if 80% of the papers that get published Don't even look for these kinds of assumptions and they just do it and they just need a p-value to get it published and Then they don't care But it's not very difficult to do Yes, so I actually I mean it's a good question so often when you it's actually really annoying but when you Try to publish a paper and you know, I start John or something people It's not uncommon that people say, you know, are the assumptions correct? Can you show us some plots and things that it shows so we had to do that for a paper very recently? Which I thought was kind of a waste of time on my end because we did check all the all of the assumptions and everything But we had to regenerate bunch of plots and things to make sure that it was correct And it was valid to do these kinds of analysis and these data. It was working on some cheap cheap data But at least we had to do it and there's a lot of papers Where they would not do it so there's actually a famous well, I wouldn't say really famous but he's there's a biostatistician at Texas A&M is called Keith Baggley and I think his job is just to look at papers that people have published try to replicate what they've done and show that It's impossible and it's just all bugger. So He gives like very very interesting talks where he just goes and rumble about a bunch of stuff that people have done before that It's not reproducible. So there was About a year ago you give that talk where they were talking about I think it was trying to find gene signature in in cancer and things like that and the people found a set of genes that were That could be used to discriminate between cancer subtypes and they tried to reproduce everything and what they realized that it was impossible to reproduce it but then if you just and there was two Indices so if you he were to start the your gene numbering at one or zero The answer will be different is because when you read into Excel it starts at one instead of zero vice versa And so basically everything was screwed up because one of the index was was different in the analysis and everything was completely wrong in the paper And actually wrote a letter to the editor to say that it was wrong, but he never went anywhere But this is just to say that if you were just to do very easy exploratory data analysis Not only to check that the assumptions are correct But to check also that the results that you got at the end makes sense You would avoid these kinds of stupid mistakes because you would see it very clearly that what you got do not really match What you have in your data So that would save you a lot of time and probably a lot of embarrassment in the future And I think our provides a great framework for EDA doesn't take very long You can load up your data You can do a couple of plots and in fact once you have an experiment or some data set that you use all the time You can have a little script that will do these Quality assessment for you, and this is some of the things we've done We've worked a little bit with rubber gentlemen on the analysis of flow cytometry data And we've done a lot of quality assessment of flow cytometry flow cytometry data using histograms and box plots and Empirical cumulative distribution and what it does nothing fancy It just generates a lot of panels and you can look at the plots and you can quickly identify it There's something that went wrong