 What we've done so far with our data was really very informal. We compared things among each other and we said let's see the top 10, let's see the highest expressed look at differences and so on. There's a branch of statistics that is concerned with making questions like that a little more explicit and giving them a somewhat sounder basis and this is called descriptive statistics. This is often the first thing that you do with your data describing ranges and describing percentiles of variables. Now to explain some of these concepts I'll just briefly go through a sample data set. This is the function r norm which gives me random values according to a normal distribution. It gives me 100 values with these parameters with the mean of zero and the standard deviation of one. In fact the mean of zero and the standard deviation of one are the default parameters for the function so I wouldn't have had to specify that. So I have a data vector x now with 100 values and I'll just simply throw some simple methods of descriptive statistics. So something we've already discussed is the mean. The mean here is a small number that's what we'd expect. We wouldn't expect exactly zero because these are after all randomly chosen but we would expect this to be close to zero because we've defined our normal distribution to have a mean of zero. If we would define it differently we should get a different number here. The median is a somewhat different value. What is the difference between mean and median? Who's come across that before? That's a great way to phrase it. It's the middle number. So basically you rank all the numbers and you take the middle. Why is that useful? Because of the distribution. Sometimes the data. Depending on the distribution the mean and the median could be very different. Exactly. Depending on the distribution mean and median could be very different. If the distribution is not symmetrical then the mean will be skewed into the tail and the median will rest in the region where the most frequent numbers are. Especially even if the distribution is symmetric or should be symmetric the median is rather robust against outliers. So if I have a vector like with a huge outlier here then the mean of my y is a large number because of that large outlier. But the median pretty much ignores that outlier. It's the middle number in this region. Note that the middle number doesn't necessarily actually exist. Here we have four values. So it is placed in between the second and the third value as 2.5. IQR is the interquartile range. Oh my God, what's the interquartile range? Let me put it this way. Summary defines a number of these descriptive statistics. So the minimum, the first quartile, the median, the mean, the third quartile and the maximum. So just like we can define a middle number for the median we can also define the range of numbers that encompasses the lowest 25% of all numbers or the highest 25% of all numbers. And this would be the lowest and the highest quartile. You can see this describes the numbers which are smaller or larger than others. The minimum and the maximum define the range of our numbers. So with this little summary here we can define a number of descriptive statistics all at once. If we rerun our normal distribution, our normal sampling, of course these numbers will be slightly different. Okay, now what could be an interesting characterization of our data in LPSDAT? What could we be interested in? So we have different cell types, we have expression values. You know, just in terms of statistics like mean, median, interquartile range, summaries what would you be interested in seeing? So your collaborator just gave you these data about the new experiment. You know, semantically what it means. Now you're curious, what does the result look like? What did he get? I guess it could be, it's interested in the LPSDAT and the expression, the expression for the LPSDAT and the average in each cell type. Okay. Right, so we could be interested in the means for each cell type. How do we get that? Any idea? So this is a, this is a vague problem. Okay, summary. I don't even know what would happen if we apply summary to LPSDAT. Let's just see. Cool. Okay. Okay, I just did that. Good. All right. So we have all the means and the median and everything else we would, we would want to know. Incidentally, if we apply mean to the whole data frame, it wouldn't split it up column-wise like we have it here. We would, even if we select all the numerical values, we would simply get one average value of expression for all of the data. Okay, so this is nice. Delia, what does it mean though? Look at these numbers. What do the numbers tell you? What kind of question could you ask? Now this is where we go from just producing numbers to doing science. So I guess it splits up into the different experimental units of the controls and the simulated cells, I'm not sure. We could say, well, let's look at particular cells. Okay, so let's look at natural killers. Mean of the control is minus 12.3 and the LPS simulated is also 12.3. What does that tell you? And I would say, well, a PS does not, well, natural killer cells are not affected by a test, which if I look at these cells, I would say... You're absolutely right. And that's what almost everybody would say. I would say, step back a moment. It says that your experiment has not shown a difference. Oh, yes, yes. And we can probably interpret this in the exact way that you said that natural killer cells are not simulated by LPS, but we also need to make sure that the experiment actually works. Okay, so always be careful about that. Right, one way to check whether that is the case is to ask, well, is there any cell type for which we see a difference? Is there? It's kind of hard to tell from the numbers, right? Oh, yeah, this doesn't count. It's the cluster and intelligently, it skipped trying to do a summary of the gene names because that wasn't a numerical column. So maybe it's more convenient to look at this in a somewhat different way. And in this case, an excellent way to look at the data is a so-called box plot. So this is in principle what a box plot looks like. The box plot summarizes all of these numbers. It shows you where the median is for a data. It shows you in a box the range of from the 25th to the 75% quantile, which is called the inter-portile range. It shows you an extent 1.5 times the inter-portile range to the top and the bottom. So basically showing how far the individual values are distributed. And in the way that R does box plots, it will also show you points that will identify data values, individual data values that lie outside of that range. So to make this a little more explicit. So we'll get back to that. So if we simply do a box plot of, say, we had the natural killer control here, we get a plot that looks like this. This is the median value. This is the inter-portile range. 1.5 times the inter-portile range and some of the outliers. So we see that this is a very skewed distribution with a long tail here. Most of the values averaging around minus 12-point something, which means that we essentially have very little to any expression. Now if we do a box plot for more than one column, for example, the macrophage control or... Well, actually, let's change this to nk because we just looked at nk. And we said they're essentially the same from the means. So this is what that looks like. So the medians and the means are very similar. But do we still agree that these are the same? Not quite, right? So there's an intriguing difference there. There's an intriguing difference in that we seem to have a few more values that are highly expressed, which could correspond to genes that are activated. And we see overall a somewhat larger variation in the values on the same scale. Similarly, we can box plot all of the columns. Now here I use this construct of sequencing indices. Sequencing from 2 to 14 by 2. And my dataset gives me all of the control experiments. Sequencing the odd numbers from 3 to 15 by 2 gives me all of the stimulated experiments. So this is 2, 4 to 14. And so in this way I do a box plot of first all the control columns and then the stimulated columns. And I get this. Now this is slightly confusing. Which one are the control and the stimulated? On this scale here, not all of the column labels are printed out. I would need to plot this larger or detached plot. But I can see there's a little bit of variation. Not too much. Maybe the largest variation in these cells here. But if you have a box plot like that, you can also overlay lines and annotations. The only thing you need to remember is that these values of course correspond to the actual expression values. And the coordinates on these plots are just integers. So this line here is centered on 1. This is centered on 2. This is centered on 3. Our controls go from 1 to 7. So let's put a line in between the 7th and the 8th simply by plotting with the command AB line. AB line can plot lines with a particular slope and intercept or horizontal and vertical lines. So I often use these AB lines to draw extra lines into my plot because I want to emphasize thresholds or other aspects. So AB line, V means give me a vertical line at x equals 7.5. AB line H would mean give me a horizontal line at y equals whatever I tell it to. So V, 7.5. So watch the plot pane gives me this line here which now distinguishes my stimulated from my non-stimulated controls. So I think there's a tendency that overall there's a little bit of increased expression upon stimulation. Would we expect that on average? Well, these are immune cells, so we would expect on average that they actually do something if we stimulate the immune response. So this is kind of consistent with our expectations. But it's not dramatic in the change as we would also expect seems to be very specific for individual genes. Now I'm actually curious what is this? The second line, these are macrophages. So macrophages seem to be most induced on average with this experiment. Is that what you expect? Yeah, for sure. Yeah, macrophages correspond mostly to LPS. Yeah, that's an important kind of question. Like when you do an analysis, give it a sanity check, right? So you see macrophages do something. Does that correspond to textbook knowledge? If not, this either means you did something wrong or you'll have a chance to rewrite the textbooks. Unfortunately, usually it's the former, but you'd like to know. Okay, so these are box plots. Box plots aren't the be all and end all because box plots can obscure important structure in your data. So for large-scale data, it's a good first view at how your data is distributed. For example, if you have, however, a bimodal distribution and I'll construct it in a particular way, let's see what the box plot looks like. So the bimodal distribution that I have here is basically a composite of two different normal distributions. The first one has a mean of minus 2, understanding deviation of 1, and the second one has a mean of 2, understanding deviation of 1. So at first I assign this to my variable x. So x now has 100 numbers. And then I concatenate this x with my second distribution here and assign it to the same x. Now, x has 200 numbers. Now, if I look at a box plot of x, I get the expected box plot. It tells me my median is around zero and this kind of looks like a distribution. Of course, this is very misleading because, in fact, there are two different distributions beneath that and if I only look at the box plot, I can't figure that out. It becomes even more clear if I look at the histogram of this. Now, histograms in R are very convenient because they usually provide excellent visualization of the data without needing to add additional parameters. The in-built histogram function is quite good in guessing how many sections you want and how to cut the sections so that they can be well-displayed. Now, my histogram shows me that this is a bimodal distribution and I have to be especially careful with my analysis. But there's also an alternative and this is a part of ggplot and basically this is my only example of where I'll be using ggplot where essentially you have a box plot but it's shaped like having histograms on its side. This is why it's called a violent plot. Well, you'll see it in a moment. So let me install ggplot which I don't think I've even currently installed. Default fresh installation. There's a lot of dependencies there. I'm always surprised that it works but it works out of the box here. Here's the warning message that I discussed. It is that warning is thrown by Require which is in the first line here but by the time that the packages get installed and loaded that warning is outdated and can be ignored. Okay, now I'll define a new capital X where I convert my value X as a data frame because ggplot wants a data frame. I define a plot object and I add the violent plot geometry to that plot object. Here we go. So this is a violent plot produced with ggplot. Essentially it's a smooth line similar to the histogram which points out the different densities of the values. So if this is not a unimodal distribution it becomes immediately apparent if the shape is bimodal it becomes immediately apparent and so on. So you can plot these things in different ways. For most standard purposes I tend to be using histograms anyway. Let's look briefly at histograms. You have this file plotting reference here. One of the sections is for different types of standard plots, bar plots, histograms. Here's an example. This is a histogram of 50 values from a normal distribution where I define to use 5 breaks. I could change that to use 10 breaks and get a finer distribution or even more. This is a nice addition to histogram. It's a strip chart and it shows you the actual values which are beneath the histograms. So the histogram is a decent summary but especially if the number of breaks are kind of smallish then it becomes helpful to see the actual values on top of that. Now if you're very astute you might have noticed I defined 4 breaks here but what I get is in fact 6 columns. So histogram uses whatever I define in here as a suggestion which it tries to follow but if the suggestion really doesn't seem to make sense at all it just augments it in a way where it starts making sense. I can override that by specifically defining the actual break boundaries and giving it a vector of break values. But for most standard purposes I just leave this parameter alone anyway. Now histogram has an output. So it produces a value which is normally suppressed when I just make a histogram but I can assign histogram so I get my histogram here and I assign, I can assign the result of that into say a variable which I call info. And that info value gives me the information within that histogram which I can then access and use later on. So this is a good way to actually get the data from your histogram. It gives me the values of where the break points are. So the first break point is between minus 3 and minus 2 then between minus 2 and minus 1 and so on. It gives me the counts within these break points. So there's one value here, 7 values in here, 21 values in here and so on. It gives me the densities for the break points which is essentially the counts divided by the sum of all values and it gives me the coordinates of the midpoints. I can use that to augment the display of my histogram. So for example we can explicitly set break points in the vector. I could set them for example since this is a normal distribution at one half sigma intervals in the range from minus 3 to 3 defining that sigma is 1, standard deviation and I can re-plot the histogram in exactly these things. We can color the bars individually so this is just a color vector where I define individual color values. Incidentally these color values, you'll see them from time to time. Does anybody know what this means? Does anybody know how to read this? That it's really convenient to specify color like this if you understand how it's done? So web pages usually specify colors like this. This is a so-called hexadecimal color coded. It's always preceded by this hashtag with three triplets of two numbers. Now you might say, well what numbers? This is 4f, that's not a number, that's a string. Actually it is a number and it's a hexadecimal number. So in the hexadecimal numbering system we count 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F to give us 16 numbers. And this gives us with the two digit hexadecimal numbers those 256 values from 0, 0, 0 or from 0, 0 to ff, ff is the highest number. Now these two, three, this triplet of numbers here corresponds to red, green and blue. So basically what this says is give me a number, give me a color here that is low on red and on green and high on blue. There's a little bit of red and green at the same time, lots of blue. So this will be a kind of a pale blue which you should expect in the first value here. This one here, the triplets are all the same. So we have the same value for red, green and blue. Anybody venture a guess what this looks like? Gray. And this part here is high on red and low on green and blue. So this goes into the reddish range. Simply from looking at these numbers I would suspect that this is a spectrum which goes from bluish values via gray values or even white, almost white to reddish values. And you can specify these things explicitly. So here's how to read them. So this is now defined simply a vector of colors and now we specify the parameters individually. I specify the input data, I specify the breaks by hand. Remember all of that was done automatically but I can tweak it individually. I specify that the colors for the individual bars should be my histogram colors. I specify the main title of my histogram to be empty. I don't want histogram of x, that doesn't tell me anything I don't already know. I want a particular label on the x-axis which is expression in terms of standard deviations and I want a label on the y-axis which counts. Oh sorry, this is not expression. This is an R expression which corresponds to the string which converts the string sigma into a Greek S because that's how I have defined this in terms of sigma i.e. standard deviations. Now we've customized many things about this plot and that's what it looks like. Going from blue to red via gray, a Greek lowercase sigma at the bottom here and counts as the y-axis label, no x-axis label. So it's relatively easy to customize each and every of these single characters. Of course if you find yourself using a particular customization again and again and again you can simply put that into a function. My beautiful histogram function accepts one parameter which is x and everything else is defined inside the function for you to reuse. This is also one way to make plots consistent for your analysis among scripts, basically developing a little library of plots that are set up exactly in the way that you want them. The idea of tgplot goes into a similar direction. So by defining and grouping particular aesthetics you can customize your plots to be consistent. For example, using a consistent color spectrum, a consistent line width and so on, but you don't need to buy into the philosophy of everything else that comes associated. It's easy for you to basically write these own things. Now here's one more thing and that's adding the individual counts to the plot. So how do we do that? I would like the textual labels of all the individual counts in this plot here. And for that I use the text command. That's really useful. Text, well adds text to a plot but you can use it to label points. And that's really cool because often you get outliers or points of interest. And for example, if we consider the box plot that we've seen from our LPS stimulated dataset, we see points that are really high up. Now of course we'd be interested in what are these points? Well the easiest way to do this is just add a textual label of the gene names, say to all points that exceed a threshold of 2 sigma or exceed a threshold of larger than minus 7 or something like that. And the parameters for the text function are the x and y coordinates, the actual labels and a parameter called adjustment which states how far away from a particular point in x and y direction should the label be adjusted. Now that's important because it's easy to pass the x and y coordinates of a particular point but if I write text on that I will overlay my point. Usually I want my label slightly to the side of that so that's why I use adjustment here. Now I've shown you the data within the histogram. I think I've called it info. So here's where I get the individual values here. So the labels I want are located on the midpoints of my bars of the histograms on the x-axis. So the horizontal displacement for each label is found in this part here and if I call my output for the histogram H I need $H mids for the horizontal displacement. The vertical displacement corresponds to the counts so getting them high or low up in the plot. The text itself should also be $H count so I want a string of 1 for this one string of 7 for that one string of 10 for this one and so on. The adjustment is a little bit to the right and a little bit lower. Why lower? Oh, okay, I see why lower. This basically puts the label inside the box and the color should be according to the histogram colors. So let's see what that looks like. So now I have the individual counts and the labels in the histogram colors above my individual bars. So what you see here is we can be very flexible in additional information that we add to our plots. We'll encounter some more examples of that. That can be quite important to allow you to intuitively interpret the plot and extract the information about what you see. Okay, let's look at a few more ways to plot and display things. Let's get back to the normal distribution. Here's one way to plot a normal distribution. You see how I get this very, very fat line here. This is the parameter line width with which I can adjust the width of a line. So what I do here is I create a sequence of X values at very small intervals and I create the normal distribution density at these X values as the function values. So dNorm is no longer a random distribution but what gives me the function values of the normal distribution with mean 0 and standard deviation 1 for each and every single of these X values. So if I simply plot X and F, this is what that looks like. But there's different ways to plot things. One is type equals L for line. This is type equals P for plotting points and I can adjust the way these points are shown. One is type equals B which gives me both lines and points in the plot. Now if I have line, this essentially gives me this smooth distribution which in fact only appears smooth but it's segmented between these intervals and it becomes smoother. The closer I keep my X intervals or the more X intervals I have. I use different labels here. This is X and this is density and the type is L. So different, there are different line types I can specify. Straight line, dashed line, dotted line, dashed data and so on. Once this is important to distinguish or to qualitatively distinguish some relationships within plots and there's also different line widths which are available. The default is a line width of 1 which is somewhat in between here and I can get very fat lines and very, very, very tiny thin lines which may be useful if I have a large number of potentially overlapping lines that shouldn't be obscuring each other. So we can adjust line type and line width here. So this is line type 3, a dotted line of the same thing. Now here's a simple thing, overlay a histogram with a line. In this case, for example, I could plot a histogram and I could overlay it with a line of a distribution that I know and thus just eyeball whether the histogram matches or doesn't match the shape of my distribution. So this is what I would expect for a normal distribution. The histogram are the values that I actually get and as I can see, the observed values are very similar to the non-observed values. A somewhat similar type of analysis is found in so-called QQ plots, quantile-quantile plots. So quantiles, we've discussed them briefly before, are basically ranks of your data and then asking what data points do I have in the lowest 10% and the next 10% and so on. So if I do this for two distributions, I can compare the quantile values against each other and then very sensitively find out whether the two distributions are similar and not similar. I'll get to you in a moment. So one example is the quantile plot, the comparison QQ norm. So this is a quantile-quantile comparison against a normal distribution. So in this example here, a 100 randomly chosen normal values and then this QQ plot shows me how these two distributions are different and as I can see, there's no general trend that says my quantiles are not the same. What I expect is a line that goes through here and indeed the function is very closely similar to this expected line. So for practical purposes, this plot immediately shows me indeed my data set seems to be normally distributed. There is no significant trend in my data that is not well explained by that normal distribution. Now for example, if we compare the normal distribution with the t-distribution, the situation is somewhat different. So again, let's have a sequence of x-values in small intervals and for the first data set f1, we generate 100 numbers of a normal distribution for the next data set f2. We calculate density values of a t-distribution at the x-values with two degrees of freedom. So is that the same or is that not the same? So this is f1, my normal distribution and now I want an overlap of that to show me my t-distribution. There's essentially two ways to go about in R in doing overlap plots. One is to set an overlap parameters, a general graphics parameter such that the next plot is plotted in the same window without additional axes. This is somewhat tricky to handle. My preferred solution is simply to use the function lines or points. The functions lines or points plot data values into the same graphics window that you already have open. So let's say lines along the x-axis of x of my t-distribution with a line width of 5 and a color of 2 which is defined as red gives me the t-distribution in a red value. So this is normal and if I add a legend, this is normal and this is a t-distribution. So let's see that these two are kind of eyeballing them. They would be similar but in practice quite different distributions. So what does that look like in a QQ norm plot? So here there's a significant skew. This is what we'd expect and the shape of the curves at the center just as we saw in our overlap is quite similar. The quantile-quantile plot picks that up and says the two curves here are different but the t-distribution has much larger tails. So that's the characteristic of the t-distribution. It accommodates outliers much more readily and has large tails as you go into the higher quantiles. We can use a similar plot to do sample against sample plots. So for that, I simply take two distributions and plot one against the other. Now note that I'm not using matched values here but normal random values which are simply taken one from the normal distribution and one from the t-distribution. The matching is done through the inherent ranking in this QQ plot and it gives me this distribution here. So if you want to compare two distributions of values and ask whether they are the same, this QQ plot is something that you could use. So ask yourself what columns of my data set could be compared with a QQ plot? What would that mean? What would that give me? And produce one example of that. So your task is make one QQ plot of one column of LPS data against another column and try to interpret the result. Okay, let's look at this result here. What do we see? So I assume you probably did something similar to me. This is an example of where I take macrophages and compare controls with macrophage LPS stimulated. Another way to look at this is looking at macrophage control versus B cell controls or B cell stimulated against monocytes stimulated or whatever. So you can slice these data in different ways. This particular slice of comparing control with stimulated for one cell type shows me the result of what happens on average with gene expression or a gene attack and Richmond values on average upon expression here. So how can I interpret this? Would you venture to say, well, what do these dots now mean? Why do they look the way they look? So this is a quantile-quantile plot. It's subtly different from just plotting one set of values against another set of values. Let's quickly do that for comparison. Let's just plot them as is. What do we see here? So for example, this point here. What does this point mean? It's a scatter plot. On the x-axis, I have the values from my control experiments. On the y-axis, I have the values from my stimulated experiments. And the individual points correspond to the individual genes. So this is one particular gene, which is expressed at a value of approximately minus 7 under control conditions and approximately minus 6 under LPS stimulated conditions and so on. So the bulk of all measurements lie in here. So this is plotting one against one. Now, if nothing happens, where would we expect these points to lie? On a 45-degree. So the 45-degree is a line with an intercept of 0 and a slope of 1. And I can use AB line, intercept of 0, slope of 1, color is, I'll give it a kind of a C green, no red, a little bit of green, lots of blue. It's picky because I didn't do it right. Here we go. Here's my regression line. So many of these points fall approximately on that line, but you can see that there are also significant outliers. There's a trend of some of them to be more highly expressed upon simulation. There's also some that are depressed under stimulation. So this is generally comparing one set with another set. Now, not looking at the overall distributions as we did before or the overall differences, but looking at every single difference on a plot like this. For exploratory analysis, this is a great tool. Now, if I look at the QQ plot, however, this is slightly different. On the QQ plot, I rank them, and now I compare ranks and not absolute values. So instead of looking at the absolute numbers, I say, well, these are my most weakly expressed genes from our most highly expressed genes. And once again, I can ask, well, is there a trend? Is there a general trend here on how weakly expressed and highly expressed genes compare with each other? Now, in order to interpret this, first of all, what does the plot immediately tell me? The plot seems to tell me that there is more difference in the highly expressed genes than in the weakly expressed genes. The weak expressions are probably due to noise. I'm probably not really seeing values here anyway. But with the secure values of the high expression, there seem to be differences. Now, in order to make sense of that, I always need to ask myself, well, what do I expect? What do I expect to see if nothing happens? Just as I did before, I can plot an AB line and look at that intercept. So if nothing at all happens, I would expect the points to lie here. So what I see instead is that the points, many points among the intermediate expression values tend towards being more highly expressed, and there are a few significantly different outliers here. A number of other points here come down from their expression profile. So the points certainly don't all lie on the same line, and there is something going on which seems to indicate that overall I get higher expression, especially in the intermediate expressions. So simple plots like this already can tell us a lot of the data if we ask these questions in a biological context. So getting it back to the biological context is really important. But one of the keys to interpreting any kind of data in exploratory data analysis is also to be able to ask, what do I expect to see if nothing happens? What do I expect to see if no hypothesis is true? And this is where we can often also use simulated data to basically compare the values that we would get if we already know the parameters and then ask, well, are my observed parameters the same or are they different? Okay, let's look at a slightly different data set, graft versus host disease. This is an older data set that was kindly provided, believed by Sorab Shah many, many years ago, or Raphael Gotato, I forget where it even came from. So we've been pounding on this data and displaying it for many years in this kind of workshop. The data set is in this file. Make it a habit before you ever read something into R. Look at it first. See what it is. Here we have header rows, and we have individual numbers. I have an expert in this one, so happy. Adam, what's this? David, what does this even mean? Slow-set commentary is a neat trick, but you're depending upon your investigator knowing what they're doing. So those numbers are simply for each parameter, position on the plot, whether those numbers have any value or if it's useful data, it depends on the user. So what's one row? That would be the six values for one cell. For one cell. So each row here corresponds to a single cell, and we're measuring a number of parameters in a flow cytometer, i.e. an instrument that dilutes cells and flows them through a chamber where we measure fluorescence in different channels. And what are these different channels? Do you recognize any of that? CD4, FITZ, CD8, BPE, CD3 per CP. So you have the forward SCADA, which represents the size, the side SCADA with granularity. CD4, which is a T-cell molecule. Help us out. Hang on, this is a T-cell marker, CD4. That's a T-cell marker. Why does the T-cell marker give me a fluorescent signal? Well, they have linked it with a fluorescent marker to an antibody that can... Exactly. So that's what we see here. We have different antibodies that are labeled with fluorescent small molecules. And we incubate our cells with these different antibodies, and then we have different amounts of fluorescence attached to every single cell, depending on the amount of cell surface marker that's being expressed. So CD4, in this case, CD8, CD3, and CD8 again. So we have two antibodies that actually recognize CD8 in here, which is a good idea because that allows us, as an internal control, to get some idea how much variation is in our experimental conditions versus what we expect in our cell surface expression. Of course, every single cell has the same amount of CD8 surface expression that I would see in both channels. But the channels, as we'll probably see later in the plots, respond slightly differently because of differences in the affinity of the antibodies and perhaps also in the fluorescence efficiency of the labeled molecules. So I think this is phycoerythrin and alofycosionine and fluorescent or whatever, different small molecules that produce beautiful colors. Okay, so this is our data set. Now, from the structure of the data set, we have to read it in. So this is not a comma-separated value file, but the individual values are separated by blanks. And there is one header row. So in this case, apparently we can just use retable, header equals true, to get us the right results and check what we have here. Okay, now we have row names, we have column names, and we can confirm that each of these columns are, in fact, integers. So there wasn't any hidden character that annoyingly converted the column to character or to factor. So everything looks okay. That's our sanity check. That's what we do first before we even start looking at the data. Everything looks okay and we can start analyzing it. Now, the first task here is only extract the CD3 positive cells just to have a little bit of a structuring of the data. So CD3 positive cells are cells that are high in this value here, in the CD3 per CP. This happens to be in the fifth column. So let's look at a histogram of that. What does it mean for cells to be CD3 positive? These are the CD3 values. So there's no obvious peak and cut-off, but perhaps we can say that something like 280 of a fluorescence value, arbitrary units, would count as positive and everything below that is negative. So we'll take a subset of our data, CD3 positives, by converting into a data frame all the values, all the rows with values in the fifth column, which are greater than 280. And we'll only take columns 3 to 6 to confine it a little more. Okay, so how much did we exclude? How many were there in the original and how many do we have now? What did I find out? Can anybody tell me? Easy way to find out. What am I actually asking here? How many observations, right? So what's the number of observations in my first data set? This one, the original, GVHD. I heard a number here, which is? 9,083. Sneakily, I can just look it up here. 9,083 observations of six variables. And now the other one, how many observations do I have in that? 2,089, it's right here. Okay, so I'm looking at a little more than a quarter of the values here by excluding some of them. Makes my plot less dense and observable. Okay, so let's plot something. This is a scatter plot. What am I plotting against what here? And what does this scatter plot tell me in the first place? Can you interpret what you see here? So the first question is, of course, let's look at the labels of the axes. What do the axes tell me? So we have C4, fit C, and CD8, BPE. And I've kind of briefly explained the semantics of the data to you, but you should be able to tell me how do you even go about it. So one way to walk yourself through this is to say, okay, I'm looking at a single point here, let's say this point here. And what is this point and why is it in this place? So question one, what is this point? What does it represent? A single cell. A single cell for which the fluorescence has been measured. And why is it in this place? Right, it has an intermediate expression of CD4. So in the range of possible CD4 values, it's kind of in the middle here, and it has a high expression of CD8. So it's kind of in between. Okay, so that's what these points mean in general. Now, if I look at the data overall, it doesn't look like simply a random, a completely random distribution. So I have 2,098 observations. So let's plot 2,098 observations in a random distribution. What would that look like for comparison? So plot R norm 2,089. This is what a random distribution looks like. What we just saw is very different. But how? We can see 4 blotches. So there seem to be 4 different population of cells here. There are cells that have low expressions of both cell surface markers. There are cells that have high expression of CD8. There are cells that have high expression of CD4. And there are cells that have high expression of both. I would say in this sample that's the majority of cells. And intriguingly, there's really nothing in between. So these are pretty much discrete populations. A cell seems to make a decision whether to express one or the other, but it's not kind of, you know, iffy on the side. It's either expressed or not expressed. Okay. So is this what we expect? Immunologists to the rescue? In the thymus. In the thymus, yes. Right. So I don't know where these came from. So you would say from just looking at this and knowing that this is graph versus host disease it looks like this is a thymus biopsy. Great. I didn't know that. Okay, so once again, even simple analysis can give you a lot of information in the right biological context. Okay. Now, there's a lot of overlap here and sometimes this overlap can obscure data relationships. Now for dense plots, there are some alternatives which we can explore. For example, one way to plot these things if it's very dense is the hexbin package. So hexbin is a two-dimensional version of histograms. So if I install this, I get this kind of plot. So by binning, just like in a histogram, by binning the cell numbers into this here, I can get a better sense of where the maxima of these four populations are and how they are distributed. Actually, let me compare this by doing the same thing with the original 9,000 values. So these are the 9,000 values. Currently, these are now dominated in a different way and it becomes hard to see the underlying structure here. So this nice structure into the four populations seems to be only apparent for those cells that are also CD3 positive. Yeah. You've changed which columns you've drawn because in the sub-table, you've only drawn out the columns for CD4 and CD8 and I've chosen the four sides again. Good point. Okay. So which ones should I be using in my original? I had CD4 and CD8 and that's three and four. Thank you for that. Okay. Well, yeah. So it seems that this nice relationship is obscured for cells that are not CD3 positive. Once again, does that make sense to you? This is the same relationship, CD4 against CD8, for all cells, not only the CD3 positive ones. And we lose this nice distribution into four different clusters. Ideally, they should both have CD3, so if you check them, and you see that there are some that do have CD3, that results that are not very effective as these cells. So ideally, all these cells have CD3 Well, now this is dominated though. There are 7,000 cells in there which have low to low intermediate expression of CD3. And it seems that these now have intermediate expressions of CD4 and CD8. So the CD3 negative ones don't seem to fall apart nicely into these four clusters. So the CD3 ones, CD3 negative, we're not seeing them at all? No, I've put them back in here. So now I've done this plot for all of the cells. Anyway, I don't know what this means. This is what the data shows us. Exactly. So there seems to be a large population in here that are CD3 positive, but have intermediate values of CD8 which seems to dominate that. So that would make sense. That would make sense. So the 2,000, whatever cells are still in that plot. But now, since the dynamic range is differentially scaled, their effect is somewhat obscured. Okay, here's a different way to look at colors varying by density. And this requires a package which is distributed with bioconductor. The installation of bioconductor packages is slightly different than the installation of standard R packages which are downloaded from CRAN, the Comprehensive R Archive Network. So if I install a bioconductor package, I don't say install packages, but I source a file which I find on the internet at bioconductor.org, biocelight.r. So sourcing this file, what we take a message here is that source accepts a number of different, well, sources for data. It doesn't always have to be a file on your computer. It can be a file that you have located somewhere on the internet. However, I would be cautious and careful with that. Don't just source arbitrary files that you find somewhere on the internet. This is about as problematic as clicking on a link in an email that you get from a prince in Nigeria. So, honestly, I always wonder why we actually consider this to be safe. If bioconductor.org ever gets hacked and something nefarious put into this installer here, all hell is going to break loose. But it hasn't happened so far, so maybe they take very good care of their installations. Once biocelight.r source, the function becomes available and then we can load bioconductor packages. That then looks very similar to installing something from CRAN and it allows us to have a function like this smooth scatter here, which basically gives you a smooth cloud of what we just saw in the hex pin values plus or minus individual outlier points. You can turn that off or you can keep them there, which helps to identify individual outliers. And you can also have colors vary by density. This is with density colors. So, a similar thing. These are essentially three different ways of displaying scatter plots with high density and still displaying the underlying internal structure of these scatter plots for further study. Now, lunchtime. We'll break for about an hour. If you want to do something in between, you can repeat an analysis that we've just done for the scatter plot. For example, by comparing monocyte activation against macrophage activation or similar things, redoing this plot with density coloring as we've done before, adding an AB line to show where you'd expect the plot to sit if nothing happened or where you expect the plot to be. If something happens and so on.