 So, I've updated the code snips file and pushed that to master and also corrected some other omissions. So, it's a good time to pull branches down from the repository to update some of the files we have. This is a small A side before this little section on sub-setting reloaded. I've put in code, Danielle came up to me and said, well, you said it's easy to make these column names with paste. And yeah, it is. If you remember how to do it, and then I spent a little while trying to remember how the hell was this easy to do? It reminds me of that joke about the mathematics professor who writes something on the black board and says that it's a really trivial proof that this proposition is true. And he continues his lecture and then he turns back to his black board for which he's just said it's really trivial to prove that this is true. And he looks at it and he thinks and he looks at it and he thinks and he looks at it more and he thinks some more and this goes on for five minutes and it goes on for seven minutes and of course the students are getting very restless. And then he says, ah, yeah, of course, yeah. And turns around and says, well, as I said, it's really trivial to prove that. And then he continues his lecture. So it is really easy to build these column names with paste. Oh, you do want to know how this is proven? Well, Danielle actually showed me that the key of the result is to use the repeat function. So if we define a vector of, say, the cell types, A is the B's and the NK's and whatever, and the vector B, which is just control and LPS, if we use the repeat function for that two times, we get B and K MO, PDC, B and K MO, PDC. That's not what we need to paste that together. But if we use the parameter each, then we get BB and K and K MO, MO, PDC, PDC. So thanks, Danielle, for that. Now we can simply paste this vector together with the control and the LPS. And that gives us B control, B LPS. Of course, we need to define that the separator should be a dot. And we need to add genes at the beginning and clusters at the end. And so this is a function that does something similar to what we wanted before. So this is in the code SNPs, in the updated code SNPs file, which appears after you've downloaded it, after you've issued pull from version control. Now, sub-setting reloaded. We were doing a few simple tasks here. I think the first one was hopefully really easy. Rows 1 to 10 of the first two columns in reverse order. So the first two columns are just 1, 2, 2. And reverse order is just 10, 2, 1. Very simple. So 10, 9, 8, 7, 6. These are the row names. The next one is a little more involved. We need to use order. Genames and the expression values for monocytes dot LPS for the top 10 expression values. So in order to do that, I start by making a little subset to develop this with. If I want to look at orderings and sortings and figure out if what I'm doing is right and I do that on 1,000 rows at once, my head is going to start spinning very quickly. So develop things with synthetic data. Develop things with small subsets until you figure the syntax out. Then apply to your real problem. That's a smart way to do it. So we have a vector x, which has these values, large and small values. So we want the top 10 expression values. And to get the top 10, we need to sort them in some way. And then we just pick the leftmost or the rightmost values from our sorted vector. Except we want indices, not values, so we need the order function. So if we do order of x, 12, 19, 1. So number 12 is minus 11.2. Number 19 is minus 11.1. Number 1 is minus 10.6. So they go from smallest to largest in that order. Now, I could pick the rightmost from that result vector, but that's kind of inconvenient. What's more convenient is if I reverse the ordering and order by decreasing is false. Decreasing is true, sorry, true. So 259.18. So 2 is minus 6.4. 5 is minus 8.5. 9 is also minus 8.5 and so on. So this now orders them in the largest first and then the smallest. And in order to use that vector to get the values, I just need x and then this. And I wanted the 10 most. So that's a small example, so I'll say the 5 most. And that gives me the values. Minus 6.4 is the largest, then 2, 8.5, then 8.8, minus 8.9. So they get smaller as we go along. All right, now that we've figured out how to do this in principle, we want genames and expression values for this. For the top 10 expression values. So let's first make our selection. This gives me all whatever 1,000 genes here. And then I want, hang on, I want is true. From the vector that this produces, I want the first 10, which correspond to the indices of the 10 highest values in that column. And then I can simply pull things out. LPS.dat, order to genames and expression values. That was genes and so row 205. H2 EB1 has this expression value. Row 2 CXCL10 has this expression value and so on. Now, if I were a researcher in mouse hematology, that would probably already tell me something. Probably, possibly. Maybe not, because we're actually just looking at raw values. And raw values are what we're usually looking for is change, i.e. comparison between conditions. So we're not looking for the, usually, for the raw level of expression of, say, H2 EB1, but we're looking for the genes that have the greatest difference. So this brings us to this question here. Find all genes for which B cells are stimulated by LPS by more than two log units. So the table contains log expression values. B cells stimulated by LPS. That's something like B LPS, B LPS minus LPS. So each of these is a vector of numbers of expression values. If I subtract from the stimulated value, the control value, I get the amount of stimulation. So this is one vector. I subtract another vector. So this is done element by element. And this gives me a third vector, which is the amount of stimulation. And for those, I want the ones that are greater than two. Is two a good choice? Well, we're here to explore data. So basically, the first step that I would do here is plot a histogram. And this gives me this histogram. So this is a distribution of stimulation values. And as you can see, it kind of looks like a normal distribution. But it has a large tail of stimulation values above two. And that's maybe, I don't know, 2% of the genes or something like that. So in order to use a filtering on that, I'm asking for more than two. So I want each of these values for which that difference is greater than two. If I apply this to a vector, I find that all of these are true. Why is that the case? Well, maybe they've been ordered. The table was sorted initially by these values. At the end, they're all false. OK, so that's just the selection vector, though. It's a vector. It has 1,341 elements. They're all logical elements. Now I can use these to subset. And I wanted all genes. So LPS genes. Now before I do that and print that, I usually always cautiously check how many numbers this is going to give me. So we have a logical vector. How do we find out how many true values are in that vector? Some, yes. We've discussed this the two days before. So if we sum over a logical vector, what happens is that some requires a numeric value. But we have logicals instead. So R tries to cast the logical value into a numeric value. And that's possible. There's a definition of how this is done. A false value is cast into the number 0. And a true value is cast into the number 1. So if we sum over that conveniently, every true value adds 1 to the result. And thus the result is the number of true values in the vector, which is 30. OK, that's fine. We can print that. So LPS dat cell of genes is this list here. And the final one may be less trivial than the others. Expression values for all genes whose gene names appear in figure 3b. Hint, use the in operator. So those of you who were here yesterday might remember that the in operator compares two vectors and is true for every element of the first vector that appears in the second vector and false for every element of the first vector, which does not appear in the second vector. So in this case, again, we write a little. I like to do this in two steps. Basically define my selection first and then apply the selection to the data object. It's a lot easier to troubleshoot this look at the selection in between. See, does it have as many elements as we think it should? Do the elements actually make sense rather than just dumping out the results? So my selection here is LPS in characteristic genes. So my selection is a logical vector of 1,341, of course. It has to be as long as LPS dat dollar genes. Let's say number of rows, so I can use it for subsetting. How many of these are true? Same thing, sum over selection is 45. So 45 of the genes for which I have expression values are also contained in these characteristic genes here. The length of that is actually 46. So there's one of them in there that does not appear in our gene list. I think when I looked at what the problem was there, well, with basically turning this around and asking which characteristic genes are in LPS dat dollar genes, I can figure out which is the one that's missing. And I think what I figured out is that from the time they prepared the figure and the time that they prepared the table, one of the genes actually was renamed with an alias, which happens all the time. So it has a different name in the figure than it has in the table, which once again tells you don't trust data at all. Anyway, so this is the expression vector. And now if we want the expression values or the entire table, we can just subset it by these 46 genes. There we go. So these are just the expression values for these 46 genes. All right, so commit, code snips, update, close, and push. There we go. Let's do something else. I hope this was informative and illustrative. This is really the kind of day-to-day thing that we need all the time. Subsetting, filtering data, the first step of any kind of exploratory analysis is looking at your data and figuring out what's there. I hope using the criminal operator, I tried to do no subsets, but I tried to use, like, can you do this using the rep? Well, you would need to rep with a pattern that matches everything that is in the table. And that's possible. So just to illustrate how we would use rep for that, rep the pattern in LPS. And now what's a pattern that matches CD69 or CXL10 or IFI47? Well, that's matching this or matching that or matching this or matching that. And until I run out of patience, I can continue doing that. But that gives me the correct regular expression to find, in this case, the rows 1, 2, 3, 4. So it does work with rep. But the challenge is to find a regular expression which matches all of these examples. Alternatively, you could rep them in turn in a for loop and assign them. Alternatively, you could remember what we've talked about yesterday and the day before and use the in-operator. They're also, you know, yeah. So this is how you would use rep or in. Oh, if you've noticed, type info in the script and you wanted to use it, just replace that with object info. I've renamed it. I should have found and replaced it globally. It'll be there in the next slide. So type info and object info is the same thing. Incidentally, you can just do the following. Type info, object info, remove the parenthesis. Otherwise, you'll be getting the output of object info. And if I do that, now I have two functions defined that are exactly the same. But now I can call the same function by two different names, right? So this may be helpful. Sorry for that oversight. So we're already exploring data. Let's look a little bit more about exploring data. There are many tools available. Don't think of exploratory data analysis, meaning to use the most sophisticated algorithms. Exploring data analysis, exploratory data analysis, most of all means becoming familiar with your data and describing it in different ways. So initially, the simplest thing that we can do about our data is trying to look at summary statistics. So for example, if we build a little vector of random numbers, 100 normally distributed random numbers with a mean of 0 and a standard deviation of 1, we can ask, what's the mean of this? And which is a small number? Not exactly 0, because they're random, but they're kind of sort of around 0. I can ask for the median, which is slightly different, but also a very small number. I can ask for the interquartile range, the variance, the standard deviation, the summary. Summary gives you a number of values here, the minimum, the maximum, and the values at the quartiles. So at the first quartile, at the median, it also gives you the mean, and at the third quartile. So between first and third quartile, the difference between these two is the interquartile range. So what are quartiles anyway? So quartiles work by ranking a set of values and then taking the first quarter of the ranked values and then the second quarter of the ranked values, first 25%, second 25%, third 25%, and the last 25%. So these are four quartiles of sorted values. The one value at 50%, so if you have 100 values, the one that appears at position 50, if we rank them, this is the median. So median is kind of similar in idea to the mean, but it's more robust. The mean is heavily influenced by outliers. The median is less influenced by outliers. So if I have a random distribution in the way that I've calculated it, and then I add a value of 10,000 to that, my mean is going to be way, way, way skewed towards 10,000. But my median will be barely moving at all. Quantiles ask, what's the threshold that has a given fraction of values above or below it? So for example, I could ask, what's the cutoff for the highest 10% of my values, or the cutoff of my highest 1%, and we do that by quantiles? These are, yeah, I could just plot that here. So if I have a normal distribution, my 90% quantile would lie here. This means 90% of the values are to the left, and 10% of the values are to the right. Mathematically speaking, this corresponds to the integral. So the integral of the distribution to the left of this cutoff here is 90% and the integral to the right or the area under the curve is 10%. The graphical view of distributions of data or where these things lie is often done with a box plot. So this is a box plot. You see these things quite frequently. What we see here is the mean in a bold line. What's in the box? Do you mind me? I think that's a median, like third and first one. Is that right or? Yeah, yeah, yeah, okay, okay. So the median, right? Yeah, so we have the median in the middle, then we have the inner two quartiles, so the cutoff here is the first and the third, and this one is 1.5 times the interquartile range. And if anything is outside of that, it's plotted explicitly as explicitly shown values. So this kind of quite nicely shows you distributions, but box plots can obscure important structure in your data. For example, if we have a bimodal distribution of two normal distributions with different means, so one normal distribution with a mean of minus two and one normal distribution with a mean of plus two, the histogram for that would be bimodal, so centered around minus two and then around plus two. But if we simply look at a box plot of this, it doesn't look very remarkable at all. So if we have a normal distribution or a unimodal distribution, it can be sort of nicely illustrated with a box plot, but if there's any structure in this, it gets obscured. Of course, if we have only a single distribution, it would always be better just to look at the histogram, but if we have a number, i.e. like 10 that we want to look at side by side, then histograms become cumbersome. So somebody came up with the smart idea, well, why don't we take these histograms and we turn them sideways, and we just put them side by side like box plots. And that's actually in principle what the so-called violin plot does. Violin plots are in the GG plot package. If you have used GG plot in the integrated assignment two days ago, you already have it. If you haven't, it's worthwhile to download it, and this is a violin plot of the distribution we just had. So essentially think of this as a side by side view of a box plot, and you can get many of these side by side like that. Now, if we do a box plot or a violin plot of more than one column, they're placed side by side. So for example, if we take box plots for natural killer cell controls in LPS, these are the log values here as box plots. Or if we want to see box plots of all of them, this is our distribution. So this is something we often do as a first step before we do quantitative analysis on the data. We just see, do our data sets have approximately the same range and where they're outliers? Is that something that we can explain from the way that we've done the analysis? Or is that something that may correspond to a systematic error in the data analysis? Maybe one of our RNA-seq experiments went wrong or was contaminated with things or whatever. And then we would see that one of the box plots is really an outlier. I think there's repeat that violin plot. I mean, why would you use it? Why would I use it? If I had a bimodal distribution, I need two distributions in there. If I just put a single box there, I don't see anything about the inner structure of my data. I don't see whether it's unimodal or bimodal or whether it has more structure. The violin plot is basically like a histogram of the thing turned on the side and it shows me what the actual structure of my data is. You can see it is the histogram itself, right? Right, so think of the box plot as a histogram with only one bar. And the violin plot is a smooth histogram with many bars. So you can see that there's more information in there. Okay, now I have a rather lengthy file in scripts which is called plotting reference. That's something you can work through and refer to at your leisure. It basically summarizes important principles and important plots that we use frequently and important plots and their parameters. So it has sections on types of plots and colors and how to add lines and how to add titles and legends and what plot symbols there are and how you draw on plots and what happens if we do scatter plots of spatial data, i.e. with x, y, z coordinates and so on. So just as a standard thing here, standard scatter plot, let's make two random distributions. One is normally distributed. One is an x cubed scaled and noise added version of the first one. If we scatter plot this, it'll give us a cloud of points. There's something called a rug representation. So this is a rug representation. It puts little bristles on your scale in a rug that correspond to the actual values and you can better appreciate the density of these things, of your data distributions. So there's a lot of stuff in that that you can just use and study. I might refer back to that from time to time, but I think for now I'll just close it. If we have kind of downtime while people are working through tasks then do feel welcome to explore this more. So just as an example of how to plot a histogram and then overlay it with a line plot, we could ask, well, if we have stimulated cells, simulated expression, what does that look like? Is that look like a normal distribution? So for example, if I calculate my stimulations and then plot a histogram, B cells LPS minus B cells controls. I define a color as a light blue color. I give it a title. I give it a label on the X axis with X lab stimulation of the change in the log expression values. And I set frequencies to false. So I don't get the actual counts but I get counts divided by the total number. And then I can define some values from the largest, from the smallest to the largest values in an interval of 0.1 and plot a curve along that line. So along, if these X value curve, I can get the values for the normal distribution with D norm. This is the density of the normal distribution at that point. So that now is a curve that traces the normal distribution itself. And with the lines command, I can overlay that. So with this, I can immediately see that if I have a normal distribution with the same standard deviation as the histogram, as the points that I have here, it's not as steep as what I see here. Basically that says, this is not normally distributed data but we have outliers that go outside of the normal distribution much further than we would just expect if this were a normal distribution. No wonder we're looking at the biological mechanism here and basically just one glance at this image will tell me, hey, there's actually something going on. This is not just random noise. This is significantly or strongly different from a random distribution. Another way to look at this and ask about how similar or different is this from a random distribution is to use so-called quantile-quantile plots. So quantile-quantile plots match quantiles of one distribution against quantiles of another distribution. The quantiles, the distributions or the values we have for our distributions don't have to be of the same length. So we don't match them element by element but we match them quantile by quantile. So these are so-called QQ plots. QQ norm is a plot of data, the data we've defined above, my stimulation against the normal distribution and this kind of characterizes the deviation from expectation if our expectation is a normal distribution. So we can plot a QQ line. So kind of in the middle here, this would correspond to a normal distribution but both at the small end as well as the large and there's significant deviation from the normal distribution. So again, this shows us the distribution of our biological data is not normal. There are outliers there which go much further out than we would expect if this were just randomly sampled. And as this is a plot, we can add a legend to this as we always should and like blue LPS effect and in red here is the normal distribution. So QQ norm intrinsically compares your data with the normal distribution. If you want to compare data with data, you need something called the QQ plot but it works in a similar way. So my simulation could be LPS data of stimulated B cells against controls B cells. A baseline could be monocytes against B cells, both controls and then we compare these two against each other and now we see that especially on the side of the outliers this is much more shallow. So there's no real reason to believe that fundamentally the stimulation alone against the sample is just the same. So there are also slight things going on. All right, now I'd like to spend a little bit of time on scatter plots because after basically looking at our data and looking at the distribution of values with means and standard deviations with box plots and looking with how the data are individually distributed, the very next thing we're usually interested in when we explore our data is, are there any correlations? So if one thing is high, is there anything that predicts that another thing should also be high or that it should be low? So this is the way we analyze our data and start searching for functional effects because if we see that some measured value is high, when another measured value is high, the interpretation is usually that that is due to the fact of one thing having an effect on another thing and which we would be interested in. Maybe one causes the other or both of them are caused by a third effect by a confounding factor. Pulling that out of our data, pulling these influences out of our data is the first step to analyzing our observations in terms of some biological mechanism. And after all, this is what our data analysis essentially is all about. So let's start looking at scatter plots, basically two-dimensional plots of one thing against another thing. I have provided a data set here. This has been floating around in CVW for ages. I don't even remember where it originally came from. This is flow cytometric data of graft versus host disease. And these are different channels here. FSC, SSC, CD4, FITC, CD8, BPE. So this is probably fluorescent and this is FICO-Aerithrin and this is some other fluorophore. So these are three different cell types and one cell type measured with two different fluorophores. So if we only extract the CD3 positives, so let's, so column five here, one, two, three, four, five, these are cells that are labeled with CD3. This is the histogram of that. So some of them, there's a peak here and then it kind of falls off. So at some point we need to define what do we consider CD3 positive? Some cutoff here. So for example, we could say, well, a cutoff is greater than 280. So something like this. Probably would give us the top three quarter quantile of where this is located. We can use that to subset in this example as data frame this thing here and subset it into graph versus host disease CD3 positives. And now this contains some of the rows and the expression, the fluorescent values of CD4, CD8, CD3 and CD8 again. Now we can simply plot the first two columns, i.e. this would plot for row four, the CD4 value against the CD8 value. For row six, the CD4 against the CD8 value. So this is, if we plot two-dimensional data, plot is a function that is very versatile. Plot kind of recognizes what the data is that you're trying to plot and then plots it in the appropriate form. So if you give it two-dimensional data by default, it will give you a scatter plot. So we get a scatter plot here. Scatter plots like this make for rich exploratory inference. So you see something like that. What do you know about your data? What does this tell the biologists about the data? Where we're plotting along one dimension on the x-axis, along another dimension on the y-axis, things that we've selected. Looks like there's four self-populations. Right, so this is not a homogeneous population that's centered on something and then just has a random distribution of values. We have four cell populations which correspond fundamentally to the four possibilities of being CD4 positive and CD8 positive. So the CD4 and CD8 can be both negative, i.e. lingering around 100 or they can be both positive like centered on 500. So something like that is extremely, extremely useful. So if to explore scatter plots a little bit more, we should consider, but we'll do that when we come to actually plotting examples. We should consider that we can use different plotting symbols and characters. We can adjust the size of the symbols here to basically tell us something about a third dimension of values. We can use colors. So for example, if we plot something of our LPS data and it's already been subset into, already clustered into different clusters, we could color one cluster green and one cluster blue and one cluster red and then see how these defined clusters distribute on a scatter plot and so on. One thing that I just wanted to show here is that the overlap can actually obscure structure in the data and I just wanted to demonstrate some alternatives of scatter plots for relatively dense plots. So one of them is plotting in the hex bin package. So if we load the hex bin package from CRAN, this is the first variant of working with very dense data. Hex bin creates hexagonal bins that tessellate the plane and then looks for each of these bins, how many points fall into that and then kind of like a two-dimensional histogram determines what the density is in these plots. This is a hex bin plot of the same data. So again, you see these four populations and you see that there's a very, very high peak in this other plot. Another variant is the smooth scatter function. This is in one of the basic R graphic functions. This kind of smooths out the values as a point cloud and again shows things as a scatter plot. You can vary colors by density in a so-called density plot. Now here we're just overlaying all of them using a little larger points and showing the individual data points. And there are also specialized packages like for example the Prada package in bioconductor which tells us the peak and an ellipsoid about that peak in a two-dimensional normal distribution that would explain I think the highest 10% of the values or something like that. So basically this one fits a normal distribution and then looks for the normal distribution which would segment out in two dimensions the highest values here. And then finally, if we don't have too many columns we can just scatter plots of all against all very easily. So remember our data set GVHD3P has I believe four columns. So if we send all of these four columns to the plot function by default it plots all of them against all of them. So this is plotting column one against column two. So these are the different activations here. So this is CD4, this is CD4 against CD8. There's kind of an anti correlation here that they tend to be one is high if the other one is high but again in two different populations that basically are regulated differently. CD4 against CD3 or CD8 against CD3 doesn't seem to be well correlated. So these are cell surface antigens that are probably regulated largely independently of each other and so on. So you can go through a plot like this and analyze the different two dimensional influences that one has on the other. We could also put histograms side by side and look at that if we can get more than one histogram on a plot. Now, as you already see in this plot here the R plot area can be sectioned up into different sections. And this is something we've set with parameters, plotting parameters, the call to plotting parameters is this here. There's a code idiom that's important to remember. So if we set plotting parameters, the plotting parameters are maintained for basically all of the future plots until you close your plotting window. That means if I split my plotting window into two with a call to parameters, then all my future plots will be at the top and then the next one at the bottom and then the next one at the top and the next one at the bottom and that's possibly something I don't want. So setting parameters of margins or how large the plot is or whether I have a required particular aspect ratio of how many plots I want can be kind of annoying if you don't know how to reset it. And there's an easy way to reset it because the call to change any of the plotting parameters has a side effect of changing their parameters but it also has a return value. And the return value is the original set of parameters. So the side effect is new parameters and the return value is original parameters. Now if I take the original parameters and I store them in a variable that is usually called O par for old parameters. I can then after all my plotting is done just reload them with a call to par with the old parameters. So in this way I store my original state of parameters and then in this way I reload them again after my plotting is done. So here I store them and now I define that I want two rows and two columns of plotting area to plot in. That's now defined. So my first plot, a histogram goes into the top left quadrant. The next plot goes into the one next to it. The next plot goes below. Next plot goes below. So I have these four histograms here now and then I reload my parameters. So the next plot after that would be again a normal plot. So this is one way of setting up things so that you can have multiple plots on the same plot and printing them out. So this is approximately where this script ends. There's a lot of stuff for you that you can explore at your leisure. As I've already mentioned in the last workshop the key to profiting from this is to practice, to practice, to practice. If you go home and don't touch R and R studio for two weeks that's probably not a very good idea. The brain is now primed to work with code and to work with examples. So going home and then systematically starting to work through some of these scripts may be a nice thing to go. I've also yesterday heard the suggestion that people should try to get together and work on their projects individually. We might, the suggestion was made we might even add that on to a workshop in the future, not on this one. So basically after the workshop was done have a day where people can just chill out together and talk about their projects and start practicing things that we've learned in the workshop in the context of what they actually need to do at home. And that kind of thing. That's an excellent idea of how to proceed. So highly, highly recommended. There's a lot of material here though. Just to summarize some of the principles here. We've started thinking about exploratory data analysis and we kind of just slipped into it and said well the first step of exploratory data analysis any kind of data into R. So we looked at various versions of how we usually find data on the web as text data or as spreadsheet data, how to save that and how to import that into an R object and a little bit about how to make sure that the R object that we have actually faithfully represents our data. The next step then was to start exploring the data by subsetting it, picking out the 10 most highest and the 10 most lowest and asking which genes are represented and which genes are perhaps absent that we would expect to miss there and so on. So simply looking at what's there. The next step after that is starting to look at the data in a quantitative fashion and looking at distributions, plotting histograms, plotting box plots, comparing rows against each other and then finally getting into scatter plots which are fundamentally different in that they actually imply a mechanism of influence. So there we go, with that in hand if you remember that you know exploratory data analysis. Everything else is just refinements on the same theme. Loading in data and instead of looking at scatter plots we might apply some kind of a clustering analysis and we'll look at how to do that. But it's really doing a clustering analysis is kind of very similar to doing a scatter plot and then quantitatively picking out the clusters that we've already seen like in graph versus host disease. So not much different. Actually the challenge of scatter plot of clustering is to do that as well as we can already do it by eye. Much of our data is going to be very high dimensional. So we've looked at data here in the graph versus host disease that is four dimensional. Four different channels that were measured. We have about 20 different dimensions of values in our hematologic data. If we look at gene expression profiles that go over very, very many different conditions. The number of columns in our data, i.e. the number of dimensions can go literally into the thousands. And then it becomes very difficult for us to figure out where interesting concentrations of values are. So dimension reduction becomes very important. There's something called the curse of dimensionality which haunts us when we do modern data analysis because it's really very high dimensional. The curse of dimensionality basically says that the area of a cube grows much faster than the area of a sphere if we go into high dimensions. So that means as our data becomes very, very high dimensional, all the data is somewhere in the fringes. Sorry, not the area. The volume of the cube or the hyper cube grows much faster than the volume of a hyper sphere. So all the data basically is in the fringes. Nothing is similar to anything. And as our data becomes highly dimensional, we become difficult to find things that are actually similar. So we need to come up with intelligent ways to reduce the number of dimensions. But fundamentally what we're doing here is we're just finding more intelligent ways of making scatter plots. If the original data was high dimensional and this is in dimension reduction, something we'll start talking about tomorrow. Now it's 20 minutes to the next coffee break. I would opt to hold the coffee break earlier simply because we're going to go into the next unit now where we talk about regression analysis. And rather than start that off and then interrupt it midway, I think it's probably a good idea to take a break now and then come back at 10 past three, and then continue until our heads actually start spinning with regression analysis. Anne, is that okay? Are all the cookies here? Good, all right. So let's do that. Let's break, for a coffee break now and back here at 10 minutes past two. Past three. I'm just intrigued about it. About the violin? Okay. So could you give some more examples when we usually use it, where it's in the upper head? Because I've been seeing that often times and actually now that I understood it, it's just like... Well, so where was our violin part? Oh yeah. Okay, if I join? Sure, join. But you know what? I'm actually going to put up an example here. So violin plots really are useful to find correlations, find internal structure in the data. So I think I've, you know, yeah, actually this is a great example here. If this is a histogram, it shows multimodal distributions here. So what we would normally do is box plot versus violin. Not just bimodal, if it's multimodal as well. So we basically see what's going on. So let's have a quick look at that. In that histogram, so this is our box plot. We have four different boxes of these distributions here, but we don't see anything of the internal structure. We might surmise that there may be a problem because we have outliers here and we have outliers there. So we already think, eh, maybe we're not seeing all the data. But as violin plots, just given those bars are just 1.5 times the inter-quartile range. Okay. The inter-quartile length? Yeah. So this is the inter-quartile range. This is two times the inter-quartile range. So this adds 1.5 that on the side. So how about this one, it's the same? Like 1.5 on this one? Yeah, this is 1.5 of the bottom one and this is 1.5 of the top one. To the medium, actually. From the medium. This is the inter-quartile. The box, the length of the box is the inter-quartile because this is 70% minus 20%, 5%. 25% right? Yeah. 25 and then 35, yeah. Yeah, and then it has like 1.5 to the medium. Factor's taken away. Yeah, and... Lauren, could you help me out for a moment here? I just don't really... I tend to figure out its utility, so... I'm trying to do a violent plot of the rough versus host disease and I'm so unfamiliar with Gigi plot. Oh yeah. Do you know what I need to do there? Yeah. So what do you want? So it's already a data frame? Yep. I think. Yep. Is it? You might have to melt it. It is a data frame. And so do you want this, this and this to be different violent plots? Exactly. Yeah, so you're going to need to just melt that data frame. Okay. And then... I think it's a little good. I don't have to melt it. Just reshape to package, then melt. I think there's advantages as well. Yeah, and let's see the header of that one just so we know the X and Y. No idea why this isn't all as measured variables. Yeah, so that's fine. That's fine. Okay. I just need to see the head of it. Next. Awesome. So then you want to say X equals variable and Y equals value. Oh, okay. So this is X equals? Like this? Uh-huh. Variable. So I use this program. I do a lot of the same things we've been doing. Like you'll be filtering. As a string? Yep. No, not as a string. Yeah, it's the weird thing about Y is value. It doesn't make a very big difference. And then you want that to be the big X. Plus G on violin. I think that'll give you it. Okay. If you want to go look at this point. Awesome. Never mind. What the underlying data looks like. It's just a point. One, two, three, four, five, six, seven, eight. So, now we immediately see that our box block really did obscure things that are kind of important. And that's the actual underlying story. So once again, in principle, you can think of this as a histogram, turn sideways on both sides. And it's just not that it's a histogram. It's a modeled density, smooth density. Like this, eh? Yeah. So these are the values of these? Yeah. So here, there's many values which have this. There's less values, which are around 200. And there's more values again, which are around 80. So what's on this axis? So on this axis, this is categorically. So this is our CD4, CD8, CD3, and CD8. So just like in the bar plot. So in the bar plot we had this here. So it's the same thing, right? But now we have these shapes here, where there's a high density and where there's a low density. So where's the outliers now? So now they just disappear. Right, so we don't show the outliers individually. Oh, OK. And you cannot see the media out of here as well, right? Right, there's probably ways to do that in G.G. Plot. Lauren is my resident G.G. Plot artist. When we come back together, I'm just going to show this in the code snips, and maybe Lauren will want to add how we put the media in there and all tryers. We can give it nice colors.