 We wanted to see whether the groupings that we saw on our PCA results can be translated into clusters by clustering that correspond to the labels of our crabs that we've collected. So for example, if I use K-means clustering, one parameter for K-means is the number of clusters that I want to get. So my classes would be K-means of PCAS, $X, over 2 and 3. And these are the classes that I get back. Now, if I suspect that my clustering was successful, I would expect one class to be exclusively in the rows 1 to 50, the next class exclusively in the columns 51 to 100, and so on. So I can just plot that. So what this plots now is the value of the clustering vector, i.e. the index, which is a number between 1 and 4, as a function of the row index. So I do actually see that somewhere my first 50 have been preferentially clustered into cluster number 2. My next 50 have preferentially been clustered into cluster number 3. My next 50 have been preferentially clustered into 4 and also some ones here. And so the clustering was not perfect according to that. And the last 50 were also clustered into cluster number 3. So how do I interpret that? So remember these were blue males, I believe, blue males, blue females, orange males, and orange females. So my clustering is doing very well distinguishing males from females, but it does not distinguish well between blue and orange females. So I've used only column 2 and 3. What does it look like if I use column 1 and 2? That's kind of all over the place, kind of strange. What happens if I use instead of 2 and 3, 2 to 5? So all of the columns. So now I'm adding information. Does that give me better clustering? What do you think? Is that better? Yeah, I would say so. I mean, I'm getting worse results for the blue males, but now I'm clearly distinguishing blue from orange females. So one of my PCA columns, apparently information about the structure between the blue and orange females is represented in PCA column 4 and not in PCA column 2 and 3. So does it get better if I remove 5? So just using 2 to 4? Not really. So now we have a first cluster that has more information and so on. So in this case, for the clustering to be successful, I actually, or the best clustering that I can have here, actually relies on information from all principal components, 2, 3, 4, and 5. That improves the clustering. But that's four dimensions, and we can't really visualize that well. So we've only looked at two-dimensional projections and overlaps. But in order to cluster, we apparently need more information. What happens if I use all of them? Then again, this is all dominated by the high correlation. So that gives me worse clustering. So this is clearly a case where removing the first principal component allows the data to be better separated. But other than that, removing additional principal components does not improve the results. On the contrary, it degrades the results. Now, is that the same situation for what happens if I use the TSNE? What does that look like? Kind of. So my TSNE result seems to be better than any of the clustering results except for the one where I actually use all of the PC dimensions. So this is how we can start evaluating whether clustering actually works. If you have a sample with labels, then you can verify that your clustering algorithm builds cluster assignments that actually correspond to your biological labels. In principle, any kind of a project where you hope to do clustering and cluster analysis for this kind of work really relies on you validating your workflow first with something like this. If I were to study crabs and I wouldn't be able to get my hands on some curated samples with which I can test my workflows, I would always get clusters. You always get clusters, but are they meaningful? In order for them to be meaningful, you need some labeled data to compare this with. So develop your workflow on known data and then apply to unknown data. You're not likely to be successful if you apply your workflow right away to unknown data and you have no good way of validating whether the results are correct. Yes? So Daniel has 50 kinds of samples and he doesn't know whether you can distinguish between those RNAs probably. And now the question is, if I do clustering on that, how can I validate the result? Because there are no data. Like, yeah, I was going to see whether I should do some of the principle components or I've already mentioned on your question and how many principle components to use. So well, if you were going to come to me and consulting this, the first question I would ask is, well, why are you going to do the results any? How would you validate them in the future? Is it just to make an informative plot that you're going to develop? Or is this going to be less in consequence? Are you going to follow up on that? If you say, well, yeah, we can follow up by doing a longitudinal study or whatever. Maybe you need to do that for a subset before you actually know what I'm talking about. Just wait a few years and look at your corresponds. Another way to do it is find data from a similar situate and then find your algorithms or develop your algorithms with that similar data set in mind. Until you know that all your code is correct, that you understand your parameters, you understand what variables you can use to reproduce that, then employ it to you. But there's no statistical paths to apply lightning to. Do you know how to move in? Well, this is kind of a segue to the next thing that we're, of course, talking about. And that's cluster validation. We can validate clusters. In the absence of biological knowledge, however, the validation really means mathematical validation. And that means, are my clusters well-structured? Are there mathematical features that I can compute on my clusters that indeed show me members within a cluster are more similar to each other than between members of different clusters? And that's cluster quality metrics. So if we have a number of methods, we can ask, well, which one should ask, which one do we use? Which one is the best to use? So what we have here is a package, CLValid, which at least compares different methods systematically with each other and then applies metrics on what these cluster mean. So the vignette here for CLValid speaks about validation measures that you can apply, internal measures, measures that reflect the compactness, the connectedness, and the separation of the cluster partitions, i.e. connectivity. Connectivity has a value between 0 and infinity and should be minimized, so that value should be as small as possible. Then silhouette width measures the degree of confidence in the clustering assignment. Silhouette width lies in the interval minus 1 to 1 and should be as large as possible. And the done index is the ratio of the smallest distance between observations not in the same cluster to the largest intradustric separation. Has a value between 0 and infinity and also should be maximized. So you have these three indices that characterize, are my clusters compact? Are they well separated? And it can also do a number of other things. So if I run that on my data here, so remember these were the highest differentially expressed genes. And the question is, can I cluster them? So it starts fitting with a number of different clustering methods, hierarchical, k-means, and some methods that we haven't discussed here. And for each of the methods, it gives me connectivity, done, and silhouette values for assuming two clusters, assuming three clusters, assuming four clusters, and so on. So clustering this along the different possibilities of clustering and then what the optimal scores. So we just try it out and then we look at what the optimal scores are. For the connectivity, we want to maximize that. Or should be minimized? I can't remember. Let me check back here. Connectivity has a value between 0 and infinity and should be minimized. So what I see here is that as the clusters get higher, usually the whole clustering gets worse in terms of the cluster metrics. And the model algorithm is the worst of them. All hierarchical clustering seems to be performing the best overall along with Agnes clustering, which seems to have the same values. So that tells me my data don't really cluster well. The best cluster that I can get is all of my data in the same pod. If I consider done, done should be maximized. So the largest clusters here, again, hierarchical and Agnes gives me the same. Once again, number of clusters is two. So the fewer clusters I define, the better it gets for this one here. And simple hierarchical clustering also does the best. And then silhouette width is between minus 1 and 1. And I think the largest one and, again, hierarchical clustering is the best. And if I increase the number of clusters, my metrics get worse. So this is an example where I would be cautious about using clustering in the first place. Or it would say the data that I've collected, i.e. the expression profiles, does not support clustering here. Is that something that I'd expect? Is that just due to the vagaries of the clustering algorithms? Anyway, that doesn't work well. I would expect clustering to work a lot better if I apply cluster valid to our TS and E data. Why not just run that? Should be easy enough to do. I hope I'm not going to break something again. So we'll try between two and clusters of size between two and nine. I would expect that I get the best results with the best method for four clusters, because that's in my data. The data that I'm clustering is PCAS25. And using all these methods, I've never done this. I'm actually very excited. So what do we find here? OK. So this one should be minimized. Was it? Connectivity? I can never remember. Sorry. Connectivity, yeah, should be minimized. So judging by connectivity, we kind of seem to be getting a dip around four or five here. So this dip indicates that some useful clustering is actually going to be taking place around this region here. With the done index, that should be maximized. I actually get a good observation that a cluster value around four peaks looks like something is actually happening here, especially with the model clustering algorithm. And with using silhouette width, something like five or six clusters or something like that gives the best cluster. Now, mind you, giving me the best clustering does not necessarily mean giving me the best biological labels. It just means this is the way we can partition the data. If there happens to be, say, in my orange females, a subgrouping like motherly and grandmotherly orange females have different shapes. The clustering might pick that up, but I would not have it in my labels. So calculating cluster validities is also easy. Interpreting them is like anything with your data, something that you need to give a lot of thought to. So this is cluster validation. Now, I've mentioned TS and E. I would like to apply TS and E briefly to our expression data because I'd like to pick up on this question. We have many plots, many points on our dots. Often, we need to identify them. And I'd like to show you a few strategies to do that. So this is T-stochastic neighbor embedding for the cell cycle data. And there's a bit of structure. We see, even though our clustering was not successful, we're usually very good at seeing cluster structures. We see clusters even if there aren't any. But I could say, well, maybe these are similar and let's look at their expression profiles. And maybe this group is similar. And I'd like to see their expression profiles. So how can we do that? And I'm not even going to color them here. But the question is then, how do I select from such TS and E results? So the simplest thing that I can do is simply plot the row numbers as text and then gaze and steer into my data and plot them as text. So these are now the row numbers. And now I need to steer in that and then collect these row numbers to be able to figure out what these groups are. So maybe if we make the labels a bit smaller, it becomes a little more visible. At least I can start seeing and distinguishing things. And we can do a parallel coordinates plot from the data that we see here, some data points that we're interested in. So for example, I could define a set 1 that I see here, 157. And I see row 74 and 223 to 19. And then I could say, well, some others are around in this region. Let's see if they're similar but different, 180, 110, 139, and 174. And then finally, maybe take a few from this cluster down here, selection 3, 107 and 61, 194, 209, and 208. So these are now three sets. And so this is the first set. And indeed, they're similar. And the second set, indeed, again similar but very different. And third set on a different scale. Again, I see different behavior from the other two. So that means, in principle, the separation that I get from my TSE analysis really does show me points that are close together in space. And thus, that are similar. And that I can pull out this. Now, that's something. When you do your analysis on your data sets, you could run a principal component or a TSE, something that plots your data points. And then you can start looking into that and say, do I have structure in my data? Does the structure make any sense? Is it reasonable to see that my sample number one and my sample 17 kind of look similar? Does that, they're close together under these plots? Is there any biology that would indicate that perhaps that indeed might point to a cancer subtype or something like that? But of course, if you have a large number of samples just staring into these things, would be very tedious. So can't we go in there? And is there not an interactive way to identify points from plots? And yes, there is. And I'd like to show you that. So basically, there are two functions that we can use, identify and locator. And they give us interactive ways of using the mouse to identify points in the plotting frame. So these were the points in our TSE plot. And if we type identify, then we can go and we can click on these points that we might be interested in. And if we press Escape or a different mouse button, we then get shown which points we actually chose. So now it becomes a lot easier to say, ah, these were 150, 64, and 145, and 113. So I could easily write that down. But I don't need to because it's actually also in a vector. So I can assign the results of using the identify function. And I can do that and actually plot the parallel coordinates plot in a separate window. So open a second graphics window. So open my new graphics window here. And I store its ID. I call it Window 2. And there's something called focus. Initially, when I open a window, focus turns to that window. So if I plot something, it arrives in that window. If I want to return focus to my original plotting window, I need to say Device Set Window 1. So with Device Set Window 1 or Window 2, I can address plotting in one window or the other window. I hope this works. So I get this window on the side here. For example, clicking this one here immediately gives me the parallel coordinates plot in this frame. Clicking this little brother on the side shows they're indeed similar, slightly different scales, and so on. So this is a very interactive way of picking up data from plots, transforming them into another window, showing them, and displaying them. And by implication of the TSNE, I would expect that points in this region here are significantly different and have different behavior. So in this way, I can further explore my plot. I can see what actually happened, what the plot actually showed me in the first place. And by exploring the internal structure of my plot, hopefully learn the important features and important information in my data. So I'm sorry that a lot of this is a bit broken. I need to update that script. I will post it back on GitHub as soon as it's updated, which probably should be today. I don't want it to languish around in this broken state. You can then go and experiment with it and play with it if anything is not clear or doesn't work, then feel free to email me about it. The next section, but again, it suffers from the same thing, shows how I can plot some polygons around a cluster and then define everything that's within that cluster, which again makes it easy to pick out point sets from PCA or model plots or TSNE plots. Good. So as far as the cluster is concerned, I would like to leave that for now and talk a little bit about statistics, just a little bit. In particular, the one question that obviously haunts us after we've done our exploratory data analysis and that's the question, is any of this significant? So we need to wrap our head around what we mean by significant. What does significant mean? And significant in statistics is a concept that relies on p-values. So we have a probability of some event having happened. And if we know, if we have an idea about the distribution of observations in an event that we're not interested in, because it's something that always happens, like we often call that the background distribution. If we know that distribution and then we observe an event, we can assign a probability to the sample that we've just observed, being an example of the background distribution or perhaps not. So the probability is the sample something that I've drawn out of the background distribution. Now the background distribution is a distribution. It can take many values. So the question is, does the value that I'm observing is that conceivably part of the background distribution? Or is something interesting happening here? We often call that background distribution the null hypothesis. So if nothing is interesting is going on, the null hypothesis is true. If something interesting is going on, then we would reject the null hypothesis, because the probability that the null hypothesis is true for that one observation is small. So if that's very small, but what do we mean by the probability of a single observation? And what do we mean by very small in that case? And so what we mean by very small is actually purely a matter of convention. It's a cultural convention. We call something very small or highly improbable in the biomedical field if it has a probability of less than 5%. We call something that has a probability of less than 5% significant. We sometimes call something that has a probability of less than 1% highly significant. But there's no particular reason to assume that 5% is better than 6.237% to characterize what we are interested in. Or that 1% is better than 0.83%. It's a cultural convention. However, it's a cultural convention that's widely adopted. And if you use different levels of probability to call something significant than 5%, you will have to have a lot of explaining to do to your manuscript editor. So but that's actually it means. 5% is just the value that the British statistician Ronald Fisher happened to propose for this purpose in 1925. And incidentally, so this is almost 100 years ago, incidentally Fisher himself then later said, well, maybe that wasn't such a great idea. Maybe we should, and we do, really be using different cutoffs for different purposes. But what's also really important is that this refers to a background distribution. It doesn't tell me whether an alternative hypothesis is true. It only tells me, or specifically it doesn't tell me which alternative hypothesis might be true. It only tells me something about whether the null hypothesis might be false. So what do we mean by the probability of an observation? So assume that I pick a random number with random number generator for normal distributions, i.e. the Gaussian distribution. I pick one number x. This corresponds to one observation. And now I print my number x. And I get the value minus 0.8969145466249813791748. So this is one observation. What is now the probability of that observation? So this is what you're really talking about. You have an underlying distribution. You have a single observation. What's the probability of that observation? This is a small number, perhaps a rational number, or certainly a rational number, because it's somehow represented in binary on my computer. The probability of that, if I extend this, the number of digits to infinity, is actually 0. So does that mean every single number, every single observation that I make, is infinitely significant? Of course not. When I say the probability of an observation, I don't mean the probability of that observation. I mean by that the probability of an observation or something even more extreme than that as compared to my reference distribution. So that's something that I'd like to illustrate. Let's first draw a million random values from our standard normal distribution. One million values instantaneous. And let's look at what the distribution looks like by looking at this histogram. The value that we've drawn previously is still this one number. And we can say, well, where does this number fall on that distribution? This is where that number falls. Now we can ask, how many values are smaller than this one? And that's 184,491s. Or conversely, the number of values that are larger than this one is something like 720,000. So let's color the bars to illustrate that. And this was our red line. So we have numbers that are smaller and numbers that are larger. And that now defines what we call the probability of our observation because it's the ratio of this number and the ones that are more extreme divided by everything else. So the probability here in one million would be something like 18%, p value of 18%. So the shape of our histogram confirms that our norm has actually returned values that are distributed according to a normal distribution. So normal tables tell us that 5% of our values, i.e. the cutoff for significance, should lie approximately two standard deviations away from the mean. So this is centered on 0 with a standard deviation of 1. So values that are greater than 2 or smaller than 2 should have a 5% standard deviation. So let's see. How many of these values have, how many of the values are greater than 1.96, i.e., lying two standard deviation outside? So this is 24,986. Wait, didn't we just say 5%? So this is 2 and 1 half percent, not 5%. So why? Well, what you have to consider is whether we're doing these comparisons in a two-sided way or in a one-sided way. If I'm looking only for things that are larger than the extreme value, larger than the rest of the values, I consider a one-sided observation. And then my 5% value would be somewhere inside of the two standard deviations. But if I say, well, I don't know. They could be smaller. They could be larger. I don't know that. Then if I'm only looking for extreme values, I have to look at the two sides separately. And then my cutoff is that 2.5% of larger and 2.5% of smaller values. So considering absolute values. So I can use quantiles to count how many values in my distribution are larger or smaller than a given value. So the quantile for probabilities that are at the 0.95 boundary is the number 1.6449, or something like 1.645. And if we actually calculate the sum of these quantiles at that level of 0.95, we get 50,000 as expected. So this is how we can look into a distribution. And simply from the distribution, tell us what the probability or what the significance of an individual observation is. If we know a functional form for our underlying distribution, because there's a mechanism of well-understood physical mechanisms over which we can integrate, then we can approximate these counts as areas under the integral. So the integral of the function on one side of the cutoff divided by the integral of the entire function gives us the probability. But many probability distributions can't be integrated numerically, especially the normal distribution can't be integrated analytically. And the normal distribution can't. So especially things where you can start looking up tables like normal distributions or t distributions or whatever, make assumptions about the independence of data that are certainly often violated in biological data. So whether it's even possible to come up with a good description of the underlying distribution is a very difficult question to answer. But we need it, because if we don't have an idea of what our underlying distribution is, we can't even say whether any value is significant. So what comes to our rescue here is something that we call empirical p values. So we can often simply run a simulation random resampling or doing a shuffling or permutation and then count the number of outcomes just as we did with our norm value. We have a large number of simulated outcomes. And then we can compare one observed outcome to that and ask, where in that simulated distribution does this lie? This is absolutely correct. This is perfectly fine statistics. Mathematically extremely simple. It's just a counting statistic. But of course, conceptually, not simpler than applying any kind of a very refined statistical test. Only now the conceptual burden goes into your simulation. But that's a lifesaver, because you know a whole lot about that simulation. You can bring all your biological expertise into the simulation. You can say that I have a bias, because many of my patients were smoking or this is a mouse strain that has a known risk factor for cardiovascular events. So I can take that into account. And then take anything that you know about your data, put it into a simulation, and then ask, what does the null distribution look like? These are things that take creativity and how to build your simulation that take a lot of insight. Once you build your simulation and synthetic data, validate them well. But they will allow you to make quantitative statements about your data in situations where you simply know that using a standard statistical test will fail, because standard statistical tests have assumptions that you can't guarantee about your data. Most importantly, almost all of them require independence of your observations. So here's an example. Assume you have a protein sequence. And now you would speculate that positively charged residues should be close to negatively charged residues in the protein sequence in order just to balance charge locally. So a statistic that could capture this is the mean minimum distance between all d and e residues and the closest r, k, and h residues. So let's take a protein sequence. This is a yeast transcription factor. As a string, and I split this string into individual characters. So we talked about sequence strings quite a lot. I have called this v here. So it's a character string, 833 characters of m, s, n, q, and so on. And now I can find position of my charged residues. E and d is I have 88 aspartic acid or glutamic acids in these positions here. Arginine, lysine, and histidine. I have 125 of these in these positions here. And now I can calculate the minimum distances for all of the e and d's and put that into vectors. So the first e has the closest arginine, lysine, or histidine at five residues. The second one has the closest one at three. The third one has a neighboring arginine, lysine, or histidine. So now I would speculate that this number is smaller than I would expect by random chance because of local balancing. So these are the minimum distances. 24 have adjacent residues. The largest distance I observe is a separation of 28. Well, and so on. So the mean separation here is 4.1. So on average, the separation between charged residues is 4.1 apart. Now is this significant? I wouldn't even know how to begin to phrase a question like this properly into a statistical test. Lauren, what would you think? Do you think you could do this by applying a hypergeometric distribution? I would have to think harder about it. I think I would need to think really, really hard. Much, much, much harder than my very poor background in statistics would make reasonable. But we can do this by simulation. We don't actually need to think. We can let the computer just try. And that's really easy to do. So first of all, we take whatever we've done above and combine that into a function so we can repeat it many times over. You said significant. You mean what your null hypothesis is, right? And what the four that you're talking about. Yeah, exactly. What's the null hypothesis? So the null hypothesis is that in an observed biological difference, in an observed biological sequence, this is different from a nonbiological sequence on which we had no evolutionary pressure. So how do I get a nonbiological sequence of this kind? Well, I can take my original sequence and shuffle it. So I have the same. I don't have a bias from selection of more or less argememes or lysines, which is an important factor because eventually they are the less the mean distance becomes. But all the correlation that actually says have evolutionary pressure on local separation, all that correlation would be lost, because now I'm shuffling things. There's no biological meaning in distances anymore. So that would be my null hypothesis. So that's my function. It just does exactly the same thing that I did before. I can execute it once on my sequence vector, confirm that executed on my sequence vector. It gives me the same result that I calculated before. And now I can produce a random permutation of my vector. So of my vector v, I make a random permutation v and get a vector w, which is now tkfn. So it's different. It doesn't start with methionine anymore, but it has exactly the same amino acids. Therefore, exactly the same number of lysines, arginines, histidines, glutamic acids, and aspartic acids. And I can apply my function to that. And I find, OK, 3.273. So that's actually less than what I saw before. And so let's do this 10,000 times. For n equals 10,000, I make myself a charge separation vector of 10,000 numeric slots. And I repeat the whole thing 10,000 times. It takes a few seconds to compute. And I can look at the histogram. Now, this distribution here corresponds to my null hypothesis. This is the background distribution. If I have a sequence of that type, my underlying distribution of charge separations in a shuffled sequence looks like that. And so where's our observation here, the one we have in our actual biological MBP1 sequence? This lies here. So now I can interpret my hypothesis based on this observation. Contrary to my expectations, the actual observed mean minimum charge separation seems to be larger than what we observe in a randomly-permuted sequence. I would have expected it to lie far on that side of the mean, i.e. whenever I have an arginine or a lysine, I would have an aspartic acid or a glutamic acid very close to it to balance the charges. That's not what actual sequences do. So it's on that side. Now, is this significant? How do we now determine whether this is significant? Take this sample, which is the largest value in the body by the entire sample. Yeah, I could do this for more proteins. When I go home on a Friday evening to set it to start working on the human protein, proteome, one sequence after another, this takes about 10 seconds. By the time that I've come back, the 20,000 sequences would have been done. The statistics that I would collect at that point is the significance that I calculate right now for every single sequence, because they're all different. They all have different lengths and all different separations. So calculating this for one, I have this vector. Was it CHS? 10,000 samples. These are the values. We saw the histogram for that. And I have my single observation V. No, V is the sequence. What was my single observation? I'm slowly losing it here. Oh, well, simply the result of this here. Charge separation of V. Let's call this 4.1. OK. So how many are smaller or larger? What's the expression I need to write? You just have to count them on the bottom. I'd still be here tomorrow if I do that. Oh, I need to tell R to count it. Yes. OK. Sum of the vector was CHS? Greater than. Ops. Yeah. So 1188. What's that number saying? You have to divide it by 10,000. By 10,000? Right. That I can do by hand. I don't have to. Or more generally, n, that's how I defined it. So 0.118. So relative to our plot, where does that number appear on the plot? It's all of the counts that correspond to these columns of my histogram here. Or if I would consider this to be a smooth curve, it would be the area under the curve on the right-hand side of the line. So given this background distribution, I would expect to observe a value as different from the mean in a sequence of that composition in about 10% of the time. Is that significant? Not if you use the 0. like that. Not at the generally agreed upon cutoff. And if we then start saying approaching significance, or almost significant, there's a very funny post somewhere where somebody actually collected from his experience as a referee all the linguistic contortions that people were using to weadle their way out of that. Their data was not actually significant. That's very funny. So not significant. So it's an intriguing result. It certainly seems to contradict our initial hypothesis that the charge separation is close. Perhaps if we do this more times on different sequences, we see a trend. Perhaps we see that that trend would be different from random if we average it over many, many observations. But this one, in and of itself, perhaps would warrant further follow-up, but it is not in and of itself significant. Now what you have to realize, however, if the very same thing from just a few tweaks in the sequence would have turned out to be on the other side of 5%, it would not necessarily mean that now we have to rethink our biology. We still need to follow this up. So we don't blindly trust your significance levels based on P values. At least this very simple approach of permutation testing or simulation testing, or in other way, calculating empirical P values, allows us to make quantitative statements like that about biology that is otherwise very, very hard to understand. Now in the second part of the script, I'm going through basic statistical tests, like two-to-sample T tests and other things on the GSE data and basically illustrating what we're expecting here, how we calculate T values if we're comparing events where we have three biological replicates and three biological replicates on the other side, what this really means, how we are certain to find absolutely highly significant things that are otherwise meaningless simply through repeating the test over and over and over again, how we correct for that with something we call the Bonferroni correction or the false discovery rate. I hope that the script is relatively self-explanatory. It also introduces ideas of non-parametric tests if you're so inclined to study that. I think if I start going on about real statistics now, you're not going to have a good time. Just falls silently asleep. I think I'll leave it off at that point. You still have surveys to fill out for the workshop, but for the formal teaching part, I will conclude by saying thank you. You've learned a lot. I see it in your faces. Most of you, I think, had a lot of fun doing this. If this was successful, then I was able to break down some barriers. I don't presume that I was able to teach you are, but I hope I was able to introduce you enough to learning are on your own so that you can continue doing it and explore it and simply enjoy it. I can't overstate how important it is to play and ours fun to play with, making pretty plots and finding odd correlations in your data like that. I think that's quite satisfying. So thanks a lot. I'll hand it over to An for the last good words and hope to see you again. And as I said in the last workshop, if you do have any questions about that material or you're stuck at some point, feel free to email me.