 I have placed the sample solution that we've developed here into a file task-solutions.r which I've uploaded to the repository, so if you pull again from the repository you will have that sample solution on your machine with a few extra added comments to it. Now one thing we still need to do and it's also in the task solutions is to update our list genes function. Remember our list genes function just listed the systematic gene names that weren't very informative and I've changed it here in this way to actually print out the standard genes and the systematic ID, the gene symbols and the gene names and that's what it now looks like. So we have the row number, the systematic, the standard symbol and the gene name. And let's hope if we find some interesting gene sets that there might be some clues about what these genes are in that information. I have no idea. I haven't tried that. So that's going to be new. In order to make this list genes work you'll simply have to select and define it. So let's have a look. We found some genes clustered on the PCA, this set 2 and they're kind of similar and as you know if gene expression profiles are similar one possible reason for that to be the case is that they are co-regulated and genes would be co-regulated if they need to be expressed under similar conditions in the cell because often they have some kind of related function. They collaborate. So the idea is that co-regulated genes also participate in similar functions. So our gene sets were this one and this one. So let's use our new list genes function and see what we get. Actually that's already in the code. List genes cell one. So we have a mitochondrial ribosomal protein. We have another mitochondrial ribosomal protein. Well that would make sense that these two are co-regulated. We have a nitrogen permease regulator protein that utilizes GABA and two proteins for mini-chromosome maintenance that are regulated in that way. So indeed these co-regulated genes are annotated with similar functions. So that's not something we put into the analysis in any way. That's something we got out of the analysis. We now discovered that among the genes for which we've measured profiles there are a number of genes which contribute to the same complex and to the same biological system where these genes appear to be co-regulated or change their expression patterns in comparable ways across the cell cycle. It's nice. So our set S2, oops, cell 2. In this case I see there are two genes that have something to do with methionine regulation. There are two genes that are the same gene. So we have the same gene measured twice on different spots in the microarray. That's kind of reassuring that if we discover one of these genes to be regulated in a particular way that all of the spots should be regulated in a similar way and some more biosynthesis. So once again, looking at that other set of regulations of regulated genes, number one and number six, they go in very similar ways and they are related to the same biological system. So once again, if this were our data and we would have just gotten it fresh off the microarray chip, we would have made a discovery about it from applying this kind of analysis. So being able to integrate data is fantastic for exploratory data analysis. As I said before, we are primarily concerned with looking into the structure of data sets here. Probably in the future I think we should add a module on actually how to go about integrating data in this way, but I think for that purpose it was probably very useful to basically take you through these steps. We have an idea that there is data on the web somewhere available that we should be able to integrate via matching IDs and then we set out on Google and find different ways to get that data and then we annotate our genes. The integrated data actually allows us to make new inferences about our data. Now Brandon asked this morning, what about PCA and how do we interpret these columns and what do we do with that? And that's an absolutely valid question. PCA will show us what is actually in the data. We might have some ideas about what could be interesting. So if we've designed an experiment to look at cell cycles, perhaps we already know that a cycle can be expressed as a sine wave and maybe cyclically expressed genes should look like sine waves in their expression profiles. Or we could have some other stimulus and we should believe that perhaps genes of interest to that stimulus rise first and then plateau and then decay in their expression profiles exponentially. So what we can do is we can define such ideas, such hypotheses as numerical models. For example, by simply defining a canonical sine wave and then using correlation to ask are there any genes that are highly correlated with our model? Or to what degree are particular genes correlated with a particular model? And if we measure that, we can use that as a projection. Just like we are projecting our data on the principal components, in this case we would be projecting our data on a model by measuring the correlation of an individual expression profile with one particular model. Because PCA really is a projection of the data under a high dimensional rotation and a model-based analysis can be seen as a projection along a particular model. There's no guarantee that the model will actually correlate well with a high variance in the data set. We just use it as a hypothesis that we have that might be useful. Then we can go in and explore whether what we get out of the data actually makes sense relative to that model. So it's one way of, again, exploratory data analysis. But the big advantage of using models in this way is you can interpret them. So if I use the sine wave as a model, I can interpret this as a sine wave, a cyclical function that's applied over time. Now let's define three simple models and try that out. That would be a half sine, a single sine wave, and two sine waves over the course of the experiment. And so what would that look like? We can simply use trigonometric functions. Let's define the number of values that we need, which is, of course, the same as the number of time points. We'll make some empty vectors and populate them with sine wave values or trigonometric function values. One is a cosine. So for a half wave, one is a sine. Well, I just scaled the trigonometric function appropriately. You can do that in different ways. And oh, didn't actually find it. Here we go. So these are our three models. So one is a sine wave, single sine wave. One is two sine waves. One is half a sine wave. That is simply decreasing expression over that time. So if we want to plot the correlation of each data point against our two models, we can do that by defining three vectors that store the coefficients of correlation for each row, i.e. for each expression profile. And that's very fast. We iterate over all of these values. We calculate the correlation of a measured expression profile with our model profile. So this is, again, referring back to what I said before. It's like a projection of the data onto that profile. And we can then plot it in the same way as we plotted our PC, our principal components. Now in this case, this is the correlation with a single sine wave and the correlation with a full sine wave. And again, we can vaguely see clusters of data points that correspond to more or less correlation. So these data points here are highly correlated with the double sine wave. These data points here are highly anti-correlated, so they would go in the other direction. The data points up here that are close to zero are correlated only with the full sine wave or only anti-correlated with the full sine wave. So they look differently. This is a plot of the double sine wave against the half sine wave. So again, we have data points and they're probably quite similar that are highly correlated with one. And we have other data points that are not correlated well or actually well correlated with the other profile. So these data points here would be data points that are steadily increasing over the course of the experiment because they're anti-correlated with the half cosine wave we're using there. So these data points here, these would be genes that are steadily decreasing over the course of the experiment, right? So the red line here is decreasing from the first to the last measurement. And something like gene number two appears to have that behavior. So we can again look at the plot, squint very carefully. There's other ways to actually extract the data. We'll mention that. And select some sets of data points for analysis. So I'm selecting three sets of data points. The ones in blue here, the ones in green here, and the ones in red here. So the ones in blue here, again, would be highly correlated with the double sine wave, the ones in red here, would be highly anti-correlated. And the ones in green here would not be well correlated but would be decreasing in expression. So what genes are these? So the ones that are highly correlated contain a number of known cell cycle genes. Cycline B, CDC45, polymerase, replication factor, RAT27, one of the checkpoint proteins, more polymerase, DNA polymerase B subunit 2. When do we need to make more DNA polymerase B when we're replicating the genome? And that's switched on and switched off during cell cycles. So it makes a lot of sense that these biological functions are associated with this expression behavior that we see. Firstly, again, these are CDC5 is one of the main regulators. A lot of, again, methionine genes that are highly anti-correlated. So highly anti-correlated means they're still correlated but in an inverse sense with the sine wave we have here. And cell 3 are genes that, well, they simply drop in selection over time. I see genes that are normally considered like glutamate dehydrogenase that are normally considered housekeeping genes. So maybe overall, as this experiment goes on, the cells are not very healthy. They're kind of stressed out under the experimental conditions. Now, if we plot them, that's what that looks like. So this is a correlation against the double sine wave. This is an anti-correlation against the sine wave. This is a correlation against the cosine wave. So overall, we now have groups of genes here that look similar to our model. Is this the best model we can use? Perhaps if we use different phases of the cyclical functions, that would discover more things. We could place sine waves into 17 different phases here. Perhaps if we add a decay term, that would make different models. So we could try different types of models here, guided by our biological interpretation. We could also do something else. And that is use a nonlinear least squares fit of a suitably scaled sine function, a nonlinear least squares fit, and then collect the parameters of the fit. Parameters in terms of amplitudes and phase and frequency. And we could then start comparing parameters, i.e. which functions have similar fitting parameters. So then we slowly start translating the problem of identifying features in the raw measured data into identifying and comparing features in some lower dimensional space that we derive by transforming the data that we see into some kind of model that we have implied. And that would also be a very valid approach for exploratory analysis. So especially in these cell cycles working with trigonometric functions, with sine or cosine functions, is probably a very fertile approach to discover genes of interest here. And to address Brandon's problem now, of course, these things we can immediately interpret. The blue genes here that we found and discovered are highly correlated with a cyclical varying function. OK. Now, I'd like to introduce you to a somewhat different approach. So we've basically arranged these genes in a two dimensional plot according to either the principal components or according to the degree of correlation with a model that we've defined. But that requires defining such models requires some knowledge. Is there not a completely automatic way that just pulls out similar elements from a high dimensional space? And yes, there is. It's called t-stochastic neighbor embedding. And this is a dimension reduction algorithm developed by Jeff Hinton's lab here at U of T. So in principle, what t-stochastic neighbor embedding does, it calculates statistics between individual data points and then computes an embedding from a high dimensional space into a low dimensional space such that that statistic is preserved. So whereas we can't normally say, well, we'll just take this high dimensional, five dimensional space and project it into two dimensions and then retain the important information, the t-stochastic neighbor embedding embeds it in such a way that all these dimensions are somehow contorted and worked and that original structure is maintained as well as possible. And the result is that it provides a very general way of taking high dimensional data and putting it into two dimensions so that we can then go and use our very, very sophisticated pattern recognition brains and look at this and say, ah, there's a cluster here. And we can then look what that cluster is. It says nothing about whether a cluster makes any biological sense. So in this case, it's kind of similar to principal components, but it's a heuristic algorithm. But it works quite well on a lot of data. So this is in the package TSNE for t-stochastic neighbor embedding that we should load and install the library. And the way that this works is it runs an embedding and iteratively tries to improve the embedding over time. And every now and then, it stops at a point that it calls an epoch and plots the current results. So the call to t-stochastic neighbor embedding requires us to have a function that actually does the plot, the data that we want to work with, parameter that's called perplexity. Basically, that defines the granularity of the approach and the maximum number of iterations, usually the more the better. And it starts from a random seed. So whatever the seed value is, you may get slightly different results. So it's worthwhile to experiment with different seeds and different numbers of iterations. So we need to define a plotting function. The plotting function here would be to pass in a value x, then use our crabs plot function that we've defined this morning to take the coordinates that the t-stochastic neighbor embedding has currently defined and plot the points with circles and triangles and orange and blue and so on. So that's the plot that we would use. So let's try this out, setting the seed and running this intelligent algorithm here. So this is iterating down and it's improving from the first guess over iterations and it tries to preserve that statistic. Now, note that this algorithm knows nothing about principal components or about the biology or about large and small things. It simply tries to take the side dimensional space and project it down to two dimensions. And that's the result that it gets. I think for the very, very small amount of effort that we needed to put in to make this run, essentially just run T, S, and E on the data, that's a very credible result. There's a little more overlap here. But in general, we seem to be able to preserve the structure of the orange females very well and the orange males. There's no overlap in the blue crabs. Some things flip into each other and so on. It does depend on the seed values first. So if I use T, S, and E, what I usually do is I loop it through a number of different seed values and then you just watch the iteration error and pick out a seed value with a, hopefully, small iteration error. So that's for crabs. So does it work for cell cycle data? First of all, let's pick two data sets, two selection sets. In this case, with positive and negative correlations. So in my first set, I have about 10 different genes. In my second set, I have four different genes. Let's see what they are. So these are the first 10. These are the other 10. So they're respectively anti-correlated to each other. And I can write a little plotting function now that will plot all of the points of the expression data and plot my first selection in red and my second selection in turquoise. So the reason I'm doing this is so I can see once that embedding function, now trying to embed things from a 17-dimensional data space into two dimensions, once that embedding function runs, are the clusters of points that I get, are they actually cohesive? Do they stay together? So let's look at that. So that's our plotting function here. And let's go to go. So after 1,000 iterations, we come up with this result. So the negative correlations weren't well preserved, but the positive correlations are very well clustered. So we can ask, well, what do genes in the vicinity of the original selected values look like? So I have no idea which ones I've picked here. Let me check. So cell 3 is this. I'm just trying to color which points I have selected here. So the data structure that this neighbor embedding produces just has two columns. So this has the x and the y coordinates. So in order to write the text, I just need column 1 and t s and e set cell 3. Is it lower case or uppercase? It's probably lower case. Yeah, lower case. So t is in a two-cell column 1 and column 2. And the labels should be cell 3. Make it red. The x is 0.7. So these are the genes I chose. So different from the points, they are overlaying the general area. So this is where they ended up. Now what do they look like? So remember, these are our points that I didn't choose in the original selection of our points. My original selection was, which correlation is better than 0.77 to produce it? So basically saying, we have a cutoff here of 0.77. And actually, I can just add line vertical equals 0.77. So this is the 0.77 cutoff. And now the t s and e has proposed genes that it thinks are similar to these, but that don't have that cutoff versus the model correlation. So again, if this is correct, if the algorithm gives us something reasonable, we have then discovered additional genes, which should be considered similar by the criteria of the t s and e, but would not be found through this kind of a model correlation. So let's see what they look like in our matrix plot. So the lines behind it, the dark lines, are the ones we originally found. The red lines on top of it are the new lines in our cell three. So simply by inspecting this visually, I would say that the algorithm has done a very good job of finding genes that are similar by inspection to what we've seen previously, and something that we wouldn't have discovered just looking at the mathematics of the coefficient of correlation. 0.77. In other words, they are still very correlated. Yeah. But we do not have our threshold. Yeah. So I can see why the correlation coefficient against the canonical sine wave is not perfect and rather subtle. This is because the scale of these two peaks is the width of these two peaks is slightly different, and there's a decay here. So the correlation coefficient, which goes in this case to the exactly correct sine wave, basically includes a bias in this whole analysis that we should be able to remove. Now if we would be doing an actual nonlinear least square fit, it would be easy to take care of that. But right here, we're really trying to correlate against the idealized sine wave. So where are these genes on the model correlation? That's exactly your question. How good are the correlations? So we can easily plot that. So these are the ones we originally chose. These are the ones that the TSNE algorithm found to kind of overlap that whole thing. So they have lower correlation, but they're not necessarily the ones that are overall the closest in terms of coefficient of correlation. And not all of them, not all of those that actually are very close, like 161, have been included. OK, one exercise that you could do at home. And that's something that I would probably do in trying to get a better sense of what these TSNE data mean in the first place. I would take some of the models and include the synthetic data for the models with the data set that we're plotting. So I would simply append the model data to our cell-cycle expression data. And then I would follow with suitable markers here where the points for these models lie on that expression profile. So I then can see this region approximately corresponds to a sine wave. This region approximately corresponds to a shifted sine wave. This region approximately corresponds to inverted sine wave and so on. Again, this is asking, we have all this data in the 17-dimensional data space. What does a projection actually mean? How can we interpret the projection? When we find clusters in that projection, what becomes part of these clusters? So let's recapitulate a little bit what we've done with dimension reduction. The fundamental problem that we're trying to solve is that interesting data in our domain very often is quite high-dimensional data. And we find it hard to find patterns and correlations and structure in high-dimensional data because it's a problem to visualize it. So we can use dimension reduction techniques both for increasing signal to noise ratios in analyzing the data, for discovering things that act on our data, like confounding factors, and also for helping us to discover structure in this data. And one of the first dimension reduction technique that comes to mind is PCA because it has very well-defined mathematical purposes. The downside of PCA is that the projections become hard to interpret because the principal components along which we're essentially projecting the data are comprised of all of the data dimensions at once. But it works very well. It's fast to compute. And once we start plotting and exploring with it, we can get a very good sense of how our data is structured internally. An alternative to PCA are model-based correlations, like the sine waves in our expression data, with the advantage that they can be very, very easily interpreted. But it's also kind of limited because we're trying to, we need assumptions, according to first principles, how these models should be constructed in the first place. So in our case, we know we have a cyclical expression pattern. But if you are looking at gene expression profiles as a response to a certain set of stimuli like heat shock or acid challenge or whatever, or if you're looking at expression profiles that are derived from different tissue types, from different cell lines, or if you're looking at expression profiles that correspond to different cancer types or whatever, there might not be a model that you could imagine that would somehow explain biology about your data. As an alternative and an additional approach to PCA, we can use TS and E, which often is able to find very intuitive results about the data, that none of the mathematical methods would pick up in the same way. The disadvantage is now it becomes completely uninterpretable. It just says we're putting things next to each other that are similar in high dimensional space. We have no idea what that similarity actually is. But it's also maybe one of the methods that is applicable with the least effort and in the fastest possible way. No, I'm not aware of a rule of thumb. The only rule of thumb that I would use is use as many methods as you have patience to use. Each method will give you a little bit of a different perspective on your data. And the more intimate you become with the points, with the structures within your data, and you start recognizing which genes always come out, coming, lying close to each other and where particular pathways are represented and so on, the more you will learn about your data and the more secure your interpretation will become. But at TCA, it's completely uninterpretable. Are we going to go where it's like extract what you can extract from TCA or something like that? Because when we did the BIPOC report, it showed what's in the large influence or should it be. Yeah. I think that's meaningful necessarily. I hope I didn't come across the same TCA as completely uninterpretable. So basically what I said is that the individual principal components might have contributions from all of the dimensions at once. And that becomes kind of hard to interpret. Now, applying additional biological knowledge, as we did before, for example, taught us that we can interpret the first principal component for the Crab's data as related to the age. And that will remove that correlation with age. Or that we can probably interpret the first principal component of the Cho cell cycle data as kind of normalizing out variations between the microarray plates. So again, that implicitly takes care of some of the inaccuracies in measuring the data. And they're the strongest components because they affect all of the data points equally. But that certainly doesn't mean that one is kind of more important than another. The information that we look at, the strongest information, the strongest signal, indeed, is in these first and largest components. It doesn't mean that the strongest information is also the biologically most relevant information. And that's the problem that we're continuing with. Our idea of what's an optimal analysis is very different from a mathematical strongest signal. So I guess once we get to some awareness, then is it possible to go back and say, which are the most important genes affecting these particular components that are giving us good cluster? Because I know my boss's question every time I hear it. Is that, OK, which genes are driving us and we can't really go back? OK. So if you're asking which genes are driving this, you're asking a slightly different kind of questions, and that's trying to interpret these in terms of causality. And that's very, very difficult. In principle, what we're trying to do if we're looking for causalities, we hope we have timelines. We hope we have timelines with good resolution. We hope we have many different timelines under many different circumstances. And then we can hope to find patterns of genes that are preceded by similar patterns of other genes. And that would allow us to infer some level of causality. But it's fiendishly difficult, and you really need a lot of data. Now, you can, however, perhaps then imply biologically reasons. If you say which genes are driving this, perhaps you have a regulatory pathway. And you can get a cluster of genes that respond in a certain way. And then you can separate them whether they're high in the pathway or low in the pathway. That would be one thing. In terms of things like driver genes in cancer, for example, we have a different approach. And that's just looking across many cancer samples and looking for genes that are recurrent in being mutated, either with a gain or loss of function mutation. So finding such recurrent genes is not as much a question of exploratory data analysis in the data. It's just a question of tabulating and then asking, is the frequency of mutation for particular genes different from the frequency of mutation that we would expect? Of course, that's not easy either, because frequency of mutation is related to the length of the gene and to its expression status, because genes that are condensed are much less likely to be mutated than genes that are exposed and actively being transcribed. But there's ways to normalize that and then to discover that. One way to discover such correlations is working in networks, i.e. in graphs. So we've always been treating these data points as independent, because essentially they are. But we could just as well map them on a network where the nodes are the genes and the edges are functional relationships. So typical functional relationships for these kind of graphs are drawn from, say, protein and protein interaction or similar Go terms. So a good database for functional interactions is, for example, the string database, S-T-R-I-N-G. And then you can build these networks. And then you can look at, say, these expression profiles and basically paint your networks with such expression profiles, i.e. for example, correlations with a sine wave. And then you can ask, are there genes that are functionally close together and frequently mutated or correlating in gene expression and so on? And in this way discover additional related genes that you might not discover with any of the individual. So that's a very powerful way of data integration. Now if your boss asks you about what are the driver genes and what's the causality there, I think that that's probably a hierarchy of difficulties of question to answer right at the top. Almost at the top. The very top one would be why. I guess I agree with that. Specifically with the brand PCO and your gene even if it's not causing, what are the most happy quality of all of that? Yeah. Well, I mean, what I've shown you here is a number of ways to find genes that are similar to each other in very high dimensional data sets. And that's at least the starting point. If you then look at these genes and look at their properties, you might find something they have in common, which you can use as a general criterion to pick up other genes that are related there. In a sense, what we're slowly getting to is a question that's related to cluster analysis. How can we take data sets like these and split them up into clusters of things that are similar to each other? And that just happens to be the topic of our next module.