 I think when I gave lectures 20 or 30 years ago I didn't have the habit of wanting to start every new whatever I'm doing with so that's become so common right you know I have this urge of saying so welcome back or so this is what we were doing last time and I don't think that I've always been doing this there was something else there but I tried to make a conscious habit of breaking that and not starting this with so so if you find me jumping into your coffee break rather abruptly that's because I try to remain conscious of what I'm saying it doesn't always mean I'm making sense but at least I try so we left off with this plot of PCA of PCA scale data and clusters begin to become apparent now I don't know if you've looked at by plots yesterday but another common way to plot PCA results is by so-called by plots so a by plot is a plot of principal components it automatically plots the the row index of the individual points that data points that that contributed here and it also plots the projections of the original axis onto that original component so in this case here we're plotting principal component one against principal component two and the projections actually already show us that most of the original value dimensions here are kind of orthogonal on on PC one to the greatest degree so this is what actually then becomes apparent as the large correlations between the values that that you see so by default this plots against PC one PC two with the parameter choices we can tell it to plot different things I eat PC two against PC three and now we could squint very carefully and start pulling out these numbers and typing the numbers into a selection vector which we could then use to to classify our data now here in the projection of PC two and PC three you see that the different dimensions are actually nicely distributed so that's the hallmark of that the principal component captures information from all of the original dimensions in the same way now something I haven't looked at before is comparing this to the unscaled values I calculate a PCA O for original from the original crabs data and plot that it looks like that if I do the by plots we get them here I have the impression that the clustering in this unscaled analysis is actually better or more pronounced than in the scaled analysis but without knowing which ones of these are the the the oranges and the triangles it's it's kind of hard to tell so now we finally get to that challenge of plotting these things with color and shape coding the different categories if you've reached if you've researched this on Google you you probably quickly noticed that what you need for triangles and circles is different plotting characters if I remember correctly these are pch 17 and 19 for filled shapes you could also have used one of the 21 and 22 for coloring borders and backgrounds differently and these will give us well triangles and circles if you need need different shapes and different convoluted shapes you can you can plot many other characters or just make up your own shapes and plot them individually as polygons even though that for large plots this this may be slow okay so we need pch 17 and 19 and we need colors blue and orange and the super pedestrian solution for that is to exploit the fact that they're already ordered in in our original data set so rows 1 to 50 are one category and 51 to 100 are another category and we can just go ahead and plot that it's lame but we can do it so I'll show you how to do it so we are we're plotting rows 1 to 50 of the rotated data of column 2 against rows 1 to 50 of the rotated data of column 3 and using a plotting character 17 and coloring this blue as points and these go here and then 51 to 100 go here and 50 101 to 150 go here and and that so that's how these distribute indeed they they they can be pretty well separated along these dimensions now comparing that to the unscaled data yeah it's kind of different now you notice that orange is down and here orange is up so that's completely unpredictable that just depends on how what the variances happen to be and it underscores the fact that we can't really interpret these principal components in relating them down back to any up back to the semantics of our original data whether okay well that that's just a scaling problem yeah okay right that's more like it not so squished I would say they're approximately equally well separated in principle but indeed I think the clusters that appear here especially for the orange ones appear a little a little more pronounced now of course as I said doing it this way what we'd actually want is a function that takes the factors which we have as parameters and then translates these these factors into something else so translates factor one into blue or factor two into female and the way we do that in principle and thanks for Maya for bringing that to my attention that this is really key and important it's so important that I that I've stopped to think about you know that this is actually part of what you need to know that's the downside of experience you you don't remember what you didn't know so how do we translate values into different types of values this is what comes up again and again and again as we write script so for example if I have values red and white and I can build a named vector from that so now I have a named vector the vector has is a character vector it has two components two elements first element is red second element is white the first element is named capital R the second element is named capital W so now I can use an index vector to map values to something so assume my data set has factors of R and W which numerically translate to ones and twos in some way and it's a large data set of 2,000 columns so but what I want for my plot is the actual color names red and white so I simply take that vector whatever data dollar type whatever the column is I take that vector in this case I define it as in one which is simply a vector of numbers one two one one two one and I put that into a subsetting expression for my values now technically you'd think of a subset of something smaller but subsetting can be done with more elements than the vector actually contains and then some of them obviously will be repeated so if I put one in one into square brackets attached to my values I get red white red red white red right the first index of my values the second this first element of my values the second element of my values two times the first element again the second element again the first element and so on so apply to our our crabs data I would be taking a vector like blue and orange and I would take be taking column one of the data set the SP species column so crabs dollar SP and put that into a vector like that and then I would get labels of blue and orange which would then color the plots in the points in my plot according to you so when we did this you were coding like the red white and the red and white but what happened when you're doing the index like so does it inherently understand that like r slash red no it doesn't that's that's an example that that could possibly apply factors because I don't know what when you when you query a name vector with a factor whether it queries by the contents I either numbers or whether it queries by the levels I would need to try that out that's a little bit let's not let's let's not talk a lot more about factors this is just one example the vector doesn't need to be named it can be named if we name it then we can use letters and I'll do that in the second example but we can just use numbers you know map something into numbers and then start pulling out and translating numbers into elements of a vector it's completely general I can use logical expressions I can use you know what whatever but translating the named vectors by using names and this is where we had our example of B and O and blue and orange if my input vector is the letters w w r r w r and I put that into the subsetting expression for my values this into then I get the same result we're actually the opposite result here w for white and red and white and red now if my my values vector would have been blue and orange and I would have named them with B and O my values orange names are B and O I get blue and orange labels right from my right from the contents of the first column of my traps data and now I'm actually curious so so species is a factor with two levels blue and orange B and O where B has what the what number one has the level B and the number two has the level O so that happens to coincide with what I've defined here how do I find out now where the factor is selected by level or by by index well I just turn it around right now in this case my oranges are index one and my blues are index two so now if I queer if a factor queries by level I get the correct result because the names still match if the factor queries by its contents I will get the wrong result any any bets on what will happen I have no idea Greg what do you think Lauren okay so what happened why didn't this give us the correct result of the inverted result well because in turn of the end a factor is a vector of integers and the integers are then named with with maps but what the subsetting expression sees is the integers and it works with the integers it doesn't care about the levels so the subsetting expression is not smart enough to recognize that this is a factor and then work with the levels as row names you'd need to do that by hand can we do it by hand of course we just need to to change our factors into into the row names as character this is how we change our factors into a vector of their levels what hang on so there's some weirdness here what did I just do my values crap so the speed yeah yeah yeah yeah so it went it went wrong at first and now when we use as character it gets to be correct you can you can have it based on well let's if we're using a subsetting expression you can subset either by index or by the row name or the element name and if we name it my value that's what we do we subset or choose by element name now you say why can't we just choose by the value name well that would be something like this directly choosing it from the name we'd have some kind of result vector of character elements and as many as we have rows and then I could say so here I would go and say which ones of my my values are equal to the string B and then I get a logical vector which I can use for subsetting and then I assign the word blue to all of the matches here and conversely for orange so this is going by the contents of my translation vector but that's less efficient because we have to do this comparison every time it's it's it's more efficient to either use indices or in the most general way we just use what we want to choose from as row names or as element names and use that to map now that that comes up all the time in many situations mapping values to color or to shape or to size in the plot or somewhere where where I use it all the time to is to map accession IDs to gene names so I have a little mapping vector that that maps Hugo gene symbols to ensemble gene names for example and then I just pull out my Hugo gene symbols in a very large table and and everything gets translated okay now how do we actually do this in practice I've I've I've written a function crabs plot which was loaded from R or which should be loaded from your R folder if you type in it if it's not loaded just do type in it then it's then it's loaded it's a function that takes parameters X and Y and species X and age and and plots all of that now there are three things that I want to vary here one thing is I want to vary the plotting character and that works in exactly the same way that we've just discussed and that another thing is I want to define the colors in this case I want to be a little more sophisticated not just a blue and orange I want to have different oranges based on whether they're male or female and different blues based on whether they're male and female so I want to consider both of the columns the male and the species and the sex at the same time and the third thing is I want to take the size into account I want to stick I want to scale the large ones with large characters and small ones with small characters and I just walk you through what I'm doing here so creating a vector of plotting characters in in this case I'm plotting filled characters so the plotting character 21 and 24 with a background color my pch set is 24 and 21 24 is for the females the circle and 21 is for the males the triangle and my crap plotting character is as integer whatever I feed the function as a vector named sex exactly so if it's a factor that I'm putting in here I'm taking the ones and the twos and I'm I'm doing this by index here right I I could also have said as character sex and then I would have had to name my pch set probably the safer way to do it that way I don't need because I can't actually make guarantees about the order of the levels in an unordered factor okay so that's the simple thing a little more involved is if I have a color set of four two different blues and two different oranges that I I want to select from based on two different columns so otherwise it would it would work in exactly the same way but I said yeah I want to be a little more you know sophisticated here so let's combine these two and the idea is is really very simple assume that assume that we can treat these as as binaries so then I can just combine them as binary numbers if they're both zero I get a zero if one is actually that would be three this would be a one sorry that this would be a three this would be two and then I just add one to that so this would be one so I can compute myself an index I maybe need a little time to figure out exactly how that will work but from properties and one or more columns compute myself an index into into a vector and then put the right colors into the right place so computing this is my my color index is taking as integer species minus one which takes the ones and twos and transforms them into zeros and ones multiplied by two and then add as integer sex which is one and two so if this is zero I get one or two and if it is one I get three or four so then I have one two three four as possible combinations of the two and then I can plunk that into call index and I just need to make sure that the right colors appear in the right space and that defines my colors and then creating a scale vector basically is also very straightforward arithmetic taking a range of one and transforming it into a range of another so the range of one thing could be the means of all the columns so taking taking the the measurement means the row means of my measurements would give me an average of how how these crabs grow and that would basically give me a stand in based on the data of their age or their total body weight and then I need to define a range into which I want to map this to and the useful scale for character expansions is something like going from zero point five to four so one is the standard size of a character that you see on a plot zero point five is half of that four is four times of that this is my scale min and my scale zero so I do this in two steps they're a little redundant I could I could mangle this together at the loss of being explicit first I transform whatever I get in here in my age or whatever numbers appear in that column I transform that into the range of going from zero to one and I do that by subtracting the minimum value and dividing by the difference between the maximum and the minimum so that's how you scale an arbitrary range into the interval zero to one subtract the smallest value and then divide by the difference between them the max and the minimum so age minus minimum of age divided by max age minus min age transforms that into that scale so now I have zeros to ones and I want to get that into the range zero point five to four so I add or I transform s max times minus s min plus s min and multiply that by the factors so I I take whatever values I have and I scale them into a different range and I can use that range now to define cx and then I just need to plot with main and x-axis labels y-axis labels my plotting character is crap pch this is what we define here my color is black so they all get black borders my background colors of these filled plotting correctors is my crab calls which I defined here by munging together the two different columns into into four different values and my character expansion factor is the crab cx which I scaled here now what does this work of coding art actually do crabs plot so oh my god how do I need to call it x is PC a s dollar x column to why is PC a s dollar x all the rows column three what else do I need the next one is species which is crabs dollar species the next one is sex sex the next one is age and we said as a stand-in for age we'll just take the row means of all the data columns to eight similarly there's also call means but let me check whether that actually works yep that gives me values and main is scaled x lab and y lab so if you notice the function signature x y species sex and age have no default so if I don't supply any of these values the function is not going to run but main x axis label y axis label have a default of an empty string so if I don't put anything there it will plot with an empty with an empty character if I do put something there the empty string is overwritten and then I get an actual value here so x lab is PC 2 y lab PC 3 and plotted and there you go oh my god PLOS biology here I come now what does that tell us I could have used slightly nicer colors what does that tell us what do we see here so something we haven't looked at before is the size of the characters the CX is that all over the place or is there a trend data interpretation you can immediately infer an interesting part of biology from this plot right the larger they get or the more the age the more morphometrically distinct they become so they express their differences with size especially noticeable on the axis that seems to be separating triangles and circles so the sex differences are strongly dependent on age as they're very small they kind of all cluster together so oranges and and reds are less dependent on on size but the differences between the triangles and the circles is more is greater and greater the larger they get well that's that has an obvious biological that's for the originals so the the fact that the sex differences scale with size is kind of less obvious so even though I think the clustering is is more distinct the the scale plot is more informative in terms of the biological interpretation we still have a tendency especially for the blue ones that overlap is greater for very small species this is the unscale these these are the original oh all right saved committed uploaded available for your perusal now in my dimension script I would now do dimension and PCA analysis on the LPS data set but in fact this is more of the same it's maybe even not as enjoyable because the end result is more or less that PCA doesn't give us very good separation of of the LPS data set a little bit we can find some interesting things there but I'd like to move on and show you a different kind of principal component analysis and you can just do the script at home I think it's it's more self-explanatory I would like to talk about a different way of two different ways of dimension reduction one of one of which I haven't written into here because we've actually done that yesterday when we were doing regressions by looking at correlations against models so if I do a correlation with the model then what I'm essentially doing is I'm taking a high-dimensional data set and projecting it along a dimension that is defined by that model so like the PCA has a contribution from every single different dimension if I'm if I'm projecting or doing a correlation of an expression data set on a sine wave every single time point contributes to that sine wave with different strengths and that also gives me a projection if I take two different models for example a sine wave and a decay function or something like that or a high frequency in the low-frequency sine wave or two phase shifted sine waves and I do a core and I plot as a scatter plot the correlations I derived from one model and the other model then I'm doing exactly the same thing as with a PCA with two dimensions I'm projecting into a low-dimensional space and that can be very useful if we have strong and reasonable and and valid assumptions about what underlying models could be that have generated our data so that that could in fact be a very useful approach and then the number of dimensions that you retain really just depends on the number of models that you're plotting against and of course you can do mixed approach I can plot one dimension against the sine wave and another dimension against the second dimension of my PCA there's there's no limit to this as an exploratory method whatever way I can spread out and project and analyze my data might be useful to discover structure in my data now there's an approach called T as in a T stochastic neighbor embedding you didn't do that yesterday right okay so T stochastic neighbor embedding is one of the things that our rock star faculty member Joff Hinton developed in 2007 and it's an embedding method of high-dimensional data so what T stochastic neighbor embedding does it calculates a statistic of distance in high-dimensional space and then it finds stochastically projections into a low-dimensional space that will in the best possible way preserve the distances between points so this is a stochastic and completely random method and again the resulting dimensions don't have any interpretation whatsoever except that if you project them along these dimensions the distances between the points stay reasonably intact distances between points staying intact means that similar points end up close together on the two-dimensional plot I'm not sure I can answer that I can I can say that yes that's exactly what it looks like but on the other hand I don't think there's a mathematical guarantee for that that that yeah I know I don't think that that there's a very predictable relationship about the say Euclidean distance in the high-dimensional space and the actual distance between points on a two- dimensional projection but let's look at T S and E what it does it's it's a package I have my my normal idiom here if not require T S and E install packages library T S and E if you run that by the way after it successfully installs you will see a warning that says the package is not available that might lead you to think that something went wrong but that's not actually the case I I just can't turn this off without major surgery in and and resetting global options which I don't want to do so what happens here is the quietly equals true actually only applies to when the script is sourced so when you source the script it's actually quiet it doesn't throw a warning but if you interactively execute this the warning is raised that the package doesn't exist once a run required so require returns a value of false because it couldn't find the package but it also raises the warning and that then kind of floats around because it first starts installing the package then it loads the library then everything is done then it comes out of the if loop then it takes a breather and says do I have anything else to do oh there's a warning there yeah I have to show that warning and you see the warning but the warning is that no longer relevant but of course at that point it can be confusing so if you run this and it installs and then it says warning no such thing is available don't worry this is just a side effect of the required if it says error something that shouldn't happen but you can actually check and you should about the contents of a newly installed package with these three little idioms that that are really useful library help equals T S and E will give me a help page that tells me this the description of the packet of the package and what the functions are that it contains so here T S and E package contains actually only one function and that's T S and E itself packages usually have so-called vignettes this one doesn't vignettes are basically PDF versions well done vignettes will have nicely worked out examples and explanations what the hell the package does not so well done vignettes are simply a PDF print out of the help pages which are often less than helpful for newcomers they are very technical and assume that you already know what you're doing some packages don't have vignettes like this one because they're assumed to be kind of self-explanatory and some packages have attached data sets so T S and E doesn't but for example mass does that's where we loaded the the crabs data from mass actually has a lot of data sets that you can experiment with and and play with so explore your packages use library use browse vignettes and use data now T S and E runs stochastically over many iterations it pauses every few iterations to plot what it's been doing so it has so-called epochs of I believe by default 100 iterations and then it triggers an epochs callback which is a function so the epoch callback parameter needs a function which we need to define the function that we need to define is a useful function that will plot our our data in an informative way so the plotting function here could be T S and E plot which is a function of X which is the crabs plot with X dimension one X dimension two now now these are not the X's from the principal component analysis these are the X values that T S and E the T stochastic neighbor embedding produces then column one of crabs two of crabs the first dimension of PCA crabs actually that wouldn't work here because we've now called it PCA S if you want to run the suit you actually need to make that change no okay no I could use I could use anything I just need these parameters for my plotting function so the crabs plot function needs these parameters remember when we showed that we took the mean of the data columns as a stand-in for age here I'm just taking the first dimension of the PCA which should be kind of the same thing and I give it a type of crabs T S and E and so on so this defines a plotting function so I set the seed now T S and E crabs embed the columns four to eight whenever you've done a hundred stochastic steps plot the result and perplexity and max iterations are well max iterations tells tells it when to stop perplexity perplexity what about it what about it what is it actually can you define it I can't like it's it's it's going to leave you to have certain levels of clusters kind of like if you have your perplexity is smaller you're going to have more tightly grouped data like yeah so you put your perplexity is two and so you're going to find your data should have paired off and your perplexity is 30 it's going to be in slightly larger chunks I experiment with different levels yeah I mean that's that's that's always a suggestion here you just try it out with default parameters and then you play around with parameters and see what you get the parameters you play with is you adjust perplexity you adjust seeds this is this is important different seeds do have different values so anyway let's run this thing and that's the embedding here so this takes time especially for large data sets it can eat quite a lot of CPU cycles this is not mathematically principled in any particular way except that it turns out that it usually does a good job of showing you structure in high-dimensional data sets now just as in PCA even though it's a completely different algorithm it turns out that projecting along a good separating dimension in two dimensions also says the the greatest differences in points are if the overall age is larger so let's let's try this again with a with a different seed the first iteration doesn't do anything meaningful and then kind of starts finding things actually this one is is quite a bit better so that's TSNE T stochastic neighbor thing or if you're looking to explore more or I don't think that there's a principled reason what I see recently is that many of the single cell RNA seed papers are using TSNE simply because it gives a global nice separation on clustering of data so the way to work with that and TSNE and I think we'll just mentioned that in the clustering unit is you plot something you get a clustered structure then you apply some kind of labeling I'd say this is these are monocytes and these are macrophages and then you label your plots or you put colors on your plots and then you see whether the labels kind of cluster together in one region of the plot if that happens that means the high-dimensional data has structure that supports the labels which we may get from a completely different usually the the very non-trivial interpretation of a single cell RNA seed experiments where they actually do this they see nice clusters in the in the in the TSNE embedding and then they see that yeah these cell types are here and these cells cell types are there and that means tissues are real it really means that tissues have significantly and recognizably different gene expression profiles which is non-trivial no they're not tissue identity could be defined by something else but gene expression profiles but it's not no actually the way we think about it gene expression profile defines tissue identity that's actually what our single cell seed experiments show by and large yeah so when you get a cluster for the different genes or whatever infected is a cluster what you need to do is you need to identify these points and we'll we'll talk about that in the cluster I do I'm not getting clusters I am getting a two-dimensional plot of my data you know it's nothing about clusters this is not a cluster we see clusters because there's structure in the data but plus the ring means actually doing this mathematical and then saying that point has a label of two and that point has a label of three and that point has a label of four TSNE doesn't tell me about these labels that's something I put into it because my data were labeled to begin with so I mean I get this you see that there is practically it's it's a very quick way to see whether your data has whether your high-dimensional data has internal structure that is possibly useful to exactly do something like clustering so if I see this I immediately see oh yeah clustering is going to work we're going to find ways to to break this apart if I only see a single point cloud then you know probably it would be nice to add additional data so just a preliminary stage preliminary or to map experimental data like for example if this were just point cloud and now I would have information like the red and green then the blue then I would be able to map red and blue to that and then show in this plot that the labels which I have derived from some external information actually correspond to the mathematical differences in my high-dimensional measured values like gene expression. Yeah another reason you may want to think about TSNE versus PCA PCA is a linear embedding and TSNE is long linear so when you're trying to recover structure that isn't necessarily linear in your data uh TSNE will show you that whereas PCA won't show you that so it allows you to see more complex structure than PCA. Always two-dimensional. No you can do it high-dimensional. I'll actually going to show you an example of three-dimensional TSNE and you can be very much higher dimensional so in fact it may be useful as a dimension reduction method as well instead of PCA if you say if you have a hundred-dimensional something and you want to break it down to only five dimensions because what to do some machine learning with it you can try TSNE rather than PCA and see which one captures the variation in your data better. If the variation is is not well separable or linearly independent TSNE may give you something useful whereas PCA is just going to say each of the PCs is equally important. Which really brings us to the C word we've we've spoken about clusters so much let's do something with clustering so our clustering unit is introduction to clustering and as the data I'm going to play with I'm going to use again expression data we always seem to be using expression data it's just large sets of numbers many other experiments don't give large sets of numbers and in this case again I'm sorry I'm using cell cycle data but this is a human cell cycle experiment one cell cycle of Hela cells and the first part of actually loading the expression data is derived from GO2R so I don't know if you've ever used GO2R before GO2R is one of the very nice results on the NCBI webpage um so this is the GU data set GSE 26922 cell cycle expression profiles in Hela cells published in 2012 and there's an option to analyze such data with GO2R so this is micro array data unfortunately I don't think they've yet built a tool to analyze RNA seek data with GO2R so that's something you would need to do at home so if you ever need to analyze micro array data with GO2R you can at least that started with the code on the from the NCBI site if you need to do this with RNA seek data the bio conductor project has a lot of tutorials and help information of how to analyze RNA seek data so there are very well described workflows there and standard operations so when we analyze something with GO2R um the first thing we want to do is to define groups so to define groups we could call one group t0 one group t2 t4 t6 t8 and t12 2 t4 t6 t8 t12 and I can select three columns here click on t0 select three rows here t2 t4 t6 t8 t12 because these are all um biological replicates of the same information so they should all have the same values well you know more or less but a way to analyze data like this is to average the biological replicates and and then look at that information so the first thing is to define my groups and the second thing is then to look at the value distribution of my groups which is this here so they're very nicely distributed um they go from 4 to 16 so what are these values less they're not raw counts right I think these are already log transformed in some way maybe log ratios to some standard or reference now that's fine I don't need to to scale them in in any particular way what you're looking for in an experiment that can be well interpreted is that they all have the same mean and approximate standard deviation on average in in RNA expression experiments nothing should happen it's just a small number of of genes for which something actually happens and incidentally um what we are doing here is we're treating them all as independent this is not actually a time series analysis there's additional statistical information from the fact that that they're actually ordered in a time series which we're not exploiting here but um what I can then do is uh to calculate a table of the top 250 differentially expressed genes so given these groups here um GO2R gets me the expression uh the the top differentially expressed genes according to the p value of the differences or an adjusted p value which takes multiple testing into account which we may talk a little bit more about this afternoon um it gives you the scores for that the id which is the id on the spot it gives you an actual gene an actual expression of profile or a value profile here and indeed you see this is one that that is down and then it goes up and then it comes down again so it does look somewhat differentially expressed importantly what you will always find on the high scoring values is that all of the individual values are quite similar within each group and that's because significance in this game doesn't just mean the change is large it means that there's a small standard deviation of the individual measurements and change so the small standard deviation is really important if I by some measurement chance have three absolutely identical measurements of one thing and then three absolutely identical measurements which are slightly larger that algorithm would tell me that's a very significant difference because the standard deviation is so small I should be trusting my measurements so much that even small differences become highly significant so that's a bit of a downside the top expressed genes are not the strongest differences they're the ones that are most statistically significant given that measurements have a certain standard deviation here and we get the gene symbol and gene title so this is a stone cluster if we look at the second one here also this this thing but it correlates in a different way if we scroll to the bottom so this is one of the less significant ones right so what you see here is that still we have a we have a great trend along our clusters but slightly higher variation in the individual measurements and this makes the result less trustworthy to the algorithm so this is standard type of differential expression analysis and now the very cool thing about it is that it automatically generates an r script which you can copy and paste and use at home and modify to your heart's content for your own purposes it's a little bit salty in the way that it works but what I've done is essentially I've taken this r script for GSE 26922 and I've pasted it in here and we'll just go through it very briefly so what we need for that is bioconductor libraries bioconductor libraries or packages are loaded in a different way from CRAN packages so whereas from CRAN you load things right away as they are for bioconductor you source a file which is called bioclight.r from the bioconductor project site sourcing this file installs the function bioclight that function handles the installation of packages so then we call the function bioclight biobase bioclight geochrary bioclight lima these are packages that are needed to build to download the data and build these these differential tables now when you do that and you have existing bioconductor installations already on your system more often than not it will happen that it says I'm finding old packages here do you want me to update them and you have a number of choices yes and no and all if yes means just one no means nothing all means all of them at that point I always type a for all because bioconductor packages rely a lot on each other in in in different ways so if some of them are outdated they might not play well with the newer versions of packages so it's a good idea to keep all the bioconductor packages at the same level approximately so you always when it asks you yes or no or all you always say all to bioconductor packages then it may ask you well some of them need to be compiled from source though because we haven't got a pre-compiled version at the bioconductor repository to that always reply no unless you know what you're doing you might need a bleeding edge development version that is actually compiled from source but in general the versions that are available at bioconductor that are already pre-compiled the newest versions of that is what would normally be called something like a stable release so if you're asked about packaging updates yes if you're asked about compiling from source no that's only if you really need the very very newest and you know that you need the very various newest and you know why you need it and you know that your system actually has the correct compilers and tools to build it on your machine so once you do that you have these packages installed and then just like with packages from CRAN we need to say library biobase to your query and lima so let's see what i do here biobase yeah it asked me update all some none so in this case i actually say none i want to be slow about it because i've done this yesterday so nothing much will have changed since then i've gone through all of this yesterday okay okay so if that works for you you may be able to download the gene set gc2699 from geo and you may not be that's really unpredictable i don't know why this is so patchy especially in workshops i get the impression that maybe the the NCBI site notices that many people from the same domain are trying to get the same large data set at the same time and it thinks there's something fishy going on like a ddos attack and then it starts shutting it down so it may work it may not work that's the standard way of doing it gset it is simply the the variable name that i use here is the result of the function get geo with the geodata set name now if it doesn't work for you the data set is actually in your data folder i've compressed it here it's it's large ish as is to be expected gc2699.r data so you can either download it from NCBI but if that doesn't go without a problem just select and execute load data gc26922.r data if you do download it you still have to pull the first element out of a container list so execute lines 114 to 119 as well no of course you can import rnac data but actually the way that that you import rnac data from geo is usually it's structured in a different way it's not these annotated gene sets but it's usually just a tab delimited file of the expression values per gene so you download that from as ftp it's just you usually just read it in with a read delim or a read.csv so if i do a head of that it tells me this is an expression set assay data it tells me that there are 18 different samples corresponding to three samples each for the six time points feature data annotation which is the annotation and so on and so on if i look into the internals of this thing and and do str you will see something that's a fair bit more complex than what we've seen before with this and other data structures so these are s4 objects r actually has two different types of objects for real object oriented programming one is the s3 class one is the s4 class these are all s4 and the syntax to get stuff out from these is different but it's it's possible you can pull things out from bioconductor objects um but normally you don't normally you leave these objects alone and you depend on other bioconductor functions to know about how they are structured and what to reasonably do with them so so these are these are a bit more complex than what we've worked with so far okay um right so then i i simply go um over the code it needs column names it needs group names it needs to to be told which samples correspond to each to which groups so group zero to group five here then it did a log two transform now i'm not sure that that's actually a good thing to do to log transform our data again i think we've already log transformed it so i'm you know would we get different data if we didn't do that i i haven't experimented with that but the script which we ran on on the to our website did this log transform um it does a bit of a quanta normalization um it set it sets up the the group names as factors attaches the description to attaches that to a slot called description then it designs the so-called model matrix where it builds a matrix um of the measurements based on our group definitions for calculating correlations defines model names and then it uses the the linear model fit to to fit um linear models to to these group variances building a contrast matrix um using empirical bays to find the most significant ones and then adjusting multiple tests um by the false discovery rate parameter so there's two two basic ways to correct for multiple testings one is a Bonferroni testing and a Bonferroni correction and one is fdr this one uses fdr which is more appropriate for data sets of a large size really this is all just um code from from um geo to r the result of that is an object called tt i believe we may be using this this this afternoon so so try not to to delete it from your workspace all right so so far that code tt has um returned to us the top 250 differentially expressed genes across the groups we have identified and um now we can explore the data the top gene has the id 8117594 i believe these are spot IDs on the on the micro array so what are the expression the original values we can use this column name here and apply that to a function so this is a bio conductor function i told you the data is hidden somewhere in this object the function express expression values knows where to find the actual expression values and how to pull them out of the object so this is a method which is which which can now be associated with the data set and this id here um i can i can pull that out i can look at the expression values or i can plot them as a bar plot that kind of looks similar to the plots which we've seen on the on on the geo to our website so our group zero group one two three four and five the expression values are relatively constant and they go up and then they slowly come down and each triplet of three replicates is is highly similar so um two three four five this is for the the fifth highest value um and so on so if we want to return the the data row for the top three differentially expressed genes we could we could just use these row names or alternatively is not even correct 759 we could say dt dollar id 23 same thing now i want to prepare this data for cluster analysis usually for cluster analysis we have a similar consideration as for pca scale might matter the absolute values might matter we want to make sure that um unless unless our algorithm is robust to that we have no na no non available values and and so on so and then it's useful to make a table that actually only contains the numbers otherwise i always need to specify which columns i want to use but if i if i have a table that only contains the numbers um i um no this is now we're going beyond the script now we are preparing the data for cluster analysis so um to prepare that i'll define um a list of gene symbols which i can use to identify my data um then build a matrix um by iterating through all of the rows and for each row taking in turn the three values for the columns one two three and four two six and seven two nine and so on and taking the average of them so i'm taking the means of the biological replicates for all of the columns in my data set and then i'm i'm giving my my data column names and then i could just um then i could just um apply the id's or the gene symbols as row names but if i apply the gene symbols as row names i have to be sure that none of them are repeated row names have to be unique so before i assign my gene symbols as row names i need to check whether any of my gene symbols are are unique so i can apply the function duplicated gene symbols so here indeed i find that eleven of my symbols are duplicated some of them are the empty string so some of them do not have a gene symbol because they're spots that correspond to technical controls and others are duplicated presumably because the spots are not unique so something like this hsp a1b dash dash hsp a1a probably means that the probe recognizes transcripts from both of these uh presumably um parallels so that may cause um problems so i massage my data a little bit by throwing away everything that is duplicated or taking all the rows for which gene symbols are not duplicated and taking from the row names of my my data set um the not duplicated row names and we'll also remove any that have spots for isoforms um i.e. where we have where we find the forward slash in the gene symbol um using the grep function so this is now our our our expression data set um we've taken averages of the individual measurements we've made sure that all our rows are unique and uniquely represented by gene symbols and we threw away everything that cannot be uniquely mapped to a gene so this is the kind of preparation that you usually need to do um before you do any meaningful exploratory or clustering analysis you don't need to execute that if you want to make a local copy of the file then you can you can just write it and read it in this way or make a binary object and write and read it um in this way so if you have large files like that that need to go through a script of repeated um controls of errors and and and calculations it's a good time every now and then to actually save a version so you don't have to recreate it every single time um and then we can analyze this for clustering so the very first thing i i i do is plotting a heat map that's a heat map what does that tell you or what what do we see here what are the rows what are the colonies what are these thingies in the margins the rows are the genes right so the the fuzz fuzziness here corresponds to genes to gene gene symbols of um so 223 we started out with 250 but some of them were redundant so so we removed them so we have 223 rows 223 gene names here which at that resolution i can't read um the columns are exactly so the samples so there's time zero here time 12 time two four six and eight and what are these things here these are called dendrograms from the greek word for tree dendron so these are tree diagrams and the tree diagrams tell us how this heat map has been sorted so that the heat map has been sorted uh in a way which places similar columns and similar rows next to each other and the algorithm that determines that similarity and decides which similar things come next to each other that works by building such dendrograms and and then arranging that it doesn't matter does it matter no so they're independent you can you can do the columns first and then you can do the rows because the columns and the rows are treated as being independent so the order in which you consider them for the calculation does not matter any permutation of this dendrogram would be equally valid um and that's actually an interesting question that i haven't actually been able to answer because if you if you consider a dendrogram and say consider this part here topologically um a dendrogram which i rotate around one of these branches is exactly the same dendrogram it doesn't look the same but it has exactly the same information so any rotation around one of these branches here does not add information to the dendrogram so you could imagine that i take this whole block this whole upper block here and then just flip it and that would correspond to fundamentally the same heat map it's basically just a rotation around this branch here so the question is what does it mean that these two rows are adjacent here that doesn't actually mean anything so the adjacency only means something if the rows are connected in a very neighborly fashion but if they're distant from the dendrogram then then they are the fact that they look similar is is kind of deceiving now what i suspect happens is of course the the layout algorithm has to make a choice of how we rotate these things when when when we when we display them and what i suspect happens is that it tries to rotate things in a way again where similar things are close to each other which however in this case can be deceiving because the two rows here and here in fact are not similar as evidenced by the dendrogram they're actually quite far apart but they kind of look similar here because you know that's that's just the way this this was chosen to be what are we looking for in a heat map like this the the dendrograms are built from the similarity of these vectors so this is built from a the similarity of the six-dimensional vector which has individual expression profiles across the the conditions in this way it is sorted by the 223 dimensional vector which has information about one condition for all of the genes um so what are we looking for here first of all we could ask well which of these conditions and which of these genes are similar just from looking at this this would mean that all the genes up here vary in their expression values in a similar way across the cell cycle if i look around uh if i compare the columns this tells me which of the conditions are similar so that tells me uh condition t0 and t12 kind of turn out to be similar which is you know the beginning and the end of the cell cycle maybe makes sense conditions 2 and 4 are similar again this is where the cell cycle starts increasing expression and this is where expression decreases so that's also similar so that the grouping here would kind of vaguely correspond to our understanding of the biology in this year so we can simplify this a bit uh basically by doing the same heat map only for every fifth gene and randomly chosen um since the values are randomly chosen in a different way it it kind of turns out in a similar way so in this case t0, 8 and 6 appear to be more similar than t12, t2 and t4 okay so this is this is the first view of the data in order to be clustering this what we're looking for is block structure block structure means that if it's pronounced block structure it means that somehow we have a way to arrange this that brings similar genes and similar expression values close to each other that's what clustering is is hopefully going to achieve and that's what we will do in the next step after lunch