 Hello everybody and welcome to the single cell plant data set tutorial Where we will be going through the tutorial given under transcriptomics And if you scroll down to the single cell section, it is The analysis of plant SERNA seed data with scampi So you just click on this follow the tutorial button here and it launches the tutorial That we will be doing today So for now what I would like you to do is to just copy the title of this tutorial and then click outside of the learning rectangle and Then create a new history you can create a new history by clicking on this and then you can rename your history By pasting the title of the tutorial that we will be currently discussing so Let's go back to the tutorial by clicking on this hat and In this tutorial what we will be learning is Very similar to the same tools that we learned in the clustering 3k pbmc to the scampi tutorial Which if you are not yet familiar with that, I recommend that you do that tutorial We will be using the same tools but this time we will be trying to look at a plant data set one given from this paper from 2019 and in this paper they have this specific Clustering that they have discovered and this particular differentiation pathway Where there are stomach cells differentiating to tricoblasts as well as potentially into other cell types So we wish to Recapitulate this same type of clustering but using our own tools in galaxy via the scampi tool suite so let us first get started by Obtaining our data which we can do by clicking onto the data section and seeing that we will be dealing with essentially two data sets Which one which is a wild type and one which is this short root mutant where essentially the root has been shortened by cutting it and The idea is that it should ideally generate more More stem cells in some certain cases which will Show better how these cells differentiate into more mature cell types So we can simply do this by first of all creating a new history in the tutorial if you haven't done so already and Then we can just copy these two data sets directly from this zanotto link so if you want you can click on this zanotto link and Directly see all the files that are associated with this training material But we don't need to use all of these materials We can just use the data sets that we want This data was originally taken from this GSE identifier 1, 2, 3, 8, 1, 8 However, I did have to change some of the gene names and I did have to transpose the matrix which is why you have this new extension of fixed names and transposed from the original data set If you wish to you can work with the original data set with this history, but it is recommended just for the Better more meaningful names later on to use to use this data matrix So let's go back to Galaxy and let us just simply copy the text in this file in this File window and then go back to our history. Click on upload data paste fetch data and then we just paste the links that we have copied and Click start we can close that and then as soon as these start to go orange We can start working on the next step and here they have gone orange and If you're doing a live session with us It might actually be quicker to use the shared data from the data libraries, which should contain these data sets It's I think you should probably use these and other links if you're doing this at home They will there be there would be a real benefit otherwise so once the data sets are orange we can begin to apply tags to them and tags and galaxies are just a way to assign a data set some kind of tag which tells you More information about it more information about whether or not The data that you have processed originated from a certain Input data set and the way that we can propagate these tags such that the output of a given tool Shows the same tag as the input of that same tool is to put a hash just before it So these hashes are very important So what we what should do now is that we should apply this short root tag and this wild type tag To each of our data sets with respect to their initial labels so Let us take a look at these data sets these SS and our green indicating that they have been imported correctly and Notice that the top one here that I have it has this Wt meaning wild type and this bottom one here has shr meaning short root This may not might not be the same for you. So be careful when you apply these tags And to apply these tags all you need to do is just click on the data set and then this should expand the data set to give you a more expanded view of The total number of columns within this data set. So if we go all the way to the right, we see that there are 27,000 columns For those of you wondering Maybe why it took so long to expand the data set is because this is actually quite a large data set So the preview window took some time to generate And with this we can add our tags by clicking on this edit data set tags button clicking on it and typing in hash Wt for this one and then enter or just clicking away and There we go, we have this Wt tag now and hashing say will be inherited and then if we go down We can click on the second one Again, it might take some time for it to expand just because of how large this data set is Fortunately, this is behavior very particular to this data set. So We will work on fixing that in the future to make it a bit more expandable But it is very important that you have these inheritance tags added. It makes the analysis much more easier to understand There we go. So if we scroll down we can then click on edit data set tags And then click there and this one is SHR So we add SHR click away and It should update the data set very shortly. There we go Okay, now let's go back to the tutorial and we click on see galaxy training materials and Now we should have these two data sets having these SHR and Wt Tags added If you wish to we can once again expand the data sets to see exactly how many cell barcodes there are and What their gene names are and how many cells are in the data set how many genes? But this is stuff that you can do in your own free time If you're just curious about the data set and this is stuff that will become much more apparent to us Once we convert them from their current CSV data data type towards the and data data type If you're curious about what exactly an hand data type is then please click on this link here Which should redirect you back to the previous tutorial towards the relevant section Where it shows you this and data type where the main matrix is held in this X lot and Observations about the cells themselves are kept in a separate table and observation will vary balls are kept in the room table adjacent to the main matrix as well Just close that for now And let us do our first step so if you're not yet familiar with these Clickable tools, then you can actually just click on the data itself and fill in the parameters and This should simplify the task of searching for the appropriate tool in the site panel So for example, I can just click on this and it loads the correct tool and the correct galaxy version Which is very neat otherwise, if that doesn't work for you because perhaps you're experiencing galaxy from a Different domain then you can just type in and data into the tool section and then scroll down and find the appropriate Tool here and by clicking on it. It should give the exact same tool again so going back to the tutorial we see that we need to choose our format as the tabular CSV and TSV file and our Annotated data matrix the input file should be both the SHR and the WT data sets And I'll show you how to do these in one step Right right now So we click on this and we click on tabular TSV and CSV And then we can now click on this multiple data sets to select both data set one and data set two You can hold control to multi support select or shift to select a range and then we hit execute and We just wait for these to become green and This should give us a much more introspectable view to what is happening in these two data sets Now that enough time has passed we can see that these are now green I'm going to expand this window slightly and since I'm not really using this sidebar here. I'm just going to minimize this So we can now look at each individual data set just by clicking on this preview window and seeing that this short route data set has 1099 observations which in the case of single cell means 1099 cells and we have around 2700 variables, which is 27,000 variables, which is the number of genes in the raw input matrix And if we compare that with the wild type We see that we have 4,000 cells in the wild type and the exact same number of cell the exact same number of genes, sorry as the as the short route so let's go back to the tutorial and This just by looking at the the observations and the variables we get an idea of the dimensionality of the data set We have a thousand observations and we have 27,000 dimensions Or 27,000 variables, which at some point we will have to reduce To fit into a nice two-dimensional graph which shows us which cells That's a cluster With with others So at a moment, these are two separate files as we can see here And just minimize and minimize We have two separate files and we wish to combine them into one combined and data file Which is one of the features of our data is that you can have multiple batches in this case wild type and short route Within the same file So we were going to click on this and we are going to concatenate along the observation axis and We are going to concatenate the wild type On to the SHR as the input. So to do this we click on manipulate and data We choose the SHR data set as the input. So this should be three for me We are going to concatenate along the observation axis and we are going to select the wild type, which is four The joint method here is not actually that relevant because these have the exact same number of variables And I know from previous experience that these are the exact same set of variables as well So we don't actually have to change anything here, but if you're doing a Batch merge in the future you may want to consider Performing instead of an intersection of variables, but a union of variables to get a more wider scope of Variables which could be driving the analysis So while that is running let us Once again take notice of the scheduling such that we can cue the next job on Even while the Input job for the particular tool that we wish to run hasn't yet completed so The reason we want to manipulate the AND data again is because at the moment by performing this concatenation All we have ensured is that we've added a new Column to our AND data observation but the observation for The SHR data set is given as a zero and the observations for the wild type cells are given as a one And we wish to make this a bit more concrete by changing these to SHR and WT as explicit text so to do that we click on manipulate AND data and We will be renaming the categories of annotation and the category that we wish to annotate this batch And we are going to copy this SHR WT and the press control C And then we are going to Oops, sorry, we're going to go back to this click on the tool We are going to rename categories of adaptation type in batch and SHR WT Execute Notice that the job is already queued up despite the fact that the previous job has not yet finished. This is fine And yes once again notice that order does matter So if you have problems later on where things don't quite look the same as your tutorial It might be because the order in which you concatenated your initial data sets Not the same order as written in this tutorial. So please do pay attention That you first selected the SHR data set and that you concatenated the WT data set onto it So we can confirm that our data sets have combined by actually just looking at several things in the new and data object Things such as the general file size which should be at least the sum of the SHR and WT individual data sets And by looking at the sum of cells, which should again be the sum of SHR and WT cells So if we just take a quick look at this new data set here We see that we do indeed have 5826 observations We have this new batch layer which tells us specifically which of the original batches a given cell comes from So we know that roughly 1100 come from the SHR batch and 4000 at least come from the wild type So what we need to do now is as in the previous tutorial we need to filter and Plot the initial QC metrics of the data set such that we can get a better idea of exactly how we wish to filter this data set What limits we can impose on it to reduce noise that may appear later in the clustering So first we click on inspect and manipulate how we wish to calculate quality control metrics So we say execute Following that we then ask to plot with scampi But making note to copy and paste this keys for accessing variables, which will be the number of genes by counts i.e the total number of features in a In a given cell and the total counts the total number of messenger RNAs in a given cell So we control cvert And we click on plot with scampi And we go down And we say keys for accessing genes by counts We click on this and then we say Click on then plot We don't wish to plot all variables. We just want to subset And the keys that we wish are in genes by counts and total counts And we wish to group our observations by the individual patches Just to see if there are any significant difference between patches If there are then this could make things interesting So anything else we need to add add a strip plot. Yes add a jitter. Yes size a jitter point is not 0.4 Okay So here we can go to your violent plot attributes. Yes to strip plot. Yes to jitter And the size should be not 0.4 Do we display multiple keys and multiple panels? Yes we do So we click on yes And we leave everything else blank and then we execute. We should notice now that By performing the QC metrics, we have now gained A few more features such as What do we have 11 layers now? So we have this batch For each cell we have the log n genes by counts log total counts n genes by counts and percentage of counts in the top 100 genes percentage of counts in the top 200 genes and so forth So the QC metrics generates many many Types of annotations that we can use later on for further analysis But right now we are just interested in the violent plot and once it's ready we can Simply view it by clicking on the i symbol Like so So here we see this Violent plot here and we just zoom in slightly So this is number of genes by counts, which is the total number of features Of a given cell in a given batch Each dot here is an individual cell and don't worry so much about the The spread on the x-axis. This is just a randomness added so that no two cells Overlap each other they still do but it's just more random if they don't so we can better see The most defining features of a violent plot are knowing exactly where the most Counts are distributed. So here we see Most of ourselves have around 2000 genes describing them, which is good and we have a few which Describe of at least 12,000 genes characterizing the cell So you might feel like these are outliers that we should get rid of and in most cases you would be correct But I believe in the paper that we are replicating they kept them in so we are also going to keep them in If we look at the wild type we see it's a bit more denser because we notice that there are At least four times as many cells And we also notice that the average number of features is also Slightly higher. You may think it's double the initial but it is This game here is marginable. These datasets are very comparable If we look at a total number of counts Again, we have a lot more cells In the wild type than the short root Most of these are distributed towards the I would guess that's about a Maybe a thousand to two to five thousand Total messenger RNAs in the given cell And unfortunately we do have a significant number of outliers that we can't just simply cut them out and call it a fair analysis So we are likely going to keep most of these outliers in just so that we can Get a few more specific clusters and get a few more specific cell types So this should be the same Violent plot that you see in your in your data so We are now going to filter our data using Similar metrics that you can derive from the violent plot. So we definitely want cells that have at least um a hundred Minimum number of genes expressed in a given cell that seems at least fair And we also want genes that are expressed in at least two cells If a gene is only threatened in one cell we we do not care for it that much So the sensitivity here is actually reasonably high But this is what is required to replicate The same analysis in the paper if you wish you can play with these later after this tutorial increase these thresholds And I can show you how to do that in a very easy manner later on But for now, let's just stick to to to these limits given here What we also wish to do is we wish to filter out any cells Which have a number of features less than 12 000 And we also wish to filter out any cells which has a total library size Less than 120 000 notice that this is 12 000 and this is 120 000 so Be very careful when you do this So, okay filter We are filtering cells By the minimum number of Expressed So we click execute And then while that is running You can go back to filter again and this time we are doing filter genes by a minimum number of cells two So filter cells. No, we want to filter genes Minimum number of cells expressed at least two Click on execute Wish to manipulate the object directly by Copy this key to filter the number of genes by counts to less than 12 000 So Filtering observations or variables We are filtering observations key to filter is number of genes by counts And we are filtering to less than less than Then we do the same for the upper limit for the total library size. So we copy this total counts And we click on manipulate and data And we once again click on filter observations by key total counts number Less than or less than or equal to So much 120 So if we take a look at this data set we can see we have five six two three observations and Two three one or two variables and this is exactly what we expect to see Yes, so What we want to do now is we wish to preserve the original matrix just by freezing it into a spare slot And this will help later on potentially not in this analysis But potentially in a later analysis Where you want to perform some kind of rank gene assignment and you wish to Use the original data set but to perform this which can sometimes be useful So we're just going to click on Freeze the current state into the raw attribute and execute. So any further changes We can then undo them by restoring from this from this raw slot We wish to do next is then known as confirmed removal where if you consult the video at the time spot given it tells you of the Types of confirmed removal that you wish to get rid of such as unwanted biological variation unwanted amplification effects uneven cell capture rates several several factors which we wish to request out of the analysis through a fair normalization And we do that by simply clicking on the normalize And by setting a target sum to 10,000 So this will normalize all cells to a target sum of 10,000 Transcripts per cell and we wish to perform a log transformation on the data This is to ensure that This is to ensure that large differences between in a gene expression profile are not too drastic enough to completely Drive a further separation of a data type which Does not need to be so greatly Teperated Essentially, we are looking more at gradients rather than absolute differences And by performing a log transformation we compress This variability into a Into a tighter space essentially Okay, then we wish to regress out on wanted sources of variation. So the main one that we have in our Data set here is simply the library size So we wish to go to remove confounder regress out And we click on title counts What we also wish to do is we wish to scale the data to unit variance So this is just to ensure that the mean expression of um The mean expression of a gene is not a contributing factor. It is more the variability Of the gene that we care about. We're now looking at the gene dispersion Rather than absolute gene expression By doing this, we should have a much nicer data set that we can use for dimension reduction So if you do not know what dimension reduction is then we explain this much better in the introduction video So I do implore you to Take a peek at this specifically at the time point given at the bottom 1346 If this is not automatically linked Then please just jump to that time spot of 1346 and it should tell you more about dimensionality reduction But essentially what we are trying to do is we are trying to perform a principal component analysis i.e performing a rotation of the unit axes of the data and By doing so finding the axes which have the most variation of them and selecting them by order of most variation to least variation And we don't wish to sell out. We don't wish to select all 20,000 Gene axes or not gene axes but pca axes Uh, but maybe the just the most top 40 relevant ones So this single step alone pca very powerful It gives you a dimension reduction from 20,000 to 40 which is a phenomenal amount of reduction Without losing much variability in the data The components themselves are linear combinations of genes so it's never explicitly describing one gene it's a Yeah, it's it's it's a linear combination of several genes Some which have more impacts than others and you can get these loadings Using scampi, but you should consult the previous tutorial For that since we are not going to do that in this tutorial Once we perform a pca we can then perform a further dimension reduction to get a actual plottable space of two dimensions And in single cell RNA-seq you can do this via umap or tsne Now umap is the new golden child of dimensional reduction in Single cell because you can't project a new data onto it But tsne is still very good. So if you want you can try both just to see how be fair And you should hopefully get robust projections of the data in both But for now in this tutorial we are just going to be using the umap projection So first let's do a pca. So we click on cluster infer and inspect and then we perform tlpca and we will be using the top Sorry, we'll be using a full pca using rpack wrapper. Okay, so the top of pca is full pca And svd. So there will be this rpack execute Then we wish to generate a neighborhood graph Of cells so that we can perform clustering on them later on For those of you not really familiar with a neighborhood graph, I do Ask you to once again look at the introduction video specifically at time point 1340 Which should give you a better idea of exactly how to Generate a graph of cells And how this can be used for clustering later on. So for now, we're just going to compute this neighborhood So pp neighbors And the size of the local neighborhood should be 10 And we will only be using 40 principal components. So we computed 50, but we only need 40 execute We're using the umap method of connectivity Then we will Be generating the umap. So let's generate a umap embedding to keep all these defaults Especially number of components to be to And these steps here are just for generating the Coordinate I guess you could say for these embeddings But to actually see the embeddings we have to plot them. So Let us plot with scampi first the pca Group them by batch And we'll be using this rainbow color scheme in the plot attributes because sometimes the default Colors don't quite work so well. So let us let us do that. So plot pca Can't see it. You can just type in pca. Here we go pl pca Uh the keys for annotations Shall be batch the plot attributes We select I believe the colors to use for categorical annotations this one It can actually be any color Just please don't choose a default one because the default one has been known to Not produce such nice colors And while that is loading Let us also Plot the umap embedding. So again same sort of settings batch Uh rainbow color scheme ease for annotations. We just have batch for now We don't wish to show edges. We don't push arrows And on the plot attributes We select the categorical annotations To be rainbow. If we wait long enough, we should see eventually these two Plots Uh datasets go green and from there we can have a better look at exactly How classed our data is If we have a look now at our pca and the umap We can actually take a look at them side by side By clicking on this enable disable scratch book button So we click on that and we click on our pca And it loads in this nice resizable window, which I hold place like so Then I click away and then I click on plot umap And also resize this So that these two are nicely side by side And we can see that the pca has some variability variability in both axes Which is expected for such a complex dataset And we also see that the short root batch and the wild type batch Um are actually quite nicely mixed together meaning that there shouldn't be such great batch effects If we look at a umap for these two ump batches The short read and the wild type appear to overlap but in different sections Hinting that there is likely an overlap of cell types But not necessarily of the exact same Cells for the for the other majority of the cells So for example the wild type appears to have Uh more of this central cluster of cells here than the than the short root So let us have a look at the training again And here again we can just see very similar types of scatterings Um tutorials and cells But what we now need to do is we need to actually find the exact same cell types That was given in the original paper So in the original paper there is this specific heat map Well, it's a heat map of source, but it's actually a dot plot Showing you the intensities of a given set of genes here at the bottom Expressed in different clusters Now they had approximately 15 clusters Including zero And you can see that they found specific Um cell markers well specific cell markers were lit up for specific clusters Um in such a way that these clusters were unique to those specific cell types So for example a triangle glass appears to be extremely illuminated in this cluster 10 But nowhere else Whereas we have for example columnella qc and nc These three cell types appear to all be localized within the same cluster 11 So hopefully we can uh Recapitulate the same sort of pattern by performing our own scampi analysis So to do this we go and we click on this cluster inverter trick trees and embed And we will be using the Leiden clustering method And I encourage you to once again take a look at the introduction to single cell video Just to refresh your mind on exactly how Leiden clustering works It is very similar to Louvain clustering, but Leiden has the added benefit That it deals better with disjointed parts of the graph So we click on this We will be using the latest 20 And we will be using Leiden And that is a coarseness we said was going to be 0.35 So 0.35 this is something that you have to sort of discover yourself to a trial and error Which level of coarseness will give you the best sort of clustering And we can execute and we can disable the scratch book if you want to And this will perform the actual clustering what we wish to see this plotted as a u-map. So we are going To copy the Leiden and the batch as observed types Plot with scampi and we're going to use the output of 23 Let me expand this window slightly Keys will be Leiden and batch Um, I believe we have some further plot attributes, which is We want the legend to be on the data and we want the legend to be at least size 14 And we will be using the rainbow color scheme. So plot attributes location of the legend should be on the data and it should be And this should give us these two nice side-by-side plots Um, or in our case the data will be actually on the The legend will be actually on the data itself Um And this should give us 13 unique clusters. This isn't the same as the 15 in the original paper But it is enough for us to find the correct cell types as you will see later So if we just take a quick look now at the enable scratch book We Lot So here we have the labels actually on the data And for example the short route, which I think is the purple one in this case probably not the best to Have it on this particular App setting But here we have the individual cluster identifiers and we have Uh, is there a cluster zero? I don't believe there is a cluster zero. So we have 13 individual clusters Which we don't know anything about them All we know is that there are 13 clusters that the cells have fallen into So now we need to identify specifically which ones are the ones that were identified in the paper So let's go back to this tutorial and we are going to be generating a dot plot So this will be the same dot plot from the paper But this time doing it within scampi And seeing whether we can get the same patterns and same groupings of cells within the same types of clusters So one thing you will need is this list of marker genes This is taken straight from the paper itself And literally from this this graphic here, I didn't get all of them because perhaps some of them are filtered out in the initial analysis Um, so I'm only keeping the ones that we have detected so far And what we want to do is to do a dot plot And a subset of variables, so let's do that click on plop of scampi We will use the latest And we will type in dot plot We will not be using all variables. We will be using a subset of variables and I will just paste those in there And I don't think we need to do any group. Oh, sorry. We do the grouping should be the leiden. Let's just double check that Uh, yes, this should be the leiden in which will be the vertical grouping And we will get to this part later on but this part is essentially to help um shape the Help shape the categories so What all this does is it creates specific gene groups Where we identify that all genes that fall within this group of genes are called columnella and Though that fall within this group is called qc and so forth. It's Just a way of adding annotation to the plot So it is a little bit painstaking to do but let's just Brave our way through it So we have nine Yes, we will be using the raw data And columnella is zero to five categories Insert group insert another group Going back six nine qc inside another group mc Ando-dermis is 12 to 17 physical indexes to Markets these These genes 18 to 23 Contacts The trigger blasts 24 to 25. The blasts 30 to 34 Xylem 35 and custom figure size So if we take a look at the dot plot we see now that We have our own dot plot for these given market genes and we can compare it with the original from the from the original paper um So let's go to this dot plot here viewed in a separate tab And let's go to our own dot plot and middle click to view that in a separate tab So this is ours and this is the original so notice that columnella qc and nc all share one given cluster here Which is the same as given in our reproduction uh notice that endo-dermis cortex and a trigger blasts have Uh different clusters from everything else So a trigger blast cortex and endo-dermis these are all localized within their own clusters And we have good expression for tricoblasts xylem and vc So In general this has been a pretty good reproduction of the paper And we can further um We can further show this by annotating the names with new names So um, we can actually Put the correct labels onto the desired clusters um to give a nice um plot that we can use for So Let's have a look manipulate and data What we wish to do is to rename categories of annotation And we will be taking these In it there. Oh, sorry, not that's the key. This key should be uh the lyden And these are the new categories Execute that on our latest data object And then we plot it We plot the scampi on the data 14 rainbow new map Key for annotations and lyden We're not really concerned with batch. So I will add it although You can if you want, but I'm just going to put a lyden for now since I find this more interesting And Reviews we want the legend on the data The legend size should be 14. That's that's a bit big if you want to make it smaller you can And I'll be using rainbow for the categorical annotation groups And we should be using data input set 26, which we are So now if we have a look at this new plot We see that we get The same clusters as before but now relabeled such that we have the colomela qc and nc Contained within this cluster region here our tricoblast tricoblast endodermis and cortex And so from this we can possibly infer A kind of differentiation from this central cluster one And if we are to look back at the original Clustering giving in the paper We see that cluster one is quite likely to be this meristemic cell type Which then differentiates later into this tricoblast trajectory pathway So this is As far as the original paper went in terms of the determining the clusters But we can ourselves Take it further and perform the actual trajectory analysis By Using the same data set in a different tutorial I lead this as an exercise for keen users who wish to try to do this using The excellent single cell materials in the our studio and jupiter notebook libraries Um, you can perform this within galaxy using scampi However, it should be noted that you would get a slightly different trajectory analysis So it is recommended to use the trajectory tools found in sored And I greatly encourage you to follow this tutorial to try and replicate it If you wish to play with this particular tutorial A little bit more using Slightly greater control over what the input parameters are Well, there is a workflow available Just simply search for scRNA plant analysis and it's actually attached to this To this tutorial In the main use galaxy workflow and you can Form your own analysis by feeding it, you know your own different data sets Setting a different level of resolution for lightning clustering different input parameters for filtering and seeing what kind of Plots that you will eventually get So this is completely automated within galaxy. You just run this workflow feed of the impulse that you would like and you can regain New insights into this same data set Hopefully you should get the same sort of clustering which will prove that it's quite a robust data set and that Your analysis is is a correct a bit of different parameters, which is always great to see But maybe you get something slightly slightly different. So I encourage you to play and to Please leave some feedback on the actual content if you see any changes that need to be taken care of And thank you for completing this tutorial