 Hi, my name is Maria Doyle and I'm the application and training specialist for the Peter McCown Cancer Centre in Melbourne, Australia. Today I'm going to take you through the hands-on tutorial for CRISPR screen analysis. So in this tutorial, we're going to go through the steps for processing CRISPR screen data. We're going to identify essential genes across experimental conditions. There are some introductory slides that you can find in the link in the material here and also a bit of background in the tutorial website. Okay, so today what we're going to cover is we'll upload some data, some fast Q files of raw reads and we'll check these raw reads for quality. We'll trim adapter sequences and then we will count the guides we have and then we will test for differential abundance of guides across two conditions. So in this case a treatment and a control. Okay, so the data we're going to use in this tutorial is from a CRISPR screen from this paper here. We've got three samples we're going to use. We've got a baseline sample taken at time zero, time zero control. We've also got a sample that's been treated with drug for eight days, the APR drug, and we've got a control sample that's been treated with vehicle for eight days. That's our vehicle sample. And the aim here is to identify genes whose knockout increases the cancer cells sensitivity to the drug. So we will use fast Q files containing 1% of reads from the original data to demonstrate demonstrate the read processing steps. Okay. So what I am going to do is to go to galaxy. So I'm going to set up one window that has galaxy. And you can use here I'm going to use galaxy Australia, you could also use galaxy Europe to do this tutorial. And so I have one window with the material and one with galaxy. Okay. So go ahead and log into galaxy if you're not already logged in, I am. And then in a new history. I'm going to give it a name. I'm going to call it crisper screen. And then we will load in our data. So here are our three fast Q files. So I'm going to copy this, the links for these files, go to upload data in galaxy. And I'm going to click on the rule based and upload these data, these balls as a collection. And what I'm pasting in here is actually two columns, one with the names of the samples and ones with the files names. And I'm going to create a collection to make it easier to keep the data organized in the history and to save on some file renaming. Okay, so I'm going to click build. Okay, and then a window should pop up. I need to under the rules, click add modified column definitions. I need to specify a name for each of the files. So that's the list identifier. And I'm going to say that is in column a. So these are the names for my samples. And then I need to specify the URL so where the files are located. So that's in column B. So apply that. The final thing I need to do is give this collection a name. I'm going to call it fast cues. And I'm going to upload that. Okay, and you should see it appearing here in your history. So at the moment it's great means it's a job waiting to run waiting to upload. And when it starts uploading, it'll turn yellow. And then when it's finished uploading, it'll be green. So now it's turned yellow. So you can see, so it's uploading there and it's a list. So this is what a collection is. It's like a folder. So a list and containing multiple files. Okay, and then in the tutorial website, you can see the steps we have gone through here. That should upload shortly. Okay, great. It's great and see we've got a list. So our collection with three items by click on the name. There's our three samples there. You can click on the name of each of the files. And see it's fast cue format and get a little peek into it. So these are our fast cue reads. So if we click on the eyeball, I can have a look at the file in the main pane. And I can close off this panel on the left. Okay, just get a bit more space. So these are our reads. So each of these four lines is one read. So you've got the first line is the read name. And we've got a sequence for the read. So in this case, there's 75 bases, but space surplus. And then we've got the quality score for each of the bases. Okay. So I'll go back because what we'll do now is we'll use a program called fast QC to have a look at the quality of these reads. So I'm typing fast QC in the tool search bar here. And then clicking on fast QC to open up that tool. Okay. So because we are using this collection list, I need, we need to click on the data set collection icon. And then select our collection here. Fast cues. And we also need, we don't need it, but we're going to upload a file containing the, some adapter sequences, including our CRISPR adapter sequences. So we can have a look at these in the raw data. Okay, so we're going to upload this file here. So I'm going to copy that. Go to the upload data. So this time, I'm not making a collection out of that. This fall is just a single file. So I'm going to use pace fetch and paste in the link to this adapters file here and click start. Close and that should appear here in green when it's uploaded. So to explain here with CRISPR screen data, we're interested in the single guides. So these are these sequences in blue. See a minute of it. So sequences in blue. And this, the read the spaces surrounding them in red are the adapter sequences that we don't care about and want to remove. We're going to upload this adapter list file so we can have a look at these adapter sequences in our data. Make that a bit smaller. Okay, so we've selected our fast cues. Now we are going to select our adapters file. So the, as it says here fast you see will explicitly look for these adapters. And if we want to peek in this file, we can see what it is. So the first four, one, two, three, four, the first four are sequences fast you see will look for by default. So we'll just include those and then we've added in two more that are CRISPR sequences. So these regions in red around the guide. Okay, so I'm going to run that. And what we'll do is so fast you see will output a report for each of our three samples. So we can use a tool called multi you see search for in the toolbar to aggregate these multiple reports into a single one to make it easier to view the results. You've got many samples. So we tell multi qc what type of what tool reviews to generate the reports we want to summarize. So in that case it's fast you see it's raw data. It's in a, it's in a collection. So this raw data here. And we click execute. Okay, and that's waiting to run there and it's running now. And this will give us one. And then on the web page that will summarize the fast you see results for our three samples. It's just finishing now. And I'm actually I'm going to add a tag to make it easier to see what this is in the history. So there's a button here for creating tags. So if I type a hash, and then I can say call it fast qc on trend. That just makes it easier to for me to know in the history what multi qc is summarizing there. Click on the eyeball we can have a look at our fast qc report. We can see how many sequences we have for each of our samples. Just got a small amount here for this data set for the tutorial. There is a galaxy tutorial on quality control that goes through these back to see plot and in detail. We're here. We just are going to have a look at the base quality. So we can see that along our read length. So we can see our readers 75 bases. We've got good quality for our basis along the rates. So the sequencing has been has been good. The other thing we also really want to have a look at is the adapter content. So if we go down to our adapter content section. So what we can see is the doctors present and this is using the file of the adapter list file that we supplied for to fast qc. So we can see that at the beginning. So again, this is the length of our reads. And at the beginning of the reads, we can see some of the five prime CRISPR adapter has been detected at different positions in the reads. That's because of the staggered and the stagger sequence that's present in. And it's using this protocol explained in the tutorial. And we can see at the ends of our reads from about 40 onwards we're starting to see the CRISPR three prime adapter. Slightly in less amount than the five prime, because the three prime and basically tends to be a bit lower towards the ends of reads in general. Okay, but that's picking up the adapters that we expect. So we've got some examples of CRISPR adapters and we can also see the small little bit of a lumina adapter present there which means you've got some short sequences or a small amount of primer timer there. Okay. Okay, so now we'll trim the adapters from our reads and we'll use cut adapt tool for that. So I'll type that in the toolbar. So our data is single and again we need to say it's a collection. I'm going to input our fast cues. And then we're going to say, tell kit cut it out we want to trim adapters from the front. So we are going to trim. This is the adapter sequence here. Enter that sequence. And I will select to output the report. And explain. Okay, so, so here. So as I said our guides are these regions in blue. We are telling cut adapter we want to trim this front part to the bit in front of the blue by putting in the five prime adapter sequence. So if we could, then we could also trim off the, the three prime adapters or shorten the reads to just have the guides, what magic, the tool we're going to use for the analysis will take just the guide sequence from the front of the adapters from the front of the adapters. So we really only need to trim these by prime adapters. Okay, so cut adapters finished running as before, we can summarize the cut adapt report using multi QC. So we'll do that. Okay, need to say the input is cut adapt. And it is cut adapt report we're going to summarize. This is just so we can have a look at the trimming that's been done. That one's waiting to run. And while it's waiting to run, I can add a tag. So I'm going to add a tag cut adapt report. Okay, and that's there and that's like just to make it easier to see which, again, what the multi QC is summarizing. That's wrong. So we can have a look at the multi QC report. So the trimming report from cut adapt. So here we can see for each of our samples we've had a similar amount trimmed by 34%. That's been trimmed from the front from the firecrime man of each read. And see that we've got no none of our reads have been dropped to about 100% because we just told cut it off to trim reads from the front. And then in this trim sequence lens we can see how many reads with the different lengths of drafters have been trimmed. Okay, so if I click, you click on these buttons can get the plots to show whoops. Okay, so we can have a look at say the counts, the numbers of reads that have been trimmed for at each of these lens so we can see we've got many 20 21 etc basis trim from the front of the reads up to 31 basis. And that's actually what we expect for this. So this is a sequencing protocol because of the use of these stagger sequences, and that's explained in more in the link here, you can see what those stagger sequences are. And so this is what we expect to see for the trimming for this data set. We can also see that the samples are fairly similar. It's maybe a bit more trimming and performed for the control the T zero control sample. So that looks pretty good. And if you want, then you can go on and run fast QC and multi QC again on the trimmed data set. So on the output of could adapt the fast Q's that could adapt has produced just to see how the trimming affects the adapters. And we can see that there's now no adapter, no CRISPR adapter present in the first 20 basis of the reads. It's been removed. Okay. So now we can go and use magic. So that's a toolkit for CRISPR screen analysis. And we will use magic count first. So that will count how many reads we have for each guide. And we will do that. So first we need to import our library file. So magic count will count how many reads we have for each of these guides in the library file we supply. So this tutorial this data set we're using the Brunello library which is a human genome wide and knockout library. So I'm going to upload the Brunello library file. So I use paste fetch here. Okay, and that should appear in history. So I'm going to upload in the meantime. I will search for magic count. So this is it. So what we're going to input here is we are going to input again our collection of our trend reads. So that's the output of could adapt. So we can select our library file need to wait for it to finish uploading. And this is the Brunello file this has just over 77,000 guides in this library. So there's four guides for every targeting every human gene. And then there's a thousand non targeting controls. This is what the fall looks like. The first column is an easy ID identifier for each of the guide RNAs. The second column contains the sequence of each of the guides. And the third column has what gene it's targeting. So as you can see, the first four rows are for guides that are targeting the a one bg gene. Okay, I've got 77,000 of them. So I can now select Brunello as as it's uploaded. And in the output options, I'm going to select output the summary statistics, and also plots. Okay, and execute that. Okay, so and that's going to output three files. So counts for every guide, the summary statistics and then the plots. However, this is using 1% of the original data so these aren't as maybe useful for you to see as what are compared to what a real data set looks like. So we're going to, at this point now import these three files for the full samples. So that's these links here. So you can see what that looks like so I'm going to copy those links. So data. Paste in the links here. Start that. Okay, so these three files are the same as these three, except in this case this is 1% of the data and this is the full files full fast key files. And the tutorial website goes through the different columns that are in each of these output files and what they are for so I'll just point out some of the key things. Okay, great. Okay, so first we'll have a look at the count summary file. So it's that one here. I'll click on the eyeball to view it. Okay, great. So we'll name the label of our sample. And this tells us how many reads we had in the files we inputted. So in this case it's eight new 20 million. And then how many then mapped and what percentage mapped. So we are looking for at least 60%. But we get 80%. So that's very good. It also tells us how many guides we have, and how many have zero counts how many have no no reads mapping to them at all. So we hope that to be no more than 1%. So that's okay for this data set. And then we also get this genie index, which is important metric to have a look at. So this is a measure of how even the distribution of guides are. So if we are doing a positive selection screen. It would be okay to have high values here because we just expect only a few cells view guides to be present. We're doing a negative selection screen. So we want to see small genie index values here. And that's what we say so that's good. Okay, there's a bit more on the genie index explained here. Okay, so these metrics for these samples are pretty good. And we can also have a look at the plots that magic count produces here. So if we click on the eyeball. Okay, we get some box plots that shows us that the distribution is pretty pretty similar for our samples. There's a little bit more variability at the day eight time points compared to time zero and distribution is pretty similar. Again in this histogram type plot. The control is a bit lower because there's more more points plotted. And we also get a PCA plot. This would be useful if we had. So we can see if our replicates are grouping together. And if we have a good separation between our different samples are different groups. We also get a little heat map showing the hierarchical clustering of our samples. And this helps us to see that are in this data set are again our day eight time points are a bit more similar than our time zero baseline sample. Okay. Okay, so now that we've generated our counts. So just have a look this is actual counts follow we're going to go on and use. So where we've got a row for every guide, same what gene, and it's targeting, and then also how many counts that guide has for each of our, in this case three samples. So now we're going to input that into magic test to be able to compare our conditions of interest. So, like a magic test. Okay, so here. So from magic test what we want to do is input the guidance. And we will specify that our treated sample is in our first column. So it's actually caught it in the first column if that's column zero. And we want to compare that to our controls up to our vehicle sample. So drug to vehicle, then column one then. And we want to output what we want to put we want to put a normalize cancel because that's useful if you want to plot the normalize can values for different genes and guides for those genes. And we'll also choose to output the plot. Okay, so execute that. Just to say that there is a magic test is using an algorithm called robust ranking aggregation so the RA. And then this is a visual of what it's doing. And in essence, it's helping us to identify essential essential genes by identifying guides that are highly ranked, either highly enriched or highly depleted in our treatment versus our control. Okay, and so note on if you have biological or technical replicates. You can input them into magic and can separate the names by comma. There are biological replicates for the samples for the experiments used here but we just for the sake of time and in the tutorial you're using just the single samples. Okay, and what we're going to see is that magic test outputs a gene summary file and guide summary file. So those output also the normalize cancel and we get a PDF report with some plots. Okay. And again, the columns in the different files are explained here in the website. And we'll have a look at the gene summary file. Okay, so here we get a row for each gene so it's 20,000. Okay, so in this, you can see every row is a gene. And we get the number of guides that are targeting that gene as well as metrics. So, our score p value FDR and ranking off that gene, both for negative selection and for positive selection. And this is a negative selection screen we're looking for genes and that are where the guides have decreased and with the treatment compared to control because we're interested in and drug sensitivity. Okay. And we also get a CSG RNA summary form. So use the peep. We have a view here. In this case we have a row for every guide so go over 77,000. And this gives us the counts for the guide in the treated and the control condition. So if we want to have a look at the individual guides and how they're performing. Okay, and then the PDF report we get output. This gives us plots of the top 10. And then we have a list of the top 10 genes. So these are ranked by the by magics or a score or p value. You can think of the RRA score is similar to p value. And that it's showing us the most significantly ranked genes. Okay. So here we get our top 10 genes. And then we get the results whether we look at the RRA score or the p value. And then we get plots for these top 10 genes. Or their guides. So every gene has four guides. And so we see how the abundance of those guides is changing in the control compared to the treatment of our drug and our vehicle. As I said, we're interested in guides that are dropping out that are increasing the cell sensitivity to the drug for this particular experiment. So we would hope to see that these four guides for vehicle are all higher than what they are in the drug. So if we look though, we can see that this may not be the case for every gene in our in our top 10 samples. So if we look at this FL1 gene, this is the fourth top negatively selected gene. But if we look at the actual guides, we can say that one guide is decreased. It's very decreased in drug compared to vehicle control. But one of the other guys is slightly increased and two of the others aren't changing that much. So by looking at these counts for this, for the individual guides, we can see maybe we wouldn't have as much confidence in that gene being one of our key genes, essential genes here compared to some of the others. So it's worth having a look at the individual guide values. Okay. Okay. And something that's quite commonly done is to create volcano plots for results. So we can do that with the volcano plot tool in Galaxy. And we first need to prepare reformat the gene summary file so we can create the volcano plot. Because as we saw in the gene summary file, we've got p values for both negative and for positive and logful change values as well. So we really just want four columns like this for the volcano plot gene, one p value, one FDF, one adjusted p value and a logful change column. So we can use. So, so what we're going to do is basically described here. So we are going to select if the negative selected p value is smaller than the positively selected p value. That means the gene is negative selected. So we use that p value. And if the negative p selected p value is larger than the positive p value, the gene is actually positively selected. So we are going to use all to select to create these p value and logful change columns. So I'm going to copy that. And go to the octal. Okay, so it's the gene summary file we want to reformat. Okay, and that's going to convert our gene summary file into this. Then we can use. We have a look here. We can see we've got our four columns now. And we just need to specify what our false discovery rate column is. Our p value. And logful change. And our gene labels. And then we can ask it to label our, for example, our top 10 most significant genes and execute that. Okay, we have a look at the results. Okay, we get a volcano part produced. We can see significance on the y axis. And logful change on the X. And we've got our positively selected genes to the right of the zero. And our negatively selected genes to the left. So in this case, these are negative selective genes of interest. And if we want to identify pathways, central pathways between our conditions between our drug treatment and vehicle control. And then we can use. Many different tools we could use, but we use one called FG SCA, which stands for fast gene set enrichment analysis. What we need for that though is a follow up pathways. So there are. For human there are hallmark pathways available from the Broad MCDB. Website. So there's one of these hallmark fall we have here. So we'll import that. And actually, I should have set the type to tabular, but if you do what I did, you can actually change the data type then and set it to tabular here through the pencil. And because it needs to be tabular format for the FG SCA till. So it should be in between for the tool to stop running to upload. Okay. So there we go. Okay. And now we are going to actually, just before we use FG SCA, we need to prepare our gene summary for us. So we need just the gene symbols and the negative score. So from the gene summary for so from here. Because we're most interested in the negatively selected genes for this experiment, we're just going to chop out this, the gene and that negative score column. So I'm going to use the cut columns of a table to and use that to select first and third columns from our gene summary file. Okay, we have a look. You can see, we've just got those two columns there. And now, if we look for FG SCA to. Okay, we are going to input this. So these are the genes, their scores to rank by. We are then going to input our pathways or our follow with our gene sets, and just to show you what that looks like here. So everywhere was a pathway. The name of the pathway is a link to more information on the pathway from the GMC DP website. The genes that are members of that pathway are in the subsequent columns. Okay. We'll say we're interested in gene set of size 15 and above. So we're not too small. And we will have lots. Okay. We get two outputs here. So then we get a ranked list of pathways and also PDF file containing some plots. So if we look at the ranked category list, we can see we've got pathways with their p values, just a p values, and we can see top pathway is oxidative phosphorylation. And we also get some plots then that show us. So how these genes in the gene sets. They are in our ranked list of, of genes, sort of genes being ranked from left to right, and shows us how these gene sets, for example, from the oxidative phosphorylation are ranked in our data. Okay. And there's another tutorial in galaxy called genes to pathways that gives a bit more information on this FG SCA tool. So if you have, if you want to use multiple conditions. So there's another module for magic called magic Emily, which you can choose to run. So the one we've been using magic test only allows you to compare two conditions. So for example what we were doing our drug versus our vehicle at time point eight. So if you had more than that more complex designs, you could use this magic Emily algorithm. And it is slower to run so it takes about 30 minutes or so on this data set so we won't run it here you can run that yourself if you like. So I want to note that instead of outputting scores for both positive and negative selection, it outputs a single value, a beta score, and a negative beta score indicates negative selection and positive score indicates positive selection. Okay, and for that, the Emily, you create sign matrix as one example there, and more information on the magic website. Okay. And again, you can visualize your results from the Emily using volcano pots. Okay, so hopefully this tutorial has helped you see how you can analyze CRISPR screen data. So using standard sequencing tools such as fast QC multi QC adapter trimming tools I could adapt. And then how you can use the magic toolkit to count and counter guides, and to test for differences in guide abundance and across different areas. And then also how you can create some downstream visualization such as volcano plots and perform some pathway analysis with tools like FGSA. Okay. And there's some links here if you want to get help with this. If you have any feedback, we, there's a little form here would love it if you would complete that. And thank you very much for listening and taking part in this tutorial today.