 Hello everybody. I welcome you to the training for ATAC-SEC data analysis or ATAC-SEC, I quite often will pronounce it ATAC-SEC, which is not correct. It's actually called ATAC-SEC data. Please bear my mispronunciation. ATAC-SEC, ATAC-SEC stands for SA4 transposage accessible chromatin-using sequencing. And that's basically the analysis of open chromatin regions in your genome to then predict regulatory elements like enhancer or silencer. And in this training material, we will present the kind of the quality control for the data and how to predict open chromatin regions. Also some visualizations and how or what what you can infer from the results from this data. Just for a short explanation, what kind of data we will analyze. This data comes from this paper here. It is the first paper on ATAC-SEC methods. And it's a human cell line from CD4 plus T cells. However, the data is a bit too large. It comprises a few gigabytes of reads and in order to now do a very quick training material or training, we downsampled randomly selected some of the reads or the read library. And we added also up to 200,000 reads from chromosome 22 in order later on to make or show you a better visualization. So we will focus mainly on chromosome 22 in this training material. Of course, this does not reflect real, completely real data, but we took it from real data. We will also compare the predicted open chromatin regions from this data to a positive control of a transcription factor called CTCF. CTCF is a well-known transcription factor and should correlate with some of our sites, which we will see at the end. And we will also of course look if you can find open chromatin regions in correlation to transcription start sites. In order to understand this training, you kind of already have to know how to operate in Galaxy. I will explain a few or will repeat a few steps which you get from the Galaxy introduction. And you also have to know a bit about quality control. Also there, I will repeat some of the things and will go through all of the plots. So we will use again fast QC, if you really know it. I will repeat some of the plots. However, I will not go that much into detail. I will go into detail for those plots which are important for CTCF. However, if you want to know more, then please go to the quality control training and to the Galaxy introduction to fully understand this training material. Okay, let's first start. So let's go into our Galaxy. First of course, look into your account. And you can see I have already here on the right side. Data analysis open. Let's create a new history for our other CSEC training. So click here on this plus icon, create new history. And let's name this training material like let us see training. So with that, we can start to import some of the data we need. As I mentioned from this first other CSEC paper, so we have here to read files. And we will also import this BED file from our well-known CTCF binding sites. So let's first copy this. So click here on copy. And then we go here on the left side to upload data. Where we then click here on this button, paste fetch data. And we simply paste these links in here. Of course, if you may know, you could paste each link individually and also assign already the data type. But let's do all three of them at once. And I will explain a bit again about renaming and stuff after the data is uploaded. Okay, let's click here on start. And then we have to probably wait a few seconds or minutes until this is done. Okay, let's make here a short break and come back if this is finished. Okay, let's come back. So the data is now in our history, as you can see. And the names are, well, a bit convoluted here now because I have not assigned it immediately when I downloaded the data. I did not change the names or anything. So let's do it in the history itself. So click here on the pencil icon. And then you can already remove this whole information here just before you click here on save. Let's also go into data types. It says here interval. Let's first, because it's actually a BD file, let's change this to BED. And let's say change data type. Oh, I should have also actually should have saved the name. So let's wait until this is finished and we can change the name for these files. So let's remove this and click save. So this name has changed. And this new information, so the names are a bit shorter. So this is now a BD file. And we can now again remove the information here. Click here on save. And then we have these shorter names. And you can see, so this is now a BD file. And this is a first Q file. And this is also first Q file. So as you already can see, this other cseg data is in paired end or was done in the paired end sequencing. So we have the first mate here and the second mate. That's why you have R1 and R2. So this is one sample. If you, I repeat, if you don't know what I'm talking about, so what's a BD file, what's a first Q file, and what is paired end sequencing, then please go back to the quality control. There we talked about first Q paired end sequencing and BD is covered also in other training materials. So apart from this data, we also need some other data as well. So let's scroll down here. So here you can again read up quickly about first Q and here about BD. We also will need some annotation later on for our open chromatin regions, as well as for the binding sites of the CTCF, which we downloaded now. And in order to get this annotation, you have the tool in Galaxy called the UCSC main table browser. The UCSC table browser is well known if you need some annotations. And there is a very quick tool where you can directly import the data from UCSC to Galaxy itself. So you don't have to go to UCSC download the data locally. You just go to this tool and import it into your Galaxy history, which is very handy. So if you just so to mention, if you are not questioning what I'm doing now, because you could actually, so I can click here on this button and it will immediately open the tool. Of course, if you want to search for it, you can also copy paste the name and go into this field search tools, then you paste this name in here. And it will list everything which closely relates to your search terms. And then you can find, for example, the UCSC table browser added in the main table browser here. However, if you want to follow this a bit more, if you don't want to always switch and search and so on, so you can click here on this Galaxy training materials button. I was already in the other UCSC training material, but if you go into epigenetics, on the epigenetics, you have the other UCSC training. And if you then follow this training based on this button in Galaxy itself, then you can immediately click on these buttons here for these tools. And then it opens directly the tool itself, just so that you understand what I'm doing right now. Okay, therefore, we click here now on this button you see a main table browser, or you search in the tool field for it. And then you get to this page where you simply have to click on execute, and then you are directed to the UCSC table browser. And there we simply select for clade the memo. And we want the human genome, the assembly version from December 2013, HG38. And we want to have genes and gene predictions from gencode version 36, where we want to have known genes. And the position, we simply want to have everything from chromosome 22, as I mentioned. As an output format, we select all fields from the selected table. And we send the output to Galaxy. And if you then click on get output, you then have to also click here on send query to Galaxy. And then the data will be directly sent into your Galaxy history. I don't do this now, I've already done it. Because there was right now, a little problem with Galaxy. But if you have done these steps, then you have this UCSC data here in your history now. From this data here, if you would look into it, so click here on the view data icon, then you see this is a GTF file listing you, or it's not, it's a table file, listing you the genes from chromosome 22. And from this table, we want to have specific columns. So we bring it into a format, which is readable for a tool in this training. So we have to process this data. And therefore, please go to the next step, where you have to click on this button here, or you search for cut columns from table. And there we copy paste this line here, the columns we want to select and put this in here in cut columns. The table is tap delimited, as you saw, and we apply it to this file from UCSC. And you simply click here on execute now. This Galaxy now produces the output. In the meantime, let me show you to organize the history a bit better throughout the training. You maybe know that you can assign a text to the data. If you don't know, then let me show you how to do it. Click. You can click on a data set. And here is a button for edit data set text. If you click on it, and then it opens this text field. And if you click on this button here again, then you can assign some text. And let's say this is the binding size of CTCF. So we give it hashtag CTCF. And if you press enter, then this tag is assigned to your file here. And let's do this for the other files as well. So for the first mate of our CSEC data, you can click here again on edit data set text here on the tag button. Hashtag at the C that I don't know, right underscore our one. So this is my first mate. And for the second mate, I do underscore R2. And if you follow the training material, then you see that we have to rename the data we have produced. So there's cut on data form. So this result file here, we want to rename into chromosome 22 genes. So save this. And we also can give it a tag just to make it a bit more visually appealing. So hashtag chromosome 22 genes. And this will help you to later on see which results come from which input data. And now let's begin the actual training, so to say. You know, from the Galaxy introduction and also from the Galaxy quality control training that in any next generation sequencing, experiment and data analysis, you have to do a quality control to see if your experiments worked, if there was the quality breach with the sequencing or with your general with your experiment. And also to see if you have technical biases in there, which you assume and anticipate from your protocol. And to check this, we use a tool called FastQC. So you can either search the tool in the search tool field or click here on this button. And we now apply FastQC to our first, only to our first mate as a demonstration. So select here single data set, and then click in this field and select the R1 file. All other options we leave as a default and execute the tool. And then you already see, you can now, so the tag is from, but I see R1 is assigned to my result first from FastQC. So I can immediately see that this report, this FastQC report comes from the first mate. So this might take a while and we do a short break here. Okay, let's continue. We generated the FastQ report for our first mate. And as you can see, so based on our tag, the tag file, so the tag R1 is assigned. And we know immediately that this FastQ report correspond to the input of the first mate. You know, maybe from the Galaxy quality control training that FastQC produces two outputs here, one is the raw data, we are basically, well, it holds the raw data. And then you have the webpage. So the report with the visualizations. If you click here on the view data icon, then we get, oh, we are greeted with the basic statistics about our data. So we see the name. We see the sequencing device and version. You see how many sequences are in the library. Then we see how many sequences are flagged as poor quality. What is the sequence length? So we have a constant length. So all of the reads have a constant length of 50 bases. And we have here the average GC content in percent. Then we see, based on the most important plot, the per base sequence quality, where we see on the x-axis of the position in the reads, on the read library, and on the y-axis of the thread scores of the quality score. And we have per position distribution of this quality for the position. And we see in the blue line, the average of the quality. And we can say for our experiments here that it's very good. All of the positions are in the green area, so above 28. So nothing is in the red area, which is the visualization for very poor quality. If you scroll down, then we have the per tile sequence quality, which stands for, which is shown if you do Illumina sequencing. If you would see a red stretch signaling a poor quality, that this is an indication that the sequencing did not work quite well. So there was a quality breach in the sequencing itself. This is kind of called to hot encoding here. So blue colors means very good quality, red colors, poor quality. And you see here the position in the read on the x-axis, on the y-axis, the lane of the flow cell. Please go to the Galaxy Quality Control training. If you don't know what this plot actually means. So it is described more into detail. And also please look into the first few C documentation to understand better. But if you again see a big red stretch or a lot of red stretches here, it would signal light or it would imply that there was a problem as a sequence. And here it's pretty good for our data. We have here the per sequence quality scores for the distribution for our reads on the x-axis, the thread score on the y-axis, the reads with that thread score on average. And you can see a lot of sequences, definitely above 28. So this plot correlates with the per base sequence quality. And it shows basically that our read library has a very good quality. Now we have here a warning. It's not a heavy quality breach. A heavy quality breach would be kind of a red stopping sign. We have here just warning, signaling that we have here nucleotide bias. So this is the per base sequence content on the x-axis here of the position in the read, on the y-axis, the fraction of the reads was that nucleotide at that position for T, C, A and G. And you see in the beginning we have having nucleotide bias. This is anticipated for the ATASIC data. Because of the transposay, the TN5, which is used, this can introduce, I mean, the newer protocols, the bias is more or was more or reduced. But however, there is a bias anticipated. So don't worry if you see a warning here. This is what I would expect from ATASIC data. Then we have the per sequence GC content. We have human data. So we anticipate kind of a GC content of around about 50%, below the 50%, and blew the theoretical distribution in red, the empirical distribution, so the distribution of our actual data on the y-axis that the reads was that GC content. And it's here a bit spiky, probably because we have, we've down-sampled the read library and we have, which we will see in a moment, still some adapters in our data, or potentially some adapters in our data. Then we have the per base and content. You can see there are no ends in our reads. So you use a position in the read on the x-axis on the y-axis, the fraction of the reads, or was the end at that position. And you maybe know also from the quality control training that an N stands for, or it's a placeholder character, if the base calling algorithm of the sequencing device does not assign an ATG or C, because the signal was not good enough, then it outputs an N at that position, signaling, okay, this position is very, very, we have not a high enough confidence that this is an ATG or C, and therefore we have to put an N at that position. And here you can see none of the reads have in the positions an N, and therefore we have a very good quality. Sequence length distribution, as I mentioned, all of our reads have a sequence length of 50. And then we have here the sequence duplication level in red. If you would de-duplicate and blue the actual distribution on the x-axis, you see the duplication level. And on the y-axis, the fraction of reads was that duplication level. And you can see we have here a big peak. So we have a lot of sequences in our read library, which are at least 10 times duplicated. And PCR duplicates normally happen if you do PCR. And the higher or the more PCR cycles you do, the higher the sequence duplication levels you have in your data. And this is always a question if you, so based on this, for some NGS experiments, you do definitely a PCR de-duplication. However, for other CSEC data, this question always depends on the experiment you do. So not immediately don't do a de-duplication always for other CSEC data. You have to ask if it's really necessary and if it's really worth also to do a de-duplication for your other CSEC data. The last things in this first CSEC protocol to mention is the table for over-represented sequences. You can see there is a high count for this sequence. And this is basically our adapters, which are still in the data. And you see that also in the adapter content plot. So here we have next to adapters in still in our data, which is shown here. And therefore we have to do an adapter training. And now we want to remove these adapters. So let's go back to our training material. And you have here a mock-up of the next terror read library. And you can see based on the mate, we have different potential read or adapter sequences. So our first mate, which here starts at the five prime, has a potential read through here into the i7. Now the other mate, the r2, where we have here the five prime, we have a read through here into the i5. And therefore we have two different sequences for the r1 and the r2 that we have to remove furthermore. We also want to take care of low quality reads. So basis with low quality. So we also do a quality trimming. This is what you are quite often do for next generation sequencing data. And we want to remove reads, which are too short. Normally, we use set a cut of, at least for the CSEC data and also for other NGS experiments like Clipsec or Chipsic, around about lower than 20 bases. We want to remove these reads, the course later on in the mapping. There are, of course, easier to map because the sequence is not as unique anymore. And it's therefore very easy to map the read and find the position in the genome. However, because the sequence is not that unique anymore or becomes less unique, the shorter it gets, you will have more of these reads mapped to multiple positions. And then you are unsure anymore what the correct position of the read in the genome is. And since we are interested of open chromatin regions, I mean, at a CSEC, we investigate these open chromatin regions. And we want to know, we want to have exact positions. So it's crucial to have reads, which where we can say with a high confidence, they come from this position and this position is then or signalizes us open chromatin regions. And therefore, we want to remove reads, which map multiple times in our genome. And one cutoff is therefore the length. So let's first click here on the cut the depth button or search the tool in the search tool field. And then the first thing to do, we have to select here at the question single end or paired end. As I mentioned, we have paired end data. So we select here paired end. And then we select for first fast due file, number one here, the R1 file. And for second fast due first number two, the R2 file. Now you can maybe ask, why do we have to select your paired end? Why not apply what I mean, I could also do single end and apply cut it out twice, right? So one time for the R1 and one time for the R2. The answer for that is you have to keep the file synchronized. And as I already mentioned, we will do a filtering risk cut it out based on the length and the quality. And if one of the mates are removed, so let's say one read here and R1 is removed because it's too short, or the quality is not good enough. Then you still have the other mate in your R2 file. And then you have one less read in the R1 file in comparison to the R2 file. This will be a problem for the mapping. Because the mapper needs and if you have paired end data that you have a file which have the same number of reads in there. So if they are not synchronized, you will end up in an error if you do a mapping. And then either the mapper complains that there's something wrong with your format or you get generally an error saying there's something wrong. And normally us quite often this has something to do that you miss or not use the adapter trimming tool correctly. And both read files out of sync. So therefore, paired end, select both files. And now you have to say to catadapt, take or look at these adapter sequences and try to find them in our read library. And therefore, so it's written down here in read number one. So insert three prime end adapters to click on this button. Then you select enter custom sequence. You can give here a label for the adapter and we give it what's written here in the training material next to our power one. And now we give the actual adapter sequence. And we have this also here in the training material. So copy that string, paste it in here. So this is for the first mate. Now we want also to remove the adapter in the second mate. And therefore, we go to so this is read one options. So we go down here to read two options. And we also insert here three prime adapters and the custom sequence. Then we say here next terror are two and we copy here the sequence for the adapter. And now we want to set the quality and length filter. So go into in filter options. So we have here adapter options and then we have filter options. And then we say here minimum length 20. And then we have in read modification options, quality cut off. And here we say also 20. So we filter reads out which are lower than the quality of 20. And we also want to have an additional cut adapter report. So go into output options. And here you have report and set it to yes. And now we click on execute. I'm not forgetting forgot something. Okay, yes. So let's click on execute. And then you see already what I mentioned with the text, we clearly see here, okay, this cut adept was applied to my two mates of my other CC data. So this might take a few seconds. And therefore we do a short break again here. Okay, if cutter cut adept is finished, you will see that you have now three output files to correspond to our trimmed reads. Now so the first mate and second mate and the third file is our candidate report. It gives you now an idea about general statistics about the adapter trimming. For example, we can see here that for first mate and our second mate around about 14.5% of reads contained adapter sequences and they were removed. We have also around about 0.7%. So 2100 reads, which were filled that out because they were too short. And from 285, so around about 285,000, 283,000, so 99.3% of the reads past the filter and are still our data. So this is also always important to check just to see if you have done something wrong with the adapter trimming. You have to definitely check also the adapter trimming still. I mean, the statistics is good. But to definitely check it, you have to apply fast QC again. So let's search for fast QC here and select multiple data sets and select the output files of cut-adapt. So here for me it's 11 and 12 for you maybe different numbers, but definitely here read one and two and execute. In the meantime, I want to explain that we will after that apply the mapping. So an aligner and we will use Botite 2, which is a mapping tool which works quite well for attack sec data. There are three, so we have to optimize Botite 2 for this data type. And there are three important things to mention. The first thing is that we will set it to minus, minus, very sensitive. So to an end alignment, that means we will use the entire read for the alignment. There's something called soft clipping. So there's also a mode which is different to this end-to-end alignment. Furthermore, we will map the reads to the canonical version of the human genome, so HG38. There exists some alternative or alternate load C in the human genome. And in order to distinguish, that reads map to these alternative load C because they would also map to the canonical chromosome. So they map multiple times. And in order to distinguish reads which map to these alternative load C and reads which map to repetitive regions, and therefore also map multiple times, we have to map it to the canonical version of the human genome. The third thing, we will allow something called dovetailing. It's explained here in detail in the training material. But what does it, what it means is that cut-adapt only removes adapter sequences that are three bases long or longer. So if you have adapter sequences or bases, one or two bases of the adapters in your reads, they will not be removed. And it can happen that one of the mates are a bit longer than the other. So here you can see this in the picture that the mate one is a bit longer than the starting of the mate two. And Boca would consider this as a non-concordant alignment and would remove it. However, you see this kind of events very often in our TASIC data. And therefore in order to keep a lot of our reads, we have to allow this dovetailing. Okay, that as a forward. Let's check again our results for the adapter trimming. And let's check first for first mate. So the report and let's go to adapter content and you can see, okay, no overrepresented sequences anymore and the adapter content zero. Let's check for our second mate. And you see, yes, that as well. And here you can also check if your read files are synchronized again. So if you go into the basic statistics, you can see, okay, 283,155 reads in our second mate. And in our first mate, the same number. So we have the same number of reads in both files and the answer. Okay, let's come to the mapping. Therefore, either search or click here on the bow tie two button. And as I mentioned, we have paired and data. So select your paired and select the result files of our cut it up to and we select here for fast Q file number one, the read. So the cut adapt results for read one. And here for fast Q file number two, the cut adapt the read number two file. Then as I mentioned, we set it to minus, minus, very sensitive. Then we select use built in genome index. And here we select search for the HG 38 and the canonical version. And then we have to set or optimize it a bit further. So here you have the question, do you want to set paired and options? Yes. And we have here set the maximum fragment length for valid paired and alignments. We set this year to a thousand. We know from this particular experiment that the fragment size is up to a thousand. So that's why we chose this number here. Obviously, maybe for your experiment, it's a bit different. So don't just simply select here a thousand. Maybe for your experiment, you have to optimize this a bit differently. Then as I mentioned, allow mate dovetailing. Yes. And then we have said read group information. We have to find do not set. Yeah, that's already done. Select analysis mode. This we have done very sensitive. Do you want to tweak sandbamp options? No. And what we want still is some statistics above our mapping. So say on the save the bow type mapping statistics to history. Yes. Also always good to do. And then we execute the tool. So the mapping might take actually quite some time. So let's take a break here and come back in a moment. Okay. So hopefully the mapping is now finished. And let's take a look at a very brief look at the mapping statistics. So click on the icon for the file mapping stats for border to you. And then you can see, for example, that we have here round about 55% of our rates aligned concordantly exactly one time and round about 43% aligned more than one time concordantly. Meaning if we would add this up, you have 97 to 98% of our reads aligned to our canonical version of our human genome, which is pretty good. Now before we do the peak calling, which is basically the identification of significantly enriched regions, which then correspond to open chromatin regions, we have to do still some filtering. And we have to do a checkup. What we want to do first, and let's go back to the training material, scroll down. Then we want to filter first, read pairs with a certain quality, some mapping quality. And we set here the cutoff around 30. And they also have to be properly paired. So we definitely need to have one made and the second made together. And we want to filter out reads which map to the mitochondrial chromosome. Why? Because the mitochondrial chromosome is not that interesting to us. It contains a lot of open words, has no nucleosomes. And so the T and five. So the transpose A's inserts a lot of adapters in there. And therefore, it's not of interest this chromosome. So therefore, we click here on the button or search for filter BAM data sense on a variety of attributes. And we select our alignment file from Volta 2. Here under BAM data sets to filter, we select the first filter, which is already in default, the mapping quality. And we say here it's bigger equal to 30. Then we insert another filter and say is proper pair and select properly paired reads and say yes. And then we insert another one saying reference. And here we say exclamation mark chromosome M, which means it does not belong to chromosome or reads which map not to chromosome M. We keep and all others we throw away. So which map to cross map. And that's basically it. So we execute. And then we probably have to wait a short moment. So let's stop. And let's come back. So this should be now finished. And based on just the file size. So let's see. So we have here 28.1 megabyte. And now after filtering the 50.2, so we actually filled it out half of the reads. If you want to include repetitive or reads which map to repetitive regions, then what you would need to do is adjust this cutoff for the mapping quality. And if you want to know more about this, so there is I think an interesting post about Bota 2 about the mapping quality and mapping to repetitive regions. So if you search for it, you can find and read about this in more detail. Next, what we also want to remove are PCR duplicates. As I mentioned, we have PCR duplicates and you saw this. You may remember from the fast QC report. And these PCR duplicates are a bit of a problem because if you would do a peak calling with PCR duplicates, they would give us or introduce technical bias in our editor and maybe produce then a lot of false positives. So a lot of regions which are signalized from our peak caller as open chromatin regions, but they're actually not open chromatin regions just because we have PCR duplicates in our editor. And to remove that, we take this or use this tool called mark duplicates. And so he's click here on mark duplicates or search for the tool. And then we select the BAM file from this filtering. And then we have to set exactly here this option from this long text. So I will just quickly search for it because I sometimes miss this option up. It's actually right in the beginning. Okay, it says here, if true, do not write duplicates to the output file instead of writing them with appropriate flex set. It's a long sentence and we have to set it to yes, but this just means that we filter out PCR duplicates. So PCR duplicates are not they are not written into the out in our result for their thrown out. And with that, we can execute the tool. And then we have to wait again. So I do a short break. So if that is finished, you have now two results from this mark duplicates tool. One file correspond to some statistics. The other file is our BAM file when we are filtered out now the PCR duplicates. If you look into the statistics, then you have here a table giving you some information like okay, we have investigated around about 135 or 36,000 reads. And from these reads, we have removed 3,000 or 3,600 reads because they are potential PCR duplicates. This is a bit hard to read. I can imagine so in the training material, if you go under the section of mark duplicates, there is a tip formatting the mark duplicates metrics for readability. So where you can apply two tools to make this a bit more readable if you want to. Well, let's skip that and let's come to one of the last points before the P calling. And this is, we have to check the insert size for now our processed data. And this is very important for other CSEC data because it has a very specific pattern you have to check. And if you see this pattern, then you're certain you're at a CSEC experiment worked. It doesn't show if it's a bit off this pattern, then this is some information about some very heavy noise or very big quality breach. And for that, we will apply a tool called Paired and Histograms to search for this tool or click on the button here. Then we choose the mark duplicates, the result, the BAM file. We set as a lower limit zero and then upper limit of a fragment size of a thousand, which we know and already used for Volta 2. And then we simply click on execute. And this runs now, I mean, I already know the results and you can see already the results and the training material. So what you will get is this pattern here. And you can imagine you have now two mates. So one read. So they are one and they are two. And there are different events or different pairs, so to say, for other CSEC data. There is a potential of two pairs, which are included in one open chromatin region. So no nucleosome in between. Then there is an event where two mates, so R1 and R2, actually surround exactly one nucleosome. And then there are two mates which surround exactly two nucleosomes and so on. And this you can see now in this insert size metric. So you have here on the x-axis the insert size and basis. And here on the y-axis the frequency, so the number of read pairs, which have exactly this insert size. And you should see a very high peak around about 50 bases, which corresponds exactly to these read pairs which are open chromatin regions and no nucleosomes in between. Then you have a smaller peak around 200 or smaller than 200 bases, which corresponds exactly to one nucleosome is in the middle of this read pair. And then you have another smaller peak around 400, which are two nucleosomes. And you can see the pattern. So then there's another peak around 603 nucleosomes and so on. And if you don't see this pattern, then something might went wrong with your data. So for example, if you have no nucleosomes, so generally a genome which has no nucleosomes, like the mitronome chromosome, then you see something like this here. So there are no bigger bumps, bigger peaks in there. If you have a very noisy analysis experiment, then you see something like this. So it's very blurred these peaks. And you see that the bump in 200 is very low. If you have a very bad or failed analysis experiment, then you don't see these bumps at all. So then you have only here around 50. And then it's like no bumps. I mean, you can vary weight. You see maybe here it's very slightly bump in 200, but it's barely visible. And this totally failed. And again, so this is what we see these very distinct peaks for 50, 200, 400, 600 and so on. So this is a very good analysis experiment. And before we now go into the peak calling, let's do a bigger break. Okay, so let's come back to one of the biggest tasks now, or the most crucial task, the peak calling. We were left with that we have checked the this typical insert size pattern for the CSEC data and showed, okay, to see if our experiment worked or at a CSEC experiment. And now we have to identify significantly enriched regions where we can infer these correspond to open chromatin regions. And two tools you can use for that is GenRidge and Max2. Both were actually designed for chip CSEC data, but they were also now optimized for other CSEC data. It is always good to, so we will use here Max2, but it always is good to check another peak caller like GenRidge to see if the results you get from a peak caller correspond to the results you anticipate and also to check if some peaks are more robust. So if meaning that they pop up in one, not only in one tool, but also in the results of a second tool. So I highly recommend not only to rely on the results on one peak caller, but to use the second peak caller to check these results as well, because they have their false positives as well as their false negatives, since they have different models, so meaning different algorithms and data assumptions. Again, we will use here Max2 and we will optimize Max2 a bit for our data, meaning that we will recenter the reads on the five prime sites. Since we are interested on open chromatin regions, we have to recenter it so that we pick exactly that region and not pick the region which is exactly on the nucleosome. You might remember, and you see this in this picture, that the paired and data now encompass different events, meaning you have pairs which surround only open chromatin regions or they surround one nucleosome or they surround two nucleosomes and so on. Max2 in its standard mode, since it was optimized for ship-seq data, it will recenter these pairs exactly on the center of the nucleosome if they are usable, of course. However, we do not want this behavior. We don't want to have a pile up at the nucleosome, right, since it is close chromatin. We want to have the pileups in open chromatin and therefore we will recenter our reads, our pairs, exactly at the five prime site where the TN5 binds and inserts our adapters. Furthermore, we want to take all of the reads into account, since all of the reads correspond to an open chromatin site. Max2, again in its default mode, took both mates, the first mate and the second mate, recenters this for ship-seq data to have a pile up exactly where the protein binds. This we do not want. We want to have each individual mate that corresponds to open chromatin taken into consideration. These are the two points I have to mention here. We have to optimize Max2 for other sea-seq data. Again, first recent, we have to recenter the mates or we want to recenter the mates on the five prime ends and then we want to take each mate into consideration for the peak chromatin. This we will now do. In order to do this, we will do and therefore we have to first apply BEDToolsBAM to BED Converter. We will convert the BAM files we have from the mark duplicates tool into BED format. We kind of trick now Max2 into believing that our paired end data is actually single-end data. To do this, we have to apply this tool first. Let's click here on BEDToolsBAM to BED Converter or search for it in the tool search field. We apply it now to our mark duplicates on our BAM file. I think that's already it. We apply it to just apply it to the data and click on execute. We wait until this is finished. I want to mention also in the meantime, we only have here other sea-seq data taking into account for the peak calling. Of course, Max2 can incorporate control data. Since we don't have control data here, we will not do it. But for your interest, Max2 is able to incorporate control data and generate as well. I hope now this is not finished. In the meantime, I explain again this whole re-center. You can see here is a short mock-up about what we will do. This is the actual read, and we are interested on the five prime side. We want to re-center the pyre on this five bottom side for the first mate and as well on the second mate. Therefore, we will shift the read 100 base pairs for the first mate to the left, for the second mate to the right. Then we'll extend 200 base pairs here to the mate. We are left with this big stretch, so to say at the end for all of our reads. Why do we extend it now for so much for 200 base pairs? Explanation for that is if you would just take the five prime side itself, so just as one nucleotide, the data is too sparse to do the peak calling. That's why we extend 200 nucleotides so that we have a bigger distribution, so to say, and therefore to have a better peak calling. Now the conversion is complete, and we can apply now Max2. Let's click here on Max2 call peak or search for it in the tools. Now we have to again optimize the tools. Are you pooling treatment files? No. Chipsack treatment files, so it says Chipsack. Again, Max2 was designed for Chipsack data, but we can use it for other Chipsack data with some right settings. We apply it to this mark duplicates. Pay attention to that it's this SBED file. Do you have a control file? No, we don't. Format of input files is single-end BED. Effective genome size, since the data comes from homo sapiens, so from human, we apply it to homo sapiens, so we leave it in the default mode. Build mode, so we have to now, exactly, so this is kind of now the build mode for Chipsack data, so this we don't want. We want to optimize it for Addisly sector, and therefore we exactly set the extension size to 200, so this is we will extend it to 200 and set the shift size to minus 100, so we shift the read nano, what we have mentioned to the left for the first mate, and to the right for the second mate to 100 nucleotides, and then we want to have additional outputs. We want the peak as a tabular file, we want the peak summits, and we want the scores and BED graph file, in a BED graph file, and in advance options, so key advance options, we have to check another setting, composed, yeah, so this option here, composite broad regions, no broad regions, use a more sophisticated method, yes. If you want to know a bit more about this option, then please read the paper about MaxTune, it would go a bit too much into detail now, but this basically gives us better results, and how many duplicated takes at the exact same locations are allowed, we say here also how many duplicated takes at the exact same location are allowed, or since we have removed duplicates, we already know that there are no PCI duplicates in our data anymore, and we want to allow every read, it can happen that MaxTune removes reads based on just the exact same position, a read is mapping, since we already checked for this, we don't want to remove these reads, and therefore we say, okay, take all reads we have now, take them into consideration, and with that, we can execute the tool, and this might now take a while, and therefore we do short break here. Okay, let's come back, so the peak calling should be finished now, and now we want to do for most of the people the most fun part, and this is the visualization, for all NGX experiments, at the end you will do the bigger analysis, and here we want to check if our other CSEC peaks correspond to what we would anticipate, and for that reason we downloaded, as you may remember, binding sites, a file which corresponds to binding sites of the transcription factor called CTCF, so this is our positive control now, you want to see if the peak calling gave us now binding sites, so open chromatin sites that correspond to transcription factor binding sites of CTCF, which is something we would kind of anticipate, and furthermore we also want to check later on if transcription start sites are also more covered, so have a bigger coverage, read coverage in our other CSEC data than other regions, which we also kind of would anticipate from other CSEC data, and in order to do the analysis now and also the visualization, we have to do a few steps to apply our visualization tools, first of all we want to select only the binding sites of CTCF which corresponds to binding sites from chromosome 22, since our data mainly comes from chromosome 22, and in order to do so we will click here on this button for the data on any column using simple expression or search for this tool, and we apply it to the data which we downloaded in the beginning, so this n-e-n-c-f-f file, already in the default mode, with following conditions, we say you check column one, and column one should be equal to the string chromosome 22, and here we simply click on execute, to explain you this a bit better what is happening now, if you would look into this peak file of CTCF, then you see in column one it's always listed the chromosome name, and we simply check if this chromosome name for every line corresponds to chromosome 22, since we want to check if other CSEC peaks correspond to CTCF binding sites, we also have to select only CTS binding sites which correspond to or which lie in intergenic regions, so we have to kind of take now another file into consideration, you may remember we have used the tool for UCS, the UCSC table browser, and there we downloaded basically data from UCSC, the table browser, to get the genes for chromosome 22, and this we use now to filter for intergenic binding sites for CTCF, so we click here on this tool, BED tools, intersect, intervals, find overlapping intervals in various ways, and then we select here on file A to intersect with file B, the data we have or the result file we have now generated, so this filter on data one, and under file B to intersect with file A we select this chromosome 22 genes file which we have downloaded, and then we say write the original entry in A for each overlap, so here what should be written to the output file, and we select this minus WA option, furthermore the required overlap we leave in the default one base pair, and I will now search for this, right underneath it, it says here report only those alignments that do not overlap with file B, so click here on yes, again so we only want to have binding sites which are intergenic, so not overlapping, so not in genes itself, and therefore we activate this option, and that's already it, and we execute the tool that shouldn't take long, and in the meantime we will generate another file that we need for the visualization tool, we have in Max, so we got from Max2 a BDGraph file for which says BDGraph treatment, so if you look into the history you find it, later I find it here, Max2 call or copy, and in brackets it says BDGraph treatment, and we have to make or convert this BDGraph into a big wicker file which is a binary form of a BDGraph file, and it is needed for our tools which we will use later on, so click here, wick BDGraph to big wick, and we apply this tool to the treatment, and simply click here on execute, and hopefully this will be done shortly, or it actually failed, I have to check what happened, let's take a short break and I come back describing what I have done wrong, so what did I do wrong, I checked, and technically I did not something wrong with applying the tool, however what I did wrong is how I downloaded one of the data files, especially the chromosome 22 genes, the annotation, I hope you followed, I hope you followed the training material, and not my instructions right in the beginning, so if you followed my instructions there in the training material it's written when you apply the UCSUMA tribal browser to get the chromosome 22 genes, I said I used the gen or I took the database gen code version 36, however you have to take the database gen code version 37 here, and also the table is the basic table, so if you also run into a problem, this is probably the case for you as well, please go back, follow the training material, select all the options which are written here, then apply again this cut columns from a table tool, and rename your file into chromosome 22 genes, with that it should work, and therefore we come back to where we left, so I applied the BDTools intersect and I applied this conversion of BDGraph to BigBig on the treatment file, so for our coverage, in the training material it's now written so to better have an overview, we renamed now the BDTools intersect file, so this is basically these are our intergenic CTCF peaks from chromosome 22, and the BigBig results we simply name next to BigBig, so maybe I was too fast, so simply copy this text here for the one for for the BDTools intersect file, and copy this text here for the BigBig file, and then to rename again, so click here on Bitted Attributes, and then paste the name here and click on Save. With that we should have everything, however we have to first compute now for the visualization tool to inspect the transcription start sites and metrics, and this metrics here, so this compute metrics, considers or creates other metrics for the transcription start sites to see what we would expect that the coverage should be increased for our other CSEC data around transcription start sites, and therefore click here on compute metrics or search the tool, the search tool field, and select under select regions, regions to plot the chromosome 22 genes, sample and order measures, no, the score file is the max2 BigBig file which we generated, and under compute metrics has two main output options, we select instead of scale regions, we select the reference point, and already in default we say beginning of a region, so this we leave, and we also leave in a default distance upstream and distance downstream to a thousand, so we pick a big window, so of a thousand bases left and right to the transcription start sites to look at the coverage. The only thing which we now have to also select is under show advanced options, yes, and convert missing values to zero, yes. With that we are good to go, and we simply execute the tool. I hope this does not take too long, now I'm staying in the queue, let's see, I hope it will resolve quickly, if not then we take a short break here and I come back in a moment, so now the compute metrics should be completed, and we continue now to plot the heat map, so we click here on plot heat map, we select the metrics file from the computation metrics tool, the result which we generated, and that's it, we simply click on execute, and this time I hope that it doesn't take long and that I don't stay in the queue for too long, I also have to do break again, or we do already in the meantime, let's do this in the meantime, so after I will explain what we see in the plot heat map we do another heat map, so the same thing, but we will look now at the intergenic ctcf peaks for chromosome 22, which we also anticipate that we have an increased coverage at these sites, and for that we will do the same tools, we apply the same tools, so we click here on compute metrics, and now instead of the chromosome 22 genes we select the intergenic ctcf peaks chromosome 22, we select again here on the score file the max2 bigwig, and we select now on the reference point compute metrics, we select here the center of the region at the beginning of the region, so the reference point for the plotting center of the region, again left and right, we leave a thousand bases, and we do the same thing, show advanced options, yes, and convert missing values to zero, and we click then on execute, okay, now let's come to the results of our heat map because I saw we are now finished with the heat map, the first one, and we simply inspect this by clicking on the view data button, and we see, so on the this upper plot shows you the average coverage, the blue line, on the y-axis the coverage, on the x-axis the distance to the transcription start size here in the center, and the heat map shows you on the x-axis, on the y-axis, so to say all the genes which we had in our annotation file, on the x-axis the distance again to the transcription start size, so left and right, thousand bases, and then you have from red to blue, the scaling of the coverage, so blue means high coverage, red means no coverage, and you can see that around the transcription start size what we are expected, we have an increased coverage which is actually shifted to the left, which we also would expect because on the left we have more upstream, is the upstream regions, and the upstream regions are normally quite often correspond to our chromatin regions, so here it's a bit left shifted, okay. Now the compute the other metrics is generated and finished, and with that we can apply again the plot heat map tool for our second result file here, so we click here on plot heat map, and we select now this compute metrics result, so the 58 for me here, and we will change a bit the design of the plot, so under advanced options say yes here, and then color map, so I have to search for its color, exactly, so here color map to use for each sample, we click on insert color map to use for each sample, and then we select blue, you remember we had a red to blue scaling, now I change it to blue scaling to show you that you can change the appearance of this plot, then we also label the x-axis a bit differently, so simply copy this text here, and then go to the x-axis label, and you may remember from the other plot it was distance from the transcription start sites, however we have not now inspected the transcription start sites, we inspected the intergenic peaks or ctcf peaks, therefore we named this distance from the peak center in basis, and also the y-axis we relabel, so under the y-axis label for the top panel we say here ctcf peaks, since we inspected ctcf peaks, and reference point label we say, so there's also reference point label, we say transcription start sites, we say peak center, and we say for labels for the regions plotted in the yid map, so here we say ctcf peaks, so we're just checking again if I selected more the right options, and then we only have to go now to this option here, did you compute the matrix with more than one groups of peak regions, yes I use multiple groups of regions, and then we click on execute, so I just wait a short moment, or maybe it doesn't take too long, but it's already computing, that's good, and then it should be finished in a few seconds, so I mean you can also skip to the point now the yid map is finished, and it is finished now, so if you click now on the view data button, then you see again the top panel is the average coverage, and you already get the idea that yes what we anticipated, the coverage is increased around the peak centers of ctcf intergenic peaks, and it's also not now left shifted, since we don't have transcription start sites, we have intergenic peak sites of ctcf, so it's technically really centered around this peak center, and you can also see that we relay the labels which we changed for the x-axis, the y-axis, and also here the coloring for the yid map, so also here the labels, peak center and ctcf peaks here for the y-axis, and coloring is changed now to just the blue coloring scheme, okay with that we do a short break and I come back in a moment, so let's come to the last point of this tutorial, and we will now inspect some of the results in an integrated genome browser in Galaxy, you may know already genome browsers like IGF, the UCSC browser, however Galaxy also supports several genome browsers like PyGenomeTracks to make it for you a bit more handy to inspect certain regions of your data right in Galaxy itself, and what we now want to inspect is kind of what we already did or slightly did with the plot heat map, we first of all want to check, do some of the peaks correspond to transcription start sites or intergenic ctcf sites, on the other hand we also want to see if we could find peaks which do not correspond to either one of them, and therefore mark other regulatory elements that might be of interest, so this is something which you typically do not a seasake of course to see what kind of regulatory elements have you have in your data and how do they correspond to certain transcription factors, transcription start sites and which are kind of novel and maybe of interest for further analysis, so let's take this tool here PyGenomeTracks, so click on it, and we will now create our tracks for a certain certain window in there, so the key stay with me, there are a lot of options which we have to change, but at the end it's a compact visualization, so first of all let's copy this these coordinates here and go to region of the genome to limit the operation and paste these coordinates in the text field, and then we start to formulate our tracks so the first track is already, so we already have a default first track here, and for that we simply select and choose style of the track, a big, big track, and the track files, big, big format we select here the max2, big, big file, and the plot title, we copy here from the training material, so on the plot title we paste the text here and then color of the track is now up to you, and the training material is not specified, so let's see I will select now yellow and minimum value, minimum value is zero and maximum value is five and show visualization of the data range, we show visualization of data range, yes, so minimum value means kind of the minimum coverage, so we have selected our coverage track, so the minimum value this coverage track has and the maximum value, from the from this plot heatmap you know the average is round about five so therefore we select it here the maximum value of five and show visualization of the data range just means yeah we showed the visualization, with that we have already our first track and now we click here on insert include tracks in your plot, so we select or we include now a second one and for the second one we choose style of the track, we select narrow peak track the plot title we can copy from the training material, these are the peaks from max two and we select therefore from max two the call on call peak on data 30 narrow peaks, so these are the peaks which the peak region so to say and then color of the track you can select again on your own, I take here this color and display to use, let's go down here display to use we want to display it as a box, so just draw a box for this region and plot labels, plot labels name p value q value no we don't want to plot them, so this is the second track, so your first track is a coverage track, second track are the peaks, now we want the third track and for the third track we will take the gene track bd track, copy the name genes, so we want to see the chromosome 22 genes now and therefore we select and attract files the chromosome 22 genes, again the color is up to you we select now here stay with black and height of the plot is five and plot labels yes this time, so the height is basically just simply the height of these boxes which we will draw and the plot labels we want to have so we want to see the gene names, so therefore yes here and plot put all labels inside the plotted region yes and allowed to put labels in the right margin yes, so this is basically taking care of the labeling, okay that's it for this track now comes the fourth one, so we have coverage peaks of our so other C-seq peaks and we have now our gene genes in there, now we want another narrow peak track, so I include the fourth track so as I mentioned and we want to have the ctcf peaks as I mentioned, so plot title ctcf peaks and we have this here this intergenic ctcf peaks from chromosome 22 ah no pardon not these not this file you want all the peaks, so not only the intergenic we want all the peaks, so therefore we select the this E N C F F B D file and then yeah track color again up to you we select you know uh that would select kind of a red and under display to use box draw box and plot labels again no we don't really have labels here and these are our fourth track the fifth one which is now missing is simply an x axis and therefore we insert another track and there must be something on the two star of the track which is named x axis and that's it and we simply execute you can also change the image output format from a png to an s vg but let's stay with a png and simply click on execute I hope this doesn't take too long okay let's I think I stay in the queue again so let's take a short break here so I hope you now finished the pie genome tracks and I did it again so that's why the change of background and change of shirt because I saw later on that I did something wrong and I had to correct it so as you can see so this is original pictures so this is how it would look like if you really followed how I did it and the problem already so I did it again because the color here is maybe it was not a wise choice and also the height is at five so I said okay we know that kind of the maximum was five the coverage is five but that's not true if you follow the training material then you see instead of the maximum value of the coverage track of five you so so for the first track here so I selected here maximum value of five the training material it's actually written and I show to you that you have to have no maximum value and the height is five let's let's see that's the change and so this looks now like this so not that you get confused by your plot looks a bit different to the training material and what you can see here now is all the tracks which we have selected so the first track is the coverage track for our CSEC data the second track is the peaks from our CSEC called with max two the third track are the genes which we see in this window I also mentioned beforehand that pygenome tracks kind of a genome browser I have to correct myself there also that's not true we see here a certain window so you're it's it's not a interactive tool so to say so you cannot simply move around here it's just this window so that that's why we see the certain window here and we select this certain window however galaxy has some genome browsers available which you can search for and the fourth track are the ctcf peaks and on the x axis so the last track is the coordinates so to say and we can see that we have here in this window for peaks for CSEC peaks which corresponds to these big spikes here in our coverage track and we see that we have two genes so r i r ac2 and s st r3 with six transcripts four transcripts for the first gene here and two transcripts for the second one and more or less we have three transcription start sites so this transcription start site here and these two here and we can see that one at a CSEC peak more or less corresponds to the transcription start site of r ac2 furthermore we see that we have several ctcf peaks here it's not clearly visible but the last peak of our CSEC data here corresponds to one of the ctcf peaks so with that we confirmed that some other CSEC peaks correspond to again two transcription start sites some two known transcription factors like ctcf and then we have two peaks in the middle here which does not correspond to any to any of them what you would do now is to figure out what they correspond to what kind of regulatory element is that so here you will look for another maybe transcription factor regions for that maybe histone marks or mutilation sites something like that to confirm that this is a valid open chromatin region and or closed chromatin region to to de-confirm it so to say it's actually a false false positive here and so this goes then into the further peak analysis which you which you would do so with that you see it's very important to look into your data again with a genome browser to confirm some of the sites and to see to to answer your biological question with that we come to an end this was the last part of the CSEC training and I want to summarize so at the end you should know by now what other CSEC is and how you get annotation data first of all for the later data analysis for example by the UCAC table browser you should know now how to first of all check the raw data quality with fast QC then you should be able to do a quality and adapter trimming with cutadapt followed by a mapping with bow tie 2 furthermore also checking there the mapping quality and some quality checkups with again fast QC then you should be able to do specific filterings for other CSEC data like filtering out reads that are properly paired or correspond to reads that map to the mitra 100 chromosome which are not of interest then you should be also able to filter out PCR duplicates which you then also check with a quality control you should also be able to check this insert size pattern of other CSEC data which is a very important check up for it and then you should be able to do the identification of open chromatin regions with max 2 and then you also should be able to further analyze these peaks these identified regions with pygenome tracks and deep tooth so deep tooth is this heat map for the transcription start sites in the intergenic intergenic CDCF peaks of course there's a lot of other things you you can do we covered here kind of the basic for other CSEC data but there are still further analysis you would do and if you have further questions regarding this training then please let us know during the training itself maybe there's a select channel available so write quite a message there asking questions or try to find us so i am me or see you are also available via email so you can ask us some questions regarding what a CSEC data analysis okay thank you very much for your attention and I wish you with that a good day a good week and I hope I could help you to analyze your other CSEC data okay bye-bye