 Hello, welcome to these introduction slides to the Attaxic data analysis. Before re-watching this video, it would be better to follow the introduction to Galaxy analysis tutorials, but also the sequence analysis, quality control and mapping tutorials. Today, I'm going to answer these questions. What is Attaxic? What are the quality parameters that you need to check for each dataset that you have and also how to analyze your Attaxic data? So the goal is to understand Attaxic, understand the quality parameters, but also the peak coding, which is quite specific to this technique. Attaxic is designed to identify in the intergenome the region that are accessible. What does that mean? In the nucleus, the DNA is wrapped around nucleosomes. In some places, although the nucleosomes are very compact and all the DNA is wrapped around it. This is what we call closed chromatin. In other regions, the distance between nucleosomes is larger and then you have free DNA in between. This is very important in epigenetics because this is where you have, for example, your transcription start sites for the gene that you're going to transcribe, but also where you have transcription factors that can bind and then promote the transcription of other genes, like distalinensis, for example. So in order to be able to know which are the regions in your genome that are accessible, Attaxic uses the TN5 transposase. So the transposase is a protein that is able to bind to DNA and to insert a piece of DNA that is foreign to this genome. This TN5 transposase has been modified. It has been, it's now more active, and this is this version, the more active version that we are going to use in Attaxic. So what will happen is that when the TN5 transposase enter into the nucleus, it will be able to bind and to put the two DNA fragments that were bound to it in the open chromatin regions. So in the DNA that was not wrapped around nucleosome. Then what you do is that you purify your DNA. So you remold the nucleosome and the transposase and then you amplify through PCR cycles using some barcoded primers. So in fact, the TN5 has been modified to bring with it two adapters, two Illumina adapters. And then we are going to use PCR primers that are barcoded to amplify and this would be directly the library. So it will be compatible for sequencing in Illumina machine. The use of barcoded primers allow to be able to sequence together different samples. So typically, if you want to do an Attaxic experiment, we would recommend to have at least two biological replicates. Then for the control, it depends on the situation. If you're using a tissue from, for example, a mouse, then you don't really need a control. But if you're using a cell line and you don't know the copy number, it would require you to have a control. Because if you have a cell line that have, for example, a region that is amplified, if you don't know that it's amplified and you run your Attaxic, you will not be able to know if the pile up that you see on this region is due to the amplification in the cell line or to the fact that it was highly accessible. So if you need a control, the control would be purified DNA. So next DNA, you remove all the nucleosomes and then you just use the TN5 to cut and to insert the adapters everywhere randomly on your genome. And then you should have something that will cover the genome that correspond to the copy number. Usually, we use patent sequencing for the Attaxic. There are two reasons for that. The first reason is that if you do patent, you know the fragment length. So as you may have seen on the previous slide, you have different types of fragments. You have fragments like this that are depleted of nucleosomes. There are nucleosome free fragments. And then you have fragments like this, which are fragments that were in fact, burying your nucleosome. But when you sequence, you don't you can't really distinguish except that the DNA wrapping around the nucleosome is 150 bases. So if the fragment size is smaller than 150 bases, you're sure that it's nucleosome free region. However, this is something that we don't cover in the actual training on Attaxic currently. Another reason why we should use patent is because as we use PCR to amplify the library, it's difficult to distinguish between real, deplicated even. So that means that you have a fragment that you sequence that is twice the same because it's coming from two different cells. Or you have fragments that are identical just because they have been amplified by PCR twice the same fragment. So if you have patent data, you can distinguish between the fragment that have the two pairs that are exactly at the same place. And the one that have one at the same place, the other one not at the same place. While if you have only single rates, then you won't be able to distinguish between them. And to be conservative, you would remove one of them. So you may decrease the the the pilot at this region. So that's why we would recommend to use patent. So how to analyze your data? There are different quality controls that you can do. The first one you can even do it before sequencing is to run a fragment analyzer to see what would be the size of the the insert. So what is what is the size of your fragments? Be careful when you do a fragment analyzer, you need to add the size of the adapters. Here what I'm showing you is just the size of the insert size after sequencing. So it's slightly biased to the to the short fragments. But still it gives you an idea of what you are supposed to see. So the majority of your fragments should be very small. So it should be around 50 base there. So in fact, if you have two DNA, two transposites, that's bound to DNA, very close, stick one to the other, they will insert your Illumina adapters. And the distance between the two Illumina adapters will be around 50 base there. And this corresponds to the shorter size of fragments that you can have. And you can see that you have quite a lot of them. So that means that the TN5 was either stuck or just slightly separated. Then you have another population, another band that correspond 200 base there. And in fact, this correspond to the situation where you have one nucleosome where DNA is wrapped around it. So I said it's 150 base there. And then you have two TN5 that could just stuck to the nucleosome. This is a 200 base there. You can see also the 400 and 600 corresponding to two nucleosome bound or three nucleosome that stick one to the other. The TN5 transposites initially is coming from bacteria and it's moving a transposite from place to place. And it recognized a strong nucleotide sequence. With the modification of the TN5, the nucleotide bias is reduced, but still it's high. And that's why when you do a fast QC analysis on your reads, you may see this type of profile, so where you see a strong nucleotide bias. Still, we managed to have coverage roughly everywhere on the genome. So this is a bias, but it's not just 100% of the sequence that I had. And as in different negative generation sequencing analysis, we will do some filters after the mapping. So first we will filter for the uniquely mapped reads. This is because if we look at a region that is repetitive, you may have a lot of reads at PyLab just because it's repetitive. And you don't want to have artificial PyLab. So we would remove just all the reads that map to different locations. We will also remove the reads that map to mitochondrial DNA. And this is quite specific to ATACC. The mitochondrial DNA is like a bacterial DNA. So it's supercoiled, but there is no nucleosome. So when the TN5 access to this DNA, it will just cut everywhere. So depending on your tissue or your sample, you may have different amounts of mitochondrial reads. And this, if you don't remove them, this may influence your normalization or your big colleague. So we would recommend to remove them. In addition, they are not interesting because, as I said, all the mitochondrial genome is accessible. Finally, we will remove the PCR duplicates. So as I said before, if we have bad end, we will use the bad end information and we will discriminate between the PCR duplicates that come from the same fragment. So at the end, what you want to know is which region of your genome are accessible. And to do this, you will do statistics and you will use a picoder. So it's a software that will analyze your data and tell you what are the regions that are significantly enriched in coverage. So in fact, what I didn't talk about is that when the TN5 bound to the DNA, it will insert your Illumina adapters, but not exactly at the place where it bound, but just nine base pairs rated. So one on each DNA strand. And the distance between the two insertions is nine base pairs. So that means that if you really want to know where the TN5 bound, you need to add five to one and four to the other. Remove four to the coordinates to be able to find the actual position of the TN5. So ideally, if you're interested in foot printing, you would like to have a picoder that would take this into account. However, most of the time, we don't have the resolution to go to this. And so we will just use Max2, which is not designed for this nine base pair duplication, but it's really not a problem because we don't have this resolution. Finally, as I said before, you have two types of fragments. So you have fragments that are like this that are nucleosome free, but you also have fragments that are like this. And in fact, the gold fragment is covered by a nucleosome. So if you do a regular genome coverage user, the default parameters for picoder, you may find, in fact, the coverage on the nucleosome instead of finding the coverage in between nucleosomes. So what is important is to identify the region where the TN5 bound. So you need to focus on the five prime of the reads, not the reads itself, neither the wall fragment, but just the five prime of the reads. So to do that, we will use Max2, but we will adapt the parameters in order to center the coverage on the five prime of the reads. Finally, here is another view of the workflow that you're going to follow if you go to the training material with the hands-on on the Ataxic analysis. So we start with the patterned datasets, so R1 and R2. And as I said, you can have very, very short fragments like 50 base pairs. So if you sequence more than 50 base pairs, what will happen is that the R1, for example, will read your fragment, but then will come to your adapter of the R2. So you need to remove these parts to be able to map correctly. So we use cut-adapt to remove the adapters. And then on the trimmed reads, we will run the Botite2 to map into N-mode. Then we will do the filtering, as I said, so uniquely map, remove the chromosome mitochondria. We remove the adapters with the pick-out tools, the MAC duplicates. And finally, we do the pick calling with MACs, too, with specific parameters. And this will give you two files. First, the coverage. So at each position, what is the coverage? And this will allow you to see the height of each peak, et cetera. And then a list of peaks, or the region that significantly covered. And this would correspond to your accessible region. In order to display this data, we will have two approaches, one which is genome-wide. With deep tools, we will do a heat map. So we will just pile up all the coverage on specific region. We will choose the transcription start site. And we also have a locus-specific approach using pine genome tracks. So that will allow us to have a screenshot of what's happening on a specific locus. So the width coverage, where are the peaks, where are the transcription start sites in this region. And we will also compare to other annotations. With that, I would like to thank you very much for your attention. And I hope you will enjoy and follow the Galaxy training material associated to it with the hands-on.