 So this presentation will be all about preparation of your alignment for variant analysis. So that's actually a very substantial part of the variant analysis, a very important part because if you make the choices you make during the alignment and preparation of the alignment can have a big effect on your variant analysis itself, so that's why we spend quite a bit of time on it. Also of course the choices you make for sequencing have a big effect depending very much on the technology you choose. You can do certain types of analysis or not, for example whether you choose between long reads or short reads. So there are a lot of ways nowadays through sequence DNA. Of course as you might know at least it all started with standard sequencing and that is a method that generates relatively long reads but at relatively low throughput. So what this figure is about is what you can see in this figure is all of the different sequencing methods that are frequently used today or have been frequently used because some of them do not exist anymore. On the X axis over here you see basically the average read length a method can produce in the log scale and the throughput also in the log scale. So the number of pages it can produce per run. So both of them are of course important. Read length is important for whether you can, for example, generate haplotypes in the context of variant analysis or whether you can assemble genomes, for example, and throughput usually is very much related to the cost per base. So how much sequence can you generate for a thousand euros, for example. So as you can see many different sequencing technology, standard sequencing is a very important one or at least has been a very important one with relatively high read length. So it's on the right side of the graph but very low throughput. The brown dots you see over here are all the Illumina sequencers and they're basically their evolution through time. Nowadays you see that you have different Illumina platforms with relatively low throughput and very high throughput. It's competitor or at least it was a competitor, ABI sequencing is a big thing green that has been discontinued just because it was just too costly and there was not a big difference in quality and throughput compared to Illumina sequencing. In blue we see iron torrent sequencing so iron torrent sequencing can produce a bit longer reads compared to Illumina sequencing but also has its own challenges won't go into too much detail regarding that technology. In orange we see 454 sequencing. I don't think it is used anymore. I think the brand is still used for some specific kind of sequencing but that's different. And then on the right side of course with longer read lengths we see back bio sequencing and Oxford Nano 4 technology. This is an image from 2016 and of course things have changed since then especially for the longer sequencers. Also for the short read sequence for example Illumina has introduced a machine with even higher throughput or two machines with higher throughput than it was then they had in 2016. But also Oxford Nano 4 technology for example introduced their Prometion which has a massive throughput so magic and creates a lot of phases per run and then also of course with a pretty long read length. Also back bio didn't sit still for back bio in the meantime SQL 2 has introduced has been introduced with high throughput and high read length and even more recently. So nowadays the back bio review can be shipped so also now for example in Burn we will acquire it in take November a review with a very competitive throughput to Oxford Nano 4 technology. As you might know there's also a third component that is important for sequencing technology and that is actually the quality. So the base quality more about it later where Illumina has a relatively high base quality meaning that there are very few errors in the sequences it produces for Oxford Nano 4 technology. It is very much improving or has improved in the last years but still is is not yet at the level was getting close to Illumina sequencing while back bio sequencing nowadays has very high base quality so very few errors it produces. Martin, that's a question. Yes, the Oxford Nano 4 that's basically the Minion there or something. Yeah, so the Minion are the bottom yellow dots and different versions of the Minion. So you always do Nano 4 sequencing with the Minion machine or something like that? No, no, so but the technology is all the same for different machines but it depends on the side of the flow cell you have different sides of flow cells so that can generate more or fewer reads. It depends basically on the number of pores that are on the flow cell. And with other machines like Promethean but there are more, those flow cells can also be going also run in parallel. They can even have parallel flow cells running at the same time. And that of course very much increases the truth. Thanks. Okay, Lorena's question. Yes. So, for example, for the sequence, which one would you recommend? This is FACTIO. For holding of, yeah. So, let's say really right now the cost base. So it's still quite a lot higher for back bio sequencing compared to Illumina sequencing. So, so the big advantage for back bio sequencing compared to Illumina sequencing is that your base is in a long read which is usually increases for example mapping quality as we will learn later on and your base quality is higher, which is also important for variance analysis. However, for many projects, especially if you want to sequence many samples, so if you want to sequence an entire cohort, back bio sequencing will still be too expensive. It might actually change with the review with also the cost per base is starting to approach the smaller Illumina machine. So, at some point you might be in a situation where researchers actually choose long re sequencing, like back bio sequencing for whole genome variance analysis. However, if you ask your sequencing provider the question now they will decide you to do Illumina sequencing just because it is now affordable to sequence many genomes that Illumina sequencing will not, it's not probably yet affordable. What you want to do is back bio sequencing. And which of these for the Illumina, the high sec for example. Which one of these is the high sec. I mean for the Illumina you mean for example the high sec here in the figure. So, I am not a machine specialist but they must be somewhere over here. This is the figure from 2016 so things probably have slightly changed. But they are probably over here so we have to know about the, for example, that it's probably already over here. So, with very high throughput. I think it's similar to put that from the film actually that's probably over here, something like that. And they also have this this newest machine was even at the high throughput, and I don't know by heart how much it is. So Illumina is important to note that you have different machines. So you can have, you can produce in a single wrong, fewer or more basis, more fewer or more sequences. And that makes it quite scalable for also Nanopore technology that is actually very extreme in terms of scalability because you have this Minion which only produces relatively few reads. So, relatively no fixed cost per run and the Prometion which generates really, really a lot of, a lot of read, billions of read and everything in between. So in that sense, for technology it's very scalable and probably as you know, very portable. Well, that is not really the case for Illumina at that time. So if you want to sequence only, for example, a few Amplicons probably want to choose either off of Nanopore technology or maybe some are sequencing. You want to sequence hold, you know, probably either Illumina or back by sequencing. Or the big machine of Oxford Nanopore technology of course. Good. Any other questions? If not, yes, yes. Yeah, sorry, I just said, just purely from the computational perspective of analysis is that something that you need to do differently or keep in mind when you do your, you know, holding so on the long reads or on the short and just basically the same steps in the analysis, or you just maybe get better calling of the haplotypes if you have a long reads. So, so basically can it be said generally that purely from the computational perspective is always better to have the long reads. I think I can answer that question with yes. It's almost always better to have longer read. And that is just because you're you're mapping qualities, for example, in regions where it is very difficult to align short read is usually much better with longer read. However, many of the methods that are around are very much optimized for Illumina sequencing. So many things we are going to do during the exercises are quite specific for for Illumina sequencing. And it's just because Illumina sequencing is is nowadays still the golden standard for for variants analysis. Because, you know, most of the genome that are generated that are resequenced are resequenced with Illumina sequencing, not the tech biosequencing, however, there are very good programs to do variants analysis with both also not for technology and like biotequencing, for example, deep variant from from Google is an example of that that can very, very well handle these these long reads data. So, and so computationally, it definitely has an effect you probably use different software because software might be more optimized for for longer read. And most of the times, it is, you will get better result if you use long read sequencing. Of course, there's a lot of but but in general, I think that is almost always the case. Okay, thank you. Okay, then we continue. I thought I had a question there. So, during this course we will, especially during the exercises we will focus on Illumina sequencing. So we will I will also go quite deep into what the technology actually is and what kind of caveat there are for for variants analysis. So it's a sequencing by synthesis method also referred to as second generation sequencing while these long re sequencing techniques are also considered as third generation sequencing. And a single machine, a single run can have massive throughput, for example, the Nova sync 6000 generate 500 billion basis per run. And that makes it the one of the cheapest way ways to generate sequencing data just because it's true boot is so so massive. And still, especially for variants analysis is the most used platform today. But there are definitely there's definitely competition coming up both from the long read side but also from other other technologies but still it's kind of the golden standard for for variants analysis. So, what kind of sequencing data will you get from Illumina sequencing. Typically, you get, well reads, of course, and it reads are relatively short, meaning that they are in between 50 at maximum 300 papers long. However, there is there is something that is unique to Illumina sequencing and that is that you can sequence there and so that means that if you have a certain fragment a certain DNA molecule, you can sequence it from both ends and that is quite interesting because although your reach are short, and you're not sequencing this entire molecule. So as you can see over here, these are the two reach you're not sequencing this part. You do know that they come from the same molecule so that they should be at least from the same chromosome. And typically we also know that they are at a certain they should be at a certain distance from each other. You cannot know that in front exactly, but at least typically you can know it's for example between 400 and 600 papers. And that's already quite a bit of information that you can use in your downstream analysis, for example, during the alignment. So if one of your reads ends up at a different chromosome that and those we prepared, you probably know that something is going on so that you can use that information to increase your to improve your alignment. So this is the kind of data you get. And how do you get that kind of kind of data? Well, first is start course with extracting DNA. We're not going to focus too much on that. But once you have that DNA that is typically if you did your especially if you did your extraction well, then this DNA is still in relatively large molecules. And because you're using this pair then sequence thing we actually want to make those molecules a little bit smaller. We do that by sharing. By sharing you make your you cut off your DNA, randomly into smaller pieces, and then you do often a size selection. So after sharing you do a size selection and you only select sizes, for example, between 400 and 600 base pairs. And that's important because then you know you can you know the expected distance between your forward and and reverse read. After sharing and size selection. There is an adapter like agent for these adapters. They are known. We know what they are. And from these adapters we can also add other oligos to it. For example, we can add a barcode a barcode that is specific for a sample and we can add P five and P seven sites or actually we have to add P five and P seven sites. For the molecule for the fragment to a new to the sequencing lane. So these are important for and dealing with sequencing lane. While the barcode are important to identify which fragments came from which sample. So if you remember well. These Illumina machines they can have massive throughput so they can really generate a lot of reach but typically you do not want the same all of the reach coming from the same sample. If you would have one Illumina run for a human genome, you probably cover it with like 10,000 times or so probably even more. So usually you want to sequence multiple samples on a single run and you do that with with barcodes. So these barcodes they are also sequence and associated with a fragment and they you know exactly which read originated from which sample. Then there is a PCR step, typically between eight and 16 cycles that just to get enough library in order to be able to actually do the sequencing to load the flow cell properly. And after that, there is basically the process of sequencing. There's a link. This sequencing thing word contains a link if you click on it you get a nice YouTube movie showing you how the Illumina sequencing works. You don't have time in this course to really go through that entire process, but you can have a look at it in your own time. Martin has questions. Yes, please. Recently got actually a library mate. And the biometrician was asking for PCR free library. How does this fit in here? Well, actually, this basically without PCR. Yeah, there are there. I think there are specific kits for that. Typically, they require a lot very high input of DNA, depending on your project that's possible or not. So you need really a lot of DNA input to start with the really micrograms of DNA. And as far as I know, it is just a matter of let's say the first three steps without the PCR. So then you make sure you have enough library after adding the barcode and be five feet seven sides and then start using. I understand the advantage is that you're like you don't introduce any bias like from the PCR reactions. Indeed. Indeed. And those biases they do exist. For example, so you typically have higher coverage, depending on your genome, depending on the DC content of specific region. So that that also causes an unequal coverage over over your genome that is partly caused by this PCR step and these are free libraries run into that issue to a lesser extent. In addition, you also introduce, of course, molecular duplicates, meaning that you find sequences originate originating from exactly the same fragment. And those you typically want to remove, but there are ways to do that and we'll discuss that later on. I guess my question is, do you think it's necessary to have a better to have PCR free than regular because obviously regular is cheaper. Yeah, yeah, yeah, I would say it is it is better. However, it's all definitely have an impact on your workflow. So it definitely means that you need very high quality DNA extraction and probably also the library preparation takes a bit more more time and effort and maybe also a bit more costly. Thanks. Yes, I just wanted to tell you, Martin, that we use that all the time in genomic sequencing for human diagnostics, the PCR free approach. So we routinely send our patients DNA for sequencing like this. And this simply because it gets better results in the end. But of course, it's, as I heard said, properly, a different workflow. Thanks. But it has its advantages in diagnostics for diseases, which we do. So we are really, really dependent on very high quality of output at the end. And that's why we use PCR free. Perfect. Thanks a lot. So I guess it all depends on whether what you can invest for samples. I can imagine if you have like a project where you want the sequence of 10,000 genome, you make different decisions than if you want to do diagnostics, of course. Yes, we have very low throughput action indeed. We don't have so many patients. Luckily, we couldn't be able to do to make it. Perfect. Thanks. Thanks for your addition. Alright, so then I think it's important to talk about some definitions. So we now know how this library preparation works. Probably you have heard of it before. But it's important to know what you're talking about. If we, for example, talk about tracking. So what we have is this molecule we generate in the library preparation where we have to be five and seven sites. We typically have a barcode in this molecule. So that helps us to figure out which read came from which sample. We have the adapter to adapters on both, on both sides. And then we have the first read and second read and something in between. And what do we call those things? Typically, if people talk about a fragment that is the, let's say the unknown sequence and the known sequence. So the read, including the adapters, but without the barcode and the five and seven side. If you talk about fragment length, it can include, typically include the adapters. However, people call also fragment that's the only the, let's say the unknown DNA in between the adapters. But that part is also referred to as insert side. And insert side, I think is a little bit of a nicer name because it really depicts, okay, it is the stuff that is inserted between all the known oligos in between the adapters barcodes and be five and seven side. Then we also have something called inner distance and inner distance would be the distance between the three prime and off the fourth and reverse read. So that's actually the sequence. The DNA, you are not off the fragment, you are not sequencing. So that's basically the unknown part of fragment. Then some more definitions about sequencing in general. So what do we consider as a library? The libraries are all the fragments together of one DNA sample or cDNA sample. If you have been sequencing RNA, that's share a barcode. So that also means that if you have a single sample and you generate two libraries out of those, those are two separate libraries. And typically, you would also want to retain that information when you are doing variant analysis. Why that is, we will learn more about it later on. A sequence we're on would be a complete cycle of generating read on a machine. I think that is quite easy to grasp what it means. A flow cell would be the physical platform at which the sequencing takes place. So that's actually where the fragments, the V5, V7 side and new to start the sequencing. So a flow cell is used once in a sequence. So you load a flow cell on a sequencing machine. And then you have a lane and that is a compartment within a flow cell. So a flow cell can contain multiple lanes. Some flow cells have only one lane, but for example, an S4 or an S2 flow cell contains two or four lanes. And typically, lanes are quite independent of each other and whether a read from one lane or the other lane, that it can also be information you want to know to retain when you're doing variant analysis. So how does a typical workflow look like? So what you do in the lab is you create independent libraries of, for example, different samples, or usually it's one library per sample. Of course, you add the barcode during creation of the individual libraries. Once you have done that, you combine those libraries together. So you pull them together in a single tube and you load those on usually different sequencing lanes. So after you do the sequencing that results then hopefully in VODQ files, what do they then look like? So let's say, oops, you have lib one over here, so library one over here. You have that identifier, which is generated by the milk flexing software. An identifier, I think it's an identifier of the barcode, if I'm not mistaken, an identifier of the lane, and whether it's the forward, forward, reverse read. And we only have forward read over here. So if we would only generate single and free over here for each lane, we would expect them for VODQ files. If we would have also reverse read in there, we would expect eight VODQ files for each library one. And you have that for both lanes. In this case, we have two files for each library and they are separated by lane. So these machines, they generate signal intensity and signal intensity are converted by the base color and by the demultiplexing into individual VODQ files. And these VODQ files, of course, they contain the sequencing reads. But in addition to the sequencing reads themselves, so these ADCT sequences, they also contain something that is called the base quality. The base quality tells you something about how sure the base color was, that the base it is presenting was actually the base. So if it says, OK, at this position, I have an A, how sure was it that it was actually an A? And that is depicted in a thread-based likelihood. And thread-based likelihoods are always minus 10 times the log with base 10 over certain probability. And in the case of base quality, this probability is the probability that the base was wrong. So let's say in a case where the base color was 1% sure that the base was wrong or basically it was 99% sure that the base was correct, was called correct. Then we have a base quality of 20. If it was 90% sure it has a base quality of 10, if it was basically unsure, so probability, so 50-50 could be wrong or right, you get a base quality of 3. So this thread-based likelihood does a very good job in specifying differences at very low probabilities. So that's what you can also see over here. So you have over here the accuracy and the error as a function of the thread-court or the thread-court of 10. We are about at 0.1, but at very low probabilities we can still find differences in the thread-court or as a question. Hi. Sorry, maybe I missed it, but so how does the machine know what the probability is of it being wrong? Yeah, so I will go into that a bit deeper in a few slides later, but basically to answer it quickly because I understand that you have this question now. It's when a base is incorporated during the sequencing, you get a certain light signal and that has a certain color. And depending on how well the machine could pick up the color, so whether it was a red or green, it gives a probability. So if it's somewhere in between red or green, it has a very high probability that the base is wrong. If it was clearly red, for example, it has a very low probability that the base is wrong, so very high accuracy. But does that then mean that if there's like a snip, then it's anyway not so sure? Not really because... Oh, because it's each individual read. It's each individual read. So it's more like a light intensity thing? Yeah, it's exactly that. Intensity and yeah, so it's intensity and what's the right word. So kind of how well it takes a certain color. Thanks. Giancarlo. Yes, Kora, just one thing. If you heard about that when it's a snip, when it is a snip in case of Sanger sequencing, your affirmation will be true. Because snip double peaks in Sanger have a significantly lower red score at that position. So if somebody has mentioned that to you, it's correct if the person was talking about Sanger sequencing. Fair, fair. Thanks. Yeah, great. Thanks for that, Giancarlo. Yes, in my previous work I've been using Sanger sequencing in polyploid, and then it's even much more fun. Then we had the offer that we got next-generation sequencing, which was really, really big improvement in variant analysis in polyploid. So to check with you, I have a question. Let me change the screen sharing to Firefox. There we go. So just to validate with you, because it's important, this is a very important concept in variant analysis. What does a high base quality mean? So if you let's say I have a base quality of 30 or 40, what does that mean? Okay, I don't see any additional answers. So I will stop. Okay, well, great. Nice game across. I won't spend too much more time on this anymore then. Awesome. So indeed. So high base quality, I mean high fret 4, means high accuracy, low error. Okay. So how is, how are the base qualities and sequences stored in a fast QFL? Well, it's very likely you have been looking at a fast QFL before. So many of the biophoneticians know this by heart probably. So a fast QFL, each record in a fast QFL contains of four lines. First line is the title of sequence. So basically that's the identifier of the sequence and there's quite a bit to that identifier. If we talk about alumina sequencing, there's actually information where that we came from. I think I have an additional slide that explains what is depicted in the fast new title. Then of course we have the nucleotide sequence. So that's a string of A, Cs and Gs. Then you have a third line with an optional description is not used very frequently nowadays anymore. I think it's basically there for historical reasons. Then the fourth line, and that's an interesting one that contains actually the base quality. But this line has the same length as the nucleotide sequence. And that is nice because then it's quite easy to refer to basically connect the individual basis to a base quality. So these base qualities are of course not the numbers for the actual integers, but they are characters that represent an integer. So how that relates to each other you can see over here is basically the order that characters have in the FD format. So for example, an exclamation mark means a base quality of zero and an I means a base quality of 41 and everything in between. So if we have a B over here, so the A over here has a base quality B. Can anybody tell me what the base quality then should be? And one, either the bio? I think 32. Oh, so 33. 33, sorry. Yeah, but you're right. So it's relatively an easy way to to trigger it out. However, of course, computers are way faster to make that translation compared to out because we have to look up, look it up in the table all the time, of course. It's just a way of storing base quality in an efficient way. Another question for you. So this might be a bit more interesting for the bio transition. There is this program called crap that you can use to find strings in a cell. So based on, for example, regular expression. So with, if you look at a file, a file always starts with a greater than sign. So the title always starts with a greater than sign and then with that you can get a count the number of sequences in a file. So we have just learned that the title in a fast Q fell starts with an at sign. So I probably have seen that so each title starts with with an at sign. So in principle, you could just find all the lines that start with an at sign and use crap with minus C with count the number of instances in order to count the number of records in a fast Q fell. However, this doesn't work. And why do you think that doesn't work. And if you don't know just give your best guess. Okay. You don't know you have answered 12 even 13. Awesome. So close. So at the special character and regular expressions that is not entirely the case as far as I am aware. So that's not correct so much you were right. The ad can also occur elsewhere in the fast Q fell in the third we go back to the slide. Then we see that this at sign actually depicts the base quality of 31. So if you have the first base with a base quality of 31 you will have an ad over here and then also this line will be counted as a as a record. So usually you overestimate the number of records if you use crap with fight and trying to find lines that starts with an at sign. So the way you can count the number of records in a possible would be just by counting the number of lines. And divided by four because each record consists of four lines. So, what we don't want to do when we have to possible is is perform the alignment one question. Oh yes, just regarding the father Q file. There's no way to find out the adapter sequences from this file is it It depends what what do you mean with finding out. I understand one has to remove the adapter sequences. So you would have to do an analysis there's no like indication like one of the numbers given from the sequencing center. What adapter was used. Well, not, not really but there are many ways to find adapter sequences. And so in the introduction to MGS course we go a bit deeper into that, but you will only get adapter sequences if the port and reverse read completely overlap. So, in this case, we have an unknown sequence in between. However, if your answer, if your answer site is very short, shorter than the actual read length. So let's say we generate reads of 150 base pairs, but our insert site is only 100 base pairs. Then we actually reading into the adapter and they would expect adapters. It would also mean that we only expect adapter sequence if the port and reverse read have the field overlap in sequence. I think I got confused because red online, you know, sometimes people say you need to remove adapters and trim them and others say I don't worry about them. It's all automatically trim for you. I would say it's always good idea to trim them because they are they are sequences that do not occur in your reference, you know, and they can still interfere with the mapping. Depending on the alignment software it, it can quite well ignore adapter sequences. But typically it's good, good idea to trim adapter sequences. I understand sometimes software also requires adapter sequences to be fed in, you know, and then I don't have no idea. Yeah. No, it's true. Yeah, then they have to be clear about that. Yeah, that's the thing with with with academic software or software in general. They have to document clearly what kind of input they require of course. Cheers. Okay. So if you talk about the depth of sequence as well, it's not really a main focus of the course that's why I don't talk about it's always good to remove them if you expect them. So then with So we talked about the base quality, right? And we have a base quality because the base calling they can make mistakes. The baseball I can make mistakes just because for Illumina sequencing isn't perfect. And that's important for variant analysis because for example, what you see over here is. Okay, this is pretty clearly we have a homicide. See, okay, pretty clearly we have. Adders I can see, but for example over here, we see also a difference with the reference between a over here while the references see but mostly it is reference. And we have to make a decision better. This over here, so where we have this a over here is a variant yes or no. And over here to look at it, probably quite obvious, but it's not always that obvious. Especially if you have, for example, lower coverage or more difficult regions for the alignment. So then it's important to know whether this a, for example, has a high base quality or not. Because if it's if it has a high base quality, then it is unlikely that the base color made a mistake if it is a low base quality, it is quite likely that the base color made a mistake and this is an actual error. The base qualities are important for for a variant. So, why do those lower base qualities occur and why is it, especially relevant for a lot of different things. Well, you're not sequencing as you have heard it's quite limited by by the read length. So at most for limina sequencing, you can generate reach of 300 base pairs. And if you have seen a fast Q fell of 300 base pairs long, you probably also have seen that base quality quite quickly declines after 150 base pairs. So even so the three prime ends of the retard and usually have way lower base qualities compared to the beginnings of five prime. And that's also the reason why Illumina sequencing is limited in terms of of read length. So why is that? Well, that's because at the process of sequencing, there's a step called bridge amplification. And during this bridge amplification step, what happens is that you have a single fragment that is annealing to the sequencing lane. And that single fragment is amplified at the same spot or at very close proximity to the original fragment on the sequencing lane. So that is also some kind of a PCR step. And that is required because this base incorporation during the sequencing requires multiple molecules. So if you have if you would have only a single fragment over there and do there will be a base incorporation that signal strength is just not strong enough to be able to pick it up at all. So what we've done you have this bridge application so you have many fragments over there in the same spot and then a base gets incorporated at the same time. Amplifying the signal because it occurs in so many molecules at the same time and then the signal can be picked up. However, this base incorporation is not flawless. So mistakes are made over there. So for example, if a mistake is made, maybe there is one base skipped in the process of the base incorporation, and then instead of a C is incorporated, a C is incorporated just because at the previous step, the base incorporation was skipped. So these errors they always occur and they build up towards the end of the read of course so the further you go into sequencing by synthesis the more errors build up. And then you can imagine this signal intensity gets blurred because you just get more different places incorporated some more different colors emitted. And therefore at some point the base quality becomes lower and lower because the if let's say if a C is incorporated, which where you would expect a red signal you have a lot of blurred colors in there. And therefore the base color is less sure that the cold base is actually the base and therefore you get lower base qualities towards the three prime and off the read. Until a certain point where just base quality is high enough to be sure that the base that is specified is can actually be used for analysis. So we have some limitation in Illumina and that is called by this requirement for bridge application. So probably, or many of you probably have used what you see before where you can see if you have for example at 300 base per long read of Illumina. It's truly reclines base quality truly declines towards the end of the read. And therefore the sequencing, the re sequencing, the read length is limited by this out of phase. So the long re sequencing methods, they do not have that issue. Because what they do is they can maximize the signal from a single molecule base readout. It means that it only sequences a single molecule. So not multiple molecules in this in this in this spot where you have written application at the same time, but they can get a signal out of a out of only one molecule. And therefore, it is pretty much unlimited of how long your reads are. Michael has a question. Yes, I have one question, maybe probably into this bridge application step that Illumina was discussed with my colleagues vaguely problem of so so called optical. What is it called when you basically what happens is that optical duplicates yes so it's like that because you can have duplicate weeks, maybe from CR but also is called optical duplicates which means that you by chance create to duplicate islands for this bridge amplification is never specific in search for the correct it for that is it something and I know that's different types of sequences are affected probably based on what the flows, what the flows are really looks. Can you comment a little bit on that if it's still an issue or it's not really something that nowadays people know it's a it is still an issue, definitely. And it's actually optical duplicates are quite easy to detect. Because we have this information in the possible whether we actually came from. So, I will go into that later on. I think three or four slides from here. So, so more about that later on but it's also with with newer flow style. It is definitely still an issue. It still occurs. But it, I wouldn't say it's a big issue actually more about more about that later. Okay, thank you. So because you read from a single molecule. You do not have this out of phase signal. So in principle, in theory, you have an unlimited read length. So your read length is not limited anymore by out of phase but basically by the length of the molecule you are secretly. Or by the how long your sequencing can actually run. That's of course also an important limitation. So, in order to do that, there are two frequently used platforms, probably a third of them before actually talk about it before is that it's back by a single molecule. And you name real time sequencing and Oxford nano for technology. And introduction to and yet course we will go into that a little bit more deeper in discourse not really. It used to be for for both methods that accuracy was really an issue, especially if you want to apply it to variance analysis, we already kind of saw that for variance analysis is base quality. It is important. So they used to have very low base quality or relatively low base qualities. And that means that you have a lot of mistakes. So this is a relatively old run off of non apart from all the sequencing. What you see over here is that they are most likely also mistakes of just kept. Consequences so we have a lot of insurgents and leashes in there which seems to be the same but also over here, there are probably single nucleotide. Errors in there. For variance analysis, it used to be really a challenge. I would say still for none of the Oxford nano for technology still kind of can be a challenge but both the sequencing and the software has very much improved in the last few years. For fact bio sequencing that's not really an issue anymore because it actually creates higher accuracy rates compared to eliminate sequencing so very few mistakes in there. So, well, it used to be that long with generated more error still kind of the gates for Oxford nano for technologies, technology and that's that cause of difficulty for variance analysis. However, with back by a circle called sense of sequences or nowadays they're called high pirates you actually get a very high base quality and also no bias in in where you will find the errors. Well, for Oxford nano for technology, you still find a bit of bias on where to find the errors. In addition, long reach can have much higher mapping qualities, which means that the alignment is usually much, much better because of longer read it easier to find the most probable location in the reference, you know, you just have high bigger puzzle pieces basically. And in addition, what you can of course do with the long reach and that counts both for back bio and for nano port technology is that hypo typing is very, very much improved because you just have longer stretches originating from the same chromosome. So, especially if you're in hypo typing, then long reach can really, really very much improve your analysis. So now we have discussed the sequencing technology. So we focused on the lumina sequencing and we also have heard a bit about longer sequencing. So what can you see when well usually think about whole genome sequencing or whole meta genome sequencing. However, that can be relatively expensive and also it will just generate a lot of data and therefore also the analysis usually takes way longer if you do whole genome sequencing compared to a reduced representation of your genome. So you can reduce the representation of the genome by, for example, all exome sequencing that means that you use bait to capture all of the exomes out of your genome, because typically we are interested in genic regions and not so much in in chronic or intergenic region. And then you have a much reduced representation of your reference genome is usually deeper for sequencing and also reduces the amount of data a lot, which also reduces the computational requirement a lot. Other than that, you can have restriction enzyme based to representation where you basically randomly cut your genome with a frequent and rare cutter and then actually sequence that used to be very popular in ecology or maybe still isn't also an agriculture. So you can also think of amplitude sequencing if you're only interested in a part of a small part of the genome, which can also do is is multiplexing a PCR where you further for example focus on 100 small relatively small region, or what people also nowadays do quite quite a lot is for example, if you're interested in a single gene, you just create maybe 10 or 20 optical of that gene and then sequence only those 10 or 20 optical, for example, with stronger sequencing or even better with or for non part technology. And also do variance analysis in our native data can be quite challenging, because, first of all, you have different in covers of course because one gene is higher expressed than the other gene. But other than that you can also have a little specific expression so that means that one a little is more expressed than the other other a little so you have an unbalanced division between your your to a little. So if you have your possible, you do read alignment typically. So what you need for that is your possible obviously and a reference genome that we want to align to a reference genome. Ideally that causes of pseudo molecules so whole chromosomes, like, for example, the reference, you know, and then what an aligner does is try to find the most likely position. originated from from the reference genome, and that information is then stored in a sound file. And I think I will stop here for a bit. Because I've been already been talking for an hour. Market. Yes, regarding sequencing technologies and fast queue fights. I was just wondering, can you actually mix different technologies like I guess you can just take all kinds of fast queue files from pack bio and from a lumina and do your analysis with them. Is this something people consider or is this bad practice. No, it's not that's not not a very bad practice. So typically, I guess what you would do there is you do the variant analysis on the two. So separately, then you have to do VCS to list of variants basically, and you try to combine and or compare those. It's quite difficult to do a variant calling on a, an alignment fell that both contains for let's say back by a week and and preliminary. Okay, so there's some difficulties, I guess. Another question is, is obviously, if you align to your reference, you know, I guess the reference could also have some errors. You know, you could have like several different isolates, you know, and maybe from some of them you have a reference. So that's really, that's also a big topic in your carriers in general. So what we know now is that you have, you can have pretty big variation so really actually massive differences of kilobates and kilobates or even megabates between individuals. So we were talking about the structural variation and the structural variation even within the human population can be very big. But it means that if you have a single reference, you know, and you are sequencing an individual with an insertion compared to the reference, you know, you will never align those reads of that insertion to the reference, you know. And that can be considered a bit of a challenge and that's where all these pen, you know, initiatives are focusing on. So I guess, what we would be doing is to always sequence all to the reference strain again. Is this something other people do? You know, like basically what you do is, well, what of course happens is that reference genomes, they improve through time. So nowadays we have the for human, we have the telomere to telomere sequence, for example. And that occurs in many different organisms. So reference genomes, they do improve and they tend to become also combinations of multiple reference genomes so that you have a complete as possible representation of your species. But that is more and more extended to more graph-based representation of reference, you know, where you basically represent the entire pen, you know, of a species in a single reference, you know, but that is not very common yet to use those references. But I think it will be more and more common in the future, especially for human genetics. So just a recap. So you have a falcule, of course, with the actual sequencing reads and their base qualities. And then a liner, what you do is try to find the most probably probable position of your reads on a reference genome. And usually only one position is recorded, but you can also choose to have multiple position, for example. But typically for variance analysis you want the most probable position. And of course, what the liner also does is define a probability on how sure it was that the reads actually aligned to that specific place. So this is what read alignment can look like. This is a visualization in ITG. Maybe you have heard of ITG before, but did you also use it during this course to visualize some alignment. So what you see over here is the paired and alignment with the forward and at the reverse read. And everything that is gray has exactly the same sequence as the reference genome, everything that's not as a difference between the read and the reference genome. That's what you want to end up with. So there is a lot of software available to do read alignment where frequently used ones are both type 2 and BWA for variance analysis. And nowadays there is also a dragon available for Illumina reads also developed by Illumina. And that is performing even quite a bit better than BWA and both type 2 and can also be used for free. So if you're interested in that, let me know then I can give you the link where to find it and how to use it. For long reads, people usually use Minimap 2 that is developed by the developer of, if I'm not mistaken, I have to say it right, BWA, Hang Lee. Of course it's both type. No, both type is Ben Langley, I think. So I think it's BWA. So Minimap 2 has some different assumptions compared to the short reads aligners and it's also very fast in aligning these long reads because you can imagine that it has some penalty on how quickly you can align reads. And also with Minimap 2 you can also align short reads by the way and performs also quite well. So what have you worked about mapping quality? So we have been talking about base quality before and we also have mapping quality. So what you can have, for example, is on your reference genome, you can have two locations on your reference genome that are relatively similar to have very similar sequence. So the, for example, caused by recent gene application. So these are the blue parts. These are generating sequence things, relatively short sequence reads out of those locations. What you can have is, for example, reads that completely are within the blue region. And then of course for the aligner, it becomes very difficult to decide whether this blue read over here actually comes from this position in the genome on the left side or the application on the right side. So actually, depending on how similar those two regions are, the aligner might just have no clue where it came from because you have pretty much exactly the same sequences over there. And how sure the aligner was where a read belongs in a reference genome is depicted by the mapping quality. So again, it's spread based like the base qualities, but the only difference is the probability, the type of probability you have. So in this case for mapping quality is the probability that the mapping position is wrong. So not to call base but the mapping position. So you can imagine that these reads that are completely blue and provide very high probability that the mapping position is wrong. And therefore a very low mapping quality. And, for example, read that are partly in the green region or entirely in the green region that is pretty much unique in the reference genome of a very low probability that the mapping position is wrong and therefore very high mapping quality. So you can imagine that these mapping qualities, of course, very important for for various analysis because if you do not know where a read actually belongs on reference, you know, it's also virtually impossible or very difficult to call variance. And these mapping qualities, they're all taken like the base quality taken into account with downstream analysis software. So, for example, the ATK takes both mapping quality and base quality and many other things, but these two also into account when it calls the variant. I have a question for you. So, if you have learned what indels are this morning. And you've also seen some of them and indeed visualization of the alignment during the presentation. The question is, why do you think are indels more difficult to detect from alignments compared to these snips or single nucleotide polymorphisms. Okay, if you have answered all your great. I would need to agree with most of you. I would agree to say it's all of the above. So, because there is no base quality, it depends whether you're looking at a deletion compared to the reference or an insertion. If it's a deletion compared to the reference, you do not have a base quality of the deletion itself. So it's very difficult to estimate whether that deletion was because of sequencing error, or just because it's really a real deletion in the read. Martin has a question. Regarding the last slide you were showing the mapping quality score map q that is per base or is this per sequence. Per alignment actually per line per read. Okay, thank you. So, the second issue was because it is difficult to know the correct alignment, and you will learn more about that in the tomorrow, but you can imagine, especially if you're insertion deletion is in a repeat or a homo polymer it's very difficult to know what is actually a deletion and what's an insertion. So what is the most probable confirmation what that or the most problem mutation actually has to do that the most problem mutation that has caused this insertion or deletion and depending on where to read aligns. So typically aligners make different decision and therefore the alignment around insertion deletions are usually very very messy. And third, because they are often present repeats and homo polymer so you also have typically no mapping qualities or difficulties with mapping quality in this region. So these alignments they are stored in a sample and for sequence alignment format. And you probably also have heard of a bump fell the bump fell is nothing more than just a compressed binary thumb format. So, as you have seen, so it will be the information for which read aligns where and that's produced by the by the aligner. So what does the sample look like a sample like many cells and bioinformatics starts with a header and how to contain metadata about the fell. In this case, the sample. So it starts the header always starts with an at sign. So each line in the other starts with an at sign. It can it contains, for example, information about the version of the sample as far as I know there's only only one version so typically is 1.0. And also about how the sample is sorted, for example, whether it is sorted based on the order of the reason for two fell or better is, for example, sorted based on position in the genome. So already from the some header you can see where how it has been sorted. There's always information about the reference. So the chromosome. In this case, this is an E coli chromosome. So it has a bit of a weird chromosome name and only has one chromosome but of course if you would align, for example, to the human chromosome then you will find from some one from some to three and so on. And their length that is always required in some fell. Pretty much required header line would be the PG tag and SPDG tag tells you which programs have been used to create the bomb fell. So every time a program does a calculation on the bomb fell or changes the bomb fell. It will add a line with SPDG and tell you which program has been used so it's a some fell contains basically its own history in the header and it is quite powerful because for example if you receive a bomb fell from a colleague. They can just check out the header and see what kind of programs have been used and what kind of program calls have been used to create this bomb fell. So, after the header, there's the actual some format it is just a step the limited fell is basically you can also you can also just load an Excel basically with a whole bunch of columns. I will just go shortly quickly through them to the first column is the is the read name. So, if you this is an example of a sequence from SRA. But of course, if you would be if it would be raw Illumina reads, then you have this specific Illumina header with all the information where the read actually came from. Then the second column is a thumb flag and that contains all kinds of characteristics about the alignment, for example, whether it is marked as a duplicate or whether it's made this map or whether the read is actually mapped at all and so on. And those characteristics are all stored as a single integer and that's quite a smart way how to do it because you only have to store a single integer but contains really a lot of information about what the kind of features. The alignment has to know more about that just Google some flag and you'll get more and more information about what kind of information can be stored in there. Then the reference chromosome where to read the lines to turn this case is read the lines to chromosome 20 the start position so it aligns really at the beginning of chromosome 20. Then we have the mapping quality so Martin as you can see each read or each alignment in this case has a certain mapping quality. This case 42 so very high so very low probability that this read could align somewhere else in the reference you know, then there's a sticker string which tells you. How the alignment is ready you have, for example, insertions or deletion in there. Whether the mate is mapped to the same reference so you get an equal sign. If it is equal to the the other so the read we're actually focusing on if it's a different chromosome you get a different chromosome there. We get information about the start and end position of the mate. Then we get information about the fragment length because we do an alignment. We know we can estimate the fragment length which can be relevant information. There was a question by IO. Yeah, so the question is about the Sega string. I'm just wondering how it tells you how the alignment was. I'm not sure how this 150 m give that information. So we do not have a lot of time during this course to really focus on that. We focus on it in the introduction to NGS course, but basically in short what it does it gives you a string of first an integer then a letter and then can be an integer and then a letter again and many repetitions of that. In this case, it tells us 150 m and 150 m tells you 150 matches. So apparently this read is 150 base pairs long and all those 150 base pairs match to the reference genome. So if there is an insertion in there, then let's say you get 30 m. Then let's say if the insertion is 5 base pairs long, we say 5i and then 60 m again. But that's how the Sega string work. But as I said, we cannot spend too much time on it. There are really a lot of resources on the internet to check it out. So that would just mean that, sorry about that, that you would have, you would see other, that's like m or i indicating whether it's matches or insertions. I guess. And it can be relevant to realize that if you have a single base pair different between the reader and the reference that is not stored in a Sega string. Because we already stored a sequence in the sound file and we do not want to store similar information twice. It's a bit of a theoretical issue. But I think if we talk about Sega strings, it's relevant to know that so basically snips. So single base differences between the reference and the reads are not stored in a Sega string, but they are stored in the sequence. I think. Back sorry. The button. So there is a question of Martin. Yes, sorry, maybe I didn't understand correctly. The Sam file. Can it contain multiple alignments of the same read. Yeah. Thanks. This software is an alignment and that means that you can have multiple alignments for read. So technically it's possible for in default settings for both both I do and bwa as far as I know it will only produce a single alignment for read. So what you can, for example, also have is an alignment. I think it's called a chimeric alignment where only part of the read aligns somewhere and the other part of the read aligns and another chromosome. Then it can say, okay, I also have two alignment for the same read, but then two parts align somewhere else. Lorena question. So I want to know, let's say, but I'm not sure. Within two reference, you know, which ones to use. Then if I get this kind of a thought, should I focus more on the mapping quality. Which ones to. Mm hmm. So, first of all, using using the reference, you know, for your analysis, it is actually very important that the very important step in your, your analysis and you should make that decision as early as possible. The reason for that is that it can be very cumbersome or sometimes even impossible to change your reference, you know, while you are doing your various analysis. So let's say if you have your PCF from, let's say, reference, you know, from 2010, and they want to actually change the coordinates in that PCF to a reference, you know, from 2020. So that actually can be a very challenging task to do. There are ways to do it called doing a lift over, but it's not not a very easy task. So choosing references is very important. So and of course you can have different reasons for choosing one reference or the other, maybe you want to compare it to older projects, maybe then you take an older reference or you want to take the newest. So of course, taking a newest newer reference reference or reference that is more similar to the organism you are interested in. Of course gives a better representation of the truth. So that does mean that you might get let's say maybe a newer reference actually is able to create a good reference of a recent genes application while a previous reference actually merge those together. So basically with the older reference you might have gotten high mapping qualities, but it probably meant that you were aligning reads from those two duplicated genes to a piece of sequence that was actually supposed to be two different positions. Yeah. Yeah, because this, for example, is thrown out so example, so it will be isolated some my my copy of some bacteria and then. So yeah, I can, I can use that for the genius for example. And that would be that way. So, can you can you repeat your last time. Maybe if the bacteria is not not not known so it's not. Yeah, not so. So like a little bit like the normal, the normal assembly. So that's that's why I would like to maybe to try different alignments and to check the values which one is best to use. So basically, ideally, just for the sake of, you know, true positive variant or fault negative variant and so on, you want to have a reference that is most similar to the individual you are researching. However, you know, maybe in the Nova assembly, you do not have a good annotation. And then if you are interested in mainly variants that have an effect on, for example, the protein sequence, then you might again go to an older or different reference, you know, because you're mostly interested in that. So it's always a bit of a balance. Thanks. No worries. Okay, so then you have secret string information about the reference and fragment length. So that's part of the information you have in the sample. But also sort of the sample is the actual sequence of the read and the base qualities. And then there are some optional tags. For example, information about the alignment core, and that tells you how, how good the alignment was so that's what all aligners do they calculate some kind of alignment for and based on the alignment for that decide okay this is a significant alignment and this is not, for example, another tag that can be used is for example information about the read group and more and more about that later on. So question review. Okay, the question is, can you technically regenerate the fast Q fell out of the stone fell so let's say you only have some phone. And you want to regenerate possible questions, can you do that. And another question is, can you regenerate the reference sequence to which you have aligned from the phone. So that's basically you have to use the information you just learned about the file format. And the question is whether you can regenerate possible file, and whether you can regenerate the reference. Okay. Most of you have answered. There we go. Oh, interesting. Okay, so I guess that this is a relevant point because the sample is is relevant of course for for very interesting. So, let's start with the first one. So only the fast Q file. So let's go back to the presentation. So if you think what is stored in a fast Q file. So the fast Q file basically contains the read name. So the read name over here we have that one for raw in our read it looks a bit different but anyway we have we have the read name. So the only thing would be we need to do is just paste an ad before that but that should be quite quite doable. So what else in the fuck you felt that the sequence read for the actual sequence. We do have that also in the phone file. And also what is stored there and third line that usually empty in the fourth line we have to base qualities and the big qualities are also in there. So, in principle, we can reproduce the fast Q file from the sample. So that's one. So then the question can we regenerate the reference genome from the sample. Well, we do have some information about reference genome and that's information is in the header. So in the other we have information about reference genome, which is the chromosome name and their length. However, the reference genome itself of course contains the actual sequences of the chromosomes that is not stored in a sample. So it's only the chromosome names and and their length and not the original sequence of the reference you know. So the answer is go back again to the question. Only the possible. So you can regenerate only possible because we have the possible header you have the sequence and base qualities but we do not have information about the reference genome. Is it a bit clear? Okay, good. So again, if you have any questions doubts, if I go too fast, if I go too slow, even just let me know. Okay. So that's the sum format. There's really a lot more to say about some format, but simply because we do not have a lot of time during this course we cannot read everything. Another important concept that relates to some felt or alignment felt is the concept of read groups. In principle, it's a relatively simple concept. What you sometimes want is to have multiple groups of reads in a bound file. For example, if you have two libraries of one sample, and you want to keep track of which read came from which library, but you want to have them in the same sample. You can do that with with read groups. So what you do is you add a header line that gives that specifies the group and gives metadata and gives an identifier. And for each read, you add a tag with that identifier where it comes from one library or from the other. Not only libraries, you can specify that can be anything. Typically, it is used to add metadata of alignments. For example, the sample identifier, well, the library identifier, lane identifier, and so on, which platform was used, for example, Illumina or PacBio. So you can have with read groups both PacBio and Illumina sequence in the same bound file, but then analysis might be a bit challenging. And so on. We will use those, of course, so that's why I'm mentioning it. So that's how it looks, what it looks like. This is a very simplified example of a sample. So we have two header lines added there, and those header lines start with addRT. And these have an identifier. So we have an identifier called RG1, and we have an identifier RG2. So read group one, read group two. And over here in this example, they both come from the same sample, but different libraries. And if you have those reads in or these alignments in the sample, we add at the very last column a tag, starting with RG, then a Z, specifying what kind of information we can expect, whether it's a string or a number. The Z stands for string, and then the identifier of the read group. So this alignment came from library one, this alignment from library two. So that's how these read groups actually work. And you will do some exercises that later on is something. Then duplicates. So as you have learned, there's PCR step during your library preparation, and that can actually result in fragments in your library that are not coming from an individual fragment, but actually come from the same fragment. And usually you want to get rid of those, and you want to find them for it. So how it works, usually if you do not have any UMI's, more about UMI's later, is that you just try to find alignment that align in exactly the same position. And then you say, okay, if they align exactly in the same position, it probably cannot happen by chance. So we assume that these are actually duplicate, either from PCR or an optical duplicate. Korra has a question. Hi, sorry, yes. So this is because you're assuming that the DNA sharing at the very, very start is like always going to produce like completely different. That's where it's come from originally. But the position of the sharing is random indeed. Okay. Yeah, cool. And that's indeed a bit of an issue because if you do anti metric sharing. That is as random as possible, but not completely random. So you might miss identified duplicate even because of because of that. There are some issues there and indeed this non random sharing is one issue. So there are two and more about it later. So there are two different duplicates in general, so we speak about two different types of duplicates. So this is where we are discussing these optical duplicates. Basically a regular duplicate can occur anywhere on the flow cell. So you can originate from the same original fragment, but can occur anywhere on the flow cell. Optical duplicate is actually easier to detect. And it's usually also whether something is an optical duplicate or not is usually that information can be stored also in the in the sample. And usually if a duplicate is specified as an optical duplicate, you are more sure that it's a real duplicate other than that it's occurring, for example, non random sharing or just as occurred by job. An optical duplicate actually occurs if you have two spots that are in very near proximity from each other and have exactly the same sequence or at least the same alignment. This can happen if one molecule ends up in just a different position of the spot. And also it's grid amplified over there. Sometimes just some fragments they kind of end up somewhere in a neighboring spot and then you'll get grid application and then. So then exactly the same original fragments get grid amplified in two different tapering spots and then it comes an optical duplicate. There are two different ways of two different kinds of flow cells. I must admit I kind of forgot the name but you have one flow cell that is that is in a grid, which is which are the more modern flow cells and one flow cell that is more positioned randomly. In this randomly positioned flow cells, I think optical duplicates used to occur a little bit more frequent, but even in this grid based flow cells, it still occurs quite frequently that one fragment ends up in a different spot and then gets grid amplified, causing these optical duplicates. So for the person I think it was Michael that asked that question about optical duplicates. Will this answer your question. Yeah, yeah, I think it's aligned with what I discussed. Thanks for explanation. All right. So about marketing duplicates. It's important, especially for variant calling. For example, for, let's say for RNA, for example, if you have a nice complex library. Then for our nice thing is not essential that you do but for variant calling it rather important because what you assume when you cover variant is that each read is an independent observation of the genome. And if it's a duplicate, then that isn't an independent observation because duplicates are dependent of each other, they come from the same original fragment. So if you are using duplicates in your variant analysis, you're actually violating this assumption and therefore you typically want to remove those duplicates. However, what can happen is that they you mark things as duplicate, but they still have a different molecular origin so that you are actually removing information which is a pity. Basically in a high quality library, however, removing duplicates doesn't have a big effect on variant analysis. Mainly in libraries that are lower quality, for example, where you had lower input DNA where you have then a lot of duplicates where it actually is becoming more and more important to mark those duplicates. What really helps in marking duplicates are unique molecular identifiers or UMIs. So what are those UMIs? UMIs are random sequences that are added to the fragment before you do the PCR reaction in the library preparation. And this is a random sequence, meaning that you can have the same UMI occurring multiple times in the same library, but it becomes a very small chunk that you have a read with the same UMI that also aligns exactly at the same position in the reference genome. So with this UMI, what you can do is actually find false positive duplicates. So here we have the same image that we have before, but then we have UMI information and that's the way. So we have this blue tag and the green, purple, yellow and orange all depicting different UMIs. And in this case, what we see is that we have exactly the same alignment with the green tag compared to the blue one, but because it has a different UMI, we do not specify it as a duplicate of the blue alignment. So then we can actually better find false positives, maybe just they can occur just by chance, and then only remove the real duplicates. So your accuracy of duplicate marking, very much increases when you use the unique molecular identifier for UMI.