 So an important point of CBW is that all of these materials are openly accessible and openly reusable under the Creative Commons license. I know Nia just shared this slide as well and I touched on it yesterday. But that also includes, you know, the instructors, we reuse materials from other lectures as well and this slide deck that I'm going to present includes quite a lot of material from my close friend and collaborator John Tyson of the BC Center for Disease Control. John was really one of the main contributors to the development of primer schemes and sequencing methodology for SARS-CoV-2. We're going to be focused a lot on coronavirus today. So it's a big credit to John for not only producing the slides, some of the slides I'm going to present today, but also his huge contribution suggests our ability to be a cheaply and rapidly sequence coronavirus. Alright, so maybe I'll start in with the module. So this is module four. Welcome everybody viral pathogen genetic analysis. So I'm Jared Simpson. I'm based at the Ontario Institute for Cancer Research, which is one of the hosts of CBW. My research group is focused on developing new bioinformatics methods for analyzing sequencing data. So originally, you know, 10, 12 years ago when I got into bioinformatics I was really focused on shortly genome assembly. But over time I'd say my research aims have brought into all things sequencing and how we can best reconstruct genomes from raw sequencing data. So my group mainly focuses on cancer genomics. So that's one of the main CBW modules that I teach, sort of CBW courses that I teach. There's a lot of transfer of bioinformatics methodology across different sequencing domains. So early on in the COVID pandemic. I got involved with the Canadian National Project to sequence coronavirus which we'll introduce yesterday, which is called cancogen and some of the nanopore analysis tools that we developed for, you know, just calling variants against genomes which were applied to Ebola and Zika and some that became applicable to coronavirus as well. So I got involved in the project that way. And then, you know, when we're putting together this workshop, I offered to teach this module on how we interpret sequencing data from viruses. So this is going to be very specific to coronavirus and particularly the way that we sequence coronavirus, which was using on-based schemes where we take tiled PCR across the genome, amplify those segments of the genome and then try to reconstruct what the sequence of the virus was from those tiles. So I'm going to get into all of that as we go throughout this lecture. But just to say that this lecture is very focused on viruses and particularly on-based analysis. Now, one thing I want to say is please feel free to ask questions as much as possible. I love asking questions. We're all here to learn. So please feel free to put up your hand or even unmute yourself and just interrupt me to ask a question as we go. I probably don't have a full hour lecture here, so we should have plenty of time before 11 a.m. to turn to Toronto to answer questions. I know just looking at the Slack and watching the lectures yesterday, you've all had great questions, so please do keep that going. Okay, so let's start off with learning objectives. So I'm going to start with these slides from John where we're going to talk about the different approaches to sequencing pathogenes but more specifically viral genomes. And this is really going to try to address the question of how we can take a clinical sample which may have very, very little viral material and amplify or enrich that up to a sufficient level that we can then sequence the entire genome and reconstruct it. And then, you know, a bit of a spoiler here the way that we're going to be talking about is amplicon based sequencing data, but amplicon data, well it's a great way of, you know, recovering full length viral genomes. It's very challenging to analyze. It's more challenging than just whole genome shotgun sequencing so I'm going to highlight some of the pitfalls and some of the challenges of analyzing amplicons. And then in the mostly in the module that we're going to be the hands-on part of the module later on will be interpreting the results of amplicon based analysis pipelines will be taking real SARS-CoV-2 data running it through a genome reconstruction pipeline called signal and then interpreting the results and performing quality control on that data. Alright, so let's talk a little bit about pathogen genetic testing and typing. So there's a lot of different molecular tests you can do for pathogens, but they also follow a similar workflow you might take your sample which could be, you know, anything from culture and isolates, all the way up to just swabs or blood samples from the infected patient. So the first step is always going to be going to be doing a nucleic acid extraction from that sample. So some pathogens of course DNA other pathogens are RNA. And if it's RNA, your next step would be converting the RNA to cDNA through just reverse transcription. And then we've got a couple of different workflows that we can use to identify whether there was a specific pathogen in that sample. So if you know the pathogen that you're looking for, you might go to QPCR or you have, you know, short primers and probes for particular regions of the pathogen genome and essentially just looking for the presence or absence of those probes in your sample. Or you can start to do some targeted sequencing. So if you maybe don't know what pathogen is in the sample, you could sequence a 16S. So just do targeted sequencing of the ribosome of 16S, then try to match it to some database of 16S sequences. Or you could go all the way through to whole genome shotgun sequencing. This is more widely used when you need, say, SNPs to look for cluster cases and outbreaks, so looking for mobile elements like plasmids. And there's a lot of different ways we do whole genome sequencing, either completely unbiased, like metagenomics, or with highly targeted applicant schemes like we've done for Ebola, Zika, and of course SARS-CoV-2. So let's talk about sequencing a little bit more in detail. So the easiest way, and by far the most straightforward from the analysis point of view, is just sequencing isolates. So here we've managed to isolate the pathogen or the genome of interest, usually through doing something like culture and a bacteria. And then just sequencing and assembling the genome, or aligning it to some reference database to identify what was there. Slightly more work up front in sort of the wet lab and a little bit more on the analysis is targeted sequencing, or you might not want to sequence the entire genome, but just sequence a little bit of the genome. Here we usually just copy out the stretch that we're interested in using PCR primers, again to highly concerned region, then sequence it and identify it by matching your 16S reads to some pathogen database or using any of the workflows that are available for 16S analysis. The big caveat here is that unlike just sequencing an isolate, you need some primology of what the target is so that you can design your PCR primers to amplify that region of interest. Or you have the most complicated case where we're doing the central just manage genomics, where your pathogen of interest is a component of the total amount of DNA or RNA within that sample. And you need to detect that pathogen sequence in the sea of other sequence that you may have determined from your sequence. And again, you can do this either one of two ways. You can either do pure shock on sequencing where you don't try to enrich anything, you just sequence the total complemented DNA and RNA within that sample, then do your analysis on that. Or you can try to capture your pathogens or targets of interest, or even deplete background material like human RNA or DNA that you're not interested in, and preferentially sequence the material that you're rich. So the choice of strategy, whether you're going to shock on probe enrichment, amplicons really depends on two factors. The first factor is just your target abundance. So in real clinical samples, taking in coronavirus, for example, you don't necessarily recover a lot of viral RNA. Sometimes you can have, you know, tiny, tiny amounts of viral RNA, where if you just do say metagenomics of all the RNA were from your nasal pharyngeal swab, you won't recover enough sequencing reads to detect whether SARS-CoV-2 is there. So that would be a case where we have very, very little target abundance. Now conversely, if we have very, very high target abundance where there's tons of material there, you know, there was a very active say infection of coronavirus, you may be able to just do shock on metagenomics and recover enough sequence to reconstruct that viral genome. This is sort of the most important access for determining whether you want to do enrichment of your sample or not. So if you have low abundance of your pathogen, you probably want to enrich. If you have high abundance, you probably don't need to enrich. Now, when you're going to do some enrichment, you really have two roots, two different strategies. If you're not going to do amplicons, you know, tiled PCR primers across the entire genome, or you can do probe capture where you take short oligonucleotides that bind to the genome of interest. And then you have them attached to things like biotin that you can enrich. And to just enrich for that pathogen sequence. And the choice of the between amplicons or probe capture really depends on the diversity in the genome. If there's very low diversity, there's not a lot of SNPs. Amplicons are probably a better choice, but there's a very high diversity where those SNPs might interrupt your PCR primer binding sites. You may want to go to probe capture. And all these different strategies in shotgun sequencing, amplicons and probe capture have different tradeoffs for how expensive they are and how easy the protocols are with shotgun sequencing being the cheapest and fastest probe capture being the most expensive and the longest and amplicons being somewhere in the middle where the cost is sort of medium and the protocol doesn't take too long. Now we can take different pathogens and put them on this map of different sequencing strategies and see what the best strategy may be for a particular pathogen. So SARS-CoV-2 where again we can have very, very low abundance, you know, when maybe it was a weak positive from QPCR. But it's fairly low diversity genome, especially with all these selective sweeps with constant variants of concern arising. So you put it down in this bottom left corner here where amplicons are good strategy. Other genomes like flu where there's a lot of circulating variation, don't work very well with amplicon schemes because you get a lot of dropouts. We'll talk about dropouts a little bit later on. So you may want to do probe capture. And up here again where you have like say culture bacterial isolates, there you don't need to worry about doing any sort of targeted enrichment through amplicons or probe capture. You just directly sequence the cultures with shotgun sequencing. Okay, any questions there about design of sequencing strategies before I move on a little bit into the analysis. Yeah, Tess, do you want to go ahead? I saw your hand go first. So I'll give you first shot. Yeah, for sure. I was just wondering, in terms of RNA viruses, what priming techniques are most common for reverse transcription that you see within the public health sector. Yeah, I'm sure somebody else on this, the call can answer this more definitively than I can, but I'm pretty sure the dominant ways at least for coronavirus was just random hexamer reverse transcription. Anybody have a different opinion or, you know, better understanding? Sounds not. Deborah, do you want to ask your question? Did you say Deborah? Hi, Deborah. Okay. So, I have two questions. The first one is about cultured isolate. Is there a particular special way to culture when you want to do this type of analysis? The normal cultured method is fine. So what type of cultured method do you use when you want to do this type of analysis? And also my second question is about extraction. Is there a particular kit that you recommend for extraction of RNA or DNA samples from viruses? Yeah, so I'll answer the culture question first. So essentially like any culture, you know, method, any method that you use that you can culture your bacteria in would be sufficient for sequencing. So the culture obviously takes a clone and expands it up to many, many, many, many, many copies where there's going to be enough DNA after you cultured it that you can do an extraction and then sequence. So I don't think there's a particular culturing method I can recommend that's appropriate for whole genome sequencing work, as long as you can culture it, you should be able to extract DNA and then off you go. So my second question about extraction techniques. It is quite important for clinical work and we saw a lot of variation in your ability to recover complete genomes from sequencing coronavirus, depending on the extraction kit that you use. And offhand remember or recommend a particular extraction kit. But you know if you're starting a new project where you're going to be sequencing a large number of pathogens, it's well worth testing of different extraction kits and seeing how well they recover viral material, and how well that sequences and whether you can recover the entire genomes that that's definitely important but I can't make a definitive recommendation. Thank you. Do you have a free question, Sean student, do you want to ask your question now. Yes. Yes, thank you very much. Yeah, I saw in this image that you that's the slider we're on now I can see hepatitis C virus on one side that is ampicons and then I can see unknowns. The hepatitis C virus I just wanted to know what's the difference. Yeah, that's a very good question it's simply I didn't, I didn't highlight here. So hepatitis C has incredible diversity, both, you know, within the different circulating lineages but also within individual hosts, there's a lot of quasi species, which makes it perform very very poorly if you want to use an applicant based sequencing method. So pushes it out to these other methods where we need to do slightly less biased sequencing. But what's why it's listed down here is that there are highly conserved regions in hepatitis C where there's not a lot of variation, and you can design primers to amplify those highly conserved regions and sequence them so it's sort of a special you know, it's, it's for certain regions of the genome are minimal to doing these more targeted approaches, but in general you need to use fairly unbiased approaches for hepatitis C. I hope I answered that question. Yes, you have thank you. Do you have a question. I just wanted to kind of give some feedback about the, the priming for the RT step. The original amplicon protocols for, for Arctic applicants. They all use like a two step reaction which is an RT, and it's random hexameters and oligo DTs mixed. I have switched to a one step PCR amplification. And so that RT step uses target specific primers for the RT. Great fantastic thanks for that. If I could just ask a quick question about the random prime and strand switch method. Could you just elaborate on that I haven't actually looked into that. Yeah, I probably have to go back and look, you're asked john. Okay, this is a slide from him, but I think that's more or less just you know, meta his stand in front of like a meta genomics protocol where you just, if you're sequencing. Metagenomics from RNA, random prime to CDNA, but I'm not sure about the strategy. So I'll check and then get back to you on that. Okay, thanks very much. Have you. So, what type of samples would be difficult to strike DNA from in a quality enough to do like a good sequencing. Like it is hard to. The second half part of your question you, I heard what part what type of samples will be difficult, but then I missed the rest sorry. So, in terms of sequencing, which, which types of sample can be harder to obtain good DNA quality in order to get a good sequencing, high quality sequencing. It's less dependent on the type of sample that the abundance of the sample so if you have something that's in very very high abundance. You typically can extract DNA and they get a good quality gene allowed. They're definitely different bacterias that it's hard to extract DNA from or they have really high ET or GC content where the analysis and you know the sequencing becomes more tricky. But I would say at least for the this sort of this topic where we're talking about you know sequencing viral genomes from clinical samples it's more the abundance of the target that's that's important. I hope that answer your question, and I'll take one more question and then we probably should move in move on so go to Kim sorry if I didn't see your name right could you. Yeah, I would like to ask a quick questions about that depletion step which is, I mean, I saw a lot of web lab procedure is available for human or host depletion. But let's imagine that we couldn't do that depletion step and we did the sequencing and like 90 or 95% of the sequences are hitting with the human genome let's say, and like, after extracting that bioinformatically, can we still say that this is reliable. I mean these results, if, for example, if you found a virus or like, how can I say about if we find something like a bacteria and or virus or can we say that these results are reliable, or can we go do we need to go back and do the depletion. I'm in bed lap depletion stuff. That's a very good question and it's set up nicely something I'm going to talk about later on, which is quality control of sequence and results. Because you've done some sequence thing doesn't necessarily mean you know the results are reliable. And this idea that you have a high amount of human or host background DNA is certainly a challenge that we faced with cancogen, you know, our national project sequence SARS-CoV-2. The pipeline that we're going to be using in the practical session a little bit later on, the very first step of that pipeline is to map reads to both the human and coronavirus reference genome and throw out all the read the map to the human genome, so that they don't influence the analysis of the viral genome. So it's this this this problem of high host contamination something to handle on the informatics end. And then the second part of your question is, you know, whether we have a reliable genome construction is something I'm going to cover in depth. Later on in this in this lecture so all I'll defer that question a little bit but it's certainly very important. So let's carry on now great questions everybody I think does a great discussion. So I'm not going to dwell on this too much because we'll give a very nice introduction different sequencing technologies. Yesterday. But once you've extracted DNA, you know amplified it or did some probe capture to a rich for whatever pathogen you're interested in. We have a lot of different choices for our sequencing technology and the one we're going to focus on today as our luminous shortread sequencer. And I'm putting a mention for the oxygen nanopore long read sequencers, as they were very, very widely used for for sequencing coronavirus during the pandemic, because of the inherent portability of the midnight and how cheap it is to get started in countries on rem their entire surveillance programs just based on orchard nanopore men ions. And in fact, I got into this field of pathogen sequencing by writing varying calling software for the midnight as part of the Arctic project. And really most of my research focuses on longer long reads. So it's slightly ironic that this entire lecture and practicals and be using a luminous shortreads. But what I hope to get across is the commonalities of how the analysis workflows, take your sequencing data and turn it into a viral genome sequence. Okay, so we're going to be focusing on tile and book on sequencing things, because we have this, you know, regime where the targets, the pathogens at very low abundance and this fairly low diversity which makes it highly amenable to cannibal fine with just PCR primers. And we're going to be using data in the practical section section from the Arctic protocol. I was part of the Arctic network and we're sequencing Ebola and Zika. And it was very fortunate that they had a lot of tools developed for sequencing viral genomes, both from the wet lab side and bioinformatics side. And these tools available, you know, at the start of the pandemic, they were very, very rapidly able to adapt them to coronavirus and the main workhorse protocol was really called Arctic v3. They were using method various revisions and versions through the pandemic, which uses 98 primer pairs into non overlapping PCR pools with 400 base pair of applicants and a really big challenge at the bench side which is balancing the amounts, so you don't want, you know, all of your coverage from one primer pool to come from a single ampicon, you want these applicants to be relatively balancing the amount of material that they generate. So, john and Josh quick, and a lot of, you know, been scientists spent tremendous amount of time balancing concentrations of PCR primers to get relatively even coverage of the genome. But even then you still need a sequence, quite deeply, you know, up to 1000 x I think the same as it was sequencing 10,000 x on Nova seek to make sure you get representation of every ampicon in your cool. And I see Jose mentioned in the slack here that now up to Arctic v5 that that's that that's true that's the newest iteration of the protocol. And maybe this is a good time to mention the reason that we need to keep updating different versions of the protocol is that as the dominant circulating lineages have snips scattered across the genome. If those snips are in PCR primer binding site that apple con will then drop out or may drop out or amplify a lower efficiency. So there's this constant need to update the primer schemes as the dominant variant of coronavirus changes whether we're in, you know, Omicron days, they needed to change the PCR primers to adapt Omicron have alternative primers. So that you still recover nearly complete genomes instead of having primer dropout. So thanks Jose for bringing that up. Okay, so let's talk about analysis pipelines. So, this is the general overview of a whole genome sequencing whole genome shotgun sequencing pipeline so we take fast Q files from our sequencer. We're going to align the reads to a reference genome using a program like BWA and then we then call variants. With respect to the reference genome using programs like free base are many, many different varying colors free basis one that I prefer and one that we use for coronavirus. So we're going to take the variant calls, and we generate a consensus the sequence for the sample that we've sequenced using a program like BCF tools we're going to take the reference genome, and the set of variance with respect to that reference genome, and we can just swap in the variance at every position of our reference to get our derived sequence for that sample and that's going to be output as a fast a format sequence and that is our genome for the sample that we sequence. Now I've intentionally simplified this pipeline. There's a lot of quality control steps a lot of things like trimming adapters filtering variants and I'm not going to talk about in detail, but these are really the three key steps that I want to emphasize today. So let's talk about these steps in detail and if this is something that you're interested in I'll put in a plug for the high three but sequencing CPW workshop. That's one that I lead. And we talk about all these different steps in the analysis pipeline in considerable detail where each one of these steps I'm going to talk about is really an individual module on its own. So here I'm just going to give a high level overview of the different analysis steps. And of course I'm happy to answer questions for any of this. If you'd like to go into the topic and more detail. All right, so the first step is we're mapping reads and along into our reference genome. So the problem here is that genomes are very big but reads tend to be very very small. And this is even the case in coronavirus where the genome is only around 30,000 nucleotides and length, but our lumina reads are only about 100 to 200 bases. So we need to have programs that are going to take those short leads and determine where in the reference genome that read may have come from. And the genome is called the mapping problem. So when we take a read and determine the most similar sequence on the reference to that read is what the mapping location of that read. Then there's a subsequent step that takes the basis of the reference in that region and the basis of the read and lines them up base by base and that's called the alignment step. So here the alignment of this little read to the reference looks like this where we drawn these bars here to say the matching between bases of the basis of the reference. And there's one mismatch here where there's T in the read and a C in the reference. This could be caused by a snap or simple sequencing error. And we're talking about how we can distinguish between those two cases, a little bit later. We've heard of the Sam and BAM alignment format. So Sam stands for sequence alignment and mapping. So it stores both the mapping of the read and the alignment of the read and BAM is just a binary version of Sam that represents the same information just in less space. So the main issues here are the mapping, which is the region in the reference that is most similar to the read, any alignment which is how to read lines up to the reference base by base. Now to determine a consensus sequence what we're going to do is take all of the reads mapped to a certain position of reference, like this T that I've highlighted involved here. Now we're going to look at the sequences present or the base that's present in every one of these reads. And we call this stack of bases that is aligned to a certain reference position, a pile up. So varying callers and consensus callers examine all of the evidence in the reads at a reference position to determine what the position in the consensus sequence should be, what the base in the consensus sequence should be. So here every read agrees that there is a T at this position. So we're going to put a T in a consensus in our consensus genome. We didn't slide over to a different position of the genome. And here we see that there's a reference T, but all the readings agree that there's a C at this position. So here we would say there's a variant or mutation in the sample with respect to the reference genome, and we would put this seed base in our consensus. Now occasionally it's not clear what the consensus basis should be and this is a major problem during our coronavirus sequencing projects. Sometimes the reads won't agree on what the basis. So here we've got a reference T, this read says there's a T, these three reads say there's a C, then there's another read that says there's a T and so on. And here we represent these ambiguous or mixed positions using something we call IUPAC ambiguity codes. So these are extended character sets for nucleotides that stand for two or more possible bases at a position. And the ambiguity code Y stands for there could be a C there, or there could be a T. Now, you know, you might think naively when someone, you know, transmits the virus from one individual to another, that all the copies of that genome should be the same. We shouldn't have these ambiguous positions we shouldn't see mixed evidence. Based on the idea that you say transmitted a single founding virus to to to someone that got infected. So can anyone think of reasons why this could happen in reality why we might have evidence for mixed bases. It's a mute or right in chat or slack, you know, some some ideas on why we might not have a pure single viral lineage in our say coronavirus sample. Just be sequence. Sequencing error. What was the second one? Sorry. Because of the variant of the virus. Definitely so those are definitely two possibilities sequencing errors. I'll talk about each one individually once we come up with a good list of things. So you could have sequencing errors you could have, you know, mixed infections or variants so definitely we've seen is rare but cases of co infections you could have intro host diversity, where a single founding lineage was transmitted. And the virus mutated into this population it's not so so frequent with corona virus which doesn't mutate all that quickly. But certainly is certainly the possibility. Any other hypotheses these are all good ones. There's a couple great ones from Martin in the. I see oh yeah just looking at slack here mixed infections. Definitely post infection mutation that's a possibility. I don't know more, you know, I'm trying to trying to lead you to one that I'm thinking of without exactly saying it, what all things that can happen in the lab. PCR errors, for sure, PCR errors is good one cross contamination good one Jose that's that's the one I was thinking of. The one that I worried the most about during during can be gen is cross contamination between samples. Now something I didn't talk about is, you know, the genome here is tiny is 30,000 nucleotides, and modern sequencers generate huge amounts of data. And so it will be huge overkill to sequence one coronavirus sample on one sequencing run. So typically what we did is multiplexed by adding barcodes to each sample, and you multiplex 100 or even 400 you've been more samples on individual sequencing runs. And the danger there is that if there was a, you know, some splash over between wells on your plate, you can get cross contamination between samples. So we developed a lot of tools for detecting contamination that give you this pattern of variants where you can have, you know, multiple mixed basis, but all the other things that all the other ideas that were mentioned like mixed samples, sequencing errors, you know, co infection, intro host diversity from from mutations happening and after infection, those are all valid possibilities so the fact that we see, you know, ambiguous bases can just definitively say whether there's a cross contamination versus say co infections, you typically need to go back and do some investigation, like looking at other samples are on the same plate to see whether there's a valid mixture and we're going to come back to that a little bit later on. I will make a quick mention of sequencing errors though, the variant callers typically have air models built in. So the sequencers will estimate quality scores, which is a number representing how reliable it thinks the sequence base is, and the very colors will use those quality scores to estimate whether the evidence that these positions are caused by sequencing are caused by true variation. This is the type of thing that I work on, you know, in my day job is building these probabilistic models for determining whether something is a sequencing error or whether something is a true variation. So we would hope that these ambiguity codes aren't caused by sequencing errors, but definitely if you have particularly noisy samples, or particularly low coverage, you could get ambiguity codes caused by sequencing errors. So just a hand go up there, Steven. Yeah, I just have kind of a more of a question from your experience in the field. So I feel like a lot of times when labs are dealing with pathogens and things like the people who are handling samples probably wouldn't be infected with what they're handling like fixing. But I feel like for SARS-CoV-2 because of the amount of population that's probably asymptomatically or just not realizing they were infected yet. Like lab stuff, was there a higher rate of like negative controls coming back positive and things like that like contamination was it a lot harder than I think with other stuff that you've dealt with in the past. Yeah, it's a great question. I'm not aware of any documented cases where say a lab tech was infected and that contaminated. The main route of contamination is the fact that we're using amplicon based protocols. So, you know, when, when the genome gets amplified the amplicons go to incredibly high abundance and they can contaminate like you know lab equipment, like your like your sequencing instruments, you know, there's cases where there are lab coats that have amplicons, you know, contaminating them and then it would end up in your sequencing run. So, because there's such a high level of amplification, that was the main route for contamination. The second main route was just like splash over between wells. You know, a lot of there's a lot of automated, you know, robotics for processing them for processing, you know, these multiplex plates, and there are definitely cases where you'd see, you know, a drop go from one well to the other and that would cause a contamination. So I'd say those are the two main routes of contamination. But I think, you know, technicians being infected, certain I'm certain that happened, but I can't think of any, you know, in any any documented cases. Thank you. Let's just move on. So, something I emphasize in the introduction is that we've got, you know, we're sequencing amplicons here. And that requires us to revise our analysis pipeline to specifically handle amplicons so I've spliced in a step into our little toy pipeline here, which is saying trim primers. And the reason for this is that every one of our PCR products is always going to start with the sequence that we primed off of the genome, and that's always going to be invariant with respect to the reference genome. And that's not all they go that was, you know, bought from IDT or whoever, you know, synthesize the PCR primers and it's never going to contain variants that are actually present in the sample. And this is an example of aluminum read that we're going to use later on in the hands on, and I've highlighted the Arctic v3 primer in red here. And the reason for the read is useless for variant college only going to exactly match the reference sequence. So we need to identify the PCR primers in our read and truncate the read to illuminate that red sequence and this is incredibly important for analyzing that there was a lot of cases, especially early on in the pandemic, where there was data that was improperly trimmed that made it into public databases, which looks like, you know, there was a genome that had a reference base in say an inappropriate spot. Early on in the pandemic, you know, groups are trying to make inferences of say, introductions, where the virus was introduced into say a country based on individual SNPs and there are cases where those SNP calls were incorrect because primers weren't trimmed. So this is something that I really like to emphasize on top and go talk about Applecon data, as is an absolutely mandatory step you need to identify and remove your PCR primers. So here's what it looks like. So we're going to be looking at IGV later on in the lab. So each one of these gray bars in a read aligned to coronavirus reference genome. This is the BAM file before primer trimming, and this is the BAM file after primer trimming so it's just identified all of the bases that are from Arctic v3 primers and then just clip them off the read to give this alignment that starts 20 bases over. And again, we'll see examples of this a little bit later on. Okay, so that's sort of a very high level overview of the analysis pipeline so I'm going to come to this section on quality control and I think this relates back to a question I got at the very beginning is of like, how do we know our results are reliable, especially when your sequence, you know, we have examples with very, very limited viral material. So there's typically three things that we care about when we're quality check quality checking our results. We want to know whether the virus genome was successfully sequence that's fairly basic one. We want to know whether that genome sequences accurate. So we want to know that these consensus sequence doesn't contain a lot of sequencing errors or PCR errors. And we want to do sort of run level quality control to make sure that the sequencing run wasn't contaminated. Again, that's just touching on this idea that the sequencing protocols are highly multiplexed, and we're running, you know, very, very high number of PCR amplification cycles. So our applicants can go to very, very high coffee number and, you know, we risk, you know, contaminating them across the plate and then, you know, having problems and we're trying to infer genome sequences. So let's talk about these one by one so first we're going to check whether our genome successfully sequenced. So the way that we do this and again we're going to do this hands on in the lab is by looking at the coverage of the genome position by position so here we just mapped the reads to our coronavirus reference genome. And we're plotting the read depth, the number of reads covering each position of the genome so the number of bases in each one of those piled up columns that I showed you in this slide or consensus sequencing. So this is some cancogen data we typically sequenced here to our charger of 1000 x coverage, and we see that this sample, which had a fairly low qp chart CT value of 16. It had pretty uniform coverage of around 1000 x from base zero, all the way up to base 30,000 this is the ideal coverage protocol there's not a lot of coverage variation and it looks like the whole genome was sequenced. So it looked good amount of confidence that will be able to infer an accurate consensus sequence from this sample. Here's a sample of how to hire CT value of 25 CT is just the threshold for detection for the qPCR assay. So, CT of 25 means that it was called the positive after 25 cycles of PCR. And that's getting to be fairly high. I'll show higher examples of CTs and a little bit, but it's still something that we would think we can seek with. But here we see a lot of these downward spikes in our coverage profile, where the coverage drops to zero or very very low coverage like 10x. And we see a lot of the level cons that are very very weak or entirely fail like this example, but really the genome was covered most of the genome was covered between 100x and 1000x. So we should be able to recover a genome sequence that's mostly complete and fairly accurate. Here's a very high CT sample CT 34 so 34 cycles of PCR for tech positive. There's tons of draw about the genome coverage is very very spotty some applicants worked very fairly well, but for the majority of the genome might say we don't have enough coverage to calculate a consensus sequence. So we would have, you know, a lot of apprehension among using this sample for downstream analysis. Now, it's fine to look at these plots and I looked at a great deal of these plots journey. So we need to look at the oncogen, but we need automated ways to assess the complete the completeness of our genome. And the way that we do this is that we calculate the proportion of bases that are actual nucleotides ECG or T, rather than the low coverage positions we get, which get masked as in ambiguous bases in our genome. So we're going to be using. If there's 10x coverage or higher, there's a consensus basis called at that position, if there's less than 10x coverage, that position genome gets annotated with an end, basically seeing there was insufficient information to call a base at that position. So our genome completeness is a portion of non end bases in our sample. So for example this sample here had three end bases these are three bases that are low coverage, seven ACGT bases, so our genome completeness would be 70%, seven out of 10. There was one, there was only one ambiguous basis, so our genome completeness would be 90%. And this is crucially tied to this coverage plot here, and the completeness is essentially the amount of positions that are over this 10x line here. So what we care about this is that incomplete genomes are going to be much harder to analyze, for example, it's going to be harder to place a genome on the phylogenetic tree. Incomplete genomes also tend to have more sequencing errors. So this was the main QC check that we use in candidate is that we only analyzed and we only submitted genomes that had completeness greater than 90% to public repositories like you say. So that was the first check that we did is that if the completeness is over 90%, we would proceed with that sample for downstream analysis. Different projects had different thresholds for completeness, but this 80 to 90% was around the standard that most people use. Now I touched on QPCR CT values. Here's a look at the amount, the genome completeness as a function of QPCR CT. So I think this was for a large collection of samples that we see was at OICR. And we can see there's this relationship between CT and genome completeness where for CT less than 30 we're typically getting nearly complete or complete genomes for CT greater than 30. There's a lot less viral material and we see completeness drops off quite a lot. So we set a threshold that we wouldn't accept for sequencing any samples that are greater than CT 30. That decision was obviously made after this plot was made generated. But that just increased the efficiency of our sequencing and avoided wasting money on samples that were unlikely to give a usable result. So this is a heat map of a lot of samples, about 100 samples, and the coverage for all the different Arctic Amplicons ranging from Amplicon 1 up to Amplicon 98. And the only thing I want to point out here is that the efficiency of different Amplicons varies with this light banding pattern here, blue being very high coverage for that Amplicon, red being very low coverage for that Amplicon. This light banding pattern here indicates that these three Amplicons all had lower coverage and pretty much every sample on the sequencing. And these Amplicons were very, very well known to be the most problematic for Arctic. And the reason that we needed to sequence so deeply is that these three Amplicons and some other Amplicons underperformed the rest of them, even after primary balance. Okay, so let's talk about sequencing accuracy. So now we want to assess where our genome consensus sequences are accurate. And the main way that we do that is that we look at the pattern of SNPs present across all samples on an individual sequencing run. And all of us are tree SNPs plot because we built a phylogenetic tree on the left from all of our consensus sequences and ordered the sequences based on their placement in this phylogenetic tree. On the right, we've plotted every position of genome with a colored box, if there's a SNP in that position, with respect to the reference genome, which is plotted down here at the very bottom. So here we can see a lot of sharing of the variation across the samples on this little demonstration sequencing run. For example, this cluster of samples here, it looks like it's about 10 samples in this cluster, all share a T SNP at this position T SNP at this one, this one, this one and this one. The exception of this sample at the bottom, which didn't have a handful of steps at these columns. We can also look at, say this cluster here which is larger which is defined by this three base substitution in the end of the genome, and even within this cluster there's some variation where this subset of samples all shares these five steps. So this gives you sort of a graphical overview of the genes that were present in your sequencing run. So we're going to start to see samples that may have had sequence in problems, these gray boxes are where there's ends in the genome, and the black boxes are where there's those ambiguity codes there's IUPAC ambiguity codes that looked like mixed bases and we're going to come back to that in a little bit. One of the main QCs that we performed is by counting the number of mixed or ambiguous bases as a function of our QPCR CT value and just like genome completeness we see this relationship between CT and the number of ambiguous bases in our genome. We never really nailed down exactly what the cause is, but probably due to PCR off of very low template material, very little viral material, where PCR errors would then get amplified up to be a dominant or nearly dominant fraction of the bases and start to appear as mixed bases and this problem would be more prevalent. So when you have these high CT genomes where there's just a lot less template, also some theories what RNA editing or RT errors also contributing when you have these low, low template material. One of the main QC criteria was that we would fail any samples that had five or more ambiguous positions because of this reason that they probably had unreliable amplification. So here we would just count out the number of ambiguous positions in that genome and if it was greater than five, we would discard that. Now finally I'm just going to touch on how we can detect contamination. So typically, not really typically, so it was a mandatory requirement of cancogen that every sequencing run had negative controls. What we do is just assess the amount of coverage across our negative control. So here, we're looking at this coverage plot on one of our negative control we see very little coverage, which is what we want. Here's another negative control where we see, you know, these towers of coverage corresponding to different Arctic camp cons. This is definitely not what we want to say that this sequencing run was contaminated. We recommend it was discarded and reprocessed. Or you can try to mask the corresponding regions of the genome with ends, but typically we would recommend that we would discard those. Finally, we can look at the pattern of mixed bases to try to infer whether there was a mix up between two different samples on our sequence and run. And this is a little bit complicated. So basically what you do is look for these ambiguous position which are just noted by these black boxes and see whether there's a pair of samples on the run that explains all of the positions being ambiguous in one sample. For example here sample C as ambiguous at this position this position this one this and this, and that could be explained by mixing data from sample B, which is reference at this position and sample a, which has a mutation at this position. And all of these positions call the same pattern where B is reference and a is new to it or a is reference and B is muted. So these are the various scripts for detecting these situations where the pattern of ambiguous positions can be explained by some combinations.