 All right, so in the morning lecture, I talked about there being two different Main styles of analysis to do one is based on mapping reads for reference genome and the other is when you don't have a reference genome or For some reason you don't want to use the reference genome to perform a genome assembly Which we call a de novo assembly as you're trying to construct the genome from scratch without using anything but the reads themselves So again credit to Ben and Aaron This is going to be maybe both 45 minutes depending on questions About the theory of how we assemble genomes and then after that we're going to have a chance to run different assembly software for For short reads like Illumina reads and then some long read data sets as well Now you might think why are we teaching genome assembly in a cancer genomics? Workshop it's a good question But a lot of barf max analysis are starting to incorporate genome assembly In some way or another so even if you're not constructing a reference genome from scratch a lot of the modern Indel and structuralvation callers will have a genome assembly step built in which we call a local assembly Well, they'll take the reads that might indicate some type of structuralization and do a little assembly of just those reads In place so it's a good. It's a good idea to understand You know what we're talking about we mean genome assembly and what some of the key algorithms and key ideas are for reconstructing genomes Okay, that's okay. You're up to your phone Yeah Right, so Francis is question. I think the essence of it is When would we want to use genome assembly for analyzing cancer genomes versus, you know Traditional reference-based methods where we map reads like we did in the morning lecture for simple somatic Substitutions where it's just one base that's different Almost always reference-based alignment is going to work better And people really got interested in the idea of using genome assembly For cancer variant calling because of the limitations of finding large insertions and deletions from aligning reads to a reference genome So there's a fairly popular tool called scalpel Which uses a local genome to know of a assembly step to try to improve the accuracy and sensitivity of finding these larger types of Variation and the general principle the way that we think about this is that the more the more messed up as Francis said or the More the higher level of difference between the samples that you have and your reference genome the better It's going to be to do a de novo assembly a lot of the methods are still in development And when we talk about long reads, maybe I can touch on this as well, but sort of the the Direction that the computational genomics field is moving in like what my group works on is trying to take this assembly first approach Where we're going to do a genome assembly and then analyze the assembled genomes rather than comparing to the reference genome Yeah Yeah, you don't really know ahead of time you don't really know ahead of time, you know is this is this a tumor That's really structuration versus point mutations the common you know common bar informatics practice is to Not just run one variant caller on your sample, but run many different variant callers and see how they compare to each other And see if the results are consistent across different varying colors Excuse me, I was involved in a project called pan cancer analysis of whole genomes For that project we run four different varying colors for each class of mutations somatic substitutions insertion deletions Structuration, and then we made a consensus call set out of the four different approaches They try to balance the strength and weaknesses of alignment based versus assembly based and so on All right, so what is genome assembly wasn't mean when we talk about genome assembly So again, I'm going to come back to my cartoon genome here where there's three repeats shown in red Now when we sequence our genome we randomly fragment the genome into many many pieces and we put those pieces on to our DNA sequencing instrument which determines the Sequence of each one of those individual fragments now conceptually genome assembly process Is really quite simple to think about and then we're trying to reverse this process and take our sequence fragments and just Reconstruct our genome From scratch, so we're just trying to use our individual sequencing reads and how they relate to each other like how they overlap To infer what the sequence of our genome is now the analogy that I like to use is if I went down to The newsstand bought a hundred copies of today's newspaper put them through a paper shredder Put them in a big pile in the middle of this room and then had you guys all look at how the different pieces of that shredded newspaper Had had similar sentences and then tried to reconstruct the newspaper from and I'm not going to do that But we're going to use software to do something very similar which is taking sequencing data and trying to assemble All right, so the lecture is going to have three parts I'm going to first talk about at high level how assemblers work some of the algorithms going to assemblers Then I'm going to talk about the differences between assembly algorithms for short and long reads and then finally Talk a little bit very briefly about some of the features of genomes that make Assembly a difficult problem and really highlight why we can't just get in to end Reconstructions of genomes, which is what we would really like to do. Yeah Yeah No, you for any sort of targeted sequencing you can you typically can't do genome assembly You'd use your reference exactly You might do a little bit of local assembly within that region like some of the various cause a popular Popular method of calling germline snips and indels is called the GATK haplotype caller It does a local genome assembly Internally using some of the methods we're going to be talking about here today But typically when we talk about, you know doing a genome assembly It's what we call whole genome shotgun sequencing where you take the entire genome Fragmented randomly determine the identity of each individual fragment and then try to put them back together All right, perfect lead into whole genome shotgun assembly. Here's what we Here's how we depict whole genome sequences Shotgun sequencing we have our input genome, which is shown in light red here We take many copies of that genome and then we fragment those copies into smaller individual pieces Determine the identity of those pieces which we call sequence reads and then we want to reconstruct this input sequence from these fragments Now if we knew an ordering of these fragments from left to right like that this These three fragments were all taken from the beginning of the genome This fragment was taken from the fifth position this fragment was taken from the ninth position and so on We could just line them up and read off the bases From these columns of this alignment here and that would give us back our genome sequence It'd be really conceptually quite straightforward to assemble a genome if we know the ordering of those reads Unfortunately, we don't know the ordering of the reads the sequencer just outputs An individual sequence fragment and we don't know how those reads relate to each other So we need to design computational algorithms to try to infer the ordering of those reads along our Input source gene Now a key term that we use when we talk with genome assembly is coverage We talked a little bit about this or touched on it in the morning when we were looking at pile up of bases The coverage at an individual base is how many reads are contributing to that pile up So for example, the coverage at this C here is six as we have six reads crossing this position Typically for genome assembly we aim for coverage of around 30x which means that on average each base of our Input genome is present in about 30 The reason coverage is so important is if you have higher coverage the reads are going to overlap more and if you have longer Overlaps, it's easier to assemble Your target genome right So the basic principle behind assembly is that we can compare the reads to each other to look for similarity between pairs of reads So the more similar to reads are The more likely they are to have come from some the same position on your input genome So here's an example of two reads where the start of this read which starts TAT CT Matches the end of this read which also has this TAT CT stretch, but there's a little individual difference in here, which is fun So what we call these as overlapping reads and overlap is where the end of one read Matches the beginning of another read a question Yes, so in your in your pile up the number of bases you have in the pile up is your read depth of that position So here, you know the pile up is C C C C C So our pile up contains six bases our depth of that position is six. I Think we have somebody else. Yeah You were saying Yeah Yeah, that's a good question usually the way that we calculate coverage is that we count we Sum up the total number of bases in our sequencing data So the luminous sequencing run is maybe a hundred gigabases and then we divide that by the size of our reference genome Which for human genome is about three billion bases So that would be coverage of about thirty three point three repeating So it's an average measure across the entire across entire genome when we're talking about You know an individual pile up like here We would say coverage six if one of these reads had a deletion in it We would probably say that there's still coverage six It's just that there was a read crossing opposition, but there was no base Observe for that nuclear time Yeah, we can talk about you know Genome coverage we would probably divide by three billion bases if we wanted to talk about haplo type coverage We divide by six billion bases So if you sequence a human genome to 30x you'll have 15x coming from the paternal set of chromosomes 15x coming from the maternal set of chromosomes Often in assembly as I'll I'll describe later We try to ignore the allele differences between chromosomes for diploid genomes as they complicate things But with long reads we're starting to look at things like phasing Simultaneously with genome assembly where can this sort of consideration matters all right any any more questions about coverage or Read overlaps before we move on These are all good questions good discussion All right So what we're looking for are these overlapping reads where the end of one read matches the beginning of another read as high levels of sequence similarity between Overlapping read might indicate that they come from the same position of our genome and then we can just merge the sequences together To do a little assembly Of that region of the genome if we do this for all reads we we can come up with a reconstruction of our genome Now we haven't talked about long read sequencing very much So maybe I'll pause here to talk about the main types of long read sequencers So who who's familiar with Pac-Bio sequencer oxygen nanopore sequencing has anybody looked at Pac-Bio or oxygen nanopore data before No, nobody's had long read sequencing right so Long read sequencing became available in around let's say 2013 when the Pac-Bio RS sequencer came out They sequenced single molecules of DNA instead of having this amplification step like Illumina has where you amplify this cluster into this Colony of clones They sequence just individual fragments of DNA And they this allows them to sequence really really long fragments of DNA up to 10,000 bases in length When we talked about these repeat problems problems that repeats give us a lot of the repeats in human genomes are between 500 and a few thousand bases So if you have a read that's in set in excess of 10,000 bases you can cross over or span those repeats One of the drawbacks of long read sequencing is because they're working off the single molecules the signal The base color uses tends to be much weaker So the error it's a lot higher at around five to 15 percent as opposed to well You know point one percent for Illumina sequencing So the algorithms that we use for assembling long reads are very very different than the algorithms we're going to use for Short reads because the read length is longer because the error rate is so much higher So all of the software that we use this morning like BWMM That's optimized for long reads or short reads when we're going to use long reads in the In the lab section for this module. We'll be using an entirely different set of software that we'll describe later All right So the key computational challenge for long reads is overcoming the high error rate The key computational challenge for short reads is trying to efficiently assemble the extremely large numbers of these short reads that Illumina sequencing Generates the reason I'm describing this is that I'm going to describe an assembly pipeline for long reads and then contrast it To an assembly pipeline for short reads which uses very different methods All right. Any questions about long read sequencing technology before I go on I'm generally happy with that. It's just longer reads pretty easy to get your head around All right, so here's a long read assembly pipeline, so we have reads at the beginning we then Construct what we call an overlap graph. We're going to find these pairs of reads that have similar Starts and ends We're then going to process the graph in a step that we call the layout step where we try to find This ordering of reads through the graph that reconstructs the genome we're then going to call a consensus sequence to pick the most likely nucleotide for each base of our pile up and then we're going to output those as Context I'm going to describe each one of these steps individually So the overlap step We take each one of our reads and we compare them to every other read to look for Regions where the end of our read matches the beginning of another read And we're going to construct a graph where each vertex or each node in the graph is represented Represents a read and we're going to draw an arc from one read To the other read if they share an overlap So this read shares an overlap with this read So we've drawn an arc between them showing that we could travel from this read to this read To make a walk along our genome Now if we do this for every read we end up with a graph that looks like this This is an example overlap graph for seven base pair reads where we require a three base pair overlap to Link them up with an edge And we can see this graph isn't so complicated We can make a walk through the graph which may be reconstructed part of our genome sequence each edge is labeled with The length of the overlap between them so the last five bases of this read a TT a T Match the first five bases of this read a TT a T So this is a representation that your genome assembly software is going to use To try to reconstruct the genome all right So that's the overlap stage next the assembler is going to Try to compute this layout where it's going to try to bundle stretches of the overlap graph together into context Now I think it's easier to Understand genome assembly if we move away from a CG's and T's for a minute and just use a fragment from a song So this song says to everything turn turn turn. There's a season the reason that we selected this fragment is because this Word here turn is going to stand in for a genomic repeat So it's present in multiple copy and allows us to visualize What happens to an overlap graph when there's repeats in it? So here's the overlap graph that we constructed using the same technique as before where we just break it up into seven character fragments Which overlap by three characters and it's pretty complicated And just looking at this is not apparent what that fragment of a song might be Yeah Like some kind of like context-free grammar language to process this It's yeah, it'd be similar to those sort of techniques Where we're representing sort of this the relationship between these sequences as a graph But the techniques that we use to construct this graph aren't really you know traditional parses of context-free grammars But string matching algorithms where we'll process we'll generate an index of our data and then Search that index to look for reads that might share a sequence and then use dynamic programming to determine whether they They share a similarity or not So it's sort of like the same is a similar class of algorithms, but more You know, we don't we don't really know the structure of the strings So we need to be more, you know more more permissive in what we allow into the graph if that makes sense All right, so we've arrived at this assembly graph and what we need to do that and what the assembler needs to do is try to clean up the graph to make the structure of the genome more apparent and The first step it's going to do is look for what we call transitive edges in the graph Any transitive edge is an edge that bypasses a node that looks like this Now these the genome assembler is going to consider these edges redundant as the path spelled by this Sequence is the exact same as a path spelled by this sequence. So the green edges can all be removed without changing the Possible reconstructions of the genomes from this graph so we're going to remove these edges that bypass one node and We're going to graph and get a graph that looks like this It's a lot cleaner in the structure of the genome. It's starting to become And we go step further and remove Edges through the graph that skip two nodes that look like this and now we have this really nice graph Which has this linear chain of nodes here and a linear chain of nodes here with something complicated going on in the middle here Now what the assembler is going to do is it can't resolve this genome any further So it's going to take these linear stretches assemble them together by merging the reads and Output a contigue for each one of these linear unbredged nodes and what we have is a one contigue Which says to everything turn and then another contigue in purple here, which is turn There's a season and then this part in the red box here is an unresolvable repeat So using these little seven be seven character Fred mints We don't know how many times the word turn should appear in this reconstruction so the assemblers done is that it's built the graph found this looping structure to which Represents a repeat it can't resolve how many times to traverse that structure So it just gives up and outputs these linear segments and that's exactly how genome assemblers that you'll run On sequencing data after this lecture operate They're building a graph performing these transformations to make a simpler and then outputting these unambiguously assemblable sequences Any questions about that for a little long? So this is really the fundamental step in the assembler. We build this graph clean it up output contigs All right, so the third step in overly overlap layout consensus Assembly pipeline for long reads is picking the final sequence of our content So this is really just looking at each base of a pile up So we take all the reads that are assembled into one contig We look at the pile up for each one and we're going to pick the most frequently observed base At each part of that pile up. So if we look here, there's a sequencing error where there's a T here Four reads show to see at this position one read showed a T So we output a C here because it's the most frequently observed base in that pile And we just do that from column to column all the way along our pile up to output our final genome assembly sequence And because we're we're taking We're treating each read as if it has a vote for what the final sequence is and we take the majority vote So we call this the consensus algorithm for taking a majority vote across all of the reads for each base of our content To get our final genome Assembly Each read is going to be From the same genome ideally. So you take, you know, a large collection of cells extracted from let's say a blood sample of one individual You then extract DNA from those cells They all should have the same genome. So each of them is an independent observation of the same genome Traditional genome assemblers up to maybe five years ago Try to ignore anything like headers I got city or, you know, low-frequency subclonal somatic mutations And it would just output the majority base So modern genome assemblers will try to preserve those distinctions and instead of just outputting a single majority base They'll try to say, you know 80% of the reads were C here 20% of the reads were T or whatever it may be They'll try to to represent as much of the difference as possible They'll also try to phase individual haplotypes as well if you have long enough reads All right So this is this method of assembly, which we call the overlap layout consensus paradigm of assembly Was developed for saying or sequencing, you know, many decades ago When the luminous sequencing first became available people try to apply these methods where we look for overlaps between reads to short read data and Long story short, they basically failed miserably The number of reads and the number of overlapping reads you get from an Illumina run Caused the assembly graphs to be far too large. They would have hundreds of billions If not a trillion nodes in the graph and they basically exhaust the available memory That you have to do the genome assembly and that's even if you can compute the overlaps themselves So for short reads We had to develop entirely new classes of algorithms to deal with the volumes of data that luminous sequencing Produced and what I worked on when I was at the genome science center in Vancouver and for my PhD at the Sanger Institute was assembly algorithms For short read sequencing Right. So here's the Overview of a short read assembly pipeline. We're going to first Introduce a novel step, which we call error correction where we're going to take the reads try to identify sequencing errors and fix them We're then going to build an assembly graph We're then going to perform right what we call a graph cleaning step where we remove artifacts from that graph We're then going to build contigs and then try to scaffold the contigs Together using paired and information Right. So the reason we need to need to do correction Or why it's a good idea is that we want to overcome this error profile from luminous sequencing So I mentioned When I introduced the luminous sequencing this morning that the error rate increases towards a three prime end of the read This is exactly the plot that I wanted to show you. So this is six different Whole genome sequencing datasets that were part of an assembly project called the Assemblathon come back to that a little bit later We had a small yeast genome A lake Malawi sick lead genome a fish whole constrictor snake human genome parakeet And not part of a sample on but an interesting data set was this oyster data set and what we're plotting Here's the average error rate for these six different data sets as a function of where in the read the bases are So we see that through the very beginning of the reads The error rates were really quite low below point zero zero five or one and two hundred base pair errors But for all of the data sets as you get towards a three prime end of the read particularly say for this fish data set We see the error rate spiking that's because of these phasing problems and luminous sequencing where the individual molecules are going to cluster Get out of sync So some some data sets in this Example where longer reads Not long reads, but slightly longer short reads which went up to what 150 bases The snake data set was about 120 bases and had pretty good error profile You can see there's this variation in error profile across data set across read length, but the general Trend is the same where the error rate increases towards a three prime end Yeah So are you asking like why we see this this increase Yeah, I'm not quite sure why the curve has this shape You know, it looks like it's some sort of exponential process where more and more molecules get out of sync And then though that just compounds is the error rate sort of skyrockets towards the end But you know some of them, you know like this snake data set in red Has a pretty you know, it's not so drastic So there's some there's some differences between the samples that cause it to rise more sharply or or more Gradually and I'm not really sure what you know what what exactly that is Yeah, yeah Yeah, definitely definitely, you know if you if you read off the end of the DNA fragment which I think is possible You know, that's that's gonna cause all sorts of weird artifacts Like like what we're seeing where the error rate spikes. There are adapters as well So you can read into the adapter sequences. But yeah, if you overshoot all of that, I definitely think weird things would happen Did you have a question? Do this The general trend of you know, the thing the error rate increasing the three prime end is Universal across basically all Illumina datasets, you know, if if all snake genomes are going to sequence slightly better than all You know bird genomes, I wouldn't expect that Air rate does depend on things like GC content You know some some some organisms might be framed there to sequence than others Classic one that's really really difficult to sequence is is plasmodium falciparum malaria parasite It's about 80% AT which gives a lot of problems for for a short read sequencing because of these amplification steps But yeah, I think that I think the snake genome in this case was actually sequenced at Illumina Where you know, the techs are very very good. They have tons of experience. So that might be why it's a little bit lower here You know the sum of this is not just like genome to genome variation, but sequencing center to see Right, right. Yeah, I think a lot of it is like, you know driven by the sequencing center A lot of these runs are you know, there were multiple sequencing runs that went into the data set So some of them will be averaged over multiple runs some might be from one run This data is also a bit really quite old now. I think this was generated in probably I'd say 2013 so, you know a modern Illumina Nova seek Ron is probably a little bit better if not a lot better than this All right, all good questions. We have any others. I thought maybe a son of their hands somewhere All right, so let's move on then So air correction So what we want to do is overcome this air rate and try to correct as much of the read as possible So the way air correction works is that we use a technique called camer counting Where we're going to take a fixed length substring of our reads and we're going to count how many times that fixed length substring Occurs across our entire data set Now the reason that we do this is that for Illumina data We expect most of the read to be error-free and if we take a small enough Camer size or fragment size It's likely that that That camer is error-free and seen many many times across our entire data set So by counting the number of times a camer occurs we can classify whether a camer Contains a sequencing error or whether it's perfectly error-free. So in this picture. I've Introduced a sequencing error, which is this red C here if we count these short camers. This is a 21 base pair camer That contains no errors. We're probably going to see about 40 times Depending on what our sequencing coverage is If we count the camer that contains that sequencing error sequencing errors are going to take a True genomic camer and flip it to a camer that's probably only seen a single time across our entire data set So if we count how many times this camer occurs It's probably only going to be seen a single time across the entire read set So we can identify by just counting the number of times the short substring occurs whether a camer contains an error or not Once we identify these camers that are only present once across our entire data set We would then search for Replacements like switching this C to perhaps a G that would change that from a low frequency camer to a high frequency camer now this was a very popular firefax algorithm to To write into software and there's I've only listed seven here There's probably dozens of different error correctors that are based on this principle of counting cameras So quick was one of the first ones SGA has a camer corrector SGA is a program. I wrote as a PhD student so to know Mo so to know of a BFC Bless lighter must get our all camer based error correctors They can take just a set of sequencing reads without a reference genome and fix the majority of the sequencing errors in your reads Now once we've corrected errors just like an overlap layer consensus assembly we need to construct an assembly graph Now the assembly graphs that we're going to use for short read assembly are very different than For long read assembly in that we're again going to use the idea of breaking up our reads into short segments called camers To construct what we call a de Brown graph. Who's ever heard of the term Brown graph before a few people Few people in back So the Brown graph was developed by a computer scientist named Pavel Pesner He developed a Brown graph in 2001 as a way of more efficiently assembling Sanger sequencing data The tech technique didn't really get widely adopted until we had these very high volumes of short reads and then Computer scientists like myself and Pavel realized that there's You could do genome assembly of short reads much much faster and overlap layer consensus assembly So basically the entire field of short read assembly adopted this idea of breaking up your reads into camers and constructing the Brown graphs to vastly accelerate assembly of short reads So we did Brown graph assembly works as it takes your reads and you pick a camer size in this case We're going to use K equals 4 And you slide this camer size window over your read and you add every read every camer that you've seen into the graph as a vertex So the first camer the first four base pair substring of this read is CC GT So we're going to add it to this To the graph as a vertex here the next camer is CGT T We're going to add it to this vertex here and then GTT a we're going to add as a vertex here Then get over to this read at TT AC to to the graph And so on until every distinct camer that we found in our read set is present as a vertex in the graph what we're then going to do is they make another pass over our reads and look for Camers that are adjacent in a read and link them up with an edge So in this read the first camer CC GT is followed by the camer CGT T So we'd add an edge between this node and this node and we're going to do that for all camers in our reads And we're going to end up with this structure the graph Now already this graph is a lot cleaner than when we saw for overlap layout consensus assembly when we constructed that graph Overlap layout consensus assembly. There were hundreds of nodes You know thousands of edges even for very very small graphs where as for this to Brown graph We have this really nice Structure where we can just read off a sequence of the genome by following the graph along this path Going back around to this node again for a second time and then following on this path There's one unique traversal which spells out the simple the genome for the simple toy example Now the reason I put the CGT T node in red is it standing in for a genomic repeat here is present twice Which is why we have two edges coming into this and two edges coming out And the main the main algorithm feature of the ground graph is this very compact Small representation efficient representation of repeats in our graph where we just have a few edges coming in and a few edges going out And that's what makes it so fast and why we've chosen to use this for short read assembly Right any questions on the ground graph before we go on. Yeah What we do we have multiple errors leaving at one here because the For each pair of camers There's only four possible camers before it one where it's a CGT and four possible camers coming out of it So you can only have four incoming edges and four outgoing edges for every node in your DeBron graph And that constrains it to be you know at most four edges in and out Which is why it's so compact to represent repeats Because it's not every individual read that has those camers It's just is this came or followed by this came or in in any Yeah, and we have that because here's one CGT T and here it is CGT Once it's followed by a which gives us this branch of the graph once it's followed by C Which gives us this branch of the graph and it can only be followed by a CGT So they can only be four possible edges coming out It doesn't matter it doesn't contribute anything new to the graph You're only seeing if you know if this camer is followed by this camer in any reads if it's there once or if it's there a hundred times It doesn't matter So what you'll usually do is you'll count you'll you'll put a count of the number of times It's been observed on the edge and then you can see okay There was a hundred reads with evidence for this edge or there was only one read with evidence for this Sorry It's not going to try to assemble it to a hundred copies in the genome because we've sequenced the genome 30x coverage or a hundred x coverage You don't want you know as many copies in your genome as you've observed in the reads because your reads are over-sampling the genome because you have this you know for any position in the genome We have 30 reads or a hundred reads depending on how much sequencing you did So a unique stretch of your genome should only have coverage 30 or 100 Does that make sense? You're not trying yet It's you did the edge counts the number of times you observe it It's indicating how many times that that sequence is present a genome, but it's not exactly indicating it So you can't say if you've seen that a hundred times. It's not going to be a hundred copy repeat Yes, yeah All right, so next we want to clean up the graph So there are different artifacts in it can this can appear in a brown graph the first artifact that we Have to clean up are called tips So if we have some residual sequencing errors as shown by these red bases here and here They can cause these spurs or tips off the graph where we have this little short branch That doesn't go anywhere. It's not connected on both sides So this branch is created by this sequencing error and this branch of four nodes is created by this sequencing error So what we want to do is process our graph to try to clean these up by removing these tips second type of Graph artifact that we can have is caused by allelic variation. So human genomes are diploid if we Are looking at the assembly graph of a part of the genome that has a heterozygous snip The graph is going to have what we call a bubble structure where there's a divergence where one half Follows one allele. Let's say the C allele and the other half follows the other parental allele, which is a G in this case now Traditional assemblers no five years ago wanted to try to remove these to just give us this nice clean linear structure of the graph Which is what we want for? assembling contigs, but more recently some of us are trying to preserve this structure like what we talked about when From question Francis had when we want to use this as evidence that there was a heterozygous snip or maybe some subclonal mutation At this position of the graph One of my grad students actually works on this using these structures in the graph to try to find Variation in cancer seems Alistair Right, so after we've assembled We've built our assembly graph. We might get strong something like this We then went to try to clean up these structures So first what we're going to do is identify all of the ending points of the graph These are nodes that only have a connection on one end We're then going to walk backwards until we find the point where they've diverged from the graph And we're going to just prune those off process. We call tip removal we're then going to look for these bubbles in the graph and Look for where they branch and then come back together and typically we'll remove one half of the bubble To collapse it down to a single path to make the structure of the genome more clear Following that we're just going to assemble contigs by merging together all of these unbranched segments of the graph And outputting them in a fastifier, which is your genome assembly. All right finally At this point, we haven't used our paired end information We talked a lot about paired end reads in the morning So the last stage of assembly is a process called scaffolding where we're going to use our pairing information To try to jump over unresolved repeats So if we have one of these branching structures in the graph that the assembler can't get over Sometimes we'll have reads that can bypass that using a pairing information to try to Try to bridge that repeat So we call the scaffolding scaffolding we align all of our reads back to our genome our genome assembly And then we're going to draw an arc between two read pairs and when one set of reads aligns to the end of one contig and its pairs aligns to the end of another contig We've colored them with either blue red the sort of dark purple. It's probably hard to see or this darker green So by observing that there's this cluster of reads That aligns the end of this contig and as pairs all line to the start of this contig We can say confidently that they're probably Adjacent to each other in our final genome assembly So we can construct what we call a scaffold graph or we're going to take contigs Put an arc between them if they have these pairing relationships and then we can finally Assemble them together where we're going to fill in the gaps with these end characters Which we saw in the sequencing reads before which we means we don't know the exact identity of the nucleotides in between these Scaffold we just know that they're followed by each other on our genome and Finally we can use a program called gap filling that which you'll try to do a little local assembly of these gap Sequences to try to close them or sometimes you can use long read sequences as well to try to fill in these gaps between your scaffold All right, so what can you expect from the output of your genome assembly? We're going to see this later on So if you're sequencing bacterial genomes, which are only a few megabases in size For short reads, you're probably going to get a few hundred contigs Where your contig length is about ten thousand to a hundred thousand bases For long reads where the read length is maybe ten thousand bases You'll typically assemble bacterial genome to a few contigs which are megabase in length And often you can assemble the bacterial genome back into a single contig that covers Your target genome from end to end If you try to assemble a large genome like a human genome With short reads your contigs will be around ten thousand bases in length with long reads Maybe a megabase even up to ten megabases now with an improvement to long read genome assembly The drawback is that traditionally long read data is more expensive packed bio nanopore data There's a few times more expensive in sequencing a whole genome using alumina Which is why most people are using alumina if it's a human genome as you want to sequence as many genomes from your project as possible Okay, any questions about assembly techniques before we move on to the last few minutes. Yeah Yeah, usually it has to be whole genome again like like for targeted sequencing Exxon Exxon sequencing is a you know special case of targeted sequencing And once you've you know gone past where your probes have pulled down The exon region then you just don't have any coverage there So your assembly will stop so if you try to run an assembly on and on Exxon sequencing you'll get you know really small little exon fragments But nothing beyond that because you just don't have data there Yeah It's a great question It's really the most critical part of doing a short read assembly using the brown graph is picking what K to use Typically if you have you know it depends on both how repetitive your genome is and how much coverage you have So if you have a lot of coverage you can use a longer K if you have less coverage You can use a shorter K Conversely if you have a really repetitive genome you want a longer case so fewer your nodes in your graph will be repetitive Versus if you have a bacterial genome you can use a shorter K So there's a lot of work that's gone into Programs will automatically try to determine what K to use and the program we're going to be using Just after this we'll automatically pick the camera size After analyzing the coverage profile and the genome structure it used to be like you know used to be you'd run Your assembler and you'd run it, you know 15 times with different K values 30 35 40 all the way up to like 60 70 But luckily, you know clever people figured out how to automatically do that So you only have to run it once and I'll give you a pretty good result So base quality score recalibration Takes the reads aligned to the reference genome It looks for where the reads mismatch the reference and then it tries to adjust the quality scores to be more Representative of the truth like the quality score is just an estimate of how accurate the sequence is based quality score Calibration tries to re-estimate them to to make them a little more accurate Air correction on the other hand we're actually modifying the reads themselves So if the air corrector thinks that there's a C at this position It was a sequencing error and it thinks there's a correction It will modify the read change that C to a G and then you get a new set of reads which you know Hopefully has a lower lower rate Yeah So So usually like So for human genome You typically want camera sizes around 60 61 in that range, you know really quite large I was showing short cameras just for illustration And the determining factor is how much coverage you have of the genome So what these automatic camer size? Pickers will do is that it's going to Calculate an estimate of your genome coverage and then it's going to pick K depending on that estimated coverage So it's going to use you know if it thinks you have 20x data It's going to use a shorter camer maybe 31 if it thinks you have a hundred x data It's going to use a longer camer 60 70 something like that We if when you run your the programs in the tutorial part The spades program is the program we're going to be using it will give you all this output saying exactly what it's doing and Why it's selecting particular cameras? All right, anything else before we move on you just have a few more minutes left All right gap filling we talked about this Right. So I talked about these six different data sets are part of the assembly phone 2 I just want to highlight one of the main findings of this project Which was that there's basically no best assembler out there? The assemblathon was this benchmarking project where they sequence three different species the fish that snake and that That parakeet they released those three species to the community of people who develop genome assembly software I took part in this assemblathon They had us all run our our tools our software tools on the roof on these genomes Send the results back to them and they scored the individual assemblies to see who's doing best And there's basically no assembler that did great across all different species Some were better for the snake genome some are better for the bird genome and the results were really quite variable So I just want to end on some of the features of your genome or your data Which might make assembly more difficult So an obvious one is just how repetitive your genome is so we talked about the human genome is about 50% repeats You know trying to do a genome assembly of of a human using short reads is really quite difficult because of that 50% repeat content another factor that's really You know not recognized as much as it should be is how heterozygous the genome is Here human genomes aren't so bad as there's a snip around a one 1,000 bases But for one of these genomes I showed earlier the oyster genome. There's a snip around every 80 bases Now the effect of that is that because there's so much heterozygosity so much of the electric variation You get these bubble structures all over the assembly graph and it makes it very very difficult for the assembler to pick Out what the true path is so when you reach really high levels of heterozygosity it causes a lot of problems for your Assembler of course low coverage is an obvious one. You need your genome to be very well covered We need every base of the genome to have a read For it to be present in some sequencing read for it to be able to be Assembled if your sequencing is biased in any way So what we want is nice uniform coverage across the genome if you have things like plasmodium where it's 80% AT that can true introduces a lot of coverage biases where some regions are much more difficult to sequence than others Right So this that's a slightly different type of bias so that's like a reference bias where you know You're using a reference genome that's not well matched to the sample the view of sequence You know if it's an African sample and you're using your peon copy of mitochondria, you know, then that might introduce Introducing issues the type of bias. I'm talking about here is we're some regions of your genome are going to sequence more easily or Or less easily so that you might have good coverage in the regions that are 50% gc But you might have poor coverage in the regions that are 20% gc Which is what I'm meaning by bias coverage is by sequence Of course if your reads have a really high error rate It's going to be more difficult to assemble if you have chimeric reads if reads are from different independent sequence parts of the genome that have been Incorrectly merged together during sequencing that causes all sorts of problems for assemblers If you have sequencing adapters and reads like if you've overshot your fragment size if your samples contaminated in any way Or if you haven't just sequenced a single individual for a lot of small things like maybe flies You can't get DNA from a single sample or you maybe can't get DNA from a single sample So you need to cool DNA from multiple individuals and sequence it that introduces a lot of headers like Austin He becomes a population assembly problem Which is much more difficult to resolve Give a question back to you Yes, so come Eric read us to during library preparation You know we fragment DNA and we assume each fragment of DNA is from one contiguous location of the genome What can happen though is that sometimes two Different fragments of DNA can get ligated together and then sequenced from end to end And if that happens your read contains to you to distinct genomic regions and we call that a chimeric read It looks quite a lot like a translocation to your assembler So causes problems for the assembly to try to resolve that, you know, you really don't want chimeric reads in your library All right, I think that's it's a good time to to dive into the tutorial now So just to summarize this is an overview of how assemblers work We have different methods for short and long read assemblers Many different factors determine whether giving assemblies can be difficult or easy If you have any questions, you know, shoot me a message on Slack or to my work email here