 A workshop on plastic as vehicles for anti-inflammatory distance. Welcome to ICTP, even though virtually, and so this workshop has a history started. We meant to have it just when COVID started, but now finally we managed to have it at least virtually. So I just leave the word to Alice or as the main driver behind this workshop, Alice. Ah, thank you Matteo for everything really and making this possible. So I welcome everyone, thank you for applying, I overwent from the reaction really. I didn't expect, I mean I wished so much, so many people, but I didn't expect it was so. Well, welcome. I have to tell you a couple of things, housekeeping things. So all the talks for the speakers, all the talks will be recorded, unless you say so. So if you want to say so, please do say so. And for the schedule, today we are not having, there are some modifications in respect to the advertised schedule. And in particularly, Rohan Mehta cannot speak today. We will try to reschedule him later in the week, but we are still discussing when he can speak. So I will let you know when everything will be posted in the website. And yes, I think it's all. And so I leave it to Zam, I know, and I want to thank Matteo for being the local organizer and believing in this. And Zam and Fernando, who are the other two organizers who believed in this and helped me a lot to try to figure out how to do this. And of course we have to acknowledge the founding, which is Imperial MRC Gida Center at Imperial and ICTP. And that's it, Zam. That's to you. Okay. Hi, everybody. Welcome to the beginning of the week. I am going to talk, let me just see if I can share my screen successfully. Oh, it's good. It's over here. Alicia, can you see that? Great. Thank you. Okay. Hi, everybody. I think I should stay straight away that I've been given a mission of giving you a very, very gentle introduction. This is going to be very gentle. So I think any of you who are real experts, analyzing plasma genomes or anything like that, feel free to go and get a coffee and come back in half an hour. But we're expecting quite a wide audience. So what I'm doing is just a broad introduction to what we might see in bacterial and plasma genomes and what happens when we try and assemble them. Okay, here we go. So I'm going to talk a little bit about what bacterial genomes look like and how they evolved, a little bit about plasma genomes. I am going to talk about how genome assembly doesn't work. I'm going to do this in a very, very high-level way. I'm not going to talk to you about any algorithms or that kind of stuff, because we don't have time and that's not really the mission of this talk. I'll talk about how long weeds make things easier and very briefly about what kind of things might go wrong. Okay, so if we start with bacterial genomes, so our cell genomes are maybe a few megabases in size. They're haploid, so they've only got one copy of the genome from the parent. And genes cover sort of 85 to 90 percent of bacterial genomes. So they're quite gene-rich compared to, say, humans. So you can actually use shared gene content both to tell you how many genes are shared but also how much genome is shared because most of the genome is covered by genes. And so what does that mean? Well, roughly speaking, the average shared sequence between two bacterial genomes can be quite low, even within a single species. So two E. coli may share a quite a big window here between 30 and 70 percent of their genes. That really depends on how those two samples have been selected. And that was quite a shock when this was first seen. So for comparison, if I asked what proportion of genome was shared between me and one of you, then it would be over 99 percent. So the genomes are structurally basically the same within humans compared to bacteria where even within a species, huge amounts of DNA can be discarded and added. So when you think about, when you look at what that means for genes sharing across genomes within a species, I hope you can see, I hope my window with showing all of you is not covering up the picture. There we go. So here what I've got on the x-axis is what I'm plotting simultaneously is how many, in a set of 10 genomes of six different species, what's the frequency of the different genes within them. So in other words, in all of these species, most of the genes are either rare on the left, so they're only there in one sample, or they're very common. They're there in all of them. And there's relatively few genomes in the middle, which are there in 50 percent of your genomes. So what that means is, if you want to compare a lot of genomes of a given species, there are some genes, in other words, some proportion of the genome, which is shared by everybody. But there's also a huge amount of stuff over here on the left, which is really rare. And those things that are on the left-hand side are typically things that are transiently in the population. They may be brought in by mobile elements. So they arrive in a particular, in one particular genome, and they don't hang around in the population very long. They may drop out again. And there's some continual turnover. And if they do add some selective advantage, then you would expect to see them move up in frequency over time. So how do genetic changes occur? Well, they can occur in a couple of ways. They might occur intrinsically. In other words, they might happen during the process of replication. So during replication, you might get a single base change, a SNP, and that would be inherited by the children. So we talk about this being vertical inheritance. Or you can have genetic changes which occur through contact with unrelated individuals. So DNA material is transferred from one cell to another unrelated cell, and then either incorporated into the genome, into the chromosome, or it becomes a plasmid or something else, which is sort of hitchhiking along. And because of the way that these things move, they can arrive and disappear in blocks. So this is a picture from a paper by Eduardo Rocher and Marie Tuchon et al. And all of these things I'm showing here, each of these rows is a chunk of genome, all taken from the same place in the E. coli genome. So it's what we call an insertion hotspot. And each of these coloured blocks is a different set of genes. So you can see, for example, I hope you can see my mouse. You can see shared blocks across many genomes, but they're all arranged differently. So we've got a mosaic of blocks of genes, all of these sort of occurring at about the same position in the genome. So when we compare bacterial genomes, how we do it depends on what we're trying to achieve. So the prototypical thing you might do, if you're looking within a species, is you might want to track the spread of something. So what I've got here is a picture from a classic paper from 2013, I think, from Simon Harris Satel. And what they wanted to do was track the spread of MRSA through a neonatal ward in Cambridge. And what they did is they collected samples from babies and mothers and stuff. And by comparing the genomes of these guys, they managed to infer something about how the bacteria was spreading. And when they do that, they needed to decide, they used the implicit information that genomes that are very similar are closely related, and there hasn't been time for mutation to occur, and things that are further apart have had more time for more mutations. So the immediate question that occurs after that is to ask how this makes sense. So I've just told you that two eco-lifes might share only 30% of their genome. So how do I make that make sense at the same time as drawing trees? So the tree model makes sense when mostly the genome stays the same except for mutations, or at least it's a model that only incorporates the mutations. We know that's not really how bacterial genomes evolve if we wanted to incorporate everything into our model. But if you were to restrict to the core genome, so that's the bit of the genome that's shared amongst all of the samples that you're studying, then if you just restrict there and just look at the snips, then a tree is a perfectly good model for understanding those relationships. And even more, if we think about the cells from which these genomes came, because cells come from binary fission of a mother into two daughters, there is a real cellular tree which represents the relationships of all the cells and their ancestors going back to an ancestor, a common ancestor. So there is a cellular tree, but the point is that the genomes don't precisely reflect that cellular tree, precisely because they're passing DNA backwards and forwards. So what does that mean? What about the rest of the tree? So if I've got free bacteria here called Alice, Bob and Charlotte, and they're all quite closely related compared to the rest of their population, they all have these red mutations. But then this gray mutation on the left happened on the way to Alice and only Alice has it. These two mutations happen on the way to Bob and only Bob has both, and Charlotte has the first one of these. So that's how that is a representation of their relatedness. So when we use tree-based approaches, we're saying that the relatedness is about which snips you do and don't share. So Alice, Bob and Charlotte all share the red snips. Only Alice has this. Charlotte has only this and Bob has both of these. So that's fine. The question is what happens when some ancestor of Bob and Charlotte, but not an ancestor of Alice as well, somebody back here passed some DNA from, sorry, if an ancestor of Bob passed it to an ancestor of Charlotte, how would we represent that in the tree? Well, we couldn't because it's a non-tree-like behavior. They're passing stuff sort of sideways. And if then that gene then developed a snip, there's absolutely no way we can represent that. So there's some, there's a mutation here hanging off on the side in a gene that moved from Bob's ancestor to Charlotte's ancestor. And we've got that snippet sort of outside our universe. It doesn't fit into our model. So where do we go? I mean, where we go really depends on what we're trying to achieve. Most of the time, well, for many questions, what you want to do, for many questions, standard phylogenetics provides what you need. And so here I've got a set of bacteria related by a tree looking at the snips in their core genome. And on the right, this is a representation at what's called a Fandango representation, which shows other accessory genes and who has them and who doesn't. So for example, this gene here, there should be a label up here. This gene here is shared by all of these bacteria, but not this one and not this one. This blue gene is here only in this sample, as is this one. And so you basically combine a vertical representation, a tree representation of the genome that everybody shares with some kind of heat map showing for the other genes that not everyone shares, who doesn't, doesn't have it. And you'll find many papers use this kind of representation, for example, to show the relatedness of strains, and then which ones have AMR genes or which ones have plasmids. And then not really the subject of this week, if you wanted to do a genotype-phenotype analysis and you wanted to draw correlations between phenotypes and genotypes, then you probably want to be aware of all of the genetic changes, not just the snips, because if huge chunks of DNA are missing, then they make a difference to the phenotype. And so these days, there are about three approaches you can use. You can use kamers, you shred the genome into words, and you just do independent associations for each word to see which genomes they're in, and to see if there are any words that are signifiers for that are highly correlated with whatever your phenotype is. So there are, I guess, the state of the art, are two tools called bug, was, and piecier about. The trouble with that with breaking the genome up into tiny words is, is there a lot of words if you break up the genome into tiny pieces? And lots of them are totally correlated with each other, so you end up, um, multiply testing things. So a, this, a uniting approach is a way of sort of collecting together blocks of words that always coexist. Um, and there are approaches for trying that. There's one called DBGWARS, uh, which is an acronym which I won't go into for now. And there are the approaches called genome graphs, which try and represent the, the entirety of the variation that you have inside species, and you can use them as an infrastructure for doing these kind of things. So, okay, well, that's bacteria. What about plasmids? So these are independently replicating elements in the cell. They're separate from chromosomes. You may have a, a given bacterial cell may have either no plasmids or up to maybe 10. I think 11 is the biggest I've heard of. Um, and each one might be at a copy number of somewhere between one and a hundred. So generally speaking, small plasmids tend to be at higher copy numbers. So you have more copies of them in the cell and larger plasmids, which might be hundreds of kilobases long, um, would tend to be at a lower copy number. Um, and these can be, you can think of these as selfish or parasitic or commensal. It depends, um, depends on your perspective and on the example. Um, but they do often have the ability to independently move to other cells and carrying genetic cargo, um, which is, you know, a subject of most of this week, because that cargo can often be genes which convey the ability to not be killed by, by antibiotics. And these can move within species. They can move across species and they can even move across filo. They can move a long way. Um, okay. So all of that, all of everything I've said so far is about, uh, how bacterial and plasma genomes exist and how they're related. Um, I haven't talked about how you infer those genomes from sequencing data. So what's genome assembly? So you've got a genome which is unknown sequencing data means what that means is you've taken the genome, you've broken it up into chunks and you've redundantly, um, read or measured or seen those chunks. And given these chunks, uh, you try and reconstruct the genome from which they, uh, from which they come. So an assembly is a hypothesis of a genome. It's a, it's, it's, you have a model and you say under this model, this is what I think the assembly is. Quite often genome assembly tools are, um, and I speak of someone who's written one, but I'm, but bluntly, that, um, they often have beautiful ideas and some ugly heuristics underneath. Um, especially if you're doing short read assembly. Um, so, um, I think that, here we go, here we go. So if you're using short reads, so Illumina breaks, Illumina data breaks your genome up into somewhere between 75 and 250 base pair chunks. Um, and your genome is a few million base pairs long, um, and often contains repeat elements. So if you have a repeat, say if you have, um, some kind of mobile element that's copied and pasted itself many times in the genome, um, and you break your genome up into chunks that are smaller than that repeated copy, then it's fundamentally impossible to reconstruct the genome. It really is completely impossible to reconstruct the genome. Um, so what happens is you end up with tens or hundreds of contigs blocks of genome and, um, depending on what you're doing, that may be good enough for what you're trying to achieve, but plasmid reconstruction is particularly hard from short reads, uh, in particular because mobile elements can be shared between plasmids, plasmids can recombine or do crazy things stimulated by those mobile elements and you can have shared things between the plasmids and the chromosome. So plasmid reconstruction is hard. Infants of presence of a known plasmid, given the sequence data, is a bit easier because you know what you're looking for. However, things changed, I guess, five years ago or six years ago, whenever Oxford Nanapur came on the scene and Pac Vi was already on the scene, of course, by then. So with long reads, um, where you've broken a genome up into chunks of, say, a thousand bases or a hundred thousand bases, you can reconstruct the full genome as a single contig and full plasmids. Um, that is possible. It's not definitely going to happen. It's not guaranteed. Um, but it, and it still requires, I think at some level, some luck and some careful checking, but waving my hands, maybe it works 30% of the time, 50% of the time. It depends on your sample prep and how long reads you've managed to achieve. And the reads you can manage to get really depends on, on how you've prepared the DNA and how long the chunks are at the point when you put it into the sequencing machine. So complicated things can still happen. Um, I'm going to finish with just a single example of, of crazy things. Um, there was an outbreak, um, in a hospital in Virginia in the USA. Um, and from 2000, 2000, sorry, from 2007 to 2012, um, they found many patients were infected by organisms which had carried the KPC, uh, um, carb panemase. Um, and so there's a study, there's very interesting study where they took 204 patients, 281 isolates from those patients and all of them contained this gene KPC, but it was one gene carried by 13 species. So, you know, the first question you ask if you're doing genomic, genomic epidemiology, um, in a hospital is, you know, is this a clonal outbreak, meaning is this a single strain of a bacteria that is suddenly becoming very successful and spreading? And the answer is obviously no, because we've got 13 different species here. So it's not one strain of something that's suddenly taking off. Um, so the next question was, is it a plasmid outbreak? And they spent a long time trying to reconstruct these, uh, the plasmid genomes. Um, and, um, what they actually found was that the KPC gene was being carried on a transposon, so that's a small mobile element that can copy paste itself. And, um, that was sitting on plasmids. And in fact, um, so these are, I hope you can see this without this thing on the side blocking you, um, there's some quite complicated relationships between the different plasmids. It wasn't, they weren't all the same. Um, you could see the transposon on multiple different plasmids, which were related in some cases, but less so in others. And actually what had happened was, um, the gene was sitting inside one transposon, which was then nested inside another one. And these things were seen on multiple different plasmids in multiple different species. So they call this a Russian doll's effect. So basically you had a transposon outbreak jumping across plasmids, some of which were more or less successful than others. Um, and you had this nested mosaic structure of things that come on. Now this, this kind of stuff presents a big challenge for, um, assemblers because you have repeats that you really care about, um, how carried on multiple plasmids in the same cell potentially and, um, and in multiple samples that you're trying to compare. So, um, I think it's still true that, um, understanding plasmids from genomic data is, is often challenging. Um, there are lots of things we might want to ask about how they've evolved and how samples are related to each other when you've got a data set. Um, and I think you'll find a bunch of the talks this week cover the subject. I'm going to stop talking here and I'll stop screen sharing. Okay. So I think I'm handing straight over to the next speaker. Uh, we, we have a 10 minutes break. If anybody wants to take a grab a coffee and if anybody has any questions, comments, things that want to know, by the way, Xamant was a great talk. Really interesting. Uh, or you, ah, yes, the first thing I needed to tell you is that, uh, put, if you have any comments, questions, everything, put them on the chat and we, we will read them from the chat. Otherwise, we can't have coffee or, because, um, James is in another call and he will arrive straight at 4 o'clock for his, uh, talk.