 which returns us back to pangraph, actually, and Liam, Liam Shaw is going to be talking to us about using pangraph to explore the plasmid diversity. Liam, you've had a warm-up act. I would not describe Marco as my warm-up act, that makes it sound like I'm the headliner. Yeah, thank you very much. So I'm going to talk about using pangraph practically. So I've been visiting the group here who developed pangraph, and I'm very much a user, not a developer. So if I can use it, you can too. So I'll be kind of just talking about some of the explorations I've been doing of plasmids using this very useful tool. To start with, I thought I'd start by mentioning Ulysses, which I think has a lot of similarities with plasmid diversity in that it's vast, extraordinarily diverse, and understanding it may seem like it's a fool's errand, but also because the last line of Ulysses, for anyone who's read it, is Trieste Zurich Paris, and this conference was meant to be in Trieste, of course. Sadly, we're online. But throughout the talk, I'll put Ulysses quotes in the top left. So if you're really bored with plasmids by this point, just look there, and there will be something that is relevant from Ulysses, because really Ulysses has everything in it. So I'm going to talk about why pangraph is a useful tool for exploring plasmids, and then give two examples. I'll spend most of my time on the first example of antimicrobial resistance, plasmids, and then I'll talk about the idea of plasmid architectures in the second example. So why should we think that pangraph is going to be useful to explore plasmids? Now, tools exist to cluster plasmid diversity, and Fernando's talk earlier was a fantastic example of this. So Coppler, the classifier that his group have been developing, is extremely useful. And you have a different range of ways of clustering plasmid diversity. So there's traditional typing schemes, which are based on presence absence or sequence type of a set of known genes. You have alignment free clustering, which is based on the similarity of component camers, so not actually aligning things. And then you have alignment-based clustering based on something like average nucleotide identity. And you have some rules for partitioning up the big messy network that you get into clusters. However, once you have those clusters, understanding the structure within them of what happens to the plasmid evolution within the clusters is more difficult. Those clusters are a fantastic place to start. So this isn't a criticism of those methods. It's just that we have the clusters, now we've got to explore them. And plasmids are actually a special case of pan genomes. So in James' talk, he talked about how you've got a core genome, which is the genes that all strains share, and then the accessory genome, which is unique to particular strains or not seen in all members of a species. And with plasmids, you kind of have a similar thing. You have the core genome might be a backbone plasmid, and that backbone plasmid is found within other plasmids that you might call compound plasmids, as this recent preprint does. And variations on a definition of backbone plasmid could be given. But if we think of it as a minimal entity that can propagate using the genes that are on that backbone, then in plasmid pan genomes, we kind of have a minimum working example approach to pan genomics because the core genome can actually exist on its own. And plasmids are interesting to think of in a pan genome approach because they're tractable. You can sort of make the visualizations and also do the quantitative stuff. Whereas with whole genomes, looking at the big messy graphs that you get, it's just more difficult to work out what's going on. So I think starting with plasmids is a good idea when trying to build models of pan genome evolution, even though they are a bit of a special case. So to start with, I'll give this example of antimicrobial resistance plasmids and why pangraph is useful to look at this type of evolution. So Zam mentioned this paper, which involved a hospital outbreak in a hospital in America over five years, multiple species, multiple plasmids, genetic elements, moving this gene around, it's a nightmare. And it's a really fantastic paper. And I really recommend if you haven't read it, do so because you'll learn a huge amount. And I think that's kind of a real, that was a groundbreaking paper in exploration of this genetic diversity in bacterial genomes. But for the purposes of this talk, suffices to note that plasmid structures can be extremely dynamic. So here I'm taking out a tiny section of that data set to explore. So they did long read sequencing on a subset of isolates at the time, I think 17, which was expensive. And so they only did it on a small subset, but you get the whole plasmid assembly for those reduced set of plasmids. And I'm picking out four plasmids that were, so here's a reference plasmid that was also sequenced as part of this collection of isolates. And it's a 43 kilobase pair plasmid that carries on it, this resistance gene. And here are four other plasmids that were isolated within a two year period from different species. And compared to the reference, these two plasmids P1 and P2, I've simplified the notes from the original publication. They had one single nucleotide variant in the reference sequence. The others had zero changes, but then the structural changes were much larger. And again, this is a table from the original paper. So P1 had no changes with respect to this reference. P2 had a 188 base pair deletion, P3 had an insertion, and P4 had a duplication of this BLAR KPC containing region, and then an insertion as well. That's how they characterised it in the paper. That's what I've taken that description from. So if we look at this with aligning genes, for example, or trying to visualise it, this very common approach to take, to look at gene diagrams, this can be quite confusing. So to make this, I've found genes in the plasmids with prodigal and annotated them with prokka. So prokka will actually run prodigal as part of it. And then produce this visualisation with clink. There are many other methods you could have used. But this is just to point out that some of the things show up quite nicely. So for example, here's an insertion it looks like. So this would be how they worked out that there was a 1,200 base pair insertion in P3 compared to this reference sequence. But the duplication that's going on between P4 and the other ones is kind of all the way off down here. I've truncated it because it's too big. And this is not really quantitative. This is a way of qualitatively looking at the data and trying to work out what's going on. But it would be nice to have a data structure that was based on alignment of blocks of homologous sequence, which is what pangraph is. So here's the commands that I ran on this set of plasmids. And this is what you get. So if you visualise the results of running those commands on these plasmids in Bandage, a programme made by Ryan Wick originally for looking at assembly graphs, but it works equally well here for looking at panginal graphs. You have these sets of blocks. And here's the KPC gene, which is on one of these blocks. So remember from Marco's talk, this is a representation of all of the genomes together, how blocks are connected. Any particular plasmid genome will be a walk through this graph. So a closed loop that goes around the graph. So for example, we have plasmid 1, the reference. And here I'm showing a sort of linear version of the alignment blocks that pangraph has discovered in these plasmids. So if we take plasmid 1, let's start at the purple. So we're going to go purple, red, orange, yellow, green. And then we're back where we started. So that's the circle. If we take plasmid 4, for example, we're going to go purple, red, orange, yellow, green. Then we've got this kind of similar yellow color. We go all the way around that loop. Then we go orange, yellow, green again. And then back where we started. That's the end of the plasmid. So why is this good? Well, that data structure allows us very easily to compute structural distances. And what do I mean by that? Well, this is very much a toy example. So I'm just going to go through it. If we imagine we have the reference plasmid and plasmid 2, say, which had a deletion. If we look for the longest common subsequence of blocks between those two, you'll see that it's purple, red, orange, green. So we just go through and it takes into account both the presence and the order of the blocks. So there's a break point here is what we say, because there's a difference between these two plasmids at that position. If we take, for example, plasmid 4, the longest common subsequence here is purple, red, orange, yellow, green. They both have yellow block in this case. The break point actually comes at the end. And this whole thing is a single break point because it comes between the green and going back to the beginning, the purple. So that's the same number of break points as the previous example, even though there's a vastly different amount of sequence that's been changed in those different cases. So we could choose to weight those differently. And that's something I'd be very interested in people's thoughts on. But for the purposes of this, I'm just going to treat those as the same. So we can have a distance matrix between the plasmids now based on break points. And because this is a very simple example, but it is a real example. It turns out that between plasmid 1 and all of the other plasmids, there's one break point. And for example, if we compare plasmid 2 and plasmid 3, we see that there's two break points. Plasmid 3 has a bit of blue here. And plasmid 3 also has a bit of yellow. So this is the thing that's a deletion in plasmid 2 compared to the others. And that is so that's two break points. And we have a two in the distance matrix here. So we've got evolution going on. We can use different distances to characterize that evolution. And those distances have different information that they're going to contain. So if we use a reference based mapping, the single nucleotide variance against the sequence of the reference plasmid, we will see this tree. So plasmid 1 and plasmid 2 have one position where they vary over that 43 kilobases. And plasmid 3 and plasmid 4 are the same as the reference. So I'm just showing this tree like this, but I could put them on top of the reference here. That would be the same tree structure. If we use camers, so if we ask about the presence or absence of camers, not taking into account like alignment or anything, we're going to get that P1 is the same as the reference. P2 is a little bit different because it's got this deletion. P3 is a bit more different because it's got a slightly larger insertion. And P4 is really different to everything else because it's got this massive insertion. So that's quite a different picture of the evolution. If we were using this to infer stuff about evolution, it's quite a different picture. And now here we're using blocks and the break point distance. So P1 is the same as the reference. All of the other things have one break point with respect to P1 and two with respect to each other. So to go from P2 to P3 is one, two. So this is not a phylogeny, right? It's a tree representation of the distance matrix. And this is sort of very easy to get out of pangraph. We get really easy access to data on the plasmid structure. But what we need is evolutionary models to link those distances to evolution and find minimum parsimony roots. So for example, just thinking about plasmid 4 where the paper says it's a duplication and an insertion, we might want to sort of decouple this one break point into two different things. But we need a model of evolution to do that. So far we're just saying every break point is the same. So in the remaining time, we're just going to talk about another application of pangraph to different plasmid architectures. So there was a recent paper, a preprint or a recent version of a preprint, I should say, on bio archive from these authors, which is a really fascinating paper that I'd recommend everybody read. They searched in human gut metagenome assemblies for plasmid structures. They used read mapping patterns to decide whether a contig was a plasmid or not. And some plasmid classification stuff, they basically got a big data set of plasmids out of that, tens of thousands of plasmids, many of which have no similarity to reference plasmids. Then they group these plasmids into plasmid systems. They basically align, I won't go through all of the details, but they align the plasmids to each other using directed edges to represent containment. So this backbone plasmid is present in all of these compound plasmids. And then they call this connected component of the network of plasmids system. And this is a better starting point for the purposes of pangraph than alignment free clusters where you can actually have cases where the sequence divergence is graced than that tolerates by pangraph at the moment. So when you align the plasmids and make the graph, you'll get disconnected components, which is, yeah, sort of just a current limitation of the tool. So here's plasmid system one, which is the first of these over 1000 different plasmid systems that they report in this preprint. It's 26 plasmids, which are covered from a whole bunch of different cut fence genomes between five and 15 kilobases long. They have a rep be, rep be gene there. I don't know why I've written protein. Sorry. And then there's no hits in Plasdb. There's no PTU assigned by Coppola because this is a mesh genome plasmid, you know, you might want to treat it with some skepticism. Here is the pangraph representation of that. Those sets of 26 plasmids. And I'm showing the core blocks in red and I've just given the blocks numbers so that they're a bit easier to see. And then you can see that the colors wouldn't be informed of here. So I've just used numbers so that I can refer to them. And so if we imagine a particular plasmid, that's not the backbone plasmid. So this core plasmid is, you know, one, three, two. Another plasmid might be one minus seven, ten, three, eleven, two. So one minus seven, ten, three, eleven, two. And then I'm just, I've messed around with this to make the walk through the graph. So if we plot the breakpoint distance as a tree of all of those different plasmids and I've collapsed, this is the most common plasmid sequence of blocks. So that's why that point is bigger than the others. I've just collapsed it to make the tree easier to understand. You can see that here's the backbone plasmid, sequence of blocks one, three, two. And then, for example, this plasmid has one breakpoint where five, ten has been inserted. And you can see that there's a lot of conservation of blocks into me. So one, three, two, this, this, what we're calling the backbone plasmid fugitively appears in the same order in all of those different plasmids. So this is quite interesting because we very quickly get a picture of how symtony is working in the plasmid, how the architecture might be going. So we could think like, why is this region disruptive? Why between three and two, there's only one disruption compared to many other disruptions in the other regions of the graph, for example. So we're already kind of, you can see there's lots of questions we can start to answer once we have this representation of the plasmids as this state structure. The pan graph gives us. So there's future work to do. Obviously I've just presented really rough work in progress using pan graph, which it's a pleasure to use. So pan graph is a scalable way to explore these classes of plasmid diversity. But what are the right metrics? Okay. So we pan graph contains within it, not just the graph of blocks of, you know, paths of blocks that I've been talking about here, but it also does contain the alignments of those blocks. So you could combine the SNP level information with the structural information and that would be very cool. And then there's these kind of two applications that I can think of. One is this recent structural evolution where antimicrobial resistance plasmid should represent an ideal test case to estimate those rates for various processes of structural change. So how common are duplications compared to deletions or something like this? And then also plasmid architectures. So we've got lots of clusters of plasmids. What are the possible topologies of graph structures that exist? How can we connect those to the models of evolution? I think that's a really interesting question. So it remains me just to thank very much Richard for hosting me in his group for this visit. Nick for being the main developer of pan graph and Marco for also developing pan graph and working with you. Thank you so much for having me. Thank you for helping me in how to use it. And I really welcome people to email me in particular if you know of longitude and or long replasmid data sets. It may be ones that are yet to be published or so on. I'd be very interested in those. And also if you have thoughts about models of structural evolution that could be applied here, that would be fantastic. And I'll stop there. That was really brilliant. I am going to you've got two questions already. I'm going to skip Natasha's for the moment if that's okay Natasha because I think Liam answered it in this talk. So I've got a question from Yana. Liam very interesting and talk for working talk. Can you please say a bit more about the models describing symptom evolution that you're considering or where or how could one where one start developing such a model? Yeah. Okay. Yes. Very good question. I'll say a little bit more about it, but I think more for discussion. So one of the representations of genomes that was popular a while ago or some people still work on is this idea of assigned permutation. And so you have integers say like one to 10 and those can have signs. They can be positive or negative. And then you assume that any genome contains no duplications and is some ordering of those things. So you can represent like your genomes assigned permutations of the numbers one to 10. And then you ask, okay, if I only allow reversals of any size so you can take any sort of block of numbers and flip the sign of them and turn them all around. How can you go from one sign permutation to another? And that there exists algorithms for and that's just kind of very interesting stuff on chromosome evolution, maybe like 20 plus years ago. However, it gets very complicated once you allow things like transposition duplication. So it's an empty hard problem once you start adding those things in. So that's one example of sort of a model that has been applied in the past to this kind of approach. But I don't know that there's one that is ideal at the moment. Okay. That's a great answer. Thank you. I don't think I've seen one that copes with. I mean, most of those are really set up for genomes that contain fundamentally the same genes. Yeah. Yeah. So and once you add super short, there are some models where people allow what are called super short transpositions or things like involving only two genes or blocks because otherwise the combinatorial space just explodes. Yeah. Okay. Anything from anyone else? Okay. I'm sure people will want to talk to you more about this during the discussion with them. I'm going to hand over now to the next speaker.