 Today I will tell you about a tool that we've been developing in the group. So as Amin said, I'm currently doing a postdoc in the lab of Richard Nair here in Basel. And yeah, I will tell you about this tool, Pongraph, that we've been developing, that it can be used to represent basically genomes or plasmid as graphs. So to introduce it a bit, let's say that I have a set of isolates maybe of plasmids that contain some cassette that maybe provides antimicrobial resistance. And I'm interested in studying the evolution and the spread of such resistance. There are different ways in which I could go about it. So one could be maybe stratifying my sample by presence and absence of the resistance cassette. But then I could also take the sequence of my cassette and try to align it and see variation amongst the isolate, maybe to study sequence evolution. And then since most, since many of these cassettes are kind of mobile, I also might be interested in knowing where these cassettes integrate inside the broader context of a genome or where it is placed inside the inside of a plasmid. And what we've been interested in is kind of finding a representation that allows one to ask meaningful questions on all of these levels. And a very natural one is the pungenome graph. So what is a pungenome graph? Let's say that I have a set of assembled chromosome, maybe assembled bacterial chromosome or maybe plasmids that I want to compare. So here in this case, I have five isolates, A, B, C, D, E. And part of their sequence is homologous. So here I color the homologous sequence with the same color. I want to build a representation in which homologous sequences are grouped as kind of edges in a graph and isolates are represented as closed path inside of these graphs. So this representation has two main elements. One are what we call blocks. So blocks are sets of homologous sequences. And these have a consensus, but they also contain variations on this consensus for each isolates. So it's really information on the alignment of these homologous sequences. And then I have paths that are basically a representation of my genomes, of my sequences as a list of blocks. So here in this case, isolate A is composed by block one, followed by block two, block three, block four, block five. And in my graph representation, there should be a closed path inside this graph. So what I will refer to as pan genome graph is this collection of blocks and paths where blocks are really alignments. So they contain information on each occurrence of these homologous sequences. And paths are representation of these isolates as list of blocks. I hope I can convince you that with this representation, one can ask meaningful question on all of these levels. So one can check whether a given stretch of sequence is present or not, once you know in which block it's present. You also have the alignment that is present inside the blocks. And for each block, in which context it appears. So you can kind of explore all of these dimensions. And at the same time, it's a meaningful representation. Also, if one is interested in sequence evolution on the broader level. For example, you can think of recent insertion level, insertion events, sorry, as a new part of the sequence appearing that is not present in any other isolate. And this appears in my graph as kind of a deviation from the common path taken by your isolate of a given single path. In what we've been trying to develop is a tool that allows you to build these representations starting from the sequences. So this is a work done in collaboration with Richard and Nick. We've come up with PAN graph. This is a Julia-based library and common line interface for aligning whole genomes into a graph. So really for building this representation. This is currently available on GitHub. And yeah, most of the heavy lifting has been done by Nicholas Knott, who was a postdoc here in geocentrum and now is at KITP in the University of California. And the interface of this tool is very simple. Given a set of genomes that can be contained, for example, let's say a set of plasmids that are contained inside the FASTA file. Once the tool is available in your path, one can simply type PAN graph build. So this is the common to build the PAN genome graph. Then there's an optional flag here, minus, minus circular, because this signifies that the plasmids we are considering are actually circular plasmids. And one can pipe the results into a PAN graph.json file. And this JSON file really will contain two elements, a list of blocks and a list of paths. So this is what I've been telling you about. And yeah, there's also documentation available in which more details on how this JSON file is organized and all of the functions are well documented. This is also available online. So I want to tell you a bit about the algorithm that we use to build this representation. So starting from our set of genomes, what PAN graph does, it's first of all building a guide tree. So this is done by using alignment-free distance. So in our case, we use a mesh to obtain a quick measure of how diverse, how different each of these genomes is from the other. So this allows us to build, in a fast way, a guide tree. And in these three genomes that are more similar are closer together. And then we will place each genome on a leaf of the tree as a trivial graph. So this is a graph composed of a single block. And then what happens is that we'll propagate this graph along the tree and at each internal node of the tree, we will perform a merge of two graphs. And we will collect the PAN genome graph at the root of the tree. And notice that this process, each of the merges on the upward part of the tree is independent of the one occurring afterwards. So they can be running parallel and they are running parallel by PAN graph. But really the interesting part is happening at this level, at the merge of two graphs along the tree. So I can explain a bit more in detail how we do this, how we merge two graphs. So you can imagine that on two branches of these trees, I'm collecting two graphs, the blue and the yellow one, and I want to merge them into a new green graph. So each of these two graphs, the blue and the yellow, we left blocks. So these are what I was telling you about before, these alignments of our homologous regions. What we do is we run a pairwise alignment of every block against each other using the consensus sequence of each block. And in our case, we use minimap2 as a tool to perform the alignment. And every time we find an homologous region, we have to decide whether we want to merge this into a new block. And we do this using pseudo-energy. Every time this pseudo-energy is negative, the merge will be performed. And this pseudo-energy goes as minus the length of the alignment. So the longer the homologous region I was able to match, the more likely the alignment is to occur. But then I also have two other terms that depend and see on the number of cuts that I will perform when merging the two blocks. So you have to imagine that if I'm able to match a subpart of this blue block, I will have to perform cuts on the edges of the part that is alignable. So this will make the graph more fragmented. And then there's a term NM that counts the number of mutation basically of polymorphism that I have in my alignment. So this makes so that if the sequences are too diverse, too diverse, too diverse, then they won't be merged. And one can control these parameters alpha and beta depending on whether one wants more merging or at the cost of a more fragmented graph or less merging and a graph with a smaller number of blocks. So for blocks for which the pseudo-energy is negative, merge will be performed. And in this case, a new block is created and is connected to the rest of the graph as given by the path structure. So this is really how at the core how PanGraph works and how PanGraph builds this PanGenome graph representation. So talking about performances, I think the most important things I can maybe show you are times for aligning genomes of some species. For example, here for 51 genomes of mycobacterium tuberculosis. So these are chromosomes downloaded from RefSec. The time is around 90 minutes, so an hour and a half. And the good scaling we obtain is mostly because the number of mergers is equal to the number of inner nodes of this binary tree. So it's around the number equal to the number of leaves, so the number of genomes that I am trying to merge into a PanGenome graph. So overall, and this is not executed in a cluster. It is on an eight core machine. And I can show you some example of PanGenome graph. So here, for example, I took a set of 105 Klebsella Preamonia chromosome and I picked nine of them. So these are three of these strains on the average pairwise divergence on the core genome is around 0.5%. So around the five snips per kilo base pair. And I picked nine of these strains on the tree. So here are a signal width crosses. Using PanGraph, I can export my PanGenome graph into a .gfa format that can be opened and explored with Bandage. And this is a graphical representation of the graph. So you see in, so here the color indicates whether a block is common to every of these strains, in which case it is colored in red, or belongs to only some of these strains or paths, in which case it's some darker shade black or darker shade of red. So you see there are stretches of synthetic regions, and then you have regions in which you have some diversity between the paths. And one can look at the distribution of the length of these blocks. So here, for example, if you look at the blue distribution, this is a cumulative distribution of block length, you see that in total for these nine strains, I'm considering genomes that are around the five mega base pairs long. I have around 1,200 blocks. And half of them are bigger than a kilo base pair. But many of these blocks are actually short. So if I weight this distribution by block length, giving more weight to blocks that are longer, actually I see that if I look at the total amount of sequence in my graph, then more than half of it is present on blocks that are around 20 kilo base pair long. I can also look at the frequency of blocks. So for every block, I can check on how many strains this is present. So blocks that are present in only one strain will be kind of private accessory blocks, while blocks that are present in all of the nine strains will be core blocks. And again, if I weight blocks by length, so the red curve, I see that I have that most of the sequence is either in very private blocks that are present in only one strain, or in conserved blocks that are common to every strain. So again, this is a cumulative distribution. And if I look at the distribution of block length versus block frequency, I see that many of these accessory blocks that are present in only one strain are potentially very short or of intermediate length. And then instead, the core blocks have a higher average length with around 10 kilo base pair long. And I can also ask what happens if instead of these nine strains, I take all of the 105 strains in this tree. So in this case, the graph looks much more messy because obviously it's more fragmented. But if I look at the distribution of block length and block frequency, it still shows very good properties. So again, most of the sequence, it's in blocks that are around 10 kilo base pair long. So the number of strain has been multiplied basically by 10. But the average length of blocks if weighted by length has been basically divided by 2. And I see this very common pattern of bimodal frequency of blocks. They're either very common core or they're very rare present in very few strains. And if I look at the average size of a single chromosome, so a single of the strains that I'm that I'm trying to include in this graph, I see this has a size that is comparable to the total size of the graph, especially if I look in this case, I have 105 chromosomes, each one around the five megabase pair long. And the total pan genome graph size is around double of this size. So this is a good compress representation for this set of strains. One last thing that I can tell you about pan graph is that another common that one can use is the marginalized common that is used to project the graph on a subset of strains. So let's say that I have a complex graph, but I'm only interested in the difference between maybe a couple of strains, strain A and strain B. So I don't want to, I don't want to keep the complexity of regions that are in which the two strains behave the same. So I can basically project the graph on only these two strains. And this will basically highlight differences between these two strains. Just to give you an example, again, I can take the pan graph that was produced by the nine Klebsiella strains that I showed you before, and I can marginalize it over two particular strains with the marginalized common. And what I obtain is a much more simple graph with only one point of contact. So you see in black, you see local diversity between the two strains and in red parts that are common. And if one looks into why this pinching point is present, it's really because these two strains have a big inversion. So actually what is happening is that one path will kind of perform an H shape while the other path will remain outside. So this is why I have this pinching point in the middle. Yeah, this is roughly everything I wanted to tell you about this tool. We have a pre-print out on BioArchive. And we are currently working on polishing and improving the tool. Version 0.5 was released three days ago, and we are currently under active development. So we are currently working on adding another alignment kernel inside of pan graph because we have some limitation on the diversity of sequences that we can align. So we are working on overcoming this limitation. Then we're also working on making installation a bit simpler and still catching bugs and trying to make sure that everything runs smoothly. And at the same time, we are working, trying to apply this tool to study the evolution of bacterial pan genome. And yeah, we're very excited to hear what, if you have the occasion to try it, but what you think about it. And yeah, we hope that you will find it exciting and useful. Thank you very much. So yeah, I want to really, most of the hard work of coding the tool has been done by Nick under the supervision of Richard. And also I want to thank Liam, who's going to talk later and who's been here visiting us in Basel and contributed with a lot of useful discussion. And thanks to the organizer for having me. And yeah, I'll take questions or it's their time. That was a fantastic talk. Thank you so much. It was really interesting. And that's a subject a lot of people have been working on for a while. And it's the first time I've seen something quite interesting that's quite new, that really deals with like the whole genome problem. So thanks very much. Thanks. I'm going to squeeze in one question here from Olivia. Let me read it out so everyone can hear it. Very nice explanation of your tool. Very nice tool to see large rearrangements in the genomes. Do you find the bimodal block frequency pattern in other species that you also found in Klebsiella? And also, did you look at the blocks that were causing the bimodal patterns? Is it coming from certain types of genes or mobile elements? So we did look at other species and we do find this pattern also in other species. So I only showed you here an example, but the pattern seems to be consistent. And we can also match it with the same frequency of genes. So we can check that actually the frequency distribution of blocks matches the frequency distribution of gene clusters. And indeed, it does up to a certain threshold of divergence. So things that are too diverged will currently not be merged. We are talking about around 5% divergence will currently not be merged. And yeah, we are in the process of looking at these disordered regions and try to see if we can impute this diversity, local diversity to mobile genetic elements or recombination hotspots. So we still don't have an answer on that. But yeah, we are also interested in trying to see if we can come up with one. Verify if really there are some causes to this local diversity. Can I squeeze a small question? You talk about the limitations in terms of the sort of sequence diversity at the bits that do align and guess driven by the mapper. Are there limitations in terms of like how much of the sequence of the genome can align? So will it still work okay if it's heavily rearranged or if you've only got say 40% alignment between the genomes? So it will still merge the regions that are homologous. If all regions are more than 5% diverged it will basically, so maybe I can show an example. That's fine. I was less worried about the divergence in the lined bits as the proportion that was not aligned. Yeah, the proportion that is homologous will still merge and the rest will be isolate loops that will be outside. Okay, well, thank you very much. Thank you. We've got further questions. We might bring them up in the discussion later. I'll be here. Great. And if people could please feel free to add questions to the chat, which we can save for later to bring up in the discussion. So thank you for that and time to move on to the next talk. So wait a minute. Alice, has Rowan Mehta stepped out? Was about to invite... Yes. Yes, so sorry. Oh, it's in there. So I'll move straight to Julian Paganini, who's going to be talking about an optimised short read approach to predict and reconstruct antibiotic resistance plasmids in E. coli. Julian, have we got any? Hi, guys. Yeah.