 Okay. Hi everybody. It's a great pleasure today to introduce this sort of first, I guess I could call it the first real talk of the week. So this is, I hope I pronounced your name right. This is James McInerney talking about new approaches to understanding gene content in prokaryotic Pangee names. Hi folks. It is one of those things that doesn't get said in meetings that are in person. It's somebody in the room, you know, because I guess you can probably see them. But it's, I'm so pleased to be here. Can you guys see my screen? Yes. That looks good. It's not in the presenter view. We can see you. I'm afraid I'm COVID positive right now. So I probably wouldn't be there if we had to. I'd be a second cancellation for this meeting. But thank you all very much for inviting me here. And I'm very, very pleased to talk to you about this. My talk isn't specifically on Plasmids. It's on Pangee Nomes. And it's really just sort of I'm going to spend the next few minutes just kind of talking a little bit around the subject and about a sort of perspective that we have on Pangee Nomes in my research group and the way in which we like to think about them these days. About five years ago now, we published this paper was in Nature Micro. I think it created a small bit of sort of agitation afterwards. Because in the paper, we sort of asked the question, what's causing Pangee Nomes, how do they arise and how are they maintained. It was because we wrote this because there were many, many papers coming out demonstrating Pangee Nomes in lots and lots of organisms. And they were quite descriptive. And we felt that we might as well get the ball rolling in some sort of way to try to understand the mechanisms by which Pangee Nomes arise and this enormous amount of gene content variation, at least in some species and in some groups of organisms. I don't think everybody agreed with what we wrote and then there were a few letters over and back since then. But it's become a very, very interesting space in which to carry out research. And I think there's been a lot of really, really great papers that have come along in the last few years. We pushed on from this and just maybe to summarize if you're not familiar with the paper to summarize a little bit. There was an observation that some organisms say E. coli or whatever have these enormous long term effective population sizes. And what that should suggest to you is that there's really, really strong selection in organisms with large long term effective population sizes. And indeed we do see this in E. coli in at least the core genes and the highly expressed genes, you see codon usage preferences. And quite a lot of these preferences are probably very, very tiny fitness effects, but they're still sufficient in an organism with a large long term effective population size. They're still sufficient that they manifest they come through. So, if selection can see such tiny differences one codon in one protein in an organism with 5000 proteins is producing 5000 proteins. We're seeing a very, very tiny part of the genome and it's, it's, it's exerting a pressure to prefer one neutral change over another because remember it doesn't change the encode of amino acid these are synonymous codon usage changes. If you can see that then why do we have such a huge amount of variation and we came to the conclusion that at least a lot of the presence absence variation pan genomes must be selected and the only way in which we that could be the case is if the amount of pan genomes are quite a lot and at least a lot of the time you have selection for new niches for for for for moving in from one niche to another. So we've got to try and see if the pan genomes have real signals of that and we had some sort of anecdotal evidence we pushed on since then to sort of think about it in in that sort of way and where we've moved to is to think about it as being an ecosystem themselves so I'm not talking about pan genomes in an ecosystem I'm talking about pan genomes being an ecosystem. So in other words, the genetic background of any genome is playing at least as strong an effect as the fitness effect of the gene that's coming in. I mean, that'll be variable of course antibiotic resistance genes are probably quite strongly selected in lots and lots of different genetic backgrounds, when antibiotics are present. Okay, so the fitness effect can overcome that. But in the long term, what decides on whether a gene has a positive fitness effect a negative fitness effect or if such a thing exists in prokaryotes, if it's neutral. And so we started thinking about the pan genome itself as a series of interactions and and as an ecosystem in itself and the variation that you see across pan genomes. Of course, create different niches for genes that might be on mobile elements and might be coming in and going out and moving around. Of course, you know, biology one or one really. I don't mean to insult anybody by putting this up but just to maybe elaborate on what we're talking about. If we think about lots of macro ecology, research and understanding and have a think about what that might mean as an analogy in the gene space. Now you can start thinking about mutualism and commensalism and competition and predation and so on, and the interactions between genes on this and it wouldn't just be pairs of genes but maybe lots and lots of genes in this. There are a lot of genes being where both genes benefit and you probably see these kinds of things where both genes would be on in an operon for instance or something something like this commensalism where one gene benefits and one is unaffected competition where they are in competition to each other and we have seen a very nice example of that Nadine Zimmer to the University of Tubingen as published on some helophilic bacteria. There's a little part of that story as well where we could see that these these gene clusters were popping each other out they were encoding an iron key later that functionally was equivalent even though the two molecules were quite different so the two molecules could have either one or the other. We never saw them have both. And there was lots of horizontal gene transfer happening. So it's in some kind, you know, in some way you could say that these gene, these gene clusters are in competition with each other. And then you've got predation and so on to selfish genetic elements were all familiar with that so it's just really try to set the scene of of pan genomes as a kind of ecosystem in their own right. On the left here I've got sort of expected patterns that we might see in pan genomes and this is just a toy diagram here really the same sort of if you look at the tree on the left you know the blue branches are in are all genomes that are found in environment why the green ones are in found in environment Z and then you've got the various different kinds of genes that you might see in a genome those grey ones ABCDEF would be just played specific the green ones G and H are environments is specific both of them are in, in, in why the green ones, JK and L are in environment side, and then M and owned P are a bunch of different kinds of interactions with M and P, I'll just draw your attention to it, where they're avoiding so where you see M you don't see P and vice versa so these might be two genes that are in competition, there might be a gene dosage effect because it produced the same thing and there might be toxicity and so on. So, these are the kinds of backgrounds and there's a lot more than I've just described there but I'm just going to spin along a little bit quickly to say how we've thought about looking at these, these patterns, trying to find out if they actually exist or whether we've seen them in our own minds and they don't really exist. And so the, the first effort was from Fiona Whelan who was who was a Marie Curie fellow who came to work with me. And so she put together along with Martin with Sylvitz, a piece of software called coin finder coincidence finder. What it does is it just goes through a bunch of of genomes using programs like Rory or Panaroo or any of those for making gene families. And eventually then when you've got gene families you represent them as a node, you connect them to another gene family if they are coincident if they're coincident in some way and they can either be coincident by having a more similar pattern of presence absence than you expect by chance, or they can have a more similar pattern of being the opposite of avoiding one another than you expect by chance so we put in some Bonferroni corrections. We try to account for tree structure in the data with this as well. And we did a little analysis of 534 streptococcus pneumonia genomes, when we were describing the paper, and this is the kind of output that the program produces. So we draw your attention to the kinds of things that we then started to see in the data so we're asking all the time about the influence of genes or genetic background on the presence or absence of other genes so so they're modulating or they're they're they're they're varying the fitness effect of an incoming gene. So this is just a collection of 51 gene families. And the, it's just the software gives us this result and we asked well you know what does this result sort of mean. And so these are VATP is complex. There are ones that are on the extreme left here in the two different kinds of red I apologize for the color but the two different kinds of red are known and hypothetical VATP is complex genes gene families. Okay, and you can see their distribution across these 500 genomes that we've we've analyzed, and you can see the pretty much where one member of the gene family is present. You see the others there as well there are a few exceptions but not very many. They're really really tightly linked they form a clique. But as we go along, we find that for one more and quite often for most of them, there are a total of 51 other gene families or 51 other ones that show a pattern of presence absence that's more similar than you expect by chance to these other VATP. There seem to be, they seem to be co-occurring with these, these others and this is something that you might expect to see where if the genetic background mattered if the presence of these genes that we've represented in the red color was having an influence on the others, or vice versa and some sort of hint, if you like, the within pan genomes there are lots and lots of associations that we should start paying attention to within moved on to some pseudomonas strains. And these are from cystic fibrosis lungs and fuel a wheel and did this work as well. And it's just again to show that when we looked at these, we saw a lot of relationships and patterns of presence and absence. And you really can't explain by chance. They seem to be very very strongly associated with each other. So I'm going to just show you, I don't know if I can see my cursor, but up here in C these, it's a little bit small but you don't really have to read what's what's on it. And the left most column here are all the genes and they're classified with this red as being core cloud genes soft core genes shell genes, you know how depending on how frequent they are in the data set. We pair this down to just look at the abundant accessory genes so we get rid of the core, and we get rid of any singletons are very very rare genes, because they're not going to really be interesting for us. So from a statistical perspective, they may be really interesting, but our method would never pick that up. Okay, so they're so we just get rid of the rare ones and the core ones and so we're looking about genes, the show pattern of presence absence variability. And the third column here are represents the fraction that shows some kind of relationship, either they like to co occur with some other gene, or they really avoid another gene. It's the majority. Okay, so when we're looking at these patterns, most genes seem to have a pattern of coincidence so I mean co occurrence or avoidance with at least one other gene. And so that was quite surprising to us but we've seen this again and again in other data sets. I'm just moving along here to look at just one other thing part of this paper and I just focus on on this data up here if we picked. At random, two genes from our pan genome. And we said, what's your go annotation so these are gene ontology annotations. Then about 50% of the time, they would have the same or very similar go annotation. Okay, well half the time. If we look at the data that comes through from the coin finder pipeline, we see that it's much higher around 70%. They have the same go annotation and so we we think this means to a certain extent. The genes that are functionally doing the same kind of thing are either very strongly they like one another or very strongly they don't like one another, and that this kind of structure is coming through in the data set as well. And that they're a little bit more agnostic about genes that don't have the same go annotation as themselves. So moving along then we with Rebecca Hall who was in my lab for a while who looked at E coli and she looked at E coli accessory genes. This is a data set a relatively small data set about 200 E coli genes genomes. We reduced it down a little bit because of runtime and the difficulty of rambling such a big data set and making it sort of work. So this is just one sort of analysis from from this data set and so these are a bunch of a bunch of, sorry, a bunch of membrane proteins, and you know we have this sort of feeling that when things form complexes of two or three or four or proteins that they should really always be found together and that's actually not really the case for these, you can see that for instance, these four up here these four complexes up here, they seem to to co occur quite a lot. But it's not always that every gene in the complex co occurs with every other gene in all of the others. So there seems to be much more of a mixing and matching of genes in complexes like this to a certain extent, in this particular group of E coli. So you can see avoidance here this sore a BFM complex here completely avoids this, this, and this gene, they don't, if one is present the others are not present, and this is the whole complex in this particular case. So this is the base of mix mix and match on that. This analysis is based entirely on pairwise comparison of the presence and absence pattern of individual genes with other individual genes and so when we make a network. I'll just go back to one of these sorts of networks. It's a gene family is a node, another gene family is another node and they're joined by an edge and so it's a pairwise comparison, but quite often that builds up to being a much more complex basis. And it's a little bit slow. And we were also interested in the question, whether we could get more complex patterns so it's not just a pair of genes, but maybe 2345678 genes, forming perhaps a genotype and if we could, if we could iterate across lots of systems, we could begin to understand a more complex pattern so for instance here. This is an example of the random forest approach I'm going to show you some preliminary data where we were looking at genes ABC and D. And we're trying to say is there something about the presence or absence of ABC and D that implies to us, whether gene X is present or absent so this is the output we're trying to explain the presence of gene X. This would mean that it's absent gene A and this means that it's present so you're dividing the data set. You're using pretty standard a decision tree approaches in order to try to understand what's what's happening and down we get to the bottom here we have gene X is an outcome its presence is is an outcome in three cases here its absence in three others. And by using a random forest approach we're trying to understand the influence of genes on one another. We again come out with a graph that's very much like this at this time it's a directed graph, where we can talk about the influence of one gene on another, and how likely a gene is to be present. How likely gene B might be to be present if gene A is present or how likely it's to be present if A is absent. That that sort of thing. This is the first analysis we have it's on 500 E. coli genomes this isn't published. This is work was done by Alan Bevan in my lab. And what you've got here is the result of the random forest approach. We've got a graph it's got lots and lots of connected components and it's got some rather big connected components in the middle, which we can break up using a community discovery algorithm like the Louvain algorithm or something like this. What is very nice about this is of course because we can ask about collections of genes that explain the presence of another gene but also we can ask whether there are genes that explain the the the absence. And of course some genes are completely agnostic their presence or absence doesn't explain the presence or absence of any other gene in the data set, but some strongly do explain presence or absence now. There's fairly new data so so it's not what it's going to look like in in final publication form, but I'll just take you through the two sort of clusters that I'm highlighting here. So the top one here negatively interacting clusters and so you can see this sort of bar here. So just to say the group 36017 which is a gene family, its presence, strongly seems to suggest the absence of group 362 70. But not the other way around. And that can quite often happen when there's a gene frequency difference, but we can put directionality on to it because because of the way in which we, we look at the data here. These two gene families collectively imply the absence of the other one, the presence of 36017 implies the absence of 43090. But down here we've got another collection it's this sort of lilac group here where lots and lots of genes are implying that the presence of lots and lots of other genes their presence strongly implicate implies that another gene will be present as well. So we're going through the data right now I put this up because it's pretty much the newest thing that we've done, but it's just to say that in this kind of approach we're really just using fairly standard Python libraries fairly standard approaches to it. But it is really beginning to tell us something about this ecological ecological sort of situation. And you can sort of view this if you're in a sort of macro ecology way to say, you know, next time you go to your local park, you see lots of grass on the ground. See a big tree in the middle of the park. There's no grass growing under the tree. The tree doesn't care about the grass but the grass is highly sensitive to a tree being present or absent. And so we're seeing these kinds of patterns begin to come through in the data set as well. The whole the whole ambition is to try to tease apart in pan genomes, the forces that influence the presence or absence of another gene in the genome. I haven't really talked about sort of specific genes here I've just in passing, because there are just so many stories to tell that it becomes a situation where this data set or this approach gives you lots of stories, which then can be taken into the lab if you like, or it can imply by sort of smoking gun analysis that some genes are implicated or involved in particular pathways and so on. The conclusions of their networks in illuminate pan genome evolution, pan genomes themselves do seem to be an ecosystem with, you know, antipathy with, you know, a positivity negativity pathogenicity within the pan genome itself. And we can see again and again the genes can predict the presence of other genes and they can also predict their absence. I mean, work was all done by the four pole stocks on top here Maria Rosa Fiona wheeling Rebecca Hall and and Alan, who wrote pretty much all the code and and did all the data analysis. Thank you very much I've come to the end I hope I didn't go too fast or too slow. Thank you very much that was fantastic and that's perfectly on time. Alice do we have time for brief questions now. If we want. If anyone does have questions that that are reasonably brief then type them in the chat, and we can kick them off otherwise there's a discussion session later. Although I think they did it's possible that they would have already put them in. I have things but I think I'm not going to become like saving for later. Okay, that's it. Thanks very much that was super interesting. I hope you have a full recovery by the way. Yeah. Okay. Um, so time to move on one more time. And, and then there's not going to be happy with this I I'm using that I'm using the web page to get the titles of the talks and web page doesn't currently have an under his title. I'm so sorry. It's your turn to talk but I don't know what your title is. Okay. But the secretary applauded it. Yeah, but maybe he has to refresh the Anyway, the title of my talk is Plasmid taxonomy. Okay, so can I start. Absolutely please do. Okay, if you go ahead and share I'll tell you if we can see it.