 Yeah, bioinformatics meets neuroinformatics. I guess what I want to do today is tell you about some of the work that we do in collaboration with a bunch of groups right across Europe and I'll try and remember to name them all as I go through and apologies if I miss any of them out. Where we're actually looking at fairly large-scale raw data sets coming from what would typically be analyzed in a kind of bioinformatics approach but then looking for how do we start to link up the levels of biological organization. What I mean by that is great slides don't work. Good start. Predominantly the information we get is at the molecular level but what we actually want to really do is understand things like the brain and behavior and so on and traditionally there's just a big jump the so-called phenotype gap between the genetics and the molecular biology that you do in research and ultimately what you're interested in which is the behavior of the system. Now various aspects of this are going to be dealt with on the course this week and you've heard some of about it said some of the aspects of brain organization and how to handle brain data this afternoon but you're also going to hear various parts of it modeling neural function and synaptic function from various people. So we're going to focus on how do we start to dig down at this molecular level which is usually kind of distinct from the rest of the process and how do you feed that up into models of neural function and synapse biology so no one in here can see anything now. So why is this difficult? I mean bioinformatics has existed for a while it can handle raw data sets pretty well then there's really good mature advanced methods for doing this but actually what really comes into it is just the complexity of the neuron itself. In terms of molecular level we're interested in what's happening at the synapse each of these little green dots on this very faint neuron has been woefully projected on the screen in front of you. Each of these green dots contains the molecular complexes that we're actually interested in where things are happening in terms of information being processed chemical signals coming in being translated into eye and child changes but each one of those is a separate distinct molecular unit and bioinformatics doesn't really deal with that it deals with the whole thing as a bag happening in a general volume. There are exceptions but in general it deals with things happening into cells as a whole and so the complexity of the neuron is a fundamental challenge to most bioinformatics approaches so we're looking at different ways to try and address this. So what we do what we're trying to do what we call systems neurobiologies where we start with kind of data and bioinformatics these large-scale high throughput approaches that gives us a lot of genetic data proteomic data transcripts atomic data and so on and assemble these into intermediate models that can capture what we know about the molecules and how they interact and how they how they interplay with each other and so data integration step in producing static models that you see in common systems biology type approaches. But then to take that slightly further and actually capture some of the logical arrangement of what's going on in there in terms of what are the rules that limit these interactions so the middle stage here gives you a map of all the possibilities but then we can actually start to put the constraints on that and what can what interactions can happen at the same time what the competition for these is how they're regulated and so on and then we're looking for this as a way of feeding up towards what would traditionally be called neuroinformatics or computational neuroscience. So what we work with with our collaborators in particular the groups of Seth Grant and who's Smith that I'm going to talk about in various parts of the talk today is synapse proteomics where you basically take mashed up animal tissue extract the synapses from it and then identify what proteins you find in there and there's a number of different techniques I'm not going to go into techniques in detail for these there's a number of techniques to get a global view of everything that's happening in all synapses as well as more focused methods that allow you to pull down specific receptor complexes. So everything from you know 3,000, 4,000 proteins in a single study down to two or 300 proteins that are inherently linked to a specific neurotransmitter receptor or so on. What I'll focus on to begin with is just walk through one of the examples one of the simpler ones. So we'll take one of the smaller complexes and walk through the sorts of things that we do and what kind of information we can learn from that and then I stop briefly and go through some of the larger data sets and try and show you what we can learn from that and some of the challenges that the larger data show you. So the first complex we analyzed and this is what we did with Andrew Poglinton here in Seth Grant is so-called the NRC or the MASC complex and this is a complex of 186 proteins closely associated with the NMDA receptor and pulled down with affinity purification. So what we did was we start off with a classic bioinformatics approach in terms of identifying what all molecules are in there, a list of 186 and I'm just going to say what can we tell you, what can we say about those molecules and can we look for enrichment in specific things. So what we did was we looked for enrichment for key terms in terms of what was known about these molecules for instance synaptic plasticity where those molecules already known to be associated with synaptic processes. Behavioral plasticity in terms of if there was a mouse knockout or an animal model for those did it affect any kind of aspect of the animal's behavior and then if we translate that molecule to the human orthologue is there any involvement in psychiatric disorders. Now we say some involvement in psychiatric disorders and we're not looking at very, very closely defined proven genes linked to diseases. We're looking at just very, very loose associations so that for instance the gene is overexpressed in post-mortem tissue or so on. So this is fairly loose evidence. But what we see with this complex is it's massively enriched for disease, behavior and electrophysiology. So just this very simple list analysis actually tells us that we're looking at something that is enriched for a whole pile of interest in proteins and is worth taking further. And we can do the stats to show this is not a random subset either. I'll show you how we do that later. We then want to try and see a bit more about the organization of it. So that's just a list and you can do various list analyses. You can look at enrichment for functional types of proteins and so on and we're not going to go into that. But what we really want to then do is look at how is it organized? How do these molecules interact with each other? Because fundamentally that's how they work. That's how things are going to happen in a molecular complex. So what we did for this particular example was we started off with the online protein-protein interaction databases. So we took the proteins that we got from the biological tissue. We did our database searches for protein-protein interactions. We pulled all the papers out that were associated with those protein protein interactions. We read them. We threw half of them back out because actually they weren't the same genes that said they were in the paper. We then did an extensive literature search using text mining and so on. And then basically every single interaction in there is something which has been read by at least two biological experts to check that the protein-protein interaction is verified and we can go back to the original sequence. So each one of those little interactions a few years ago was someone's PhD project. So we put a big thank you to all the people who actually put those little black lines in these diagrams that we normally just ignore. So I'll take the opportunity just to say that now. But this is the first stage. This actually builds as an interaction network, a potential map of interactions. We're not saying they always happen all this way. We also have to admit that a lot of it's going to be missing. For proteins that have never been studied before nobody's looked for the interactions so we don't know they exist yet. I'll come back to some things that we're doing to address that later. But this is the first step in building these static interaction network models. When we do this and we can spread it out, we can start to see some structure emerging. Now what we've done here is we've done a fairly simple cluster analysis. So what we've done is we've taken the entire map of all the protein-protein interactions and then we split them out into clusters of molecules which tend to interact with each other more closely than they do with the other partners in the network. Now what we actually did for this is we tested about eight or nine clustering algorithms and then printed them out in a bad paper and then took them around a bunch of biologists and says, which of these makes sense? And this is the one that really stood out. And this is an algorithm by Newman and Gervyn for community architecture and networks, which effectively works by taking a random protein in your network, selecting a protein at random, and then taking a random walk through the network. Every time you go through an edge or protein-protein interaction, you add one onto the value of that. You do that a hundred thousand times, you end up with a ranked order of number of times you've gone through every single protein protein interaction. You then remove that and start all over again and effectively what happens is that the network fragments into these clusters. But when we then take these clusters and then look up where those molecules were that we did in the initial list analysis, remember we've looked up for disease association, molecule type and so on. What you find is that those clusters then tend to congregate. So we get, for instance, up on the top in this cluster here, we're starting to get an increase in terms of inotropic glutamate receptors, but also for involvement in molecules linked into schizophrenia, whereas we're getting metabotrophic glutamate receptors over in this other cluster, closely linked to each other, but more of an association with depressive illness rather than schizophrenic illness. So we're starting to see hints of things we can go back into the lab and test. We wouldn't take this as being ground truth and say, right, Frank, we understand everything, but schizophrenia or depressive illness now, but at least it gives us some more target molecules to go back and look at, have some of these molecules ever been looked at in terms of their association with the diseases. So it gives us a way of informing what we're doing. The other thing that emerges for this is quite an interesting kind of basic architecture. We can see, for instance, the two blue clusters are very obviously receptors, membrane-bound receptors. We also see, and this has been seen in several studies, this is a unique tenething we are doing, but it's been seen quite often. So you get this kind of input layer, these input clusters, so information coming into the complex. We've got a large kind of central complex in red here that's involved in kind of basic processing of that information. And then a series of output pathways, kind of classic output pathways, for instance, the map kinase pathway coming out there. We also get things that are involved in anchoring the whole complex to the cytoskeletal system and all the rest of it. So overall, the map makes reasonable amounts of sense. I know, basically, instead of having to try and show you this network model all the way through, when I come back, I'll probably just use this little cartoon where we've got kind of two receptors out on the cell membrane processing cluster and some output ones, just to give you a feel for how we do it. So one of the things, once we had this in place, we said, well, where did this come from? This didn't just appear miraculously. This came from some evolutionary process, from unicellular organisms all the way through to the mammals that we actually did the work in. So we started off, and this was with Richard Ames, started off by looking, not just actually this complex, we looked at a slightly larger one as well, but I'm going to focus on the smaller complex again. We started off just by doing a bioinformatics approach. Given our route data came from the mouse proteomics, how many of these can we map very accurately onto 19 other species for which there was a good genome annotated at the time we did this work? And so we took a total of 651 synaptic proteins, the smaller cluster that I've just showed you, plus a larger collection of general synapse proteins. And the story that emerges is quite interesting. What you find is across the mammals and even just the vertebrates, you actually can pretty much find an orthologue for almost everything that you found in the mouse. And in fact, a lot of the stuff that's missing, a lot of these slight changes in these graphs there are really just the quality of the annotation of those genome databases at the time. There are exceptions, but almost everything is one to one orthologues. But if you go back into the invertebrates, there's a big drop. So effectively, if you go to Drosophila, which is the one that was best annotated, it's about 47% of the proteins that we can find in the mammal brain that we can also find in the Drosophila genome by bioinformatics approaches. Interestingly, you can even trace 23% of them back to yeast. So unicellular organism with no nervous system. So 23% of it is there. Actually, every major class of molecule that we have in the cooldown, you can trace at least one orthologue back to yeast. So all the building blocks are there in a unicellular organism. Now, of course, yeast has gone through just as much evolution as we have. We all come from a common ancestor, rather than yeast isn't the ancient genome. But it is interesting that everything is effectively there. So the model that kind of emerged from that, and that's the one we discussed a few years ago, where we have some sort of primitive stress response that's effectively present in all unicellular organisms through an increase in genome complexity, allows simple learning or cognitive processes at the molecular level. And in the mammals, it's effective that there's a bigger repertoire of molecules you can choose from. Of course, there's one alternative explanation to all this. And that's actually that there is 300 million years of evolution between Drosophila and mouse, which are the two things that we'd studied most closely. And perhaps actually all that happened is there's another 50% missing there that we didn't see in the proteomics in the mouse that would be there in fly if you went and looked for the same complex in the fly. We'll come back to that in a second. What we also did was we looked at the gene expansion by class of molecules. So we split the genes that we'd found into kind of seven or eight broad classes. So the scaffolders, scaffold, kinases, channels, receptors, and so on. And then we looked at the origin of them. And what you find is that for the kind of clearly synapse associated ones, so the first ones, you see a small number going back to yeast but a big expansion in the invertebrate and a big expansion in the vertebrate lineages. So there's a small number of these you can trace to unicellular organisms and then a big jump at these major evolutionary boundaries. If we look at the bottom ones, the ATP synthesis, heat shock, chaperone proteins and transcription translation things. So actually what we find is most of those we can trace back to single orthologs in yeast and a very small amount of extra additional ones appear to be more recent proteins. So actually those also then map spatially onto the cluster diagram that we've done before. So the recent innovations that appear to be more either invertebrate or vertebrate specific map more likely to the input and the processing areas. These more basic processes are much more likely to be in the output ones and that's why we see things like the map kinase pathway in there that's conserved across everything we've ever looked at. So back to my other alternative to this. So is the brain of this thing here really that much simpler, is it really 50% simpler? Okay, it's a lot smaller, it doesn't do as much. Or is it just different? Is there just 300 million years of evolution that's actually allowed it to bring in other things? So what we actually did was we went back and did exactly the same experiments again. I'm afraid those slides are not reproducing very well. So we basically redid the proteomics. Slap connection maybe? Experiment works. If only fixing them in real life was so easy. So we went back and in effect repeated the experiment for as closely as we could to what we did with the mice. So unfortunately four grams of brain tissue in a mammal is a little bit difficult to get from a fly brain. So we equates to roughly 10,000 fly heads. So we had to collect 10,000 fly heads. But then we did effectively the same pull down protocols where we actually pull down the NNDA receptors from those animals using a hexapeptide affinity purification technique. And the controls all work. We show what we get. There's some sort of initial key proteins down with the key scaffolding proteins that we find associated with the mammal synapse we still get fly. So we're getting something like that. But when we actually look at what we get, first of all we're pulling heads down. So it's a little bit dutcher in terms of things like some of the very basic kinases and things like that. But if we actually filter these for the synaptic proteins or synaptic related proteins that you get, what we get is effectively this fraction of things which comes into channels, receptors, cell adhesion molecules, G protein, signaling molecules in general, is roughly 50% of the size that you find in the mammal. So that's the fly fraction there and the mouse fraction there. If we go back and then see where do these proteins come from, we get a very, very similar story. We get the cytosacletal and cell adhesion molecules are largely conserved with yeast, with an expansion in flies and fly-specific ones coming in but not many. But when you go to the very basic things like transcription and translation, you find that most of that is of an ancient origin. That's not particularly anything new. So it's the same story as we're seeing in the mouse that it is just less of these signaling molecules. So most classes, as I said, are already present and there's a large expansion in the invertebrates and in the invertebrates. And with a larger expansion in the vertebrate lineage, but that expansion is targeted. So these upstream signaling and structural molecules, there's more of them have cropped into the vertebrate. So we also had a look at where those are expressed and this is pre-Allen brain atlas days. So this had to be done through a variety of techniques. This was led by Chris Anderson in the South Grants Lab. We did a collated various different data sources, some of which they did in-house, some of which they got from collaborators, from Western blood analysis from dissected brain regions, from immunohistochemistry on animals, in situ hybridization and microarray data, again from dissected brain regions. So a variety of quantitative and qualitative data, altogether information on up to 148 proteins, obviously the number of immunohistochemistry states and proteins of a smaller subset of those. And to try and summarize this very, very briefly, what they effectively found was that the, if you looked at the yeast, or proteins conserved with yeast, so once from unicellular organisms, you find out that they're very, very uniformly expressed in the mammalian brain. The ones that are metazone and all of us shared with invertebrates tend to be kind of medium. There's a variety of ones that are very specific versus ones that are very uniform across the brain. And the vertebrate innovations are the ones that are most likely to be very specific to different brain regions, indicating that it's possibly what's allowed the brain to develop its complexity, or at least the increasing complexity in the mammalian brain has inherited in those new innovations. It's one explanation, but there's another. There are others. So again, we went back and said, okay, fine, that's kind of an interesting story, but what about flies? And a similar thing here is just an artifact of the specifics that we looked at. So we also looked at flies by tagging neural proteins. So what we did is we worked with a Steve Russell's group in Cambridge where we tagged, or they tagged and we screened of random proteins, 400 or 500 of these, sorry, were actually expressed in the brain. So we used a mobile genetic element, which many whites is an eye color marker and flies, but it has proteomic markers so we can do affinity purification from these and a GFP marker. And this is designed to actually go into the splice mechanism within proteins. So we've essentially tap tagged 500 neural proteins. So the insertion site was closed, so we know which ones we've tagged. We've done Westerns to confirm that the gene model is correct and we know which splice variant of the protein that we've got these tags into. And then we started looking at the brain expression pattern for these. So effectively dissecting and doing a 3D reconstruction for each of these. These are also lined onto a common reference so you can actually compare one protein expression pattern against another and this is just three of them overlaid onto each other and then annotated. And the take home message for this is exactly the same as the mouse. 77% of the scaffold proteins show regional specificity that they're in one brain region and not another. 81% of the after pod variations, this is the invertebrate. Specific ones vary, but the transcriptional genes for instance show very, very even expression. When it says even expression, and this means in every single neuron throughout the entire brain, so we've got uniform expression. So we've got a model that's emerging from this where expression variability is greatest in this upstream signaling region and structural protein. So both in this region here in the model but also in here and very, very conserved down here. So the signaling complex is recycling and we're using very, very ancient conserved signaling cascades. Lineage specific innovations in other words, the ones that are closest to the speciation event tend to be the genes that vary the most in the expression pattern. So they're very much more likely to be involved in speciation specific differences. Whether that translates to behavior or any cognitive processes, we don't know yet. That's not been looked at, but that's just the trend that's emerging from this but something to be tested. And we've got a common co-adventure of the data for that, but there's a common core of neuromolecules expressed in homologous brain regions as well. For instance, if we look at the proteins that are expressed in gustatory control regions in the fly, we find that there's the same ones that are expressed in gustatory control in the mammalian brain as well. So that's not particularly significant in terms of the numbers but the trend is certainly there in the data. I'll just show you what we can do. And the model that was proposed for that was the idea that this increased availability of signaling complexes allows greater diversity in terms of brain regions and larger brains and also an increased range of cognitive processes, increased power in that. So that's the sort of thing we can do in terms of bringing together from a bioinformatics and systems biology approach in terms of small networks. But obviously, we want to go to the larger networks. We know already that that's just one very, very small receptor. How do we scale to the 1,000, 2,000, 3,000 that you can find in modern proteomics studies with increased sensitivity and better methods? You can start to find an awful lot more molecules in these studies. How do we scale to that? So we spent some time developing a range of bits of software to actually improve this. This is work that was led by Ian Simpson in the group. So this is, for instance, for getting the protein-protein interactions, you tend to do your pull-down in one species, but you want to aggregate evidence from every species where there are protein-protein interactions. I said earlier on that there's very often data missing. You need to go out and look for evidence that two proteins interact. So for this piece of software effectively works on the principle that if, for instance, you get two human proteins and you don't want to know if they're interacting, if the two orthologs of these in mice are very, very similar and interact, there's a very high confidence you can say that they will also interact in human. And obviously you can do, that confidence becomes less as you go further in terms of evolutionary distance or at least in terms of protein sequence. So this just allows you to go and basically say, I've got, here's my candidate list of proteins. Can I find evidence from any species and then rank order that evidence by the evolutionary distance? So it allows you to get that. That's what that's publicly available. So these various papers will gonna make available enemy for anybody who wants to use any of this stuff. The other thing was also the clustering algorithm that we chose, that the one that actually just gave us the best results, this Newman and Governe one, is computationally fairly expensive. So that's been re-engineered by Colin McLean in the group. And again, there's some open source code available for that for anybody who wants to try it. On small networks, it's just fine. Once you get up to 1,000 molecules, it starts to slow down considerably. If you think about the complexity of the random walk on those size of networks, it just gets computationally expensive. The other thing, obviously, what you want to be able to do is you want to test how robust the clusters are as well. So for instance, if you add noise or you remove information, how likely is, for instance, this protein here to jump out of that one cluster around another. And that's built into these systems. And again, there's a software package designed as well for actually measuring that conference. So you can get a measure for how confident you are in the cluster results. So we'll put the tools in place to scale this up a little bit. So in terms of videoing stuff, I'm going to show you some of the more recent results. What showed you so far is the kind of publicly available stuff. So we're going to edit this next bit out. So as I say, what happens in Vegas is going to stay in Vegas. What I want to tell you a little bit about is the next generation of these studies. And this is what we've been doing with Hoosemitt and a number of others at Amsterdam, where they've taken this kind of model of what they already think is happening at Synapse. This is based on proteomic studies already. There's a good border molecules on there. It's quite a complicated model. And what they've done is they've identified what they consider the important molecules. That's a very, very subjective term. And they'll be the first ones to admit it. So they've identified 50 important molecules. And this is at the pre-synaptic region, rather than the post-synaptic region. And they're doing immunoprecipitation from all of those. So they now have a pipeline set up where they can do synaptosome enrichments, purify synaptosomes from biological tissue, extract the protein complexes and then do immunoprecipitation to those 50 important molecules. What they've also done is not just the 50 that's there, is they've done the pull downs from those and then identify, you know what? There's a bunch of other proteins that we get when we pull down with these. So let's do one, do the next lot as well. So actually the number of IPs that we're looking at isn't one or two, now it's now 90 with controls. So now looking at one of the largest scale proteomic studies, I think that's been done in Synapse. So that's just some examples of these things in terms of the original raw data where the slices cut up and then those slices then go into the mass spec for identification. And obviously this involves a large number of people. I wish it was done here, but it is larger from who's Smith and Matthias Fahage's group. So they deserve all the credit for this. So 90 baits so far, so that's 90 sets of antibodies. In fact, sorry, not 90 sets of antibodies, that's 90 sets of proteins that have been pulled down with multiple antibodies. Identified 2,100 partners so far of which there was a bunch of obvious contaminants and things like bovine serum albumin that's been spiked into the protocol so we can remove those quite nicely, antibody fragments and so on. And so we've cleaned all this up, this turns into about 2,025 proteins that you can identify from the PCenaptic region. Map those onto stable IDs, so this is what comes out of the mass spec. And we can actually get unique mouse IDs for 97% of these and we can map those onto the human orthologues for 94%. So we can get a pretty good recall of this. This is what's still in progress, it's being slightly cleaned over the intervening time since I built this slide. In terms of what we already know, we can, these names won't mean much to you. Build 2 is the antibody list, that's just a thing. Mouse PSD is just, how does it overlap with proteins that we already find in post synaptic density, not pre-synaptic density, immunoprecipitations. You see, it's quite a big substantial amount of overlap. Now there's not, we wouldn't necessarily think all those are contaminants, we'd just think there is a lot of common proteins. And when you drill into a lot of the evidence, we can actually find evidence for pre and post synaptic localization for a lot of these proteins. We obviously haven't looked at 600 lines of evidence for this, we've just taken some key examples out of this present. Pre synaptic is a smaller list, but obviously that was based before we did the large scale proteomics. So it's just our known list of pre-synaptic proteins. It's about 619 proteins which haven't actually been linked into a synapse molecular model so far. So a lot of new stuff to work with. So again, we work with the same sort of process as how do we reconstruct this into a model that maybe makes a little bit of sense. So we've got a bunch of protein-protein interaction databases, we've got the homology interlog walk that I mentioned, where we look for the evidence from other species and human protein interaction databases as well, collapse all the common lines of evidence over a reasonable conference threshold. All right, it basically allows us to make a network out of the original 2,025 clean proteins of 1,308. So you can still see there's a lot of stuff we can't connect into these models. So there's a lot of protein-protein interaction data still missing. But 1,308 with 8,500 interactions. And that's what it looks like. So we now understand the presynaptic signaling complex because we can show it on the screen. But this is basically just the entire map of interactions and it's just done clustered with the same clustering algorithm that showed you for the other one. So it scales to identify these wheels effectively are more commonly connected together than they are with the neighboring partners. That's all this is particularly showing. So how do these vary? This is one thing that we thought we could potentially do with this data set is that we know already that not every protein is expressed at every synapse. So can we actually start to sample from this and get a feel for it? Do we, when we pull down with different baits, do we add and remove clusters? But for instance, if we pull down from a bait on here, so one of the, an antibody for putting in here, do we get everything that's in here and maybe one or two or three of these other ones? Or do we get a few of these things and a few of these things and a few of these things? So it's the first thing we wanted to test and we are, we actually, does everything just fall apart? What we're artificially done here is stuck everything back together in a way that never exists. But interestingly, what we actually get is that very few of the communities, the baits are nicely spanned but spread over the network. So it's nice and easy to do this one. But actually what we get is we get an even distribution of internal and internal edges. When we pull down from one of these, we're just as likely to get things from in here as we are from things across the network. So what you don't get if you pull down this is just this and maybe one or two others. What you get is a couple from here and you sample from elsewhere across the network. So there's a lot of diversity in there. It's hidden by the proteomics when you've mashed everything back together that we're gonna have to start looking and dealing with. So the other thing we obviously wanted to look at was diseases. Is this interesting? So we did that with a small complex. How does it work with a larger one? So again, 1,300 odd proteins and I've looked for how many is then linked. This is using a gene ref. We could also use OMIM and various other things. There are limitations to these databases but those are available to scale to that kind of size of analysis. So what we've got is for instance for Alzheimer's there's 44 associated with that. If that was a random population and we then test that against a random sample which p to the minus four is the likelihood of getting that random. But of course as I said right at the start of this little bit that was so-called interesting proteins. So if you're gonna do a proteomics experiment and you had an Alzheimer's target you'd probably include it. So we tested for that by removing all the base and yes so for instance Huntington's and it turns out in post hoc we discovered actually that the two Huntington's related proteins were in there because they knew they were related to Huntington's and it was the only ones they put in. So yeah the p-values do go down a little bit but we're still looking at very significant enrichment over random samples. Look at these. In terms of the structure though we get one cluster that's significant for Alzheimer's. So in terms of that this is for the network these are the p-values of the network overall but you can also then map those back on to the clusters that we saw. So let's look at the density of the clusters. So this is just the Alzheimer's disease genes in orange kind of over the top of the network and you can look for the enrichments. This is the cluster where it's significantly enriched. You can see where everything's there and then highlight those ones most of which are either candidate or known drug targets already or at least been proposed as potential drug targets. There's various screens going on for quite a few of these. So what about the evolutionary origins? We've done this for the post-enacted density. What about the pre-senapse? So we did the same thing. We mapped all the mammal genes on to fly in the east. We didn't do the full 19 species we just did the concor ones that we were interested in. Found all the orthologues. Classics gene is mammal specific, metazone or potentially primitive in terms of its potential origin. Just to remind you this is basically the same analysis there. That's what we did for the post-enacted density where we got 45% in drosophilus. A slight look is a bit better now and 23% in yeast. But when we do it for the pre-senapse we get almost all of them mapped onto humans as we would expect but with an orthologon fly of 77%. So almost all of them or the vast majority we can find a clear orthologon fly much higher than we expected to find if we were assuming that the same pressures were all in the pre-senapted and post-senapted regions. Orthologon yeast was more or less exactly the same as the 30% versus 23%. 24%. So a jump, so jumps in terms of post-enacted density from 45% to 77%. Again we can map those onto the network so we can look and see where the ancient proteins within this look for clusters that are enriched either for or against those things. So we can see for instance a whole pile of structural proteins clustering together. Chaperonans and things like that that are of ancient origin and we can look again in metazones where we're getting structural scaffolds or signaling scaffold molecules and iron channels pretty much clustering together there. So that's where we're kind of getting to with this. Where we're going next is of course we've kind of assumed that this is all one big mush and of course it's not. There are various, you know, there's for instance the active zone there's various other parts of the cellular organization we've just not taken into consideration at this point. So what we're actually in the process of doing is defining little groups of these and going and redoing the analysis on those groups. So that's what's coming next sort of thing. So it doesn't necessarily make enough of a sense to put all these things together. We have the entire map of possibilities now. And so that's what's coming next. So just as some of us we are with that, the IP data supports this kind of diverse population of synaptic complexes but we still need to divide these data sets up so they make a little bit more sense. That might be an artifact of the fact that we've lumped everything together. There's strong enrichment for specific diseases. At the moment we see one cluster enriched for Alzheimer's but again that is possibly an artifact of us mushing everything together. As we split it out we might see enrichment in other clusters when that makes a bit more sense. But fundamentally we are gonna need more and better interaction data and that's just something that's just gonna take time to come. So we're working with interactomics groups who are actually doing high throughput yeast to hybrid screens on these data but that's not available yet. That's gonna come in the next year. But I think that's a common story for anybody doing protein-protein interaction network. Just data analysis is getting the interaction data is fundamentally difficult. It's very, it's expensive to generate. It's noisy. Most of the methods are hard. So we'd have to be careful with these. I said this, the known one was a list that we got from one of the groups who was doing this in terms of their confidence. Where is it? So that 200 and 70 are in there? Yeah, I mean we need to look at them in a bit more detail and see what they were, why they're not in there. Would you expect it? Would you have expected to find them? Obviously, you know, all 50 baits are from this list as well. And so now we don't know what they are yet. I haven't looked at them yet. So that, those kind of methods allow us to take these kind of raw data sets and build these static representation maps. There's this data, they're basically a way of doing data integration. So we can get a map of all possible interactions, at least all known interactions. As I said before, a lot of interaction data is missing or it's never been analyzed. So we have to assume that the networks we're dealing with are a sparse representation of what's really there. But they are just a static representation. And the nervous systems, if we know one thing about it, we know it's not static. So there is competition for binding sites. There is more of what some molecules than there are of others. And we're looking for ways to build that kind of level of understanding into the models. And so Oxana Sorakena in the group has been leading a development of a connexed level where we're going with this, where we can try and look at more logical models that allow us to capture at least some of the dynamics. Now it's not in these unfilled dynamic models, but it's to capture some of the logical processes and relationships between the types of models that are there. Now the advantage of these is we can scale these effectively to 2,000, 3,000, 4,000 molecules. These we have to go down a level, an order of magnitude in terms of the complexity, just because of the complexity of the model itself. So Oxana's approach is to use the Kappa modeling language, which she's been working on in collaboration with Vincent Dinos's group, which allows us to look at the types of rules involved in the interaction between different classes and molecules. And we can abstract that to a class of molecules. We don't have to model every single interaction for every single molecule. We can say, for instance, PDZ type interactions, and we can classify those or model those just at the level of the general interaction type. And we can define rules for the common ones. And where possible, we can go to the literature and get the dynamics for those, or eventually go to the lab and get them. But at the moment, those are estimated where they're known from the literature. And then start building these models up. Those then allow us to actually simulate the formation of molecular complexes, because they now actually have the rules in that say, interaction A and B requires a phosphorylation at a specific site, or there's competition with other interactions, or there are various other constraints. So the first level of these models is shown there, and that's just basically capturing the types of molecules that we've got into the model and the various interactions between them. So each one of these black lines basically says that there is a rule within the model system that defines how that interaction occurs and what we know about it. For a lot of these, it's estimated from the literature. What we can do from this is we can actually assemble virtual molecules or virtual complexes from these. Given this, you can make the proteins available and then essentially compete to see what you can build together. So this is just one of the very early simulations that was done with a limited number of the molecules available. And highlighted is the PST 95 and one of the other scaffolding molecules in red and blue respectively to look at the kind of distribution you get. Now now we've actually started, we've got the PST 95 for instance now interacting with two or three molecules within the complex instead of the 45 potential interactions that we'd have done if it was in a static network. So there are no competition for the interactions within each molecule. So you're starting to get what we believe is a little bit more realistic. What we can also do with this obviously is you can then say, what have we take some of these molecules away? What happens to these simulations? So for instance, we can zoom in on this, then remove PST 95, so essentially virtually knock the thing out and see what happens to the complex. First of all, what we notice is that the molecular complex you can support is half the size that it is with PST 95. And that's something that seems to be borne out in animal studies as well. The PST 95 knockouts are lethal, but if you can get some neurons through to the right stage you can actually get small complexes out of these animals. But what we notice is we can then start to make predictions like even with this very, very simple model we can start to say shank for instance now has a much stronger role in pulling the network together. Now we wouldn't necessarily rush out and do a whole pile of experiments based on this. This was one of the very first models, but it shows that we can actually start to make predictions from this. We can up and down regulate the availability of different molecules and look and see what other things can come in to potentially compensate for it. So we're starting to get to that level where it's a little bit more predictive rather than just mapping what we know we can start to get some predictions from this. The sort of complexes that we're getting now look a little bit more like this and this is color coded depending on the, for instance we've got the red of the membrane bound on the known membrane bound molecules within the complex. The molecular density of this is approximately right in the right order of magnitude for this. And so for instance, as I just said, we can highlight the membrane bound proteins and then look at the distribution. If we linearize this to flatten it all the membrane bound proteins out, we can start to look at say, for instance, where the kinases in blue are now distributed on various chains that project presumably into the cell. This isn't real spatial distribution though. This is just us flattening things out and putting the membrane bound things in. And we know that spatial organization is important. So one of the things we've been looking at recently is how do we actually extend this into actually capturing some of the spatial rules as well as the interaction rules. So in other words, for instance, a typical example is the amper receptor trafficking where there are various stores of amper receptors around the cell and the regulation of their incorporation into the post-enacted density is incredibly important to actually regulating its function. But the languages as it stood was not capable of actually building that in. So Cassana's been working with Donald Stewart and Vincent Dhanos to extend the modeling language itself to actually start to be able to capture those rules as well so she can now actually bring in spatial constraints where effectively you have available amper receptors in a space which can integrate with a post-enacted density in PST 95, sorry, PST 95 molecules, and those relationships can now be captured where you get the amper receptor molecules actually slowly incorporated into the PST 95 network. Again, you can then start to model that by doing, for instance, starting off with a distributed population of these things and saying, right, over a period of time, what's the integration of the amper receptors into the PST complex? And you can see it reaches effectively and more or less a steady state over a period of time. But then we can then say, right, okay, let's remove some of the key molecules. So we know, for instance, PST 95 is critical in incorporating amper receptors into the complex. We can then reduce the availability of PST 95 and you can start to sample how much you would then reduce the availability of amper receptors. So this is just reducing it by about two thirds in terms of its availability. If we almost knock it out, you can then start to see that the amper receptors don't get incorporated at all, or just at random noise levels. So this is the incorporation rate here. This is the other molecule types. And then we remove PST 95 altogether. It's pretty much flat. So what I've tried to do is go through what I think is one route to start off where we start off with these big raw data sets coming from high throughput molecular biology studies. We can start to build these static integration models, data integration models that allow us to capture everything we know, lump it onto one thing where we can at least look for associations in large list-based studies. But then we're actually starting to get into extracting the key molecules from these into these more logical models that actually allow us to make predictions in terms of a receptor availability or channel availability within a complex. And that's what we see as the start of a link up the level of organization into some of the work that is going to be presented this week with some of the other speakers. We're actually looking at how do you model compartments or physiological processes. Because if we can go from these to say if you change the expression level of a molecule, we can start to make actual predictions on how that would affect key molecules that are involved in physiology to allow us to link to the next level of organization. So, I mean, just to wrap up, I mean, as I tried to say all levels of analysis that give you something in terms of but the more realistic models and the more realistic you get, the more expensive that model gets to actually generate in terms of just getting its data, how to simulate it, how to build it, everything gets more expensive, it gets harder. And those logical models, as I said, give us a means to link from the molecular towards the cellular. It's a long way off yet, but I think we've potentially got a way to do it. Kappa Modelling's not the only way to do this. There are other approaches. This is just the one that we've been using locally and we quite like. Finally, I'd like to thank our funders from the Wellcome Trust, MRC, BBSRC, EPSRC and Framework 7 at the EU. And I've gone through a lot of people's data today, but CMO knows Barley did all the fly gene expression work. Bilal Malik did the fly pull downs, Colin McLean developed some of the cluster and software. Oxana led the Kappa development, that showed you in the last third of the talk. Lissy Marcos did some of the early fly work as well, and this was all done in collaboration with Seth Grant, Vincent Danos and Andrew Palkington and Huss Smith's group, groups who provided an awful lot of the raw data, especially in the mammal studies. So thank you very much for your attention. I'm just going to talk a little bit about the interpretation of the evolutionary conservation. So we've known that sometime in the era of voting fishes, genomes got duplicated, so that in most vertebrates there are, on average, four, four of the logs. So the single what you might find, a single with a lot of the flies, four of them. And so I'm wondering, how much of these plots that just show is just that, that's just that, do you, do you look at how to, you know. I think that is the basic mechanism that's driven it, absolutely, and if you look at our favorite important molecules, again, by being very subjective, that is what you see. For instance, the PST-95, and it's four, it's three. Now, it's a this large one in fly has four mammalian orthologs, exactly as you see. Interestingly, it looks as though some of the splice variants in the fly map to different orthologs, but which is, you know, it's neat. The story looks as though it's holding up. But yeah, I think that is definitely the, well, all the evidence we have, so yes, that's the mechanism that has driven this, then what has then been selected into brain function has what's gone since. And then that's what we're seeing, but the mechanism's definitely been from, you know, evolution. There's very little evidence for real novel things in there. It's pretty limited. Yeah. So the dates, the tissue that has come from, is this whole brain or is this cortex? Or hippocampus? It depends on the study. Some of them are four brains, some of them are dissected regions. If it's fly, it's whole head because you can't dissect 10,000 fly brains. So is that differences between, say, spinal cord? We do, yes, and we do get differences when we do. I've not, everything I've tried to present has been as close as possible today, but if we actually look at very different regions like spinal cord, then yeah, you would get a difference. So those differences might be good ways to test whether your models are really predictive? Yeah. And does that work? We haven't done it yet. That's a good way to do it. We don't have the qualitative data, for instance, the spinal cord or for some of these other regions yet. But we're focusing at the moment on working with groups who've got knockouts for those kind of key proteins and are doing proteomics on those knockouts anyway, as probably the cleanest, single way of doing it, but that is another approach that would be worth following up. Okay. To be describing what you get with food data, can you, have you got a theme for how much, how much over that there would be, or how much would it be pruned if you were to, somehow, through some clever genetics and markers to be able to pull out a very, very specific set of synapses, as you are? So if you do, if you look at the data you get from looking at tap tag type approaches where you're genetically engineering tags, you typically get a much smaller number. So it's typically 250, 300 proteins. And these, but these methods are really quite sensitive. I mean, they're quite advanced these days. So I think that's you're getting towards, when an individual synapse class, that's you're getting towards that. It's a more realistic number. And so when, so have you done a bit of a survey on how you can compare your three synapse classes? We do have enough tap tag data for that yet. That's coming. There are groups that we're working with who are generating it, but there's not enough of that available just yet. And the other, obviously the other thing we could do, we could do it at the fly level as well because we've got these 500 ones down there. They've not been, they've been systematically done in embryos, but not in brain nervous tissue specifically yet. So it's a little bit, it's a little bit of a mixed bag. So following up on these same questions. Yeah. There are a number of molecules that are known to be expressed in only one class of neurons. Yeah. Most of the, of what you show, actually don't go into that category for expressed in a very broad way. Yeah. But if you could focus on the IPs for those babes which do, and then remove them, the prediction is if your clusters actually correspond to complexes that represent particular types of synapses, then they should drop out in a way that's what you were describing didn't happen for most of the tags. That is, you've got a very distributed dropout. You should actually use a whole cluster if that cluster really represents a physical complex out of certain classes of synapses. Yeah. So I wonder if you have... We don't have that data yet. That would be good to go. But some of those, I don't know if you could, you could email me the list of 80 tablets on a type of an equal neuron, not expressed in every cell. Cool, yeah, no, no. Yeah, there's some interesting things that there are protein-protein interactions that exist in these models where we know those molecules never exist in the same cell. We have got examples of that and we can pull things like that out of it. So we know that there are things in there that don't exist. That's one of the reasons for going for the logical models all of us to capture some of those restrictions a little bit better. And we've even gone back and tested that those proteins never exist in the cell, just to be 100% sure. And we can see them in very, very different cell types in the fly. So yes, the approach is kind of positive to the experimental data. Yeah, absolutely. And if you were to look to sort of an alternative type of group that's meant to validate the results, maybe it was the ASX? I guess what there is, though, is, I mean, it's not, it's one of these things that's hard to go through in a talk, is that the groups who work with that are the only ones doing these experiments. If you looked at five or six years ago and you looked at two proteomics experiments trying to do roughly the same thing, you'd have two completely different lists. So the confidence in any one list would be pretty low. What we are starting to see is the overlap is getting a lot more substantial. It's not perfect, but still every time you do it, you maybe get 20, 30% new things. But it's down to the minority of things that you're finding is new now rather than, and it's not just that we have the whole genome. So it's still a fraction of the available genome when you compare it to other tissue types. So it's getting better, but we are obviously limited by the low data. But is it realistic to expect, you know, EM, you know, co-recipitation or something, because it goes down to, sort of, the... The interactions? There's a tiny bit of interaction there. Not sure. Just to ask, what are the key databases for the interaction? I mean, what are the databases available at the moment? That kind of changes every few months. But yeah, there's, I mean, N-Tact, for instance, is one of the key ones we use. There's the human protein-protein interaction database we use as well. There's a bunch of these that we use. But at one point you were doing, plus the field we've done, having to do that, which is... You still have to check things. Right. That's for sure. There's noisy data out there. And so you have to check where, what the sources are. There's a lot better annotation in the databases now of what the sources were. So are they coming from automated protection algorithms or are they coming from automated text mining or are they coming from what types of basic biochemical study? So you can actually filter that out a lot easier now. It's not as hard as it was when we did some of these first networks where we built them by hand, effectively. Yeah. OK, thanks a lot. Cheers.