 So good afternoon, everybody, and welcome to the BioXL webinar number 57. Today's presenter is Charlotte Dean from the University of Oxford, UK. And she will speak about computationally designing therapeutic antibodies, combining new reportual data and structural information. I'm Alessandra Villa from the Royal Institute of Technology, and I will host this webinar together with Arnaud Proum from the University of Edinburgh. Today's presenter, Charlotte is professor of structural bioinformatics at the University of Oxford. She's also deputy executive chair of the Engineering and Physics Science Research Council and COVID-19 response director UK Research of Innovation. She worked on development of novel algorithm and tools and database that are openly available to the community. She has developed a lot of tools that are also using in pipeline of pharma industry for drug discoveries. And now I'm looking forward to listen to her webinar. Hello everybody and first of all, thank you very much to the organisers for inviting me to speak today. I'm going to talk about one of the areas of work that we do in my group about trying to computationally design therapeutic antibodies. And I'm going to talk about really, I believe we're in a really important phase of being able to do this, where we have this data source, immune repertoire data, and we have these huge amounts of structural information, and we need to bring these things together to really reap the benefits of what's possible to predict and what we can do in terms of computational design for these types of therapies. So first of all, why antibodies? This is much easier for me now than it was a few years ago when I was trying to explain this as almost everybody has heard about antibodies because of what's been happening over the last 18 months. But there are really two sides to thinking about antibodies. One is that they are your main defense against disease. So understanding them is key to understanding, you know, our health and how well we can survive. It's been estimated that a typical human can produce more than 10 to 12 different antibodies, and every one of those can probably bind to something distinct or a distinct epitope, a distinct site on another protein or a small molecule. And of course, what they do is recognize and bind to potentially harmful molecules and either on their own or inhibited with other parts of the immune system, recruit things to get rid of whatever is dangerous. And then if we move across to thinking about them as biotherapeutics. There are lots of different pieces to this. I am sure that lots of you have seen in terms of COVID-19 looking at antibodies and therefore how effective a vaccine is, you know, what is the antibody response, how much of this is remembered within your antibodies. And you can also use it to do things like diagnosing exposure. So one of the best ways to tell if somebody has had COVID is to know what antibodies they have circulating in their system. And then finally, the effective biotherapeutics. Now, why are they such effective biotherapeutics? Well, basically because they work already. In our natural immune system they target specifically and with high affinity to specific sites, and they can be raised against almost any antigen. So they really are a kind of, you know, if you can get it right, a wonder way of being able to target things. Currently there's over 100 approved antibody drugs. There's something over 700 that have made it past phase one of clinical trials. It's a huge and growing type of therapeutic. I'm afraid I'm going to do my little primer now to make sure when I do the rest of my kind of discussion about antibodies everybody is on the same page. So on the kind of left hand side of this slide is a schematic representation of actually what are the total size of a human antibody kind of looks like. So it actually is made up with four chains, as you can see there to blue and to pink, the blue chains are called the heavy and the pink ones the light. But the only bit I'm going to talk about today at all is the areas at the top two domains of this called the VL and the VH. So that the last two domains of this and the reason for that is it is the CDRs within them. These are complimentary determining regions, which is a set of six loops, which bind to your target of interest. So that's where the antigen binds that's what we're interested in. And those six loops are picked out in darker colors in the cartoon representation of protein on the right hand side. So it's that size of piece that I'm going to kind of run through for the rest of the talk. And the other part of this so that was kind of this is the shape and structure is this immune repertoire sequencing, which has become a big part of the data that we now have available. What is now possible is to take a sample it could be from the blood it could be from the lymph from a human or some other animal, and then sequence the circulating repertoire of antibodies. So these snapshots of up to several million antibody sequences. This is an amazingly rich data source. The kind of caveats I have to put on this is of course the theoretical diversity is much, much higher than that. And even the circulating diversity is probably at least two orders of magnitude larger. So even though this sounds like a lot of data it is still not everything by any stretch. It allows us to do incredible things we can look at what's happened pre and post immunization we can see what's happened, we can see what's different between being human and being a mouse. We have these possibilities we can start to look at these things. Now all my work starts from databases databases of information and these are our three main databases. The first of these observed antibody space is about a collection of that immune repertoire data. It has over one and a half billion antibody sequences in it. So there's an enormous amount of data there. And these go across diverse immune states organisms, most of it is human quite a lot of the rest is mouse and then there's some small amounts of other species like rabbits and camels and things. But it also covers things like different diseases or ages so you can start to do those comparisons. The thing about OAS and the reason we built it is people are dumping this data out, but actually it comes out as his some DNA sequence. You need to sort this clean it annotate it translate it to number it so you can actually use it to think about antibodies. So what this databases is a clean version of what is out there, and what you can use. The second database is the structural antibody database and this is the oldest of our databases. It's a fully automated updating collection of all publicly available antibody structure data. So it contains about five and a half thousand structures at the moment. And the really important thing about this, which I kind of have to emphasize at this point is the reason we built this any of you who deal with structural data you've been to the protein type the word antibody in, you don't get the antibodies get all sorts of things, but not a collection of the antibody structures. And so it's very important once again to have this consistent data set that contains as much information as we can. But once again we have numbered it and cleaned it because that allows us to work with the repertoire data and the structural data to use them for things I'll talk about as we go forward. And then finally, there are some that this is a semi automatic up self updating database of all immunotherapeutic variable domains. So what we're doing here is as people as companies start saying this is a serious candidate to be an immunotherapeutic, we can collect that information and we can store that up so these are sequences as they're released by the World Health Organization, and there's about 700 of those, and we collect everything that is post phase one clinical trials in this. So now to kind of bring all that back to the title this slide is a representation of a piece of work it's one of those things that as a group leader you say something to a student, which you think just should be done, but you don't really think there's going to be a massive result from it it's more a kind of this we should think about this we should check this because somebody will ask. So what I said was we got the therapeutic antibodies and at this point we had 242 to look for because we weren't collecting them quite as carefully as we are now. And we had OAS and at that point it had about half a billion sequences in it, and I just said, can you see if any of the therapeutics are in OAS. And the answer in my head was no the therapeutics of these highly engineered antibodies that people have really worked on and struggled to make. The answer actually was, yes they are there. So 54 of these 242 had a perfect CDRH3 match. Now I told you about the six CDRs that make up the binding site CDRH3 is considered to be the major contributor to binding. And a lot of companies only bother to make mutations in that area. A lot of people only even sequence CDRH3 because they think everything else doesn't matter. But that means I could have found that antibody just by looking in my database. It wasn't that clever from the company to have built it. And lots of the sequences had incredibly high sequence identity across all of the main and important reasons as you can see from this. So this suggests that you could use this kind of data the data that's in OAS to start hunting for therapeutics at least as very, very good starting points for them, rather than starting with a blank canvas of there are something like 10 to the 15 sensible antibody sequences. Guess which one will bind to your antigen. Why not start within something where it's a much smaller representation of that space but clearly contains important information for binding. Now this is what we have built from our databases. I am not going to take you through all of these tools. I'm going to do a very quick whip round of this slide and then I will tell you a little bit more about a few of them on the way through. So if I start from the top left the collection of sphinx pairs a bodybuilder scallop and in fact ablupa are all about building good models of antibodies and there are two characteristics here we have. One is we want to be able to do it very fast. And that's because of that huge amount of data. If I've only got one antibody I'm interested in doesn't matter if it takes a few hours to build a good model. But if I've got potentially thousands of antibodies I'm interested in, or even one and a half billion. I'd like to build models of them all and I can only do that if my method is very fast. The second thing that we concentrate on with these is always to have an estimate of error. So if we give you a model we will also tell you, we think this model is worth using, or we think this model is pretty rubbish, but it's all the information you've got so choose what you want to do with it. Because actually if you're building thousands of models, you can choose to ignore half of them and go, I'm not using those models because they're poor, I'll concentrate on the bit where I've got good models going on as well. And I'll talk to you a little bit about the most recent addition to this, which is a blooper. Then we have a bangle, which looks at the angles between the two domains that make up the antibody, then saw plus which is a very, very rapid structural annotator so in reality we can't yet build models to the level of one and a half billion. But we can certainly annotate structurally to that level. And then tap, which I will tell you a little bit more about during the talk, which is the therapeutic antibody profiler. This is where we look at the properties other than binding to say whether something would be a good therapeutic. And then across the bottom here, epitope profiling, prototyping and obligatory. There's all methods to try and identify more accurately things that will bind to the same site, but also try and identify things that couldn't be found in any other way. So, so things that look sequence space very, very different, but I'm going to say they're going to bind to the same spot, and I'll tell you a little bit more about space on the way through. So I'm going to come across to whom up so this is a tool for humanizing antibodies. So obviously if I put an antibody into you that is from another species, your current antibodies will react to that and drive it out. So I need to put something in which doesn't create an immune reaction within you. And that's what whom up is for anarchy is our numbering tool and D lab is our first attempt at virtual screening with antibodies. So the first the tools I'm going to talk about is whom up. So, what we want to do here is humanize antibodies and the important part of this is still many antibody therapeutics are derived from the on human sources about 50% of those currently in development. And even those which come from humans sources those are often humanized animals with humanized immune systems also can need humanization post that because obviously even a humanized immune system doesn't have anything to nothing that is non human and non human antibody can give you these potentially harmful immune response. And so it's very important to have, make an antibody therapeutic as human as possible. And currently humanization is carried out experimentally in a largely trial and error process. It's not something that is done in a very kind of feel like objective function based way. Well, we took what we called a very clean part of the OAS database with no repeats within it because lots of sequences are identical within the OAS database as well as we took 65 million sequences from OAS we removed all these redundant sequences, we removed anything that was missing residues that might be important for all of this. And we separated them into human and also split by V genes the different V genes of the antibodies and non human. And we trained a random forest. I have to say we just started with a random forest because it was the simplest method. We then tested better methods as well, but came back to the random forest because it was the best. And what we wanted to do was we just set a human non human classification threshold to maximize the YJ statistic and perfect performance is one. And the first thing to say and you can see this from the numbers kind of on the blue side of this is pretty much all the time. You get it perfectly right. And I won't say this is partly because this is actually not very hard in the sense that my seek mouse sequences are different from human sequences camel sequences are. But the, the point here was that you can do this, what we also found interesting was there was an LSTM model that had been published before us which didn't separate quite as well. And there was also been a model that's been published since, which also doesn't separate as well and I'm sort of surprised the other way around here I think that this is, I don't want to say a trivial problem but it's not a hard problem to separate out these sequences. It's the rest of the things that we hope to do with this that I think are more important. So the next question was, okay, I've asked you an easy thing. Is this a human sequence is this a mouse sequence. This is the camel sequence you find the human ones. What about if I take some known therapeutics. So this is when we tested this on a set of known therapeutics so these from phases one to three and approved all the way through. And we took sequences that were listed as human humanized climatic humanizing climatic and mouse. So you can kind of see which domains and what has been affected on the pictures on the left hand side. And we see exactly the picture, we wanted to see, which is all the human sequences came out as human. So that was good. So these are the human therapeutics they come out as human humanized quite a lot of them do but not all of them so clearly you know not everything is as human as people might hope it to be in that in the climatic it starts dropping down quite a lot and none of the mouse sequences come out. So we're still feeling okay our tool is telling us what we wanted to tell us. But this is really where we feel that it actually we can show it's useful. So this is a set of data that was collected on ADAs. So this is anti drug antibodies this is basically people are having an immune reaction to having been given that antibody and this was a set of sequences where there was information published about this. All we did was we the ADAs were usually split into these three categories low immunogenicity medium and high. Yep, and we compared this with our score. So if we just had a really high score, we looked at those where we just had, you know the score was very high so say point nine. Yeah. And almost all of those have low immunogenicity and a tiny number had medium. If we took a positive score. So this was a score less than point nine, but greater than what we would set as our human threshold. We barely had anything which showed high immunogenicity. And as soon as we moved to things which if you like had a non human score, you can see we were getting the majority of things that had high immunogenicity and medium immunogenicity. So this means that without having any information around this experimental data, we are starting to be able to tell people sequences which are much more likely to have an immunogenic reaction, we have a correlation with the experimental data. So this is some linear color coloration correlation between our scores and the immunogenicity here. We're actually expecting our cut off to be the most important thing. So given that we thought well, can we turn a sequence human, because that's obviously what we could do with this. So at the moment, I could just tell you that's a bad sequence don't make that one. But that's not that useful. What if you have a precursor sequence and it has a bad humab score and I want to move it through to having a good humab score. So here we took a sequence and the way we do humanization is a very kind of greedy way. We look across the entire sequence, and we mutate everything to every other amino acid we identify the residue that makes the biggest change in score. We make that change and repeat and repeat and repeat until we reach the threshold. So the interesting thing about this is of course sometimes you make one mutation, and then you make two, three, four, and then you reverse the first one you made, because there are a lot of interdependencies between positions. And it's obviously computationally can be quite heavy work if the sequence is a long way away from human. But it's obviously a procedure that will maximize your ability to do this and minimize the number of residue changes that you would make. 25 therapeutic antibodies where somebody had released the precursor non human sequence so we could do this properly we could test on a real data set. And they had carried out there is sort of experimental humanization to a humanized sequence, and we just ran humab to create a humanized sequence. The first thing that I should say is between set well 77% and 85% for the light chain of mutations suggested are basically identical to those made experimentally they're either exactly the same residue type, or if you like in the same chemical family of residue types, the random overlap there would be 5%. The other thing is that we tend to suggest fewer mutations. So we suggest about 60% of the number of mutations that have been made in the experimental humanization. So suggesting that we've got greater efficiency to generate something that is human. The other thing is that we've suggest fewer mutations at the VHVL interface, and so the binding properties between the two are more likely to be sort of preserved. So overall we feel this is really cool because it's basically saying we've got a greater likelihood of preserving antibody structure and function, whilst being able to accurately humanize these sequences. So the tool itself, we think it's very accurate and evaluating whether an antibody is human or not, some kind of human this school. It actually has predicted power as to whether antibody is immunogenic based on experimental data, and it can be used to improve the humanness of the sequence. So second part of this there I've basically used the power of the sequence data in OAS. What I want to start using now is the power of structural data on top of this. What I want to do that. Well, this picture here is really a demonstration of why there's a very, very well known paradigm that everybody is comfortable with in terms of proteins, which is, if two things have a very similar sequence, they're going to have a similar structure and a similar function. And there's some variants on, you know how similar they need to be but basically we understand that. And that paradigm in terms of antibodies people talk about it as clonotyping grouping things together with similar sequences and therefore they'll have similar structures and we'll have a similar function. These pictures show that that paradigm doesn't always hold. So if I look at the one on the left first, these are two heavy chains were from two different PDBs, they've got the different V genes. So they've come from a different genetic background both their V gene and J gene genetically are different. And the CDRH3 so the most important binding loop shares hardly any sequence identity that's barely above random. And yet the RMSD between those two is one angstrom. So within the limits of kind of the crystallographic how well I will have solved this. And actually if I showed you the surface the same chemistry is pointing out as well. Whereas I can go to the second case here on the left, where I have the same V gene, the same J gene, very, very similar sequences, but now the structures are very different. Now the purpose behind this is kind of to say, in order to know if two things are going to have the same function I need to step beyond just looking at their sequence, I should start saying well do they have the same structure as well, because that would also add a layer of information. And I should say about this that case B is, as far as we can tell relatively rare, it does exist and case A is more common. But of course, both of them once you start talking about billions of sequences start becoming quite common, as in there are a lot of examples within the data that you can look at. And to show that this does really make sense in terms of what we're trying to achieve. This was a data set where we one of the first data sets where we started to try and say what structural shapes are happening within this data set. And the data set was an immune repertoire from somebody who had had the flu vaccine. Okay, and the flu vaccine is actually made up of the influenza hemagglutinin that's basically what's in it. And when we do our structural predictions at this point in time I'll show you at the end that we have moved on from this, we did them kind of effectively by mapping them to known structures of antibodies. And what happened was the student who did this work came to me and said I keep getting the same structure for all of the antibodies here, not all of them in the data set but a huge number. And so we had a bit of concern that it wasn't running correctly and what we were doing wrong, but we worked through it all fairly sure we were right. And it turned out that what was happening was more than 7,000 of the sequences in this data set were being mapped to a particular antibody. That antibody was a structure of an antibody in complex with the influenza hemagglutinin. So we were basically identifying a huge number of diverse sequences, which structurally shared the same shape shape as something that we knew bound to the target here. Like, if we'd structurally map this we would have clustered these things together and said, this is actually an important shape, and it must they're all doing roughly the same thing. I'm not going to talk about one of our tools where we start to use those kinds of structural ideas and I'll go to more kind of things that play into them even more so the first of these is a therapeutic antibody profile attack, which I mentioned at the beginning. And this is based on the idea that therapeutic antibodies might, they have to do more than just bind to their target. They've also got to be free from developability issues such as poor stability or high levels of aggregation. And any of you who've worked with small molecules might have seen the kind of Lepinsky's rule of five, it's the same idea here. We want to guide the selection of antibodies with appropriate biophysical properties. So what we're going to do is sort of completely analogous to what Lepinsky did, we're going to look at clinical stage therapeutics and assume that these indicate the allowed values of properties. And we're going to calculate these metrics on models, because of course if we're working on potential stuff we have to have to compare model with model, because we won't have crystal structure those all the things that are potential therapeutics. And the last thing is that these metrics don't have to correlate with a particular experiment, because they're not about experimental value for something, they're about a description of what an antibody must look like. One way to think about this is in Lepinsky's rule of five you might talk about how many hydrogen bonds something hydrogen bond donors, the external of the molecule has. Well, you can't actually measure that that easily in an experiment but it's very easy to calculate on a computer, and it's a very useful metric to use for these things. So the data set here was a set of structural models. We started here with only 137 post phase one clinical stage antibody therapeutics those are CSTs. We checked how good the models were there were 56 which had known structures at this point, and the average arm esteem is less than the angstrom, I should say antibodies are pretty easy to model so that's not that much to surprise. But more importantly for what we're going to do here, less than 4% of the residues are wrongly annotated exposed or buried. The five properties we looked at were CDRH3 or total CDR length, which you can think of relating things like aggregation, flexibility, topology, patches of surface hydrophobicity, so similar things aggregation, viscosity patches of positive charge and patches of negative charge, poor expression poly specificity those kinds of things are often related to these and the structural FV charge symmetry parameter so that's about how you relate the two sides of the antibody in this way. And the data sets we're going to look at those 137 post phase one therapeutic models. We're going to look at in the background on this 14,000 human antibody models that we built as a kind of representative set of what's in OAS. So you can see do therapeutics look like standard human antibodies, and then a couple of data sets where people have had failures. So first of all, looking at our metrics and looking at the therapeutics versus the human antibodies. I think the picture on the left here isn't really a surprise. Therapeutics really do tend to have shorter CDRH3s. So that's clearly minimizing the amount of flexibility in there making them easier to work with so you can see the red graph is quite a long way pushed to the left compared to models which are in the blue and green You can also look at this with the patches of surface hydrophobicity therapeutic antibodies are quite a lot less hydrophobic than a standard at human antibody and once again I think that's not that surprising because you're talking about a very different formulation between the two. A therapeutic antibody has to be kept in a bottle in the fridge, a human antibody is floating around our system and is in different concentrations with lots of other molecules around it. The process of positive charge and negative charge look very similar between therapeutic and human antibodies these are the distributions we see of this, and the same for the structural symmetry charge. So do these things work. Well the first question is to test them against things that shouldn't flag. The way we do flagging here is very simple. I say something is an amber flag if it's in the top 5% of the distribution. So what I mean by that is, I haven't seen many therapeutics like this before. I give it a red flag if it's outside the distribution. Okay, simple as that. That's kind of the way we're building this. And we took 105 extra post phase one therapeutics that we could collect up beyond the first hundred and thirty seven that we had. And we got eight flags. So we're not going to flag things very often that a genuine therapeutics even when we had the distributions contain that little data. And finally, one of them a runner map which was three times is known for being very, very difficult as a therapeutic and has several properties that make it very difficult to work with. So what we did next was look for things that should flag and can it help to see what would make things better. So these are two different antibodies here. The first one M one nine one two was found to aggregate uncontrollably during development and you can see on our patches of surface hydrophobicity that actually just sits way outside the distribution. So I should say the development of this antibody started with M five seven eight and then to improve binding, they generated M one nine one two which definitely improved binding but aggregated. It was like an absolute pain. And then with a few mutations, it became M one nine one two STT and as you can see that moved it right back into the distribution and that resolved all of the issues. So that's something that you can just see on this distribution very clearly as these things move. Another case example is here is a 001 which had very poor expression levels. So they were unable to work with it in terms of, if you're like generating enough antibody if you're going to use it later on in your experiments and with some engineering back with some changes, we got to ADD EN which removed those issues and brought it back into inside the distribution again. So the other lovely thing about the tap guidelines is of course because we have Sarah Sabdad and we're collecting these things, they're auto updating. So they just keep getting if you like a cleaner and clearer picture of what is happening with therapeutics. So the guidelines today will look slightly different from this but actually the, it turns out these distributions are not moving very much anymore they moved a very little bit when we moved kind of from the 138 to the 242 when we added in the over 308 on this picture we're now at nearly 700 for these distributions and they do move a bit, but basically the end points are staying where they are. And if you're central in these distributions you definitely got properties that are probably very good for being a standard antibody therapeutic. So next to tell you a little bit about something that I'm sure pretty much everybody has been thinking about and we did a little bit of work on. So in March 2020, we decided to because of our expertise in these kinds of databases to collect all of the antibodies that were known and this is experimentally verified to bind to SARS-CoV-2, SARS-CoV-1 and MERS. And this really was just as a resource so that you know that you could have all this information because we had the tools that were relatively easily available to us to start trying to collect this as easily as possible. And we actually decided to use this to do something called epitope profiling and really this is because it's the first large scale database we've had of this type where we could start to examine our structure based methods. So epitope profiling is one of the things you want to know when you have an antibody is where on the pathogen it binds. You want to know where its epitope is, what site it binds on. And here's just a picture of the SARS-CoV-2 spike receptor binding domain. And you can see these are some of the main mutants that are involved in the different variants that people have been talking about. A neutralizing antibody that binds wild type SARS-CoV-2 there is not likely to bind to the variants as well because of course the residues have changed in that site. A neutralizing antibody that bound if you like on the other side would be more likely to keep going. So epitope profiling allows you to gain an understanding of which binding modes belong for which kind of B cells and you can evaluate all sorts of other pieces if you know this information. Now, this is to demonstrate you could do this if you had solved structures of everything. It's lovely to have solved structures of everything but unfortunately as I'm sure you're all aware that's not normally where we sit. Unfortunately in the world of the COVID antibodies it was enough data that we could start to show what was possible if you did have solved structures and then talk about what you could do with models. So this is an example. There are 22 different antibodies on this picture and that's the SARS-CoV-2 spike RBD again in pink there. And they all bind to the same site, as you can see. It's pretty clear they're all binding in one place. So they're all binding to the same epitope. The CDRH3 sequences of these are given kind of in the middle of the slide here and they're very variable. Those black lines that I put in there separate them out into the clonotypes. So these sequences would have been separated out into multiple different clonotypes if we'd only have their sequences and there is no way we would have been able to identify from sequence alone that these all bound to the same site. We obviously do once we have the structures. So how do we get round and pass that because this is the information we would like to have. So we're going to use modelling to do that because we know that these antibodies from these different lineages with different sequences can engage the same epitope with near identical binding. And if we knew what their structures were, we have a better chance of working out that that was the case. So the input data is a large set of antibodies known to bind to a single antigen, some with known epitopes. And then we're going to model them and cluster them and see if that correctly identifies what binds to what. Now we show this all on the cov ABDAB data, but actually you could do on anything where you have a large data set of antibodies known to bind to a single antigen. It's completely agnostic about that. It's just within this data set we are able to show all of this and it is all publicly available. And this is really just to emphasize why you have to do the modeling step. So this is basically the data running from March 2020 to March 2021, which is when we did this analysis. So we're doing the analysis are kind of on the right hand end of these graphs. And if you look at the graph on the left there about antibodies, you can see that even though this is probably a disease that was worked on more than anything else. Yeah, has ever been in terms of trying to solve structures of it and collect data. We still only had about 113 structures of over 2000 entries. I can tell you now that though this has grown significantly so the number of entries in cov ABDAB now I think it's about four and a half thousand, but the numbers of structures has grown at approximately the same rate so it's about 200 structures. So you're not closing that gap. And I don't think you're likely to because it's much more expensive to collect structural and sequence data. So modeling is going to be the step you will need if you want to rapidly identify whether these things have similar shapes and might do the same things. So what do we do. Well we took the 2063 coronavirus binding antibody sequences that are in cov ABDAB in March 21 we used our modeling tool a bodybuilder about 75% of the things we thought we modeled accurately. So I go back to what I said when I was talking about my tools one of the important things is we give an estimate of accuracy 25% of these things we declared those models are just not good enough. We didn't put them in there'll be noise. So we took what we could model, and then we did structural clustering based on this. So we would say this is a distinct structure and we'd say which things had sequences that fell within this. Yep. What we ended up with were 200 multiple occupancy distinct structure clusters. There were a lot of things that were singletons they had no mates. So we couldn't say anything about them, except they were unlikely to be the same as the things that we did have. And of those clusters, given the data that we had in the database, 92% were entirely consistent in the sense that they were binding to identical things. So we had a method that is 92% accurate identifying if things will bind to the same effort. So it's a very simple method for being able to do that. And then when we looked at the ones that were incorrect, you could basically put them into two categories. So these were 16 cases and they split almost evenly into the two categories. One category was the kind of models we suspected were actually not very good and we're causing us to cluster things together that might be wrong. So we modeled incorrectly. The other case was actually quite a lot of experimental uncertainty about some of the measurements of where things bind. And some of them we would have it binds here and that would be the top case but it also binds here. And the second case would be the one that would agree with our clustering. And so we suspect that some of the misclassifications are related to experimental uncertainty as well. This ability to predict structures to predict epitopes allows you to do two things very easily. You can functionally characterize less well known antibodies just by seeing that they are in the same shape as ones you do know really well. So here we've got another set of antibodies where we know we're one of these binds, and then we built models of the others they share almost exactly the same shape. Their sequences are different, but if we go and if you like dock them into that place, we can see that their binding sites are really similar. So you really are able to characterize antibodies that you wouldn't be able to characterize from sequence alone if you just looked at sequences you would say I don't know what this does, but by predicting its structure, you're able to do that. The other really cool thing about this is of course if you get a cluster where you don't have any annotation. The experiment tells you everything about that cluster. So you can run one experiment and suddenly 1015 things you have an idea of where they bind. And so it reduces your need for experiments you can be far fewer in order to get similar or much larger amounts of information. The other exciting thing it does is you can functionally link across species here, because of course you can't do that in standard clone type and very difficult to do from sequence similarity because mice are different from humans. But here we've got three antibodies. Two are from mice. One is from humans. They are very sequence distimilar because they come from distinct germ lines. But they have very high structural similarity. And this allows us to understand which coronavirus binding sites are targeted by different genetic loci. So mice and humans target the same area on the coronavirus. And it allows us to compare the immune functions of different organisms, something that's been always been very, very difficult in a sequence based world. We've called this method space because I like nice names, makes me happy. But it has, as I say, this very high accuracy, it functionally allows us to link antibodies with distinct genetic lineages, species origins, coronavirus specificities. And it's interesting, I think, because greater convergence appears to exist in the immune responses than would be suggested by sequence based approaches. So actually, we are doing more similar responses than we think more than we had previously thought. And it allows us to make these kind of high confidence relationships. And I think it will be really useful not just for kind of early stage drug discovery because you can also think about, you know, if I've got a whole load of antibodies that potentially bind to a site I'm interested in, I can pick the ones using TAP, which are most likely to be most developable because they're quite different in sequence, but they all bind where I want to go. But it's also for things like epitope immunodominance. So which epitopes is everything binding to and that helps with things like vaccine design. So finally, I can't really stop without talking a little bit about ablupa. And this is kind of running alongside the work that has been done by many others and I guess most famously by the alpha fold group from deep mine about this idea that we can do much better at predicting structures than we previously could. So the first thing I will say here, which I'm hoping you have both seen in the papers and talks about alpha fold is the alpha fold group are very open about the fact that one of the classes of proteins they find hardest are antibodies. And it's for a very obvious reason, when they are building their models, the base information they use is multiple sequence alignment, a multiple sequence alignment of antibodies in their CDRs is sort of meaningless in the terms of evolutionary information to extract from it because of the ways that antibodies do their somatic hyper mutation, these tiny mutations making big differences to loop confirmations. So what we wanted to do was see if we could build something that was specific for antibodies to see if that helped and to improve the speed and quality of the structural models. And I should say we started all of this before alpha fold to came out and we finished it slightly after it came out. Super pipeline, the input data is just kind of the antibody with its CDRs cut off so we'll always be using a model of the antibody, and we don't have the CDRs on it because we haven't modeled them yet so it's the rest of the antibody minus those six loops. And then we put in the sequences. We use equivariate graph neural networks to do this, we actually use five of them, and each one outputs its own separate loop and then we average these predictions to create the final prediction. It's an end to end predictor so you can just use it like that. We found that upon occasion it can create things which are slightly if you like off standard geometry. We found that a small energy minimization can be useful. And of course, given that it's one of my tools. It gives an estimate of the accuracy of the prediction, and it also works very fast indeed. So first of all, how does it do. Well this is to give you an idea across the standard sort of data sets. So this is predicting and we're just going to look at age three because that's by far the hardest. I'm doing very well on the other five loops, but I mean, the results are very similar on the other five loops I'm claiming nothing better or worse than anyone else there just saves time to only show you this one. These are two different benchmarks that was anti body benchmark, because some one of the most recent other tools deep ab used this as a test. So we were able to compare to them. And then sabdab latest structures was basically what was available in sabdab. And after we had trained ab looper, but we don't know what other people have used of that data set in their own training and things. So the first thing is that actually the results are not that different across the methods. So I would say that a bodybuilder which is using standard homology modeling techniques and certainly when we have done comparisons previously is very similar to everyone else's homology building techniques for these loops is worse. But the deep learning methods are all quite similar to one another. So deep ab outperforms us on the sabdab latest structures set, but some of those will be have been used within the training of deep ab and ab looper is approximately the same on the anti body benchmark. So you might say, well, you know, deep ab's already there, you're not doing much better. Why do you care? Why are you telling us about this? And the reason I'm talking about this is two fold. This is the first piece is that we have a very good way of estimating the accuracy of our predictions. So on this picture on the right hand side is the how good is the prediction the RMSD to the crystal structure. And then on the x axis is the average RMSD between the five predictions made by ab looper and you can see it's very clear correlation. This is actually very easy for us to say when we have got a good model and when we have got a bad model. Now we can't. I mean, this is obviously not perfect, but it certainly allows us to rule out all the things which we would consider to be very poor models quite quickly and very easily. And we think you can do better than this but this is gives you an idea of the kinds of numbers that we can get here. The other thing is that it's fast. So all of the other methods we're talking minutes to make a prediction minimum, some of them considerably longer than that particularly if you're thinking about something like alpha fold but deep baby same thing. So a bodybuilder is that takes 30 seconds, that's considered to be really very quick. Yep. Alpha fold, you're talking a few, you know, several minutes up to about 30 minutes on this is doing it all on the same kind of single piece of kit. So deep baby same kind of timeframe. I can predict the CDRs for 100 structures in under five seconds. So once again, I can go straight to my large databases, I can predict the lot I can know what I've done well and I can move on with them to do any of these kinds of tests that I would like to do. So that's not there, as you can probably tell I could have gone on and on about every tool that we have in the group but I had to pick a few for today. So picture of the group should say I presume this is like for most of you this is our Christmas party from 2020. I'm sort of hoping the one this year will look slightly different. Though I have to say it was quite entertaining, but not the way I want to hold Christmas parties most of the time. There's been a huge number of people involved in the work that I have described today. Some of them are not in this picture anymore because their ex members of the group and lots of current members also. And then I should always remember at the end to say thank you to all the companies listed across the bottom there who and the research councils that pay for all the research that I'm lucky enough to be able to do within my group. So finally just say, if you're interested in any of this, everything I've described is up it's free. Yep. So you can go and click on our website so you can get download the databases you can do what you want with it. If you've got sensitive data or you're an academic who wants to run really large batches please don't do that on our website. We have both a vagrant virtual box and a singularity machine and just get in touch with us and we'll be very happy to share those with you. So thank you very much. And I think it is time for questions now. Thank you very much Charlotte the very interesting insight into the range of tools and databases it's very impressive, both having built up these databases and tools and also the kind of insights you can get. So we have quite a quite a few questions we'll see how many we can tackle. So I will read these out and I will let you respond. The first one is how similar are the models built by your antibody builders compared to more classical structure prediction tools like Modeler, I Tassar, Rosetta, etc. So two parts to the answer to that, maybe even three. First part is if I take a bodybuilder as is before we built things like a blooper. So specific antibody modeling tools like a bodybuilder and there were others Rosetta antibody actually exists as well are more accurate on antibodies than the general tools like model or itas or Rosetta, but all tools were reasonably comparable. So I wouldn't have said a bodybuilder was better than Rosetta antibody for example. I think in our paper we say it's slightly better but there's not there's not a great deal in it is not some kind of big thing. So antibodies are not that hard to model so tools like Modeler would do reasonably well as well. Once we move on to these more kind of, you know that the sort of deep learning approaches, particularly the things using the end to end modeling like alpha fold, we are starting to see it, you know, there's a jump there's a huge jump in accuracy or something like alpha folder Modeler. If you're on a something that you can't homology model easily. And so for many antibodies CDRH trees that's the case. And so something like a bodybuilder with a blooper in it would be significantly better than any of these on the difficult parts of the antibody. Okay. Thank you very much. So the question is what kinds of database systems technologies are actually used in and in which of these tools and what kind of data structure standards are you do you think are important for this. So that varies between the different databases so with OAS we have worked with what's called the air standards where people have been trying to set up standards for doing dealing with that type of data so dealing with sequencing data so to make sure that you are reproducible. So it's it there we have to set up the whole database in that way we use very different back ends on the different databases because they're very different scales, you need something very different on the back end of something like OAS from something like OAS from Sabdab because Sabdab has five and a half thousand entries and is probably not going to reach a million anytime soon, but it has a lot of other data for each entry. Yeah, and a lot of connections out from it, whereas OAS has a huge number of individual entries. You can store it in ways that are very compact, and it's very difficult to search and you know working out how you can do search space on that, but actually the amount of kind of extra information for each entry is very very low, and connected out much less because it's not beneficial to do so so it's a bit difficult to take which way that jumps. You can see how they're all built if you just go and have a look. So we, they're either the codes already available on GitHub or you can just look. I would suggest go and do that and depends which one you like. You can find out more about it. Yeah. Great. Make sense. Thank you very much. Next one is, can you recommend any bioinformatics tools that would help predict the monoclonal antibodies based only on the intergenic sequence. At the moment the answer, well, I can give you two answers to that. We have built the only tool that currently exists which has some chance of doing that which is the D lab tool which I didn't talk about. That is not a very accurate way of doing this. It is incredibly difficult at the moment computationally. Almost all computational tools in this realm start either they're taking antibodies which we already know bind and we're trying to explore where they might bind. We're taking an antibody that might bind weekly and we're improving that binding. The search space if you say, well, here's an antigenic sequence. Well, good bit is now I can turn that into an antigenic structure because alpha fold exists. That's good. But I still, there's a whole series of questions. One is where would you like to bind on that structural surface does that matter, but if it's everywhere, then that's an enormous search space question, and then couple that I've told you the search space is 10 to the 15. So, the closest we have is something where you would effectively screen that antigen having specified an epitope site in the same way that it would do virtual screening for a small molecule to screen it against a batch set of antibodies and hope that you get some large scores somewhere. But those methods so D lab is really the only method that does that from anybody are not accurate. So I wouldn't be using it to do that in that kind of that extreme away yet. I think that's coming but doesn't exist. Great thank you I'm sure it's very useful to the person asked the question. So, next questions have you run any in vitro validation of you map. So, this is one of those questions which I have to answer with the you have to take my word for this, as was probably clear from my previous side, I worked with a huge number of companies. So lots of those companies have this code in house. So the answer this question is, yes other people have, but no I can't tell you what happened and then I feel that's a bit kind of like, that's a bit pathetic so I can tell you that they think it works and it's great, but I wouldn't find that very convincing if I was you either so no we haven't run any of that we have been told by others who have run it and used it and done lots of work that they like the results. But that's as far as I can go I'm afraid and I feel a bit rubbish, but hopefully you can at least vaguely believe me. Yeah. Okay, great. Thank you. There's a question about glycansin glycansin glycolization patterns can strongly affect the binding affinity and binding modes of antibodies. Yet these are proven very challenging to include in models could you comment on the challenges and progresses in this area. So I would say hardly any progress in that area. I have a known problem in terms of everything I've talked about and almost all the software in this area. And it's something that's usually only picked up once you have some other information to say you've got to deal with it in some way. The reason I think the challenge in this is of course, there's almost no structural data to work from. So it's very unclear what glycosylation might be happening and what shapes that might do or what it might do to your binding. I think this is a really challenging question actually at the moment, until there is more data, it's going to be one of those things where everyone's going to all this is very important. We can't do anything on a computer because even if I wrote something I wouldn't be able to prove it worked, and it'd be very difficult to write something without some data to kind of test it on build a model from that way. So I think this is something that will happen as we start collecting more data around that kind of thing but at the moment know this is basically nothing that sensibly deals with that at all. Thank you. As it comes to to space. What would be the experimental validation of predictions would be at the top inning or something else. Yeah, so with space, there are two different parts that we had effort opening that that was the data that was there but we also had the crystal structures of lots of these things because of the data set we were working on. So we could actually see the, the other thing is, lots of people had done experiments by doing mutations on the RBD. And we had ways of identifying what position because everyone was desperate to find this information out. So you had lots of things which said where on the RBD it might bind so it binds on this side it binds on that side kind of thing so you have all those different things. It's interesting because I suspect that epitope binning and space won't ever totally agree with one another. Because in epitope binning of course if two antibodies have a similar site but not identical one will knock the other off. So they would be in the same epitope bin because they interfere with each other's binding, whereas in space they probably might be in different categorizations because we would see those as different sites, because they're not identical sites so they were interesting caveats here and I think it's, I'm quite excited in the next stages as we try and work out how we can kind of what experiments you really need to show that space is doing what I think it's doing. The results so far suggest that's correct but it is interesting to try and build on that. Great, thank you. We have time for at least one more question but I'm not sure how much time after that we have a lot of questions clearly people are very interested. We'll see. So then I would say at least one more question is Tony was a comment Alexander says a very nice talk in your ablooper ablooper ablooper. Does the averaging of the predictions improve the quality would single prediction have potentially better qualities and why not return actually an ensemble. So, we found that generating five predictions did produce better predictions, it is possible that you could do the single one there are two parts that when we do the averaging is when we put a loss one function in. And the loss one function is what makes sure that we retain the antibody geometry perfectly. We did originally consider retain returning the ensemble and I think if people are really, you know, it's there, you can have it, but I worry that people will think that that ensemble is an ensemble of five confirmations that are possible for the loop and we don't think that's the case I think you'd have to train something a bit different if that's what you want to see but it might be so it's worth thinking about. So, I guess in our hands, the averaging does seem to improve the quality but I'm not going to claim that that's a rule. I think that's more exactly how we design the algorithm. So I think it's perfectly possible that a single prediction could have better quality. Thinking about it. And then with the ensemble I don't think I have an objection that we're just slightly worried that we think people might over interpret it more than anything else. Thank you. Then there's a question about anti-antigen-antibody interaction, could you comment on this and in particular AlphaFol2 is doing well protein-protein interaction but not antigen-antibody interaction. Do you have any thoughts on this? Yeah, I think that relates to the same things that I've spoken about before. I think it's ability to model antibodies is somewhat hamstrung just by the entire way it's built. So antibodies have a very particular mechanism with their somatic hypermutations and their changes and the structural importance of those residues at the binding site. That means a method like AlphaFol2 is struggling to do that. There is also, you know, other people have published this kind of thing. We've commented on a bit, but other people have published better papers on this. That actually the binding sites between antibodies and their antigens are somewhat different in their residue compositions and those between sort of general protein-protein interactions. Obviously it's not different on the antigen side for protein surface. It's the antibody side with a greater use of things like tryptophan and serines and things like that. So it's a slightly different kind of formation of binding sites. So it could be that it's also related to having, if you like, learn how standard proteins behave and antibodies are actually a special case within that. So I think it's both that they struggle to accurately model the antibody CDRs and that there is, we know that there are specific properties about antibody antigen binding sites that are different in terms of residue usage from standard protein protein. So I suspect it's a combination of those two. But I mean, we found the same thing. So we have been playing with AlphaFol2. It's a mixture between there was a sigh of relief in the sense that it couldn't do antibodies, but also like a dam, why can't it do antibodies and then we wouldn't have, you know, we could just use that and work from it. So I'm not quite sure how I feel, but one of those two or some combination. Okay, great. That's very insightful. So that's another just first thing. Thanks a lot for the very nice talk. Any comments on whether some of the databases and tools of your group might might be applicable to autoantibodies? That's a really good question. I don't think something we thought about a great deal. I suspect the general answer is yes, but we'd want to think carefully. So in each case it would be whether it how it was trained or what we had done with it that we hadn't biased it to be not useful for that question. So I think I'll just go with a sort of general yes, but I want to be think quite hard before I said specifically on that. Let's do one last question. I mean, it's great. We have even a lot more questions, but interest of time, I'm aware. So let's have one more question. The question is, do you only consider it CDR or do you think other regions, for example, tight PPI between the lines and heavy chain will also contribute to the action and binding as it makes the body structure stable. So one of the things we do is we consider the entire antibody structure for all of this. And I think I agree a lot about considering the binding between the VH and the VL as well, because it has an effect on the orientation between those two chains. And if it's very flexible, that's going to be very different from if it's very static. I mean the way I was thinking about it, I'm going to hold this up to the camera because I do this all the time. CDRs are basically where my thumbs are. And if you do this, and you're flexible, your entire binding shape is like chain shape. Yep. And if you're static, then it's one shape. So I think there's, you know, understanding all of that is really key. So we actually build models of the entire antibody always. We have a specific method which I mentioned at the beginning looking at the orientations between these and how things sit within the expected orientation distributions and considering how much flexibility there might be. So I think you're exactly right. And there's lots of interesting things there. Great. Thank you very much. Okay, so in the interest of time, I think we'll leave it there. Thank you very much for a great talk. Thanks a lot for the audience. Thank you for responding to these questions as well. I really appreciate it. Then I think we just have a quick view from Alessandro about upcoming bio-excel webinars. Yep. Thank you very much. Thank you for the discussion. I'll attend this for the question. And just I want to announce the next bio-excel webinar that will be on Charm Force Field and the eSport to Gromax, and it will be around in two weeks with Justin Lemcouls. Thank you all. And now we close this webinar section. Bye.