 It's my pleasure to welcome Jennifer Listgarden and all of you for the final talk of our summer school. And it's a particular highlight because Jennifer is a star in the field of machine learning and computational biology. She did her PhD in computer science at the University of Toronto. She joined Microsoft Research, was there for roughly 10 years, and then became professor at UC Berkeley in the Department of Electrical Engineering and Computer Science at the Centre for Computational Biology. She is a Chen Zuckerberg investigator, and she has done very influential work in our field to list a few like all her work on mixed models in genetics has been very influential she did great work on CRISPR of target activity prediction. And now she's also working on protein design so she has made similar contributions to some of the key topics that emerge over the over the last decade so we are very happy to have you here Jennifer, and to learn more about your current work on on machine learning based design of proteins and beyond. So thanks for the nice introduction and also for the invitation. So right I have since I've been at UC Berkeley, almost four years now which is hard to believe. And I sort of pivoted towards trying to understand how we could use machine learning to improve protein engineering in a number of different ways and so we're going to tell you about our sort of my emerging thoughts. So I think that this area and my group has been started off working on methods development that we think are useful to think about or develop in this area and we also at the same time been ramping up collaborations and we should have our first collaborative paper with some projects out, at least as a preprint, hopefully in the next few weeks but so most of what I'm going to tell you is just sort of my thinking about this topic. So may not, it's not going to be a like conventional talk of like a whole bunch of results but maybe just to think like what are the interesting questions here and what why is this different. Okay, so but first show this slightly ridiculous figure as I wake myself up here. And so okay this guy, if you've seen and I apologize because I have given a similar talk before so that you may hear a lot of overlap. But so this guy is searching for a needle in a haystack. And so, so why is he doing that I see there is a quite a small group we can make this interactive. If we'd like so feel free to speak up. Suggestions okay. Also, it's a it's actually like art installation and I don't I don't actually know what point they're trying to make but they want they issued a warning that it might take longer than expected to find the needle. And so I guess in some sense when we're looking for proteins in some large combinatorial space or molecules that suit our purposes this is of course kind of you know fits this analogy quite well and so at its core I that's how I think about this problem is like how are we like if we've only got a few bits of hey that we know where they are and how do we use them to tell us where this needle that we want is. So, okay so a lot of what I'm going to tell you it turns out is also kind of applicable in thinking about problems of small molecule design but my heart is probably mostly with the proteins just because I've been working in biology a lot longer. For example, this is you know been a long time goal to re engineer proteins. And so I guess the way it kind of started is to notice certain properties of naturally existing protein so may have heard of GFP which got the Nobel Prize, I think in 2008, which naturally started to remain. And then people said I'd fluoresce the screen maybe we can make it fluoresce blue or we can make it fluoresce brighter like how can we do that. And so in these cases on this slide these all had all of these also Rubisco is the most abundant protein in the world. It's involved in carbon fixation, but there's people would very much like to re engineer it to get the best of both worlds between oxygenation and carboxylation so this is yet to be done. Of course everyone here's heard about CRISPR and Karsten alluded to some of my work which I will not talk about but I don't know why this little cute video is not playing but people are reengineering the Cas9 protein which are these big Pac-Man's which come in and undo the DNA for various purposes of Christine therapy with virus delivery which is an active collaboration we have is a big one. Anyway, in all these cases these proteins already had some existing nice functions and we'd like to change them and of course in principle we may also want to do de novo design and which is going to be I guess much harder. So and I'm going to tell you more about the spirit of, well, I guess, you tell me at the end of which of those it applies to. Okay, so why is this hard I mean I've kind of alluded to it but you know everyone I guess this is a machine learning for precision medicine so everyone knows that the amino acid alphabet is as 20 letters in it and so if you want to have a string of even a very short length protein 50 right is a very small protein you're already getting to the number of possibility proteins that are close to the atoms in the universe and by the time you get to 100 you're already there and so even with massive amounts of compute to search through a space. If you could say run an in silico assay like you still can't really do that and of course we don't even have great in silico assays although that's going to be part of the topic of this talk. But it's a discrete space so you can't follow gradient very easily and things like this. Okay, and so right I'm not going to say any more about chemistry but I'm going to say that some of these ideas of how can we use what I've started to call this area machine learning based design. If someone knows an existing word that's better please let me know but so this idea of machine learning based design is applicable and protein engineering and small molecule design also in like chip design in materials and it's kind of I think it's a general principle but of course as you go into any domain may have to revisit certain parts of it such as the representations, and so forth. So, I think the thing that's maybe closest to the heart of what we're thinking about is based on directed evolution. In the very first paper we put out in this general topic with David Brooks at ICML turned out by pure coincidence of thinking from sort of more first principles as a machine learning person to generate sort of in silico directed evolution and I'll explain that in a moment, but so that if you haven't heard of directed evolution it's on the one hand it's the most simple idea and on the other hand we know that it's like kind of genius in the sense that it's really impacted the world. We really spend a lot of work time getting it to work and have therefore changed the world, and we're awarded the Nobel Prize but what you do is you take a protein that has sort of the seed of what you would like in terms of property. And you have to have something that has that seed like if you want a protein of fluorescent you just start off as some random protein this will not work. So as you start off with like a wild type GFP, and then you induce some diversification step in the lab so that could be error prone PCR that introduces a mutation with some probability. It could be some sort of shuffling or recombination scheme from some parents, instead of one parent where you put the blocks together, something like this. But I usually just think of it as some random distribution of errors along the parent, let's say mutations and you get a variant pool. And then critically you need to be able to screen for what you care about and that is often the bottleneck so in fact Francis Arnold's lab works on catalysis enzymes and they can typically only measure 300 at a time. But those kinds of numbers are dramatically different right then what you see in NLP and vision and these things and also you can't help like just crowdsource the labels you need you know an expertise in a lab and labor and money and time. And so in any case whatever you're able to screen you don't you know you screen what you can and of those ones, you take the ones that are say that the top 10% performing ones and you repeat this loop until you either achieve something that satisfies your needs or it's converged to some local and the people I know who do this say that it tends to work very well, but I guess they've, you know they figured out how to limit the things they're doing to the ones that work well. And so what one question is like what how could we bring machine learning to bear on this kind of a setup. For example, although it's not the only one I'm interested in to improve what they're doing and that. And so right. I mean my prove I mean maybe we do fewer rounds, and even in fact just reducing by a few rounds can be extremely important so by the time this gets to, you know, getting into FDA approval, they will have to for drugs you know this has to go through monkeys and things like this and so those are sort of a super precious resource you don't want to touch much and if you can change from four rounds to just two rounds this is like really really tremendous actually alternatively if you can just in gen you know generally you're working your way towards some grand therapeutic goal but it's a number of different selections for different properties, and if you can make any of those be shorter, or if you can having selected for one and then you're going to the next property you need you can make sure you've maintained the first one better. Also the all of these things are very helpful so these are kind of all of our goals. And so if you look at what's going on here and you're a computer science machine learning person you say okay like, where can I bring my skills to bear. And so I think the most, you know the obvious one that anyone who's been reading the news for the past year about AI, trying to vaguely understand it would immediately see is I know that machine learning is supervised models. I can find a place for supervised models and that's going to be the screening. So instead of a biological assay, I'm going to measure in a lab about. Sorry, instead of measuring in a lab, maybe with some training data I can build an in silico model make it an in silico assay. And indeed I think that that should be and is one of the goals of this area and then to see how we can leverage that. I, at first when I used to give this talk I said, this is not actually number one is not the thing that's most interesting to me because I think that building predictive models now is kind of commodity like if you have enough data, and you know how to use the right tools. Like if the signal is there you will be able to get it out and if not you won't or it might be an entire PhD is worth in a very specific domain, bringing a lot of auxiliary knowledge to bear but I've and I so what I used to say is, I'm really interested in figuring out how can search through the space, really well, making calls to that model and trying to understand when to trust it and when not to trust it to make progress but I stand corrected because my students and some collaborators have pulled me into the first one as well and I'll allude to that a little briefly at the end but maybe so so the talk overall is going to focus on on this idea of having some data building a model and then using it to do design, if you will. So, so right so how do we think of a normal predictive. So, right so later on I'll talk about number one, but in the beginning I'm going to think like someone has given me this predictive model already maybe my you know my students have trained it but I'm not going to focus on that problem and I'm going to say, I know this predictive model has been trained on something how can I now use it to get the protein that I would like. So normally we think of a model need not be a neural network but they have nice, nice to peek at them with a little visual and a regret but at some point in one of our papers we started calling this in the context of our problem, a stochastic Oracle. And so I may keep saying that even though I wish I didn't, but anyhow right so you give it an input after it's been trained you give it an input sequence and then it tells you some predictions maybe for one scalar property, maybe for a set of properties and so forth. So when I think of machine learning based design, I think that I've been given this model and I'd like to kind of turn it on its head right I want to specify some criteria on these outputs so maybe I want the protein expression to be as high as possible, subject to the self fitness being above some like survival threshold, and then given those criteria I want to find a set of candidate sequences that I could give back to my laboratory collaborator. I think one that was exactly right I would just find the one, but we know that everything is imperfect and we need to hedge our bets a bit. And so we'd like to give back a suite of candidates so this already is quite different in some sense from what we think of as normal machine learning. And the other thing we're going to want to do is to make sure we pay attention to the uncertainty in the predictive model itself right and so there's just a little schematic down here which, if you know ghosting processes you can imagine with a GP. And, but the main idea I'm trying to show is that near the training data, you're probably more confident in the model and the uncertainty is less. And as you get further from the training data, the uncertainty gets larger. And if you if your data is sort of not evenly sampled throughout the space this may, you know, change as you get closer and further away from the data. And so you, you, you would clearly you want to pay attention to that in this sort of machine learning based design setup, although almost everything I'm saying is like super loaded with impossible things but we'll get to one at a time. Okay, so let me just really concretely instantiate how we first started to think about this problem we said, suppose we've been given this ability to use a predictive model for a property. Say DNA could be amino acid space, and you know it could also be some constrained part of the space sometimes people only want to change like the binding pocket or something like this. And then we are able to predict the scalar property that and let's just say without loss of generality we'd like to maximize. So we're said to ourselves is okay if someone gave us a model to do this, then and then what method, like, what could we do what algorithm what procedure what method could tell us what sequence to choose to give back to the collaborator. And right we may add in on add in some constraints, and I won't talk about that too much in this talk but one can layer these in pretty easily. And that was the first challenge we as we started to think about this and we also said we'd like that whatever method we developed down here for it to be able to handle the fact that this may be a black box. In the sense that we can't peek inside it we can't necessarily compute gradients on anything in it other than by calling it. And so partly that's just because we wanted to absorb other models people had used we anticipated creating our own but we also thought if other people have some sort of biophysics models or whatever we'd like to just be able to use anything and plug in play. So right so the first sort of task we set ourselves and this is preprint that got folded into a later ICML paper but sort of sets the stage for how we think about things is we'd like to be able to handle a black box predictive model for the that goes from say DNA to fluorescence. We'd like to account for uncertainty in its predictions, and we'd like to provide a good set of candidates not just one. And so we developed David developed a solution to this based on what we, again, I don't know concrete terms that are widely used so we've started to call this model based optimization as a sort of general class of techniques that we can use here so it's if you think about it it may look a little counterintuitive at first, but let me take you through it and it's something about it that I really love. So, normal, normal optimization you just have some function you're trying to let's say maximization trying to maximize x and so you're going to somehow move around the space x right until you find f that you believe is is the maximum. And so, in our case let's think about x as the space of DNA sequences and f is tells you the fluorescence. And so what you could do instead and this this may seem very strange but I hopefully I can quickly convince you that it's quite sensible is, I can instead look for the parameters over a distribution over x. I can instead maximize the expectation of a f f of x with respect to this probability density or probability mass function in the discrete case. And so, okay, why can I do that was so the very easy way to see that is that if I parameterize this density here with a high enough capacity model, such that it can put all of its mass on the single value of x, where f is maximal. It will be equivalent right I'll get the same value it'll be just this that point mass will be the same x that satisfies this. And so, all I've done is change the formalism of optimization from from something that looks like this to something that looks like something like this and you can see why are you going through all this crazy weird work and notation. How is that going to help us so there's a number of reasons that people do this and actually we didn't know we just started doing David started doing this and, and the reason I love it is because I grew up before. still believe that there's some real benefits to probabilistic modeling and thinking about things coherently in probabilities and mixing and matching the languages in this coherent way. So I love that I've introduced the notion of probability into my optimization algorithm. Ultimately, I haven't given you an optimization algorithm, right? I've just given you a formalism, but ultimately I'm going to tackle this optimization problem and it has a probability distribution in it. So what that means is if I can encode, like if I want to bring layer in other aspects of the problem, which I'm going to do very soon, now I can do that in a coherent way with the language of probability. And so I think that's super powerful for people in this kind of an area. And here it isn't there. Maybe there's some way to get it in, but for me, this one's just sitting there. And the other thing is that depending on what optimization algorithm I use, I can stop it before it gets to a point mass convergence. And I can end up with just essentially, if I think of like a Gaussian, imagine it's a Gaussian, I'm moving it around, as I'll explain. And when it's converged, I'm going to not let it converge to a degenerate point mass, I'm going to let it have a bit of variance. And then I can sample from it as many times as I want to give to my collaborator. And I can think about the variance of that distribution and I can think of, right, and so diversity and so forth, not all of which I'll touch on today. But there are other reasons that people like these approaches. So one is, well, you don't need the gradients of F. So even if you want to solve this objective with a gradient, turns out if you take the expectation because, sorry, take the gradient, just because you've now rewritten it, the gradient of F doesn't actually appear here. And this you see in reinforcement learning as well, I think it's called reinforce if I'm not mistaken. So right, so anyhow, that's the strategy that we're going to use. And right, and so now our aim is to solve this model based optimization objective, and instantiated to our problem. And so what does that mean? So I'd like to think of this distribution. So instead of again, I'm not going to move through sequence space, I'm going to move through a distribution over sequence space. And I'm going to, as an abuse for intuition, I'm going to think of this just as I'm talking to you as like a Gaussian, even though, of course, for proteins, X is discrete, right? So I'm in practice going to use a VA and HMM and auto aggressive model, something that's, you know, and you might ask what's the right capacity model, what should I use? And that's, you know, an entirely different paper, but an important question. And then, but really, this is I call it the search model, because I'm going to develop an iterative optimization scheme that initializes this sort of Gaussian spotlight search model and iteratively moves it to a part of the space where when I stop, I can sample from it and get the things I'd like to give back to my collaborators. So and what is going to be in this probability? What is this f of X? I've changed it to be probability that I've satisfied the criteria on my properties, let's say the fluorescence is higher than anything I've seen before or maximal if we have some way to kind of define that. And so right, the way we're going to get this is we're going to get it with like a neural network or something that we think is reasonably calibrated for the probabilities and plug that in. And then we're actually going to use the cumulative density function. So if there's heteroscedastic noise, when we do that integration, it gets included into there. Okay, are there any quick clarifying questions at the moment? Okay, so very clear so far, but of course, questions are welcome. Let me check the chat based ones. No, no questions at the moment. All excited and focused. Well, we can, we can aspirationally think of it that way. Okay, so I'm not going to actually, I don't remember how much detail I'm going into, but so I may go into some detail. So we called this and then we said, okay, now we have this loss function framed in this way. How can we actually find like a practical algorithm to tackle it? And so just roughly speaking, there are two big challenges here is I'm trying to do maximization over theta, but theta appears in expectation. And so that's sort of a little unconventional as compared to standard optimization techniques. And we've seen this in other places as machine learning, where people use the reprametrization trick, and so forth. And so we, we do, I'm not going to go into the details, I'm just trying to point out to you that it's sort of a slightly non standard, there's two non standard parts of this to make it challenging. And I won't go into the details of how we solve it, but I'll give you the intuitive algorithm that emerges from it. But the second thing that I think is very important to notice and was something I didn't really have to think about before I had encountered this problem is that we have to initialize theta at some point in the beginning. And when we initialize theta, it's sort of, we know that that theta is not going to be in a part of the space like this sort of spotlight space for which the properties are satisfied, right? If they were, we wouldn't have a design problem, we would be done. So we're like, by, by no pun intended, but by design, we're in a part of the space for which this property is not going to hold when we initialize. And what that means is that this expectation is going to have very high variance because these X's are just going to have very low values for this probability here. And so this is essentially a rare event problem. And it makes, it makes it very tricky to tackle this. And so we drew on some literature for Monte Carlo estimates for rare events, although we could have actually drawn on more than we did, but we didn't go back and add more in. But basically, we added an annealing step where this criterion starts off quite loose and sloppy, if you will, so that we know that in the beginning, it's got enough mass that we can get reasonable Monte Carlo estimates. And then we slowly clamp it down. And so in the end, what this looks like is this very intuitive algorithm is it's iterative. So you initialize the parameter again, I'm just going to pretend it's a mean and a variance. We initialize it somewhere. And then we basically sample from it and compute how well each of those samples adheres to the properties we're looking for. And then we essentially do maximum weighted maximum likelihood of the next theta to, to update it. And we just move around like that. And this, yeah, this looks actually a lot. I'll get to that in a second. But so, okay, just like some pseudo code. And so this actually really maps intuitively nicely to directed evolution. So this procedure you take, I call it this oracle. This is the predictive model. This is the point mass. And this is cumulative density function of like a neural network with say some Gaussian noise, you need a routine to be able to call a generative model. So like a Gaussian or a variational auto encoder or an HMM that can take weighted samples and do maximum likelihood estimation on it. Although we actually do use a VAE, which is using variational inference. And you can see the details of actually some ways you can be exact in doing that in the paper. And then you may initialize with a set of parent proteins or not. So if you don't initialize with a clever set, you may converge to somewhere that's not particularly useful. Or it may take you a very long time or you may need lots of random restarts, right? So we're not suddenly able to fix a non convex optimization and be hard problem. But as compared to the biologists, we were in a similar situation, but the hope might be that we can move around the space more cleverly and overcome bad starts better, although that remains to be seen. So right. So we initialize either with a known set or some random initialization, and we just set all the weights in the beginning to one. And then so long as it's not converged, what we do is we sample from the generative model. Well, sorry, we train it on the initial set. And then we sample from it. And now we have this set of proteins, which is kind of like the diversification step where like error prone PCR. But now, and this is, I think, a key point is when I sample from this generative model, if I've chosen a suitable class of generative models, it's going to understand things about the space, right? Like again, if I just think of like it as a Gaussian, say, you know, it understands the covariance of the space. And so it understands what are kind of redundant directions, if you will, or, you know, and like orthogonal direction. So it stands to reason that this might let us move around the space more cleverly than the sort of more agnostic diversification steps taken in the lab. Then I'm going to score each of those using the or are going to gloss over this, but basically use the predictive model and its CDF to coherently create a weight that matches all the sort of setup of the math. And then, as I said, just continue. So I take those weights and I retrain the generative model. And so each time I do this, I'm sort of moving around this Gaussian spotlight or this HMM or VAE spotlight around the space. And I'm doing it by shifting it to the regions of the space that better satisfy the criteria I'm looking for. And so we didn't know this and we like David kind of we were doing this in our own little bubble in the city in UC Berkeley and then quickly realized that this is very related to modern day evolutionary algorithms, which I was taught in grad school that these were kind of most ridiculous ad hoc things in the world. And in fact, they are a little ridiculous, but they were actually replaced by this instead of making sort of these ad hoc, quote, you know, evolutionary moves by Luja and Karawana in 95 replaced it with a generative model, which led to, which is essentially what we've done, right? I have this VAE or this Gaussian and I'm moving around, I'm generating diversity with a generative model. I'm not just saying, I'm going to mate these two samples together and have them crossover and just make up some kind of crazy rules, right? I'm going to let the machinery of machine learning understand the space and how to induce variation. And so that modern field is kind is called estimation of distribution algorithms. And also the this idea of the having to deal with the rare events is also very closely related. And this is called the, oh my God, I always mess up with the EAS cross entropy method. I think it's the cross entropy method E my standard or something else and it's too early here. But so this is, we've also drawn on that. And it's also a particular EDA is called CMA yes, which actually uses a Gaussian covariance matrix adaptation. And actually the most interesting, well, for me, the most interesting thing, which is why we wrote a paper on it is that you can actually exactly write this algorithm I've shown you as a particular sort of strange form of EM. And so we have a very short paper on that, explaining the connection, which as far as we know, it has not been made before. So, right, but and then also I won't go into this, but I've sort of alluded to the fact that the underlying problem of optimization and what the objective looks like is tied to how many problems get framed in reinforcement learning. And so actually my colleague, Sergey and I here have had some really fun chats and trying to figure out how to work together just because of the underlying technical challenges. But I won't talk about that here. All right. Okay, I have to what time do I have till till 8 20, 8 30. Oh, sorry. Another 15 minutes, another 15 minutes, but okay, you can go up to 25 minutes. I can stop at any point. I don't think I'll get through it. So I'll just stop when the time is up in a sensible manner. Okay, so I think this is like a really cute story. And we're very happy. And then we started to run it on data sets after doing the development. And David was doing it with GFP, which is supposed to fluoresce. And he was looking at what it was giving back. And he said, you know, I'm no protein scientist, but I'm pretty sure this is complete garbage. And so, so in fact, we had to tweak this. So I didn't just waste your time telling you all this, but we had to think a little bit more deeply about what, you know, what's going on? Why isn't this working? So basically, in what I've told you, I've assumed like I've fully trusting that predictive model and its uncertainty, right? So I'm assuming that it's unbiased and has good uncertainty estimates because I'm calling it the point predictions and the and the CDF. And in fact, this is complete nonsense. So we like to look at these nice two dimensional images of uncertainty. And they, you know, and this is what the GP people always do. And maybe I'm a little too cynical. But, but what do we know is that the further you get from the training data, that all bets are off on the point estimate, right? Like it might be anywhere and you basically cannot trust it. But correspondingly, you can't trust the uncertainty either, because actually that point estimate is just like the average of, of the, you know, probability distribution. Neither one is reliable at some point. And so, although to the extent you can model the uncertainty, well, you're likely to do better on this problem. At some point, that's not enough. You have to know when you're far enough away that you just can't trust the model. And so this is sort of a problem of extrapolation. And so how, you know, this is what, when people now say what's the hardest, like what's the most, what's the, what are the open problems in your field, which actually I find to be kind of a funny question, but that's another story. But I say I want to use machine learning to do protein engineering. And I want to have some supervised models in there. And I don't know how far I can trust them. I think that's actually the hardest problem in some sense. And so how can I know when I can, when can I start, stop trusting that model and how far can I push myself, right? And of course, if you're doing active learning, you can, you can keep making that space more and more reliable because you keep collecting more data. But the reality is, is at some point you're going to stop collecting data and you're going to want to make your best bet, right? Like very few people run a laboratory in protein sciences where they just keep getting more and more and more. They typically want one or two rounds of data. And then they say, okay, like, please give me something that works. So I should, and I should clarify, sometimes people think I'm talking about active learning here and everything I'm talking about, I'm assuming we've stopped collecting data because at some point that will happen. Okay, so now let me slightly switch gears and then I'm going to loop back to what we were doing and explain concretely when went wrong. But you've no doubt by now heard of all these strange pathologies that people are constantly trying to combat and say highly over parameterized neural networks as an example, where you have the state-of-the-art classifier and it can predict from a stop sign, you know, quite perfectly. But if you do this clever little gradient step to the parameters, sorry, not to the parameters to an image, you can get gradient step to get what looks like imperceptible noise added to the image and run the new image through that same classifier, you can basically get it to be called any label you want. So like, and so this is, you know, you hear about adversarial problems in machine learning, this is what they mean. And so we don't have an adversary, right? So no one's doing this to our protein models. But what, so why am I talking about it? Because this basically shows you that there's these sort of black holes in the space of which you can call these models, right? So there's, there's certain areas where if I call the model here, it's completely crazy, right? Like, it really has no idea what's going on. It just happens actually to give good labels in parts of the space that I am expecting to call it in. But if I call it in a part of the space that I am not sort of expecting, then all bets are off. And so, right, this is another example where all these boxes from, and this is from a state-of-the-art classifier get called as having greater than 50% probability of being an airplane. So in other words, the many high dimension, high capacity models, especially neural networks, are just full of nonsense if you look in the wrong place. Okay, so let's loop this back to the problem of machine learning based design and think about this for a second. And so actually what I'm going to show you is this is something people do in machine learning, I guess, perhaps it's at this link. I don't know. I put the links here just to credit images. I don't actually remember what's at them. But so what people do in computer vision is they'll say, I've trained this giant neural network on, you know, gobs and gobs of data. And I want to know what my neural network thinks a banana is. So I'm going to set the banana classifier node to one. I'm going to turn off the apple, the orange, the giraffe, the whatever. And then I'm going to, what they actually do is like a gradient descent because they've real valued images. They don't need to deal with combinatorial optimization. They do something where they back propagate in a sense on the inputs, starting with a random image until convergence. And in this example, this is what they get. And so their answer would be this is what my model thinks a banana looks like. And so on the one hand, it's sort of very reassuring, right? I have these like curved edges. It looks, it's sort of kind of convincingly seems like it's understood what a banana is. And on the other hand, if I went and had this 3D printed and put it on like, you know, grandma and grandpa's dining room table, they would think it was abstract art, they would not think it was a banana. And so it's also complete nonsense, right? Like this is, and so really, this is what we're doing when we do machine learning based design is we're setting these nodes to what we want, not because we want to understand what the neural network thinks, but because we want to find those things, right? We want to understand in the input representation what protein to use. And we're going to start it somewhere. In this case, this is random, but my guess is actually, I don't know if anyone's done that. I wonder what happens if you start with something that actually looks like a banana and you run it, I wonder actually if it probably might mess it up or not. I don't know. But in any case, if this is the therapeutic you want to give to your grandparents instead of a banana, like then it's getting really dangerous, right? This is complete nonsense. And so that's essentially the problem is that if you're in a space, we're using a high capacity model, which hopefully we will be, hopefully we'll be in these sort of interesting spaces, then things can just go terribly wrong. And so what we did was we added in one more layer. And I don't think this is sort of the end of the problem. This is just saying this is a problem. We need to think about it. How can we start to think about it? And so I know people think they're all presenting solutions at ICML in Europe. I started to call these like thought papers. So I think we've started to think about this and we've provided some ways that are better than people not thinking about it, but I'm not claiming you should go take the exact algorithm from our paper and that's going to solve your protein design. It's like, it's a way to think about it. So okay, so what might we do to add in this one more thing? So suppose this is actually from a small molecules paper. And so if you think about that training data, for example, we're lying in this region down here and they map to a property up there, then if we had access to those training data, we could say, well, I probably trust the data near the value near p of x, where p of x are the training data. And so if I could somehow reasonably well estimate p of x, maybe I can fold that into the optimization as sort of a probabilistic trust region of sorts. And maybe that will help me. If I don't have that, because I said that I wanted to be able to use any kind of model, then another thing I might think of doing is saying, well, are there some general properties about my scientific domain that may be helpful here? And so I would argue that, and in fact, this is, I would say people aren't talking about it this way, but I think this is what implicitly people are doing. But one way you could think about this for protein engineering is to say, well, if I could even just restrict it to the space of proteins that fold or likely to fold, I'm probably already going to do a huge service to my search, right? Because most of protein space is not folding and most proteins we want are folding. They're not these sort of weird disordered proteins, which are numerous and useful. But as far as I know, not the main goal and most protein engineering tasks. And so, and of course, I may want to mix and match these two criteria, right? And maybe I say, I know where the training data are, plus I have these sort of biological knowledge I'd like to fold in. But and so the, basically, the goal is to say, imagine I have some p of x, which arose from some of these types of arguments, how can I combine it with the method I have so far? And I'm not going to go into all the details. I'm just going to point out that you end up with a very intuitive change to the algorithm. So this was the algorithm I first told you about where you just did a weighted maximum likelihood at each step. So this data is like the mean and covariance of the Gaussian, let's say. And you this is the weight says how well did each of my Monte Carlo samples adhere to the properties. And then I just do maximum weighted maximum likelihood on the new generative model, which is this guy. And here I'm just adding in a term here. And the term is basically looking at how much does my sample adhere to these extra criteria of being near the training data or being able to fold and so forth. So I'm just tweaking the weights. And I still have a weighted regression. And so that's essentially how to do it. And so let's see. Okay, I still have a few more minutes. So I think one thing that was is most challenging about this area, and is very different from most other machine learning I've done, which is how do I actually test out this idea, right? So normally, with at least regular supervised modeling, I can hide part of the data, and then I can say how well did it do. And maybe if I'm interested in, you know, domain shifts or something, I can sort of simulate. Well, actually, no, I can't simulate that unless I have labels, but let's say just standard predictive modeling, I can just hide part of the data. But in this case, what people were doing, and what many papers do almost every single paper treats this problem as an optimization contest. And basically, people are making up new optimization algorithms, they're not publishing or in the optimization community, they're not talking to optimization people. And if they did, they would probably be told like, you should just use this, you know, discrete optimization from 40 years ago or something. And so, right, the problem here is how much do I trust that predictive model, and how well is my way of bringing that into the optimization helping combined with having a decent optimization algorithm. And so I just want to emphasize because people think that we're trying to develop a new optimization algorithm, but we're not, we're trying to have one that has the language of probability in it, precisely to bring in probabilistic trust regions and so forth. And but the reality is that we're trying to design something, right? And so we don't have holdout data either. So maybe we have some real GFP data, and I'm trying to make it brighter. But unless I go to the lab, I don't really know if I've succeeded. And so it's very tricky to do these things. But basically, the way we do it is in this case, I believe this was GFP. Well, this is simulated. But over there, this is an illustrative example, but in GFP, we basically remove all the most brightest GFP sequences, and we train only on the lower ones and do things like this to help get us to where we'd like to be. But it's, it's tricky, and it's a very nuanced way to evaluate it compared to other things. So let's see, I have just a few minutes. Let me just say a few other things. But so, right, so basically the main point of this work is to say, I want to do machine learning based design, but I need to not walk too far away and go to crazy parts of the space. That's the like two sentence summary. And of course, there's a bunch of empirics that show compared to other methods that we're doing better at this, where we can sort of simulate the exact problem. So let's see, I'm just going to quickly go through a few other highlights just to know what other things we're doing in this space. So actually, I guess this is actually was in New York. Sorry, this slide's a bit out of date. So Clara always look, you know, had read this paper after joining the group. And she had this kind of crazy idea. She said like, you know, we are actually building our own predictive models. Someone's not at the moment giving them to us. And if we have access to those training data and we're training them ourselves, and we're not collecting new data, is there any reason to think that we should retrain the model? Like if I'm not collecting new data, why would I retrain the model? And so it sounds like a crazy sort of hypothesis, but she has this really beautiful paper that shows indeed that you might want to do that. And I'm not going to go through the details of it. But the way I think there's a very intuitive explanation, which is that we know this model is not the true causal model, right? If it was the true causal model, we would never update it because why would you do that? But as we iterate, as we move that Gaussian through the space, we're getting further and further away from the training data. And so we have a sort of domain shift problem almost is like, we didn't start thinking about it this way, but she started from somewhere else and she sort of turned the crank of probabilistic modeling and then found this mapping to solutions and domain adaptation based on importance weighting. But the idea is that as we move that Gaussian around, we're going to try to tailor it to be more accurate, the predictive model that Oracle to that part of the space. And of course, as you do that, what happens is that you essentially reweight your training data so that the training data that's closer to where you're making calls is weighted more heavily and the stuff that's further away is down weighted. But as you do that, you get you have a lower and lower effective sample size because you're getting further and further away from your data, which also kind of gives you a handle on how much you can trust that thing. Okay, so but I'm not going to go into this. I'll just say a few more things. Sorry. Right, I want it. So this is under like third round of review. And so this is a, this is some newer work that and I'm not going to get too into it, but people are have been publishing kind of very complex NLP based models that require you know, you know, like crazy number of like, you know, $2 million of GPU spending at Google or Salesforce or wherever and things like that. And they're, they're not doing sort of very due diligence on simpler models and they're combining a lot of things together and it's hard to make sense of. And so one paper that really inspired Chloe, we're not intending actually to write this model we, so this paper started when Chloe said, I want to and I wanted to understand one of the papers from George Church's lab, this low end protein engineering, which is Sir G beeswass et al, which I think eventually was into nature methods. And they're using an MLSTM on a huge set of like all available proteins. And then they were doing some representation learning. And then they were throwing regression onto that. And they were doing designs. They were kind of like, you know, it's like the whole kitchen sink in there. And we started to ask, sorry, I'm sorry, I've totally switched gears. I realized there's not, I just, because but I just want to mention this paper. And so anyway, this paper is kind of an overview of what a lot of people are doing with modern day deep learning and talking about some of the issues there, like, is this actually helpful? Is it not helpful? Like what's really going on? And so it's almost like a review and a big comparison and kind of poking a few holes. And like, you know, you didn't really need that crazy complicated model. So I'm hoping this is accepted in the next few weeks, but we'll see. But you can see it on archive. And then David was my very first PhD student. And he has just left to go to Dino, a startup that it does protein engineering. And he got really obsessed with this question that we would always get when we were giving this kind of talk that I give him today, he'd say, how many samples do you need to do this? Like how many supervised samples do you need? And what I told everybody is we don't know, right? Like you can never know because you don't know what that fitness landscape looks like. So if I don't like, you know, the, you know, the fitness landscape, I mean, the mapping from DNA to fluorescence, right? So if that's a very bumpy, bumpy surface, then I'm going to need a lot of samples to estimate it well. If it's just like a linear surface, then like I only need a handful, right? And so if you don't know how complicated that surface is, you can't possibly know what the answer is. And although in, you know, there's plenty of people that, you know, they do these power studies for their NIH grants under the assumption of very simple models, but we can't do that here. We'd be totally fooling ourselves. And so I said to people, look, we don't know, we'll collect data, we'll do our best, and that will be that. But David is a biophysicist and he's like, no, I want to answer this question. And I said, no, please don't, it's impossible. And he, anyway, at some point, I said, okay, he's going to do it. So where he ended up with this is a very technical paper. This one's also under review. Oh, I don't even have the link up. I see you can find it from my web page that's on bio archival also. And so he ended up in a nice collaboration with Amarali, who's an information theory postdoc here. And here's the very quick summary of this paper is he combined notions of, so there's a discrete notion of a Fourier transform called the Walsh Hadamard transform, which is really nice for biological sequences because it basically lets you think in a natural basis, which is just epistatic interactions of the fitness function. And basically using a lot of mathematical machinery and a particular sort of toy model, David could nevertheless show on real data that if you consider the crystal structure, and he can only do this in a few cases, so we don't know how general it is. But for the ones he was, we had to find things that had some sort of, oh, no, one is a predicted structure, one's a crystal structure. But the problem is in examining this topic, we needed to have protein fitness landscapes that had all possible observations, just to see what was going on. And so there's very few data sets. So there's a very limited to that. But we think there's something there, which is if you take a crystal structure, and you consider any amino acids within four angstroms of each other to be interacting pairwise epistatically, and then that actually gives you a pretty good idea of which epistatic terms go in the fitness landscape, which in turn tells you how bumpy it is, which then you can turn the crank on some known methods to compute sample sizes for linear functions. And so I guess the key here is that linear functions enough, if you have high enough order epistasis in it, right? Because if I go up to arbitrarily high order epistasis, I can capture any fitness function, even if that's not how we typically learn such a mapping. But so the punchline here is that using some simple rules of thumb from structure to gauge what are likely to be epistatic interactions, we can actually get some pretty good ballpark estimates on the sample size needed to estimate the fitness function. So you can take a look at that, if that's interesting to you. And yeah, so this I wish that I could say more, but this is a really exciting collaboration with the chiefers group who does some really, they're like the leaders in the field, essentially in AAV design, they have, he has some startups going through FDA approval. And so we're very soon we'll have something out and I can't quite talk about it. But it's about AAV capsid design for therapeutics to deliver genetic payloads. And I will just stop there and just say that I have questions that I came to academia just a few years ago, and it's really been a pleasure to work with all of these magnificent students. David has actually just left. There's four, a few more in the pipeline. And Brian, I didn't talk about this, he's been just an incredible master's student working on a nano core sequencing project. Okay, I will stop there because it's late there still early here and have a bit of time for questions. Thank you, Jennifer. That was excellent. We learned a lot about protein design and people are sending here virtual applause, I can see that in the Zoom channel. Yes, so time for questions and the opportunity to ask a question now for the for the doctoral students, for example, maybe I can go first if there's no immediate answer with a high level one, how much will this breakthrough of Alpha Fold change this field now? I mean, most of what you said was sequence based, right? And you mentioned in the sample size, you mentioned structure based statistics at the end, but what will be the impact of Alpha Fold on this? I mean, I don't know, and I think this is the interesting question. And of course, the whole world now is going to start running Alpha Fold and they're going to try to add it here, add it there, and there's going to now be like a whole set of papers that show when you add this Alpha Fold based features to this type of modeling that it helps. And I guess the question is, what features will you take from Alpha Fold, right? So suppose you're trying to estimate fitness and you can compute the structure. I'm looking forward to seeing creative and ways that I didn't expect of how Alpha Fold will get used to do that. And I expect there will be some, how much will it help? I mean, I don't know. I guess structure is very useful, but we don't know how useful it is in different areas. So the jury is still out, but I think this is what the whole world is waiting to see, is how useful will it be for which problems and how will we make it useful? And so in fact, I used to have some slides in this deck where I say, we're going to build this predictive model. And used to be historically people always built those predictive models by going through structure, right? They would first predict structure and then they would use structure to predict the fitness. And so, but as machine learning people, we said, like, why would, you know, before Alpha Fold, like, why would you do that? It doesn't seem to make a lot of sense. If you get enough data, then let's just try to bypass it. So, and then, so we talk sometimes to Frances Arnold's group and, and she, you know, she said, for the proteins we care about these, these catalytic enzymes, she said, if we want the structure, like, we can get the structure, maybe it takes like a year and we hire like a crystal person, whatever, but like, you know, that's not a bottleneck for us to get the structure. But that doesn't mean that as her group develops computational methods, which they also do that having access to, you know, Alpha Fold isn't going to help them feature eyes and so forth. But I don't know, my students are starting to think about it, although I'm always saying to them, make sure you do something very non obvious, because the whole world's going to jump on this right now. And I don't want to end up in a rat race on it. But I'm very excited to see. So, yeah. No, it certainly is something that many, many labs are thinking about now. So are there further questions? Yeah, Giovanni is one of the students from the network and has a question, please. Thank you for your talk. It's really fascinating. I kind of wanted to piggyback a little bit on Carson's question. And it's a bit of a curiosity, because it's closer to some of the work that I'm interested in. I've seen a few groups apply language models to proteins and make some very interesting conclusions on these language of amino acids and so on. How to say, how useful do you think something like that would be to develop new proteins? Or like if we understood this grammar of amino acids, could that be useful for a design perspective? Well, I'm a cynic and a slow convert. So I'm a cynic because what people were doing was they're just like, I'm going to download all of Unipro 50. I'm going to spend a million on compute. I'm going to do this like language modeling. I'm going to say that I've learned the grammar of protein, whatever nonsense. They're just like throwing a bunch of data at a big thing. And then they're not doing the very simplest things to compare. Like people have not been doing good baseline comparisons or ablations. And they're just like it's been kind of crazy. And so part of that's as I was saying, like, so I would say Chloe's paper is my first answer to that question is like, go like, go look at that in detail. It's a preprint. And and but it's not the end of the story. But first of all, that took a crazy long time that that paper. And there's now like a new paper, a new wish by Salesforce that and Salesforce had is this all recorded in public? I'm trying to remember is this Carson, is this all publicly available after it's been streamed right now? Okay, also it's being streamed. Okay. So well, that's fine. I'll just say what I was gonna say I some of their earlier papers I was very skeptical of but the most recent one I find rather compelling. But I think always as you look deeper, you wonder what's really going on and is it as it seems. And so I think that's one of the more compelling ones to suggest that there might be something to taking all the uni pro or whatever equivalent large database of proteins and building an unsupervised model and then incorporating that in to me that's the one that's most compelling but I'm still a little skeptical. But I think it's a well done paper that is pointing in the right direction. But I actually think with the biggest problem at the moment and it was extremely frustrating when Chloe is writing her paper as it is for I guess the field is that the number of supervised data sets that you can actually like test things out on are just very limited and they tend to be single and mutational scanning data. So Debra Marx's group in Harvard is our pioneers kind of in this area and they have something like 30 kind of now canonical data sets they used to evaluate as they push the forefront of the capacity of these models and so forth. In their case they're doing evolutionary density modeling. So they take like a GFP, they do like some sort of a jackhammer search to get a collection of things that look like phylogenetically similar are assumed to have a similar function. They build a density model. So it's totally unlabeled although it's of course curated to be enriched for things that look like GFP and well I totally lost track of where was I going with that. What was I trying to say? Ah right and so they're the ones that have sort of defined some of the test sets that people use either for evolutionary data or in the case of Chloe's paper Chloe is also asking what if I have evolutionary data and I have assay labeled data both because this is of course like what actually tends to happen then like what methods are better and what if I remove this or I remove this and but almost all those data sets are just single mutational scans and so like people are spending as we did like Chloe spent a really long time investigating this but at the end of the day it's hard to say much about what's really going on because we don't even have great data sets and so this is a really big limitation and so I wouldn't be surprised if even you know and we write this in our discussion that all these conclusions are limited by this limitation in data but so things that perhaps like don't look like they were warranted you know they may very well be warranted and we don't know yet and so I'm a big fan of anticipatory research but I'm not a big fan of anticipatory results and so I think people are are not necessarily considering things as no pun intended deeply as possible but I think that there's something to it I think you know NLP has its own issues and I don't think biology is NLP like the only similarity is there's a sequence of discrete symbols and that is a big similarity and so it is kind of a nice field to draw on but I think it's worth also saying like I I'm actually doing biology and these are proteins like is there something I should think more carefully about thank you Giovanni thank you Jennifer and also thank you Jennifer for the talk and the discussion