 Hi everybody, this is Trey Eidecker back with you again from UC San Diego I'll be one of your two co-moderators for this official session one on Algorithm development and machine learning approaches in genomics and my co-chair Hi, how are you doing? This is Anthony Philpakis from the Broad Institute Call it you know dialing in here for my attic in Harvard Square This is a very exciting session coming up. We have a great lineup of speakers and Trey and I are very excited to be able to be here for it Great, thanks a lot, Anthony. And so without further ado, let's go ahead and introduce the first speaker First speaker today is Dr. Jen Peng. He hails from the University of Illinois, Urbana, Champagne He is an associate professor of computer science and the title of this talk is machine learning algorithms for structural and functional genomics Hi everyone My name is Jen Peng. I'm currently an associate professor of computer science and medicine at the University of Illinois Thanks so much for the invitation and I would like to share of some of our research on Structure biology and the functional genomics mainly from a machine learning perspective So as we will know a protein is composed of a sequence of MLSs When and when the protein is presented in the in the solution in the living cell it has to be folded into a particular structure called its native structure and the problem of getting to know how the Well after the protein is generated how to get a fold into the right structure the problem is called a protein folding and we have been studying this protein folding problem for decades and And why this protein folding problem is important? Why do we need to understand the structure? We will know that a structure provides a lot of insights on the particular location, particular sites and Some for example there's if you're actually one example that is a this is a protein one of my favorite protein is called a protein kinase what it does is to Modify other proteins by adding a phosphate group and this is a procedure is called a protein phosphorylation and This procedure is original We will have no idea how this this is done and only after the structure is solved We get to know okay. There's a to this structure have two lobes Okay, two lobes and in between there is a pocket where we host the Molecule called ATP where is a is one is one of the most important energy molecule and also carry the The phosphate group that is needed to be transferred into the substrate which is a peptide or a fragment of another protein and after this procedure the phosphate is transferred to some of residues on the substrate and Thus perform a molecular switch function and turn on and off another proteins function Okay, so structure basically provides a lot of insights on the on the function So therefore the protein structure protein is still one of them is still a very important task It's still maintain me remains one of the most challenging problem in computation biology so the task of the protein structure prediction and also Sometimes people also call this a generalized version of protein folding is that we're given an input sequence M&S data sequence and through whatever Confinancy of computational approach you want to get it's a three dimensional structure on the right side. Okay, and There are a lot of a very successful prediction algorithm including Rosetta repreacts at tassel and the newest version of alpha fold and I also show you here. These are progress of these are different of the prediction power Collectively in the field of course Through the past a few years and here I what I use is a a metric that use It is used in in the recent a cast of competitions where the community entire community come together And you better their their their algorithms on some new proteins unreleased proteins and see why they're pretty good structure Have a certain level of accuracy So I only showed you the here the plot until 2018 because there's a huge improvement in In the cat in the most recent cast in 2020 But here what's important is that I want to highlight that in the first for example in the in the In the middle of this maybe 10 10 20 years or so we don't see much improvement. Okay, many of them improvement Can be credited to the increase of the data. Okay, so we've got a template. We've got a similar structures That we can use them to make prediction But most recently you can see there's huge improvement that is that it that it greatly Increased the performance of the structure prediction. So why is that we just I just show you the two reasons I believe that to be true one is that the machine learning is The machine learning algorithms start to to become very powerful and it can leverage a lot of data that we can make a reasonable prediction Another is that we have a paradigm shift on how we do structure prediction previously most reliable approach will be the template based But now we're more into the the the gym there where we use co-evolution analysis and which rely on the contact prediction Okay, so first I would like to to to to show the current status of problem structure prediction and what we have been doing in in this field and Currently there are two different approaches just mentioned there are two different approach for protein structure prediction So we're given first up, but the first step is always is the same So we're giving a protein sequence We use sequence search algorithm to find all the homologous sequences and build this or call us multiple sequence alignment okay, and Traditionally the best approach most reliable is called template based modeling We are we build some model hidden mark models or some kind of a sequence profile and we search against us a database There's a structured database called a PDB and we identify all this kind of impossible Templates and we extract the distances parallel distances or geometry constraint from these templates You can think of the naive way would be that we copy the coordinates from the templates and the fun put them together Bring them together somehow we build a this backbone structure and then we add a set of chain to the structure Okay, but these methods has been very well Like although it has it's very reliable But it's a heat to the some bottleneck that we can search upper bound We cannot further improve the performance of template based modeling for quite a few years And most recently this year's the new paradigm will be this is what we call a contact assistant modeling where we rely on this Information within this alignment. So we extract the some something like a correlation between or among these different sites and we predict it We predict whether these two residues to amnesty will be spatially close Okay, according to the correlation in the alignment. So this is called a co evolution analysis And also with the deep learning models will be able to predict a lot of properties and a lot of constraints Regarding the pair wise comes rescue rescue interaction. For example, whether they're close by or their orientations to be to to be meaningful and with this constraint put to predict that we can use them to To optimize the backbone and obtain some really good structure initial structure and the further we can also add a set of chance will perform some some type of Structure refinement. Okay, so So here the core concept is based on this core evolution. So here I show you why this is a this is a good intuition. So basically here we have He after we built this multiple sequence limelight look at this the two columns and if we have a way to evaluate the strengths of the co evolution of these two sides. For example, here the Eij this is a term that is used to quantify the choices of this, for example, two different different amnesty and this is the absolute value will give us the strength of how the whether they like to co occur or they don't like to To to to to occur together. Okay, and this is a biological meaningful consider for example consider on the right structure. We have the three residues And these two the red one and the blue and tend to be interacting with each other. And because they're close spatial close by and when we mutate one One substitute one residue to another amnesty the other one you would like to also be be be mutated or substituted a quantity. And on the other hand, the remote one, for example, the orange one will be far from them. So the choice of the orange one has usually has nothing to do with the choice of those two. So this is a basic concept. So Essentially, we have this assumption that evolution and a structure are are this constraint that should be should be similar to each other in terms of these rescue interactions. Okay, so here's a model. Usually people do like this is the how we actually capture this kind of interaction. So the naive model people usually space called this is is based on the For example, these PWM or position specific scoring matrix. These are based on assumption that each of these location is is independent. Okay. And so here we have a different way to write this and this is a pseudo energy based representation. So we have the single potentials, which means that the choices of these letters on the Isoposition has nothing to do with the other sides. Okay, but this a pairwise based approach should actually give us a way to learn the coupling this co evolution stress here we introduce another term. That is a sum over all these pairwise potentials, which give us a co evolution strengths and this this particular formulation is very well studied in machine learning in statistical physics and And in high difference statistics where it has different names Marco random fields using model or on directly graphical model. Okay, so to solve this This this this model, which is that we want to learn these strengths is from the data. For example, single term potentials pairwise potential where this mostly Critical one, we need to solve some problem that related to the maximizing the data likelihood, which is shown here. And here there's a partition function, which makes a problem computationally very hard. So there are a lot of approaches have been developed To for approximation. So there are mean field approximation Gaussian approximation pseudo like For the approximation, they all give different versions of Of the evolution co evolution analysis and people have shown that it's much better than the normal correlation analysis, which is for discrete Variables, we call them the mutual information here Okay, so what is the mostly exciting about this is that now we have the way to to show that the Correlation between the size, but it's also very noisy. So what do we have my group has really changed the field? I think change the field and initiate this this This introduction of deep learning to the field is that we Leverage some of the some of the intuition we got from natural image recognition. For example here. This is an Classic deep new convolution neural network. It is very useful to Helpful to recognize the imaging patterns. For example, this different patches We can organize this different patch in a hierarchy way to give some output Okay, and and here we also treat this protein like for example co evolution data Evolution will also call evolution accompanies as an image. Okay. We also look at the local patterns In this evolution the coupling matrix and we apply this hierarchical deep learning model to Put in put all this signal together Hiring a hierarchical manner and then predict the distance contact map Okay, this can be binary, which means that whether they're close by these are the residues are close by in space Or we can just predict the distance. Okay And and this is our model. So we we build this model We learn from the co evolution. You can see this is a very noisy signal Okay, and then we go through all this. This is a hierarchy This is a this is a hierarchical neural network a deep convolution neural network integrate all kinds of features And on on the bottom you can see that we basically this model denoids the original the Co-evolution patterns and also generalizes it the contact map looks very similar to the actual contact Structure, okay. So here is a another example. This is actually the distance matrix We see okay from the native structure This is the the the program we use to generate the input for the neural network And it looks very you can see there's some some pattern But very noisy and this is the version after we apply the deep contacts the neural network to the this is MPRAD So what do we got here is that we use this new this generated this This the the contact map to help fold the structures. So we tested it in cast 12 You can see that it generated much better structure than the original input and we were And and and the our program is also ranked among the top in cast 12 A few years ago, okay So why does machine learning deep learning work for for this particular problem? So we actually perform some visualization You can see that here we look take the first layer Filters and the visualizing to 2d and we color them according to a second or structure You can see that it is a parallel or anti this parallel anti parallel patterns of beta strands Are all these kind of diagonal or antidiagonal patterns and for critical critical interactions We see this kind of space the space the patterns which corresponds to the the gap To the space between them corresponds to the number of residues in each turn of the alpha Here you go. So so this corresponds This is visualizing show that within deep neural network actually learn something that is meaningful and at a higher layer These patterns are organized into higher like the more complex structure motives for structure prediction so Okay, after this many groups have developed different approaches for you can wrap your axe Alpha fold and rosetta. They all based on the similar intuition building the extracting the patterns from From the co evolution data and predicting the residue-residue interaction and some other angle properties like inter-residue Angles and they have been also shown to be successful But what's mostly interesting is that what happened last year that in in casper 14 deep mines approach They have some component related to the previous What I just described but they greatly improve the performance Okay, you can see that the gdt which is scored between 0 and 100 remains perfectly prediction And alpha fold is able to to give a score close to 90 and this 90 is about the the variance between this body The consistency between the two structures sold by the two different groups Okay, so here I also show you down there. They are very challenging two structures I tried all different approaches and existing algorithms and these are the structures from sars kovat kovat two Virus and all existing a structure Structure prediction algorithm cannot make any reasonable prediction for these two basically We just got a random prediction But alpha fold two is able to generate a prediction that it looks almost perfect Okay, so which I think is very impressive. Okay, so in the rest of the This talk I'll be Move a little bit away from structure prediction by focus on the function prediction. So Beyond the structure prediction What do we really want to know like you understand is actually the protein function That is we're giving a protein structure or sequence to want to understand its function It has a lot of applications for example We can use it to improve if we have such a model we can use it to improve binding affinity of antibodies Optimize fluorescence proteins or even Improve the specificity of of gene editing Okay, so what is what do we need to be able to have such a model from a machine learning perspective? So essentially we want a model that takes a sequence as input and give an output which is a function value Okay, so The model that is useful for function prediction has some needs some some there should be some requirements For example, we need to the model to be sensitive enough differentiate the function level of very similar sequences Okay, because sometimes we know Introduce an important residue will completely change the function And we also need to model the non additive effect epistasis and interact changes of these different ratios may have interactions And we also need to this model to be good enough to generalize to unseen sequences and mutations Okay, so my group has developed quite a few different machine learning based approaches for modeling this this For modeling the protein function from sequence. For example, we have this Recurrent neural network conversions We also have the the model that takes the both the protein sequence and the and the substrates as input using the convolution neural networks And we also have applied this different approach to different applications including phosphorylation protein RNA binding and And protein drug binding. Okay, so all of these These different application requires a lot of data to train the model Okay, so but in practice, we don't have a lot of data the label data expensive to get So there are quite there has been a lot of ideas in the machine learning community We can borrow that can be applied here So one idea is that we can use unsupervised learning to learn good representations Using language models. Okay, so these labels these data are very easy to obtain We don't have labels, but they present some kind of a good representation of Of the protein or the sequences Okay, so basically the idea is that we have the sum of the sequence The residues we have seen ms. They will have seen we want to predict what we will see next And there are other models like transformer models, which will well It's more complex, but it's also able to to learn to make such predictions So these approaches can be trained these models a large model can be trained on the pfm or uniprot with large with a lot of sequence, but without any labels related to their function Okay, but the drawback of these approaches are too general Maybe they were just a capture second or structure So of an accident is some of this basic grammar of this protein sequence, but we are not Specifically enough to reflect what the spec special function. We want to study Okay, another idea we take is is from is similar to what I have Before is that we can learn from evolution. So here we can build the sequence alignment And we assume that all these sequences have some native function and they have all passed the Selection step doing evolution. So we can treat those as a weekly label data We know they have survived through the evolution. We know the they can be treated as a positive data somehow And these data are not a lot, right? We know that there are not a lot of data to train very complex deep learning model But they're good enough to attract good features. So what what we have done Is that one thing we first thing we did is that we can check whether this kind of evolution patterns Actually correlates with the function. So on the left, we have them experiment data from Deep mutational scanning like pairwise deep mutational scanning and on the right, we have this model We have shown you before to each to compute the co evolution pattern and we actually check whether they correlated and In our different sites, we have different correlations But on average if we have a lot of homologous sequences in the alignment, we tend to have stronger correlation Okay, so this gives us some insight that we can actually use co evolution patterns to form features Okay, so this is our model We have we call this an easy net by integrating Evolutionary context global context and local context for structure for protein function prediction So on the top we use a global model language model to attract the sequence representations And and also on the local we have this model that I learned from a multiple sequence Those are homologous sequence very similar sequence close sequences and we use them to learn these these specific and since these features that Sensitive to mutations and then we use our favorite model to to integrate all these features together and so that to predict protein function So we have you value this A large collection of deep mutations scanning data. These are single mutation data sets and we perform rigorous Course validation and show that it works much better than the previous unsupervised versions such as the ev fold or deep sequence And also better than the supervisor model that is the base developed by the machine learning community Okay, so we have also tested the generalization generalization of this model to high order mutations And you can see the on the left we tested on the gfp and Seems like with a single and a double mutation will be able to predict a very high order mutation Which means that maybe the pair wise the high order Interactions may not be that important the pair wise might be good enough and on the right. We're also tested that the collect the literature data on these are resistant to em. This is beta-lactamus and show that our model is able to differentiate these These are an antibiotic resistant alleles against the random alleles Okay, so finally we apply this to to actually engineer the the inhibitor resistant beta-lactamus So we take the structure the sequence structure We use our eC net to design a collection to Off mute mutants with high-order mutations and we check whether they are resistant to to a very important antibiotic compound Mpc link and and we tested it on across different Concentrations and we have here. I was a show there here We have two different models and we also have the positive controls. They're very very handful of positive control in the literature We collect only We examine all the literature only collect the various fuel number of them Which means that this is a very hard to engineer protein and we did the the We have we replicate this experiment multiple times and show that actually our heat rate is really pretty good when the concentration is high Okay, so finally I would like to to to to thank Thank the audience. Thank you all and also thank my students and my collaborators and all the funding agency. Thank you very much Okay Yeah So the way this is going to work in in the sessions is we'll have a very brief q&a after each talk And then we'll have a lot of time at the end of the session We'll have 30 minutes at the end to to have a broader discussion So just quickly let me ask one question from from the audience How did you decide the size of the images and the coevolution? Presumably as input to the convolutional neural networks you talked about with different image sizes Caused learning of different relationships between correlated patterns Right to to prepare the input. We first Compute the evolution of couplings and the summer that writes those values for each pair of For each pair of positions So essentially if you have a protein with length l you'll got l by l matrix of the input Now, of course, there's a lot of uh Engineering to needs to be done for example, because we don't have large like a ram gpu's We have to cut the proteins into meaningful domains So there's a we In the current version we're mainly conning consider the we inter domain interactions not inter domain interactions and That will make the engineer much easier to do And uh, I know there are a lot of other groups who also consider inter domain interactions and the protein protein interactions and There are some other like kind of input needs to be prepared prepared and or so the the evolution of patterns needs to be to to be to be Formatted in a different way. So so it's essentially this is uh In practice we mainly consider the whether we can actually implement the accuracy is it's like a random in there In with a sufficient with a given resource is not related to whether we want to It's not that we want to cut them into pcs. It's it's mainly engineering consideration Great Another question Is what is the success metric used for 3d structure prediction? Maybe you can say a little bit more about how it works yeah, so usually the The there are quite a few different metrics people use it in the field But what they are they're trying they do are pretty similar. So given uh predictor three dimensional structure and Uh and a native structure. Well, usually the first time is first step. We will do is a superimposes it to Geometrical objects. There are a lot of algorithm to do that after doing that We just check Have what the the deviation between the predictor residue position usually consider for example the backbone Adam see alpha atoms and well How far that each predictor see alpha atom? Atom is from the native atom and then we compute the for example Like we can set a certain threshold as maybe we can consider like a for example one and strong to be meaningful two and strong to be meaningful and then we can summarize a fraction of the correctly predicted to the Percentage of the corrected predicted residues and so this is usually people do for example gtt score I mentioned it here is to compute in this way with by average in multiple threshold There are also some other metrics people use for example comparing the other pairwise distance And all of them have the similar manner. So usually people summarize that between Due to a score between zero and 100 or zero and one one means perfect zero mean means not meaningful Okay, great. So to stay on time Let's let's move on now to the second speaker and again, we'll have lots of time after after the session to discuss more Excellent. So our second speaker will be sarah matheson who is an assistant professor of computer science in haverford haverford college Hello, everyone. My name is sarah matheson I'm an assistant professor at haverford college in the department of computer science And i'm very happy to be here for the machine learning in genomics workshop Today and specifically i'm going to talk about generative adversarial networks and their use in population genetics But more broadly I want to give a sense of where we're going in in machine learning and population genetics and and also kind of the shifts that have happened over time in the field So the one of the big questions in in population genetics is how do we go from all of the simulated data All of the sequence data And actually learn something about evolution. So here I have this big matrix of zeros and ones representing a bunch of different samples And a bunch of different snips for these samples and say I want to learn something like recombination hot spots or something about the recombination landscape But I don't want to do this just for these particular This particular sample or just for this particular problem of recombination I want to do this more generally in any species that i'm interested in And maybe any uh evolutionary phenomenon that i'm interested in as well say heritable traits or diseases Admixture natural selection I don't want to have to redo the entire method every time that I change Change the application or change the species And so if I were to think about what are the properties that I really want for this type of method I would definitely want it to be fast and flexible and maybe I would want it to be a machine learning method So I think there's been several shifts in population genetics recently and one of them was was toward machine learning in around kind of 2010 time frame And then um a second shift um that that started to move away from summary statistics and then I'll talk a little bit about One of my ongoing projects in in my lab about using generative Adversarial networks and an average adversarial training more generally to create simulated data So what about this first shift toward machine learning? So I think one of the one of the papers that really highlights this nicely was in 2013 This paper that was trying to Identify regions as either neutral or under natural selection. So a binary classification problem um, and one of the observations which has has been known for for a long time is that selection distorts properties of genomic data and In particular, this is highlighted in the site frequency spectrum, which is a common summary statistic and so in in these figures the the Blue line here it shows a neutral region and what the site frequency spectrum would look like and these, uh, red red curves Show what what it would look like under a natural selection in a very variety of different scenarios So could machine learning which is known to be good at pattern detection and and highlighting and picking out subtle signals Could it be uh used to actually detect natural selection? And so the authors showed that it could and their method that they used was a more classical machine learning method of the support vector machine And broadly how this works is that uh neutral regions are are simulated in this case of natural selection Shown in red here and then regions under selection shown in blue here Are simulated as well And the idea of this model is to try to identify some boundary or some barrier between these two different Types of of regions and if we're able to fit that boundary very well Then if a new data set comes along say a real region that we're trying to See if it's under selection or not Then we can plot it in the statistic space and actually see which side of the boundary it's on So in this case, we would probably identify this region of real data as predicted as under selection I think this is a really nice example of how machine learning Had started to be used because it was very good at finding signals from summary statistics But the issue arises in the case when we don't really know what summary statistics are good Or maybe we have a lot of summary statistics available Here a bunch of common ones I used for a project around 2016 And we don't really know if we change our problem of interest Do we need to change our summary statistics or do we need to invent new summary statistics? So one one Usage of machine learning is is really to identify Important features or distill down combinations of features that will be really useful for problem of interest So what we decided to do was feed all of these summary statistics into a deep learning method The deep here just means, you know, these multiple layers and actually infer Both selection and population size changes. So here represented as this simple bottleneck model But jointly inferred along with natural selection And the idea was that this deep learning method could actually distill down information within these summary statistics creating Groups of summary statistics or features built out of summary statistics that would be informative for these parameters So I think these there's there was a lot of of activity in machine learning around this time But one of the drawbacks was that it's still relied on these summary statistics And so then there was a second shift Away from summary statistics who are trying to use this raw data And part of my motivation for moving away from summary statistics was actually to Not have to change them for every application and not have to think about Inventing new ones for every single new problem that I'm interested in So for this I think several groups drew inspiration from convolutional neural networks or CNNs And these have been developed in a lot of the image recognition and classification literature And very very broadly one of the key ideas of CNNs is that The these little filters could pick up on Different aspects of of these images so say for example Maybe this filter would learn to identify Beaks or feathers and then if those filters sort of lit up in classification Then we might be more likely to include that the conclude that this image is of a bird So the the idea was that these learned filters could be Used in genetics to try to find out which aspects of The data were really informative for certain parameters of interest However, there are a few issues in this one being that these filters are really specialized for images But they They are not really specialized for genetic data And the second issue is that for Say an unstructured population where this the order of the individuals Is is not really informative at all. We don't want to encode that into our network architecture So for an image if we imagine just sort of shuffling all of the Different pixel rows of the image it wouldn't mean anything anymore But for genetic data we want to say that's actually okay. That's actually the same data set Um, so I'll focus there's a few methods during this time that worked on cnn's for genetics and I'll focus on the second one Where we were trying to create an architecture that would actually Sort of respect some of the properties of the underlying data So for this we looked at our our raw data again series of zeroes and ones in in a matrix But the samples or the rows are now exchangeable And we fed this not through these square filters. Um, but actually used filters of height one So these filters actually only Only spanned one haplotype or one sample and thus they didn't Encode any information about the ordering of the samples in this network And then we did that for several layers to that To to make sure that we weren't assuming anything about the order and then Toward the end of the network we actually collapsed along all of the rows And and so that would give us something that was also A permutation invariant So there are lots of different permutation invariant functions. We could choose here Some max average are all permutation invariant Um, and this this allowed us to kind of distill down some of the information that had been learned in the previous layers And then finally we can infer some evolutionary parameter That we are interested in So this this network Was able able to work quite well on on a few different types of problems One one example is actually looking at recombination hotspots. So a binary classification of either region inside a hotspot or not And we plotted over the training iterations. We plotted the accuracy And we found that for this blue line here. This was the exchangeable architecture or the permutation invariant architecture We're able to get higher accuracy than if we just used Some of the more image Image oriented CNNs shown in these in in these other curves here So really explicitly encoding the nature of the data was able to help us Perform better in this in this network So this was uh, there is um several different CNN methods developed during this time And I think it was a huge step forward for the field But one thing I've kind of glossed over is that we still need simulated data for these machine learning methods And actually it's very important for machine learning methods in genetics to have simulated data as the training data Since we don't really have any real examples where the evolutionary history would be known Or in other words, we need to use supervised learning because we don't really We need to use supervised learning and therefore we need the labels and we don't really have the labels of of any real data So we really rely heavily on that simulated data So I became very interested in developing better simulated data And this was inspired by um several several times when I tried to create realistic simulated data and gave my sort of best guess at the parameters And I still was not really able to accurately Capture some of the features of the real data. So here are a few examples Where the real and the simulated data sets Look very different and here as well Even when I tried to put in sort of reasonable parameters for these for these simulation programs And as a as a broader point the role of simulated data and population genetics can really not be overstated It is useful not only for training these machine learning methods, but for validating methods developing intuition. It's it's extremely important and To that effect many different Simulation programs have been developed over the years and two of the Most popular ones right now are slim and ms prime And these these programs work very well at at recapitulating evolution But they do require many input parameters And so it's very it can be very difficult to really identify. What are the best ones? And here are some examples where this where this didn't really work So I was interested in trying out generative approaches and specifically generative generative adversarial networks And the idea behind these GAN algorithms is that our initial guess at Whatever type of simulation we're trying to come up with is probably going to be pretty poor quality, right? If we just guess we're we're not going to be able to Develop a really good simulation right off the bat So to draw an analogy from the art world say we are a forger and we're trying to make sort of fake artworks and you know In the style of famous artists And so if we just try this and we we're starting out We're not very experienced and we just try this our Fake data might look like this and so now it's pretty easy for say an art critic to determine Well, one of these is fake and one of these is real So say we go off and we say, okay Well, that wasn't very good but we try to train more and we learn more about art and we try again and Maybe this time we're a little bit better. So we try this Example on the left, but it still doesn't really look like a real Picasso on the right So we we go off and we train more and we learn more about art and we try again And I'm not sure what these look like to you, but to me they look pretty realistic But it kind of turns out that both of these are suspected to be fake Jackson Pollock Okay, so kind of Being a bit more detailed If we were to think about what again architecture really looks like it has these two components It has this generator Or we can think of that as a forger that's trying to create these synthetic examples And then we have the discriminator or we can think of the art critic That's actually trying to identify this real versus fake And an important part about the discriminator is that They have to make a binary classification at the end of the day whether that example is is real or fake and As part of this feedback loop Based on the information the generator tries to perform better And so eventually both of these entities will become better over time with the generator creating more and more realistic data And then because of that the discriminator has to get better too and identify these subtle differences to try and identify which are real and which are fake So this is kind of the the broad idea and this has been used in images before But for population genetics, there's quite a lot of modifications we need to make to really get this to work in in in a new system So what we did is designed a new GAN framework for For population genetics, which we call PG GAN And the idea is that we still are feeding in Generated data and real data into the discriminator And and training the discriminator that way But instead of the generator Going right to recons to to simulating all these zeros and ones I actually wanted to include an evolutionary model as part of the generation process So instead the generator needs to choose these parameters of say an evolutionary model Say these population sizes and one and two and three in this example And then feed those in to some evolutionary simulator and then create the generated data And so because of This this unique generator architecture We can no longer really use back propagation or a lot of gradient descent style approaches because we don't have the gradient anymore So to overcome this we actually used simulated annealing algorithm to select the parameters that we feed into the evolutionary model And the idea here is that We start out with being able to rapidly move across the parameter space In terms of our evolutionary parameters at the beginning and then as this temperature cools down we're able to Make more small refinements for our parameters until at the end We should hopefully have parameters that generate data that is as realistic as possible and confuses the discriminator So this was our generator and then for the discriminator Um, uh, we chose to use a cnn and for this cnn We actually expanded on the permutation and variant cnn. I talked about earlier to include multiple populations and the idea here is that we We want to include a richer set of evolutionary models So we need to handle multiple populations And within each population the samples are permutation and variant but between populations They are not and then after several layers convolutional layers and and filters here then we Collapse along the the the rows and then concatenate the two populations together and then instead of the output being a Evolutionary parameter of interest. It's actually this binary output or probability of being real or fake So we have this sort of binary classification problem So before I talk about the results, um, I wanted to say just a little bit about What can go wrong in GAN training GANs are notoriously difficult to train and in part because there's two optimization problems the generator and the discriminator and their their adversarial So they're competing against each other So one of the things that can go wrong is that the discriminator classifies All data sets as real so across these training iterations on the x-axis here If we look at the accuracy on real and fake data The discriminator is classifying is 100 accurate on the real data classifying everything as real but zero percent accurate Accurate on the fake data And this is also represented in the loss where the generator loss is not very low It can't really it's not getting any positive feedback. It's just everything it does is classified as real So there's no incentive for the generator to improve So this is definitely a major thing that can go wrong In contrast looking at examples of successful training We see that in the beginning usually the discriminator is very accurate And then it reduces its accuracy over time to around 50 percent And then at the end it's really often confused and and and not even though so there's some back and forth It's typically not able to really accurately determine real or simulated And at the end the losses are balanced meaning that they are the generator and the discriminator are both Really really working hard to to to do their jobs So one way of evaluating This type of method is to look if there are some features of the real data that are accurately being captured by the simulations So we looked at summary statistics here as a as a a sanity check that are that are that our method was working And so so for one example, we looked at just a single population and here Here chb Which is an east asian population And if we just use a constant population size Then we didn't you know, we weren't really capturing the real data very accurately or these these green and gray curves We're not very very close together But if we used a More sophisticated model that actually included exponential growth and some time Size changes at different times. We're actually able to much better fit the data We also looked at a two population model of modeling the out of africa event with yri and ceu and african and european population And and we're also able to really closely mirror the summary statistics here And then we are also able to fit this model and and just look at the parameters as Another sort of sanity check that we were getting reasonable results and we do see the out of africa model out of africa bottleneck and uh re expansion here in in ceu as well as some migration post split. So this was encouraging that this was broadly in line with Uh with with the existing literature Um, so a little bit of um, uh, where we're going with with machine learning Um, I think there's many different opportunities for pggan in particular, especially for studying under Studying understudied populations where we don't have a very good idea of what the default parameters would be And then also overcoming this fundamental imbalance since we have unlimited simulated data but limited real data And I think more broadly, um, we we really need to keep the data data in mind And not just take off the shelf, um methods from other fields, but really think about What methods will be most applicable for our data and what modifications and new algorithms do we actually need for genetic data such as, um, uh, this permutation and variant architectures or, um, uh, simulation, um, uh, generators that don't rely on the gradient I think we really need machine learning methods to be more interpretable. Um, we don't really know what these cnn's are learning yet And I think there's a lot of opportunities to combine machine learning and evolutionary modeling. So it's not one or the other Um, and finally, I would say that, um, I think we need to think more about moving away from simulations to and and thinking more About unsupervised learning and learning from the real data directly And with that, I wanted to thank all my collaborators and funding and thank you for your attention Thank you so much. Dr. Matheson, uh, wonderful talk. Um, so I'll start with a few questions by our audience Um, so first you said that you needed simulated data for training supervised models. Are you also exploring unsupervised models? Yes, and I think that's a great question And I think that part of where the motivation for the gan framework came from is actually This exact idea is that we're not just using purely simulated data that we kind of make up based on our best guess But we actually try to use the gan to generate better simulated data And that and thus the real data is actually fed into the gan and then that the simulated data tries to match that So I think that that's kind of a hybrid approach in terms of it's not purely unsupervised but we're trying to Use the real data directly instead of at the very end after we've already trained on purely simulated data I also think there's a few other groups in this area that are working on using unsupervised learning more directly In terms of visualization and clustering and things like that That would actually be really purely unsupervised without any training data at all. So I think there's Much room for improvement though. I think I would definitely encourage people to kind of think about this area of unsupervised learning in population genetics Great. Great. Thanks. Sarah. I'll I'll ask this last question by John Yoon For the SNPs data as input to the convolutional neural network model They assume zero and one refer to the number of minor alleles. Have you tried the zero one two? type input in your models and and how to optimize the stride in the CNN model That's a great question. So yeah, zero refers to the minor allele and one refers to the Zero refers to the major allele and one refers to the minor allele You can also if you know your ancestral drive state, you can have zero as the ancestral state and one as the drive state That would also um, that would also work I think if It would be interesting to try other approaches You can also try like a negative one one and then zero is for missing data, which I think could could be better in cases when you The quality is less good In terms of increasing the number I mean, usually we really see bially like SNPs and that's why you don't sort of see higher numbers But I think there's no reason not to include them if that's better for your model I think you just have to be really careful, you know Because we're using relu and all these sort of different activation functions You need to be careful about where is your zero mark? And what are you really assuming that, you know, you don't want everything to be positive If if those are the those are the things that are going to be sort of filtered down your network So you kind of have to be aware of that What do you think Anthony? Shall we shall we do one more or shall we move on? Um, well, you know, there's a really good question. So maybe we'll do one more Um, so what type of loss functions do you use for imbalanced data? And do you apply any up sampling for unbalanced data? That that's a really interesting question about this sort of data imbalance For us, we try to simulate the exact number of real data sets we have So we we kind of get around that a little bit with the with the balancing of the data I I think for generally we use this sort of like binary cross entropy loss functions for for the GANs I I can I can talk more about that in the main Q&A But I think that In in in general there is this fundamental data imbalance, right? We have unlimited simulated data We have limited real data. And so I think that needs to be further explored. That's not something I've done yet Okay, great. Thank you, Sarah So let's move to the final speaker about section one It's my pleasure to introduce dr. Christina Leslie from Memorial Sloan Kettering Cancer Center Her talk is titled the 3d genome and predictive gene regulatory models Hi, my name is Christina Leslie. I'm at Memorial Sloan Kettering Cancer Center and I want to talk to you about using machine learning models together with 3d genomic data To train predictive models of gene regulation All right, so as I'm sure you're aware gene regulation involves a collaboration of Transcription factors that bind at the promoter as well as at the enhancers that can be quite distilled from the promoter and DNA looping that brings enhancer elements in contact with the promoter and Leads to up regulation of expression of the transcript And for quite some time we've used 1d epigenomic data information on where transcription factors are binding information about chromatin accessibility and histone marks To map the presence of candidate enhancer elements, but we haven't used Until recently the 3d connectivity of these elements Okay, so Why do we want to try to predict gene expression? Well, gene expression Is important for the function of cells and for understanding What cells do in different states? And this is information or data from one of my own papers But there are many such studies where we're looking at chromatin states Of in this case cda t cells and functional t cells and t cells that progress to dysfunction in tumors Okay, and what we can see by looking at this chromatin accessibility at specific low side So this is the locus that encodes pd1 in the mouse pd1 is an immunotherapy target and this is the locus of an important Effectors cytokine interferon gamma and what happens as T cells progress to a dysfunctional or exhausted state There are chromatin accessibility changes at these loci Different from when you get a factor or memory functional cells Okay, and these changes somehow encode the gain of expression Of pd1 and the loss of gene expression of interferon gamma and we'd like to understand this and model this For NHGRI it's also important to link genetic variation To gene expression. Okay, and this is an example that's not from my lab. It's From a recent paper trying to understand the association between a genetic variation and gene expression of a particular gene. Okay, BLK here As well as doing association of genetic variation with chromatin accessibility and teasing apart how changes in the dna sequence can change accessibility And a chromatin state and that can change gene regulation Okay, so ideally We would like to have machine learning models that could automate this process um In my own lab, uh, we've done a lot of work in this domain trying to use models uh to To predict gene expression or fold change between different cell states from the sequence content and the accessibility or activity of regulatory elements and we're doing this not to predict per se but in order to Um decipher gene regulation figure out what the regulators are Figure out how individual genes are regulated and the missing information in our models So far has been the connectivity information the connectivity of promoters and enhancers So what I want to show you today is using 3d interaction data in order to Model gene regulation using graph neural networks Okay, so the data we're going to use uh is a chromosome confirmation capture data, uh such as high c here's a picture um Of high c, uh, the the basic idea is that you're cross linking proteins, uh to dna and c2 You do a restriction enzyme digest so that you're cutting up the chromatin You ligate and pull down and you get paired end reads where uh, the the read pair Maps a contact so the pair of reads can be distal in the in the 1d genome But they had to be close in the cell in in the input population And this data allows us to build a contact matrix that gives us information about 3d proximity of genomic regions all right and We can look at these contact matrices at different scales at a genome wide scale or chromosome wide but we're more interested in These domain level and um loop level and individual promoter enhancer interaction level information Okay, so if you zoom in uh to these uh maps close enough you can see organization like topologically associated domains and within these tabs you can see some loops between genomic loci Okay, um methods matter so uh in our lab we've developed our own statistical approaches to uh infer Uh interactions significant interactions uh directly from count data and um we call this method high cdc per high c e direct color um, so as essentially what we're trying to do is uh Uh estimate a background model from the count data using negative binomial regression the covariates that are important are um The genomic distance as well as um features of the the the interaction bins such as mapability number of restriction enzyme size gc content and when we fit our model we can observe the counts and um Decide how surprised we are to see a high count and assign a p value or z score Okay, and uh this uh latest version high cdc plus is uh in revision hopefully coming out soon And when we can do that we can start to see uh interactions that are of interest to us for modeling gene regulation So this is data uh from danwei huangfu's lab in uh collaboration For our 4d nuclear project that also involves fv apostolu And what we're looking at is a progression of guided differentiation of human embryonic stem cells towards insulin secreting beta cells, okay, and um looking at these different stages of pancreatic differentiation standard normalization at the top and high cdc normalization with z scores Um at the bottom and what we can see is uh certain uh 3d interactions beginning to get set up uh that uh Influence the promoter of this important um diabetes gene pdx one Okay, so we would like to understand uh how At different stages of differentiation this enhancer rewiring Uh influences expression of the target gene Okay, so Now i'm going to get to the machine learning part And and tell you about a model that we've developed called graph rake Where we're using graph neural networks To infer gene regulatory models, okay, and the idea is we're going to use High c or we're actually going to use a variant called high chip um to encode 3d interactions regulatory interactions as a graph And we're going to propagate information along these edges in the graph via graph neural networks Okay, so we think of the linear genome as a set of bins Uh the bins are connected by edges that we get from high chip data um the input features are going to be either One d epigenomic data or dna sequence and the output is going to be gene expression Okay, so the expression output of each bin Okay, um, so We have two models one that uses epigenome Based data and one that uses dna sequence. So i'm going to talk about the epigenome based Uh a graph rake first Okay, with this model what we do is um the inputs are chromatin accessibility and a minimal set of histone modifications, uh, we actually just use one promoter mark and one enhancer activity mark Okay, we go through a few cnn layers and this helps us learn local features of the of the chromatin and then we pass through several graph attention network Layers and this allows us to pass information between Um uh enhancers and promoters and then we predict the promoter output As measured by cage seek, okay, which um Is a tag-based protocol that maps promoter activity the output at specific transcription start sites Okay, so roughly speaking this model predicts gene expression from the activity and connectivity of regulatory elements This model is also cell type agnostic in the sense that If you train in one cell type you can go to a new cell type and as long as you have Inputs 1d and 3d inputs for the new cell type you could predict expression there All right, uh, just a few more data Information details on on the model. We're using dna. We're using h e k for trimethyl and h3 k27 acetyl um and uh a finer binning for Uh the histone marks, but by the time we get to uh the the Graph attention network, we're at a 5 kb binning. Okay, so and and We are predicting at the spin This resolution the cage seek Okay, and um We're making a prediction on fairly large Genomic regions of about two megabases Okay How well can we do we actually can do quite well? All right, so this is um True signal versus predicted signal on how about chromosomes Uh in mouse. Yes cells and you can see a nice correlation Um, also if you train the model In different cell types and then again on how about chromosomes try to use um these models to predict full change Again, the full change looks good here the color the darker color means that you um Are looking at uh Promoters with more high chip edges. Okay, so um, they have more complicated regulation okay The other model the sequence-based model we start with six megabases of DNA sequence one hot encoded. We pass through a few cnn layers again to learn local features So now sequence motifs and then we pass through dilated convolutional neural network Layers in the bottom path to predict accessibility in histone marks. Okay There's another path through the model where we take The output of this cnn these motifs and we pass to the graph attention network and try to predict cage seek Okay, so now We're using dna sequence together with 3d Connectivity to predict both gene expression and to predict the 1d epigenomic data as an auxiliary task Okay, this is definitely a cell type specific model In that you're learning sequence information that is specific to the cell type that you train in Okay And just again a few details on the model The bottom path through the model is very similar to the dilated cnn model called the senji developed by david kelly Okay, so these sequence models That have been used in regulatory genomics. The novel part is the top path that uses the graph attention network Okay, and again the sequence model It's a harder task But we can evaluate on health of chromosomes and look at true expression versus predictive expression or full change Versus predicted full change and again, it's it's a reasonable correlation Okay, and and finally, uh, I'm Passing through this quickly, but uh, if you Compare the prediction of either the epigenome or sequence based model to the corresponding cnn model. Okay, so cnn's are Wet widely used now in regulatory genomics. Uh, they're not Incorporating 3d information The graph rig models do better. Okay, and in particular they are more They give higher accuracy for prediction of gene expression when you start restricting to genes that are expressed And genes that have more complex regulation more high chip edges okay, however Unlike in a lot of machine learning The prediction performance per se is not the point We're doing this because we want to understand how gene regulation works and how individual genes are regulated We want to interpret the model So what we can do is use feature attribution to predict What are the functional enhancers for a specific gene? Okay, so what we're doing here We look at a specific output so expression of the gene dhps And for all the input Features, we can use a method called deep shop that identifies which Bins of that feature contribute most to prediction And we can sum up these contributions and get a track like this Which tells us which positions along the genome the model things are important for predicting this gene Okay, and those That feature attribution approach gives us a prediction of of where we think the functional enhancers are Okay, there's data to evaluate this now. And this is thanks to crisper eye based Enhancer screening strategies, for example crisper eye flow fish from the n grad slab Okay, so this data what they're doing is they're taking A particular gene of interest and they have a set of candidate Enhancers and they do a pooled crisper eye screen With the readout using RNA fish for the gene of interest and they can through sorting and sequencing estimate The impact of perturbing individual enhancers on the expression the full change Of of the target gene Okay, and in the same paper they have a model for predicting functional enhancers called the activity by contact model This is a score using accessibility Accentulation and high c contacts Okay, so how do we compare with the abc score or with standard cnn models? What we can show is that the graph regulatory models From graph rake i'll perform these other approaches. Okay, so this is an evaluation on flow fish data and k562 cells This is the performance by area under the precision recall curve over a set of genes In green is abc Orange are The different graph rake models blue are the corresponding cnn models Okay, and so what you can see is that we have higher performance for Determining functional enhancers Based on flow fish data Okay, uh with the graph rake models in orange versus uh cnn models or abc okay and just a final slide to Explain why graph rake can outperform cnn's For this task. So what i'm showing is the mic locus Mix an important gene. It's an oncogene. It's known to have very distal Enhancers, okay, so the promoter is here and i'm showing the high chip data and The question is can these different models through? A Feature attribution can can we figure out whether they are Able to detect these distal enhancers and what you can see is both that but you know model Based models and the sequence based models graph rake can Find these distal enhancers. Where's the cnn's can't so? 1d cnn's the dilated cnn's In principle have a wide receptor field, but actually the feature attribution shows they're only Learning information very proximal to the promoter. Okay, they can't access this distal information and that's the power of this approach Okay, so uh, I showed you some new work Using graph neural networks to predict gene expression across large Genomic regions using both 3d interaction data and 1d epigenomic data or using dna sequence with 1d Epigenomic Prediction as an auxiliary task and I showed that the graph rake models outperformed the baseline cnn models For gene expression prediction more importantly We can use feature attribution to predict functional enhancers for genes and this outperforms Existing models as well as the abc score for identifying functional enhancer elements Okay, and I think the big picture here is that there have been rapid developments in machine learning modeling in epigenomics and 3d genomics and also in screening Approaches based on uh crisper editing Um, and this is all enabled advances in modeling gene regulation and deciphering Regulation Okay, with that I want to just thank all the people in my lab who did the work Ali Reza Carbala Chiare Is the post doc who developed graph rake? Marve Sahin developed the high cdc plus package With help from wilford wong and in particular. I want to thank My collaborators afia postulu and danue kung fu As well as funding sources Okay, great. Thanks very much christina So let's go to a couple of questions Just for this speaker before we move to the general discussion I'll start with the first one by yens lichtenberg How difficult would it be to infer the deconactivity When assuming the other features such as epigenomic data and expression data are given or or well known And christina, maybe we can just for the benefit of the audience Explain what what is meant by the deconactivity The the connectivity Like the height the high c contact. Yeah, I'm guessing what they mean is the high c connectivity from this from this question Right. So so that's a great question. Um, so, you know, a graph rake uses A high c information a high chip information in the model to predict gene regulation Another problem would be trying to use 1d epigenomic data to predict the contact matrix to predict exactly and we and others are working on it We are working on it in the context of using both bulk 1d epigenomic data as well as using a single cell attack data Sort of using different Deep learning approaches. There are existing approaches that Go from dna sequence to the contact matrix training in a specific cell type But one doesn't expect those models to generalize to a new cell type Um Here's another question from uh, Antonio's lutsas Uh, which starts off with saying, uh, great work. Uh, very exciting Um, a lot of the work that you're presenting is uh on population models Uh, and I'm wondering if your ml models would benefit if by using single cell data in addition It's where we're going You know, so, uh I I don't um You know, I'm I'm not quite ready to put it into the the workshop slides, but um, we and many other people are Looking at multi ohm data where you have single cell attack and single cell RNA seek in the same cells and um The cage seek that we're using looks very much like The signal you get when you summarize, uh single cell data, especially certain certain kinds of assays Over over cells over clusters. So we are definitely heading in that direction as as are many other people Great, and maybe I'll do a question on my own. Um, you know in natural language processing There was kind of a phase transition when transformer models came in, you know And you see kind of the rise of burt and gpt two and three I kind of your thoughts on whether transformers would be useful on this domain, which at least to me looks a little similar Yes, I was actually just reading a paper yesterday, which is um, uh from calico and deep mine Uh, David Kelly's uh, so it's a pre-print and our graph rake is also a pre-print So they're both available in bio archive and and they're using a transformer Architecture, you know, so it's a different attention based mechanism. They're not using 3d structure, but they can Through an attention mechanism get some longer range interaction Not quite the two megabases that we are getting but but they can kind of learn information Out to about a hundred kb, which is interesting Okay, great. So why don't we go ahead and get all the speakers of the session up here? And we'll start the general discussion period And you know, maybe uh, while we're waiting for some new questions to roll in I I have a question, uh, again, uh for jan peng Uh, you know, it was kind of amazing to me how so much of the advance in protein structure Came from methods that are based on information theory or evolution Uh, rather than from a biophysical approach Uh, you know your thoughts on why that is and whether or not we'll actually get to a place Where the biophysical approaches kind of catch up to the more information theoretic approaches Yeah, sure. So I think uh The First the biophysical based approach has their limits, right there Our current understanding on figures of man-in-body interaction and the exact formulation of these how this potential should be formulated are Not complete where we have very or though there are a lot of progress in the field we have We don't have a very much understanding on On these and the world. So the biophysics based approach will also suffer from What we have seen in optimization. There are a lot of issues in optimizing those energy functions are Highly rocked there in the landscape people have Visualized that there are a lot of local minimum and those energy functions are really really hard to optimize So so I think that's the main bottleneck of applying those approaches in in large scale And of course the computational costs of those Biophysical based approaches are really really demanding and we need a lot of like supercomputers to perform a lot of simulations And to get a some reasonable reasonable folding or simulations So I think the future director might be the like as you said the information based Information theoretical based or machine learning based plus physical based approaches For example in our approaches and also many other groups, especially also in deep mines approach We're not just only use the the deep learning approach for example for protein structure prediction protein folding What are the machine learning is giving us is just the initial structure We got this scaffold to be right. We know for example this particular residue should be placed in the In the direct right direction But the details are are not good So usually what we do is that after we obtain the initial structure We we use a force field or some of these biophysical based approach to optimize also Atom detail atomic details a lot of sidechain packing orientation is because before we If we don't have that the structures globally they look correct, but if we look at the detail they They are not useful at all So that's why people now use these kind of different approaches and put them together And and and machine learning I would say is a very good at identifying a good starting point After we find this good starting point and then the biophysical model will be Be very helpful useful to to optimize the detail and get them folding into the right structure So a question just came in that I have to say I I've been wondering during a number of these talks And so I think it's a good general question for everyone How much data do you think is needed for genomic deep learning? And I would I would say you're welcome to comment more generally But but start with your own research and let's start with christina leslie because you know christina during your talk I was wondering You're you're running this thing obviously you're scanning it along the genome from chromosome to chromosome and you can you mention You can use one chromosome as a holdout to validate the the model and others But is that is that just one individual? Presumably you have multiple individuals and so we have these sort of at least two dimensions number of nucleotides by number of samples Is is is that how I should think about this and and how much data do I need in in both of those dimensions? so Let me point out the the data I'm showing you is training on a single cell type. So and it's you know in code cell line data or mouse s cell line And a lot of a lot of work so far has been at that level or you know, maybe you do Multi task where you're predicting all of encode, right? But fundamentally, it's predicting it from the reference genome and As we inform it's not magic. We're going to need to incorporate genetic variation into the models if we really want to learn What is the function of non coding genetic variation on gene expression? So and and We're we have some strategies other people have strategies as well But that that's the next step But but for instance, do you do you have any estimation of of how things will improve as more data are added as more cell lines? For instance are added to the the repository. Is that going to help? I don't know. So I I I actually think instead of cell lines, I think we need primary cells looking at Differentiation stages like in our 40 n project. We we want to model something that is relevant Both to cell identity and you know, the biology of of the you know, fully differentiated cell and the biology of disease, right? So I think We should be training in in more relevant contexts In terms of learning, you know, so the epigenome based model that I talked about where from a few marks and the connectivity, you know, you're you're We're learning something fairly modest about how Contributions of distal elements and and and the state, you know Helps us predict the level of expression output. And so it's not surprising in a sense that that Does well we have good information that can generalize to new cell type pretty well But it doesn't get us all the way to What does this? you know The minor allele change and regulation of an important developmental gene So Sarah, how would you answer the question about how so how much data are you using and and what are your views in general on on You know obtaining more data and how would impact the machine learning you're doing Yeah, I think it's a it's a different application than what Christina was talking about but we still we do use the entire genome But we would do that for multiple individuals usually Say 200 haplotypes. So 100 individuals and I think it hasn't been explored enough You know, how how many we actually need to do well Could we get away with fewer? You know, I think there's kind of some interesting questions there And I think it depends on what you want to answer for natural selection You may need more individuals to really see that pattern for things like Ancient population size changes and migrations and admixtures. You may need only fewer fewer individuals But I think in in terms of I'm also sort of you know chopping up the genome into regions and I'm trying to look at Feeding each of those regions and almost as an image right to the GAN framework So how many images do I need? I mean I would say a lower bound would be something like 10,000 I mean just to throw it out there So I kind of would there's a trade-off between taking more snips per region And having fewer regions versus taking fewer snips per region and having more regions, right? So I think there's a lot of exploration that needs to be done there But we're kind of for a GAN we're limited in just the length of the genome, right whereas for some supervised machine learning method where you're using simulated data to train There's no limit to the amount of data training data you have, right? It's just is it actually good quality training data? Great. Thanks. John. Did you want to comment as well? Yeah, I want to just add one comment that we're though for for a lot of Problems that we we don't have for example. We don't have a lot of cell lines. We don't have a lot of samples We don't have too many genomes or we don't have too many proteins to train all these models But essentially you think of all these like modern deep learning models They're trying to capture these local patterns from very bottom and organizing a way that to come to get to come up together with more complex organized patterns and a lot of these If you think of these local filters or local feature extractors at a very low lower layer in these models They we actually get a lot of data, right? One we have a three billions of base pairs in the genome Each of this region can be seen as individual data points Of course depends on what kind of tasks we want to do like for example for proteins like we don't have a lot of protein soft We're though Much more than we've seen in the past but essentially what we're trying to do is find all these interacting motifs Those are local local interactions So essentially I would say why this deep learning models work is That we or though we don't have a sheer large number Large sheer number of how many samples we can get but within each sample We actually can construct a lot of useful information to create the data points. That's that's what I want to comment on Excellent. So, you know actually a unsure kandaji asked a good question in the chat here You know about what you can say in general for stability or instability of deep learning models If not, in fact all classes of ML models Maybe we'll start with Christina. I would love to hear your thoughts on that Yeah, so we're I mean moving into this deep learning space because I think about the The models are very exciting and flexible and they're allowing us to do things that we couldn't do with the previous generation of machine learning models, but I think Anshil is exactly right that It it can be scary how unstable they are and you do have to use Good machine learning practice Right and you you have to Use ensembles. Yeah, you have to be very careful Not to fool yourself as with all machine learning that that you're you're Really generalizing So so I yeah, I I agree it is And you know, it is a something that we all have to be cautious about Look and just to push on it a little bit Christina You know, how does the stability change as you start to add Even more even more parameters or more complexity, you know, there's a certain kind of theoretical strain of thought that's arising around overparameterization Actually being kind of useful Rather than detrimental. So as our models get more complex, will they get more stable or less? I you know, I I don't know that we can make a general statement, but yes We've also seen these papers and also empirically the idea that you want to overparameterize and and maybe it makes your gradients Better, right that and more ways to kind of approach a solution. It's very interesting. I think the You know, it's sort of a lesson for learning theory that the algorithm is part of Part of understanding. What do you know, what do you overfit? excellent Maybe we'll go on to a few other questions just to divide it up You know, maybe Stephanie you could say a little bit about what are the right inductive biases to use for genomics And again, just because it's a general audience Maybe you can start by saying a little bit about what is meant by the term inductive bias I meant to say Sarah. Sorry about that. Okay, okay. Um Yeah, I think it it it also maybe depends on the application a little bit. Um, I think it There's there's sort of I guess biases in the data itself And then there's also biases in terms of the methods that you choose, right? And that might be your own inductive bias about what you think will work better I I think in population genetics this is This kind of benefits itself as choosing models. We think are realistic In terms of evolutionary models say like clean splits between populations and you know, it kind of we oversimplify in that case And I think that that can lead to some inductive biases Um, I think there would be a little bit different in in each field But I think it's what I would like to see is actually Methods that allow you to move between models. So often we fix the model and we fit it, right in terms of say Two population model with a clean split and no admixture sense of split So what what could we do to actually allow the method to explore models? And so that we are not constraining ourselves in that way I think that would be that would be super interesting And what would kind of get us away and maybe explore solutions that we didn't think were really feasible but actually happened over evolutionary history So one discussion point that that keeps coming up. I think it came up in sarah's q&a But I think it's good to discuss as all of us is This role of simulated versus real data and and in particular the role of simulated data and what all of us do You know who are up here on the virtual stage On the one hand simulated data can produce Virtually an infinite number of training samples And has lots of other benefits on the other hand if you're not careful You've stopped learning about human biology at some point and you're learning about how this this machine learning algorithm works Which of course is less interesting So so sarah, why don't we leave you on the stage or you know on the floor so to speak and and maybe you can comment You know just in general. How do you think about simulated data and when are the appropriate uses versus when are the inappropriate uses Yeah, it's a great question because I think that in population genetics, we've relied on simulations You know really heavily perhaps, you know more than more than most fields And that's that's really because we can't really go back in a sense in the past and find this evolutionary ground truth So if we wanted to infer historical mutation rates or recombination rates or population size changes We we don't have the ground truth. So we are forced to rely on simulated data for anything that's sort of supervised broadly defined However, I would say it's not as bad as it sounds because we do have actually very good evolutionary models We know how to do the forward process We just don't know how to infer or that's the goal of population genetics To infer going back in time what actually happened So I think that's that it's still It still hampers us a bit because we only can simulate some things that we we think are real But I think we do have very good evolutionary models I guess there's two exceptions to that one is ancient dna So if we had really good ancient dna, we could actually have some ground truth However, it wouldn't be nearly enough to train any okay. That's what I think right now It wouldn't be nearly enough to train any machine learning model And two is experimental evolution So you can do this in yeast or you could do this and you know some other type of fast reproducing species But I don't think you're going to be able to do experimental evolution for human, right? So, um, I think that's that's where we will maybe always rely a little bit on simulations to test our model to validate our model Does it produce reasonable things in simulations? Maybe we can sort of triangulate with different models and say, okay If all these models were tested in simulations with different caveats and things and they all produce the same Same result on real data. Maybe we start to feel confident. This is a real result But without external validation techniques, we we won't exactly know what happened, you know thousands of years in the past Um, I think there's a power of gans. You know, that's what I'm interested right now To to help us make better simulated data more realistic simulated data I think we should think about like sort of underlying structure in the data clustering and things visualization that helps us Learn more about About the data in an unsupervised fashion, but I think we're always going to rely a little bit on simulations John or Christina did either one one of you want to comment? I mean, I think, um It's not clear how we do data augmentation in the genomic setting right Whereas in the image setting there's lots of strategies to augment your training data, so, you know Augmentation in a way that doesn't bias Badly I will say the simulated Simulation data are useful to validate hypothesis. So basically you want to develop method You wanted the method to capture certain like inductive bias you believe to be true for this particular biological problem Then you like to simulate the data so that the data have the property you wanted it to have And then check the model is useful or not to detect this kind of thing So so essentially I think all of these are the real all this question and the previous question are all related in a sense that What do we what do we really want to do? You to use this machine learning or deep learning of shoes for whether we just want them to improve prediction That's okay And for example, we want to predict the protein structure Sometimes we don't need to worry about how these structures are generally as long as useful, right for drug discovery, etc and for many other field like especially I know in the system biology and regulatory Genomics people want to generally find a tool that can are able to identify certain Knowledge you write certain kind of interactions But we don't know whether those are these kind sometimes these interaction are very complex So we we don't know whether the method is able to to mine such a relationship So we generally the data sets so that I will kind of learn these kind of patterns So this also comes back to their original question Why what is how do we actually use a deep learning in the right way for different problems? I would say for many problems in genomics The accuracy is a secondary, right? We don't really want to push the accuracy to be very high We want the model to be Insightful to we can use a model to discover a lot of A lot of a new knowledge a lot of a new interaction new insights That will be better for us to for us to better understand biology so I think all of these are Are related to questions and It's it's all about how we actually design the right methodology to To to interrogate the data to To identify what all the purpose are whether want to make the accurate prediction for some downstream analysis Or our downstream applications, or we want to find a new biology or and better understand the biological system And and thanks for the segue to interpretation or explainability So what one thing I was going to ask you john is it's both sarah and christina got questions about interpretability of their models You didn't so much. So I was going to ask you how do you think about interpretability or interpreting the model Say you're building for structural proteomics and protein folding. Yes. So we actually Actually considered that very much So for example, when we build a model to predict a protein structure or protein contact Structure or interactions will actually visualize So there are a lot of approach to develop in the field to visualize Activations of a certain layers in your neural network. So we come we come for example You can essentially what do you want to do is to take the gradient with respect to the input, right? And or you just across a different type of data you visualize which neurons are are are activated In this way, we can actually go backwards to the original sequence or structure to identify whether there are some Relationship between for example to beta strands or to alpha helical structure interacting with each other. So we were so To do that Check that all the time but instead I would say our The goal of this problem we study is kind of different from other genomics Problems we want to push the accuracy to be to to a regime that we can use a prediction And then for the for the other field like trade you have done a lot of work in this field We all understand that the we want to figure out what the model actually learns basically essentially I would say the at the end of the day The what do you actually learn that can be transferred Can be generalized are the very simple Rules, right some biological rules that rules are noisy rules. We can now easy to learn using very simple models So that's why we need to deep learning But finally we need to distill this kind of knowledge from the model One way is to using this interpretable Approaches like a postal analysis, right after you build a model you want to analyze what you have learned And other ways like what you did and others did is you use these inductive bios build the model in a way that the model already incorporated necessary knowledge your biological Information into the model so that you can visualize the model and And and make conclusions and find a new New discoveries, right? So I think those are two different approaches people widely used in the field I don't know which one is better. I think both are have their own virtues And but I think this will be the I think it will be very important research area in the future We because in the end we want to discover new biology want to better understand the biological system Yeah, I would I would agree and what I presented, you know, it's it's very important to be able to predict gene expression because you want to learn the mapping from The epigenome or or the genome genetics to expression But actually, you know expression is the easiest thing to measure of all the things that right model, right? That's right. You just wanted to predict you you could just measure it exactly You know, I have a question for the panelists that's a little bit more related to kind of training I one of the things you all three have highlighted Is the challenges of actually getting your models trained? and there's a fair amount of kind of knob turning that has to happen And that's the kind of thing that really is the domain of experts who really learned in an environment how to do it But you know now imagine your lab you're a lab that hasn't yet done a lot of work in machine learning And you want to start being able to build your own models and being able to apply them to the kind of question of interest What's the best way to to get up to speed? Is it by finding a collaborator recruiting a postdoc that knows how to do it? Or just blood sweat and tears Maybe Christina you can start I'm a big believer in Bringing in postdocs who are well trained in the area and get them, you know getting them excited about biology Right and and having a training environment where you can say look, you know You have exciting methods, but we have the we have the real problems, right? And we can put this methodology to use in a really impactful and meaningful and biologically Sensible way. So yeah, so you have to you have to lure them away from Google and Facebook and get them in your lap So you mean there are some people who'd rather work on cancer than selling ads? Yeah, this is all that you all all we have is You know come cure cancer We can't offer the salaries We can there is a mission You know, I see we're almost at time. So maybe you actually will just end on that note I thought this was uh, just a wonderful session. I'd really like to spank thank all of the speakers There were many questions We actually really great questions that we didn't get to today, but I thought this was a great discussion Thanks, everybody. That was fantastic