 So we've actually almost doesn't need an introduction because we all know her as one of the leading computational biologists in the world. She's working on a wide variety of topics including biological networks gene regulation and evolution. We've known her for many years as one of the core members of the Broad Institute and since 2014 as the chair of the faculty and member of the executive leadership. The team there as a full professor in biology at MIT as the co-chair of the human cell at last as a Howard Hughes Medical Investigator. And as a recipient of many scientific awards, I just highlight three of the more recent ones, the Paul Marx Prize for cancer research from the Memorial Sloan Kettering Cancer Center. And then the admission to the National Academy of Sciences and the reprise in biomedical sciences from the Foundation for the National Institutes of Health. And very recently that there was a big event in her CV if I may call that in August 2020. We've decided to join Genentech as executive vice president for research and early development. And as such she's also a member of the enlarged corporate executive committee of rush, which was not only on the scientific level interesting for us but also on the local one because rush is headquartered in Basel where my institute is also located and we are almost neighbors so down the street is the rush tower. I mean we are very happy to have you here we are honored to have you here and we are looking forward to your talk. Thank you. Thank you so much and indeed one day when the world normalizes again and we travel I expect to actually visit Basel quite frequently and would be lovely to also see all of you in person but for now, for now we do it in this way the safe and appropriate way. And so first let me start with my disclosures before I started in genetic in August 1, 2020. I had. I was involved in several companies Celsius therapeutics thermo fishery and neogene serous and immunitas and in several of them I still have equity. And in addition, as Carson already said I'm an employee of genetic which is a member of the rush group. So what I'm going to describe today is so from my academic lab that was at the Broad Institute, and it and my, but it's quite relevant also to the work that we do that I'm going to do and with my colleagues in genetic. And so, many problems in biology are very big. I'm going to start with this one that actually I'm going to go back to at the very end of the talk, which is the mapping of genetic interactions between two genes, which we define as their non additive joint causal effect. There are outstanding challenges in biology it's why biology is difficult to predict. And it's something that is true genetic interactions are present at levels from the set from cell are phenotypes where we can look at gene expression or cell viability, all the way to the organismal level, as we think about the joint action of multiple variants in one. So this is really just one example of the many problems in biology, where the space of possibilities is enormous, except that most of these possibilities do not actually exist. So this includes the combination of somatic mutations that we would see in cancer or in regulatory sequences that scales as forward to the power of and, and things like gene expression programs that are co regulated or those genetic interactions, which, you know, can choose in problem. And because m and n are big numbers here, these numbers become enormous, they quickly grow to more than the number of, you know, cells on the planet or the number of people or the number of atoms in the universe. And that, of course, opens up the question of how can we study problems like this in biology systematically. So upfront that we will never be able to do it exhaustively just by trying out all of the possibilities experimentally. And what I'm going to try and claim today is that in many of these cases right now, we should stop looking at this as one of our problems, but as one of the best of one of the best opportunities that is happening to us right now in biology. But first let me give you a little bit of a historical perspective. We always knew these were big problems as biologists and we always had to make them work somehow. The first thing that we took to the large search space was to use some sort of informed design in order to limit the search space. So instead of testing all the mutated sequences, possibly mutated sequences you would probe only for the ones that you know exists, or instead of exploring a hypothetical expression program, you would use signatures that reflect the ones that we already saw. And instead of testing all possible genetic interactions we would look for those between the genes that we think are likely to do something together. And instead of looking, and instead of looking at just, you know, some regulatory sequences we might design assays to look at ones that, you know, regulatory sequences we saw in nature courtesy of revolution, or might be doing something together and, you know, so on and so forth. These are just for example. And all of these are great approaches. But they actually didn't fully solve our problem, because we don't really know for sure that we know the general answer by looking at the search space in this limited way. And so the question that we want to ask ourselves is can we design type the types of experiments that might be inspired by algorithms for inference that would allow us to infer better the rest of the, the rest of the biological systems that we cannot measure directly. And as an intuitive idea to how I think about why this is changing in experimental biology, I'm going to use a little example from art. And so imagine that functional discovery in biology is about figuring out which painting I'm actually hiding in this frame. And every experiment, the experiments are hidden behind these little white tiles, there's a lot of white tiles and every experiment is equivalent to removing one tile and seeing what is underneath it. And so I don't know at all what the picture is like, but I know something principled about what pictures might look like something basic about the organization picture. So in pictures in my world in paintings, there are going to be patches of color. And that's what I actually want to know. And there's going to be a lot of uninteresting background. So until recently, you could say that we could only move, you know, very few tiles at the time to so we had to focus. And so imagine that we already knew that there was some color for example here in this pixel. And because colors come in patches, then the best thing that we can do is actually try and dig around it because it's much more likely that there is color next to color, given the patchiness of paintings. And when we do this we actually realize that there is red here. And of course, maybe I also knew that there's a little bit of yellow here from prior knowledge and I would dig around that and I would get a little bit of yellow as well. And this is because I could only remove actually a small number of tiles I did the best that I could do with them which is search around where there was likely something to be found. And I learned about red and I learned about yellow, but I have no idea about the overall big picture. And so now what might happen if I took a different approach, for example, I could use the exact same number of tiles randomly. And if I did that because most of my painting is background, what I get back is mostly background, and I didn't really learn much more I uncovered though that there's a little bit of blue summer. So it's not that I learned zero but it seems that I learned this. And so what would happen though, if my ability to do experiments was very different, and I would be able to remove a lot more tiles at a time, but still far fewer than the total number of tiles in the world. Well, if I would do that, and I did it randomly, then gradually as the number of tiles increases, I would get a pretty good idea what the painting is like and in fact if you like modern art, and you've seen examples before, you can almost fill in the blanks with your blanks and you learn that this is a painting by Juan Miro and it's not just a painting by Miro it's actually a painting from the collection in the Broad Museum in Los Angeles. And so doing this in machine learning is actually something that we do all the time. We have multiple ways in which we can fill in the blanks and we have multiple ways of standing large search spaces without actually looking at all of the options. Experimental biology is not so typical, although it's done, but it's a great opportunity for us simply because we can now remove more tiles than we could before. And because conducting experiments that consist of looking at a lot of things at random is getting easier and easier for us. And at the same time realizing that this might be a good design for us should prompt us to design our biological experiments and biological methodologies differently than we might have done them before. So this is what I'm actually going to try to convey to you today and I'm going to start very briefly with my first case study for design for inference, which is massively parallel single cell RNA-seq. I'm actually going to say very little about the past work in single cell RNA-seq, except to remind you that it was actually a major unconscious experimental design decision that we made very early on to develop methods like drop-seq, which is shown here on the right, which favors sparse and noisy data from massive numbers of cells over the richer and more precise data that you could get for a small number of cells, basically preferring removing a lot of tiles over anything else that we could optimize for. And we did that because we knew that we could handle the sparsity per cell because expression within cells and patterns across cells are really highly structured, and that's a point that I'll return to again and again and again. And that's actually what I mean by experimental design for inference. It's a new way of doing a biological experiment because we had a certain assumption on the structure of the data in the real biological world. And because we knew that given this assumption, this would be a better design to actually get biological answers from. But I can also tell you that it was very uncomfortable for experimental biologists in the beginning. It did prove itself to be quite successful. We get nice patterns and structures in this sparse and noisy data and they can be captured by algorithms and we can learn very meaningful biological representations of things like cell types and the programs that cells run and the states that they assume and how they develop and even where they're in space. And so this is one good example of a lab technique that becomes very successful with associated analytics by leveraging structure in sparse data. And over the last several years, we started developing additional designs for inference, and I'm going to focus on three vignettes today, and they're all going to be largely in the area of how RNA expression is regulated from the genome. And this is very basic biological view of that world. We have upstream molecular circuits that control downstream gene expression and they do this through regulatory sequences that can affect RNA production. And one of the fundamental questions in biology is actually discovering these kinds of circuits. And one big edge that we have in biology is that we can do this by using interventions. We can intervene in the causes because we can perturb one or more components at a time, and then we can measure the effect in this in my examples the effect will be the effect on gene expression. And so I'm going to start with regulatory sequences that control gene expression. And this was work that was done by Carl de Borre, who was a computational postdoc in my lab at the time and he now has his own lab in the University of British Columbia. Now, one way of learning is by looking at many examples of regulatory sequences and the gene expression level. And so typically people use two different approaches to generate these match data of sequences and expression that they drive. They can use native sequences. And so we would reflect natural function, but they limit us to the examples that are present in nature, or we can design sequences, like in massively parallel reporter assets, we can put them in constructs and see the expression that they would drive. And here we can design any sequence that we desire up to a certain length, but we're limited by synthesis technology. So the number of examples that we could look at might be tens of thousands of sequences, maybe 100,000 sequences, and it's also very costly. Just several years ago, could we work with much bigger data, many orders magnitude bigger, if we just didn't try to design anything and use random DNA instead. So how would this work? Well, because transcription factors tend to bind very short and somewhat degenerate sequences, then most of the transcription factor binding sites should be very prevalent once we have a large enough collection of random DNA. In the general case, if you have a motif with X bits, it's expected to occur every two to the X minus one basis. And so in a library of 10 million, eight base their random sequences, and I'll get to why eight bases specifically in a second, nearly 90% of the base transcription factor motifs will have at least 40,000 examples, and 50% of them will have millions of examples. And this was all actually calculated by Leonid Murni several, you know, more than 10 years ago, but this would be testing this and seeing whether one can work with that in an experimental because random sequences very easy to synthesize, we hope that this would produce ample training data. But of course, as I said, this actually has to work for experimental biology. We made a very simple assay, we would measure the noisy expression of hundreds of millions of sequence example by making random sequences, we would put these sequences will be 80 bases long, we would put them into yeast, they would drive the expression of a fluorescent marker, we would sort yeast based on the level of expression of the marker, and then we would sequence the driving regulatory sequences from these different things. We are using such a large number of sequences into the eight sequences, and actually less yeasts than sequences and the mass majority of the time, any given sequence will appear at most once in the collection. So the data would be, you know, enormous extremely sparse even the search space for to the 80, but also very, very noisy because we would get zero to one measurement per sequence. So, this kind of idea will only work if the expression driven by these random promoters is first of all variable enough and second reproducible. So to test this Carl grew the library sorted the yeast into 18 bins of expression, and that would let him see if there's a range of expression levels that the random sequences can drive. Then he recovered the live yeast from each of the beans, he regrew them, and he measured their expression again, and this would let him ask if the expression is actually reproducible. And so he got a very broad range of expression levels, and this was reproducible, the distribution shows the regrown yeast levels, but they're colored by the strength of the expression in the original being that they grew from and you can see that they follow the same order. And so now Carl could use this to measure about 100 million promoters in two core promoter context, when the yeast were given different sources of carbon glucose galactose and glycerol. And now we can start asking whether we can learn something interesting from this data. And so one way to start and learn biology from data is to build yourself an interpretable model of gene regulation, which here is explicitly in biochemical terms. You can put in a lot of biological details, because even though the model has far more parameters for this problem than previously attempted, we still have far more data than parameters so we can easily learn a model with 200,000 parameters, and a lot of biological refinement. The first task that we challenge the model with is prediction, and it turns out that he does a pretty good job. So we can, the model generalizes really well in predicting the expression of other random sequences, meaning zero overlap action between the sequences in the training and in the test data, even when we measure the expression levels for the test data with many more cells per sequence so there are high quality estimates where our original data was very noisy, and it also generalizes well for really sequences, which are really far from random. So now we had kind of basic faith that this model at least generalizes and we can ask more biological questions. The first thing that we saw is that the model can learn cis regulatory elements motifs de novo, simply if we initiated with a thousand random sequence, a thousand random motifs. So the model has a somewhat better predictive power actually with randomly learned with learned motifs and by giving it pre-known motifs, and it recovers many motifs that correspond to the ones that we know from years of experimental work in these. So this model has about 120,000 parameters, which is two orders of magnitude than a model that would be norm that would be initialized with no motifs, but you can learn it because the data is sufficiently weak. So next, we looked at the potentiation scores that the model identifies. These are scores that you can think of as opening chromatin. It's about whether a transcription factor if active would impact the ability of another motif to be active or to be bound by another transcription factor and activated or repressed. And so looking at these potentiation scores, we were able to identify transcription factors that are that that can open chromatin in either glucose, which is on the x axis, or in galactose, which is on y axis. And the model correctly predicted that the known regulatory factors, a bf one wrap one red one are opening chromatin both conditions but the factor like gal four does that only in galactose. Now we looked at the model's accessibility predictions when we train it on our random sequences and we use it to score the native use genome. So these are actually experimental results on this plot we see the average nuclear zone occupancy of all these promoters based on published biological measurements of nucleosome occupancy in red and DNA is one hypersensitivity is in brown. And the beautiful thing is that all of this is really nicely predicted by the model we get the minus one and the plus one nuclear zone, the nuclear zone free region and even some of the periodic structure. And this is even though we never told the model, any information about chromatin state, we just framed the problem in the way that it could infer this from the sequence itself. So next we turned to a feature that should be well covered in random DNA but difficult to study otherwise, which is the position and orientation of the binding sites of the transcription factor because just randomly we get the sites in so many different places. We can we can see them positioned and oriented in different ways. So the model model that they showed you results from so far just summed up the binding or strength of the sites for each of the transcription factors. But now we designed an extended model that fits parameters based on the location and the strand of the motif. And this brings us to about 220,000 parameters when we initialize with no motifs. In the end of model, we can look at these new parameters of location specific activity that we show here for each of the motifs in the rose, and then we see it along the positions relative to the transcription start site in the columns, and we see it separately from the minus trend and the plus trend. And now we can identify specific mechanisms, looking basically at these pattern. For example, in a, in a transcription factor, like in a transcription factor like MGA one. We have strength specific activity. So MGA one on the minus trend. MGA one has to be on the minus trend in order to be active, but this only happens in the context of one of the of the promoter scaffolds PGPA. And also there are cases when you can see that these strand preferences are actually periodic. And this is consistent with a helical face bias, which we actually showed by correlating to the periodic periodicity of a 10 plus, sorry, 10.5 sine wave. So we can see that there are position preferences so an activator like SK and seven is more active when it's bound distally to the TSS, but repressors are actually more repressive when they are bound approximately to the TSS. So, so the prevalence of these position effects and orientation effects were really intriguing for us. Because in past experiments that people have tried using massively parallel reporter essays, especially for my very good friend around Segel's lab. People have tried to learn position effects before we're not the first people to have thought about them at all. And by doing a very clever experiment where you design data, and you take a transcription factor binding site, and you essentially tie it along the length of the promoter so you first put it in one position, then you move it by one base and you move it by one base and you move it by one base. So you should be able to see what is the impact of moving the site along, along the length of the promoter and you already saw that from our model, this should have an impact that the position of the site should actually matter. So people did these experiments they found them basically impossible to interpret, because even the tiniest move around would give you these radically different behaviors. So you could slide aside by a couple of bases, and you move from having a strong activity to having no activity at all, even though it was the same site and by and large it was roughly within say the same distance from the TSS. So that was hard to understand and we wanted to see if our model can help. And so we repeated these kinds of tidying experiments using our system we took a motif here from MGA one, and we slide it on the background of random sequence. And we got the same answers right so just like the previous experiments the impact appeared very jagged or erratic so even a one to two base shift could have a very big impact on expression. What was different for us is that our model actually predicted this extremely well. This is what you see in the red and it's not just for MGA one. It's true for each of the each of six motifs that we tiled across three different random context. Now because our model is interpretable we could start asking which of the features actually relate to this behavior. And it turns out that the positional activity of MGA one does not actually correlate with this behavior, but what does correlate is the accessibility that we learn in the model. And these changes in accessibility or potentiation are actually coming from multiple week sites for other transcription factors that get created or disrupted as we slide the primary MGA one sequence. So, none of these sites would ever be considered by a classical model. They're simply too weak to be counted on their own by standard models because the data size that the model is learned from is too small to identify these kinds of effect sizes. But it turns out that to biology it actually matters and it is the creation and destruction of these sites that really changes the picture. And so with this type of result in mind we asked whether we could understand what's the contribution of strong versus weak sites on each gene's expression and for this we ran a computational experiment. We deleted each transcription factor computationally and then we asked which means it's it's binding sites can no longer be activated or repressed, and we asked what would be the impact on the on expression from each of the native promoter sequences, and we call this the interaction strength. And so this is what we see here it turns out that only 0.1% of the possible regulatory interactions are actually predicted to alter expression by more than two full. And these are rare and strong regulatory interactions and they do explain a disproportionate amount of expression given their volume. But at the end of the day 94% of the expression is attributed to the much more prevalent week less than two full impact regulatory interaction so because of smaller data we always had to focus on large effect sizes, but the truth is that much of what happens in gene expression is actually affected by much weaker interactions. This unfortunately means that we can just easily design promoters by tiling sites without disruption. So we had to come up with an alternative approach to design. And so Carl together with issue to the graduate student in the lab said let's train a model on this on this reporter large scale reporter data, and using to search in silicone for random sequences that simply have the desired pattern that we want, and then we could synthesize those sequences and we could test this. And because this is primarily a prediction task we turn to models that are focused on predictions rather than on interpretation, but are still inspired by biological by biological sequences to see an n based model. And we designed it in a way that captures things like what gives it a chance to capture things like motif specificity and interactions transcription factors scanning long range interactions and activity after the binding, and so on and by now we're at 3.3 million train parameters but we still have more than 100 million sequences to train the model. And this model is more accurate as it's designed specifically to be great in prediction. And so it reduced the error from the previous model by 33% and it's now predicting 98% of expression variation, driven by sequences on a test database of random sequences and 92% for these sequences. And so this gives us a real distinct value. The first thing that we can do with it is we can use the trained model to search for sequences that would have especially high or especially low expression. And then we synthesize those sequences this is measured data from these sequences, and we see that the sequences designed to, to have high expression indeed drive high expression those designed to have low expression actually have low expression and they can exceed beyond the limits of the native distribution of expression levels for these. We can also use models like these to understand biology I'm not going to actually go get too deeply into this this is working progress. But in particular we're asking ourselves the question if the level of expression of the gene is actually the proxy for fitness for the cell. It allows us to describe what we think of as a comprehensive fitness landscape on which we can ask specific evolutionary questions. And so to conclude this part. I will tell you that because functional transcription factor binding sites occur frequently by chance we can design an experiment, the gigantically parallel reporter assays we call it fondly to measure the expression levels are associated with hundreds of millions and you can go to billions and more of random sequences. This random DNA keeps a very broad range of reproducible expression. We first learned the biochemically interpretable model that explained 92% of the variation in random sequences 85% for native sequences, and gave us many insights into how transcription factors function. We learned about motifs the novel. We correctly predicted chromatin structures, we find transcription factor regulators chromatin states. And once we add the position into the model. It allows us to discover location and strand pref effects and here come face preferences, and especially understand why this behavior of tiling motifs on a random background which is due to the disruption of multiple weak interactions. And then when we went back and analyze native sequences, we saw that these weak interactions are actually very prevalent in the promoter, and they have a lot of effects on gene expression which opens all sorts of interesting questions when you start thinking about, you know, for example genetic variation regulatory sequences in humans. This also suggested to us that we could develop a model driven promoter design, which we learned using a deep CNN from the same data set and then we actually synthesize based on our in silico predictions and got very nice, very nice expression driven to our desire, and we're working on additional aspects to this in addition to the evolutionary work, Carl is working on adapting this to human cells, human mammalian cells and we're working on combining interpretable and deep models. So now let me turn to programs of co regulated genes. And again, there could be many different ways in which different genes can combine into expression programs and the second part is all the work of a former graduate student who is now an independent fellow Brian Cleary. And what Brian reasoned was that maybe we should not need to measure each gene individually, and we could still recover the information that we need so for example, what if instead of measuring each gene separately in each sample, we could somehow collect compressed data on gene expression, and then we could use an algorithm to decompress it and give us the genes later. And this is of course not a new idea at all. For image processing all the time we can compress images after acquisition to do you know more efficient storage or computation, but we can also acquire compressed images and decompress them later and choose the many applications. And intuitively the reason that this actually works so well for images is that images are actually structured and they're not random. And so, Brian reasoned quite sensibly the gene expression is also very structured you have these sets of genes that can be co regulated up and down together in sets of samples, and these modules kind of shift and change, and are used in different combinations like pieces of Legos to give us the seller state. That's something that we've known for you know more than 15 years it's one of the first insights that people got from functional genomics expression. Could we somehow use the using to our benefits to change the way that we do our measurements, oops, and compress expression data as well. Now, why do that well it could be at minimum very helpful for us because there's many measurement methods that can really measure the abundance of only dozens of RNAs or proteins, but if we could use the same say 100 measurement channels of genes like teacher cytometry or maybe, but actually get the levels of 10,000 proteins for 100 measurement channels. That would be a fantastic thing. And so how could one imagine doing something like that. Well, instead of measuring individual genes, let's imagine that we measure the abundance of a much smaller number of composite genes. The composite gene is basically a linear combination of genes abundances. And you have to you choose the weights for these. You have to use the choose the weights for the linear combination. And in particular you can choose this, these weights randomly there's beautiful math for that doesn't come from us at all. And the, in particular the weights can be binary zero or one which makes an experimental assay actually a lot easier than arbitrary, implementing arbitrary numbers. And so now we would want to use, how would we be able to use this idea who could actually pull this feet and do it in a lab instead of individual measurements so for one, we'd like to compare perturbations by their effects. So for example we could correctly cluster or group compressed profiles. We would also eventually like to know the effect of a perturbation on each gene. So not just on these weird random composites so we would need to be able to decompress the data from the composites genes. And then finally of course this will all be lovely slides but we would want to show that we can do it in the lab. And so to answer the first two questions we started by simulating random composites on existing expression data of all sorts. So to look if we could correctly group compressed profiles we're going to look at the data set called G text that's made of hundreds of RNA profiles from different tissues, and each of the tissues is color coded. And if we can take the full expression profiles that we measure each of them, about 20,000 genes, we can get 30 sample clusters so this is just standard clustering of gene expression profiles from the data through simulations with added noise. And now each of them would represent the random weighted sum of dozens of genes. We can cluster the data again these random composite data and we would basically get the same clustering so we preserve the sample to sample identity. And honestly, this is not just an empirical result we actually expect this mathematically upfront but it is always nice to see that it behaves appropriately under a model of noise. And then, in fact, this is as good as if we had clustered the data using all 20,000 genes plus the same model of noise. So now if we have so random compositions can preserve the sample to sample similarity. So now if we have this compressed data, can we also decompress it so that we know the levels of each of the genes. One way to do it is to use a little bit of training so first we use 5% of our real of our full data to find the structure. And we do this by fitting a model where the expression of each gene across the samples is explained by a dictionary of gene modules, and by the activity levels of the modules in the sample. And you can use for this different types of approaches you can use SVD or sparse M&F and we also developed our own algorithm that we call sparse module activity for factorization or SMAP, which is sparse in both the module dictionary and the module activity which is particular beneficial for data deco-pression. So we take the remaining 95% of the samples and in each case we simulate compressed acquisition with random weights, and then we use the module dictionary that we learned in the first step from the 5% of the data. And we take the compressed measurements and we predict the level of expression of all of the genes that's what we call data decompression. And then in the final step we can compare the decompression to the original data and see how well we could do and this is what we see here. We tested this with 200 fold compression so we simulate 25 noisy composite measurements and then we predicted the level of 5000 genes. And for example in the GTX dataset, the decompressed data is 82% correlated with the full profiles. Now that shows that composite measurements can actually be decompressed. We can ask if we can even decompress the compressed measurements without any training. So could we infer the level of each gene even if we've never measured genes individually. This is called blind compressed sensing or BCS in the signal processing literature and it requires us to make some assumptions on the structure of the data. The assumptions are made in signal processing are not actually good for biology. So instead we made the assumption that samples that are grouped together based on compressed measurements are likely to use similar active modules. And so Brian designed an algorithm called BCS math that first clusters the compressed samples, then it searches for small dictionaries of modules for each of the clusters. Then it concatenates them and that becomes the starting point for an iterative optimization on the modules and their activities and could this possibly work for biological data. Well, it turns out that it actually that it actually can. And so this is the original data. Now we're looking at 14,000 genes per sample. And BCS math never saw a full profile. We gave it either 20x or 50x compressed data, which means 700 and 200 or 280 measurements. And this recovered up to 70% correlation with the real abundance. So we clearly use some of the signal, but we clearly also regain a substantial amount. And we can look again at the task of clustering and blind compressed sensing. And this performs quite well for clustering as well, which again, we expect mathematically, no matter what. And so next we can ask whether we could actually do something like this in the lab everything I showed you up to now was based on RNA seek profiles. In this context we actually turn to in situ measurements which is where we are truly limited in measurement channels so we thought was a particularly good use case. We started with the design, we started with single cell RNA seek data to find gene modules and define good compositions. We do don't have to do it, but it definitely helps if you can. And then we perform composite measurements of these combinations and we recover the module activities, and we decompress the image to the level of individual genes. So the first step is to measure each of the colors in each of the cycles. In each cycle multiple transcripts together, simply by using probes with a single color that target multiple genes. And then together with phase chain with exactly this this shows you nine composite images of 37 genes in a mouse cortex. If you're familiar with patterns in a mouse cortex this looks like no pattern. Actually that you would expect to see this is because the signal from all of the genes is combined together in the composite. Then we decompress these images in a more traditional approach to this kind of data we first segment the cells and then we decompress the expression per cell. But we also developed a segmentation free decompression that's based on an auto encoder, where we do the decompression from 10 to 37 channels in the encoded latent space, and then we decode the images. And this saves the segmentation which is a pretty annoying task. And then when we decompress the images, we get very precise and accurate patterns and shown here for three genes. There are decompressed patterns to the allen brain Atlas, and also to individual measurements which is on the right of the genes in the same section at the end of the process or directly measure these in red. What we recover from composites is in green and when they overlap pixel to pixel, it's in yellow and you can see that most things are yellow. And most recently, we actually in increase this is biggest experiments that we ran is 20 x images almost 500,000 cells 180 millimeter squared, a series of 12 bisected grown on sections of the brain 37 genes we now have 11 compositions for genes per composition, and they span both cell type known cell type markers immediate early genes and 27 genes that model driven. And we can beautifully both decompress the genes which we see on the right, both for cell type markers and for immediate early genes, but also get to the level of labeling individual clusters of cells in the tissue at very high resolution. And so to summarize this part I showed you that because gene expression is so structured, we should be able to use compressed sensing by measuring random linear combinations of genes that this required us to have shared regulation. In our cases like say ccqtls where a genius very specially and separately regulated and this will not work, but fortunately a lot of gene regulation is shared in these nice modular structure structures and because of this we can decompress from 200 fold compression with 5% training we can apply to blindly with blind compressed sensing with no training, and we can actually go on an experimental path which we've now shown for compressed imaging for RNA method that we call size C or compressed imaging spatial for compressed imaging. Okay, and so finally I know I'm a little short on time but I'm going to try and take a few minutes for the third and last part of my talk and turn back to the problem of regulatory modules of genes that control the expression of other genes and the fact that they that the impact of perturbing multiple genes together might be non additive, and I should point out that again, it's a big problem. And it's one of the ones that I find to be the most bothersome in biology at the moment because it's a real barrier between us and prediction. And I'll also say upfront that this is still work in progress it started by work done by a trade exit when he was a grad student in the lab and or in progress when he was a postdoc in the lab he has his own lab now at the Hebrew University and a trade as a company. And more recently by Brian Cleary and Katie Gaiger who Katie is a postdoc in the lab. And so first just a quick reminder that to assess the function of one gene at a time we use pooled genetic screens. So if we wanted to know what affects the level of gene X, we would use, we would deliver a set of barcoded perturbations for cells today these would be perturbations. Each cell would receive mostly one we would grow them. At the end point we would sort sort the cells by the expression of gene X a little like the GP experiment, we would sequence the barcode and we look for enrichments and this is a fantastic design delivered amazing things in biology, but he does have some challenges. The first is that we have a very simple read out from every cell, which is just the level of one gene X. So we have to choose X in advance and then all the hits are going to look the same they affected X. The second is that what happens if we ask the effect of more than one perturbation at a time. It may be that these, you know, for red jeans all genetically interact with respect to the level of the yellow jeans, but not with respect to the level of the blue jeans or have a different style of genetic interactions with respect to the blue jeans so the read out is actually going to restrict your answer, since you're measuring just one of those at a time. And so a few years ago we started tackling this by developing perturbs which is a pooled crisper screen design but using single cell RNA seek as a read out. We deliver interventions to cells that are barcoded is RNA and after the experiment we profile by single cell RNA seek, which recovers which cell had which intervention, as well as the profile of the cell. And because we're just delivering interventions we can do it at a high multiplicity of infection and get multiple perturbations for cells if we want to study genetic interactions. One of our first applications was in dendritic cells, which is the result that you see here. I don't want to say actually anything about the dendritic cells specifically in this result. I want you to pay attention to the structure of this matrix and so this is what we call a regulation matrix you have the perturbations or the guides in the columns and affected genes in the rows. And you can see the effect of each guide on the expression of each gene. And you can see that the genes are grouping into these co regulated programs so genes affected in the same way by the different perturbations and the perturbations are grouping into these co functional modules so perturbations that have a similar functional effect. And this is the structure that we're going to use in order to understand genetic interactions. So first of all, let me skip these couple of slides this just to show you that you can actually assess genetic interactions when you measure them explicitly. But you know what, I'll actually go back. I shouldn't have skipped this. And so first I should tell you that based on single perturbations, we can make individual predictions on the effect of each perturbation of each perturbation on each regulatory program. For example, real a and NF Kappa B each affect this P5 program but we can't actually tell just from this we can predict what would be their joint effect, but we can use perturb seek to assess this because we have multiple perturbations per cell. And we do this just by going to higher multiplicity of infection and then learning a model with interaction terms from the cells with multiple perturbations. Then we can categorize each of the each of the pairs, the extent of interactions that they have in each category, are they buffering or their synergistic dominant additive or have no or have no effect at all. And when we do this, we find that different pairs have a different extent of interactions in each of the categories. So that just to show that it's technically feasible to assess genetic interactions experimentally using these assets. What's important for us is that it's a scalable asset. So our analysis on these first screens which were saturated for cells and for depth showed that for state or signature levels for these programs. It's enough to have as few as 10 to 30 cells per perturbations and a few hundred reads it turned out to be a great method by which to do many different kinds of screens, but it's not going to solve a combinatorially exploding problem. And so for that, we need to use the structure of the co-regulated programs and the co-functional modules by switching from measuring necessarily individual perturbations or by switching from measuring individual genes or by moving away from both. So instead of measuring individual genes, we can move to composite genes. I already showed you that with compressed sensing. I will close by showing you that instead of assessing each perturbation on its own. We can actually assess composite perturbations. You can imagine several ways of doing this. You can profile little groups of cells or you can have more perturbations within one cell and try to distinguish their individual from combined effects. And focusing on the first concept, we use droplets to overload cells so that we profile the sum of their expressions. And as a test experiment, we perturb each of 600 genes in an inflammatory response, either in the traditional way one cell at a time, or in this compressed way. Then we used compressed factorization algorithm that was recently published to learn which modules of co-functional perturbations regulate which modules of co-regulated genes. The first sanity check is that the decompressed expression is 97% correlated between the traditional decompressed experiment. Then we look at the impact of five known major positive and negative regulators in the pathway of on four major known targets, and those are also consistent. Then we extend to the perturbed genes with the top 15 effects and they're still consistent. And even as we go further down our ranking, although we can see that the signal gradually degrades, it actually works. And I'm going to actually stop here. I'll skip the very last few slides, which were in human genetics, and summarize this last part to say that I showed you perturb sick, which is a method for massively parallel pool screen. We're seeing us on a secret out. We have this nice linear model to infer the individual effects of the effects of individual genes and we can add nonlinear interaction terms if we have combinations of perturbations. The responding genes fall into these co-regulated modules and the perturb genes into co-functional modules. And the critical thing is that we believe we can leverage this structure for compressed screens. I showed you at first early proof of concept demonstration with composite perturbations. And we have ongoing work to use co-regulation with composite genes to use the sparsity of the interactions in order to assess multiple perturbations per cell, and we use the structure in order to do better inference of the unobserved interactions. And with this, I will finish. I tried to give you some motivation to go back to the lab, actually, and try and use inference as a principle for experimental design. I showed you how we use this to do random experiments with random sequences in gene regulation to develop composite genes as a readout method for expression profiles and the path that we have around genetic interactions. I did not talk about DNA microscopy today, but it's a beautiful example of math driving experiment pioneered and led by Josh Weinstein. And I highlighted all the people who have done the work, in particular Carl DeBoer for GPRA, Ischit and Carl for the work on design and evolution, Brian Cleary for compression, composite transcriptomics, Atrey, Oren, Brian and Katie in the context of perturb-seq and composite perturbation screens. Thank you very much. Thank you very much for this wonderful talk. Now have time for questions. We send a round of virtual applause to you for this, for this talk. Thank you. So we have two modes of asking questions from Insight Network and from YouTube through this Slido app. So Insight Network, is there a question? I would start, Aviv, with one question. In these randomization experiments, in general, like how important is context? For example, cellular context, for instance, in the first part of what you showed with the... Oh, yeah, it's a great question. Yeah, it matters, but only to some extent. So we actually did the experiment in three different conditions in glucose, delactose and glycerol. And some transcription factors and some, excuse me, some chromatin potentiators, their activity is actually context dependent, and you see that and it's quite substantial. And I actually showed an example that Gal 4 only acts as a potentiator or as chromatin opener in delactose but not in glucose and glycerol, which is the correct biology for it. You can use the model from one condition and do predictions on the data for the, you know, and do predictions for data from another condition. So if you take the high quality data, either the native yeast DNA or the 10,000 sequences that we measured at high quality and you take a model trained on the different conditions to say high quality data measured in glucose and you take a model, learned on delactose, you're going to do a pretty good job, but you are going to do a little less of a good job than if you used data where the sequences were just as random, but the measurement was done in the same biological condition. And that is actually what we would expect to see. So this is though a yeast experiment, yeast is an organism that expresses the vast majority of its genome all the time, not the genome of its transcriptome all the time. Very few genes are actually completely shut off in a given condition. That's a, that's a radical thing for yeast. It's more about the level of expression than it is about the act of being expressed. We expect this to actually be quite different and much more context sensitive in human, in human or any mammalian system where you have a very rich set of developmental programs. Very probably context will matter a lot more than it does because chromatin plays a much bigger role in organizing the gene programs that are accessible for regulation as well at all. But we don't know yet because we don't have the lab essays yet to fully implement this for human. This is work in progress. Okay, very exciting. Thank you. There are a number of questions on Slido, a number of short questions I tried to cluster them somehow. Yeah, so maybe the meta question here is in this compressed sensing. Can you elaborate on why this is so much better than using more classic methods for finding the structure and expression data like by clustering. Oh, mentioned PCA. Yeah, let me clarify. The goal is not to find the structure. And you could use many different models. That's why I said we tried this video with tried others. The goal is not to find the structure. The goal is to use the structure in order to do less experiment. And we whenever we do less in any given experiment means we can do more experiments. It's not that we actually do less. So, so the goal is to use the fact that you know the structure exists, even if you don't know what the structure is at all, in order to measure less. In a compressed in a compressed expression experiment, instead of measuring 10,000 proteins, which technically speaking, we just cannot perform we cannot do a 10,000 protein maybe, but if we had 10,000 antibodies, we could do 100, 100, 100 tag experiment. If we have 100 measurement channels. How can we use them the best. And then you use the fact that you know structure exists in the world there's structure even if you don't know how it's instantiated in order to learn in order to do measurements like these. And what you need is a decompression algorithm in a by clustering algorithm doesn't decompress the data. However, if you have training data, you can use by clustering by clustering is one of the ways to develop a model of the world. So why we chose specifically the sparse activity algorithm. It's because we wanted to induce sparsity in the solution because that the existence of the sparsity and it's sparsity at two levels. It's the modular structure itself. So that you know a module can be made of only so many gene and a gene can belong to only so many modules. And it's sparsity in the activity of the module. So rather than saying every cell can draw from all the modules with some level. We impose a constraint that says every cell can only participate, you know, have so many modules participate in any module can participate in only so many cells. And that gives you a much stronger modular structure in that modularity then serves you really well when you try to decompress the data. Thank you very much. I think that answers a whole bunch of questions here on slide. They are two more longer ones in the perturbation screens. How do you discriminate between direct versus indirect effects cause or consequence or a random correlation. Yeah, so, so, so there's several questions actually packed here together I'll start with the last one that asked about causality. To me, that's why biology is so cool to work on from a computational perspective because in the math vast amount of problems that we solve with inference we solve in we solve from observational data, rather than interventional data. And as a result of that we are always left scratching our heads at the end saying, Well, is it correlation or causation. There's all sorts of structure and we can all do causal inference, but we don't actually know what in biology. When we use genetics, we actually know the causal structure is not because of the inference is because we did genetic perturbations. You know that you're intervening in the causal side of the problem. That's the beauty of the central dogma now there's all sorts of ways in which the central dogma is violated, but they're not viral, it's not violated at the time scales of these kinds of experiments under these kinds of intervention so of course you have to worry about noise and about random but you only have to worry about the to the extent of a statistical worry, rather than a foundational worry. You have to be sure enough and you do it repeatedly and you have enough power, then what you're detecting is not just an association. That's, that's the beauty of genetics, going back to men that we didn't come up with it somebody else did and honestly evolution came up with it, not us. That's that's the end of the question, but the beginning of the question I think is is is just as critical for long term understanding of biology and that's the distinction between direct and indirect effect. And the honest truth is that in everything I showed you we simply don't distinguish between them, but it does not mean that one cannot start working on the problem of distinguishing between them. And the things that I highlighted for the multiplicity of perturbations so doing multiple perturbations is one of the best ways of starting to tease apart the ordering of, you know, who regulates in a more general way and who regulates in a more specific way and working out a more static relationships. It just requires more experiments that we have done an additional analysis on top of the one that we did. It often would require temporal data so measuring multiple time points, and I know fortunately or unfortunately if you like difficult problems and fortunately and if you would like to be solved already then unfortunately, one can show. And others I think have done very serious work, classes of solutions that are all equally likely in the context of most of the data that we measure today. So from these genetic screens, you can end up with a lot of, with a lot of, with a lot of solutions that are equivalently good. How could you improve on that well part of improving on that are the things I already said you could use multiple perturbations for ordering you can use, you can use time courses for ordering, but you can also use mechanistic data. So having information about who has physical capacity to impact whom, for example, which transcription factors bind where with signaling molecules interact with each other can introduce additional constraints in the modeling. So data is not causal for causality you always have to have interventions interventions mean genetics, genetics writ large, I mean it can be genetic and can be chemical, but you have to intervene. Thank you. So I have further questions this would now be the perfect time for further questions. So I will conclude with one question. In fact, a question I asked Barbara Troydline earlier this morning and I think it's been you are also a perfect person to ask this question as co chair of the human cell at last. So we have this big data in single cell genomics in terms of the cells and the expression levels we measure in in let's say in general in machine learning medicine. On the other hand, we have these big bio banks where the, where the big N is the number of patients and so far, these are two separate worlds. I know people are interested in the intersection there was this lifetime initiative presented in nature, like one or two weeks ago. When do you see these two fields need or do you see them need at all is there a future with bio bank scale single cell data coming up. So it's kind of funny because that's the one part I skipped in the sense. Okay, okay. And I don't think you knew that so so for me, moving to my new role. This is one of the reasons I moved. I actually think these two worlds are tightly and intimately connected these are all views of just one biology. The fact that we slice and dice it in different ways because patient medical records and how we treat clinical samples and all of those things tend to get separated into silos. It's still at the end of the day one human one patient, and everything is related to each other. So I'm actually going to I'm not sure even in these slides I have a perfect one for this but I'm just going to use this slide to show how these two worlds meet with each So so so imagine that you think of something that the UK Biobank and the polygenic risk or I think that's a great example. You can take the UK Biobank and the polygenic risk or and you can you can use, you know, there's all these clinical features from the medical records. There's the human genotyping and you do your association studies. And this is shown this cartoon shows cases and controls but that would not be a case of case and control it to court study. You end up with associations. If you if you did, if you did it in a study like this, this is from a certain colitis, you would have 150,000 sneeps, the number of pairs is 24 billion. And one in humans are not just pairs of mutations of course they're even worse than that. And so more than the number of humans on the planet very quickly. Not just the ones that exist today. And that makes it very difficult to think about the polygenic risk or in any ways that go beyond the basic one which is aggregate all of the linear of all of the signal roughly linearly. I'm saying it's very, very crudely and being very unfair to the field in this it's only in order to make the to make the point that it stuff. But imagine that you also have information from single cell analysis and it doesn't have to be from the same individuals. In this case, for example, imagine that you have also single cell analysis from assertive colitis patients which in fact my lab, my lab studied in profiled and we had, you know, there were I know 30 individuals in a study like that 500,000 people in the UK by grant not even close to the cell. But that gives you structure. So you can now say for each of these variants if you can associate it with a gene, you can now say, Well, these are the cell types in which it participates these are the programs that it goes with this is how it changes in disease in either its proportion and so on. And what we've shown before is that these genes are actually organized into modules within cell types, just like these programs that I showed for the genetic screens in the cells. You no longer have to think about every pair wise interaction. You can think only about interactions that are either. Sorry, this was to make all of those points you can think about the interactions that are either basically reflecting the types of cells. So you learn that from the variation across the cell types, or you can think about it from the covariation within the cell type. And you can now start and think about things like for example genetic interactions between the snips by considering interactions within modules and between modules where you aggregate signal from each of the module and this is not a hypothetical it's a cartoon. That's actually the work that we do. So that's a great one example of how these two worlds meet and the meat very intimately. And there's many others. So another example is that you have many physiological features that you measure on patients in the medical records you can look at their sections in agency stains if they're cancer patients this would be very prevalent you can have Imaging like their MRIs and their CTs and then you have this rich molecular information that we would increasingly be measuring for very large numbers of individuals. Now you can finally tie these things together. And it's again it's a prediction problem. It's how do these different layers of biology these different convolutions of biology relate to each other. So I think that's what's coming. That's why it's so exciting to try and work on clinical problems these days. We can finally do human biology in humans. We don't necessarily have to work on a model at all. Sometimes we do, but a lot of the time we don't evolution generated plenty of perturbations for us to figure out. I think I'll close with that. Yes, that's exciting and a perfect ending of our summer school. Thank you so much for joining. Thank you. Everyone enjoyed this very much. So this was wonderful. Thank you. Thank you. I see people applauding virtually. Thank you so much. Stay well and healthy everyone. Same to you and everyone else here watching on zoom or on YouTube. Thank you.