 is curation and a visualization of encoded data with Quilt. Hi, thanks for having me. So I'm here to talk to you about crowd computing. And I believe in this paradigm so strongly that I created a local startup to pursue this idea. And in order to understand why crowd computing is valuable, it's important to look at the kind of dysfunctional model that we have around data and data sharing in genomics today. And I call that divisive computing. And so the idea here is that we have a flat file repository. Endcode comes to mind. NCBI is another example. And we have multiple users pulling from that repository, copying the data, pulling it behind the firewall, running it through ETL, extract transformer load, which is essentially your data prep process, doing a little bit of analysis, and then very little sharing or reuse. And so if we think about how this actually works, it's a case study in how to do things suboptimally. Or it's a case study in how not to do things. And essentially what we have is we have the most number of people doing the lowest value tasks. And again, very little sharing and reuse. And so it's easy to imagine a different world. So this is the divisive computing world, where we have a crowd computing model. And the idea here is that we match both computational and human bandwidth with the value of the task. So the idea here is that you do your extract transform load, your data prep once. And then you use as many brains as possible to do your analysis. And you have high sharing in however you use. And this is what science is all about, standing on the shoulders of giants. But if you can't see those shoulders or those shoulders are caught in a two-year pre-publication review cycle, you really don't have anything to stand on. And now instead of pulling from a flat file repository, this model of computation really involves pushing into some combination of a database and a data lake. And the idea here is that this computing resource allows not only allows for network effects among people, but this pooled computing resource allows people to transform the data, to analyze it, to track where every atom of data comes from, which we call provenance, and to control permissions, which is who can see what. Now, this looks great on paper, but how many people have seen this movie before? They've been so many efforts to get biologists to collaborate and to come into a single database and it's not happening. And I want to talk about what I think the shortcomings of excellent work have been and then where we're going to go with it. So the barriers to crowd computing, scientists very frankly, I think have a schizophrenic attitude towards data sharing. So before publication, we don't want anybody to see what we're working on. After publication, we want everybody to see it, but not really, because we don't show you the data intermediates, which is the data that we use to derive the data. Paywalls, we're all running into them. I mean, how useful for the public good is an Excel file behind a paywall on a PubMed server? Like, how many people are really gonna use that data? We see very high fragmentation and low findability. I don't want to name any names, NCBI, but there are lots of places where people tend to push their data, and they have no expectation that any other human being will ever pull that data back out. And of course, high technical barriers. So really, few labs are blessed to have a bioinformatician or a data scientist, but we want to make it possible so that bench scientists can do their own analysis. So these shortcomings have kind of compensatory solutions, which is what we're trying to work on right now. We're piloting this program, which I'm gonna describe in just a second, with researchers at the National Cancer Institute. And the first and most important thing, which I think most projects have missed, is there has to be a strong social layer. If people are going to contribute, they have to be able to get social credit and social proof for the work that they're doing. And if how many people are familiar with GitHub? Right, okay, great. So your currency on GitHub is the stars and the forks that you get on your data set. So this is this new model, this new world that we're moving into. Scientists can publish data, get stars, get forks, and have a way of measuring what the impact of the data that they're publishing is. This data management layer allows us to hash their signature data sets, so that we can unambiguously know who was the first to publish it, number one. And we can also perform this diffing process with some of you would be familiar with. Unified storage and schemas is what we showed in the last diagram. And then we have to lower the technical barriers, right? Which in short means making it possible for anybody to visualize and analyze without coding. So this is the experiment that we ran. And essentially we wanted to see if bench scientists could create their own targeted CRISPR screens. And I wanna talk to you about how we use ENCODE data to develop some custom guide RNA design libraries and then let people at National Cancer Institute and beyond use that. So the first was creating a genome-wide CRISPR library. So there are libraries out there from Moffitt Lab, from Zang Lab, from Doudna Lab, but how do you know which one to use? And frankly, if you wanna target epigenomic elements, you're out of luck with any of those libraries. So the first thing we did is we found NGGs or SP-Cas9 PAM sites throughout all of HG-19. We then intersected those with DNA's hypersensitive sites from ENCODE and used bow-tie-2 to filter for off-target effects. So what you get out of this is this genome-wide or intergenic guide RNA library. And as far as we know, it's the first publicly available library of its kind and you can see the slides will be available to you. It's available on QuiltData today. So how do you use it? So the first thing is, again, we looked earlier. We don't want people working with duplicative files, but we pulled all of the histone peaks from the ENCODE project, the chromatin mark from the ENCODE project, made those available in a tag interface. You can check this out right now on QuiltData.com so that people can surf tags the same way you do on Twitter or Facebook, and then made it possible for people to, how many people use bed tools? Great, so we put a GUI to that so people can now intersect and subtract genomic intervals. And so the idea here, let's say that you want to find Nanog enhancers. You want to target or CRISPR Nanog enhancers in embryonic step cells. Very quick example here and a quick review. We want two activating chromatin marks, right? So that's a methylation and acetylation. That plus indicates genomic interval intersection and the minus is subtraction. So we're looking for areas that have monomethylation and acetylation, but no trimethylation. And again, using a very friendly user interface, you can now find targets for enhancers for Nanog. Okay, so now what, and again, this is an example of the interface that our scientists are using. And in order to now actually get GUI to RNAs out the other end, you take these Nanog targets that you found and you intersect them with this genome-wide CRISPR larva that we talked about earlier. And what you get out of that are GUI RNAs that you can use to target your enhancers of interest. So I think what is significant about this, first of all, is that, oops, well, we're allowing people to take data that's already existing out there or create novel combinations and publish that. So it's to visualize and analyze without coding. The next piece is kind of the pull request for data part where after running these experiments with GUI RNAs, the scientists can round trip that to the community and talk about which GUI RNAs, in fact, were most successful experimentally. So for public projects, we can offer you free storage, compute and visualization, especially interested in working with people who are in CRISPR or variant interpretation and love to answer your questions, stay in touch, check out what we're doing. Thanks. So...