 way of addressing some of the concerns and opportunities that we have across the NIH data sets and international data sets as well. And that's really talking about this central server model. Really multiple central servers is probably a more appropriate way of saying that. To begin I just wanted to talk through, you know, what are we trying to accomplish? And I think David talked through this, so I sort of just stolen some of David's slides, although he added a bullet point on his newest version of slides. So this is two of three. But I think the summary in the box is really the key piece. I mean, we want to make it possible for researchers to find very broadly not just geneticists, but biologists, clinicians, statisticians to answer scientific questions about the relationships between DNA variation inherited or potentially somatic in the cancer case and human phenotypes. And I, there are a variety of things that we're going to need to be able to overcome that stop us from really achieving this today. I mean one question, you know, obviously, like why does, why does it just work? And I think there are a couple of fundamental barriers. So one key barrier is that the disease data has been largely siloed. It's been analyzed, you know, single samples or single sets of samples for a single disease. And there have not traditionally been any routes to access that data. There are very complicated technical confounders that we need to deal with across the data. So it's not as simple as simply me sharing my p values or statistics at sites without having an enormous amount of quality control going on. And that if you want to really have a distributed model, you have a huge educational burden or certainly analytic challenge of distributing methods that would do that. And that's something that has been technically limiting our ability to join datasets. There's also been, I think, historically a lack of sort of rigorously, a rigorous analytic methods that are instantiated and reliable, easy to use, scalable software. And that's turned to two things. I mean it's been traditionally difficult for methods development individuals and analysts to access data. So, you know, people who develop a very sophisticated method can't just get the data and run it and give you the results. Alternatively, many of the data holders find it very difficult to actually get the methods and run them. So this is sort of horribly disempowering from both perspectives. And then in general, you know, when you get the methods and you have this and try to run it at scale, you have enormous scaling problems that should not be minimized. But I think one thing that's useful is just to say that, you know, today these barriers are sort of largely gone. And I think that to some degree that's why we're discussing this is because we actually have an opportunity to have a unified approach, to have a sort of systematic way of looking at all the data together, where we think those barriers are no longer limiting. And there are a variety of reasons for this. One is that, you know, next-gen sequencing is a very different data type. So it's very different than traditionally candidate gene studies or even genome-wide association studies. It's a very intrinsically rich data type. It sort of describes not, you know, intensity data in some probe that may or may not be well described. It's an actual read that you can align to the reference genome yourself. You can analyze it statistically because there are many reads per site. You can learn a lot about the data from the data itself. So that's a very important thing that's not, you know, transparent in genotype data as traditionally cannot collect it. And also obviously that we have enormous amounts of it. I mean, because sequencing is so cheap, we have 50,000 or 100,000 samples today, which we didn't have, you know, five years ago. So again, you know, merging the data five years ago would not have yielded such a, you know, a huge bounty of presumed scientific results. But today we sit in the situation where if we could just get it together and analyze it together, we all agree that we should learn a lot more than we're learning individually. Part of the data access problem, I know people are complaining about the data access challenges, but DBGAP has largely solved that. There's a single place that has most of the data for most of the funded data, at least at the end through the NIH, and there are similar places internationally. So it could be worse. It could be sitting on individual hard drives from individual researchers with absolutely no ability to get the data. At least now we have a, you know, maybe complicated, maybe arduous, but we do have a mechanism to go to a single source and download the data once we get approval. So we have this opportunity, you can imagine getting the data for all of the samples that were sequenced through the NIH. And we have newer, a newer generation of analytic tools. And these are, I would say, significantly more powerful, more general than previous generations. I mean, it's clear today that we can integrate and harmonize data across multiple sequencing platforms, multiple different sites. So the 1000 Genomes, for instance, has three years of sequencing data, many different platforms, many different centers, you know, six, seven centers depending on how you count it, three or four different sequencing technologies, each of which has four or five revisions within the time period that we sequenced. And yet that project can deliver an enormously consistent data set across thousands of samples. So we know this is possible. And then finally, you know, in order to deal with the fact that we had this big data problem, we have tools now that actually run in a relatively automated fashion. So in principle, if I had the data, I could run my tools at scale to deal with hundreds of thousands of samples. So again, if we could just bring it all together, it would be possible to harmonize, to create a single universal data set across all the samples. And then we could presumably analyze this in some way if we understood the relationships of the constraints on the data use of that harmonized data. So obviously, I'm trying to, you know, sell an idea to people. And I think, you know, I'm going to describe how it might be implemented, how, what value it brings. And of course, it's not the only solution to the problem. It's an aspect. But I think it's one that's very, I'm trying to be as concrete as possible because I really think that today we could build such a thing. I mean, it's possible to do what I'm describing. So what do we, what does this central analysis server really provide? It provides a single aggregation of all of the data in one place. Which is, and the data buying data is genetic data and some phenotype data. Not identifiable data, but data that would be useful for disease studies. It needs to provide a state of the art computational environment and analysis tools of equivalent, if not best in class product for every part of the pipeline so that this is the place to go to analyze your data. That would be the idea that it would be impossible to do better than it, because it would be the best across an open environment. Just do the best thing. And it could manage security, data use policies, individual access so that you knew for certain that the data was used in a way that was consistent with its consent when it was collected. We don't have to have an honor system. I can actually enforce that. And I think it is technically possible to build such a system. I mean, it's complicated, don't get me wrong, but it is technically possible. So how would you actually do this? So we have a very simple model. I mean, you basically want to say you have many different data sources. They act as primary data repositories. You're already doing a very good job of putting the data in one place and cataloging it, storing it, archiving it. There are many different places. There are international ones. There's DBGaP. There's individuals who might do sequencing. And then you have this thing that associates with it, central analysis server that basically has three pieces. One piece needs to be some sort of secure sample manager. It doesn't want to store the primary data. What it wants to store is pointers to the primary data. How do I get this data when I need it? It needs to manage data at multiple levels, not just the pointers to the raw data, but every sort of derived data type from it. You have raw data. You have variants. You have summarized data. You have counts of the number of nonsense mutations across the whole cohort. And that data needs to be managed sort of at an atomic level with sharing information at every one of those stages. And finally, you have some kind of cloud-like, although I think it doesn't matter really where it runs, some sort of next-generation data processing core. And this runs apps, tools that basically do alignment, that do varying calling, that do association statistics at the raw read level. And this needs to be some sort of extensible system so that people can submit apps and run them on the data. And that, in principle, you could do this. The product of that thing, although it looks like it's raining down, is this sort of quality-controlled variants. And the reason I think this is so important is that you really want to harmonize data product. You want, and I'll talk a little bit more about this, but I'll just say, you know, what you really want is all the true positive variants across all the samples. And then for every sample you want to know the genotypes and the likelihoods of those genotypes for every variant that you know of in the human population at every one of those sites. It's sort of a completely square matrix of sites by samples. And that would be the product, that you would have that as a raw data set. From what you could slice individual pieces to do whatever analysis you wanted with some sort of extensible system like the iPhone, iTunes, App Store, that it would do association. It could do digestion and share that data in a very clean way with individuals who wanted to know what variants are known in a genome. You could do visualization. You could compute ancestry. You can do personal genomics. You could imagine doing any number of sort of analysis apps on top of that product. And so the server's goal isn't in some sense to be the single place to do everything. It's just an infrastructure to let us instantiate what we all want to collectively do and enforce the restrictions that we have agreed to for the data. So things that I think are important to emphasize about what are the opportunities for such a server. I think one thing is that the problem isn't the apps. We have lots of apps and the problem isn't data. I mean we have lots of data. The problem is that we don't have a platform infrastructure to put these things all together. So we need something that we're calling a platform that coordinates sample management. It knows where data lives. It knows where phenotypes are. It knows what phenotypes are associated with samples. It needs to enforce and understand data use policies for every sample and every user. Because it's not just samples. I mean individuals have the right to use samples for certain purposes. It needs to understand what that means in some sort of way that is represented by a computer to be enforced. And it needs to be able to execute applications that do all variety of the types of common analyses or data processing or quality control we all do at the scale of hundreds of thousands of samples. So one thing that I think is important and not widely appreciated when you start to think about this thing is that you want it to be continuously updating. It can't be a static thing that today I implemented the server and I press go and now I have a great dataset and I can share this and it's some constant static entity. The thing needs to live and breathe downloading samples all the time. If I sequence a sample today I want it uploaded into this system tomorrow and I want to have it integrated with all the data so that I can understand the relationship of that sample with all my other data as soon as possible. It needs to be continuously managing that and it's also not acceptable to have old methods or have samples that were run with different types of methods. You need it to be running best in class stuff all the time continuously. And finally I think the other part of it is you really need applications for variation discovery in hundreds of thousands of samples. So this is sort of a technical thing that thousands you can think about is like a massive evolution of thousand genomes. What you want is this harmonized error corrected list of polymorphic sites. And then given that you have all these sites you want to know what the genotypes of every sample are at every one of those sites. And the key for doing this in a single universal place is not that you couldn't in principle do this in some sort of strange distributed system where everyone has a handshake and agrees how to do the sites is that you gain so much power by looking at the odd statistics of sites across hundreds of thousands of samples that you can sort out very quickly what sites are errors that are common which sites are errors that are real or sorry that are rare. And so this is a way to ensure that we have the highest quality data by combining information across as many samples as possible. And who wants to spend their time for instance analyzing loss of function mutations that are completely dominated by errors. And the only way I know to get rid of those errors is to look at the properties of sites across many samples. And if you have rare sample sites that are at 1% frequency and you want to see it tens of or maybe hundreds of times to decide whether it's an error need hundreds of thousands of samples to do that. So very interesting things have come out of our discussions of what it would take to have a to put all the data in one place and feel confident that that the data could be analyzed in a way that wasn't violating the agreements on the data use policies of the samples themselves. So I think it's just a contrast you know today we've sort of described this is you basically sign a piece of paper you download your data and you kind of hope for the best. You know what we envision is a system where it actually represents the data use explicitly per sample in some sort of way. In the example I give on the right are these like data use cards that you can imagine a normal system understanding it's not totally unreasonable that you could represent what the data can be used for per sample and per user. That these policies can be enforced for instance P investigator would be the only person to be able to look at a specific sample but it could be easily merged with the NIMH common controls because you might have approval to do that. You can merge anything with thousand genomes because it's completely open. So the individual investigator sees different subsets of the sample space that's based on their approvals and the data use policies of each sample. And it's possible to restrict the analysis simply by saying you have to be either you have to meet one of two criteria you have to be an approved user to see the data or you have to meet the approved uses which are enumerated and can be quite conservative. If you have a sample that is very conservative approval you don't get to see it. If you have a very open sample you can freely use it. You can imagine some sort of button that pops up on like your Gmail now that says include these sample sets and it would know who you are and it knows what kind of other samples you're merging. If you try to merge a T2D data set with a schizophrenia data set it says this is incompatible. You know one of these data use policies being you know violated you can't do that. And I think this is an important point that took us a little while to fully articulate is that the goal for a central server isn't to be a mechanism for policy reform. Whatever that is I mean I think there's huge value in reforming the use and data access policies. It's just a mechanism to enforce whatever those policies happen to be. If we evolve into a more open model then it should enforce those models those rules if we remain closed or we find that in fact it's possible to identify people in a horrible way you could just enforce that right overnight. You have a mechanism to stop people from doing the things that we are uncomfortable with them doing. And I think this is the other part is you know I was talking in a comment earlier is that you know there's sample usage but there's also this totally orthogonal direction of the digestion of data. So we all agree that you know the rawest of the raw sequencing data is probably the most protected. I mean whatever you said you would do with the sample you can't it has to live at that set at that stage. So you know you would probably are only approved to look at individual raw sequencing there probably individual samples variance with the genotypes if you're the approved investigator. But once you start to digest the data to saying well I just have individuals in a set of samples where the only data I have is a list of polymorphic sites with their frequencies this might be more open for sample. Not all samples would be equally open but some would be proved. This is the sort of mech model of the exon variant server. And the ESP data could be available at that level within this data system. And finally there's very high level metrics that we all agree are in principle shareable which is the summary data is that lives in a publication. I mean just because it's inconveniently stored in a PDF it's clearly available to everyone in its open access. The system could represent that. And the answer the question of like what people would query if you're a pharmaceutical company you should have the right to look at that just like you have the right to pay to look at the PDF that contains the table. It's just a mechanism for sharing the data. And this could all be integrated in one single process that enforced it at every stage and it knew what we were doing. So I think the example that there's probably best there's two sort of side by side examples I think should clarify this you know the goal of the server isn't to enforce individual level accessor to drive all samples to an open model you want to maximally enable the sharing of samples consistent with their data use. So a thousand genome sample uploaded to the data to the server would be freely available at any stage. You could look at the individual reads if you want it. It would be allowed basically the server will allow you do anything you wanted to do within the constraints of the applications had been uploaded. And it can be freely merged with other data sets. If I want to know the frequency of my variant that I discovered in my Mendelian consortia set I could just look up the frequencies and all the genotype likelihoods of that site in the thousand genomes data for free. But we all agree we can obviously do that. The samples are consented to do that. It's just inconvenient even today to do. But a single use sample for instance a schizophrenic sample. That's only approved to an individual for non commercial use would be highly restricted. You could upload it to the server. It could come in. You could benefit from all the analysis infrastructure. It would be it would be jointly called with all the samples so that the error modeling was empowered. But that sample would only be visible to the individuals who were approved to use it which would be the single investigator. And they could pull in any data sets that were equal value. You could merge in thousand genomes with that data set. Because the thousand genomes is freely available. But the other investigators couldn't merge your sample. It would be you would be completely blind to that sample. And I think the important piece is that most samples really would fall in the middle. I mean obviously many samples are sort of intermediate consent. And if we could just know which were which we could analyze our individual smaller sets of data in the presence of these massively empowering data sets. So some basic you know advantages what's what's why is this I think exciting. I mean one is that enhances the value of large open broadly consented samples. So the NMH controls the thousand genomes project. If those were so trivially easy to access and merge into your data then these would be hugely valuable things to invest in. We would see more samples with broad consent because they would be enormously valuable for all the other samples. Another thing that's important about this is it has a huge network effect. And if we can do this right then you would have basically access to the cutting edge data processing and analytic apps and everyone would want to use this. And then data would pour into the system as more people wanted to do use it. And I think also it provides an answer to the thing that David was asking is how do you actually provide digested results. How do you share what we're learning with the community. It's clear that you would have a knowledge base in the system that would be easy to share. You wouldn't have to go out of your way to make a list of P values and then put them up on the website. This would all be a click away. You could say oh well look at this gene you know all the data. Everything that was allowed to be known about that publicly would be freely and easily available just like Google Maps is easily available. And I think the point of point is it actually solves some of the problems. You know if you're a biologist you can just look up a gene look up the variance like you can in the exo variant server. If you're a pharmaceutical you can explore all these rare loss of function mutations. You have access to some subset of the data and that would be available to you. And geneticists would have a very strong interest because you could incorporate all these enormous common controls into the system and you can analyze your data against all of them. So just to conclude you know I guess to not conclude but really open up for the discussion is that I think one thing that the central analysis server provides as a mechanism is that it's an infrastructure that's sustainable that would let us have integrative genetic analysis going forward. It's I think computationally manageable and the the LC issues I think are also manageable. We could build such a system it could be consented it could do what we wanted to do. And I think a successful server would be immensely valuable but it's a technical challenge to build it. So I don't think it's it would make sense to try to do it in just one place. We'd want multiple versions of it instantiated so that we could see which types of analyses and what type of infrastructure worked best. And you could easily imagine a diversity of these things for specific disease areas. What might work very well for germline variation might work very badly for cancer. You wouldn't have to solve all things at all times. You just have an idea and maybe we could share common infrastructure if you had a generic platform you'd have different apps in different places that would allow us to do basically to leverage all of the infrastructure but have special things going for it each individual place. So I think it's important just to say you know why would you why would you want to build such a thing. I mean I think as a biologist or a pharmaceutical rep or a rep but a researcher somebody in biotech it's a you know comprehensive database of of the variation you know geneticists get the largest data sets statistician can get data for their models. You can upload as you're as a methods developer your app and have it run. You don't have to try to distribute it to every individual who wants to run it. If you can just get it running in this system it'll run it for you. And then everyone can use your method. They can cite you and you have you know thousands of users. And I think the other the final thing is that you know for the LC consideration actually understands the rules that are articulated for the samples and would enforce them. So you could say we could actually say at a very fine resolution this sample can be used for this cannot be used for that these users and we could actually benefit from all of that at a very fine grained level. So that that's basically my my my conclusions is that I think we should take this opportunity seriously as something that would really transform how we at least a large fraction of the community would do research in this area. I mean it could be hugely valuable for us to build such a thing. Great. I will take questions. All right. I'll start with you. So one experience of this is that you know when you get down right to the nitty gritty there's this whole business of a particular analysis for a particular run drops out a series of individuals or rows and drops out a series of of columns snips before they go off and do whizzy things that and then whizzy things that will often you know pop in a covariate or something like that in the P value calculation. And so it ends up being so in your model this is kind of a it's quite hard to make all of those decisions kind of generic. Do you know what I mean? They're very they're not only unique to the data set but it's almost unique to that that little round of analysis. And although there's a kind of generic core to it the details are kind of specific. So have you thought that through is that a problem or not a problem? How does one one handle this this kind of scenario? Sure so my family would be you know so for some things you clearly we can clearly benefit from the full scale of the data. So for instance SNP calling Indel calling can look across all samples and be beneficial. I could see that developing specific covariates for the exact data sets that you merged in would be potentially better than a universal set of covariates for correction. But I think that that would be something that the app layer would manage. You would have some PCA tool and you could run that on your data and it could do exactly what you wanted it to do. And I think that's completely computationally tractable to do that. So OK so I mean I was thinking more about covariate modeling the phenotypes not covariates on the on the calling and the genotypes and that sort of thing. But from your view in some sense the the app infrastructure which sounds more like a virtual machine would have full access to whatever the researcher wanted to have at that particular time. Right. So there would be some serious process of vetting those tools. I think you know you couldn't just upload it for free and just run it. You'd really have to say this would meet some minimum criteria much like I think Apple is doing with you know iTunes and apps. I mean you just you know you're you're distributing these things to other individuals. It's not you have to make sure that they pass a minimum criteria and they're going to those criteria are going to be scalability you know safety. But you have to look at the code to see what it's doing because you don't want it to sort of bring down your computing infrastructure. All right. So I'm just just wondering and trying to you know understand this in some more detail two questions that different investigators or consumers of this information will come in with different needs and probably very different computational burdens. Some would be quick and easy. Some could be quite protracted and I don't know how you see managing that and who adjudicates that. What you thought about it. Yeah. So I would say you know one of the big heavy lifting pieces is this thing in blue. You know working with the raw next gen sequencing data and analyzing across samples. If you imagine that that's where most of the heavy lifting is. I think that's already well tractable today. To get into that bucket and that's why I actually put this little separate piece of like an analysis apps versus NGS apps. The NGS apps would need more serious efficiency testing to go into the system because I think it could bring down the compute. But analysis apps are pretty cheap. I mean it's relatively inexpensive once you have genotypes and sites to do you know lots of different types of analyses with relatively little compute. But in your articulation of this I'm just trying to get this clarified in my mind that however all the computation is really done on this site meaning if I'm thinking of a new method beyond downloading some data that is downloadable and testing it on my own system I would still use a new method if all I'm doing is testing this method on the massive data. Yeah so the developer's toolkit is like how do I as an individual methods developer test something given that the actual infrastructure is going to run on as far away from me and is you know protected. Usually this is just some sort of virtual machine you can download that's like a microcosm of the full system and you can develop in it so like if you want to develop for the iPhone you don't develop on an iPhone you have this fake software simulation of an iPhone and you develop in it and that's what you would have to do. So you know imagine pulling down it would represent some fraction of the thousand genomes data. You would download this would be a tolerable size and you would do your development there. Go ahead David. So it sounds like you see and I wonder if you're intentional about a lot of fluidity in the results you would get such that you and I might get different results or in different days you know with different samples being added. Do you do you see that as being actually a major point or would you bring in freezes and so forth to it. Yeah I think you would have to do this just. I think it would be intolerable if it were continued if you never had any if you never had any freezes so that you never could analyze a single data set. I think that the freezes could be done at this sort of QC variant level which you know would cost relatively little enormously large data sets are under terabytes for for variants and genotypes and I think that would be totally doable. I think it's worth thinking about our different audiences as we return to this like there's the fact that for people who there's a community of people who are apps if you want to call it this here apps developers. I think the challenges everybody's raising are very real of like how would you know that they didn't develop at first of all how would you vet the apps? How would you know they weren't going to violate security? How would you know they weren't going to screw up your compute and cost you a lot of money? Presumably there are people who could help with that because there are there is this whole incredibly dynamic apps development ecosystem right now. So maybe we could learn but that might not be the first thing you layered on because your methods developers could still access data. It's actually no no it's because you know where the data is and we're streamlining data access. It's actually all the people who are not statistical geneticists and who are not methods developer who are now completely hamstrung. Because if you're not actually I mean the thing we hear most often is A, I can't get the data but B, even if I got it it breaks my computer because the people who have the questions don't have any ability to do this. So I think that while we worry about I know everyone's agreeing as we worry about solutions that might limit the serendipitous creativity of the five most sophisticated statistical geneticists in the world we should also think about how currently we're enabling people who don't know how to deal with data like this and don't know how to do analysis to answer questions. By saying to them you can go to DbGaP and download it when they don't have the ability to do computation at all. Right now the waiting is a hundred statistical geneticists, zero biologists. So I think there is the issue of the process to get an app in. And I like most of this except I'm worried about the part of how we go from one set of apps to the next set of apps because there is going to be cost in getting those apps in. And the problem is most of what we do right now in the analysis of these data is not cut and dried. And there's a certain amount of time in between someone developing a method and other people, other people have to use that method before anybody is going to agree that it's sort of state of the art. And so how you make that change over time without getting too conservative and therefore getting stuck with old methods that really you don't want versus changing so quickly that you've wasted time changing to a new method that it turns out really isn't that good after all. That's a hard part of it. I can give you a concrete example. It would be entirely possible to run, it's just to be very technical. You could run a sandboxed version of R in the system that would have isolated to some virtual machine and absolutely only had access to the data. It would be impossible to get outside of it at least to the best that we could do in any reasonable system. And you could do whatever you wanted in that thing. You could break it. You could do. There are ways to handle that. I would imagine that that's what I would want. I don't, you know, I would want free. So the idea of a better answer to my idea is this isn't the only solution. This isn't the in opposition and the way of operating. It's this idea that these solutions are in opposition, I think, towards this. And it's the intention that we don't need to have. Right? So having a central servant has not precluded traditional local analysis. So in the back and then to Gonzalo. So I was running with you. I was like, yeah, yeah, yeah, yeah, yeah. I think this is a great idea. Good. And then you lost me at the apps. And let me explain why. So in a sense, the model starts as some type of a shared computing resource. So we're basically going back to old school shared time computing systems, which that's fine. That's where we're going anyway. But I think this is great. I'm really hesitant to pigeonhole people into a particular language and a particular software architecture. You mean like a programming language? Well, not even necessarily a programming language. Just whatever your architecture is for the apps, that's going to limit what people can build. And so I liked it when you started to say that there was a potential for virtualization and allowing people to have space and computing time on here. I guess the big question that comes with this is, who pays for this at that point? Because this is a non-trivial amount of computing space. And if it's constantly evolving and you need to have a whole series of engineers to maintain it and manage it and then eventually update it. I guess there's, I mean, you know, ideally you would, so I guess there are two questions, you know. One is how bad is the computational burden? I mean, and I can answer that to some degree, you know, as a test, just to sort of proof of concept. We recalled 16,000 exomes last week. Single unified process, which is the large expensive piece and, you know, it takes on the order of a couple thousand CPU days. So this is not so expensive, given that CPU days in a non-efficient environment are like on a couple of dollars a CPU day. Data processing, you know, this is hundreds of thousands of CPU days. So you're not talking enormous amounts of money relative to the investment that we put into the data itself. I think the point is that, but also that's if you, if somebody has a very efficient version, I think the point that's being raised and I think the answer is we don't have an answer, okay? Is if you haven't, when you raise this apps idea, if someone were to bring in an app and it were to all of a sudden set off some process, setting the limitations on the thing that was gonna cost a billion dollars worth of computing, you know, who's gonna pay for that? That's a business model question and what, yeah. So that's an important thing. It is. Also with the shared storage model, because for every additional PETA by you bring on, it's gonna explode on you. I guess we could, I think those things are solvable. And you know, there are systems to build. There's, you know, there's a many, many ways you could imagine ensuring that there were resources available to run what you wanted to run. And you could have a central funding mechanism, you could have individuals be able to fund their own things. These are, you know, not clear what the right way to go is, but I think that we could solve this. You know, Amazon has a cloud that it uses, not because it is very happy to provide a computing cloud to everyone, but that's the compute that they need to deal with Christmas. And in order to monetize it and all of the things associated with the cloud, this is an attempt to recoup the cost of setting up computers for Christmas for them. And it's been hugely valuable. And so my point is that, you know, there's a lot of compute out there. There's a lot of opportunity to do this. I think it's not astronomically difficult. So Gonzalo and then over there, and Debbie, sorry. Okay, so, you know, I think you started from a very compelling vision, you know, that there are problems that we're not addressing well. You know, it'd be nice if there was a convenient way to get answers to these big questions. But, you know, my take on this architecture that you've laid out is that it's actually an incredibly complex system that, you know, which goes, you know, probably way beyond what you need to address those problems. I mean, just thinking of the scale of this, you know, if you're thinking about, say, how many papers are published in bioinformatics each month? You know, let's say half of them have to deal with next generation sequencing and it could be turned into an app. You know, if someone had to vet the code on each one of them before they went into the cloud in different versions, it's kind of... Well, I wouldn't want to do this. We'd have to have mechanisms that you had to leap over in some sense. Then you've got, say, access to many different layers of data, you know, from sequence to, you know, I think, I mean, it just seems like an incredibly complex system and it seems like we should focus maybe on simpler parts that get us to the question you articulate at the beginning. You know, if someone wants to know what the effect of variance is on heart disease risk, David's example, there should be some convenient way for them to get the best answer for that. I totally agree. I think that one, this is why one, you know, I feel like if we did go down these routes, we would want to have multiple versions of these things because the technical challenge to build one of them, you know, I wouldn't want to be the only person trying to build them. It's very complicated. I also agree that there are other, it makes sense to go incrementally through lots of different things, each of which would have value as early as possible. So I'm not sure what parts of the system would be most valuable most quickly, but it would make sense to focus on those. So it could be that we decide to have the subset of most valuable samples or most incented samples as the start to maximize value quickly. So my final comment is, you know, one thing that I, you know, we've been saying that these things are not in opposition to each other, but you know, one thing is if, you know, if we overpitch this idea that you just laid out, you know, one of the things you said is this is a better model because we can atomize access control better, you know? If we said, hey, actually our goal now is for access control to go through this process, you know, this would throw a big spanner in how everything works because, you know, we don't have anything that does this. It's not clear that it's buildable or how many years or whatever it take to build. You know, you couldn't work to this model. We don't even know if you could build it if it would work. So I think, you know, there's attractive features here, but I think it's an incredibly complex engineering task that's laid out there. Definitely. I'm gonna pick one last time. Okay. I don't want to be the one to pick. Want to, Mike? All right, so just to sort of dovetail on this point. Oh, Debbie, no, no, no, Debbie, Debbie. This is, I didn't want to be the one to pick. I see mine to Debbie. All right. I think we're dealing with a lot of different problems with one app, if you want to know what I think about this. And I think that we need to take step back and I think we're gonna plan on doing this tomorrow. Is that right, Mike? To figure out how to approach these problems. I think that we want to have easier access for everyone to this data set. Doesn't matter who they are. And we also want to make the best data sets available. We could think of the thousand genomes model, for example, and they have a freeze and everybody uses that freeze, right? I mean, why can't we make all the data available as a freeze? I think the idea that you could do limitless computing and test all new tools on the same system is a totally different can of worms and everybody could agree that's a good can to open and I'm not disagreeing that it wouldn't be a good can to open cause maybe that'll lead to the next version of the data that the community vets as being the very best ever, right? But I think that these are different issues and we need to consider them all and the points for building what it takes to get them in succession. But I'm agreeing with Gonzalo in the sense that the beginning was very different than the end of this presentation. Great, thank you.