 All right, so I'm going to start talking about projects at the tail of the spectrum that start to get further and further away from ENCODE, beginning with phantom, which is a long-run consortium, the functional annotation of the mammalian genome. It has a goal that is in some ways quite similar to that of ENCODE, really focusing on identifying functional elements in mammals. They're looking in a variety of human and mouse tissues and cell types, and so they tend to be quite broad on the sample diversity axis. Unlike ENCODE, they're very highly focused in the assay dimension. A great deal of the data that they generate is CAGE data, cap analysis of gene expression, where they can locate the endpoints of RNAs, and they couple that to a limited extent with transcriptomic data. So all phantom data is available at a public repository. Another resource that phantom provides is a CDNA clone bank. So they have clones of many of the full-length CDNAs that they've identified over the course of their project. And there is a coordinated data analysis and integration effort. This focuses to a large extent on functional element annotation, using their CAGE data to identify both promoters looking at mRNAs, as well as enhancers using their data to map enhancer RNAs. They've also taken a foray into examining different cell states and fates, and they have also done some work looking specifically at mechanisms of gene expression. Phantom is funded by Wright, and I should point out. So really the key feature of phantom, as I was considering it, is that they'd have this very, very narrow focus on a particular assay, the CAGE assay, but then that allows them a lot of latitude to move through some of the other dimensions, particularly the dimension of sample diversity, looking at a large number of cell types and tissues, as well as doing some nice work looking at different cell states and fates, biological and chemical perturbations of cells and tissues, and coupling that with time-resolved data collection. I'm going to move on now to talk about GTEX, the Genotype Tissue Expression Program. This is funded by the NIH Common Fund, and the goal of GTEX is to provide an atlas of gene expression across human tissues, and really to provide a resource for exploring how this gene expression is modulated by genetic variation. So GTEX is looking at a pretty broad swath of human tissues, up to 30, and looking at this within a large number of donors, so up to 900 by the time the project is complete. They're collecting a number of data types, transcriptomic, as well as whole genome sequencing and whole exome sequencing. There is another effort associated with GTEX, the EGTEX or enhanced GTEX, where they're looking at a much smaller number of donors, but applying a larger number of assays across those donors, including epigenomic and proteomic assays. Additionally, I should mention that there is very recently been established a collaboration between ENCODE and GTEX, where a handful of donors will have these assays and code assays performed on this broad variety of tissues. GTEX data is primarily restricted access, and it can be approached through the GTEX data portal and available in DVGap, with the exception of the ENCODE-GTEX collaboration and all that data is to be open access. They also have a very nice data analysis feature in EQTL browser present at their portal. So I think the really distinguishing factor or feature of GTEX is the large sample size. Looking at several hundred individuals really allows them to delve more deeply into inter-individual variation than some of the other projects that are being considered. And then they couple this with very high tissue diversity. To do this necessitates that they focus more tightly on the set of assays that they use, but together this provides a really great set of resources, I think, for starting to understand how inter-individual variation impacts gene expression. Next I'm going to talk about the LINX program. This is another common fund program. It stands for Library of Integrated Network-Based Cellular Signatures. I have to look at that every time I say it. The goal of LINX is to create a network-based understanding of biology by elucidating cellular signatures as they refer to them through systematic perturbation of systems, data collection, and then a computational analysis component to extract these cellular signatures from the data they collect. They are looking fairly deep in the tissue and sample dimension, looking at a number of primary cells, cell lines, IPS cells, and differentiated cardiomyocytes and neurons, collecting a number of different types of data, transcriptomic, phosphoproteomic, epigenomic, and imaging, so a wide range of molecular and cellular phenotypes, and it should be pointed out that they're not looking at all of these assays in all of the different samples they look at, so the LINX data cube, as it were, is not uniformly dense. There will be a very deep number of assays in certain cell types, but not in others. Their data is available at the LINX data portal, and a distinguishing feature of LINX is that there really is a concerted integration and data analysis effort aimed at deriving these cellular signatures. These will be queryable, are queryable, and available online at the LINX portal. And they're also generating tools so that community data sets can be used to inform about additional cellular signatures. So LINX has really gone very deeply into the dimension of cell-state fate diversity. They have a uniform library of what they refer to as perturbogens. These are biological or chemical stimuli that they can approach various samples with, and they're also doing time-resolved data collection for some of these perturbations. So I think we've heard in a couple of the talks already a mention that there's not been a lot of movement into the direction of examining cells in either resting states or perturbed states, and LINX is really pushing hard in this dimension. Another feature of LINX that I think is worth pointing out is they have a strategy to initially look at a large number of assays, but then analyze data and consider which assays are the most information-rich, and then rationally hone in on this set of assays so they can get the most information, but then apply that to the largest cell, state, and fate space or look at lots of different types of samples. And that's important because they have this high sample diversity focusing in this way will allow them to approach many different cell states and fates. Next up is a very well-known resource, the Cancer Genome Atlas. The goal of the Cancer Genome Atlas is to improve cancer care by accelerating the understanding of the molecular basis of cancer through genomics techniques. This is funded jointly by NCI and NHGRI. They have a very large sample set, 10,000 matched tumor and normal pairs, and for many tumor types, they have several hundred tumor normal pairs that can be compared. Collecting again a variety of data types, primarily whole genome sequencing and whole exome sequencing as well as transcriptomic data and more limited epigenomic and proteomic data. So some TCGA data is open access, other is restricted access. It can all be approached through the TCGA data portal and the restricted access data is available in sort of the TCGA equivalent of DVGAP, which is known as CG Hub. TCGA is a member of the International Cancer Genome Consortium, ICGC. And this has an overlapping and similar goal to TCGA. And here they're trying to obtain a comprehensive description of changes that occur in 50 tumor types that have been determined to be of very high relevance, genomic changes as well as transcriptomic and epigenomic changes. Like TCGA, they have matched tumor and normal samples and collect several data types that are similar to those being collected in TCGA. Also like TCGA, they have both open and restricted access data that can be obtained through the ICGC data portal. And finally, I'm going to finish by talking about a project that is fairly distant related to the rest of these, but it provides sort of a unique resource and so we thought that it was worth mentioning here. This is the Knockout Mouse Phenotyping Program, or COMP2. This is a common fund initiative as well, and the goal here is to provide a broad set of standard phenotyping assays over a genome-wide collection of mouse knockouts. They're a member of the IMPC, the International Mouse Phenotyping Consortium, collect data types that are pretty distinct from everything else we've been talking about, morphological data, histopathological as well as a number of behavioral assays. And COMP2 data can be accessed through the mousephenotype.org data repository. So we briefly run through many different, very large programs, and we had to do this at a very high level. And now we're going to take a step back to an even higher level for just a second and consider all of these in aggregate. So I've listed them all here, and next to several different ways that they can be differentiated. Obviously, there are many, many different ways. You could draw distinctions between these projects, but we thought this might be a few particularly interesting things to look at. So in terms of assay diversity, several projects have gone quite deep into looking at a number of different assays, and many projects actually have gone very deep in the sample diversity space, so looking at a large variety of tissues, cell types, and cell lines. I think it's notable that the dimension of looking at a large number of individual samples is, at least at this very, very rough view, rather poorly explored at this point. GTECS, TCGA, and I guess PsychonCode I neglected to mention, have significant numbers of samples where they can start to address questions about inter-individual variation, but it seems that there is still a lot of work to be done in that area. And finally, again, a smaller number of projects have looked at this dimension of cell perturbations, looking at different combinations of cell states and fates, and how functional genomics characteristics change over time courses of differentiation and things along those lines. So I'm going to wrap up by considering one final dimension, and that is looking at these in terms of the topics we're considering in this workshop, those of unbiased mapping or annotation, supporting basic biological studies, and supporting more disease-focused studies. And so these categories blur together a lot, I think, as we'll hear about over the next day, and this is a very, very rough approximation, but we wanted to sort of, and I should credit Mike Payzen for putting together this slide as well. We wanted to take a look at these and see whether there were particular areas that were considering in this workshop that had a large excess of effort, I'm not sure there can be an excess of effort, concentration of effort at the expense of some of the other areas. And so, to me, the take home from this was really that there is a lot of work going on in each of these areas, and that there's really a broad base of functional genomics research taking place at this time. I think that gives us a really good jumping off point for the types of things we're going to consider throughout the rest of the workshop. So I will wrap up there, and if anyone has any questions for myself or any of the other presenters, I'm happy to take those now. What is FunVar? You didn't mention that. FunVar is the functional... So I apologize for that naming. FunVar is a new project, and it's the... This is the project I referred to as the computational prediction of the significance of non-coding variants, and in-house we refer to it as FunVar, and it doesn't have a real name yet, and so on. Some of the materials, it's FunVar. Question. Do you have... So with all these projects that are in production and have been for a number of years, is there a good assessment of the cost per assay experiment? Right? I mean, that would be really interesting because you have a number of large-scale projects that are doing very similar types of experiments, right? They're doing chips, you know, epigenomics. Yeah, I don't know that we've compared that across projects. I'm not aware of us having tried that from ENCODE, for example. I mean, in-house we know about the cost of ENCODE. The cost per experiment is a little more difficult to say because, I mean, some of our costs are the data coordinating center, analysis of the data, and so forth. I mean, you're primarily interested in whether the costs vary across project or just how much doing this kind of stuff costs. Oh, per project. I mean, it would be interesting. I was thinking about the genomic experiment would be to ask each of the projects, hey, I have an interesting cell type. I'd like to run it through your assays how much would it cost me, right? It's more general. I mean, you know, there's all kinds of great stuff that's infrastructure and I'm not really, I'm not quibbling about the cost of the project. We know what those are. We can assess them. It's just more down in the mechanics, in the weeds, right? You have these, you know, and I'm asking, this is a taxpayer. I see the sense of your question. I can't give you a simple answer that, you know, standard six histone marks and whole genome by cell fight and RNA and DNAs would cost X number of dollars. One of the vagaries of this depends on the sample type, okay? So like with cell line, the cost is relatively fixed. You buy it from ATCC, a few hundred dollars, and then grow up a small amount. That's not limiting. But getting large amounts of cells, ES cells differentiated by a company is hugely expensive. Getting consents for open access data costs a lot of money. Hooking up with clinical partners to get sample access. So in some cases, that part of the step is quite expensive and depending on what the sample is can hugely change the price. So can I just follow up on that question, sort of bringing into this what Lynx has done, where they've sort of started out with generating a lot of data and then trying to funnel it down to which are the data sets that actually tell us the most. It seems to me like that is sort of a clever strategy and something that ENCODE could build on based on the fact that there's a tremendous amount of information for certain cell types, but not all of it is probably deemed equally useful. Not all transcription factor antibodies work equally well, et cetera, et cetera. So is there an interest in ENCODE basically to do something like funneling down the data sets that are would be necessary if we move to different cell types, systems, perturbations, et cetera? Maybe a conjecture. Can I? We'll talk about this in a little bit. We will? OK, so I'll say briefly, yes, that this is of interest to the project. One of the things to keep in mind is part of the utility of the data depends on what it is that somebody is going to do. For instance, you heard from Mike Snyder's presentation, we have a small ENCODE has a small number of cell types with a very large number of assays. And certain kinds of work require that. There are people that say, I want to analyze 80 different TFs in the same cell type. So that's what matters to them. And other users of ENCODE data are saying, I want to know which enhancers might be important in which cell types and disease. What matters to them is that we have profiled many different cell types with the same assay. So then those people will say, well, I want to look at DNAs or RNA, where ENCODE has this in many, many cell types. So part of it is a trade-off issue. It depends on what you want to do with the data. Yeah, and there are two things that have been coming up in the talks, and I wanted to bring those out a little more. So one of the things that you might want to do with the data is to keep it interoperable and combine data from these different groups. And I'm glad that there's a growing awareness of the need for good metadata. What's hard to keep track of is this key distinction. Some of these projects produce data that's open access, and some of it's restricted access. And that's a critical dimension that we've got to keep in mind as we think about how best to not only interact with these groups, but how much of the data could be put into a common fund of data that really could expand to a larger encyclopedia. Thanks for that point. And as we move through this workshop and we think about implementation, I would like it for people to remember to comment on some point to what extent should we be spending our dollars on getting more assays, to what extent should we be spending our dollars making sure that the data are unrestricted access. We can't do everything, but this is an important point to consider. What is the relative value of unrestricted access versus controlled access data? Can I comment on that for a moment? The restricted access obviously is needed when you're talking about using samples from patients, right? But I would just like to remind you of earlier that the use of raw data was very, very minimal in the ENCODE project. And that is the data that will be restricted, and you can gain access, but it is restricted in the sense of openness. But it's used very, very infrequently, very rarely. And so I don't think it's a major problem. Well, I think we have to keep track of it. So I mean, I certainly understand your point. That's a good point. And I know that Blueprint and others are making some process data. And that's what you can get from GTECs if you go to their portal is a processed version. But we really have to keep in mind the levels of access and levels of access to which level of data. Yeah, so I think we need to move to the next session. I'm not trying to like the idea of having more discussion. Perhaps the comments or questions that we didn't get to come up in another session. Thank you.