 Good afternoon. Welcome to the NHGRI seminar series on both predictions for human genomics by 2030. You can find more information about this seminar series at this website, genome.gov, slash both predictions. Let me see how I move forward. Here we go. So last year, we published a 2020 strategic vision. You know, this document was completed after extensive consultation and discussion with many people in the field, including the two speakers today. And overall, the vision is for improving human health at the forefront of genomics. This is now supposed to be encompassing the entire human genetics and genomics field, but mostly focusing on what NHGRI will be doing. As you know, predicting future is risky. As a matter of fact, most of the impressive genomic achievements in the history, when viewed in retrospect, would hardly have been imagined 10 years earlier, but still it's fun to make predictions. And so in this document, we had 10 bold predictions for what human genomics will be in 2030. Most of this probably will not be fully attained, but this supposed to be an inspirational document to inspire people to strive for something that's not possible today, and also provoke discussions on what might be possible in the forefront of genomics. So to kind of unpack or expand those one sentence bold predictions and not also start discussions. A seminar series was designed, mostly in credit of Chris Gunter, who is here with us. And so it started in February. And so today is the third installment of this seminar series. This will run to June 10, 2022. And again, you can find all the information about these seminars on the website. The format for each seminar are two speakers, each gave 25 mini talks, followed by moderated discussions, and then question answers from the audience. By the way, please feel free to submit your questions through the question and answer button. Please don't use the chat button. And these questions will be answered at the end of the talk, but you don't have to wait until the end. So again, today's talk is regarding the third, both prediction, the general features of the epigenetic landscape, and transcriptional output will be routinely incorporated into the predictive models of the effect of genotype on phenotype. And we have two fantastic speakers, Tom Dingeris and Julie Lepilinen. Dr. Dingeris is professor, head of functional genomics and cancer center member at Cold Spring Harbor Laboratory. He received his PhD from New York University, followed by postdoc research and studies appointment at Cold Spring Harbor. He then moved to West Coast, initially with a position in Salt Institute, and then went to biotech companies before returning to Cold Spring Harbor in 2008. He was the vice president of biological sciences at AT-Metrics. His current group studies where and how functional information is stored and regulated in the genomes. And these efforts help explain the biological and clinical effects of disease-causing gene mutations in humans and other organisms. He has been the leader in an encode, mouse encode, mod encode projects of NIH. Dr. Lepilinen is an associate professor at Columbia University and also a core faculty member at the New York Genome Center since 2014. She received her PhD from the University of Helsinki, Finland, followed by postdoc research at the University of Geneva, Switzerland, and also Stanford University. She has come near the integration of large-scale genome and transcriptome sequencing data to understand how genetic variation affects gene expression, providing insight to cellular mechanisms underlying genetic risk for disease. Her research focuses on functional genetic variation in human populations and its contribution to trace and diseases. Dr. Lepilinen has made important contributions to several international research consortium in human genetics, including one-thousand genome project and GTX project. As a matter of fact, next month, Dr. Lepilinen will begin a new role as the director of PsyLev Labs National Genomics Infrastructure, as well as a full professor in genomics at the KTH Royal Institute of Technology in Sweden. So I believe Tom will be the first speaker. Tom, the podium is yours. Let me stop sharing here until you can start your slides. Thank you, Paul. Thanks for the introduction and the opportunity to take part in this programmable predictions. I'd like to say at the outset that there's a relatively high bar being set by the previous four speakers, and I hope to be able to match that. Anyway, so let's begin. The bold prediction number three as read by Paul basically states that the general features of the epigenetic landscape and transcriptional output will routinely be incorporated into the predictive models as they impact genotype on phenotype. As you look at the details of this prediction, it's clear that it's bold, but it's also clear that it's somewhat daunting. Specifically, it's composed of three independent components, each of which have several areas of challenge. And it is my intention and goal for this presentation to focus on these challenges as a means to move beyond this particular bold prediction into others. The first of these areas which this prediction is composed of is the collection analysis of personal genomes. And by that I mean the generation of phased, biolithic genome sequences. It also consists of a collection of relevant transcriptional and epigenetic profiles, ideally using long read sequences and gather both sequence and modification data at the same time. The second feature, the second interdependent component is the use of predictive modeling that approaches this dataset, integrates it and begins to look for relationships with known pathways such that the outcome is a proposed phenotype. And the third component of this bold prediction is, as mentioned by a variety of our other previous speakers, is that the phenotypes that are going to be detected or predicted will in fact occur at many, many biological levels which we'll discuss in a few minutes. But what exactly do we want the outcome of this bold prediction to look like? What is the goal of this prediction in a sort of substantive way? That goal is, in the ideal situation, samples from a symptomatic or asymptomatic individual is obtained and can be obtained either from an anatomical source, namely one of the organs, or a source that's easily accessible which will then serve as a surrogate for the affected organ or tissue. This sample will be used to gather the genome sequencing of the provider and provide also information in terms of the transcriptional profiles and epigenomic profiles. These data will then be analyzed using computational algorithms to determine how the sum total of these data point to one or more genomic variants as the cause of the anomalous transcriptional or epigenetic phenotypes that may be contributing to the complex phenotype. This set of goals really has several unresolved and unappreciated challenges for both precision menacing and precision genomics and they will also lead, I think, to additional kinds of opportunities for bold predictions. The unresolved and unappreciated challenges concern collecting transcription and epigenetic data by many consortia. These consortia have had a long-term interest in collecting basic data as to the functional areas of the genome and how they're regulated. They include ENCODE, GTEX, Roadmap, and NTEX. These consortia efforts also have been in the business of looking for genetic causative mutations. Now the union of these two kinds of efforts is currently undergoing and this bold prediction really serves as a way in which to bring together these two very important efforts. Now in light of these efforts and in light of these resources that have been collected over the years and the goal laid out by the prediction number three, many challenges have emerged that need to be addressed if this prediction is to be actually realized. And what I'd like to go over is a brief summary of the challenges that have emerged upon thinking about what is entailed in this bold prediction. First of all, the identification of generic variants giving rise to phenotypic results as measured by changes in the level of expression and epigenetic modifications is a real challenge and has been a challenge for a very long time. The second challenge is that the fact is that many genes have multiple functionality and they also have multiple isoforms, some of which are responsible for the different functionalities of that gene. The third challenge is that there is an increased availability of having normal tissue available for analysis and study. This includes the brain, heart, kidney, things that which are not easily accessible in the normal individual in order to study baseline profiles that will in fact constitute what is normal. The definition of the sheer definition of what is normal is actually also important because at that it is that state which is likely to be a range of states. It is that state which we're going to consider how to evaluate the data that will collect both from genomic and epigenomic states. The fifth variable point challenge is the differences in the transcription and epigenetic profiles that exists in samples that have been analyzed from living and post-mortem samples. There is a considerable amount of data particularly when it entails analysis of difficult accessible tissue that come from post-mortem studies and it's this challenge I'd like also to talk about. And finally the environmental influences on somatic epigenetic changes and the pathways that lead to those changes as caused by the environment is quite important and while it has been a subject of considerable interest for a long time the processes involved in this is still quite unknown. So let's walk through these challenges that I just enumerated briefly and talk a little bit about each one in order to get some clarity as to what is meant. The first of these challenges was basically the identification of the variants that give rise to phenotypic variation. And the depiction on the slide here is of two genic regions where mutations have been identified and they have been identified as present in the gene and also identified as a site where epigenic modifications are important in the expression of that gene. This is important because in the two genes that we have here there are two points I'd like to highlight. First, the phenotypic effect of genomic variation is dependent upon knowing where to look beyond RNA expression and epigenetic modifications. And where in what biological level is this phenotype likely to be exhibited and if that's not known then in fact the variation that we see is dependent on making predictions alone rather than actually having physical results to fall back on. The second point that this slide is intended to identify is that complex phenotypes are often caused by multiple genotypic changes. And although there's only in these examples only one site cited, most of the challenge that's going to be facing the future accomplishment of this old prediction is to be able to identify all related changes that in fact contribute to the phenotype of interest. The second challenge is basically the issue of multifunctional role of genes as is complicated and it is the determinative factor that some of these multifunctional elements are affected while others are not. One feature of this challenge is that there's a presence of expression levels of different isoforms and different cell types. So the same gene can have obviously different isoforms but those isoforms can in fact vary in their expression level depending on what cell type is investigated. The most novel isoforms are expressed at fairly very low levels. They're somewhere between 10 to 1000 fold less lower than what the major isoform is. But the fact of the matter is that roughly 43% of genes that have multiple expressed isoforms in fact have these lower expressing isoforms as the major expressed isoform in many other cell types. So it makes it somewhat arbitrary to say that there is a predominant isoform because it's very much tissue dependent and it may be that is those isoforms lead to other different phenotypes. I wanted to now address the issue of what is normal because several of the features that we'll discuss, challenges we'll discuss later will depend upon getting a sense of what is operable, what is normal in each cell type or each organ that we're investigating. Phenotypes can occur in any of these biological levels from the level of the protein being made up to the level of subpopulations where environmental influences have effects on the overall expression levels and the phenotypes that are present. So if the individual phenotypes will be different in each of these biological levels, then we have a task in trying to understand which of these levels we're going to use in order to identify the effects of major mutations. Finally, you can determine the range of expression and the loci of epigenetic modifications in genes of interest as part of this baseline, as part of this normalcy. And that's going to be important because in many of these instances, it will be alterations in the levels of expression and the position and presence of modifications. There is the idea of normalcy comes into play when asking oneself, where are you going to find normal samples? It is true that in the NIH and many other funding agencies have made great progress in providing samples for a variety of different types of studies. But the fact of the matter is that many of these studies and sample collections deal primarily with specific disease states. And the idea for normal controls for these disease states is at least only part of the collections that are being brought together. In addition to this, there are many centers, many surgical centers in the United States, most fairly large hospitals have such surgical centers. And these surgical centers are routinely operating on individuals in which normal tissue is part of the resections or part of the normal procedure. But those normal tissues are removed and usually just discarded. And this leaves us a resource that is really untapped. And one thought is that many surgical centers could, being incentivized by NIH, basically look to keep these normal tissues and make them available either to a central resource depository or to make these resources locally available to those who request it. The bottom line is that these are valuable resources to provide a baseline understanding of what each gene and what each regulatory element in that genome is how it's operating. And so you could think of this as sort of leaving no tissue behind. In the next to last challenge that we're talking about today is the differences at the transcriptional and epigenetic profile obtained from living and post-mortem individuals. In a set of studies that we have recently published, what we did was to look at the performance or the expression levels of all of the genes of the entire genome of several individuals, about two dozen. And in these two dozen individuals, half were actually patients undergoing epileptic treatment surgery. And in that surgery, the normal tissue is unavoidably removed as part of the treatment. That represented a opportunity to look at gene expression in those tissues and compare it to individuals who had died and had donated their tissues for analysis. And all of these individuals were then examined both at the RNA expression level and epigenetic level. And you can see from this slide that the expression levels of these genes look very similar in the top four samples where there's quite fresh samples. Half of these in each of these panels is composed of individuals who were deceased and which are in red and half of which are came from living donors. And in the case of housekeeping genes, which are the first upper panels A to D, those genes are almost identical in both living and deceased individuals. In the case of the post-mortem samples, there are about 2,000 genes that in fact were affected and differed between the two states. This is also true when you look at the RNA editing at the three and five prime UTRs. There is an appreciable difference in the genes that are not housekeeping genes. These variations that we see for the most part are not only loss of expression, presumably due to degradation of the RNA, but also by some genes that in fact remarkably increase their expression and the cells remain quite viable. And these genes often will somewhat affect the outcome of understanding what the effects, what the behavior of certain brain-sided genes are. So it is important I think at the end that we understand that the selection of tissues, not only be normal, but not come from a state which is very challenging to a very large number of genes if they come from post-mortem individuals. Finally, the environmental influences on somatic epigenetic changes is a well-studied area. And the signals and pathways leading to genomic specificity as to where these modifications occur after exposure to environmental conditions is really an area that's quite challenging. The mechanisms, the pathways and agents that are responsible for identifying the locations and the type of modification that goes there is still understudied and very valuable area of study. This is a challenge that will in fact require a variety of approaches in order to solve. Now having gone through these challenges, I'd like to actually suggest that these offer opportunities to see additional progress come forward. And I'd like to go through that for some of these challenges. For example, the challenges that we talked about in terms of things that could perhaps be approachable by 2030 consist of issues like the identification and engineering of gene, isoformic regions of genes in a cell-type specific manner and these genes being selected by being clinically important during development or during a disease state. This prediction of what we could do could then provide a fundamental understanding of how different isoforms function and how they actually operate in a normal metabolic or in a disease state. Another prediction is that in light of the need for larger access to tissues that are nominally normal, one could suggest that NIH, the NIH funded tissue collection, mandate to request all participating medical centers to contribute normal tissue that are consented for genome and RNA and epigenetic sequencing and use these data to better define what normal is. And then the last prediction is that you could identify genes whose expression profiles of all cell types and organs that are affected by post-mortar conditions. All the use of these data, in fact, should be corrected and in doing so will provide a better way in which to understand how they operate in very specific types of tissues. In the bolder, more bold versions of predictions come in the form of two. The identification of all causative genetic variants giving rise to changes in levels of expression and epigenic marks by identifying populations within the normal expression profiles and that the location in cell types in the location, that is to say cell types and organs of all expressed coding and non-coding genes. And to do this, the prediction is that one could take advantage of the ongoing and developing work of in vivo sequencing and chip analysis and could provide a level as to where these variations are occurring and where these phenotypes can be seen. And finally, the prediction that involves the environmental influences on somatic epigenetic changes could, in fact, come in the form of identification of cellulose signals and pathways leading to the specificity, that is to say which ones are, which modifications are occurring at which sites, that goal could, in fact, be approachable by the development of markers, that is to say RNAs, proteins, lipids from easily obtainable biological samples rather than samples that would have to require surgical intervention, but use them as surrogates, act as surrogates for markers that you would like to study in less attainable organs or tissues. Now these predictions and these challenges, I think, offer an opportunity to think ahead and to think of ways in which we could move the fields forward if we, in fact, could achieve some of these predictions. It's important to note that not all of these bold predictions that are at the end of this presentation require novel technology or novel inventions. Many of them require a decision and a commitment to, in fact, provide resources that would be helpful in solving some of the challenges that were discussed. So I'd like to end by acknowledging the, my colleagues, both at Coal Spring Harbor and at Harvard and Yale, who have contributed to the data that was used and generated in these studies that I mentioned, and mostly for the ideas that have often been traded among all of us as we think about the massive sets of data that we collected in these different consortium. So thank you for your attention and I look forward to answering any questions that the audience might have. Thanks a lot, Tom. We'll hold a question and answer session until later. So maybe you can stop sharing the screen and truly it's the next speaker. All right. Hi, everyone. Good afternoon and thanks for thanks for having me here as a part of this very exciting seminar series. So just to get to get kind of like right, right into this. When I saw this prediction and when I was asked to talk about it, I first started thinking that there is multiple interesting premises that are baked into this statement that I'm not going to read because it's long, that I wanted to kind of dissect today and discuss whether these are true, what do we know about these and how do we actually make this prediction a reality. So the first kind of sort of something that is implicit in this prediction is that it talks about predictive models of the impact of genotype on a phenotype, but also suggesting that we will need epitenetic and transcriptional data to make this work. So that's basically implying that genotype data alone will not be sufficient to predict physiological disease phenotypes in humans. And that's an interesting proposal that I'll inspect a little bit later. And then again, if we are saying that we want to include some other phenotypes, not just genotype and phenotype, that we need other layers of biological data, then that transcription and epigenome data would be those data types that would be informative, useful data types here, implying that they are at the very least correlated with genetic variance and physiological traits and potentially even mechanistically mediating those genetic effects on disease traits. And then the third aspect is that to say that this would be routinely incorporated into these predictive models implies that we would be able, or maybe we are already able, but at least in the future that we would be able to measure these molecular phenotypes at sufficiently high scale and precision for these data to be actually useful. So I'll be discussing what's the current data supporting these premises? What are some of the other key insights that we have learned? And how do we make this prediction into reality? And what are the other components sort of like around this pro-topic that need to be kind of where we need to push as a community to make this happen? And Tom already touched on some of these points, but I hope to expand on some of the aspects. So when it comes to the prediction of phenotype from genotype, especially in the complex trait space, has many fundamental challenges that we are now very well aware of as a field. So of course now, after whatever 15 years of CHI was, we know that the heritability of complex traits is distributed in teeny tiny genetic effects across the genome, and that these variants actually account for just a fraction of the phenotypic variance in complex traits, even though, I mean, probably this is an NSGRI seminar series, I'm a geneticist, we love to think about genetic variance, but it is not all that matters in complex traits. And then we also know that the most of CHI was heritability is in non-coding regions of the genome with likely regulatory functions and the sort of interpretation of these variants has been quite complicated. And in fact, if we would start to think that we would want to have like the perfect in silico interpretation prediction of the functional and phenotypic effects of these non-coding variants, that would actually require pretty much perfect knowledge of cellular molecular biology and genome function. And we are very far from this when we think about that you would just see a variant and you would say that, okay, this is it affects the binding of this kind of transcription factor and the enhancer activity in this way and that leads to this fall change effect in the expression of a nearby gene and perturbs this pathway that then changes the cellular function that leads to some physiological function. We are extremely far from this and we're not going to get there in 2030 alone. So that's sort of like just sort of like a black box prediction of just taking genotype and getting to phenotype. That is going to work. We do need those additional data sets and insights that I'll talk about today. And then why do we care about these predictions anyway? Of course, there is the sort of the big goal of pretty much all biomedical research of being able to provide better diagnosis and treatment to individuals who suffer from some disease. And the traditional medicine paradigm, of course, is that you basically have the phenotype data and then you infer, make some inference of what would be the appropriate diagnosis and treatment. The precision medicine paradigm adds phenotype and genotype and environmental data to this to hopefully provide better diagnosis and treatment. And then something that we I guess could call precision molecular medicine or something has also incorporates gene regulatory readouts, either chromatin state or RNA sequencing gene expression, etc. to have an even better insight into what is what is going on and what can we do about it. And so what is the status quo here and does molecular data, this kind of precision molecular medicine framework actually work. So in the rare disease space, we are in a situation where the class is kind of half full when it comes to genotype to phenotype predictions. So exomer genome sequencing can now lead to diagnosis in about like half of rare severe Mendelian disease type of cases. So that's that's fantastic, like this is extraordinary success and has absolutely changed the lives and save lives of many, many people. But of course 50% is not 100%. The glass is still half empty for various reasons regarding like detecting some more complex structural variants and then having more kind of more complex genetic architectures, but also identifying variants as disrupting gene function or dosage. Not every disease causing variant is sort of like a stop code premature stop coding variant that we can annotate quite easily and it can be quite complex. And here, there is pretty decent data that RNA sequencing can help. And the basic problem is here is such that to be able to really do genetic diagnosis in a rare disease situation. We basically need to have two things. If we think about the sort of the situation that works quite easily nowadays is that you can identify just based on the good old genetic code what is the gene disrupting variant in the coding region. And then you can also kind of put that in the spectrum of population variation in that gene. Tom referred to many times to the kind of that we need to understand the normal to be able to understand disease and that is absolutely what is the basis of these rare disease studies. But when it and that would then help you to say that okay this patient having this kind of a very and that is an outlier in the population likely or or at least potentially contributes to disease. But when it comes to variants that affect gene expression or other traits of of related to gene regulation. First of all, it's difficult to identify those variants and it's also difficult to sort of really have a sophisticated framework for for kind of like what is the spectrum of normal variation in the terms of let's say gene expression. And here splicing analysis has been one of the early cases of of success where in RNA sequencing data it can be quite clear. And we have to illustrate forward to see that there is actually an aberrant splicing pattern in a patient that is absent in in a number of controls, and this may help to identify, for example, intronic variants that that would be quite quite sort of obscured just based on genetic data alone. I think us and us and others have also pushed this further in in terms of identifying variants that may affect our gene expression levels in such a way that we used her healthy population RNA sequencing data from cheats and a little specific expression is to really sort of every gene to draw those spectrums of of how much these genes expression varies in the normal population for genetic reasons, and then one can go to a patient and actually put the patient in the spectrum of that normal, normal population and identify outliers and we've shown that this, this, this can really have a high specific sensitivity in in muscle dystrophy and my path to patients and now we're working on applications in clinical clinical heart disease and as well. And this framework together with with others have been incorporated into analysis that really tried to use many different types of transcriptome readouts to better interpret rare variants. We were part of analysis using using the most recent cheat text data set, looking at a healthy population cohort RNA sequencing and genotype whole genome sequencing data for multiple tissues, looking at at different types of effects that rare genetic variants have on transcriptome traits, and in a very interesting preprint that just came out I think a week ago, it was shown that transcriptome data can give you a 16% boost in your diagnosis rate over whole genome sequencing, detecting many different types of of perturbations. And I think that these, these examples and these insights are proving that transcriptome data is already useful in in clinical genomics and then this this bold prediction is already becoming true in that space. However, in complex disease prediction, unsurprisingly, the situation is more complex. So, thinking with think about sort of disease or phenotype prediction in complex disease of course the the main method that is now being being used or or studied is is polygenic scores, which have a lot of promise but there is still a lot of promise in terms of their clinical use to do they work when do they work when is it good enough to be actually medically meaningful. And one very major problems is, is problem is various biases that this is this course can have in terms of their their so transferability for example across ancestries and then also other other groups. I think there's a lot of exciting new new research showing that some of these biases can potentially be overcome by by overlaying genetic associations with regulatory elements thus getting better better kind of insight into the causal variance and avoiding some of the biases caused by the equilibrium. And I think that there is there is a lot of potential there. And I think that going forward, the idea that incorporation of tissue and cell type specific functional information into polygenic scores could potentially help to partition the complex complex disease risk to these distinct components. We think about most complex trace if you take let's say type two diabetes that can be caused by by kind of this regulation or misfunction of many different organ systems and being able to sort of partition different individuals disease risk in terms of like you have a problem with your lipid metabolism you have a problem with your insulin metabolism etc could have a lot of a lot of potential. We're not exactly there yet, but let's see maybe by 2030 I'm sure there is also a very exciting area at least in my my opinion in in terms of using these molecular phenotypes to incorporate genetic and environmental risk. As I mentioned a few slides ago heritability of complex trace is far from 100% there are major environmental effects in complex disease. And if we are actually able to incorporate those risk factors into the same framework as genetic factors this can be very powerful. I mean the whole sort of idea of using using genetic data to develop targets is based on the same paradigm that genetic and environmental risk factors are partially mediated by the same molecular pathways. And this and here transcriptional and epigenomic reads readouts can really help because they should or could capture both types of effects unlike genetic data alone, and this could be one of those things that actually makes this this prediction that we're talking about reality. There are some interesting early early studies that that have some promise in terms of showing that RNA sequencing data can inform in an upcoming flaring rheumatoid arthritis actually driven by a specific cell type. And and also looking at case control differential expression in in GWAS genes, where a large part of the, like the differential expression that that is seen it's it's much too big to be driven by the genetic variance there are some other factors as well that that try this so there are there are attempts there are interesting sort of frameworks that are being developed, but much more data is needed. But I think that there is also major potential in terms of leveraging the ability that genetic data has in in pinpointing causal disease mechanisms, and then thinking about an environmental component that is a modifiable component of disease risk and thus be able to potentially develop better interventions. And Tom talked about this at length so I'm not going to go super deep into this but but I think it's, it should be clear to all of us that that to understand disease we really need data of what is the normal and these kinds of resources that that genetics and genomics community has has been building have enabled vast amount of the studies that we now often kind of take for granted there would be no she was without had map and no whole genome sequencing studies without 1000 genomes and exact rare disease studies encode she takes human cell atlas etc really building that foundational understanding of the regulatory genome and there are much there's much much work that needs to be done in this space to just create more sophisticated data of various types of molecular functions that that very inhuman populations and and and thus empower specific disease communities to use this data to explore specific questions. However, there are some major major issues that we really have to address as a community with want to use these these resources to their maximum ability. One of them is that population diversity captured by these resources is very limited at the moment. So, so there are like 1000 genomes of naturally explored global populations, kind of like a couple of hands handfuls of them. And as genetics has some has captures kind of like the average American diversity, but but we are far from being able to really, really understand, or like characterize population diversity in functional genomics data and, and as the she was and the community is now very fast building bigger and bigger resources in in terms of genetic variation across the globe and then its contribution to diseases, we need to make sure that those functional genomics data set data sets are also there to help to interpret and analyze these data sets. And a related question is that data availability data dissemination integration visualization is a serious difficult problem for functional genomics data because these are kind of messy and hazy data sets in a in a way that that functional genetic variation is not with major sort of batch effects and other integration issues, but, but unless we are actually able to print these data sets together and disseminate those into the community and make them sort of available across across the globe and consortia. We are really sort of shooting ourselves in the in the food and not be, and not being able to leverage the power of these, these resources so this is also a major area where we must invest as a community. And also another area that that Tom also referred to is that we just simply need more data we need to scale up the sample sizes in terms of multiomic data sets, especially in the complex disease trade space but I'm thinking more sort of like normal populations basically if we have learned anything from from the history of human genetics or the past 25 years is that in the early days of she was et cetera on the discovery is a little bit so and so a lot of struggles, but when the sample size became sufficient where there was actually good statistical power, amazing discovery started to emerge. And when we think about molecular data sets at a population scale studies like cheats et cetera while they are big they are far from from those kinds of samples of tens of thousands even hundreds of thousands of individuals to really make well very well powered in France. And there are some some attempts to fix this top midis producing a lot of, a lot of RNA sequencing data mostly from plus samples. In my lab we have been working on this, this interesting project whether we're now wrapping up where we have tested for types of non invasive samples to do RNA sequencing using sort of a low cost smart sick to prep protocol that is that is that was initially developed or is much used in single cell space. And, and here we've collected here follicles saliva buckle swaps and urine from, from a number of donors and then done RNA sequencing of these. And the exciting thing is that compared to just sort of standard heck sales top notch RNA from, from cell line, we can actually get almost comparable data from here follicles and from urine, despite the very small numbers of sales and, and the starting material is low but with these modern library of methods we can actually get excellent quality RNA sequencing data from these samples that capture cell types that that these plus samples that are typically collected cannot capture and we can, we have shown that the hair follicle cell types and that the data is very closely related to skin and in urine and buckle swaps we get sort of mucosal tissues that that makes sense. There's also some kidney signal in urine and I think that there's a lot of potential in these types of sample types, but also obvious technical challenges, buckle swaps sometimes work great. Sometimes not saliva is absolutely terrible this is our contest example of that that this is this is not always always something that works, works easily do not try saliva RNA seek at home. So one of the challenges thinking even further ahead is that eventually we may need to push these types of data sets to single cell resolution, and actually think about how to do single cell RNA sequencing in thousands and thousands of samples from from disease or some informative biospecimens these types of non invasive kind of swab and poke type of samples that we have been taking thus far is not going to give us sort of praying molecular phenotypes etc. But this is an area where we clearly need to invest. Yes, as a community. And just to kind of switch gears a little bit for the last couple of slides. I want to basically make the point that prediction is really not enough. Even if we have the perfect black box to predict variant to phenotype. We would still want to understand those mechanisms. And made a relatively compelling argument that that this kind of black box prediction of variant to phenotype is just simply not going to work especially complex disease is just. Yeah, we will not be able to build that box. And also we are scientists we should be interested in mechanisms and wanting to understand how and why certain genetic variants somehow affects molecular and cellular functions in a way that contributes to disease phenotypes. And then if we want to actually develop interventions trucks and other types of interventions to do something about this, then we need to understand those mechanisms. And luckily we have a very rich data set to pursue different layers of, of mechanistic questions in terms of what are the causal variants, how do they affect, let's say transcription factor binding enhancer effects what are the target genes insist what are the target genes and pathways networks in trends, what are the relevant cellular types and states, and then even further to towards sort of physiological phenotypes and cellular functions. And, and I think that there is it's just going to be an extremely exciting time for us, us using these different types of approaches using these kinds of large scale multiomic data sets that we have been building and continuing on that but then also incorporating that with experimental perturbations of the genome and its function with with tools like like CRISPR and I really strongly believe that no approach is going to be a silver bullet. All of these approaches have their unique advantages and disadvantages and it's only only with integrated approaches that we can really build a good understanding of genome function. And I want to mention you a quick example of this type of work. This is from a very recent preprint from from collaboration with Neville's and channels lab led by our poster John Morris, where we basically took a plot rate she was integrated that with encode etc data of potential regulatory elements fine mapping, and then did CRISPR I inhibition of those putative regulatory elements, and then it seems also are in a sequencing to see which genes are affected by by silencing of these these genetic elements where potentially causal she was very in the city. And in terms of identifying the cis target genes in these low side, we were actually or I was personally surprised by how well this worked for 42% of the low side or the or the variance that we tested. We actually discovered this a significant gene in in in sis, and the vast majority of these these low side left and eqtl signal showing that we're really discovering something that is complimentary to the data that we have had before. And a particularly exciting example for us was to was to see that in addition to just capturing those those target chains insist which has been a major challenge for she was, we can also get to the more complex question of affected pathways. And for this very interesting locus where we have a GFI 1b transcription factor, we actually had to she was low side one in an intronic and one in a in a downstream enhancer that both affected the expression of GFI 1b insist, and then they both also had a major effect on gene expression across the genome, with the stronger enhancer having hundreds of significant gene targets across the genome. And those target genes were actually organized in a network that that had sort of three, three specific clusters that seem to represent different sort of functional functional components, with one of the clusters representing more of the direct targets of this transcription factor, and then another cluster the cluster see here seems to be has really has something to do with him biosynthesis, which is consistent with with this transcription factor having a major role in in blood blood traits and and are studying specifically blood blood trait she was. And an exciting thing was that that we had very specific enrichment of of she was hits so independent she was hits across the genome in the target genes of this GFI 1b she was locus and this is suggesting that there may be this kind of like convergence of independent she was effects for the same traits in specific cellular pathways that may then be particularly interesting in terms of the the cellular biology behind behind the traits that are being studied. So to wrap up, what does the future look like. I believe that with these kinds of approaches and addressing the challenges that that both myself and Tom have talked about. I think I think that we can really incorporate molecular traits as a part of precision medicine, and to improve our understanding and also treatment of personalized disease risk that is driven by genetic and environmental factors. We will be developing deep insights into molecular and cellular etiology of human traits using both the sort of observational population studies and also perturbation studies and experimental tools. And all of these requires that we really build a sophisticated toolkit for highly informative in silico inference so that we can have as accurate sort of priors and predictions. As possible in terms of observing a variant of interest or you have a Chiwa study and your Chiwa slow side and being able to sort of have good sort of predictions of what could be the functional mechanisms that are that are being perturbed or the functional effects of these variants, and then also to have a sophisticated toolkit for experimental follow up of these discoveries. And with that, my sort of the small additions to the prediction would be that that in addition to using the features of every genetic landscape and transcriptional output to understand or to predict genetic effects of phenotype. I would also want to understand genetic and environmental effects on on phenotype and thus build a holistic understanding of, of the diversity of human traits and the underlying genetic and molecular processes. And one of the sort of venues or places or organizations trying to do this is the International Common Disease Alliance that was launched somewhat, somewhat recently where us, especially in the mechanisms working group are really asking the same types of questions that that we have been talking about today. And with that, I'd like to thank many people in my lab, current and former members, my collaborators in various consortium project, other collaborators and, and ICD colleagues and sources of funding. Thanks very much. Thank you very much to Lee. And, yeah, both of your presentations are, you know, very impressive. And as Tom mentioned, this is my impression is, it's quite challenging to fulfill this bold prediction and even daunting. And it seems like it requires multi-discipline approach, rather than just to, you know, to, to, to, you know, sequence genome, it just one kind of one dimensional, there should be any sequence, but here you're talking about so many different levels. And so my quick question will be, you know, do we need a new technology, new approach that's, you know, several orders of magnitude faster and more comprehensive, because for each person, you cannot just do one sequencing. You have so many different tissues, and so many cell types and so many level of sequencing is due. And by software sequencing, you know, whatever, so a lot of levels. So I'm just wondering, you know, using the current technology, can we get it down, you know, in, in 10 years or even, you know, 50 years. If I, if I start, I think that in terms of sort of assays to analyze molecular phenotypes, whether it's the transcriptome or, or epitonomic, epitonomic features. There is obviously going to be advances, let's say long read RNA sequencing, direct RNA sequencing is going to be very important. I don't really see those as the major bottleneck. I think something that would really change the game is if we actually had good, cheap, fast, practically feasible ways to differentiate cells and basically take cell samples from an individual and obtain other cell types from that. Because those like brain biopsies, they're just really not going to ever be a popular thing to do. And for several different diseases and phenotypes, we just don't really have accessible cell types to analyze. I would, I would add to that, which I think is really quite correct. I would add to that two things though. One is I think we're moving in this direction with the understanding that, you know, a human or an animal is quite a complicated machine. And it's, it's more than the sum of its parts. And therefore, I think there is going to be a sort of effort to do two things. One is to get as much information in situ, in the organism itself, as much as we can. Because that's where the interactions with other cell types, that's where the effect of environment. That's what that's where it's seen. And in addition to that, I think we're moving in that direction because with the development of organoid systems, where cells are placed into an environment with different other types of cells and allowed to architecturally reform structures that they normally do in vivo, we begin just begin to approach a situation where the complexity of an organism begins to reveal itself. And so, again, looking at these different methods sequencing RNA methylation determination, protein analysis, as much to be done in situ as possible, and do it in a situation where it is as naturally interactive as possible. I think that's where our largest, you know, progressions are going to be made. Thank you. Another, you know, question I have, you know, it's about the so called a normal, you know, I'm quite impressed by both of you mentioned that it's a key component, you know, if we move forward, if you cannot define normal and how do you correlate, you know, disease states and but here the normal is probably a very large range, you know, I'm thinking about, you know, physiological terms, you know, blood pressure, for example, you know, you know, cell count blood cell count and all of this will be a range and also again, thinking about, you know, different populations and there will be variations and so it will not be a single normal, maybe not a hundred, maybe not a thousand. So, again, so we have to consider all these factors, right. I think that's absolutely correct. I mean, the normal isn't, it isn't as if you fall off a cliff at this point, right, even if you're dealing with a spectrum of values, you know, it isn't as if, you know, as in, as in the case of blood pressure or something, once you get above this number, you're not normal anymore, because we get these differences in physiology depending on what's going on, you know, with the with the organism itself. So I do believe that's right, but I also think that normal, normal behavior or normal activity of genes and modifications so it has to go back to this idea of interactions. The interaction of the context in which the the sample is that's being evaluated is is going to be everything, because it is that's that's where that's why you have this range, because the cell or the organ or the tissue is responding to the conditions it can sense. So I, I, I, again, I think the complexity is is even somewhat more daunting by the fact that doing these experiments on individual cells or a collection of cells in a tissue or even more heterogeneous collection in an organ. Alright, that is that that's going to be, I think, somewhat more problematic because we're going to have to rethink all that data or reconsider it when we have systems that are more lifelike in that situation. Yeah, I mean, I would add, I agree with with all of that I would add that when it comes to the sort of definition of the normal when it relates different populations and different environments and all the sort of human diversity. That is, it is certainly important to characterize and understand that that diversity and sort of like that we don't think that some sort of I don't know very specific population somehow represents all of humanity. However, it is also, it also happens quite easily that people really kind of focus on the differences and not on the similarities. But it comes to sort of like molecular function of human cells or human physiology. A lot of it is shared across all humans. And, and, and there is just a lot of a lot of shared components that we also kind of must must keep in mind to kind of like not over emphasize those those differences that we still should should learn and appreciate. In terms of accessing, you know, large databases, you know, being thought about, you know, collaborating with all of us and, for example. Yes. Yeah, I think that's very interesting. There's an interesting RFA out looking at diet. And that in that population, thousands of people or tens of thousands of people are going to be, you know, they have volunteered to be part of this diet study to see what a precision diet looks like for an individual and those studies, I think are somewhat, I wouldn't call it the wave of the future but they are realistically trying to deal with the numbers that are significant, statistically significant, and the variation of individuals in the population. I think that is, you know, I think that is really a step forward. I think there are quite a number of questions so we should give them a chance and then if I have time in the end I can ask you additional questions. So Chris, would you please, you know, read the questions or Yes, definitely. So a lot of appreciation for your talk. So thank you again to Tom and Tully for talking to us today. So you've got a number of questions about what normal is so we're going to come back to that even though you've addressed some of them. I want to start actually with one of our most recent questions which is how do you think that clinical researchers can specifically contribute to achieving this bold prediction. I think obviously the sort of the first thing that comes to comes to mind for for a biologist is by giving us samples. But obviously it is a much more much more nice and complex question than than that, although the sample access question is important and that is something where we absolutely must work together. And I don't have an MB I have no access to actually go and poke at living individuals. Well, except for our non invasive study because it's, it's, it's not invasive. We were actually able to collect those samples without without medical involvement with the blessing of an IRP. But I think that when it comes to, especially the sort of moving more towards the precision medicine space that is thinking about implementations and the kind of effects sizes that are not just p values but where you're actually talking about medical importance and let's say PRS whether some sort of difference in the risk. At which point does it become medically meaningful I think that there needs to be very very serious dialogue between the basic researchers genomicist biologists and medical practitioners. I could, I would like to add to that that I've had a really remarkably lucky a set of interactions with many clinicians. And the thing that has really affected me, the most is the amount of sort of data, not data but behaviors that they use in their clinical diagnosis, things that you seemingly are not important in terms of when we talk about molecular biology, but makes sense. Once we understand the molecular biology of what's going on in a particular condition. So it's the ability that people who are ill with this particular disease, for example, become hard of hearing, you know, and the end a symptom like that, which may not make any sense. You know, to molecular body. I'm interested in cancer. Why is that, you know, important that somehow these these kinds of of clinical observations which are usually acquired in the in the course of rounds in the course of dealing with patients and also passed down from one generation of doctors to another. These pieces of information are invaluable in some ways, because they explain, they, they, they offer the opportunity to form a model of what that, what that phenotype is compared to the molecular processes that are going on. I, I, I think that is how one way in which you know the interaction between clinicians and, and people who, who work at the bench, you know, has has been and could continue to be very valuable. And that's an area where patient groups contribute as well right by defining helping you determine the symptoms and everything. Yeah, I think those are great answers so getting back to the question of normal. I think a lot of the questions that you're getting are a little bit along the lines of what truly said earlier that people focus a lot of the differences and they're asking questions about that. There's a gallery who asked, what can and what, what, what can slash must we do as a community to ensure that data are collected to make sure that we have multiple axes of diversity. So how can we do better to be more inclusive of individuals, since you're both involved in big projects maybe you could talk about that. Yeah, I, you know, again it's, it is one of those situations where it, the limitations is often based on access. The ability to recruit, if you're part of a large project the ability to recruit is basically just defines what kinds of diversity you can, you know, bring into that situation. That's why I think the example of the all of us situation is very valuable because it starts with the premise of needing to have very large populations. All right, and, and, and doing so when you start with that premise, I think it, it offers the opportunity to be able to say, all right, we're not limited by a number of people with that's not the concern. All right, the question is if we need 100,000 people, how do we afford that, how do we organize that, how do we, you know, control, you know, have the right controls and kinds of things that that becomes the more relevant question at the end. And because you start with the premise of needing to have as large a diverse group as possible. Yeah, and I could, I could add to that in terms of also thinking about a global perspective and and and like building these big projects. One of the things that that ICDA is working on is sort of, at least somewhat unified consent and recruitment and the biospecimen protection and other types of protocols to make it easier for investigators in different countries to collect data that is then interoperable and intercreatable to be able to actually use this data for it for in bigger and bigger studies and I think that in addition to to sort of let's say doing better outreach and incentivizing minority populations, let's say in the US to participate in medical research. We also need to sort of make the barriers lower for investigators in developing countries to be able to engage in this these kinds of studies whether it is sort of resources, protocols, access to the sort of, let's say the inner circles of where science is. Yeah. Yeah, I agree that those are such important points there. So, another question about normal we got two related questions I'm going to put them together. The first one was Laura who's asking what is the relevance of age in defining a normal phenotype should age be stratified to control for expression that might occur over lifespan and then Mark also asked for Tom and you're talking in particular because because brains at advanced ages rarely are truly free of pathology should we really be thinking of normal as a dead economy and more of a qualitative trait. So, can you specifically address age and then it sounds like also with some folks on the brain. Yes, I could say a couple of things about age since we've been studying it quite quite recently from in one of the top met cohorts in the basic cohort where we actually have only two in all samples that are 10 years apart and we've been looking at sort of age interacting with age and it's complicated. And, and it's also probably one of those areas where we, where it's not going to be enough to just have molecular phenotypes from a complex tissue sample in this case blood blood cells, because cell type varies between age it also varies between phenotypes and sexes, etc. And that explains a major component of the of the differences. And I think that this is one of those areas where for cell type insights into cell type composition will be really crucial to actually understand what is going on and even the most sophisticated molecular assays reading those molecular phenotypes are going to easily lead you astray unless you understand the cell type context. So, you know, the, I think the questions is a good one it in the case of age, you know, we're all living old, much more in a prolonged life. We, you know, we, we have to think about this as. As a, as a quantitative continuum there, we, where our comparisons are within a stratified group, and what, and it goes back to the question in that stratified group, what is normal, what is operatively normal, it may not be normal in any other stratification, but in that group, it's normal. And maybe the functionality of that group is not the same as others you don't remember as well or you don't you're not as rapid thinker or thinks whatever as you get older, but in that group in that stratified group, that's normal. That's not different. And therefore, as was suggested, there is a continuum. And it very much is dependent on age, and it's also dependent on the phenotype, the manifestation in of whatever phenotype that we're talking about. I think that is important to understand and, and I, and I think in terms of what Tully said earlier, that it's the similarities that also will mark how much deviation is going on within a stratification. So if in that stratification, there are X number of phenate of biochemical and molecular processes that are being monitored, and most of them are similar. And others are not, then we know where to look to see if that is constitution that constitutes not being non normal, because we now have a place to look at a larger population. So, I would say that's probably, yeah, that's great. Yeah, thank you. So, again, we have two questions which are related to each other. They're both about identical twins. So the first one is what's the level of correlation of gene expression in identical twins versus unrelated individuals. And the second is, could order of exposure to environmental factors in identical twins affect their phenotype. I'm reminded of these, this paper, these papers that were about, they must be about 15 years old now or so, when the, when the Spanish groups were studying identical twins. And it was remarkable because it was very clear that young, young individuals identical twins had very similar DNA modification and expression profiles at a very young age. But as they got older, and particularly if they were separated, right, then that that similarity broke down quite considerably. If they were, and in some ways it was seemed to be related to the fact that different behavioral environment, different habits and behaviors, or it have their opposite effects on the physiology and, and the genomics of the individuals. And so, in large measure, if the individuals are, are in the same environment and, and have very similar kinds of exposures to things that are risk prevailing, then they will have very similar kinds of reactions, because basically the groundwork has been set up for the reactions to have to happen. If they, if they see a different set of environments, and that's not only the external environments, but internal environments, the foods they eat, the, the, the cleanliness of the areas, the, the infections they acquire and so forth. All of that will lead to variation, even among identical twins. So I think it, in large measure, it's, you know, it's, it's a system that was used very much to emphasize the importance of environmental change on the overall be molecular behavior of the, of the individual. Yeah, I don't think I have much to add to that. What was the second question, Chris. The first one was about how much correlation actually is there in twins versus unrelated just in the second one was, could you concede that, and I think Tom addressed the set, the order of environmental exposures would affect their phenotype. Yeah, no, but again here I want to emphasize the importance of cell type composition so if you take blood samples from one twin and and the other and the other one had a cold a couple of weeks ago there is going to be differences is cell type cell type proportions that will manifest as differences in gene expression levels and we know from, for example, from T text that like the major component, the major source of gene expression variation is self differences in cell type composition. So if you had a very specific cell type extracted from both twins, I think that those numbers will go even up from what we know from most studies thus far. So specific to be the answer, although you'd want to have some information about what percentage of cell type that was at the time. Yeah. So maybe one more question before I turn it back over to Paul which is this was from Dina closing the genotype phenotype gap requires integration of functional data to recapitulate to recapitulate real life disease pathology. How can we feasibly achieve this, especially for complex diseases. Ending with the easy question there. Well, like what was the table that I had with these kinds of like five times six and I only got to the cellular level there and didn't even sort of address the sort of like, like, similar function like insulin extrusion and physiological phenotypes so yeah, no I think we just need to do a ton of work and using all kinds of approaches that will be complimentary and and also including some of the future future kind of approaches that that we've discussed today I don't like. We don't, we don't have like the pipeline, the toolkit, the method of sort of build that will get there. And I think that we're also very much sort of like, there is a, there is a bunch of approaches but kind of what is exactly the best in different types of settings is not entirely clear. This is also something that that ICDA is is sort of working on taking a bunch of sort of kind of like flagship diseases like example disease and trying to take those apart and then if we can actually use that to develop generalizable lessons on how to do this across very diverse set of trace and diseases. And I think that as I said I think this will include both a sort of observational population type of studies where you collect cells from from actual individuals and with different types of model systems. Yeah, thank you and Paul I'll turn it back to you unless Tom was to add some of that. No, I'm fine. Thank you. Sure. Always. So, maybe a final question or common, you know, again, we're talking about the differences versus, you know, commonality among populations, you know diversity and things like that. I think, you know, my question will be, for this both prediction, do we, you know, so it seemed to be at that academy or some kind of attention between the two. And on one side, I agree, you know, for big data, look for population trend and normal range all that stuff. So you want to look at what's in common and how you define disease stage versus, you know, a normal stage. But but when you apply this to precision medicine, for example, then you really want to look at each patient as a unique person, you know, unique set of genotype unique set of environmental factors. And but how you efficiently doing this on each patient, especially with so much data and how do you apply to individual from a clinician point of view. Paul, I need a little help on this. So is the question that the data types are large and diverse. How does a clinician. Yeah, how do you, how do you, you know, have efficient data collection from each patient. And also how you apply the knowledge of population point of view to a single patient. Yes, yes. Yeah, you know, again, I think, you know, clinicians have a lot to teach us in this because they do start with the supposition that each individual is unique. All right, and that what they learned in medical school could be could be contradicted by this individual, you know, in some some some very tangible way. I think that's that approach is probably a lesson that, you know, molecular biology should should pay attention to that is to say we, we, you know, we tend to treat things in a much more. You know, in a much more communal sense, because we're looking for bottom line answers or bottom line explanations. And when we, when we go back and look at the individual, the, you know, things that we would look for to explain a particular clinical state or whatever, you know, would start with these the sort of general bottom line summaries. Right. But then I think when when those those are usually, you know, the first things that look for but then what what molecular biology can offer is variations that are seen to be in terms of frequencies seen and where they're occurring in the genome, or for alternatives to be added to the bottom to these bottom line summaries. And I think that that's sort of a, that's sort of an approach, which I think requires, you know, a path, a process by which we, we as molecular biologists can interact in a way which provides information about the the behavior of the phenotype genotype relationships that we happen, in addition to the ones that have been well characterized. And I think that kind of information is probably the one way in which to translate or synthesize information, which is very complex and very, very numerous. Thank you very much, both for your wonderful presentation and insightful comments and discussions and I'd like to thank the audience and thank you Chris for the, you know, you know, design this whole process and the seminar series and Susan for the support and you know admin support and Gerald, William and a viral for it support and thank you very much have a good day. Thank you very much.