 Thank you for the opportunity to present. Well, orange is not intentional. I didn't know that MPC color was orange, but I thought it was seasonal. So the time is ripe for us to work together, I guess. That's how we can interpret this auspicious sign. So Alex Moseley from Baylor College of Medicine, my group and the Stanford group develop infrastructure for the ClinGen project. And I'll be talking about a core element of that infrastructure, which is a little registry, which helps us integrate information that's important for curating variants for their pathogenicity to inform clinical interpretation of the human genome. Other domains of ClinGen knowledge are validity of genes, which is where COMP more directly contributed in the past. But as we've seen, the trend toward linking sequence level variation is probably to be important going forward. Also, the third domain of ClinGen knowledge is actionability knowledge. So what's the problem we are solving? It looked like none problem initially, but it emerged as actually the core problem. How do we integrate information about a variant if it comes from different corners of the world unless we use the same name? So the name is a necessary condition for that. The problems are actually deeper than that, but let's start with the name. Having to solve this problem, well, isn't genomic reference enough? Well, which reference? Geneticists actually do use transcripts, many of them, half a million available in major sources. And also, the genome is becoming more of a graph and there are multiple versions of it. So reference is not there to solve the problem. But even if we agreed on a reference, we still would have problems with more complex variants than simple SNPs, say indels, which do not have actually unique HGVS expressions to agree on. And literature is full of examples of different variants being described using different HGVS expressions, but corresponding to the same CAD, which is the identifier of the allele register we'll be talking about. So the paper on the allele register actually will appear in 10 days in the special Klingen issue of human mutation. And there'll be a number of ACG talks, including on the allele registry at the ASG. So allele registry is not just about naming. It's actually allowing linking of information about genetic variants. For about 50 years now, the idea of linking information has followed the whole data warehousing model where you put the data together. As part of this warehousing, you actually need to address the identity problem. You know, how do you know that two different parties are talking about the same thing? So this deduplication is actually solved at every warehouse level. We are moving actually to a new world now. As of probably about five years ago, huge trend has started on the web, which is called linked data. So Bernard Sliet, the founder of the web, 10 years ago, defined this new set of technologies called linked data technologies. They have linked it for about five years. But about five years ago, big search engine actually adopted these technologies. And you're starting to see changes, sort of like Google search not just in your links when you do a search, but giving you a info box at the top of all these links, telling you something about the topic you are interested. Well, the way it works is Google now forces websites to provide data using knowledge and data using these linked data technologies. They put the pieces together. And this is a kind of graph. You can imagine every site adding a link in this knowledge graph, and then presenting this knowledge to the end user. So Bernard Sliet started this 10 years ago. He was inventor of the web, right? Because he was not actually happy with the impact of his project, believe it or not. He said, well, this is not exactly what I meant. It meant something bigger. So what is it? Instead of a web of linked links where people can go from one site to another, he envisioned a web of linked data where agents, software can actually on one's behalf go to different sites, put the pieces together, and present something like an info box of Google. But I think we are now in the web of linked data where the web of hyperlinks was actually in early 90s. So I think that during the time of the COMP project, actually, if you're planning 10, 15 years down the road, you'll see a similar revolution in terms of how the data is integrated to the revolution of the web, even though that's hard to imagine, right? But we are working toward this vision. And using these technologies to integrate the knowledge of our genetic variants, which actually is such a challenge that what can't imagine a single database, a single warehouse solving the problem, right? So these technologies come right in time for us. The same way the web came right in time for the genome project, right? One can imagine that the impact of the genome project would be much less without the web emerging at the same time. We see this knowledge of linked data about the genetic variation actually being created at the same time concurrently with these other technological revolutions, which we would like to harness. So in the effort, this effort was a clinging consortium effort and involved a couple of years of actually deciding what a variant is. So believe it or not, design decisions actually are involved in this, not just implementation. So basically, we went with the concept of a canonical allele, which is independent of its definition in the genome or transcript context, and separate concepts of the protein and nucleotide level alleles. So the registry implementation involved actually a lot of engineering to put half a million of different transcripts in memory so that one can quickly resolve the identity based on the sequence level alignment. So the registry contains many transcripts, whole genome assemblies, amino acid sequences that were aligned. And variants from all major sources are registered. If you go to the user interface, actually, which is a minor part of it, the registers meant actually mainly to serve as an infrastructure for other software. But it does have its own user interface where you can put any name of the variant and find its other names and locations and transcripts and such. If you don't know exactly, you found in the literature incomplete description of a variant, say, amino acid change in gene name, which is typical, you can find all the variants, for example, and pick the one that you think was being referenced. This is a kind of big, complete picture of the allele registry's implementation. So one can not only work with individual variants but large amounts of variants, millions in VCF files or AGVS expressions or other identifiers. And a registry provides two services, lookup and registration. So if no one has seen the variant, you'll actually get the identifier of it so that somebody else seeing the same variant will use the same identifier. So this is not a database where to get an identifier, you need to go to a lengthy process. Registry does it at the rate of 100,000 variants per second. So we envision and registering every variant seen in every sequencing project in the world. And even the variants that are artificial from functional assays and also hypothetical variants from computational predictions. So the key is actually the bandwidth. The variant has, as I said, UIs, but it also has APIs, application program interfaces, and follows these link data technologies called JSON, LD, JSON link data, which allows all the knowledge to be represented as a graph, essentially, in so-called RDF format using ontologies and so on. So currently, we provide links to all major resources, but also we support on-demand linking, which means not only you can register your variants, but you can register that you have something to say about it so somebody, a program, can go to your site and get that part of the graph so that somebody can see the big picture of knowledge about the variant. So as I mentioned, bandwidth was the main design goal here, and dbSNP can be registered in 15 minutes. My variant did in about 90 minutes and so on. CleanVar in 40 seconds, only hundreds of thousands of variants, right? Registered being adopted in the sense that major databases, Civic, CleanVar, are adopted identifiers and putting links back to the allele registry to their users. And we now have a number of use cases of linked data. I'll just go through one of them. For example, because we are linking information about amino acid change with nucleotide change, we can ask questions of this integrated data. For example, we can ask how many nucleotide variants in CleanVar cause the same amino acid change and have different pathogenicity assertions. This is a variant, 31 variants of unonsignificance. 31 variants fall into this kind of category. Here's an example of a variant of two nucleotide level variants that have different assertions, but one is the pathogenic. Another variant of unonsignificance, despite the fact that they cause the same amino acid change. So I'll show you a quick example how this linked data can be visualized as a graph using Neo4j, a graph database in a query language. For example, if you look at these conflicting pathogenicity assertions as a graph, you'll see nucleotide variants and amino acid variants. And if you zoom in, you see the relation is contradicts because they have different assertions and they cause the same amino acid change. You can expand the graph and see what's going on. And you'll get information from a little frequency databases, from ClinVar, and so on, all contributing different edges. Now, this is a simple case. Well, one variant has one pathogenicity assertion, the other, but the other was evaluated later. Here is the contradiction. These are the one is this is the contradiction of pathogenicity assertions. And this is difference in the evidence. And because this one has a stronger evidence that was evaluated later, you may try to probably trust this one. You also can see the link toward the amino acid change. They cause the same amino acid change. You can have a link to their allele frequencies. So in the context of all this information, you can actually figure out which one to trust. A more complex case where we have actually three variants that have one pathogenicity assertions and one variant, the other. But while this and that variant have a single piece of evidence supporting evidence, this one has multiple supporting evidence. Moreover, there are three of them, right? So we would probably trust this. But of course, you can examine links to allele frequencies and other supporting information before evaluating this one. So in terms of mouse links to mouse variants, I think the link toward homologous amino acid change in mouse and human, that link needs to be established as well as something equivalent to the mouse database linking nucleotide amino acid variants so that these kinds of types of information can be collated and used for inference. One important bit is, again, adding new layers of data. And we actually kind of ate our own dog food, as the computer engineers say. We try to use our own project to actually test this concept of layering information on top of the existing information using our allelic epigenome project. So this is as part of the data analysis coordination for the epigenome project that started 10 years ago. Just one month ago, actually, we published the final paper where we linked genetic and epigenetic variation by looking at allelic imbalances due to regulatory variation in the human genome. Using information from 13 donors, about 50 whole genome bisulfate sequencing experiments involving 27 tissues, nine cell types. So the key element was really whole genome bisulfate sequencing and looking at allelic imbalances that may be caused by variants in CES. Each methylation of each CPG on every reads was assessed in the context of the specific allele. And special focus was on the reads spanning heterozygous loci, where the heterozygous loci are either markers or actually causative variants, causing the change in methylation in CES. It turned out actually thousands of transcription factor binding loci actually have these kinds of allelic imbalance patterns, and the effects in fact are mediated by transcription factor binding affinities to different alleles. So also we developed a sensitivity map of the genome in terms of a sensitivity of regulatory elements to genetic variation as estimated based on frequencies of these imbalances that heterozygous loci. So the big message is CPG islands, CPG-rich promoters are buffered against change. So allelic imbalances are less likely to occur than the background. Whereas CPG-poor promoters and enhancers particularly are highly sensitive to allelic variation, where allelic imbalances are highly likely to occur. So the most actually unexpected pattern was stochasticity. So what we mean by that is that there's no gradual difference, even though percent methylation is fractional between the two alleles, there's no intermediate level of methylation. It's either on or off. What's different is the fraction of on or off states, which means that the alleles actually modulate the frequency of transcription factor being of the bound state of the transcription factor with some factors preferring methylation, some preferring non-methylation. And actually this is true for a majority of hundreds of transcription factors we looked into. So the message of this allelic epigeno map is that about 5% of the epigenome shows absolute allelic imbalance, which means beta values of at least 30%. These imbalances associated with allelic transcription insist that these are marks of gene regulation. And we have about 200,000 such loci with these allelic imbalances, thousands of them regulatory loci. And each individual actually harbors at least 200 variants that are on purified selection and show these allelic imbalances, which means that these variants do have impact, presumably, on human health. So how did we integrate this? Well, here is how these user-contributed links show up in the allel registry, along with these other links. And if you click on it, you get some machinery that will count in JSON-LD. You probably will not use it. But your programmers will be happy because they can serialize it as a graph. They follow the latest standard of knowledge representation. They can be understood by Google. And they contain information about transcription factor binding, tissue, foraging, donor, and so on. And so how do we then integrate, say, these, how can we maybe make use of mouse in light of this new understanding of the allelic imbalances in human? Well, I think F1 crosses. My dream project would be actually deep whole genome-by-selfie sequencing of F1 crosses because there we can actually create a allelic het loci. And actually, the readout of allelic imbalances is very direct, no need for mouse sequencing because we already know the genomes. And then homology to human will actually allow us to validate the effects of these variants on gene regulation. So in summary, kind of, I have two more slides, we can integrate mouse variation on the regulatory level, mean acid level. And this is, moreover, we can actually be even more looser. For example, we can look at hotspots, orthologous, homologous hotspots of mutations in humans. We can look at functional domains that are homologous in mouse in human. And then transfer this knowledge, links this knowledge to inform interpretation of human genetic creation. We need actually a mouse allel registry for that. And Carl Bolt from Jackson Labs just got an internal pilot project to work with us to actually apply this technology to build a mouse allel registry with a view that this will be a core infrastructure component for integrating mouse data and human data at the sequence level. So in conclusion, Clinton Allel Registry links information about genetic variants using new technologies, looking forward 10 years from now to the web of linked data. It applies these technologies to nucleate a distributed data ecosystem centered on human genetic variants. So it's not a single database, it's just a core element of infrastructure allowing everybody to contribute. And that similar approaches that can be applied to human variation can be applied to mouse and other model organisms so that this ecosystem starts including model organisms. So with this, I'd like to acknowledge members of my lab who have led the allel registry work and also those who've led allelic epigenome mapping as part of the last major paper of the roadmap epigenome project. Also our Clinton collaborators who have defined actually standards that we've implemented for the allel registry. Sharon Plon, who is the PI on the Baylor component of ClinGen with us is a subcontract from Sharon and also Carlos Bustamante, who is the contact PI of the Stanford Baylor Group that's developing ClinGen infrastructure. Thank you.