 In 2013, ClinGen was launched, and one of the key goals was to develop and implement standards to support clinical annotation and interpretation of genes and variants. And so, as most of you know, there are some major curation pain points, whether it's in the genomic area or whether it's through literature search, but major pain points is just searching for applicable literature. This can become very time-consuming. Also with biocuration, keyword searches are not ideal, and this is really lack to consistent nomenclature and IDs. If you can look at the bottom of the screen, this is a typical search stream looking for a variant in a specific gene. Also, we have the inability to search supplemental data. So currently, a lot of biological search platforms do not have the ability to search this, and there could be rich data about variants specifically within there. So overall, this can result in valuable data and evidence being missed, and why is this important? This could lead to inaccurate pathogenicity classifications, which could then result in misdiagnosis of a patient. It could also lead to an endless odyssey of trying to find out whether or not they have a genetic link to their disease. So we thought of how can we tackle identifying relevant literature for biocuration, and one way was to crowdsource through web-based annotation. This way, we could potentially populate mass amounts of data that was important to our initiatives and evidence. It could also, in populating mass amounts of data, enhance our curation and also the efficiency of our curation in time. Applicable articles would be tagged with identifying information that could help us prioritize our data searches and can help evolve what is the current variant search stream that's very lengthy and clunky and to something very specific. This involvement also has the potential just to reduce the time searching for these variants of evidence or genes. It can also reduce the curators' time and the expert's time, which means we can go through the pathogenicity classifications quicker. And we can uncover evidence that may not have been easily captured by biocurators and EPs just because of the length of time or where it was located. So ClinGen launched a community curation working group and initiative in October 2018. I'm the chair of that initiative after working with Hypothesis for a while. And so our current statistic is we have onboarded 200 volunteers and that have taken a survey and have wanted to participate in our actual curation. We have two initiatives, one of them being a baseline evidence curation and this is going to be based off of Hypothesis annotations and a tiered approach. Our comprehensive are individuals who are actually going to join our work groups and talk and interact with biocurators and experts on these panels. I will say that it is an international effort, but right now these volunteers span about 19 countries and as you can see, we consistently get about 20 to 25% of our individuals who are interested in using the web-based annotation method in order to curate. And so how we plan to do this baseline evidence curation is a tiered approach where all levels can make a contribution to evidence. So we really aim to target not only individuals who have backgrounds within genetics but citizen scientists down to your high school students as you heard from AAAS all the way up to clinical molecular genetic system, people who do this every day for a living. But before launching this, we did want to create a proxy and so we onboarded undergraduate students at the University of North Carolina into our bio curation core last fall and had them test out this tiered approach of annotation. And so what we did was we chose two genes that were important for our groups. One of them was GAA, which is responsible for Pompeii disease, is a metabolic disorder that can end up with muscular weakness and MYH7, which is the most well-known gene associated with cardiomyopathy, specifically hypertrophic cardiomyopathy. And so within this bio curation core, two of our seven undergraduates annotated articles within a two-year time period for these specific genes using a broad-based search and this was through PubMed. And so what we did was we developed protocols and tags and this was to standardize our evidence capture. Tags of importance for us were variant IDs, such as a ClinVar ID or a ClinGen allele registry ID. This would allow us to have a high throughput search for the variants of interest. We also used inheritance pattern. This is important for experts as well as the type of evidence we were getting and whether or not this data was found in the supplemental data. And so you can see a test of what is our annotation report here on the left as well as all of the tags that we can use to search. And so the expert bio curators were then tasked to assess the time to curate 14 variants for each gene with the standard search protocols and control platforms as well as a hypothesis pre-annotated information. And so for that standard search, you can see that we used about five different search platforms. PubMed, LitVar, which can use a RefSeq ID as a variant search. We have Mastermind. We use the free version of this, but to search supplemental data it's the only one that does it and it costs $6,000 per year per user. So with 570 individuals associated with quench and that is not an option for us to use. Google and Google Scholar. So I like numbers. I'm a scientist. I'm going to show you numbers basically and wanting to do this. We wanted to evaluate whether or not we could significantly save time. And so what we find is 14 variants again over the same genes that if we look at the control method versus the hypothesis, we saved on average 50 minutes per variant for GA in 28 minutes per variant on MYH7 by using pre-annotated information and hypothesis. How much time total? The total time could be up to 280 minutes. So some of them vary because some of these variants are very well known. There's a lot more literature that could be found in them. The highest amount of literature returns we got was about 30 articles or 30 annotations that are articles we could annotate versus one. So it is a range on them. And then let's see. We did then normalize this to the number of annotations. You know, obviously there could be different articles between these that could influence our time. So what you're really looking at is the average time per annotation that it took. And we can still see that with the pre-annotated information, we saved about seven minutes for every annotation we made on a variant and 11 minutes in MYH7. So again, this is a significant amount of time when you start looking at 500,000 variants, we will be looking at in the future. What was an interesting result from this is that we were able to compare these search platforms for biological data to see how well they performed. And what we found was that overall mastermind by Genomonon what had returned the greatest percentage of articles around 50% for both of the genes for the variant research that resulted in actual annotations. Most of this is because they use the machine learning algorithm where they go and try to tag the variants of interest within articles. They also again through their professional version can look at the supplement, but we did not go to look through that. These other search platforms do not have very good ways to actually search for a variant using an identifying number. So the only other one is Litvar, which uses a RefSeq number and many variants do not have these. So if we wanted to look at the unique articles identified, we could not standardize this to say that the same number of articles would be looked at and annotated. Really, when you go through the normal search platform, we could be getting a bunch of unique articles to that search and those search platforms versus what we pre annotated. So what we found was that about 50% of them were shared to about 28% between the two genes of the articles and that there was unique articles found in each one. That was really amazing for us because it means that both methods were really showing something that was different. There wasn't anything where one missed and the other didn't. But out of these articles from the control specifically, if we look at the percentage, again, mastermind came back. But what I want to point out to you is that PubMed returned the lowest percentage every time, yet our pre-annotations all came from a two-year time period within PubMed. So everything that was from Hypothesis was a PubMed article. So what this really begins to show is that there is clear lack of keyword searches within biological data and with PubMed that allows curators and interested experts to look through the data to find what is there. So the conclusions for this is that annotation saves a significant amount of expert time. This is attributed to the standardization of this process through the development of our SOP as well as tagging that facilitates efficient and searching of the evidence. By having those varying ID tags as well as other tags that are important to our experts, they can get through this data very quickly. It also enhances the curation experience because as you all know, Hypothesis annotation provides transparency of the curated data. Instead of us just logging what this curated data is in an article, we are directly outlining and saying this is what we think about this evidence. This is really important because some of these variants will actually end up being benign. Some of the genes may likely be disputed or refuted for their association with disease and we can clearly show the evidence within these articles in which we dispute or refute. It will also facilitate fact-taking capabilities for curation at scale. So this means that we can have the community use themselves to fact-check this data to say whether or not that this is applicable or appropriate, which will help our experts in the end decide whether they want to use this information or not. And last and most importantly, there is a great unmet need for the use of varying IDs to facilitate identification of applicable articles and curation. So I'd really implore anybody who works with publishing companies here today to consider using varying ID tags like ClinBar IDs or ClinGen Allel Registry IDs. And it's an initiative that ClinGen is currently working with with journals to require this on publications. And this is because this loss of data could result in these pathogenicity misclassifications, which again could result in misdiagnosis of individuals, which is the last thing that we would want to do, especially given that ClinGen is now the first FDA recognized entity for varying pathogenicity. So just to go through some futures, future endeavors here, our ClinGen has developed a linked data hub, as you heard of a linked data platform, and this is really going to help us to put our annotations to use. I do want to note that the two curators over an eight-week period of time annotated almost 1,100 articles for GAA and 600 annotations in that two-year time period. So it really shows how much data could be generated by individuals who are really willing to do this. But this data hub is now going to be a newly publicly accessible and extensible scalable infrastructure that will allow us to collate this information. And it does model the W3C standard for annotations. And so essentially we're already pulling in varying information from my variant ensemble, Nomad, ClinBar, as well as hypothesis. And what we'll notice, those community annotations will be a part of that. And that will allow us to then show these annotations within our certain, our variant curation interfaces, our gene curation interfaces, and every other interface that ClinGen currently uses here. So this is really exciting measure. And so we're already coming up with our proof of concept of how these annotations will be presented to our experts. And so what you're looking at here is a screenshot of our variant curation interface. Experts will go in and assign what is their variant of interest through this nomenclature, but we plan to highlight within there that there are community curations and annotations through hypothesis and then give a sneak peek of the actual tags that were used so they can begin to understand how those criteria may be met. And we provide the link so that when they come and log out they will be able to go and see the rich data to be able to use that. And so I just want to acknowledge quite a significant number of people within the UNC Biocuration Coordination that helped to do this actual experiment, especially Megan Mayors and May Flowers. There are two undergraduates who did all of these annotations. The link data hub is through our Baylor collaboration through ClinGen and our gene curation interface, as well as John Udell who has helped us with hypothesis. We have future endeavors in using a bookmarklet that we can launch for this community curation initiative to standardize approaches. And then just for a plug, our annotation based community curation we do expect to launch in the new year and we've already got our community curation work group. So if any of you are interested then I would welcome you to go to our website to actually take the survey and hopefully we will contact you with the ability to curate. We do plan to go out to patient advocacy groups and citizen scientists within the new year. So thank you. So if anyone has questions for Courtney, be easiest if you could step up to the mics. I just had a question about, I noted that you were using a standard input form on the hypothesis input and the standard tags in a controlled vocabulary. I was just curious about your experience with that. I had designed a project where we did the same and one of the issues that we had was getting the people who were annotating to always use the correct tags in the correct way, not misspell them and to not provide. So we ended up building an interface to validate that data and I'm just kind of curious about your experience with that. Yeah, mostly so far we've had a pretty good experience. I think we only had one error out of our group but again that was that tiered approach where we started with a certain level and then went and really trained them out. One of the ways we hope to get around it in the community is through this bookmarklet that we're working with with John and that way we can then begin to ask questions whether or not they have a specific case control. You know is it information about an individual with disease, a family with disease and then if you're clicking these it will then go ahead and give you that tag. As far as when we go into the ontologies like HPO and those we're already providing the links for them and so from there we should be able to take because every link on the URL typically has the HPO number on the end of it. The same is done with the Klenge and a little registry and Klenge bar ID. So by having those links brought in we can then create the tag that way. So you're going to build your custom interface for guided input. Okay, thanks. Hi there, my name is Robert. I'm a developer hypothesis. So I just wanted to check something from the background of someone who's not who doesn't even that kind of biology world and isn't really sort of familiar with the processes and the workflows that people go through. So just check my understanding is that you have groups of people who are not necessarily experts to the same level as the the people who end up doing the searching using Google scholar etc at the end of it. So they are tasked with annotating a collection of literature looking at certain things and they need maybe like an undergraduate level skill or they're given and then like that but you can easily get a larger pool of them than you can of like the experts who end up doing the searches and then the more highly skilled people who are presumably like graduates, post graduates, etc. They are the people using the search tools to then ask questions of that data. Is that is that? So what we hope for the actual creation is yes you will then go through the community to annotate this data. It doesn't mean that all data that's annotated will be used. That is really up to the experts within the working group and those experts are going to be the renowned experts within those genes and diseases in there. Is that what you're asking? Yeah so the thing I was getting at is there is a distinction between like a large potentially poor people who can do some pre-annotation to save a small group of people who are more highly skilled and who's you know have limited obviously to save them time. Follow on question I have from that is so you mentioned that there's some search engine that uses certain amount of machine learning to do some effectively kinder to some extent solve a similar problem. Do you think to what extent do you think the task that this like the community you're asking the community to do or to what extent do you think that's automatable? Some of it can actually be automatable. The problem is though if we look back to let me go up how this becomes difficult to be automated is here. This is what is called HDVS nomenclature over a variant. You're going to have a codon reference which is to the cDNA, a protein reference and there's also a g.agenomic reference. And so papers do not have standard requirements for this. You could have something that has the genomic variant, the cDNA or codon variant or the protein variant and there because they've just now standardized it could be in multiple different ways. And so really going through to automate this process for all of the past literature is difficult because you have to be able to know all of the synonyms, all of the identifications of these variants. And so that's where really with us being able to go through and having individuals come through that literature would be able to tag to potentially help machine learning in the future. And so that's one of the initiatives also of clengen is to be able to help automate the search capacity. And so we hope that going through these tags to be able to tag every variant, give it an ID that you can then take that information back to help the machine learn and understand what this variant is and to be able to look for it in the future. Cool. We do also hope that from the future for it they could use variant IDs within literature as well as other ontology terms that can make it automated. And very dumb question he'll ask is how does a variant ID relate to a gene or identifier or whatever? Oh, sure. So a variant is a mutation within a gene. And so that variant ID is going to be very specific to that amino acid sequence or codon sequence or that genomic coordinate. And then the gene has the overall multiple different variants in it. And so we all have variation within us right now. That is all different in what we would call benign, right? It's not pathogenic, but there are a certain few or a handful of people in here that may have a pathogenic variant within their genome. And so that's really what the basis is, is we want to see what genes are associated with disease, which clengin really is the only resource right now that has a semi-quantitative metric to let you know the relationship of a gene and disease. But then also is the variant in your gene that's associated pathogenic and to what disease. Cool. Thank you very much.