 Good afternoon everyone. Welcome to the infectious disease genomic epidemiology workshop module three. My name is Emma Griffiths and I'm a research associate at the Center for infectious disease genomics and one health, the lab that will show runs at Simon Fraser University in Vancouver, where a part of our research focus is the development and implementation of data standards and ontologies for public health. And I am very excited to be here today to talk to you all about tools and processes for infectious disease genomic epidemiology data curation and sharing because it's a topic that's very close to my heart. So over the next hour, we hope that you will better understand the challenges of using genomics contextual data for public health analyses. Know how ontologies data standards in tools can be used as solutions for streamlining data flow, be able to describe some real world examples of how ontology based specifications are used. Be aware of data sharing principles and different practical ethical and privacy considerations. Know about different public repositories and their submission requirements and be aware of some general data curation best practices. Okay, so let's dive into the material. So when we talk about pathogen genomics data, we are of course talking about the sequence data. But we're also talking about the contextual data something that will and Fiona both alluded to in the earlier lectures. And when I say contextual data what I'm talking about is the sample metadata, the lab, clinical and epidemiological data, as well as the methods and information that are all critical for interpreting sequence data that ultimately gets used to inform decision making for public health responses, but is also necessary for developing biological insights in genomic epidemiology research. Getting the right information to the right people quickly is really important in public health emergencies like outbreaks and pandemics. And data can be shared in a variety of different ways, such as between groups or departments at the same organization between trusted partners in a network, or more broadly with public repositories. For example, data can be generated in one place like a local public health agency, and it might need to be shared with a different jurisdiction lab authority or organization, for example, provincial or federal reference lab. And all of these other labs that the provincial or the federal lab need to enter the data that comes from the local lab into their own systems, which is not always easy to do because frequently contextual data that's captured from the local lab source is collected and stored using different systems, different data structures, and different processes. Also, not all kinds of information that gets shared between these different groups is the same. And this is due to issues of privacy or identifiability, or due to different data needs. Why is there so much variability. The reason for this is because different jurisdictions have the authority to choose whatever information management systems work best for them. And they can also set up their data collection templates and by that I usually need spreadsheets in whatever way works best for them. Now this is good because it creates flexibility and enables labs to customize their data collection tools, according to their needs and priorities. However, this flexibility creates challenges when you try to integrate data from different sources, because it creates heterogeneity in the data. One of the main challenges in using other people's data or even trying to reuse your own contextual data for different projects when it isn't standardized is that people use field names in different ways. So for example in the slide here you can see two sets of brackets in the slide. In the upper bracket is a series of field names that I extracted first from some food and water safety data a few years ago. The different data providers used a range of field names to indicate the types of samples that they collected. And you know this because if you look at the values that go into the fields, they all describe sample types. And people here in this case are using different words to mean the same things. Now in the lower bracket, you see the field name source, which looks the same as the field name source in the upper bracket. But if you were to look at the values that go into that source field in the lower bracket, you would see that that lab was using this field to indicate the lab name, or who is collecting the sample, not what the sample type was. So in these cases these two labs are using the same words, but they mean different things. Now a human being can look at the data and figure this out, but a computer has a much more difficult time, unless you tell it how to do that. Also heterogeneity of values within fields also complicate using the data. So often fields and spreadsheets and information management systems are free text based. And so that means the information in those fields doesn't doesn't have any constraints or maybe just a few. One of the biggest challenges in the use of free text to capture information is that the information often has errors associated with it. There are differences in information granularity, there is semantic ambiguity, which is something we've already mentioned, where people use the same words to mean different things. And that often happens a lot when you're when people are using discipline or field specific jargon. There are also different formats that are used, and, and there's a lot of other issues, including differences in content during data collection which can result in spotty data and missing information between data steps. There are also differences in the way the data is organized and aggregated. Now it's one thing when there's heterogeneity in private databases and you're only dealing with a little bit of confusion or heterogeneity, but the challenges get compounded when you're either submitting to or extracting from public repositories. So for example and these two very different stars Kobe to contextual data records that I pulled from NCBI, they have different fields, one is very descriptive while the other contains discrete pieces of information and so on. And while neither of these records is wrong, it would take a lot of work to be able to combine the information from these two different records, as they are right now. So, how you structure information impacts how you can understand it and how you can use it. You know, it's, it's time and resource intensive to do a lot of data cleanup, it can take hours, weeks, days or weeks, and you just don't have that kind of time to spend on data transformations in an emergency, or even during routine analysis. So what tools do we have at our disposal to streamline data capture in harmonization. We have things called ontologies and these are universal languages for humans and computers. We also have data standards and specifications and these are prescribed sets of fields values and formats. And we have data management and transformation tools and these can help implement ontologies and standards in order to put them into practice. So, for those of you that are unfamiliar with ontologies, basically these are hierarchies or trees of controlled vocabulary, and that's in standardized terms, where the terms are linked using logical relationships, and the meanings of the terms are all disambiguated using universal identifiers, and each of the terms have have specific definitions. An example of how terms are related to each other in ontologies can be seen in the mock beer ontology in the right hand side of the slide. Here we can see that a logger beer has been identified a bureau identifier 1234. And you can see in the hierarchy that a logger is a type of light beer, which is a type of beer. So it is a relationship is often the most commonly used and helps to form the backbone of the ontology hierarchy. You can use other kinds of relations to build other things like ingredients qualities producers brand names into your knowledge base. So you can start to do things like disambiguate corona brand logger beer from the corona of the sun, which would have an identifier probably derived from the environment ontology. One of the most powerful things about ontologies is that the definitions are meant to be universal and not just dependent on the language used by certain organizations. However, organization specific terms for things can be mapped to the universal terms and indicated to be synonyms so that anywhere that the ontology is implemented the synonyms can be used instead of the universal terms, and these work interoperably. And the ontologies also enable the deprecation of terms so that you can make it clear to users when language uses prep language usage preference preferences change over time. So in contrast to ontologies that does that circumscribe domains of knowledge data standards as I mentioned our sets of prescribed fields terms in formats. These are meant by experts and meant to fulfill a specific purpose. So for example in the slide here you can see the ISO 23418 standard. This describes best practices for using whole genome sequencing to characterize bacteria in the food chain. As standard there is a section that prescribes a subset of ontology based fields and terms for describing information about samples, isolates and sequences in order to improve contextual data consistency standardization and interoperability. To hit this point home a little bit better. I wanted to illustrate the difference between ontologies and data specifications in this slide, we can see an example of a parse partial pizza specification on the left, and this consists of fields and menus of terms. In contrast, we can see a partial example of a pizza ontology on the right, which indicates the relationships between pizza types and toppings. Now you can imagine just looking at this simple pizza ontology that ontologies can probably be pretty complicated things. And while ontologies have been used for inferencing and more complicated querying by the US Department of Defense and by Google. In public health ontologies usually act basically as a source of standardized fields and terms from which we can create contextual data specifications. So that's the theory. Now we'll get into some real world examples of how ontology based specifications have enabled data harmonized harmonization and sharing during the coven 19 pandemic, and also the monkey pox epidemic. At the start of the coven 19 pandemic, local and provincial labs were collecting samples for diagnostic testing. There were also being used for sequencing and for understanding variants, which variants were present at local and regional levels. To understand how the virus was entering Canada or spreading and evolving across the country sequences and contextual data needed to be sent to the public health agency of Canada's national reference lab for analysis. So all of the contextual data collected by the different health authorities across the country was in different formats. So basically, it would be like comparing apples and oranges if we didn't develop some sort of data standard to harmonize information from all of these different sources. So our lab at SFU in collaboration with the national microbiology lab, we're able to develop an ontology based contextual data specification to help harmonize all of the SARS cove to contextual data being collected and shared with the NML for national surveillance priorities, which was critical for streamlining data flow and coordinating responses. So how were we able to do this. Well, luckily, our team had done a lot of hard work prior to the pandemic in formulating contextual that contextual data specifications for the ISO standard that I mentioned earlier on. And so we had the benefit of that specification having undergone really rigorous international review. And so we had a pretty good framework to start with that we could add on to and customize for stars cove to data needs during the pandemic. What exactly is in the stars cove to contextual data specification well it has a quite a large number of standardized fields in terms that help capture information about repository accession numbers and identifiers which is important for matching in chain of establishing a chain of custody. So we had a lot of fields in terms to standardize the way folks talk about sample collection and processing on fields for capturing host information which could be something like age and gender but it could also be information about exposures reinfections vaccination. Importantly, there were also fields to capture information about the methods so sequencing instruments bioinformatics processes quality control metrics and so on. There were also fields for capturing information about different kinds of lineages and variants pathogen diagnostic testing results, and importantly fields to capture into information about the contributions of different folks from the different labs. So we were able to source all the standardized fields in terms from a lot about 24 different ontologies. I've indicated, there's over 1000 standardized fields in terms in the specification, you do not have to fill in 1000 things. There's a very small number of fields that you actually are required to fill in the others are there, and you can fill them in optionally. And there, the subset of required fields basically concentrated on capturing who sequence the sample, when wearing what got sampled, and how the sequencing of the bioinformatics were carried out. So, um, what it's one thing to create a specification, but it's another thing to put a specification or a data standard into practice. So, a critical part of operate operationalizing ontology based data standards is implementing them in some sort of tool, like a collection template or some other kind of data management and transformation tool. And so that public health practice practitioners can really make use of them. So in parallel with developing the data standard for SARS CoV two. We also develop protocols and practices for data curation. And in addition to this we also developed a tool called the data harmonizer. The data harmonizer is very simple. It's a spreadsheet like text editor that enables users to open spreadsheets of data, or to enter data directly into the app. It has required recommended and optional fields that are color coded so the yellow ones that you see here are considered required. There's dropdown menus of pick lists that offer standardized terms. There are curation features that enable you to enter data more quickly and importantly, there's a data validation function that identifies errors and missing information. There's also a number of different support materials like a curation SOP and a reference guide that provide instructions for how to use the tool. And then you would need to get people curating and submitting data quickly and in a harmonized way. Now, we recognize that there can be different downstream and points for data. One might be the national microbiology labs, national genomic database. But there could be different public repositories like the virus data portal or NCBI or guess aid and we'll talk about these different repositories, specifically in a little bit. Different repositories have different submission requirements and formats. And like I mentioned, it's very time consuming and basically a massive pain to have to reformat your data for these different endpoints. And so probably one of the neatest things about the data harmonizer is that it automates data transformations for you. And this is possible because we started with the ground truth of ontologies in the data harmonizer, which and, and these fields these standardized fields in terms could be mapped to all of the different fields and formats that are required by different databases. And we're going to play around with the data harmonizer later this afternoon and where you'll be able to try out some mock real world style examples of data that would be curated for the pandemic response. So having a pre existing framework and tooling that work for SARS-CoV-2 was very handy when we had another crisis, namely the monkeypox epidemic, which hit the world in July of last year. Now because the SARS-CoV-2 standard had a pretty robust core and by that I mean there were a set of fields that you need for just about every pathogen. So we really repurposed the SARS-CoV-2 spec for harmonizing monkeypox surveillance data by updating the picklists where necessary, and by adding and subtracting field as needed. So I know I've been pretty high level so far when describing what the specification specifications can do. I wanted to give you some more concrete examples. In this slide in the first table we see in the first column, a set of unharmonized monkeypox sample descriptions. These free text descriptions are pretty hard to compare right off the bat because they have differences in granularity. They use different terms, some are in French. So if you wanted to study what types of samples give you the best sequencing results, it would take a lot of time and work to be able to fit all of this information together. So I didn't already have a data standard that would tell you how to organize the information. The monkeypox specification standardizes what kinds of anatomical materials got collected, what body part the materials were from, as well as the devices used to collect the samples and if applicable the biomaterial that was submitted for sequencing. So you can see that there are a number of different fields used here to organize the information. Not all of them are used for every record, only the fields that apply to the sample type needed to be filled in. So we were also able to capture information about specimen processing. So for example, if specimens were pooled, as well as linking host samples from the same hosts that were sequentially sampled. So comparisons could be made across time if you were trying to determine say whether an infection was reoccurring, or if the individual was suffering from reinfection. It's also possible to harmonize information about one health samples where one health is a concept that takes human animal and environmental health into consideration when looking at evolution and the spread of a pathogen. So you can see that different types of one health samples were all able to be harmonized using the same set of fields. But just to point out this is mock data here, but if you're curious, you can see what the actual publicly available data looks like by hopping over to Canada's monkeypox bio project, and I popped the accession number there in the slide. So having a consistent contextual data framework across pathogens enabled us to do a lot of different things. For example, and enabled us to develop pathogens specific templates for the data harmonizer a lot faster and with less work. Also, there was less uncertainty as a number of expectations data expectation had had already been established with the partners who were used to seeing the core fields in terms. So the reuse of fields in terms better enables interoperability between systems and tools. So for example, outputs from the data harmonizer and the national ML database. It also enabled the reuse of data sharing agreements curation skills training materials, basically, many of the tools processes and protocols that had already been set up. I've given you the short version of data harmonization during the pandemic and in the monkeypox epidemic. If you'd like to learn more about how the data specification we built was used, not just in Canada, but in other countries and regions around the world like Australia, the US, South Africa and Nigeria, Argentina and so on. So you can listen to episode 26 of the micro bin fee podcast. This is a great podcast it's all about public health bioinformatics there they do all kinds of topics they usually do an episode every week. And if you'd like to learn more about genomic surveillance in general in Canada during the pandemic, you can also have a listen to episode 53. Okay, so now we're going to shift gears from talking about tools and specifications to talking about data sharing. So why are we going to spend time on this and why is data sharing so important. Let me start by saying that no one can tell you that you have to share data. This is a decision that your organization and perhaps your government needs to make. I know this can be a contentious issue. But here are some of the benefits of sharing your data with the world. The first reason is to create situational awareness. When people share their data, you can better understand your data in the context of what's happening in the world. And inversely, when you share data, other people can understand what's happening in your region. If your region is underrepresented in public repositories, you are basically in a blind spot. If you are in a blind spot, people may have to guess or make assumptions about what's happening in your region, and those assumptions can be wrong. You also don't want to be in a blind spot because when companies are creating diagnostics and vaccines, you want the viruses infecting the people in your region to be covered by those tests and therapeutics. No distribution of things worldwide has not really been equitable, but if you can get a hold of tests and vaccines and they aren't designed to protect the people in your region, then you have a whole other set of challenges. So this leads to the next reason, which is that data sharing creates leverage. When you have data in the pool, people notice and it helps give you a voice in global conversations. Another interesting argument has been made by the Global Alliance for Genomics and Health. Some years ago they developed a document that helps put data sharing in a human rights framework. Basically they argued that using their framework that data holders had a duty to share when they were interpreting human rights, but they also emphasize the right of data providers to be attributed to get credit for their work. While also reinforcing the idea of scientific freedom. I'm not going to go into this much further, but I popped the link to the GA4GH framework in here in case you're looking at data sharing from a human rights angle. And finally, why should you share data? Because it's part of being a good data citizen. So if you're using other people's data, then really the best thing to do is to share the data in return. Okay, so as a data generator, there are some important good practices for data stewardship. One of the key principles of data stewardship is protecting privacy. In order for public health to work, you need public trust. If you lose public trust, people won't get tested, they won't get vaccinated, they won't comply with restrictions and so on. But trust is a bit of a double-edged sword. You also don't want to, you don't want to share identifiable information which can jeopardize people's privacy. But holding back too much information makes it seem like you're not being transparent and that you're trying to hide things. So here are some basic rules of thumb. Only use de-identified data in your analyses. That means no names and dresses. Be careful of geographical granularity. Watch out for small case numbers in particular locations at the same time. And watch out for combinations of different fields. Now, there are fields that are not particularly identifiable on their own. But if there is a small number of cases or the combination of information types is unique, then people could hypothetically use that information to identify someone. Always track identifiers. When samples and sequences switch hands between different departments or organizations, they often get renamed. So you may not be able to share different IDs, establishes, like I mentioned before, a chain of custody, but also be aware that personal health IDs or even sample IDs may be considered personal health information. And so you may not be able to share them. But if you have some experience with data sharing before you share anything, it's best to consult your privacy officer who will hopefully be more familiar with jurisdictional or organizational policies, as well as national legislation. And all of those things might affect what you're allowed to share. So remember in public health settings, auditability is incredibly important. Tracking methods and provenance of your data is essential. Depending on what you have contextual data might actually have higher security requirements than your sequence data. And it's also really important that if an error is detected in your contextual data, as well as your sequence data, that it's really important to correct to correct those errors right away, because once information is shared, it can be propagated out really quickly, which will also propagate the errors. Okay, so when it comes time to share data publicly, there is a sort of hierarchy of contextual data that gives you the most bang for your buck, and that labs will probably let you share. So in the slide here I've provided a sort of prioritized list of data types, starting with those that I would consider to have the least amount of controversy associated with them those are at the top of the slide. Those would be things like the type of host so probably human but it could be something else. The location of sample collection, at least to the country preferably to the level of the state of the province, the sample collection day, preferably to the day, not to the year or the month. Who collected the sample who sequence the genome, a little bit about methods. All these pieces of information will probably probably get you pretty far in terms of understanding where a pathogen is circulating in a region. As we move down the slide the next set of data types are very useful, but maybe more difficult to obtain or share. These are things like sampling strategies so why certain samples were sequenced and not others and that's very useful to know because it addresses types of biases in the sequence databases. Sample type so whether say it was a nose or throat swab, or a swab of an air vent in a hospital. These are useful for correlating whether different kinds of samples indicate a stronger presence of the pathogen. The graphics are great to have for understanding who a pathogen is infecting how they're how it's infecting them compared to others. And different kinds of deep diagnostic testing information like CT values in the case of SARS-CoV-2. They're really great for anticipating which samples might produce higher quality sequences. Now, the data types at the bottom of the list here, unless you are deeply embedded in a public health organization, and you have very good working relationships with people in different parts of your if you are working at a public health organization, things like vaccination, exposures, travel history, hospitalization, health outcomes, these things are going to usually be pretty difficult to get and pretty difficult to share, but if you can get them and share them, they are very informative. Okay, so now we're going to start talking about public databases. One of the main public repositories that you'll likely need to deal with, and that are being used to share pathogen genomics data globally, are gisade, that's the global initiative on sharing avian influenza data. Don't let the name fool you, it's also used for other pathogens as well like SARS-CoV-2. The database was originally created for storing sequence information used for vaccine development, that's why the name is what it is. So besides gisade, there's also the International Nucleotide Sequence Database Collaboration, or the INSDC. The INSDC is actually a collaborative effort between three different data hubs, and this includes the National Center for Biotechnology Information, known as NCBI, and that hubs in the US. There's the EBI, the EMBL EBI, European Bioinformatics Institute, ENA, which is the European Nucleotide Archive that's located in the UK. And then there is the DNA database of Japan, DBJ, which is, as you might have guessed, located in Japan. All of these different repositories have overlapping, but different submission requirements. So if you decide to submit to more than one of these repositories, which we suggest for different reasons that we'll talk about later, there will be some data transformation required to, to fit your contextual data to the different formats that these, these places require. So you may collect a lot more information in your private database than you are permitted to share publicly. So it's very important to check with the data stewards, whoever's in charge of sharing data, and potentially any privacy officers at your institution to make sure that you're complying with any data governance requirements and privacy policies that you might be subject to. And finally, no matter which repository you choose to submit to, there are generally three stages to submission. The first is that you need to set up an account. The second thing that you need to do is prepare your contextual data and sequence data files. And the third stage is that you have to navigate the different submission portals to upload your files. Now we're going to focus on contextual data prep in this workshop, but we do have some protocols for submitting data that I'll share with you a bit later that address all the other steps. Okay, so there are some key differences that are important to recognize in these two types of repositories. That is the INS DC group of data hubs and guess aid. So in the first case, the three nodes of the INS DC are considered open access. That means that the data that submitted there can be used, can be seen and used by everyone with pretty much no restrictions. The data that's uploaded to one of these nodes like NCBI or ENA, that all gets mirrored across all three nodes on daily basis. You can submit data about pretty much any organism, whether it's a pathogen or not, whether it's a bacteria, a virus parasite, whether it's about humans, starfish, whatever, whether it's individual gene markers, entire genomes, including raw data or assembled genomes, metagenomes, gene expression data. You can submit whether you are an academic scientist doing research or whether you're a public health scientist doing genomic surveillance, they will take it all. And all of the information is stored in different databases that have different tools that you can use to retrieve and explore the data. Now guess aid, on the other hand, which to give it all credit where credits do actually hosts the highest number of SARS-CoV-2 genomes currently, which is sitting around about 15 million. Access to this data is controlled. You have to sign a data use agreement before you can register and access the data. This provides some protections for data providers, but also restricts some of the ways that you can use the data. So specifically when you apply for an account, you have to agree to specific terms of service, the specify things like you need to attribute the data generators, you need to do your best to collaborate with them. You're not allowed to publish results without acknowledging the data providers. You're not allowed to combine the data from guess aid with data from any other places, and there are other, there are other clauses as well. And guess aid only accepts data for a specific set of pathogens. Like I mentioned, it was originally created for influenza. It's also assembly and consensus sequence focused, although I believe you can now start submitting FASQs, although it's not necessarily clear to me and others how those sequences are linked to other types of data. But regardless, they guess it also offers different dashboards for visualizing information. So as I said, we're going to focus on preparing the contextual data for submissions. When you submit to guess aid, there is a spreadsheet that you need to fill out that provides basic information about each sample that was sequenced. There are required and optional fields. There are a few specific formats that you have to use. For example, there are particular virus naming conventions that we'll look at in a little bit. But usually the fields are free text, and there's not really any picklists. You can get the template after you sign up for an account. The template comes with a set of definitions and instructions. You can see, well, it's very small, but you can see an image of it at the bottom left of the slide. And there are slightly different templates depending on the organism that you are working with. So for example, there's a template for SARS-CoV-2 and a template for monkey virus. You should also check for updates because these templates change over time. Okay, so it's kind of small, but basically you can see there are required fields. Those are in red here and those required fields consist of information like submitter info, like your lab name, the lab address, a similarly sequence generator information, name and address, organism information, date and location of sample collection, information about host, age and gender, and patient status. By that they mean whether the person was sick, asymptomatic or recovered. And as we previously discussed, that information, the demographics and patient status is probably going to be hard for you to get from health authorities and more difficult to share. So if there is a required field and you can't fill it in, you can just put in a null value like missing or not provided. You just have to put in something. You also will need to list the type of sequencing instrument that was used like an Illumina MySeq or a Nanapur Minion. And as I was mentioning before, the template sometimes gets updated. So do check that you're using the most up-to-date version because if you're not, that could cause your submissions to get blocked. Okay, so at the INSDC on the other hand, you can submit programmatically or you can submit by using a spreadsheet through their web submission wizard. The preferred data structure is to create what's known as a bioproject and this is basically a placeholder for contextual data records. And these are called biosamples. And within the bioproject, you can link biosamples to different kinds of data like raw data or assembled genomes or consensus sequences. Now, you can put whatever you want in your bioproject, you can create as many bioprojects as you want. So for example, you might want a bioproject for a particular outbreak or a particular organism. You can even link different bioprojects together under an umbrella bioproject. So you can have all the bioprojects from your organization in one place. We did that during the pandemic, all the different provincial bioprojects, we link them under Canadian umbrella project. Each bioproject can contain many biosamples. The nice thing about biosamples is that you can use, you can include user defined fields, which means you can add on to besides the required set of fields you can add on whatever information you want that you think might be interesting or useful to people. There are metadata requirements, not just about the samples, but also about the data to be able to capture the methods. For example, for SARS-CoV-2 genomes, you can include the primer scheme and the sequencing instrument used for your amplicon sequencing in the raw read metadata. You can also put the name of the software and version of the software that you use to generate your consensus sequence, and you can put that in the assembled or consensus sequence genome metadata. So each of these things, the raw data, the assembled or consensus sequence, and the sample, each of these things have a different form that you have to fill in. So, you know, it can be a little bit time consuming if you don't have tools that can generate these sheets for you automatically, but there is a place to put different kinds of information, which makes it easier to explore. Now, because the INSDC accepts such a wide range of organisms from different contexts, they offer different metadata packages that contain sets of standardized fields that were developed by a group called the Genomics Standards Consortium. And using different metadata packages helps to make data sets interoperable, so you can more easily combine data sets for analysis. I've taken a snapshot of some of the packages that are on offer. There's only a few in the slide here, but there's actually over 200 different packages, and that depends on whether you're submitting marker genes or genomes or metagenomes. It also depends if your organism is associated with a certain environment like water, soil or air, or a certain anatomical region like the human gut, mouth or skin. It might seem a bit daunting to pick the right package for your samples. So, you know, when you're looking at the list, but a good rule of thumb for genomic epidemiology would be to use one of the pathogen packages. So either the clinical or environmental package, or if you're doing SARS-CoV-2 sequencing, there is a specific SARS-CoV-2 metadata sample package. Okay, so this slide is just to show you the differences between GASADE submission requirements and ENA, one of the INAS-DC repositories. There is a lot of overlap in information types that are required, like where and when the sample was collected, host information, provenance, like the lab name. The fields are named different things, and there are a few key differences. For example, as mentioned, the sequencing instrument would go in a different form if you're submitting to ENA, whereas the sequencing instrument would go in that one stop shop form that you need to submit to GASADE. Another thing that's important to recognize is there are differences in organism naming conventions, specifically how viruses are named at GASADE is different than how they are named at the INAS-DC. So you need to create these names for the viruses yourself based on your country. The year the sample was collected in the sample ID. At GASADE, they use, they call SARS-CoV-2 HCOV-19, and that's because that was the name of the virus right at the start of the pandemic before you had the International Committee on the Taxonomy of Viruses that the people who standardize virus names before they said we're going to call the SARS-CoV-2. So the INAS-DC uses the sort of official version and GASADE uses the initial name. Anyway, we're going to explore this later in the lab. So just something to keep in mind. So we sadly don't have time to go through a whole submission pipeline in this one hour seminar, but if you're interested in sharing data with public repositories and I hope that you are, there are step by step instructions for setting up accounts, preparing your contextual data, preparing your sequencing data, and submitting. So basically what buttons to press available on protocols.io and the link to those resources is in the slide here. These protocols were developed by an organization called the Public Health Alliance for Genomic Epidemiology, otherwise known as Fage, and their mission is to improve the reproducibility and openness of public health genomics data, and to improve the interoperability of data sets and systems. So please check out these protocols. Okay, so. So far we've concentrated on curation of contextual data for genomic epidemiology and public health and genomic surveillance, and using data specs and tools to be able to standardize and harmonize that information. But now we're going to go off-roading a little bit and talk about other kinds of data curation that are important for genomic epidemiology. So an important set of concepts for data curation are, and for sharing, are the FAIR data principles that were first published back in nature in 2015. And FAIR stands for Findable, and this is important because you want others to be able to find your data on the internet or in different repositories and databases. FAIR stands for Accessible, so it should be easy to access and retrieve information, the more barriers and controls there are, the less people can get at your data, which makes it less useful for the community. Interoperable, so computer systems and software should be able to exchange and make use of the data, and finally reusable. So digital information like sequencing and contextual data can have many different uses, and in order to maximize the utility of the data, there should be enough context provided about how the sequence was generated, where it came from, how to contact people for follow-up, etc. If you need more information, all of that needs to be included so that you can use the data for different kinds of studies. So what are some of the ways that we can help make contextual data FAIR? There's different rules and different things that people can do. For example, everyone can start using ontologies and data standards to make their data sets and systems more interoperable. Standards developers can make their ontologies and standards more accessible and findable by depositing them in registries. You know, a lot of times people and folks in the community who could use these things don't just because they don't know that they exist. It makes it hard for practitioners to put these things into practice. People generating data can of course make it available in public repositories, as we've already discussed, and they can provide as much contextual data as they can, particularly about the limitations of the data, sampling strategies and sources of biases, contact details and so on. And public repositories can provide tools based on data standards for exploring and discovering the data. Another important concept to recognize is that not all ontologies are interoperable. Some ontologies are built for a particular purpose, and so they're kind of standalone resources. To really get the bang out of your ontology buck, it's best to use ontologies that have been built with a common architecture. So a great source for interoperable ontologies is called the oboe foundry. This is basically a community of scientists that reached consensus about how to build ontologies in an interoperable way. And all of the ontologies included in the oboe foundry are open source. The foundry provides some oversight in terms of assigning identifiers so that when different teams create different ontologies, there's no ID clash. There's also a committee that were using ontologies where they're accepted into the foundry to make sure that they conform to the principles and practices of the foundry. And so the oboe foundry is a trusted source for ontologies for curating and standardizing a whole bunch of different genomic epidemiology data types. In fact, there are over 200 different ontologies in the foundry that cover all kinds of domains of knowledge, from taxonomy to environments, food, geography, anatomy, disease, there are loads. I just mentioned that I might change a lot of the knowledge that are in the oboe foundry. If you are looking to explore different ontologies, for example, to find a term that you need, say if you don't have a specification but you're looking to standardize some data, or if you have a particular specification and it's missing a term that you need, you can use what are known as ontology lookup services, like the EBI-OLS that you can see in the slide here. You just have to type a term into the search bar and it'll spit out all the possible matches from a bunch of different ontologies. Now you will need to know how to evaluate those matches, but we're going to talk more about that in the lab this afternoon. So let's go through an example of how we can use interoperable oboe foundry ontologies to curate and standardize data sets useful for genomic epidemiology. Okay, so say you wanted to study an epidemic of childhood illness caused by a particular pathogen that has different variants, and you wanted to examine how well different therapeutic agents improve outcomes. What you might have is a data administration history, sort of like the records that you can see in the slide. I should mention that this is mock data by the way, but it represents a real data curation exercise we carried out a few years ago for the Canadian Healthy Infant Longitudinal Development or Child Study. What you can see are the names of drugs that mothers administered to their children. The histories were captured using free text, so there are spelling mistakes, there's a mixture of brand names and generic drugs, drug formulations, there's a mix of specific and non-specific antibiotic names. So the question is how can we curate this data set so that scientists can start to identify trends. What we ended up doing was we decided to standardize the active ingredients of the different drugs. So if Tylenol was entered, for example, we knew that the active ingredient was acetaminophen, which we could standardize using kebi, which is an ontology for standardizing chemicals. Using kebi, we also knew that there was a synonym for acetaminophen, which is paracetamol, so we could identify chemicals that were the same in the list that had different names. We were also using the drug hierarchy in kebi. We were also able to group specific antibiotic names together with the general antibiotic terms. So for example, we could group together neomycin and gentamycin together with the general term antibiotic. So by using a combination of curation criteria and an ontology, we were able to help make more sense of the data set so that it could be used for more complex querying and machine learning to identify different trends. Here is another example. And in this example, we consider why it's important to understand the methods of data when curating and harmonizing different lab clinical and epi data used to interpret genomics data. Say you wanted to study how the genomic variations of a particular infectious disease can contribute to the increased development of allergies in children. You have positive allergy test data from two different labs, and you want to combine them so your sample size is bigger for analysis. Sometimes labs will use the same field name to indicate the same result. But the methods are different and there can be different implicit criteria used to interpret the data that are not included, and this can create uncertainty and challenges for harmonization. So for example, in the slide here we have two labs that are both carrying out allergy tests on children. You can see there's a little picture here about how they do that they basically prick the skin, put a drop of allergen on the pricks, and then they wait and see if the area swells. And usually there is a threshold or a cutoff of the size of swelling. So you can see in Lab A they used a diameter of two millimeters to indicate anything bigger than that is a positive test. In Lab B they used a five millimeter diameter cutoff. So it had that the swelling had to be bigger than five millimeters in order for that to qualify as a positive test. Imagine that both these labs shared spreadsheets of data with you that said positive skin prick test results, but they use different thresholds. So a positive test in Lab A spreadsheet is not the same as a positive test in that spreadsheet. We can actually make these interpretation criteria more explicit using ontologies by using terms with different definitions and different identifiers, which makes it more clear that the data requires greater consideration before it was combined. Okay, and here is one last example of how data curation is important for infectious disease genomic epidemiology. The comprehensive antimicrobial resistance database known as CARD was developed and is maintained by the wonderful folks in the MacArthur lab at McMaster University in Ontario. It's an incredibly important resource for understanding AMR and pathogens. This database collects and organizes reference information on antimicrobial resistance genes proteins, their phenotypes. And the database covers all types of drug classes resistance mechanisms, and it structures and links its data based on the antimicrobial resistance ontology, known as arrow, which is an oval foundry ontology. Both the facts and assertions in the database are based on evidence identified by curators who painstakingly go through the literature, and to make sure that the curation is consistent. The CARD team has created a system that involves well defined and documented criteria for decision making tools for automating processes as much as possible version control of the database and as well other quality control steps. The data curation is incredibly important as the ARO and the CARD underpin the resistance gene identifier tool, and that tool is used all over the world for studying resistance and outbreaks and infectious disease research. I'm not going to go into CARD and RGI right now, you're going to hear more about that in module six. But what I do want to emphasize are some of the data curation best practices that are exemplified by the CARD team, because they are good lessons to learn when setting up systems for curation for your own data or for creating your own databases. So in the slide we basically have five data curation best practices. The first is that it's best to establish a set of criteria or rules for deciding how to curate your data. So those criteria can include scoping what it is that you're going to include or not include. For example, in the drug data that we looked at a few slides ago, we didn't curate brand names of data. We decided to focus only on the active ingredients. So that meant that there was information that we left out that was a scoping decision. We also want to consider if you need to decide on any thresholds or cutoffs. So for example in the allergy data, we realized that the methods would impact the way we standardized the data. So it would be important to know that different kinds of data are generated because distinctions will need to be made and that will need to be documented. Because you want to make it clear to all of the curators how to consistently react in different situations. It's good to have a set of documented examples of common curation issues. So for example, if you see situation a in the data curators know they follow curation a protocol. If they see situation be in the data, they know that they have to follow situation be protocol. The second best practice is to use trusted sources and evidence. So for example, if you wanted to state the function of a particular gene product, citing the literature is good. Also citing whether a particular gene product is expected to carry out a function because of its sequence similarity or whether it's been tested in vitro is also a good thing to note. So that's where evidence comes in different types of evidence carry different kinds of weight. I am curating a data set of samples, say there were swabs from agricultural equipment, and I want to describe the equipment or define the equipment so people in public health can understand what the people in agriculture are testing. I want to describe the equipment and its function based on how it's described in say an agricultural manual or reference, not based on its description in say Amazon, right so we want trusted resources. The third best practice is that it's important to establish consensus. So everyone carrying out curation is in agreement about how things should be done. So using automated tools for curation where you can encode the rules so they're less open to interpretation by individuals doing manual curation is a really good way to do this. That is not always an option, but if you can use a tool that's really good first choice. If you have to do manual curation, having a proficiency test, or some sort of exercise that can catch conflicts or inconsistencies between curators is this. The fourth best practice that document document document document everything your criteria your processes, common issues, changes you make over time examples. This establishes a track record for auditability and in case you need to troubleshoot something later, and also for training purposes. If you've documented how you do things. It's also easier to bring a new person on board. It's also good to make your documentation openly available to like on GitHub. This will help create trust in your work and your products be your products data databases tools processes, whatever. And the fifth best best practice is versioning your criteria and practices and data will change over time. It's really important to make sure that you version control your documentation your data sets your databases. And just like you would do for software and analyses and everything else that you would do in public health and infectious disease genomic epidemiology. So if you follow these best practices for curation, you really set up a quality control framework that will contribute to high quality data standardized data structures, transparent methods and the implementation of fair principles that will result in open interoperable and reusable that can be effectively turned into action to help produce to help prevent and reduce the spread and impact of infectious disease. Okay, we are almost done. So, in summary, here are the things that we have learned today that ontology based data specifications increase interoperability of data sets and systems that data management tools like the data harmonizer will help operationalize data standards that data sharing is important for data situational awareness decision making and innovation that data stewardship is important, and that involves considerations important for data sharing that pertain to privacy security and public trust that open and access controlled public repositories have different advantages, like data provider protections that have to be compared to disadvantages like data use restrictions in the case of this aid. In the INSD side, the advantages are that it's flexible with no restrictions, but you have to balance that with less control over the data. And different public repositories also have different submission formats. And finally, data standardization tools and best practices help build knowledge bases useful for public health, genomic epidemiology, and also research. Okay, so I know that was a ton of information. But what about wraps things up for me. I want to thank all of those folks who helped make the data standards and ontologies and tools. I want to thank all of those practitioners in the public health bioinformatics community that have been putting them into practice. I'd like to thank the organizers of the workshop for having me here to talk to you about these things, and also thank all you all the participants for listening and happy to take any questions. Thank you very much.