 Okay, so welcome to the first offering of the infectious disease public health-related bioinformatics workshop. As I mentioned, a senior scientist at the Bicentric Disease Control, and the type of work that I do there essentially is to try to set up genomic epidemiology as a routine workflow at BCCBC, and my research interests include the sequence analysis component of genomic epidemiology, but I'm also interested in how you can integrate diverse data sets through the use of ontology. So these are the two areas that I will cover a little bit in this module. The introductory module is also meant to sort of fill in holes of some background knowledge that you may not have in order to sort of make the subsequent sections more understandable, but a lot of you I looked at your application have extensive experiences already, so I also want to make this sort of an interactive session, so you have comments to add, feel free to just pipe up, and if it's important enough, I'll repeat it so it's recorded, but if I don't, that's just because I forget, so please do remind me to repeat your comments for the recording. So a brief overview for the course. We will start with the introduction in the background, talk a little bit about why you're here, what's the benefit of genomic epidemiology, we'll also give you some idea what the challenge is facing genomic epidemiology is. In module two, then we'll start to go into the technical aspect of how you can actually carry out genomic epidemiological type of analysis in your own group, so we'll have both the background session, but also hands-on session on how you can construct biogenetic trees, how you can perform single-nucleotide variant or single-nucleotide polymorphism analysis, and modules three will cover molecular subtyping activities using whole genome sequencing, and that would take us to the end of today, and tonight we'll have the keynote by Fiona Brinkman on open bioinformatics. In module four tomorrow morning, Andrew will talk about antimicrobial resistant genes and how whole genome sequencing has transformed some of the laboratory practice in terms of AMR detection and AMR characterization, I'll show you some of the bioinformatic tools to do so. Module five will be done by Rob on phylogyrography, so the epidemiology is about people place and time, and how do we link genomic data to location information, and how do we interpret that, and how do we visualize that will be covered in module five. Module six will be covered by Gary, who will talk to you about metagenomics and how that can be used to detect emerging pathogens, and module seven done by Anna on data visualization, more of a high-level conceptual introduction to how visualization can be used to improve your understanding of analysis. So the general learning objective for the whole workshop is to understand how genomic epidemiology can improve public health microbiology, how you can process genomic sequence data using a variety of bioinformatic tools I will show you, and all the tools I will show you are open source tools that you can download and use at your own institute, and of course, and I mentioned that you can also use them on Amazon Cloud if you decide to buy time there, so we will also show you how you can interpret genomic data in epidemiological context and perform several types of epidemiological analysis throughout this workshop. Then in module seven will help you understand the fundamentals of data visualization, which is an area that's in active research, and tools are just being built now, and I will show you how you can actually use R in shining to build some simple visualization tools. Lastly, hopefully in this workshop you'll recognize some of the limitations and challenges of genomic epidemiology, which is a rapidly evolving field as we know. For this module one, the learning objectives are to be familiar with the role of public health agencies, to be familiar with next generation sequencing and subvocation in public health microbiology. I think most of you have some exposure to NGS already, but it would just be a conceptual overview, and be familiar with sequence data processing, again it's just a conceptual overview of what kind of data massaging with data processing that needs to be done for subsequent analysis, so you have a sense when you actually do these in the tutorials and in your assignments, if you give your conceptual background. Then you hopefully will also be able to recognize the importance of metadata in sequence and data integration. Genomic as a field is, microbial genomics I should say is a field is fairly mature, the first genome came out in 95, and hundreds of thousands of genomes have been sequenced to date, but the issue with most of the data is that they come with very little contextual information, and therefore it makes the interpretation of the sequence quite difficult. So hopefully after this workshop you'll get a sense why contextual information, especially the organization and the harmonizational contextual information is important for interpretation of genomics data. Lastly I'll have a quick few slides depending on the time to cover the different genomic genealogy analysis in this course. I sort of already gave you a rundown in my previous slide but I'll delve into that a little bit more. So briefly the role of a public health agency is to track and intervene in the spread of diseases to improve health of the population, and through this process hopefully we learn some lessons and be able to come up with policies and strategies to prevent disease from occurring in the first place. Public health laboratory tests patients and environmental samples and detect the pathogens and determine the cause of diseases. At the provincial lab which is I guess mid-sized by the mid to large size by a country standard process about 3,000 samples a day and about a million samples a year. And there are commonly described as two dual arms of public health agency versus the epidemiological investigation interested in people place and time. For example in a foodborne outbreak investigation epidemiologists or environmental microbiologists might form my phones to patient up and ask what type of food was consumed, where was the food consumed and when was the food consumed and using epidemiological information they try to infer a common exposure and if enough cases point to a common exposure this is called confirmed by epidemiology. The laboratory are among the other hand interested in testing the actual samples that are derived from the environment or from the patient and asking what kind of pathogens might be found in the samples, what subtype of pathogens for epidemiological tracing so what subtype of pathogens. And you'll learn a lot about this in the next few sessions. And the goal is to identify the pathogen and to type the pathogen and if again enough cases point to a common pathogen then this is considered an outbreak that's confirmed by laboratory analysis. So why do we want, why do we interested in applying genomics to outbreak analysis? This is a picture of all the airplane flight path that are around the world and you can see we are at the age where there's a lot of traffic around the world and with the traffic and the trade it brings the pathogen with it as well. But if you just take an epidemiological approach how do you try to narrow down the common exposure when there are so many background noise. So one study that's done that I found quite interesting is this metagenomic analysis of toilet waste from long distance flights. A step toward global surveillance of infectious disease and antimicrobial resistance. So this is a study done out of Denmark that took waste samples from 18 different flights from three different continents. I don't think, I don't know, I don't think so. But I guess it's considered waste so it's like go out and collect waste somewhere. So the three continents are North America, Europe and Asia. And what I found kind of interesting is that 400 liters of waste is produced per flight. That's how much I guess including the blue waters that they used to flush the toilet. But anyway, so they for sake of not contaminating the samples they basically shipped the full 400 liter of waste to their lab and I'm sure they didn't process the whole thing. They extracted a sub-sample and then sequenced the DNA that they found in the human experiment. So the samples are a cluster based on microbiome profile. And if you're interested in more how to do that there's the microbiome workshop that CBW offers. It's already for this year but yeah, but next year you can certainly take it. And they also characterized the antimicrobial resistance genes found in the samples and this is something that you will learn more in Andrew's lecture. So long story short, they found that the samples indeed do cluster by geographic origins suggesting that it could be a way actually to identify the source of contamination with flights based on characterization of microbiome. And what's more biologically interesting is that there's a higher proportion of antibiotic resistance genes found in flights from South Asia compared to which they both in red compared to North America or other parts of Asia. Oh I thought it's Europe but it's actually just another part of Asia that we looked at. Okay, the study references courses at the bottom in case you're interested. So the current state of clinical microbiology lab involves a variety of different tests which increase the needs for triaging of samples and need different platforms, different reagents, so on and so forth to process these samples. And the well run lab has a stringent QC and SOP in place to make sure all the tests run smoothly. For example, at BCCC there's over 100 tests that we perform regularly that's available to order on our test menu. And that makes running a lab more challenging because you have to organize the test based on the amount of time needed, based on the type of reagent needs, so on and so forth. And also some of the slow growing organisms, if you have to culture them and you have to run susceptibility, so antimicrobial profiling, resistant profiling on these pathogens, it also takes a long time. The turnaround time, as a result of that turnaround time for tests could be somewhere between minutes or sometimes it could take months to complete. And often specialized tests that we cannot perform locally have to be sent to the National Microbiology Lab and the trends of time also add significantly to the testing delay. What's proposed is that hodgenome sequencing or DNA based technology, sequencing based technology can replace some of the existing tests and therefore simplify the workflow significantly in the laboratory setting. And also since the sequencing time and the data processing time can be quite well characterized and optimized, the overall turnaround time is also easier to control and as desktop sequencers become commoditized in local or frontline labs, these type of workflow, it becomes even easier to become a distributed network. So genomic epidemiology, the name itself really doesn't have a lot of creativity built into it, really just a combination of hodgenome sequencing, genomics, data from pathogens with epidemiological investigations to track the spread of infectious diseases. What's important to realize is that they go hand in hand, so the epidemiology support the contextual information for performing genomic sequence analysis and the genomic sequence information in turn provides high-resolution data for typing data or high-resolution test results for epidemiological investigation to help filtering out the background cases from linked cases. So knowing some of the workflows and how genomic epidemiology can help streamline lab analysis, the question to ask is why are you here or what's the benefits of hodgenome sequencing? And there are a few front-runners in the world that have already made hodgenome sequencing as their routine analysis pipeline. This includes the UK's Public Health England lab network that are committed to sequence or salmonella isolates submitted, and the USFDA and CVC also has a distributed network system of state labs that help to sequence the data, but then the analysis, I guess that's why some of you are here, presents a challenge in such a distributed network. So one reason that you're here might be because hodgenome sequencing is forced upon you, so you are forced to learn about it, but there are other benefits beyond that. So as I mentioned already, it simplifies the workflow, improves the turnaround time in some of the applications. It reduces the cost by reducing the number of platforms and reagents that you have to maintain. And also sequencing is becoming commoditized, making it easier to deploy to regional labs. And also very important as it actually comes to data analysis, the sequence results can be more easily shared with other groups and it's by and large is more comparable than say for example a gel picture or say some type of PCR assay where you might not use the same primers or whatever reason the tests are less comparable, whereas the hodgenome sequence data, partly due to the limited number of sequencing platforms available and also partly due to the nature of genomic sequences itself means that the data is easier to compare across different institutions. However there are some challenges associated with genomic epidemiology. The results for example could be harder to interpret because now you're giving a large data set that you have to analyze and you have to learn how to process it. The computational resource requirement is also higher and typically there's not a lot of local IT support. And it's also a rapidly changing technology, meaning that as we tweak the pipelines or the parameters, it could affect the results and how do you balance the use of technology and the evolution with the improvement of the technology. It's a challenge that all bioinformaticians working in this area are facing. The per sample cost is still relatively high and often batching of samples is required to achieve cost saving efficiently and some labs may simply not have the throughput to batching up samples to achieve that cost efficiency. So there are other benefits and challenges so I want to sort of hear from the crowd if you guys have any other to add to the list based on your own experience. There was a paper published recently about the Trostman condemnation on the ICP change. Yeah, so when you batch samples, the cost contamination from one sample to another could happen and there are instrument limitations associated with how well the signal can be interpreted and some of the contamination happens on the instrument rather than due to human or the operator error. So that's definitely an issue when you're going to batch. Any other challenges? I think right now the states were using PFGE and the whole unit was using a lot more resolution to be able to identify clusters. Okay, so that's a benefit. Yeah, a big benefit. So higher resolution compared to existing tests such as PFGE. Anyone else? Right, so one is the issue challenges associated with storage of the data, especially when it comes to privacy and when it comes to transferring of large data sets and also challenges associated with the incidental findings. Yeah, so incidental funding means that you are looking for one bug, but as an example, you're looking for one bug, you might find something else. For example, you might ask me to find out that the patient has HIV when you're doing a metagenomic sequencing. So what do you do in that situation? Do you tell the patient or because the person didn't come in for other diseases, you just report that disease. So there's ethical issues that need to be resolved. And there's also the incidental finding of the host on DNA that might be found in the host on DNA. Anything else? I just wanted to mention that to kick out a lot of public health labs and the members of public health labs relying on the exchange of culture dynamics and the bio-baking of those cultural resources that lead to culture collections that the evolution of whole genome sequencing, especially if it isn't promising, possibly in the near future the ability to do some sort of culture and diagnosis. You just grab a specimen, biological specimen, it's a pseudome, it's simple, and you analyze it directly using whole genome sequencing that means and you no longer need to culture a specimen which has implications, serious implications for a type of the essence whenever there's a sort of a regular specification going on it's that you just have to leave for a month's versus two or three hours possibly to be received as free. So just briefly summing it for the recording. The issue associated with essentially it's a digital photo phenomenon that we're facing now as we transition to just having the sequence data and not the original biological samples or the isolates are kept. We are potentially losing an important resource that currently are performed as part of routine laboratory work. So for example, at BCCC we have a huge cultural collection based on the isolates that we obtained throughout the years and if we just go straight to metagenomic sequencing or if we perform whole genome sequencing and not keep the samples later on we will not be able to retrieve those samples. So Mike knows that you can't listen to it and the ratio that has been suggested is that one user to one user so about one person to another culture. You go out and look at other researchers like Mike's will and see how proven that you can also retrieve samples and get hearts in a field. And if you work hard on one another culture I think he is demonstrated that we can basically have 600% because there are very small populations in the microwave that are not accessible by sequence. But you can culture them. Gary, you have to speak up. So yeah, the other consideration is what percentage of the samples that are submitted to the lab can have unknown etiology so you are unable to find a known pathogen associated with that. And I think as we do more metagenomic sequencing of those samples we will find out more about potential pathogens that might cause diseases. But I think you have to break down the problem. The known pathogens, the success rate for culturing is actually quite high in the lab but there's the unknowns that are much harder to estimate. Anyone else want to add anything before I move on? Oh yeah, that's a good point. Yeah, so Andrew brought up a good point that going from genotype to phenotype is not always straightforward or easy but there's an up... So in order to improve that the bioinformatic process with the analysis for that the phenotypical test and the genotypical test need to be performed hand-in-hand for the next little while to help us build up the knowledge to translate from genotype to phenotype. And the other issue with direct sequencing is we don't know if the organism is viable or not. So you found the DNA but is the organism still viable? So I have studied for example looking at environmental samples for avian influenza viruses so we are able to find and to type these RNAs from soil sediment samples but how long have those viruses been there and are they still viable? That's a bit of unknown and actually in collaboration with CFIA we are trying to culture these viruses to see if they are still viable or not. Okay, so here's the section where I go into a bit of the background knowledge gap-filling exercise. So feel free to pipe up if you want to add anything or if anything doesn't make sense. So in this course we're by and large giving you bacterial genomic samples to deal with and so I'm going to focus on that. So typically they contain within a single circular chromosome and some of the bacterial genomes are linear. And also very important is that they are typically a haploid genome meaning that there's only one allele per gene in the organism in the cell and they reproduce asexually so-called. However, they may contain plasmids that are extra chromosome DNAs potentially much more transferable across organisms. So the genome of bacteria is the gene-contained hopper by the chromosome and its plasmid and certainly when it comes to antimicrobial resistance and other invariance factors and so on this proportionate amount of those important genes are found on the plasmid rather than the genome. Genome sizes are not particularly big when it comes to bacterial genomes. It's typically 0.5 meg to about 10 megabase and the average, especially for pathogen is somewhere between 2, 3, 2, 5 megabases and roughly correspond to it's roughly one meg per thousand genes so these organism typically contains 200 to 500 genes. However, the genomes are constantly evolving constantly under selection pressure and there are some key ones highlighted here. Some of the human pathogens are known to have undergone a genome reduction and become specialized so-called linear mean. They lost certain metabolic pathways and they no longer need but in exchange they adopt very well in a certain niche. The other driving force is a genome rearrangement which can affect the gene expression which can be turned on and off fairly rapidly in certain pathogens such as myseria. Gene duplications allows the organism to evolve new functions by first duplicating the gene and then one copy of the genes will be under less of a selected pressure and therefore allow to mutate faster and potentially change. This corresponds also to the gene loss in genome reduction. The last one that I want to mention is a horizontal lateral gene transfer and this is the acquisition of genetic material from a non-parental source and one of the key findings for the microbial genomics there is a lot of horizontal gene transfer occurring between, back to your cells. Here is the high level workflow of whole genome shotgun sequencing. You of course start with the culture isolates and if for metagenomic samples you would bypass the isolation process and go straight to DNA extraction. The DNAs are then fragmented and with the sequencing adapter attached to each piece sometimes including the barcodes unique barcodes are used to identify individual samples and that then is made into a DNA sequencing library and put on a sequencer and at the end of the process what you get is a huge text file with ATC and Gs that you somehow have to make sense of but after this workshop you'll be able to do some of the analysis associated with the sequence files. The cost per megabase of DNA sequencing has dropped naturally in the past a while and the initial genome cost human genome cost 100 million to sequence and now it's about $10,000 to sequence a human genome and correspondingly because bacterial genomes are the 100th of a human genome in terms of size theoretically you should be able to sequence a bacterial genome for as little as $10 per genome in reality we're still at the $100 range somewhere between $102 to $500 ending up again how you batch the samples and how streamlined the process is and this of course also doesn't account for the labor cost involved in sequencing So quickly go over the sequence analysis process these are things that we're not necessarily covering in the workshop but you'll have some hands on experience you'll have some opportunity to do some hands on exercise with these but if you're interested in some of these and didn't have any more experience I'll point you to some of the resources and again there are resources online and workshops online that you can do oh okay so another shameless blog is there is a workshop that deal with this topic okay so the steps of once the DNA sequence is generated the steps of data analysis typically include assembly of the genome either through the novel assembly or through a mapping exercise to map the reads back to a reference genome and then after that you can carry out annotation in other words adding information biological information to the sequences this process including predicting where the gene is found on the piece of DNA and what the function of the gene or encoding regions might be in the sequence and then as was covered later today we also want to identify variants such as single nucleotide polymorphism or elliptic differences from the sequence data so for genome assembly there are probably two categories of genome assemblers the task of course is to reconstitute the whole genome from fragments of DNAs that you sequence and as you may know the reason we need to fragment the DNA is that most of the current sequencing platforms generate short reads rather than genomic sized reads so the de novo assembly is very much like a jigsaw puzzle you try to identify fragments that overlap each other and essentially align them to assemble the reads and the reference based approach is a bit like the diagram showed here that you already have a reference genome, a reference picture in place and all you're trying to do is to map the most similar reads back to the reference there are some challenges associated with assembly first of all some of the well all of these platforms has certain sequencing errors associated with it so when you're trying to assemble the genomes you're not necessarily looking at 100% identical overlaps so you need to allow some error margins in the overlapping regions and second there are repetitive regions in the genome and these could confuse the assemblers so two distinct repetitive regions may be collapsed into a single sequence by the assembler and therefore potentially remove for example if you have two repeats like that and they get collapsed and the intervening regions may not be assembled properly and that intervening region may drop out of the assembler so you potentially lose information in this assembly and this of course can be important when you're trying to type organisms based say repeats in the genome and other genomic features that you're looking at if the assembly is incorrect that can result in mistyping of the string so I would assume that obviously this is a problem for a genome assembly and species identification but also maybe getting this later but for grades that you have for a particular organism to determine the pathogenic load are there bioinformatic tools to help you determine it just like with human genomes you've got copy number variations so you have to map that and determine the number of copies for a genome rather than for an organism but when you have to do what I mean like is there a way to resolve that with determining bacterial load if you've got great variations in a genome? so yes so for example in metagenomic analysis people use 16S but 16S sequence is there are multiple copies of 16S RNA in the genome and different organisms have different numbers of 16S so to do the analysis properly you actually have to adjust for the copy number of a given gene in the genome and by and large we don't see that done so that's a very good point and the other point is if you're trying to assemble if you're trying to use repeats as a marker for example tandem repeats as a marker these regions are notoriously poorly assembled so that could also, especially by these short reads so that could also affect your interpretation any other questions? so this is just a list of sequencing errors on some of the next generation and third generation platforms the key point here is the third generation sequencers such as Pac-Bio or Oxford Nanopore still have much higher error rates compared to the second generation sequencers such as LUMNA and to minimize the errors in sequencing the same region of the genome typically sequenced multiple times and typically people aim for 30X coverage and you will see later on in the workshop that the depth coverage is used as a threshold or cut-off for some of the analysis and this is basically how many times a region is sequenced in your sample and typically the consensus is then taken as a correct sequence so when your depth coverage is too low and there are sequencing errors you might not be able to properly determine the consensus sequence so after assembly there's often still gaps in the genome that cannot be closed due to lack of sequencing coverage or more likely due to unresolved repeats so typically 16S as an example are not properly assembled when you just do an automated whole genome assembly so instead of a complete genome you get a set of contiguous sequences or context that represent most of the genome but might miss certain regions so to close the gaps manually it's typically called the finishing step of genomic sequencing it's labor-intensive I think it still costs hundreds of thousands of dollars sorry, not hundreds tens of thousands of dollars to do this because it involves manually designing PCR primers to try to close the gaps and this has been alleviated by combining third-generation sequencers which have longer reads with the short more high-quality reads from second-generation sequencers to improve the overall sequencing assembly process for some reason this slide didn't flow properly okay, we're missing some text there but is it showing up on the... it is, right? but anyway, this is just to say that for genome annotation we're typically looking for the functional locations of the open reading frames or the genes and locating the genes is a little bit like when you look at a string of characters you can typically pick out words in those rendering characters I'm sure you all have seen a puzzle where you get the matrix of letters and then you're supposed to find the first word and that's supposed to be telling you something about your life so our ability to do that is very similar to computer's ability to recognize patterns in... in render sequences so genes that encode for proteins have certain frequency profiles quite different from non-coding regions so using this... different... using the difference in the coding frequency computer programs can quite reliably identify coding genes in the DNA sequence and for microbial genomes the accuracy is 98% plus so it's really a problem that has been solved by bioinformatics while we're able to identify the locations of coding genes to know the functions of the coding genes it's a much more difficult problem and this involves functional annotation of the gene so this is just a diagram that I think is from Gary that shows the overall workflow of an automated annotation engine start with the contact and you go through some regional annotations to identify non-coding and coding genes non-coding regions and coding... protein coding genes and for each of the coding genes you have to carry out functional identification I'll talk a little bit about that and both using automatic I just realized I'm not playing both using automatic approaches and also manual inspection by human curators and also the non-coding regions there are tools to help predict the RNAs and other non-coding sequences so functional prediction the most common way to do that is through sequence similarity search and it's assumed that genes that have sequence similarity are derived from the same ancestral gene and therefore have assure similar functions so that's the basic assumption of sequence similarity search so BLAST is the most common tool for performing this analysis so you infer the function of one gene or one protein based on its similarity to a known function sorry to a gene or protein of known function and this is called transitive annotation so you didn't actually study the function of the gene that you sequence but you infer its function from another gene that has been annotated in the database and this of course requires a database such as GenBank or SwissPROD and so on so how many of you have done BLAST search okay so that's good I'm preaching to the park so that's skip yeah I think I can skip that and there's different versions of BLAST that allow you to do nucleotide and protein searches in combination thereof briefly some rule of thumb regarding interpretation of BLAST results so as you know at the end of BLAST search you get a score and you get an E value you know what E value means anyone so when you see an E value how do you interpret them well first of all is a high E value good or a low E value good if you want to try to find a match low E value, okay and anyone want to give is it the probability of finding the match if it was random the probability of finding a match randomly by chance yeah so here I'll give you an analogy for that let's say you have a phone book with a lot of names in there so the chance of you finding a common last name is going to be a lot higher than finding a non-common last name but not all common last names are related not all people have the common last name are related to each other right so just like not all sequences that are similar are related to each other so there are sequences that are called low complex regions other than that before so there are regions of sequence of complexity and that could be derived simply by chance or simply by some functional constraints but they are not actually evolved from the same common ancestor so when you do a BLAST search the program takes into account how common a sequence is found in the database and use that information to interpret what's the probability of you identifying a similar sequence that's actually not related to the sequence that you that you have so that kind of makes sense so as I say using a phone book analogy it's like you're finding someone who has the same last name as you but it's actually unrelated to you the probability of that happening would be much higher if you have a common last name versus a very uncommon last name yeah it depends on the size of your phone and it depends on the size of your phone book or your database does anyone want to offer a different interpretation or explanation of the value okay sure that would be great the value is basically if I take in that sequence and compare it against a randomized database how many examples yeah so that's a more technical description of how the statistical values actually derive okay so hopefully like if you some intuitive sense of when you see the e-value you'll know that it approaches zero that means the probability of finding a sequence an unrelated sequence that's similar to your sequence by chance it's approaching zero and if you higher the number that means the problem the not the probability but actually the number of times you see the sequences by chance is higher okay so the rule of thumb is that in a typical glass alignment e-value score should be at least smaller than .01 and often we use 10 to the minus 5 is a safe cutoff and this is referring to protein sequence alignment so for DNA we typically do use a much lower e-value cutoff e-value and scores are related but e-value contains more information as Rob described it's based on randomized data sets and the statistical probability framework the percent identity again is more intuitive it refers to essentially how similar the two sequences are to each other when you line them up so 100% identity means the two sequences are the same 99% that means one base out of the 1 amino acid out of the 100 base where the amino acid are different this is just a list of automated annotation systems that you can actually try out on your own so once you have a genome process you can do so for many genomes and then compare them to each other the goal of compare the genomic variations that can correlate to phenotypical characteristics so again this is a genotype to phenotype mapping that we're trying to achieve through comparing genomes so for example we might be interested to know why certain isolates pathogen are more resistant to a certain antibiotics than others so you might want to compare the two genomes and see if you can identify resistant genes and so on and we can also use the variations to track transmissions of pathogens as you'll see in the workshop so roughly speaking comparative genomics can be done at three levels one is the regional differences if there's recombination or rearrangement you can detect it through comparative genomics there's the gene level analysis that will be covered by in the in his session he's involved but also in this case you can do a gene profile comparison in other words comparing the presence or absence of genes so the last one is a nucleotide level level comparison and one example is the single nucleotide variants or small indels that can be used to compare microbial genomes and therefore used to type them and this will be covered in our next session so the first genome as I mentioned was published in 1995 and soon after that the first comparative genomic came out and they just compared the minimum number that they need in order to call the comparative genomics paper so two helical pylori genomes will compare they're actually seven years apart and what's very interesting and at least at that time was that they were able to line up the two genomes as you can see here and then they saw that the strain-specific genes are actually quite quite well clustered to each other and so through these types of analysis the idea of genomic islands was introduced where it's believed that essentially a large cassette of DNA was acquired by one strain and the other strain acquired a different cassette so and this allows us to identify regions that are horizontally acquired by essentially line up to genomes and compare regions that are different and that in combination with other DNA-based signals called genomic signatures how horizontally acquired genes are identified so if you're interested in this area Rob is an expert in this type of analysis and so is Fiona Brinkman so you can talk to them more about it later on today okay so now moving to more than two genomes you can compare large number of genomes from the same species of organism and this is called the species-pen genome the term was first coined in 2005 by Erwey, Tathlin and others at the Tiger, the Institute for Genomic Research at the time in which they compared sequence genomes from six different STRAP, Agilentia and what they were able to do is they were able to identify the genes that are shared by all these isolates and they coined the term core genes for the shared genes and they also found that these genes typically are housekeeping genes and they are on the other hand these accessory genes which are the strain-specific regions which are in the different isolates so the pen genome is calculated to extrapolate alterations based on the limit number of strains that you are able to sequence and then to come up with a theoretical limit on the genome size for the entire species so in that lecture and tutorial you also learn more about pen genomes and when they did that they found out some species have what's called open pen genomes in other words no matter how many number of genomes they sequence and add to the analysis the line which is the number of the y-axis is the number of new genes discovered so with increased number of genome sequence they never approach 0 so this is to be expected if the organisms are undergoing horizontal gene transfer and acquire genes from nonparental sources some other pathogens on the other hand have what's called close pen genome in other words as you sequence more genomes eventually theoretically no new genes would discover there's actually no true close pen genomes because organisms do have some capacity to acquire new genes but this type of analysis help microbiologists to predict how the species will evolve over time and how many genomes need to be sequenced in order to characterize a species and in that lecture again we'll talk a bit more about population dynamics of pathogens so extend on this this concept alright so I'm going to switch gear and talk about the challenges in data integration ensuring and specifically focus on what's called ontology if I do that any question regarding to sequence analysis or any comments okay okay so as you know how many of you actually work in government or public health sector lab and not the academic lab so yeah okay so you know in your day to day activity you typically need to interact with other groups if you're a national lab you need to interact with the regional and local labs and vice versa and that's true for both epidemiology but also for laboratory activities so there are many different players in this complex ecosystem and the most common way to talk to each other now is still by over the phone or by fox so we for example still get fox from national microbiology lab with test results and the challenge with that is that someone have to take a piece of paper then enter into our own system and vice versa we also send them paper based form that they have to enter manually so and on top of this labor intensive recoding exercise we also have the phenomenon that informations are collected at the local or the online labs and as they get passed on to more specialized labs less and less information get passed on sometimes by choice because of privacy concerns but a lot of times simply because it's labor intensive to pass on informations and to recode that information into a new system and the dilemma is that the amount of bioinformatics and analytical expertise is typically concentrated at the reference labs with the national public health agencies and typically a lot less available at the Fremont labs and so on so how can we improve data sharing integration to address these dilemmas here's a funny gift that I found about modern software development but because there are many different systems in public health labs and public health agencies it's actually quite keen to what we're trying to do here so part of the issue associated with this and I have to thank Damien for some of the slides that I'm presenting here some of the sorry the contextual information are often institutional specific so examples including there are different acronyms or code used for the same antibiotics so when you convert from one system to the next some people have to recode that information the other example is there's different terminology used to describe the same concept with several differences so for example in the risk factor forms for a BCCDC we use alcoholism to describe someone who had problems with alcohol consumption as a risk factor to infectious diseases in this case tuberculosis but in public health England they would call it substantial alcohol use or abuse another example is that we would say a patient died or that PHE use would be death rather than died so these subtle differences are okay for human interpretation but when it comes to trying to build a computational system to allow data integration and interpretability these minor differences becomes the stumbling block and there's other examples like the use of different units to measure the same test or more challenging is that you can have different tests trying to address the similar questions so for example using different platforms for an anti biogram could actually result in incompatible data and this is something that Andrew I think would talk about in his lecture how using ontology to encode for antibiotic resistant genes and mechanism and lastly here's an example that Damien found there's different severity gradients used in the UK hospital system versus American system so in America there's only five different gradients from death to good whereas UK it's more refined so they have disease which is the same as death but then there's critical but stable, which you have to somehow slot it into a gradient to make the tool is comparable and these are challenges that data integration faces so the metadata problems include spelling inconsistency synonyms used at different institutes and also semantic level issues as an example here, diarrhea which is defined as a normal increase frequency of loose or watery bowel movement here's the list of word cloud that describe essentially diarrhea so you have loose dual feces, blowouts, so on and so forth so all these terms describe the same concept and you have a related concept FICO which is portion of semi-solid body waste that starts through the anus sorry for the explicit description there but the point is that these two terms are related to each other yet they really don't share any words or definitions as Damien saliently pointing out the only word that's shared is between the two definitions is off so if you're just looking at the two terms using a computer there's no ways for computer to know that these highly related terms are indeed semantically related to each other so a few years ago we embarked on an exercise of trying to use ontology to describe genomic epidemiology and first let's define ontology so it's a mechanism to specify and to express a body of knowledge so these include the use of standardized and well defined hierarchy of terms and each term also have a unique universal idea associated with it so you have an idea that's associated with the concept and therefore allow you to use different terms to describe the same thing whereas how many of you have worked on data dictionaries before so in a data dictionary often you define the terms directly and therefore when you move to a different data dictionary you have to do a mapping between the words rather than in this case we can map using a universal idea across different systems and the terms are also interconnected with logical relationships so in the example of diarrhea and fecal then someone will have to give it a logical relationship describing how the terms are related to each other so the ontology is an inherently coherent tool can act as a universal adopters as I've shown here as an analogy to facilitate data interchange and more importantly it's in the both human and computer readable format so BoboFundry is a collection of ontologies and it started off with gene ontology the initial system but since then has expanded to over 100 different domain ontologies describing different domains of knowledge and also equally important is that it's an open source project that allow people to reuse and recycle the terms as much as possible so as a matter of fact they encouraged you to reuse the resources rather than to reinvent the reuse so back to the example we had before diarrhea which is mapped to a concept in the human phenotype ontology and fecal which is mapped to an anatomic ontology called uberon or not just humans or ontology that can now be then connected through the use of shared framework so genomic epidemiology ontology or gen-APO is a project we started a while back and it's an attempt to bring together different domains of genomic epidemiology including lab analytics sample metadata epidemiological investigations clinical data and reporting under the same ontological framework to allow data to be harmonized and so this involves mapping all the different sorry some of them is hard to read but it involves mapping all the different concepts of different terms in the workflow and then describing the relationship of these terms so we have made the resource publicly available and there are currently over 700 specific terms but over 2000 terms in total that captured a variety of domain vocabularies that I briefly described so the idea is that we hopefully will have a plug and play future where you can extrapolate from geographic information to population health related information all encoded in the coherent knowledge framework of course it's a long way from that we're making an road towards that and in the future we can also encode mathematical formulas and other SOP standard operating protocols and so on as using ontology and therefore improve the interoperability of laboratory processes and epidemiological calculations and so on and we have a website and a consortium setup that we definitely encourage people working in public health to join us and to help us build up these resources that you can in turn use for your own institution okay so very briefly I mentioned module 2 and 3 will be today we explored the variations in microbial genomes phylogenetic reconstruction, outbreak investigation pathogen tracing source attribution and so on and it will be the first with the introduction of phylogenetic analysis that Gary would do which is an important basis for many of the downstream analysis so just want to make sure you understand how to interpret the phylogenetic tree tomorrow on module 4 Andrew will talk about antimicrobial resistant gene detection and analysis using whole genome sequence data with WHO just in February released some high priority pathogens for R&D of new antibiotics this is a very hot area certainly topical to public health in module 5 Rob will talk about phylogenetic phylogenography so this is a sure most of you know the story of John Snow and the cholera outbreak in London this is a map of it but of course this has geographic information only so in Rob's lecture it would show some of the visualization and some of the analysis of how you can combine geographic information with genomic sequence with phylogenetic data and module 6 will be talking about cases where you don't know what pathogen is implicated in an outbreak how do you detect potential pathogens using metagenomic sequencing and also some of the rapidly involved pathogens such as biothes may actually change to the point that the PCR primers you designed to try to detect it will no longer be able to detect it and in those cases again maybe metagenomics is the only solution you have some pictures of Ebola and Zeta outbreaks and stories associated module 7 by Anna is data visualization so with increasing large and heterogeneous data sets how can we use data visualizations techniques to help us understand and interpret the data here I'm just showing some of the visualization techniques are used for visualizing genomes and genomic islands Genghis which rather talk about is for final geographic analysis and the goal is really that we should be able to synthesize different data types and to describe the phenomena so in this case the wind pattern in the US is being synthesized and described in this single animation if you go to the website we'll show the animation okay so that's all for the background intro section any questions nope great