 All right, so, yeah, thanks for coming to our second infectious disease genomic epi workshop. As Anne mentioned, we started this last year sort of in response to demands from both public health and some academia interested in learning how to apply genomics in clinical and public health microbiology. Last year, we focused a lot on public health because most of us, as you heard, came from the public health side, but this year we added some more applications that are applicable to clinical microbiology as well, and also last year we focused a lot on bacteria, but with the events and technology and so on. This year we actually added some viral components as well. I don't think we're covering parasite quite yet, but a lot of the analysis still do apply, and my job here for the first module is to give a general introduction of the course material and the content and also the background to help everyone sort of get on the same page. So, feel free to stop me anytime if you have any questions. The lecture will probably take about 60, 70 minutes. Some interruption is totally fine, and so for the... It's a packed workshop, as Anne mentioned, and we have divided the workshop into seven different modules plus a keynote talk, and each of the module will be followed by a hands-on tutorial session except this first one and the keynote, so you'll actually get to practice what we discussed in the lecture material. I didn't list it here. We also will have a brief tutorial right after this about how to use the Amazon Web Services and how you can use that to analyze your data. So we set up Amazon Cloud Instance for every one of you, and tomorrow we will also show you how to access to another analysis platform that, collectively, the instructor created and worked on in the last few years, and it's a web-based platform that you can upload your data and do some analysis as well. In terms of the actual course content, for module two, you'll hear from Gary about how phylogenetic analysis and how to detect single nucleotide variants and how to use that for transmission studies and for other applications. Modules three, you'll hear from Dylan about molecular stop typing, mainly focused on gene-based or allelic-based molecular typing method, such as MRST, using whole genome sequence data. And then, this evening, Fiona, who just walked in, will give a broad picture keynote. And, Fiona, do you want to quickly introduce yourself? Hi, I'm Fiona Brigham, an active professor at South Korea University, very interested in mathematics, but also some lab training. And tomorrow, Brian, early, you'll hear from... What am I talking about? Andrew? I only got, like, four hours of sleep, and from Vancouver, this is, like, five in the morning. Anyway, from Andrew about the... And a lot of you mentioned about antimicrobial resistance being your research interest. So, Andrew will talk about the identification of AMR genes from whole genome data and some of the resources that will be available to you to do that. Module five on phylogenetic analysis. This lecture was typically developed by Rob Beacole, who developed Genghis, which is a tool that allows you to combine phylogenetic trees with geographic locations. We can make it this year, so Anna will cover the lecture and will also talk a bit about the visualization aspect. And module six, Gary will talk about emerging pathogen detection, and this is where the idea is you can search for potential pathogens in metagenomics data, and we'll introduce some viral applications in this case as well. Module seven, Anna will talk about data visualization, some of the principles behind it, some of the good practices behind data visualization, and we also have a hands-on tutorial on developing some of these visualizations yourself, and look at the data from different angles and different lengths. So the general learning objective for this workshop is to understand how genomic epidemiology can improve clinical and public health microbiology. You'll learn to process genomic sequence data using a variety of bioinformatics tools. The focus of the workshop is not so much on how you process the sequence data, because a lot of the tools are readily available, so we won't necessarily go into the details on how you will assemble and annotate your genomes, but rather focus more on the downstream once you have those... once you have a sequence process, how you can use for genomic IP investigation and other applications. And we also will talk about how you can interpret your genomic data in epidemiological contexts and perform several types of genic IP analysis, and that's sort of the meat of the workshop. And understand the fundamental data visualization, and as you can imagine, we're dealing with heterogeneous data sets are highly complex, so some of these visualization techniques can really improve your ability to interpret the data. And last but not least is the... this is a rapidly evolving field, so we also want you to go home with a sense that there are limitations and challenges still associated with genomic epidemiology analysis. And this is an active research area that people are working on, even though a lot of the work can now be operationalized as I'll show in a bit. So for this particular module, the learning objectives become familiar with high-throughput sequencing and its application in clinical and public health microbiology. Be familiar with the sequence data processing, so I'll go over the process briefly in this introduction. Be able to recognize the importance of metadata and sequence analysis and data integration. And lastly, I'll give a brief overview of the other modules. And as I was doing the introduction, I was saying that we'll talk a bit about my research. So one of the important message for the workshop is that sequence data cannot stand on its own. You need the metadata to facilitate the analysis, but metadata cannot stand on its own. Either, as you sometimes don't get the resolution that you need to make interpretation. So the importance of being able to bring the two together is actually what my lab is interested in. So half of my group worked on the sequence analysis aspect, and I'll mention about that in a bit. And the other half of my group actually works on data harmonization and data standardization approaches to improve the quality of metadata. So we live in an increasingly interconnected world. This is the commercial flight path of 2014 or so. And just like people and goods can travel around the globe rapidly, pathogens can be carried with these travelers, animals and goods quite rapidly and spread around the globe through international travels. And case to point is this study done a few years ago looking at the actual human waste from long-haul flights. So a Danish group collected samples from 18 different flights from three different continents and filtered through 400 liters of human waste and extracted the DNAs and sequenced the DNA using metagenomics approaches. And they then clustered the sample based on the microbiome profiles that they observed and also searched for antimicrobial resistant genes in the dataset. So what they found is actually that the samples do cluster with geographic locations. And as you can see here, the North American clusters forms its own distinct cluster or two clusters actually, whereas the northern, it's split into northern Asia and southern Asia form their distinct clusters as well based on the microbes found in these samples. And also by looking at the antimicrobial resistant genes, they found that certain types, certain samples, namely the ones from South Asia actually have higher proportions of antimicrobial resistant genes in the human waste or human expert. So it's a nice study just to show how you can do global surveillance by monitoring flights. And as you can imagine, if the pathogens are secreted in human waste, they're carried into the destination that these travelers go to as well. And another study, look at the emergence of infectious disease events. And EID events have defined newly evolved strains of pathogens in human, excluding re-emergent pathogens. So they, for example, did not count the reason re-emergent of Ebola outbreaks or Zika outbreaks, but they count the first instance of a particular pathogen or a new variant of that pathogen getting into the human population. So this is done over the last, this survey was done over the past six decades or so since the 1940s. And the pattern is clear that there's a steady increase of these emerging infectious disease events in the last few decades. And they attribute the increase in 1980s to HIV and related pandemics. But I don't know if they fully understood why there's a slight drop in 2000. But in any case, the increasing trend is quite consistent and it's dominated by zoonotic pathogens that are hopping over from animal populations into the human population. And in the study, they were also able to then identify global hotspots where these EID events occur. And perhaps not too surprisingly, South Asia is also one of these hotspots. So you have a combination of higher probability, I should say, of newly emerging infectious diseases coupled with increased antimicrobial resistance genes as well, creating kind of a perfect storm that might be of interest in. And that's a lot of the public health agencies and groups do closely monitor the region. So in a subsequent paper that's just published last year, I think, 2017, they also identify some of the risk factors associated with these emerging infectious disease events. And namely, being tropical forests where there's dense population, both human and animals, and rich species diversity and climate effects are all significant risk factors associated with the emergence of new infectious diseases. Now, so ODA is to really illustrate the importance of infectious disease in our modern society. And perhaps the outbreak a few years ago of Ebola in West Africa is still fresh in people's minds. Most daily Ebola outbreak in the history resulting in about 28,000 cases, reported cases, so the actual cases is likely to be higher and also 11,000 or so death over the span of three years or so. It has significant impact, impacted the global travel. Remember when at BCCC, even though there's no case in Canada, we still have to go through, or not me personally, because I'm a mathematician, we don't really go into the field. But the the technicians and the biosafety officers have to go through rigorous trainings in handling in case the Ebola case were detected in Canada. So it's quite a resource intensive exercise to prepare for potential outbreaks and again highlights the global natures of infectious diseases that you know infectious disease in one region, if it's significant enough, can actually have a global impact. So the Ebola genomes, what's different between this particular outbreak and the previous ones is that with the mature maturation of sequencing technologies, about 5% of the cases were sequenced and genomes of these Ebola viruses are available for analysis. And this is really one of the best examples of how genomic epidemiology really facilitate the research and the health care interventions of an outbreak. So this study is one of the earlier studies that look at just the 99 early isolates and what you use in just sequence data alone, but then later corroborated with the epidemiological evidence, they're able to show that the human exposure to nature, to a natural reservoir is likely to be a single one-time event, because there's only a, they're only detecting sort of one original case that's spread human to human, through human to human transmission. And this is likely due to the funeral practice and the lack of proper quarantine in these West African countries that have never seen Ebola virus, Ebola outbreaks before. And, but they also from the phylogenetic tree, they were able to show that the incursion of Ebola from Guinea into Sierra Leone might actually, even though the IP evidence suggests it's a single event, might actually have introduced two different strains of Ebola into the country. So as you can see, there's a split here of two distinct lineages, one here and one down here. And this type of detail information really helped the care workers trace the transmission and understand how these viruses are transmitted and the result was able to influence some of the policies and some of the interventions. However, as I mentioned, a lot of death occurred and it still took quite a few years to finally have the outbreak under control. So fast forward to 2018, some of you might know that there was there's currently another Ebola outbreak in the Democratic Republic of Congo. So what's different from, of this outbreak from the previous one is that DRC has already experienced multiple Ebola outbreaks before. So they're more aware of the symptoms to look out for and they're more aware of how to deal with the patients. And the death rate, the death toll currently stands around 50 or so. So it's much more well controlled outbreak. And other reasons for the outbreak being under control is that there's a faster mobilization of resources. So since the original Ebola outbreak, WHO and World Bank have set up some reserve funds to deal with public health emergencies. And their team put in place to be able to quickly mobilize to deal with these public health emergencies. They also have been able to use the previous outbreak as an opportunity to test out the Ebola vaccine that was actually developed at the National Microbiology Lab in Canada. And they were able to stockpile these vaccines cases for future outbreaks. And in this case, they were able to treat the high risk patients and certain not patient high risk individuals with these vaccines. And also experimental drugs and what developed and also health workers was mentioned able to be in position faster for this outbreak. But we really need to be constantly in alert to be aware of these potential infectious disease outbreaks that can be devastating if not not quickly managed. And hence the importance of a genomic epi based surveillance to monitor pathogens is what the community as a whole is working toward. So unlike Ebola outbreaks, Zika virus cause much milder symptoms. And it's often self the disease is self self eliminating. And most people who have been exposed to the disease developed immune response to it. And it has been an endemic in both Africa and Asia for Africa for much longer, but Asia for at least the last half a century or so before causing an outbreak between 2015 2016 in in America. And the outbreak resulted in a few thousand cases of microcephaly and microcephaly cases. And the likely reason is that the general population in America is naive to this virus. So when they are first exposed to the virus, there's more likely to to have complications and to develop in in the process to be developing immune response. My cost issue such as my microcephaly is in in in fetus. And so it's still an open question why microcephaly wasn't observed in Asia and in Africa to the same extent. But it's also possible some people are proposing based on the genomics data that was generated that may be a recent mutation of the virus has also increases virulence and might be dealing with a more virulent strength of Zika. And also to highlight because of the symptoms of Zika virus is similar to to some other fluffy viruses such as dengue virus. The syndromic base or surveillance are based on observing of symptoms can can actually be unreliable and you can miss cases. So zoological or genomic or molecular testings are needed to for confirmation. And again through the genomic analysis it was later on realized that the virus probably had been in circulation in America specifically in Brazil for at least a year or so before it was recognized as an outbreak. And such gap in in surveillance is likely to increase the magnitude of the outbreak. So as I mentioned genomic space surveillance moving forward can also can can reduce such gap and can improve our ability to to identify potential outbreaks. So another example here is a flight path of a different kind. So these are the the migratory bird flight path. And the reason I'm showing this is that these migratory birds are actually natural reservoirs for influenza viruses. And through these migratory pathways to influenza virus especially specifically avian influenza can actually spread around around the world. And also you notice that some of these flyways actually overlap each other. So it gives opportunities for device genomes to reshuffle and create new variants of the virus. And influenza virus many of you know cause for human pen large scale human pandemics in the past hundred years and 2018 is actually the centennial of the 2018 Spanish influenza pandemic. And the pandemic strains are typically hypervillain sorry hypervillain and typically arise from the mixing of human and animal food viruses. And these migratory birds such and that do have in their sort of process of being the the natural host of these influenza viruses have developed some resistance. So they don't typically get EO from the viruses. But then the viruses can be passed on to domestic poultry populations which are highly susceptible to these viruses at least to certain variants of these viruses. So also to related to human health occasionally the virus can can infect human can jump over to human. And typically when the country observed or at least in North America if we observe avian influenza calling of the birds the domestic poultry's population is the only solution right now. So in 2014-2015 there was a I guess a medium consider sort of a medium scale outbreak of avian influenza in North America. That it's believed that the Asian strain of H5 and H5N1 mixed with a North American strain in somewhere in Alaska. And in that process the highly virulent version of the H5 virus get carried down through these wild bird migratory passes along the coast into the US. And in BC the Fraser Valley is where we raise all our poultry where we have all our poultry operation. And as a result of the incursion of the H5 virus in the poultry population essentially 13 farms were affected and about a quarter million birds had to be destroyed and cost the industry about three hundred million dollars. The once it got into the US it scaled up and the estimate was direct cost to the direct damage to the the industry is about three billion dollars and directly translated to about 80 percent increase in eight prices in in the States. So these diseases can have a wide impact not just for human but also for trade and for economy. And in during that time genome BC got interested in how we can improve the surveillance of of these influenza viruses. So we received some funding together with the BCCC I mean see some funding with ministry together with ministry of agriculture to come up with better ways of doing influenza surveillance. So before I jump into that the current approach to do doing influenza surveillance for birds is through bird based testing. So both passive in other words testing dead birds and active capturing and testing of live birds or hunter killed birds for influenza viruses. As I mentioned because most of the wild wild waterfalls are not killed by influenza viruses so looking for dead waterfalls does not really improve your chance of detecting these viruses. And as a result the overall positive rate for detecting viruses in a bird is it's quite minimum it's approximately one percent or so. And for that particular 2015 outbreaks the actually we failed to detect these hypatogenic strains of influenza virus in the waterfowl before it got into the poultry population. So we developed an environmental genomic based surveillance approach and this is still ongoing work in my group and the process involved isolating and enriching for influenza genomic RNAs from wetland sediment samples and we go to wetlands where these wild waterfalls congregates and the viruses are excreted through their fecal material so the wetlines act as a their outdoor toilet essentially and this type of approach does not then depend on the capture of birds. However because the environment does dilute out the virus it is like finding a needle in a haystack so we have to develop both enrichment method and also improve bioinformatic method to detect these viruses in the environment. But as you can see here the punchline is the overall positive to raise is much higher and in the wetland samples it's about 25 percent positive and the farm samples of course were known to have the viruses the positive to raise is much higher and this is just from the environment around the farm not necessarily not direct sampling of the birds. And we are also able to detect the specific outbreak strains in our samples rather than just like a different strain so it's a proof of concept that show that genomic based approach can actually be used for environmental surveillance. So Andrew will talk a lot more about AMR so I'm just giving a quick example here where there should say many of the AMR genes encoded in mobile elements can move and move around from host to host so detecting of genes based on PCR without sort of the genomic background or the detection of or the identification of the specific bacteria without knowing whether the AMR genes there were not. So if you just identify based on culture or based on other method without knowing the AMR gene is there were not both insufficient to establish the transmission route for the patterns for these AMR gene AMR genes. So this study for example show that the so on this these columns here show the the AMR patterns and you can see the AMR patterns do not really correlate to the phylogenetic tree based on I think in this case MLST analysis really highlighting the importance of being able to identify both the genes and the the genomic background and whole genome sequencing of course it's an approach to to give you that information the caveat being the current sequencing methodology you still don't have it still requires some work to pull out both the plasmids and the genomic chromosomal background so it's still work in progress for these these type of application. But this is an interesting study highlighting how they did that. Now I mentioned quite a few studies so you might think that we're still not prime time yet we're still not operationalizing genomic epidemiology but in fact the public health England lab have been sequencing all their salmonella isolates since 2014 and have been using that to establish genomic based clustering and use that to to detect outbreaks so I gave some references down here but there's actually a paper that that summarized their experience I'll make that available through the through our course website. US FDA also have to establish a central centralized data repository and a little platform called Genome Tracker for both US and global partners and I think Fiona mentioned a bit about more about that in her talk and really highlighting that you can do genomic based surveillance in real-time or semi-real-time to help address an outbreak in this case foodborne diseases. So just before we get into the some of the technologies I'll first define that what I'm meant by genomic epidemiology I sort of keep it quite simple as just to say that it's the combination of whole genome sequence data from pathogens just to separate from separate out from the idea of whole genome sequencing of the host with the human and combining the sequence data with the epidemiological investigation to track the spread of infectious diseases and the epidemiological data provide the contextual information for genomic for interpretation of genomic sequence data and the genomic sequence data provide high resolution diagnostic information for helping to shape the direction of the epidemiological analysis investigation. So this is an improvement over the current clinical microbiology laboratory practice which required the lab to maintain a large number of different tests and different media to support the identification of different pathogens and also of course comes with this different diagnostic platforms machines you have to maintain so on and so forth. This sort of showing you some of the tests that are regularly performed in the public health diagnostic health microbiology lab and due to the regulatory and validation requirements sometimes developing a new essay or new tests it's a slow process and moreover some of these methods can take weeks or months sometimes you have to send the samples away actually so we regularly send samples to the National Microbiology Lab if it's for specialized tests and that just a transportation process alone and getting on a queue could sometimes take weeks if not month depending on the test. So genomic sequencing of pathogens promise to simplify that workflow so instead of maintaining an array of different platforms you essentially have to maintain a sequencing platform and use and also you have to maintain or at least be able to carry out the bioinformatic analysis downstream of your sequence generation but many of the pathogens then can be sequenced and analyzed without having to to maintain a large array of other tests. So the benefit includes simplified workflows I mentioned and also in most cases you can guarantee a turnaround time if you're doing the sequencing and analysis in-house because you know how long the sequencing would take how long the analysis would take so it then depends on how quickly can you get the samples processed for sequencing it's a overtime issue achieved called saving by reducing the number of platforms and instruments that you need to to maintain and also sequencing itself is becoming cheaper and cheaper by the day the one of the really value added bonus for sequencing is that the the result of your sequences is actually comparable so it really doesn't matter where you generate a sequence by and large the sequence data are comparable and moreover that it's digitized format right so it's making sharing of the data easier as well and you can perform value added analysis value added research on the sequence data as I've shown in some of the the uh or in the Ebola and the and the Z card examples so some of the challenges include that the the results are harder to process interpreted whole genome sequencing for example give you a large amount of data so you do need specialized computer resources bioinformatic resources to process that information the there's also a rapid change in technology as I mentioned sometimes validating a test can take a long time so when you have a technology that's fast evolving how do you deal with the the accreditation or the validation of the platform so a few months ago I was at a conference so uscdc is just sort of talking about how they are going through this accreditation process for clear which is the clinical the body that maintains clinical accreditation in the US and detailing how labor intensive and time process time intensive it's to validate a genomic test for their foodborne disease program and the other issue is that if you're if you have low throughput then this per sample cost can be very high and often you need to accumulate enough samples and do batch processing to reduce the cost but many of the labs don't necessarily have the volume to sustain this high level of activity so it becomes again whether you centralize the the sequencing service where you can you will you do it distributed but then have to absorb the cost for the lower volume the cost due to lower volume so I talked for I guess for 30 minutes or so and well pause for a bit and just want to open up the the discussion to other people do you have any experience with with applying whole genome sequencing and are there any benefits and challenges that you have observed in your own work anyone enough samples yeah yeah so that's a common problem and and often sequencing can be done in a few days but you might be in the queue for a few weeks or a few months so one of the reasons why we're not a lot of time we're still not doing this in real time anyone any other thoughts on what are some of the potential challenges what benefits so how many people actually doing genomic sequencing of their samples and not doing analysis at the moment okay so about half you and what about the other half are you just receiving data and then want to do analysis yeah okay yeah so or we'll go into now the bioinformatics aspect a bit more oh actually before I do that so I just wanted to well introduce a bit of doubt about high throughput sequencing so the term high throughput sequencing had gradually sort of been becoming more popular to the next gen sequencing although you still see that a lot and part of the reason is I guess some it's a clarification between what's next gen sequencing versus what's sort of what's so-called third gen sequencing that's based on single molecular sequencing so the sort of by you by using high throughput sequencing it sort of creates a collective term that you don't have to worry about whether you're talking about Illumna sequencing or you're talking about Minayang sequencing or or Pac-Bio sequencing which are some of the third generation sequencers and sequencing data have many laboratory utilities can be used for diagnostics so you can do actually detail strain level identification can identify virulence genes and AMR genes from the sequence data alone you can carry out surveillance be it gym by gene type of surveillance or single nucleotide variant SNP type of surveillance you can even do in-circle serotyping so based on the genomics data predict the serotype and you can also use the sequence data to identify carbon number variants so and that the information can then be used for outbreak detection and and investigation so you can carry out trace back and you can carry out transmission route analysis and so on use the sequence data in combination of with the EPI investigation so the so I think most people probably are familiar with Illumna base sequencing that's sort of the current workhorse for generating a lot large amount of of shore reads from your your samples how many of you have heard of Minayang all of you great so I don't have to to talk I don't have to introduce it but this is a sort of a small form factor sequencer that gives you a longer reads but much lower fidelity in but it does promise to be portable and highly versatile so this graph shows the cost decrease in sequencing DNA and you can see an inflection point here that shows when next generation sequencers were introduced there's a rapid decrease in cost and also notice this is in log scale so so the drop is is I think it's about 10 to the 14th fold drop since 2001 in terms of pricing so right now it costs about point I guess that's one cent per megabase and of course show off and how many people are sequencing their genomes at one cent per megabase anyone no so so the caveat here is that this is a highly optimized highly batched solution so this data comes from the the the National Human Genomic Research Institute they sponsor some large-scale sequencing centers and this is the cost of these large-scale sequencing centers they're per megabase cost so what's important to take from this is really there's a significant drop in price but if you translate to a medium or small sized lab you're still looking at somewhere between 15 to 200 dollars per sample and roughly per genome or per roughly per genome if you're dealing with metagenomic samples the cost could be could be higher because you may need much higher coverage for the samples depending on your application so a quick summary a reminder of what pathogen genomes look like so for bacteria there's typically a single circular chromosome with some exceptions it's a haploid genome meaning there's only one copy of a of a given gene or one the leo I should say of a given gene in the in the genome and may have exocromosomal elements or plasmids that are carried around with with the the chromosome the genome sizes range from about 0.5 megabase to 100 sorry to about 10 megabase average especially for pathogens typically is about in the middle three to five megabase correspond to about 3000 to 5000 genes the viral genomes of course are typically smaller they can be both dnn RNA and the next gen sequencing platform typically only dealing with dna sequencing even though now you can do RNA direct sequencing on the minion but the efficiency is still much lower than dna base sequencing so often you're dealing with dc dna or reverse transcription of the RNA viruses before you can carry it out sequencing and the range is somewhere from one to two kb to one to two megabase which it's it's it's quite unusual to see such large genomes for viruses because they're usually quite compact and depend on the whole cellular mechanisms to to replicate uh i heard some of you are working with eukaryotic parasites uh which include fungi uh protests and and worms and they're usually few sorry i didn't quite type it out a few megabases to a few hundred megabases and usually have multiple chromosomes so um there are some some exceptions such as giardia or some of the east only about 12 megabases so they're about the size of a large bacterial genome with no introns but typically when you're dealing with eukaryotic parasites because of the more complex gene structures the downstream processing is it's more complex as well whereas bacteria and viruses were getting quite good at doing the uh gene annotations with with these pathogens and there are several dominant evolution driving forces that are changing the genomes uh i'm just listing some of them here such as the genome can be reduced over time so called the sort of specialized uh specialization where the organisms become specialized to a specific niche uh and uh as a result they come sort of lean and mean and then reduce the the access requirement for um maintaining a larger genome uh there's constant rearrangements in in most of the the microbial bacterial genomes and the gene duplication followed by uh diversification of gene functions it's another uh force of of change in the genomes and of course in in uh many microbes the lateral gene transfer is a it's a big driving force as well the acquisition of of non-parental or DNA material or non-material non-parental strains and recombination uh can quickly uh change the bacterial genomes and uh and because of all these driving forces we actually instead of sequencing a handful of genomes typically have to sequence hundreds if not thousands of genomes to really characterize a single pathogen species well okay so uh this is sort of outlining the whole genome shotgun sequencing approach and what i want to just highlight is that because uh there's no cloning stage and the uh the the DNA fragments are typically just uh quickly PCI amplified and sometimes the skip this step is even skipped before sequencing so you can uh streamline the process uh compared to the previous uh Sanger based approach and it's also highly parallel paralyzed so you generate a lot of data in the in the short period of time now moving into sequence data analysis so in the sequencing step especially on the um uh the next gen platforms such as Illumna the the DNA is broken down into smaller fragments that are sequenceable in size they're typically a few hundred uh base pairs long so you generate millions to billions of such short sequences uh short um uh fragments and you sequence them uh and so the typically the downstream process to before you can analyze the data is you have to put the fragments back together through a process called assembly and then you have to annotate the sequences to provide functional informations about the genes or about the the non-coding regions and then you can do compare the genomics to identify variants that are found in each of your genomes so the genome assembly uh the task is to reconstitute the whole genome if possible from fragments of DNA and there's two basic approach one is called the novel assembly essentially you're just finding overlaps in your fragments and stitch them together based on the overlap uh kind of like putting back a jigsaw puzzle finding the the overlapping edges uh the other is called mapping or some people call it references assembly although mapping is probably a better term for it what you're doing essentially is you're mapping your sequences to an existing related reference genome so keen to trying to essentially place each of these pieces back to the to using this reference picture place back to where they belong so of course a easier process but the challenge is that if your reference genome is diverse uh has diverged from your query genome where the genome that you're interested in then your mapping based approach is not going to give you a very accurate representation of your query genome in in addition to those challenges there are sequencing errors of on all platforms that you have to deal with so the assemblers have to be able to tolerate the sequencing errors and also one of the major challenges for shore read based assembly is that repetitive regions sometimes span or are bigger than each read so when you're trying to piece them back together two distinct repetitive regions could be collapsed into a single region because the there's no not enough information in your sequences to span the entire repetitive region uh this is by and large addressable by using uh say minnayang or oxford nanopore sorry or pack bio that will allow you to generate longer sequences hopefully span beyond the the repetitive region so this is quickly summarizing some of the common platforms are still in in use today for sequencing in the type of error and error rate and i just want to highlight that ilumna has comparatively uh very low error rate and uh it's typically a random error so it's sub substitution errors that happens randomly much easier to to uh to correct if you generate enough depth coverage so by taking the consensus then on average you'll get the right base rather than the wrong base so the consensus approach can easily correct for substitution errors uh the third generation sequencers still have greater than 10 percent error rate so one in 10 base that you get from from the sequencer will be incorrect and on top of that the third generation sequencers have at least minnayang i should say have uh errors that are more uh that are harder to correct they're based on uh complications due to the signal processing process so uh because of that uh there are systematic errors that are sometimes harder to to to uh to fix and again by sequencing to high cop depth coverage you can fix some of the problem but often a good practice is actually you will combine the minnayang data with the ilumna data and use the high quality ilumna data to correct uh the sequencing errors in the minnayang data but effectively that doubles will triples your cost per genome if you if you do it that way okay so after assembly uh there are often still gaps in the genome so you are not getting the one piece uh in uh genome back um so we call that each of these fragments can take these contiguous sequences a contek so i just wanted to introduce the the term there so we're all on the same page and the context is likely to represent in in most of the cases 97 plus of the genome but you're still likely to miss some of the repetitive regions and um and the process to close the gaps or what's called a finishing approach it's often time consuming and labor intensive because the in the old way of doing it is actually design PCR primers and walk the gap to try to close it uh the new way of as i mentioned is how you use longer reads to try to span these repetitive regions and um some of the well-studied pathogens is now quite easy to uh to get the full genome if you combine both sequencing uh technologies right so the next topic is on genome annotation so you get a string of ATC and G's back from your sequencing providers and there's really not much information attached to or there's no without the annotation process there's not much information attached to the the sequence so the goal of annotation is to identify features such as genes or non-coding regions such as different RNA molecules or RNA genes in the in the genome so typically want to find out the functions of the the features and also the location of where these features are in in your chromosomal or in your genomic sequence okay so uh the location of coding genes at least for bacteria and viruses are easier to to detect then and to figure out then um you carry out and the sequences can often be identified through this up initial gene prediction approach and this algorithm essentially works on the fact that the coding regions have different uh base frequencies compared to non-coding regions and if you train the software to recognize these words or these different coding frequencies it can actually pretty reliably predict these regions for you in in the in the context so this is a fairly well resolved problem in the sensitivity at least for bacterial genomes is somewhere in the 97 98 percent of the genes can be detected this way so this sort of quickly summarized the the annotation process you have a a contact that you carry out some regional annotation and detect both the protein coding genes and the non-coding genes such as RNA tRNA RNA tRNA and other potentially small RNAs and in promotion sort of promoter regions so on and so forth and the protein coding genes can be detected through that up initial gene prediction method as I mentioned and you then can annotate the functions of these genes by comparing to a reference database of known uh genes of known functions and this can be done automatically as as what you'll see in in the workshop and the manual annotation can be done as an option to improve the quality of your annotation so the most common way to do functional prediction is through sequence similarity search so I mentioned this is assuming that the genes that have sequence similarity are derived from the same ancestral gene and therefore they have similar functions so the most well-known similarity search tool is called BLAST so how many people have used BLAST before just wanted to make sure so everyone okay so the so you know how it works basically you're taking your sequence search against a database of sequences that have already annotation assigned to it so what you have to keep in mind is that that process creates what's called transitive annotation in other words you are you haven't done any sort of biochemical or functional characterization of your gene you're simply determining the function of that gene or protein based on the similarity of that genome protein to a known protein with gene so that process can often create issues such as that your your gene may be a may have actually have a different function than what's in the database due to a small mutation but it's still BLAST 2 99 percent identity but and this is quite common in some of the AMR determinants that small mutations can can have a huge effect on the function of the gene so have to be careful when you're carrying out transitive annotations okay so I think I can skip this slide but maybe just to say that it is one of the most cited bioinformatic tools and the reason is so popular is that there's actually a statistical framework that was described around the BLAST search so it's easier to predict the behavior of BLAST search results and to interpret the BLAST search result compared to some of the more heuristic approaches that are that does not have a statistical framework around it and it then becomes harder to to interpret some of the the results of some of other tools so the different versions of BLAST searches different combination of nucleotide and protein searches sure people are familiar with that now so in terms of interpreting the BLAST search results the big as I mentioned it's based on the statistical framework so there's the concept of expected value and the expected value essentially is the number that the number of well just read the number of BLAST alignments was given scores that I expected to be seen simply due to chance so does everyone know whether does anyone can anyone put that in a different word or does that sentence make sense to to everyone yeah show show of how many people can understand what that sentence most half of you okay so I just I like so analogy like typically give is just if you walk on the street and see someone that look like you what's the probability like that someone that look like you is actually related to you right so if you have a like maybe some of you have a face that's much more more common and aside from the fact that all humans are related to a certain degree and certainly all proteins are related to a certain degree we want we do want to figure out which ones are more closely related and which ones are which one looks similar just by chance so the E value gives a statistical based interpretation of the likelihood of finding a match purely due to chance in other words two unrelated sequences have similarities purely by chance so the rule sum is that the BLAST alignment with expectation expected score greater than 0.1 percent sorry not percent forget what I just said greater than 0.01 this is I say it's a number not a percent should typically not be trusted unless your sequences is really short and you have high identity to to the BLAST database that you're using so the gray zone is 0.01 to about to about one this you have to interpret accordingly as I mentioned if your sequences is very short and because the the expected value and scores are related but the scores are based on factors such as the alignment lens and or expect values adjusted based on the complexity of the sequences and the probability of finding a match in the database so a combination of percent identity which would be reflected in the score and expected value should can both be taken into consideration when you interpret the the results so there are quite a few automated annotation system that use BLAST and other type of similarity search tools that potentially a lot much a lot faster but BLAST is still considered the gold standard in terms of sensitivity but if you have to search thousands sometimes you know a metagenomic data say with hundreds of thousands of genes they potentially need a faster method than BLAST to search and I think Fiona will mention some of that in her talk but the idea of finding similarities of your query sequence in the database record is the same and in this workshop we will actually use this PROCA which is built into analysis platform that I'll talk a bit to do automated annotation of your genome but others are available as well that you can look into so once you have the the genomes you can do compare the genomic analysis and the goal of such analysis is to identify genomic variants that can be correlated to phenotypic or characteristic features of the an organism so for example might be interested in identifying a more virulent strain of a pathogen so by looking for certain toxins or by looking for non-virulence factors or looking for antibiotic resistant genes and we can also use the variants to track transmissions of the pathogen as I mentioned already. At a high level the variants can occur at three different levels one is regional this is due to for example horizontal gene transfer acquiring a genomic island or recombination events or reshuffling of of of viral genomes and so on so this can result in strain specific regions or segments of genome that are different from strain to strain and gene based analysis typically focus on strain specific genes so by doing gene by gene comparison looking at the allylic differences and this is the topic of module three this afternoon and as a single nucleotide based analysis looking for variations at the at the base level and this will be covered in the next session so comparative genomics is not a new concept ever since the first genome was published in 95 soon after that comparative analysis was already carried out and this actually highlights some of the consistent observations of microbial genomes that really improved our understanding of bacterial evolution and as you can see in this comparison just two strains of H. pylori that strain specific genes typically cluster in so-called hypervariable regions and sometimes go genomic islands as you can see there's denoted by these blue regions and this observation as I mentioned occurred over and over in other genomes as well and leading to a lot of studies about genomic evolution by horizontal gene transfer and so on and I think Phil now would also talk more about that in her lecture and extending comparative genomics to the concept of pen genome so the large number of genomic variations means that you typically have the sequence and large number of genomes to really understand the bacterial species and some species might never have enough genomes that can allow you to thoroughly characterize the species so the term was first coined in 2005 based on the study of six ruby strap genomes and what they define pen genome is the theoretical entire gene set of a species and they break down into two broad categories the core genome which are consistently found in members of that species and the strain specific or accessory genes that are found sporadically in in the different strains of of that species so the idea is that the core would be more stable and the strength strength specific or the lifestyle genes would would be more fluid and in over time so the pen genome calculation then is extrapolation of the observed observed number of genes based on your limited number of genomes to the theoretical number of genes that are required to fully capture the pen genome of the species and as I mentioned some organisms you'll not you'll not saturate its gene content so that's called an open pen genome so you can see some of these that would these lines will never reach at least at least not in the reasonable time reaching zero meaning that you'll continue to discover new genes as you sequence more genomes of that given species so this is sort of as you add one additional genomes how many new how many new genes are you discovering so the number goes down and for close pen genomes the number can go down very quickly so in in the case of anthrax the the number of genomes need to sequence to characterize that organism is is quite low and partly it's a because it's a it's a highly it's a a newly invert in emerged pathogen relatively speaking still like a few millennium you know years old but but the the number of genomes required to describe it is it's trackable and this as you mentioned that this concept helped the microbiologists to predict how species will evolve over time whether it has an open pen genome and and have a lot of horizontal gene transfer events where in a lot of genomic changes where it has a close pen genome meaning that it's a more stable genome and this information is helpful in terms of interpreting your genomic epi results as well so as I mentioned pretty much everyone in the in the in the faculty faculty is part of this era that consortium and a few years ago Gary Fiona and I wrote a grant to genome Canada to build a analysis platform for genomic epidemiology that's that's tailored to public health agencies and public health workers who lock the tools to process to process genomic data and the idea is to have a free open source standard compliant and high quality genomic epidemiology analysis platform to support real-time disease outbreak investigations and the core functions of the platform include managing of the users managing of the projects and streams but also building workflows that allow you to characterize these sequences and there's also some visualization tools built in to allow you to visualize the trees and looking at the looking at the analytical results it's still an active project in in development and a lot of the core platform development happened in in at the NML in Gary's group but Fiona's group also hosts a public instance at this assignment phraser university that will create actually account for all of you and you'll use some of that functionalities tomorrow in our integrated assignment so I mentioned the goal is to build a platform that's that's easy enough to use for non-bioinformatic specialists so mentioned it's a it's a partnership of a lot of between both provincial and provincial and federal public health agencies but also with a lot of academic partners and the slides it's slightly out of date but if you go to our website there's a long list of partners involved and the project teams what's unique about the area that partnerships are the project teams such as Gary and I actually embedded in the together in the user organization so we try to build tools that can tailor to to the public health microbiology and take feedback from the users to improve the process but as I mentioned we're opening up this platform actually has been open the whole time but we're making it more readily available through this public instance so there's a few so others can can use it but also more importantly if you're able to contribute to the to the development we're also happy to to engage you for that as well so the website as I mentioned is area.ca where you can download install your own copy you can use a public instance or you can read up on the documentation and so on okay so it's designed to be user-friendly you'll see this interface tomorrow basically once you loaded your data you just have to click this big red button to to launch analysis there's not much configuration needed to to do fairly complex to to run fairly complex workflow and you'll also have you'll also be able to combine the metadata with the with your sequence analysis results okay we're coming up to the hour so I'll just go through this last bit fairly quickly so switching gear into the the metadata component so there are many players involved in infectious disease surveillance and outbreak investigation however because of privacy and confidentiality concerns the information tend to be aggregated at the the information collected at the the front line labs or the front line agency investigation by the investigators typically are aggregated and lost as we move from the the front line labs to the more centralized the board centralized resource but the bound for mac expertise on the other hand typically start developed in a more resource rich environments such as a national lab and it's slow to propagate down to the the front line lab and that's probably why the era that platform was developed to try to equilibrate this process but also it highlights the importance of how we can share the information across different players so also to mention that different uh agencies have different um analysis platforms and this is like trying to of you know fit a plug into the wall when you have uh different uh standards right so this shit yeah so there's a lot of retrofitting of of data into the various standards and and dealing with when we deal with data harmonization issues and that's to say that the contextual information is often collected in institutional specific format there's different codes or acronym used in different systems there's our terminologies that describe the same concept but have different names and there's different units of measurements different um here's an example of different severity grade used by the american hospital association versus the uk uh the nhs and as you can see the uk have much uh more finer gradients compared to the the aha aha uh standard so how do you fit all these descriptors in uh a single uh gradient and it can be challenging and certainly difficult for software to understand such nuance another issue of course is metadata often quite dirty people are typing in the information there's a lot of spelling errors uh there's synonyms used uh there's issues with the semantics uh used such as grammatically incorrect terms and so so uh uh the um for example here is the word cloud describing of different terms describing um essentially feces so the bigger the more common terms are in larger fonts than the smaller ones but you can see it's it's quite a mess so um ontology is um the idea of ontology is that you can use it as a as a universal converter that allow better inter interoperability and better harmonization of the data set through a mapping process so ontology is a mechanism to study to specify a body of knowledge to to specify an area of study and essentially involves standardized well-defined hierarchy of terms that have a relationship describing their how these terms are related to each other there's definitions for each term and also each of the terms has a unique universal id so you're no longer bound by the actual words used to describe the the concept whether there's an universal id assigned to each of the concept so it allows you to use different synonyms for the same concept and uh the terms are expressed in formats are both human and computer readable so um you so through the through this approach you can improve the uh interoperability of of data sets and this is the area of research that are quite active in in my group uh so the idea is for example you can uh take these terms and then start mapping them to existing or uh to ontologies and that the different terms can be mapped to the same id or uh or even the same word if they have two different usage could be mapped to different id but you're essentially expressing these terms in a in a more um systematic way and a more formalized way so and also so I mentioned each of the terms then have a clear definition and the different ontologies are meant to be interoperable too so you can reuse terms from one ontology in in another ontology so so uh my group was in in collaboration with with Fiona and Andrew uh had in the last few years developed a application ontology that's specifically useful for uh genomic epidemiology investigations and so we try to come through epi lab and genomic and clinical data type uh data types and data fields and come up with consistent descriptors for for these terms and and uh use these terms to describe then use the terms to describe the activities because we're interested in foodborne infections we also started a kind of interesting project that tried to come up with food descriptors from farm to fork activity so for food products for food processing uh steps and for preservation method and of course this has applications beyond just infectious disease application so we have uh partners are interested in using this for to describe their nutritional studies we have partners interested in using this to uh uh describe the the food web for environmental studies and so on so again the idea of being able to develop a knowledge framework that has wide applications and help to bring these different applications uh the data from these different applications in a harmonized way it's a it's a um interesting area for for research so this requires to develop an ecosystem uh which uh i shall i'll skip this slide but the idea is that uh the ontology self if no one uses it it's not useful so we need to develop tools that would enable people uh to reuse the ontology so uh one of the tools that we developed is called genomic epidemiology entity mart and this is trying to address the problem that people typically develop their data specifications in silos resulting in slightly incompatible specifications even though they're working in the same area so the idea is to be able to come up with a platform that allow people to publish their data specification but also to reuse ontology terms when they build their data specification so the different data specification can be um more interoperable to each other but it also doesn't force people to adopt the same data specification the other function built in is to improve the data consistencies by enforcing some data value constraints directly as part of your data specification so you cannot type in random or erroneous text in in or information in the field in a data field and to make a user friendly we create a web interface that allow people to shop for terms and develop their data specification so there's a link there it's still work in progress as well but we're hoping to wrap this up later this year the other issue that we try to deal with is that there are a lot of inconsistent terms in existing databases so but then to manually map them would be a labor-intensive process so we try to develop an automated way method to map the database short phrases that you found in the databases so the two different ontological terms and in the process it will clean up the inconsistencies in in these terms and identify and map the corresponding uh the the terms to the corresponding ontology um so this sort of outline all the steps that uh the system or the yeah the system that we call lexmapper go through to clean up the data so just to summarize the idea is to be able to draw your metadata descriptors from a body of ontologies that then can be published in genes and will clean that by lexmapper that then so you have a consistent descriptors for your metadata and then you bring that together with standardized sequence analysis workflow and both of that is critical for downstream uh genomic epidemiology analysis here i put here that but it could be any of the of analysis platform that you use the idea is still the same that you want to have both clean or well processed sequence data and metadata for your analysis um i think i actually did cover most of the the modules in my introduction so i'll skip given the time i'll skip these um so this is just highlighting each of the modules uh okay sorry it took longer than than expected but any questions