 introducing myself a little bit and obviously dating myself a little bit as well, but then my background has always been microbial genomic, so this workshop is really near and dear to my heart because this is what I enjoy doing and what I've been doing for the last 10 years or so. So initially I started doing microbial genomic analysis and then did some meta genomic analysis as a postdoc. And after that I joined EC Center for Disease Controls to lead down the petition there and essentially worked on building genomic epidemiology capacities in our public health sector for about eight years. And in 2020 right during the pandemic, I moved to SFU full time as an associate professor there and focus more on research rather than a mixture of research and service, but still very much interested and still very much involved in public health collaborations, especially in working with public health to improve data sharing and also to improve collaborative data analysis and so on and some of that messaging will come across in our series of lectures in this workshop. So that's a bit of introduction of myself and I will begin the introductory lecture of this workshop. The purpose of the introductory lecture really is to provide a bit of a common background and a bit of level playing field for everyone in the workshop so some of you may find a material quite redundant when you've heard about it already, or maybe you're even more of an expert than me in some of the material covered. And if that's the case, do feel free to contribute in Slack and adding to the material about your own experience with your own knowledge. I think we very much welcome, you know, sharing of knowledges for those of you who might have less background in the area. Hopefully the lecture is accessible, but if it's not, again, please do feel free to post any clarification questions on Slack and the TAs and other instructors and myself will attempt to answer any questions you might have. Okay, so as I mentioned already, I'm at Simon Fraser University and I run a group called Center for Infectious Disease Genomic and One Health. I mean, I already talked about the course overview, but I thought I would post the course overview from 2017. This is not in your slide deck, but I just added it sort of as a contrast to the material that we're doing today. And in 2017, this of course was pre-pandemic was when this workshop was first offered. And you can see that a lot of the material still very much in common, but you can also see that we really only focus on bacterial genomic analysis and not much on viruses. And this is, as you'll see later that because pre-pandemics, the most active area of genomic epidemiology was focused on food-borne bacterial pathogen research. So a lot of our collective experience and also our examples were drawn from that area. So we sort of talked about the phylogenetic analysis, molecular subtyping of bacterial virus, bacterial genomes. And then we of course still have an antimicrobial resistant section. But what we changed out a little bit, what we actually add to is the phylogenographic section. Now we sort of rebranded more phylogenomics and more with a viral and more integrated approach to add non-sequence and non-microbial. My non-evolutionary phylogenetic data to the analysis. And then we also incorporate data visualization more as an overarching theme throughout all the modules rather than its own section. This is, as Nya already presented, the updated module with a lot more focuses on viruses and also adding environmental microbiology aspect to the workshop. And we expanded the number of modules and it's a three days, now it's a four day virtual workshop. I should also know that Samira Baraka from Sunnybrook will be our keynote speaker and I was late in asking her for a title. That's why it hasn't been decided, but she'll be talking about her experience during the pandemic. And Samira is actually once a workshop participant as well. So this is nice to have her coming back to give this keynote. Sorry, I forgot to advance my slides. So the general learning objectives for the whole workshop is to understand how genomic epidemiology can improve clinical and public health microbiology. And also to process genomic sequence data using various bioinformatics tools for bacterial and viral genomes and metagenomes as well. And also to interpret genomic data in epidemiological context and to understand the importance of data standardization and sharing. And actually that's another module that we instead of embedding in just in passing, we pulled it out and made it its own module this time around to highlight the importance of that. We also performed several types of genomic epidemiology analysis as part of your lab sessions, and you will hopefully throughout the workshop throughout the interaction recognize the limitations and challenges associated with genomic epidemiology analysis. This is very much still an evolving field with new methods and new techniques being developed. Specifically, the learning objective for this module is to understand why infectious disease research is important and be familiar with some examples of genomic epidemiology studies. I think the importance of infectious disease research probably doesn't need to be highlighted after the collecting collective experience we've been through. But it may also not be surprising that we are as a society quite forgetful and often post pandemic. A lot of the good intentions get drawn out by other priorities. So something to keep in mind that we should leverage the collective efforts have been put into fight this pandemic and then make sure the lesson learned are not forgotten or the efforts are ignored. And so we also will be familiar with some sequence data processing. I'll give a very, very, very high level overview of how we process genomic sequence data. So you know the data that you're receiving how they were generated and then to understand the challenges associated with sharing genomic epidemiology data, which would actually be further expanded upon in module three. Emma's section. So I want to start with some examples of genomic epidemiology studies so as we all we know we all live in an increasing interconnected world with in this case so many of you from different time zones different areas around the world. So the picture here shows the flight path of commercial air travel and we know that pathogen often travels with with with its human or animal or other hosts and can quickly transport throughout the world. So it's also interesting to show these two figures side by side. They're just a month apart showing the. This is the satellite picture of the airplanes in that's actually in the air. And this was not just the flight path anymore and you could see that, you know, after the pandemic was declared the number of international travels and what even domestic travels reduce significantly so the for example, travel between Europe and and North America or between Asia and Europe, you can see that it reduced significantly and not to mention to Australia and even as well. So, case to point about pathogens can travel with with human passengers here is a study. It's quite old from 2015 but I still like it a lot it's a study by the Danish group that look at microbes that I found in the airplane toilets. They so they fill out filter out the microbes from the human way ways that are collected from 18 different flights and that traveled across three different continents and then they looked at what what the microbiome profile looks like in the samples. So they cluster the samples based on the microbiome profile and then they also look at the presence of antimicrobial resistant genes in the samples. So we found that this the samples in the cluster based on geographic locations, as you can see in the the tree on the on the left and see that the North American ones on the left Asia in the middle and so on so again North American on the right so it shows that the microbiome profiles could actually be used to differentiate the origin of the the the flights. Moreover, they also highlighted that there is a higher proportion of antibiotic resistant genes found in flies that originated from Southeast Asia. So that the study highlight that am our genes could spread quickly around the world three global through global travel travelers, and you'll learn more about am are in a MG antimicrobial resistant AR genes or antimicrobial resistant genes in module six and how these can be identified and characterized. Next, I want to highlight the study that look at emerging infectious diseases. So emerging infectious disease events were defined as the detection of newly evolved strains of a pathogen so for example a new variant of coronavirus SARS-CoV-2, but not the reemergence of a known pathogen or known strain in this so that's their definition and I looked at it over the last six decades and they notice that there's an increasing trend over the decades as you can see here on the graph all the on the on the graph here all the number of emerging infectious disease events increase over the decades and then further they looked at different sub categories of EIDs and they for example notice that the number of drug resistant versus non drug resistant cases also increase over the decades. Similar the vector born versus non vector born, which is the vector born's white also increase over the and about two third of the emerging cases are so called zoonotic so these are the white so the basically the non black so why orange or red and highlighting the importance of a one health approach to understand emerging infectious diseases as these events seem to be dominated by zoonotic diseases. Okay, so next we'll look at this is almost a decade ago now the Ebola outbreak in in West Africa in 2014 or 2013. So it highlighted the global interventions using genomic epidemiology approaches to understand the transmission and the spread of this virus. This is the most deadly Ebola outbreak in history resulted in 20 more than 28,000 reported cases and 11,000 more than 11,000 deaths. It has a significant impact on global travel and even though, and at that time I was at BCCC even though there wasn't any case in Canada, all the public health agencies were under high alert and had to conduct training and preparedness tasks in case that the case and uplending in Canada, and I think other countries went through similar exercises, this a precaution. So the Ebola genomes on approximately 5% of the cases were eventually sequenced in from from this outbreak and this result in the wealth of information so this study which highlighted some link down here. Look at the early samples that were sequenced and in the analysis revealed that the outbreak believe to actually start with a single human exposure to a natural reservoir and not a repeat exposure doing this particular outbreak. And also the outbreak was then sustained by human to human transmission from Guinea to Sierra Leone, and it's likely to be from a single event but with two distinct strains transmitted in parallel. So they were able to, as you can see here, identify separate clusters doing using genomic sequence data. And these transmission were likely sustained through, as I mentioned, human to human and the widespread of the transmission was likely results of lack of proper quarantine facilities in the early stages of the outbreak. And the phylogenetic analysis also show that the transmission from Guinea to Sierra Leone, as I mentioned is from likely from a single event so the phylogenetic analysis has the ability to really tease out based on the genetic diversity or genetic variations the likely transmission scenario. But although epidemiological, in other words, contextual evidence is needed to collaborate on the genetic evidence. So the phylogenetic analysis combined with epidemiological investigations helped to unravel the complex transmission dynamics and provide healthcare workers. The knowledge or the information to institute effective policies and interventions. But as I was saying, there's still unfortunately a huge delay at the beginning resulted in an unacceptable mortality in this outbreak. So fast forward to 2018, there was another outbreak in DRC. And because this time around the DRC's, the practitioners are much more prepared than have experienced dealing with outbreaks. And also globally, there's a faster mobilization of resources. And I understand, of course, even though it's incrementally better than previous experience though, I mean, a lot of the mobilization efforts is still challenging, but nevertheless, WHO and World Bank have set up standing emergency fund to deal with these emergencies. And also, there was stockpiling of vaccines to try to stamp the onward transmission of the virus. And of course, drugs were developed during the first, during the last, I should say the Ebola outbreak and were made available for subsequent outbreaks. Just like what we're doing with the vaccine development and drug development in our, in the COVID-19 outbreak. So they also were able to engage globally or outside healthcare workers faster and be able to put the health interventions in earlier. Now, so the next outbreak that I want to highlight is the Zika outbreak. So unlike Ebola, of course, Zika is Zika virus causes much milder symptom for most people and have been endemic in Asia and in Africa for many decades before they cause an outbreak in the more naive population, or naive to the virus population in the Americas. And because the population in the Americas were naive to this virus, it causes a lot more severe cases in the newly exposed populations. So the outbreak resulted in a few thousand micro syphilis cases, syphilis cases, I mean, like, and this is likely, again, due to the populations not being, having immunity to the, to the virus. And because the symptom of Zika infection overlapped with other fluffy viruses, such as dengue and other surveillance efforts based on symptoms alone is not sufficient. So laboratory tests based on serology and molecular tests such as genomics and others were needed to for confirmation. Yes, Andrew. Will, we're still seeing the Ebola slide. Oh, sorry. I kept forgetting to advance slides. Yeah, thanks. Just forgot to advance that one. Yeah, so the depicted slides actually just depicting some of the patterns that was detected using genomics information. So the phylogenetic reconstruction of the case suggests that the viruses has circulated in South America. For at least one years prior to it being noticed in an outbreak, as I mentioned the symptoms of coin mild and overlap with other known viruses. So the data was, the state I was used to reconstruct the transmission route as shown in this diagram in this picture on the right and the gap in surveillance efforts. Because the viruses is not known to the population, or at least not not keenly aware by the population affected the surveillance effort. Right. So genomic based surveillance. On the other hand, go moving into the future will reduce such a gap in and allow potential new viruses to be detected as well because genomic based approaches much more agnostic to the pathogen. Okay, so the, the next case I wanted to highlight is related to avian influenza viruses. And this is a study that in collaboration with BC CDC was the BC animal health center. And the CFIA Canadian food inspection agency and few other partners, we conducted a study to see if we could detect avian foods from environmental samples. So, as many of you know, avian influenza travels with their bird host, and because these migratory birds has a wide range and their flyways often cost with each other. And this resulted in opportunities for different strains of influenza viruses to commingo and to to reassort right we are reassort their genomes and a new variant that could therefore rise from this mixing of virus populations. So in 2014, there was a. Oops. There's there was a North American and that's a global outbreak of H5 and to rain that initially come in as H5 and one I believe and then and in Fraser Valley of BC, a large number of farms were affected result in a quarter million of birds that were destroyed in the process. And as the bird travel from Alaska down to West Coast and into the United States. US actually was also affected by this outbreak and it resulted in about $3 billion of damage in North in United States and about 300 million damages in Canada. So we're engaged to see if we could have a better approach to do avian influenza surveillance. Because that's because the bird based testing was actually quite ineffective. So, currently the approach for influenza surveillance is mostly passive so test waterfall that were dead from other causes for influenza virus active surveillance such as capturing testing live birds or hunters submitting kill birds samples samples from kill birds to to the testing labs were quite suddenly done and again the positive rate overall positive degrees quite quite low. Sorry, I guess there's a timer on this slide so it keeps working itself. I apologize for that. So the overall positive rate from these existing surveillance is less than 1%. In other words lesson one out of 100 samples tested will be positive. And as a result of that the presence of high pathogenicity avian flows that resulted in the outbreaks evaded the detection process the surveillance process. So we took a different approach we essentially went into the wetlands where these birds were residing or and and then we collected mud from this environment that contains the birds species, which then contain these viruses and then we through rounds of amplification, then attempt to detect the viral RNAs, which then translate or transcribe to DNA and in from there we use sequencing to identify the subtype of viruses present. And this resulted in a much higher positivity rate and roughly 30% of our samples were tested positive of course we tested high in the farm samples or has much more highly positive as expected, but even in the environments we were able to have a fairly high positivity rate. And more importantly, as a tree pointed out here the samples we collected actually then show the virus strains that cluster together with the outbreak strain of the virus, suggesting that our approach does can be used to identify the specific strain of the viruses that cause outbreaks. And last little bit I will last example I'll highlight is that to illustrate why genomic sequencing is important. So many amr genes are encoded in mobile genetic elements and therefore can move independently of the host core genomic material. So through the process called horizontal gene transfer rather than clonal expansion. So detecting the genes by PCR, if you just focus on the subset of genes marker genes or if you focus on this identification genes alone is insufficient to understand and characterize the antimicrobial resistant profile of the organism and to understand the transmission of these ARGs, these genes. So here's a study that showed that the amr profile as shown in the he-map on the right is actually quite different from the phylogenetic tree that's built based on MLST patterns. So you'll learn more about MLST, the corginal MLSTs in Ed's lecture tomorrow. So stay tuned for that. But what I want to point out is that the core phylogeny of this bacteria actually are not consistent with the antimicrobial resistant genes that they carry. So you can see some of the ones that cluster together phylogenetically actually have different ARGs. So you can see the patterns on these he-maps are quite different. Okay. All right. So I will now just, that was some examples of using genomic epidemiology just in various applications. But I will now move more towards the process itself. So the, as I mentioned, whole genome sequencing of foodborne pathogens was the name of the game at the beginning of genomic epidemiology, partly because there's a lot of global efforts to use genomics sequencing to study foodborne pathogens. So, for example, the UK public health England, which is now called something else, have star sequencing all summoner isolates in the UK since 2014. And US FDA and also CDC also created a distributed network of labs to utilize whole genome sequencing for foodborne pathogen tracking and identification. So they call this system genome tracker. This result in a large number of genomes coming available publicly, and that could be useful analysis. So one study I want to highlight is actually one that was done by Jimmy, one of my students and also TA in this workshop. And we really benefit from the availability of these genomes. And so he took all the publicly available genome and compared the serotypes of Salmonella to the phylogenetic tree that were constructed using genomic sequences. And again, you'll learn how to do that later in this workshop. So on the this minimum spending tree with a clustering tree on the watch, this is a neighbor joining tree, but it's shown us a clustering pattern. Essentially, what I want to highlight is that trees are labeled with serotypes, but the tree, the clustering patterns will determine phylogenetic distance. And you can see that in most cases, they do correspond to each other, the serotype, the names, right, new port or interior this do correspond to the sequence based phylogenetic analysis. But there are some exceptions where if when you see the colors mixing in, that means there's two serotypes that are in the same cluster. Well, when you see same colors in different clusters, that means that the same serotype is subdivided into several distinct clusters. But in some cases means that the name and phylogenetic analysis, the serotype and the phylogenetic analysis are inconsistent. And what we were able to just make sure is that indeed most of the, so on the right here, the X axis shows the number, a threshold. So what's the cutoff that we use to cluster these genomes? And on the Y axis, the number of clusters that zero vars, I should say, that either have their, that either fall into one cluster or to multiple clusters. So not monophilic versus non-monophilic. So you can see that when you have a very stringent cutoff, right, so the organisms are more similar to each other, the serial raw information is concordant with the biologic information in the, but as you increase the cutoff used, then you see more and more serial groups fall into the same cluster as you, as you would expect. And then as I mentioned, some of these then correspond to cases where you have polyphylidic structure for the serial vars. Now we end with a bit of the COVID related topics. Of course, COVID is near and dear to all our hearts after a significant effect our life for the last three years. But genomics really have been the hero of COVID-19 pandemics, because this is actually the first time we were able to deploy genomic epidemiology in almost real time to study an infectious disease outbreak. So the sequencing of SARS COVID genomes can tell us how the viruses spread regionally, preventionally, nationally and internationally. In other words, we can track transmission to the viruses. We also can use it for outbreak investigation as I mentioned previously for, by looking at the clustering patterns and by looking at comparing the biogenetic results with the epidemiological evidence. And however, the other advantage of having the sequences is that as the virus is involved, your detection methods such as PCR can fail, right? So many of you might have heard of primer dropout where the old, the primer that would, the primer that would design based on the sequences no longer work on the mutated sequence because it doesn't, you know, as well. And therefore regions of the genomes would not be amplified if there's, if the primer regions contain mutations. It could also allow us to reliably characterize the different variants and systematically look at how the mutations evolve over time. There are many, many studies looking at these type of mutation to functional variations. And the studies, of course, informs effective measures in health care and in public health. So many, there are many national international efforts at the early stage of the pandemic and all throughout the pandemic to try to implement genomic epidemiology to study this outbreak and to, in response to this outbreak, I should say. In Canada, this is highlighted by the Canadian COVID-19 genomics network. And it was established in March 2020 at the early stage of the pandemic was an initial investment of $40 million from the federal government and $20 million went into viral genomic sequencing and $20 million went into human host genome sequencing of the infected individuals. So this is a large consortium driven approach with partners from national, provincial and public health laboratories, hospital laboratories and other research institutions. And there's also large scale genomic sequencing centers involved industry partners and other academic labs and researchers from different institutions. So the goal was to coordinate SARS-CoV-2 and host genomic sequencing efforts. Our initial goal was actually just to sequence 150,000 viral genomes and 10,000 human genomes. And as you'll see later we, at least for the viral genome went way over that. And the integration of sequence data need to then be coupled with the contextual information and because there's large number of groups involved that requires to harmonize the data collected across various agencies. So that was the effort that Emma will describe more later. And also the key here is to work together. So heavy emphasis and facilitating data sharing nationally and internationally. And most important is to, as I mentioned, not to have all these efforts went to waste and have capacity that built in at the right places to prevent future pandemics for future. We cannot prevent pandemics, but we cannot prevent outbreaks, but hopefully at least prevent future pandemics from happening and do better pandemic preparedness. Okay, so this is just to show that this is a global, sorry, a national effort, the KENCOGEN, and to date actually more than half a million viral genomes were sequenced in Canada with both KENCOGEN funding and actually a lot of provincial and federal additional investment in generating the data. And all the data, of course, need to go somewhere to be utilized. So there was a virus seed data portal set up, led by McGill and Ontario Institute Cancer Research to provide the genomic information and the associated metadata to researchers and these are publicly available resources. Okay, now, so genomic epidemiology, you'll hear about this term, you know, many, many times throughout this workshop, and you'll get further sort of refined definition for this term. But for now, I'll just give a very high level definition, and I would characterize it as a combination of whole genome sequencing data from pathogens with epidemiological investigations to track the spread of infectious disease. So the epidemiological information provided the contextual evidence for this genomic data and the genomic data provided diagnostic information to support the epidemiological evidence. And this is a very different from a traditional clinical microbiology laboratory approach where different tests were used with different organisms, and they have different turnaround time they have different equipment needs they have different person now needs to, for example, reading a microscope, not a ton of equipment just reading a microscope is quite to characterize the morphology of a pathogen is quite an expert you need quite a bit of expertise to do that and it's not something you will just be able to do with a few days of training. So, so we want to replace a large number of, well, at least complement a large number of laboratory testing with a more consolidated approach using whole genome sequence based workflow. So the workflow is in a very high level simple view essentially involves collecting of DNA samples and sequencing the set samples and to generate the sequences that in process using boundary tools, which then lead to diagnostic with report prevent preventive measures where even new drugs being developed. This then course provides interventions to particular disease or the benefit of the the approach is that it really simplifies the workflow need and it has much faster turnaround time than some applications. It's also cost saving by reducing the number of platforms and instrument needed and sequencing has become such a commodity that a lot of the know house is available in the community, and the results of, in other words, the sequences are also more comparable insurable than other test result types. So then the, the sequence data can also be useful value added analysis such as pathogen evolution analysis as I shown before MR predictions reception before and transmission and dynamic modeling which you will see more in the subsequent sections. It does have some challenges though this approach that it results in the results are harder to process and to interpret it because the volume of the data involved. And this is, I guess, why we're all here to learn about the process, but also it requires more computational resources to support these type of data processing and analysis and to do this, if you don't have the adequate infrastructure could be quite challenging. It also is a rapidly changing technology, the technologies involved are rapidly changing so it's keeping up with the technology. On itself requires a lot of our R&D work in itself requires a lot of R&D work, and this is why collaboration between practitioners and research can really benefit the process. So the per sample cost is still higher than some of the traditional tests, but this could be is being reduced further as we start batching large number of samples and start streamline some of the operations. So briefly about high throughput sequencing so next gen sequencing and third generation sequencing and collectively calling high throughput sequencing and sequence data have many clinical and public health laboratory utilities. This is some of them here that you will actually learn more about throughout the workshop. The data, as I mentioned, could also be used for understanding pathogen evolution and understanding the characteristics of these pathogens, which you cannot do if you're just looking at the market gene, for example. So there are several sequencing platforms on the market. The most popular ones I would say are the bottom three for now, the Ilumna sequencers, the Nine and Four sequencers are probably by far the dominant technologies. The PacBio and IonTorrent called gene studios are still used by places, especially gene studios seem to find its way into diagnostic facilities given the streamlined workflow used. The costs of for sequencing have decreased drastically over the years and this fairly recent study actually highlights some of the characteristics associated with each of the platforms. And the cost, for example, the cost per million reads or the cost per million, in this case, billion basis and so on. They also highlight the runtime needed and the throughput and so on. So this is a useful reference to have. Last I want to highlight, it also mentioned the read accuracies and the re-lens and the read accuracy of these two columns here, the read accuracy of the different instruments. So just want to briefly mention the shore versus long re-sequencing and you can you'll see some of these mentioned in subsequent lectures. So shore reads, as exemplified by Ilumna sequencing, are much cheaper per base, but the re-lens is only a few hundred base pairs, typically in the low hundreds, so one to three hundred base pairs. Their higher capacity throughput instruments in the in the Ilumna family of sequencers, they're also much more accurate than the long re-sequencing. So the per base error rate is less than, it's less than point one percent. However, the reads are consensus reads, so they're of many molecules rather than a single molecule. The long read technologies, on the other hand, are more expensive per base and their lower capacity throughput machines. And also, because they're typically reading from a single molecule, their accuracy is low, also lower than shore reads. But the typical re-lens is at least a few thousand base pairs to tens of hundreds of thousand base pairs and there's actually competition to see who can get the longest read possible out of the Minaya sequencer. But because of the, well, actually I'll get to that shortly, but we'll first want to characterize the pathogen genomes. So bacterial genomes typically contain a single circular chromosome, some are linear, but mostly circular chromosome, but also only a single copy, so they're haploid genomes. They may contain extra chromosomal DNAs called plasmids, and the genome size typically range from half a megabase to roughly 10 megabases. There's some large ones being discovered more recently, but when we talked about pathogens, the average is actually about three to 500 megabases and roughly correspond to 3,000 to 5,000 genes. Viruses are much more diverse. They can be DNA or RNA. They could be single-stranded, double-stranded, and they're classified into seven high-level families. Their size range from one to two kb's to one to two megabases, again, some large viruses are being discovered, and they depend on host cellular mechanisms to replicate, so that's why their genomes are much smaller. There are also eukaryotic parasites such as fungal, protist, and worms. There are usually a few 100 megabases or a few, I was missing a word there, but basically a few 100 megabases in genome size, and there are usually multiple chromosomes. And these genomes are constantly evolved through different evolutionary forces, so the genomes can undergo deletions, reducing the genomes, often increase the fitness if the organism is in a specialized niche. They can only infect a certain type of host or become an endogenic pathogen for certain hosts. The genomes are also undergo rearrangements, and this could of course affect the expression of genes when the genomes are arranged. And same thing with gene duplications, the genes could duplicate and could lead to selective loss of the copy of the genes, but new functions can also evolve when you have gene duplications. So, very different from the predominant mode of eukaryotic transmission, which is typically sexually reproduction within the species, bacteria through horizontal gene transfer could actually acquire genes from a wide range of non-parental organisms and other species or other strains. And therefore, its ability to rapidly change its genome composition is quite remarkable. So briefly, I would just want to mention a few terms, and I think Fiona will go over this a bit more as well. So homology means similarity due to shared common ancestry. So it's a yes or no characteristics, right? We don't talk about degree of homology, we talk about degrees of similarity. So you either share the same ancestor or you do not. So just keep in mind when you think about homology, it's yes or no, not degrees of differences. And within the homologous genes, in other words, genes that have the same ancestry, you have orthologs that arise due to species. And paralogs that arise due to gene duplication. And xenologs that arise due to horizontal gene transfer. So there is no homology because they don't share the same ancestor. Okay, so now the process involved in Ho Chi Nong's shotgun sequence analysis as follows. So you first have your DNA isolated, and then sequenced in the, sorry, the DNA, if it's from an isolate, you first would culture it, and then you will share the DNA, attract the DNA, share the DNA. You might end up only selecting certain targeted region through PCR amplification for sequencing, and then you put it on the sequencer followed by analysis. From microbiome hyper studies, you of course bypass the cultural step and you go straight to go straight to amplification followed by sequencing. Okay, so the sequence data analysis essentially involves trying to piece together millions of billions of overlapping reads. And and assemble them, we're putting them together for subsequent analysis. Or alternatively, you will process them as reads and then use them for analysis, but not again will be discussed later. So as I mentioned, there's the steps are assemble your reads into contents or back into the genome, and then annotate the sequences to have functional information and location of the genes determined, and then you would then carry out variant analysis, and that should say module five rather than module three. Okay, now the genome assembly process, there's two different flavors. One is the noble assembly where you compute algorithm essentially try to identify overlapping sequence and merge them together. There's also reference assist assembly. Sometimes people refer them as mapping that will map your sequences to existing related genome sequences. So the assemble qualities assembly qualities and therefore affected by the reference genome use if your reference genome is very different from your own genome, then the assembly quality will be lower because the mapping process it will not be as accurate. And there are actually dedicated lecture for this and you can look up if you're interested in. So the main point that I would like to mention out of this discussion is to say why the long reads are beneficial. So why do we in my career genomics sequencing often prefer to have longer reads rather than shorter reads and that's because in the genomes they're often repeat elements in. And if you have short reads, then that does not spend the entire repeat as shown in the blue. So each you can think of each of this as a repeat sequence and if you think the darker blue is the short reads that you could see that they don't spend the entire repeats and the sort of result of that. They don't resolve the repeat properly and often can lead to misassemblies. So for example, only recognizing two distinct repeats as opposed to six up here. On the other hand, if you have long reads as depicted in these long blue reads, then your long reads can be longer than the repeat sequences. I'm hoping you can see my mouse cursor. And as a result, the long reads because it spans the whole repeats will have less trouble putting recognizing that there's this long repeat in the genome and therefore assemble the genome correctly. So the novel assembly of short re sequences that contain often have trouble spanning the repetitive regions is the message that want to highlight here. As I mentioned the different sequencers have different error rate without going to the details just want to say that the third generation, such as me and I on the pack bio usually have a much higher error rate. And that requires extensive error correction or consensus based error correction. The next gen sequencers such as the Luna on the other hand have much lower error rate. And this ability for long reads to close the gaps in the genome or to span regions that otherwise the short read cannot successfully spend usually resulting more more complete genomes with fewer gaps in them. So contact just means contiguous sequences and your complete genomes ideally would be a single contact that's been the entire chromosome of the genome, and a number of extra chromosomal elements such as placements in its own separate contents, but often we don't see that. Okay, so the annotation process is the process of assigning functions and gene locations to the sequences. And again, we won't be covering this in the lecture in the workshop this this time, but it's just for you to familiar with some of these terminology. Okay, so I'll skip over this actually for now. And just so to highlight there are automated systems and because this is process that have been more or less automated. That's why you can just take the results from these automated annotation and carry on with your work. And there are some caveats, but it's not something we will grow on too deeply in this workshop. And for the sake of time, I will just say that a lot of the work that we're doing essentially fall under the umbrella of comparative genomics essentially we're trying to identify variations in the genome and use those variations to link to epidemiology, epidemiological evidence. So the type of variations could be regional, could be gene by gene, or could be single nucleotide, and you will see both the gene by gene method and the single nucleotide method being discussed in the workshop. Okay, skip that. And then I want to introduce the idea of a pan genome so compare the genomics resulted in the pan genome concept being proposed in 2005. And this was the realization that when you look at several strains of within a single species of bacteria. So you can see that some of the genes are shared and these are called the core genome, and they're typically correspond to housekeeping genes that are important for the survival of the organism. And you also will see strength specific accessory genes that correspond to lifestyle or adaptation type of genes. And the genome calculation essentially give you a sense whether a particular species have an open or closed pan genome or closed pan genome means that there's a somewhat limit number of new genes that you expect to see when you sequence more of a given organism, given species, I should say, an open pan genome such as E. coli, on the other hand, the more strains you sequence, the more new genes you're going to discover. Okay. And yes, so the last little bit is actually just touch base on while we will cover more in depth in module three. So this to highlight one of the ongoing efforts called Eric that were integrated rapid infectious disease analysis platform or that was developed by the National Microbiology Laboratory in Gary's group and and this collaboration with Gary Fiona and actually Andrew and almost everyone that's in this and this is actually when we've how we first got together and eventually proposed the workshop. And it's to build the tools to pop the software platforms needed for doing genomic epidemiology analysis by practitioners so it has simple user interface, so on and so forth. And so mention the partnership. Hey, I'll skip all these and just jump to the end. Some of these will be highlighted. So I would want to highlight one thing to end this lecture which is there is a challenge in data sharing in Canada in that Canadian comprised of 14 distinct healthcare systems at the provincial level and there's no universal standard for data collection we're sharing and also there's no legal binding public health data sharing agreement in Canada. And as a result of that during the early days of the pandemic data sharing in Canada was quite delayed. And as you can see here, the analysis time done by our point show that the delay between sample collection and submission to public repository was quite delayed in Canada and also the type of metadata or the contextual information available for these records is also quite limited. And that prompt some of the efforts by us and others to to try to correct course and by and large we were quite successful in that. So some of the reasons outside of the delay include capacity to process sequence data and metadata for public health laboratory within the public health laboratories to be able to release such data. And there's also multiple sign off needed release and of course there's very the German privacy concerns and that's why these protocols were put in place. And the desire to have only release high quality data was another reason that was delayed so mentioned a few studies led by young jolly's group and my group really tried to take a social science angle to understand the problem so one was focusing more on privacy concerns and the second focusing more on Canadians opinion on this on data sharing and just going to the punchline to say that most Canadians are actually in favor of sharing the identified records and the identify sequence records and related contextual information such as symptom vaccine status of the case age so on and so forth. And so the, the onus then now is on public health and on low makers to try to reflect the willingness to share data but also to do data sharing responsibly. So the idea is that data not been collected may result in a subset of data that can be shared this what we call the minimum contextual data, but still a large number of types could be collected in response to health care needs or research needs and so on. And then mechanisms for more responsive data sharing will need to be put in place to allow us to benefit from all the resources going to put into data collection. Okay, with that, I will thank you for your attention and in my lecture.