 Good afternoon everyone. So my name is Anna Maria Crescent, I usually just go by Anna, and I'm going to be talking to you this afternoon about biologographic analysis. The creator of this module content and who also taught it last year is Rob Beko. Rob could not be here with us today and so I am taking over as part of his material. So Rob is the true expert on this subject, but there's a nice overlap between some of the work that he does and I do in particular about how we represent this kind of information. So you'll see me again in the data visualization component of this CBW series. And I've just used his previous slides and work from last year. So there are some things where if I'm not sure what the answer is, I could defer to Rob or even find out and get back to you on it. But I have done quite a bit of prep work so I think we'll be okay. So the idea behind biologographic analysis is essentially that you are taking time, geography and trees and you're putting them together in some way. And as we will see, this can be done in a couple of different ways depending on how you're creating your trees and what kinds of methods you are using. The learning objectives of this module is that you're going to learn how to generate the kinds of data sets that are appropriate for biologographic analysis. You are going to also learn how you can take these data sets and manipulate them in various different tools. In this module, we're going to be talking about Filocanvus, Microreact, and Genghis. We're also going to talk a little bit about R in the data visualization component because again, we'll be working with these kinds of data as well. And you are also going to learn about, whoops, I was hoping to scroll from the notes here. Oh, no, it was working previously. Okay. You're also going to learn about how we can analyze and interpret these kinds of biologographic data sets and to do that, we're going to be looking at a lot of different papers that have tried to solve some of these problems, how they did it, and how they got to their results. Do you know how you got it to go to the, I'm sorry? Yeah, we'll have to make sure that you're, you're in the pointer thing, and here we go. It's missing. Sorry about that. Oh, it's there. Seriously. Okay. We'll just make do. Okay. All right. So, sorry about that. I have some notes here. I forgot to memorize. We're so helpful to see them. So, it's not cooperating with scrolling. Okay. So, it's actually been, the relationship between disease and geography has actually been studied for quite some time. So you may know by now the story of John Snow and his cholera pump, where there was all these cases of cholera that were going on in London. He looked at where it was and had a hypothesis that it was around this pump, removed the pump handle, and the outbreak was over. But he's not the only one who's done this. In fact, before John Snow, there was a gentleman named Valentin Seaman. And Valentin was studying yellow fever in New York. And they were taking a look at people that were falling ill with yellow fever around these docks and plotting where all the different individuals were getting sick. And when they took a look at not only the locations of the people, but the time when they were falling ill, they could not only see that people were getting sick at specific docks, so all these ships that were coming in from different places were dropping off all this contaminated water, were bringing with them other contaminated species that would then integrate itself with the ecosystems on the docks. And they could see when they added time components that the spread of yellow fever was going across all of these different docks. So you had this really nice visual of how the outbreak was evolving over time from place to place. Of course, they didn't have the phylogenographic component yet. They didn't know how to do it. But it was a very interesting and early example of how people were studying the spread of disease over different geographic distances. There is also this really, really nice example from Henry Ingersoll Bodich. And he was studying the spread of tuberculosis in New England at the time. And he was looking at all of these different regions and looking at the incidence rates of TB and encoding this information as colors and symbols. And so you got this really nice visual of the incidences of TB in different areas. And he actually collected a lot of different kinds of information as well so that they could try to understand the different factors that were involved in the spread of TB. So one of the things he did was he collected a bunch of environmental factors, for example, the soil moisture. And he found that there was some relationship between the soil moisture and some incidence of TB as well. So you'll always hear the old saying, correlation is not causation. But there are ways actually that we can inference causal relationships. And so this was one way that he did look into trying to find some factors that could impact the spread of TB. In fact, the website where Rob got this is a really interesting one that I'm going to show you, and that catalogs a whole bunch of different historical disease maps that have been produced in different regions and by different people over time. So you've got the color epidemic, John Snow's version, as well as various ones that I've shown you here. So if you're curious in taking a look at how people were visualizing this data 200 years ago, this is a really interesting resource. OK. So phylogyography as an area of study is a field of study concerned with the principles and processes governing the geographic distributions of genealogical lineages, especially those within and among closely related species. And of course, this was much harder to do in the past when we didn't have a concrete idea of how we could actually use this kind of information. But today, things have changed quite a bit. This is a tool that is called NexStrain. It recently won the Open Science Prize and is developed by Trevor Bedford and his team over at the Hutch Institute. So what this visual is showing is the spread of the Ebola virus during the 2014-2016 outbreak in West Africa. And rather than just showing you the picture, I actually want to show you the full tool as well as the way that you can interact with this information to understand what is going on in the outbreak. So they've got a couple of different organisms that they study here. Unlike some of the other tools we're going to look at, they do all of the data pre-processing as a separate thing and then put the information up on the website. You can use their data processing pipelines if you like and if you want to download it. But theirs is more of a canned solution. So the Ebola outbreak is right here. And so what you've got is a phylogenetic tree. It's a Bayesian time phylogeny. So you can actually infer when different events occurred over time. The nodes are currently colored according to, let's take a look at countries. So we've got a nice geographic view of them. So whether they occurred in Guinea, Liberia, or Sierra Leone. They've got an accompanying data visualization as well, which actually shows the map and the spread over time. Now you can't really take this apart because it's quite a big blob right now. So one of the ways that they've solved that is to add some nice interactive components to this. So what you can do is that you can play this animation. And it'll slowly march across the tree and start to show you how the diseases are being spread over time and how they're being transmitted to all of these different regions. So now it can be kind of tricky to actually follow both the tree and the map at the same time. But it gives you this really, really rich idea of the spatiotemporal as well as genomic dynamics of this outbreak. And they've also got a little genomic view as well so that you can take a look at how all these different mutations or aberrations are occurring in the genome over time. So you've got all this really great information that they just couldn't have access to in the past. As I mentioned, Neck Strain itself is actually three different components. The visual component that you just saw is something called OSPIS. So this is a GitHub repository for the Neck Strain folks. And all of their bioinformatics pipelines is something that's called Augur. And they're really cool names that actually have to go back to Romans that were responsible for interpreting the will of the gods. So the OSPIS is their prophecy and the Augur was the guy who had to go and talk to the gods. So it's a really nice way that they got the names of these tools. So the Bedford Lab will usually do the analysis on the back end for these different microorganisms and then put the information up on their website for surveillance purposes and for different specialists. If you wanted to do it yourself, this is where you have to get the tools. So you can see that there's a lot of information that have been encoded a lot of different ways in this kind of visualization, like using color, using the trees, as well as using geography to show the spatial temporal trends. So there are several questions that we might ask when we are doing a biographic analysis and we're looking to match the space, time, and environmental components. And you could start to see that in the Ebola example. So we might actually ask the question of what are the most likely points of origin or introduction of a pathogen into a specific region. If we slow down the animation a little bit more, you might have been able to see how some things would hop from one place to another place, for example, as well as what degree of transmission is there. You might want to know when did an outbreak likely originate. And you also want to know whether there are specific settings that could impose a greater risk, so certain environmental or climate conditions that can influence the rate of transmission or how it's spread. So what do we need in order to do this kind of an analysis? First, you need some sort of location information. Often this is GPS coordinates on a map. Sometimes it is really, really hard to get, especially if you are working with like a really restrictive data sharing environment. I know that when I work with some of the folks in public health, I can usually get geographic information only at a really high level, maybe postal code, or even some larger city administrative region. So having fine-grained GPS data is really, really hard. It's usually fudged a little bit. In fact, for the Ebola data, all they had were these different provinces or geographic administrative regions that they could use to figure out where the spread and the inference was. They didn't know at a low-level village, which obviously can impact the kind of response that you would mount. You also need some temporal data. This is often when you actually sampled the individual and actually got an isolate from them. This can also be very complex because you don't know how long somebody's been sick and at what stage they're at, and this will of course vary for different diseases. So you can think of something like HACP or hepatitis, which can have a lot of in-house diversity and you're seeing it only at one point in time. You also need contextual information or different settings. So this can be location attributes. They can be information about the patient. Sometimes they can be things like socioeconomic factors or they can be different kinds of environmental data that you might wanna collect from biological sensors or even weather and meteorological information as well. Fiona presented an analysis that looked at the microbiome within these different watersheds and for that we actually collected rain data and found that there was a relationship between the amount of rain and the diversity of the microorganisms and we had a lot of different reasons why. Some part of it was you'd get more manure runoff and that would feed a population girl from stuff like that. So all that information is something you need to collect and include in your analysis. And then of course, you need a phylogenetic tree. So we're not in a GenFB course for no reason. We're always gonna have phylogeny as our champion. If you're working with a lot of epidemiologists, they'll often collect the first three types of information and you're kind of coming in and trying to integrate it with the phylogenetic tree, which can itself be quite complex since they're not necessarily taught how to interpret phylogenies or are not used to seeing them as much. So the load's becoming more and more common for at least to see that kind of data too. So to reprise all that data type in a workflow, you usually get some genomic data in the form of sequence information to which you actually have to do some sort of an analysis and transform that into useful phylogenetic information that you can then use. You'll of course have these metadata or alternative sources of data that need to match to the information that you've put in your tree and this can be about the environment, time, and your patient. You'll often stick that in to various different kinds of software tools. We're gonna learn about some of them today, but they're not the only ones and these tools change over time and they're usable in very different ways. And these software tools can help you visualize the data or can help you compute some statistics upon the data that'll help you figure out transmission events, evolutionary changes, and even indicate outbreak events as well. So let's start with one example. This is actually some work that looked at a Haiti cholera outbreak that was done by NML researchers, I believe. And what they wanted to do was study the introduction of cholera into Haiti. So I'm gonna bring up just a larger version of this tree so it's a little bit easier to see. It's a bit small. So here you can see the geographic information of Haiti and the various different sampling sites that were used. They've got the 2010 Haiti cholera outbreak. These were after the earthquake, which are all shown in red. They've got various different strains of cholera from Nepal that are shown in green. The authors on this paper also went back and got several different isolates as well. So not only do you have the Haitian outbreak, you also have a big tree. You've also got some information from Bangladesh and from Nepal over time. So taking a look at this tree, are there any things that immediately kind of pop out to you? I'm asking. Does anybody wanna suggest? So one of the things that I know Rob asked students last year is, first of all, is this a filogram or is this a cladogram? Do you know? Filogram, that's right. And as a filogram, do you know what we could use the branch lengths to interpret? Pardon me? Distance, right, so relatedness, okay? So what you can see here is that it's almost practically flat, right? This means they're very, very related isolates. And you've also got some green Nepalese isolates over here as well. Also pretty close, pretty flat, right? If we look along the tree, we don't have any isolates from Haiti prior, sorry, earlier on, right? So it's possible, in fact it's pretty likely that this Nepalese strain was introduced into Haiti and possibly initiated the cholera outbreak in Haiti, okay? This was controversial at the time because it meant it was introduced by the UN peacekeeping forces. What they've also done here is applied a Bayesian biogenetic analysis. And so they've also been able to assign different time points to the most recent common ancestors within these clades. See where the last point that they diverged. So here the MRCA refers to the most recent common ancestor. And what they've got here is a point estimate as well as a confidence interval around it. So they think that the most recent common ancestor within this clade was within the last year and as little as within the last half year to perhaps two years. Going back to point B, the most recent common ancestor was about four years ago. And so on, they've got about two decades worth of information from this outbreak. And so what they can more clearly get a sense of is that the timing and the relatedness of the Haitian and the Nepalese strains are, it's too close just to be a coincidence. So it's very likely that the introduction happened by this Nepalese strain into Haiti. So this is reprising some of the information that we just saw when we looked at the different, just looked at the paper figure. We're gonna get back to talking about Bayesian trees a little bit more at the end of this presentation. So usually what you can do is just make, get a regular phylogenetic tree and visualize the accompanying data beside it. And there's a bit of a disconnect between how you generated the tree and the different kinds of information that you have that are relevant and contextual. Bayesian analysis allows you to actually kind of put the two together at the analysis so that you can start to construct a tree with contextual information like time and geography and get a tree that is relevant to your metadata as opposed to just overlaying it visually. And from that, you can get this most recent common ancestor information as well. And there's various different ways that you can infer the root here depending on whether you have an out group or whether you have certain molecular clock assumptions that allow you to also calculate the origin of the outbreak. And you can also do some really interesting statistics to see how confident you are in that root and origin. So in the tutorial, we are gonna be talking about several tools that allow you to do the kinds of analysis that we went over in those different case studies. One of these tools is called Genghis. It was produced by Rob Biko. I am going to be giving you a demonstration of Genghis since it runs on OSX and Windows and the OSX version is slightly different than the Windows version. So it's just far easier if one of us does it instead of all of us. But the download information is available for you. We're also gonna be talking about MicroReact, which is a tool that's come out more recently and which you might wanna use for your integrated assignments as well. And I'm also gonna be talking to you about R in Shiny. So Genghis and MicroReact are these standalone tools. You kind of upload your data into it and it does everything for you. Whereas R in Shiny are more customized. They're scripting tools. If you're comfortable in those languages, they can be quite flexible in terms of what you can produce for phylogographic analysis. I'm also gonna mention a more advanced scripting library. JavaScript is a web programming language. And it has a specific library called PhiloCanvas, which allows you to visualize phylogenetic data in the browser and perform all of these kinds of operations on it. Do any of you have any experience with phylogenographic analysis, or sorry, with JavaScript? Okay, one, two, kind of. Yeah, okay. So how many of you have experience with scripting languages like R or Perl or Python? Okay, a couple more. Great. So JavaScript is in that realm of scripting languages where it's you make everything on your own, but in a very, very different paradigm, and in a language that changes every two weeks. So we're gonna talk briefly about these tools. I'm not gonna tell you how to use them yet, but we're gonna talk about it and take a look at the Cibola outbreak as well with these different tools. Okay, so a lot of really new and interesting data visualization tools for phylogenographic analysis on the web incorporate phylocanvas to some extent. It is a JavaScript library for interactive tree and metadata visualization. It has a bunch of different plugins and public APIs as well, so that if what phylocanvas supports out of the box is not satisfying for you, you can actually add it and augment its functionality as well. So we've got an example here of what code looks like for phylocanvas, but there's one important component here that is missing, which is how you actually get the tree. So what this does, I wonder if I can kind of, so what this first line of code does, sorry for those of you over there, that's just tree level zero set display, sorry tree leaves zero set display. It means like look at this tree leaf. So one, you've got all these different ones. Even though this is tree leaf A, it is accessed by using its index zero. So that's a bit confusing, but yep. And then you wanna actually specify some properties for how this node should look like and you have to specify every single one. So you set the display and the first thing that you specify is color red. And it turns out that color red refers to this tiny line, which actually looks black on the screen, but is red on my screen. It's very hard to see. Next you have shape, circle, which happens to say that this is the shape of the node. It's a circle, great. Size three just means bigger than the rest of them. All of them have a size of one. Then you have specific leaf style. So that's, this is hexadecimal code I think for, oh, this is black and this is specifying the green color. And then you're also saying this black line around it, I want it to be a width of two. So you can get a sense of how big of a slog this is because you have to basically specify everything and this is just for one leaf. There are automated methods that you would write over top of this that do this automatically, but you have to know the rest of JavaScript in order to do that. Otherwise, you're stuck doing this manually. Fortunately, there are a lot of tools that are implemented like MicroReact that sort of leveraged this framework, but this is the level of scripting that they're usually working at. And then you can also specify the labels and then to draw the tree. What this doesn't tell you is how you actually draw the tree. So I'm just gonna very quickly show you that. The tutorial from last year. So what you first need to do is type in this creative variable that is ready to take a tree and then you would load some Newark or Tree file. Then you'd go through that process. So if you like JavaScript, File of Canvas is a great language, a great library to check out. If you're not familiar with JavaScript in general, you would have to actually learn a fair bit more JavaScript before you could dive in to really, really getting the full potential of File of Canvas. But this is what you can get at the end. So this is the File of Canvas website. They've got really nice seamless manipulations of the trees. You can also do a live upload where you can just throw in some data and actually draw the tree, right? So for your integrated assignment, if you wanna throw in a CSV and you wanna throw in a Newark Tree onto the File of Canvas page and just see what it looks like, you can do that here without having to use any of the programming. The next tool is MicroReact. So MicroReact is a more recent tool that's come out from David Onson's group and it uses File of Canvas and allows you to see the geographic data as well as the temporal spread of your data over time. So we'll quickly go to the MicroReact page. We will be practicing uploading some data to MicroReact in the lab portion of this module. So what I'm doing here is I'm loading the Ebola data set again. You've already seen one version of the Ebola data set with Next Strain. And now in MicroReact, they've made slightly different design decisions, but it's roughly the same data. So what you've got here is a geographic map. If you're curious, this is with a JavaScript package called leaflet. Right here, you'll see File of Canvas in action. And down here is a timeline of all the different events and when they're occurring, okay? And we'll manipulate this a bit more, as I said, in the tutorial component. Okay, so once again, you just have to put in a simple CSV file and a Nuwek treat, which is really nice and easy. You drop them in there. You can do all kinds of true manipulations and some animations. And you can also save your project and return to it and share your project with other people. So the journal MicroBuild Genomics has an agreement with the MicroReact folks so that if you have a data set that's got a file geographic component to it, you should upload your data and save it as a project to MicroReact as part of your submission. Now MicroReact and Nextrain are great tools because they show you things so long as Nextrain supports the bug that you're interested in. But those are tools that are run on the web, which means that your data has to inherently be made public. It has to be able to live on somebody else's servers. And there are reasons that you might not be able to do that. And as we talked yesterday, there are a lot of barriers to data sharing, some for good reasons, some for not so great reasons. Genghis is a standalone tool, which means that you download it to your computer. It runs off of your computer. Anything that you do is saved to your computer. So if you have data that's really private or low level or you can't share it for some reasons, but you still want to use this file geographic analysis and you don't want to create your own thing or script your own thing using JavaScript or something else, you can use Genghis, okay? There are a lot of Genghis. Genghis has a lot of great resources on the web. It's Wiki is really nice and it's got a lot of information about how you actually run the tool itself. Again, if you're a Mac user, there are some functionality to Genghis that may not be available to you because it's a slightly different version between the Windows version. So the way that Genghis works is that you have some data that you give it. Usually you provide the map for it, it doesn't figure it out for you. You have to give it some sample information, which it actually calls location information. And you have to give it some trees in a new format. The application itself is programmed in a language called C++ and OpenGL as well, which is all to say, it's not the easiest thing in the world to just open and take a look at. It's not like Python code or R code where you can get it and really easily modify it in theory you can if you know C++, but I don't know, don't think many here would know C++. It's also got an interface to some more common scripting languages if you want to do your own statistical analysis and that's using an interfacing to Python or R. And then finally there are a lot of different plugins and a lot of different ways that you can output the data so you can do some interesting customized analysis. There's an active development community and there's various ways that you can save things such as images or even little movies if that's what you wanna show. So one of the primary tutorial examples that Genghis has available is this mosquito one. Got a note here that it's not actually a genomic epidemiology example because I think that they don't get the mosquito DNA specifically to involve in this analysis. But it's still a nice biologiographic analysis, nonetheless. And what they've got here is a map of various different Hawaiian islands and what they're doing is overlaying this tree over top of the geographic locations. So you can make this more direct influence between the inference between the 3D structure and where the leaf nodes actually touch down. And you can see if your tree actually makes sense. So in a sense you can know whether things are hopping across an island, whether there's some relevant outgroup or not. Here the different points are colored by geographic regions and specifically they're colored by each of the little different islands that they come from. You'll notice that some of the lines are actually white and that's because Genghis will keep a neutral color, in this case it's white, until everything below a branch seems to all be coming from the same place or have the same metadata in which case the branch becomes a solid color. So over here, just in this guy down there, everything is coming from the same island and so now the branch is all green. So it's giving you visual indicators as well of the purity of the branch. Now 3D is actually not the easiest way to see something. You might think it's intuitive, but it can be really challenging because you kind of want to look at it this way or that way and it's really hard to think about how you might rotate that in your head. So what they have also done is that, so what they've also done is created a 2D tree. And so one of Rob's students, Donovan Parks, implemented an algorithm that effectively rotates the tree down into two dimensions so that the phylogenetic relationships are still relevant but things are positioned in such a way so that you can still get a good view of the relationship between the phylogeny and the exact geography. And by that I mean like physically pointing to the geography with the notes. It's far easier to then see all of these different relationships compared to the 3D tree where you're trying to look down. And there's also different ways that we could now use this 2D information to get a sense of the quality of our phylogeny. So you might notice that there are a bunch of these little lines, there are these lines right here. Okay, and for you guys, those lines right there. So usually what you want to take a look at is if your phylogeny makes sense that they're generally not crossing, right? And that kind of depends on the positions of where the tree actually ends up in space but the algorithm takes them into consideration. So there's a bunch of lines that are crossing each other. Sometimes it means that your phylogeny doesn't have a good correspondence with the geographic information. And in instances where there are nice parallel lines or non-crossing lines, it means that there's a good correspondence between your phylogeny as well as the geographic distribution of something. The Genghis also has statistical methods implemented into it so that you can actually do this calculation as opposed to just seeing it by eye. And of course, in really, really complex outbreaks though, there are a lot of crossing lines, it's a bit messier. So it's a visual that works really well for simpler outbreaks but it's harder for larger outbreaks. We're gonna be returning to the Ebola data set in the tutorial since you've already seen it in two other systems and you'll get a sense of how larger data can make the visual quite complex. There are a lot of different ways that you can manipulate the Genghis trees to encode different kinds of information. So you could have just a simple tree, you could change the colors as you've already seen to encode geographic information. You can change the style of the tree, so whether it's an angular tree, whether it's a cladogram, a phylogram, or whether it's just sort of this more hierarchical tree as well. And you can play around with various different parameters and overlay different kinds of information to get an evolving sense of the different factors that might actually contribute to some sort of an outbreak that you might be investigating. So the key features of Genghis is that you can do all kinds of true manipulations, ordering the leaves, you can do edge colorings, you can collapse internal nodes, and you can even split it into different sub-trees. You can create cartographs which take a look at in a moment and have these different plugins from R and Python to do your own statistical analyses if you wish. Here's another example from MRSA, and this is an example of how the nodes were collapsed. So you can see this one green screen node at the bottom. It's pretty homogenous. And so what's happened here is they've just collapsed the node into one kind of big triangle, and that is pointing out to the phylographic information as opposed to showing everything individually. So it's a way to clean up the visualization. This is the Haiti cholera data that we saw from the paper, and there's a lot more data here, and it can start to become really, really tricky to see where all the individual points are in a global world map, right? You might still want the global context. For example, if you wanna see how far apart Haiti and Nepal are, a lot of people are not great at geography, so there might be a reason to do that. And then you've got the various different kinds of information that you've overlaid on it. So what you want is you don't wanna lose the focus of where this is going on in a global context, but you wanna zoom in a little bit more to see what is going on. And so what you can do is you can create something called a cartogram, and what a cartogram does is it distorts a map based upon regions of really high density. Might be really easy if I show you, there's a great cartogram of the world population. Just quickly look that up on Wikipedia. There we go. So this is a cartogram of the world according to how many hectares of organic farming there is. So it's keeping the geographic positions of the world, but it's distorting the maps, right? So Australia looks massive, right? So this is what you're trying to do. You're looking at densities and changing the shapes according to those densities. So what they've implemented in Genghis is an algorithm that does some of the same thing where the map has actually been distorted and now you're actually zooming in a lot more on the regions that have a lot more information. Now the rest of the world is still present, so they've done this in a limited sense. So like Canada is a super tiny or non-existent, even though there is no Haitian color there. But it allows you to actually take a better look and see at these different points. And he's also taken away different lines here as well so that, again, you get a better sense of the relationship between the Haitian clades and the Nepalese clades. And in particular, especially down here, you'll start to see there's some stuff that comes from Nepal. There's some stuff that comes from Haiti. They happen to be very close together, probably genetically, which suggests that, again, Nepal was the source of the introduction for the Haitian outbreak, right? I mean, this is just another example of... This is a more extreme example where the regions are blown up even more, so you're controlling the amount of distortion. I remember talking to Rob about this last year and there's a different technique called like a fisheye lens zoom. But I bet that you can also do with this as well. However, here they've chosen to implement a cartographic solution. We can also take a look at a data set that is considerably messier. So this is data that actually comes from BC. It is courtesy of Patrick Tang and it is 400 norovirus samples from Vancouver. And so what you can see is there is a giant blob. So you've got your data, you have visualized your data, you cannot penetrate the blob. The blob is hard. You can do different kinds of manipulations in Genghis to actually start to tease apart this blob and figure out what might be going on there. So is there any one perfect way that you could simplify this information and get the whole complete story? And the answer turns out to be no, it's kind of complicated. Sometimes when you have these really big blobs on static visuals, what you then have to do is try to look at the data several different ways and manipulate it in various different ways and kind of keep track of the story as you go along. And so in the next couple of slides, we're gonna take a look at some of the manipulations that you can do with Genghis to try to clean up this data set and show different aspects of it. Neither one visual is perfect, but it's much better than the initial picture. So one thing that you can do is just, I think get rid of the tree, although it is still pretty messy, even once you've removed a lot of excessive features, you might overlay some kind of information about color but it's still too much to have this be effective. And the thing that you can do is, again, instead of looking at the 3D version of it, take a look at the two-dimensional version of the tree. So we start to get some clarity. You actually can see the tree as opposed to a purple blob, but there's still quite a bit going on. The lines are all crisscrossing. It can be really, really hard to figure out exactly where everything is coming from. Maybe you don't actually need a map, which is also a decision. Maybe you just want the tree and you just want to color the lines of the tree because you've got too much data for a map to actually be useful. And then what you might want to look at is, are there clusterings at specific locations? And maybe you don't want to show every individual. Maybe you just want to throw the clusters because that's relevant. So again, that's a kind of a decision that you make as you go along in the analysis. You could also just try to animate it and just really break down the tree over time. So this is the tree construction changing as more pieces of information get added. So this is starting to clean it up. It's far simpler, but there's a lot going on and it's really, really fast. So you also have to pause at each frame to try to figure out what's going on. So it's cleaner, but the animation has its disadvantages as well. You could also do the approach of collapsing by genotype or cluster, collapsing on internal nodes. Again, you might not need to see every data point and maybe this will be helpful, but if there's not a lot of homogeneity in geographic location in those nodes, it can become complex. One thing that is evident when you're taking a look at it this way, coloring by the different genotypes in the snow virus outbreak is that the top layers are all yellow, which means that it's coming from this one genotype two variant 12. And you can see that there must have been like some introduction or some source of this genotype that was the ancestor to all of the different genotypes that you then observed. So coloring and collapsing has allowed us to take a better look at that. And every single clade that is the same genotype is collapsed here. So you can get a sense at least of what the dominant strains are and where they're kind of coming from. And the last thing that you can do is that you can just look at different types of interest. So you're not looking at the full tree, you're just looking at a specific area of the tree that you're actually interested in and then you've got a much cleaner picture as well. But again, as you focus in, you lose the larger context. So what we've mostly talked about are ways that you actually can visualize this data in a lot of different tools. So we showed next strain, microreact, phylo-canvas, as well as gangas, and sort of the pros and cons and the different approaches that they all take to it. But what we've seen so far are primarily ways that you get trees and then you overlay the data on top of the trees and try to make your inferences. Now, and the trees themselves are all computed using just the genomic data and whatever molecular assumptions you have. Bayesian approaches are a bit more of a behemoth and they allow you to actually incorporate all of these different sources of information, for example, your temporal or your geographic information into the construction of your phylogenetic tree. So you're not just relying on overlaying data on top of it, you're also seeing what the most likely tree is given all of these contextual factors as well. Fitting a Bayesian phylogeny is complicated. It's very complicated. Some of you may be familiar with the application called BEAST. That's probably between like that and I'm not sure if anybody still uses them as base. But BEAST is definitely the most common way that these Bayesian phylogenies are fitted. And they've got a tool as well called BEAUTY that can help you fit the different, the various different components and the parameters for the beast tree. But it's still very, very complicated. One of the things that's also a disadvantage of the Bayesian trees is that as you introduce additional parameters, and in this case parameters were loosely using to refer to different sources of data, it takes a very, very long time to compute this information. In fact, in computer science as a whole, just more compute power, bigger, stronger systems is what has made Bayesian analysis more reasonable to do in the modern era. Beast trees are no exception. And as you add more bits of information for it to include in its computation, the longer it takes. For very, very, very large outbreaks, and we're talking about like thousands, you may actually have to subsample in a very particular way because BEAST will not run in the time of the universe existing. That's not a joke. That's actually sometimes how long it takes to run some algorithms in it. So then you start to get into these complex issues of you actually need to subsample and how do you calculate everything around that? And so if you're thinking of running a BEAST analysis and you haven't done it before and you're thinking I could just use beauty and it'll be easy, I recommend you talk to a friend that has run it before because you will need their help. So what we're doing with the Bayesian, what we're doing with it with the Bayesian phylogenetic approach, having all these different parameters and actually explicitly modeling them is that we've created this very interesting probabilistic model from which we can gather additional information about our phylogeny. So the paper that Rob refers to here is a really famous one where some of the people that were important in the creation of BEAST have actually given you an example of how you would use this to infer migration rates in birds, right? And some of this thinking was really instrumental in what you saw in the Ebola outbreak when you could actually see it going over time to all of these different places, right? You're getting that information from the phylogenographic component as well. So what they've done here is they've fitted a tree using the hemiglutinin component of influenza virus and they have looked at various different kinds of birds and where they've originated from. So what the Bayesian phylogeny does is through its methods generates a whole variety of different kinds of trees and kinds of does a computation about the likelihood of them given some prior information and summarizes all of this as a what is called a posterior distribution and that posterior distribution is shown as the little bar chart, okay? And what the posterior distribution is telling you there is the likelihood that this red guy down here, okay, was the root of the tree, okay? So it's done all these different simulations of all these different kinds of trees you could get. It's taken a look over the resulting tree space and it says based upon all these different simulations I have, I'm pretty sure that this is the root of the tree, gwandong, and that this is where things started, okay? They also in this paper, it said the other half of this figure actually have the neuromididase calculation of the tree as well and using the neuromididase that actually has slightly different timing. Looks like it's actually, the origin is further back in time which confuses the situation but they sort of reconcile that in the paper. So you have this different kinds of information that you can use to feel confident about how likely your tree is or how reliable or reasonable your tree is, which you kind of get with maximum likelihood trees when you're looking at bootstrap values but this is a more holistic calculation and it incorporates a lot of uncertainty, more explicitly, okay? They can then also use a bunch of the information from the tree to actually take a look at these different sites and look at the transmission between each of them. So what they do in the paper is given all these posterior distributions they have, I think they treat it like a model selection problem and they effectively identify the most likely roots of transmission based upon the most likely model and they can take that information and overlay it onto when they think that transmission events occurred and where and how, right? And here they're not showing every single possible city, they can actually take a look at how likely some transmission event is and only show the really, really likely ones. So if you're like, I don't get that. You're not alone. Like I said, Bayesian phylogenies are really, really complicated. They've got a lot of different moving parts. There's, it requires a very deep understanding of like statistics and probability and all that stuff to know what they're getting at and what they're using. So again, talk to a friend, right? But this is the kind of thing that you can do with a Bayesian phylogeny that is harder than impossible to do with the maximum likelihood tree, right? And you're, again, incorporating that data into the computation of the tree as opposed to just visualizing it, right? So you're deriving additional sources of data to use as well. So there are a number of limitations to the Bayesian methods beyond being very, very confusing. They, you need to decide upon something that is called the prior, right? Has anybody ever done a Bayesian analysis like at all period? Yeah, cool. So a Bayesian analysis has two components, which is your prior and your likelihood. The good way of understanding that is a joke from a comic I read called XKCD. And so let's imagine a very contrived scenario. Let's say we have a six-sided dice, right? And we say that if we roll the dice and it lands on a six, the universe explodes, right? So if we just use that information alone, the chance of the universe exploding is startlingly high, right? Because on a fair dice, it's like one in six, right? However, none of you would actually believe that scenario because you actually have some prior information, which is that the universe does not routinely explode and the universe does not actually rely on a dice. So the dice supplies the likelihood. The prior is your prior beliefs and your prior knowledge that universes do not regularly explode. And when you apply the prior with the likelihood, you can rest assured that if anybody in the world rolls a six on a dice, that the universe will remain intact, okay? And so Bayesian methods allow more explicit incorporation of priors. How you calculate a prior is more of a subject of art than a subject of science in a way. And there are a lot of papers and a lot of really interesting and academic discussions about how you fit an appropriate prior. Some people, when they throw up their hands, they're like, I give up, I don't know how to do it. They just have an uninformative prior, and then some people will be like, well, what's the point of doing a Bayesian analysis? So it's a larger problem that's not unique to follow geography, but this group of methods in general, which is also why if you're fitting a beastry, talk to a friend, okay? They can be very, very time consuming. When you parameterize a model, by adding all this additional amount of information, a lot of more complex calculations have to go on in the background. What those calculations are are harder versions of what you've seen in calculus and linear algebra converted into programming and run over many, many computers. So you also sometimes need a lot of compute power to run this and time. And when you need to speed up the calculation, you might have to do some sub-sampling or some simplification of your data in some particular way in order to get the algorithm to run in a reasonable amount of time. And then of course, the model becomes really, really complex and it starts to become a little bit challenging to understand and interpret everything that's going on. So it's not always easy to diagnose errors or mistakes that can propagate throughout the model. Okay? Now the very methods that I talked about were actually used in the Ebola outbreak. This class of methods is being more routinely used. And the stuff that was going on in the Ebola outbreak was actually implemented into sort of the algorithms that underlie the next strain application. So the Ebola outbreak and the paper that's referred to here consisted of about 1,600 genomes of the Ebola virus. It's a lot. If you want to know how messy this data is and we didn't talk about data cleaning or data wrangling, we assume that you have the perfect trees and we assume that you have the perfect metadata. But of course, the paper has a really extensive supplemental analysis where the authors actually talked about how they whittled down even what the relevant metadata that they should take a look at is. And they also simplified a lot of the calculations to make things go faster. So for example, since they also didn't have the exact individual locations of everybody, they would say, well, the distance between these two people which you want to include in our calculations because we want to include this distance component in the file of geography, is just how close two states are. So it's very, very coarse. And they also had a number of factors where they looked at just doing a linear regression analysis to see how important each factor might be in transmission and through this separate analysis, weeded out the metadata variables that were not actually relevant, right? So you don't always want to throw everything at it because it can be misleading and you can see these spurious correlations. So they've actually thought through quite a bit ahead of even just running the phylogeny of what the most important and impactful variables are and removing the ones that were unlikely to be very important. I think here what they found was that the most relevant factors for transmission rates was the distances between regions, not surprising, the population size, whether an internal border was shared or not. So if you didn't have a shared border, with another country, then you're very unlikely to transmit outside of your shared border. So why calculate that if someone's too far away from you or doesn't share a border? Whereas border states are more likely to transmit. So that's all relevant information. It could simplify the calculation. And they had a bunch of other different variables as well. And you can actually see this as a nice side-by-side movie that Gidus made of the outbreak evolving over time. So they're showing the incidence rates as they're lighting up becoming a darker color. They're showing the spread from these different regions. And you can start to see as it'll start to get more intense at a point. And they're also showing the phylogenetic tree as the outbreak is evolving over time. So this is sort of its maximum, its peak, and then it is now decreasing. That should be the end of it. So they're fitting a Bayesian phylogeny here. What they've really done that's interesting is they've overlaid the epidemic curves, which are these guys here at the bottom. They've overlaid the epidemic curves and the phylogenetic tree. So you can get both of those contexts together. They've got the geographic context as well as spread. Now, Next strain has provided the tools for you to do this kind of analysis. If you run their pipelines, they've got a lot of great instructions on their website, but if you're not familiar with how to do this, making this kind of visualization can still be really, really challenging. So we still need to get better tools to help people be able to do this kind of analysis more universally, right? Not everybody has a Gidus in their lab, though you probably wish you did. There are a lot of really interesting visual cues that the author also used to alert you to different kinds of metadata. So for example, in each of the regions, you could see them getting darker and lighter as the outbreak was changing and evolving in that place over time, so showing the incidents. The colors are, of course, color coded by different countries, so you can get a sense of the spread as the animations, the curvature of the little lines over time also lets you know when things were going to different regions. So you could see something that's sort of launching from one of the countries and landing in another one. But of course, as with all animations, it can be hard to keep track of all the different moving parts. Okay, so it concludes the phylogyrographic component. You'll now have a break, and then we're gonna get into talking about Ganges and MicroReact. If you got more specific questions, I'm happy to try to answer them, but also feel free to contact Robico. As again, he is the expert on this topic. Thanks.