 So good afternoon everyone. So far we've done some, we've seen some excellent science, we've seen phylogenetics, we've seen antimicrobial resistance. That's sorts of fantastic things. And now for something completely different. We're going to look at some pretty pictures, well hopefully pretty pictures. Think about trees, think about geography, think about time, and see how we can put these things together. We don't have any sort of automatic present-to-advanced-to thing, eh? Okay, alright. So what are we trying to accomplish here? It's a fairly profound question, but the idea behind the module is that we want to be able to generate appropriate data sets, right? So what's an appropriate data set? What's an inappropriate data set? We want to be able to take our data sets and manipulate them in various different types of applications. The ones that you'll be seeing in the tutorial are going to be phylo-canvas, micro-react and Genghis, giving you a bit more of an overview later on. And then once you've shot things into those programs, interpreting the results, analyzing and interpreting the results of a phylogeographic analysis, right? I got my map, I got my tree, I got my pretty colors, it wasn't me. So these are important questions. And we are going to start in 1797. So, as one does. So how many of you are familiar with the whole John Snow cholera water pump story? Zero. Okay, some in the back. Excellent. I think it was Gary. Gary or Ed alluded to it yesterday. So the John Snow story is basically, oh, look at all the people of cholera. Oh, look, there's a cluster of people of cholera here. Oh, look, there's a water pump that's giving people cholera. That's pretty cool. But this comes over 50 years before that. And this was a fellow named Valentine Seaman, who was a medical doctor, I think it's what they're called, in New York City, and published an article in 1797, Medical Repository, looking at something similar. The focus of his investigation was yellow fever. So this is a dock in New York City, or a series of docks in New York City. That's Front Street, that's Allegable Street, that's Fly Market, that's Pine Street. And the idea is that in 1797 or slightly before, there were outbreaks of yellow fever that happened. And as one might imagine, a dock in 1797 in New York was not the cleanest environment known to humanity. It was pretty foul. And you had ships coming from all over the place, with all sorts of stuff, right? Dumping the build, ballast water, you name it. And people started getting sick from yellow fever. And so it's cool in some sense, because what Dr. Seaman did was took a map of this immediate neighborhood in one other spot in New York as well, and said, is there any information that we can glean from the geographic locations of these sequences? And so it can be a bit difficult to make out, and actually this is an enhanced version of the original, which has aged reasonably well, all things considered. And this is the map. And if you could see the little dots there, those are individual cases of yellow fever that were identified. And so as you can immediately see, the little bitty dots are kind of concentrated in one particular spot. You may not be able to see them, but there are other dots that are numbered, and those correspond to cases where individuals contracted and then died from yellow fever. So here's an overlay with the actual names of the individuals who passed away. And then these sort of inferred regions of high concentration, which you refer to as miasmas, potential sources, and just considering different hypotheses. And I don't have the image, but one interesting thing that people have done with this is if you look at the dates, it actually kind of stands out, you can draw these sort of temporal contours to see the cases expanding over time, which is pretty interesting. This website, which I discovered while I was preparing this material, is amazing. It's like a historical cartographer of epidemiology and the examples that he has, absolutely stunning. So we haven't talked about consumption in the last couple of hours, so let's get back to it. This is another map. This is the state of Massachusetts. And this is a study by Henry E. Yersol Babich in 1862, considering potential causes of consumption tuberculosis in New England, specifically the state of Massachusetts. And so what he did was he took every county in Massachusetts and encoded it according to the incidence of tuberculosis. And so it's completely illegible, sort of done in this 1860s cursive that no one could ever read. But the idea is that you have some of this is like in the towns, in the towns on the countryside, particularly in the towns and so on and so forth. And so that's why we have the blue, red border with blue centers across the St. George and all that. Those all refer to different degrees and types of incidence. And then he constructed these rather remarkable tables where you have all of the different incidences of consumption in different counties and then considering different environmental factors, in particular soil humidity, soil moisture, trying to build a link. In the end, the strongest link was soil humidity versus the increased incidence of consumption. Okay, so this is a nice visualization. This is actually the original, which is in the collections of the University of Colorado. It's kind of humbling in a way to look at what people were capable of 150, 200, 300 years ago. All right, so what's biologiography? Well, the term was coined by John Avice in 1787. He had a long and prosperous life. 1987, which some of us might remember, a field of study concerned with the principles and processes governing the geographic distributions of genealogical lineages, especially those within and among closely related species. Not that bacterial species actually exist, but we can pretend they do. So the idea being that for any particular collection of microorganisms, we can look at their phylogenetic relationships and we can look at their spatial distributions and try to infer something from the intersection of those. So there's a nice example here. And this is from the next string.org website, which has several different examples of temporal phylogenographic tracking of different recent outbreaks. So this is the Ebola example. And immediately, hopefully even just without me explaining it, you can grab a couple of key pieces of information about the distribution and possibly the spread of the Ebola virus throughout the three countries there. So what do we see? What are your first impressions? What's going on here? Sorry? Three points of entry. Three points of entry? Okay. To the three different countries? Yeah. Okay. So what are you basing that on? The clusters and the map. So we have clusters in the tree and you can see that they're distinguishable colors and often we see these clusters of the same color, which correspond to clades. So phylogenetically cohesive groups of isolates that are from the same country. So here's a big pile of green, which corresponds to guinea. And then there's a lot of blue right here on the on. And you can see that, you know, there's sort of implied transmission events here. So for example, this blue area, probably most of the stuff going on here was in Sierra Leone. But then as we track this through, we can see several cases where we get green dots emerging. And those might very well correspond to transmission across international boundaries. And so looking a bit more closely at this, you can't really see it from where you are. But each of these locations corresponds to an appropriate level division. Not canty, but different countries have different divisions. And localities, circles, the size of the circle indicates the number of cases. And then there are actually arrows connecting circles, which indicate implied transmission events or even rates of transmission between different localities. So there's a lot of information there. And it's amazing what you can accomplish with a tree and a very simple coloring scheme. So why phylogyography? Well, we want to know about the relationship, space, time and environmental factors. Some questions that we might ask are, what are the most likely points of origin, introduction into a region? So we already saw that with the Ebola example. When and where did an outbreak most likely originate? Can we infer that from some of these data sets? Are there specific settings that are at greater risk? I would have changed this last one if I had the opportunity to. Because settings is like command line arguments. But settings actually is referring to environmental parameters. Heavy rainfall, locations with heavy rainfall, for example. So if you can do the phylogyographic visualization and the modeling, you might be able to get at some of these questions. So what do we need? Well, obviously we want phylogyography. We want some sort of coordinate information at some level of resolution. And those of you who've worked with these types of data, will know that often the data are not presented at the most fine-grained level of resolution. A lot of the data sets that we've obtained from our collaborators are fuzzed either by mapping them up to some sort of higher level like postal code rather than house. Or even just have had some random noise added to them. Which you'll see in an example where a lot of cases end up in the Wanda Fuca Strait. Temporal data, right? So if you want to have some sort of temporal context of the outbreak, then you want that information. Contextual information. So this is settings, right? So location attributes, information about the patients, right? So you might want to know which isolates came from an assisted care facility. Which ones came from a school, right? Which ones came from the taco food truck outside? Which I don't actually want to know. And then for phylogyography, obviously you need a phylogenetic tree. Some sort of description of the inferred relationships amongst your isolates. So this is a fairly redundant representation because the workflow is take the stuff I just said, do analysis, and get results. So there we go. We don't need to dwell on that slide, I don't think. So let's take a closer look at what I said. Okay, so this is geofylogy, right? So this is the Haiti cholera data, which I believe you were working with last night. This is kind of a superset of what you were working with, I believe, right? Is that, yeah, okay. So this is obviously more than the data sets that we're looking at because this would be kind of intractable in the context of a two-day workshop. But immediately you can see the same sort of patterns, right? We have a phylogenetic tree relating cholera isolates. And because one of the key questions here was about the origin of the Haitian outbreak, there are a lot of sample Haitian isolates, right? That's all the red ones up there, which do actually almost but not quite constitute a claim because they are interrupted or intermingled with the Nipahal 4 data set right at the top. You can see green indicates other Nepalese clades in the tree, and then we have purple for Bangladesh and then various other country isolates, just not highlighted with any particular color. And so what can we see from this? Well, it's very instructive that we see this clay that has only Nepalese and Haitian isolates. Immediately that suggests a pretty strong connection between the two. This is not to say it could very well be that there are unobserved isolates from the third country, and so we do need to be careful about that. But there is a strong suggestion here that there is a very close connection, especially so Gary's not here. I can't tease him about phylogenetics. Do you remember yesterday Gary talking about the difference between a cladogram and a phylogram? So what's this? And what's useful about a phylogram in this context? Yes, those branches up there are really, really short. They're like this basically zero. So it's pretty much the same thing sampled again and again and again and again. And so the fact that not only is there this close phylogenetic relationship, but we can also look at the branch lengths and say, you know, there's almost no sequence change here to speak of, gives us further support for the idea of this strong connection between Nepal and Haiti. So this is interesting. The tree that we're looking at here was actually built using maximum personally, which is perhaps a bit surprising. But then they use statistical measures and we're going to get into those in more depth when you're on to try and infer times of most recent common ancestors. So you have statistical models. And part of it is if you have, and we often deal with this, for trying to build large phylogenies based on things sampled roughly at the same time, then you don't really have a time context. It's like all of these things were isolated this year, so we don't really have some estimate of how much things change through time. In this case, we do. We actually have isolates collected between sequences here where, Phil, were you on this paper? Yes. So what was the, was it 2006 was the first sequence or something like that? Yeah, about 2006. So, you know, we have several years here, and that could serve as a useful calibration to say, ah, we have some idea of how the rate at which things change. Obviously, that's not always going to be the case, but we can calibrate it and say, based on certain assumptions of the relationship between time and sequence change, we can infer times of most recent common ancestry. And so in this case, clay A, which is our very, very non-divergent AT-plus-NECOM play, it was referred to be just over a year before 2010 that the common ancestor of that clay existed. Now, as is very common with Bayesian analysis in particular, we're all comfortable with the idea of confidence intervals as the measure of dispersion around an estimate. It's standard deviation. So we get these mean estimates here, and you would hope to sort of increase your confidence in the precision of the result and the accuracy of the result, too, that you would have very small confidence intervals. It's like, you know, it's about one year, but my confidence interval says 0.9 years to 1.1 years. Well, with Bayesian analysis, often it's no such luck, and here we have between 0.46 years and 1.93 years. So we need to take these things in context. But you can walk back through the tree and use the same sort of assumptions, molecular clock assumptions, right, that the rate of change is consistent through time, or using what's called a relaxed molecular clock. The rate of change is sort of consistent through time, right, which is about as technical as I'm going to get about that. But we can continue to make these assumptions and try to map back earlier and earlier common ancestors to different clades. What happens if there is no constant sequence of clock? Then your assumptions are not met, and your estimates are going to suck. So the... Sorry. So there's... What was that about editing the presentation, Sam? So the... At one extreme end, we have perfectly predictable molecular clock, right? Oh, it's Saturday. Time to mutate. So that allows you... Saturday night, right? You've got nothing better to do. I'm going to exchange this A for a G. So in that case, you get very precise estimates, right? Often what people use is called the relaxed molecular clock, where instead of enforcing the same strict rate of substitution on a branch, you say, alright, branches can change, but there's going to be some sampling where if the previous branch, the ancestral branch, has a rate of X, then I'm going to let the descendant branch change, but it's going to be somewhere centered around X, the probability. So you can have this sort of retention of roughly, like sort of an approximate retention of the molecular clock idea by strictly enforcing it. And then, right, you can increase the amount of dispersion around that, and at some point, you know, you throw up the clock assumption entirely, and then you can't really make any temporal inferences at all. The best you can say is, here's the sequence substitutions, so you can't really say much about time. Alright, so I've already talked about this. So this is not a Bayesian tree. It's a Bayesian inference of times. It's a differently inferred tree. And the other thing I should have pointed out, Gary mentioned bootstrap support values yesterday, right? So the idea behind the bootstrap is you've got your real alignment, and you build a tree, and then you take your real alignment, you shake it 100 times, sample with replacement, and you build 100 trees. And bootstrap support on a bifurcation event is high if many of those 100 trees contain that thing as well. So if all of your 100 trees contain that same bifurcation, the same split, then its bootstrap support is going to be 100. And so this is all to say, in a rather inefficient way, that where you see many branches emerging from the same vertical line, the phylogenetic package said something more. It said, here's the complete branching history, but a lot of the bootstrap values stunk, so we're actually going to collapse those relationships and say, we don't know. A lot of cases we don't care, right? It doesn't matter whether ZX8586 is slightly more closely related to this one than the bad one. It doesn't really affect our inference at all, and by doing this it's good because we're not stating more about the data than we should, at least not as much more. The other nice thing about this is that by using a molecular clock or a relaxed molecular clock, it's the title of that paper that was first introduced. It's called Relaxed Phylogeny and Dating with Confidence, which is a pretty good title. So you can infer the position of the root and the time of the root, right? The common ancestor of everything in the tree. Okay, so tutorial, right? Three components. I'm going to introduce you to PhiloCanvas, and here's a question. How many of you have done development in JavaScript? For those of you who haven't done it, I can't recommend it because full disclosure, and I'm going to say this again in the tutorial, PhiloCanvas is a JavaScript library. It's very cool, but in trying to wrap appropriate JavaScript to create an interface in HTML, my brain almost exploded. So MicroReact takes PhiloCanvas and does awesome phylogenographic things. So we're going to take a couple of examples in that, and then we're going to try out the Ganga software that's been developed in my lab, which is not... So the first two are web-based. Mine is not, but it does offer additional options like data manipulation and visualization. Honest data manipulation. So like I said, PhiloCanvas is a JavaScript library. You can draw trees right in your browser, you can drag them, you can zoom, you can make things blue, whatever you like. And so you invoke it from HTML, embed it on a web page, and one of the great things is that it's got an API, so you can develop your own software that plugs into the basics and extends the functionality, which is a very, very useful thing. That's how MicroReact sits on top of that. And so just one example. I will not make you code unless you really want to, but I'm going to show you some code later on and just basically try and communicate the intuition. So here's an example. Excuse me. Motivate or terrify you. The basic idea here is that we've loaded a simple tree, ABCD, right? Fortaxon tree, classic. And we want to make this one look strange. And so all we do is say, okay, this is one of our leaves, this, and we're going to set the discipline. Color is going to be red, shapes is circle, right? So there's the red branch, blah, blah, blah, just setting the attributes, right? It's pretty, you know, if you look at the actual names there, label color black, I guess what the letter A is black. So the control statements can be a bit weird, but the actual setting of different parameters is pretty straightforward. All right. And then we move on to microreact. Pretty good today. It's going to look better than the tutorial, I swear. And the basic idea behind microreact is that you have three pens, three different views of your data. Space, space, tree, time. And you can interact with each of these three panels in various ways. What you do in one panel can impact on what happens in the other panels. For example, if you have your geographic view of things, there's a little button here you can click to draw a polygon or round the samples you really care about. Space, polygon, polygon, polygon. That will be reflected in the tree that gets updated and in the timeline. So when you want to run Zika virus, the motion picture, it will be focused on points that you selected. So microreact is beautiful in its simplicity. What do you need to do in microreact? You need a comma-separated file. What do you need in the comma-separated file? You need unique identifiers, which in general you need, and some georeferences, right? Latitude, longitude, that's it. Now there's not much you can do with that, but it's a start. You can add arbitrary columns after that. You can say facility, right? Assistant care facility, school, so on and so forth. What's the color associated with that? So you can populate your data in whatever way you like, which is a very useful thing. The tree is optional, so if you just want to visualize stuff on the map, go for it. It also has a certain date convention, so if you want to actually have your time references, you just put that in the appropriate column formats in that comma-separated file. Okay, so that's it. Comma-separated file, optional tree. Your maps just come from the open street map layer, so it's like Google Maps but not the same oppressive degree of copyright. And then File of Canvas allows you to go kind of mess around with the tree a bit. You get these time animations. One of the greatest things about this is that the projects you create in MicroReact, you get a link, you share the link, other people can immediately go and see that project as well. I'll mention right now and I'll mention again in the tutorial, MicroReact, you can actually create projects without logging in, so you don't need an account to create a project. The problem with that is that once you create a project without being logged in, you can't do anything with it afterwards. You're like, I hate this project. I want to delete it. No, can't do it. So I strongly recommend you create an account, and then you create your, they're actually just called, so a project within MicroReact is called MicroReact, so create a MicroReact within your login session, and then you can manipulate it and delete it afterwards. It's pretty straightforward. And then finally, Genghis. So this is some software that's been under development for several years in my lab, and the objectives are in some ways very similar to the objectives of MicroReact, but Genghis is a standalone application rather than a web app, and that creates some unique advantages and some unique disadvantages. So we're going to look a bit into the contrasts between the two. The data that you feed to both of them is very, very, very simple. Phil had created a file as input. Hi, Phil. We're all digested into mapping. Although to be fair, I think Phil was checking email. What's this? It's upside down Genghis. I like it. Southern Hemisphere Genghis. So what's that? I'm sorry, Gary's not here. I have to tease somebody. So now I've forgotten what I was saying. I was talking about the input files, and so the main difference between MicroReact input file and Genghis input file is that your unique identifier for MicroReact is called ID. The unique identifier for Genghis is called Site ID. So we have two redundant columns in your project assignment file, and that's why. Okay, so just to give a bit more information about Genghis, how it's put together. And so the data, I've already pretty much covered this. So map data, so raster and vector formats are supported. So raster is simply like a bitmap image or some sort of pixel-based description of the data. A vector is simply like a line-based description of contours. Either is fine. You can have one raster and overlay several vectors if you want. So samples, right, so different locations in a comma-separate file. And one thing that's a little different between Genghis and MicroReact is that for each location, what we call it, confusingly a sample, you can actually have several things at that location. And you'll see this with the cholera example. We have our locations, which should have changed that area. So that's the location. Samples that give us locations. With the Haiti data set, we have several de Patemann, Haiti, right, Al-Tibannit, and so on and so forth. And so the problem there is that we have multiple isolates from Al-Tibannit. And so this file, the first CSV, will have georeferences for each of those sites. We call a sequence file. We'll have multiple isolates for each of those sites. And that lets us do with more of the data. Newic-formatted tree. You're familiar with the Newic format? It's a bunch of parentheses, lots of colons, semi-colon at the end. Easy. Core application is in C plus, carriage return plus. This was done on Windows. And then we have the open geographical environment. And then we have various ways to interact. So we have a scripting interface. So if you want to throw some Python code at it, we use the RPI2 libraries if you want to embed our code inside Python. And then we have the graphical user interface, which is how people normally interact with games. This is WexPython. And then by virtue of having the Python interface, it makes it actually really easy. So like Filocanvas, there are certain defined ways to interact with Gengis and say, Dear Gengis, please give me the following information. Dear Gengis, take the following information and draw some bars on the map or something like that. Developing stuff like that is relatively straightforward. So you can do data retrieval, data analysis, and then output, you can have images. You can save an image if you want to do that. You can also save and restore sessions. So you make your beautiful, beautiful canvas. You're like, I have to go do this other thing now, so you can come back and restore your session. All right, so first view of Gengis. Not a genomic epidemiology example. We have the Hawaiian Islands. So this is a digital elevation model, which is why we have these pointy mountain things. The color gradient didn't work out so well in the transition. Anyway, no worries. Different islands. And then we have different locations from each island, which is reflected in the dots on the map, colored by island, right? So green is the big island. I've never been to Hawaii, so I can't really name the other ones. And then you see the three-dimensional tree overlaid onto the map. So you can say, all right, here's some green stuff. And one nice thing about Gengis is that you can color it such that an internal edge has a color if all of the leaves under it have that color. So here's an internal edge. It's green. And that's because all of the things under it are green. Here's an internal edge. And we've set that to white because there are different colors underneath. So it's relatively easy to visualize places where you have nice cohesion of your data. In this case, it's geography. But again, it could be facility. It could be some patient attribute. So it makes it a little bit easier to interpret the trees. Now, 3D trees can be pretty, but they can also be a bit annoying because you're trying to map information in three dimensions projected onto a two-dimensional field into your three-dimensional brain, which doesn't always work out so well. So my former PhD student, Donovan Parks, came up with an algorithm to do this. Come on. There we go. To do two-dimensional trees instead. Now, normally you might think that a two-dimensional tree is easier than a three-dimensional tree, but that's not the case. A 3D tree, all you have to do is draw branches that go to the various points and then merge them above. For a 2D tree, you want to take this a step further. And so you may recall yesterday, Gary mentioning that in a phylogenetic tree, if this... Yeah, why not? If this is a phylogenetic tree, you can actually take any internal node and rotate it. I'm not going to demonstrate that. Rotate it. Someone call it androids. And not change the meaning of the tree. So it's kind of like we have the tree pointing this way, A with B, C with D. Well, if we spin things around so that it's C with D and A with B, we haven't changed the meaning of anything. So what Donovan's algorithm does is it takes your tree and rotates the internal edges until it reaches a point where the alignment of those edges with the geographic points is optimal. One way to visualize this is to say, here's the tree, right? So the tree leaves mapped to this dash line, the geographic axis. The geographic points are mapped in order to this line here. So this parallel line is tree leaves, and this one is geographic axis. Basically what Genghis is trying to do is, if you connect the appropriate leaf with the appropriate site, it's trying to minimize the crossings. Because a crossing indicates a place where the geography and the phylogeny cannot be reconciled. So what does it mean for this example? It means that in general, and this is one of the reasons why bonds of Cadi-Dids are like the classic first year biology textbook case, there's actually a really, really nice gradient along the axis of the Hawaiian Islands where in general Cadi-Did samples from adjacent islands tend to be more closely related to each other. So that's what we're seeing here. Now over in this case though, we cannot actually reconcile the geography with the phylogeny for a couple of the islands here, which is why we get these crosses. So it may not be... You can use this to sort of examine and actually test because there's a statistical test for this, the relationship between the linear geography and the phylogeny, and that's perfectly fine, but in general, particularly in a genomic epidemiology context, your outbreak is probably not going to follow a straight line. If anybody has any counter examples, I would love to hear about them. But at the very least, even if it's not useful from a scientific point of view, from a visualization point of view, it does make the job somewhat easier. All right, and we'll play with a few of the customizations later on, but the point of Genghis is that it's infinitely or near infinitely customizable. So you have your 3D trees. You can change your tree colorings. You can change your line thicknesses. You can do weird stuff. There's cladogram, right? So everything maps to the same point in the axis. So branch lengths are not meaningful. Phylogram, branch lengths are meaningful. So there's different parameters you can play with, and we're going to explore this in the tutorial. So key features. I already talked about the optimal leaf ordering of your trees. Edge coloring, collapsing internal nodes, and I'll show you some neat examples of that. You can also split some trees, right? So if you've got big tree that is annoying the frig out of you, and only a small part of that you actually care about, then you can just right-click on a node and say, I want this, and the rest kind of goes and hides off in the corner until you're ready for it again. Apart from that, I'll be showing a cartogram example. If you're not familiar with the term, you're certainly familiar with the concept. You'll see that later. And then some of these plugins that we've implemented. So here's an example. This is MRSA. The colors are showing up fantastically well. But hopefully you can see the country outlines in the back. You might recognize your favorite countries. And we have the different locations mapped, right? And so each color corresponds to a different country, and then these are mapped back to a two-dimensional tree. And you can see in some cases, we have many, many samples from... I think it's Sumatran. All falling into a clade. And so all I've done here is just collapse that clade. So we don't have a silly little branches. We just have one big triangle that maps nicely up to Sumatran. So the idea here is that you have very fine control over the visualization to really emphasize the patterns you want to emphasize. So cartograms, again, back to the Haiti color example. So we have a tree. We have a map. We have colors. What more could you ask for? Interpretability. Yes, from this, I can tell that most of our samples are from Haiti and Nepal. Submit to nature. No, we need to take this a step further. So we can see the filogram that's very nice. And here's that very flat line of stuff. And here, just to clarify, if necessary, the coloring is basically... Haiti is divided into its divisions. There are five, I believe, represented here. And then Haiti, sort of, Nepal, five divisions. And then Haiti into its various... it helped them off, Sudeth, Sudvest, Sudmit, and so on. And so that's what those colors refer to. It's still kind of the big mess. And then everything else is kind of in the background, kind of switched everything else off-degree. So you get the context, but you really want the focus there. So what do we do about this? What can we do? What's that? You could zoom in, absolutely. So that's one possibility. If you want to, you can just zoom right in on Haiti and look at the filogenetics, the filogenetic relationships there. So that's kind of like focus. Now, the downside of that is that you lose the global context. And so that's certainly something you can do. What my recently graduated master's student, Alex Kedi, did was implemented cartograms. How many of you know what cartograms are? Okay, well, this is what cartograms are, right? So you can take regions of high density of points, and you can actually distort the map to emphasize them. So the idea here is that we have very dense sampling in Haiti and Nepal, and by expanding those regions, we can get a more detailed view of those regions without losing the global context. So now the five locations in Nepal are readily distinguishable, the locations in Haiti are distinguishable, and we can select a claim here that actually has three Nepalese samples, instead of trans-liberal. But we have Haitian samples, and so now we can start to see which parts of Haiti are most closely connected with which parts of Nepal. And it's configurable, and you can take this as far as you want to. There's a couple of extreme examples we can do with this. You can keep going until basically the map is nothing but, excuse me, Haiti and Nepal, and everything else is squished off to the corner. Now one of the limitations here, of course, is that unlike with, you know, Google Maps sort of zooming-based procedure, you're limited to the resolution of your basement, which is why Haiti is turned into a green fog. You can still see coloring with the locations, but the background country, you know, you don't have the resolution to show that particularly well. So this is a relatively easy example of phylogyographic visualizer. Sorry, a fisheye zoom. But you're still limited by the basic resolution of the map, right, if you're expanding an area, and you're still, you're kind of, you're not able to interpolate really either way, I think, right? Because it's, you know, it's a very different procedure, you're connected with the same thing, right? An expansion of one part of the map. And so, you know, if there's not a lot of fine-grained detail there, then you're kind of ending up in the same place, I think. Did you just use a higher-res map? Sure, yep. Is this available? So the situation for digital maps has actually gotten worse over the last few years. There's one resource that we use a lot which is called Natural Earth, which has some beauty, and that's referenced in the tutorial document, which has some beautiful, beautiful maps that have been very carefully designed and curated, raster maps, vector maps, and so on. But the other data source we used to use, which is still, you know, which still exists, is the NASA 30-meter shuttle radar topography mission, right, so remotely sensed, fairly high-resolution elevation data. And there used to be this beautiful web mapping service where you could go to the Oak Ridge National Lab, you hit the map, and you're like, I'm going to draw a rectangle, and then you click the button that says, download my rectangle. That's our resolution. It doesn't exist anymore. So there are various places. Like the situation in Canada is pretty good. We have a lot of publicly available cartographic data, digital elevation models, a very fine resolution. The problem there is certainly a manageable one because you get the map data in defined blocks, sort of geographic ranges, and so you need to use a library to stitch them together and sub-sample them as appropriate. So the short answer is yes, but it can sometimes be difficult to find the right, you know, the perfect map, which is why we often default to the natural Earth because it's convenient. It's a pretty decent resolution. It's manageable. Are there any other questions? I saw a couple of hands up, although I think maybe the other hands were just pointing at Anna. Okay. So that one's kind of messy, but it's nothing compared to this. So this is the dataset that was kind of provided to me by Patrick Tang, and it is about, if I recall correctly, about 400 neural virus samples from kind of Vancouver and surrounding areas. And I'll tell you right now, what did the million ways? There's no geographic structure here whatsoever, right? And there's just this huge wall, like a massive wall on the points around Vancouver. So this is pretty hideous. I'm going to show you some of the things we can try to do, again in Genghis, to try and tease out some patterns or at least focus in on attractables part of the dataset. Because hopefully by looking at this and this awful example, you get some ideas about how to improve the visualizations of non-terrible datasets. Just to complete spoiler, none of these actually works very in a very satisfactory way. One thing about visualization, sometimes the best you could say is, it's complicated, right? And we've all seen these, right? There was that paper a few years ago about the hair ball graph. We took, it doesn't matter what kind of data, right? We took data that has entities and relationships, right? So the graph, nodes and edges. And we built a giant hair ball. Here it is. You're like, I cannot get anything from this apart from there's lots of nodes and they're somewhat connected. So if that's the message you're trying to send, well, you know, mission accomplished. Who's ready for coffee now? So one thing you can do. Just pretend the tree isn't there for a second. It is a tree. But you can remove some of the noisy background stuff, right? Maybe you don't actually need those pretty shadings and so on. Just really kind of focus on the data. Okay, that's a 3D tree. We can go and do a 2D tree. And now you really appreciate just how many leaves there are in this pretty tree and how there's really no clear relationship between any sort of linear gradient and the geographic ordering of the points. These, by the way, are now colored by site, or sorry, by location type. So again, it's a care facility, school, restaurant, and so on and so forth. Okay, we haven't really improved things very much. Maybe we just want to discard a map. Discard a map. And so this is getting into that sort of microreact frame of mind where we have the map over here and we have the tree over here and we're not explicitly trying to link the two because it's hopeless. So, what can we read from this tree? It's complicated, right? But it's actually, I mean, you know, part of it is, are there clear clusterings of specific locations? No. And there are actually statistical measures you can use to try and characterize this to basically say that there's Fitch's algorithm, which basically says, I have a bunch of different types of characters in the tree. Care facility, school, whatever. If I walk through the tree, how often do I need to change? If that makes sense. So the idea is, if we have very cohesive groupings, then you're going to have, like, a bunch of yellow over here, a bunch of blue over here. Imagine yourself walking along the branches of the tree. Very rarely do you encounter a transition between blue and yellow. If you're walking along this tree, you're going to have to encounter frequent transitions because there's no cohesion here. Does that make sense? It's like the weirdest explanation. Hopefully it's okay. And so, again, you know, is there a really strong clustering? Well, certainly not readily apparent at this scale. What else can we do? Well, we can bring the time axis to bear, and we can hope that the, ooh, that looks promising. And so if we have temporal information, we can run a movie of... Let's try that again. Okay, 3, 2, 1, go. So you can see points appearing through time, and, you know, the sort of intensity reflects the recency. So you can see certain clades expanding over time. There's not so much. Okay, so at least... But at least you can kind of focus on specific aspects of the data at specific points in time. And this comes up again in the micro-react framework, where you can play movie forward within certain time limitations. Well, what else can we do? Okay, well, let's change the color. Okay, because people have classified neurovirus into various genotypes, and this is kind of interesting. So we can look at the tree, and the root is down here, and you can see that most of the things at the end of the root are this genotype 2, variant 12. So when you see that, there's sort of lots of stuff at the base of the tree, then that suggests that that's the ancestral genotype. The root of the tree is probably group 2, variant 12. And then you can see different types emerging from within that. So here's a specific root, and here's a specific clade as well. So we've collapsed any clade where everything in that clade is the same genotype. Now, hopefully, if we collapse things that are of the same genotype, then we get a lot of stuff collapsed, because if the genotypes are not cohesive, what does that say about your tree? It says that one of those is wrong. So what else we got? So that's a big tree. We can also extract subsets of interest. Again, so we've stopped trying to link the tree leaves directly to them. But you can still see these colored genotypes and you can say, well, here's this orange genotype, where they're there and there. So this is a case where we've sub-selected part of the tree and then are just focusing on those relationships. Okay. I was worried whether this top was going to go an hour. Now I'm worried if this top is going to go an hour. Any questions? Deep breath. Sure. This might be because I'm not familiar with the field, but when you're calculating the time to mutation, basically, do you ever use that in a forward predictive manner or is it always looking back to determine the origin? Oh, so you're saying you're trying to use molecular clock information to predict the future in some sense. Yeah. Like, would you be basing vaccine development if it's mutating so fast? And what's the point in trying to build a vaccine that's contained for whatever? Like, do you make any decisions on that kind of information? Um, I think the answer is kind of yes. I mean, you think about the way influenza evolves, where each season it's like, I diversify, diversify. There's zillion versions of me, and then like one or two will survive to the next season. So in those cases, I mean, that population structure certainly influences the strategy for development of vaccines, right? And so is that the same as the rate of evolution? Well, it's kind of influenced by it, right? So it's not really directly losing it, but certainly the properties. That's a big part of it. Anybody else have any other good examples? So the question was if we have molecular clock information, is that ever used in a forward sense, you know, to try and justify certain decisions, whether it's vaccine development, other types of strategies? I'm up here and not thinking very well, so if anybody else has any ideas, it'd be great. Hearing none, your reward is to get hit with Bayesian phylogeny. So, um, I see that I have theoretically five minutes left. What should I do? Okay, so, Bayesian approaches. What's great about them? Gary touched on them briefly yesterday, and the key to Bayesian approaches, sorry, take a step back. Maximum likelihood, maximum likelihood of trees. I have sequence alignment, and I have model. What's the model? The model is the tree, the shape of the tree, the branch lengths. Find me the best tree. Long story short, maximum likelihood. Bayesian approaches do not say that. Bayesian approaches say sample from a set of pretty good trees and build me sort of a general representation of that. I don't want a single answer. I want good answers with plausible confidence intervals around me. What's great about Bayesian approaches, in particular the ones that have been developed, is that you can parameterize it as much as you want. You can have parameters for everything, right? So the tree branch lengths are obvious, but on top of that, you know, substitution rates kind of fall into that. Um, collage, well that's the tree shape, position of the root, population sizes, geographic diffusion. These can all be parameters in your model. One of the problems with Bayesian analysis is that it takes for freaking ever if you have a lot of parameters, but that's your problem. So you do have to choose somewhat carefully which things you want to parameterize and which things you want to say, I'm okay with that. Remember what Gary said about rooting yesterday, when you can and when you cannot. When can you actually infer the position root? What information do you need? Okay, time is the big one. So you need an out-group, right? Confident out-group, or you need temporal information, or you need to make some assumption that, like, the midpoint of the tree is the root, which is not necessarily the root. And so with the Bayesian stuff, again, you can sort of infer this molecular clock and then try and get information, infer stuff about the position of the root. Beast is the most widely used application by far for the sort of Bayesian population inference. Mr. Bayes is another widely used package, but that's really, I think, still focused on phylogeny. Beast offers a much more rich parameterization. And so this is great because then, specific questions that you might have, like the reproduction rate, right? Or temporal information. The frequency with which an outbreak jumps from location to location. All of this stuff can be represented as parameters in your Bayesian analysis. So here's an example. This is from 2014, and this is looking for, looking at alien influenza H5 and more. So I probably shouldn't dwell too much on this. But the basic idea here is that we have a tree, which is inferred from, in this case, the Himaludan sequence from all these different isolates. Here we have the time, and it's sort of the same package of stuff, trying to infer times of recent common ancestry. And in this particular case, also trying to get posterior probabilities of different candidate geographic locations of the root. And so here's the legend, right? Here's the color guide. And basically you can see that most of these sites have basically no associated probability. It's the probability of common ancestor being in Novosibirsk. Nothing, essentially. What are the two locations with the highest posterior mass? Well, Hong Kong and Hong Kong. So you can immediately get a sense of that from doing this type of analysis. I'm not going to go into details about the parameters of the sampling strategy and so on, but that's the idea. And so people are very interested in saying, well, where do these things come from? So this is a goose sample from Hong Kong that is seen as kind of the ancestor of a lot of this. So was it originally from Hong Kong? Well, we can put probabilities on that. Using the same data set, the authors try to infer rates of migration. So I believe it was 19 different geographic locations they were looking at, and essentially they said, well, we want to have a model that has sort of the probability or the frequency of migration between each pair of sites. However, trying to fit a model with all possible pairings of 19 sites is not a good idea, because if you're familiar with this sort of... over-define this sort of way more parameters than you have data, this just leads nowhere. And so what you can do is you can actually say, well, I expect most of the rates to be pretty close to zero. So we're going to assume that a lot of them are close to zero, and then only really represent the ones that are somewhat dissimilar. And so this is looking at both the Wuton and Neurodimidase, Neurodimidase, not proteins, and trying to see whether those actually imply different or sort of similar patterns and rates of cognitive uncertainty. So it's interesting that they show this degree of agreement in spite of the fact that they actually have somewhat different phylogenetic histories. I only showed you HA, but in the paper they show HA and NA, and there are some substantial differences there. And if you're familiar with processes like the process of reassortment and influenza evolution, then that's one of the big reasons why. So, again, I think Gary talked about priors. He talked about priors. So you're kind of, you know, basically what you're trying to do is infer a model, and a prior is basically saying, before we even run the data, I have the following expectations. So I'm going to put prior probabilities on that. In many cases, we kind of feel a bit squeaky about trying to put some sort of informative prior on things. Let's say flat prior, right? One of the problems of doing this is that, you know, in general, I mean it makes sense, right? You want to be agnostic, you don't want to impose something on it. However, sometimes what you think is a flat prior is not actually a flat prior. And this is particularly true of phylogenetic trees. If you say each tree has an equal prior probability of being the true tree, that's not saying that each relationship in the tree has an equal probability of being correct. So there's a bit of messiness there. Bayesian methods can take forever. As anybody who's trying to build a Mr. Bayes tree with 50 leaves in it would know. And then this, you know, it's a mixed blessing. Deciding on model complexity. Because you have the opportunity to basically model whatever the heck you want in your data. But the more parameters you add, the longer it takes to fit that model, and the more likely you're going to get very unstable estimates of your parameters. So you do have to be fairly careful with that. And one of the ways the authors here deal with that is by saying, yes, we're going to have a lot of parameters. Excuse me, but we're going to set a prior that most of those parameter values are going to be zero. So it's kind of deflating a lot of them and saying, let's have lots of opportunities, but let's throw out most of the implausible ones. Okay, almost there. So this was a paper that was published actually about three or four weeks ago, maybe five or six weeks ago now. Again, looking at Ebald. And so this was a collection of 1600 and 10 publicly available genomes from the really intense part of the outbreak between 2014 and 2015. And actually, if, you know, we haven't really talked a lot about sort of data munging. That's a lot of the presentations you've heard in the last couple of days are like, how to deal with your data. I haven't really talked about that. However, if you want a sense of sort of nastiness that can come up in biological data analysis, look at the supplementary methods of this paper, because there's some sort of standard data cleaning stuff they did. And then there's some crazy stuff. It's like Ebola specific and really kind of mind blowing. And it's a nice illustration of, you know, you want to filter your FASQ files. You also want to know something about the specific system you're looking at. Are there specific nucleotide biases? I didn't know this, but apparently Ebola is kind of wacky this way. Are there repeat regions? Are there other things that you really need to think about before just, you know, pushing the big red button, turning the crank? Relax molecular clock. I already talked about that. Markov chains. So this is actually one of the ways in which we can get posterior probabilities, basically by saying, well, how good is it if this parameter is this value, this parameter is this value, right? So it's like one realization along, right? If this is the tree and this is the transmission rates and if this and if this and this, how good is it? Get that information, put it over here. Walk to another possibility. Okay, well, what if the tree is like this, right? And this parameter is like that. Markov chain is where you're walking from model to model and getting a sense of which ones are good and which ones are not so good. So this is one of the most common ways of getting these Bayesian posterior probabilities. So what do they find? Well, through various methods, they actually modeled the, they didn't, I can't remember the number of locations. It was about 46 or something like that. And so they wanted to know transmission rates. But as you saw before, you don't really want to fit a parameter for each pair of those 46. What they did instead was they looked at various descriptors. So for even pair of sites, how distant are they? And then instead of saying, well, what's the rate of transmission from here to here, you just say in general, if two sites are 50 kilometers apart, what's the expected rate of transmission? Does that make sense? Still awake, which is good. Kind of sleepy after lunch. Maybe you had healthier options than I did. So what's interesting is that they considered 25 different potential influential factors driving the rate of transmission. And this is table S1, I think, if you want to look at all of them. Some of them are pretty interesting. I wrote them down, but Ben, my book's over there. So they looked at all these different candidates, and then they had ways to kind of filter them and say, well, these ones don't really stand out against statistical noise. So we're going to set them aside. Out of their 25, they found five important factors. One was barely above the margin of interest. I'm not even showing you that one. One is that not surprisingly, sites within countries tend to have higher transmission rates than sites between countries. OK, the system works. Distances between regions is important independently of the national versus international population size at source and destination. OK, these things all make sense. And then if they share an international border. So in general, the disperse of the transmission rates between sites in different countries is relatively low, unless they're sharing the work with each other. So this all makes sense. And it's nice to be able to consider a bunch of different model parameters and discard the ones that have no evidence. And so just to finish off. So we live in an amazing age. And this is something that Andrew touched on a lot in his presentation, but also his tutorial, where people are now starting to realize that if bioinformatics is a scientific pursuit, then it should be repeatable. Because that's kind of one of the fundamental things about testing hypotheses and doing research. So we are seeing more movement towards things like galaxy workflows. If you submit to microbial genomics, the journal, you need to have a lot of diligence in terms of people being able to repeat the analyses you did. Your data need to be made available. Your contextual data need to be made available. The reason I mention this here is because this is an animation that is tied with the paper that I was just showing you. Running through, again, it's a temporal simulation. It's a pretty heavy bowl, basically, in the geographic context. And it shows over time incident rates, incidence rates, and transmission, inferred transmission events. And what's really cool about this is all of the code that was used to do this is available through GitHub. Right, through the author's GitHub site. So, I wish I made this. I didn't make this, but I wish I did. Maybe I should just lie and claim that I did make it through the tree, gradually emerging there. Think of it all the different visual cues that they're using. They're using size, they're using color, they're using this temporal information, curvature to distinguish from other types of information. And you can see, right, here's the end of the outbreak. The end. Go get some coffee. We'll see you in 25 minutes.