 It seems like we have a good group of people here. So I will get started. Thank you so much for joining me today. Sounds like I have a pretty stiff competition in terms of another session, but I really appreciate you all being here. My name is Sherry Siew. I'm a postdoc in infectious disease epidemiology at UPenn. And I've had a longstanding interest and love for spatial data and spatial data analysis. So I'm really excited to be here today to talk to you guys about this, be talking to you guys about this topic. So some housekeeping type things, sort of by default, everyone is on mute. But if you'd like to ask a question, you can raise your hand and I will ask you to go off mute. You'll unmute yourself and then you can ask your question. Or if you prefer, you can ask questions straight into the chat. You don't have to wait for any breaks or anything like that. I'm very comfortable. People stopping me sort of midway if questions come up to just ask them. And please don't be shy because I don't really know what everyone's background is. So if anything's unclear, let's pause and really address it. And I'll try to keep my eye on the chat so I don't miss any questions. OK, so here is an overview of our workshop today. We're going to start by just introducing how our handle spatial data and the different spatial data types. And then we'll have a brief introduction to the analysis of spatial health data. And so all of these will start sort of in these slides. And now is a good time for me to post the link to the RStudio cloud that we'll be working out of today. I'm going to post that straight into the chat. Yes, there are materials. You didn't need to read anything beforehand. It's all ready for you on the RStudio cloud. And let me just walk you through the different files we'll be working with today. So the main sort of R script file is this R markdown file here, spatial analysis. And hopefully it's well annotated with sort of the all the things we'll talk about and what all the different functions mean. If you want to see this R markdown file all knitted together, you can go here. And you can go to these different sections if you want to peek ahead. The slides we'll be talking, we'll be going through, we'll be right here, this PowerPoint file. And then there's also this data file that contains all the different data, all the different spatial data that we'll be reading in as part of our practicum that we're going through today. But I'll start with just some background in these slides. Oh, right. I think I was in the middle of talking about our overview. And so then we'll talk briefly about analysis of spatial health data. And we'll have two breaks today. We'll take break number one. And then when we actually get into the meat of this workshop, the actual spatial analysis part will be analyzing Lyme disease data from New York State. So we're just going to first sort of look at that data, make a map of that data. And then we'll do a global clustering analysis with the Moran's eye statistic. And that'll be in sort of the R markdown file. We'll take one more break and we'll end with a local clustering analysis with the local Moran's eye statistic. So why should you use R for spatial data analysis? So as attendees of this workshop, I don't think I have to convince you that R has many strengths. And same goes for spatial data analysis. And I just want to start with sort of historical note. There's sort of two different frameworks for handling spatial data and spatial data analysis. Historically, it's been through the SP framework, which has formed the backbone of RGIS tools for well over a decade now. So there's so much we can do in sort of the SP framework. Unfortunately, it's very clunky and not very intuitive. So several years ago, this SF framework was introduced, which is really great. It's much more user friendly. It's compatible with the whole tidyverse, which is great. But in the beginning, you could really only make maps with it. But slowly ever so gradually, these packages for actual spatial statistics have been released that is compatible with SF. So I think eventually this SF framework, which is, I think, much cleaner and easier to use, is going to completely replace SP. So I mentioned SP just because there's still things you can only do in SP. You're going to see it all over online documentation about R spatial data analysis and textbooks and things like that. But we're going to work exclusively in the SF framework when we're going into the practicum file. So I had a question about the files being on GitHub. It's not on GitHub right now. But I will upload all of these files to GitHub in the next day or two. But I have the link that I will eventually load the files to on GitHub as the last slide of this PowerPoint. And another question about can we work in the workspace asynchronously? I believe you can go through the R markdown file on your own and run that code. So you should be able to run that. You can even save it and run it at a different time. That should be fine. OK. And the advantages of performing geospatial analysis in R compared to something like ArcGIS is that R is a statistical package. So it's sort of the gold standard for running statistical analysis. And the same goes for spatial statistics. And it also has an impressive graphics engine. So in addition to spatial analysis, you can make really beautiful maps that are highly customized to whatever needs you might have. Our focus today will be not on mapping, which I saw was the topic of a different workshop, but more on the analysis portion. And of course, it's a command line interface. And you can save your script files. So it's really great for reproducible data that you can share with others. And they can sort of check your analysis, things like that. So we'll get started talking about the different types of spatial data. So two broad categories. One is vector data. That's one big all-encompassing category. And then the other that's different from vector data is raster data. So under the umbrella of vector data, we have point data, line data, and polygon data. And luckily, these things are exactly what they sound like. Point data are just points in space. So each point is associated with an x and a y coordinate. If you're dealing with 3D coordinates, you can have x, y, and z. Lines are points where you specify the order that you want the points to be connected to each other to form lines. So that's what line data is composed of. And then finally, polygon data also includes these different geocordinates, these different points. But when you specify the sequence to connect the points together, it's going to close back in on itself and eventually form these polygons. And in contrast to vector data, we'll get to raster data in the subsequent slides. I want to first give some examples of these three different vector data types. So point data, we can look to just the locations of a specific incidence. So we could look at the locations of shootings. So this is all the locations of shootings that occurred in Philadelphia. That's where I am in 2017. An example of line data, so all of these examples are from local, where I am locally, which is Philadelphia. We can look at all the bike-accessible roads in Philadelphia. And so you can see all of these lines that represent the different road segments that include a bike lane here in Philly. And then finally, an example of polygon data would be census tracks. So you can see all of these individuals census tracks. So each census track is its own polygon. And now we can talk about raster data. And unlike vector data, which represents these points, whether you're connecting the points or you're not connecting the points, these points can be anywhere in space. Raster data, the data is stored in a grid of cells, in a regular grid of cells. And so a very simple schematic of raster data would be this rectangular set of cells here. So this is a common way to store elevation formation. Satellite imagery is another example of raster data, because you can think of each pixel being its own cell. So here, the way that I'm showing these cells, let's say that this is elevation. And these colors represent for very dark colors, low elevations, and light colors, high elevations. So what you can see simply shown by this example raster is some kind of hill or mountain, depending on how high this point is. And some examples of raster data, again, from Philadelphia. Here's a satellite image of Philadelphia from Google Earth. And here's a pretty cool raster of the land use classification that was performed, I believe this is a 50 meter by 50 meter. Each cell is a 50 meter by 50 meter cell, which represents an actual square in physical space. And within each cell, these values for the cell gives you what the land cover and the cell is. Tree cover, grass, shrubs, bare earth, water, buildings, roads, other paves. So these classifications might be useful to you as a health researcher, depending on maybe you're interested in green space, or maybe you're interested in heat trapping that might happen when you're around buildings and roads and things like that. So finally, now that we've gone over all of these different GIS data types, we need to talk about how these types of data are handled by R. So again, just as a historical note in the SP framework, these four different data types were stored in these spatial star data frame objects. So points are spatial points data frames, lines are spatial lines data frames, et cetera. In the SF framework, we'll all start by saying that SF can't have handle raster data. So when you're working with raster data, you either work in the SP framework or you just work with it as a matrix. But for these vector data types, all of these are going to be stored as an SF object. And then you can specify the geometry type. So for points, there'll be a multi-point SF object. Lines will be a multi-line stream object. And polygons will be a multi-polygon object. And whether you're talking about an SP or an SF object, that object, and we're going to see this in the practicum, is going to have both spatial attributes and data attributes. So the spatial attributes, it's going to contain three important elements. One is the actual geo-coordinates. So this will be all of your xy-coordinates. So for these polygons, it's going to be all the vertices of each polygon. It's also going to include boundary information. So what are the minimum and maximum x boundaries and the minimum and maximum y boundaries? And it's also going to include projection information. And for the data attributes, if you have any familiarity with R, you're going to be happy to see that it's going to just be a simple data frame. So you're going to have rows and columns to store information about the different variables pertaining to each one of your rows. So in this example, I'm showing you the census tracts for Philadelphia. And when you go to the spatial attributes, you'll see that there's 384 distinct polygons that are stored as part of the spatial attributes. And each one of these polygons is going to be associated with a row in your data frame. So the data attributes, the different variable, the values for this first census tract will be stored in this first row. And now we can head over to the practicum. And we'll start with this first section on the spatial data overview. So we're going to work exclusively in the SF framework today. So you're going to have to load the SF library. And then within our data folder, we have all the different census tracts for Philadelphia, which we just looked at on the slides. And if you do the class function, you can see that this Philly tracts object is an SF object. And it's also considered a data frame. And because I mentioned that SF objects are very user-friendly, they can be handled like data frames. So what I mean by that, if you run the stir command, which allows you to see the structure of a data object, you can see that it's going to give an output similar to what you'd expect from a data frame. So you have all of these different variables, all of these different columns. But then you also have this geometry portion so that your spatial attributes are going to be stored here. And we can see the first few rows of our census tracts. And you can see that each row is going to be associated with a specific census tract. So this first row is census tract 94. You can also be the dimensions of the SS object. And it's giving you the dimensions of the data frame. And you can use familiar subsetting commands to, for example, select the first row. So again, that's going to give you back census tract 94. And it's all its affiliated information. We can also select columns by name. And I'm including this head command within selecting this column by name just so that it doesn't blow up the markdown NIT file. And you can see the first six census tract names listed here. And you can do the same using a column subsetting. And if you wanted to extract just the spatial attribute of the Philly tracts object, we can use this ST underscore geometry command. So we're going to store that output as PTGO. And if we take a look at PTGO, we'll see that it is, in fact, a multi-polygon. And if you wanted to get the spatial coordinates of any one of the polygons, you can index it the way you'd index a list. So if you wanted to get the perimeter coordinates for the first census tract, census tract 94, you can index it. And you can see that these are all of the XY coordinates for that first polygon. And if you wanted to see the data associated with this first polygon, you can go back to your PTSF object, take that first row, and get that information. And if you wanted to do the same for the second polygon, second census tract, you can get the coordinates, again, using subsetting the second polygon in our PTGO multi-polygon. And we can get the data attributes from the second row of our SF object. And what's nice is you can use the base plot command for a spatial attribute. So the multi-polygon, the PTGO, you can plug straight into a plot command. You don't have to specify anything in particular. And it's going to give you a very simple map of the differences in census tracts in Philly. And just as a bonus, there are some things you can tweak in the space plot command. You can color your polygons, this lemon chiffon yellow. You can also increase the perimeter, the line. And you can make the lines thicker with this LWD. I think it stands for line weight, something like that. And you can make that line red. So you can do something silly like that. OK, so that's an example of polygon data. Let's go on to some line data. So now we're going to look at that bike network that I showed you all an image of in the slides. So we're going to go ahead and read this. And you can read a shape file. OK, so I'll take a step backwards and talk to you about reading in shape files. So this kind of is a little beyond the scope of our discussion today. But just as a bonus, let's go into the data folder. And this bike network shape file, I downloaded straight from Open Data Philly, which has a lot of publicly available data sets specific to Philly. And when you download a shape file, it's a little bit confusing because you end up seeing multiple files listed all together. Often you download it from whatever source you're downloading it from, and you have to unzip it. And once you unzip it, all of these other files pop out. And it can be confusing at first because you're not sure which one's the file, which one do I read? Well, don't panic. What you can do if you want to read a shape file as an SF object, you can just use this very simple ST read command. You go to the file path that all of these different shape file files are being kept under. And then you name specifically the .shp object. They should all have the same name, but with different letters after the period. But you want to make sure you write the one in your code that's .shp. And that's going to read it in as an SF object. And just as a side note, the reason that these files are so weird is the shape file format was developed originally for ArcGIS, which handles data a little bit differently than R. And if we want to double check that we did, in fact, read in the shape file as an SF object, you can use the class function again and see that indeed. It's an SF object. And all SF objects are also data frames. So that's why you see both. And if we wanted to, once again, get just the spatial attributes of this bike network SF object, we can run the ST geometry function. And now when we index, let's index a second line segment, we can see that it's a line string. So now it's just two different points. And this order, I mean, I guess it doesn't matter when it's just two points. But once you connect these two points, that's going to form your line of interest. And if we wanted to see the line segment, the data associated with that line segment or road segment, we can go to the second row in our SF object. And we can see that it's this segment on Bartram Avenue. And finally, we can again pass the spatial attribute of our SF object directly to the plot function. And that's going to give us a simple plot of all the roads with a bike lane in Philadelphia. And finally, our final vector data example, we're going to look at point data. So now this is not a shape file. This is an SF object that I've already converted. So I stored it in our data folder as a .rds file, which is a native R file, just to save time. But I did want to give one example of reading in a shape file. And you can see that it's very easy to do using the SF library. So OK, let's just read in this crime data. We can type in crime and it'll tell us that it's a simple feature, which is what SF stands for. It's an SF object. And the geometry type is a point. So it's a multi-point object. And you can confirm that it's an SF using the class function. We already viewed the first few rows of crime. And I'm just going to plug crime directly into the ST geometry function to get those spatial attributes and we can plot it. You can ignore this warning. And we can see that these are all of the crime incidents that I think this is September 2017. So that's maybe too many points for our eyes to process. So we can take a look. Let's take a close look at this crime data. You can see that we have the longitude latitude. We have the location, dispatch date and time. And then the one I want to take a closer look at is the offense type. So you can see the first 10 are all recovered stolen motor vehicle. So let's see what type of offense types we have available for us. So we can run the simple table command and now we have a one-way table of all different offense types. And when I was reviewing these, I just wanted to select some things that sounded interesting. So I picked out homicide, criminal, and fraud. So now we're using tools in the type uber. So this one is specifically a dplyr function, which has this nice filter command. And what it's going to do is it's going to take crime. And this is often, or this is a function, dplyr functions are meant to work on data frames. And the reason I included these dplyr functions for this crime data, we saw that this crime data has the spatial attribute. But it can also be handled by these dplyr functions, which is used for data frames. And it will behave appropriately. It'll behave like a data frame. So we can use this dplyr filter function to filter by offense type. So now we've selected only those rows and only those points that are associated with a criminal homicide. And this line of code is going to filter just for those crime incidents that are instances of fraud. And just as sort of a sanity check, you can see that even if you filtered using this dplyr command, you've still kept the spatial attribute. So we still have this SF in our class type for these objects that we filtered out from the entire crime database. And now we can plot the locations of homicides. And you can see that they're kind of distributed like that. And then looking at this might not give you a lot of information because what did these points mean? So you can plot these specific crime incidents on top of our census tracks. So what I'm running here, we start by plotting the spilly census tracks. We layer on top of the census tracks the instances of fraud, which we've colored blue. So all of these blue dots are instances of fraud in Philadelphia. All these dots in red are criminal homicide incidences. And we can add a legend explaining that the points that are blue are fraud and the points that are red are homicide. So this is a very simple map that we just made. And finally, we'll look at a very, very simple example of a raster. So this raster data is stored in the data sets library. We have a question. There are equivalent plots in GG plot. I intentionally left off GG plot today because I wanted to save us time to talk about the actual spatial analysis. But since there is some interest in GG plot, when I originally made this practicum for a class that I taught, I included the same maps with GG plot code. So when I do upload this to GitHub, I'll be sure to plop all of that code back on. So you can see how to make even prettier map on GG plot. All right. So here's a very simple example of a raster data set, which is stored in the data sets library. And you can see in this case that this raster is just a matrix. And we can peek at this volcano. And you can see it's just this matrix. And each cell has some value. And again, this is elevation value. So it's sort of similar to the toy example I showed in the slide. And if you run the stir command, you can see that it's a matrix of 87 by 61 matrix. And using the filled contour function, we can specify the color scheme that we want. We can view this volcano. And you can see it is, in fact, a volcano. Because if you look at this color legend, we have our highest elevation along the rim of the volcano. And then the elevation goes down once you get to the middle. So those are all the different spatial data types that you can work with in R. And now we're going to jump back to the slides and talk about how we can analyze some of the spatial data. So broadly speaking, when we're talking about spatial analysis in general, what we're looking for is evidence of spatial patterns. And we're talking about health data. We're really interested in spatial patterns of disease, risk, or incidence. And the reason we care about there being a spatial pattern in disease, risk, or incidence is if they're present, it may support specific etiologies. Because if we have a cluster of cancer hotspots, a cluster of cancer cases in a specific area, you might be suspicious that there might be an environmental hazard. And of course, if you have clustering of cases, that can also signal that this process is infectious. And the reason why cases are clustered is because people are transmitting it to each other. So people who live close together are more likely to transmit the disease to each other. And if we wanted to investigate the presence of spatial patterns in a formal way, we're going to need to rely on spatial statistics, which will dip our feet in today. And spatial epidemiology, which is a field that I very much want to be a part of, involves the application of spatial statistics to spatial health data. And it also involves usually the first step is to visualize that spatial data through mapping. So usually that the mapping comes first because you want to explore your data, and then you run the formal statistics. And when we talk about spatial patterns, let's begin by talking about a lack of spatial patterns. So what does that mean? A lack of spatial pattern means that the occurrence of whatever outcome of interest is happening randomly over space. So if we conceptualize space in a very simple way, just sort of as a simple grid, the idea is that the probability of an outcome occurring at any grid is going to be constant across the grid. And the incidence of the outcome in one grid is going to be independent of all other grids. And specifically, it's going to be independent, or more most importantly, it's going to be independent of neighboring grids. So the fact that a case occurred in this grid is not going to impact whether or not a case occurred in the surrounding grids. So this is an example of a random spatial pattern. And another way of saying a random spatial pattern is that you have no spatial autocorrelation. So this might jump out at you as being a non-random pattern. Of course, you have all of your cases in one half of your grid and no cases on the other half of your grid. And we call this kind of situation positive spatial autocorrelation. And you can think of positive spatial autocorrelation as it's as if your positive cases are attracting all the surrounding grids to be positive themselves. So having a case in this grid makes it, well, in this case, perfectly probable that cases will occur in these surrounding grids. And in the case of negative spatial autocorrelation, so when you first look at this grid, you see that it's very regularly arranged. You might think of this almost as a random pattern. But you see, in fact, it's not random. I just used the word regular to describe it. So this is an example of negative spatial autocorrelation. And what happens here is that the presence of a positive case makes these surrounding, these bordering cells, much less likely of being cases themselves. So here you have an attractor force. And here you have sort of, you can think of it as a repelling force. And I wanted to find examples of both positive and negative spatial autocorrelation in health data. But it turns out, in the real world, what we tend to see and what we care about is positive spatial autocorrelation. So here's an example of positive spatial autocorrelation. And I might just talk about spatial autocorrelation more generally, and just mean positive spatial autocorrelation for the rest of these slides just because there's so few examples of negative spatial autocorrelation when we're talking about health data. So here's an example of spatial autocorrelation. We're looking at county aggregated heart disease death rates between 1999 to 2003. And you can see that areas with high rates of heart disease death tend to be clustered in these counties in the South. So as I said, the reason we care about positive or the reason we care about spatial trends or spatial autocorrelation when it comes to health outcomes is it's suggesting some sort of underlying etiology. In this case, it may be shared lifestyle factors for these counties in this region of the country. And now let's talk about a subject that you may be sick of hearing about, which is COVID, which is, of course, an infectious etiology. And like I mentioned, when we talk about infectious diseases, we often see spatial trends in infectious disease incidents. So here I'm showing seven-day average COVID positivity very early in the pandemic, early April, 2020. And you can see in this early stage we had this pocket, this hotspot of COVID in the New York metro and surrounding areas in Louisiana as well as Georgia. This is where, I don't know if you remember in the news, a two very large funerals occurred late February and it became a super spreader event and led to this part of Georgia being an early hotspot for COVID. And then a few months go by. And now you see that New York City is no longer a hotspot. And now the hotspot is sort of all of these places in the South and Southwest. And then by November, COVID has really spread across the US. But now it particularly hit the Midwest, the hardest, which had been had relatively lower rates early on in the pandemic. So I took these screenshots just to show you especially when you consider incidents, cases of infectious disease are often spatially correlated. And if we go back to our different GIS data types, we can sort of contextualize them with regard to health research with these examples. So when we're talking about points data, a common example that you'll see in the health literature is when you have electronic health record data or if you have registry data where you have the actual locations, the residential addresses for your patients. And even better yet, if you have a control population that you also have spatial information on. And when you have the locations of cases and controls, you can run a spatial GAM model, which will allow you to identify disease hotspots. An example of line data that you might see in the health literature, a very common line data that health researchers care about are linear polluting sources like three ways. So in this paper that I've listed here, these authors were interested in the association between asthma exacerbations and proximity, residential proximity to major freeways. And not surprisingly, they did find that as you live closer to a highway, you are more likely to experience adverse outcomes with your asthma. And today we'll be focusing on polygon data and we're going to be looking at regional rate data. So polygon data, by what I mean by regional count or rate data. So if we were talking about counties or census tracts as our polygons, if for each census tract or county or state, we have some numerical value that we sort of aggregate, like aggregately measured for that area. So today we're going to be looking specifically at county aggregated Lyme disease incidents for the state of New York. And the reason I don't include a literature example here is because we're going to be doing this ourselves in the practicum. We're going to be looking for global spatial heterogeneity in Lyme disease incidents in New York state. And we're also going to try to find local clusters where incidence rates are higher than one would expect under the null hypothesis of no spatial trends or lower than one would expect or for cold spots for Lyme disease incidents. And then finally, examples of raster. You could have rasters of health-related exposure. So the NASA, one of their satellite instruments, puts out a estimate of pollution of PM2.5 and other pollutants that you can find sort of as a raster for the entire globe. And so you have information about some health-related exposure like pollution across a grid of cells. And now you can associate that exposure with perhaps where people live, the location of specific cities with outcomes that you have. So you can associate these exposures to different health outcomes. And I list an example of this here where the authors used a raster that they created, a PM2.5 specifically associated with wildfires. And they associated that exposure with the incidence of various health outcomes for different cities that they linked to their exposure profile using that raster. And now we've reached our first break. So I'll give us about 10 minutes. We'll start again at 4.27. All right, I hope you enjoyed your break. We're going to jump right back into our slides to talk about how we assess for spatial trends in our health data. So again, as a reminder, the goal of spatial statistics is to assess for significant spatial patterns or trends. And then we do that with hypothesis testing. OK, so I have a question about the bike data that there are over 5,000 line streams in the data, but each string has more than two pairs of coordinates. So yes, that these points, so when there's only two points, it's a straight line. And that's just a great question because it's not really the mathematical definition of a line because lines can also be like rivers where you have many different points because they're jaggedy, but they're all connected to each other. So in the world of spatial data, what we mean by line is a series of points. It can be two or more or 1,000 even. As long as they're all connected in a single line, we call that a line object. That's a great question. So there are, of course, some streets that are straight, in which case it can be represented by just 2G of coordinates. But if you have a jaggedy street, you have a fork or something like that, then you're going to need more than two points to represent it. So back to spatial hypothesis testing. Very broadly speaking, there are two different hypothesis tests. One is a test of a global trend. And what you do there is you're looking for evidence of global spatial heterogeneity. So in the case of a global test, you're going to have just a single test statistic, and it's going to either be significant or it's not. So you're either going to have a global spatial heterogeneity or you're not going to be able to make any inference about the locations of specific disease clusters. It's really whether or not there's a global trend or not. If you were interested in assessing for the presence of specific clusters, if you want to point to the area on your map and say, oh, that's a hotspot or that's a cold spot, then what you're going to need is a local test. So we're going to start with the global test. And when you're assessing for a global trend, then you're looking to test for or against a null hypothesis of constant risk across your study area. And so what a constant risk hypothesis is under the null condition is that your disease risk is going to be constant throughout your entire study area. So this is very easy to think about if, well, let's say we're talking about polygon data. If we're talking about different counties and disease within between the different counties, if you have the same population at risk, you have the same number of people in each county, then under the constant risk hypothesis, you would expect disease counts to be the same for every one of your counties. It gets just a little bit more complicated than the more real world situation where you're going to have differences in your population at risk. You're not going to have the exact same number of people living in all your different counties. So in that situation, an easy solution would be to instead of looking at disease counts, you're going to be looking at population normalized rates. So you're going to divide your disease count or your incidence count by the total number of people who are at risk. And we're going to look specifically at one way that we can assess for global spatial heterogeneity and that's with the Moran's eye statistic. So the statistic itself is a measure of global spatial autocorrelation and there are ways to assess the significance. It's usually through permutation testing, whether it's Monte Carlo simulations. And if we find that this Moran's eye statistic is much larger than any of our simulations that we simulated under the null hypothesis of constant risk across our different polygons, then that's going to be evidence of spatial heterogeneity. And if our Moran's eye statistic falls well within the distribution that we get from our simulations, then that p-value is not significant and we fail to reject our null hypothesis. And this is not meant to be a math or stat workshop, but I do want for us to take a quick peek at the Moran's eye formula. I liked this image that I took off of Stack Exchange because it very clearly delineates the different parts of the formula, the parts that we don't have to worry so much about. So we have this, the inverse of the variance that's just helping us to normalize and we can ignore that. We have another normalization term and we can ignore that, that the meat of this Moran's eye statistic is up here. And so what this formula means, so I is one specific polygon and then J is a second polygon. So we're comparing polygons sort of two at a time and we're gonna compare one polygon and it's different from the value, so like the incidence rate for this first polygon and it's different from the mean value sort of across your study area. And then we're gonna find that difference of that incidence with our other area. And you'll note that this term is going to be large, our test statistic is going to be large when these two incidence rates for our two different polygons are high at the same time. And we want our Moran's eye statistic, we want this term to be high, the two rates to be high at the same time when they're close together because that's sort of the definition of a cluster. So like I'm stating down here, Moran's eye is essentially a weighted covariance function and we're going to want to find, this is our weight term, we're going to want to choose a weight that's going to let us assess whether polygons close together are going to take large values at the same time. So the next question that I'm trying to tee up for us is how do we pick our weights so that the weights can be large when the polygons are close together? So to reiterate, we are interested in seeing if polygons close to each other tend to have similar values. So the general idea is to give weights to the polygons when they're close together. So how do we define close together? So there's this concept of neighbors and there's different ways that you can define neighbors when you're talking about polygons. So here's sort of an example, polygons. This is actually New York counties that I clipped along this Western edge. And the reason I chose these polygons is that you can see that some of them are kind of weird and irregularly shaped and you're gonna see how that's going to change how we select our neighbors as we go between our different neighbor definitions. So I think I'm starting with what I think is the most intuitive definition of neighbors. It's certainly what I mean by neighbors when I'm talking sort of colloquially. So let's call these polygons in red our index polygon. So we're interested in defining neighbors for our two index polygons. So we'll start with this county up here, a contiguity based definition of neighbors. It's simply going to be all the counties that are contiguous or touch share border with this index polygon in red. And when we consider this county down here, you can see that these four counties are considered neighbors by this definition because they too share a border with this index polygon. The next commonly used definition to specify neighbors is a distance based definition. And I'll start by showing you what we get. So in this case, we're setting the distance. So it's gonna be our radius. Here I've drawn a circle centered around the centroid which is the center of mass center of each one of our index polygons. And the radius here is 50 kilometers. And so a polygon is going to be considered a neighbor of this polygon. If their centroids lie within 50 kilometers of this centroid of our index polygon. So you can see that in this case, this made not too much of a difference in what we're specifying as neighbors, except in this case, you can see that this polygon here, which doesn't actually share a border. It's so close. It's located in such close proximity to this index polygon that its centroid is going to lie within a 50 kilometer radius of the centroid of our index polygon. So it too is being included as a neighbor. And you can see something interesting happening down here where again, because these polygons are small and they're located so close to our index polygon down here, even though they don't share a border, this centroid is lying well within the 50 kilometer radius of the centroid of this polygon. However, even though this long irregularly shaped polygon shares quite a large border with this index county, it's so long that its centroid lies outside of this 50 kilometer radius. So under this definition, it's not considered a neighbor. And then a final commonly used criteria for selecting neighbors is to select the K nearest neighbors. And so you have to specify what you want K to be. The idea is that each polygon, so you'll notice that in these definitions, each polygon can vary in terms of how many neighbors it has. However, in this definition, all your polygons are going to have the same number of neighbors. So you're gonna specify that number. We're gonna just for a hypothetical example, we're gonna say that that number is four. So each one of the polygons are going to have four neighbors and you're gonna select the four based on the four that has the centroids located most close. I'm sorry, if you hear some whining in the background, that's my dog. The four polygons with centroids most close to the centroid of your index polygon. So up here, we're selecting these four counties. And then down here, we're selecting these little counties that are located close to this southern index county. And then when we talk about the contiguity-based definition, we can classify that further into a queen's case contiguity where you can share a border or it's sufficient to even just share a vertex or a point of contact versus a Rook's case where you want to share at least a whole border. So this is sort of a more stringent definition of a contiguity-based neighbor. And this is sort of the more lax version of the contiguity-based definition. Oh, okay. So we're not taking a break. We're actually going to switch over to our practicum and we're going to do some analysis of global, assessing for global spatial autocorrelation. But before we jump into the significance testing, we're going to actually start by exploring our spatial dataset that we're going to be working with. So I'm just waiting for my workspace to load. So for the rest of our workshop, we're going to be working with the new data to be working with the New York State Lime Incidents data, which you can find at this link. It's publicly available at health data and why. And just as an aside, when you go to this website, we can take a peek at it now. So here is sort of their, this is ArcGIS-based, their map viewer. And you can see how they map their data. And when you export this data, when you download it, you'll see that it's only available in tabular format. So this is just a CSV file. And so there's some additional steps you would have to take to end up with sort of the SF object that I load directly for us here. And so what those steps are going to consist of, you're going to load the CSV as a data frame in R. And then you're going to need to find a shape file of New York State counties. And this was just through some Googling. Oftentimes these political boundaries, it's very easy to find shape files for on the internet. So this is GIS.ny.gov. And you can download the shape file. It's going to download, again, like I said, it's going to be like a zip file. You're going to unzip it. It's going to have all of those weird components, but you can read it using the STREAD command in the SF library. And I actually, we're not going to go through the code today because I want to sort of be more streamlined in what we talk about. But I just want to mention that if you are interested in sort of preparing your own spatial data set using health data and GIS data that you can download from the internet, or maybe you have access to this kind of data through your work, you can see how I sort of started with these raw materials and ended up with an SF shape file at this code here. So you can review that. Hopefully it's, yeah, hopefully you find it helpful if that's something you're interested in. So we're going to start again by loading our SF package. And this is, like I said, the line data that I've already sort of prepped and converted to an SF object. So if you run the class command, you're going to see that it's an SF object. And we can, oops, just in our console, we can take a look at this data and see what we're working with. You can see that it's a multi-polygon object with all of these different variables. So this first variable is just the name of the county. You're going to have the county FIPS codes. You're going to have some information that came from health.data.ny.gov. And you can see that this is sort of the main value that we're going to be, this is the variable that we'll be working with. And the sort of data that the state has provided tells you that this is Lyme disease incidence per 100,000 people. So like I mentioned, our null hypothesis is going to be constant risk, a constant disease risk across the different counties. And our counties, of course, are going to defer by the number of people in them. So to sort of compare these counts more fairly, we're going to work with this rate, which is going to be the incidence per 100,000 people. And so this should be familiar by now. We can use the ST geometry command to get the spatial attribute from our Lyme data and we can use the plot command. And that's just going to plot us this nice simple map of the different counties in New York. And now, so like I said before, mapping is not the focus of our workshop today, but it's still important to visualize spatial data before you run analysis of it, just to see what you're working with. And a really awesome, quick way to make an interactive map is using this package called TMAP. So we're going to load it now. And I'm going to uncomment this. I commented this to knit this R Markdown file because otherwise, when you have an interactive map in your R Markdown file, it makes the file super, super large. And it wasn't able to fit in our studio cloud. So that's the only reason it's commented out. And if you go through the HTML, you'll see it's a static map, but you can uncomment that in your code as you're following along. And this line of code is important because it's going to set your mapping mode to interactive. And now with just two lines of code, so TM shape, you're going to feed the SF object that you're working with. And it's going to contain sort of the spatial attribute that this map function is looking for. And then with TM polygons, you're just going to select, you're going to name the column that you're interested in mapping. So what you're making when you're mapping values of polygons, just as an aside, you're making a choropleth map. So let's view our interactive choropleth map when we run this function. And okay, so now we have this nice interactive map. You can see our color legend here is going to help us see our lime incidence rate. And so things, you know, features of a interactive map, you can zoom in, you can zoom out, you can pan around and sort of just default in this TM shape function is you also have this pop-up message. So you can select counties with your pointer and now you can see which county you're in and you can see the lime incidence rate for that county. So I just think it's really remarkable that you can do this with, which has two lines of code. If you wanted to customize this map more, I would recommend using the leaflet library, which is going to give you a lot more control in how this legend comes up in the types of things you wanna display in your pop-up message. But for just like that initial data exploration, I highly recommend using the TMAP library because it's just so quick and easy to make these really useful interactive maps. It's really helpful when you're trying to get a handle of your data to be able to just sort of hover around and see the names of the different polygons that you're sort of familiarizing yourself with. So one thing that you might have already noticed is that we do have a missing value. So if you were to look in the SF or the data frame of our lime, we can do that right now actually. So you can see this is St. Lawrence County and you can, when it's a data frame, you can click in your data and view sort of the rows and columns. And if you look at St. Lawrence County, you'll see that we have some missing data. We have NAs and what they're most about is the NAs for the lime incidence rate. And the reason I point this out is further on when you're doing our spatial statistic, there's lots of just little nuances with the way that different packages and functions are written and some of them get really angry if you have missing values. So part of data cleaning sometimes is going to be removing those NA values. And of course, it's very important to document when you do that. But while we're still sort of getting a handle and cleaning our data, I'm not gonna throw an error message. So the way that I do that was with this line of code and I can kind of try to walk through what's going on with this code. So when I'm trying to understand a line of code that has sort of a lot of embedded things in it, I like to go from inside out. So in this very deepest layer, nested within, we see this lime variable. So this is a lime incidence rate in our lime data frame. And then this useful function is .na. And when you pass, so once you've selected this column by name, this lime incidence rate, this is effectively a vector. And when you pass a vector into .na, it's gonna go through every value of that vector and tell you true or false is the value, is this element of the vector in A. So you can see that every of this vector, this value, which is associated with St. Lawrence County. So essentially now you have this vector, this Boolean vector of true, false is, with one true value. And you want to inverse this, which is what this exclamation point does. So if we now pass the same code with the exclamation mark, now it's gonna flip it and you're gonna have all truths except for this one false, which at the position of St. Lawrence County. And then finally, when you pass this Boolean vector in sort of this index bracket, what you're telling R to do is I wanna select all the rows that are true. So we're essentially selecting all the rows for our lime data frame that have non-missing values for the lime incidence rate. So what this is going to us is it's going to give us the same lime SF and data frame object. But now if we go back to lime, you can see that St. Lawrence is no longer there. So we dropped our one row with the missing incidence rate. There's a lot of different ways to do this, but this is the one that came to mind when I was writing this up. So I figured it was worth going over with you. And now let's map our data again. So this is the same three lines of code as up here, but now because we've updated lime, it should be a little bit different in terms of what app, let's see what happens. So it's just loading the interactive map. Here it is. And you see that it looks almost exactly the same, but now when I hover over St. Lawrence County, nothing comes up because that polygon is no longer there. So it's the same object as before, but minus St. Lawrence County. So you'll also note that before we had 62 rows and now this data frame has 61 rows, which makes sense. Okay, so now that we've sort of explored our data, cleaned our data a little bit, where we're ready to start our clustering analysis. Though the first step of the analysis is sort of some data checks. So something that I haven't told you yet that I'm going to mention now is that the Moran's eye statistic is very sensitive to extreme values. So before you run a Moran's eye statistic, you should look at the distribution of the data variable that you're interested in. And if you see extreme outliers or just any strong departure from normality, transform your data. Otherwise, you can essentially be prone to getting false positives, where you don't really have spatial auto correlation because you have one or two really, really large values, it might be sort of artificially inflating your test statistic. So that's why that's what we're going to do now. So some ways that we can look at the distribution of a variable with the summary function. So this is sort of our variable of interest once it's in this column in the lime SF frame. And if we run the number summary, the minimum value, the maximum value, as well as the first quartile, the median and the third quartile and the mean. So if you look at sort of the value of your median and your mean in your third quartile and how much larger your maximum value is, that should already be sending me some low flag that something might be up with your data with the haste command pop up down here. And you can see, wow, okay. So the bulk of your values are going to be between zero and 100, the rate per 100,000 individual. But then you have sort of this very long tail where you also have in some counties line and so this distribution is very strongly skewed right or skewed. And just for fun, we'll also do a box plot, which is going to show you the same thing that you have a lot of extreme outliers that are much larger than sort of the other values. And so because our data is skewy, strongly skewed, and when you have very positively skewed data, a few different transformations that you can consider, but one that works well for our data is we can just simply find the log of it. So let's do that. To create a new column, a new data frame called log lime incidence. And we're just going to find the log of our variable of interest. And let's see the histogram of our log transformed value. And you can see this looks much better, this looks much closer to normal and our box plot also looks much better and it's much more symmetrical. So I, and just sort of as a sanity check, we can once again map our log transformed values. And so that, this map looks very similar to the map before in that before, you know, the very, the areas of very large incidence rate. So the areas very suspicious of being a hotspot for lime incidence has not changed, which makes sense because log transforming your data is not going to change the rank of these different variables, but it's just going to change how extreme the extreme values are. So now these extreme values are closer to sort of the other less extreme values, which the Moran's eye statistic is going to handle better. And, okay, so now we're done with our data cleaning. And the first step of the analysis itself is to define neighboring polygons. Do that with the SPDEP library. And the function that we wanna use is poly to NB. So the system really stands for polygon object, which our SF object, you know, contains spatial attributes. So it's gonna, that that's gonna be this poly to NB and NB stands for neighbor. And it's, well, let's take a look. So poly to NB, we're gonna feed it our lime SF object. And an option that you need to specify with this poly to NB command. It's by the way, this poly to NB function is for finding contiguity-based neighbors, which is that first very intuitive case where any neighbors are defined as regions that share a border or a vertex in the case of queen, a queen case, contiguity-based definition, which we're gonna select as true, for example. And so when we run this poly to NB function, we can take a look at class and it's an NB object. So this is a special data class from the SPDEP library. And it's how the SPDEP library stores information about neighbors. So let's just do a quick, some quick exploration of this NB object that we created. Oh, so this is a typo. I'm sorry, attached to this now in this live section. So this should be the name of NB with lime underscore NB. I'm gonna save that now. So the next time you open this up, if you're on studio cloud, you won't have this error. You can run this line of code and you can see that it's a list of 61. So an NB object essentially, it's just a list that is going to be assigned like your polygons. So remember that we can write in the console dim of our dimensions for our line data is 61 rows. So that's how many counties we have. So it makes sense that we have a list of 61 vectors and each vector is going to list the index for each one. So let's, I'll show you what, recall that when you're a subsetting from a list versus from a vector, when you subset from a vector, you can just use a single bracket. But when you subset from a list, you need to subset using a double bracket, which is what I'm doing here. So we're gonna take the first element or the first vector in this list and it's going to return a vector of these values. So what do these values mean? If we take these, as I already told you what they mean, they're indexes, they're the indices of the neighbors for this first polygon. So we can find the names of these neighbors. If we take, okay. Well, first let's find the name of this first county, so Albany. The first county is Albany. So this vector, this first vector from our NB object is all the neighbors for Albany. And we can find the names of all the neighbors for Albany. If we take this vector, we plot this into this bracket and that's gonna pull the names of the neighbors of Albany. So Columbia, Green, Rensselaer, Saratoga, Schenegdacht, okay, I won't try to pronounce the last two. If you're familiar with New York State geography, you might already be convinced that these are the neighbors, but we can do a quick check. I believe Albany, I thought I knew where Albany was, but I guess I don't. Let's see, Albany, there's Albany, okay. And the neighbors we would expect to have is the name I couldn't pronounce, Montgomery, the other name I couldn't pronounce, Green, Columbia, Rensselaer, yeah. So it is indeed these counties. So that's what's going on in this NB object. So now we figured out who the neighbors are for each polygon. The next step is to assign weights to the neighbors. So the idea is you want to assign some non-zero weight to the neighbors, and then the weights for non-neighbors are just going to be zero. So you're only gonna care about sort of the covariance between sets of neighbors and you're gonna weight zero, you're gonna not include essentially in your summation of the covariance of the Moranzi statistic, all those non-neighboring polygons. So when we're assigning weights using the SP DEP library, we can do so if we already have an NB object with this function, NB2 list W. And what this function, what you input into this function is our NB object and then the style, the way that we're going to assign the weights. And sort of the simplest way we can do this is to assign equal weights to neighbors. And I'll show you what I mean by equal weights. But yeah, let's go ahead and run this line of code. So now we're gonna store our weights into this object that I'm naming line weights. And let's see what the class of this object is. Okay, it's a list W class. So the W stands for weight. And we can run the stir command again. So I already wrote that up here. So let me first show you what happens if I just run stir like that. It's going to be quite a lot of output that's sort of hard to wrap your mind around. So one way that makes it easier to view if you have sort of a complicated list within list situation, you can write maxed level one. And that's only gonna give you that first layer of sort of the structure of your object. So that's what I'm doing here. And now you can see this is a little bit easier to read. So it's a list of three different things. The first thing, the first element of this list is the style. So that's just saving that we selected W or equal weights. The second thing is actually our NB object. It's going to be that list of 61 vectors with each vector containing information about the neighbors for each polygon. And then what we actually care about the weights themselves is this third element of this list W object down here. And so we can access that third element of our list. So when you have a list where the elements are named, you can call a specific element the same way that you'd call a column in a data frame. You can use sort of a dollar sign. And so we can access these weights with the dollar sign weights, and that's going to output sort of all our weights together. And if we wanted to give you a list of all the elements together, and if we wanted to see our weights for our first polygon, again, it's Albany, we can use a double bracket because you can see by the way that this is outputting the weights that each of these numbers is a double bracket, which is telling me that this is a list, this is another list. So this is what I'm talking about, the list within lists. So that's why I need to use double brackets to find the first set of weights, which is going to correspond to our weights for Albany. Okay, and now you can see what I mean by equal weights. So Albany, if you'll recall, and actually you don't have to recall, we can check with code. The neighbors for Albany are these indices or these names, and if you count, it's six different neighbors. And so not surprisingly, you're going to have six weights and all the weights are equal. And if you're not good at mental math, which I don't think I am, each one of these weights is equal to one over six. So for the way that SPDAP calculates weights, they want the sum of the weights for all the neighbors of any given polygon to equal one. So when you have six neighbors, the weight for each neighbor is going to be one over six. And if we consider our second county, which is Allegheny, we can see that Allegheny has four neighbors. And if you wanna see what the weights for these four neighbors are, it's 0.25 or one over four. So again, the weights are going to sum up to one. So yeah, now we have defined our neighbors. We've assigned weights to neighboring polygons, and we're ready to do the significance testing to test for global spatial heterogeneity. And we do so, another function in the SPDAP library is the Moran dot test function. And what this is going to do, it's going to run an analytical test to calculate. So you're going to get the same Moran's I statistic in either case. So this is the Moran dot test versus Moran dot MC is going to, they defer in terms of how you're assessing your significance or how essentially you're calculating your P value. And without going into detail, I mentioned before that the best way to assess for significance when it comes to the Moran's I statistic is to do permutations or simulations. So these Monte Carlo simulations is the best way to assess for significance and to calculate an accurate P value. Doing it analytically, which is another way of saying like mathematically, it's going to be faster but it's going to be less accurate. So we're going to do both ways. We're going to run the Moran test. And so what you specify is your variable of interest, the weights, and then this is sort of the default but I just wrote it explicitly anyway. What this means is whether the alternative test that you're testing for, whether it's, you expect the Moran's I statistic to be greater than under the null hypothesis, which is equivalent to positive spatial autocorrelation. If you are for some reason, which I don't think you will be given your interest by being on this call, if you're interested in testing for negative spatial autocorrelation, you would write less or you can write two-sided if you think that there's any chance that there's a negative spatial autocorrelation but we're going to stick to the default behavior and like I mentioned, when it comes to health data, you're typically dealing with positive spatial autocorrelation. So we're going to keep sort of this option as greater. I just wanted to give some background about what that means. And so we can see that our Moran I statistic which is calculated using the formula that we briefly went over on the slide. It's going to equal 0.72, but of course that means nothing without looking at our P value. And you can see that our P value is very, very significant. But like I said, the analytical test, it sometimes gives an over-inflated value of significance. So like an overly small P value. So we can compare this result to the result that we get from our Monte Carlo simulations. And so what you input to the Moran.mc function is going to be pretty much exactly the same except in addition to these options, you also need to specify in sim. So that's going to give you the number of simulations to run. So essentially you want to pick a large enough number but not like a million because otherwise it would take forever to run. So like about 999 or 1,000 is a pretty good intermediate between those extremes. And we can calculate Moran.mc and you'll see why, but I will save the output of this test. We're going to name that mc. And before we look at the output of this test, I want to point out that the way you calculate a P value with Monte Carlo simulations is that essentially you're going to have 999 iterations or simulations and you're going to have that you simulate under the null hypothesis of no spatial trends. And so you're going to get a distribution of Moran.mc statistics. And you're going to compare the Moran.mc that you actually got from your data to the distribution that you got from your simulations. So because we have an in sim equal to 999, the most extreme P value that you can get is essentially 0.001, that your Moran.mc statistic is greater than all 999 simulations. So with that being said, let's look at our results. And you can see our point value is 0.001. So I just point that out because that's not to say that our P value wouldn't be higher. I mean, I haven't done this, but if you're curious, you can increase this in sim and see if the P value gets any smaller, but the number of simulations essentially puts a cap on how tiny your P value can be. So hopefully that makes sense, but if not, it's kind of just a random side point. And the reason I say this output as MC is SPDEP has a really nice, the way it outputs this value is you can just feed it directly into the base plot function and this is doing what I said. So this function, it made 999 iterations that it simulated under the null hypothesis where the incidence rates are homogeneously distributed across our study area and it made this distribution. And you can see our outlier, I'm calling it outlier because it looks very much like an outlier. You can see that our test statistic is well outside the distribution. So my suspicion is if you were to increase the number, this in sim, you're going to get a very small P value. Oh, I have a question, good question. So is the MC method still feasible if you have a large number of observations? For example, all census block groups in a state, are there any rules of thumb? So I think the MC method is absolutely feasible even if you have a large number of observations. It's just a matter of how long you're willing to wait. I don't think there are any rules of thumb if it's something you're very curious about, like it's the main result of your paper and you don't mind waiting. Oh, I don't know, like two days. I've waited for code to run for that long. I think it's perfectly reasonable to do so. Yeah, okay, excellent. Good luck. And before we move on to local tests of spatial clustering, we're going to take another break. Let's do another 10 minutes and we're probably gonna get out an hour or more early because I don't think we have that much longer to go. But please take a break and get a snack and go to the bathroom. And once again, feel free to leave messages in the chat and we can start with questions if there are any. Okay, let's get back to it. So I was following the chat and I don't really, I've lost a computer, but I've never used a do parallel library for any analysis. But I imagine if you did have a problem with co-taking a long time to run because you're running simulations that it could be helpful to speed things up dataset and running so many simulations to really take days. But my example of take days for code was compiling our county statistics. It was just rendering a very large file in a very specific way. But yeah, I'd say it's pretty rare that things really do take days. And a good strategy is to start with a very low in-sim. You can start with like a hundred. It's not gonna give you very robust results, but it's just a nice check to see that your code's working and to give you an estimate of how long it would take if you were to increase a number of simulations. So yeah, great, great, cheers everybody. And we can go now to talk about our local significance tests. So assessing local trends, I just have a few slides on this. So speaking generally, the general term for a local tests of spatial association is a LISA, which stands for local indicator of spoke spatial association. So the number of leases and what they have in common is that they provide a test statistic for each location that you're considering. So if you're working with polygon data, you're going to have a statistic for each one of your polygons. And along with the test statistic, you're also going to have some measure of significance or like a pseudo P value at each location as well. And because you have these statistics in measures of significance across your study area, it allows you to actually identify specific disease clusters. And because we already talked about the Moran's Eye statistic for our test of global trends, it's maybe a nice extension for our LISA. We can choose the local Moran's Eye statistic to talk about. Gathit, what are you doing? I think she's hungry. Sorry, guys. So I have a quest. I see a question in the chat. Should read this question. Oh, okay. No questions. Let's move on. So the local Moran's Eye statistic, this, if you recall, are a formula from earlier. It's a global Moran's statistic. Except instead of having sort of this double sum, you have your eye term outside the summation because you're going to calculate this for each one of your regions. So we're going to call each region, region I. And essentially you're going to have, you're going to find your neighbors. Again, you're going to use the same weights that you used before and you're going to wait to your neighbors. So the neighbors for polygon I, you're going to see if your deviation from your mean is high for your polygon of interest, then this local Moran's Eye statistic is going to be high if the neighbors' values are high as well. So it's just a nice extension of the Moran's Eye test statistic that we talked about. So like I already said, the local Moran's Eye statistic is going to also include a measure of significance. And we call this a pseudo P value because it's not this P value, this measure of significance on its own that you care about. So that's what we call a pseudo P value. And we're going to have a pseudo P value for each polygon. And so a polygon with a high, and we're going to, we're going to judge how high, how significantly high a local Moran's Eye value is based on the pseudo P value. And it's going to be high, like I said, when you have a high disease counter rate that's surrounded by other polygons with a high disease counter rate. And a polygon with a low value is one with a low disease counter rate that is surrounded by other polygons with a low counter rate. And so when you have a high, so high with a pseudo P value surrounded by all pseudo P value surrounded by other polygons with a high local Moran's Eye statistic with a low pseudo P value, we're going to call that cluster a high, high cluster. And another name for this would be a disease hotspot. And sort of on the other side, if you have a significantly low value polygons surrounded by other significantly low value polygons, then we call this a low, low cluster. And it indicates that that cluster is where the disease incidence is lower than one would expect under the null hypothesis of no spatial trends. So let's go to our workspace again. And we have some code to run that that's going to show us how we can do a local Moran's Eye in R. We left off, we're on line 266. So the way that we do a local Moran is we're actually now going to use a different library because right now with the SF objects, you can't do local tests with the SB def library. So this package came out like just last year, it's still being updated, it's still very much in the early stages. When I was going by some documentation that I found that came out just a few months ago, I saw that some of the functions had already changed names. So this is a nice library called RGO DAW. And the steps themselves in terms of what they're doing is going to be similar. We're going to, again, we're going to need weights, but we're going to calculate them a little bit differently because we're using a different package. So let's start by loading RGO DAW. And once again, we're going to find queen case contiguous based weights, which with the RGO DAW library, we can find with the queen weights function. Let, excuse me one moment, I'm just going to remove my dog from this room so she's not making a racket. I apologize for that disruption, we can continue with where we were. So we can find queen case contiguous weights using the queen weights function in the RGO DAW library. And all we need to put into this function is our SF object. So, like I said, RGO DAW is brand spanking new and it's got some nice speed up and we'll, I'll show you what I mean by that. So we can store our queen weights. I'm just going to differentiate the way we calculated weights using the SBDAP library. I'm just going to name it a little bit differently. I'm going to call it line G weights, G for GEO DAW. And if we see what this class is, we can say that it's a weight object, which is a type of object from the RGO DAW libraries. And that's just what this information is telling you. And once again, we can run our old faithful stir command. And we can see that it's class weight. And some, it's going to give us a nice summary of the weights. Now, this is telling you that I'll see the minimum number of neighbors. So minimum number of neighbors for our 61 polygons is one. And there's at least one polygon with as many as eight neighbors. Our number of observations, the mean number of neighbors as well as the median number of neighbors. So this is a nice summary that you would need several lines of code to get this information using the SBDAP library. So once again, we can see the neighbors of the first polygon, so Albany County. And the way that we do that for the RGO DAW library is they actually have a function that they've already written out for you that lets you get the neighbors. So this is nice and clean, you don't have to index with double brackets and then single brackets. So that's what I mean by it being just a little bit more user friendly. So the way that what you input into the get neighbors function is you input the weights and you specify the index the county that you're interested in. I wrote one here. So if you run this line of code, it's gonna give us the index values of the neighbors of Albany. And once again, just remind us that the first county is indeed Albany. We can just input this vector of indices directly into a single bracket and you can extract the names of the neighbors. And once again, we get these New York counties. And I won't belabor the point, but you can do the same for Allegheny. You can get the names of the neighboring counties that neighbor Allegheny. And you can get the actual values of the weights using a similar function, just get neighbors weights. And once again, you specify which polygon you're interested in getting the weights for. And you can see a difference between the Rio da and the SBDF library is that instead of the weights for each polygon summing to one, essentially the weights, sort of the default behavior is to assign weights equal to one for every time you have neighbors, you weight it equal to one. Every time you don't, it's zero. So it's just a little bit different, but it's similar. So we had six neighbors for Albany. So that's why we have this vector of six ones. And because we have four neighbors, we have four ones. And then you can calculate your local Moran's I statistic. Using your Rio da weights. Though there is one quick data cleaning step. This is kind of weird. I wonder if this behavior is going to change as they continue to update this package. But for what a vector format, which is if you were to feed, oh, I think my, okay, sorry. It's just being a little bit laggy. So if you were to just pass this, it's just a vector, but it really wants it like a data frame. Oh, it's not showing you, but it's essentially going to be a one column data frame. I have a question. Okay, okay. I'm getting info that my internet is a little bit choppy. So I'm going to relocate. How is my video and sound now? If anyone can give me some feedback in the chat, I'll just wait before I continue. Okay. Great. Let me know if that changes. I can keep moving closer to my router at home. Okay. So we're just going to continue with our example. Okay. So essentially if we're passing this vector into the as data frame function, you can see what the output looks like. And it's essentially just a one column data frame. So this is just a persnickety thing about this local Moran function right now is that it wants our variable of interest, not in vector format, but in data frame format, which is why there's this extra step here. And the first argument that you pass to the local Moran function is going to be our weights in that jail-daw weight format, which we stored as line G weights. So we can go ahead and run this local Moran function. And we can see that what this outputs is a Lisa object in the RGO-daw package. And so this Lisa object actually has some nice features. So recall that for a Lisa, so in our case, a local Moran's eye, we're going to have a Moran's eye statistic at each one of our locations, so each one of our polygons. And we're going to have a pseudo P value at each one of our polygons or each one of our counties. And we can access directly the value of our test statistic at each point and the P values or pseudo P values associated with them. If we use sort of this dollar sign, Lisa valves. So now we get a vector and each element of our vector is going to be a local Moran's eye statistic that's sort of associated with each one of our counties. And then if we wanted to see the P values for all of our different counties, we can do that with the dollar sign P valves. And sort of the final step of our analysis of local trends is we can sort of map the results and see if we have any high, high or low, low clusters. And there are some functions in the RGO.library that can help us with the plotting. So we can get the colors that we want to use for our maps using this Lisa colors function. And I can show you what that output looks like. And it's just going to give us, we'll see what the colors look like, but it's going to give us the color of no association of our high, high cluster and of our low, low cluster and all these other values. I think these first two are high, low clusters and low, high clusters, which we don't see often again in public health data because they show up really when you have negative spatial autocorrelation. And I don't recall what these colors are, but we can also get the labels that we want for our map. So we can see what this output looks like. So like I said, these correspond to the colors. So the possible labels that we can have are not significant high, high, low, low, low, high, high, low. And so the other two I can remember were undefined and isolated. And then finally we can run the Lisa clusters command and this designation is going to tell you for each one of our polygons, whether it's not significant. So none of these sort of possible significant trends is represented by zero. Polygons that are part of a high, high cluster are going to be labeled one and polygons that are part of a low, low cluster are going to be labeled two. And so once we put all these things together in this plotting function, we can see our results. And you can see, which is not surprising. If you remember sort of what we saw when we plotted the values themselves that there was this area of much, much higher lime incidents sort of in this part of the state. And indeed this is a high, high cluster or a hotspot for lime incidents. It was less clear just using the color scales that we were using, but that there is, it turns out an area of a cold spot where we had a cluster of low incidence rates. And it turns out this little corner, including New York city, which makes sense, given the lack of exposure to wooded areas that probably New Yorkers experience that they are sort of in a low, low cluster or cold spot for lime disease as well. So that's really all I've got for our practicum and for our files. So I just wanna end by thanking you for joining the workshop today. And please, if you have any questions about spatial analysis, any comments you wanna share, feel free to email me. And eventually in the next day or two, I will upload all the files from today, including some bonus code with GG plot sort of mapping in it. And I'll also throw in, why not? I'll throw in some leaflet, some like more custom interactive maps that you can play with as well. And please let me know if you have any questions you want me to answer right now. I'll stay on for a few minutes if you want to ask me any questions. Thank you. Excellent question. Any book recommendations for further reading? I don't really have like a favorite. I'll let you, like I'll tell you what I go back to. And I don't think it's because I think it's the absolute best textbook out there, but it's just what I learned with. So I'm just familiar with with the content. So the layout, I sort of just know by heart. So my go-to reference I'll share with you, Lance Waller, this book. Oh, a question about diseases and topics that I research on. So I completed my dissertation not too long ago looking at the environmental and social drivers of asthma exacerbations in adult patients seen by Penn Medicine. And that was really fun because I got to integrate electronic health record data with publicly available data sources on like SES and pollution levels and things like that. And it really gave me the opportunity to pick up a lot of spatial skills. And in my postdoc, I'm researching, one of my projects is looking at how we can improve rabies, canine rabies vaccine uptake in Peru where they're currently experiencing a canine rabies outbreak. And we're also doing some other projects looking at zoonotic diseases sort of more locally. But yeah, that's my research area right now. Okay, if there are no more questions I'm going to sign off now. Thank you all for your attention. And yeah, look out for these files on GitHub. And yeah, feel free to email me if you have any specific questions. Bye, have a good night.