 All right. So I'm Amelia McNamara. On Twitter I'm at Amelia MN and I just tweeted a link to a PDF of these slides with a bunch of resource links. So if you're trying to read the links I know they're a little bit hard to see but you can go look them up there. So I want to talk about how spatial polygons shape our world. And this is going to be connected to the last talk in that it's about spatial data. It's about the way that the world is different than the visualization of the world. But specifically I'm talking about the polygons that are being created by humans. So when I think about geographic data I think about three main types of geographic data. I think about points which might be where people are or trees or houses or something like that. I think about the lines, maybe rivers or roads. And then the part that I'm the most interested in are polygons. And polygons are closed shapes that might describe some area. Again, you could think about maybe a house could be described as a polygon as well. But usually we're thinking about polygons that are larger and contain some amount of human population. So when we're doing mapping of spatial polygons we could have regular polygons or irregular polygons. You can see this example where I have hexagons. You could imagine having those on a map or you could have polygons like the US states. They're up in Minnesota so here are some irregular polygons in the upper Midwest. And in data visualization when we see spatial polygons they're almost always colored by some value. So we make choropleth maps where we have areas, polygons that are being shaded according to the value of some variable. And this example I pulled off a flicker. It's from the Prop 1 in Seattle where they were voting on a light rail measure. And you can see there's different areas and they're being shaded according to whether the people voted for or against this measure. Red is less than 50% voting for the measure and blue is more than 50%. And you can already see a problem with plotting this kind of data which is that small areas look not important and big areas look very important. But if you're from Seattle, which I know many of you are, I believe this measure passed. And that's because the places where it's very blue, where people voted for it, those are very dense places in the city of Seattle. And these enormous swaths of red, those are the less urban areas surrounding Seattle. So Andrew Gelman has a great paper which is called All Maps of Parameter Estimates Are Misleading. And it's kind of about this exact problem. So if you have some map of polygons and you color it according to some absolute number, then places that have really big populations are going to have a lot of that thing going on and places that have small populations are going to have less of it. Often one way that we think about correcting for this problem is to divide by some other measure. We're going to see the percentage of people who have cancer in a particular county. So rather than showing the absolute number of cancer cases in a county, we could divide by the size of the population. And the problem here, this example comes from Ben Jones, is that you end up with really high variants in low population areas. I think it's Poloski County in Illinois that has the highest rate of kidney cancer. And it's also a very small population, Polk County, Wisconsin. Also it's a relatively low population and it has a low percentage of people. And that's basically because if you don't have very many people in an area, if a few of them get cancer, then it's a really high rate. If no one gets cancer, it's a very low rate. So it doesn't work to just plot the absolute numbers and it doesn't work to plot the percentages. I think Gellman has some suggestions for fixing this problem. One of the solutions that I've seen recently that I liked was this surprise Bayesian waiting for de-biasing thematic maps, which takes, you have the absolute numbers, that's not good. You have the percentages, which has this high variance in small population areas. And then there's this technique to do a Bayesian waiting to say, okay, we know that the places with small population are going to have a lot of variance, is this a surprising result or not given that underlying distribution? And then the other problem, other than the absolute numbers and the percentages, is again that big areas give a lot of visual weight. So this is the 2008 election map. I'm sure you're all familiar with it. When you look at a map like this, it looks like the red polyons are taking over the map. There's a lot more red area than there is blue area, but in this particular election, that's not the way things swung. So you can see at the top the electoral count, which is heavily weighted toward the Democrats. And I didn't put in a map from this year, but the map actually doesn't look that different because you have these large red states that are giving a lot of visual weight. And the question is how can you solve that? So one solution would be to plot the number of people that voted one way or another by county or by using some scale that includes purple. But again, it doesn't really solve the problem of aggregating to spatial polygons because when we make our electoral decisions in the United States, we're basing them on these aggregated polygons. We're sending a certain number of electoral votes from a particular state. And so showing that there were people who were voting either way in a number of states, that's interesting to know that we're not as divided as we think we are, but it's not giving us more information about how the vote actually turned out. So people make cartograms. This is I think from the 2004 presidential election, it's taking the counties and then sizing the shape based on the population and then coloring based on whether people voted Democrat or Republican. Again, this might give you an idea that we're not as divided as we thought we were, but it's very hard to look up any results here. If you want to go look at your county and see how it voted, well, you better live in Los Angeles County because otherwise it's gonna be hard to see. Another type of cartogram does sort of the opposite instead of distorting the shape in making areas larger or smaller. We make every area exactly the same. So you could have a cartogram where you use squares and then every state in this case is given the same visual weight. This is an example from NPR and they're visualizing what states have non-discrimination laws for LGBT people. I think the darker the blue, the better the laws are and then the gray is where there weren't laws passed yet. And again, I'm from Minnesota, so when I look at this map, I can tell that there's something wrong about it. Minnesota is not next to Illinois like that. So we've gained something in that all the states have the same visual weight, but we've lost something about the relationship between states and how they connect. So here's maybe a slightly better cartogram which uses hexagons. And now we have Minnesota and Wisconsin sitting next to each other, but I'm sure if you have contextual knowledge about other parts of the United States, you'll find things that aren't quite right. And people have measures for how accurate a cartogram is and if it's doing something to distort the picture or not. But it still is having this problem where we're weighting things in a particular way and maybe we're losing the spatial relationships. Okay, so with that in mind, I wanna talk about some common spatial polygons that get used to aggregate data. So we've already talked about states, which I think are probably the easiest to think about the census. This is the dropdown if you go download data off of the census website. These are the enumeration units that you can choose for census data. And there's census blocks, there's census tracts, there's counties, there's zip codes, there's school districts, there's so many more. And I think one of the things that I want to emphasize is almost none of these spatial polygons are naturally occurring. They all arose from some human political process where a person said, this is where we're going to draw the line between these voting districts or these zip codes. And so it often does not make sense to aggregate data into spatial polygons. But people do it because it's easy and it looks really pretty. I'm a data science professor, so for me one of the problems comes when you have data and it's at two different spatial aggregation levels, but you'd like to combine it together. So you'd like to make a map that showed these two variables, but you have one at the school district level and you have one at the state level and you need to combine them together. If you're an R programmer like me, if you had tabular data, you would use something like Deeplyer or another Hadley-Wickham package. You'd do kind of an SQL flavored join and you would match up things that had kind of matching IDs. So the spatial combination problem is that you don't have matching IDs. If you have point data and you want to combine it with polygon data, the IDs for your points aren't gonna match up with the IDs for your polygons. And so you have to do something to solve that kind of data science problem. And spatial statisticians call this type of problem the change of support problem. And there's different change of support methods for moving between types of spatial data. So if you have point data and what you really need is polygon data, they call that upscaling. If you have polygon data and you really need point data, that's downscaling. And if you have polygon data that's aggregated at some particular level, and you need polygon data aggregated at some other level that's non-overlapping, that's called sidescaling. So I'm gonna talk about the easiest one of these things first, which is upscaling. That's where you have point data, what you actually want is polygon data. And now we're bringing in another geographic problem, which is the modifiable aerial unit problem. And the modifiable aerial unit problem is that if you have point data and you aggregate it into polygons, the choice of your polygon boundaries makes a huge difference to the visual distribution that you see in the polygons. So this image here is maybe the only image associated with the modifiable aerial unit problem. It is on every website about this problem, but I think it's actually really nice. So the yellow dots are the points, and then you can see there's three different choices of breaking it down into polygons. And what we're doing to color the polygons is just counting how many points fall into that area, and then coloring by that value. You can see that sometimes I think that there's a huge group, like a dark region, a clump in the center, and then other ways that I can aggregate, I can really make that melt away. I don't think it should be incredibly surprising that the way that you aggregate data can have a big impact on the distribution that you see. This is some work in progress with my collaborator, Erin Lanzer. We've been thinking about histograms for a long time. So with a histogram, you're taking point data, and you're aggregating it into bins. It's just in one dimension. But if you change the way the bins are defined, either making the bins wider or skinnier, or just adjusting the bin offset, or the closedness left or right closedness of the bins. You can really change the visual distribution that you see in your histogram. So we have an alpha version of this essay up right now, which you can go play with, and we're soliciting feedback. So if you have suggestions, we'd love to hear it. But I think the problem that I'm concerned with in terms of spatial polygons and aggregating into this two-dimensional setting is that it's this histogram problem, which I think people already have trouble wrapping their mind around. But then we've jumped into two dimensions. And with regard to things that impact us and shape our world, the way that aggregating from point data to polygon data can have a huge impact on us is gerrymandering. So gerrymandering, as you probably know, was named after former governor of Massachusetts, Governor Elbridge Gary. Not gerry, for whatever reason. And he helped redraw these state senate election districts to benefit his party in 1812. And so the political cartoonists at the time drew this salamander, which was kind of taking over and changing the way that people were getting aggregated into districts. If you've seen that recently, it was probably on this last week tonight segment that came out about two weeks ago. And when I saw this video, I thought maybe I should just play this as my open Vizcon talk, like John Oliver actually does a really nice job of explaining gerrymandering and pulls up a lot of these examples that I'm gonna talk about. So when I think about gerrymandering, people love this Washington Post article. This is the best explanation of gerrymandering you'll ever see. And it's showing us, not points, they're little squares to represent people. Each square is a voter, and we have 50 voters, 60% blue, 40% red. And we can divide them up into polygons in a variety of different ways. And then we can see how the election result would shake out. So in the first scenario, you end up with three blue districts and two red districts, blue wins. In the second one, you end up with five blue districts and zero red districts and blue wins. And then if you do this kind of maybe gerrymandering process, you could get two blue districts, three red districts, and red wins. Even though there are many fewer red squares in the original data. And gerrymandering is a really hard problem to solve. When we define voting districts, we have a variety of measures that we'd like to optimize, so we'd like to have compact, contiguous voting districts that respect political subdivisions or communities defined by actual shared interests. And it's very hard to draw a polygon that respects all three of those things. And sometimes people have even more things that they would like to have in their voting districts. Like maybe you think that competitive voting districts are a valuable thing. Or maybe you really want safe districts. And I think our political process hasn't really decided what is legal, what's illegal, and what's even right. So when people talk about gerrymandering, they often pull up this example of North Carolina's 12th district. This is not a compact district, but it's a majority minority district that perhaps does respect a community of shared interests. So this is a bunch of African American people along an interstate corridor there. Another example would be California's 33rd district. I lived in LA for quite a while. And again, this isn't a compact district. But maybe you'd think that the people who are living right along the coastline who have beachfront property, maybe they do have some shared political interest. Okay, there's lots of ways to understand this upscaling problem. Where you have point data and you're trying to aggregate it into polygon data. I love this redistricting game. This came out of USC and I recommend going and playing it. It has a ton of levels and I'm really bad at it as you can see from this video. So it sort of steps you through a number of challenges. So initially, you can kind of do whatever you want and then it wants to have equal population in each of the districts. It wants them to be compact, contiguous. Of course, you need people to live in the district that they represent. And so the challenge of the redistricting game is to aggregate this point data into a number of different polygons that maybe help your party win or make the other party lose. And there's lots more to understand about the problem of gerrymandering. One of the projects that I've been the most inspired by recently is this talismanic redistricting tool, which is using a supercomputer to generate many possible electoral maps with different districts. And then look at the outcomes so that you could compare a real redistricting plan with a bunch of theoretical simulated redistricting plans and see if the one that is being proposed or that was used was unfair in some way. But I think it's an open problem to decide the most fair way to aggregate us as people, as points up into these polygons that decide things about our political process. The next problem is the downscaling problem. And this is where you have polygon data and what you really want is point data. So this is an example from New York City. This project, I think, is called Disser. And it's the population by census block. Census blocks are very similar to city blocks, but not exactly. And then you have to decide how you're going to disaggregate it into points. So one way that you could do this is you could just say, okay, in this block there's 1,000 people. So I'm gonna randomly distribute 1,000 points. Spatial statisticians and geostatisticians don't like that because it's not respecting things about the environment. So there's a method called Dysymmetric Mapping, which takes auxiliary information, information about the geographic landscape. Are there mountains or are there lakes or rivers? So we're not gonna try and put any people there. And it uses these additional pieces of information to show the way that things are distributed in reality. So in this project, the Disser project, this is zoomed in on the lower east side, this is Stytown. And if you have this very dark filled in polygon, if you were just going to disaggregate it to a random selection of dots, that's the picture that you would get. But what the Disser project has done is combined this with polygon data about the footprint of the houses, the buildings that are in each of these blocks. And if you randomly place the dots within the houses, then you're probably gonna get a better visual distribution of how people actually live. So there's the comparison of those three maps. There's another very famous example of taking census data, which is usually given to you in an aggregated way and then disaggregating it into dots. This is the racial dot map. It has one dot per person and they're colored according to race. And again, I think this is a really beautiful map and it's done a nice job of using additional information about where people don't live so that it's pretty accurate. But it's hard to get the kind of gestalt of what's happening with people, you have to really be willing to zoom in and look at the detail. And that's not always what people want when they're doing visualization. So the last problem is side scaling. That's the hardest problem. And that's the one where you have polygon data and you want polygon data, but they're not exactly the same. So I'm gonna start by cheating. This is upscaling again. So this is some more work with my collaborator, Aaron Lunzer. We built this little tool, which I think is kind of like an explanation of the modifiable area unit problem, except it's interactive. So it lets you see how changing the way that you aggregate this point data, which in this example is locations of earthquakes in Southern California. How changing the binning, where the bins are located, how they're oriented, how large or small they are. How that changes the distribution that you can see. And so I think you can see that changing from one polygon level to the other is gonna make a difference. So, but back to our side scaling problem. If we have nested polygons, this is easy. So some of you might have seen this example from Kevin Hayes Wilson, redraw the states. And this is taking the electoral map from 2016. And it's just moving some counties from one state to another. So you can go play with this. This is me, and I'm moving counties from Florida to Alabama. And I only have to move, I think, three counties to get Florida to swap from being a red state to being a blue state. So just by changing the shape of the spatial polygon, the state of Florida, we're able to get the distribution to be different enough that we can change the way the electoral vote turns out. So you should go play with this tool. It lets you adjust the edges of any state, not just Florida. And he has a great medium post that goes along with it. I don't think we're gonna be able to change the outlines of the states. I think those are probably spatial polygons that are here to stay. But the districts within the states, that's something that we might be able to adjust, again, with this idea of gerrymandering. So that's a nested polygon problem that's pretty easy. Misaligned polygons are the big problem. And thinking about misaligned polygons, I think the most compelling recent example is the problem of lead in the water in Flint, Michigan. So there were a lot of things that Flint did badly when they did their analysis of lead. They weren't doing a very good sampling method. They had some strange protocols for people who were collecting water in their own homes. They did bad things with outliers. I mean, as a statistician, there's many things that I'm upset about. But one of the things was that they were collecting data at a spatial aggregation level that was inappropriate for the task at hand. So they were reporting the lead levels at the zip code level versus at the municipal level. The water was distributed at the municipal level, but then they were reporting things at the zip code level. So this is from a great blog post by a statistician who did some analysis here, and he's showing how the zip codes and the city boundaries are misaligned. The zip codes are the blue outlines, and then the city is that green outline. And I mean, zip codes in general, if you're plotting zip codes, you're probably doing something wrong. Because zip codes are probably the most human generated and nonsensical of the spatial polygons. They were generated to make it easy for the USPS to deliver mail. They're not respecting geographic boundaries. They're not respecting communities of interest. That's not what they're about. But in this case, it was essentially taking data which had high values in the city and lower values outside and averaging them. So I have some students who are working on this problem actually doing the analysis with data. But for the purposes of this talk, I just sort of mocked something up in Photoshop. So maybe we had a Coralpluff map that looked like this. And we want to choose the value for the green region. So it turns out that one method for finding the value of a region like this is just to find the exact center of that polygon, your target polygon. Say, what value is under that point? And just say that's the value, okay? So this is like in the spatial statistics literature as like that's one of the methods that you could use. I think that's a bad method and we should do better than that. So what's a little bit better? Well, we could say we're going to weight the areas that are sort of in that target enumeration unit that we're interested in, weight those values by how much of the area is shown there. So if we did that, maybe we would find a slightly different value that might be more accurate. But there are many methods to do a better job, a more statistically rigorous job. Most of them rely on this kind of rule of thumb called Tobler's first law of geography, which is that everything is related to everything else, but near things are more related than distant things. And that means if you have two polygons that are next to each other, you think that maybe there's some continuous relationship between the stuff that's going on in the one over here and the one over here because that line was just drawn by a human. And Tobler also has this idea of the picnophilactic property, which is that if you're going to move from one type of map to another, you need to make sure that whatever your smoothing or interpolation method is, it can be reversed back to the way that it was before. So if you're going to take polygon data and turn it into a smoothed surface, it needs to re-aggregate back exactly to the values that you had before. So I have some joint work with some students from last year, which is working with the picno package in R. And this is a very classic spatial statistics example. This is data from North Carolina on birds in 1974. And in order to get this map, you have to know things about the SP package and RGDOL. And I think there might be some GGplot2 going on here. And you get this beautiful map of the places where there were lots of birds and where there were not many birds. So the bright spots are lots of birds. And the picno package is going to observe the picnophilactic property, and it's going to create a smoothed surface. So if you run the package, this is, well, depending on your parameters, this is the smoothed surface that you might get. And if I overlay the counties again, you can kind of see how things have kind of smooshed out from their original aggregated distribution. When you compare these two maps together, it's not clear that they're going to observe the picnophilactic property. It doesn't seem like that bottom one is going to re-aggregate to the top one. But it turns out that it actually does, because pixel values are sort of colored by individuals. So if you have a smaller area, you're going to have more bright spots. So even this place over here, where things look like they're not going to match up very well, it does actually aggregate back up to the original data. We tried this with some other data. This is population data in New England at the voting district level and at the county level. And we've created some different smooth surfaces from each of those levels and are working to do the re-aggregation from one level of spacial support to the other. So this is using, again, the picno package. I think there's some great progress on other tools to work with things like this. I just heard about CoGram in JavaScript today. There's a new package in RSFR, which I think is going to maybe not fix the picno issue, but some of the other problems with working with spacial polygon data frames. And there's other methods out there. So there's interpolation methods, which are going to respect observed values and smooth things out in between. And there's smoothing methods, which might not end up matching any of the points. People talk about creaking a lot in spacial statistics, so that's a good interpolation method. So if you're interested in this stuff, I'm happy to talk more with you about it. I think my main takeaways are if you don't have to aggregate data to polygons, don't do it. If you are going to aggregate, pay attention to the spacial polygons that you're using. Don't use zip codes or something else that's kind of meaningless. And if you happen to have auxiliary information about where housing is located or where rivers or lakes are, use that information to help you when you're moving between levels of spacial support. So because spacial polygons are impacting us on a day-to-day basis, I think it's really important that we keep working on these methods and being cognizant to the ways that they can impact the visualizations that we make. Thank you.