 Hey folks, I'm Pat Schloss and this is Code Club. If you watched the last episode, and I encourage you to go back and watch the last episode, you'll know that we're spending a few episodes out of our march through analyzing microbial ecology data to think about something that's, well, maybe a little bit more important, which is kind of thinking about the lasting legacy of slavery in the United States and how it manifested itself in racial violence between 1877 and 1950 through lynchings in the southern states of the United States. In the last episode, I generated a line plot that showed an aggregate across those southern states, the total number of lynchings, or at least the number of lynchings that could be detected by the database and the people that were curating these records. So we generated that line plot. It was very stark and very sobering. And then we went another layer in and said, okay, within each of the states, what did the number of lynchings vary by year, right? And so we generated a heat map to look at that. And again, of course, that's whenever you're dealing with lynchings, whenever you're dealing with murders, it's very sobering, right? And at the end of the episode, I commented that, you know, we need to remind ourselves that these aren't just data, that these are people. One of the constant refrains last summer in the fallout from the murder of George Floyd and Breonna Taylor was to say the name, right? And so what I'd like to get to is a visualization where we say the name of the victims. We're not ready for that yet. So what I want to do is spend this episode to create one more visualization that will allow us to answer a different type of question about the data, which is how widespread were these lynchings in these southern states? And to do that, I'd like to build a visualization that I rarely get to build in my research, which is I think called a chloropleth map. And so it's a map of the states. And actually, we're going to do it at the county level. And I want to shade each county in each state, according to the number of lynchings that were detected between 1877 and 1950. You'll recall that the data that I am using comes from this lynching database. It's very easy to get the data, but the data aren't publicly available. And so I'm reluctant to make those data available on behalf of the original investigators who, you know, it's there, it's theirs, right? It's not mine. I need to respect that. The code that I'm going to generate, I will post to a blog post that's linked down below in the notes. So I would strongly encourage you, if nothing else, get the code, look at it and see if you can make sense of what's going on after the fact and perhaps read along as I go through this episode. Again, we're here in our studio, I'll go ahead and open up my Juneteenth by state.R script. And I'm going to go ahead and create a news R script. And I'll save this as Juneteenth map.R. And I'm going to borrow some code from my Juneteenth by state. And I'm going to take these first 17 lines or so from that script and plop it down in here into my Juneteenth map.R script. I'm going to clean up a little bit of stuff just to make it a little bit easier to know what packages I'm working with, loading the tidyverse. I've got my state lookup. I've got my lynchings per state per year. For now, I'm going to go ahead and comment this out because I just want to kind of work with the pipeline to get the data frame in the shape I want before I sign it back to a variable. Just find that's a little bit easier to work with that way to sign it back to a variable at the end. So I'm not going to break it out by count. I'm going to break it out by Lynch State and Lynch County. And if I look at these first lines, I now see that I've got my Lynch State my Lynch County and the number of victims in each state. And I'm I now have here an inner join with the state lookup table. I had this column called name, which is the full name of the state. So I'm going to change that to be state because eventually I'll want the name of the victim. And having all these names will get a little confusing. So I'll state lookup. And so then I'll have Lynch State mapping to the abbreviation. And I now see that I have my Lynch State Lynch County the end and the state because the data that I'm going to get for my map package is going to have everything in lowercase, I'm going to go ahead and use a mutate to get to get rid of that Lynch County make it county and to make my county and state names all lower case because that again is the format that's going to be coming to me in those maps packages. And so I'll do mutate County equals to lower Lynch County. And then state equals to lower state. And so again, I now have my end my state my county, and I'll go ahead and do a select on my state county and end. And so that's the data I want to insert into a chloropleth map. And again, I now will assign this back to a variable, which also lynchings per state. Maybe I'll say per state county. Great. So now we have lynchings per state county, and we've got state county and the end. So the data that I want to make a county level map, I can get from a function called map underscore data, and I can then give it the type of map I want. So if I do state, I then get all the coordinates for the borders of all the states, if I give it County. So this data frame will be tremendously useful, because again, it'll allow me to draw all of the counties for all the states. It's coming in as a data frame. So it's quite large. And so I get again, the longitude latitude, the group for drawing the borders, the order, as well then as the region and sub region. So in this case, the region is the state, the sub region is the county, I guess in Louisiana, would be the parish. Now one of the things that I've been warned about working with old data, so data from before say 1850 or certainly going back to 1877, is that the counties that we had say in 1877 are not necessarily the counties that we have today. So I want to do a check to see what counties are represented in my map data county. And are those represented in my lynchings per state county, right? And so I want to make sure that I have the right county names. I'm going to go ahead and create a variable that I will call state county. For doing some tests, I'm going to simplify this a little bit. And I'll do select on region and sub region. And so then again, I'll have state county. And I have my state and my county. And I'll go ahead and do a distinct so that state county only contains kind of unique combinations of states and counties. Now what I'm going to do is I'm going to use a special function called anti join. And so anti join looks at what values or what rows are in one data frame, but not another. So another way of thinking about this is that a inner join joins two data frames together, and that you join on a certain column in each of the data frames, right? And so if there are any columns or rows that come together and they would produce NA values because perhaps one row is missing from another, a key from one is missing from another, then that wouldn't get brought into the inner join. The anti join will tell you which of those rows would have been discarded. So let me show you with an example. So again, if we do anti join with lynchings per state county comma state county, and we'll do this by and we're going to have two columns that will join by right. So we'll do county equals sub region, right? So we're going to use the county from lynchings per state county and sub region from state county. And then we'll also do state. And we'll join that with the region column, right? So we're actually joining these two data frames on two columns, we get back 28 state county combinations that aren't shared between the two data frames. And so a couple of these make sense, right? So certainly unidentified Southwest Georgia County, right? That is not an official county name, probably white. I'm pretty sure this should be probably white county, right? And so we could perhaps just for the benefit of the doubt, so this individual this victim was from white county. I can also imagine that these periods in like St. Clair, St. Francis, St. Johns are throwing problems. And that throughout here, you know, Dade, I think is now Miami Dade County. And there's a variety of others like this. And we might find that there's actually some typos to clean this up. Let me show you my typical process. So I'll add a mutate line to the end of my lynchings per state county pipeline. And I will then do county equals and I'll use the case when function to do this. You could do if else or whatever. But I think this should be a little bit more compact to use case when county equals equals, probably white. Then tilde will put that to white, right? And similarly, county Dade will make Miami Dade, right? And then for now, I'm going to make everything else. So if true, right, so we want the last thing to be evaluated in a case when to always be true, so that everything gets a value, we will assign that back to the original value of county. And I've got a period there instead of a comma. So let's run this and see if things get any better. And so we see now that Dade went away, and probably white went away as well. And so I'm seeing a common mistake that I make is that I now have a Miami Dade in Georgia. So what I should do perhaps is be a little bit more specific. And let's do state equals equals Arkansas. And county equals probably white there, right? And so then we can also do state equals equals Florida. And county equals date. So Arkansas needs to be in quotes as does Florida. So good, we got rid of that Miami Dade as well as the probably white. And we didn't make a Miami Dade Georgia. So another example that I'll throw in here, I'll do a county equals str replace all. And here I'm going to remove the periods from St. Clairs, St. Francis and St. Johns. And so we'll do that on the county column. And we're going to do a backpack period. And the backpack period means actually match the period character. Otherwise, the period means match any character. And we just we want to be very specific. And then we'll replace that with nothing. And then we'll feed that into the rest of the pipeline. So that brings us down to 14 other county names. I'm going to go ahead and advance the video forward. So you don't have to watch me go do some digging. What I'll probably end up doing is googling these names, going to Wikipedia and see if their current counties are not or perhaps there's a typo and and then going through and fixing that. And I'll be right back. So it took me a few minutes of running through Google and Wikipedia to kind of match the county and the state to what it should be. Some things were obvious, like De Soto should be two words instead of one. Other things may have had the spellings. So like Din Whittley should have been Din Whitty. And then a variety of other mainly in Virginia, they have these kind of independent cities that aren't actually part of counties anymore. And so I kind of did my best to kind of assign them to a county or a city closest to them. So when we run this, we now get from our anti-join a data frame with zero rows, which tells us that we have our counties properly aligned between the lynching database as well as the map data county data frame that we're working with. So now I want to go ahead and join my lynching data counts, my lynchings per state county to my map data county data frame. And I can then do inner join on my lynchings per state county with map data on the county. And we're going to do that by and then so lynched, we're going to have both the state and the county. And that's important because there's some counties that are the same between different states. Okay, so we can do state equals region. And again, we'll do county equals sub region. We now see that we have each state county, the n, the latitude, longitude for each county. So I'm going to go ahead and save this as lynchings map data. And I can then use lynchings map data as input to ggplot. So that'll be the data and we'll do a s x equals launch y equals lat. And I'll go ahead and do fill equals n. And I think that should do it. And then I'm going to use geome polygon. There also is a geomap. I have found geomap to just be horrible to work with. So I'm going to run with geome polygon and see how much mileage we can get out of that. And here, I will do color equals black for the lines around my counties. And I will then save this with gg save. And I'll save this as lynching map dot PDF. And I'll do height equals five width equals four. And that's quite a mess. I realize what it's doing is this is kind of like when you use geom line for a bunch of different factors or categories and you don't assign a grouping variable. So we'll go ahead and fix that. And what I will do back here in my AES, I'll do group equals group. So it kind of looks like a map. Let's go ahead and make it look a lot more like a map. And so let's do cord quick map. And so this looks like a map a little bit more now, right? Where we have Florida down here, Louisiana up here, Mississippi, Alabama, Georgia. But what I'm noticing is that if the county was missing from the counts, it doesn't show up here. Because again, when we did that join, it produced an and it would produce an NA. And so when you do an interjoin, that row then falls out. So let's come back to our inner join here. And instead of inner join, I think what I'm going to do is write join, because I want to take the map data that's in the county and that's in the lynchings per state county. I think this will work, but I'm suspecting it won't. So that solved one problem but caused another problem, right? So we got all those counties were missing. But now we've got all the states and all the counties that are in the data set. And so I need to come back here and think a little bit more carefully about this. Why don't we go ahead and instead of map data county here, I'm going to create another data frame using only those states that are represented in our data set. If I take my lynchings per state county, that again has the states and counties and let's look at distinct state. And I see that there are 12 distinct states in the data set. I'll use that to then do an inner join with map data county. So then I want to do a join by it's going to be state equals region. And so now what I've got is I've got the state and the counties. And I think that should do it. So this I'm going to call them my county map. And I will bring that back over here to be county map. And my column names are going to be state and sub region. So this could be state county sub region. That seems to join pretty well. And so that looks a lot better, right? So again, the problem initially was that we were missing counties in our lynching counts, right? And so we needed to then join in to get all the other county parameters or coordinates. Doing that, then we got the counties for all the United States. And so then we needed to filter to only get those states that were in our data sets. We want only the counties from the states we're looking at. And we're seeing grays in there and the gray is the NA color. But I'm going to treat the gray as a zero. And so I need to go back and change those NA values to zeros, because I'm going to assume that if there wasn't one reported that it wasn't there. Of course, we know that's not true that there's probably a lot of under reporting. But for now, I'm going to be pre content with that. And I can then do mutate on N to do N is going to be if else. So if N is an NA, so we'll do is that NA, then we're going to put in a zero. Otherwise, we're going to report the N. And so let's run these two lines and see if it works. It's complaining because false must be a double vector. I think zero is coming in as a zero as a double, whereas N is an integer. I think that's because we used count and count creates integers not doubles, whatever. So we have a couple options we could say as integer for N, or we could actually do zero L. And the L will make zero an integer rather than a double. Now, that runs without a problem. And we can then feed this into our GG plot. And so now we've got every county being some shade of like, you know, dark blue to light blue, I'm not a fan of this color palette. So let's go change that now. So I'll do that with scale, fill gradient. And we will do low equals F F F F F F F. So white and then high will do F F 0000, which is red. And I'm going to to go ahead and do limits zero to NA. So I think this looks better. I'm going to put the legend to the side for now, and I'll worry about that later. And what I want to focus on first are these borders. So the county borders are really thick. And I don't really have any good definition of where the states are. I'm I can see them, but maybe you can't. So what I'll start by doing is here in geopolygon, I'll do size equals let's do 0.2. So that gives a much thinner look to the borders for our counties. And we might play with this again later, as we start adding on a border for the states. And so again, I now want to draw on a thicker border for the state to do that, I'm going to use another geopolygon function call to add on the data for the states. So to do that, we're going to do something very similar to what we did with the counties, where we're going to bring in that state data to a map data state. And so we're going to create a data frame that I'll call state map. And so but before we get there, I'm going to do map data state. So again, we get a data frame with the coordinates for each of the states, which will be very useful. And I want to join this with the actual states that I have in the data set. And so you'll recall that we did that back up here. And so I'll borrow this. And so we'll do lynchings per state county. And just to remind ourselves what that looked like that was a column of the 12 different states in the data set. And then we could then do inner join map data state with state equals region. And then we look at this and we then get our parameters or coordinates for each of the different states. And then I can assign this back to state map. And then I've got my data for state map, right? Good. Now what we can do is we can add that geome here to the geome polygon, data equals state map, AES is going to be all the same, right? So we're going to have x equals long, y equals lat, group equals group, I don't even know if I need to state these the aesthetics. But I think just to be safe, I'll include them. And I'll then go ahead and change one of the aesthetics, I'll make fill the NA. And so you recall perhaps from a previous episode that if we have a space or a line or text or whatever, that NA means make it actually transparent. So white is a object with color, NA is an object with no color. So we'll do fill equals NA. And so again, our map or states map will be over top of our county map to give us a thicker border. So to set that we then need to do color equals black. And I'll do size equals 0.5. Let's throw this on a following line and add that plus sign. Very good. We now have our thick lines around the states. I think that looks pretty good. Maybe they're a little bit too thick. Maybe we could do 0.3 and 0.1. Yeah, I think that looks a lot better. It's not so bold in drawing the lines. It doesn't really take your eye off of the individual counties as much as when those lines were a lot thicker. We're now going to add some theming to this. So we'll go ahead and do theme void. That then turns off all the theming. We don't need the latitude longitude. I don't really care about that so much. What I really care about is kind of where do we see the cases of lynchings here in in these 12 states. Good, I'm happy with that. Let's go ahead and add a title. So I can add a labs function here and say title equals lynchings were widespread throughout the US south between 1877 and 1950. And we saw in the last episode that we could create a variable to plug in there with the glue package. I'm not going to worry about that right now. If you want to do that, go for it. I give you permission. And so we see our title is too long. And we've seen that we can fix that by doing theme. And we can then say plot dot title equals element text box simple. And I need to make sure that I've got library gg text loaded here. So we're getting in the ballpark. Let's go ahead and use a face equals bold size equals 18. So that one of the challenges with using the cord quick map is that with maps that it's basically cord fixed special version of cord fixed, which makes the units or the spacing of the units on the x axis the same as the y axis. I find that that then causes big problems with laying out your figures, say in a PDF or some kind of dimensionalized file format. Before I go in and muck with the title anymore, I'm going to fix that legend. And maybe what I'll do is kind of lay it on its side and put it here in the Gulf of Mexico. So first I will give it a name. So I'll say name equals a number of lynchings backslash n. So I've got a title now. Again, what I'm going to do is lay that down on its side here in the Gulf of Mexico. So I'll do legend position equals c and then you give it a two element vector for x and y position of where you want the legend. So I'll do 0.3 and 0.1. We're getting there. We've moved it to the Gulf of Mexico. Now we'd like it to lay on its side. And to get that I can do legend that direction equals horizontal. So that that looks pretty nice. I'm going to go ahead and capitalize the l in lynchings because to me, it seems like it's a title and it just looks like it's an i rather than an l. And again, I'm doing that up here in my name for the scale fill gradient. And so I think that looks better. Let's now go ahead and look at the margins that are around our title. And I can set that up here in element text box simple, where I could say margin equals margin. And I'm going to say for the top, let's do five, right, zero, bottom, 10 left, zero. And so again, I need something on the left probably here, right? So if I put in five there. So one last thing that I think I'll do to tweak this is that this title, I think I should say lynchings of black people were widespread throughout the US South. And the something I wish I would have included in the previous episodes titles, because yeah, there were white people that were lynched as well as Hispanics and I'm sure Native Americans. But the data we're looking at here are black people. So let's make it specific that we're looking at black people's lynching deaths. And so we'll say lynchings of black people. I'm happy with the way this turned out. I think it does a very nice job of showing where lynchings occurred over this time period. And it is a bit jarring to look at and think about. And I think I know that I am distracted because I want to look at this and I want to look at the individual counties and see where those lynchings occurred. So one critique that I would have of this visual, and it's kind of a critique of all heat maps, is that we have very skewed distribution in the number of lynchings per county. And so we have these three counties in northern Louisiana, they're kind of, you know, very high levels, probably up above 30, around 30 or higher. And because we've got to represent those with a color, it kind of mutes the colors of all the other counties. Something we might think about doing would be, you know, can we perhaps turn this into a categorical variable? So we might have a variable that's say like greater than 20. And so those greater than 20 are all dark red. And then we scale between zero and 20 rather than between zero and 30 or zero and 35 as it were, right? And that might give some of those paler reds a little bit more color to make it easier to differentiate between the shaded counties and those counties that are white, indicating the zero, right? Anyway, there's always more that you could do with the visual. But like I said, I find this very compelling. I want to spend time looking at the different counties. And it makes it it's a story, right? It's an interesting story. And as we said, in the last episode, I'm kind of wrestling with this data set. And I want to get to a finer and finer scale. Because, you know, this isn't just number of lynchings. This is number of people killed. And a number is an aggregate. And it's losing track of the individuality of the person. And, you know, in many cases, we don't have a choice, right? We need to aggregate people together. But something that I would like to do for this series of videos is really get down to the individual level. So what we'll do in the next episode is we'll take this map, this framework, and we're going to build an animation where I will light up each county for each victim. And then for each victim, we're going to display their name with the county that they were murdered in. And then kind of use gg animate to make an animation out of that. And I think that will be very powerful. And will be, yeah, just really powerful. So I'm looking forward to working through that with you. But make sure that you are subscribed to the channel so that you know when that episode comes out. I'm not totally sure when that is going to be released. It might be Saturday on Juneteenth, or I might save it to Monday so that we can have more time to digest these materials. We'll see. Anyway, I hope you do request the data and work with the data on your own. There is far more data in here than what we are working with. And I really encourage you to work with the data. I think by working with the data, you also learn a part of our history, right? And you gain a greater appreciation for kind of the barbaric practices of our country 100 years ago. And perhaps that we realize that a lot of things haven't changed that much, unfortunately. So anyway, we'll see you next time for another episode of Code Club.