 Hey folks, I'm in the midst of a series of episodes looking at different ways of working with and visualizing data related to climate change. In the past, we've looked at global climate change data, specifically looking at temperature anomalies. Now we're moving forward to looking at more localized weather data. I'm looking at data from a weather station in Ann Arbor, Michigan, which is just outside of where I live in Dexter, Michigan. In the last episode, I showed you how we can access the data files from NOAA, which is the National Oceanographic and Atmospheric Administration here in the United States, which keeps track of all the weather and climate data here. It also has data for people in other parts of the world. And so if you have your latitude and longitude, you can get the data for your neck of the woods, so to speak. And we developed this R script in the last episode that I'm calling localweather.r. If you want to get a copy of this script, as well as everything else that I have in this project, go down below in the description. There's a link to a blog post that will get you all the stuff you need from GitHub to get going. This is the script we created. If you want to customize it for your area, all you would need to change are these numbers here in latitude and longitude. And again, in the last episode, I showed you how you can very easily get those values from Google Maps. There is a Google Map R package that would allow you to immediately put in your actual longitude latitude, regardless of where you are. I'm not going to do that because it takes some steps that I'm just not willing to do, because I don't care that much, because I can use Google Maps. Anyway, you do you. But I think it'd be great for you if you're able to put in your latitude and longitude, just to personalize this a little bit more. As we saw in the last episode, we then used our local latitude, longitude to find the station that's closest to us that has data starting before 1960 and ending after 2020. We got that station, we then pull down the data from that station by creating the URL with the glue package, reading it in with read CSV to then get the temperature, precipitation and snow data. There's a few other data points that generally come with each of the weather stations, things like T min and T observed. I'm going to work with T max because that's probably the most extreme. So let me go ahead and run all this, we can have that local weather data loaded. So it takes a little bit of time to run because again, it's downloading some pretty big files. But once we have it here on our computer, we can of course look at local weather, and we see that we have a data frame with the date, the T max, the precipitation and the snow. What I'd like to look at in today's episode is whether there's any anomalous data in here, right? So whenever we're doing analysis, it's always good to make sure that data makes sense. And so I want to show you a couple of strategies for diagnosing problems in the data, as well as show you ways with quantitative data, at least that we can go about removing those outliers as we might call them. And so I'm going to show you two ways. So the first is to plot the data as a line plot. And the second is to plot it as a histogram. And so if I take local weather, and pipe that to ggplot, and on the x axis, I'm going to put the date. And on the y, I'm going to put each of these variables. So let's do T max. And then we'll do geom line. And I don't see anything too outside of the norm. However, one thing I noticed is that these temperatures are really high, right? Up to 400 degrees, something must be off. So if I come back to the directory where I got the local weather station data for, you'll recall that at the very bottom of this directory, there was a read me by station, we go ahead and open that up and see if there's any information about the precipitation, the snow or the temperature data in here to tell me what the units are, right? I just assumed that was Celsius and maybe centimeters or millimeters. I don't see anything in here. Actually, what I see here is C section three of the GHCN daily read me text file for an explanation of the element codes and their units. So let's go back to that. So that was the parent directory here. And down here we have this read me file. And again, if we scroll down to it said I think section three, right format of data files. And we can then see that it's got all these different values, everything. And here's the element. So elements is the element type, there are five core elements, as well as a number of additional elements. We're working with PRCP snow and T max. And we can see the units here, right? So precipitation is actually intense of millimeters, not millimeters. Snowfall is in millimeters. And then the maximum temperature is intense of degrees. So where the plot might say, you know, 400 degrees, that's really 40 degrees. So we'll want to go ahead and fix that. What we can do is come back to our local weather. And I'm going to go ahead and add to this mutate, where I will do T max equals T max divided by 10 to get it in degrees Celsius. Also my precipitation is intense of millimeters sold a PRCP equals PRCP divided by 10 to get that into millimeters. I'm also going to go ahead and rename T max PRCP and snow to all be in lowercase. So to do that, we can do rename all. And then what we're going to function will apply to all the columns is to lower. And so again, if I look at local weather, I see I have temperatures that are much more reasonable in all my column headings or lowercase. So I need to turn that to T max. And let's regenerate this line plot, we can see that there's a fair amount of variation. Again, that's these are daily temperatures. And so over the course of a year, you're going to have pretty widespread, right? So in Michigan, it's basically from about what, like 35 down to about minus 15 over the course of the year. Cool. All right. So that's one way of looking at it as to take your data and plot it across some variable, right? Just to see, is there anything that sticks out as being weird? Another way that we could perhaps do it would be to take local weather and pipe that to Gigi plot. And I want to make a histogram. And so here I'll do x equals T max, right? So the variable we're interested in, we'll plot on the x axis. And then using geom histogram, we can then plot the frequency of the number of observations we have in each temperature bin. This gives us a bit of a bimodal distribution, probably for winter months and summer months, it tells us that it's using bins equals 30. So to pick a better value, use bin width, could do bin width equals one to get a one degree bin width. And here you see something that's a little bit funny is that you get kind of this sawtooth shape right around like every five degrees, which makes me wonder if there's something about the temperature sensors that doesn't like five degrees, right? Because there's kind of a dip at zero, five, 10, 15, 20. That's a little bit weird. Anyway, that's why you'd perhaps want to use a larger bin width like two. This will then smooth out the plot a little bit. And if we do three, it smooths it out a little bit more. Of course, you could go up to like five. And then it's just going to look like a blob and not get a lot of good definitions. So let's go ahead with 2.5. And I think that looks okay. All right. So that works well for T max. Let's repeat this for our other variables. So I'll bring that down. And instead of T max, let's go ahead and do PRCP. And here with precipitation, we see that there are two dates that are much higher than all the others. And so I'm going to ask myself a few questions about these data points. And so I might say like, was this monsoon season, right? Did was there some freak storm that came through Michigan that we just don't remember? Well, you know, I could do a Google search for the actual date. I could get that date by doing something like local weather, slice max, n equals five on, let's say what PRCP. So we're kind of looking at the data to see if it makes sense, right? So this was September 5 of 1953. We had 1286 millimeters of rain. So for my American colleagues, that is 1286 divided by 10, to get to centimeters and divide that by 2.54 to get to inches, that is a 50 inch rainfall. That is like biblical proportions. I don't think that's real, right? 500 is going to be on the order of about 20 inches of rain. So that's not that's not logical, right? And so I think these other values of 115 millimeters, again, that'd be like 11 centimeters, maybe like six inches of rain, that seems like a lot also. But again, I'm inclined to trust that as being legit. Let me show you what happens if we use the histogram instead. So I'm going to go ahead and grab that code for the histogram. And again, instead of T max, we'll do PRCP. And I'll remove the bin width because, you know, bins of millimeters will be different than bins of Celsius. And here what you see is that most days have zero. And then it falls off pretty quickly. And there's probably something out here at the right side of the distribution, which we know, of course, right? That's that like, that's at 1286 is way out here. But there's only one. And so it's so small, you really can't see it. Perhaps we could do something like scale y continuous. And then we could do limit equals zero to let's do 50. And so here, then we see that there's something around 500, and something out here, right? And so for the most part, what we see is that there's a, you know, pretty consistent set of observations or precipitation data on the left side of the plot. But these things that like 500 and greater, probably aren't real. So let's do the same type of analysis. But now with the snow, so go ahead and take these three code chunks that I used with precipitation, and instead use snow. So put snow there, snow there, snow there. We'll go ahead and run this line plot. And here we see, there appears to be one outlier here in about like 1930, where again, we had like 1500 millimeters of snow. Again, that's just a gargantuan amount of snow. Even in Michigan, as we can see, that seems weird. So let's look at the five highest days of snowfall. And again, yeah, 1931, March 8, 1500 millimeters of rain, that seems just way high. As I understand, the precipitation column is a, when there's snow is basically like a rain equivalent of the snow where they take the snow they collect, they melt it, and then they see they got 11.4 millimeters of rain. So that's like four centimeters or about inch and a half of rain. I don't think 1500 millimeters of rain of snow will go into that much rain. So I think that's anomalous, right? Everything else here seems to be legit. Again, we could look at the histogram, and we'll see that there's a value way out here. Again, this is using scale y continuous with the limits of zero to 50. And so when you, when you use that limit zero to 50, then if the bar goes beyond 50, then with scale y continuous, it's removing those data, it basically truncates anything that doesn't fit in the plot. Anyway, but what the hell allows us to see is that this one is way out there on the right, and is a bit of an outlier. Again, my strategy for diagnosing an observation as being an anomaly or being an outlier that I don't want to include is to ask like, do the data make sense, right? Is this plausible? If I was collecting this data in my lab, I might ask like, do we have any evidence that the recorder, the data collecting device was on the fritz, right? Did we perhaps change how we were doing things? Historical data like this, we could use Google and go back to 1930 whatever and say like, were there any just like gigantic snowfalls or rainfalls back then? There weren't, right? And so those types of data support me in thinking that we can safely remove those samples, I can take two approaches. So first, I could use a filter to go ahead and remove anything that doesn't behave, or I can convert things to an NA and then drop all those NA values. So let me show you how I would do both. So let me start by showing you how I do the filter based approach. We'll do filter, and that'll do PRCP less than. So if any precipitation value is less than some threshold, then I want to keep it in my analysis. Again, if we go back and look at our plots for precipitation, we would then see that if we picked a value of say 200, then we would safely remove those anomalous values, right? So let's go ahead and do 200. And then if I came back down and ran the geomline on the corrected data, I now see that I no longer have those anomalous values, and that it's safely been cleaned out. Now let's come back and see how we could do it with the approach of converting the values to an NA, and then removing those NA values. So again, I will do a mutate on a snow. And I will then say if else, so if I'll statement, so if I'll snow is less than some value, then I want to replace that with snow, the actual observed value, if it's greater than that value, then I'm going to give NA underscore real. And so that NA underscore real is a NA, but it's a real. So if else requires that both the true and the false values be of the same type. So snow will be a double, and a real will be a double. So what's this threshold going to be? Well, again, if we come back to our plot, we see that 500 is a pretty good threshold. So I'll plug 500 in here, run that. And now if I come back down and run the line plot, I see that I've gotten rid of that anomalous value. So looking back at these two lines, you might say, why would I choose to do one approach over the other? Well, if I do the filter based approach, I'm going to remove a specific day and all that day's data from the data set, right? So I will also remove the temperature data from that data set for that anomalous day. Maybe you want to do that? Maybe you don't. With this mutate based approach, however, you're only changing the snow value for the days that are problematic. So for that reason, I would probably prefer to do the mutate based approach. So let me show you the code for that. I'm going to go ahead and comment this out. And I will add to my mutate to do prcp equals if else prcp less than 200, then return the prcp. Otherwise, well, let me show you what happens with NA if we've just put an NA but not the NA real, this will give you an opportunity to kind of see the error message you'll get. And so the, like I said, the error in mutate problem while computing prcp if else, caused by error and if else, false must be a double vector, not a logical vector. So an NA value, this is in the false spot. So this is the logical question. This is the value of true, this is the value of false. So NA is a logical prcp is a double, right? And so that's again, why we need to use that NA underscore real to return an NA that's a double rather than a logical. Anyway, so if we run that, we don't get any error messages, thankfully. And again, we can go back and we can look at our line plot for the precipitation, again, see no anomalies, do the same thing for the snow. And again, it looks very good. So the data that I have here doesn't lend itself very well to talking about categorical data that might be an outlier. So let me show you how we would do that. If we had an outlier, that was a categorical data point, right? So, for example, if I look at local weather, and let's pick a date, right? So let's pick a date like this 1891 date in October, right? So we could add to this the date column, and then we could do NA if so if the date column is some value, so that's going to be the Y value. So let's do 1891 hyphen 010 or 1009. And then again, local data, we see that we turned that specific value into an NA. And so again, if that date or that label that category was wrong for some reason, we could use the NA if to do that, right? Again, we could do the same type of thing with an if else statement, right? So we could then say date equals if else, date equals equals, let's do 1891 1005. So if it's that date, then we're going to return NA character, otherwise, we'll return the date. And I'm getting an error message. And it says problem while computing date equals if else, date equals that argument true is missing with no default. Ah, I've got a true and a false, but unfortunately, I have a parentheses in the middle, right? So that parentheses is excluding the true and false. So let's try that again. And that actually gives me an error message, because date is a type of date. And this is a character. So I think what we need to do instead is make this as dot date. And so this is a function that will convert a character into a date, of course, it's going to take an NA and make it an NA, there isn't a built in NA underscore date underscore variable. So that works. And again, if we look at local weather, we now see that sure enough, we have that NA value in there. Okay, so again, these are different strategies for identifying outliers data that we want to further investigate. Sometimes those outliers are real. But sometimes like the cases that we had here, I think are junk, right? There's probably something wrong with the sensor. I am doing this kind of empirically looking at the data and saying that looks weird. I suspect that when Noah and NASA when they're looking at all these weather stations around the world over many, many years, they probably have some algorithmic way of identifying a truly anomalous data, right? So like those cases where you have like a 20 inch rainfall in the middle of the desert, right? You know, perhaps what you might do would be to look at like a standard deviation and say like, is this a three standard deviation rain event or a four standard deviation rain event? And if so, then we're going to throw that out. Anyway, I think it probably takes a little bit more deeper thinking than what we're trying to do here where again, I'm just trying to show some principles of how to identify weird data, and then how to remove those data from your data set. One final thing to show you is that if you have NA values in your data set, like we do here, one way to go ahead and get rid of that whole row like you would with filter would be to do drop underscore NA. Then if we look at local weather, we see that those two days, October 5th and October 9th are now gone, that that drop NA removes those rows where we had an NA value. And of course, you could use drop NA in here. And you can give it a variable to remove those days where we had an NA value for a specific value, right? So if we did NA drop NA with PRCP, that's going to remove that row where we had an NA value for precipitation, rather than for the date, right? So now if I look at local weather, I still have those two rows, because it dropped the NA from the data set where NA was in the PRCP column. Okay, so again, a couple different ways of doing this, you do what works well for you. I think it would probably be good to keep notes, keep comments about why you're removing certain data points. Again, just for your own transparency, and reminding yourself, why did I pick 200? Why did I pick 500? Anyway, I would be interested in knowing what kind of anomalies do you see if you're trying this with data from wherever you live? Do you have a 500 millimeter or snowfall? Or do you have no precipitation or tons of precipitation? What are the weird things that you're seeing in your data? Let me know down below in the comments. Ultimately, though, do this with your own data, data that you're generating for your own research, whether that's in a laboratory or out in the field. I would love to know what kind of challenges do you run into when you're recording data on data analysis instruments, and what kind of anomalies do you find? There are other packages out there for kind of doing this type of work, but I ultimately find that this approach of using if else statements filtering is a lot easier than having to try to remember how to use some other package that will ultimately do this for me. You let me know what you think of that philosophy. All right, we'll see you next time for another episode of Code Club.