 Hey folks, in recent episodes, I've been working on a project with the end goal of creating a world map that has color-coded the level of droughtiness for the last month relative to that same period of time over the past hundred or whatever years. To do this, I need to consolidate a lot of data that I've been pulling down from the NOAA website. This data is being reported for 120,000 different weather stations. My vision is that I want to count or sum up the total amount of precipitation over the window of time I'm interested in for each set of co-localized weather stations. So, you know, there might also be a case like say in in Arbor where I live where there's multiple weather stations going at the same time. Now they're really close to each other and on a map of the world, I'm not going to see a difference of a couple of miles. So, what I'm thinking about doing is one of two approaches. So, one approach that I could do is to calculate the distance between all pairs of weather stations. Then I could apply some type of clustering algorithm to clump those together and I could then kind of calculate the average amount of precipitation over the time window for that cluster of weather stations. I think that would take a very long time. It would be certainly far more precise than the alternative which I'm actually going to do and I don't think that the human eye could perceive the difference in these two approaches. So, I'm going to go with the easier approach and the easier approach would be to take the degree's latitude and longitude of each of the weather stations and round them to perhaps a single latitude longitude value or maybe go out to the tenths value of the latitude and longitude and then to pull all those weather stations that are within the same latitude and longitude. So, that's what we're going to do in today's episode. Before I dig into that though, what I want to do is share with you a little bit about how R does rounding and the different functions that are available in R to carry out these rounding procedures. So, R has at least five different approaches to rounding a floating point number. A floating point number is a number with values after a decimal point. We also call those doubles here in R as opposed to an integer which is a counting number, right? Most of the time in R we don't think about integers even though we might have a number like say three. We typically represent that as a double without really thinking about it. One of the nice things about R is that you generally don't have to think about what type of data you have, although some people actually think that's a detriment to R. Anyway, we're going to look at these different approaches to rounding numbers in R. To illustrate these different approaches, I'm going to go ahead and create a table and the first column I'm going to call X and I'll do seek from negative two to two by 0.1. So, you can see I've got this vector starting at negative two and going by tenths up to two. So, I'm going to create some additional columns that indicate different ways of rounding. So, we'll do round and we'll do round on X. We'll also do trunk which will be trunk on X and then we'll do floor which will be floor on X and then ceiling which will be ceiling on X. Let's go ahead and look at all of the outputs so we can compare and contrast the different approaches. So, I'll go ahead and do print n equals inf and let's make this maximized. And so, looking at these five different columns, we see that in the first column again we have X going from negative two to positive two. If you think about the rounding we learned in say elementary school, you round to the nearest integer, right? And so, what R is doing though is something different. If you look at negative 0.5, it actually goes to zero rather than to negative one. Hmm, why is that? And so, also if you look at 0.5, that also goes to zero whereas 1.5 goes to two. So, what R does by following this I triple E standard is that it rounds a tie like 1.5 to the nearest even number. The idea here is that if you are rounding to the nearest even number and you have a large number of numbers that you're rounding, then you know, half the time you'll round up, half the time you'll round down versus if you always rounded up like we learned in elementary school then you would be rounding up obviously more often than you'd be rounding down. Now, let's look at some of these other columns. So, trunk is short for truncate and so truncate will basically remove the number to the right of the decimal place, right? So, we see negative two and then for these values we go to one and then these values that are between negative 0.9 and zero go to zero, right? Because again if you remove the number, the values to the right of the decimal point, you're left with a zero and that's what you got and we see the same thing for the positive values. With four, R is rounding to the next lowest integer, right? So, negative 1.4 goes to negative two, negative 1.1, negative two, right? Negative 0.1 goes to negative one and then we can see that if we go to like say 0.8 that goes to zero, 1.6 goes to one, two goes to two, right? It's going down to the next lowest integer. Ceiling is going to the next highest integer, right? And so we see like negative 1.9 goes to negative one, you know, negative 0.5 goes to zero and so forth. So, these are four of the built-in ways to do rounding. One other way that I will share with you is to use the as integer function, right? And so we can then do integer as dot integer on x. And so here what we see with integer is that the output in integer is the same as it is with trunk with one subtle difference. The difference is that trunk outputs a double whereas as integer of course outputs an integer, right? And again, we can see that, you know, things line up in this trunk column and the integer column and if we come down to the positive values as well, right? So again, integer and trunk do the same thing except integer as integer makes the number an integer whereas trunk and all these others make them doubles. So there's one other approach to rounding that I'll go ahead and put in here and that's signif and that is going to give you the significant number of digits as the rounding. So if we do signif and we then give x, we can then give it the number of digits that we want to round to, right? So let's go ahead and do one. Here now we are rounding to this number of significant digits that I gave it, right? And so remember that significant digits isn't the number of digits in the number but it's the significant number of digits in the number, right? And so like 1.5 we see rounds to negative 2 whereas negative 0.4 already has one significant digit and so it doesn't get rounded at all, right? And we see the same thing for those positive values. So I said five different approaches. I've already shown you that there's six. What do I know? Anyway, again, this is seeing what's going on with a fairly simple vector of numbers from negative 2 to 2. Let's look at another set of practice values and I'll call these x and I'll say 100 times pi. So pi is a stored constant within our to represent pi and so let's do 100 pi just so we can get some more significant digits on the left side of the decimal point. So if I do round x, that will give me 314. Again, it's rounding to the nearest integer by default. Whereas if I do round x and then digits equals 2, that gives me two digits to the right of the decimal point. If I were to do something silly, it might seem silly and do say negative 2, think in your head what that might do. Well, that is going to round it to two values to the left of the decimal point, right? And so we then get back 300. It's basically taken 314 and rounded that down to 300. And so by the same token, if I do digits minus one, that'll give me 310. And if I did the same type of thing, but let's say out to four, that will round to four decimals to the right of the decimal point. Okay. So again, the difference between round and signif is that round is rounding to the number of digits to the right of the decimal point. Signif is rounding to the number of significant digits. So if I do signif on x and I do digits equals one, that will give me 300, right? If I do the same thing, but let's do three, again, 314. And if we do six, that will give me 314 and 0.159. So again, the difference between round and signif is that round is rounding to a prescribed number of digits to the right of the decimal point. And if you give it negative number, then it's going to go to the left. Signif is going to round to the desired number of significant digits. So all that's very cool. Now, what do I want to do for my application of rounding degrees latitude and longitude? I think what I would like to do is round to the nearest whole number of the latitude and longitude. A latitude and longitude is about 60 to 70 miles. And I think that's enough resolution thinking about kind of where I live and what's in like a 60-mile diameter of where I live. I think that's a pretty cohesive environment. Certainly, I don't think that on a map, we'll see anything much smaller than that. So let's go ahead and try it with whole degrees, and maybe we'll go to tenths of degrees depending on how many weather stations we get back. So I'm going to create a new R script, and I'll call this merge lat long dot r. Of course, I'll start this with library tidyverse, get that loaded, make sure that's all loaded. I'm also going to bring over the shebang line so that I have that for when I want to be able to run this from the command line. So we need to get data into our session to be able to manipulate and round those degrees latitude and longitude. In the data directory here, I have one called ghcnds-stations.txt. This will have all of my weather stations as well as their latitude, longitude, elevation, city, some other information about them. So I'm going to start with a read tsv to read in the data. So we'll do data forward slash ghcnds-stations.txt reads that in. I notice that there's no header, but I also notice that it's only one column. So we saw this before in the episode where I talked about fixed width formatted files. If I come back to the read me file and scroll down to where it talks about the format for this ghcnds-stations file down here, that it is sure enough a fixed width formatted file. There do appear to be spaces between all of the columns. So that's cool. What I could do is I could always do read fwf and by default read fwf will try to figure out the types of data in the different columns based on where there's spaces. And so it's using fwf empty as the default. I want to share with you another way that we could read this in. I'll start by grabbing this table and pasting it in here. And I'll also grab the URL so that I've always got that accessible if I need to look this up again. And I'll comment these out. So I'll give this the argument call positions. And then I'll give it fwf calls. And now I can give it a named vector of vectors, if that makes sense, right? So I could do ID equals C one comma 11, right? So that means that the ID column comes from those columns between one and 11. I guess I should say positions rather, right? So then if we do latitude, I'll see 13 to 20. And then longitude, C 22 to 30. I don't want the hyphen, I want the comma. And I will go quickly through the rest of these because they're not so important. So I've got that all coded in. So that reads in, we've got all the different columns we want. So recall that we can also give this the argument call select, where I can then give it the columns that I want. So I'll do ID, latitude, and longitude. This then outputs those three columns that I am most interested in. So before I go too far into rounding my numbers, I want to double check the precision that I have on my latitude and longitude. The table output shows one value to the right of the decimal point. It does some formatting to make it look more attractive. So it could be there's like eight digits there, I just don't know. So let's double check that. So I'll go ahead and do a filter ID equals equals, and then let's grab this first weather station, pop that in there. And of course, we get basically what we had before. So let's go ahead and do a poll on latitude. And so here now we see the we have four digits to the right of the decimal point of precision. So if we're to throw in here a mutate on latitude, equaling round latitude, and we could do like one digit, and let's pipe that to our test code here, we now see that we went from 17.1 something something something to 17.1. So let's go ahead and add a longitude. And we'll do round longitude to one. And then let's go ahead and remove this test code. And instead, I'm going to go ahead and pipe this to a count on longitude and latitude. And so this shows me that I now have close to 72,000 different weather stations, that seems like a lot more than I need to represent basically again, we're going to have 72,000 pixels on a plot. I think that's too much resolution. So let's go ahead and drop this down to zero so that we're rounding to the nearest whole longitude and latitude that brings us down to 8,000 different combinations of longitude and latitude. Again, we'll have 8,000 pixels, perhaps on our final visual. It still seems like a lot, but let's roll with that. I think that'll be fine. So now what we can do is go ahead and group our data by the longitude and latitude. Of course, we see that we still have 122,000 different rows, and we have 8,000 different groups. What I'd like to do is go ahead and create another column that indicates the group number that each weather station belongs to. What I ultimately want to be able to do is to say these different weather stations that I have from all my daily data belong to the same region. And then within that region, I'm going to calculate the average amount of precipitation over the previous 30 days. So I'm going to pipe this into a mutate, because I don't want to summarize the data. Instead, I'm going to create that extra column that I'll call region, and then we'll give it the Kerr group ID function. Kerr group ID will report the ID for that group. And so now what we see is that we've got region, which again is the Kerr group ID for those different groups. So let's go ahead and count the number of weather stations we have by region, and then do a descending sort on that count. I could do a count on region, and then a range minus N. N will be the column that indicates the number of weather stations in that region. And so interesting, negative 105 longitude, 40 latitude has 1300 weather stations. That's wild. Okay, so let's go ahead in here then and do filter on region equals equals that, like we said, those 1400 rows. I'm curious what that is if I do 40 comma minus 105. Basically, it's Denver, Colorado, that has all these weather stations. I wonder why they have so many weather stations. Regardless, that's kind of my point here, right that I don't need 1300 or so different weather stations worth of data for the same area. And so what we'll do in the next episode is see how we can join this data with the daily data, basically the precipitation data, so that I can then get an average amount of precipitation across these 1300 and all the other weather stations for each group to then get an average amount of precipitation for each of the 8000 different regions of weather stations. So let's go ahead and clean this up a bit. And we'll go ahead and write this out as a TSV. So we'll do write TSV, I'll do data forward slash ghcnd regions dot TSV. And now we will go ahead and update our snake make file. And go ahead and add that. So we'll do data forward slash ghcnd regions dot TSV. And that's going to be my output file. So let's go ahead and get some more breathing room here. And so here I will do rule aggregate stations. And then our input will be what our input will be our R script. And that will be code merge that long R and then our data. So the data data forward slash ghcnd stations dot text great. And I do have my comma there I always forget the comma but that's so annoying. And then output will output our ghcnd regions TSV file. And then our shell will be this and I'll grab this code so I don't have to worry about typing all that stuff. I need to go ahead and make my R script and executable get a new bash script and do ch mod plus x on code forward slash merge that long. And now I can do snake make hyphen hyphen dry run double check that everything works. And it says nothing needs to be done because my merge lat long is older than my data. I'll go ahead and save that and let's try again. Wow. So it's complaining that I copied and pasted without thinking right. And so I put an input bash script. And it's saying input files object has no attribute bash script. What I really wanted was our script. So I'll go ahead and copy and paste that there. Save it. Let's go ahead and try the dry run again. Everything is good. I can now run it by doing snake make hyphen C one. I only need one processor to run this. That all goes through and looks good. I'll go ahead and do a get status. I've got my practice file in there. I'm going to go ahead and remove that I don't need that. So I'll do RM code practice dot R. And I'll do get add snake file code merge lat long dot R get commit hyphen M find groupings for each weather station. Guess I can put a space in there. Good. And I'll do a get push. And that will be up there for you to check out if you want to see what my code looks like at the end of today's episode. I hope you found this discussion of different ways of rounding in are useful and then how we can apply that to the actual data we're working with. I know one of the most common questions on Stack Overflow is why does round 2.5 go to two rather than to three like I learned in elementary school. Well, now you know why. Again, it has to do with trying to balance out the fraction of numbers that get rounded up or get rounded down. So now you know that the standard is to round ties like 1.5 to the nearest even number. Hope you found this useful. Tell your friends in case they're scratching their head about why R does such weird things with rounding. And we'll see you next time for another episode of Code Club.