 Hey folks! In recent episodes, I have been looking at a variety of approaches to visualizing global temperature change. We've been looking at global anomalies in temperature, which is a deviation from a normalized temperature between the years 1951 and 1980. We've been looking at this at an annual level, at a monthly level, and in today's episode, we're going to look at it at another level, which is at a two degree by two degree longitude latitude level. This will give us a greater sense of the variation in temperature anomalies across the earth over the past 70 or so years. To do that, I have been inspired by this visual, this animation that comes to us from the Scientific Visualization Studio at NASA. And so this animation I think is really cool because it starts out in 1951, and you can see how this histogram or density plot moves over the past, 70 or so years. You can see again, leading up to about the mid 1980s, it's right around zero, but then certainly as we march forward towards the present, the deviation gets quite considerably moved away from the mean of zero that we saw before. Another approach that they took was a static version of that figure, where they basically made a ridgeline plot. And so a ridgeline plot has also been called a joy plot. And I have always been looking for an application of a ridgeline plot, and here is our candidate. So basically, each of these density plots represents a different decade from 1951 to the present. And so here they have, I think, seven different density plots. And what I was thinking was it'd be really cool to make this go by year, so that we could have effectively 72 or so different ridgeline plots and see how that changes over time. In this version that NASA generated, they have coded it gradient across the density. I'm not such a fan of that. I think what I'd rather do is color the fill of each density by the average temperature anomaly across all of the gridded points on the globe. So that's what I'm going to do. I'm going to make this, but make it for all years rather than by the decade. And I'm going to color it by the average temperature anomaly for that year. To get the data, I'm going to come to nasa.giss.gov. And if we scroll down, we will see that they've got all sorts of data made available to us. The data I'm interested in is within this gridded monthly temperature anomaly data. We're going to be looking at compressed net CDF files. And so these are on a two degree by two degree grid. And we're looking at surface air temperature without ocean data. And again, this is on that two degree longitude two degree latitude data. When I put my finger over that link, I see that it comes up as a GZ compressed file. So I'm going to go ahead and copy the link address. Coming into our studio, I'll set up a new R script. And I'm going to paste that link into my R script. I'll go ahead and put this in quotes and then call that URL. And so that'll be the URL that we're downloading the G zipped data from. And so as always, I'll start with library tidyverse. Get that all loaded. So we have access to those great tools. I then want to download the URL. So if I then do download that file, I can then give it URL, I need to give it a destination file. So I'm going to go ahead and grab that test file. This downloads the file, it's about 10.6 megabytes, which is much larger than what I want to mess with. So I'm going to go into my get ignore file, and make sure that I've added that file. So I'm going to basically put in here GIS temp stars that'll match anything that starts GIS temp. The reason I'm doing this is because I don't want to accidentally commit and then push a large file up to GitHub. So the next time I commit, that will be seen and it won't try to commit my GIS temp files. I'll end up removing the file anyway. So putting it in my get ignore is really just a safety measure. Now what I want to do is go ahead and decompress that GZ file. So if you see GZ, that's short for being G zipped. There is a tool called gun zip that we can use within our it's also a Linux command line tool that we can get from the r.utils package. So library r.utils. If you don't have that installed already, you'll have to definitely install that package. Again, we can then do gun zip on that file. So let me go ahead and grab all that with the quotes. I now look over in my files. And I see that I now have a file that ends in.nc. Again, that's 52.8 megs. So that's quite large. I certainly don't want to be pushing that up to GitHub. But thankfully, again, we have kind of the stub of that file name in my dotted ignore file. So I don't have to worry about that. So as we saw in the browser, this is a net CDF file. And so I'm not sure how to work with that file, because I'm not familiar with this data. If you're familiar with GIS data analysis, looking at global information systems. So if I come to Google and do net CDF file R, I see the first link is how to open and work with net CDF files in R. So I'll go ahead and open that. And this opens up a demo that Alison Boyer from Oak Ridge National Labs developed of working with net CDF data. And so of course, their data is going to be different from my data. But what I'm going to do is use their tutorial their demo to figure out how to work with my own. And so the first thing I see is that we need the package net CDF four. I've already installed that. I'd encourage you to do it as well. And so we'll go ahead and load that library. Again, if you don't have it installed already, you will need to do that. As we come down, I'm going to look at the commands that she ran, and kind of copy it over into my R script. Again, I'm going to have to change the names of some things, but I'm going to grab this block where she uses NC open to open the file. She also has these commands I see. So maybe I'll go ahead and plop these in and then let's look and see what actually happens. And so the file I want is this, but without the GZ, right? So it's the gist temp 250, and so forth, ending in NC. And so I run that. And now if I look at NC data, I see all this great metadata that comes to the screen, right? And so there's two variables excluding the dimension variables. There's a time bounds. There's a temp anomaly with longitude latitude and time. It's showing the surface temperature anomaly data. And it's got all sorts of other information. There's then dimension data in here, the longitude, the latitude, the time, telling me something about the time that this is the days since 1800, January 1st of 1800, right? So good. So looking back at this, we can print the NC data to a text file. So I'm going to go ahead and do that with this gist temp file substituting in the name that they use for their text file. I wasn't familiar with the sync approach, but basically you can create a output file using sync. You can then print the data and then you can sync to close it. So they've put that in these curly braces. So if we run that, we now see that we get this gist temp text file that, again, is the same output that we had previously. So with the longitude and latitude, we can we can run that. And if we look at long, going from negative 179 to 179 by two degree increments, that's cool. And then the lat is going to be, I think, the same idea. Again, going from negative 89 to positive 89. Then we've got T. We see then that these are again, all the days, the number of days, since 1800. Cool. And we've already done this head on launch. So let's go back and see what else they had here. That's the data from their NDVI variable. I think that's going to be the same as our temp anomaly variable. So I'm going to go ahead and grab that. So again, I'm going to replace NDVI with temp anomaly. And again, if I look at my text file, that is this thing, right? That's the analog of their NDVI. So temp anomaly. All right, so we run that. And then we do dim on that. And so we see there's three dimensions, right? And so what we've got is the longitude, latitude, and time. So there's 1709 months worth of data. These data are being spat out every month, I believe. We'll confirm that later on. But actually, if I look at these times, right, I see that this varies by 31 days. So yeah, I'm noticing about, you know, about a month's worth of jumping between all the different time point. Now, looking down here, looking at the fill values, this is what they use to fill in the missing data. So again, I'll copy this. And yeah, so we've got, again, temp anomaly instead of NDVI. And so then if I look at fill value, I see that it has the attribute and that the value is 32767. And we can go ahead and remove that. As we see down here, we can make sure that gets set to NA by using a little bit of base R magic, which I'll plop in here. And again, this is not going to be my, well, I called it NDVI. Maybe I'll call it instead, T anomaly, I'll copy that here. And there. And that should be all good. So let me just make sure this is all updated. So we now have this three dimensional array that I would like to get to be a tidy format, right? And so one of the ways that we can easily do that is with the data table package. And so what we can do is we can come up again and add another package. I'll do library data dot table. Again, you'll have to install this if you don't have it already. What we could then do is we can then do as dot data dot table on T anomaly array. And what I get out is a very long data frame with something like 8.86 million rows. One of the things about as dot data table is that it automatically removes the NA values. So it's already cleaned it up for us a bit. And it's taken that three dimensional array and it's flattened it, right? So we've got, I think the rows are in v one, the columns are in v two. And that third dimension time for us is in v three. And then we see these values. So now what we need to do is go ahead and change v one v two v three to our latitude, longitude and time, as well as then make our value to be T data or T diff, right? So I'm going to start by converting this to a table. So we'll do as table. So we see we've got the table format. That's not that big of a change. I'll go ahead and do a select to change the names. So I'll make longitude equals v one latitude equals v two time equals v three and then T diff equals value. So again, these values of longitude latitude and time are the index value into the long lat and T vectors that I made up above here. And so I can then get those to be the proper degrees longitude degrees latitude and date by using a mutate function, where we can then do longitude equals lawn on longitude. Again, that lawn square bracket means take the lawn vector that we defined up above, and take the value from the longitude column and plug that into the vector, which will then return a value from the lawn vector. We can then do the same thing with latitude equals lat on latitude. Right. And then we can then do time equals T on time. And then we can, yeah, let's run that and see what we get. Great. So then we have our degrees longitude or degree latitude and time. Again, that's the number of days since January 1st of 1800. So what we could do is we can take this time value the number of days and add it to the date 1800 January 1st, right? So we can do as dot date on 1800 hyphen zero one hyphen zero one. And so that as date will convert our string, which is again the ISO standard date notation, and we'll add the number of days, and we'll return a date for that column. So now we see that the first time point that we have data from this longitude latitude was January 15th, 1957. I can go ahead and do a tail on this to see the most recent time points, which again, this goes to May 15. So I'm recording this on June 29. And so I guess they don't quite have the June 15 data inserted into here yet. So again, we have longitude latitude by month. We also know the year great. So I think we're in pretty good shape now to take this and to go ahead and summarize the data by year. And so to do that, I am going to go ahead and create another variable that I'll call year, and that will then be the year function on time, right? And so what I need to do is add another package that I'll call library lubricate. lubricate is installed with the tidyverse. So it should already be installed. If you've got the tidyverse installed, we'll run all this. And so now we see that we have the year for each longitude and latitude. So I'm going to go ahead now and group our data by the year. But we're also going to group it by our longitude. And our latitude will then calculate the average across the 12 months for each longitude and latitude by doing summarize. I'm going to call it T diff again, as the mean of T diff. And I'm going to add to this dot groups equals drop. And so that will remove the grouping by year, longitude and latitude. I now have the summary table by year, long year, longitude, latitude, T diff. Something I might do is go ahead and then count the year to see how many temperatures I have or how many grid points I have for each year. And maybe I'll go ahead then and pipe this to ggplot AES with year on the x axis. And then on the y axis I'll put n. And then we'll go ahead and add geomline to see what this all looks like to see the frequency of sampling we have by year. So looking at this plot, we see that sampling really increased over time. But there was a step function, if you will, right around, you know, so this minor grid line is 1940, 1950 is right about there. And so that's probably why they picked 1951 going forward. So I'm going to go ahead and replace those lines with a filter to get the year greater than or equal to 1950. I'm going to assign this to a variable t data. And so we'll have t data that we can then use to make our ridgeline plot. And as always, if we look at t data, we can then see the outputted tibble. Again, we're going to be looking by year at the distribution of t diff. So I'm going to go ahead and take t data and pipe that to a filter. I just want to look at a few years worth of data. And so I'll do filter year in. And let's do 2018, 2019, 2020, just three random years. And then we'll do ggplotas on the x axis. I want to put the t diff. And then I'm going to want to get a fill color, but we'll hold on to that for now. And then let's do geom density. And I need to group it by year, right? So let's go ahead and do group equals year. And for now, let's also do fill equals year. And so we can kind of see that there's three different shades of blue in here. If I were to set the alpha to say like 0.3, you could kind of see that there's three different shades of blue in there. What we want to use is a ridgeline plot. And that's going to come to us from the gg ridges package. So again, we'll come back up here and do library gg ridges. And again, if you don't have gg ridges installed, you'll definitely need to do that first. So I'm going to go ahead and replace this geom density with geom density ridges. So it's complaining that geom density ridges requires the following missing aesthetic y. And so y is basically the position kind of up or back into the screen, the where I want to draw each of the different distributions. If I look at this static image, again, the y would be these different decades. And so for me, that's going to be the year. So this group needs to be a y. Also year is a continuous variable. And geom density ridges is going to want this to be a factor. It's going to want it to be a categorical variable. So I can do factor on year. And so now you can see that we have the three different distributions back on top of each other. Maybe I'll use some different years. Let's go ahead and do 1950, 1980. Let's do 2000. And then 2020. And so you can kind of see from 1950, going forward, the shift in the distribution, these distributions are pretty bumpy, whereas again, the original were fairly smooth. It's telling us that it's picking a bandwidth of point one. And so actually, if I go in here and I do bandwidth equals, let's do point two, we'll get a smoother distribution. Maybe if we take that up to like point three, that looks a bit better. Again, the distributions on that original plot were fairly smooth. Again, I want my fill color to be set by the average temperature across all of the years. So back up in the pipeline where I made t data, I'm going to do another group by a year. And we're then going to do a mutate on t av. And that's going to be the mean of t diff. And so again, if I look at t data, I now get this extra column that has the t av. And so then instead of fill equaling year, I can do t av. And so now I see that I've got this dark blue color for the mean of zero, basically at 1950, and a lighter blue color for 2020. So this is reminding me that we don't have all the data yet for 2022. So I also want to add an and here to my filter to do year less than 2022, because we don't have a full year's worth of data yet. So let's go ahead then and add in scale, fill gradient to we've seen this before. But for our low, we'll do dark blue, our mid will be the default of white. And then our high will be dark red. And our midpoint, again, it's the default, but just want to be explicit, we'll set at zero. And so we can then see that, you know, again, where the average for the year was zero, we get pretty white color. And then it kind of increases in intensity as that mean distribution moves off to the right. Let's go ahead and remove the filter where we're only looking at those four years. So we can see the full ridgeline plot, we're starting to shape up to look like we want it to. So I'm happy with that. I would like to flip the order I want 1950 at the top and 2021 at the bottom, to do that down here, where I define the factor, I can then do levels equals seek again, going from 2021 to 1950 by negative one. And I'm going to put these on separate lines because they're kind of long and they're scrolling off the side of the screen. But that should get us to flip the order of our years. Sure enough, we now have 1950 at the top and 2021 at the bottom. And we can see the shift in that distribution over to the right. The next thing I want to turn my attention to is the x axis scale on the original plot, it goes from basically negative five to positive five, but we have labels of negative four to four. So again, here we can then do, I'm going to start with cord Cartesian, and we'll do excellent from negative five to five. I'm cord Cartesian will zoom in, whereas if I did scale x continuous, it would remove the data from outside. But I want to include all that data so I get the full shape of the data. So again, that zooms in and makes it easier to see the distribution that we want to see. To get the breaks that I want on the x axis, I'll do scale x continuous, and then we'll do breaks of seek from negative four to four by two. That's great. I'd also like to make my years go every 10 years. And so then we'll do scale y. And we're going to use actually discrete, because again, we turn the year into a factor. And so then here we'll do discrete. And we'll then do breaks of go from seek of 1950 to 2020 by 10 year increments. And so there again, we see 1950 at the top and 2020 at the bottom. So the peaks on my ridgeline plot seem a bit muted to me. And so an argument that I can add to geodensity ridges is scale. And so if I do scale equals one, the top touches the bottom of the next. So if I up this to say two, we have more overlap. And then three is even more overlap, it makes the peaks look a bit taller. I kind of like the disappearance, because it gives you a more sense of the kind of peaks to the distributions. And I'm pretty happy with the way that looks. So I'm going to leave that there. Now I want to turn my attention to doing more of the styling of the figure to make it look more attractive. I'd like to go with that black background to kind of make the colors really pop out. So the first thing I'm going to do is go ahead and remove the legend. So here in scale fill gradient two, I can do guide equals none. Again, that gives us more real estate to work with. I'm going to go ahead also and turn off the Y lab. So I'll do labs, Y equals null, the original had temperature anomaly in degrees C. So on X, I'll do temperature anomaly. And again, the Unicode because I've typed this so many times now you 0 0 B 0 C, we'll get the degree sign with Celsius. And then title, let's go ahead and put what they've got. So land temperature anomaly distribution. And so it's coming together. And so before I do much more tweaking of the sizes and positions of things, I want to go ahead and save this with gg save. So I'll go ahead and do gg save figures temp distribution dot PNG. And it's really tall and really narrow. So I'm going to make my height. Let's do six width equals three. So maybe we can make it a little bit wider. So let's go ahead and do width equals four. Again, that looks pretty nice. I now want to flip the color to make white black and black white. We'll go ahead and do that with the theme function. And so I'm going to do text equals element text color equals white. And again, that'll take all of the text elements and make them white. Then I'll do panel dot background equals element wrecked. Fill equals black. I'll also do plot dot background equals element wrecked. Fill equals black. And I also want to remove the grid lines from the panel. So I'll do panel dot grid, element blank. For whatever reason, it didn't turn on my text white like I expected it to. But we've got the black background. So let's go ahead back to those labels on the x and y axis. So I'll go ahead and do access dot text equals element text color equals white. And then we also have the axis dot ticks that were gray and not white will do element line color equals white on that as well. So looking at the original one thing they have here that I like is that they have a line for the x axis. Again, if we come back into our theme, we can do access dot line dot x equals element line color equals white. I'm pretty sure we don't have a line for the y axis. But just in case, we can do access dot line dot y equals element blank. And that will get rid of that as well. So one other thing that I notice about this is that we have this kind of thick black line for each of the distributions. And what I'd like to do is maybe turn that into a thinner white line to make it easier to kind of see the differences between the different curves. So I definitely want to make it then I'm not totally sure if I want it to be black or white. So again, if we come back up here to geom density ridges, we can change that with say size equals 0.2. And then let's do color equals white. So the thing I like about the color equals white, besides not being super overwhelming within the plot, is that it gives the effect of a grid line all the way across for each year. Right. And so that's kind of nice. I guess if we turned it black, and made it really thin, then we wouldn't see that. We could also then get rid of the color as well. But I kind of like having either white or black to get some definition between these different ridges. And I kind of like this look, I don't know what you think, but let me know perhaps down below in the notes. Maybe what I'll do is get rid of the tick marks on the y axis. But I like the way this looks. Okay, so I'm going to go ahead and get rid of that tick. So we'll go ahead and do access dot ticks dot y equals element blank. So I'm pretty happy with the way this came out. I like having a histogram for each year. I think you see a lot of interesting things going on and questions that just kind of popped to mind like, you know, why do we have kind of bimodal distributions in some years? You know, are these the same locations on the globe that have lower temperatures than the rest of the globe, right? And so that's I think the mark of a great visual is that it tells a story and then forces you to think of additional questions that you can then dig back into the data to answer those questions. Great. So I'm going to go ahead and save this R script into my code directory. And I'll call this temp distribution, ridgeline. Finally, before we quit, we want to go ahead and close our connection. So we'll do NC close on our NC data. And then we also want to unlink or remove those files, those just temp files that we told get ignore to ignore. So to do that, we'll do unlink, and we'll then do just temp that NC file, and then unlink the gets temp text file, right? So go ahead and run all those. And then we see that those files are no longer in our files directory. And we are good to go. I'll get this committed and pushed up to GitHub, so that you can download it along with all the data so you can see exactly what I did. I strongly encourage you to work with this code, manipulate some of the values, see if you can get a different appearance to the plot, perhaps something that is more suiting your sensibilities. Let me know what you come up with. And we'll see you next time for another episode of Code Club.