 Good morning everyone. Thank you for joining part two of this UK data service workshop named an introduction to time series analysis and forecasting. This is the code demonstration which will take place in our to reintroduce myself. I'm Nadia Kenner. And we are joined today with Emma Green who will be helping facilitate this workshop. So if you have any questions related to, you know, like, well not related to the content then please drop a message in the chat and Emma will do her best to help there. If you have any questions related to the content then please use the Q&A function because that means that I can view the Q&A's and attempt to answer the questions throughout. I'm just going to give it a couple of minutes before we get started as participants are still joining and then we'll get started with the code demonstration. All right, so we'll get started with the code demonstration now. I hope everyone's settled in. Everyone's got a drink and we'll get started. So yes, as mentioned, we're going to be taking this code demonstration in our studios. There are three R Markdown files that we'll be using in this. This is the intros and prerequisites file. We have a section one and a section two. In order to obtain this code that I'm looking at here, I'm going to show you how you can get this from GitHub. I'm just going to turn my camera off as well because I'm in the way. So yeah, if you head over to GitHub which Emma, if you can post that link in the chat again that'd be great. This is the repository that contains everything that you've just seen on my screen. In order to get the code onto your own computer, there are multiple ways to do this. You could just download the zip file and this will be stored in your documents on your computer. However, what this means is that if I make any changes to the code or make any updates to the code, you'll then have to redownload this rather than just being able to pull the changes. So if you want to stay up to date with any of the changes that I make on this, then your best bet is to clone the repository. I will show you how to do that in just a minute, but the third way to obtain this code is to scroll down and in the read me file, I've attached some interactive binder links. Basically, this allows you to run this code without having to install our studios or install kind of any other software. So you can work from this in like a cloud repository kind of thing. If you were to launch the binder link, which is here, you'll get something that looks like this. We can just ignore that. So once you open it up, you also see the three files. You can open up the IPYMB file if you are using binder, but just some forewarning that should happen because I closed it earlier, sorry, but binder is incredibly slow. It takes approximately maybe six to eight minutes to launch the binder in itself and then running some of the chunks can take some take some time as well. I'm just using the co-lab links. I'll open one up now just so you can see what this looks like. See, this opens up immediately. It's got all the code chunks available and you can run these. Obviously, this will work a little bit slower as well because we are on a cloud-based service, but yeah, if you can't download our studios, I've had trouble that these links were available for you. Yeah, in order to clone the repository, you can click the big room button and copy the HTTPS link. Once that's copied, you can head back over to our studio. Now, obviously you won't be looking at what I'm looking right now because you haven't cloned it, but if you open up our studio, you can click, if I can get to it, File, New Project. From New Project, you can click Version Control and this is basically telling our studio to connect to Git, which is where the code is stored. You then want to click Git again because we want to clone from that repository and all you have to do is paste that URL link in, give this the name of your choosing and then choose where you want to store this on your computer. A little tip, if you are cloning the repository right now, again, this will take maybe like 30 seconds, not even to clone the code onto your computer. And I would suggest always opening this in a new session so that you avoid any projects that you're working on at the moment. So the reason that we tend to use our projects is because using our project automatically sets your working directory. And the working directory is just a file path on your computer that sets the default location for any files that you read into the R or save out of R. So it's like a little flag, you know, somewhere on your computer which is tied to a specific project. And if you ask R to import a data set from a text file or to save a data frame as a text file or any kind of data frame, it will assume the file is inside of your working directory. I'm not going to click Create Project because obviously I've already done this, but yeah. In the intros and prerequisites file, I've got some information again about how to clone your cloning repository. I've got some information about the data sets I'm going to be using specifically in relation to Ashby's paper that was talked about in the previous webinar on Tuesday. There's some information about setting your working directory, but as discussed, you don't need to set your working directory if you create this project because this is done within that process. And the last step that you need to do before we can run sections one and section two is to install the packages needed. To install a package in R, you can use the installed packages function and you call on your name of the package with quotation marks. Obviously, I've already done this, so once you install a package, you won't have to do it again. But what you will have to do is load your packages. So in order to load your packages or to run a bit of code using our markdown, you can click the green arrow that points to the right, which will load the whole chunk. Or you can individually select a line of code and click Control Enter if you're on Windows or Command Enter if you're on a Mac and that will just run one line of code. So once your packages are installed, make sure to load these into your work, into your project, or you won't be able to work with the functions that I'm working with in section one and two. So yes, I'm sorry for what I'm talking about setting up your directory, but it's very important to always get questions at the start. So yeah, let's work through our first section. So in this section, we're going to be covering five main topics which were talked about in Tuesday's session. This looks at time series data representations, it looks at converting time series objects, making decomposition plots, checking for stationarity, which I think I've spelt wrong, and applying rolling averages. So again, the first steps before doing anything in R is to load your necessary packages. Obviously we've already done that, but it's not bad practice to just keep doing this so you don't forget. So yeah, we will start this code demo by exploring some different types of time series data. When working with data in R, you need to decide the object class of your data at hand, and this is important because the object class you choose affects more than the data restored. And then dictate what functions will be available for the data pre-processing analysis and plotting because some functions like I believe the forecast package requires you to have a time series object rather than a data frame. R has, I believe, like eight different implementations of data structures for representing time series so it can get confusing. The main packages include TS, which is R's base package, which means you don't need to install this package. Once you download and install R, these base packages are automatically installed. You then have packages like zoo and XTS. XTS is an extension of zoo, but a very useful package for dealing with like time date variables. And then we have probably the most common package, which is the TSible package, which is one we're going to be using quite frequently. So let's get started by exploring a data set within R Studios. We're working on a data set called Kings. This data set contains the age of death of successive kings in England. I believe starting with William the Conqueror and then, yeah, so we're going to use this data set to explore kind of the basic structures of a time series data. The data set can be found in this document here. And luckily, we don't need to download this data set elsewhere. We can do this straight within R using what is known as the scan function. So the scan function evident here. This reads the fields of data in the file as specified by what option with the default being numeric. So in this instance, I'm creating a new object called Kings. I'm using the assignment operator and asking R to read this data set in from this HTTPS link here. Notice how the link is in quotations. If this was not in quotation, you'd probably get an error. And I'm also using the skip function just to skip the first three lines of the data set as this just contains descriptive information about how the data was collected. It's not actually the data. So if we go ahead and run that, you'll see in your console that we have read 42 items in. So let's have a look at what this data set is. So, yeah, as you can see, we've got our 42 Kings and the ages that they died at. We can access what type of data set we have by using the function class. As you can see, we are working with a numeric data through. However, if we wanted to run our time series plot, you know, plotting our data on the x-axis and attributes on the y-axis, you need to tell R that this is a time series data frame. And in order to convert a data frame to a time series object, we can use the TS function from the base package. Now, because our data frame is univariate, all we have to do is call on this TS function. Call in the matrix that we want to use. In this instance, it's the Kings. And I'm assigning this to the same object. So this is just going to override what we've already have. You could, you know, call this Kings 2 to stop confusion, but I'm just going to keep it as Kings. Now, if we recheck the class, you can see that this has been noted as TS. And a time series object is simply a vector or a matrix. So that's univariate or multivariate. And it has time indices for each observation, which allows us to then sample the frequency, allows us to examine the time increments between observations, allows us to analyze the cycle lengths, et cetera, et cetera. However, it's common to common cross time series that have been like collected at regular intervals are less than one year, the Kings data set. You know, this could be monthly, weekly or quarterly. And in these cases, we have to specify the number of times that the data was collected per year using a additional parameter. So let's have a look at what that parameter will look like. I'm going to show you this on a data set called births. This is also a data set that can be downloaded within our so it doesn't involve any additional installation or downloads. But yeah, I'm going to use this births data set which refers to the number of births per month in New York City from 1946 to 1958. And I just want to make sure that everyone can actually see my screen. I'm going to zoom in. Apologies, I didn't realize it's quite zoomed out. But yeah, we'll zoom in now. So yeah, again, we're going to call in our data set using that scan function. We don't need to skip any lines of data because there wasn't any additional information provided. We have 168 items. So let's have a look at what that looks like in its rawest form. As you can see, we have quite a messy data frame here. It's pretty hard to tell what's what. So, and this is because R doesn't know that there's a time variable within this data set. But we can use the frequency parameter to identify what type of time interval you're using. In this instance, because I know the data set is recorded monthly, I want to apply the number 12, which indicates, you know, your monthly, your months within the year. You can also set a start and perimeter which just indicates when the first, when the first data was recorded. So let's go ahead and run that line. And let's have a look at what we're dealing with now. This is a much needed data frame now. We have our assigned months helpful to the frequency parameter, and we have the years that this has happened in. Obviously, if you had a yearly data, this number would be represented by a 54. If you had costly data, this would be four. And if you had only data, this would be one. So let's go ahead and plot our time series data set. We can use the plot.ts function from the TS package to do this. It's a really simple code. All you need is plot.ts and then you have to call on the data frame interested in. So let's have a look at plotting that King's data set first. This gives us a pretty simple, but effective time series plot of the deaths of the successive Kings in England. Obviously, this is a very varied data frame, so it's very simple. So let's have a look at what would happen if we plotted a data frame that has seasonality that has months present. This is a classic example of a time series plot. The plot.ts functions are really effective just for creating, you know, quick and efficient plots, but there are problems with this in that there's no title defined, you know, it's almost quite simple. But it tells you what it needs to tell. We can analyze a more or less upward trend from about 1947 all the way to 1960. So we can also plot TS objects using ggplot to then extend these visualizations to make them better. And this is found in the gg4tify package. We're going to be using a function called autopilot. Now gg4tify lets ggplot to know how to interpret its TS objects so they work together. It's almost like a low level ggplot method, but it's much simpler and easier to use to produce, you know, fairly complicated graphs. So again, let's have a look how this would look like on our univariate data set, which is the buffs. And this gives us a bit of a nicer. Sorry, this is the multivariate data set, the buffs. So this gives us a bit more of a nicer plot, I would say, it's got some background grids. But we can also extend this plot by adding further functions. So I've added a plot, a function called ts.geo, which basically specifies the type of line graph you want to plot. And I've also decided to fill this with the color gray. So let's have a look what happens here. Nothing happened. Strange. Let me try to reload the package. How strange. Let me try a different function. So let's do ts.geo equals line. That's so strange. I'm not actually too sure why that's not changing. Let's try one more. Let's try a bar. This was literally working, you know, yesterday. Hey, hopefully this works for you. What about if I added a, we could do ts line type instead. I think that's one. You'll call this a dashed bar. Sorry about that. I'm not too sure why that's changing, but hopefully that does work for you in your own code. This was working, you know, two minutes before this workshop. So very strange, but this is just, you know, basic visualizations anyway. So let's go on to looking at how we can make our full costs using the auto.arima function. So as mentioned, the arima models, the auto.arima function basically returns the best arima model according to either the AIC or the BIC. And then the function conducts a search over possible models with the order constraints provided. So let's use the auto.arima function and I've created a new object called births-arima just so that we don't override the original. You don't have to call on the original dataset and assign whether there is, if you have any seasonal patterns, which is true in this case. So let's run the births-arima and let's have a look at the residuals. This is automatically provided our values for the PDQ in the non-seasonal elements and the PDQ in the seasonal elements. It's also identified that we are using a 12 month dataset. If you remember correctly, he stands for the trend order. So this is telling us that there are two autoregressive lags in the non-seasonal component and one lag in the seasonal. The D, which is the middle number, represents the difference thing. So it's told us that we've had to difference this by one in both the seasonal and the non-seasonal components in order to make the data stationary. And then we have our values two and one, which represents the Q and this is the moving average. So this contains two moving average lags. So we can then go ahead and also check the residuals. It's important to check the residuals because this in a time series models, this is the data that's being like left over after fitting a model. And for many, but not all time series model, the residuals are equal to the difference between the observations and the corresponding fitted values. So using the check residuals function, you'll get a plot that brings up three separate graphs. Now these graphs show the naive like method produces forecasts that appear to account for all available information. So in this instance, we can see that the mean of our residual is close to zero, which signifies that there's no significant correlation in the residuals series. The time plot at the top shows that the variation of the residual stays much the same across the historical data, right? Apart from maybe this outlier here in 1956. And this means that the residual variants can be treated as constant. We can have the histogram of the residuals and the histogram suggests that the residuals might not be necessarily normal because the right tail seems to, you know, seems a little too long, although not. It's not too dramatic, but this isn't a perfect bell curve. So you'd have to question. Okay, let's move on to then making the forecast. We can make the forecast using the forecast function from the forecast package. Again, I'm creating a new object called burst underscore forecast and I do this just so you don't overwrite the previous objects. We call it on our model, which was the births arena. Again, you can use the h function to identify the time period, the time steps, and we're using monthly data. So let's go ahead and apologies. I'm not sure why that's happened. Give me one second while I while I revamp this. Let me I will just call in the data set again and see if there was sometimes this happens. I'm confused about that because this had just run prior to this workshop, how typical. I've got the right model in place. I'm using the right function and we've called on. That's a shame. So I was going to show you how we can create like really quick forecast, you know, but I mean if anyone knows what the error means that'd be great. I'm not too sure myself. Someone said that the code works on their laptop. So if you're having issues, making the auto plots and running the forecast, or is this just me? Yeah, sorry. So if this works for you, then great. You should then be able to see a plot that indicates the forecasted values at the end for another 12 months. And if you kind of remember what we did in our first session, you know, you model your, your variable interest, you make the forecast and then you plot it, kind of following like these three steps. I'm working with crime data. Typically there's another step which would involve identifying the time interval and I'll show you how to do that. Yeah, people are saying it works for them. But not me. So that's a very strange. But I'm glad it's working for people. So we'll move on. So we're going to look at decompositions now. If you remember in the talk, we mentioned decompositions as a way to analyze if there is stationarity in your data set. Decompositions then splits up those four components, which is the seasonality, the trend, the noise and the cyclical variation. We can use the decompose function to do this on our birth data set. So let's have a look at what that'll look like. As you can see, this is added a new object called burst decomp into our global environment. So we can use the head function to look at the first 10 rows of data. Now, what we have here is we have our original data set in the first instance. We have our seasonal components in the second instance. We have our trend in the third instance. And we also have our random components. So this has been all split. It also tells us what type of plot we're working with. It tells us that we have an additive decomposition structure. This means that when all components are added, all components are added in order to form that trend. So let's have a look at how this will look visually because obviously this is really hard to understand. We can use the plot function, which is base package R, that will give you something to look like this. So you have your original data, your trend, your seasonality, and your noise. However, I'm not really a fan of this plot because it does look quite messy. So I tend to use auto plot function. Will it work? Let's find out. It does work. So this gives you the titles of what each of these graphs are. Someone has asked whether my library is load correctly. Yeah, yeah, I didn't have any issues, you know, running this just before the workshop to make sure everything was working. But there might be an issue with the way I've set up the binder links, possibly that might affect the code. But as long as you guys are able to see the plots, then that's no worries. So how to remove seasonality from your data set? And why would you want to remove seasonality? Many industries experience fluctuations in various metrics based on the time of the year. And, you know, we see this in crime as well. But this means that it's not possible to effectively assess performance by comparing data from one time of the year to data from another. Furthermore, these seasonal fluctuations can sometimes be so large that they make important trends. They mask important trends hiding in the data. So if birth rate were to increase in September, for example, this could be due to seasonal variation, or is this an actual increase in birth rate? And to get these answers, we need to remove the seasonality from the data. And this process is called seasonal adjustment. In order to seasonally adjust your data, you need to minus the seasonal component from your original data frame. So I'm calling on a new object called adjusted buffs. And I am taking away that seasonal component that we've got from the decomposition plots from our original data set. So let's have a look at what this would look like without a seasonal component. So as you can see, the trend becomes, you know, a little less clear. But the seasonally adjusted time series provides a way in understanding the underlying trends by removing the noise of seasonal fluctuations. So outliers and anomalies from the data are easier to see. So what about how would you then check for stationarity? If you remember from the webinar, there was two ways to do this. This is the graphical way and the statistical way. And remember, stationarity is just when the mean is constant, the standard deviation is constant, and its cross covariance are also constant. And we need a time series object to be stationary in order to make forecasts. So the graphical way would be to examine that decomposition plot, which is, yeah. And the second way would be to use the statistical method. In this example, I'm using the augmented die-key fuller test, known as the ADF test. And yeah, this is a common statistical test. The ADF test belongs to a category of tests called unit root tests, which is, you know, the proper method for testing stationarity of a time series. And it also expands what is known. The ADF test expands the die-key fully test equation to include high-order regressive processes in the model. And we can use the ADF.test function to examine this. We can also do this on the birth data set. Sorry, ADF test. So let's first examine the data from King's. So typically the p-value obtained should be less than the significance level, which is 0.05 in order to then reject the null hypothesis. So in this instance, I would say that the series is stationary or isn't stationary because our value is not smaller than 0.05, as seen here. And with the birth data set, we have a value that is less than 0.05, which would indicate that our data can reject the null hypothesis inferring that the series is stationary. I'm going to quickly just talk about the zoo package, and then we can take a little break while we then move on to look at the crime data. The zoo package is pretty important because it allows you to perform calculations containing a regular time series of numeric matrices and vectors. And yeah, I'm going to show you how we can look at some rolling averages in order to smooth our data. So your first step is to install the packages zoo, and you also then want to load this into your data set. You can use the require function to do this in this instance, because zoo is also a data frame included within R. So the first step is to load the data set. We're going to be using a different data set here, which is the average temperature by month and year in Nottingham. And this also exists within R. So if you call on the data function, this will load in the Nottingham data set, as you can see in your values in your global environment. We can then plot this data set by using the plot function to scroll down. We have a very basic plot of what we are looking at from 1920 to 1940 with our values on the y-axis. So we can extend this plot by again using auto plot to add labels to our y-axis and our x-axis. We can include a title using the gg title function, and we can also identify what type of theme we want to use. So if I run this plot and scroll down, we've got a bit more of a visually pleasing plot, right? We've got titles and whatnot. We can also view the seasonal sub-series by a function called gg sub-series. And this basically plots out the seasonality for each month in like a separate plot or a separate trend, but on the same graph. And I really like this one because it helps to visualize all the seasonal components in your data a bit better. But yeah, before moving this data, let's explore some quick descriptive statistics. And we can do this using the x-table package. So again, I'm just going to load that into a practice and I'm calling this to a new data frame called notom2. And we can view the descriptive statistics now by using the head function. As you scroll down, this will give us a summary of the frequency in each month. The first five rows of data that is. We can also use the summary function to get a bit more of an extended grid to have a look at each month. So this gives you the mean, the medium and the quartiles. So your first step is to identify what type of time interval you have. Again, you can use the frequency function to do this. This is telling me that we are working with monthly data. Obviously, we did know this because you can, you know, picked it up from the summary statistics, but yes. So let's decompose this data and plot the separate trends. Again, I'm going to use that decompose function calling on our original data set and identifying that this is an additive model. We can then plot the decompositions. And this gives us a graph that we've seen before. Or again, you can use auto plot, which I just prefer because it gives those titles. And it also makes the residuals into a like a bar plot rather than a line plot, which helps identify which of these values have high noise. And if we want to remove the seasonally adjusted data, again, we do the exact same way by removing the seasonal component from our original data set. So I've called in a new data set called X, which is probably not necessary, sorry. And taking away that seasonal component from our original data set, reading this into a new object. We then plot this trend. We then have a seasonally adjusted time series plot. And just if you are interested, you can plot the individual components of a decomposition by calling on those variables that we've seen. So you can plot the seasonal components. You could also plot the trends. You could also plot the random components as well. So what happens when we smooth our data? And yet the reason we want to smooth our data is to see if there's too much fluctuations in the seasonal patterns. I'm calling on a new data frame called not underscore mean and using the role mean function. In this instance, again, we call on our original data set, which is the not term. We identify the time steps in our interval, which is 12 months. I'm asking it to also fill in any missing values within NA. And then we can align this to the right. And we use the fill NA function because there are not 12 previous months before this. So then we use a line equals right to like model the 11 previous observations. Whereas a line left would be the next 11 months. So yeah. Let's go ahead and run that and have a look at our summary here. So this is a summary of the rolling averages from each month to each month. We can then plot this again using the plot function. And this will give us something like this. So this is a rolling average of our data frame. You can also add additional features to this plot function so you could add a white limb function and identify. I'll just set this 3070. Oh, full stop. So this basically allows you to almost zoom in and zoom out of your data frame. Allows you to create a bigger plot or a smaller plot. And just briefly on the XTS package, as I mentioned, this is an extension of the zoo package. And this is useful to use when and when your time series is made to be a bit more flexible. So I'll just quickly run through this as I'm noticing the time already. But yeah, we're going to use a not in data set again. And I'm going to convert this to an XTS data frame. You might want to load it first. See, I always make mistakes. These things happen. And that allows us to then use the X.xyx function, which is from this package. We then need to convert this into a matrix in order to then run the time series plot. And we can use a head function to analyze the data here. I'm just going to skip this part here because this doesn't actually involve any of the data sets we use. But this is just a like a made up data frame that I use, which kind of provides a good example about how to create a time series plot from an XTX object. I'll just run it briefly, why not? I'm basically creating a structure that stores the number of hours that someone has worked along as worked with some attributes about them like their birth date. So I created a random object called hours with five random numbers. I've created some dates as a date class variable. So that will be our time instant time step and indicating some attribute information, which is the birthday. And then I use all of this to create this into a new XTX object called work. And we can look at the structure of an XTX object by using this function. And when working with time series, it will sometimes be necessary to separate your time series into like core data and index attributes. The core data is like the matrix portion of the XTX and you can separate this from the XTX object using what is known as core data. Let's go ahead and do that. We can then view the class of core data and it tells us that we do in fact have a matrix array. And in order to index this, you can use index function. Oh, sorry. Yeah, this should say hours, apologies. Sorry, I'm not. A lot of errors today, apologies. But anyways, these are just some functions that allow you to work with XTX objects. But yeah, that draws conclusion to section one. I am wary of the time. So we'll take a quick five minute break here while I have a look at any Q&As and comments. Hi everyone. I hope you had a nice little break stretch legs, got a drink. And we're going to move on to the second part now, which is looking at that crime, looking at some crime data and seeing how we can run those four steps that we had seen in Tuesday's session. So obviously without saying always loads of packages, this is just practice. So we are going to be using data from a package called crime data. We are specifically specifically going to be looking at data from Detroit, which is from 2015 to 2020. And the crime data was collected from a website or database called the crime open database also known as code. It basically makes it convenient to use crime data from multiple US cities. All the data available is free to use. And it's quite flexible in terms of what data you can collect from the data can be downloaded by year data can be downloaded by city. So you're forced to choose what type of data you want to use, whether it be the core data frame, whether this be an extended data frame or whether this be a sample of the data frame. So all you would have to do is change said word to extended or to sample. But for our case, I want to use the core data set, which is like the raw data set. So another kind of interest or a really good function here is a function called output equals SF. And this basically allows you to install a simple features object. And this would be useful if you're looking at mapping crime data. You want to know how to map crime data crime data selfish plug, but we run a mapping crime data workshop not long ago. And you can use this data frame and apply the concepts that we talked about in the code demo there to have a go at mapping crime data. But in this workshop was really interested in that time component so we don't need that. But yeah, you can download the full list of URLs for data files. This does take a few seconds. I'm actually just going to ignore this part, but it will let you download all the URLs from each different city. However, as I said, we're only interested in the years 2015 to 2020 and I'm only interested in Detroit. So in order to obtain this data set I've called on a new object called crime. I use the get crime data function from the crime data package. I've asked it to only select the data from Detroit and only the data is from 2015 to 2020. So if I go ahead and run this, this will take a, I'll say about 30 seconds to a minute. I took longer than I thought, a quicker than I thought. If you see my objects, I have a new object called crime here with a lot of observations, a lot of observations, but we can again view the first five rows of the data set using the function head, the first six rows of the data set, sorry. So this is the data set that we're going to be looking at. Before we even get going, let's have a quick look at the variables that we have. There's a city name. We've got the offense code. We've got the offense type. We have the offense group. We have the offense against. So if it's this other property or society, we've got the date that this had been reported on. We still have the longitude, the last shoot and the census block. These three variables here are only really relevant if you do input that you are using an SF object or you want an SF object. In our case, we're really only interested in two variables and that's the date single and the offense group. Let's have a quick go at just like exploring this data to see some of the general trends. So you can use the pipe function, which is part of the Dipler package to commit crime to commit code all in one chunk. I'm calling on a new object called offense count and I'm calling on a data set called crime, which is that original data set. And this pipe function can be known as like, and then, you know, and we want to group by the offense group. So that's grouping all those crime types. And then we want to summarize the number of crime counts have happened within each offense group. I've also got the arrange function here, which is basically telling it to arrange the summarized data frame in descending order of the account column so that the offenses with the high accounts appear first in the data frame. And the, yeah, let's have a look at what happens when we group the data by a fence type. Let's view the first 10 rows and see what we're looking at. So this is giving us a data frame of just two variables. This is known as a table with 10 observations and two variables. So we have our raw count of crime for each different offense group. You can view this crime data set by. Oh, sorry. If you click the offense group this will open up the data set in the studio console so it's a bit clearer and as you can see there are 32 different types of offense categories. They might notice that, you know, these categories that only have a few counts. They're probably going to affect the data that you have because the variations are so low right. If you are interested in like studying all the difference offense groups is then you would need to do some data manipulation to remove these minor crime categories into like one crime category. And I've gone ahead and done that, but I'll show you that a little later on. So let's first just plot the offense count using ggplot. This gives us a pretty, you know, quite a packed bar plot but it has all the offense groups that happened and their count. So you can see that assault offenses by far are the most prominent crime which is definitely not surprising. But as I mentioned, we are specifically going to be looking at burglary offenses. So I'll show you how we can select just those burglary offenses a bit later on. Well, I'll show you that now apparently. So in order to select only those kinds of interest you can use the filter function from the Dipler package. So I call on the variable interest in our crime data set, which is the offense group. And you can use a double equal sign to then select which type of crime you are interested in or which type of crime you want to create into a new data frame called burglary. So if I run this bit of code here. And view the first 10 rows, you can see that we are only looking at the offense group that covers burglary, breaking and entering. Now you could quite easily select an offense type. But the reason I chose to do the offense group is because it grouped all the different types of burglary. And it provides just a bit easier analysis for us, you know, just out of. Well, just to let you know, obviously are is case sensitive, which means that if you had spelt burglary with a capital B. You can see that we now have zero observations and this is because it couldn't select the correct variable. So make sure that when you are working with your variable that you type these in, in the exact same format that they are written in. So I had trouble because I didn't realize that breaking entry was had a and symbol rather than and and yeah. Little things right but the rerun that and you can see that now jumps back up to 47,000. So yeah you may have noticed that some of our crime counts are really low as I mentioned. So if you were interested in studying more than just burglary, you want to look at all crime types, then you might want to do some data pre processing and categorize those minor crimes into one category. You can use. So let's first group the offense types and count the data so we already did this before, and this brings up this table right. So we've got like really low counts that if you were to run an analysis on this is just not enough data to significantly produce time series plot. You can have a lot of uncontrolled variation, because one account of, you know, like a peeping Tom or a gambling offense. You're not going to be able to create forecasts from data that doesn't exist. So it's better to group these into to one category. So the first step is to remove those small counts of crime to reduce the unwanted variation. And you first need to mutate that is to change your fence group variable into a character variable. You can check the type of data set the type of variable that you have by using class function. So if I run the offense group it tells us we've got a factor variable. However, when using case when you need to ensure that your variables are character and I'll show you that just at a minute. So this converts our factor fence group into a character fence group. I am now assigning all those crimes that have less than 1000 crime counts into a category, a new category called minor. You know, minor offenses. If we view this data set, we can see that these are the offenses that all had less than 1000 crime counts. Your next step is then to mutate this into your data frame so that you group all of those minor offenses into one category in our original data frame. And we can do this using the case when function. It basically updates the offense group column based on whether the value is found in the minor vector. If the offense group is found in the minor vector than the value is replaced with minor crimes. So let's go ahead and run that. And then we can view this new category right here line 15 minor crimes. We now have 3500 and 15 crime counts, which provide a better grouping or comparison variable for understanding crime. So yeah, that's like the basics of the data preparation and understanding your data and exploring your data. We're now going to move on to looking at our time date variables. We're specifically going to be using the burglary data set, which is just a burglary counts from Detroit from 2015 to 2020. So your first step is to identify that time interval in a data set, right? But first let's have a look at what type of data frame we have. As you can see, we just have a table for a data frame. However, some functions within the packages such as Fable require you to turn a data set into that TS object, as we mentioned last session. You can still create time series plots without converting your data set and I'll show you how to do this in Gigi plot. But first let's explore the time variables in our data set. I'm calling on the date single variable from our crime data set first. And it tells us that we have a positive variable. But let's have a look. So this is obviously going to be the same in our burglary data set because it's just a filtered data set. And this type of variable refers to a class that stores both date and time. And your first step would be to convert this date variable into a date object so that R can recognize this. And we can work with some of the packages that require you to identify a date. We can do this by using a combination of functions. In this function, in this code here, we have three main functions which are mutate. We have as date and we have year week. The mutate function is asking R to create a new variable named week. And I'm using the year week function to convert the date single which is our date variable in the burglary data to a week year object. That is our weekly data. And the as date function is then used to convert this year data, this week year data or week year object into a date object representing the first day of the corresponding week. So let's have a go at converting our date object into a recognizable date object and R. If we scroll along, we'll see that we still have, yeah, so once that's done, you can then move on to counting the number of weekly crimes. We've established in our talk on Tuesday that we're interested in the weekly crime counts because this provides the best time interval for reducing variation. So your first step is to create a new data frame with the count of the weekly crimes. I'm calling on a new data frame for weekly crimes as this just makes it a bit clearer. Again, I'm using the mutate function. So this is what I just spoke about in the previous code above, which is converting that year date at that time. The date single variable into a workable day object and then asking it to count the number of crimes per week. If we run this and have a look at the first 10 rows again. You see, we now have a new data frame that indicates the week that the crime took place and the number of burglar offenses that had happened on each week. There are also some few other steps to identify. Firstly, as you can see, we've got data from 2014 for some reason, but we're interested in data from 2015. So you want to remove that first row of data and you can do this by simply calling on an integer, which is like identifying that this is row one of your data. So minus row one of your data. We then need to convert the date. The index argument at the end here indicates the column in the weekly crime data frame that should be used as the time index for the T civil object. So it's just letting you know, I'm trying to let our know that the index argument at the end here indicates the column in the weekly crime data frame that should be used as the time index. For the T civil object. So it's just letting you know, I'm trying to let our know that our week is my time variable of interest. We then want to remove the gap so that's removing kind of like any of the NAs or missing values or any missing gaps in the weeks then we can do that using the fill gaps function. And then we can remove the incomplete weeks by using a filter is.no. And let's print the first few rows of the data set. And now we have a clean to date of set named TSP, which stands for T civil weekly crimes. So let's go ahead and start plotting our data frames. So, as I said, you can plot a time series object without having to convert this. So in this example, I'm using Gigi plot, and I'm using the weekly crimes data set so that's the one that hasn't been converted yet. And I'm going to use Gigi plot to do so. I'm calling on my week, my dependent variable as my week and my attributes as my count. And this is our first trend of a of our crime data in our. We have data from 2015 to 2020. And as you can see in like 2017 this is really a large speak speak spike in our data set. So that's obviously something that's going to need to be examined. We can also use Gigi plot in a different way by calling on your data frame first and using the pipe function to then supply the aesthetics for Gigi plot. In this instance, I'm using geo and point, which is a bit different to geo in line. And this does give us something like this, which is very similar to the plot that we had seen in the workshop on Tuesday. Just out of interest for anyone who's interested in the coding part, you can run everything that we've just done in one line of code. You can count the weekly crimes, you can convert the date object and you can plot this all on one line of code. So if you want to run this part of code, you just need to uncomment this so you can use control shift C or command shift C if you're on a Mac. And this will allow you to do everything that we've just done in one line of code. But yeah, although Gigi plot is useful in understanding visualizations, it becomes difficult to understand the underlying components of a time series. In that we won't be able to run geo composition models or extermination or any of that using a non time series object. So yeah, again, we're going to convert this into a time series object. And we want to convert the weekly crime data to a time series object. Apologies. I have used the data frame that is a T, a typical rather than the T symbol. I think we should be working with the TSP underscore weekly crimes rather than the weekly crimes which is not a time series, not a T symbol. So this converts our weekly crime data into a time series object. We've identified the frequency, which is 52 because we're interested in weekly data. And we've got our start and end periods. We can then plot our time series using auto plot. Like so, or we can use plot dot TS. And this gives us, you know, two very basic time series plots. Your next steps again would be to check for stationarity. It's important because your data needs to be stationary. Again, we're going to use the ADF function to do so. And this gives us a p value of 0.01, which would indicate that our no hypothesis can be rejected because we is lower than the p value of 0.05. Meaning that this data is stationary. Sorry. Again, I think I supplied the wrong. If your data wasn't stationary, then you would need to perform what is called differencing and you can use the diff function from the base package and R to do so. So let's go ahead and run our decomposition plots. I've provided, I believe, like three different ways for you to run a decomposition plot. You can use the seasonal decomposition from the stats package. This function takes the time series object as its argument and specifies the type of seasonal window to use. You can then pass the results from a decomposition into autopilot. So let's have a look at what this might look like. So using the stats package, we get something that looks like this. We can also use, sorry, that's the STL package. We can also use the stats package, which is like R's base package, which we've already done. So that's just using the decon post function. We can then plot this. And that gives us that very basic and kind of unreadable random error plot, which I'm not really a fan of. You can extract the individual elements or individual components if you're interested. And or you could use the forecast package. So let's have a look at using the decompose from the forecast package, but this time identifying if we have an additive or multiplicative plot to see the differences. It tells me straight away that this a multiplicative plot wouldn't work because we are dealing with additive data. So again, we have our original data. We've got the overall trend. We have a seasonality and we have our noise. One other function I want to talk about before going on to looking at some RIMA models is accounting for holidays in your models. So the time date package or the time date function, should I say? No, sorry. The time date package returns a vector of week to year objects representing bank holidays in across the globe. So we can use the holiday London function to return the bank holidays in London. However, reading this out loud, I've just realized that we are using data from New York and from Detroit, which means we wouldn't be using holidays from London, but instead holidays from New York. So you can use the double question mark and type in the package that I think it's holiday. I think rather than London, we want NYSE. There we go. So this indicates all the bank holidays in America since we're using information or data from Detroit. So yeah, this should be NYSE. Obviously, if you're using crime data from London, then go ahead and use the holiday London one. I will push the changes to this code. So what I'm doing here is this code assigns the result of the comparison using this function to a new column called bank holiday. And in the X data frame, it indicates whether each week is a bank holiday in New York or in America for the year of 2020. We can go ahead and run this and if we scroll down. Yeah, so this is basically it gives you like a binary variable. It will say true or false indicating whether that date falls on a bank holiday or not. Let me just have a look at I can. Yeah, so if you're running Serena models, it might be like important to consider bank holidays because this could affect the frequency of crime right. If we want to establish like it cause an effect to see how the frequency of burglary from 2015 to 2020 had changed. We might want to include bank holiday as an exogenous like factor that could be affecting the frequency of crime rate, you know, can we say that the trends are reflective or like. So lastly, we are on to our three models, which is the last bit of code that we're looking on today. So we use our Serena models to examine whether the expected crime counts are different to the predicted in a certain year, or to forecast the crime data and then compare the expected to the predicted. So we've got a few steps to do the first is to subset a time series object, and we can do this using the window function. So in this code line 369, it's used to create two subsets, the train data set and the test data set. The training data set represents the data from 2015 to the end of 2019 and the test data set represents the data from 2020. So I'm simply just splitting this data off into two. Let's go ahead and run that. So if you look into your values, you'll see that we've got two added objects. I'll just move this along so we can have a proper look. So we've got a test and train here. You can see the test indicates we have data from 2020 to 2020 and our train indicates we have data from 2015 to 2020. Your next steps would be to fit that Serena model. In this example, I'm using the Auto.Rima from the forecast package because this allows us to automatically select the best Serena model for time series data. In this code, I've also set the seasonal argument to be true, indicating that the time series does have seasonal components. So let's go ahead and run our model. And then your next steps would be to run the prediction. I would ignore this line of code here for now, sorry. But yes, we want to predict the burglaries for 2020. And we can do this by running a forecast model on our test data set. So that's the, sorry, just making sure that I got that right. Yes. That's the data set from 2020. If we go ahead, just done that again. Let me see if maybe adding a level. So this is like setting the confidence intervals so that any values that fall below or below these can be marked as insignificant. So I'll set these to 80 and call it 95. I'm having the same issue there. Well, if anyone's actually able to run that bit of code, that's the forecast and could you give a thumbs up or something or just leave a word in the chat. To show what this error actually means. So yeah, I do apologize about that. Let me just see if running a forecast on the whole data set. Yeah, I'll try the whole. Yeah, very, very confused about this. I do apologize about that. Sorry guys. I had issues with the code a couple months back to use the changes in the forecast package. And yeah, I've been struggling to effectively, you know, fix these, but if you are able to manage to run that forecast, then these auto plot functions should work for you. And this will show you the expected crime count in 2020 compared to the predicted, which was established from the Serena models made in 2015 2019. I'm really glad that that isn't working. But I will if you're interested in doing this yourself and following the code that came from Ashby's paper in 2020. Then you can head over to this link here. Emma, if you want to drop that link in the chat, that would be great just so people have this at hand. But this indicates how you count your crimes, how you model your forecast and then how you can then model your results. So the code here is obviously was adapted from from this paper and this, this site, but yeah. That draws conclusion to this talk. I hope that this has been an informative and engaging session for you and hopefully you've been able to apply some of the topics to your own work and start to question your own research questions. So yeah, that's, I believe, all the questions answered so and it's just reached 12 30 so we're going to call this webinar here. Thank you all for attending, whether that be you attended both sessions or just one. Please complete the survey at the end and yeah, thanks Emma for facilitating. And if you have any questions, please feel free to contact me via email or Twitter. Other than that, thank you all will close off this webinar here.