 Good morning everyone. My name is Nadia Kenner. I am a research associate with the UK Data Service based at the Cathy Marsh Institute at the University of Manchester. Thank you all for coming to this introductory talk on time series analysis and forecasting. Now in this talk I aim to cover some of these things such as what is time series data specifically comparing how it differs to cross-sectional and longitudinal data. We'll be looking at what exactly time series analysis is, the types of time series analysis available, components of TSA. We'll look at how we can fit some specific time series models and we'll have a look at one specific forecasting technique known as Sarima or Arima models and then we'll briefly discuss some of the available software. I hope that this talk gives you almost like a guide or some inspiration to conduct some of your own time series analysis in your own work. Now the benefits of time series analysis are vast and I could spend the whole webinar talking about that in itself but in short it allows us to measure and analyse change, what has changed in the past, what is changing in the present and what we can forecast to examine what changes might look like in the future. So before delving into some of this content we'll just make sure that the Mentimeter polls are working and we'll start off with just yes here's the code I think sources left in the track view as well you can use the QR code or head to the website. So if you just head on over there and answer this first question which is more of interest to me but I'd just like to know what kind of software do you guys use most with one of your own type of analysis or research. I'll just give you a few minutes for those answers to roll through. That's great it looks like we only have all users here which is the correct software to use it's my favourite all ICS stars coming up it's good to hear good to hear. So we have a majority of users in our this is just for me to know its interest to see where we could take future webinars to see what most people are interested in learning but yes as you know this live code demonstration will be taught in our that's the second part of this webinar so yeah thank you so much for answering that and we'll move on to some of the content very much appreciated. So we're going to start with looking at what exactly is time series data in short it is a collection of observations obtained through repeated measurements of time. Now each instance represents a different time step and the attributes give values associated with that time. The intervals for which time series are represented is vast as you can have hours you can have yearly data you can have quarterly you can have hourly you can also have weekly which I seem to not included on that but weekly as an option and as with kind of any type of research first steps in your research process comes as thoroughly understanding your variable types and when looking at time series data your time interval is one of those components that you need to consider. So the question is how is time series data different to just having a time field in the data set and can longitudinal data sets be considered time series well this kind of depends on how the data has been collected as this affects how we can analyze changes in state over time. Now I use the word changes as this is a key concept in understanding time series analysis. The major difference between time series and non-time series data is that time component. For time series data the time factor is the dependent component and for non-time series data time is not necessarily a central theme and we can explore this concept a little better by providing an made-up scenario. So imagine you have been asked to maintain a web application and you have been asked to analyze when a new user logs in. After some careful consideration you've realized that there are two ways to do this when a new user logs in you may just update a last login time step for that user in a single row or you decide to treat each login as a separate event. I'll just give you like five or ten seconds to let that sink in and then I'm going to give you the option to answer this back on Mentimeter and no worries there's no right or wrong answer this is just to get you guys exploring and thinking a little bit more creatively about time series data. So we head over to Mentimeter and you can pop in your answer. If you are to maintain a web application would you choose option A or option B? The majority of people are leading towards option B where you treat each login as a single event. Very interesting. We have a few people on shore that's absolutely okay we're going to go ahead and discuss the advantages and weaknesses of both methods in just a minute. That's great so we've got about 18 people we voted in so we'll continue on and talk about some of these benefits and advantages for both. Let's just say you have decided to choose option A where we update the last login time step for that user in a single row. Now you might have something that looks a bit like this we have the user and we have the company that they work for that's just some contextual information and then we have that last login time step. Although useful in examining which user works for a company the time variant isn't of much use here it simply provides some context right whereas if we were to explore option B which might look something like this we have a new row for each time a user has logged in where those changes are preserved and each change is recorded as a new event and this is where I draw importance to that word change again because that's exactly what we're looking at because this allows us to examine how the frequency of login time can change over time per person or per company. So to summarize time series data almost all data is recorded as a new entry the data typically arrives in time order and the time intervals can be regular or irregular and just to clarify what I mean by regular or irregular time series is always classified into two types regular time series tend to represent some sort of cluster monitoring or aggregated data whereas irregular events are those measurements gathered at irregular time intervals and these might represent things like log intervals or traces the issue with irregular intervals is that the events are unpredictable and cannot be modelled or forecasted since forecasting assumes that whatever happened in the past is a good indicator of what will happen in the future if we have these irregular time frames and that doesn't allow us to make predictions so when forecasting and modelling data your time series should be in a regular format which is why you might see a lot of data comes in aggregated formats. So yeah let's move on and explore some of the reasons why you might want to use time series analysis and explore some of those aims. I would say that there are generally three different aims to time series analysis the first is how we can access the impact of a single event this could be described as a descriptive event an example of this could be the number of crimes in Manchester you could obtain the number of crimes in Manchester and observe the seasonal patterns and the upload or download trends that'd be a single event as we have a unit variant I guess analysis of just looking at one event another aim could be to study the cultural patterns this could be the effects of variables rather than the events themselves this would be known as an explanatory analysis and the last aim which is very common is to forecast future values or time series using either previous values of one series or values from another and this is known as our prediction models forecasting uses the observed values of a time series with a model to predict future time series values there are several forecasting techniques available for use with time series data and one of those examples are the Arima models we will be looking at Arima models as well as some other models in our live code demonstrations as well but I'd like to just talk about one specific article that really draws on almost three of these aims and this was conducted by Ashby in 2020 his aim was to understand crime patterns during the pandemic he used police recorded open crime data to understand how the frequency of certain crime types change from the start of the pandemic now his method involved using Serima models of the frequency of crime types from 16 different US cities between 2016 and 2020 his forecasts were then created from these models to compare the actual calls received to the expected calls that were calculated now I would go on to explain all the results and discussions but there's quite a lot to unpack here as the article explores five different crime types across 16 different cities but some general notes would be that the differences between the actual calls recorded and those forecasted were different in it's different in each city one example being that there were reductions in burglary in most cities so what this paper had helped to do is understand or identify a relationship between the pandemic and crime the data in the code can be found in this link it's also attached to the github link and the r script so you can have a look at that in your own time but yeah that's just a really good if you're looking for a really good real life example that I suggest reading this paper it goes into detail about how the real model is conducted and the code is available to do this and yeah with that I will be exploring a case study that takes on an adaptation from this paper so using some open source police recorded crime data I'm going to be exploring the burglary rates from Detroit from 2015 to 2020 following some of that code that was created by Ashby now in order to understand some of these components and all the the fundamentals of time series analysis I've kind of created two main aims that might be able to help us contextualize these these kind of like complicated statistical terms what the first name is being we're going to explore the long-term trend in seasonality in burglary across the city of Detroit the second name is to examine how the frequency of burglary changed in Detroit in 2020 from the start of the pandemic so that is again using those arena models to come up with predictions so we will be using time series graphics time series analysis and forecasting to answer both these aims now the steps in time series analysis can be quite complicated but I've broken this down into four main steps that will allow you to take on kind of any analysis in any data set the first step is to explore your data set as of any research this means understanding your variable types but the difference in time series data is that you need to understand the time intervals that you have present in your data set so that is understanding whether you have yearly weekly monthly hourly quarterly data and then step two would be identifying graphing patterns so this is very like descriptive analysis and understanding the very basic patterns and then we move on to modeling the data and then we move on to predicting so once you have indexed those points according to a time order you can use time series algorithms to create a model right and once you have created that model you can use that model to predict future values so this is just a really like broken down version of how time series would work but this has always helped me in conducting my own analysis and breaking it down in really simple simple steps so let's move on to look at what a time series analysis might look like typically the statistical characteristics of time series data often violate violate the assumptions of conventional statistical methods because of this analyzing time series data requires a unique set of tools and methods and this is known as tsa or time series analysis it's important to note that time series analysis is used for non stationary data this is things that are constantly fluctuating over time or affected by time but we'll get more into this in a little in a little bit so we can use basic time series graphs to basically plot the observed value on the y-axis against an increment of time on the x-axis most time series graphic will look something like this so in our instance we might expect to see our y-value which is our boundary rates against the increment of time which was from 2015 to 2016 obviously you can get more creative with your visualizations it won't look as basic as this but this this helps understand your overall you know graphics to it so yeah the first steps would be to identify kind of what time interval you would have in your data set whether this be yearly monthly weekly or quarterly but before examining this data set and before examining what we have available I want to ask you guys what is the most accurate interval to use for exploring crime data what do you think is the most accurate interval for exploring open source crime recorded statistics to be head over to mentor me so you can just pop in an answer here this also gives you a break from hearing my voice but um yeah just have a go and do you think that it'd be better to compare year to year data month to monthly data um hourly minutes just pop in your answers I'll give it a second to a minute to to let these roll through all right the numbers have slowly stopped stopped to change but yeah okay that's pretty pretty interesting what we see so we see that 42 percent of have you guys think that monthly data would be the best interval followed by weekly followed by other which i'm actually very curious to know what what that might be and then we have yearly and hourly well I'll briefly talk about kind of some benefits to to some of these but yearly data is it's accurate but it's only accurate if you're looking at long-term trend it tends to miss information as it hides variation that happens within the months so we won't be able to analyze kind of any seasonal trends we won't be able to analyze uh fluctuation against different seasons we would simply just have an up-bottom downward trend from each year we then have monthly data monthly comparing month to month data is more accurate because we then get to analyze that seasonality but the issue of monthly data is that it does is that um moms do not hold the same amount of days so some months have longer number of weekdays so we can say that so how can we say that um this month has a higher number of crimes or is it just the result of a higher number of weekdays so there's this variation in um yeah the number of weekdays and how this affects how this is affected by crime reporting because do we have higher number of crimes reported on weekdays or higher number of crimes reported on weekends and how does this affect the overall frequency of crime per month so these are things that you would have to consider when using month to month analysis we then move on to weekly data which 29% of people have suggested is is a good interval and i would like to agree with you i would say that weekly data basically minimizes that variation from month to month to year to year data and is quite a frequent interval used in recent research for understanding crime hourly data and like minutes basically um they are more useful if you're analyzing like a single event a single event rather than over a huge amount of time and this is because the smaller the time frame the increased amount of noise in your data set and that is something that you would want to avoid we'll discuss a little bit further on what noise is but this is basically uncontrolled variation um so for this reason and following the work of Ashby as well he suggested that weekly trends provide kind of the best interval for understanding police recorded crime data that is not to say that yearly data or monthly data can't be used it just affects the outcome of your results and that you know yearly data you will be limited to just understanding the trends so yeah these are just things to kind of consider when choosing your interval but yeah i have decided to then create a frequency count of weekly data of burglary in our case study and i created a plot that looks something like this like this um so what we have here is a really basic time series plot we have our weekly incident of burglary on our y-axis and we have an increment of time on the x-axis from 2015 to 2020 so what this plot what this plot shows um quite quite clearly is a downward trend right from 2015 to 2020 we can say that the overall count of frequency has definitely decreased over the years has gone on but how much is there to say about the seasonality and how much is there to say about other hidden trends it's quite hard to read this you see we have these like really high numbers of counts of burglary in 2016 why were they so high in this year is this just noise is this uncontrolled variation and um in time series analysis for forecasting new values it's very important to know about the past data there can be many reasons which cause our forecasted values to fall in the wrong direction and factors like these might be one of the issues so if you want to explore further trends then you need to look at the components of time series analysis because the variation of these components cause the changes in the patterns of time series in the patterns that we see here so let's move on to have a look at those components of time series analysis there are four main components known as trend the cyclical behaviour the seasonality and that noise that we've mentioned before the trend typically represents the decreasing or increasing pattern in like statistics this is known as the linearity we then have our cyclical behaviour or cyclical variation and this variation in the time series tends to operate themselves over a span of more than two years we then have our seasonality and these are the rhythmic forces which operate in a regular manner over the space of less than a year so that's the main difference between the two cyclical tends to happen over two years and seasonality are patterns that happen within a year we then have our random or irregular movements also known as unnoise this is basically any other factor which causes variation in the variable under the study so as mentioned noise is likely to be higher when you have when analysing shorter time periods so if we were using monster month data you might expect increased noise or even our hourly data would expect increased noise because there's more uncontrolled variation within the time frame now the the combinations of these components with time causes a formation of a time series it's important to note that most time series consist of a trend and a noise but the cyclical and the seasonality variations are optional in that they might or might not exist in your data depending on the data that you have at hand if the seasonality and trend are part of the time series then there will be effects in the forecast values as the patterns of the forecasted value of the forecasted time series can be different from the older time series i'd like to also just draw on the fact that the combination of these four components can either lead to an additive or a multiplicative model now i don't want to get too complicated with these times and avoid therefore statistics but it's really important to know how these four components can shape your your model for time series so let me just explain what i mean by an additive and a multiplicative model an additive model might look an additive model is the increasing or decreasing pattern of the time series is similar throughout the series this is when all those components are added together hence the name and then we have a multiplicative model which is if the time series has an exponent for growth or a decrease amount of time then this would be multiplicative and this would be when all your components are multiplied together so it's not too confusing to understand but it's really helpful to see how this might look like by visualisation so i've just provided these two examples here in this case an additive model here has an increasing pattern but we have the same amplitude between our time points excuse me yeah so we have the same amplitude between our time points but with a multiplicative model what we have is this change in amplitude over time it could be increasing could be decreasing same with the additive but the main difference is that that amplitude between the two points differ quite a bit right so we can bring this back to our case study and start to think about what kind of trend we have do we have an additive or do we have a multiplicative plot so i'll just show you this plot again this is what this is our basic time series plot looking at weekly trends can we say that this trend is additive or can we say this trend is multiplicative and i'll give you a moment kind of just to explore this a little bit and think about this and you can head over to Mentimeter again and just put in your opinion do you think that this plot is additive or multiplicative or you have no idea it looks like majority of boats so far have been leaning towards additive we've got quite a few that say multiplicative as well yeah let me break this down a bit so the majority seems to think that we have an additive model and 21% seem to think we have a multiplicative well done to those who voted multiplicative this is in fact a multiplicative model because we have a decreasing trend with changes in the amplitude between the points which in turn produces larger intervals of seasonality an additive model would highlight almost a constant trend like a very if we would have the same averaged amount of crimes as the years go on if it was to be additive and this might be more of a straight line um yeah this is a multiplicative model so when reviewing the line plot it suggests that there may be a decreasing crime trend throughout the historical changes but it is hard to distinguish if there is seasonality or noise or even any kind of like cyclical behaviour right the only thing we could really confirm from that is um is the trend but we can explore the other components of time series analysis by decomposing our models through decomposing the models we can clearly model the individual components and get precise information about whether the series is stationary or not um so yeah as discussed the time series is considered to be a sum or combination of these four components so let's have a look at what would happen if we were to decompose our model now I've already gone ahead and done this and made this plot for you and this is what a decomposition model might look like there are four graphs in this image the first graph highlights our original data the second graph highlights our trend that is that decreasing trend that we see and our seasonal data highlights um some seasonal patterns as you can see from each year we have this really same repeated pattern every year which would indicate that there is some seasonality present in this dataset which isn't surprising with crime data at all um we also have the remainder or the noise components which I'm not going to delve in too much at the moment but um if you are familiar with um ACFs and lags this is what you're expected to see but we'll address some of this in the live code demonstration as well um so yeah we can clearly see that as a seasonal component and there's definitely some sort of downward trend so once we know these patterns and trends and and um the basic like structures to our data we can then go and check to see if the see if the series is stationary or not um if the series is not stationary then it is necessary to make the series stationary so let me just provide some context about what stationarity exactly is and how we can test for this stationarity in short is when it's statistical properties such as the mean the variant the covariance remain constant over time um the formal way to check for this is by either plotting the data as we have seen through the decomposition plots or by using um some more like advanced statistical tests known as the KPPSS the Dickie Fuller test there's also the augmented Dickie Fuller test which uses the unit root test um but yeah that just basically evaluates whether there's a no hypothesis in one way or another using the alpha value of 1 and the p value of 0.05 we can look for a no hypothesis but yeah visually we could say that the plot does show visually the plot shows what we had trend and that we had seasonality and that means that this series is not stationary and this is because that mean is not constant over time there was changes in the in the trend and there was changes in the seasonality typically stationary data is quite flat looking you wouldn't expect to see a trend and you would have a constant variance over time you'd also have a constant autocorrelation uh structure over time you would have no periodic fractures based on the seasonality um but using a stationary dataset means that the model can do predictions based on the fact that the mean and the variance will remain the same in future periods and this is why we'd have to shift a non-stationary dataset to a stationary one um in order to do this you would have to differentiate your dataset also known as differencing uh this isn't a completely like key concept here but it is necessary to know when um understanding our arremer models as this is a big part of how we shift from non-stationary to stationary um yeah that was uh just a stationary aspect so now we've broken down the components of time series analysis we can use this to create models for our forecast there are many models to consider such as the uh moving averages model this is the like most simplest and basic of all time series forecasting methods this model is used for unit variant that is one variable time series and in an m a model the output or the um future variable is assumed to have a linear dependence on the current and past values that means that the new series is created from the average of the past values hence the name moving average so we then have a single exponential smoothing model which is really common in like economics and financial data this is also used for unit variant series and this is when the new values are calculated from the weighted averages of past values so a little bit different to the moving average um there are extended versions of this smoothing so you could have the the single sorry is used for when there is data with no trend or seasonality a double model would be used for when there is trend in the data and then the triple would be used for when all three exist that is trend seasonality and noise at all present in the data set um and the last model that I'd like to kind of talk about is our arema and serial models that I said a bit about they are suitable for multivariate non stationary data um so yeah we'll be using a variation of this model to answer our second aim which asks how burglary trends compared to the predicted trends over the pandemic um so why arema models well they are she gets some next slide sorry so yeah they stand for seasonal also aggressive integrated moving average models which is a very long and complicated term but I'm going to try my best to break this down in a much simpler and clearer explanation they're simply used to predict future trends for time series is a form of regression analysis that evaluates the strength of the dependent variable relative to other changing variables so what makes this um regression so different is that our dependent variable is of course a time variable um which is very um different if you is analyzing like cross sectional data set um so yeah a it's basically a linear equation in which the predictors consist of lagged variables of the dependent variables and the lags of the forecasted errors but we'll move on just to a um a slide for those who are a bit more interested in the statistics behind the arema models we'll see what these seasonal also aggressive integrated and moving average components might look like if you were to write this down on paper so here we have a slide um which does contain a bit of information so please don't feel overwhelmed but basically an arema model contains three values p a d a q arema without the s means that this is a non-seasonal model the p stands for our um ar which is our autoregressive and this indicates a trend order our d stands for the integration which is the differencing so that's how we make a series stationary and then we have our q which indicates the moving average or the trend order of a moving average so arema model really combines a lot of these um a lot of these methods and a lot of these models to time series analysis we then have a saruma model which is the exact same as an aroma model but includes a seasonal component so we have the same p d q which indicates um the autoregressive the integration and the moving average but we then capitalise the p d q to indicate the seasonal components so then we have a seasonal autoregressive a seasonal integration and a seasonal moving average we also have this extra value which is m and this is just the number of time steps for a single period so this letter is typically denoted by if you were interested in quarterly data this would be represented by a four if you were interested in yearly data this would be indicated by um sorry if you're interested in the yearly data it would be a one monthly data would be a 12 and weekly data would be a 52 and so on so on luckily for us we're using compensation to compute these for us so there's no need to figure out how to calculate all these values yourself uh we'll be using some functions in r to do some on some practice data sets so you can get understanding about how these models can be used to predict future values but for those who are interested um in how do you choose those values for p d q um capital p d q well this is where we talk back to the auto correlation function um the a c f tells us how correlated a time series is to its previous values it is the correlation between observations of a time series separated by k times units so that's just um that m that we discussed in the previous slide and then we have the partial auto correlation function and this indicates the seasonality and this measures the strength of the relationship are the terms being accounted for um yeah so these are cut this that's the very like simple but backbone of arena models and how these are constructed and how these are put together but as discussed we luckily have um automated functions and packages to do this for us yay so we don't have to deal with all the maths um uh so yeah i run a arena model on our Detroit burglary data set to kind of compare the expected trends to the predicted trends in 2020 so i'll show you what my plot looked like oh sorry i did let's get this slide so so how do we build how do we build this arena model again i've broken this down into kind of four main steps the first would be to count the weekly crime uh that is because our increment of time we're interested in is the weekly but you can you can use monthly you can use yearly this always dependent on the data you have available as i said you is a model the weekly calls you would use functions such as a rumor from the fable package you could also use auto dot a rumor which i'll be demonstrating demonstrating how to use and then you would generate your forecast using the forecast function and then you would plot the forecast so really four simple steps um and yeah i'll show you all how to do this in r and we can then explore how i basically got this this image this this graph this plot what we're looking at here is the um the expected calls received in 2020 compared to the predicted calls the expected calls fall along this um this dotted line here with the with the circles and the predicted calls falls within this dash line that sits within this gray bracket and what we can see is that the actual calls received are far lower than the predicted calls and this is because as discussed in ashby um the covid 19 pandemic led to substantial changes in the daily activities of of well of everyone due to changes in stay at home orders due to distancing due to more people working from home there was reduced opportunity for these kind of crimes to take place and this is what abrima models allows to do they allow us to understand the relationship between crime and people's daily activities um yeah so what we have here is a raw model what i did was count the weekly crimes i then modeled the data from 2015 to 2020 i then created the the forecast and then i then then i plotted the data but there are um other effects that you could include including this for example you could include a holiday effect um you could donate a binary variable that indicates whether this week had a bank holiday in it and you could then possibly establish a causal relationship between the frequency of crime and bank holidays and these are ways that you can extend your models but um i didn't include that in in this one so yeah that kind of draws conclusion to the main components and models and kind of the like underlying concepts that um make up a time series analysis and i hope that i've been able to provide some information before finishing up i would like to just draw relevance to some of the softwares that are available you know obviously most people have said that they use r so i suspect you're familiar with the packages four times series analysis this would be the fable and the forecast the t series there's at least another five that that run um times series analysis but um preference is is different for everyone one function that i will be looking at is the autodotteruma function as discussed um this is basically a combination of the the unit root test which uh it's like a minimization of the the aic and the bic which are just um um like the correlation coefficients to obtain an arena model but as of any kind of automated function you have to think about how um this might over or under predictor data how it might um ignore some ignore variation and things like that but we also have python which the libraries for the um time series of pandas that's models kick it learn and we also have this nifty little website made by facebook actually called fb profit so they developed an open-sourced forecasting tool available in art and python it was written in c++ but it's basically used for additive models with non-linear trends and they fit yearly weekly daily and seasonal trends as well as a holiday effect um this is a really useful tool if you're looking at just creating um really simple and effortless time series models i'd have a look at fb profit because they work nicely within python and art yeah uh thank you all for listening to that talk we've just gone 50 minutes which is perfect so if anyone has any questions i'll give you the opportunity to ask that now uh you're welcome to type the intermentometer or the q and a on zoom and yeah feel free to ask some questions we'll also have a break just after this as well is it possible to share the data codes on yes um someone has asked is it possible to share the data and um codes underlying results yes that's exactly what the live code demonstration will do and all of this is available on github as it is um and then we'll talk through it we'll talk through the rest uh mr question please can you talk about the additive multiplicative distinction and how this affects the arema modelling process uh yeah of course so as discussed there are two types of um time series analysis additive or multiplicative and this is dependent on um whether you have trend seasonality or noise present in your data set uh typically an arema model will account for this automatically and it will figure out if you have an additive or a multiplicative data that is have these components added together to make up your trends or have they multiplied and um i'll be looking i can show you actually how to do this in the live code demonstration and we can break this down a little bit further um yeah i hope that's helped but yeah i'll demonstrate this further because it's easier to demonstrate it with some some some code at hand hi everyone um i hope you've had some time to stretch your legs grab a coffee get a drink and get yourself ready for the code demonstration as i said it's not completely necessary to clone this repo onto your own computer but if you would like to explore it yourself then this will be available and will be yeah available from the ukds github link for well permanently so yeah and any changes made to this code today i will push the changes and all you have to do is pull the changes so yes hopefully everyone is back and ready and i'll take a slow walk through this code demonstration um so yeah here are some of the links on the references used from ashwit's paper as well as his code which is here um there's also the crime data in our package which we'll be using which is linked here this is an open source crime data from from the us if you would like information about how to set your working directory that is how to specifically clone the github repo then there's a line here online thirsty about how to do that that means that everything that i have on my computer that you have on your computer um and here are some links to install the packages as well as to um as well as to load the packages sorry um yeah if anyone has any questions before we get started then please ask away but otherwise i'm just going to take a slow walk through the section one script um social we have a q and a question is there any chance we could make my face smaller and the studio screen bigger i'm not sure if that's just their computer setup but ah no problem right i'm glad it's sorted no problem thanks for asking the question anyways i can zoom in if need be but yeah let me know if this view is okay for everyone yeah i was section one of the our mark down file we will be looking at some time series data representations we'll be looking at how to convert time series objects we'll be looking at making decomposition plots we'll be checking for stationarity and we'll be applying some rolling averages to some data sets and section one specifically uses um synthetic data sets and open source data sets and then section two will take a turn and look at how to basically apply some data manipulation using the diplo package and the lubricate package to work with time series data and to work with um time intervals yeah let's get um let's get started so yeah we'll first start by demoing some different types of time series data typically when working with data in r you need to decide the object class of your data at hand this is important because the object class you choose affects more than that the data is stored this will dictate which functions will be available for data preprocessing data for analyzing uh data wrangling as well as plotting your data typically data in r is stored as a vector and a vector is in short the most simplest and it's the most simple data structure in r and it represents a sequence of data elements of the same basic type i think there are six different vector types as numeric inter logical daytime factors and characters i think that's six however when working with time series data we tend to have to convert the object class into what is known as a time series object or a ts now r has at least eight different implementations of data structures for representing time series data the list below identify some of the frequently the most frequently used packages we'll be looking mainly at ts and t civil but i will demonstrate some examples into and x ts as well um so yeah the first thing we'll be looking at is a time series object called uh kings this is an example of a small time series data set and it records the age of deaths of 42 successive kings in england the data set can be found in this link by robert hinderman um but this data set can is is recorded in a text file so we can use the function scan i think from the stats package to read in this data set i've also applied the function skip equals three which basically skips the first three lines of data because they included attribute information that just wasn't part of the data frame so if we run this and then we can view the data set by just typing in king you see that we have 42 um 42 objects which represent our kings and the ages that go with those kings we can examine the type of objects we have that is the data structure itself by using the class function now you can see we have a numeric object however this isn't very useful if we wanted to plot time series objects because it doesn't know that we're treating our dependent variable at the time so in order to convert a data frame to a time series object we can use the ts function just to clarify a time series object is a vector which is univariate or a matrix which is multivariate with additional attributes so yeah let's use the ts function to convert our kings say kings data set all i've done is called on the data set applied the assignment operator use the ts function and run that now if we check the class again you see that we have a ts object instead of numeric so let's see how this will look if we print king that's great now we have time series object from start date and end date and a frequency the frequency indicates one because this is a unit variable data set all we have is the age of the kings at which they died and that's it but i can provide a bit of a more better example by using data that includes seasonality so yeah what happens if we have data that has been collected more at more regular intervals that could be monthly that could be weekly could be quarterly and if this is a case you would have to specify the number of times that the day was collected per year using the frequency parameter so let's explore this with a different data set named births which refer to the number of births per month in new york city from 1946 to 1958 again this data set can be found here by robot hindman and the code to run this looks something like this so we call on the ts function we supply the vector we give a start and end date and we also identify the frequency so let's see how we can do this with the birth data set first thing's us again is to use our scan function as it's a text file to read the data set and if we examine this we see we have this this is quite messy data frame with a over 160 rows and a value associated to each row but there is no current time in this data set is there but we know this is times 8 set because this is it was it's on the website it was created by robot hindman and we can use that ts function to convert this into a time series object with our dates so I know that there are so yeah we call on the vector which is births and then we apply the frequency in this instance I know it's 12 I know we're looking at monthly data because this was all on the website and I know that the start date was 1946 on January hence the number one so if we run this and then rerun the function but we should have a much neater data set with our birth rate given by month and year which is really neat so now with this we can go ahead and create some time series plots um so yeah let's go ahead and run some time series plots we can do this on both seasonal or non-seasonal data and we can use the plot ts function to do so so if we just run it on the king data set we'll have something that looks like this what we have hopefully you can see that is our time variable on the bottom which this instance is represented by the age of kings which can be confusing but but stay with me and then we have the death at which the kings had happened this is a really really simple time series analysis as we only have a unit barrier uh data frame if we run this on the birth data set we get something that looks like this which is kind of much more common than what we saw in the slides we have our time variables from 1946 to 1960 and we have the um births at which the average births per month across here you can definitely see that there are some um trend there's definitely some sort of upward trend from 1948 as well as some evidence seasonality um where we have fluctuations in the data but again it's hard to say that if this is due to seasonality or if this is due to monthly variation and this is what we are going to be exploring in this section uh you can also plot ts objects using juju plot which is uh i guess much more common actually a lot of people like to use auto plot just because there's more uh flexibility in the functions that are available so we need to load and install the juju four to five function and let's just see what happens if we use auto plot instead of plot.ts and this answers we get something that looks like this very much very much the same apart from this background that's changed right um this is preference i like to have this background as it i think it's just easier to read on the eyes but there are also um you can also apply a color so if we run auto plot call on the burst function and the burst object so and we can use a ts.color function to make this trans pop a little so let's just go with red and you can also use ts.line type to change the style of the line i'm going to use dashed just for the strange that is strange when this spells i'm in auto plot uh ts.color equals red ts.line type is dashed i'm not too sure why that hasn't changed ts.color equals red ts.line type equals dashed up should work this shouldn't do any work um anyways we'll move on to that and we'll try to see if the ts.jon function works in that is very strange because i had just done this before let's try it really not anyways um well that was only just to change the graphic so it's not entirely important but um these functions should on the system should work on your computer apologies about that i'm not quite sure why that hasn't hasn't worked well we'll move past that it's only it's only for the graphics so that's no problem if we have some time in the end i'll come back and see if i can if i don't fix that but yeah uh now we're going to move on to plotting with the forecast package we're going to be using the auto dot arena function to to run this bit as you know uh what the auto dot arena functions does is return the best arena model according to either the aic or the bic so this function conducts a search over all the possible models and basically provides the best one for you um so we know that an arena model is partly autoregressive and we know that this is trying to understand or trying to explain future values of birth rate using pass values um we also know that there's a ma function which we are using a white noise error term to explain future values um this component gives you an armor model ar ma which obviously doesn't include this integration so oftentimes this is when the series needs to become and this oftentimes series are non-stationary but they need to become stationary and this is what the integration does it tells you how many times the values need to become differentiated in order for the series to become stationary now if you were to calculate this by hand then you would use the aic and the bic however auto dot arama does all these calculations for us so let's explore this a bit better so yeah you this is uh run by the forecast package to make sure that's run and what i've done here is created a new object called birth arama i've assigned the assignment operator and then i use the auto dot arama function i then call in the uh vector births and i tell auto dot arama that the seasonality is true this part of the code isn't completely necessary auto dot arama should just pick that up anyways but sometimes it's useful just to um put this in so that your model is clear and you know what's what's going on so if we run this birth dot arama function and we can then print the summary we get this output under something like this the second row indicate our arama figures and what we see here is a it contains two auto regressive lags it contains um it's differentiated by one and it includes two moving averages for the seasonal component there is one also aggressive lag one differentiation and one moving average we then also have that time frame which is 12 in this instance which indicates that this is monthly data um so through that we can then examine um these coefficients and these residuals by plotting them because it's much easier to read and understand once we plot the residuals and you can do this by using the function check residuals and then you just apply the birth's arama into um products so if we run that we will get a image that looks something like this we have three bars here which is um basically the residuals should approximately be white noise that is um a process with no structure this means that the mean should be a constant of zero it should really indicate around this zero value in our instance it's not entirely near zero but I still want to use this as a demonstration because you can still plot auto dot arama functions that don't have that mean around zero we also see the acf plot to the left here and um in an ideal case there wouldn't be any lags as in these lines would sit between the confidence intervals um but we're still going to go ahead and use this as a forecast it won't obviously mean that we create the perfect forecast but these are things you need to consider when using auto dot arama because as I said they can overpredict your AIC and your BIC values so now we've um run the model we can then make the forecast and we do this by um what I've done is create a new object called burst forecast and assign this again to the forecast function we then call on the burst arama so this is our arama model and we indicate h equals 12 and uh h is just how many periods ahead do we want to forecast for and because we know it's monthly data I've decided it's put in 12 so we forecast for 12 months ahead so if we run this and we print the values we get a bunch of values here which indicate our uh confidence intervals but you would read it as like um so January 1960 has a 27.7 percent forecast increase um compared to the previous year but obviously what we'd want to do is really plot these and get these on the graph so you can use that also plot function again to just plot your forecast and once you do that you have this really um nifty little bit that indicates the forecasted value for 12 months so we could say that in 1960 there would be an increase in um birth rates from the first half of the year and then from the middle of the year we have this decrease in birth rates so yeah also that arama has the ability to decide whether or not the data used to train the model needs a seasonal differencing however sometimes the data might not be able to clearly express its behavior and also arama needs a nudge in the right direction I suppose and this is where our decompositions come in handy so we can go ahead and decompose our time series to explore some of those individual trends as discussed a decomposition is simply a um an addition or multiplication of your four components in order to decompose a time series you can just use the decompose function which is here from the stats package let me just check that is from the stats package yes it is from that yeah you can run that function again I've created a new object called birth decomp call this onto the decompose function and I've included the original data set births into that if we run that and view um we can use the head function to view the first five rows of the data set we see that we have our x values which are original values we have our seasonal values we have our to go down a bit we have our trend values and we have our random components so this is simply just broken down with four components and split these into more readable trends so that we can analyze each one individually so again you can just plot this data so we can plot that same graph that we saw in the slides by using plot and including the birth dot decomp and now we have a decomposition of our time series it's also identified that it is an additive series that's what plot does it tells you there's additive multiplicative so you don't even need to figure that out yourself which is very handy but yeah as you can see we definitely have some sort of increasing trend there is definitely some seasonality present we have this kind of repeated pattern as each year goes on and yeah you could improve these graphics because although this highlights a random noise model you might want to plot this using what we saw in the graph which was made using the auto plot function so you can basically use the um this I've forgotten the name of this function but you can use this to um run code and all all in one line so what we've done is called on birth and said and then decompose and then auto plot and then we get this looking graph which is a little bit more readable and a little bit more presentable you can see those lags that put to place um yeah much much clearer indeed now we have just decomposed a seasonal time series but how would you go ahead and decompose a non-seasonal and you might be thinking why would you want to do that if there is no seasonal component but as I stated that all time series doesn't include a trend and a noise element so if you were to decompose a non-seasonal time series all you'd be doing is removing that noise from it and reducing some of that variation so we can go ahead and explore how we could do this with the Gings dataset as this was a one day as this was a non-seasonal and one way to do this is to use a smoothing method such as the simple moving average um this can be used to smooth time series and it is the sma function under the ttr package so I'll write this out for you we can decompose non-seasonal we'll use the king's dataset so I'm going to create a new object called kings deconf use assignment operator and apply the sma function from the ttr package which is like that I have downloaded it and loaded it already and then from the equal on the original dataset which is called king and then you need to set a moving average order typically this number kind of stems from like well it could be anything but I'm going to set a moving average order of n equals five and if we run that and then use the plot.cs function on the king deconf we now have this smooth version of the king's dataset which has removed some of the uncontrolled variation um so yeah so another question to answer is um how would you remove seasonality from a dataset and why would you want to remove seasonality? Typically in many industries experience like fluctuations in various metrics based on the time of the year this means that it's not possible to effectively assess the performance by comparing data from one time of the year one time of year to another and these seasonal variations can sometimes hide important trends so sometimes you might just be interested in that trend and you want to remove the seasonal component and this is how you do it. In the context of the birth dataset let's just say for example if the birth rates were to increase in September could we say that this was due to seasonal variation or an actual increase in birth rates now to get these answers we could remove that seasonality um if you have an additive model which we have in this case you would simply minus that seasonal component if you had a multiplicative model you would divide that seasonal component so create a new object called adjusted births called on the births dataset and i am minusing the seasonal component from the decomposition object as you see here so once we do this and plot the adjusted data now we have a seasonally adjusted dataset that removes that seasonal component and allows us to solely focus on that trend um again this would be you know up to you as a researcher if this was an interest in you or if you wanted the seasonal component then then um then you wouldn't run this yeah you would only really run this if you're interested in that linearity between years. As mentioned in the slides there are two ways to check for stationarity you have the graphical way which is to examine the plots so look at something like this um in our instance we can say that there is no constant mean in in the king's dataset so the dataset is not stationary but there are also those statistical tests that I mentioned one being the adf test now this is part of the t-series package and we can run this on our king's dataset um the function for this is the adf.test function so if we just run this on the king's dataset we get this um um we get this score of minus two a like order of three and a p value of 0.5 um now just given the time I won't go too much into this but your main interest would be that p value um the p value should be less than the significance level of 0.05 or 0.5 depending on what you've said but that has to be less than that in order to reject the whole hypothesis so let's just say we had a alpha level of 0.05 our value is higher than that so therefore we cannot reject a no hypothesis therefore inferring that the series is is um not stationary and that's how you could compile the neural dataset now moving on I just quickly like to talk about the zoo package and how we can calculate both basic rolling values using functions within here so yeah install this and load this into your dataset and um basically the zoo package consists of methods with totally ordering index observations and it aims at performing calculations containing irregular time series of numeric vectors matrices and factors so yeah let's uh we're going to look at a new dataset called notum which highlights the average temperatures by month and year in optimum this is a base package in r so there's nothing to install or download you should just be able to run this when you click notum as you can see we have something very similar to our blast dataset with the year and the monthly dataset um but this indicates the average temperature for each month each year so first let's just plot the dataset to see what we look at we can can we tell from this time series analysis uh which years are the hottest it's kind of hard to say because there is so much fluctuation going on in the seasonal patterns so one way we could do this one way we can understand which months hold the highest values which months are the warmest is to apply a smoothing trend and we can do this by using the sorry by using the role mean function indicated here um i've gone ahead and created a new object called not mean standing for not your mean applied the role mean function from the zoo package calling the dataset and then i've um applied a k value of 12 and this k represents the um width of the rolling average that is the year that you want and then i've applied fill equals na and we do this to fill the first 11 months as there are not 12 previous months before this and then we can use a line equals right to model the 11 previous observations if you were to model the 11 pass observations this would be a line equals left and if you were to model um the ones before and after you would use a line equals center so once we run this we can examine the statistics and as you can see those first 11 variables are filled with those na values if we didn't put the na's in you would simply just have missing values and that would cause um that would cause noise in your data set so it's always good to exclude missing values properly and if we've got this we then have a much more smooth dataset where we can see some clearer trends in the dataset and lastly i would just like to briefly talk about the XDS package it just won't take too long but XDS stands for an extend extendable time series it's an extension of the zoo object and it includes a matrix and an index in this example i'm going to show you how to create an XDS object and how to convert an XDS object so what i've done here is created an object called data and using the on norm package which is under the stats package we're creating 10 random observations and that's all so once we run that I then want to apply those 10 random observations with a date from each one so to do this I use the sequence function I use the as.date function to identify the start of the time I want to use I'm just using 2016 on the 1st of January and then I apply a length of 10 and I'm doing this by days but you can do this by months or weeks or years so if we run this now we have two objects data which is our observations and our dates which are our times and then you can basically combine the two by using the XTX function and here I do so right here that's the XTX function I call on the data and I order this by the dates and now we have a really neat XTX object that looks like this now the preference is kind of I guess up to you as a researcher if you prefer TS objects or if you put XTS objects I tend to prefer TS objects because I like the output of those a little little better for example with our bottom object we can convert this to a TS by using the as.xts function so if we run this into a new object called XTS2 and then view this we see we have a really different structure to the bottom in that everything is listed in in two columns we have the date and then we have the observation and it's just a little bit harder to read there's not too many differences between between the two but yeah I just sort of introduced that as a main packaging off I see we're actually a little bit short on time but there are two questions and then we'll take a two minute break and move on to the section two which shouldn't take too long pipe that was the word I was looking for thank you yeah the deep sharp function function completely lost my mind but yeah I'll just take a quick two minute break while I rest my voice and then we'll move on to exploring how we can run some data manipulation on some open source crime data um I'm just going to answer two quick questions in the Q&A in the move on to section two but someone asks can slash should additional explanatory variables be added to an ARIMA model can they yes should they absolutely dependent on your research aims you would use what is known as a a regressor error ARIMA model which I'm not 100% too sure how to run in in R but the name of that is an ARIMA model with regressors so it is possible and I have seen it done in papers but I haven't got too much information about how to do that in R we also have a second question which is if I want time series trends in different countries do I need to create a different object for each one um you wouldn't you don't necessarily have to you could aggregate and combine the time series but for it also depends how many countries you have you know I would probably suggest visualizing different models for each country and then you'd be able to compare these trends so that means creating an ARIMA model for each country and then plotting for each country yeah all right I'm just going to spend the next 10 minutes running through section two I do realise we're short on time but I'll leave some time for any questions at the end but yeah so hopefully you've all in downloaded the crime data in R package which we're seeing in the interim prerequisites to install this you can just use install.packages if you have an older version of R I'd suggest using the dev.tools install github link but nevertheless I'm going to be using the data from Detroit and from the years 2015 to 2020 you can use the function cities years and type to obtain this data from the crime data in R package um you can have a look at the website to see which cities are available and which years are available for those cities um I think the list crime data function will tell you which years are available for each city but I'm going to avoid running that now just because it will take a few minutes to obtain all the information but yeah I'm only interested in five years of data from Detroit don't run this you'll see a um you see a warning sign but that's that's you can absolutely ignore that and now we've just obtained all the information from Detroit using the get.crimes data function so let's briefly explore this data section to see what we're dealing with we can use the head function to explore the first five rows so what we have is an id number we have the city name which is obviously Detroit because we filtered just for Detroit we have an offense code we have an offense type with an offense group and we have the date we also have um number two and last shade but this isn't necessary for this talk this is a simply interesting in that date variable as well as the offenses so our first object kind of to group these crimes by offense type to see how many we have so we can use the pipe function that's the one um to obtain this in one line of code and if we run this we'll open up this new data which gives us the number of crimes per offense type I'm not sure why we have two codes right there but yeah it does the same thing um as I said our interest is in um burglary but yeah we have 56 different crime types which can be really difficult to model so your first step in analysis would be to um figure out how to greet these crime types so you may have noticed that some of our crime types are really low for example we have um like a peeping tom of only one count from five years we have operating and promoting assisting gambling only two counts from five years and these counts these really low counts will cause so much noise in your dataset because there's not enough uh previous data to make these predictions so one way to reduce this noise is by grouping those crime categories with less than an x amount so if we group by the offense type rather than apologies I've I'm going to uh change the name of the crime object to Detroit as the rest of the dataset reference this is as Detroit so just override that with the um Detroit variable give us a second to run apologies about that there we go now if we head back to yeah this color there we go now we have a group by yeah defense group all works yes we have these really low counts but one way to reduce this is to remove those small counts and group these into its own category of like minor offenses so the first step is to convert your offense group variable to a character and I do this by using the music function from the different package and then we can go ahead and um remove the minor counts by pulling these out the dataset and then um that's using the pull function and we've set the x amount to uh less than 1000 so any crimes are less than 1000 will be grouped into a new category called minor crimes which are just easier to deal with and then we can pull these back into this dataset and call this crime minor categories less than 1000 so we do that now we can then use this new category and you see that we only now have 22 different offense types rather than 36 and our minor crimes are now listed here so we have 3440 minor crimes which looks much more better in that table um that was if now um if we were following the like case study from the lecture slides we're simply interested in like the burglary counts so this step is only really necessary if you're interested in exploring other crime types but I thought I'd need that in there in case you um yeah I wanted to explore your own dataset or try this try this result be a lessious filter for burglary slash breaking and entering as this was what was mentioned in the slides so case study so we can just filter by offense group and select that crime type and I'm also only selecting the variable city name offense group and date single as that longitude and last year in the census wasn't really of interest and now for you this dataset we have a much a needed dataset with our three variables um yeah so now we can talk a little bit more about the object object class and how we can plot a time series data with um data frames that aren't TS objects as you can see the Detroit dataset is a data frame is a table um but there are some functions within the package fable which require you to turn a dataset into a TS object but you can still create time series plots using um things like ggplot so let's go ahead and just skip to line 157 it is and we're going to show you how to plot a TS object using ggplot the first steps involve converting your time variable into a readable class in art this is because we're not converting the whole data frame into a time series object but we're just letting our know that our time variable which is our date single is in fact the time object and we want this to be represented as such so i'm calling on a new object called x and i'm mutating um the date single object into a new variable called week using the year week function evident by here and then we can count this and we count the the number of weeks once we run this we can then view the dataset and now we have a number of burger accounts per week from 2015 to 2020 and if we view this new week variable as well by using the class function we'll see that this is registered as a date and not a character variable which is exactly what you want if you are um wanting to plot your time variable on the x axis so using ggplot we could plot something like this this is a really um simplified ggplot but we're calling on x and then using the pipe function to call on ggplot we set our aesthetics to um the week and our x value to the to the time variable and our y value to the observations um i'm using gion point and then some other complicated geometries to basically make our x axis um look like this so this is the plot that we saw on the slides so it gives us that nice like regression line between the points um and yeah now we have that weekly um crime count of burglary so just a little top tip you can do all of that including the mutation um that is counting the crime per week in just one line of code so you can use um this function here i just control shift and seed that to uncomment all of that but i know you have that same plot but run in one line of code and i've also included some x um including some titles with x and y axes um i'm a little bit aware of the time but i guess there wasn't too much to go through so what i'm doing now is converting our crime object into a time series object so we can explore some of those decompositions as previously addressed so we're using that TS object to do so and in this instance i'm um the data i'm using is the x object which is that weekly crime count but i'm calling on by um using the dollar sign i'm calling on the number of crime counts i'm addressing the frequency of 52 weeks and i'm setting a start and end date so if we run this and then we use that plot dot TS function we will get a time series object that looks something like this so this is um our trend count for the burglary we can go ahead and want some decompositions on this you can um decide you can clash all your types as additive or more subjective or we can simply do it um how we run this in the previous slide so we just run all of these at once so i'll explain these so here we have a decomposition using the additive model we definitely see that we have some seasonal trends of repeated patterns from from each year um and here we have a multiplicative trend not much difference to be set to be said really um but this is why we use this is why i prefer the decompose functions from the stats package because this automatically tells you that we have an additive time series which is really useful and the last little thing i'd just like to discuss is the importance of accounting for like holidays in your data set so if you did want to include um additional like values that is like the bank holiday effect then this is the code you would use to do so um we use the time date function and if you explore like the crown r package for this then a different holiday function for different cities um this actually should not say london that should say new york city i think it's ny yeah nyc because we're using american data so just make sure that's changed and then this will basically indicate a binary variable each week indicating if there is a bank holiday on that week and then you could include this into your um time series models um yeah that is the end of the r code i hope i've been able to demonstrate some ways in which we can explore rolling averages and decomposition methods uh and looked at some of those visualizations that we can use to plot our data and yeah thank you all for listening the time is closing out so here are my contact details if you have any further questions and the resources that were used for this um this webinar can be found on the slide so thank you for listening and thank you all for attending