 Good morning everyone. My name is Nadia Kenner. I'm a research associate with the UK data service and yeah, thank you for coming to this introductory talk on time series analysis and forecasting. I'm going to just give it a couple more minutes as people are still entering and then I'll get started with the introduction slides. So here is the content for today. We're going to be discussing what is time series data. And we'll look at how this is different to non-time series data. We discussed what time series analysis is also known as TSA. We look at the different types of TSA, the components that they make up this TSA. We look at fitting time series models and how we can train our data to then be able to create forecasts which are predictions. And then we briefly discussed the available software to run on this type of analysis. So what exactly is time series data? It can be said to be a collection of observations obtained through repeated measurements of time. So each instance represents a different time step and the attributes give values associated with that time. So the intervals for which time series are represented are typically quite vast in that you can have hourly data, yearly data, monthly, quarterly, etc. But the decision of what time interval to use is dependent not only on your, like, research questions, but also the data that you have at hand. Because this will identify what type of models you can or are able to run. There are three main characteristics of time series data. The first being that the data arrives is almost always recorded as a new entry. Second, the data typically arrives in time order. And lastly, time is a primary access. The time intervals can be either regular or irregular. Typically time series are assumed to be generated at regular space intervals of time and also, and these are known as your regular time series. The data typically will include like a timestamp like we saw in the previous slide. But not some data is also irregular. Typically you'll be working with regular time series, so it's not. You don't need to worry about how to change an irregular series to a regular because most data typically arrives in regular, like a regular fashion. But an example of an irregular time series would be something like withdrawals from an ATM machine, for example. But the key point to remember is that when running time series analysis, data points are recorded at regular intervals over a set period of time, rather than intermittently or, you know, like randomly. But you might be asking yourself, so like how is time series data different to non-time series data in that if we have a time series data that has a time field, does that just automatically make it time series data? What about cross-sectional data? Can that be considered time series? What about pull data, you know? And these are the type of questions that, you know, will help you to distinguish the difference between the two. But these questions can be answered by, you know, it all depends on how your data has been collected as this affects how we can then analyze changes and stay over time. I use the word changes as this is a really key concept in understanding time series data. The major difference between the two is that the time series data, the time factor is the dependent component, and for non-time series data, this isn't typically a central theme. So time series data can get confused with other types of data types. When I first started on time series data, I had issues understanding like, you know, the different data structures. So I've created some questions where you can have a go at matching the data types to the correct definition just to broaden your understanding. So we'll start off with time series data. Can you select the right definition for time series data? So if you want to head back to Mentimeter and pop in your votes and then we can go ahead and discuss answers. Do you think that time series data consists of several variables recorded at the same time? Is it data that collected sequentially from the same respondents over time? Or is this data that's recorded over regular slash irregular intervals of time? Very obvious clue in this answer, but I'm glad to see that everyone is on the right track. You would be right in saying that time series data is data that is recorded over regular intervals of time. I think there are still votes going through, but yes, this is the summary for time series data. We'll just let people finish off their votes. I don't want to cut anyone off. So now we know what time series data is. Do you think that you could explain what a cross-sectional data is? You know, how does this differ to time series data? Can you again match the correct definition? Got a mix of answers, that's good. I'm glad no one selected option three, which means we're paying attention. We've got a majority of votes saying that cross-sectional data consists of several variables recorded at the same time. And yes, you would be right. Sorry to the two votes to answer the middle definition, but yeah, cross-sectional data. This is a result of a data collection carried out at a single point in time on a statistical unit. So with cross-sectional data, you're not really interested in the change of data over time, but in the current valid opinion of the respondents about the question in a survey. With cross-sectional data, you know, the ordering of the data does not matter in that you can have the data, you know, ascending, descending. You could even have it randomised and this will not affect our modelling results. But with time series data, the order of your data is key. And lastly, what about pulled data? Can you go ahead and match that definition to pulled data? Should have possibly included an option for not sure, but if you're not sure, then take a guess. So we've got the majority of answers saying that pulled data is a combination of both time series and cross-sectional data. If you'd selected this answer, well done. So in short, yes, pulled data occurs when we have a time series of cross-sections, but the observations in each cross-section do not necessarily refer to the same unit. So I can, yeah, let me give you an example so that's a bit easier to understand. Let's say we take household income data on households X, Y and Z in 1995, and then you take the same income data on households A, B and C in 1998. So although you're interested in the same data, we're taking different samples, that's the different households from different time periods, and that makes it pulled data. And just for anyone interested, where we have five votes in that is data that is collected sequentially from the same respondents over time. This is actually longitudinal data, also known as panel data. Yeah, so panel data is a data set that consists of observations of multiple individuals obtained at multiple time intervals. Time series data focuses on single individuals while panel data slash longitudinal data focuses on multiple individuals. So yeah, that's the main differences between time series data and other common data formats. But I've come up with a scenario that allows you to think about this in a little bit more detail. So imagine you have been asked to maintain a web application. You've been asked to analyze when a new user logs in. So you're interested in analyzing like login activity. Now, after some careful consideration, you've realized that there are two ways to do this. Option A is that when the new user logs in, you may just update that last login time step for that user in a single row. Or option B, you treat each login as the separate event. So if this was up to you and you have been asked to analyze login activity, which option do you think you would collect your data? How would you collect your data? Would this be A or B? So go ahead and pop in your answer for amenity meters again. I thought we were going to have a split vote there, but it's absolutely okay to not be sure. So yeah, props to you if you are saying not sure. We will discuss the differences once the rest of the polls come through, but it looks like the majority of people so far are suggesting to treat each login as a separate event. I'll let you just keep thinking about this for 10 seconds and then I'll move on to talk about the answer. I quite like this visualization, but yeah, 68% of people have decided to treat each login as a separate event. Let's have a look at what option A might look like before. If you had chosen option A, which I think was about 14% of you. This is typical of a cross-sectional data set. So we have the user and we have the last login time step. Although this information is useful in examining which user works for what company, that time variable isn't of much use here. It simply provides some context, some attribute information. There's nothing really about changes here between users. If we went for option B, we might have something that looks like this instead. In this instance, we have a new row for each time a user has logged in where the changes are preserved keyword. Each change is recorded as a new event. Doing this allows us to then examine the frequency of login activity over time. So yeah, I just wanted to draw attention to data collection because this kind of helps to understand what a time series data looks like. And to summarize, almost all data is recorded as a new entry. The data typically arrives in time order and the time intervals can be regular or irregular. Just for some context, if a data set has irregular intervals, then this means that the events then become unpredictable and cannot be modeled or forecasted. And this is because forecasting assumes that whatever happened in the past is a good indicator of what will happen in the future. All right, so we're going to move on to looking at time series analysis now. There are many types of analysis to consider, but the three kind of main examples include visualization, decomposition and autocorrelation. The graph on the right is a basic time series graph, which allows you to plot the observed values on the y-axis against an increment of time on the x-axis. Now most time series graphics will look something like this. And this is because the statistical characteristics of a time series data tend to violate the assumptions of conventional statistical methods. And because of this, analyzing time series data then requires a very unique set of tools and methods which are known as time series analysis. So let's explore some of the reasons why you might want to use a time series analysis. You could be interested in accessing the impact of a single event. So this could be very much a descriptive analysis. An example of this would be possibly identifying the number of crimes in Manchester specifically looking for an upward or downward trend. This could very much be a descriptive analysis. Maybe you're interested in studying causal patterns, so to look at the effects of variables rather than the events themselves. This is kind of a way to understand the data and the relationship within it, as well as possibly establishing a cause and effect. So we're interested in how to forecast future values of a time series. So this would very much be a prediction analysis. An example of a prediction would be using previous crime data to predict future crime trends, right? Now, obviously it goes without saying that these aims can quite easily overlap when working on a research project. If you're asking your questions, you know, using various aims and models, but these three aims kind of provide the aims that are expected in any kind of time series analysis. But these aims are not limited. There are other models that exist, such as classification, curve fitting, intervention analysis and segmentation. So these are complex models and kind of are dependent again on your data frame. It's dependent on whether you have a univariate or a multivariate data frame if you have a regular or irregular time series and a lot of things come into play. But one thing to note is that because time series analysis includes many categories or variations of data, typically the models do become quite complex. But this doesn't mean that we can account for all variances. And we can't really generalize a specific model to every sample. So that models that are too complex or that, you know, try to do too many things can lead to a lack of fit and a lack of fit or overfitting a model, at least to those models not being able to distinguish between a random error and the true relationships, which can lead analysis quite skewed and maybe lead to incorrect forecasting. But I'm going to talk about an example from a paper that was able to demonstrate the benefits of forecasting and prediction in crime data. This was done by Ashby in 2020. He looked at the initial evidence on the relationship between the coronavirus pandemic and crime in the United States. So his aim was to understand crime patterns. And he used police recorded open crime data to understand how the frequency of certain crime types changed from the start of the pandemic. He used what is known as a Serima model on the frequency of crime types in his 16 cities from 2016 to 2020. Forecasts were then created from these models to compare the actual crime calls to those that were expected, as in those that were forecasted. Now I would go through all the results in discussion, but obviously there's a lot unpack here because he looked they looked at five different crime types across 16 different cities. But the main difference was that the difference between the actual crimes recorded and those forecasted were different in each city. I remember the example of theft being far lower than the forecasted crime trends. So this paper helped to identify our relationship between the pandemic and crime. So I've taken on a bit of an adaptation from this and we're going to be exploring a case study that uses police recorded crime data. But we want to explore just burglary rates from Detroit from 2015 to 2020. Our aim is to use time series graphics, that's the visualizations time series analysis and forecasting to answer both these aims. So aim A is to explore the long term trend and seasonality in burglary across the city of Detroit. So this would very much be your descriptive aim, highlighting the basic trends. And then we want to look at how the frequency of burglary change in Detroit in 2020 from the start of the pandemic. And this falls between that explanatory and predictive aim. There are roughly like four steps in time series analysis, very, very simplified steps, but these are the steps needed. Your first step would be to explore your data. Your second steps involves identifying and graphing these patterns. So that involves your visualizations. And your next steps would be to model the data. So applying, you know, the correct model. In our case, we're going to be following a Serima model, which I discussed a little later on. And then you can run your predictions. So the first step involves exploring your data. But what does this actually mean in terms of time series data? Well, I've taken this screenshot from data set that we'll be using on Thursday session in the code demonstration. And we have our date variable here. As you can see, we have our crime recorded by a daily rate. But lucky, we don't, this doesn't mean that we have to use the daily date. We can aggregate our data frame so that we have the crime counts per week or the crime counts per month or the crime counts per year, the crime counts per six months, you know, the list goes on. But if you want to visualize something, you know, that looks like this, what do you think is the most accurate interval for exploring crime data? So if you want to head over to Mentimeter, think about which interval you would use when exploring and visualizing police recorded crime data. I think I've given the option to choose up to four answers. So go ahead and vote those four times if you're not too sure. And yeah, we'll discuss some of the benefits and limitations of some of these. I'll give that about 30 seconds. Very interesting. We've got about 18 votes in with a pretty close split between monthly and weekly. But we're going to keep, we're going to let this roll through because I'm interested to see where the average or the polls have kind of slowed down. So I'll start to talk through some of the limitations and benefits but feel free to continue putting in your answers. So we've got seven votes for yearly and quarterly. Yeah, so year to year comparisons for crime data are very common. It's what you kind of see in government statistics is what you see across the across literature. But in recent years, comparing year to year data means that we tend to miss quite a bit of information as it hides variation that happens in the given months. Because if we were to compare, you know, 2015 to 2016 to 2017. How much does that really tell us, you know, what can we then do with that information in terms of crime prevention and policy? It'd be hard to detect specific trends in certain crime types if we're just looking from year to year because we're missing so much information. And this kind of stands for quarterly and six monthly data as well. What about monthly data? Yeah, so we've got 15 votes for monthly data at the moment. And I would say that monthly data is again quite a common comparison in crime data. But at the same time, it can be quite inaccurate because months do not hold the same amount of dates. You know, we know that crime tends to be higher over the weekend and some months have longer number of weekdays than others. So how would this affect the frequency of crime rate? You know, would it be the, yeah, so like how would this affect the frequency of crime rate? And this kind of leads us on to a smaller interval, which is the weekly crime data. This was mentioned in Ashby's paper who mentioned that week to week comparisons probably provide one of the best time intervals for crime data because it helps to reduce some of that uncontrolled variation. Comparing week to week data means that we can also incorporate things like bank holidays and specific holidays into those weeks that might be affected, that might be affecting those frequency of certain crime rates. Daily data, again, this could be effective, but it might just kind of, it might mean that your data obtains too much noise. It also means that studying things like seasonality would be much harder. So yeah, when working with crime data, I tend to use weekly because this helps to reduce some of the uncontrolled variation. So I went ahead and use weekly data then plot a time series graph that looks something like this. So this addresses our AMA, which was to explore the long term trend in seasonality in burglary across the city of Detroit. So as you can see, we've got our weekly incidents and we've got our time interval on the bottom. Now, although this is written as, you know, we've got 2015, 2016, 2017, each dot on the graph represents a new week. And this allows us to then explore, you know, variation within that year much clearly. We can see that is a general decreasing trend from 2015 to 20. And there are some spikes and peaks in the data set as well. So at the start of the year, it seems that burglary decreases and then starts to increase towards the end of the year. But why exactly, you know, is this this is the next question you want to ask. But if you want to explore these trends further, then this is why you need to look at the components of a time series analysis, because a variation of these components causes the changes in the patterns of a time series. There are four components that make up a time series analysis. We have a trend which was already mentioned, but this is the linearity is the data increasing or decreasing. We then have a cyclic variation or cyclic variation, and this is the variation in a time series which operate themselves over a span of more than two years. We then have seasonality or seasonal variation. And yeah, these are the rhythmic forces which operate in a regular or periodic manner over over a span of less than a year. So that's the main difference between cyclic and seasonality. And then we have our random or irregular movements such as noise. So this is the variation that cannot be explained. It's important to note that your noise or your irregular movements or your random errors, these are synonymous terms by the way, is likely to be higher when analyzing shorter time periods. That is if you have a smaller data frame, because it means that your models aren't going to be able to establish a correlation between the lags and different time steps. So yes, the combination of these components with time, of course, causes the formation of a time series. We can look specifically at these components by decomposing our trends, also known as a decomposition. And the decomposition is simply a way to split a time series graph into these four components. And this is what a decomposition plot would look like made in R. We have four graphs. The top graph highlights our raw data, that's the raw counts. We then have our trend present. And so yes, we were right to say that we have an overall decreasing trend. And then we have our seasonal variation. So you might be able to see that we have the same seasonal pattern from each year, or very similar seasonal pattern from each year, which would indicate that there is a seasonal variation in our data set. One of the main objectives for a decomposition is to estimate the seasonal effects that can be used to create and present seasonally adjusted values. So a seasonally adjusted value simply removes that seasonal effect so that the trends can be seen more clearly. I'll give an example of this. So violent crimes, for example, tend to increase in the summer due to many factors, you know, there's this increase of football games and increases in routine activities. So anywhere there is a real trend, we should adjust for the fact that violent crime is always going to be higher in the summer than in winter. And this is what this decomposition plots allows us to do. There are also two structures to a decomposition plot, which I won't spend too much time on, but it's important to know. And these are known as an additive and a multiplicative plot. So an additive plot in the name simply adds these components and a multiplicative plot simply multiplies these components. Now, it's important to know that most packages will determine what type of decomposition structure you have, whether that be additive or multiplicative. You won't necessarily have to establish this yourself as most packages will tell you. But let's explain what I will let's give you an example of what this what the two plots look like side by side. So on the left, we have the additive plot. We can see the time series is increasing. Yeah, so it's increasing throughout, but the amplitudes and the frequency kind of stay the same as we increase. This could also decrease, by the way, but yeah. And then we have a multiplicative plot. So this is if the time series has an exception as a growth or a decline with time, then the time series can be considered to be multiplicative. So this is that we have changes in our amplitude or frequency over time. And this is what the structures look like. So with our example, do you think that we tend to follow more of an additive or more of a multi multiplicative plot. Very difficult word to say apologize. So yeah, just take a quick look at this and do you think this kind of follows an additive or multiplicative in that our components being added at this point or they being multiplied. You can pop your answer into Mentimeter if you wish, but a mix of answers here with the majority suggesting additive but there's also very close call between maybe both multiplicative. But yeah, we'll move on. So, as I mentioned, when working in our studio, typically you'll be told what type of structure you have. And in this instance, this was actually a multiplicative plot. I didn't supply this title here this was given in the package that I used. So as you can see are the reason this is multiplicative is because we had a, you know, we had a decreasing trend, but the amplitudes between our weekly points were different. There was no consistency in those in those weekly trends, right. The last thing to know with your time series analysis is that you need to ensure that your data is stationary. And this means that the properties such as mean variants and covariance tend to remain constant over time. Typically a stationary data is very like flat looking. It doesn't have a trend and it doesn't have a trend. There's no constant variance. There's no constant autocorrelation structure. And doing this means that the model can do predictions based on the fact that the mean and the variance will then remain the same in the future periods. So yes, you have to make sure your data is stationary. There's two ways to do this, which is visually via the decomposition plots. Obviously they can be a bit difficult to read. So I tend to run statistical tests such as the KPPSS test. And there are loads of others available. But before modeling your data, you have to make sure that this is stationary and you can apply what is known as a differencing technique to make your data stationary. But again, this kind of goes beyond the scope of this talk. So we're going to move on to looking at different models of time series and talk a little bit about our Serima models. So there are three main time series models. We have what are known as moving averages, smoothing models and Serima models. So these are just three very common time series models. But a moving average model is simply a series created from the average of the past values. A smoothing models are values calculated from the weighted average of past values. And there are extensions to smoothing. So there's a single, double or triple. And these all depend on whether your data includes a trend, whether your data includes seasonality and yes. And then we have what are known as Arima slash Serima models. So these are suitable for multivariate non stationary data. So yeah, we'll be using a variation of this model to address our second aim, which asks how burglary trends compared to the predicted trends over the pandemic. So a Serima model is based on the concepts of both moving averages, auto correlation and auto regression. So it provides a much more accurate model, which then will allow you to make much more accurate forecasts. A Serima model stands for a seasonal, auto aggressive integrated moving average, which is a very long term. But it's a form of regression analysis that evaluates the strength of the dependent variable relative to other changing variables. So why would you want to use the Serima model? In short, it's used to understand past data or to predict future data in a series. And yes, as I said, they form a generalization from moving averages and make the model more like effective more accurate. Any kind of stats heads in the room. This is a quite complicated and extensive slide. Excuse me, I'm just getting a drink. But yeah, a Serima model is simply an extension of an Arima model. This means that when your data has a seasonal component, you're most likely going to use a Serima model. If your data doesn't have a seasonal component, then you'll use an Arima model. Now there are three elements that make up an Arima model. And these are known as your PDQ. I won't go into too much detail about this because again, it falls a bit beyond the scope of this talk, but the P stands for the autoregressive model. The D then stands for the differencing, which is how to make your data stationary. So an Arima model will do this for you, which means you don't need to do any data preprocessing before. And then you have your MA, which is that moving average model. And when we move on to our Serima model, you can see that we have the same letters but just capitalized. And this just references the seasonal aspects of these three elements. We also have our M value and this indicates the number of time steps for a single period. So if you were working with yearly data, if your data was yearly, then this number would be referenced by a 52. If you're working with weekly data, sorry, I got that mixed up. If you're working with weekly data, the reference will be 52. If you're working with monthly data, this will be referenced by a 12. If you're working with quarterly data, this will be referenced by a four and so on and so on. And then there are packages in R that will establish these values for you. So you won't have to calculate them yourself, which is very useful. But if you're interested in how you would choose these values for PDQ and your seasonal PDQs, you would have to use what is known as the ACF and the PACF. The ACF in short just tells us how correlated a time series is to its previous values. It's the correlation between observations of a time series separated by K time units. And then you have your partial autocorrelation, which simply measures the strength of that relationship with other terms being accounted for. So yeah, that is a full kind of structure breakdown of a Serumo model. And let's have a look at how we then can apply this Serumo model to question, to answer our second aim. Again, I've created a kind of like step-by-step guide on how we could go about answering this aim. But the first steps would be to count that weekly crime. This was, you know, back in the beginning of the slide where we decided on the time interval. Once this time interval has been decided, you need to aggregate your data to that weekly crime rate. Your next steps would be to model the weekly calls. You can use the function Arama from the Fable package, which we'll be doing on Thursday. And then we can generate the forecast from these models. And finally, your last steps involves plotting the forecast, making those visualizations. So I've created some images that kind of help to explain these four steps a little bit better to understand it can be quite confusing. So if we were to follow these steps with some graphics, your first steps would be, as I discussed, counting the crime. So this involves our time series plot that we made at the beginning. This was our basic time series plot where we have the frequency of crime rate plotted from 2015 to 2020. Following the step number two, this is where you model the crime. So your Serena model takes place on your existing data and it almost establishes that average trends, the average seasonality, the average noise, average cyclic behavior to make its own, to make its own. What's the word? I've lost the word, but yeah, it makes its own trend. Let's call it a trend. And from there, you can then forecast from this model. So this is what we would expect to happen. This is the prediction. This is where we think the trend will be going based on the data from the Serena models. And from there, you then compare your forecasted crime counts to the expected crime counts. This was very similar to Ashby's paper. And yeah, this was the kind of structure that he showed and it's really informative, especially when understanding crime trends and cause and effects and correlation and whatnot. So I went ahead and created this plot in R and I was able to compare our rate of burglary in 2020 compared to the expected crime count. So let's have a look at what that looked like. Before addressing the graph itself, you may notice the Serena structure at the top. And again, this was automatically created by R. This told me what I needed to do. It tells me how many steps in differencing I needed. It tells me my M value and yeah, that's the breakdown. So if you ever see someone like that, you know that you're dealing with, you know what those values now represent. The PDQs, auto correlation, auto regression and the moving average. Let's go back to the graph. So I'm only showing the last part of the graph. So from 2020 to 2021. And as you can see, we have two kind of lines in this graph. We have this dotted line in between this gray bracket. And this is the predicted crime rate made from our original models. And below this we have our, these were the recorded crime. So these are the actual crimes that happened compared to our predicted. So this is where we expected the trend to go in 2020. But this is where the trend actually went. So as you can see, burglary rates for a little bit lower than are expected. And this is where you start to question why, you know. Can we say that this is because like, can we pin this all on COVID? Are there other factors at hand? But yeah, in this instance, the in Detroit, the COVID-19 pandemic led to substantial changes in daily activities of millions of Americans, you know, just like in the UK, but businesses and schools were closed, public events were cancelled. States started introducing stay at home orders, distancing was then a thing. And you know, this led to a reduction in people's routine activities, which arguably could lead to a reduction in opportunities to then commit these crimes. Because if more people are staying at home, there's less opportunity for these crimes to take place. So yeah, this was our predicted model. I was able to show you the benefits of running a remodels and why this is useful, especially in the world of crime. So that yeah, that draws conclusion, but just briefly want to look at the time we've got five minutes. That's no problem. But yeah, there's loads of software and applications available for time series. Obviously, I've been working in our and the code demo will be also conducted on Thursday. But there are packages like Python and FB profit, which was actually developed by by Facebook, but it's a forecasting tool available in both Python and R. It's typically used to like additive models with a non linear trend. So that's yearly, weekly, daily and seasonal trends. But the benefit is that everything's completely automated. You don't have to kind of do any coding. You just need to plug in your data. So, you know, if you want to start there rather than coding, then that would that would be the place for you. And just out of interest out of curiosity. Can I ask what software programs you tend to use. This be it time be it for for just time series analysis, or for kind of anything that you do and be really interested to know. I've left this as a word cloud. So we're going to see what happens here. This is very interesting or in Python sitting on a very similar, similar sizes. I'm not sure what you've used is I've actually never heard of you use. But yeah, are in Python and Excel starter or definitely very common applications for running time series. Thank you for taking part in that which is kind of me being curious. Material for the code demonstration on Thursday can be found in our GitHub link. Emma or Julia or Louise, whoever's available, we could pop that into the group chat again, much appreciated. I would suggest viewing the intro and prerequisites information slide on there. It shows you how to set up or clone a repository, as well as how to use interactive binder materials but I will also be discussing how to do all of that at the start of the session so don't feel too stressed. Resources will be shared once the slide decks are shared. I believe I'll be adding the slide decks on Thursday to the GitHub repo. But thank you all for attending. If you do have any further questions that I haven't had a chance to address here, please send me an email and I'll do my best to reply to you. Thank you all for attending.