 What an awesome day we're having. Thank you so much to the organizers for bringing us all here together today and to all of you for coming. I'm Zane Armstrong, and I'm excited to be here to talk about why everything is seasonal. And when I say everything is seasonal, I mean not quite everything. Everything related to change over time. But that's still a lot of things. Consider so many of the dashboards you've made, seen, or been recruited to make. Consider all of the times that you want to know if a metric is going up or down or if some event had an impact. My main point today is that if your data includes any sort of change over time, you should take seasonality into account. And I want to start by telling a story. Imagine that it's summer and you just got a new job. You are so psyched. The only downside is that there's a commute. But it really doesn't seem so bad. And at first, it's not. Fast forward a few months to early November, and this is your life, a world filled with red tail lights, misery, road rage. How could traffic have gotten so much worse so quickly? Is it going to go on like this forever? Maybe. Or maybe it's just November. Consider that the number of commuters influences the amount of traffic there is. So if everybody takes a one week summer vacation during the 10 weeks of summer, and almost nobody takes a vacation in early November right before the holidays, that's 10% fewer commuters per week on the road. So the goal of the story is to get you in the mindset of assuming that your metric has seasonality. And in particular, to cover this fact, to consider that the seasonality of causal factors could matter. And this is going to be the first of six points I'm going to discuss today about different ways you should think about seasonality and different ways that it could trip you up. So it's not just about the traffic, but it's about the factors that contribute to the traffic, like the number of commuters. And it's not just about the number of commuters, but the factors that contribute to that and the seasonality of those factors, like the percent of people who are on vacation at any given time. But first, I want to talk a little bit about what is seasonality. Seasonality are patterns that repeat over known fixed periods of time. And the most obvious of these, of course, are the seasons. And of course, also the time of day. In San Francisco, whether people say that we don't have seasons, looking at this two years of data, we actually do. Look at that, two years, like, repeating patterns. We have seasons. But perhaps the more obvious type of season that we have is our famous fog. So when you zoom in to look at our weather data at a three-month interval, and you can see those hourly data peaks and see those days standing out, you can see that we actually get a good portion of our annual temperature variation every single day. But it's not just about weather. And it's not just about seasonality in time series that aren't growing or shrinking. While we're familiar with these annual temperature seasonalities in our own hometowns, there's other natural factors that have even more predictable and regular patterns. In Manaloa, Hawaii, they've been measuring parts per million of CO2 in the atmosphere since the early 1970s. And here is that data. Along the x-axis, we have every day from late 1974 to the end of 2014. And on the y-axis, we have the CO2 parts per million. And there's two things that jump out in this graph. Of course, the increase over time that we've been seeing throughout this time period. And the repeating annual increase and decrease that we have year in and year out, there's a strong seasonality to the CO2 in our atmosphere. Let's talk about a third type of data, births. Something that we don't think of as having seasonality. Babies come when they come. They come in the middle of the night. They come in the afternoon. They come after two hours of labor. And they come after 20. In fact, baby data actually illustrates three types of seasonality. Day of week, week of year, and hour of day. So let's take a look at the data. Here we have a number of babies born in the United States from 1969 to 1988 every single day. And it's kind of a mess. So let's zoom in and see what we can find. So we'll zoom in on just one year of data. We're looking at 1985. Some of you in this room might be represented somewhere in this data set. And then we start to see that there's some sort of repeating pattern here. So let's zoom in a little bit further to the summer. And then we start to see that there's these weekend dips and in between it's higher. Weekend, weekend, weekend, weekend. And you might notice that one of these weeks looks a little bit different than the rest. This one right here. It's the first week of July, July 4th. Looks a lot more like a weekend than a weekday. So now we can zoom back out to our full data set of this period. And now we've aggregated by week. Because we've recognized the day of week pattern. And now we can aggregate by week. And a second pattern starts to emerge. We can zoom in on the last four years, this time period. And again, see a repeating pattern over a fixed period. The up from the early part of the year through the end of September. And then a decline. Up, decline. And that repeats. So in addition to having a strong day, a week seasonality, we also see a week of year pattern as well. Now let's look at the minutes of the day. So this is a different data set. This is babies born in 2014 based on what minute they were born. So every single point on this data set is the number of babies born in particular time on a Saturday. So there were 384 babies born during 2014 at 9.31 a.m. on Saturdays. And we can look at this Saturday data and we see some pattern here. There's an increase in the morning. And there's some pattern there. So let's look at another day just for fun. How about Mondays? Same, right? Totally the same. Let's look at all seven days. So we have Monday, Tuesday, Wednesday, and Friday in the brown. And Saturday and Sunday in the blue. And this is babies being born. Totally a natural process. With peaks at eight o'clock in the morning on weekdays, 12.45 p.m. And at 5.30, right before the end of the workday. So I was gonna stop here, but I thought you might want to explore a little bit more. Let's look at two of the ways that we can influence when a baby is born. C-sections and inductions. Babies born by C-section. Not that many in the middle of the night. A lot at 8 a.m. Babies born that were induced. Not that many in the morning. A lot more in that noon to 6 p.m. period. And neither C-section nor induced. And so we still have a bit of an hour of day, minute of day, a subtle pattern here that doesn't vary by day of week. So this baby data set, these baby data sets illustrated three types of seasonality that I want to discuss today. Day of week, week of year, and minute and hour of day. And of course, two of these are very much natural cycles that our human behavior is also very closely tied to. And day of week is totally a human construct that we've made up, but it drives so much of our collective behavior. You might be wondering, what about monthly? And it's super common to see data aggregated to the month. So many of our open data sets aggregate data to the month and are published at that level. And I wanna talk about why that might not be such a good idea. So let's say that I own a restaurant. It's a great little neighborhood place. People love it when I open up the outdoor seating in the summer. Friday and Saturday nights are quite popular, as is my famous weekend brunch with Moses. And of course I'm closed on Mondays because I've got to do data visualization sometime. So let's make up some data for this hypothetical awesome restaurant that I own. 4,000 per day in revenue. Makes for a pretty darn boring line chart, which is $4,000 every day all the way across since 2013 to now. So let's build on a little bit of growth. Got some inflation I have to deal with, like maybe I'm getting a little bit more popular. So we'll bring in 1.6% year over year growth. So now I got a little bit of a slant to that line. Bring in some real numbers here. You might be surprised that I'm comparing January 10th to January 11th. That's because I really wanna compare Fridays to Fridays. And right now that isn't so important, but it'll be a lot more important when we bring in some more patterns. And now we can look at a growth chart. So daily year over year growth. Again, pretty darn boring with 1.6% year over year growth every single day. Very nice and steady bar chart. So let's bring in some week of year seasonality. I mentioned that I opened up my patio in the summer. So let's just assume that when I have more seating I make more money. When I have less seating I'm like less money. So a little under 4,000 in the winter days up through the spring. A little more than 4,000 when I open up my patio. Again, the daily year on your growth is still pretty boring because it's still 1.6% because we're comparing summer to summer and winter to winter. And assuming that I open up my patio on the same appropriate day. And let's bring in some day of week seasonality. And here we go, but it's kind of a mess so let's drill in to see what's there. As I mentioned on Mondays I'm closed doing my data visualizations. And I got my really big weekends that big Friday dinner, a Saturday brunch, Saturday dinner and a big Sunday brunch with mimosas. So here we can zoom back out to the daily revenue. The growth is still boring because I'm still comparing apples to apples. So now I have a nice time series. I've got some baseline revenue, year on your growth, two types of seasonal patterns that seem quite plausible. Let's aggregate by week and see what this looks like. And it's almost the same. Get a little bit of an angle because some of those days might be before and after I open up the patio. The weekly growth is still equally boring. We're still comparing to the previous year. And we're comparing 364 days back, so week to week. If I add in, if there was a 5% bump in March and April I can see that in the chart. It stands out really clearly. Let's aggregate now by month. Here is our nice clean data set. We can still see some of the signal. The summer is a little bit bigger and the winter's a little bit smaller but it looks like there's so much going on here. And especially when we bring in those growth metrics it looks like there's a lot happening. A lot of variation in this data set from month to month because we have some months where I had negative 5% growth, other months where I had almost 10% positive growth. There's a lot going on. If we wanted to find a real signal it looks like how could we even see it? Could we see a 5% change here? Can you point to any of these and say that's a weird change, that something would happen with the business? We need to be worried or we need to be excited. So what's going on? Of course we have the factor that months have a different number of days but they also have a different set of days per year. So in April 2014, April had four full weeks, 28 days and then two bonus days to get to the full 30 days of April. And those two bonus days were a Tuesday and a Wednesday in 2014. Moving to 2015 we have 365 days in a year, one more than a factor of seven so we shift our month one day forward in the week. So now those two bonus days that don't fall into a nice happy week are a Wednesday and a Thursday. So effectively now we have a Wednesday and Thursday where the previous year we had a Tuesday and a Wednesday. April 2016 is particularly interesting because now we're on a leap year so we move forward two days. And it just so happens that when we move forward two days we fall into the weekend and we get those big extra revenue days. So in April 2016 it's gonna have more money in it just because the days of week that are a bonus are ones where we make a lot more money. It doesn't mean the business is growing or that I did better in April 2016 than I did in April 2015. It just means that I get five weekends to count instead of four. And I think it's easy to be like, oh yeah, that's true, that's true, that's true. How much of an effect could that really have? We see here that it can change our data from one that is very simple and easy to read where we can easily see change to one where it looks like there's so much more happening when nothing really is. And this is a really simple time series. We have growth, consistent day of week pattern, consistent week of year, no random variation, no variation due to weather, no holidays, no decreases due to a bad review, no short or long-term variation due to marketing campaigns, good press or a new chef, no change in trend, day of week seasonality or week of year seasonality. And yet we couldn't, we looked like there was so much more going on in our monthly data and it would have hidden if there was a true signal there that we needed to recognize and act upon and make decisions about that would have been hidden unless we were doing more than a 10% change which on a business that has one, two, three percent margins to get a signal that you could see is huge. So monthly aggregation is inherently a bad idea unless the data is inherently monthly. You can imagine that if I was analyzing rent checks instead maybe I was a landowner and I had a lot of rent and people were paying me at the end of the month I don't necessarily care if they pay me on the 29th, 30th, or 31st I might care a lot if they pay me or don't pay me in that month. So in that case a month would be totally appropriate but in so many of our data sets our behavior is so much more tied to day of week than to month of year that in most cases I would look to day of week first and look at aggregating by weeks instead of months. But the main takeaway point is to aggregate time periods that make sense for your data. We also saw that daily data can be hard to interpret and that year in year growth of minus 364 days can help us. That's only if we actually do minus 364 days if we do minus 365 or 366 suddenly we're purposely comparing Mondays to Sundays and different days of the week and here that gives us weird zeros and infinities and an impossible chart to read. And this might seem obvious at this point in the talk but major tools like Google Analytics default to comparing previous year to preparing the calendar year presumably because that's what people expect. The good news is that it's easy to change it and you can just change it right in the tool in this case. So in general compare apples to apples if you're calculating daily or weekly year over year growth compare 364 days back. So we've seen how to aggregate seasonality away but sometimes seasonality is the story that you actually want to tell. It's where the most important insight is it's where the seed, the kernel, the thing that you're looking for is in a way that can actually save lives. And we saw that with the baby data set looking at the minute of the day by day over the week was actually really informative and we learned something and we saw something new that we wouldn't have seen if we just looked at the total number of babies per day. This is also true of a 1985 study on deaths due to tractor accidents that a friend who studied nursing told me about because it's still used today to illustrate to medical students and nursing students the potential impact of analyzing seasonality or data and then you can actually save lives if you look at the data at this granularity. And this study was looking at tractor accidents in Georgia fatalities and it looked at deaths by month, location, hour, and age. And in month you can see there's two main peaks that are during the planting and harvesting season when there's more tractor usage. And there's also in the deaths by hour to a few peaks that stand out. The 11 a.m. to noon peak right before lunch, higher values in the afternoon than the morning and that 4 p.m. to 5 p.m. peak. And knowing this actually could change the recommendations that people made about how you, when you use your tractors and various safety recommendations and actually save lives. The point I wanna talk about next is to account for seasonality when you're estimating the impact of an event or doing causal analysis or trying to do causal analysis as close as you can. There's a lot of things that can disrupt a time series, things that are expected like holidays, sales, events, things that are unexpected but common like weather, things that are unexpected and uncommon like natural disasters, terrorism, death of a CEO, mergers. And we often wanna know what the effect is short or long term. Let's look at some data. In September 2011, the month of 9-11, gun sales increased by 28% compared to the previous, to the August, the month before. And I rather just told you not to use monthly data and I'm using monthly data. Sometimes it's what you have available and it's the best you can do. Hopefully you have a chance to test and see if day of week is a big effect or not. The restaurant business is one that typically exacerbates that. But we'll do months because it's what we've got. In January 2013, this is Obama's second inauguration. It was the month after Sandy Hook, gun sales dropped by 21%. In February 2011 when nothing special happened, gun sales increased by 21% compared to the previous month. So let's look at the time series and see what we see. Here are those three points plotted on the time series. And we can see that there is an annual pattern that we can recognize. So we don't know just by looking at the previous month it's hard to tell if that's a true impact or if that's just the normal variation based on seasonality. So if we look at something like this and we know that we should deal with seasonality, what can we do to take it into account? One thing we can do is what the New York Times did. And I should say that this data came from Gregor Eich and Josh Keller in this amazing analysis. One, they both created the data set based on federal background checks and two, created a fantastic analysis where they started with this view which is what we were just looking at with a lot more annotation. And then they showed you the de-seasonalization of it. So here it is with the seasonality and here it is after you calculate and remove the seasonality from the data set. And now it's so much more clear that you can see the overall kind of trend over time and these big points of unusual gun sales. One of the best parts was not only did they do an amazing, fantastic article and I encourage you to read this, the rest of the article is also great. But Gregor also has a blog driven by data where he posted his analysis and including the link to GitHub that had the R script. So I could see exactly which method he was using. He was using the X13 Arima seeds method which is a great method. It's in the seasonal package. Yay, R, go check it out. But how do we do it? So they use that method. If you're counting for seasonality, the poor man's version of seasonality is dealing with year on year growth. It can be a great tool for seeing is this year different than last year? And if you want to isolate the seasonal component as they did, there's two major methods, decomposing with STL or de-seasonalizing with the census method. And there's also Bayesian based impact analysis. If you really want to go hardcore and you try to mimic an A-B test that never existed and there's documentation for this in my GitHub repo that I'll post at the end of the talk. So what we're gonna talk about today is actually decomposing with STL. And the reason is because I think it does a three part decomposition. It's additive and it's a really nice for illustrating ways to think about the time series analysis. If you actually want to do an analysis, I would look at both methods and decide what makes the most sense for you. There's also a lot of different parameter choices with these methods as well. So there's three parts, long-term trend and month of year seasonality and disruptions. So here's our time series and here's the other three parts, the long-term trend, the annual seasonality and the remainder. So these three parts, this is an additive method so we can actually add these numbers together. And when you add the values for each of these in a column, you get that original time series. You don't necessarily want additive, necessarily an analysis, but it's really nice for illustration and it's a really good place to start. So let's look at these, long-term trend, we can kind of eyeball it. We see that there's some decline, there's an increase and there's a steeper increase and then we can do it with math and we can see that purple long-term trend overlaid on the original graph as well as removing the trend and seeing what's left. I'm looking at a long-term trend that lets us ask questions like how have gun sales changed over the last 15 years? In general. Then we can look at month seasonality into isolate this, I've blocked out a bunch of the charts you can really focus on these even years and see, oh sorry, and see what's going on. And I've also overlaid all the years, all 15 years on a single chart. The lighter green is older and the darker green is newer. So there's been a little bit of seasonal drift. We can see this kind of like higher February and March, lower May, June, July, and then increasing into the end of the year. And here it is with that isolated in the green and top. And I just said this method allows, both this and the census method allow some drift in the seasonality over time. And we can see that there's actually, that component's been changing. We're getting a second peak and that might actually be a really interesting question to ask, why is the seasonality of gun sales changing? Are there different type of sales at different times of year? What's going on? And then we can see it without seasonality, which is akin to what they did in the New York Times and really isolates, lets you see that trend without this extra layer. So both sides of it, both the seasonality and without the seasonality are interesting and let us ask questions during which months have most got sold and the least is this changing. That brings us to remainder disruptions and a regular one-off event. So we can kind of eyeball these and see that this looks a little funny and so does that. And that looks really funny. And again, in the end of 2016, that looks pretty funny. But it helps a lot when we can isolate it mathematically and see our time series without the remainder and see these one-off events isolated and get to see how much, how big they were. And if you're doing forecasting, you actually want to forecast off of without the remainder, forecast off the long-term trend and the seasonality because these are unfrictable events, presumably. And so you want to be able to assume a baseline. And this is to ask questions like when did gun sales spike or dip unusually and by how much? Notice interestingly that they almost never dip. All these one-off events are spikes in this particular data set. And we can see this overlaid in our original data set and see in fact that January is part of a two-month spike, three-month spike. December 11th was part of a smaller spike that we probably thought of at the time as a really big deal and it's much smaller compared to some of the more recent disruptions. And February 2011 was just life as usual. Here's those decomposed time series one more time. So the next point I want to discuss that the seasonality that matters most might be in a subset of your data. And this is gonna be left as a challenge for the audience. So over beers tonight, I challenge you to describe a scenario in which ignoring the seasonality of a subset of the data will lead to misinterpreting the aggregate data. But I will give you a hint or two. So consider what if your subsets have different seasonalities? What if they have different growth rates? Hit number two for those of you that are interested in principal component analysis. What if there's seasonality in the components that you've defined? How does that affect your analysis? Last but not least, people and places are different and so are their seasonalities. So many of these patterns are super ingrained. It's easier for us to overlook them and take them for granted or to assume that our seasonality patterns are the ones that are shared by everybody around us. But people are in different places, different ages, different cultures, different interests. And all of these things might affect their behavior or the behavior that you see in your data. Are the people using your product the same as you? Or might they be using the product in a different way? Or the people that you're analyzing if you're trying to understand or the nature that you're analyzing, if you're trying to understand them? You have to look at the data in the granularity. Unfortunately, that's pretty wonderful. Our lives and our data are so full of these patterns and so this is my moment if I get to show a little bit of eye candy and give a call to action to you. I ask you to look at the data, dive deeper, don't aggregate it away, take a look, see what those day of weeks look like, see what those minutes of the hour look like, what those weeks of the year look like. Are there patterns that you never expected that make you understand and connect better with the data that you're looking at? Are there things that you can show and reveal and share? And these, so there's a lot of, there are some really pretty things done throughout the web. One of them is Why People Visit the Emergency Room by Nathan Yao. He took an amazing data set of number of ER visits per year caused by common products or associated with common products like footballs where there's more of them during football season in the fall. And nails, screws, tacks, and bolts which have more during the summer, presumably, perhaps people are doing more housework there. But things you might not expect if somebody asked you like, what's the seasonality of nails, screws, tacks, and bolts would you have known? And these things don't have to be line charts. I've used them a lot because they're a good place to start and a good illustration. But there's a lot of other ways to explore and engage with and play with seasonal data. Flicker Flow by Martin Wattenberg and Fernanda Villegas who opened up our talk this morning, our conference this morning. The Beautiful Flicker Flow. I have a project called Weather Circles that gets to play with animating weather different hours of day and different months of the year for different cities. This is looking at cloud data in San Diego, percent cloudiness during the day. Based in Boston, visualizing the MBTA data, that's the Boston subway system. A wonderful project by Mike Barry and Brian Card. This is just one of many, many charts that they made looking at the various seasonalities and patterns around how people get around Boston. A similar project in Villa Vavante Interactive Things was looking at a different type of data set and how people are kind of moving through Geneva and when they move and what they do and kind of how active the city is. There's Traffic Accidents Data by Nadia who spoke earlier today, looking at traffic accidents and a lot of the patterns that are there. And it doesn't have to even be a chart at all. From Georgia and Stephanie's postcard project that many of you might have seen, these are all the doors that Georgia opened in a week. And you can start looking for patterns, you can see the first three doors that she opens on her way into work on Monday, Tuesday, Wednesday, Thursday and Friday. So, in summation, as you go home, my real hope is that you're gonna spend the next week looking around you and all the time be like, oh my God, the world is seasonal. So, number one, consider the seasonality of causal factors. Number two, aggregate to time periods that make sense for your data. And if you don't know what makes sense for your data, look at it. Number three, sometimes seasonality is the story. Don't always aggregate it away or de-seasonalize. Sometimes it's where the crux of the story is. It might be your most important insight or your most beautiful rendering. Number four, adjust for seasonality when estimating the impact of an event when doing causal analysis. Number five, the seasonality that matters might be in a subset of your data. I look forward to talking to you more about this tonight for anybody that comes up and gives me a scenario. Number six, seasonal patterns vary by place, culture and by lifestyle. It's an amazing way to understand the world around us and the people around us and the nature around us. So go and explore and enjoy and send me your visualization. If you'd like more info, you can see my GitHub repo. I have links to everything in here plus some things that didn't make it in the final cut. This is just the tip of the iceberg. There's so much more. So if your data includes any sort of change over time, take seasonality into account. Thank you.