 So, you have taken the risk of coming to this session on non-stationary time series. Let us see if we can have some fun. So, I will start with the first principle, forecasts are always wrong, then I will make another claim. So, we are going to talk about not just non-stationary time series, we are going to talk about multivariate non-stationary time series. So, my claim is there is no other kind of time series. If you have seen others, point those out to me as we go along. So, another claim, and I hope I can hold this up over time. And if you dispute with this, please speak up now, otherwise it will haunt you throughout the presentation. The claim is one cannot find a real-world process which is stationary left to itself. If you have found one, then it is violating the second law of thermodynamics. Or it is over a very short period of time that you have checked. Does anybody dispute this? Thank you. Now, if we cannot, if we cannot find such a process, then we are stuck with non-stationary time series, dealing with them, which a lot of textbooks do not deal with, a lot of publications do not deal with, but the real world, we are stuck with them. We will get into why we are stuck with them. Now, what processes are stationary, if any? You say, well, I see a lot of things mentioned in textbooks and websites about stationary time series. The time series or processes which are tightly controlled, for example, when I used to work at Intel Corporation, we have the fab processes. Those are very tightly controlled, copy exactly, every parameter tuned, and as soon as it goes out, it is brought back and so forth. Those processes remain stationary because they are very tightly controlled. And then there is another thing I want to mention about this. What you will also see is, time series are dealt with as univariate series. Now, tell me one process in this world which depends only on itself and nothing else, any hand? So, why do we deal with things as univariate time series? That is just a question. You may want to answer it to me or to yourself at some point. Now, the stationary time series give the impression of being stationary or the illusion of being stationary because they are being controlled. Now, the last point, I will come to it later because that is a sort of summary of this presentation. We will come back. So, now what we will talk about, sorry, what we will not talk about in this presentation? We will not talk about seasonal auto-regressive moving averages models which assume homoscedasticity I can never get to pronounce that right, where the variance of the error does not depend on the value of the variable. So, it is basically constant. So, we are not going to talk about those or we will talk about those to say where they are useful except not quite here. Then we are not going to talk about the generalized auto-regressive conditional heteroscedasticity. You know I do not know how to pronounce that. We are not going to talk about those. Somebody got a Nobel Prize, two people got a Nobel Prize for that, Engel and Granger and this was you know in economics, very good thing. This assumes changing variance of noise or changing variance in the error term and if it is auto-regressive, they assume if they call it the regular auto-regressive and then if it is the generalized one, they include the moving average term. Then I am not going to talk about you know solving the problem using CNNs or a combination of CNNs and LSTMs and GANs and so forth. People have gotten during awards for that, good. But in the real world, in the real world what happens is we have non-stationarity. We need to figure out what are the sources of non-stationarity and there we will talk about what is possible to do, what is possible not to do in you know in the real world. We are going to talk about Granger causality, so Granger got a Nobel Prize for that, for the idea of co-integration as in so I was mentioning that nothing is stationary by itself. But sometimes your non-stationarity maybe brought about by somebody or something else. Now if you just model yourself, so you can difference yourself out and figure out that you are stationary but then you are making a false model because you actually depend on somebody else who is not stationary. Then we will talk about how to model real world processes, so where your ideas you know pitch in say because this does not make sense all of that, so this is more like the wisdom of all of that. And the last one is dynamic mathematical models which are supported by data driven models. So this is the paradigm I am proposing is going to take us forward. So I hope I can convince you by the end of this talk that that is possible. This talk is a packed talk, if something does not make sense just you do not have to raise your hand just say it because we are going to go fairly fast through material. We are also not going to talk about non-linear dynamics, so we are not going to talk about non-normality, we are not going to talk about a periodicity, multimodality, non-linear causal relationships etc. We are not going to talk about all of those because that is another fun topic. I will mention one of that just right in the beginning and then we will just go on. So another day, so first of all where does non-stationarity come from? Many of you some of you who are from physics background or also from engineering background would know this, this is Baker's map or Baker's transformation, this is how I got packed, my wife really knows how to make good patties. So when you make patties what do you do? You take dough, you flatten it out, then you fold it over itself and then again flatten it out, again fold it over itself etc. Now there are many real world processes which do this. Now what this actually does is it causes, you can play this, you will see right now these two, the points here some of these points are close together, you see that? And some of those are further apart, now you will see as the transformation, the patty transformation goes through, you will see some of them get very distant and then suddenly come close together. That is a very classic kind of non-stationarity, so let us go see that, so you flatten it out, fold it over itself, flatten it out, fold it over itself, flatten it out, fold it over, now and eventually it almost looks like noise, eventually it almost looks like noise, but it is not noise because the criterion for maximal information content is the same criterion as you know is maximal entropy, the criterion for maximal information content is maximal entropy. So it is very interesting that when you see something which you cannot make sense of maybe it is coming from other variables, it is not just coming from you. And here is the hypothesis for all of us to consider, I propose that this comes from processes of inherent processes of creation and destruction, construction and destruction. So and I will illustrate this with a little information theoretic example, so consider this little binary sequence. Now think about this, in the real world we actually do not see the whole process, so consider that this is the process, now in the real world we almost never see the whole process, we only see the thing that is in front of us, so this is what we see in front of us. Now let us say you shift it one bit, so we shifted this one bit to the left, notice what happened, we shift I am mixed up with left and right, we shifted it one to the right. Now notice what happened, the higher order bit died or went out of my view and the lower order bit came into view, now what does that mean, that means that there is some information coming into the system and higher order information is getting destroyed and you can continue that. So in some sense we are losing macroscopic information which means if you think in terms of an enterprise and old process is dying, so you are losing old macroscopic information and gaining or the new information that is coming in that is still macroscopic, but over time it will become bigger and similarly there is a process during construction where you do the reverse and you shift the bits the other way and then what happens is your small bits sorry the big bits become bigger and smaller bits become come in. So now notice that one man's destruction is another man's construction that is another thing you should note here. So as the information content is brought in or out you are going to have a system which appears chaotic or uncontrolled, it will appear like that and this is what is called non-stationary. Now a stationary process will look like this, there is no shifting here, the information is not coming in, information is not coming out that is a stationary process. So I think I have summarized this, I think I have already spoken about this, it is just a summary of what I just spoke, where one understands that you will always see these processes of construction and destruction occurring, new information coming in, old information being destroyed, not exactly destroyed it is going into, it is just becoming insignificant in terms of its manifestation. So this example will make it clearer, if you look at the human system or any process in the world, it takes birth would you agree? So there is some information coming in, there is a growth which is those bits becoming bigger as you shift the bits they become bigger and these are also influenced by other processes which in human world we call education, then they influence other processes produced by products, so those other processes are getting affected and some of those other processes get branched off, then they are stationary for some time, perhaps a short time, longer time you do not know, then windows, so windows is when your bits shift the other side, you see here the macroscopic bits are becoming bigger, in the other side the macroscopic bits are becoming smaller, etcetera. So now notice that the stationary period is quite short here, most of the time you are in the growth phase or the diminishing phase, now then do not we need to know what is happening during that time, meaning my claim is that if you just keep modeling stationary processes you know you cannot get too much, too far. So we need to do this because we want to control the process and in any complex process you have complex physics, signals, so you can detect the problem, you can predict something, you can anticipate control. Now for stationary time series all you need to do is analyze, you know what it is and then you can do statistical process control, you are all good. So as the world changes its processes in time series change, so forecast for non-stationary series are non-trivial and therefore very valuable. So for example this time series, climate change. Now I think this is for climate change deniers this is a very interesting slide to show, notice that this cooling here due to CFC emission reduction was predicted and because many countries got together they were able to get it down. So here we have a match of prediction with reality we have not done so well with CO2. Now this is a class of models where people have been able to do this successfully. More examples in interactions of finance, CRM, manufacturing, supply chain, acoustic and electrical signals from machines and structures, customer demand, production rates, electricity loads, traffic. So what is stationarity formally? So this will address you. So strict stationarity is your distribution does not change any number of moments you take it does not change. Non-stationary is when one of those moments change. So we will look at the detail with results. So strict stationarity is the joint distribution of any moment of any degree does not change. That does not happen in the real world we will not bother with that. First order stationarity means that you have means that do not change with time, averages do not change over time that is also not so common. The third one is second order stationarity which is a lot more common where you have a constant mean variance and auto covariance that does not change with time and this is where you have different models that people you know box and Jenkins type of models come in. So again I mentioned this. Then the second type of stationarity is trend stationarity which is whether you have linear or quadratic trends but you keep varying around the trend. So you know you can model this and then you have difference stationarity or I call n-th order integration. So you can difference it out and say now I am stationary and the last one is cyclostationarity or which is seasonality. Now notice I am still calling all of them stationarity. In textbooks this will be called non-stationary. Yes and the reason for that is once I am able to because there are standard tests for this for each of these. So I can test that out and say now I have a place where I can model. Now meaning that these are different types of stationarities and I can subtract different terms out and get to stationarity. So and that is those are called first order stationarity. So that is what you are referring to I think. So the first order stationarity are the growth which is trends, decay, cycles, seasonality. Now second order is what these guys got a Nobel Prize for. This is 2003 Nobel Prize for John Angle. This is the covariance variation of the GRH model. Then this is same varying volatility, periodicity, cyclostationarity. Then higher order skewness, fructose. Then piecewise stationarity this is very very common in the enterprise when you make a change to the process. And if you just make it a data driven model your model simply will not understand what happens. Unless you have a mechanism in place to detect that and it is very hard to detect that. We can double discuss how to detect and intermittency is similar to regime change which is when you change some part of the process. Then of course you can get into arbitrary variation and so on that is another level of detail. So now the one key thing to note is that non stationarity may not be inherent to the forecasted variable. We mentioned previously that you your variable may look like non stationary but it may be co stationary with somebody or cointegrate with some other variable that was another Nobel Prize for Clive Granger in 2003 also. Now the kind of models people use for non stationary time period forecasting. So this is somebody from a paper by Chang King. This is a 2015 paper. The field has not moved too much further beyond this because a lot of focus people have put is on image and NLT. This hasn't moved too much in this area. So one is the classical autoregressive models which most of us are likely familiar with. So you have the exponential smoothing, seasonal array, max, max is the extended. Then vector autoregressive which is the multivariate version of the Arima models. Then vector error correction. Then you have ARCH we talked about this. The regime switching piecewise stationary somebody is going to get a Nobel Prize for this soon or maybe in a few years. Then you have the GRUs RNN based particle filtering, LSTMs, multivariate LSTMs have become very interesting because what you can do is you can do attention based variable attention and temporal attention both. Then there are these frequency domain methods with Fourier transforms and wavelets going through auto encoders and LSTMs. Then there are methods with saying for short time periods I am going to or for short term features I am going to put that through a CNN and then feed that into LSTMs for longer periods and then feed that into Arima models. So people have come up with really fancy things. Now the unfortunate part is that still doesn't work for the industry and I will tell you why it doesn't work for the industry. Then of course you have the non-barapentric methods like the Bayesian and functional decomposition and so forth. So people have explored all kinds of combinations of these things. The reason we are here is because a lot of these things don't really work. They work in your research paper, they work in a controlled environment. Yes, absolutely. In the research lab it works beautifully. So what I am going to do is I am going to go through a couple of these methods, three of these methods very fast and I am going to show what is not quite working there. Then we keep going and sort of. So some of you may have seen these autoregressive integrative moving average models. So different types of models. So the autoregressive filter, sorry I shouldn't stand in front of you. So that takes care of your long term trend. The moving average filter is very interesting because it takes care of your shock or error term and if there is a process, if your process is such that these error terms keep adding up as in they don't just get averaged out. That is an example of a non-stationary process. It permanently shifts your mean and the integration filter is actually very interesting. The integration filter is telling you that I have an inherent polynomial behind me. I have an inherent polynomial. Now what this does is see when you are taking your, so see people say you difference it out as long as you want. So pro tip do not do that. If you are having to difference it out more than twice, you are going to make your model unstable. We can get into the details of why it would be unstable, but in short you do one or two and once you go to third, fourth your model is just, it doesn't know what it's doing. So if you are having to do more than two, that means it's not really, meaning your time series is not non-stationary. It's depending on five other variables which may also have their own levels of integration. So it's time to bring in other variables. So now this is well studied and if you like more details I like this article on analytics with you. So I had given a talk on data hack and then 15 days later they put up this beautiful article on non-stationary time series. It's a really good article. I highly recommend it if you like an overview of some of these things. Now how do you test for non-stationarity? Now each process, these equations, each of these equations has a characteristic function. Any polynomial, now if you have solved polynomials, finding roots of polynomials, polynomials have characteristic functions. Characteristic functions have roots. Now if your characteristic function has a root equal to 1, then what's happening is in the simplest terms, what is happening is your y n plus 1 equals y n plus something. So what is happening is your error keeps accumulating, your error keeps accumulating over time. So basically you keep going further and further away from the original process. Now if you do your differencing on this one and then you find that all other roots of the process are less than 1, that means the process will stay around a certain mean and within a certain variance. So the standard test for this is the key fuller test available in most packages as you might know. Now if the process is a root greater than 1, then you know that you have trouble. It's an explosive process, somebody needs to fix it or maybe your video is going viral depending on what you are modeling. Now there are other extensions to this which many of you may also be familiar with which is you can have models with seasonal components which nearly everybody uses nowadays. Now models with side information, now this is the beginning of where you are using other variables. So in the econometrics world they are called Arima with an extension and the extension is you add another term for another variable y and another x or set of x. Now this models with long term memory is very interesting, I wish I had the time to discuss this. So we talk about integration then these are fractional integration models, these are also very interesting. Now we will talk about briefly for multivariate time series because these do address the issue of multivariate but they do not address the issue of the non-stationarity. It still expects the individual time series to be either cointegrated or stationary in themselves. Now the GR process, now this one I think I should skip this in the interest of time but because I mentioned it in principle that it is modeling the change in variance, so maybe I just showed it. So what are we doing? We are changing the, sorry we are modeling the change in variance as a auto regressive and moving average process, change in the variance of the error, change in the variance. In short they are modeling the change in the variance of the error term and that gets you to that second order. These slides will be available, it is purposely made as a chart here but this shows you where do you use which type of model, simply stated. So some of them most of them actually are linear but some of them have non-linearity that you can use, the neural networks obviously. So but we are going to skip this for now, please look at it at your own leisure as it like a simple guide. So now the question is, I have motivated this question before that there is no variable which is dependent only upon itself. So why multivariate? Yes, in fact the Baker's map that I showed comes from deterministic chaos theory, exactly, yes. So this is a crazy world to be in. Yes, yes indeed, yes. So this used to be the domain of physicists and so forth, yes indeed and those of your physicists would likely know this. So this I have talked about that you know given example of a real process that is not affected by anything else and we were not able to find that thus far. So similarly there is nothing in an enterprise that works on its own. Sales will depend on the demand, it will depend on the production, it will depend on how many people we have hired, it depends on my suppliers, it depends on so many things. So how can I model sales as the ortho regressive process? It just doesn't make sense. Same way for revenue. What to speak of the stock market? There was a time where people was modeling the stock market as an ARIMA process and was like what? Partly due to the availability of computational resources. There was a time where people were doing these by hand. So we have to grant them back. So this is Clive Granger. Now his point is that in the real world each process is an outcome of a non-linear like possibly non-linear combination. We are not going to talk about non-linear combinations today as we said before. Today we are going to talk about mostly linear combinations but future because first we handle linear then we go to non-linear because non-linear really gets into chaos. So they might have mutual delays and shock events. Now this is his key insight this is what he got the Nobel Prize for. That non-stationarity in a single time series may be simply a co-integration of multiple time series. So and we will show you what co-integration is. So this has been well known in statistics for many years but it came to econometrics later on so the guy got a Nobel Prize. And this is what I was saying earlier about differencing. So the problem with differencing and you say I got a stationary series. The problem with that is after your second level of differencing you are at a risk of chaotic behavior. And why is that chaotic behavior? Because what you are considering as error terms are not really error terms at all. Those are coming from other variables. It may be computationally easy. Like you know if I am looking for my keys. There is light here I can look under here but I lost it under the tree outside. So you know I have to look for the key under the tree. So like we said earlier the apparent non-stationarity could be due to contributing variables. So what is co-integration? If two time series are co-integrated there is a value beta 1 and beta 2 which is stationary which has less than sorry less than one root or non-stationary. Non-unity root process. And you can regress y t on x t. How do you find it? You can regress y t on x t. And on the residual you take the augmented Dickey-Fuller dash. Again great place to look for this is on that analytic to the article. This is from the research papers. So this is multivariate time series model. We are talking about the VAR models. So basically we said y 1 depends on y t minus 1 and so forth and it also depends on x s. You can transform the non-stationary series into stationary once but if you cannot oops you do not go forward with this it does not go anywhere. You can still apply it if they are not co-integrated but they still have to be stationary and then you know the problem is if you are not stationary then you are in trouble because the assumptions do not apply anymore. So most of the times in reality we just change the problem statement so that we do not have to deal with this non-stationary. Yes. So in fact that is coming. That is coming. That is coming. That is coming. Yes. See sometimes even it is like you know when somebody is doing a Ph.D. they take the holy grail. They start off. I also started off like that. So then most of what I was supposed to do over the period of time at my thesis committee you know help me change the problem to make it more practical. And once we had five papers we said staple it together and that is the Ph.D. thesis. And then all the holy grail went into future work. So you can do that if you have the five papers. So now people have proposed temporal causal models based on Granger causality. Now the problem with these models is these are computationally intractable because what you need to do is you need to keep shifting. So let us say you have four time series. You need to keep shifting each one with respect to the other and say where do they fit. It is a combinatorically combinatorially. So causality everybody knows you know if you take the statement of Granger causality which will come later if y is predicted better by x than only by y then x is a y. That is Granger's point. You may not agree with him but the guy got a Nobel Prize. It kind of makes sense right. If x is a better predictor of y than y itself then x is likely a predictor of y. If you after his insight it sounds like a tautology but I think I am going to skip this because that is basically what this is saying in mathematical terms. What do you do to create this temporarily dynamic network or relationship? You create a time dependent network. This was a sales and revenue example right. So now notice that multiple things affect this, multiple things affect this. So this is the Tegromite package which you can all of you can download. Now the fun thing about this is Jacob Rungig he came up with this idea that I am not going to use any time series. What I am going to do is he created four differential equations and basically simulated them and said that this is my time series and those are obviously all stationary and it is all beautiful. So it works for that. It totally works for that. Now the challenge and so he is able to discover so this is my own diagram but in his diagram he is also able to find those kinds of relationship. But in the real world guess what? The world doesn't work like differential equations generated time series. But if your data is so nice you can probably take his package run with it. This is good stuff because he is building up on Clive Granger's work. It is a nice package. One can extend it probably to other scenarios. But the key thing to note from this figure is that one thing influences many other things. Multiple things influence one thing. So you cannot for example production in a factory. You cannot say the production in my factory is an Arima process. That would be incredibly naive to do. Because it depends on your suppliers, it depends on your vendors, it depends on your downtime, it depends on your shipping, it depends on your inventory, etc. Skip this. Yes. How do we deal with non-stationarity then? Because so far we have just said what doesn't work. So one way to deal with non-stationarity which Professor Jauh Gama in Portugal has proposed is they can use KL divergence tests on multiple windows of the data. So as the data keeps coming in, you keep windowing it. Windowing it or you do your exponential decay or whichever choice you have in the matter. And you can keep doing KL divergences and when there is a divergence you change your model. When there is you know you set a threshold, how much divergence is acceptable to you and then you change your model. So what you are doing essentially is you are learning from the past based on time shifting, windowing and waiting techniques. The challenge with this is how do you create those windows? Who sets those thresholds for the ships? I am not going to sit down and do it and neither are you. What to do? Now okay, so then for people in this conference a lot of people will tell you just go to LSTM. That will solve all your problems with time shifting. It's just you know because LSTM don't require the non-stationarity assumption, sorry, stationarity assumption. They can work with anything. All right, so now this is where I'm going to ask you should I talk about LSTM? Now okay, so we'll come back and talk about LSTM because I still want to give you the overall solution because they're going to kick me out after five minutes. But in this presentation you will have the animations you will have. Okay, so this shows how the information flows in LSTM and how it works but I am going to say okay, people have tried to do this aircraft engine failure. This is from Microsoft. They wanted to show how things work in Azure. Now the challenge is that they are doing this without an actual model of the aircraft engine. They're trying to just do straight LSTM problem. Note that why I'll tell you why that's a problem. Then you have this idea of multi-variable LSTM to both temporal attention and variable attention. Now if you are going to do blind time series, I have found this very useful for lab data. Beautiful for lab data. If you want real data you will still be in trouble. You see remember, when we make forecasts from these models, we are going to make decisions based on that which may change what is happening in a factory, which may change what's happening in my aircraft, which may change what is happening in my distribution system. So I cannot have them at 70% accuracy. I need them at 99%. So what do you do? We still have a problem. What to do? So that's why Niels Bohr said prediction is very difficult, especially about the future. That's where your point comes in, which is whenever possible, formulate the problem in a different way. And here is the different formulation of the problem. So the different formulation of the problem is that usually in an industrial situation, we often know, we have some knowledge about the relationships that exist. Our knowledge may not be perfect, but we at least know that and well for some it's almost perfect. I know that sales is directly correlated with revenue. It's literally y is equal to mx plus c. So there are some relationships like that and then there are some non-linear relationships. So the recommendation here, which I wish I had next time we'll do a five day workshop on this. What I've seen work in my short experience is let us create a hypothesis of what's happening in the real world. So that's one model. Now GE, MathWorks, they call it a digital thing. You've heard of that? Yes. So the people poo poo that thing, especially if you're doing deep learning, you'll be like, I can do everything with that. Now there's a challenge with that because, anyway, let me not get to the challenges afterwards. So the example of insurance. Now what you do is you create a partial delay differential equation with network relationship. So partial differential equation most people will know. Partial delay differential equation is you have explicit terms for time delays, which are not often because you cannot always account for why there's a difference in time. So you explicitly include those. And then you create that network of relationship like this one. So in some cases, initially I may just put it as a linear relationship. I don't know. And in fact, if I don't know at all, I may use a deep learning system. On top of that, do my Shapley values. And then you take your deep learning model and you create it for one variable against the other, one variable against the other, others rather, sorry. And by that, you can get the Shapley values of the, so essentially you find out the derivatives of each variable with respect to the other. That will only tell you the first order information. But let's start with first order. We don't have second order anyway. Let's start somewhere. So once we have this kind of network, the deep learning system can tell me it's almost like an adversarial network. This is my differential equation based network. This is the kind of thing that Viral Shah was saying the first day, right? Differential, differentiable programming, right? So this is one network. This generates something. There's another model which generates something. You see where it's different. And that's where you do your reinforcement, right? You say, wait a minute. What is it that I change here? That will make me match the other one. What is it? Go ahead, yeah? Afterwards. Yes, we'll do that afterwards. So what we do is we create these two models. We use the partial delayed differential equation model to create and test hypotheses, simulate production. Then you keep correcting that model. See, this model is the one that goes in production, not the deep learning one because the regulators will kill you. And then there's one more thing you can do with this. So you find the highest information content experiment because you can run a Monte Carlo simulation on this PDDE and you can say which experiments will get me the maximum information about these relationships. Some experiments will be a waste. So I'm not going to even get that data. But now I know where to get information from, right? And then I can keep building my model and etc. Now this works extremely well in enterprise situations compared to the, you know, plane, vanilla machine learning based models. So in short, in the real world, we can only predict for a short time, even if we know everything perfectly. We showed that in the Bakers map. The purely database methods which are purely machine learning methods fail, purely knowledge based methods fail. So what you need to do is we need to combine your existing knowledge, which is the point Viral Shah also made on in the differentiable programming one, that you need to combine your domain knowledge. You'll see if you throw away domain knowledge, you have to relearn it all over again. And you keep going in, you know, your ball is walking on the ceiling and all that. So if you want to avoid that, you have to encode that ahead of time, right? And then you can improve the model by finding experiments for the highest information content. I wish I could show you this fun stuff, but another time. Thank you very much for your time. Questions, comments, criticisms, problems. It is fundamentally, see, fundamentally everything is Bayesian inference, because we are trying to approximate Bayesian inference. Yes. Agreed, totally. Of course, I'm around. So if you want to give me tips of how I can improve in this, let me know.