 Hello everyone and welcome to another episode of Code Emporium and today we're going to see if we can predict stock market data. Spoiler alert, we cannot very well, but I still wanted to make this video because there are certain outcomes that I wanted to kind of come across. One of them is this is a good example of another use case of time series analysis. The second is that we're going to be using a novel method for evaluation of a time series. This is just better than like some other typical like mean absolute error that you would typically use, especially in cases that we're going to see now and why we would need to use this kind of evaluation metric. And the third is mostly that machine learning isn't magic. Your model is only as good as your data is and I kind of wanted to hammer home that point through an example like the stock market where data can be really hard to come across and you don't have all the data that's enough to actually predict the next day's closing price, for example. So with that all in mind, let's just get started. So right here I'm going to be using data from Yahoo Finance, typically, well more specifically I should say that we're going to be using Tesla's stock price data. So go to Yahoo Finance, type in like TSLA, which is the ticker for Tesla and you get this nice grid over here and you can, you know, just take the time period to be the max, which will be all data, essentially. And when you apply it, when you hit apply, and then you can hit download to just download all the data from 2010 of June 28 till till current date, right? So if you hit download, you'll get a CSV and that's a CSV that we'll be using here. Let's just go back to the main project and doing all that you get this little data frame over here, right? Now all what we are interested in for stock market data is we want to predict the closing price for the next day, given all of our data up till the current date, right? And adjusted closing price is just a more accurate metric. So we'll be using, it's a more accurate field, sorry. And we'll be using that as our main label. So right, and now just getting a very brief look at like some top level stats. We have data from 2010. There's 4,000 days of data points that we have. Well, technically that's the difference between the dates, but we only have 2,800 of them because the stock market is only open for the weekdays and not for the weekend. So this is like five sevenths of the total 4,000. Cool. Very cool. All right. And if you plot out just some nice visuals on like how your data looks for the stock over Tesla, at least for the last 50, 50 working days, it kind of looks like this. Not really observable seasonality, or yeah, anything of such, but we'll roll with it. This is our data. The stock market is particularly volatile around this time of life, I guess. So this is going to be fun to predict. Let's see. So next is going to be doing some, well, let's say feature engineering, but essentially if you think about like what, what from this data that we have so far, what would be useful in actually predicting the closing price for the next day? Now, because of such noise that I'm seeing right here in the charts, I'm just going to be using the lagged variables by one day. So the previous day's closing price is the only variable I'm mainly using to predict the next day's closing price just to keep things simple for now. And I'm also going to be throwing in like a day of week predictor just to capture any weekly seasonality if it does exist. And, and also I think we have volume data here too. Yes, we do. So we have like the total number of, a total number of buys and sells that actually happen on that day. So for every single day, we have sale volume, which we will be using for typical economics, demand supply economics just to cater to that. All right. And now let's just say if we kind of just like format our data set, it will look kind of something like this, where this is a sample, where this adjusted close is our label. And the previous adjusted close price is the previous day's close price with the previous volume and the, and the current day of week, right? So essentially what we kind of see here, and I'm going to be throwing this all into like stats models, right? And what we see is that there is a really high correlation and as we kind of expect from the last day's adjusted close price to today's close price. What's really surprising is that we're actually able to see 99.5 explainability with just these variables alone. Which kind of just like tells you, okay, we might actually be doing pretty darn well in predicting like the next day stocks, right? Something else that's interesting that I've seen is this day of week predictor because it shows that there is, so the 95% confidence interval is like in the negative region right here. And because the coefficient is negative too, it kind of suggests that the stock prices will fall as the week goes on. I'm using ordinal encoding here. So ordinal encoding essentially says, well, instead of one hot encoding where we have a bunch of dummy variables that are binary encoded, we're just encoding every variable as 0, 1, 2, 3, 4, 5 from Monday to Saturday or whatever it is. And essentially what it is saying is that as the days go on, that means as we go to Friday, the stock prices actually fall, whereas in Monday, it's much higher. But this is kind of against the typical scenario of how stocks typically behave because there's something called the weekend effect, right? And if you go to the very venerable source of investopedia, it defines the weekend effect as a phenomenon where Mondays are often significantly lower than Fridays, which is the opposite of what we're seeing. So this could just be, I mean, this could be a number of reasons, honestly, it could be that we are using a linear model. It's not exactly capturing that kind of data. Or that Tesla specifically, that stock is so, it's very different from all other stocks. It's like a magical stock and somehow it behaves differently. I don't know, but this is the data that we're seeing, so we'll just go with it. Another thing that we're also seeing is, well, for previous volume, you see that 0 falls within this confidence interval, which makes me believe that, well, previous volume is just not that strong of a predictor, at least when used with a linear model, with the linear regression model here. And previous adjusted close price is just as, well, it's super predictive of what we are actually looking for. So that's probably why we have such a huge explainability over here. I'm just going to be keeping, though, previous volume in our model, since I noticed that if you kind of just remove all of these other variables, previous model does have a significance, a significant impact, and it does have some explainability associated with it. So if you do replace a linear regression model with a slightly more complicated model, it might actually, you might see those effects a lot better. So we'll keep it in there. All right, so next is the model training phase, cool. So right now I've defined a function for evaluating a model. I'm using the mean absolute error, which is essentially like the mean value of the actual price of the adjusted closing price of the stock that day minus the price that a model predicted for the adjusted closing price that day. And just taking the absolute value and then just taking the means across all of our predictions and that will give us our evaluation metric. I'm also, in addition to that, considering mean absolute percentage error because there are cases, I just want to make sure that we're able to see it from every single lens because sometimes the mape might be lower than the absolute error and vice versa. So we just want to capture all those cases and get the best view of our data. Now, when passing this through a linear regression model, passing it through a pipeline over here where we first create, we do some preprocessing steps, which is standard scaling and then ordinal encoding of our numerical and categorical variables respectively. And then we just fit our model right here. And what's interesting is that we see a mean absolute error of $16 and 72 cents or a mape of 3%, which is very, very, very good. Actually, if we were to see this, I mean, it's almost too good to be true, right? But is it really as good as we're seeing this basically saying we're, well, this is not the best. Okay, it's not. Okay, let me not exaggerate this. We're not. It's not absolutely incredible. Like $16 off is still pretty off, right? From the ticker price. But it's actually not too bad. It doesn't look too bad. We're only 3% off too. It looks okay. But let's actually take a closer look at what these predictions look like. And if you do plot them out, the blue is the actual value of the stock. The orange is the predicted value from our model. And you can see that it kind of looks like our model is lagging exactly by one day from the actual price. And that is actually not good at all. Because if you kind of compare, let's say like this May time, right? May 10th, 2021, the stock price was somewhere down here, but we were way up here. And it's only until the next day that we saw that it kind of corrected its value. But the stock just kept fluctuating even more so. And it's like that for almost every data point. You can probably see that effect right here on the table, right? So on this first day, which is like July 2nd, the actual value is 678, which we predicted correctly. But then it fell down to 659. It fell down by quite a bit, almost like 20 bucks right there. But we're still predicting way up here, which was the same as yesterday's truth. And then when it comes to the next day, now we're at 650. Now we predict like 660 though, because we're still playing catch up with the truth. And so you can see that it's really not that great of a predictor if you kind of look at the numbers. And so what we think, well, what I think is actually better is we have an evaluation metric where we're kind of creating a stratification of sorts. So let me just scroll down to what I'm actually showing you here. All right. So here is a graph where the x-axis is the percentage change of the stock of the actual stock day over day. So let's say from today, let's say it was $100 today and yesterday the price was $90 for today, the percentage of increase in change, the percentage change would be 10%. And like that, I'm labeling every single sample in our data set with the actual percentage change as you see here. So it would be 663 minus whatever the previous was divided by whatever that previous was to get a percentage change. And that was like 0.29% apparently here for this specific case. And then what I'm doing is now I'm stratifying these samples. So basically here, the zero is saying that, well, let's consider only the days, the samples for which the day over day percentage was at least, I mean an absolute percentage change was at least zero, which is like everything, right? And that's where we get like somewhere around like 3%. That's what we're, or rather, yeah, it's a mean absolute. This is like, it's supposed to be around like 16 something like that or something like that. Yeah, but then let's say that we look somewhere up here. Yeah, let's say this, right? So what if we only consider the days where the change was 4%, right? That means that's a pretty big change over there. And it looks like we're getting worse at predicting them. We're $35 off. The mean absolute error would have been 35 if we considered only those points. And then how many days is that though? Well, I have it on this other Y2 axis over here. So 4% is like, yeah, it's like 75 days maybe or something. That's pretty big. That's a pretty large number of days that we're predicting wrong. And we're getting progressively, this line is showing that we are progressively getting worse and worse and worse if the change from day over day is more and more. And that is bad. That's just indicative of when I look at this graph, I see, okay, we're not that good. But when I'll just look at this number that we saw before, just that 3%, I'm like, oh my goodness, we're at 3%, that's so good. But I feel like this method of just using percentages and stratifying it, stratifying our test samples accordingly, is a much better representation of what our data truly is representing and what our performance really is. I did it for the main absolute error, and we also have it for the mean absolute percentage error too. This is supposed to be, you multiply each of these by 100 to get the actual percentage. And you can see the mapes are increasing, which is not good as the, if the fluctuation is, you know, if the change is higher, we do expect this for any model, but this is pretty drastic. It's well over linear, so pretty bad change anyways. So all in all, this data is really not that great at predicting big changes in the stock market, at least for now. Now what you could do is, well, a potential improvement is adding Google Trends data. So if you go to like, you know, Google Trends, right, and then you type in Tesla, you'll get these graphs, right? This is an example of like interest over time, which is a scale from 0 to 100 of, well, interest over time, which is I think like a function of the total number of search terms that search times like Tesla has been made. So yeah, we can actually fetch that through an API here, which is pie trends. And when I'm using that, I just append that to every single sample that we have. So we have a trend score, which would range from 0 to 100 for every single date. And now if we, if we kind of use this, well, I'm just looking at the search trend score in isolation, just alone in a, you know, through stats models. And well, I see that it's, it can be, it can be pretty significant in usage, maybe not in a linear model with all of those other features that we have. We might be dwarfed by that. But like, you know, the volume, it might be useful in a more complex model. And so that's, that's kind of all I'm going to show right here. But a cool thing that I've kind of like put together a little template for y'all to try out is, see, you know, what happens if you use a more complex model here, does it help, does it actually help improve performance overall? And by performance, does it help not only improve the, that 3% overall performance, but also with the stratified evaluation that we were talking about too, does it improve that? Maybe there are other things you'd like to try out too. Like I just use one day of previous data to, for, for every single sample. But maybe you would want to, you know, look at, you know, the last weeks or the last months of data to see if that would add an additional like predictiveness or predictive power. Also, you know, with machine learning, you know, I use like 10 years of data just now. I don't really need or whatever, five or 10 years of data. I don't need so much data potentially. I could have probably just gotten away with a much smaller time frame. Maybe try to see if that affects anything too. I have a template over here. So would be willing to, would be interested to see what, what you guys come up with. Anyways, I have some cool like blog posts to here for, you know, just Google trends and some code as well. All of this will be linked in the description down below, but I hope you kind of got the overarching principle of what this video was about, a video to introduce, well, another video on time series analysis and how it can be applied, how you can evaluate time series models in a different light. And also because, you know, models aren't as good as they seem with just traditional, you know, traditional evaluation methods, like just mean absolute error, mean absolute percentage error and your data defines how good your model is. Your, if your data is bad, your model will also be bad. If your data is very noisy and up and down and all around, you can't expect too much from your model. You will need to either redefine the way that you structure your problem or well, something else entirely. So I hope that helps you get a good sense of what's going on here. I've kind of explained a lot about like the code and the pipelines in previous videos. I like to structure my, my models in this way. And I hope you do too. It seems very clean to me. Do let me know what your thoughts are in the description down below and we will see you in the next video. Take care. Bye.