 I don't have a live demo and I don't have a nice presentation. What I do have, though, is an archived notebook that I prepared for this particular thing. So, why is it titled Bike Share? Because... Well, let's step back. I want to give you a motivational example about stats models. Stats models. It's a module in Python available. It used to be a part of SciKits. Now it's a SciPy. Now it's merged with stats, which was a library I was using for a while. Now it's a separate module and has several goodies. And now to the bikes. So there is that website out there called Kaggle, which I don't know what they really do, but in addition to whatever business they do, they also run competitions. And the principle of all these competitions is you get some data to write a predictor that, you know, given a training set will, you know, predict values for the contest set. And that's how you get scored. So I thought I would run one of these as an example of what stats models can do for you. People who know are very well will chuckle, but that's fine. That's okay. Chuckles are welcome. So there you go. So we have a set and that's how this data set looks like. There's just a little bit in a way of an introduction. So it has, you know, days and then some indicator of our season here and what weather was in a given day. And at the end, which you cannot see and neither can I because the screen is small. Oh, there you go. That's how I make it scroll. There's numbers of bikes rented on a given day. So data comes from a bike rental system, not unlike our Big C, except that this one is, I think, called capital bike and is from Washington, D.C. They run, in fact, the Big C system and you can get their data, which would be, I don't know if you can get Big C data. Can you? No. All right. So there you go. So that, you know, we could do the same for Montreal. So we're trying to predict how many bikes will be rented any given day, right? So first, you know, how does data looks like? And it looks very much like, you know, if you do any sort of like anything, any web business and you have logs that shows, you know, usage numbers or so forth, it actually looks kind of like that day by day. So it is very, you know, like if you do the sort of thing, you know, in web business, it should be like very recognizable. There's like, this is over two years and you see like weekly periods and you weekly variation. Then you see, you know, variation with seasons and then whatever, what everybody hopes for, steady growth over time. That's great. More and more bikes. And now, given how disjoint I tend to be, I'll step back and give my personal motivational example why I actually had to dig out stats models and why stats was not sufficient. So imagine that you have, imagine you have data like this, right? And that's not a very good example of what I want to show, but it, you know, you have some trend over time and you want to, so my problem was that I was trying to find, detect outliers, things that would just go, you know, high wire every now and then from data we're receiving. And to account for steady growth, which was the case, I would feed some sort of line to it using linear regression. And then turned out that, you know, outliers themselves would greatly affect what this line is. Now, this graph doesn't show it very well for bikes, so I faked one adding what one would call image processing southern pepper noise. So I just randomly spiked, you know, here's our regular but data right here, right? And every now and then there's something that goes totally wrong, like absolutely wrong, we know it's totally ridiculous and there's basically malfunction. So we would like to detect this over the trend that falls over time and I, you know, wanted to feed a line to that. And in the normal library, which, by the way, there will be lots of, I guess I'm trying to sell stats models as something you can actually use in normal development proposals, not just, you know, for data analysis or data science, something that you can put in the software. So in the stats, you know, stats the module that normally everybody would use sort of thing is basically a library of functions that do things like generate, you know, can generate data following given distribution, you can feed the distribution to a data set and so on. Now, what it lacks and what stats models has is advanced modeling. So my motivational example is called robust regression effectively, right? Which this picture is supposed to illustrate that you can actually make a fit using functions available in stats models, which will not be fooled by the presence of this sort of perturbance, all right? So that was my motivational example. This is why stats models comes useful. Basically, more advanced stuff is available there in a nice, useful package that you can actually use in your software. Now, take on our bike. So how do we actually model that? So contest data is actually more problematic than what I'm going to do because they want you to predict things on an hourly basis. But I got data aggregated day by day and I split it into, you know, along the same principle as they did in the contest, which is like first 20 days of the month, month is your data and the following 10 or 11 or 7, in case of February, if you bike in February, that's your test set. That's how they organize that and they follow that except for, you know, on day by day basis. Furthermore, what we need is some sort of way of evaluating whether our method is good, right? So how do we do that? I mean, in the traditional way, I wrote a simple function that basically computes sum of squares of different, usually like, you know, that's what we usually evaluate. That's what these methods do, minimize least squares. That's how it's called that, right? Minimizes sum of squares. So this is my function of evaluating and I will check a few methods, illustrate methods from, not from stats models and from stats models, showing how they differ and how you actually feel stats models in practice. Is that clear? All right then. So how can we, so what's the, you know, so what is the natural predictor for any sort of time data? You know, we have, in fact, we talk, I just spoke about weather, right? I mean, it's kind of natural to expect that bicycle usage will be affected by what's out there, right? And there's this funny fact that a very good predictor for weather is going to be tomorrow is, well, it's going to be same as today. And you're right most of the time, in fact, because it doesn't really change that rapidly. So, okay, so that's the, so I wrote a simple function somewhere there that basically takes the most recent available day and I said, okay, well, we don't have data for this point. Let's take the most recent one that we have data for and that's our usage, right? So that's, you know, that's the name for forecasting that everybody would use. We are not in stats models yet, right? And, well, it shows there's some score, right? The sum of squares divided by the number of test points and, well, it's pretty large. So not great, Bob, unfortunately. But as it will turn out, here's the first score I got, right? This is a trivial method. Let's remember that number. It will turn out that this is actually hard to beat. So, okay, so let's use regression with the functions available in stats, right? So we take, you know, x vector of temperatures, y, the number of rented bicycles, and let's try it out, right? So you can see that there is some correlation, right? So we use these functions that are available in, this is still stats, right? So, and then you see that p value, well, it's below zero, so it is, it's unlikely to be a coincidence. It's very likely that there is indeed correlation between temperature. Now, how will we do it regarding that score? Well, that's our score. Believe me, it is less than, it is more than what we got during, using the naive method, meaning, you know, let's just take the last day we had data for. So, okay, we are using a fanciful method, well, kind of, and we are still doing worse than the method that, like, my 14 years old daughter came out with. So that's not great, still. Well, let's try, finally, stats models. Now, that's a line of code I would like to talk more about, and this is in the spirit, well, let's make stats model, a Python module that behaves like R, and that's what it really is. If anybody knows R, you will be familiar with the sort of syntax where you just pass a formula that relates, you know, our dependent variable to some independent variables, and from that formula, there is some code in the background that will compute the model, right? So, here's what I took, right? Temperature, humidity, and wind speed are factors, right? And what the method will compute. So it's really that simple. I feel that the data frame that has these columns, I showed that at the beginning, right, demonstrated at the beginning, the amount is supposedly a function of these three things. All we don't know is, you know, the coefficients that should be there. And conveniently, the print out of summer will show us what these coefficients are. We can see, well, they claim all, they appear all to be significant in the statistical sense, and, well, that's great, right? So let's test it out. The great module we're discussing made a fit for us. Displayed lots of information about it. And, oh, here it is, right? Well, let's try. Oh, well, ta-da, that's the first time we're better than the naive method. Just so, barely so, but we are, right? So that shows us how it really looks like, right? This is green is the test data. Blue is my prediction. This is absolute error in red, right? Okay, pat on the back, we are better. Can we do more? Of course we can do more. Nothing beats overfitting, right? So we get more variables. I would like to have my... Am I really that bad? Come on. I thought I was more entertaining than that. It's just like, oh, right. So something I would point out is that there is... It's something that is really obvious to someone that does any sort of statistical analysis. But there are variables there that... There are variables there like season, week, day, whether something is a week, no, day of the week, and then weather that are not really numerical. These are categorical variables, right? And if you take statistics class, they'll tell you what to do with them. You introduce dummy variables and so on. So there's a little bit of work. Fortunately, you don't have to see it now, and as fortunately I do. So the thing is, what's going on? Should I like hold it? No, hold it, that will be the device. All right, so... No, you're not helping. I'll talk... Okay, so there is... This follows the formula part, right? Follows some sort of simple formalism, and it's one of its features that you can define which variables are categorical, right? Seasons are, you know... They had like one, two, three, four winter, spring, summer... Ah, autumn, right? It's not like something that scales. You can expect bike usage to kind of look like this. So this is not a linear thing, so you don't really want to put in the linear equation. So it will be taken care of magically the right way, and you will see, for instance, and this is kind of nice, that it appears to be important because their weekday is Saturday on Sunday, right? So there's a distinction between weekend and everything else. These are other weekday, Monday, Tuesday, and so on. They are not like... Which one of these exactly is... It's not a statistically significant variable, right? So it kind of tells you something about this data. These things are a little bit difficult to read, but with some practice, you can get there. And the score, now that we fed so much data in it, this is actually much better, right? If you remember, it's like less than a half that it was at the beginning. So it is a very... I like the module very much because it plays nice with actual production code, unlike R, for instance. It's very good for exploration as well. There's lots of things by itself. It is very good to understand what you are doing. But that's not only about statistics, it's also about plumbing and anything else you do. So that goes without saying. So where I was... So it plays nicely with production code. It's very good for exploration. You don't have to deal with R. That's an advantage, trust me. And what else is there to say? There's many, many other methods implemented there that come useful if you're really into it, like time series analysis, for instance, which would not be appropriate for that problem, by the way, I think. But all right, that's that. Questions?