 Hello everybody, my name is Raphael Dolom. I work at Auslandia, which is a small French company specialized in GIS application and open source tools. So I will present a work about back-sharing station analysis. This work has been done by Damien Garot, which is one of my colleagues and me. So to introduce my speech, you certainly know what is a shared-bike services. So this picture comes from Lyon in France. So you, in large cities, you might want to do some little travel within the city and you might need some transportation mode. So a shared-bike is a good solution, easy to use, and the great question to answer is, are there available bikes at stations? Or if you have a bike, are there available stations to drop off your bike? So that's this kind of question that we try to answer and we did some analysis by using some machine learning application. So the first question we wanted to answer is the fact that a bike sharing station can be classified in terms of the way users use them. So maybe there are some similar patterns between stations. And we also, we were also focused on the building of a complete data framework, a complete data pipeline from open data. So you can gather some such data on open portal. So from portals to homemade API, for example, and by store of data or by apply some treatments and that's the complete framework we were interested in. So in this presentation, I will present you what is the data we used. Then for two and three are some proof of concepts of the types of analysis we can handle with such data. And to finish the presentation, I will show you a short little API that we developed. So just to begin, we talked about geospatial data. We are interested in how a bike sharing station works so we can recover, we can gather this data on a public portal. So for example, we worked with the data of Lyon and the data of Bordeaux to French cities. So it could seem a little bit rough, but such data are not so complicated to do because we just have some station IDs, some types, time stamps, and at each time stamp for each station, we have a number of available bike stations, a number of available bikes, and the sum of these two quantities is normally equal to the number of spots of the stations. So we also have some stations which doesn't work. So in this search and example, we can see that there are no available stations and no available bikes on the second line. So that means that the station is closed. And in our studies, we just manipulate this kind of data set of the data frame. How we handle this data? So we worked with Python and to build our data pipeline, we use Luigi, which is a Python library dedicated to data pipeline building. So the fact was to get the data on the portals. So the data was on JSON and JSON formats, on XML formats. We were able to find some shapefiles too. And the objective was to store this data in base and to apply some treatments, some machine learning treatments to them. So the first application, the first analysis that we did with the data was to classify bag sharing station in terms of their utilization. So we have some stations that are not so well used. We have stations that are very used on the evening. Other ones that are very used on the morning and so on. So we wanted to know how this can work, how a system in a city can work. When I talk about time series, so it's availability bike time series. And we have some examples. For example, in green, through one day, we can see a station. So on the beginning of the day, there are a lot of bikes available at the station. And on the evening, it's the opposite. So in orange and blue, some other example. And this work was inspired from a similar work done for Dublin City by James Lowlaw. And the result of such a clustering was this kind of figure. So we have four groups of stations. These results are for Lyon, the city of Lyon. And we found one cluster which gather all the stations that are where a lot of bikes are during the day but not during the night. So that's the blue cluster. There is an opposite cluster in red where a lot of bikes are on the station during the night but not during the day. So we can see that it means different situation. In blue, it's probably a place where people live or work, sorry. And in red, it's where people live. So they take the bike on the morning and go with the bike where they live. And on the evening, they take the bike and set it in the place they live. In purple, we have another cluster which represents a station where people can go out on the evening, for example. So a place near to restaurants or bars and so on. And to make it more clear, we can set all the stations with this clustering representation. So it's the city of Lyon. And I don't know if somebody in the room knows Lyon. So I will begin with the purple circle. So the place where we go out on the evening. So it's the city center with all the restaurants and bars and so on. We have the place where we work in blue. So it's the business neighborhood with the main station. We have the universities on the top of the graph. And all the red points, it's mainly a place where people live. So we have a lot of residential area. So that's it for the first example of application. Then after that, we try to predict the availability of bikes at shared bike stations. So what we know in such situations, so we know the dates, the time of the day. We know how many bikes there are at the station. But the question we want to answer is how many bikes will be there on the station in the next hour. So it means that we will use supervised learning to use the availability, the available information to predict a new information. And we can verify our prediction because we have the data. So we use the method which is called IKJ Boost. So I won't enter into the detail of the method. But to be as simple as possible, IKJ Boost is a decision tree method. So we use decision trees and gradient method. So the way we want to predict is the availability probability of finding a bike at the next hour. And the information we have, so X variables are the hour, so the time of the day. The day, so in the week, is it a Monday, is it a Tuesday and so on. And the available bikes at the current hour. So just to show you some picture of the proof of concepts, on the top you have the prediction of the model. On the bottom you have the ground truth. So we won't go into the detail but we can see that even if it's small. The blue points are really similar in the two graphs and the red points as well. So when it's blue, it means that in one hour there will be a lot of bikes at the station. And when it's red, it means that it will be hard to find bikes. So it's a synthetic scale between 0 and 1. And what is interesting, that if we plot the error on the left, you can see that there are not so much stations where the model is not good. So we have just two stations maybe on the left, which indicates that the model is not able to predict the good amount of bikes there. So we didn't really work on the features. We didn't try to improve the definition of features to get a better error indicator, but it could be a possibility to get a better result. So to finish the presentation, I can show you a little demo of our API. So that's on the internet, click on this link. And this API has been developed with Flask, so it's a REST API. We have the two cities, so Bordeaux and Lyon. We can imagine to gather more data from more cities to add them to the API. There is a project on GitHub. I will give you the address at the end. And there is a documentation, so it's not really a documentation. It's more a way to get some data in JSON formats. We can try a little demo. So we have one table, and we can try to gather some data. So maybe by indicating the dates. So we had data, so the data was from September or August from two September, sorry. We can get three days of data. We can focus on Lyon again. And the idea is for the bike sharing station. So it's a try-on, I'm not sure. Yes, well, I don't know the idea of the station by her, but that's the idea. And if you can execute the request and get all the data in the JSON format. So maybe I will have another example after. But that's the global idea. With this request tool, we have graphical parts for the API. So we can represent all the stations in Lyon and in Bordeaux. We can zoom and get the station name. For example, this one is Tolstoy Verlaine. We have a list of stations and all the transactions. There is nothing there. Where is my mouse? So we can also click on the station to have some more information about the station. So for example, this station is Terreau-Léterne. So this station is located on the city center. We have some information. We have the daily profile. So it's a station where people go on the evening. There are a lot of bikes on the evening. And we have also the daily profile for Monday, Tuesday and so on. We have some time series for the last day. We have the last week if we want. And the last transaction. For example, we are on the 3rd of February. We still get data every 5 or 10 minutes. And so the API is able to represent this recent data. So what I mean by transaction is when someone takes a bike or drops off a bike to the station. So this is the Github project. It's called Github Share. I don't know why, maybe it's Japanese. It's my colleague that gave this name. So I was not so confident in my ability to show you the API as I prepared some screenshots. So just this chart, I didn't show you this chart. It's the amount of transaction for the most used station. And we can see that the two stations that are the most used are near to the train station, Garpardieu. So that's a little indication of how the network is working. And that's it. So in conclusion, I hope I was able to show you that with simple geospatial data, we can do some interesting stuff with machine learning. So we tried to build a complete data pipeline with Python and Luigi. We did a little API. That's if you're interested in, you can complete this API. And our objective is to do online learning. So the little example I show you. So we train an IGJ Boost method and clustering algorithm for only one week of data. With only one week of data. One objective could be to train our algorithm online. So with new data we get better training and have something which is of good quality. Even with new data. So thank you for your attention. You have the address of the GitHub project there. You can also go to the blog of Auslandia to have some information. There are some blog articles which are talking about this project. And you can also send us an email if you're interested. So thank you for your attention. Maybe there are some questions. Part of the presentation I mean that. How are you obtaining the data from the APIs? Do you have like some API that is supposed to be available? You are talking about our API or the Geospatial Open Portal. How are you collecting the data about the bicycle transactions? So you're providing that as a service to the city of Lyon? No, no. It's an R&D project. It could be an objective. The city of Lyon offers the data. Oh cool, so the data is basically free and you could gather it from the API. So we use, you know, it is open Geospatial data. And you just have to go on these websites and gather the data. So the data are in JSON format. How often do they expose the data? Is it like daily training? Every five minutes or so on. We just build a cron job, which gather the data every five minutes and store them into our database. Okay, cool. Thank you. In your patterns class, do you consider the data of just one day, just one week on the business days or in this case? This one? It's a typical day. It's a typical business day. Exactly. We just mix every types of days? Yes. And these are four classes and how do you decide whether these are four or not? Oh, that's our choice of the model. Okay, yeah, it's just four classes. And the second, if I have a question. Okay, so if you pick up a holiday, you probably get different patterns, right? Yeah, yeah. We show the Lyon results, which is good. The results in Bordeaux, it is not so interesting. Okay. It depends on how the city is organized. Yes, yes. Have you tried in Paris? No. Yeah, certainly, but we didn't check. But it could be interesting to add this. Okay. I'm also quite interested in the project. Maybe we can continue to ask questions. Sorry. Do you want an after your question? Yeah. Okay. So you have a question? Yes. Maybe the question is about missing data? Yes. We just gather the data. We have a cron job which recovers the data every five or ten minutes. And we do not have any alerting tool, which says, oh, warning, there are missing data. We just consider them a posteriori after gathering the data. I'm not sure about that. It's not me that did this job, but I guess the missing data, the missing station are dropped from the city. Just dropped. Yeah, yeah, of course. I would say that this job could be generalized to everything if you have the same pattern of data, the same data. So it's geospatial data. You have a continuous data gathering process. And it's just that we didn't check the ATM part. We just were focused on bike sharing station because we did some work about transportation. Whole transportation is organized within the city. But without, with no doubt, it could be possible. Not yet. We were just at the proof of concept step. One perspective could be to give this API or give. Propose the API to the service provider. Because I don't know the added value of such a thing. It's a way to reformulate the data. So maybe it could be interesting for us. But the question was not really, not yet asked. Not so well, but we did not use post-GIS for this study. Everything was done with Python. So we use Regi, which is a data pipeline builder, which we use scikit-learn for machine learning stuff. We use Folium, which is a library to build some maps. And it's just a way to represent information on maps. But the only database related part of the project was the data storage. No operation was done on data through post-GIS or PODCHES. It could be possible. Yeah, of course. That's a good question. As you said, this kind of new actor, without stations, it generates different results questions. We are out of scopes, maybe. I don't know exactly how it works if you have a kind of transaction when you drop your bike elsewhere with a password. I mean, that's what makes you work with your computer. So the issue with the Incentive Management is new players to make you get the bike from the station or get the bike from a random place. And for Gist and the Co, which is here, this is for some reason. Because more of this, the thing to integrate is that there are things that work downtown, which makes their place here, like back in New York. And before, the guys were coming with their tracks to get back all the bikes and bring them in the city. And because of the work, it's really hard to make this and it's delayed. So you have plenty of bikes in the downtown part. In the same area. Yeah, it's a different problem because we do not have the station availability question, but you can imagine some, you know, it's not really this kind of application, but some sort of traveling salesman problem, maybe, to find the bike, which is the nearest bike of you. But yeah, it's definitely a different problem. But it could be interesting to work on. There are incentives, yeah.