 OK, welcome. My name is Joris van den Bosen. I'm going to introduce you to the Geopandas Library, maybe just because I'm also curious, who has already heard of Geopandas before and who has already used it? Few people, cool? So very short, by the way, myself. I'm a bi-science engineer sitting in Ghent, but I'm currently working in Paris for the Paris Eclair Center for Data Science. And I'm also a core developer of Pandas. So for this audience, I don't have to explain the difference between raster and vector data, but showing this to say that in this talk, I will focus on vector data. So if I'm talking about geospatial data analysis, this talk is about vector data. It's about the typical simple features with attributes. Again, for the audience, I also don't have to explain that there is a lot of open source geospatial software where full stacks of tools are built upon, and you have GDAL, you have Geos. But again, that's to say that in Python, it's no different. No difference. There are a lot of Python packages which already have been mentioned in previous talks as well that build upon those base libraries. So for example, PyProy, for Poy4, RasterIO, and Fiona, and Shapely have already been mentioned. Fiona, for the vector features of GDAL, Shapely wrapping the Geos library. So I'm going to, shortly, look a bit more in detail in Shapely. So it's interface, it's using Geos under the hood, and provides you all the capabilities, the geometry objects, and all the spatial operations predicates that you would like to do on them. Very short code snippet to show you a bit how it looks like. So you can create a point, create a line string, can do an operationally, for example, buffer which creates polygon, and then I can check with a spatial predicat whether the polygon contains the points. So a very nice API to Geos, but the limitation of Shapely is it's based around those objects, single objects, while in practice you often, of course, have many of those points or lines. And there's also no easy way to work with attributes, so not with the typical data that is attached to your geomatical data, your geomatical data. On the other hand, in Python, there is a very strong ecosystem of tools to work with data. So Pandas is one of those packages that is with the reasons for the increasing use of Python in data science, machine learning, and academic research in general for data analysis. It provides data structures specifically to work with Tableau data. So spreadsheet-like data, like our data frame, you also have a Pandas data frame, and very similar things that maybe you could do with SQL database tables. You can also do it with Pandas. Again, a very small snippet. So you have a lot of input-output functionality. And you can do, there are two operations. The first one is Boolean filtering, which would be in SQL where operation and in group-by operation. So many of those typical data manipulation things are available in Pandas. So now, what is Geopandas? Geopandas tries to combine Pandas with the abilities of Shapely to work with geometry objects. And in that way, it tries to make it easier to work with a bunch of geometry objects and its data attributes. It was started by Kelsey Yordel already a few years ago. And it's not only Pandas and Shapely that it will build upon. It also builds upon many of the other libraries, for example Fiona and GDAL for the input-output, Prypoi for reprojections, R3, special index, et cetera. But to give you really a bit an idea of what it is, how it does it work, let's try some live demo. And I will increase this in size. This is fine, size. So importing a few things. So the first thing that Geopandas can do, of course, you want to get data from somewhere. And there is a lot of input-output functionality, mainly based on Fiona and this on GDAL. As an example data through this notebook, I downloaded some data about the bicycle sharing system in Paris, different stations, and how many bikes there are in the stations. And I also, as a Geo JSON file, it's not fully visible, that you see it's a Geo JSON file. And also, so I can read in that, you see here how it looks like. I showed the first couple of rows. So you have your attributes and then a column, in this case called geometry, which are your geometrical objects. And for this data, it are points. But I also downloaded the different districts of Paris. And if I read in that one, you will see here, if I scroll a bit inside, here my geometries are polygons. So what are those things that I showed? If you look at it, it's a Geo data frame. So it's just a data frame, like pandas does, so with all the ability to handle, to manipulate this data frame. But with a special column, your geometry column, which will enable all the Geospatial operations involved. So this column can always be accessed at dot geometry. So here are my points, which is called a Geo series in this case. And if you access a single element of that series, you will get back those shapely objects. So you still have the interface of shapely available on the single objects. As I said, it's still a data frame, so all the typical operations can still do. For example, here, what I do here, just one example, I do some Boolean filtering by saying, I only want those rows where the station actually open. It's actually a working station. Or that's one of the other abilities of pandas to quickly visualize some of your attributes. For example, here is the number of bike stands in all the stations, so you see a distribution of them. But since it's now a Geo data frame, you also have access to all the typical operations and predicates that are available in Geos. For example, I want to, for each district, the area. In this case, yeah, it's in latitude longitude, so the area is not saying that much. Or I could calculate for each station the distance to a single point. So I searched for using Geopie, the coordinates of the Notre Dame in Paris. So this is this point. So now I want to know what is the distance from all my stations to that point. And then, again, you can very easily calculate this. And you get vectorized element-wise operations for all geometries in the object. Another one is, for example, contains. So which district contains the Notre Dame? If you threw in false values so I can do a filtering operation, you can indeed see that this is the quartier, which is called Notre Dame. So it's logical that Notre Dame is there. So all the other typical cross intersects, all those operations are available as methods on a geo data frame and a geo series. Another thing that Geopanus provides is using macro clip on the hood to very quickly visualize some basic visualizations of the data that you have. For example, I can plot all the different districts or can give them some coloring or some other column. Yeah, all the styling options of macro clip are available. Just to show you a few of them, the stations are a bunch of points. So because it would be easier to see where is something located, if I put some streets behind it, I downloaded using another library, OSM NX. I saved the streets of Paris as a shapefile, which I read in here. So this is my data and you will see now I have line strings in my geometry column. So now I can quickly plot some streets. And I can also give coloring based on one of my attributes, which is also something you might want to do. So here, the coloring means the number of available backs. Yeah, over here, something else, the bigger districts. So some basic operations, some visualization. And of course, Geopanus also has the ability to do some more advanced operations like spatial joints, overlays, and things like that. So here, a small example of a join where I want to join for each station in which district it lies in, it's located in. So I can join them. And now you will see that here at the end, there is a new column added to my stations to the name of the district. And then, for example, I can, for example, group by based on this name and count the number of stations in each district. I'm not going too much detail here, but then I can, for example, plot the number of stations of each district. OK, so that was a very short demo of what Geopanus looks like. So it gives you a lot of read-write ability, general manipulation of your data, all the geospatial predicates. But I didn't show that you can also easily re-project from your data, and then visualization and choice. But yeah, so I hope that for me at least, one of the strengths of using Python with Geopanus is that you can very interactively explore and analyze your data. There are also a lot of other libraries that I didn't mention here that work or can work with with Geopanus data frames. But yeah, I don't have time to go much deeper here. So I hope that I have shown a bit that it can be rather, gives a nice interface, an easy interface to work with your geospatial data. But it's also fast. It's also a bit somewhat performant to do that in Python. So at the moment, Geopanus can be rather slow if you have a lot of data. And the demo that I showed was no problem at all. But of course, it was very small data sets. So it's always a bit difficult to say, yeah, what is slow, what is fast. You will see for 100,000 points within a simple distance operations, each take more or less a second. And once you do that many times, which is typical in some analysis, it can be quickly add up. But you have some more in ID. In the comparison with PostJS, you don't have to go into the detail of the special join. So it's an example from the Boundless tutorial. I have to say, I'm not a PostJS expert. So I don't know if it's an optimized query or if my system was optimized. But anyhow, it gives some interesting things. Because you can see that PostJS is a lot faster than Geopanus while it's actually based on the same library under the hood. It's both using Geos for the same things. So why is Geopanus slower? It's because there's a lot of overhead in the way that we use. So if you use a color distance operation in Geopanus, we will loop over our objects called shapely and shapely then calls Geos. So that creates some overheads. So luckily, there is a new version of Geopanus in development, which tries to reduce this overhead by only storing the pointers to the actual Geos geometry objects and calling Geos directly. But still, if you access a single point, we still get a nice shapely geometry object. So the same API, it will have exactly the same API, but much better performance and also memory use. Results, and what it was before, and now the green with the new version. So on very simple things, you get 10 to 100 times speed up. You also see that for a bit more advanced ones, that we're now much more or less similar as post-cheats, which is actually also somehow to be expected since we're using the same library on the dot. So you can read a lot more about those developments on those two blog posts. You can also very easily install it with Konda. There are libraries, binaries available for the development version as well. So you can, if you want to try it out, certainly welcome. So it's still new, so some real world usage is certainly welcome to test it out. So you see that the next version will also be fast. But it's also scalable because the Python ecosystem, based on NumPy Pandas, so it's a very strong ecosystem, also optimized fast. As long as you're working in memory, single core, then it's very optimized. It's a typical limitation of the ecosystem in Python. But there have been some over the last years developments. And one of the more popular ones is DOSC, which is a library that provides parallelism and distributed computing. It's written in pure Python, but lets you work on larger-than-memory data sets from just your laptop to use all the cores to big machines with many cores, or distributed system with thousands, of course. How does it do that? By just using the existing ecosystem, it just uses Pandas or NumPy, but it will use blocked algorithms and create DOSC graphs, and it has a schedule to then execute those DOSC graphs. So we can use DOSC as well to parallelize our GeoPandas. And there has been an experiment, so there was a blog post last summer of Ravi Shaker, who did on the taxi data of New York, 120 million rows of records of a person who somewhere took a taxi to another place. And what he wants to do was, for each of those rows, do a special join with the districts or the taxi zones to know for each point in which taxi zone did it start and end. So on this laptop, it took more than three hours with the version then. Mattia Rocklin, the person who collaborated with me on the new development, tried it out on his laptop, and it was able with a new version and with parallelizing it to be able to do it in eight minutes. And this is based on DOSC GeoPandas, which is a very experimental library to also parallelize the computations of Geos. So I will almost stop, so there is still some time for questions. But what I would like to show you, just to also get an ID. So this is the final pot for each taxi zone that they made. And for each taxi zone, the color gives the amount of trips that were started in that zone. They did that for the full, all data of 2015, which are 120 million records. I did a small replication of it, but with only 10 million records for one month. So with the current version of GeoPandas, it took me about 20 minutes. With the new version, it decreased to two minutes. And I'm now going to run it in parallel on the four cores of my. So it created a kind of cluster. Of course, on my machine, it's not a real cluster because I only have one worker with four cores. But I want to show you, so these are the different zones. There are 263 zones. So it's the data set of 1.6 gigabytes, around 10 million rows. And what I will do is I will read in the data set. It's a CSV file. I will do a spatial join with the zones. And then do some calculations on that. Just to give you an idea, Dask will then create a task graph. Does it all in chunks? And this task graph will then be executed, can be executed on a local laptop, but also on a distributed cluster. First, going to open here the dashboard that Dask provides. So we can see what is going on. So now I will execute this. And hopefully, if I go to the status, yeah. So now you can see the progress of the calculations of the different tasks. You can see that it uses my four cores. And you can also see the blue blocks here are the spatial joins. And so you can see that it's actually doing spatial joins on parts, subsets of the data, on all the four cores in parallel on my laptop. And this can very easily be scaled to a big machine with many cores or even a distributed cluster. And normally, it should take around, in this case, around one minute. So on my laptop, with my four cores, it went from two to only one minute. So it's not a huge speedup. But if you have many cores, it could be certainly more helpful. Well, it was the last demo. So thank you for your attention. A little bit of time for questions, I think. Yeah. So what about the building? Yeah. So at the moment, the building dot plot method is based on mapletype. But yeah, for example, the plot in the blog post that I quickly showed was actually a bokeh plot. So you can use or leaflet or bokeh. But they are not built in in Geopandas itself. But it could be added. And it's also certainly possible to build with them. If it would be possible. To do the, that you do it in, especially, yeah, that should be possible. At the moment also, so one of the things that I didn't mention is now the data was just partitioned in chunks. And in principle, they can also be partitioned by zone. That you can even get a more efficient, certain operations can be more efficient if they are partitioned by zones. But yeah, just in general, not to. No. So at the moment, to combine it with trusted data, there is nothing built into Geopandas itself. Yeah, sorry. Yeah, so the question was, if you're handling 100 million of points, is there a spatial index? So in Geopandas, in the current version, we use R3, a Python package, which is based on libspecialindex, which is used in the spatial zone. In the new version, where we, the sitonized version, we use the str3 of Geos to improve the spatial terms.