 Hello, my name is Denis and I'm a software engineer at Vault and I know no one here tried Vault but Vault is a European leading food tech startup and for those who don't know that, it's just a simple food delivery service and as you never tried it, it's simple. You have an application and you think, okay, I would like to have a burger from that restaurant or I would like to have some steak from other restaurants so you click on it, you choose where to deliver it, slide to confirm and in 20 to 40 minutes your food magically arrives on your door. That's simple. On the back end, some magical unicorns start jumping over our servers and finding the best solution for your problem, like finding the best career partner, estimating how long it will take for your meal to be prepared, how long it will take for your meal to be delivered and of course, a lot of those calculations involve map and yes, that's not a secret. We use OpenStreetMap. You're right. It rocks. We use OpenStreetMap and the problem with OpenStreetMap is that OpenStreetMap is a database of all the features of the planet somehow connected so it may be below the planet's surface but still and the planet's big. For example, it takes three hours to fly from Helsinki to Belgium and still in Europe and if you'd like to fly to Japan, it will be 12 hours, 14 hours, I don't know, it's all planet is big. That means database is also because one and two terabytes. I checked last week the XML, uncompressed XML file is 1.2 terabytes and you usually can't process 1.2 terabytes with ease on your laptop. If you have not enough RAM, you probably don't have even enough disk space so it's a perfect case for Apache Spark and that's geospatial so probably you don't know what is Apache Spark and Apache Spark is open source, distributed, general purpose, cluster computing framework, lot of buzzwords, simple words. The magical thing, yeah, I love the word magic, that's a magical thing that allows you to process huge data sets and it allows you to process huge data sets in RAM so making it fast and then we'll think, wait a minute, you have host with 1.2 terabytes in RAM? No, I don't have. I would love to have one but I don't have one. How Spark does it? That's simple. You have many hosts. You don't have 1.2 terabytes in one host but you have 100 hosts with 10 gigabytes each giving you 1 terabyte and Spark will automatically partition your data set into those hosts and load small parts into each host so you have a lot of RAM online but not on every server and of course, server is not just a RAM, it's also CPU and usually it's power of CPU with a lot of cores so you have a lot of cores attached to each of your partition and it makes processing faster. You can run calculations on each partition in parallel. Can we do it with OpenStreetMap? The first thing that you can try and I tried it. Let's import your data into some database, post use database, MongoDB, Elasticsearch, whatever doesn't matter. Let's get it into some database and then let's use Spark Connector to load your data from database while at the same time you can filter your data because it's database so you can ask your database, could you please give me everything inside that polygon or everything near that point or something like that and database will do the stuff for you. That's a good side of that approach, the bad side. You have to import your data and it may take hours and you need to allocate a lot of space just for database and every time you update your planet file you have to import your planet file or work with divs and maintain your div update infrastructure so it's painful, especially if you have some one-shot problem. Okay, let's try another way. There are some tools in Spark already made tools like Magilan or GeoSpark and they can read things like shape files, they can read GeoJSON, they can read files in well-known binary format so you can convert your OpenStreetMap file into one of those format and tell Magilan to load it. Will it work? Yes it will. I tried it. It works. Is it good? Well, it's here so you don't have to write any line of code but you have to convert and most of the conversion procedures require, guess what, database so it's even slower than the first approach so you're still loading the database and then exporting and as you are not reading from database you are losing all the filtering stuff. Third approach, what if Spark would load the OSM file directly? Will it solve the problem? Probably. Sorry? Yeah, it's compressed but well, we have a lot of cores, let's uncompress it. Can it load it directly and uncompress? Probably yes but well, there's no but. It's just a simple way. Just point to the URL like DearSpark, please load the planet file from the OSM site and that's all or you can even push down the filters in the Spark and it will not load the whole file but partial file. The problem, it wasn't implemented so I had to implement it and the first problem was the OSM data. You know everyone, I hope everyone here know that OSM consists of nodes, ways and relations and only nodes have geometry and because of that all the OSM files are conventionally sorted for you like you have nodes first, then ways, then relations and if you read them sequentially before you hit the first way you probably have seen all the geometry you need to reconstruct your way. Probably but let's hope you do that and because of that most of the readers, well actually all the readers are tried are single-threaded and now just imagine you have a huge cluster of say 20 hosts, 16 cores each. I don't want to multiply them but lot of cores and they are waiting for a single core slowly crunching and decompressing your OSM file. Lot of resources are wasted. Can we do it better? Yes, the solution number one parallel PBF. The PBF reader written in Java 8 so you can use it in your legacy Java application. It supports all the current OSM features and options. It's available under GPLv3, on a GitHub, on a Maven Central so you can start using it right now and actually here that this is the whole API of the reader. You just specify what to do if you hit some entity. You see a node, call this function. You see a way, call this function. You don't have a function to call for a way, don't read ways at all. Skip them. Let's do it faster. For example, for Belgium, well, it's 11 seconds, it's not that much, but if you're reading the whole planet, I was comparing the Osmosis built-in reader and parallel PBF on the same host with locally connected SSD drive. The whole planet, you're reading all the entities and checking for all the entities with specific tag. Fix me for no reason. Don't ask me why. That was just an idea. 45 minutes on a single thread, less than 15 minutes with a lot of threads. I just saved you half an hour of your life. It works it. So let's take that parallel PBF and drop it right into Spark. Oh no, it's not working again. And you know why? For Spark, if you just drop that stuff, you have to load all your data first. So now you have your 20 hosts, and just a single host is all it scores, crunches your data, but all the other hosts are waiting. And even worse, you have to load all your data first into your M, like 1.2 terabytes. And if you have that amount of RAM on a single host, you probably don't need Spark. But just imagine you, for some reason, have it and you load it into your M, and then you tell Spark, dear Spark, could you please create data frame from that area? And guess what Spark will be doing? It will split it and slowly redistribute over other nodes. So even more slow in the process and transferring all your data over the network. Slowly I tried it. It would be better if we will have native Spark awesome data source. And now we have it. It's built, of course, on top of ParallelPBF. It supports color 211, color 212. I will compile it for 2.13 when Spark will be compiled for 2.13 one day. And yes, it's available on the GPLv3, on the GitHub, on the Maven Central. And the best thing here, it supports native partitioning. So you just tell your Spark now, dear Spark, could you please load that file? And that file could be anywhere. It can be local file. It could be S3 file. It could be URL. Whatever Hadoop file system supports, because internally it's Hadoop file system and Spark. And at the same time, you tell Spark, I would like this file to be split into several partitions like how many executors you have or how much RAM do you have. I don't know. It's up to you. And Spark will ignite all the executors. And each executor, each node, at the same time, will start reading their own part of your planed file. It's not to be a planed file, any Pbf file. In parallel, fast and no shuffling anymore, no transferring of data between Spark nodes. And the best thing here, the same test, calculating fixed-me tags on the whole planet, takes just two and a half minutes. We start with 45. Now it's two and a half. And you would think like, okay, come on, pal, you just dropped 720 cores and telling us, oh, yeah, it's fast. Of course it's fast. You have 720 cores. It should be expensive. Nope. Nope. It's not that expensive. Yeah, running that kind of cluster on Amazon, I tried with Amazon, will cost you probably several hundred years per day. But for two and a half minutes, it's five years, if I remember correctly. For something, the coffee here costs more. So it's not that hard. Now you have your data frame in RAM, but it's awesome data frame. So no geometry, just nodes, ways, relations, hard to work with. And now comes the third part. That's collection of Spark snippets to help you to work with the awesome data. It's really work in progress. So it's heavily unstable. I'm updating it every day and it's mostly reduced from my needs, sorry. Same, Scala 211 and Scala 213, and we'll be for 213, then Spark will be for 213. It's available on GitHub. It's not yet on Mavin Central because it's really under development right now. And some simple procedures, like I would like to limit my data frame to some polygon. I would like to find geometry for some ways. I would like to find, resolve some multi-polygons, export to a smallest database. Actually, I used it just for debugging. It's not that useful. And I even made a simple renderer in a Spark SQL language. That's funny. Well, enough talking, some short example. The example part was the hardest one. I was thinking like, how to show it? And I thought, okay, probably everyone here uses public transportation, at least to get there. And public transportation can be a problem. If you go somewhere, you don't want to go too far away for your stop. Let's calculate public transport coverage by finding all the buildings and all the platforms and finding the nearest platform and then colorizing buildings. Like if it's blue, the platform is right here. And if it's red, the platform isn't several kilometers. Here comes the whole code for all that stuff, like your load map. You do the thing that you shouldn't be never doing. I'm finding for a single entry of a data frame. And Spark is quite bad in finding a single entry. It's much better in processing all the entries, not a single one, but it's good enough for demo. Find locations, get geometry, calculate distance, and mark buildings. The only function skipped here is the colorizing buildings with a distance because it's actually just a huge table of distance to RGB values. It's not that interesting. That was a whole example. And what we will get now? I calculated for Brazil, and when I got the result, wow. It is really that good. It's much better on my laptop, actually, than here. But I have a better picture for you. It's near my hotel. You see, almost everywhere, you have to walk just several hundred meters. And it's calculated with Spark on this one laptop in... Well, with this laptop, it takes more than a couple of minutes, but still. I don't remember, maybe 10 minutes. And yes, it works. It proves it works. So that's not all. I'm going to improve that stuff. And you see the first option here is striking out. While flying to Brussels, it takes three hours, as I told you. I actually implemented that stuff. You can write parallel PBA, you can write PBA files in multi-threaded manner. Of course, they will be unsorted. So you can process them with the typical tools like Osmium, for example. But you can load them back into Spark, if you like. For data source, Spark is good in pushing down the filters. Like if you have some options like I'm filtering on some tech, I'm filtering on some type of entity and so on and so on. Spark can push down and try to data source, and I need to support more predicates. Probably geometry conversion on load. I have to think on that because it will slow down the load. And for the tools, well, I need to better support for all types of relations. Well, at least for all types of geometry relations. And probably GraphX support for relations here or here, or polygons here or here for multi-polygons. Definitely, I am going to implement geometry operations. So converting to Wikipedia, so Java, Spark and Magellan will be able to operate directly and vice versa. That's the thing you can visit. Everything is on a GitHub. By vault, you can contact me on a GitHub or on a mail. It's time to make a photo of this. I'm waiting, yeah. Thank, okay. And thank you for attending and it's time for questions. Thank you. Thank you. The first question. Yes. And then using Spark, feeding the Sparky family format. Of course, are you not benefited with all the goods of the Sparkos too? But for the first step, I would have done like this. So the question was like, I should convert this OSM data to Spark-Parkit format and then use it. That's what I'm doing in real life. I use my data source to read it into RAM, filter it, and then write into Spark-Parkit format and then process it. Yeah. Well, the question was why I choose the GPL version. The real answer is that down below, I'm using cross-by parallel, oh, sorry, I'm using cross-by PBF, how it's called, artifact. Yeah, and it's GPL. So I had no choice. That function above, that's my own UDF, that calculates geometry. I have plans to implement conversion to geometry, so you will be able to use, for example, Magillano-Geo Spark. And at the same time, you can write your geometry UDFs using GTS functions, using GTF library. Is it all the question? Okay, go on. Okay, so the question was, in two words, what was the use case for it? So why I'm using Spark and why I'm comparing the osmosis and all the things. So the answer, this software is used internally involved for different kind of analysis. So that's a part of an analysis pipeline, which is run on Spark. So I have to put all the data in Spark, that was the first reason. The second reason, it's not for a speed. Yeah, that's strange, it's not for a speed. Of course, I would love to do it in just several minutes. But it's not for a speed, the fastest way, if you need to do it all the time, like you have recurring task that you calculate every several days, would be using Postgres and as you mentioned. But if you need to make some one-shot analysis, like let me check that theory. You don't want to spend several hours configuring your platform. You don't want to spend several hours installing your Postgres. Well, downloading fire docker, configuring your EC2 host and so on and so on. You just load it into Spark. That's the main reasons, mostly for one-shot things. And yes, it's not that fast. It's a specialized tool, but it's a general tool. So with a specialized tool, you can do much faster, but narrower, but you can solve narrower area, domain of problems. But with Spark, you can solve almost any problem. Like you mentioned, you can do graph analysis, for example. Okay, I'm going to cut you off, guys. There's people waiting outside. Yeah. And we are just in time, so thank you very much. Thank you. Thank you.