 Okay, then I have to tell you a little bit more how it works, you have an application. And you choose some restaurant, you choose some meal, you click on it, you slide to confirm. And then magic happens. In 30 to 40 minutes your food magically arrives. Did you ever use the vault for food ordering? Did you ever order food using vault? Okay, just sit and listen. Sorry, I'm not the truth. Yeah, so your lunch will magically arrive under your door. And what happens internally? Some unicorns start jumping over our servers and try to find the best way to deliver your food. So they have to analyze a lot of data like where are the courier partners? How do they move? What are their possible routes? Where are the restaurants? Where are the customers? What are the list of the orders? A lot of information and we have to find the best one. I'm not going to define the best one, it's a company secret. Yeah, sorry for that. But as I mentioned, we are thinking on routes, on movement through the cities. And when you move to the city you actually use road graph. Or speaking more simply, you use map. And we internally use OpenStreetMap. So my talk will be half OpenStreetMap and half Spark. So for those who came for just Spark, short introduction to OpenStreetMap. OpenStreetMap is something like Wikipedia involves of maps. It's not Wikimapia, because Wikimapia is a collection of labels. This is a table and this is a public toilet and this is a castle and this is a secret nuclear war base or something like that. You can't use it, but OpenStreetMap is completely different. It's a collection, it's a database of all the features on the planet Earth, somehow connected to the Earth, so it can be below the Earth. I'm dreaming that I will live long enough for somebody to start open insert your planet here map, but right now it's just OpenStreetMap. And the problem is the Earth is big. I traveled here from Helsinki and it takes maybe six or seven hours to fly to Prague and then take a train. And it's just in Europe, it's like this one distance, but the Earth is huge. And therefore database of the features on the planet is huge. So right now I checked it last week. The database of OpenStreetMap was roughly 1.2 terabytes. And it sounds like a good case for Apache Spark to analyze it. For those who came from OpenStreetMap, now you're time to learn what is Apache Spark. Official definition is very nice. An open source, distributed, general purpose, cluster computing framework. Amazing pile of buzzwords. But what it does? What it does? Why do we need that stuff? The idea of Spark is simple. What if you have some big data set and you would like to do some stuff with your data set, but unfortunately your data set is too big and you can't load it on your laptop. And you can't load it into the RAM of your laptop and you can't even lay it down on the drive of your laptop. And you can't even go to Amazon and order a host big enough for it. Problem, you can't analyze your data frame, but not with Spark. With Spark you can partition your data frame. You can partition your data set and lay out the parts of your data set on several nodes. So making your compute cluster virtually endless, you can attach as many nodes as you have. You can have as many RAM, as many CPU cores as you can pay for Amazon, for example. But it's not that expensive, I do think. And that means you can keep all of your data right in RAM. And you can process it right in RAM. And you can process huge data sets right in RAM. You can do it without Spark, but you can do it right now. And playing it is big. So what we can do? We have not too much options. The first one option, the simplest one. You take OSM data and import it. Oh, I'm already on that slide. Oh, I need to insert a couple of jokes right here right now. Yeah, so you take your OSM data and you import it into a PostGIS database or Osmosis database using pretty existing tools. And then you can access that data. I'm sorry, give me a second. And then you can access that data right from the Spark using JDBC connector. As simple as possible. The good part, everything is already here. There is a converter for OSM data. There is a database for it. There is a JDBC connector. Everything is right here. And as you are using the real database like PostGIS, you can use whatever. You can import it into Mongo. You can import it into Elasticsearch if you're crazy enough. But still, it's a real database. You can make a query and tell like, Dear Spark, could you please load some geometry from the database limited by that boundary? Or could you please load some geometry in 100 km from that point? Or could you please all the geometry with some tag? And database will happily filter it for you. Problem solved. The talk is over. I'm going down. Oh, no, no, sorry. The problem is you have to maintain all that crap. You have to install your database. And you have to keep your database in sync with OpenStreetMap. And just loading OSM data in PostGIS usually takes several hours. Every time the planet is updated, the planet is updated on Saturday, you spend several hours and you need several terabytes, like 2 to 2.5 terabytes of disk space for PostGIS database depending on your schema and indices. So, slow, ineffective, but working. Another one option. That was the second option I tried. Okay, there are some tools for Spark, like Magillan or GeoSpark, and they are able to read predefined formats, like well-known binary format, well-known text format, GeoJSON, they can read shapefiles, whatever. And you obviously can convert your OSM data into that some kind of a format. Doesn't matter. And load directly on Spark. The problem. The good part. Okay, everything's still here. You don't need to reinvent the wheel. The problem. It's even slower. And in worst case, you have to import it into PostGIS database and then export it from it. So, it's like number one, but slower and harder. And no filtering on load because you are just reading the text files. Not any kind of stuff. And the third approach. What if you will try to load your OSM database directly into Spark? Probably, you will not need to convert it. So, saving time here. Probably, you may try to filter it on load, at least on entities or on text or something like that. But no one did it. That was the main problem. No one did it, so I had to start working on that issue. And when I started working on that, I realized that there is a problem doing that. The problem is the OSM data itself. When I say OSM database, OpenStreetMap database, it's not a real database. Most of the time when you have OpenStreetMap data, it's just a huge file. Just a file on your disk plain one. And this file consists of three types of entities. The first one. Let me try. Yes, it works. The node. The node has coordinates. Latitude and longitude. And ID. ID is not important right now, but the coordinates are important. The node is a single OSM entity that has a geometry. The second is a way. And the way is defined as a sequence of nodes. But if you think of a sequence like in a programming language, so you touch the way and it explodes and contains nodes and you can reconstruct geometry from the way, node. You can't. Because way is just an array of node identifiers. So if you have a way to find it like 1, 2, 5, you have to go to your list of nodes and find nodes 1, 2, 5, and extract geometry from them and propagate it back to the way. Probably you started to understand what the problem is, but even more we have a relation. And the relation is a collection of ways, nodes and other relations. So you may have relation here and relations. Use identifiers to refer to other objects. So if you have a complex relation of several polygons and you usually use relations for everything else like polygons, multi-points, multi-polygons, whatever you have. Just make it a relation. If you can't make it a node, the way make it a relation. But if you have that relation, you have to go back to other relations, you have to go back to other ways, you have to go back to the nodes and finally you will be able to reconstruct your geometry. And your file is one, well, it's compressed so it's roughly 50 gigabytes, but in turn it's 1.2 terabytes of data and you can't fit it in RAM. And because of that, there is a convention that usually nodes are stored in front of ways and the ways are stored in front of relations. So if you read that file sequentially, when you hit the first way, you've probably seen all the nodes and probably you've cached them somehow or processed them somehow or at least indecent. Well, at least you have a chance to reconstruct geometry of the way. Same applies to relations except relations hierarchy. Because relation may refer to another one relation and this one can be before the original one or after, you don't know, you never know. It's not a big problem in the hierarchy, but the biggest problem is the sequential access to the file. You have to read it in order and as you have to read it in order and when I hear you have to do something in order, it means single-threaded processing, just a single process. Now imagine you have a Spark cluster of 12 nodes, 10 cores each, 120 cores. Doing what? Waiting for a single core reading 1.2 terabytes of data. I hate it. That's a waste of resource and it spends a lot of time. But with Spark, wait a minute, Spark, we can keep all our data set the whole planet in memory. That means we don't need to read it sequentially. We can just read it like starting from this one, from this one, from this one. We can read it randomly, put it into RAM and then process. And the solution number one is a parallel PBF reader. The PBF is one of the transport format of OpenStreetMap and yes, PBF stands for Protocol Buffers by Google, but PBF is not just a protocol buffer file. It's actually a meta structure over it and it consists of several blobs in a protocol buffer format. You can Google or you can find it on an OSM Wiki. It's out of scope, my current talk. But the good thing, I have a reader, written in Java 8, so you can use it in your legacy applications and even Spark. It's available on a GitHub, under JPLv3. It's available on my own central and the API is quite simple. Well, actually, that's all the API of the reader. You can do much more. You just specify callbacks. If you see a node, call this function. If you see a way, call this function. If you see bounding box, call this function. If you don't need ways, for example, don't specify callback, they will be automatically skipped. So it will be even faster. There are just two parameters. The input and input is input stream, so it doesn't matter. It shouldn't be a file. Any kind of input stream. We don't care. And number of threads. Probably it should be something like number of your cores or doubled number of your cores. I checked both, no difference. They are mostly limited by underlying storage. So, we have that wonderful guide. Is it fast? Oh, yeah. I was thinking like how to verify, what to do to compare and then I said, let's do some really synthetic test. Count all the entities with the fixed mid tag for different type of entities. You can easily do it sequentially. You can do it in parallel. Doesn't matter. On the left side, Osmosis library single-trader reader written in Java. On the right side, wonderful parallel Java reader. Same host. Amazon C59 instance 36 cores and local NVMe SSD attached. So we are not... The measurements are not influenced by Amazon's infrastructure. Local disk. For small things like for Czech Republic 34 seconds, single-trader 11 seconds. Multiple threads. Not interesting. It's just 20 seconds. But if you would like to read the whole planet, it will be 45 minutes for the planet reading. In one thread and little bit less than 15 minutes with 36 threads. And yeah, more threads, faster reading. If you can buy all the 72 cores, it will be even faster. 30 minutes of your life. Sounds good. It works faster. Now let's try to drop that beast into a spark. Oh no. Not again. We have another one problem. Yeah, we can drop parallel PBF right into spark and load it into some... in memory structure, say, area. And then convert that area into data frame. But it means that just a single host of our spark cluster will be processing data file. Or SM file. And it means we have to first load all the data into our RAM. But if it can fit our RAM, then why we need spark at all? And even if we are crazy enough and stupid and load and buy big host collection of big host cluster of big host and each host can fit in the RAM. And for some reason we do that and then we create data frame. The spark will happily start distributing and shuffling the data frame. And copying from that single host to other host for no reason actually. Sounds crazy. Why? Why? Why are you doing that? That's not the thing what you do. What I would like to do, what I would like to have is that all my hosts in my cluster, all the executors are reading same file in parallel and reading their own part of that file and keeping it in local RAM, not shuffling it between the host, not waiting for somebody else just reading it all the time in parallel. So I had to write a new thing. Spark awesome data source. It's a native Spark data source and it's built on top of parallel PBF. It's compiled for Scala 2.11 and 2.12. It will be compiled for 2.13 when Spark will support 2.13. Not yet. Even 2.12 not supported very good right now. So 2.11 is a safer way. Yes, it's available under GPLV3. GitHub, Maven, you can go and try it. The difference between 0.2 and 0.3 is just the number of packages. So I moved it from my local repository to a worldwide repository. Not too much. It supports partitioning. That was the main goal. It supports partitioning. So right now you just tell, Dear Spark, could you please load that map file and all the nodes will start loading your file and keeping their part of data locally. And regarding partitioning, I was thinking like, should I do that or not? You have to specify it. Yeah, I will show it to the next slide probably. Another one thing that it supports. And it is really interesting. With Spark you can ask Spark to distribute your file before starting your stages. And if you would like to read your file several times and it may happen if you have too many partitions, for example, it saves times on refetching it from remote storage like S3 or HTTP. So the best part. When I was thinking on that slide, I was just thinking that I should stay, yeah, by remote years, that I should stay right here and close this. The same test. Exactly same test. Let's calculate. Let's calculate all the fixed-minute tags on different types of entities. How much it takes? It takes us two and a half minutes for the whole planet. Right here, we started with 45, 45 minutes. Same result. Two and a half. 20 times faster. It works it. But probably you will think like, okay, it's 20 times faster, but of course it's 20 times faster. You have 12, oh, sorry, you have 20 nodes. It should be pretty expensive, no. Yeah, each of those nodes is pretty expensive on Amazon. And if you start that cluster for several hours, you'd better not do that. I think it will cost like maybe 600 euros just for one hour of running that cluster. But for two minutes, it will, three or seven. If I remember correctly, I paid seven euros for this one run. You can afford it. Not too much. Okay. So how to use it? It's simple. It's pretty simple. It's normal Spark data source. You set options and you send the file. The only thing I was concerned was the partition. You have to specify partitions manually. And there were several options for that, and I was thinking that the initial version was a little bit different. You may try to estimate your file and estimate the size of your cluster and try to automatically partition it. Like if I have 10 nodes and in total one terabyte, so probably I must have 20 partitions probably, or 10. I have no idea. Depends on the file. Or the second option was like, just hard code some number. Like, you know, Spark have predefined number of 200 partitions by default. So, okay, that'd be 200. And then I was like, okay, why it should be my problem? Let it be a problem of data engineer who starts that load. And that guy knows that data engineer knows, like, how much data I have, how big is my cluster, how much RAM I have, and how many executors and cores and stuff. So let just customer and your user specify how many partitions you have. You also have to specify how many threads parallel PBF should be using. So that's technically a number of cores per executor. And that's all. Just go and start using it. It is Spark and it's built on Spark. You can use any source of files or S3, HDFS, local files. And you can use local, where is it? Yeah. Use local file option to make it compatible with Spark at file feature. So just go and use it. But, no, but is the next slide. That data source schema. I forgot about this one. Sorry. That's really non-interesting slide. The data source schema is almost the same as API schema of OSM. I was thinking, like, should I split it into... Sorry. Sorry, one more time. I was thinking, like, should I split it into several data frames, like one for nodes, one for relations, and one for ways, or should I keep it together and I thought, okay, I'm working with OSM data. I'm not working with geometry data here. So let it be, like, OSM. And it contains, like, common part, like ID, tag, and info, and type. ID is an ISM object, and type ISM entity type, and tags and info are OSM tags and info, like, who created that object that changes that ID and so on. And variable part, like geometry, definition, and relation definition. So if you have a way, the only way column will be filled and all the other columns will be null, and, for example, for node, only longitude latitude columns will be filled and other variable columns will be null. The good thing is that if you don't need some, and you say your drop way, it will be skipped. So it will not even try to load the ways. It's smart enough to understand that. But filtering is really, really based here. And geometry is not reconstructive. So you are reading just OSM data. It's not geometry. It's pure OSM data. And because it's pure OSM data, it's hard to work with it. Well, it depends on what you like to do. If you are going to just analyze some tags, you finish the photo, right? Okay. So if you go into just analyze some tags, oh, sorry. You can work, like, right with that stretch directly. So you filter on tags and analyze it. You don't need to do anything else. But if you would like to do some more interesting stuff, I realized I need a lot of helpers. So another one, third library. Just a set of helpers to help you process OSM data. Different license. Right now it's Apache 2.0. I would like to make everything like Apache or Meet license, but for PBF, I'm relying on five PBF drivers. So it's under GPL and everything else in GPL. But this one is completely independent thing. Same Scala. Same availability on GitHub. Unfortunately, not on a Maven central yet because it's walking progress. It's not even half baked. It's unbaked completely. And probably it's not even a paste yet. There are some simple procedures. Like you can merge two datasets and you can limit or extract by some boundary like B box or more complex polygon boundary. You can try to work with relation hierarchy processing like find kids of that relation or find parent of that relation or find kids of all my parents and so on and so on. Conversion from base to geometry. If you would like to work on geometry. Conversion of multi polygon relations to geometry. But it's really, really basic right now. It doesn't support proper ordering that ordering of polygons. And sometimes it fails. Sorry, it's working progress. You can export it to a smoothest format database. Not really useful thing. I'm mostly using it for debugging and the funny part is a renderer hitting Spark query language. Yes, I'm rendering maps from right from Spark. One more thing that Spark allowed to do is extraction. The extraction of some data is a typical problem. You have the whole planet, but you would like to work on, say, Bernon. You have to extract just the Bernon by polygon. Typically, all the tools have just three options. So, the problem is the nodes. If you would like to extract some data by geometry you have to extract by geometry. And geometry is just in the nodes. So, first you limit your data set to the nodes. But then you have to find your ways and relations. So, the simplest way is that you filter on nodes. Okay, now you know which nodes are included. And then you just filter the ways sorry, that you include the ways that measure those nodes. But those ways will be incomplete. You see, for example, that green one it goes outside of the box. But with a simple approach okay, you have just a part of it. So, you have problem analyzing it and a lot of tools, say, osmium tool provides you with complete ways and relation. So, when you read for a second time, you know, okay, I include those ways. I include those relations. I need to read those points and make it complete. Complete ways. Same for relations, but just two times. But now, we have relations in RAM. Now, we can build a hierarchy of relations. Yeah, thank you. Not the best way to close the door, but still. You have all the stuff in your RAM. So, you can complete reference in your relations. You can fill your relations with children relations of your relations. The thing that completely out of the box, but your data will be reference complete. And another one option you can build. I don't have a picture for a fifth option. I try to draw it and it looks pretty bad. But you can try to find parents of included relations that are not even referenced by stuff inside of your area of interest to include them. For example, if I will be finding parents for Brno, it will include South Maria region boundary, Czech Republic boundary, European boundary, all the boundaries of other European countries and so on. So, it's pretty big definition of extraction by area, but still technically you can do it. Complete use of stuff. Most of it was sitting just for fun. And the fun part. Fun part comes here. Like let's try to do something useful. Let's try to calculate the public transport coverage. How do I define it? Take all the residential buildings, take all the public transport platforms and find the distance to nearest one. And I don't like tables, so let's colorize the buildings. Some boring part comes here. The code. First of all, load the data. I'm learning the Czech Republic extract provided by Geofabrik. The second one thing. The thing that you should never be doing. I'm extracting a single row of the whole data frame. And if you came from an OSM part with Spark, you don't have indices. So, if you would like to extract a single row, you have to visit all the rows. Unfortunately, but that was a simple smear here. Just for example, usually you don't need it. You can define your polygon somehow, but I'm just finding the official border of Bernoulli some reference provided by Czech government. Locations. It's just a simple Spark query. You filter own values. Nothing to do yet. You work in Spark. Forget your OSM. You are now in a safe Spark land. Get some geometry and the Czech government have a very nice thing called Ruján. Almost every OSM have Ruján type, so you can easily filter on residential buildings. An interesting part. Now you have ways definition and you can convert ways to geometry. Now when you have geometry the analysis. Analyze is really, really simple. It's not OSM related and it's not even Spark. It's like Spark for kids level. Not even mean point of a building polygon, but mean point of a bounding box of building polygon, but it's good enough for us. Find distances. And a lot of magic happens here. Distance to render parameters, but I am not publishing that function. You can find it on a Google. Sorry, you can find it on a github. Most of that function is just conversion from distance to RGB space. But interesting. Set parameters like minimal zoom will be certain. Render with polygon symbolizer and send it to render pipeline. And it will send it to some local file. The result will be like this. No, no, no. The my part is just the green and blue buildings. The underlying part is OpenSuitMap data from OpenSuitMap site. You see that Brno have a quite great situation with public transport. Everything is green here. The close up of a university should be somewhere here, I think. You see green and blue. The green means it's less than 100 meters to nearest public transport. Stop and blue is... It's going from blue to red with a step of 100 meters if I remember correctly. So you can easily analyze it. And as I'm rendering just the polygons, you see it overlaps with the roads. But yeah, this might be rendered with spark. And for example, if I will be doing the real analysis of public transportation, I will be able to easily spot the problematic area in order to see. Those guys have to walk 2.5 kilometers to nearest public transport stop. And yes, everything is done with the stuff to you. But the stuff is not yet finished. So the first thing that I plan to do is writing support for parallel PDF. It will produce unordered files because I will be writing in parallel threads, so it may happen that Waze will be in front of... Well, in front of anything. Other Waze knows relations, I have no idea. But as I can read it back into memory it's not a big deal. And it's taken out. I had 7 hours traveling from Helsinki to Buenos, so I wrote half of that stuff. And maybe on my road back I will finish it and publish it. So, it's taken out. For spark OSM data source I need to finally implement pushing down the spark filters right to the data source for text so you will not be reading the stuff you don't need. And probably geometry conversion on load, but I'm still thinking like if you put it on load or not. If you don't need a geometry it will just slow down the loading and processing. I'm still thinking on that. You can just leave me a note if you have some opinion on that issue. And for spark OSM tools it needs to be more useful. So relations needs to be solved properly. And here it should be supported by GraphX. Because everyone uses GraphX for nodes geometry definitely should be there. Like if you need to find some geometry and convert it, okay just you need to have a function to convert from geometry to OSM and back. And if you will have a geometry probably I have to convert to say well-known binary. So Javaspar and Magellan will be able to interoperate with that stuff and vice versa. So long standing place probably I will finish some of them. Probably no, I don't know. Here are some beautiful links. Go to github and see all the source code. All the examples and all the renderers that I mentioned they are on github on different tripods but you can go and take a look on them. You can write me a mail if you like, you can visit my github and obviously you can... I will wait for everyone to finish the photos, it's okay don't worry. And probably you should go and order your lunch today at world. Thank you, now it's time for your questions. Yeah sure, go on. I think it's doable but I haven't tried yet. But that's a great idea. I have a question for you. So the question was like we have obviously parent children relation between nodes, ways and relations and Spark is good in representing graph and can we represent in graph? And my answer was, no I haven't tried it yet but it should be doable. But I have now a question for you. What would be the use case? If you build some library if you convert the data into your graph representation you cannot build the library I don't know. And the answer was like okay you should try it and see how it goes in just two words, sorry I'm cutting your answer. I think it's worth trying it. The background of that project is mostly for fun. If in real life you would like to render stuff just go, grab, map, compile it and render it. If you would like to extract some data just go to for osmium and extract it but this one is mostly for fun and the single use case if you need to make some ad hoc analysis of some data. You don't want to install all that stuff. You don't want to configure your post use and import it. You can just go, load it calculate quickly and drop it. Okay, so the question was I mentioned that the data set consists of nodes bunch of nodes, bunch of ways, bunch of relations. But I mentioned that I'm reading them in parallel weather and I'm reading them partitioned way on a several hosts at the same time. So how do I do that? That's a good question. This one thing parallel pdf has another one parameter. It's documented but it's not supposed to be used well, used with caution. You can specify a number of partitions and your slice number. So the binary format of osm pdf consists of several blocks and each block contains roughly 16 megabytes of data up to 16 megabytes of data. If you have more data, you start writing the next block. Each block may contain just a single type of entity. So the single block may contain just nodes, ways, relations or change sets. They are not covered here because no one is probably there. So what I'm doing? I know which block is assigned to each partition. So if you have partition number one and in 10 partitions so blocks number one, 11, 21 31 and so on and so on are yours. Same for second partitions so 2, 12, 22 and so on. And while you read the file you know the sizes of those blocks. So you read the block say number one and then you skip 10 next until you hit your next one. If you are reading your data from HTTP or S3 of course that will be a problem, you have to receive it. But if you are reading it from local file and S3 is smart enough to not send the whole data it will skip it automatically with the S3A driver or S3N I don't remember. They have two drivers, all the new the new one knows how to skip stuff. It will skip automatically to the next one. If you are reading from HTTP, yeah it will be smooth. That's how it implement internally. We have 10 more minutes for questions if somebody is brave. Yeah. Did you try Databricks or Spark or Spark? The honest answer will be I googled where can I run Spark cluster cheap and Amazon was the first answer. Yeah. Maybe I should try Databricks. So probably that's all. Thank you for coming.