 Let's parse the world over a cup of tea. Or more specifically, what I mean to say is that let's extract, filter, analyze about 5 billion uploaded GPS points, about 3 billion nodes, about 362 million ways, in about 16 minutes on a very expensive machine like the Amazon EC2, or in about 25 minutes in a less expensive or but still expensive machine like the MacBook Pro. And to make this more enticing, we will be doing this with JavaScript or more specifically with Node.js. The world I know very intimately, the world that I work very closely with as a developer at Mapbox, is the world of OpenStreetMap, which is the largest living map in the world. OpenStreetMap, in all likelihood, contains the house that you grew in, the parks that you played in, the train stations you got off at, you get the drift. But more importantly, OpenStreetMap gives you the freedom to add your own unique perspective of the world to the map. And even more importantly, it makes this data freely available for anybody to use to parse, create, and build with. In short, OpenStreetMap is yours to keep, grow, and cherish. Before Chai, though, let us have some introductions to the Earth, but as data. The Earth has a staggering data set. Let's take a moment to think about everything that we've seen throughout our lifetimes, right? The sheer diversity of the objects, both natural and man-made, that we've had the fortune to look at. You'll be surprised to know that all these objects in all their diversity and all their entirety can be represented by three simple geometries, which are points, line strings, and polygons. There's also another concept of relations, which I'm going to be ignoring for the duration of my talk, but I'm happy to answer any questions on. Some examples of points are mountain peaks, trees. Some examples of line strings are rivers, roads. And some example of polygons are buildings, lakes, and fields. But how does this data really look to the computer? So when the computer is seeing a point, it's seeing an image like this, or more importantly, it's seeing this JSON structure on the right. I'm going to take some time to break down this JSON structure for you, because it is important for the rest of my talk. So this JSON structure is usually of a type feature collection, which is a collection of the Earth's features. And these features are usually an array of objects. And this is how a feature looks for a point. So it's got a type feature. It's got some properties, which are metadata. And it's got a geometry, which contains, again, a type called point. And it also contains some coordinates. A line string looks very similar to a point, except that here, the geometry is of a type line string. And you have more coordinates than you would in a point. A polygon looks very similar to a line string, but it's got even more points. And the first point and the last point are the same. I spoke a little bit earlier about metadata. So metadata is basically any data that you want to associate with a feature, but that's not related to its geometry. So on OpenStreetMap, this could typically be the name of the person who edited that feature last. It could have the timestamp of when that feature was edited. But more importantly, you can have visual metadata. You could associate properties like color, size, opacity with this data. This is a map of Toronto, where we have visualized several points of pedestrian accidents. And you'll see that these circles are varying in size. Some of these circles are bigger, some are smaller, some are red, some are yellow. And all of this denotes the intensity of the accidents, the number of accidents that occurred, and so on and so forth. Before I move on to the core of my talk, I want to spend some time talking about web maps. This is not directly related to how we're going to parse or process geospatial data, but it contains a really elegant concept about how geospatial data can be broken down into very logical chunks. A web map contains many, many perspectives of the world. It contains a perspective of the world as the moon sees it or as the satellite sees it. It contains a perspective of the world like the birds see it. And it also contains a perspective of the world like you and me see it. And the reason why this is possible is because of the property of zoom. Now, as we zoom in to this map of the United States, you will notice that as you go deeper and deeper, there are more details that are coming up, which means that at any given point of time, the web map holds an immense amount of data. Like it's got data at varying perspectives. It's got data at varying levels of detail. And this is clearly a lot of data, because as people who use web maps on a daily basis, we know that we constantly zoom in and zoom out. But in spite of it, web maps are incredibly fast. And why is that? The reason why this is so is because, one, you're always constraining the web map to a location that you're looking at. So if you're looking at the map of Bangalore, there's no need for the browser to load the entire world. So you already reduce the amount of data that the browser has to load. And the second idea is that of tiles. A web map is basically a giant single quadrant Cartesian grid with slots for tiles. In comfort of speak, I think you can think of it as a matrix. So this is a map at zoom level 2, where you have four tiles on each row and you have four tiles on each column. And the origin of the map is in the top left. And every tile is uniquely identified. And how is this done? So the tile representation involves three different properties. These are z, x, and y, which stands for zoom, x, and y coordinate. I mentioned earlier that it's a Cartesian grid. So you actually go from 0 to 3 and from 0 to 3 in the rows in the columns. And you also have the zoom property in every map that you're seeing. At zoom level 0, the map has just one tile, which contains the z property 0, and the x and the y are obviously 0. At zoom 1, the map has four tiles. So now the zoom remains constant at 1, but the x and the y are varying between 0 and 1. At zoom level 2, you have the zoom property which stays constant at 2, but the x and the y are shifting between 0 and 3. Basically, at zoom n, the map has 2 power 2n tiles, which is why at zoom 2, you have 2 power 2 tiles and 2 power 2 tiles. So that's 2 power 2 tiles on every row and 2 power 2 tiles on every column, which makes it 2 power 2n. And the reason why the map is so fast in spite of loading so many tiles is in spite of showing so much detail is because of tiles. Tiles make it really fast. And secondly, tiles enable you to progressively load things. So for instance, if you're looking at, say, MG Road, and then you're moving on to Coven Park, there's no need for the browser to request the tiles for MG Road all over again. Like it'll just request the tiles for Coven Park, which brings me to my third idea of caching, that every tile that you request for is always cached. And fourthly, tiles are just very, very simple to imagine even as an abstraction because they're simply objects that are there in the Cartesian grid. But how does this style data really look to the computer? So when the computer is seeing the style data, it's seeing this data in two ways. One, as raster data, where it's loading the map as PNG images. Or secondly, it may be loading the tiles as vector data. This vector data is very similar to the data that we saw earlier in terms of the JSON structure, where you have the type and you also have a geometry. But in reality, this vector data is encoded as binary, which are called proto file buffers. So on the left-hand side, you have the JSON data. And on the right-hand side, you have an excerpt of this JSON data rendered in binary. This is the reason why web maps are so fast because you're constantly just loading binary data. Vector tiles are the reason why they're so fast. Vector tiles can also be rendered client-side, as opposed to raster tiles. So when you're actually requesting for a raster tile, the browser will send a request to the server. The server will retrieve the tile data, generate a picture from all that tile data, and return this picture back to the browser. Whereas when you're using vector tiles, when the browser is requesting data, the data is given to the browser and all the rendering happens client-side. So I'm just going to wait for this slide to finish because you can see that it happens really, really fast and it happens very, very dynamically. And thirdly, vector tiles are very useful because they can be layered. So if you wanted to layer raster images, you would have to retrieve many, many tiles. But if you had to layer vector tiles, where one layer represents, say, all the trees in the world, you filtered the data, just extract the trees, that would be one layer. The second layer could have all the buildings. The third layer could maybe have oceans and you could extract these layers separately, overlay them on the map, and this is completely possible with vector tiles. So essentially, parsing the world really, really fast, the first process is that you parse the world as vector tiles. Of course, this completely excludes things like image processing, which is not what my talk is about, but where what I'm actually talking about is that you're able to parse the world's data really, really, really fast because you're parsing binary data. And the second reason how you can make this parsing extremely fast is by using map reduce. Map reduce is the process where you divide and process and aggregate data in a very parallel distributed manner. But this is, again, a mouthful, right? So what do I really mean by this? Let's break down map reduce, but in terms of vector tiles, what are the reasons that, what are the requirements for you to be able to use map reduce on something? The first property is that the data should be able to be broken up into chunks, into very logical chunks that can then be pieced back, which is completely possible with vector tiles. Secondly, when you're operating on individual chunks and then aggregating the results of these individual operations, you're doing the same thing as operating on the large data set. So for instance, if you want to find out the number of buildings in the world, you would break down the world into tiles, you would find out the number of buildings in each tile, you would put these results together and essentially you would be doing the same thing as finding out the number of buildings in the entire world. Thirdly, the chunks can be operated on in any order. So doing something like this is very similar to doing something like this or doing something like this. Before we move on to tile reduce, I want to show you a teaser of a video that we made when we tried to find out the road coverage on OpenStreetMap and our benchmark with something that the CIA released, which is called the CIA's Factbook. This visualization is on a MacBook Pro and it took about 20 and a half minutes. Essentially, tile reduce is map reduce but on vector tiles. It was written by Morgan Hallocker, who's primarily by Morgan Hallocker, who works at Mapbox and it's an open source library. So all of you are very welcome to use it. The reason why it's so incredibly fast and the reason why it's so incredibly simple is because it's 405 lines of JavaScript. I'm going to take an example of using tile reduce and this involves filtering out all the roundabouts in the world as seen on OpenStreetMap. Here are data chunks, our vector tiles. The operation that we're going to be doing on each vector tile is finding out the roundabouts in each tile and a reduce operation will be to aggregate all these filtered roundabouts. But what does this roundabout feature look like? I'm sure all of you have seen this when you've been driving or going on your scooter, these are these circles that you see anywhere. I'm sure there's so many of them in Bangalore. But as data, what this feature looks like is that it's a line string. It's important to note that it's not a polygon because it's not filled inside. It's a line string and it contains coordinates, many coordinates, some of which have skipped here and most importantly, it contains a property called as junction whose value is roundabout. Let us quickly look at some code and how tile reduce is really, really simple. All tile reduce scripts will have two parts. One is the main part where you require tile reduce and make calls to it and the second script is the map script. So this is a script that's running the map operation or the operation that you want to run on each vector tile and this file will be called for every tile that's there in your tile set. So in our main script, we require tile reduce which is an NPM module. We require any other NPM modules that we may require for our code. We then make a call to tile reduce. We specify the zoom at which our vector tile set is. We give the path to the map script and then we have the sources array which is an array of all the sources that you're operating on. In our case here, we're only working with a tile set from OSM QA tiles and I'm going to give it the name OSM. Now I'm going to move on to the map script which is going to be called for every tile in this tile set and what this map script does is that it has a tile layers variable and this tile layers variable refers to our source. So you have this tile layers variable and you first, it's a feature collection. So you first assign this tile layers variable to this layer property and this, the style layers.osm.osm is your feature collection and you iterate through all the features in that feature collection and you see if every feature has a roundabout property. In the event that it has a roundabout property, if you're counting the roundabouts, you can just increment your count by one or if you actually want to see the roundabout features, you will just write it to standard output and also every time this script finishes running, it has a callback that leads to the reduce function and here you're just incrementing the count for all the roundabouts in the world and when every tile in the style set has been processed, you can just write your output to standard output and we were running this on an EC2 with all the data of the world and it took about, just gonna wait for it to finish, took about 15 minutes and 48 seconds and we counted about 419, so that's 419,261 roundabouts. This is a visualization of the data, of the roundabouts data. This is all the roundabouts in Paris and this is all the roundabouts in Dubai. One thing that I did not talk about when I spoke about the code earlier is that I've been saying that you can parse the world's data. You can also constrain tile reduce to work on areas that you're really interested in, so this is specified by something called a bounding box and a bounding box, you can think of it as just a rectangle on the map and this includes the minimum x, the minimum y, the max x and the max y. So we do use tile reduce in map box for more things than just making really cool maps and I wanted to take you through some of the stuff that we've done with tile reduce. This project was where we tried to analyze building sizes in various countries and building sizes and shapes in Monaco is what you're seeing on the screen right now. This is OSM analytics, OpenStreetMap analytics which can be used to extract any sort of analytics about OpenStreetMap, like the density of buildings, the density of roads, where are all the roundabouts, all kinds of things. OSM lint is predominantly used to find errors in OpenStreetMap. OpenStreetMap is open for anybody to edit and it's useful to run these scripts to just find out common mistakes that people are making and to maybe send them a friendly note and say, hey, maybe you should fix this. OSM tag starts with something that I wrote for someone who does not want to write code, who does not want to write JavaScript. It just involves like a simple command line command which extracts any feature from OpenStreetMap. You can use it to count features. You can use it to extract just the features. You can go check it out. This was what I spoke about earlier which was comparing the CIA's fact book to OpenStreetMap and this is how the coverage looked for India. So there's only 21% of roads that are in the world on OpenStreetMap which is a call to all of you to go out there and map on OpenStreetMap. Something that I built in a really short period of time as an exercise was to study the gender of streets in many cities. As somebody who was quite interested in the idea of gender, this is a super fun project where I use style reduce to extract the names of streets and then pass them to another library called Namsor to extract like with some false positives whether a street was named after a man or a woman. But the question is why style reduce really because there are so many other ways to process geospatial data and why would you be interested in using style reduce? I'm gonna take an example where you try to count the number of buildings on OpenStreetMap. Traditionally how you would do this is that you would download the entire OpenStreetMap planet file in XML format which is a very, very big file and then you would set up Postgres. You would import this data into Postgres using OSM to PSQL. You would figure out the table where the buildings are present and then you would run something like this. But if you had to do this with style reduce, you would essentially just download your vector tile set from this URL OSM QA tiles which contains vector tile sets for the world and for every country and you would just NPM install tile reduce. You could also repurpose several modules on NPM which have been made using tile reduce and then you would just write some code and this code would look something like this. I'm not going to go very much into the details but it looks very similar to the code earlier. You just have your main file and you have your map file where you're looking for the building property. So what are the wins really? One is that tile reduce is incredibly fast. It is extremely fast and it's also very, very accessible for somebody who does not want to go through the entire process of setting up Postgres and you can write JavaScript, you can use it. Thirdly, you can also combine multiple sources of data very easily. Like this source variable that you see here is actually an array. So if you wanted to compare two vector tile sets, you could do that. You wouldn't have to set it up twice using Postgres. You can very easily constrain to specific areas in the world. You can make custom maps really, really easily and you can also use it to just extract data and then machine learn with this data. But tile reduce does have a set of quirks. Some of these are that you have these properties or these features that are spread over many, many tiles and when this is the case, every portion of this geometry is going to have a large, you know, decimal value. Like, they're going to be maybe 74 point, I don't know, five, six, seven, eight, nine, 10 something and when that happens to simplify, when you're breaking into tiles, some of these decimals may be cut off and when that happens, you see these breaks at tile boundaries and secondly, when you're generating tiles, it's a little difficult to accurately generate tiles for small areas because when you do that, at the boundaries, you're probably going to be including areas that you don't want but this difference is pretty negligible when you're working with very, very large areas which is predominantly the use case for using tile reduce. At this point, I'm going to stop and I'm just going to say that I really hope I've given you a good teaser into using tile reduce and I hope that you will use it to analyze any geospatial data that you may have and I'm willing to answer any questions that you may have. Thank you. The scripts that you just saw. Oh, very good question. So the difference between using OSM QA tiles and using overpass is that OSM QA tiles are generated only once every day. So whereas OSM, whereas overpass is querying live data. So, you know, like if you, you can't use tile reduce to find out what happened this second. Yeah, for that, unless you have a tile set of data that happened that second, overpass is easier for that. But there's a difference between the speed of... Yes, definitely, especially for very large areas. I'm not sure if you've used overpass for very large areas but invariably your browser will crash. Yeah, it crashes. Yeah. Try to find out the schools in India. Right. So it's almost crashed. Right, so. Would that work with this or? Yes, it would be very easy with this. You would just need the tag value, the property and the value for schools and then you could use tile reduce with that. Okay, thank you. Yeah, hello. My question is related to, you are explaining about the JAD, which is related to the, you know, the zoom level, right? So my question is how we can associate zoom level with the, you know, the satellite height or, you know, the altitude of the satellite. We are capturing the image. So suppose that if the user interested from the 20,000 kilometer above, 15,000 or the one kilometer above from the earth level. I'm guessing that I'm not sure if I'm qualified to answer this question because I don't work a lot with satellite data. But I think satellites that are closer to the earth are probably capturing images at higher resolution. So like with more detail. So the ones that are farther are probably lowest room levels. But my question very specific towards the, what should be the altitude associating with the JAD? I'm not sure of the answer. Okay, next question is related to the zoom level. So you speaking about, you know, the number of tiles we are loading as for the zoom level to 2 to the power 2 n, right? My question, if my zoom level is not the natural number, it is the decimal, like the 2.5, how to handle that thing? It's probably downed off to the next biggest number. Like when you're loading the tiles. Actually, I tried with the map box plugin. It is not working. It is giving throw me the error. Oh, but we often use like zoom that, that are real numbers. That's what my question, if I have the zoom level 2.5, 3.5, 2.75, how to handle that? How many tiles will occur? Because I tried and it is not, it's giving me error actually. Okay, so the thing is, I think when you're working with, if you're doing this with tile reduce, like you need to have your tiles at the same zoom that you're interested in. So if you're using a map, like this process of real zoom levels is very easy. But if you want to specifically work with tiles at 2.75, there's this module called tippa canoe. That can be used to generate tiles at the zoom level that you're interested in. And then you can analyze it using that same zoom level. Okay, yeah, thanks. Yeah. Hi. Okay, hi. So can this tile reduce? I mean, you also talked about B box, which kind of constrains this thing. So does B box accept a polygon as a constraint? So what I'm trying to do is just to address the same problem about at the end where you have a polygon as a map and you want to kind of do a tile reduce. So can I constrain it by a polygon? So what you could do is, there's something called tile cover, which you can use to generate tiles for a polygon. So once you have these tiles, then it's just a question of working with those tiles. So you would first, the first step would be to generate the tiles for the polygon. But you will still be working only with tiles and this will not be a square, it'll just be many tiles. So then what does B box really do? B box, if you want to use just a rectangle. So in the event that you want to use a rectangle, you can use a B box. It's still a rectangle, basically. That's what I'm trying to do. It is a rectangle, yeah. It's not a polygon. No, it will not accept a polygon value. Like B box is a very commonly used by people working with geospatial data. So I guess it's a useful, you know, parameter to have for tile reduce to pass. So what is open street maps policy regarding mapping, like sensitive buildings and government areas? Is there any policy you're not supposed to encroach or things like that? So the idea is that all of the data is licensed under ODBL, but it's completely free for people to build even commercial products on. So you can, you can, there are areas of military that have been marked on the map. It's there on the map, yeah. It just depends on what people want to mark, yeah. Hi, I just wanted to understand when things fall in the boundary. Say for example, a building, a tree or whatever it is, with so many number of tails, like how do you count it? Like do you count it as a two or one, like whenever? I, can you repeat it? Like if, do you mean like? Let's say that you are counting some particular entity. Okay. And now we are dividing into tiles. Okay. And a couple of these entities fall in boundaries. Say for example, they come in both the tiles then. Very good question. Really good question. I'm glad you asked. So the thing is, when a lot of this, when I spoke about was associated with open street map and all of these features have unique IDs. So when you're counting, you're essentially counting those unique IDs. So, you know, in spite of having multiple features, these features will have the same ID and you'll just be counting that number. So if you have your own vector tiles set, it's best to have a way to uniquely identify each feature so that the count is accurate. Yeah. All right, thank you everyone.