 in the last couple of years of building a map sense. So I'll give a little bit of a description of what our company is and then go through problems that we've encountered. And for each problem, I'll describe sort of a, hopefully, a practical solution that we came up with to solve that problem. And then for each one of those problem-solution pairs, I'll jump into kind of a quick demo that shows how we've actually implemented that solution in our product or in some back-end tool. And then hopefully after I go through these sort of problems and solutions, I'll jump into maybe a longer form demonstration of a couple of data sets that we've been working on the last couple of years in map sense. And my goal is that at the end of this talk, you'll be able to leave the room with a couple of tricks or ideas that you can apply to your own research if you ever come across spatial data sets or the need to do any spatial analysis. So feel free to interrupt me or ask me questions if something's not clear. Yeah, and I'll get started. Before I do jump into those problems, I just want to kind of frame the world of mapping and the world of spatial analysis and what's going on today. And some of this might be familiar to some of you, but it's helpful to go over it. Kind of the biggest thing, and I think sort of one of the genesis is of starting map sense. And I think a lot of even probably the Berkeley Institute of Data Science is the fact that there's way, way more data being stored and collected today. So I call this internally, I call this the asunami of geo data. And this is just the fact that in the last few years, there's been not one order of magnitude, but several orders of magnitude increase in the amount of location data that companies are generating, streaming, storing, and collecting. So this is coming from smartphones that are streaming location data, even non-smartphones that telecom companies can triangulate to locations. This is coming from connected cars, from Fitbits. This is coming from smart power grids that have sensors on them, like windmills and so on. There's companies now building very cheap satellites that can circle the world and make satellite imagery that's orthorectifiable. Drone companies are making satellite imagery, or not satellite imagery, drone imagery that's also tied to location. And so there's many, many more sources of location data than there ever was before. An example that I often give is the company Uber, is that you guys are all familiar with, is storing from every one of its drivers, his or her location, every two seconds. Some back of the envelope suggests that Uber is storing about a billion lat longs every single week. And you can contrast that with sort of the size of location data that people were playing with five years ago or ten years ago. A big one that people would often talk about is the US Census, which was, if you think about it, at its finest discretization is about 300 million rows, updated every ten years. So you can see that there really is kind of a shift in the amount of location data that companies are collecting. And in addition to sort of the size of this data, this data is streaming in. So you're getting data sets that are coming in in real time. You're getting thousands of lat longs per second. And so you need tools to be playing with this sort of magnitude of location data. The second trend is the fact that, and this is kind of a little bit more focused on location or geo, is the fact that browsers are basically more powerful today than they ever were before. And that there's sort of been the shift or a revolution in mapping and making maps. And the fact that they can now be delivered to a browser and render it on a browser. So I'll zoom out and do a little bit of a history lesson for you guys. Traditional web maps like the map quests and the Google maps when they were first introduced were raster maps. So basically Google or map quest or whatever would take a bunch of location data in a database, render it into a bunch of images. These are like literally .png or jpeg images at every zoom level and at every lat long, and then just send those images to your browser. So they basically had pre-rendered the map. They chose the font size. They chose how the roads were going to look. They chose when lakes would come in. And it was all very static. So they would render these big image tile sets and then send those over statically to your computer. Today with like the advent of a browser that can actually do some computation on the front end. We can instead of sending over pre-rendered map data, we can actually send over the raw map data. We can send the vector data, the actual xy coordinates of a polygon, send it over down to the client and then make the client do the rendering of that map data. The advantages are many. So if you're sending over and delaying your rendering a little bit until it's right in front of the user, there's a lot of things you get. One is interactivity. So you can actually touch the features. In the image, it just pixels. You can actually select anything. But the computer knows this polygon is this polygon. So when the mouse clicks on it, select this polygon and bring up its metadata. You can also render the feature as a function of its metadata. So let's say we're looking at buildings in San Francisco. Instead of rendering a map of buildings and then doing, and let's say we want to figure out, we want to color a map based on the price of buildings. We can send the price of the building associated with that feature and then have the client render the building as a function of its price. So you can do really data driven mapping by sending the metadata of the feature along with the actual vector data and make the client do the rendering as you want it to. Then, let's say you send multiple pieces of metadata per each feature. You can on client side have the user change some toggle, re-render the buildings without hitting a server. If you wanted to do this in the traditional raster based infrastructure, you'd have to pre-render a bunch of coropleths or buildings for every single data set you want. Maybe you want this to be real time. Then you're really kind of screwed because you can't render those things efficiently. So to make it very kind of obvious what I'm talking about, here is a traditional raster tile. You can see this is a dot ping tile. I can't select it. This is an image. I can't change the fonts. The roads are predefined and so on. This is kind of what vector data would look like. Looks like I broke that link, but here. So this is maybe a set of vector data that you would send down to the client and let the client render this data. So you can see that the format of this data is in a format called Topo JSON, but you can send over any format of data, JSON, Fortaba for whatever. One other thing that you get with sending vector data down to the client is that you can choose how, like I mentioned this before, you can choose how you're going to render that vector data internal tool that we use at MapSense. And this is a vector map that's being drawn on my browser. And one of the things you can do is change the style of this map really easily, because the computer or my browser is figuring out how to render this map. So this was a little app that one of our product people built to actually color and choose a style for a map. And so you can get into here and let's change the rows and make them purple or whatever. You would not be able to do this with a traditional roster map. So hopefully it's becoming a little clearer to you, the kind of things you can do with a vector map. And throughout my talk, basically, I'm going to focus on these two trends. One is there are big, big sources of location data. Two is we have powerful browsers and we can start to do interesting things with these browsers that we couldn't do before. So in conjunction with massive, massive real-time data sets. And that's one of the, I would say that's a good description for the kinds of things you're focusing on in MapSense. So I'm going to jump into our first problem. So the rest of this talk will be a set of problems and solutions and then a demo of this at the end. So the first problem is one that you probably all heard of before. How long is the length of the coast? So this is a really famous paradox. The paradox is that you can't really measure the length of a coastline, even though it has some measurement. And if you go, if you measure, let's say we measure the length of Britain in miles, where each feature's biggest measurement must be in miles, you can make some measurement. And then you go down to centimeters. It's going to be longer. And then you go down to millimeters. It's going to be longer. And you keep going forever. You'll never get a coastline. This is like studied by Mandelbrot and whole theory of fractals that came out of this research. In our world, we want to render vector polygons in a browser. And one of the things that you might run into is if you send a full resolution file, which that is the vector file, so x, y coordinates for a very, very detailed polygon, you might be sending over a massive file. So let's say we're rendering the coastline of the world. We just want a map of the coasts. You're going to send over, and you have a full resolution coast file on your server. Just even just download that to your client could take an hour, just like the full resolution file. And so one of the tricks of the trade, one of the tools that you would use is to simplify that data somehow, such that you're not sending every single point in that big coastline file. You're only sending a subset of those points. So I'm going to just quickly show you an example of what happens when you don't do simplification. So this is an app that hit hacker news a couple of weeks ago. And one of the problems that this author ran into was that he ran into them. So there are techniques to handle this problem. And I'm going to go over one, which has a very, very hard name. It's called Visvalingum's approach, or Visvalingum's algorithm. And another algorithm, for those of you who probably know this space, is called Douglas Peckerer. But Visvalingum has some really nice properties. And the best property is that it's very, very intuitive, simple explanation. So I'm going to actually just explain it to you here with this image. Suppose that that top line is some really complicated coastline that you want to send down to your browser to draw. And so here, there are six points. But imagine there are 600 billion points. Consider an algorithm where, excluding the end points, for every point, you calculate the area of the triangle that's formed by it, and it's two adjacent points. So that's shown in the second image. And then every iteration of the algorithm basically remove the point whose triangle is smaller. So you can see that in step one, we compute the triangles. And then in step two, we remove 0.5, because its triangle is the smallest. Thinking about this intuitively, basically, we're removing the point that has the least perceptible difference between its adjacent points. And so this is a very, very intuitive algorithm. You can see we're just removing points that aren't that important. And it's also has some nice properties. You can pre-compute this. You can apply the areas to points ahead of time. And this is really nice for basically performing this simplification very, very quickly on a server. And then sending down simplified geometries down to the client. So here's a quick demo. This isn't a demo made by me. It's made by Mike Bostock of D3 fame. If I move my mouse all the way to the left, here is a fully, or this is, I guess, 65% of the points in the original coastline file of the United States. And you can see the pixel area is basically the area of the smallest square triangle. So this is like 0.037 pixels. As I move my mouse over to the right, I'm gonna simplify the United States. And I can keep simplifying. You can see now I've almost removed the San Francisco Bay Area. Manhattan has become like a block. And move it all the way to the right. There's the biggest, the smallest triangle now is 100 square pixels. But I've reduced the size of this file to 0.80% of the original file size. Why is this nice? Because this is still totally perceptible. Everyone in this room, and I think everyone in the country would be able to see this very quickly and see that this is the United States. So we obviously don't simplify this much in practice. But we can simplify maybe this much and still send down significantly reduced file sizes. And this is a really nice explanation. The coast of British Columbia. You can see that there are a lot of nooks and crannies in this data set. This actually represents this little between Vancouver and Prince-Rubert. This represents 10% of the Canadian coastline. If you could measure the length of a coastline. So I'm gonna apply some simplification to this map. And you can see us reducing the amount of data we're sending. So if I make this number very large here, S equals 10, you can see that I've simplified the length of this coastline. I can make this really extreme. Make it S equals 15, you can see very chunky. And then S equals, as I go down, S equals seven, S equals four, and S equals one, the full resolution coastline. And so this is one of the tricks that if you have a kind of raw vector data on a server or you want to display client side, simplification of the vector data is really important. I'm gonna go back to probably this and not go in presenter mode for my link. So hopefully this look okay. I love Google Docs. You can run into, so what I just explained works really nicely for lines and points. So you have roads or you have polygons, but what if you have big point datasets? Like hundreds of millions of tweets or hundreds of millions of readings from mobile phones if you're a telecom company. You can run into a similar problem that you have too many points to display on a map. And so there are a lot of tricks to dealing with. You might wanna turn this into a heat map or you might wanna cluster points into so that one point represents many points. But it's often really nice to actually maintain point data where you can click and every point still has it still represents one feature. And so at MAPSense we've actually taken sort of the analog of visual lingams algorithm and applied it also to points and are doing a similar density-based point simplification. So we basically look at regions, look at their relative density to their neighborhoods and choose to simplify or thin out the data such that the geographic distribution of points is still maintained. This is my image for displaying that and I'm actually gonna show you a real dataset where this link worked where we're doing that sort of simplification. So here's a map, this is about 100 million tweets that I'm displaying on the map. But again, I'm not sending 100 million tweets down to the client, I'm sending a representative subset. And as I zoom in, I continue to do this point-based density simplification such that this, at every zoom level you get the kind of best visual for that zoom level and that bounding box. So as I zoom into the US you can see the major population centers in the Twitter map and I'll zoom in to Berkeley and we'll see what we can see. So the San Francisco Bay Area, you can see Sacramento and the city really light up. At the Bay Area view we can see that it looks like downtown SF, Giant Stadium has a cluster, let's see how Berkeley looks like. Berkeley has a lot of tweets relative to its neighborhood. And we can zoom in very far and see this even on a building by building level. So here's campus with Twitter, a Twitter overlay or Twitter data set on top of it. So where are we? I think this is us, right? Yeah. Again, I'm sending over these points not as images or just sending over like an individual point. I'm sending the vector data associated with each one of these points so I can click on it and see what the text of this tweet. So someone, I'm literally just counting the books on the shelf in the North Reading Room. The user description is I procrastinate life. It's actually, so this brings up, this is a quick aside, but being able to start to map these really large data sets brings up issues around privacy because Twitter probably doesn't think that people are mapping these data sets but I can actually a little bit more dangerous. You can see when Julie's tweeting at night and maybe figure out where she lives and you can see what classes she goes to just based on the fact that this data has been mapped. But that's all public data. So I think that this brings up some points that I probably won't get too much into but things that we're thinking about at MapSense around basically user data and user privacy. I'm gonna jump back into my talk but this should hopefully have given you a demonstration of how you can apply similar ideas of simplification also to big point clouds. Yeah, I'm gonna jump at the end of my, as in like going through these problems I'll do like a couple of demos. But yeah, so this is actually just showing the distribution of the points that are in the current bounding box over time. So this is, it looks like I have, this data set is tweets over about four days, each of these humps. And so, and yeah, correct, yeah. So yeah, I'm using some like poetic license there. It's not, I'm not simplifying the data, I'm thinning it but in some ways I'm thinning a line to make it simpler and point data to make it simpler. But yeah, the goal is basically sending less data down to the client. Yeah, correct, and the sampling that we perform here is not just like a random sample. We actually look at the geographic distribution. If you think about it, you can come up with some, it's not very complicated. Back to this one. Oh, so I've described sort of one problem solution. We have very, very detailed vectors on a server. I want to send them down to the client. How can we do it? We can simplify it or we can thin in if it's points. The next problem or kind of the next idea actually governs a lot of what we do a map sense and a lot of how we tackle visualization and analysis and so on. And there's a very intuitive explanation and then I can get into a little bit of math. But first I'm gonna do the very intuitive explanation which I was discussing earlier right before this talk. And the idea is, like a really intuitive idea is since we're really focused on visualizing data, really large data sets, we can usually assume that we know where we're gonna be looking because the consumption or what's kind of the backend is serving is a map and the map can tell you its extent. And so when you can basically limit the data that you're looking for to the extent of the bounding box of the screen, you can make a lot of assumptions on a server that doesn't need to look globally, like to data points that are outside the extent of the screen. And so let's say I'm looking in San Francisco, if I have a very quick way to retrieve data that's in San Francisco, it's really nice for a mapping or a visualization because I don't need to actually ever pull data from this that's not around San Francisco. Does that make sense? Okay, so that's sort of like the intuitive explanation. A less, more math explanation, we index all the geometries that we want to display in a big quadtree index. And so a quadtree index basically is an index that is a spatial index. And it divides up, it iteratively divides up like some space into four quadrants at every, it's sort of a tree where every node has four children. And you can put features in the kind of know that most finally represents it. So you divide up a space into four quadrants. And if you think about how traditional image tile sets back when I was talking about the Google Math Quests, they were serving images also in this form of kind of a big set of image tiles. It wasn't really a tile quadtree, but it was each image was described by its tile coordinates, which we've kind of taken and instead of calling these, how am I gonna say this? Instead of dividing up data into image tiles, we're indexing our features into a quadtree. And we actually call the keys tiles in a very similar way that Google Maps would tile a bunch of images. And so this image is kind of actually pretty enlightening. The way this works is at the very top you can see one tile or one index in one key in our index. And it represents some very simplified data representing the entire world. So every zoom level here, which is kind of a level in our tree, which corresponds to a zoom level in the browser, represents the entire world. And as you get more nuanced features like tiny rivers and roads and so on, that you only index them in the tile that they should belong into. I feel like that was a little confusing, but hopefully with some hand waving you guys understood that. So what does this give us? Oh, and I mean, another thing that's really nice is that this is a really, this is an algorithm that's really easily distributed. The mapper basically iterates once through all the features and throws them into a tile. The key is a tile. And the reducer gets all the features that are in a key and then groups them into these tiles. And so it's really easy. I mean, this is distributed and runs on Hadoop. So you can tile very, very quickly. So for those of you who understood or those of you who didn't understood, assume you can very, very quickly access location data for any zoom level on every lat long. Given that assumption, I'm gonna go through some other problems that we were able to solve with the tile. Zoom level and it's a tile index, tile coordinate, like a z, x, y. Yeah. Yeah. That's the key. So this isn't for storing. This is just for the tiling operation. Yeah, so forget like different servers. Just the mapper, there's the keys are a tile coordinate and the reducer is getting all the features in the tile coordinate. That makes sense. And then, yeah, storing, we're doing some charting with most of our tiles, like tile sets can be held on one machine or just replicated across machines. So we're not, I'm not talking about like geo charting here, which is something that we could talk about after this talk, which is super interesting stuff. Yeah. This is just on the tiling operation. Thinking about what I talked about before simplification. Let's say you wanna perform Visvalingum simplification, given a really big input like the entire coastline of the US, even I gave you guys a nice trick to simplify that really big coastline, but that's a massive file that would be hard even to hold into memory. And so you don't need to perform Visvalingum simplification on that entire coastline. You only perform it on a subset of the coastline that represents the part of the world that you're looking at. And you can pull that part of the world into memory based on this index. So what is this data structure give us? A couple of things that I'm gonna describe here, but one of the really nice things about how we're indexing our data is the fact that we can very quickly filter geographic data based on its metadata. And so I'm gonna give you a demo of that here. Hopefully this link is working. So here San Francisco, can anyone notice anything wrong with this map? Yeah, exactly. Yeah, the Golden Gate Bridge is missing. And I was able to do that by just putting a query in my URL. So you guys can read the query here. But what this query is saying is defining a group called group A, defined it by some Boolean expression where name equals Golden Gate Bridge. And I'm saying where group A is true, get rid of the bridge. If I'm gonna put a false here, I'm gonna give you everything just the bridge. And I'm able to basically query the underlying vector data very, very quickly because I have this index, this data structure on the back end. I'll show you another maybe more practical example. Suppose I wanted the vector data just for buildings in San Francisco. I could perform a query where I again create a group. And this is a Boolean expression that I know is true for building. And I think it's like this. And so this is again, I'm creating a group defining a Boolean expression and for everything it's not in that group we're moving in. And so I'm very, very quickly able to do these sorts of filtering on a traditional Postgres server with PostGIS. And so if you wanted to get the vector data for buildings in San Francisco, this query might take 30 seconds to run. I can put a minimum area filter on the end of this and get the bigger buildings in San Francisco. And so we can do very, very fast operations on this geodata, sending down real vector data to the client. So it's not, again, this isn't image data. I have the name of these buildings and the elevation, whatever other metadata is associated with these buildings. I can go into more of that at the end when I give a couple more demos. Another thing that we have been doing at MAPSEN, so this was a function of a real world problem that we ran into. We had a data set of all the crimes in San Francisco. And so this is coming from the city of San Francisco to data set of about 15 million crimes over the last 10 years in San Francisco. So we got it from Data SF. The city actually publishes its own crime map, but it only publishes the last two weeks of crimes because it's such a large data set. So we dumped this entire data set and we had a really nice point-based visualization of it. I could search it. I could search for different drug types or whatever and I can, I'll show you that at the end. Well, one of the things that, when we show this to some people in the city and some people, one of their feedbacks was this is really nice that I can see clusters in this data set, but the police work in not in these undefined regions, they work in police districts or they work in census tracts or they work in zip codes. And it's nice, let me actually show you that. So here's that police data set. This is 15 million crimes in SF. You can see the date on the bottom. This is since 2004. Each one of these points represents a crime. So this is what happened that there was an arrest, someone was booked in the tender line for category other offenses on the Thursday. For each one of these points, I can also slice it by category over here. So here's a histogram of the crime categories and let's say we're interested in the drunkenness category. So I'm gonna just display crimes representing drunkenness. So these clusters were really interesting to a lot of people, but again, so going back to kind of why I brought up point and polygon, the police don't work in these kind of continuous areas. Okay, so look, it looks like North Beach, there are more crimes here, but is that in your police district or in my police district and there's bureaucracy and how much money is alluded to those areas and so on. And so one of the things we wanted to do was to aggregate these crimes and be able to do the same sorts of filtering, but instead of drawing pure point clouds like this, show the police districts or census tracts with a count or the color associated with each crime. And so we did that and I'll show it to you and then I'll talk to you about how we did that. And so this is still like work in progress, so it's a little bit messier, but here's the same SFPD map with colored by census tract where each census tract is colored by the count of crimes. And I can still filter on the underlying data so I can do and where category equals, equals, crunk. I hope I spelled that right. Nope, can anyone catch my typo? Two in. So this wasn't a pre-computed choropleth, this choropleth is computing on the fly even on these really large data sets. And so going back to my slides, how are we able to do this point in polygon on really large data sets? We've done this on hundreds of millions of points or hundreds of millions or probably tens of thousands of polygons. The traditional point in polygon algorithm is really simple. It's a ray casting algorithm. So you have a polygon and you have a point, cast array in any direction out of that point and you count the number of times it intersects the polygon. So you can see it in the image here. If it intersects an odd number of times that point is in the polygon and if it intersects an even number of times that point is not in the polygon. It's actually, it's really trivial to think about and it's really intuitive. So you can imagine, I mean the point over there, it's gonna pass through six times, it's gonna be even. If a point, if you cast a ray out and it never intersects because the ray was in some weird direction that point is not inside the polygon. If it's inside it'll pass through once. You can, in every iteration with like self intersections and so on it always works. The proof of this is actually much more difficult than you think. It uses the Jordan curve theorem, like the fact that a closed curve intersects a space to self discreet space. But suppose you have 100 million points and you have a bunch of polygons and you wanna perform this intersection. This would take a long time to run. But now think about the fact that we have these, the polygons are indexed in this quadtree. Can anyone imagine how you might, what your algorithm might be to make things faster? How can you speed up the point in polygon? Given the fact that your polygons or your points are indexed in a way that you can quickly retrieve the features for a particular region, in a quadtree. You can employ also your math points because they have something to do with the body but that can be very fast, a little bit worse than the data. Yeah, so I think you gave the answer that I was gonna give, but basically, given a point instead of intersecting against all the polygons in your dataset, first figure out what tile it's in because you can do that really, really quickly. And then only intersect it and do the ray casting algorithm on the features in that tile. And you can get, there's some tuning on how deep, how big your tile is, but basically that's a simple idea. And this lets us do really, really massive point in polygons on ingests of points that are coming in in real time. We can still perform this point in polygon because we have our underlying polygonal data and a very, very fast data structure. So I'm gonna give you another example of what we've done with this point in polygon and this is something that's really new also that I think you're the first people to see this. We just got this map up last night. We haven't explored all the possibilities of this. But what this is a map of is pings for mobile phones. So this is coming from one of our customers datasets who has a very popular way to, very popular app, let's say. And I don't wanna give away too much. But we collected, this datasets are of about three million points and we intersected every point or we point in polygon at every point with the buildings in San Francisco. And what does this give us? It gives us a map of which buildings in San Francisco are popular or which buildings have people in them assuming whoever, who your users are or who the users are of this app. There's still a lot of things obviously we wanted to do like normalizing based on area. But this is some of the work that we're doing that's kind of a little bit more experimental. We're able to figure out like kind of real time which city or which part of the city is popular right now. So imagine, and just like I was able to filter on the SF crumbs on drunkenness and the choropleth would update, this is just a choropleth that works in the same way and I can filter on time of day and show the residential areas and then I can do time of day equals business areas and business areas. What category of user is it male or female or are there ages and so on? You can see like very interesting things just on, this is the entire dataset but here's for example the hospital or the UCSF and so on. And so this is sort of the combination of a lot of what I described at the very beginning of my talk. There are massive real time ingest datasets which is kind of what's behind this and by doing nice spatial algorithms and sending vectors onto the client you can make a map like this relatively easily. I'm gonna give another kind of problem solution and I'm gonna have you guys make an assumption and the assumption is and this is a feature that we have kind of in our tile quadruses, excuse me. Suppose you have a polygon that sits in multiple tiles or multiple key values, you've actually chopped it. So you have a polygon like a circle and it has two tiles and each tile just has like a half of that polygon. If you're just concentrated on visualization, that's okay. You can just kind of display half here and half here and line them up and it'll line up and life will be good. But in many cases it's helpful if your polygons that you've indexed in this tile index are actually mergeable. So you can grab those tiles, merge the polygons that same polygon that's in associated tiles into one polygon. So this is the assumption that I'm gonna ask you guys to make. Assume you have a tile index now where polygons are mergeable and this is something that we've done. So our polygons are mergeable and now I'm gonna throw out a problem to you. Does everyone understand this idea of mergeable? Like the features are mergeable across tiles. So if you have a row that sits in many tiles you can grab all those tiles, pull them into memory and then reconstruct the entire row or the entire polygon row. Okay, so now we ran into a problem of map sense. This is the last problem I'll go over. Wow, time. So the problem was this, our base map is derived from OpenStreetMaps. And OpenStreetMaps does this kind of interesting but also kind of frustrating thing in that administrative boundaries like countries and states aren't clipped to the coast. So they actually include the nautical boundary of the world. So when you look at the, and I'll show you, I think that tile I got here, this doesn't show it. So you can see this is the border of a country and this is the nautical border. So if I wanna make administrative maps of the country boundaries, I'm gonna get these weird nautical boundaries. This is really interesting for some tiny minority of people, but not for most people and definitely not when you're trying to make a map of the world or used by people who don't know what that means. And so we were left with the problem that we wanted to intersect every single admin, that every single country and state with the coastline file. Now the coastline file is a massive file I kept alluding to over and over again. So, and we had to do many, many intersections. There are states and countries together is like tens of thousands or even hundreds of thousands of polygons. I wanted to intersect each one with the entire coast. And so given our mergeable tiles, we're able to do this really, really nicely. And this is like, we didn't think when we were tiling, we happened to spend a lot of extra time on making the tiles mergeable and we really didn't realize the benefits of it because we were thought, okay, maybe we'll render it and we'll do client side processing and we'll want the full feature geometries. We didn't really know what we're gonna do with it. And this sort of came out as like a beautiful algorithm that we just jumped into later on when we need to solve the problem. And so the idea is this, basically, let's say you wanna clip the state of California to the coastline. What you can do is you can take a bounding box, compute the tile cover of California. So the minimum number of tiles that cover the entire state. Go to the coastline file which has been tiled, grab those corresponding tiles, merge the coasts that sit in those tiles and then do an intersection on a much smaller set of polygons. So now you're intersecting California only with the coastlines that it might intersect because they have the same tiles and not with coastlines in Sri Lanka. Traditional, if you're just to run this masking on PostGIS, you're gonna run it on the entire coastline files and it's gonna take like 72 days to clip the entire world. Which looks like some of you might have run into. And so this is just kind of another trick or problem solution pair that we've come into. This is the result of some testing. So we're really big believers in integration and unit tests at MapSense. And so this is on random data where the light colored orange is random coastline and those little like darker colors are random states and we're intersecting, you can see we're only intersecting the states with the coastline that it actually has the same tile cover. So I wanted to spend the last five, 10 minutes on giving you guys sort of like now that I've shown you a bunch of problem solutions, give you guys sort of a culmination. What do we do now at MapSense? These are some of the things we've run into or we've solved. And so I don't know how much time I'll have but hopefully I'll go as fast as I can. I'll maybe I'll continue with this data set right now. So this is a data set of crimes in San Francisco. This is those 15 million lat-longs. I showed you how I can drill down on particular categories. So prostitution is one that has a particularly strong spatial trend. There are basically two clusters in the city where arrests for prostitution occur. And you can also see that my time filter in the bottom, these are really trending downward. So this was a problem, I mean still is a problem today but it's a significantly less of a problem today. One of the things that's nice about how we index data at MapSense is we also index for unstructured data or for text. And so associated with each one of these crimes is the actual text description that the officer wrote when he made the arrest and we can search those text descriptions. And I showed this at an earlier speak engagement where some of you are at but it's interesting to search for different drug types to show where the distribution of those drugs are in San Francisco. So if you look for all arrests where the police report contained the word marijuana, this is going to be the resultant distribution. You can see that the tenderloin features a lot but also this is the hate where a lot of the hippies in San Francisco live and this is the mission and there's North Beach. But if you were to look at the distribution of cocaine for example, it's really tenderloin distributed along with a mission but not so much in the hate. The last and funny one is hallucinogens. Can anyone guess where this is going to be distributed? In the hate Ashbury. I also wanted to show you a data set that we've started to work with recently and this is a data set that's coming from the California Condor Project. And so this is an organization or several different kind of organizations that are working on protecting the condors. And so the data they get is basically coming from tags on the birds. So there are 120 condors in the wild, about 70 of them have GPS tags on them and they're streaming their location data every single day. So we have the data, we've indexed the entire data set again but I'm gonna just zoom into a particular, let's say week of data. And so this is three days worth of data. Each one of these points represents a reading from this data set. I'm going to color the points by the bird. So each individual bird. So this is bird number 374 in his pattern. And we also have kind of the most, one of the most interesting things is to play with their speed. So was the bird flying around? Was it commuting or was he nesting or eating or mating or whatever? So when you look at birds who are actually flying, so I'm gonna drill down just on 10 to 20 kilometers per hour and that's actually, I'm just gonna zoom into one day's worth of data here. So we can, we just added some more functionality around temporal data. We can actually see this bird flying around. So these are the patterns for this guy number 374. And so this is used by, sure. Yeah. So this is super impressive and but I just want to be clear and point out this is a town sign, not a question. So this is a really good point. We do a combination of server side and client side filtering. We try to abstract it from the user, from the UI. So the user doesn't really need to think about what's going on. But yeah, of course, this is, when I do this, that is completely client side. But when I drill, if you notice, I'm sort of doing this drill down functionality, that's hitting a server. And so we've kind of made the UI so that the user doesn't know what's going on, but one of our tricks we're doing really, really fast filtering is doing some of the filtering client side. It's a really nice trick that you can do when you send vector data to the client. Good, good eye. One of the things that the field teams do, the condor field teams, is also look at when the condors are not moving. So this usually represents where they're like sleeping at night time. And so you can see how this tool, I'm able to, over months or over a day or over a minute, just slice this, like hunt, I think this is probably tens of millions of latlongs, is a really nice tool for those teams to understand, okay, this is a tree that this bird likes to hang out with. Or on this day, he hung out here, but let's, maybe we'll unlock time, just look at bird number 493. Like we can understand his or her behavior over very long ranges of time. Yeah. And so, I think I'll show you one more dataset, which is the Twitter dataset that I started playing with before. I think it looks better, dark color. And so this is a similar dataset, except this is tweets and not condors or anything. And again, we're able to index the data or search the data on free text. And so one of the things that's really nice on the Twitter dataset is to actually search the text of the tweet. First, I'm gonna color the tweets by their language. One of the things that I haven't mentioned, but you might have noticed is that all my charts are dynamic given my extent. So right now I have the whole world in my bounding box. And so this histogram is a good approximation of the distribution of languages in Twitter. But as I zoom around, so let's say I'm interested in European tweets, you can see that this histogram is updating. And so I can see now there's a lot more Spanish tweets or Turkish tweets are like featured very heavily right now. And so all my charts are updating given my bounding box. And so as an example of this, I can type the word bonjour here. You can see France light up. You can also see if you look at time on the bottom, people say bonjour at the beginning of the day. If you put in bonsoir, you can see time shift to the end of the day. And less people say it. One of the things that happened while we were collecting this dataset was the Coachella Music Festival. And so I'm gonna search for the word Coachella and show kind of what happened during that music festival. So Coachella, for those of you who don't know, is a really big, famous music festival that happens in Palm Springs. And so these are all tweets that mention the word Coachella somewhere in there. You can see my time filter at the bottom is showing very few tweets leading up to the event and then a whole cluster of tweets at the end. And so if I'm gonna play with time a little bit, this is, oh, it's broken. Gonna refresh this to make the demo more interesting. Okay, so this is a couple days before Coachella. This is the day before and then here's the music festival. And you can see a couple days before the conversation around Coachella is located basically in Palm Springs in LA, like where that music festival happened. But as I move time over to the right, you can see the conversation becomes global. People in Europe, people in India, people in Manila are talking about Coachella. And if you click on a few of these, you can see that most people are lamenting not being at Coachella. Wish I was at Coachella. Why am I not being against a hippie at Coachella? And this is, I didn't pre-can these. I would give a lung to be at Coachella. So this is really nice for sort of a global analysis, which languages are talking about Coachella and so on. But I can also even zoom into the Coachella music festival, which happens here in Palm Springs and understand the same interaction but at a very, very local level. So here is the fairgrounds where Coachella happened. Oh, it's in there a little too far. You can see that a day before or two days before nobody's here, a day before they open up the campground, this is where people can stay outside the music, like outside the concert. And then this is the concert. So this is day one and this is day two of the concert. You might be able to see sort of four clusters here. Those represent the stages at Coachella. Maybe, I wonder if this shows that. Yeah, so this is where Coachella happens. And we can even, again, this is the Twitter dataset, which has a lot of implications. But we can start to look for, and this is some of what our customers do, look for influential people who are out this music festival. So influential people measured by a proxy of how many followers they have. So I'm gonna size the circles by the number of followers that each user has on Twitter. I'm gonna just drill down on people with lots of followers. And I can even, again, given this data on Twitter, pull up their screen name. So color by screen name, drill down on this person, Elon Music, this person is a famous DJ who's at Coachella. So for organizations or researchers who are interested in finding people or slicing big datasets, this is sort of some of our work. I know it's 205 and I just wanted to end my talk by saying that if you're a developer, we're about to release a whole set of self-serve developer tools online. This is a set of JavaScript APIs to do maps. This is a set of vector tiles to interact with those maps. You get the full feature at OSM with all the metadata associated with the feature so I can click on a park or a road and understand exactly what the data is in OSM on these roads. This is all the data you'll get. And then there are lots of different datasets we're releasing. So, yeah, maybe we'll do questions. I don't know. Thanks, thanks. We know some of you might have to go into the park, but we have a whole half hour for those of you who have time to stay. We are trying to record through social media. It turns up all right, but because we don't have a microphone passed around, Aaron will do his best to receive your questions. Sure. Is all of our APIs here standardizing on Tobogatron to send to the client? There'll be several formats. I think Tobogatron is the one we want to people to send to use, but we're also gonna release a protobuf which will be faster. And I hope to release also a GeoJSON which is for simplicity. So I think the three we'll release will be Geo, Tobog, and Protobuf. So right now, yeah. Yeah, we should talk. Especially with open source datasets, I think we're actually gonna release in addition to these map datasets, some of those crime datasets and open source datasets, for use by developers. Yeah, our company has been really focused and sort of heads down building and then the couple of customers we have our big private corporations that don't want anyone to see their data. But our goal is to start releasing more self-serve and we're starting that with this release. So I saw, so we talked to the Condor folks and we interviewed them about their workflows and basically they have an automated system that's sending around a KML every day to the field team that they open up in Google Earth that didn't have much filtering at all and it was really laggy. So yeah, they were super excited. We're super excited. That project's awesome. Our goal is to make them happy enough that they take us to go see. Are there any other questions? The board that didn't come up in the mapping description is the word, spherical, because we're working on in-spherical coordinates except in very small Cartesian spaces and even computing areas, you know, in the plan. I mean, it screws everything up. Pointing polygon becomes a very tricky problem in practice to an almost impossible problem when dealing with the in-spherical. Impossible to get right. You don't really have to answer this in any detail but how much does that room over all the things you're doing and complicate things? I'm not the best person to answer this question. That's all I could say. I think- Is there anything other than rotators? We do support and we will support other projections but our goal is actually to serve lat-long data and let you project it client-side which is one of the things that you can do when you're playing with vector data. And I think the best person to answer this is our CTO who deals with a lot of the apologizing and simplification and projection issues. I just can't speak to it. Yeah, so this is still something we're figuring out as we release our developer tools. We actually have lots of other data sets. I think my favorite data set of all of our data set is the Albatross data set. Again, these are a bunch of Albatross and the Galapagos. Here, let me just show you this one. I think I love this data set. Drill down on speed, size them by speed, color by ID, you can watch these birds fly. And so these are birds that commute between the Galapagos and the mainland and you can zoom in. I mean, this is about a million lat-longs. You can zoom in pretty far here so you're not gonna see it when I'm drilled down on speed equals this, but you can actually just see where they're nesting on this data set. So there's a bunch that nest here. So this is a particular area where some of these birds are nesting. Anyway, so yeah, I know I understand your question. Sorry? Yeah, I think the way the developer tools will work is that you'll be able to push data. They'll be like kind of, it's almost like uploading it to Google Drive, but you'll upload it to Mapsins. We'll put these spatial indices on them. We can annotate data. So if you wanted the temperature added to every lat-long or what administrative polygon it belongs to, we can annotate data and then expose it back to you in an API form or in a visual form, like kind of this tool. So that's gonna be our developer product. So anyone can push data. There'll be nice end points for pushing and then querying back the data. And I think the way, probably do pricing based on the amount of data that you send. Yeah, I think everyone who ingests data in our infrastructure will have access to everything that I showed today. So yeah, that's how we're thinking about it. You can upload the data to here and then use it the same way I'm using it internally. Sure, yeah, this is, that's a great question. Thank you. I mentioned that we're releasing a bunch of data to use. We're also releasing a D3-based mapping front end. So this is an alternative to Leaflet.js or any JavaScript mapping library, like OpenLayers or whatever. And the idea is you get a D3 object that works across tiles. So it's very, very similar to if you're familiar with D3, the kind of de facto graphics language on the browser today. But you also get to tile data sets. You get a D3 selection, you can choose how you're gonna color something and then as you pan around and more data comes in, that selection will be applied to the latest data. So it's really, really nice if you're familiar with D3 and you wanna do maps, hopefully this will be really useful to you. You can do really cool things like delay the drawing like this. We have a whole bunch of tutorials. So for example, here's a elevation map in San Francisco. So this is five foot elevation lines. And I wanted to make an app where I could hover over that legend at the bottom and then select in the browser the lines that correspond. So this is something like this. The number of lines of code to do this is extremely short. It's basically this. And all this is just putting the axis on the labels. But here I'm classing things by their elevation. I'm coloring things by their elevation. It's really, really nice framework, yeah. So stay tuned. And if you're interested in any of that, please sign up for our developer list.