 Thank you. Hi, my name is Andy Eschbacher. I'm a map scientist at CardoDB and a lot of what we do at work is try to push CardoDB to the limits and see what happens when we do that. And today I'm going to talk about a specific use case that we had that just came up from a news story and how that really changed our view of what we can do with CardoDB and how we need to put pieces together again differently and how we can answer so many more questions after doing things like this. And the background there is kind of like foreshadowing for what I'm going to be talking about. So the summary of what I want to talk about. So if you're not familiar with CardoDB, I wanted to just kind of show you what the stack looks like roughly. And after that, how a Friday afternoon hack session turned into maybe the most fun project that our team has ever done together in the last year or two. And then how we need to kind of go beyond post GIS because there's a lot of analysis that's not kind of baked into post GIS and how the open source community offers so many different spatial tools and kind of data science type tools that we want to integrate into our stack as well. And then where we're going next after that. So overview of CardoDB. I think a good introduction to CardoDB is to try to decipher what's going on with our logo. And any guesses for what our logo is? Okay, so there's an elephant. Yeah, perfect. And so the elephant because of Postgres and it's a hot air balloon which is like in the clouds, you know? So it's like Postgres in the clouds. And then Post GIS has a cloud or it has an elephant as well. So it's a Post GIS database in the cloud. And so that pretty much summarizes what CardoDB is. So it's a cloud-based GIS tool. You can do like data science stuff off of it too. You don't have to do geospatial data. You can just do, it's just a general database and so you can do any arbitrary SQL queries against it. This is roughly what the stack looks like. I'm probably missing a lot of pieces. But you can see. So on the left-hand side is what we call our platform. It's just a bunch of different APIs and things like that that people can leverage to build custom applications. And on the right-hand side, a lot of that's been squashed quite a bit. But that is what we call the editor. And that's what most people use. So we have over 200,000 users with accounts on CardoDB. And if you go to CardoDB.com and get a new account, the editor that you see there will be kind of what the squashed right-hand side stack is. And the base of all of this is PostgreSQL and then CardoDB couldn't beat CardoDB without the PostJS extension that is just really an amazing tool for doing spatial analysis. And so everything I'm going to be talking about today is possible because we have such a robust database called PostgreSQL. And I'm going to talk about some of our APIs. So you saw and on the left-hand side here, we have a SQL API. So the SQL API is an API, if you're not familiar, it's just a way for one computer to talk to another computer. So if you want to send a request to get certain information to a computer, that'll send it back to you. And we have a SQL API, which is actually just Postgres kind of fancied up. And you can send a request to get any arbitrary SQL query just over HTTP request and then get back a packet of data that you want. So it can be select star from the table. You can get the data back as CSV, shape file, GeoJSON, anything like that. Or you can do any arbitrary SQL query against your database. So if you want to do like 13 joins and I don't know what else you would want to do. But you can do it just with an API call like that just to the cloud. And that's maybe my favorite API that we have. We have a couple of others as well. So our Maps API is like a Tyler, so it prepares little tiles that are put together to make it look like a map that's kind of like fully zoomable and kind of like Google Maps type thing, but it's with your own data and custom base maps. And then APIs make things possible. So you can build web applications, mobile applications, send it to CardoDB through these like pulsating magic rainbows. And then that's where your data is stored. And after that, you can get dynamic mapping. So if somebody updates a row in your database, the map is automatically updated. So you have constantly like real time mapping based off of how your data is updated. And so the sky is the limit. So I have a rocket going up with APIs. Yeah, and we have several APIs. We have kind of like location type APIs, so for ISO lines or for getting different geometries based off of like street address or administrative boundaries, things like that. We have a JavaScript library for building custom applications in JavaScript as well and a lot of other things. And we're going to have some new APIs coming out soon that my team is specifically working on and I'll talk about that towards the end. And so that goes into how a Friday afternoon hack session turned into a really fun blog post that we wrote. And I was talking to my boss one Thursday afternoon. And just the day before the L train, the news that the L train, which is kind of a vital connection between Manhattan and Brooklyn, and CardoDB's offices are in Bushwick in Brooklyn. And I take the L train every day. And if the train went down, it would add like an hour to my commute, which is already too long. And so personally it affected me and it affects a lot of people in our office too. And I think something on the order of 250,000 to 400,000 people ride the L train every day. So it's like severing this vital transportation link. And so I was talking to my boss and I was doing this sort of analysis that's trying to find like kind of areas of interest on a map. And he was saying, well, what if you combine that with census data and get like the likely walking corridors of people going to L train stops? And then if Uber wanted to place ads to target people who are going to be affected by the L train, we can get like kind of like the number of people walking on the sidewalk. And so we're like, yeah, cool. We'll try it out. And then I started talking to my other colleague. And then he's like, we can combine this tool with that tool. And then Friday morning, this is some of the news. If you've ever been on the L train, it's like way too packed going into Manhattan in the morning. And then same way going back in the evening. It's like sardines, you're packed in really tight. So the next morning, I think all of us were mulling it over quite a bit on that Thursday night. The next morning, we just dropped everything we were doing. And this dedicated our all of Friday to working on this. And it was a really nice team building effort. And at the core of all of this was Postgres. So all of our data was stored in Postgres. We had Postgres in the cloud so we could easily share data sets. We can do analysis off of it just through API calls. It was like really good collaboration because Postgres is so dynamic and we could just share common data sets and make updates and things like that. Here's another news article. A lot of people in real estate would freak out if the L train is cut off because there are a lot of gentrifying areas that are made possible because of the L train. So we had some stages of grief with the L train. I think one thing that's not commonly in there is the data analysis portion. I think my team is a little different. When we see a problem, we think how can we solve this with data analysis? And so this was our tactic when after we had this conversation, we started to realize we can put all these different tools together. And so our goal was how can we contribute to the conversation for what's happening with the L train? And so we started to put together some pieces. So there's a lot of really great open data out there that's really hard to deal with, specifically the census data. And there's another census product called Loads, which is the longitudinal employment housing dynamic survey. And it tells you where people work and live at a pretty precise level so you can actually get information about how they would likely travel. And then we also have open source routing that's available through Maps and in some other services. They're really powerful too that we can use within the CardiDB stack to answer some of these questions. And then the projects I've been working on are spatial statistics. So if we're looking at affected populations, are there any spatially significant clusters that would be adversely affected by this? So we're going to try to mix all of these pieces together. And so the first piece is estimating ridership. And so we wanted to look at the proximity of census blocks to all of the L train stops. And so we just did that in post just. It's just kind of a nearest neighbor sort of query. And we found all of the census blocks that are closer to L train stops than any other train stop in Brooklyn. And then we also needed to get the number of workers who likely to commute to Manhattan from Brooklyn. And then we just used basic joins in Postgres to do that. And a little bit more specifically, this is all the census blocks that we found that were closer to L train stops than the J or the A or anything like that. And then to finding out the likely commuters that go from place to place, we had a lot of information in the loads data set so that we can just do some basic Postgres stuff with that. So getting the foot traffic, we used Mapsense Valhalla. They have a different kind of public facing name for it. I think it's just like turn by turn or something like that. But if you go to their GitHub repo, it's called Valhalla. And from this, we can have like the center of a census block and then go to the nearest subway stop. And then we can get likely walking routes of all of these people. And then we can add all of those little line segments together. And we can use intersections from using Postgres to get the number of people that are likely to walk along those little line segments. And we can add them all up kind of like little tributaries that are where the mouth of the river would be a subway station. And so what's really great with PLPGSQL is you can rapidly prototype some of these ideas. You can use some of the data types that are available through Postgres. And a lot of the functionality there to kind of put the pieces together to get the answers that you want. And if you haven't seen Valhalla, this is their Mapsense website on it. It's a really nice open source routing tool. It's pretty easy to use too. They have a public API. For the spatial statistics part, this was kind of my specialty. So specifically I wanted to look at poverty rates in Brooklyn and how people who don't have as many economic resources would be affected by it. And so I just found spatially significant clusters of where people have, where there are high rates of poverty and low rates of poverty as well. And to accomplish that, I did some PL Python, which I'll get into a little bit later. So this is going a little bit beyond what PostGIS can do. And pulling in some Python functions with a PL Python wrapper to do that. And so we wrote some custom Python libraries that kind of mix and match the pieces that we need. And this is what just the wrapper for it looks like. It's deceptively simple. And we can just call it from the database. So you can just do like select star from CDB areas of interest and get back the information that you want. The magic is all happening in the from crankshaft.clustering. That's the library that I wrote with my colleagues to bring in some of this functionality. And specifically there's a Python library called PySAL that was developed at Arizona State University and often have a lot of research into spatial statistics. And so we're integrating a lot of that into the cardedDB stack right now. And it's all through PL Python, which we found really flexible to use. And we can use it with the SQL API really easily. So not only can you just do kind of like one-off spatial clustering, but you can design an application that uses the SQL API to do dynamic queries of interest sort of calculations if you're interested. All right, so all of this culminated in this blog post. And it was really fun to write. I think I have some links to some of these. So I want to show you some of the dynamic maps. My colleague Mamatha is a cartographer. And before she came on our team, we all thought we were all right at mapping. But now we all think we're terrible at mapping. And she does all of our maps for us, but this map is an imagined scenario that everybody in Brooklyn walks to the L train. So it's just kind of like a proof of concept for doing the early routing things that we were doing. And she just visualized these lines by how many people are walking on those. So it's just kind of a basic choropleth type map. So the more people on it, the darker the colors. And it's kind of hard to see on the projector here. All right, and then some of the outputs. One on the left, I just showed you. The one on the top is the spatial clustering. And then the one on the lower bottom, we were also looking at if the L train were to shut down, what other options would people have? And we did all of that in Postgres as well. So all of this analysis was possible through Postgres. And we kind of hacked CardiDB to make it all work about three months ago. But our goal is to bring all of this into CardiDB so that people can do it themselves through the CardiDB editor. So I talked a little bit about rapid prototyping with PLPGSQL. If you're not familiar with it, it's the procedural language that you can do with Postgres, and we love it. It's really fun to write these functions, and you can write them really quickly. One that my colleague Stuart wrote was just a basic dot density function. He wrote it in five minutes, and we can do dot density based off of the US census where one dot represents 100 people. And this is just a dynamic map that you can view on the web. So I don't have much time left, so I'm gonna try to finish up. So integrating with Python, so PL Python, I've done a lot of experiments with this. So I talked about the clustering stuff. I think I'm calling it areas of interest, I feel like that's maybe a clear way to describe what's going on. I did a blog post about it, I was looking at census data specifically. And this map here is just, the internet cooperates. It's generation Z, so the people who are kind of pre-millennials right now. But you can see that there's some hot spots for the young kids, and then some cold spots too. So there are a lot of patterns that you can see. That if you look at a normal map that just kind of colors by value or something like that, it's hard to see kind of the bigger patterns out of it. Because our eyes play tricks on us. Okay, oops. All right, other things that we're doing. My colleague Stuart is using PL Python to build in basic machine learning. So he's using Scikit Learn if you're used to that. And then that's gonna be available just through our SQL API as well. So if you wanna do machine learning in Postgres with a geospatial component, it'll be pretty easy to do just as easy as doing a select star from a table. Other things that we're experimenting with, I'm looking at doing market predictions essentially, but it could be any variable. Like if you have an environmental variable or a pollution or something like that, using Markov chains. But there's a spatial variation to that that kind of tightens up the analysis quite a bit. And this is data, it's from Zillow. They have a housing index to roughly give what the sale price was on a month by month basis. And so I built up a model to make predictions off of this. And the basic part about geospatial is that all of these little components aren't kind of an island in themselves. They all have neighbors that influence each other. And so a lot of models that are happening in data science don't take into account your spatial neighborhood. And so zip code boundaries don't stop market trends, obviously. Maybe state borders do. But depending on your geography, you need to take into account what's going on with your neighbors. And so here's one of the outputs of that. It's a bivariate coropleth. So it's showing you two different variables at the same time. And you get a lot of subtleties displayed in this map. But all of this was done in CardiDB on kind of a hacked version that I have right now, but it's going to be built in through one of our APIs fairly soon. So you can do predictions, market predictions essentially. So what's next? So one of the big things that my team is working on is early on CardiDB I think was really successful at making maps and GIS available to a lot of people very easily. And what we're trying to do this year is to pivot maps from just being seen as like a data visualization tool to a data and analysis tool. And so we're working really hard on doing that. A lot of it's very much possible because Postgres is so flexible that we can build in these PL Python functions and PL PGSQL functions. But that's our goal this year. Our goal is to make a lot of the big hard data sets that's really easy to deal with so people can use those in their analysis and also to cherry pick some really powerful analytical tools that are kind of locked up in academic papers right now and make them available to all users. And we're also, like I talked with the machine learning things, we're going to bring that into our tool as well. Cool, so thank you. Are there any questions? Yep, yeah, you can write, you can have user defined PL PGSQL functions, yeah. So you just write them. We have a little SQL tray, you just paste it in there and then apply the query and then that function is available in your database. Yeah, nope. So PL Python is an unsafe language because you can access the file system. And so somebody with privileges needs to do that. So right now we're doing it through an extension that we're emerging with our stack, but people can do pull requests to this extension and we'll evaluate it for safety, but that's the only way to get it in. CardiDB is open source so you can install it on like a server yourself that you manage and you can enable PL Python for that as well. PySAL, P-Y-S-A-L, Spatial Analytics. You would have to do a custom, kind of a custom job to do that one, yeah. So I guess the short answer is yes, but the longer answer is it would require a bit of work. I'm sorry, either way, depending on, yeah, what you wanna do. For what? So yeah, so we're kind of in terms of basemaps, we're agnostic to whatever basemaps you bring in. We have a couple default ones that pull OSM data, OpenStreetMap data, which is by far the kind of the best open data for doing that. But if you wanna use Bing maps or Google maps or something like that, you can use those as well, so we're agnostic to basemaps. Well, thank you.