 Hey everybody. Next up we have Christopher Morvek and he'll be talking about trends in data in time and space. So let's give a round of applause. Good morning everybody. I'm Christopher as she said. My background is in geospatial analysis and software development. So this presentation will come in two parts. It will come in some things that I think are kind of... I don't want to call them mistakes because they can be totally valid decisions to make, but things that I see people do that can lead to bias in your data when you're analyzing it from a geospatial point of view from where it is on the earth. And then the second part will be a fun little journey that I've been on for the last three weeks with an organization here in Portland called Hack Oregon and some interesting analysis that we did on some data there. So first part of this is about location. So I think everybody understands these different types of location. We've got lat long, right? So we've got two numbers, comes from a phone, GPS, something like that. There's this concept called spatial reference and datum, which I promise I will not bore you to death with definitions here. And we'll come back to that in a minute because it's important sometimes. We've got other types of coordinates. So if you've ever worked with data from a local government, then a lot of times it will be in something called a state plane. We use these different types of coordinate systems and spatial references because sometimes we want to preserve location. Sometimes we want to preserve area or distance. There are different purposes for different things. So just be aware of them. We've got addresses and geocoding. That problem is actually way harder than everybody thinks. Everybody is kind of used to, oh, I'm just going to go to Google and I'm going to type that in. Well, when you have 10 million address points that you need to convert from addresses to a point on a map so you can figure out where they are, you can't type those into Google anymore. And Google gets upset when you try. So you have to come up with different creative ways to figure out how to geocode that and how to get the data to power such a thing. Then there's kind of other simpler ways to locate things, zip code, cities, counties, states, countries. A lot of times, especially with addresses, it's really easy to say, oh, look, I've got zip codes, bam, I know where that is, right? And that can be safe, depending on your scale. And we'll talk about that a little bit. And so these other kind of different scales of location type can be very important. So with that, we'll talk for a few minutes about scale. So this is a silly map, right? You can't really see anything on this map, but it's world cities that are above a certain population, I don't remember which. Anyway, it's world cities and they are symbolized by population. So there's all kinds of weird things going on in here. And they're overlapping and you can't really see anything. But the point is, is that if I had information about cities, I could do analysis at a global scale and I could get something out of it, right? If I zoom into France and I have only information about 10 cities in France, can I really say that I have trends within France? And that's kind of hard to say, right? So how you locate something is really all based on scale. So cities work for high level scales. If I was going to work within the Portland city limits, cities don't work, right? Because I'm basically just going to have one, or I might have cities nearby, but I'm not going to have enough spatial trend to mean anything. So the natural next easy thing is a zip code. This is a super pet peeve of mine, so I'm sorry. But in Portland there are roughly 38 zip codes that intersect in some shape or form, the city limits. So you can see the dark color, that's the Portland city limits, and the rest of it is zip codes. So you can see there are ones that are wholly contained, there are ones that stick out, there's one that sticks out a little bit, one that sticks out a lot. There's all kinds of weird things going on here, right? So if I have information that's bound by the city limits and I locate it by zip code, what have I done? Because I've said that well, if I'm on that zip code in the bottom, that's only a very small amount of my city, right? So now I've introduced spatial bias into that data by locating by zip code. So what else could I do if I can't use zip codes? Well I could make up a grid. It's totally arbitrary and I could locate and fill things in that way. Of course then I need to take addresses and I need XY data out of that, so it's a much more complicated process at that point. So a little bit more about zip codes and why this is important. So I think everybody probably knows that Michigan had a lead problem. One of the reasons that they didn't find it very quickly is the people doing the spatial analysis were grouping addresses by zip code. And when they did that you can see, see if my mouse works, you can see there's zip codes like this one up here, right? Fish, if I stretch it, is inside the city limits in this lighter green color here and the other half is out. So if I have data that's telling me that inside the city limits, which is a different water supply than outside the city limits, is doing something, I've just averaged that problem right away, right? So zip codes provide an interesting problem here. They're also, zip codes are kind of weird because they're not actually polygons. We always draw them as polygons but they are not. They're actually road segments designed by the Postal Service to decide what order and how to get mail from one place or another introduced in the 60s. And a friend reminded me yesterday that there's some really fun old mailings and things like that that the government created, Mr. Zip when they introduced the zip code in the 60s. And the whole thing was trying to get people to put it on their mail because until then you didn't have a zip code. So you literally just put the address and so once it got to the state, it was really hard to sort mail, especially in big states, right? So zip codes fixed that problem because they divided states up and they were able to route mail very quickly just by looking at the zip code and they didn't even pay attention to the address until it got to the destination post office, right? So that helps if you're a mail carrier. If you're trying to do spatial analysis, it kind of messes things up. So these are not really contiguous areas, right? They're pieces of street. They often cover, and this is a good example, both urban areas and rural areas. So I'm doing weird averages here when I combine the state together. They have very little or no socioeconomic connection, unfortunately, because they're just random. Well, they're not random. They're organized by post office. But the other problem is that everybody uses them. Oftentimes, if you are getting, like, car insurance, they ask you what your zip code is and that's how they decide how to charge you. My guess is if, I mean, if I was running a car insurance company, I probably want to know the difference between in and out because that probably makes a difference in the risk of that person driving a car, right? So better choices, census blocks, census tracks, things like that that are designed to more accurately represent groups of people. Neighborhoods can work, especially as a communication mechanism. Once you understand the patterns in your data, it's okay to go back and regroup things differently to try to communicate because a lot of times people see this and they're like, what did you do? I don't get it. Why is my city in a grid, right? So you can go back to neighborhoods. So this whole problem here is called the modifiable aerial unit problem. What that basically means is as a geospatial person, you can introduce bias into your data based on how you select your spatial groupings. So it's really important to select the right spatial grouping. Okay, I'm done with zip codes for now. Okay, so spatial reference and data. I used to have this boss and I felt sorry every time we hired a new person because for the first week you sat with him and you did this. Spatial references and datums because he firmly believed that everyone needed to understand how to make super accurate data and while he wasn't wrong, he probably didn't need to torture every new employee with a week of this. And so he used to say that the first lesson, and I love these quotes, the first lesson was the earth is round and maps are flat. Okay, right? So if you think about that, that's kind of a tough problem to solve and we'll talk about that in a second. The second lesson, which is my favorite, the earth is round sort of, right? Kind of a pear shape really. That makes it actually way harder to put it on paper. So we have all kinds of distortions that get put into this. So if we just keep it simple and we talk about latitude and longitude, pretty sure everybody's probably familiar with the idea of how lat long works on the globe, right? So remember each one of those squares represents a certain degree of latitude and longitude. They're bigger at the equator, smaller at the poles. That's some serious distortion, right, when I try to lay it out. Okay, so things to be aware of because distances vary. Calculating distances in lat long is basically totally false. Okay, you can't really do it. So several years ago, big company created, well actually bought another company and created Google Earth out of it and now we have web maps. And we created a projection based on a Mercator projection called Google Web Mercator. It's the common name for it. And we tried to solve some of these problems by making it really easy to render data and really fast on the internet. And we left behind all of the problems of how do we compute distances. So we have to do a lot of work to figure out how to compute distances in things like Google Web Mercator, even though Google tries to make it super simple. Okay. I promise we're almost done with the boring part of this presentation. The last thing is about datums. Datums are, and I try to explain this simply because it's messy, but basically the Earth has different surfaces, right? So the location of the Earth at different locations on the Earth have different base elevations. Zero elevation is not always the same everywhere on the Earth. So we use datums to try to differentiate those things, okay? And we can use local datums for local data. We can use global datums for global data. But to illustrate the problem, those two coordinates are the coordinates from Google Maps when you type in the Elliott Center in Portland, okay? The green dot is if I take those two numbers and I say these are in datum A, which is what's called WGS84, which is what Google Web Mercator and all of the web maps in most places are based on. The blue dot is when I say it's in NAT83, which is a slightly less common datum, but still very useful if I'm working with data inside the continental United States. They're four feet apart, okay? Now, if I'm gonna aggregate something by zip code, don't do that. If I'm gonna aggregate something by polygon, it doesn't matter. If I'm building a self-driving car, that's the difference between hitting the pedestrian on the sidewalk and being in the lane correctly, right? So these matter, but it's all about scale at the end of the day, okay? All right. Okay, one more thing, and then we'll be done with the boring part. Analyzing data, heat maps versus hot spots. So heat maps are basically a way to take point data and look at density so I can color a bunch of points to tell how dense they are, but it's not telling me what a statistical trend is, okay? So here is an example of places to eat in Houston. And basically, this app, it was made with restaurant inspection data. We were actually looking to see if location within the city indicated at all whether or not someone would pass or fail a restaurant inspection. The answer is no. You are just as likely to fail whether you're in a very affluent part of town or a very poor part of town. It does not matter. Interesting. So this gives us a really easy way to tell where Chinese food is. It happens to line up with Chinatown in Houston. Perfect. We know the data worked. The thing that I found really interesting is that when you look at Vietnamese food, it was right on the other side of the highway. I didn't know that. Turns out there's really good Vietnamese food there. But heat maps also can hide data, and so when I switch to this local mode, these are restaurants that identify as local. Local cuisine? I don't know what local cuisine is in Houston, but, okay? So yeah, so I'm not seeing a trend, right? There's nothing here that I can really make out of other than, wow, yeah, restaurants are in Houston. Perfect. On the other hand, hot spots give us a way to look at statistical trends of data. Okay? Now these can hide problems just as much as the next thing, but I also have two modes here. I can use points, and I can look at their density like in a hot spot or like in a heat map, or I can look at an extra attribute. So like income level or number of victims in a crime or different things like that and look for trends both in spatial coincidence and in attribute coincidence, okay? So if I take one year's worth of crime data in Portland, and I put it on a map, then I get this. So statistically, this is telling us that there are more crimes significantly in the center of the city than there are on the edges. Well, I mean, we all kind of know that. I think most cities basically look like this when you look at that data. But for kicks, if we look at a heat map of the same thing, we see a similar story, but it's not quite as clear. There's other kind of pieces that look like, ooh, maybe they're a little bit more important, okay? So this was important because for the last three weeks I've been working with an organization called Hack Oregon, and their goal is to kind of work with the government and local data and try to make everything available to both people and back to the government in a useful way. And we had a really unique opportunity in that we were given 10 years worth of voter registration data for Multnomah County, which is where Portland sits. So we got from 2006 to 2016, so I guess really 11 years of data, it was registered voter and address. So this data is a little sensitive in that you can take a person and you can see where they've moved over the last 10 years, so it's a little sensitive. But it is also public, at least the most current year is. You can pay for it and get it anyway. There were several, between two and four files per year basically organized by election cycle. It was all in CSV, it was actually very well structured. Data types were super clean. It was actually really easy to ingest the data. What was hard was getting it into points on a map. I had to leave several behind because I couldn't get real geocoded addresses for them. So I ended up with about 3.8 million records and when you put it on a map, it looks like that. So I put it on a map and then they said, well, can you tell us where the young people are? Can you tell us where people are moving in the city? Can you look for trends in time and motion of people? Sure, let's look at a heat map. Okay, I don't know what that means either. So there are definitely hotspots. There's definitely concentration of people within different neighborhoods. If you're familiar with Portland, you can probably pick out where these hotspots are and you're kind of going, yeah, that makes some sense. So then we did a slightly more sophisticated analysis on it. So what we did is we did hotspot analysis. So we're looking for spatial trends for the number of people in a grid, in a grid cell in this case, and we did it over time. So we did it once for each year and we had to do a little bit of repair on the data to kind of fill in the bumps because it had these weird election cycles. It's not perfect, but we only had three weeks. So basically what this shows us is that there is no trend in motion other than the trend we expect. People move into the city. The inner city is becoming more dense. The areas around the city kind of have an oscillating motion. Some years it's increasing in population. Some years it's decreasing. And you get all the way out here to these darker red surrounded by white. And this is, and I forget the name of this, but it's the boundary of where the city... The what? Urban growth boundary. That's the one. Thank you. It lines up perfectly, which is really cool because that means our data is actually telling us something interesting. So with that, the, I think, interesting part of this is that when you're doing... Sorry, I can't see the top of my screen. When you're doing data analysis like this, at least I tend to get stuck thinking, oh, I need to put this on a map. Oh, I need to see where everybody is. And what we found very quickly was the wrong way to start. Because we couldn't figure out the trends spatially, so we had to find trends in another way. So we found some interesting things. So all of this was based on the premise that there's a housing crisis in Portland, and if you're from Portland, you probably agree with that. If you're not from Portland, you have no idea what I'm talking about. That's okay. Suffice it to say that housing is very expensive here for everyone across the board. One of the interesting things we found in this data, average voters are getting older very quickly. Over the last 10 years, it's gone up about four or five years, which is kind of an interesting problem. We said... We basically were looking for that trend in space, and we couldn't find it. So we said, well, if we have maybe shared households, we'll tell us something. So four voters in the same address, will we find anything? So what we found was that it significantly increased over 10 years, so it went from 4.5% to 9% or so. Of all registered voters are now participating in a shared housing environment, so there's four or more of them living in the same place. And it was kind of like, well, that's interesting, right? So then you look at ages, everybody's still getting older. So then we looked at distribution of age, and this is the one that kind of got us thinking about maybe we started wrong looking for patterns. So we basically saw two peaks. We see a young cohort and an older cohort, and that younger cohort you can all explain, right? Because young people live together in apartments to save money. Do the older cohort do that? I don't... That probably not. So it turns out it's actually more like intergenerational households. So it's young people living at home longer. When you look at the average age of the household, it fits right in between, and it's also getting older, which is a weird problem in Portland because basically everybody thinks young people are moving here to retire, and what's happening is we're all getting older instead. So we did end up going back to the map, and we looked at the age of households in neighborhoods. And what's interesting is this trend kind of goes from the outside, and it squeezes in. I don't know why. I think that there's probably some causality in the fact that outer neighborhoods tend to be a little less expensive, so that's kind of where it started, and it moves in towards the center of the city. And the last thing we did was to try to understand where youth are in the city. We said, if you were 21 in 2006, and you started in one of these four neighborhoods here that are relatively well known in Portland, where did you go? The answer was kind of everywhere. Young people made it work in some way, and about 60% of young Portlanders remained in Portland. I don't know where the other 40% went. Some of them probably are still in Oregon. They're probably not very far away because we don't have data for neighboring cities. Or even suburbs, really. But some of them probably also left. There's also a problem in voter registration data. When you leave a place, you don't tell anyone. You just go to the new place, and you eventually sign up. So there's some lag in here. I think people arrive. We don't know exactly when they arrive. We make an assumption they arrived when they registered to vote. Did everybody register to vote when the last time they moved? Probably not. Yes, perfect. Thank you for keeping the data clean. The last thing, while we were doing this presentation on Monday night, somebody asked me, you can kind of see this, but it looks like there's some gaps in where people are traveling around the city. So I thought, well, we need a different way to look at it. So I built a heat map over time. I think this shows a little bit better how the concentration both remains in the original neighborhood so it's kind of interesting, because of the 60% that remain in Portland, almost 40% of them stay in their original neighborhood. So they don't actually go too far. But there are quite a few people that go all over, and so this is one of the reasons why we couldn't find a trend in all of the data, because people moved so much that it kind of negated the negative motion you would see in a certain neighborhood into positive motion somewhere else, but somebody else was also moving in. So it's kind of all balanced out, which is kind of weird and problematic data. So we've only had it for three weeks. We got a lot more work to go and a lot more due diligence to work on this data, but I think with that, I will take any questions if there's a few minutes left.