 So I currently work as a researcher at somewhere called the Center for Spatial Research at Columbia University. We're housed within the Graduate School of Architecture, Planning and Preservation within Columbia and I'll say that my work and training is grounded in the context of data driven urban planning and policy with a focus on the ethics and politics of data visualization and much of my work both at the Center for Spatial Research and before explores the interplay between information landscapes and urban landscapes. So today I'm going to be talking about a project that I worked on a couple of years ago that I think speaks to many of the core challenges embedded in data driven decision making in urban environments that speaks to many of the challenging ethical issues often hidden behind data driven urban and public policy and that I also think is a powerful illustration and reminder about the kind of abstraction that goes into making any and all data. And I'll say that these kinds of considerations have a real urgency right now within the fields of urban planning and design when it seems like every day there's a new proposal for how the city could be understood, governed, optimized, insert your word here by algorithms with no democratic processes needed. And so one of the things that I think is under discussed in a lot of these debates is the inputs, the underlying data that's going into how we're supposedly making these decisions. And in light of this I'm interested today about the ways that sources of data often invisibly and sometimes perniciously shape public policy through the methods behind their collection. So in this project I recently explored how data was used and created by state and corporate entities in the aftermath of the 2008 financial crisis. I'll explain what this is in a minute. And I focused especially on the role that imperfect data sets played in the disbursement of federal aid in response to the foreclosure crisis. Through this project I surveyed the patchwork of available sources of key information in order to look at what the foreclosure crisis reveals for how we've begun to use data to depict, manage, and intervene in urban environments. The project's medium was maps, 165 of them curated into an exhibit. Through these maps I created a time-lapse cartographic portrait of the rate of residential vacancy across the U.S. and also in four case study cities. Expanding the period from 2005 through 2013 the data set that I analyzed in order to make these maps is something called the U.S. Postal Service Vacancy Survey. And I learned about this obscure data set because it was one of the inputs that was used in determining how federal funds would be allocated geographically under the Dodd-Frank Act of 2010, specifically under a U.S. Department of Housing and Urban Development Program called the Neighborhood Stabilization Program. So in 2010 when Dodd-Frank was passed by Congress and signed into law, this is some great applause for when that act was signed. Yeah, thank you, Getty Images. So in addition to passing some, though I'll say not nearly enough, reforms for Wall Street to sort of mitigate the kinds of financial and data dealings that got us into this mess. $1 billion was earmarked and made available to communities through the U.S. Department of Housing and Urban Development in order to target communities that were most severely impacted by the foreclosure crisis. It was designed to be earmarked for these communities witnessing the worst impacts. So HUD determined how to directly target these funds to communities of greatest need through a formula. It was distributed across towns and cities and counties according to this formula, which used several different data sets as inputs. This is their documentation, just a screenshot actually that was made available when the funding was made available, and it relies primarily on delinquency and foreclosure filings from the Mortgage Bankers Association National Delinquency Survey, as well as data from an agency called MacDash Analytics, which they used to train a model, which was then comprised of publicly available data sets that would predict serious delinquency rates at the census tract level across the U.S. And then on top of this, because the goal in this third round of Neighborhood Stabilization Program funding, which was coming in 2010, so when vacancy was really becoming one of the core impacts of the foreclosure crisis, it wasn't just that people were going delinquent on their loans and leaving their homes, but now communities were really seeing the impact of this massive urban crisis with abandonment. So because the goal of this trough of funding was to target these neighborhood effects of vacancy, they also added an additional data set, the U.S. Postal Service Vacancy Survey. So the funding was distributed first at a statewide level and then more locally to target local geographies, and then they came out with dollar amounts that went to specific municipalities across the U.S. So one of the things that I found really striking when I was digging and researching the inputs that went into this methodology that HUD was using is that most of the data sources that went into it or that informed their models were ones held by for-profit companies, indeed many for-profit companies who were themselves profiting off of the foreclosure crisis that this policy is meant to impact. You can't make this stuff up. So I was interested in investigating these underlying inputs, as I mentioned, to know more about how and what they represented of the United States. And my goal with the project was not at all to dismantle the funding allocation methodology by HUD. Plus, by this point in 2010, knowing which communities were most hard hit by the foreclosure crisis was something that was fairly well known and not only on the basis of when the peak number of mortgage linkancy filings were made. So as the core public non-proprietary data source, I set out to learn more about this U.S. Postal Service Vacancy Survey. This is a screenshot of some, though not all, of the columns in said data set. So as it turns out, the U.S. Postal Service Vacancy Survey is created by postal workers when they are going on their daily routes delivering mail across the U.S. At each home or business on their route, postal workers, in addition to dropping off mail, note whether the occupants of that address have been picking up their mail. And the residences that are left empty for 90 days or longer are the ones that are represented in this data set, which you see flickering across the screen right now. So the responses are recorded, and then they're aggregated into quarterly snapshots, which are shared to the U.S. Department of Housing and Urban Development at the Census tract level every three months, as I said. And I'd like to pause to just convey how remarkable I think this is, that every day postal workers visit every address across the U.S., and they look at the address and they write down whether there are people there. That's amazing. And I was really drawn to this deeply poetic in my mind connection between the every day, these visits of postal workers to addresses across the U.S., and the abstract numbers of vacant housing units per census tract for the entire U.S. So that really, that basically the census takes 10 years to orchestrate. Doing a survey of all of the residences across the U.S. is such a massive undertaking that it takes this much planning to go into. Nevertheless, postal workers are having encounters at front doors and post office boxes every single day, except for Sundays. So why does this data set exist? The first step in my mind in doing ethical work with data is understanding where it comes from. Why was it collected? What went into its collection that may or may not impact the actual information about the real world that it records? And so in this case, the U.S. Postal Service Fagency Survey is collected in order to sell it to advertisers. It's the origins of junk mail. The U.S. Postal Service has a vested interest in knowing which are active addresses because they can sell it to advertisers who are willing to pay for this information to send you unwanted pieces of paper. So the fact that the U.S. Department of Housing and Urban Development can use this data set as of a much more fine grained measure of household vacancies across the U.S., then you would get through the American Community Survey or the census, etc., is completely accidental, a really unintended byproduct of its original reason for being collected. But even as this accidental byproduct, it's a very powerful data set. I was through using it, I was able to dynamically visualize the ways and places where the foreclosure crisis had erupted into a vacancy crisis. While foreclosure is always a crisis for individual households, vacancy is really the way that the foreclosure becomes a collective crisis for communities, an agent for collective urban transformation. So behind each of these flickering census tracks are stories of loss and community transformation, powerful ones that deserve further inquiry. And so through the project I made some interesting findings about the data set itself, for example, Phoenix, Arizona, one of the cities I focused on reached its peak vacancy rate for the period that I looked at in June 2010, where 6.72% of all homes in Phoenix, Arizona were marked as vacant for 90 days or longer in the data set. Miami-Dade County, Florida, and New York City both reached their peaks by September of 2010 as well, whereas Detroit, Michigan shows an altogether different pattern with vacancy rates continuing to rise through 2013. It's worth noting that HUD, in their funding allocation formula, uses the number of vacant homes as of June 2008, which is well before the number of vacant homes peaked for the US as a whole or any of the cities that I looked at. And I also discovered that there are major inconsistencies in the method of collecting this data for this period, as one would expect of a data set collected by thousands and thousands of individuals for the purpose of distributing junk mail, and led to large spikes in the overall number of residential addresses. And so the other thing that I was really especially drawn to here is how the data set really reveals the logic of abstraction that's at work in all data. The national vacancy rate fully masks and is yet dependent on the individual routines of postal workers as they visit addresses across the US. So these postal workers daily walks are hidden behind the CSB or actually it was a DBF table. That this is hidden, these individual stories are hidden within this data set. And so the project combines data visualization, critical cartography, and uses both to explore tensions between the cartographic statistical and ordinary ways of understanding the built world. And this split between the cartographic and the everyday is a principal interest of mine. Something I've returned to on a number of projects both at my current job at the Center for Spatial Research and before. And the stakes for engaging with these kinds of questions become especially clear in the context of data driven policy programs like the Neighborhood Stabilization Program, where at abstraction, you sort of have this loop from individual encounters with postal workers in the real world, abstraction, which feeds into policy, which then has real impact in the real world. And this kind of tension is something that the geographer, Trevor Paglin, writes on in the context of cartography. I won't read this whole quote because you all can know where he talks about the God's eye view of cartographic analysis is often not helpful for depicting relationships that he's really interested in, relationships about the everyday because of the forms of power that are embedded in this cartographic view that often obscure notions of fragmentedness and incompleteness that on the ground view points do a better job of embracing. And so while I completely agree with his insights and I'm a big fan of his work, I would argue and I think this project argues that fragmentedness and incompleteness are not at all characteristics that are reserved for on the ground everyday view points. But instead, what's particularly insidious about the God's eye view or the God's eye mode of big data analysis is its claims to see or reveal an unfragmented and complete world. That the claims that the US Postal Service data set is a picture of all residential vacancies in the US. And so by highlighting the relationship between the data set and its mode of collection, I think there's an opportunity to reveal precisely how fragmented this supposedly complete picture often is. In her recent book, Kathy O'Neill speaks to the disastrous consequences of fragmentedness, but in the context of algorithmic decision making. And where for her, the God's eye view of cartography is instead the ever powerful algorithm. And she speaks about the ways that biased and skewed and incomplete data sets are often one of the key ways that algorithms lead to unethical modes of decision making. The data is an abstraction. That's how it's useful, but that's also how it's dangerous. And I'll sort of wrap up here. But so she urges users of data and makers of models to think about the data that goes into them and also to understand all of these inputs as moral questions. So I'll end with her quote here to say that I completely agree with her concerns that we need to do a better job of knitting together connections between technical models and the everyday people they're seeking to describe and change the lives of. And one of the things that I found powerful about the vacancy survey is the way that it really does this. That it captures the routines of these people as they go through every address in the US and is a powerful reminder that all of our work to transform the world through data really needs to keep this in mind. So that data driven processes can be powerful ones, but ones that are also deeply democratic. Thank you. I think there are time for a few questions if people have them. Yeah. I was going to give these two states confused, but looking right now, it seems like Alabama. Mm-hm. It's a giant black hole. Pretty hard on the state border. What's up with that? That's something that would be good to dig further into. My guess would be that those are that sort of area of northern Alabama in particular is fairly rural. There's an issue in the data set that I didn't speak about, which is also they have a category called no stat addresses that puts a lot of fuzziness around it as a measure of vacancy in rural zones. It's a much more accurate measure in cities. So that could be one potential cause. Maybe there's less of a directive in Alabama for postal workers to record this information than their neighbors in Georgia and elsewhere. Yeah. So all good eyes and interesting questions. Yes. Yeah. Yeah. So basically I uncover those through a combination of looking. So the data is released in quarterly snapshots. So it's a big undertaking to combine all of those massive dbf tables. And so I looked at it as a time series, as opposed to just looking at individual snapshots, which is how HUD used it in their methodology. So to be fair, the fact that it changed over time is not a big deal. Because they weren't actually doing an average over time, et cetera. But still it speaks to me about the inconsistency of it as a solid measure. And so just there were literally spikes in the number, the total number of addresses that existed in the U.S. by the tens of thousands from like one three month period to the next. Yeah. Craziness. Yes. Is this the methodology for this sort of collection documented at all? Or is it seriously like when the postal worker feels like something's been... I have some for 90 days, like are they writing, literally writing down like Tuesday, these three addresses didn't pick up their mail, Wednesday they did. Yeah. Like how do you get a list of status on a weekly basis? Well, so underneath, one of the things that I'll say is, I don't know, I don't know. How do you get a list of status on a weekly basis? Well, so underneath, one of the things that I'll say is good about the way that the data set is released. You wouldn't really want to have a, you know, day by day picture of an every single home in the U.S. that's really terrifying actually. So HUD gets the data and makes it available to nonprofits. I work at a university and was able to get through it that way. So it's aggregated. So that covers a lot of it. So in terms of the methodology of how the postal workers, you know, what they're supposed to do, what the interface is for how they're entering this information, no idea. And the methodology is that they report is that we collect this information for our own purposes and then also make it available. So they sort of wring their hands of responsibility that way. Yeah. Yeah. That's a very good question. Yeah, I know. It feels like there would be a lot, a lot that they could do there. Frankly, this is the only, like when I was doing research until we know why this existed and all that, this is the only data set that I came across that they collect, or certainly that's made available that they collect. Yeah. Awesome. I know I'm up against time. Well, thank you all so much.