 Hey everybody. Can you all hear me? Awesome. As mentioned, my name is Hunter Owens. I work for the city of Los Angeles. This talk is towards a taxonomy of government data, which thankfully I got scheduled pretty late in the conference. I think we've heard a lot about a lot of government data the last 36 hours or so, so now we can hopefully all categorize it. Just a little background about myself. I'm a data scientist for the city of LA. Prior to joining the city, I worked at the Impact Lab, which was a consultancy for social sector organizations, including mostly governments and educational institutions. The Data Science for Social Good Fellowship, I worked there for a while. I helped work on projects with a charter school district in New York. When I used to live in Chicago, I was a check hack night a lot, and now that I live in LA, I'm a hack for LA. Go to your local labor day, do cool data projects. I highly, highly endorse them and get to play with lots of fun and interesting government data. So, when I tell people I work as a data scientist for the city, this is what they think I do, which have these massive categorized tables full of all of everybody's, your speeding tickets, your potholes, when your trash got picked up, did you pay your taxes on time, and I can be like, you did not pay your taxes on time, and you have three parking violations on the record, and that is not true. What I actually do is this. I send follow-up emails all the time trying to figure out how was this data collected, does it even exist, why was it collected, who maintains it, who has it, was there some vendor in charge of it, did somebody retire? At the city, we have a huge wave of retirement, so literally people are walking out of the door with data. So, when I end up having to do a lot of the times, I start to think about what assumptions can I make about this problem. When somebody comes to me and says, Hunter, we'd like to do something about affordability of rental housing. I'm like, great, so what data might exist about it? And what's useful to know is to have a taxonomy, sort of heuristic for what data is good, what data is bad, where it might exist. So, you have these sort of universal set of problems that you see. Actually, quick questions before I get in a little bit. How many of you in the room work for a government or are funded by a government? How many of you are journalists, librarians, because I think there are a lot of you here? Yeah, librarians are awesome. Thank you all for coming and listening to me. I do not know what. You guys know more about this than I do. And then finally, who has used a government data set in this room? All right, that's the good answer I wanted. I need full commitment of your mind, body, and personal brand to this talk. So, who has asked these questions before? These are sort of these common sort of questions that we ask about. How do I get it? Who maintains it? What about these, all these, you know, I saw talk was every home in D.C. was built in 1990, I mean, sorry, 1900, because that's when they started counting, oh, this is an old building. So, it's just built in 1900. So, those sort of weird outliers exist. And you have to constantly figure out all these problems before you can do an analysis, make a visualization, build a report, do any sort of thing with data, because you're often not the one who collected it. So, I think we all need to think like biologists. Carl Linnaeus proposed a system of mapping the tree of life. All biological things from bacteria to human beings are somehow related to each other in a taxonomy. These are those cladograms you had to do in seventh grade biology class. They look like this. These are actually really, really useful tools for us, because they organize our practice of work, of investigating government data, because that's what we all are doing. We're playing detective before we can do anything else. It helps us figure out what possible pitfalls there are with this data. So, in my experience, I've played with a lot of government data, but I'm going to propose four main branches of government data. Descriptive data, programmatic data, evaluative data, and not data. And seriously, that's a real category. These are proposed. Please argue with me in the question and answer section. I'm not wedded to these in any way, but I'm suggesting this will be useful for all of us as a community to use to understand how governments collect and use data. First category, descriptive data. This is your bread and butter government data. It's your census. It's your parcel shape files. It's your voter records. And it describes a feature of the government unit. It's frequently released in a website or a formal package. It's mandated often by law. Do you know how much the government values the census? It's in the constitution. You cannot avoid a decentennial census. It's so important to how the government functions. Voter records, those are such a key part. So, this descriptive data is often the easiest, lowest hanging fruit of open data. It's been around forever. Our next category is programmatic data. These are byproducts of government programs. In my line of work, we have the information systems, which is local government. So, I can tell you there is HIMS, BIMS, HMIS, TIMS, ZIMS, ZIMUS, which are all information systems that describe housing. Yeah. So, that's housing information management system, building information in census, homelessness management information system, which is all where the how's your homeless people, not the actual database full of homeless people that our vendor assumed that we were collecting. It was a fun conversation I had. And these are full of errors compared to our official record data, like who voted, which we kind of assume is not full of errors, or the census, which has very cleanly documented error rates. They tell you exactly what's wrong with it. You look, you go into ACS data and it has like plus or minus. These are used to run a government program. So, for example, we use HMIS to keep track of who's using shelters in all of LA County. Actually, every single jurisdiction in this country has an HMIS system. It's mandated by the federal government as part of receiving housing and urban development funds. So, if you are interested in homelessness data and sort of who's using shelters, that's where you can go look. Obviously, a lot of it has PII concerns, so be careful what you ask for. But the key thing about this is, one, the data has a vector of much more closer to real time than our descriptive data. Our descriptive data is collected at a reporting period. They say how often it's going to be collected. This is collected in effectively real time to describe our program. Our next category is evaluative data. Who among you comes from a public policy background? Anybody? It's read a lot of public policy papers. You can see that they have these magically clean data sets that I wonder how they created them where they've got like, well, if you cross-reference census data with number of new construction permits issued in three census blocks, you see a relationship between these two particular features, or if you see how many people boarded the bus against how many people were driving to work, and they have these really nice clean data frames. So what I argue at this type of data that is frequently used for the policy and evaluation of policy tools is actually, even though it may have existed originally as programmatic or descriptive data, is a new form of data that is used in the government realm to determine how we make policy. It's generally used, created once. If you ever ask a public policy person, like, great, it's been three years since we did your study on, like, bus route efficiency in greater downtown Portland, and you ask them to rerun it. It's a total mess, because they've been playing around in Excel for so long. There's actually a fairly famous thing. This is also a problem in economics, where the Reinhardt Rogoff paper, which was fairly important in determining how we left the financial crisis, they copied and pasted a line into their main Excel model off one, they put it down one row, rather than putting it in the header row. So they actually assumed a link between a country's debt and the ability to escape a recession, where there wasn't actually a correlative link. This is the Excel error that changed the world. It's a fun story if you ever want to read into it, and it's a good reason to be very, very skeptical of Excel. So make sure to double check when you're copying and pasting. And the final category is not data, which is to say, government employees do not view this type of data that you and I would met in this community would see as data, as data at all. It's actually just this weird exhaust byproduct. It's often you'll need to make an inference about whether it's data or not. So to give an example of that from my personal line of work is AtSAC. This is AtSAC. Who sees data here? Anybody? This is all the traffic control signals for the entirety of Los Angeles. You could shut down LA if this room stopped existing, because the street lights would not change. So what we've discovered here is in the original implementation of the AtSAC program, it does not collect recorded data on when lights are changing or not, which means it's hard to do any sort of optimization, making traffic flow faster. So what we've started to argue in the not data category is this is when you can be brought into the program to say this is the data we should be collecting. This is what's useful for us to have. Another example is street sweeping routes. Those are actually not digitized in the city yet. They're just mandated. So we're working on digitizing those. Those have been passed down from sign vendor to sign vendors. So one of the categories you'll often see in working with government data is if you ask a pretty basic question is like it's actually not been collected before. So that's your opportunity to advocate for its collection and optimization. Now when I when you encounter a question like what does housing look like? What or in the federal level like how is the economy doing? You know, these big sort of meta questions that you may be asked as part of your job as a librarian as a researcher as a journalist like you want to investigate a certain subject. The key to starting to taxonomize it into one of these four categories is to ask your basic questions. Who collected this data? Why was it collected? Was it legally mandated? Look for legal mandates of data collection. This is how you can often find interesting time series data. It's like again going back to my example of HMIS. It's because the federal government has mandated for 20 years that we keep track in a database of who is sleeping in shelter beds every night. Also look for your vendors. Often going back to the example of a GTFS Google transit feed specification like a lot of bus real time bus data comes from one particular vendor that puts the GPS sensor in the bus. And if you can see that your metro agency has a relationship with that vendor, you can go after that data. So let's take an example. Is anybody in the education data world in this room? Anybody? Awesome. I was glad. Got one. So you might have seen this before. Forgive me. This is a local example. So this is called the NWEA map. Education and government data loves abbreviation. So I'm going to unpack this. Northwest Educators Association measures of academic progress. This is an exam that is issued to students in last time I checked 15 or 16 states. And it's one of the three main formative assessments that you see handed out every time. So if you get map data, I don't actually I'm not showing you the real actual data, but this is basically what it looks like. So you see certain things you'd expect school ID, student ID, fall to winter observed growth, fall to spring observed growth, winter to spring. So you know it's like, basically, it's a test that's issued twice twice a year in the fall and in the spring. And they give out these things called RIT scores and goal ones and goals twos, which are your progress and certain measures of learning. You can look into all this sort of nuances of the NWA map exam. But how can we make some quick assumptions about the NWA map to say this is the type of analysis we could do with this data if we had it? So you can know it's mandated by reporting laws or contracts with particular charter school organizations. So you'll have it for every year. It will cover every student. You don't have to look for Oh, it's missing like these three, you know, it's missing half the schools in New Jersey. No, because like the state of New Jersey require or the state of Oregon requires every student in certain grades to take the map. So you can say like, now we have an assumption about this data that is fairly useful to use. We know the frequency is mandated by this law. And we know it's a formative assessment. So we can assume that it is used as an evaluative data set, because it's only issued. It's not your programmatic data. That's what if you're if you're in a school district, you're going to have your evaluative data, how do you do your evaluations, teacher evaluations? How do you meet your standards and reporting requirements? So that's your test scores. And then you have your programmatic data, which is which students are enrolled in which class? How many of them get free and reduced price lunch? And that's going to be stored in a separate system called SIS, which is a student information system. They're like the secret key keys to school. And if every every movie you've seen, where some kid hacks his way into the school's mainframe and like resets all his grades to an A or her grade, it's generally he because pop culture in Hollywood, I'm sorry. Those those are SIS systems. There's one big vendor called Power School. They actually own 23% of the grades issued by high schools in the United States. And there it's legally theirs. It's weird run up into that license agreement. But anyways, now we know this is an evaluative data system. And we've been able to make certain assumptions about it, which means that when we would, if you had to walk into a project in the education space with no knowledge, come back, you've been able to make some fairly decent assumptions about what data you're looking at, pretty quickly. So I think the crucial thing to starting these taxonomy is these data sets is to make a flow diagram of how the data was collected, what steps were taken, who's collecting it, where did it go. So make these diagrams, they're fairly useful. And now I want to talk about how you start to see the evolutions of the systems. I've proposed this taxonomy, but I want to say where all the systems come from. So in California, I apologize for the California centric nature of this talk. I'm a Californian, so it's but there are equivalent laws on the books in a lot of states. But this is SB 272. The California Public Records Act local agencies colon inventory sounds like a super interesting law. Everybody wants to read, right? All right. This law mandates that every government body in California, so whether it's a metro, a local government, a county government, a school board, a water district has to categorize every single database they have and what data is collected in those databases. So we get something like this. This is the LA County Enterprise Systems Catalog, which is published on their open data portal. Some of these some of these agencies don't publish this on an open data portal, but you can ask for their SB 272 compliance form. So you can see we've got like department, the enterprise system name, why is the data collected? What's the frequency, the data updates, all the information we need to fit it into a taxonomy of whether it's being used programmatically, evaluatively, descriptively, or it's not even being used as a data system. Those ones don't really fit into this category. So we can take a look into this. Some of the super interesting data systems that pop up in here is the ALPR systems. Does anybody know what the ALPR systems are? Those are the ones that read your license plates from police cameras. So you can start to see like what exact data that the city and this is the LA County Sheriff's Office is collecting license plate recognitions on, okay, this is a stolen car. We can look it up and see when it's been caught, you know, within a scenario. So you can see exactly what it is. But then you can also see stuff that is a little bit more prosaic, the absence management system. That's keeping track of which employees are in the office or not at the department of the human resources office. And this is just the first 10 lines of this data set. And I'm sure some of you have ideas on what you'd want to do with this data. It's pretty interesting stuff. The other way to determine where data is collected and why it's being collected is to follow the budget. Your budget is your secret data portal. It's got all the information about spending and which data systems you're implementing and where where priorities are being placed. I have my favorite Joe Bidenism here, which is don't tell me what you value. Show me your budget and I will tell you what you value. It is still the most useful quote I have heard to think about any government agency ever. Look at their budgets. It will tell you what they care about. So I wanted to leave some time for questions. So I have a few takeaways. Taxonomies are good. They help us define our problem space and make assumptions without having to like spend hours looking at individual data sets and tell us, you know, we can tell our superiors or we can tell our colleagues like what we think is possible with this data and what isn't without having to do a ton of work. And this is a little guide through government data, a little tour from somebody who has been on the inside for a little bit. And then finally, what I want us all to think about as a community is what is a more formal way of classifying government data based on known characteristics to argue for their similarity or dissimilarity. I don't have an answer to that. It's a really big question, but I think it will be useful for this community moving forward to be able to say that there's a certain taxonomy of government data and certain things fit into different spaces and not. So thank you.