 Anybody would be here with Jacob speaking at the same time. Thank you, thank you. So the idea for this talk came about through my own experiences learning Django and really any framework and the idea was that my experience has been, let's say I've worked a couple of the basic tutorial with Django, it's the polls app, I got that down, I got my feet wet, I kind of know where some things are at and then I want to know where I go next. Intermediate tutorials are often like build a blog, build like a to-do list or some other not worthless app but not really that exciting, I mean you're not going to be the next person to put a blog platform into production or any other of those applications. So my experience has been learning from other journalists, a good place to start is using public data to build the Django app or any app. Using public data and by public data I mean government data or anything from a government agency. It's a ready-made data set so it lets you build a less complicated application from the start. In the case, we're going to walk through an application I built for the Express News using restaurant inspections. It let me focus on some aspects of the framework like the kind of the basic core components like views, templates, my models, the URL routing to the exclusion of other parts like testing or the admin interface and those parts are important but in the beginning when you're just trying to take the next step up from the basic tutorial it's good to focus on the core components that you're going to be using all the time and then you can pick a more complicated example after that. So what we're going to do is I'm going to walk through probably four or five examples of public applications and production that serve public data and then we're going to go through some examples of where to get public data. I'm also going to walk you through a little bit of how to file a freedom of information request and we're also going to dig into a little dig in that restaurant inspections application I built for the Express News. Just, I put the talk up, there's a URL if you can probably can't read that. Django app from publicdata.omahajo.org. I'll tweet that out at the very end of the app of the slides I have a link to this GitHub repo which has a read me that I kind of wrote out the talk with all the links and a little bit of kind of a little blurbs of everything that I cover in the talk so you can just kind of sit back and relax. All right, first though I want to go back in time a little bit, a little history. Django was developed in 2003 and 2004 at the Lawrence Journal World. The story that I heard and some of you might have lived it. Some developers there were essentially building out some components for their CMS and they built a couple things to solve some problems when they stepped back and realized they had a framework. The first official release of Django as an open source project was in July of 2005 but a few months before that in May of 2005 the first non-journal world public Django application was put into production and that was Chicagocrime.org. Chicagocrime was a seminal application. I mean you could say that it was one of the first, if not the first application in the news applications, data driven, data journalism movement. It's been copied a million times by crime apps everywhere. It was essentially just a scraping of Chicagocrime data that Adrian put mashed up with Google Maps API before there was actually Google Maps API. So the very first Django application that was ever built that wasn't with the Lawrence Journal World was built on public data. You can't go wrong. Ten years later Django powers Instagram, Pinterest, Discus, Orbitz, Radio, Mozilla, Prezi, and a whole bunch more. I mean I just ripped this off from Jacob Kapanmoss' blog. I mean you can walk down the halls of this conference and see that it's pretty much like Django is in many ways a who's who of the modern internet. And though that's not a bad at all for a couple guys in Lawrence, Kansas. But it still powers news organizations, LA Times, Texas Tribune, The Atlantic, San Antonio Express News, Humbly, Sunlight Foundation, Chicago Tribune, The Washington Post. Some of these use it for their CMS. I know The Atlantic, they're giving a talk here. Texas Tribune guys are all over here and it's alumni. Chicago Tribune, they have their own crime app. We'll look at another application that they have. The New York Times, their interactive news group started out I think on Django. It was far back as 2007. They might have some Django applications in production but they've since switched to Rails mostly. As we look at these examples, your Django application that serves public data will essentially do one thing. And all these applications do this one thing if they don't do anything. And that's essentially taking government data, digest it kind of, so that you're essentially regurgitating it back to the public. The government does put data online. They put a lot of it online. They don't do a very good job of putting it online in any way that's usable for a normal human. And that's, I guess the cool thing about doing a public app, a Django app from public data is that, especially early on, it'll let you build an application that people will engage with. I mean, anybody in here in their local community goes back, asks for five years of homicide data, puts that online. You're gonna get feedback, you're gonna get people who are interested in it. And it's, even if you're only trying to build out your portfolio to get a regular development gig, it's a really cool way to do it. I mean, you're gonna generate some conversation and you're gonna generate a little bit of a stir. The first application we'll look at, and all these applications do this. You'll be the mama bird. The first application was the Los Angeles Times homicide report. This idea came about with a journalist and a developer. The journalist had the idea of reporting on every homicide in LA County. I think it covers five or six years, and there's about 1,000 homicides every year in LA. So the database itself is about 5,000 or 6,000 homicides. The actual database wouldn't, you could easily fit that in an Excel spreadsheet, for example, but it's not gonna be very interesting for you to take a look at this Excel spreadsheet. Families aren't gonna sit around their kitchen table at night, looking at this Excel spreadsheet, digging into homicides in LA County. What this application basically does is serves those homicides back and lets you sort, sort filter in a pretty easy way. As a nice little map that helps you drill down into a neighborhood. And you can see pretty quickly if you wanna look at the men who were, already got this up. Men who were shot in the last year, or 2014, there was 383. You could do the same thing in your community. You probably don't have the time or the resources to compile data by hand. But you could certainly file an information request to the local police department for the last five, ten years of homicides in your community. The next application is from the, oops, is from the Sunlight Foundation. The Sunlight is a non-profit, non-partisan journalism foundation in Washington. And they use a lot of different tools. They use traditional journalism, software engineering, data science, basic research. And their idea is to essentially shed light on the way that our civic processes work. One of the coolest applications is the Influence Explorer, which essentially aggregates money and politics from a number of different pools of money. Campaign contributions, federal lobbying, federal grants, and federal contracts are for the big ones. And those four account for probably, what, $2 trillion, $3 trillion in money flowing through the government. Each one of those data sets is worth a lifetime of research. What they've done is they've built out an API that you can use. So this would qualify us as a source of data. If you find something interesting that you want to look at, you can use their API or their bulk download to focus on a local county or state-level election. What this does is it lets us look at, for example, the 2016 real-time campaign finance. You can see that for your friends that think that Jeb Bush is leading the campaign finance race, he's not. Hillary Clinton is running away with it right now. And if you've ever looked at Federal Election Commission, I know there's at least one person who has, it is really, really complicated. And the fact that they would do this for you and then build an API out so you could drill down to local elections is a really cool thing. The next application is from the Chicago Tribune. They have their own crime app. But one of the ones I like best is this 2014 Illinois School Report Cards. The raw data from the Illinois State Board of Education is a pretty gnarly mix of Excel spreadsheets, PDFs, zip files. And it's difficult to imagine a young family sitting down and digging through this data trying to find research their kids' school. Normal people don't do that. But let's say I want to look at Whittier Elementary. I think there's one. I know Oak Park. So you can see with just a couple clicks in about five seconds, Oak Park is about 80% of the students meet or exceed standards. It also gives a breakdown of that by reading and math. It gives demographics, class size, and it has some notes on the data. And I'm sure they get a lot of travel from this. That's pretty useful. And all they did was take that data from that school board site and put it back online in this format. And I guarantee you that every state in the United States has some insert state name here at School Board of Education data page. The next step, this one's my favorite. It's from the Texas Tribune. Government salaries explorer. What they've done here is they've filed information requests to these different public entities in the state of Texas. They've gone through the work of normalizing how the compensation or the ways that these organizations compensate their employees. They all have essentially different fields. Like some have bonuses, some don't, some give comp time, some call it, I don't know, vacation, paid time, or whatever. They've normalized that a little bit and then let you search this. And just like homicides, if you lived in San Antonio, you could file an information request for all the, or you could just rip it off from here. But I wouldn't do that. My point is you could file an information request to your local community, find out how much people and the government make, and put that online. You could file once a year and update your application. And I guarantee you that's gonna generate traffic, it's gonna generate buzz, it's gonna be nice little way to make a name for yourself, and it's not gonna be that complicated. You don't have to do anything that looks this nice either. I know this still generates a lot of traffic for them. And it's interesting. I mean, it's important to know that, for example, in San Antonio, the city manager makes $400,000 a year. That's important. I don't think she's worth that much, but. An application that I built, we'll dig in this one a little bit more. I use the San Antonio restaurant inspections. Some reporters at the paper were going through, they're online, this is the basic search interface. If you dig in the page source a little bit, you find out that it looks like this. Some reporters came and told me that they were web producers. They were going through this every week and finding the 10 worst, let's say, every week. And then they would do a little slideshow about it and they'd put that online. I saw right away, I talked to a couple of people who had been with the paper longer than I had, and this website hasn't changed in about five, six years. I don't think it's gonna change anytime soon. So I saw right away, this is a great way to scrape this data, load it into a database, and we don't have to settle for just the 10 worst. We could do all of them. We could also let people drill down and see what's this restaurant look like for the past five years, six years. We'll take a deeper look at that in a sec. The website that I built, the Django application, all it does is take those restaurant inspections and put it online. It goes from this to this. And because from the outset, the task I wanted to accomplish was simple, the code is correspondingly pretty straightforward. We'll look at the repository in a minute and you'll see that if you take the polls application, the polls, well that's your basic tutorial Django and put it right next to this code. You'll see that at this point in Django it's one of the first things I built. I don't really know much more than that, so I simply wanted to take, I did this one tutorial, how do I take that and essentially extrapolate it out to something that's live and something that actually does something different, that it doesn't do polls. It does restaurant inspections. There's four views. Where is it? The first view was an index view. This was the 500 most recent inspections sorted by day, which amounts to about, I haven't updated in the last week or two, but it amounts to about two weeks' worth. There's an inspections view. Don't go to Rocky's tacos. The inspections view is, so the index view is the 500 most recent inspections. The establishment view is all of the inspections for a particular establishment. I call this the inspection view, which is all of the violations, descriptions of those violations associated with each inspection, and then the last view is a basic search. Four views, that's it, and this is essentially the index view all over again. Four views, four templates in. An index template, the first template you saw, an establishment template, an inspection template, and a search template, and a search template, and a base. They'll inherit from base, and the search is essentially index. And I didn't even try to deal with pagination. I just use the data tables jQuery or JavaScript library. I have three models. Probably the models aren't that complicated at all. Probably the hardest part of this whole application would be trying to get the data in its raw form shoehorned into a model, and that if you weren't experienced and hadn't worked with data a lot, that could be a little daunting, but it wouldn't be anything that, going to your local Django users group or Python users group, or even cheating and going to the Rails users group, they'll sit down to help you out. I mean, it's just how do you take data that looks a little weird and make it look like this? Every establishment has many inspections, and every inspection has many descriptions, and that was mostly just a function of the way that they put the data online. Probably the thing that I learned the most from, that I had the most trouble with, was the views. And I wouldn't say trouble, it was just getting this down, playing with, and this was a chance to play around with the database, play around with the ORM, learn how the Django queries work. If you're familiar with SQL, then it should be pretty easy to figure out, but if not, this can be a little, this can be a, you're grateful that you don't have a complicated application so you can spend some time learning this, because this is important stuff. Every Django application you have is gonna go to the database and it's gonna have some views. The first one, the index view, like I said, was the 500 most recent inspections. The establishment view uses the establishment ID and just gets all the establishments ordered by date for that establishment. The inspection was gets all the descriptions for every inspection. I had to make up an inspection key, which that's where things got a little weird. And then the search, which is essentially just a basic like wildcard search on any text you put into the text box. Find me names like that. Sources, where do you get this stuff? We talked about one, the sunlight foundation has an API that you can use. Another good place, although a lot of data journalism types don't like data.gov. It is a decent place to start. It has, now at this point, it has 160,000 different data sets. It has data on about every subject imaginable. I think this is also a good place to start because they have an example applications if you wanna take a look at that. They have a little developers portal. Sometimes you'll see code challenges, advertisers, corporations that are giving grant money. And this is just a good, once you get started, once you let's say build a general application and test the civic data waters, you can keep going down this for a long time and maybe you won't make a living out of it, but it's a great place to just get involved, build cool projects. There's a great community of folks out there. It's a great way to, data journalism open source projects, especially ones run by journalists, are a great way to get involved with open source projects. A lot of journalists have become essentially software developers by default with what they do in their jobs, but they didn't start out as computer science people. And so they're very amenable to helping people out. And there's a lot of folks who just taught themselves to code in the newsroom. And it's just a great little community to get involved with, to learn and grow. And this is a place to start. Probably one of my favorite data sets of all time is the census data. But the census data, if you know about it from the Census Bureau or American Fact Finder, is horrific at best. Some journalists got together and filed the Knight Foundation Grant to essentially make census data cool. It's censusreporter.org. It is itself a Django app and it is an open source Django app. So you could run your own censusreporter.org if you want. I put this in here because it's a good place to just know about what is involved with census data. Like what is actually available? What can I, what's in that data set and what can I find? And you could check out, it has pretty much everything. Median age, income, commute, transportation, households, fertility, marriage, whatever. Veteran status. So if you're just looking for a way to break into Census Day and you wanna know a little about it, what it entails, what it can give you, this is a place to start. Once you kind of drill down and think, this is what I would like to look at in my community or across the United States, let's say, let's say you wanna compare all the United States based on one data point. Instead of going to the Census Bureau again, I would go to census.iree.org. This idea came about and all, this is essentially what data journalism is. Government does a bad job with putting data online. How can we do that job better? Some journalists were frustrated with how the census put data online. They wanted to make it easy for people to look at and get. So instead of navigating the many, many drop downs that are American fact finders, let's say you wanna look at counties. You wanted to look at Texas, we'll do Aransas County. This lets you look, lets you see the data that you want on the right and it lets you toggle whether or not you wanted to add certain data sets in. And then you can download it. It doesn't get much easier than this. I can't tell you how bad an American fact finder is. It is not this easy. So I would just forget that, pretend that this is the actual government census data and this is where you get it from. Your own community also will have data online. San Antonio was not a leader in this by any way, but even they have data. They have this GIS data portal which has shape files. If you wanna explore Geo Django for example. I found the city's restaurant inspections online and there's like a random 911 calls but only for certain kinds of calls. Any community that you, all of you, if you went home to your hometown, looked at their city's website and just poked around, you'd find some data online. They're supposed to put a lot more than they do online. A lot of them don't have the resources to do that. But that's a good place to start if you just wanna find something that's a data set that they thought was important enough to put online. That means it's probably interesting. It also means that it's probably in some crappy table format that you could scrape, which is a good reason to scrape and then you can start updating a data set in real time, let's say, or weekly. And government data is sometimes a good data set to scrape because they don't have a lot of money to change these websites all the time. So they're gonna build it once and let it sit there for like 15 years. But I would encourage you to, the first place I would check is probably your own local hometown just to see what they have online because there's gonna be nuggets there. I also included this slide. I mean, 30 places to find open data on the web. You could have Googled that yourself, but there's a lot. I mean, it goes from the city specific government data to data aggregators to social media data, weather, sports, New York Times, the Guardian both have pretty interesting APIs. The last thing I wanted to talk about was freedom of information requests. All of you can file those. I think a lot of people think that they have to be a journalist to file them, but they don't. And this is probably one of the best ways to get data because the government will just send it to you. And homicide data, not a lot of city governments are gonna put homicide data online or their own salary data online, but you can ask for that and they by law have to give it to you. It can be a little daunting, not I wanna say daunting, but the easiest way to do it is probably to go through this website, ifoia.org. It's run by the Committee for Freedom of the Press. And what this does is it helps walk you through the foyer process. You can log in, start an account, and it will give you a template. There's not a required language for a foyer request, but there are certain kind of agreed-upon ways that you would ask for it. You log in, they generate the letter for you, and then once it's generated, if the agency that you wanna send the request to has already been, is already in their system, they'll send it to the person or to the PIO, the public information officer, of the organization that you're looking at. If it's not, if that person is not in there or that organization is not in there, that government department, you can do a little research on your own and contribute that back. Another cool thing about this is that there can be a little bit of back and forth with filing one of these things. Normally it would go, you file one, and then they by law have to either give you the date, in Texas it's 10 days. I think it's 10 days, it's 10 days. If they can't get it to you in 10 days, they have to say, all right, we can't get it to you in 10 days, we need a little extension, say I have to get back to you within 10 days no matter what. And there can be some back and forth. They can sometimes charge you like $4,000, which you can say that's ridiculous, and then you can kind of bargain with them a little bit and say, I do this for a living too, you doesn't take three weeks to dump on my SQL database. That happens. Yeah. Another thing is, you couldn't generate a legal case. I filed a request once in Arkansas and they just wouldn't give it to me at all. And essentially it's like, I documented it, like there's a back and forth, back and forth, they didn't respond. I just kind of bundle all that up and send it off to the Arkansas Attorney General. Said, hey, look, I asked for this data, it was consumer complaint data. There's no reason why they shouldn't get it, they didn't give me a response and the legal they're supposed to. And that started a little legal case and the Attorney General gets on top of the agency that I was asking for, and that you should do that. It's not, it wasn't me going out of my way or anything, it was just essentially bundling up all my communications I had with them, sending it off and saying, hey, this is not right, they should, Attorney General's like, no, it's not right. We'll talk to them about it. This does all that for you. It's a really cool resource and you should sign up. And file some, I thought journalists filed most freedom of information requests, but it's actually like hedge funds and consulting groups and stuff like that, so you don't want to let the hedge funds get all the good data. Like I said, I'll tweet this link out at the end. It has a link to the read me for this talk and it has a link to the inspections repository, which is the restaurant inspections, which is a nice example of a production level Django app that's not very complicated and it's pretty accessible. That's it, thank you. I don't know if any of the Texas Tribune people are here, but one of the things they did that I thought was interesting and I'm wondering if you have any experience with this is using Django's form or model validations to do data cleanup or at least to flag you of incomplete data during an ingest of some kind. I haven't done that. Well, I have used, I use the basic like, I guess I'll repeat the question. The question was, do I use Django model validations to validate invalidated, did you do that? So I'm former Texas Tribune. A lot of the data apps that I worked on there, we actually had to get away from the ORM as quickly as possible on ingest just because the ORM is slow when you're importing 10 million records, so. And I just used the basic, I split the data up into CSVs and then I used, there's a lot of, the data is a little weird from the city of San Antonio, so I just used the basic like I built in that inspection key and then I just make sure that it's not duplicated in and their data is not terrible. I mean, I've dealt with a lot worse but that is a good idea. That's their idea. And their ORM is slow. You mentioned before that when you were looking at data.gov you said that a lot of data journalists or a lot of journalists don't like to use the information from that site. Is there a reason for that? No, I think it's not the information that they don't like. It's just that they seem to have thought like they, I think the over, the overriding idea was that they expect a little more from it, I guess. I don't know, I don't know why that is. I've just heard grumblings at like, around the bar or whatever that, oh, too bad it sucks. But it's never backed up by that. So I just wanted to say that in case there's somebody out there who said too bad it sucks. I think it's fine. I mean, it is just a dump though. I mean, it's not, they just put up, they didn't do a lot of, they put up 160,000 data sets and I guarantee you they didn't put much work. They just, oh, you got some, put it up there. So that's where the work comes in what you're gonna be doing with this. So maybe they wanted some cleaning or some organization or something like that. And maybe 18F is the government's answer to that, I don't know. I wonder if you would consider also open sourcing or sharing your scraping Python stuff just to be a good model for. Yeah. Yeah, I can do that. There's also a lot of even better than stuff that I've written and the stuff that I wrote, actually I wrote this in Ruby, this is a while back. So I feel like sometimes some sort of traitor, but there's a lot of cool examples of scraping within the journalism community. Like I can point out a couple, I'll put these links in the read me. A couple tutorials written by journalists, maybe I can think of three or four that are really, oh, I don't know. But yeah, there's another one. I mean, I can think of, I just think of people that I know who have written scraping tutorials from this standpoint of journalism and they're really good. And I'll put those links in the read me. Yeah. Well, there's a Texas GitHub repo. We should probably get you on that. There's a bunch of scrapers there if you're interested in pre-existing Texas government data scrapers and some Django projects there too that are open source. Yeah, it's good. Yeah. I wrote some stuff for the RAO commission, which is a whole nother issue. Any other questions? We have a little bit more time. If you have a burning question, please don't hesitate. Tell me when you file a FOIA request. That'll make me happy. I've heard other talks, including a TED talk you may have seen like the one on people doing open data stuff with stuff provided by the state of New York. And I've heard repeatedly like how bad the data is. Like maybe they use inconsistent IDs for the same thing. They use inconsistent formats. You may even be trying to scrape it from something that's not actually like machine parsable or easily like in your experience, like how bad is the data formatting, I guess. So I guess the question is how bad is the data formatted? I guess in my experience, it depends on the data set. Some of them aren't that bad. Some of them are horrible. Some of them are horrible. Beyond imagining. To stay with the premise of this talk, if you find one of those, I would just pick another data set. Because yeah, some of them are like, I've spent two months on a data set that we need for an investigation. Maybe that's an exaggeration. But I've spent a lot of time on horrible data. I know everybody Chris has and I'm sure that Travis has. It's unreal how bad the data is and the government doesn't care. A lot of the problems though, with data journalism is that you're essentially taking a data set and making it do something usually that it wasn't intended to in the first place. And I point that out to a lot of people who are upset with the way the data works. It's like, well, it did what they wanted it to do. What you wanna do isn't what they wanted to do in the first place, so bear that in mind. But it can get pretty hairy. And I would just pick another data set then. Unless you wanna spend the time. I mean, also though, if it's really bad and you were interested in it, probably somebody else would be too. So that would be a great open source project in and of itself. I'm on a human rights committee in my county and one person on the committee asked for like demographic data of the constituents in the county jail and they got a response that it's gonna cost this much money and we don't have any money so that it just kind of died a year or two ago. And so I'm interested if you could tell us a little bit more about when you've been told it's gonna cost this much. Yeah. And kind of do the bargaining. Where was that? In Michigan. Okay. So how do you do the bargaining? And if you do end up needing to pay, are there like people out there who help pay for that or? So I guess the question is, how do you bargain in a FOIA request or how do you sort of barter or whatever? What I would usually do is, I don't know what they said but let's imagine in this case they come back and they say it's gonna cost $2,000 to dump on my SQL database. Well, we all know that it doesn't and I understand that like I'm not almost, having done it on my own, I'll give you a day and a half even two days of work because sometimes it's not as easy as it sounds. I mean people in the newsroom think how long does it take? I'll give you a day and you know, these people make let's say 50 bucks an hour and that would be like 400. So you can start with that. That's like, look it should take you a day, maybe a day and a half. That's maybe a little stretching it but you know, six hours of work in an eight hour work days, a ton of work. And that's a way to start. Let's say four to 600 and that will get you down from the 2,000 and then you can just ask them to give you an itemized account. Like why is it costing so much? And look, talk to somebody who if you're not experienced or with the technologies involved, talk to somebody. You know, I've asked around, we ran into a situation where they were giving us like an Oracle dump and it was like 2,000, this was a state of Texas and it was, we talked to some Oracle people and they're like no, no, it shouldn't be that hard and then you can go to them with that and say look, I've talked to some professionals, I can put you in touch with them. This, you know, this young woman writes Oracle every day, she works on Oracle databases all the time. She's saying there's no way it should take that long. And it's just, if they start out at some ridiculous level, like I don't know how much, did they tell you, how much was it? So he just said for the recording, it might have been a couple thousand dollars. Yeah, so it's okay to ask them what technology, what database are you using? How is this stored? Yeah, that's all fine. You gotta pretend you're a reporter, you gotta be really annoying and be as obnoxious as you can. And just, I wanna itemize, I wanna know, and I'm not regular either, I'm not, I'm more of a people pleaser than I am obnoxious, but tell me exactly an itemized breakdown of why it costs so much. And just be really nice too, like they, it's within, and you can also say, at the very end, you can go to the attorney general and you can say, you know, this is unreasonable and show them why it's unreasonable, you know, like if they say it's in my sequel and it's gonna cost two thousand dollars, that's crazy. And you can also, I've never done this, but I've also offered, I've heard of people offering to show them, let's say that someone who doesn't know what they're doing, and they say, well it's in this my sequel thing and it's gonna cost two thousand dollars, and they could just be afraid and they could just be making a number up. So you can say, hey look, it shouldn't cost that much, if you don't know how to do it, I can show you how. You know, I can walk you through dumping the database or converting it or however to save you time. You know, it's mostly just like, it's arduous, but, and then I don't know, there is nobody that will help you pay. I mean, I know a lot of people, the federal government can take years, but it doesn't sound like, you know, the contents of the county jail shouldn't take that long. And I would hit them back now. Yeah, that would be cool. Or universities too, maybe you could talk to a local university or like a local, like the University of Michigan Journalist Department, see if they wanna get interested. All right, thank you so much, Joe. That was fantastic. Thank you. Thank you.