 I'm going to actually start while they try to set this up, tell you a little bit about what I'm going to talk about. Open data, of course, is a movement and it has to do a lot with the idea that the government has information on you. Now, we all know a lot about that, but legitimately they collect all kinds of stuff. They collect bus schedules, they collect voting records and things like that. So the reason there's an open data movement is because, in fact, there's a need for it, right? We want to know the bus time. And let's face it, governments are not always the best writers of software. I'm a professor. I spent a frantic two hours yesterday trying to deal with a government grant software in Canada because it was designed by civil servants. So often it's best to actually say, let the community go out and design it. In New York City, for example, all the subways have a sign up, are transit apps or whiz kids certified. And what that means is the MTA didn't do them. They actually let other companies create applications to use their data, which they make available openly. So one reason, of course, why governments open up their data is in the interest of giving better service to customers. Another one, of course, relates to the fact that if they have the data, they figure sooner or later somebody is going to put it out on WikiLeaks anyway, so they might as well release what they have. So the ability to actually go out there and see all the government data seems to be a natural thing. It resonates with, you know, the whole spirit of DEF CON, which is data wants to be free. Data wants to be open. And as a result, governments are opening up more and more of their data. The other thing of course is they're cheap. They don't want to go out there and spend money on building applications if other people will do it for free. So the reality is we have a lot of good reasons, and I am not, this is kind of the big preamble, I am not against open data. I think open data is great. There are open data hackathons that are held all over the world and they basically have the advantage of getting the best minds to take the government's data and think of creative uses. So if you have four kids and you need to take one to the soccer field and one to the baseball diamond and you want to somehow optimize that, that's really good. The problem, of course, is that a lot of the data that the government collects is related to people. So I'm still hoping that we're going to see images, because I do have images to share with you, but I'll tell you that one example relates to voting. And it's pretty obvious in the democracy that we want to know what the voting results were. So just as, and this is an example that I cooked up to illustrate one of the problems with open data, we have a city in Alberta called Edmonton. It has a wildly popular mayor named Stephen Mandel. He got over half, well over half the votes in every ward, but when they published the results, there were some minority candidates, people who only got a few votes who got zero votes, in fact, in some areas. So let's say, for example, Mr. Dowdell, who was one of the other candidates, his wife was in the hospital and he goes and, you know, he says, yes, dear, I voted and I voted for you. And he goes and pulls down the online data and guess what? He got zero votes in the citywide hospital poll. Well, that's a type of torturing of data that the government never anticipated. They never anticipated that we would ask questions like that of the data. So, you know, the term torturing data in my title really relates to doing exactly what most people in this room love to do, which is to go out there and find things that you're not supposed to be able to find. So when people ask, you know, what is DEF CON? I say it's basically not people who are doing things that they shouldn't do. It's people doing things that they shouldn't be able to do. And you can do that with data quite effectively. So again, while we're praying to the AV gods there, I'm going to start telling you about an example and we'll go through it quickly when the video comes up. There actually was a challenge called the open data challenge held in Europe and they got entries from most of the EU countries. And the one that won was actually from Slovenia. I don't speak Slovenian, so I can't really say the name, but it translates as from our taxes. And what they did was they actually took all the government contracts that were registered online and they started going out there and putting them together and finding out what names were in common, which people were the directors of different companies, which people were the lawyers who represented the deals and so on. And an interesting thing happened. A woman named Ludmilla decided that this was invading her privacy and she actually convinced a judge in Slovenia to go out there and to order the NGO that created this award-winning application to take down the data. Now somebody asked me before the talk how technical this is going to be. This is about as technical as it gets. They didn't have that data. That data was scraped off government databases that were maintained by the government. So here we have a judge and it proves the old saying that judges are about 10 years behind the average member of society in understanding of technology. Yes, that's right. Yes, thank you. And so the judge is ordering them to take down this information and they can't take it down because they don't really have it. I mean what I suppose they could do is put filters into their program that somehow allow them to go out there and to get that information trapped so that it couldn't be revealed. But it just kind of shows a misunderstanding, a fundamental misunderstanding. Do we have any hopes back there, guys? Hope springs eternal, okay? Well, I could just hold the screen up and give everybody a telescope or something like that, right? So let's talk about some other kinds of open data. We have a member of the Legislative Assembly of Alberta who had the misfortune to do two things wrong. On a government trip, he solicited a prostitute and he got caught. Okay, so about three weeks ago and his name was Mike Allen and it was all over the news. So I said, okay, what else can I find out about Mike Allen? So I went to a database. I went to the Ramsey County, Minnesota, Sheriff's Office database and sure enough there is poor Mike's information on what he was doing, the fact that it was gross, they use the word gross misconduct, and his home address. And that's where things get pretty interesting because if you think about it, having his home address up there, it's private information in a sense, but hey, he was arrested, some sheriff from Minneapolis took that and posted it. Now we have, you know, different countries have different regard for privacy. So anybody from Germany here? Okay, all right. So if you're from Germany, you're the best, right? You have like the people from Facebook say they can't even operate in Germany, right? It's like, you know, everything's against the law there. And then you have other countries. Now Canada I would say is between Germany and the U.S. So a lot of things that you can get away with in the U.S. in terms of posting information on people would not fly in Canada. To take that specific example, I'm sure that if this guy was arrested, they wouldn't just willy-nilly put up the information about where he lives because what's the need for that? But in the United States, there's certainly a tendency to do more and more of that. And one of the photos, if we ever do get to it, I'll be able to show you is from Henderson County, Florida. Now Henderson County is a place you never want to get arrested because sheriffs in the U.S., those from Germany you don't know, are elected officials. So sheriffs, dog catchers, people like that, they all get to be what they are by having people vote for them. So the reality is they go out there, they want to show that they're doing their job. They're actually a good sheriff. What better way to show that you're a good sheriff than to actually arrest people? And sure enough in Henderson County, they post the mugshot of everyone who's arrested. So if you're, you know, done for speeding or whatever down there and there's a really seedy, and again we may have the slide one of these days, when we do, you'll see some really seedy-looking characters, people you would probably cross the street to avoid who have been arrested. But you'll all see a photo of a 12-year-old boy. And I captured that information off the government database and use it in my presentation. Now I obscured his name. His first name is Bobby. I obscured his surname. I put one of those black bars across his eyes so you can't actually see what Bobby did or you know his personal identity. But the sheriff didn't do that. So the first thing to realize is that he's up there, a minor child, 12 years old, and he's being sort of permanently tarred in some ways. They say, well not really permanently because after all it's only up there for 30 days. And the answer is yeah, but it's been in my presentation for over a year now. When data is out there, it is, there's no way to really call it back. There's no way to bring data back into the fold if you've let it lose. And this brings a lot of vulnerabilities. The reality is that quite a bit of data is being put out there captured by people. They say that if you put a photo up on Facebook or something like that, I mean at the very least it's been copied by the NSA, but it's probably been copied by a lot of other people as well. So there's no way to actually call back data from where it's from being posted somewhere. One aspect again, I'm just randomly remembering my presentation. I usually print out a little cheat sheet but I trusted the tech here. Yeah, so anyway, another aspect of this relates to DNA information. How many of you know Ancestry.com or Ancestry.co or whatever? Yeah, all the different national versions. I won't ask because you know it's embarrassing to put your hand up. Who has profiles on that? The business model of Ancestry of course is they allow you to sign up for 14 days for free and do all kinds of exploring with their data. So they are using public information, census data, they're using military records, prison records, all kinds of things. Those are particularly useful for Australia. Lots of prison records down there. And so they take all this data and they make it freely available to you for 14 days and on the 15th day they charge you like $299 if you don't cancel your membership. Now it's an interesting model because they're taking public data and eventually making money from it but you know we assume that that's probably okay. Second thing though of course is that they after you leave, after you tell them hey I don't want to sign up for the $299, they go out there and they keep your data. So they know your family tree so you have enriched their database and guess what? You can never, you can cancel your account, you can you know check into the hotel of California, you can never check out your data. It is permanently part of their database. So a few years ago they got a brain wave, they said you know what would make this a lot better? Send us your DNA and then we'll tell you you know if you're descended from you know Adam or Noah or you know whoever, whoever's way way back in your family tree. And some people I think were actually dumb enough to send their DNA and doing that of course provides a tremendous amount of information and if you think about it with DNA information it's not just about you, it's about your siblings, your family, all kinds of people who didn't give any consent for you to give that information. So the reality is putting DNA information out there is risky, giving it up voluntarily is dumb and Privacy International, the NGO in the UK actually launched a lawsuit against Ancestry.com. Now I was down in Utah and I thought I'd visit Ancestry.com and I tried to find out all about this and they said oh well we just keep the database here that's the genetic genealogy project and it's with the real smart people down in California. So the the genetic genealogy project is something to really worry about okay the possibility is out there for somebody to get tremendous amounts of information on you. Do we have any hope back there? Hope swings eternal? You're you're getting closer. So everyone leave your laptop at the door right so a couple of principles on dealing with this data okay the New York City released with great fanfare in 2009, New York City data mine and they had 110 databases there and they said you know these are all interesting things like all the women's organizations in New York City are now in this database well they realized a day later that they had forgotten to take out the private email addresses and the secret questions so what is your what was your first pet is the most common secret question and fluffy is the most common answer so that information was disclosed out there and you know the next day there are only 109 public databases from New York City because they had to go in there and they had to take back some of that information. I want to explain an experiment to you that I did with what's called openfili.com so Philadelphia is one of the leading cities in making their data open and they went out there and they actually put online contribution records now again because it's an international audience it varies a lot from country to country but in the United States there's laws about campaign contributions particularly those over $200 that say that if you make a contribution you have to go out there and you have to give your name your address and your occupation and those things are publicly available information but what Philadelphia did is they took all the contributions even the ones of $1.49 for the last seven years retroactively and they put them up in this wonderful database now we're going to get technical one more time it was put up there with a front end that said you can query this database but hey you better not go out there and actually download it like it's not for downloading okay so here's how you download it and any of my computer science students who didn't get this would get an F first you say tell me all the people download all the people whose first name begins with A then you do all the people of last name begins with A then you do the last names beginning with B and C and D see the pattern so within about eight minutes I had a comma separated file of all of these contributions and then I started to torture the data and have a little bit of fun with it so I said well what can I find out who in this room knows who Ronald Rivest is okay who's sort of RSA Rivest Shamir Adelman okay there is a candidate her name is Shelly something or other in Philadelphia and all of her contributions seem to be local except one came from Massachusetts well who was it it was Ron Rivest and he obviously endorsed this candidate enough that he sent money to her so you know by plotting the data on a bit of a graph I was actually able to find some interesting things the most interesting thing I found was that an awful lot of people in Philadelphia live in the same place 1719 Spring Street I mean thousands of people live there and I went what the heck is this so I run over to Google Maps and I bring it up and it is the office of the international brotherhood of electrical workers so I think you can see the scene Guido comes in he's going to get his job as an electrical worker and they say wait a minute you got come over here you know you got to make this contribution and he says oh yeah I got to do that when you fill out the form and you know when people do campaign contributions they're very scrupulous about putting some address down because they want to get the tax receipt but maybe he doesn't want to give his own address maybe he's in the witness protection program or something so they go up you will use the address of the union so a significant number of the contributions to certain candidates map back to that one building which is the headquarters of the union now then I thought I'd be a real sort of objective computer scientist and run a statistical study so I found out the most common names in the United States names like Jones and Smith and I did an analysis the only reason really to publish the name and address of political contributors that I could see is if there's two John Smiths and you want to know which one it is so I actually did a statistical test and I found they were 385 Jay Smiths and the the actual power or the need to actually resolve duplicates came down to only about three or four so there was almost no information added by having this but there was a significant privacy risk so what I'm suggesting to governments is that they need to go out there and think long and hard about how they put out the data why they put out the data now again I don't know if wherever we get the images and we're going to have like a flash dance of the images if we do get them up here but I have a a complete stalking exercise I did on a notable Albert and he actually was the president of an airline and I said to tell you the truth I started with the premier of the province and I said how well does he protect his privacy and my first test was is this phone number on listed and it was so then I worked down to other people and finally I found the president of a pretty big airline whose home phone number was listed in fact it's still listed if you look hard enough so then I went from his home phone number hey we have the technology okay so monkey or whoever's back there worked with me but worked with me darling and let's make this happen okay so go forward just just keeps I'll say next next next okay so we talked about this stop right there okay New York City okay this is one we didn't talk about New York City Day to Mind where the women's organizations are outed so there's a problem which is neglecting to read and redact data before releasing it next slide yes Toronto you know you call 311 you complain about something so sure enough in Toronto they build a database of that and they're supposed to be careful to anonymize it we have six digit poster codes they're only supposed to put in the first three well I looked at the database and sometimes they go blur and danforth they put actual intersections so just sloppy lazy data entry is the second problem next slide okay this is actually from my place in Calgary do any of you know see click fix great system you want to report a pothole you want to report a place where your criminals hang out you go anonymously on this system and you enter the data well sure enough in Calgary there's somebody who hates the car wash in his neighborhood so every day he puts in excessive noise dangerous ice so he just got a hate on for that car wash next slide okay this is the european data channel next slide who won there it is in Slovenian from our taxes next slide okay who wouldn't agree with that well the judge ordered them to take down the data next slide a little technical aside they couldn't take down the data so judges lagged behind technology we know that next slide election results there's the guy who got all the votes next slide did my wife vote for me well apparently not because all these people got zero votes next slide Philadelphia okay finance records next slide there it is the full home address is provided next slide okay that's the kind of computer that we had I had one of those Commodore pets anybody have a pet yeah they in the 1970s when they wrote the election law so maybe they were a little out of date next slide okay there's the kind of stuff I could get there's Ronald Revest's home address if you want to write to him in Arlington Massachusetts next slide yes that is Ron Revest and all of his family gave to this to gave to Barack Obama in this case next slide okay let's just keep going next slide okay so the donor files were downloadable next slide most common names Smith Johnson and Williams that's the number of Smiths and Johnson's very small number of duplicates next slide okay it was really useless and when I looked at the address on Spring Street next slide there it is it's the international brotherhood of electrical workers next slide okay problems underestimating our desire to torture the data inconsistent data like some people giving home address some giving the union and the wire requiring address needs to be examined in the light of new tools next slide okay there's the guy his name is slide Beto nice man next slide there's his home address in the phone book I took out his phone number it's still valid next slide there's his property tax assessment next slide there's details of his property tax next slide there he has plotted on a map his house is worth 710,000 his neighbors are worth less this spawned an industry of people writing letters to say dear Mr. Beto we see that you're assessed at this much money did you know your neighbor is assessed at 414,000 wouldn't you like us to appeal your taxes so the city actually have to shut down this database next slide they now have technical and legal safeguards which are dumb like you can only query 10 times from an IP address in a day while anybody in this room can spoof that one next slide next one okay here's anando county there's the sheriff nice smiley face next slide there's the not-so-friendly looking people who got arrested for things like uttering Ford's instrument and contempt of court next slide there's poor Bobby that's real note his date of birth 1997 that made him 12 years 12 months at the time of arrest shame on you sheriff next slide data journalism okay I lived in the Bronx for a long time the Rockland County newspaper released a map of all the people with pistol permits well sure enough one of them actually lived in the Bronx on Crosby Avenue a few a Middletown Road a few blocks from where I lived so this is public information you get some interesting results though like people who have multiple pistol permits and live opposite an elementary school next slide okay so here's a kind of provocative question should the rich have less privacy you know when we go out there and we do data journalism on rich people like corporate directors we could also go out there and find people who are abusing the snap food stamp program but it's an interesting societal question do we go after those people or do we just go after the easy pickings the high-profile ones next slide okay indirect risk there are companies collecting this data like checkpoint next slide ancestry.com they have six billion records mostly courtesy of the public next slide there's your DNA test they want your DNA and they charge you $99 to take it from you next slide or if you're in New York City there's mobile DNA testing all you have to do is flag them down there he is and he'll tell you who your daddy was right on the spot but who your daddy wasn't next slide please okay there's a complaint against them from Privacy International next slide tools that we use for this an inquiring mind that's the most important one having a motive like you know financial motive ID theft and so on scripting language is great for Wiki again getting technical on you here Python PHP scripts all this stuff is out there there are libraries of scripts and if you don't want to do that just go to a hackathon and some friendly person will show you how to do it next slide okay even if it's always been public data it's super public companies like checkpoint actually go out there choice point rather they actually go out there and they pay high school students to go into the basement of courthouses and hand copy people's divorce settlements because you're not allowed to photocopy them you're not allowed to download them electronically but they're public information once that stuff gets put into a public file as it is at choice point it can be accessed for a fee I knew one guy who could never get a job he didn't know why it was an engineer he'd always get interviewed never get hired and he finally got a friend to pull his private file from choice point and what was it he was a convicted murderer well he wasn't actually a convicted murderer they got a social security number wrong and you know his friends said well you know why you're not getting a job you're a murderer and he said I am not next slide okay so here are the dirty dozen inadvertent disclosure of data sloppy data entry malicious information the lag between law and policy making inferences from those small numbers like very few people who voted assuming nobody like me will torture the data my name is Tom and I torture data next inconsistent data where you know I did a project with the Rand Corporation years ago where we were collecting data on New York City fire trucks and we had people getting from the battery like battery park to Harlem and in a fire truck in three minutes and 40 seconds and we couldn't forget how that could be you know they're not helicopters we finally went and watched them and what happened is the guy who sits next to the driver was recording the data they would go out and put fires out all day and then they would go oh we forgot to do that for those Rand guys we better go fix it and they made up all the data data jigsawing taking different databases public and private putting them together facial recognition a whole big area coming up your face can be quite a vulnerable thing I know alessandro aqueasty stood here last year and talked about the fact that he can take your photo on facebook and your photo on match.com and he can find out if you're a sexy babe 235 if you're a hung dude 204 he can disambiguate those just using your face retroactive analysis going back so some recommendations we have to scan the file for PII that might be included think about ways it could be revealed indirectly this is all on the cd I don't need to read it to your next slide and I want to tell you a little bit about this image there's a great project out there concepts for all and it basically is I want to end on a hopeful note there was a poster program at the University of Quebec a mariole a poster design program I was going to be shut down because it's so expensive to print posters and all that and my friend who runs it discovered this thing called the internet so now all these images get posted on the internet they are beautiful images and the only condition for using them for something like this is that you say this was courtesy of Elsie at least Peixar so I've given her credit for that so I recommend that site to you I recommend that you talk to your government we are the people we are the democracy we get to decide supposedly what our governments do to or for us this is an area that you're going to be hearing much more about I promise to stay on time I think I've got five minutes for questions is that about right yep okay so let's have the first question thank you and thanks to all the hard-working goons who made made this image actually come up there we've paved the way for all the future speakers it's hard to get the first question I'll start with the second question okay who's got a question yes you're in turn him on turn him on please come up guys oh there we go I'd rather stay off the stage I don't want to be recognized uh so there's generally a movement in the United States to crack down on perceived uh voter registration fraud it's mostly bullshit but it's a way to keep black people from voting uh but uh in general in general I think governments feel like uh if you require people to provide an address it keeps them honest even if it isn't particularly useful is there another way you can think of to keep people more honest on forms you can take is there another way or should we just stop worrying about it I mean you know actual election fraud where somebody pretends to vote or votes twice uh I mean it's almost impossible to find an instance but I was wondering what you think about that you could take their DNA is all kinds of other things you can do I mean the the whole part of the address is the reason people did you know it yeah the uh you could do what he asked how how you could do something other than addresses to eliminate voter fraud and I joked that you could take people's DNA I mean there are countries in the world uh where people vote um and they they get their finger marked with ink right and sometimes they don't want to vote because having their finger marked with ink they might get that finger cut off by somebody or you know be punished for voting there are alternative ways to do it but I don't I I can't think of too many other ways if your contribution is small enough I think the reality is giving your address is optional you can give the address of your union and never go pick up the result so um campaign laws I guess the big idea is that the campaign law really needs to be reviewed when it was passed in the 1970s it didn't really have we didn't have the technology to do the kind of things that we do now with the data and it's only going to get bigger and the data now is going to stick around forever so they'll be able to go back not seven years but 27 years yes uh scraper wiki has a whole library of tools I mean to tell you the gods are us truth I I used excel because that's all I had to really use to take that uh but there's uh in the presentation there are those lists of scraper wiki and so on there are a bunch of places where you have kind of customized scripts that allow you to go out there and scrape public databases um you know the world is your oyster I I think just about every municipality now has some kind of data there and I just keep finding more and more and I haven't really found one yet that I don't have problems with so I guess the challenge to you for you know between now and next year is to find a lot more holes in open government and bring them to our attention one more question maybe going once yeah do it do you know if any of those terms of service for public data data requests have been tested in courts I don't and I actually asked Marcia Hoffman that very question and she's researching that because I thought that's a very interesting thing that all I can tell you is in Calgary I'll just summarize the story because of people doing what I did and also because of that secondary use of the data to actually try to commercialize it the city shut the database down for a period of time and when it came back there was a very elaborate terms of use that I violated here today because it says this is only for checking your tax records because when it first came out people were checking their bosses house assessment they were checking their ex-wife you know everybody that they knew and they now control it so what they've done and it's a good point they have two different levels of access if you want to just know and I should tell you you're allowed to know what your neighbor is assessed because that's an issue of fairness if you're assessed unfairly you have a right to appeal but you're not allowed to know how many square feet and all the other details so what they do now is there's a public facing one with just the number the assessment there's a deeper level that you have to sign up and be validated so that's how Calgary and I want to say Calgary did I think an admirable job of taking what was a real mess and making it good and hopefully other cities will follow that lead are we done thanks monkey thanks very much