 So, welcome to the Dark Arts of Voscent. This would be Dr. Noah Schiffman, aka Security Freak. He is the academic of our team, the one who actually finished college. Yes, he is way more intelligent than I am, a snappy dresser and an absolutely wonderful guy. Dude, that's so not cool. Can I have a red shirt, kick the shit out of this guy over here? Evan Davidson. Alright, there we go. Not like they didn't tell us that earlier. And I'm Skydog, of course, by the picture there. We are part of the Dead Bunny Club, whether you've heard of that or not, is the pseudo philanthropic arm of everything Skydog does. So we got together, I met you a couple years ago and we found that we're fast friends and we have a lot of fun getting together and getting into major trouble. Sometimes a little more than friends. He's a great color though. So, as they announced earlier, it's my 11th year coming to DEF CON. I actually was back in the AP days. Who's been to the AP days? Everyone's a newbie, that's wonderful. I just heard about DEF CON like two weeks ago. So, I get to celebrate, ironically at my 11th year here. I've actually been in good for nine years. For my 11th year here, I get to three firsts, which is kind of odd. It's not losing my virginity. I'm hoping. I'm really holding out. I understand I have to talk to a girl though and I'm not ready for that. So, the first one was really wonderful. My son got to participate in DEF CON kids. I'm old enough now that I have offspring. He played fourth in social engineering and second in hacker jeopardy. Definitely a first for me. My second would be my first Mohawk ever. I got to participate in Mohawk CON this year. So, a round of applause for those guys. They did an absolutely wonderful job. I had to leave Vanderbilt to actually be able to make that one happen. And of course, my third is actually being accepted to speak at DEF CON, which is a great honor. I did find out that they do require you to submit a paper. That's why it took me so long. I didn't read the fine print, but here we are. So, we're talking about our live demo. Yeah, so, there's this live demo thing that I may be kind of discussed in the CFP and brochure. Well, I don't know how many people here are familiar with something called MATLAB or, I don't know, other letters, the alphabet. Yes. What's your favorite letter? No. So, I didn't have a licensed copy of MATLAB and went with octave and got into a battle with octave one and I lost. So, we're doing a different kind of live demo that's sort of audience participation based. So, it's going to be really fun and it's going to get to meet people sitting next to you. It's going to be a fun ice breaker opportunity. No, it's not. It's actually just, but it's going to be a demo that we can all participate in and make a point. So, I hate octave. I hate it. Okay. So, that's all. Yeah. Ready? No. Get loose. Here we go. So, our talk today is about the dark arts of Oscent. So, the path we're going to take, we're going to talk about what is Oscent. We're going to move on to Evan, if you call me again, I'll fucking kill you. I swear. Fucking kill you. Anyway, I digress. So, we're going to speak about what is Oscent. We're going to talk about some acquisition tools and techniques. I am then going to sit down and the guy with the math background is going to speak about anonymizing data. And then, uh-huh, that's you. You don't remember? I'm going to leave the stage and know what's going to speak about anonymizing and de-anonymizing data. So, open source intelligence. I guess I have to hit the button, don't I? Open source intelligence. Thank you for putting the pause in there. Sorry. Did you get the transitions in there? Some. I don't know. The cool one that wipes? The dissolve. You pay for the dissolve. That's good. So, what is open source intelligence? Essentially, open source intelligence is anything out there that you can reach without having to be a Leo or something similar or belong to a large organization who require paperwork to get to it. It's anything you can get to online or readily available. Well, why do you care? Who had a picture taken of them this weekend by some jackass with a camera? Not one of our photographers, but someone with a phone or whatever. Guess what? You're now hooked up with open source. The information is out there. You appear in a picture. Now it's something I can catalog and index. So, congratulations. We weren't going to talk about it. So, how can it be optimized? We're looking at big data sets. One of the things that Noah's going to get to is taking the big data sets and crunching the numbers and actually extracting some information, some interesting information out of what's available, readily available. So, OSINT comprises many things. One of them would be text, whether it is emails that you sent back in 73 where you were talking about something bizarre. Was it? Did you send anything back? Never mind. I've gone back actually and found some of the things that I've done on forums way, way, way back in the day with a different name that I was able to actually find online. Things that probably would have shown how ignorant I was at the time. But anyway, you have text that's out there that can be searched for. You also have imagery. We have Facebook. We have appearing at Defcon. If you don't realize it or not you probably had a picture taken of you at some point in time that appears there. Video. I think last night Evan played the little VR system where you had to move around the map and begin to do the robot, which will appear on YouTube with a little bit of captioning later on. Yeah, the black cat robot, absolutely. So we also have audio. The video that we have here of this presentation is currently available on DVD later. But they also put the audio up of that so you can, if you're not into driving looking at your iPhone, you can listen to the audio. And then you have Geospatial, which would be the images that you take from a device that's GPS enabled that records your longitude and latitude and altitude and fun things like that. Other information that doesn't always get removed from imagery when it's put online. There is a certain signal to noise ratio. If you've been online and you've looked for data a lot of times the aggregators of that data may have some really bizarre things that show up. No, I never lived in Henderson, Nevada, but for some reason my name and my phone number are associated with that. So there's just a certain amount of it that's out there that doesn't really fall into place correctly. You have to go through and decrease the noise to get the true signal. So out of that, once you clean up enough data you're able to go through and put enough things together, layer them together, find where the data points in the graph appear, you will find actionable data. Anyone that's actually in the law enforcement community, which I'm not, anyone who is in that community realizes that when enough data is collected it becomes actionable and then it becomes intelligence, something that can be used to actually do something. So, sorry, I got a little cough there. The history and origins of it. You're leaving? No. No, I don't want to drink anymore. Not yet. Wait till you get on stage. So print media originally had newspaper clippings from other parts of the United States. Someone would catalog those things and actually write up a report on it. We moved into the radio age. Things were actually transcribed and then cataloged and indexed. The search time on information like that was a little long if you want to complain about Oracle or MySQL or something like that. The paper version of it really sucked. We moved to television, things that got compressed down to videotape and things of that nature. Like I said, I recently worked for Vanderbilt. They have the largest compendium of news broadcasts. They go back as farther than anyone else. That information can also be searched by metadata. And then of course we're down to the internet age where every jackass can get out there and dance and then put online their robot at a large security conference. That's coming back to haunt you, Asshat. So the evolution began news sources, of course, with radio and print. Then we moved to government repositories. For some reason they decided it would be a good idea to collect information and store it. Who knew? Then you went to academic publications where they began to collect data and sort everything and put it together. Theoretically they anonymized it. And now we've moved into the age of electronic databases where we know everything about you. Those are sexy. Those will get you laid definitely. So the current forms and uses of those are definitely tool sets, websites you can go to and of course databases you can get your hands on to depending on what your flavor is. So who's ever used Maltigo? Show of hands. Maltigo is basically used last time I let you do this part. Maltigo is used basically or primarily to dig down on an organization. You can look at their who is records and DNS and IPs and emails and things of that nature. I'm going to have someone else come up here and stop your ass too. Maltigo is really good for drilling down on a company by looking at email addresses and things to compile a large amount of data. Who's used Focca? If you haven't played with Focca, Focca is a lot of fun. Basically it looks at the metadata in Microsoft Office documents, PDFs. It'll do open office and actually looks at the exit metadata in pictures. So you can begin to compile information just in the hidden information in all the documents you can get a hold of. Randy from accounting puts out some sort of a document and inside that it contains information about where it's stored on the local network. It actually makes it to the outside world and it gives me some information about how the interior network is built. That one's a really nice fun to play with. Search Diggity. Anyone use that one? Not in my backyard. Apparently search Diggity isn't used as much as everyone would like. But it basically is another form of being able to sift through data. It takes information from being in Google and other sources and sort of compiles it together and gives you a nice little interface to be able to get to it. So there are a lot of different pieces of software out there. Has anyone heard of Recorded Future? This is one of those that makes you cringe a little bit. It's a temporal analysis engine. It forecasts and does analysis to predict future events based on information from social networks and patterns that it can find. So they're able to go in and put some information in and actually determine what could possibly happen based on information that's flowing right now. Of course, there's Facebook. Who's put their music preferences on? Alright, who uses Facebook? That's alright. We're among friends. You can raise your hand. Big mistake. Yeah. Do we get a picture of that? So if you put onto Facebook, hey, I like REO Speedwagon and for all the young guys in the crowd, that's really a rockin' band. Hey, I went to REO Speedwagon. Well, I can go back in with GraphSearch now and say, hey, I want to know anyone who lives in Tennessee who likes REO Speedwagon and blah, blah, blah. And I can then mine some data out and I guess give you a jingle and say, hey, why don't you listen to Records? At which point you would probably run. So there are a lot of ways, things are actually being put out there now for you to be able to just look at the data and try to grind through it. And there are other websites, Social Mention, Spokio, Meltwater. I have my own personal preferences on what to use. Johnny Long isn't here, but has ever seen the Google hacking database? Okay, so a bunch of things that people have put together. If you're looking for certain types of information, they've put query structures together for you to use. This is what it's like to hang out with Noah and I at any point in time. So basically you have three different types of public data. You have cooperatively provided data, which would be this is my name and this is what I like, which is social networking, it's what I put on Facebook, I like REO Speedwagon, it smurfs. It's things that you willingly give. It's things that you put out there that your personal preferences and things of that nature or posts that you've made that can actually be mine to look at. But you've willingly given it up. Did I say that right? Okay, just checking. Things that are confidentially provided. Something that has a session ID attached to it. I had to log in to give that information. I filled out a questionnaire survey. I said yes, I'm more than happy to allow you to look at this information. I put something in there enough that it's very identifiable, be it my address, my phone number, my credit card, things of that nature. So you have to actually site with a privacy policy where you see I agree to it. So you've given that information up and you've agreed to their legal statement there. And then you have the things that are willingly provided. Wait, where did they get this from? So it's the DMV records, it's other information, maybe it's your medical records, or how the fuck did they get my APGAR scores. It was slow at birth and it never got better. Things that a third party generated, government and academia. Who's ever participated in something in college where you got paid 20 bucks for mass probing or something like that for research. So they take that data and they put it into a database and they put it online and theoretically your name's not associated with it. So who publishes these data sets? A lot of time it's the government. There's academia. Now there's a commercial market now for data that's been pieced together and you can go in and cruise through that data the more you pay, the more granular your data becomes and the more revealing it is. Why are these data sets published? For statistical analysis, it's coming up, trying not to laugh. For statistical analysis we want to go back and look at the information and do some predictions. Looking for trends and patterns that are out there. And retrospective outcomes, it's to we struggle trying to find the proper example for this. We decided on which is better Viagra or Cialis. We go back and look at the information and see the satisfaction, well I guess it's not the right terminology. I said Viagra. I heard someone say Cialis. A buddy of mine I swear. He said you? What do you mean? It was a friend of mine too. No, it wasn't. Evan? Okay, he's hiding. That's good. And of course this information is used for decision making for future things. Maybe it is product design or coming up with something new whether it's actually going to be popular in any way, shape or form. One of the things that are used in here are the tools and websites. I don't do the math. That's this gentleman's side of things. Occasionally I get asked to find things. Who in the crowd, who finished high school? Show of hands. It's okay. It's okay. Alright. Who went to college? Now who finished college? Okay, okay. This is your crowd. So anyway you want to do that? No. I did not finish college. I had a hell of a lot of fun while I was there for my GPA. But when I did not learn while I was at college is what you can and can't do. It was not taught out of me when I, oh you can't do it that way. So I've never heard that before and I don't pay attention to it. So it makes it a lot easier for me to do some things like drill data on somebody. So occasionally I'll get a phone call and I'll get a couple pieces of criteria and they say find someone. And I've become very adept at doing so using all the open source information that's out there. So is anyone stated to Bellagio? This is audience participation. You're awake, right? Anyone's been there? It's kind of by the refrigerator pool. Absolutely wonderful. If at any point in your lifetime you can make that happen, definitely do it. I'm in the sun. I've got the MacBook Air with me. I'm trying to get on the shitty wireless there that does not work. And there's a gentleman to my immediate right. And he notices I have a computer. Which for all of us that is typically the sticking point to, yeah dude my computer at home doesn't work. Who's ever answered that question? A swimsuit by the pool. And a guy starts talking to me. Okay, I'll buy it, no problem. So we start discussing China, politics, the economy. Fun things like that really make you happy. Now we have a few drinks and he says, you know, so you're in Vegas. Are you here for business or pleasure? And I said well currently for pleasure, I would think that would be the case if I'm by the pool. And he says you know, so you're here for pleasure, that's good. And I said well actually in two or three weeks I'm coming back out to the largest hacker conference in the United States called DEF CON. And you could hear his asshole pucker in the seat. So, you know, that's one of those things where who in the crowd hasn't had to explain what that means. Put your hand down ASAP. So I began to explain what DEF CON is. Since we didn't have the documentary, it was very interesting trying to explain it to him. Hearing impaired CON. Yeah, that's it. Hearing impaired CON, definitely. But I got to spend some time trying to explain to him mundane actually what we do and why we get together for all this. And then his Jackass friend shows up, who has come to Vegas to go to the Pone Stars place downtown. And he comes back, dude, I got to meet Haas. Okay. Let's go get a steak. So he becomes, you know, packs everything up. And he says yeah, you know, we're going to head off and get a steak at so-and-so place and really nice meeting you later. And I said just a second. I said your name is Brian. And your family owns a civil construction firm in Seattle, Washington. And the guy says yeah. And I said I'll send you an email to your work email within the next 48 hours. Again, you can hear his asshole pucker. And I said don't worry. I said I'm going to show you. I have two bits of information on you. I don't have your last name. I don't have much more than that. But I'm going to send you an email and show you what's possible. So we went out and had a nice dinner. Went out to the pool the next day. And at some point I thought I could go find Brian. So I sit down on the bed and fire up the laptop. And in 45 minutes I owned this guy. I have where he lives, pictures of his house, what he paid for, pictures of all of his relatives. I then took it upon myself to scan the exterior of his network and tell his system administrator you probably should change this. It's not good to have this open. Brian never responded to the email oddly enough. I didn't think it was a problem. I didn't send him an invoice. I did at Grottis. But that's a good example of I had two bits of information on the guy. Fortunately one of them was unique enough. It allowed me to find him. I was able to correlate civil construction oddly enough against the YouTube video which I was able to pick this guy out in and from there just went to town on him. So I guess if you get an email from a guy that you met by the pool who says he's a hacker and he has a picture of your house from the driveway, it might be a little bit unnerving. So. Was that legal? I don't give a shit. So anyway, hey I don't have to have a court order. Apparently no one else does. But anyhow, the open source side of it can be a lot of fun. One of the things that Noah's going to discuss is finding outliers in the data. Brian had enough for me to be able to find. Had he said my name is John, the problem would be a little bit more difficult. If he said, yeah, I work at Starbucks. Okay, not as much of an outlier there, but given time and effort and how much he pissed me off, I probably would have found him eventually. But based on the information it took me about 45 minutes to track it down. So if you ever get bored and you're by the pool at the studio, just wait for someone to come by. It's a lot of fun. Talking to guys at pools, don't you? Have you ever been given a wedgie on stage? I would love it. You take the microphone. Wow. Sky claimed that I'm going to talk about a lot of things that I don't know where he got that from. You're drunk. You're really, really drunk. I know a little bit of math, some basic addition, subtraction stuff. I'm not really going to talk about anything really hard in advance because that's for smart people. Actually, a lot of these slides. Where's Echo? I don't like that Echo. Sorry. Have fun. So these slides are semi-new to me, but I think I did make them. So let's go through them. Data science. This is a big field. Data science. The science of data. Science has been around for a long time. Data has been around for a long time. You put them together. It's emerged mostly over the past decade to being really like the real data science information scientists have really, that's been a past decade kind of thing. And it sort of came out of the whole business analytics, competitive intelligence like everything else driven by big business because they're just looking out for our best interest. So all of a sudden people who are like statisticians who are experts at data mining and all these types of advance mathematical analyses are very valuable to big businesses and other entities that like to analyze large data sets. Are there other entities that collect lots of data? None that I've heard of. I haven't heard of any either, but I'm sure there are organizations out there that are collecting lots of data and doing something with this, but... Purely for benevolent reasons. Yeah, exactly. But it's mostly to enhance our shopping experience. Other people who bought this also bought this and statistics, just you giving data, you try to come up with a model, probability, given a model let's try to predict the data. Simple concept. Here's a little graphic demonstrating what I just said and it's useless. Historic data model future. Data sources. These are some random examples of readily available public data sets. We've actually gone from having databases of information to databases that are cataloging the databases of information and it's increasing exponentially. My favorite was Freebase. I came across when I was searching for something else, but apparently it's a database. I also like info chimps too. I don't know why. It's a funny name. Okay. Not just data, but big data. Buzzword. Who thinks it's a buzzword? I was thinking more buzzword. Some people and the other people think it's really a legitimate real thing. Okay. That's cool. I don't judge. I don't know. It's hard to define what that really means. Big data. Is it big data? Is it in the cloud? What's the cutoff for being big? 8 inches? 10 inches? When does it become really big? How big is your data? My data is huge. I work with a very small data set and I'm okay with that. At this point, this is yet another presentation that we cannot put in our portfolio for public speaking. That's true. Technically at least what I found is that it's sort of defined as these incredibly large amounts of data that are being rapidly generated and have lots of variability. Okay. It's still big data. But the interesting thing about it from our perspective is that the creation of big data has also brought forth the development of tools to work with big data, to analyze these big data sets. Visual representation, doing number crunching on them. All these new mathematical and advanced platforms for performing all kinds of functions on big data, which is of interest to us. We're going to look at that in a few minutes. Or not. Okay. Terminology. That means that of defining words. We Google it backstage. Depending who you talk to or what publication you read or what book you read. Anonymization, de-identification. They basically mean the same thing. De-anonymization and re-identification basically mean the same thing. They'll be some studies, some groups that will distinguish for the purposes of our talk. They're synonymous, but sort of antonym's opposite meaning. So you reverse one of these processes, you get to the other. Pretty simple. Anyone with fifth grade background should get that. Okay. This is real simple stuff. Data. When it's initially collected, a lot of times it contains personally identifiable information like social security number or address or something else. Your name, that would be person identifiable. So there needs to be some kind of process that takes this data and makes it sort of anonymous. I love you too. Oh, what was that? Ten? Holy... Okay. Super food fest. Dude, you took up all the damn time. Damn. Okay. Wow. Okay. So we need to find a way to make this personally identifiable information. Why? Okay. Make it into anonymous public data. So there's a couple different ways that can be done in general. Just removing variables altogether. A variable that actually is unique enough to be identifying by itself. Like, you know, I've had eight kids and been in porn. That's, you know, Octomom or whatever. Just remove those. Global recoding, local suppression where, again, recoding certain variables or suppressing certain values in different columns that are really identifiable. A whole bunch of different ways to... Yeah. Okay. Anonymization metrics. We have to figure out a way to look at the way we anonymize data and figure out, hey, is this working? Is this, like, actually making the data anonymous? And at the same time making it usable. So the whole utility versus actual anonymity. I mean, that's a balance right there. So two metrics. Disclosure risk, likelihood of revealing data in the public set. And then information retention. How the utility of that data. So we take away all this information. Ah, it's anonymous, but is it still usable? So that's a balance you have to strike. Yeah. It's a tough problem. You want to minimize disclosure risk? Maximize information retention. Easier said than done. But information entropy. Anyone familiar with this? Entropy? Yes, yes. And not the entropy from thermodynamics, which I spent a long semester trying to go through. So yeah, information theory. So the idea is the... I have, like, a million slides to go through. Basically the amount of information that can be the number of states that can reveal the total number of possibilities for a given state like the... I actually use an eight-sided die in an example that obviously you can roll and you get like one through eight because it's got eight sides. So yeah, information entropy is going to be three bits. And yeah, so population of the world, let's just say eight billion. That's like 33 bits. Awesome website, 33bits.org. Very good. Anyway, all right, I'm going to cruise over lots of stuff. Audience participation. Everyone just get up and participate in some way real quick because we got to do something, get up and... No, I don't... Should we do this with time for this or what? I think we have all the time we want. Really? You got that pull? I didn't do that. I don't know. I'm wrong. Let me get the radio and get a couple of retards in there. We were going to look at audience participation and kind of go through and sort people out based on some criteria. We can skip it if you want or if you want to stand up and raise your hand. Do you want to do that? Okay. Cool. All right, first question. Everyone here who... this is their first time attending DEF CON, please stand up. Noob. All right. Noob, noob, noob, noob. Now come up with West Coast East Coast or... Okay, tell you what. Anyone from the East Coast, stay standing. Everyone else, sit down. You guys paid the highest airfare. Thank you very much. We enjoyed that. Yours. Anyone here from New Jersey? Wait, I didn't say what to do. I just said anyone from New Jersey? Simon says... No, you can sit down. What do we got? 7, 8, 10 people? What are the states below New Jersey? I was going to say you had a hangover, but I guess it's not publicly available data unless we query everyone in the room. I would say anyone who's male, stay standing, but that's pretty much everyone. Any female? If you're female, raise your hand. That would be the data set. Never mind. There we go. Anyone, say 29 years of age or younger, remain standing. All the old fucks in the room, sit down. That's good. How we got left? 1, 2, 3, 4, yours. Anyone here from living below North Carolina border? Sit down. Do we do New Jersey enough? Yeah, so we're now between North Carolina and Jersey. Who do we have? You said New Jersey enough, still stay standing, so you're in the upper quadrant there. So I did age, we can't do male-female. Who got laid last night? That's a bad data set too. So how many people are we up to? Who is remaining standing? Count them off. I can't see for the lights. How many people do you think are in this room right now? 700, 800, 1000, something like that. I don't know. Of that, we're down to what? Four people? Three people who remain standing? And how many questions? Well, that was five questions. Well, it was maybe four or five questions, but the entropy for those questions, so what North, West Coast, East Coast, entropy there is one bit. We had... What was the other question? First time at DEF CON. First time at DEF CON. Information entropy there is two bits. Two bits. Anyone above what New Jersey and above, is that what you said? Yeah, pretty much. Actually I think all the questions were like three bit entropy question. So five... Yeah. So basically five bits of entropy and we were able to narrow down the population to three, four people. And it's all innocuous information, but the point is that the combination of all this innocuous information can actually be quite identifiable. Thank you for participating. A round of applause for yourselves. Thank you. So how much time left? Just keep going. Three? Okay. I have 20 slides to do in three minutes. Okay. Thank you, Scott. I appreciate that. Outliers? Values, traits, anything outside of normal distribution. Single outliers, easy to pick up if you have them in combinations or sets which are unique, a little bit trickier to detect, but mathematically possible. Graphical example of an outlier was, this is an IQ of probably here, everyone in the audience, and it was an outlier kind of, I'm special. Data set intersections, Venn diagrams, who's heard of them? Yeah. Okay. You have sets of data. You have set A, set B. What's the intersection there? A, look at that. A and B, amazing. Now you add C. Look what you have. A and C. B and C. And what's in the middle? Holy crap. Isn't that amazing? That's a math joke, isn't it? That's the math thing happening. Yeah. Well, that's a good point. Okay. A unique variable overlap. You know what? I just, yeah, if you have outliers for different types of data and they, you know what? Just move on mathematical attacks with three minutes. Yeah, that's not... Slow down. Just do it. Alright, well dude, you're the good one. I got it covered. Sweet. Inferential analysis and an example of it. Remember the target targeted advertising, the teenage woman who was pregnant and was getting all this targeted advertising based on her purchasing behavior to her household and then her dad was upset that she was getting targeted ads for like infamil and diapers when this is my teenage daughter. She's not pregnant and got all pissed off at the manager. Anyway, she was pregnant and that's how he found out was through dark. Yeah, that's not a good way to tell your parents you're pregnant. I'm sorry, yeah. That's not how I'll tell my parents. No. So database linkage, class examples, the whole Netflix IMDB thing that was, yeah, I'm sure you all remember that. Okay. U.S. census data. This happens. They don't knock on your door anymore. I think there's like, do they? I don't think they... When did they stop knocking on your door? I don't answer the door. Me and my 12 roommates. They still do they really? Man, all right. Another reason not to answer the door. Actually, a researcher in 1990, this Latanya Sweeney came up with a way to actually just using information from the census data, which was date of birth, gender, zip code 87% of the population was unique. Amazing. And it's just based on principles of information entropy. Amazing. Exposed healthcare records of the governor of Massachusetts at the time, which is kind of funny. And screw you well to apply to entropy. So how she did it, I mean zip code. There's 43,000 zip codes in the U.S., roughly. Birth dates, 365. Birth year about 70 different, the age range of 70 in two different genders. Hermaphrodites were excluded. So 30 bits of entropy which includes all the population in the U.S. Simple as that. PGP. Ever heard of PGP? Personal Genome Project. Okay. So this is another program going on where people voluntarily submit all this genetic information about themselves. They want to correlate genotype, phenotype to learn about themselves. Oh, dude. Anyway, again, this is a project gone bad. No one saw that. That didn't happen. Record linkage. Is this a cool diagram? You got to see this. Take care of him, dude. He's stressing me out. Record linkage. So this is where you have a public data set and a private data set. And maybe has metadata that's publicly available and might have some innocuous but identifying information about an individual. The private data set, well, that's got personally identified information that you don't want people to know. Record linkage, it's possible to actually correlate the two and discover sort of these anonymous or so-called anonymous traits about a person by combining the two data sets. And I'll get to mathematically how to do that in a second or not if I get kicked off stage. Flying through these slides. Vectors. This is where it gets rigged into the math. So either go to sleep or anyone math porn? Okay. Your data points now become a vector. Your record attributes, boom. Okay. We're now with the only vector math. Take it one step further, the whole database. It's a matracy. Boom. Records, people, attributes, database. Okay, cool. And again, we now can apply matracy math to this, matracy inversions and dot products, Gram-Schmidt orthonormalization, all kinds of wonderful things like that. And actually the similarity, actually measuring the angular difference between two vectors or matracies can actually find the similarities in large data sets. Yeah. Boring, boring, boring. Math, math, math. The one cool thing that we did do is hold on. Well, this is the actual mathematical formula for the similarity function in case any of you want to try this at home or see me after class and we'll discuss it. Yeah. Venn diagrams, this is really cool. So to be able to visually understand and represent and identify overlapping data sets, we had two data sets. A, B. Multiple variables that were in common, that were the same descriptive traits. Looked at the intersections of them. Noted here by these little lines across. Okay. So these data sets, independent descriptive variables, they're in common. Then we take those little sections that are in common and we then the bends, as we say. So take those and watch this. Bam, bam, bam. Bam. Right there. Look at that. And then, based on that, we can actually now actually the subspace defined by that area is the intersection of all of these groups and actually identifies records for which all the tributes are identical and actually identifies an actual person. Wait, we got... Okay. And summation, the rising side of dark side of Ozen. Okay. So, yeah. Emergence of big data, big problem, big data. It's being used for analysis and visualization. More data sets are being developed and this is the mathematical attacks are going to become easier and easier. It's another weapon for social engineering toolkits because this is information about individuals that we're going to be able to ascertain and they're not going to be aware of it and they're not voluntarily giving this information, but it's going to be actually sort of re-identified about them from these anonymous data sets. And so, cruel for us, bad for them. What can we do to defend against the dark arts? Proper sanitization methods. There are not... There's no way to... There's no standards to actually implement anonymization metrics that actually provide the utility requirements but also provide true anonymity. They don't exist. So, we need access to tools or my recommendations to falsify everything and just make show up. So, that's... I would do N. Conclusion. Questions and answers will be handled at the bar. You guys are buying. Ladies and gentlemen, the full presentation will be seen at SkyDogCon later this year. Absolutely. A round of applause for the speaker goons for letting us go a little long. Thank you. How can we take out SkyDog and his buddy? Thank you. I'm sorry. It's okay. Totally cool.