 Please welcome very warmly Mr. Adam. He'll tell you about the data mining in the last election in Israel. Thanks. Thank you Okay, good afternoon everyone. My name is Yuval Adam and as the title suggests I came here from This I came here from from Israel to talk about the Israeli population census about what it is About what we can do with it and eventually why it matters If you want to follow the slides online, they're online right now on this URL. It's a bit Lee 28c3-dm. I see So that's for the introduction So back in 2001 Things started surfacing on various file sharing yet works on the internet What looks like a standard Microsoft access? Program even though it's right to left because it's in Hebrew Actually has beneath it a very interesting database and it's one that has actually never been seen before And The database is actually the official Israeli government database that holds the entire The all the personal details of every single Israeli citizen Either alive or that has a season has lived in Israel in the past So all the details of every Israeli citizen are in this database and can be looked up through that horrible UI This data is the data that hasn't been collected in the Israeli census ever since the declaration of independence of Israel back in 1948 So does been collected ever since and the leak like I mentioned started approximately around 2001 It is actually leaked several times and the last leak happened in 2006 so So what does this data have? So the data schema once we take a look at the database underneath we see basically most of the Most of the data that you would expect the government to hold of his citizens So we have a unique identification number which is given to any person the day He is born or if he immigrates Israel if he's an adult So unique identification number and then we have obviously the name of the person the date that he was born again with With a country that he was born in the gender status that can be single married divorced or deceased the current address and phone number and then we have Interesting foreign keys that actually point to this person's parents to his father Okay, thanks So foreign keys that point to the person's parents to his father and to his mother and if that person is married Also foreign key to his spouse There are also some other fields and metadata which I won't talk about and not too interesting The entire database holds approximately 9.2 million records so as of 2006 that That's equivalent to roughly 7 million citizens in Israel and then two and a half or 2.2 Which are deceased or whatever So so that's a data schema so when again when this thing came out in 2001 I was 17 years old, and I didn't really know what to do with this other than look up famous people that I wanted to find their phone number and That's all I knew what to do with So that was back then and and fast forward to today and kind of interested in what we can learn from this data Looking at the big picture Mining this data is not only a technical challenge But to me I think it's also important to understand where this leaves us as a society now that all this data Is out there in the open so So the first thing There you go. The first thing that I that I asked myself was Was how easy is it to find someone in this database? Obviously people identify each other by by their names and not by unique IDs. We're not just numbers were people with names So the first thing I wanted to find out is given a name. How easy is it to find a specific person? And it turns out that it's pretty easy This is the the uniqueness distribution function Basically telling us that given a single up here of a name and a surname There is a 50% chance of that name being unique So 50% of the names in Israel are unique to that specific person And obviously that function goes up. So we have a 60% chance of a name being shared by Two people at most then we have a 70% chance for finding at most four people that name And then it goes up to 100% on the graph But 100% we reach at the most common name in Israel, which is shared by I think more than 2,000 people So so that's what the unique uniqueness distribution looks like Now obviously like I said people share names and we can always look up people by other by other fields So if we for example take a person's name and a surname and then look him up Against his city. We have 87% chance of finding one single record That goes up to almost 100% once you look up a person by his name and his date of birth And then you can always you know look up people by their name and filter by various criteria You know against whatever you know that the person that you're looking for matches So finding someone in this database is not a tough task at all The second thing that I noticed is that if you remember from the data schema We actually have the I did the unique identification number for every for every person And then we have foreign keys Like I said to to the person's father to his mother and to his spouse if he is he or she is married So in this example, we have the person went to 34 with his ID You know these parents are one two and three four and then you know Very easily we can see that the father one two has again parents one two and three four's mother three four They're married to each other Pretty easy stuff foreign keys So when I saw this I was thinking the logical thing to do in this case was take this thing And throw it into a graph and see what happens and see if I can try to match Not all the population, but most of the population throw it into a graph and see if I can find connections between each other And the easiest way to see how this works is to follow an example So let's take for example a subject that was born in 1985 And let's assume that for simplicity that generations are roughly 25 years apart So for this person we we have his data and we know who his father and his mother are and they were born in say 1960 And then for those people again, we have we're just following the the root of the father for the mother's the same thing So for the father we know that we know who his parents are and they were born roughly around 1935 So this is what we have so far The interesting thing is that once we get to the people that were born roughly around these years There is no more data as to who their parents are and that is from two reasons one is that Israel exists as a country only from 1948 And back then the population of Israel was no more than 300,000 people So either their parents never were Israeli citizens. They still lived abroad So that is one option and the second option is that those the the generation above them might actually have have lived in Israel the problem is that the data isn't consistent and unfortunately for for people roughly around those years say up to about 1950 5560 we don't always have Parental records and the data isn't always consistent. So going above that generation is a little bit difficult But what we can't say is that for this person if if we know who his grandfather's grandmother is That spans easily to his uncle and to his cousin so So given this person For most people we are able to go up all the way to the father and the grandfather and from that Generation to span out to uncles and cousins so essentially giving us the opportunity to map out Families spanning all the way to uncles and cousins Okay, so what does this graph look like so like I said we have approximately nine million nodes And using this data only only the the strict foreign keys that we use We get approximately 420,000 connected components which if you divide it by the nodes it's an average of families of 20 people so essentially using this that only we can We can build a graph of families of up to 20 people and then span relationships from there It's interesting to note that The graph connectivity can be much stronger if you use other metadata and other heuristics which I won't go into now because we don't have time but But this connectivity can definitely be approved can be improved and then you can all you can spend for relationships further apart So that's about building a graph now I mentioned it very shortly at the beginning The data has been leaked Over over a period of almost ten years So the first leak that we know of dates back to 1998 that the version of the data dates back to 1998 and then The data has actually leaked several times up until 2006. So usually when leaks happen They are Recognized and plugged immediately not in Israel in Israel the leaks happen again and again and again and again and again over ten years Which is kind of sad so this actually puts us in a unique situation because We actually have the opportunity to analyze the data as it changes over time over a period of ten years which is a lot of time and a lot of data and So the question is asked is What can we learn if we take two versions of this data and diff them again one against another an old version against a new version And what will we find out? So Diffing basically gives us three types of results. The first result would be new records now. These are kind of trivial cases new records can be one of two things either Children that were born sometime between the old version of the data and the new version So say a child that was born in 2005 would not exist in 2001. Obviously, we would exist in 2006 We have another case which is people that have emigrated to Israel and we can know this From the date of the birth. So say someone that was born in 1960 60 would expect him to be In the old version, but he's not and he is in the new one So we can verify that by looking at the country the country of birth of this person And some other metadata and we can conclude that this person has in fact emigrated to Israel at a certain period between these two years So new records are pretty easy Then we have updates now updates again fontidu categories one is is standard updates on on the data of a person so people Can change their name people can change their their addresses their phone numbers Things like that so data that can change. We would expect it to be different between the old version and the new version And then we have people that have passed away sometime between these two years. So a person that was Existed in 2001 sometime in that between those years passed away is now marked as deceased He is not deleted from the database. So people if you're in the database are in there for life or for death or for whatever So so again, those are pretty easy which brings us to the last and most interesting case Redactions people that exist in the older versions say in 2001 and do not exist in 2006 If you're like me you're going Yeah Sorry Yeah, they found they found the category of the status change. Yeah, and questions for the quick save the questions for later I'll save some time for that so So again redactions we have no idea now Honestly your guess is as good as mine. I have no idea what these things are okay I really can't say anything more than that because I honestly don't know Now this is under the assumption that the data has leaked from the same source Across the entire period. So obviously there can be problems if you take data that has not is not consistent with the older versions For for various reasons. So assuming that the data has actually leaked from the same source and is consistent These are some sort of redactions that means Someone sometime decided that this person existed back then and does not exist anymore Again, your guess is as good as mine and Data redactions are interesting in another context also relating to Israel and That is that Israel has a law that requires every map vendor to pass a satellite imagery that it wants to publish through the government essentially giving the government a chance to Sensor whatever the government thinks is should be censored Giving us an interesting interesting case that Google for example is not required by Israeli law to to do whatever While Israeli map site is so basically giving us A map of the same area with one one place that isn't censored and the other one very neatly photoshopped So so this is interesting because if someone wants to redact a piece of information If he if he can count on that data being the single source of data Then you're fine as long as the Photoshop guy did a good job Then you then you managed to hide whatever it is that you wanted to hide But the moment that you have another version of the data to to to diff against that's where the problem start Essentially making the redaction not only useless But but even harming the efforts of hiding whatever is that you want to hide because now you can say hey wait There's something here that I don't know what it is from from this from this zoom But if I zoom in I might be able to and you know and they're gonna say wait So I want to zoom in and see what someone's trying to hide for me So so this is a very interesting dilemma. So what's the problem with all this? So sensitive and private data has been leaked and social engineering has obviously become much more easier And we know for a fact that in the past several years This data has been used for various Identity theft scams and other scams mostly related to money But this data actually has been out for ten years and how do we adapt to to this situation because what's done is done We can't take this data back and it's not going to change and the problem is the problem is that Is the future is that we haven't actually learned anything or learned much from from this case and how we adapted to to new laws Earlier this year The Israeli parliament passed the biometric data law Essentially, it's a law that allows the Israeli government to to regulate the creation of smart ID cards So essentially ID cards that enable biometric data collection For the for the purpose of making authentication much stronger than what is today with with the existing ID cards and Mitigating the problem of fake double ID is Israel has a problem that a lot of its IDs are actually fake and people have double identities For for not the reason that you would think but but for your reasons of usually for criminal activity So that's a problem that the Israeli government would want to would want to take away Therefore the passes law the problem with this law, which is generally a good idea Seeing that the world is going to use much stronger identification means Biometric passports, etc. So so so the the direction is definitely good. The problem is in the details as always The system should work something like this the government issues new smart identification cards that have the ability to save data on them and The moment that a new citizen receives identification card the government takes two of his fingerprints Hashed them and throw them on the card. So essentially the car now has hashed by metric data of this person, which is fine And the next time the person wants to go to the bank for example He would show them a card. They say, okay, this is you Let's see your fingerprints fingerprints hashed them match it against the card The person is who he says he is and this is fine again This is going with most of the standards of how the world is going with biometric passports We also know That all the as we've seen all the identities are saved in a central database Which again is something that is pretty normally would expect a government a modern Western government to to have some sort of Records of who the people are that live in the country. So again, this is fine as long as this thing doesn't leak as it has Now the problem starts with what this law also lets the government do and this law essentially gives the government The opportunity to put all the biometric data Unhashed and unfiltered in this database and they're saying that loud and clear So the government essentially wants to start building a database that now has not only our personal records But also our biometric data and saved in a very non-secure way And this is a problem Now We all know that once a database database like this exists everyone wants their hands on them. So So from from starting off from a project of a single government office Now all of a sudden lots of people want their their their hands on this database and everyone everyone need needs access To this database from the police to whatever And again, this is a problem The Israeli public has expressed a lot of concern over this law Unfortunately Parliament has not actually addressed Any of the of the concern raised from the public so much that in fact Professor Adisha meal, which if anyone knows is the s in RSA clearly a guy we should not listen to also the the recipient of the 2002 cheering award and he actually reviewed the law and all of its technical specifications and actually Gave some feedback as to what can be improved so that the law still maintains the purpose Required by the government will also maintains the privacy of the citizens much more strongly and even that feedback has actually been Been denied. So that says something about how much due process went into this law I'm gonna cut it a little bit short so I can leave some time for questions So I'm gonna end on this note We've seen that all the data of Israeli citizens is out in the public and we've seen that for the for a period of 10 years Nothing has been done to to stop that leak So in that situation, I would be asking would it be wise of us to start collecting more data by your metric data in a Database that we know is not sufficient enough to hold this and I think Society should be much more closely monitor the data collection policies that the government maintains on them. Thank you very much We do have an additional five minutes for questions, so please put your hand up if you have any There are more than before so yeah, oh just a quick question. How many deletions did you have in the database? It's hard for me to say I can't really say because I honestly I Don't really know it's it's a little bit of a problem. I have no debt more than more that on that so sorry Okay, any more questions right there in the back I can't use a microphone You counted the doubles and So far of the names and what is the most common name in Israel the most common name Wow? I Think David Cohen or something or Moshe Cohen or something any like Jewish name you can think of it's probably the most unique Okay, do we have any question from the IRC? Wait, I'll get you a microphone There's one question from Crocodile area. It's how can we get a copy of this database? Wow, okay I'll say that it's not it's not that easy, but it's definitely not impossible Like I said this data is out in the open It's harder to get it today than what it used to be several years ago But if someone is inclined to he can definitely find this data. So yes, it is possible Okay, we've got another question from the audience Yeah, the record deletion one guess would be That it's Might be people who just went out from Israel and gave up the citizenship Do you know about the process what they do with their data? That's an interesting angle And it is possible. I it's kind of again It's kind of hard to say because the data is not always consistent I'll just say that I know of cases of Again giving up your citizenship is not something that you would usually do But there are people that do not live in Israel anymore that still exists in the database I can tell you that I don't know what the exact citizenship status But that is definitely an option. Yes How do you know or ensure the authenticity of the data and is there any possibility or Means to find whether there were Intentional poisoning of the data for various purposes guiding or I don't okay So we do know first of all, we know that the government acknowledged this leak And it has acknowledged it for the for all the period that this data has been leaking So that the data definitely came from a government source And we also know that from the various metadata that we see on the database In regards to to poisoning the data that is definitely possible again. This this data is Leaking all around the internet. So it's definitely possible for someone to find it corrupt it in some way and and and throw it away I'd say that it's not entirely That's not something that you would commonly see just for example all the ID numbers Have check digit at the end of them and from a simple check you can see that all of them are indeed consistent So that's just one field and you can always verify them against other fields But I again it is definitely possible. I have no data confirming or Okay, we have two more questions from the audience Hi The Israeli government has been relatively reluctant to publish numbers regarding the number of Israelis who are actually leaving the country So moving away does the database actually have any information about this? How are you know, how are people actually moved away and left the country marked in the database? Yeah Well again people leaving the country aren't really recorded I mean that's not a field that the government would would hold as to where this person lives right now So that's not something that is recorded in this database There is various metadata you can try to guess from as to is this person let's say active Is he somewhere around? That's not something that I researched entirely, but it does exist so And again, that's as long as people haven't given up their citizenship which like I said I'm not sure how it is exactly processing the database. So there's no real dad on that Sum it up. Okay. One last question Is that did you show a simplified database scheme or is it really not possible to track data like? Devotion or adoption or something like this? Okay, so it's it's definitely possible not directly from from the fields that that we've seen but using various heuristics such as Looking at several children of a person and seeing that some of them have different mothers Then you can assume you can conclude that this person has any divorce and has children from from other from other parents So Yeah, pretty much it so you can conclude it by looking at the data from a higher level, but not from the specific fields So yeah, there's a lot they can you can learn just from you know from looking at various fields and making different conclusions at fur from the data Okay, thanks everyone. Thank you very much you are