 Hello everyone. I think we might get started now. So next up we have a local Christchurch person who's returned to give us this fascinating talk on getting insight from data without seeing the data. So please welcome Brian Thorn. Hello. Yeah, I'm from New Zealand. I studied here at University of Canterbury and then worked at Dynamic Controls. There's a few dynamic people here which is cool. About 18 months ago I moved over the ditch to Sydney and started working at NICTA, National ICT Australia. They do ICT research in Australia. As of last week that's no longer true apparently. We merged with the Crown Research Institute called CSIRO because Tony Abbott and research. So now I think I work at this company. Anyway, with that let us begin. I'd like to start with a bit of a story. I got engaged last year and even before, yay! And even before I'd like told many of my friends or family online advertising changed so sharply it just pivoted to displaying adverts about wedding catering, about wedding photography and rings. And it was just really quite scary. I probably mentioned it in a couple of private chat messages via Facebook or Google Hangouts or something at this stage. But in a matter of hours it was ever online. Today's machine learning algorithms are making better inferences than ever before. And this is due to advances across the board. Research continues on new algorithms of course. Hardware improvements have made new things that were infeasible just a few years ago now feasible. And things like distributed computing, GPU programming have made immense impacts as well. We also have better data, more accurate data, more types of data. Insights are taking inputs from a wide wide range of sources. But probably most crucially we have a lot more data. As well as the technology is actually getting a lot easier to use as it's getting more and more of a commodity. Large cloud service providers are offering not just access to heaps of machines but a lot of that you don't have to pay very much for. But they're now offering machine learning APIs. Google has a prediction API, so does Microsoft's Azure and Amazon ML for instance. High end graphics cards can also tackle a lot of machine learning algorithms these days. And the spot rate for a Amazon GPU instance is in the order of 10 cents an hour. It hardly takes any effort to get access to dozens of these $2,000 graphics cards. And while it's easier to use machine learning algorithms, the understanding of what the limitations of each algorithm are, how to interpret the results, it's kind of quite immature. So smarter, easier, better, yes, but certainly not perfect and not without problems. So the very real, the very personal human issues associated with insightful machine learning was discussed at PyCon AU earlier this year by Karina Zona in her keynote. Consequences of an insightful algorithm. I really recommend watching it. It's well worth it. So while the insights might be wrong or too revealing, all too often it's actually the raw data that's leaked. This is obviously bad. Leaked data can be combined with other data sets and you can draw even more revealing conclusions. So what can we do to prevent so much personal data potentially being leaked and being vulnerable and being stored in one place. I'm going to focus entirely on possible technical solutions to this, not the personal human, what we can do and how we can change practices. Most of these are kind of fairly active research topics, but I'm not approaching them as like, hey, here's some cutting edge research, which isn't really possible. There's a concrete possible solutions that can be done today. So private data can be really hard to get hold of, and not that hard. I mean, ask yourself, think about this, how carefully do you review the application permissions when you install an app on your phone these days for every app? How carefully do you read the small print or the terms and conditions on a website? How many apps do you think you have installed that track your physical whereabouts and how many websites track your digital wanderings? Hands up if you think hospitals would probably protect their patient's data pretty well. No? How about a government? Do you trust the government to look after your stuff? Little workplace, your school, university, corporations, apps, all of these treat them in a completely different way, but probably everybody agrees that as a user you care about your privacy, and many organizations do as well. For many, privacy of their user's information is a really big priority. Often this is because they know a big breach would erode the user's trust and cause many people to leave them. And sometimes your data is kept very safe because it's very commercially sensitive. Think of things like insurance risk, credit risk. In many situations your data is offered some protection by privacy legislation, but this varies very, very widely depending on where you are, the location in the world and on the type of data. Many organizations have decided that some small non-trivial amount of their customer's information is worth sharing. The cost-benefit analysis obviously comes down to the side of it's worth selling or sharing customer's data quite often. So between two or more organizations, how's that done today? When organizations decide to share some of their data, at the moment they bring it together. More often than not they use a intermediary such as VEDA or Experian to act as trusted third parties. And there are some restrictions which some organizations sometimes obey. These could be governance, legal requirements, their own privacy concerns, policies and at times they are really concerned with the consent they've got from users. But they still decide to do it a lot of time. So what is the organizations are getting out of this sharing? It could be for really good reasons of course. Something like collecting aggregate information maybe about patient care across hospitals in a region. But unfortunately sometimes these solutions are very inflexible, especially when it comes to medical data. The legislation prevents a lot of things. And so you sometimes have to have legislational change before you can do useful aggregations and statistics that might be useful. The reason a lot of organizations do it is for monetization. By simply selling the raw data or customer insights, either of those with other interested organizations, they find a new revenue stream. And it could be that alone they don't have the necessary amount or quality of features, data, information to make predictions and classifications and other inferences that they want for their business needs. To try do a Raymond Heddinger from the Picon US talks, there must be a better way. So instead of trusting a third party for many problems, it is actually often possible to do data mining while preserving the privacy of the raw data. Just think about the implications of this for a second. You wouldn't have the capability to see the raw data. You couldn't bring up a table in Excel and see it all. But you could still solve the problems that you're trying to do, run your analyses. Even using sensitive data that couldn't or shouldn't be put on cloud services. So cross-border insights can be really difficult with some types of information. And I think I just mentioned medical information is very heavily restricted. Different countries have very different laws when it comes with regard to data privacy. I've got a toy example just to think about, just as an illustration. Problem involves two millionaires, Alice and Bob. This isn't a problem I've struggled with, unfortunately. Alice and Bob are interested in knowing which of them is richer, but they don't want to reveal their actual wealth. And there are lots of solutions to this from the cryptography field in a way that actually does preserve that number and just works out the yes or no, which is greater. It can be extended to in participants instead of just two people and compare a group. And other variants of this, instead of doing solving effectively for greater than, they can privately work out of a group, the mean, the median, the standard deviation, etc., of essentially a distributed value like that at group's wealth or salary. Now this can be applied to things like voting privacy-preserving way. What candidate do you prefer the most? The crux of it is it is possible at times to gain the insight from data without seeing the data. This topic lies at the intersection of cryptography, privacy, and machine learning. I'm going to attempt to introduce some of these. There's a multiple of techniques. And in these, I don't want to just do the mean in a greater than or standard deviation. So we're going to look at some of the more advanced learning techniques instead of just descriptive stats. The types of things that are possible include K-Means clustering, PCA, linear, logistic regression, and they usually work in one of two ways. You either leave the data at its source, where it's measured, and bring the algorithm to the data or share deltas, share updates. So think Internet of Things, if you've got a whole lot of devices, instead of pulling all your resources and all your data into one centralized server, you leave the data at the leaf at the node. And the second way is you upload encrypted data, and you can do your analysis on encrypted data in this centralized server. And that way you get the advantage of using your Amazons and your Azores and whatnot. But it's all encrypted data. It doesn't matter if it gets leaked. It doesn't mean much to anyone. In either case, you learn the insight, the model, without ever seeing the private raw data, which means you're not really at risk. You're not as much at risk, and neither are your customers and their data or your users. So one way that this is done or can be done is business to business. It's possible to learn simple machine learning models where some of the features think columns in Excel are never brought together. They never reside in the same place. So think about targeted news feeds, which you get in Facebook or your targeted advertising, where the raw information, what you buy, where you were, when you shop, all of that stuff could be separate and stored separately. This example in particular scares me a bit. I don't think just the technology solution is going to be the answer. There's certainly going to be some policy things in there as well. Medical research should definitely benefit from this in the near future. For example, individual hospitals could keep their own, obviously, very detailed patient records private, but these hospitals and patients wanted to, they could participate across all of the hospitals. They could do research building models or illness prediction, which could learn across all of the private data, which is stored in the separate hospitals and never pulled in one place. So that if something went wrong and something is leaked, it's not everybody's all at once. All right. So now I'm going to get on to some of these methods. I say I'm going to try and introduce a couple of methods. The first one is secret sharing. So you have a space, a space in 3D. Secret sharing or secret splitting is a method for splitting a secret amongst a group. Each participant is going to be given a share, the allocated a share, and the secret can only be reconstructed when a sufficient number, a threshold of participants decide to combine the results. This illustration kind of should help to explain it. So our secret is just a point in this box, some position. Now the first participant is given a share, which is just a description of a plane. It's important, of course, that the secret point is part of the plane that the first participant gets. Hopefully following so far. Now the second participant is also given a share. They're given another plane. The point that is our secret is hopefully also on this plane. Now if these two decide to collude, then there's a whole line of possible places where they they intercept. Combining two planes yields a line intersection and obviously there's quite a few points in an infinite line, so that doesn't give away too much. But when three or more of the people come together who have one of these planes, we can finally reveal the point right in the middle where the three planes intersect, which is our secret. So here there are only three planes, but you could easily do more as long as the secret point goes through them. And of course you could increase the number of dimensions, the same applies. The names just get a lot cooler because you're dealing with hypercubes and things and intersections of hypercubes. So this secret sharing threshold scheme is called Blakely's scheme. It's one of many. I just chose this one because it's much easier to visualize than just math equations, seeing planes. What would it be used for? Or what is secret sharing good for rather? It's good for really highly sensitive data, meaning you don't want any individual to have access to the information and highly important data, meaning you really, really don't want to lose the information. So I'm thinking things like missile launch codes, maybe encryption keys to your secret formulas or something. Now normal encryption, sorry, asymmetric encryption isn't that good here. It doesn't give you the properties simultaneously of giving you high confidence and high reliability. Say your company has a secret formula and you wish to keep it secret and you also don't want to lose it. So normal encryption would force you to make the decision, either you keep one copy of your encryption key, which means you don't have that much reliability because if you lose it, you can't decrypt the data, you've lost it. The alternative is you distribute multiple copies of the key, which lowers the confidentiality. If any one of these key holders decides, oh maybe they'll just decrypt the secret formula, have a look at it, copy it to their personal computer without being encrypted and make some mistake, the company secret would be revealed. So the difficulty lies in creating schemes that give you both. So with secret sharing, we can do something where you have end shares that are required in order to access the raw data. So to put that a little bit more concrete detail, say the president of a company should be able to access the formula anytime they want, but in an emergency any three out of 12 board members should be able to collude and do the same. So this would be accomplished using a secret sharing scheme or could be accomplished. You'd give the president three shares and everyone else on the board gets one and then anytime either three of them got together or the president wanted to, they could decrypt something. All good? Cool. The next one is secure multi-party computation and that allows a set of parties to compute a function over their inputs while preserving input privacy and correctness. So MPC has been an active area of research for like 30 years and the last decade it's just taken off significant interest and significant advances for applied MPC. It can now be used as a practical solution to various real-life problems like distributed voting, private bidding auctions and things like sharing of signatures or decryption functions. Most MPC protocols actually make use of secret sharing which means a set number of the participants need to collude in order to be able to retrieve information and it allows arithmetic operations like addition and multiplication and more complex operations like K-Means clustering that I mentioned earlier as possible under secure multi-party computation. A little bit of mouthful. Now I'm going to attempt to show you a demo if I can get my mouse over here. One sec. That'll do. Okay so we have two companies and this is a business to business example. We're combining vertically partitioned data and we're going to carry out clustering for the purpose of identifying outliers. So a resolution here. On the left we have the two separate companies green one and the blue one a finance company and a travel company. Now to make it as simple as possible they're each going to have one dimension of data so they've got one column the monthly spend of their customers and the number of trips of their customers for the travel company. Now if you could bring that together and that looks horrible on this resolution if you could bring that together the two dimensions show very clearly three clusters and you could potentially see outliers if you could run this. Now K-Means clustering converges to the cluster centers and then you could ask how far away are things from these cluster centers and find outliers. Now if you look at the data itself which I'm not sure if I can get a better view of that. If you look at the data itself you don't see the outliers like they're very clear clearly within the one-dimensional clusters and so bringing it together does give you a clear advantage you can identify fraudulent users anomalous behavior etc etc so it's useful but bringing it together gives you another attack vector. So we want to have the benefit of the last graph this without actually ever having the data in one place without sharing the data so that this graph is not possible or should not be just illustrates the outliers. Yeah that was a replay of an experiment a lot of these schemes actually take quite a long time this particular one was in the order of five minutes as opposed to seconds and it's based on secure multi-party computation instead of public private key cryptography. There's a not not this not something we've done but AES encryption can also be done in the same scheme and that uses in the order of 30 000 and an OR gate which is quite impressive and running completely privacy preserving if I can go back to my slides. Okay differential privacy another another technique for privacy preserving analytics it's a very powerful approach to protecting individuals privacy and data mining so in cryptography differential privacy aims to provide maximize the accuracy of queries from a statistical database while minimizing the chances of identifying its records to try give you an intuitive way to think about that differential private data mining protect the individuals by injecting noise so a little bit of randomness here and there which covers up the impact of a single individual can have on a query so consider a trusted trusted party that holds a data set of sensitive information something like medical records voter registrations emails and with the goal of providing global statistics and aggregate information about that data publicly available while preserving the privacy of the users whose information is in the data set such a system we call a statistical database in the recent past ad hoc approaches to this to anonymizing public records have kind of failed researchers managed to de-identify personal information often by linking to such things that have been released together or innocuously unrelated databases one successful such linkage attack was the Netflix database so in 2006 Netflix offered a million dollars for anyone who could improve their recommendation system by 10% and they released a data set for the competitors for the developers to use to train the system and while releasing the data set they provided a disclaimer to protect customer privacy all personal information identifying individual customers has been removed and customer ID has been replaced by just randomly assigned IDs now Netflix isn't the only available data set on movie ratings on the web and there are others like IMDB and on IMDB individuals can register for free rate movies if they want to and they have the option of publishing their profile like not keeping their ratings private now researchers at University of Texas linked the Netflix training data set with the IMDB publicly available records and they compromised many maybe even most of the user's identities I'm not going to go into the details of how differential privacy works other than kind of mentioning it gives you a knob of control to go between more accurate or more private yeah there's a trade-off between the accuracy of the statistics you're estimating in a privacy preserving manner and the privacy parameter like an epsilon essentially right I have some creepy faces as a demo that's right okay this this was done in python although this is not using any right now because it's all just showing a JavaScript anyone know PCA yes some good so it's a statistical machine learning algorithm ish for finding the principal component vectors and it's really simple when you look at it in 2D which I thought I had doesn't matter finding finding the principal vector through some data in two dimensions but you can also do it for faces so at NICTA we made a prototype or demo just showing that we could do using differential privacy eigen faces which is essentially just calculating eigen vectors using images completely privately so each update step these are separate nodes or calculating a shared model of what the the face from all the faces they've seen before it actually looks like but they do it in such a way that everything they share is just a delta just a difference between what you gave me last time the data I have locally here's what you need to change and it does it in a way that guarantees anonymity just kind of cool so homomorphic encryption does that mean anything to anyone one two three yes cool actually for everyone asked then let's just do a quick recap on asymmetric encryption so asymmetric encryption works by first of all you create a key pair public key algorithms are based on mathematical problems that currently have no efficient solutions these problems are inherent in certain factor integer factorizing discrete logarithm and elliptic curve relationships it's computationally easy for a user to generate a public key public private key pair and use it for encryption the strength lies in that it's the impossibility or computational impracticality of reversing that for a properly generated private key to be determined when all you see is a message and it's public key thus the public key can be published without compromising your security at all unlike with symmetric encryption and the security depends only on keeping the private key private as in in a asymmetric key encryption scheme anyone can encrypt a message just using your public key but only the holder of the private key which is paired can decrypt it asymmetric encryptions also use first signatures where it's just the other way around private keys used to sign the content and anyone with the public key can verify that it was a valid signature now one form of this is rsa rsa was one of the first if not the first i'm not sure public key cryptography systems and it's still widely used to secure data transmission today the security is based on the practical difficulty of factoring the product of two large primes appropriately called the factoring problem so these are the only uh only um equations i've got promised m is the message which is transformed into a number but for our sakes let's just assume it's a number already because we're thinking about data analytics and the public key is composed of two numbers n and e c is the ciphertext that's transmitted say from elastobob private key is just one number d but you usually know the public key as well and that's just used for decryption now rsa has the interesting property which math geeks may be able to work out just from looking at those two equations but probably you need to know how the public and private key numbers are generated but it has this interesting property that the product of two ciphertexts is equal to the encryption of the product of their respective plain text and this is called a homomorphism normally to avoid these problems practical rsa actually uh implementations they typically embed some sort of structured randomized padding into the value before into the value m before encrypting it and that prevents this um so you can't multiply ciphertext together and get the um result as if you'd multiplied things like they consider it a flaw but when you look at it um now cryptographers are looking at this as like a very very closely and other systems which have similar homomorphic properties um and thinking maybe we can exploit this which brings me to the palier crypto system actually i'm gonna i'm gonna go to my my last demo before i go into this one i'm going to pretend i'm a doctor and this is your personal tablet or your personal mobile phone this one's more of a proof of uh yeah proof of concept um it's a little bit more polished now what um i'm going to assume is that you've had your genome sequenced and that's cheaper to do these days like a hundred dollars or something you can get your genome sequence from 23 and me and you've you've got it loaded up on your personal device so you've come into the doctor and um say i've diagnosed you and said oh sorry it looks like you're going to need a blood a blood thinner um the drug i'm going to prescribe is warfarin now there's a classical way where i can determine the dosage for this and there's a newfangled way but you need to have this this app you need to have um your genome sequenced and you say yes of course i've got that i'm up to date um i've got my genome with me and so i say okay cool uh let's let's go through and do the warfarin dose test then um it's going to say oh i need access to your genome but i'm going to do it in a privacy preserving way and i'm going to need access to your basic bio things like your age your weight your sex um we decided that's all right we'd participate it asks a couple of other questions that are relevant for this particular warfarin dosage calculation um in this case it asks um whether i take some particular drugs which might conflict with it so it looks kind of terrible on the screen uh then it goes ahead and calculate something privately on the device using your cell phone or your tablet then and there and gives us a result of 34 milligrams per week apparently is the dose for the random things i clicked now that in itself is kind of impressive just like that but what i'm not telling you is that that's keeping two things secret the algorithm used then could have been a commercial secret that a pharmaceutical company did not want to share with you and have on your device and the uh genome is obviously want something you want to keep secret and don't want to send to a pharmaceutical company um using the paleocrypto system we can actually keep secrets from both sides and still work out something simple now in this case it's a really simple operation we're doing a doc product between um your genetic information and your phenotype with uh the weights from the pharmaceutical company but as i was showing with the k-means clustering the pca a lot of other more complicated algorithms are possible too um how are we for time like done let's let's let's stop there i've got um one library i'd like to share um if i can put it up not full screen if you're interested in looking at the paleocrypto system we um at nick do we open sourced um here github.com slash nick duh slash python dash paleo and it's um a partially homomorphic encryption system for python um yeah there's lots of docs on uh read the docs because i've run out of time to show examples but it works it's tested you should try it things are possible yeah so we've got time for maybe one or two questions if anyone has a question please raise your hand and i'll run the microphone to you this is going back to your netflix example matching two data sets because um i'm working for the museum government um one of the agencies that look after import export by security food safety and that kind of stuff really interesting data sets disclaimer this is me i'm not speaking from the ministry or anything um yeah because there is obviously public demand for hey can you release data and why not and it's good for research and many many other things and good causes um but there's real threat in terms of identifying people organizations and he's inadvertently um the reason example i came across was um forestry museum is really lucky having really good records in terms of the forestry um and what good for research but you could for example identify um the only person um flying palm trees and match that with lince data and suddenly yep you actually can tell exactly who it is and what not and if you're an activist and burn down your stuff because you don't like palm trees palm oil and things like that so there is a real confusion and hearing kind of what you presented was like what is possibly an easy way out if there's one or not what's your thought on this i don't think there's an easy way out and you certainly have to approach it with a lot of care um so yes you could go down the differential private like anonymizing the database and then have a um a release that you make once and that that could work that can certainly work um find someone who's maybe done some work in that area um i would recommend but you can also take two other approaches you could release things at different aggregation levels so when you've got a broader um what are the zip code piece uh and what's the wider local body areas etc i'm not very familiar with geography things like like that though you can release different um precisions um at four different levels that's that's one approach um just trying to think how that could relate for forestry it seems like you know for this area you could release um a data set with this much accuracy um but another approach would be you can ask people to send you what they want to run and you run it and then you just release the result because the insight whatever the model is they're trying to learn probably doesn't contain anything that you're too scared to release um probably um you know maybe they're classifying bad trees or something um but and they want to run that with the large amounts of data you have available inside government and if you can set that up in such a way where someone can go to you and we want to run this thing it's going to do stochastic gradient descent using all this data and you know there's obviously some effort on both sides here um but the model itself might be something you're willing to release just not the raw data it's another way you can approach it anyway how fast is the field moving it means pretty quick pretty quick so um yeah there's been a lot of announcements even in the last few months from places like MIT have put out a new program enigma um ethereum is another another cool one that's a few weeks ago got a lot of traction which does privacy preserving stuff um on the blockchain using a bitcoin or otherwise blockchain so you can you can see um a clear record of what has happened um they're not using it for transactions but yeah it's it's moving pretty fast um that's for sure i'm sorry we probably don't have time for any more questions uh the conference closed will be in a few minutes if you have any more questions uh brian's on twitter and he's here for yeah the closing can say hi um but yeah please join me in thanking brian