 Welcome everyone to this new robustly beneficial discussion. Today we will discuss the topic of security and privacy applied to machine learning. We will specifically discuss the points raised in one paper from Nicola Pappernot from Google Brain. The first thing that I find interesting in that paper is the link he makes with a good upload and you must already have heard it that when a measure becomes a target it seems to be a good measure and for machine learning model we have the same problem that when a model becomes a target it seems to be it starts to be vulnerable and the reason why is that even though you have run some tests that says your model is robustly beneficial 99% of the time this 1% that's left it will be door open to two attacks and if this model that you have deployed is a target then the attackers will use this 1% and the result of this is that your model will actually not fail just 1% of the time but higher percentage. Yeah I think it's one of the big challenges of designing machine learning algorithms or algorithms in general that are deployed and used by billions of users is that if it's used by billions of users then there's probably a lot of incentives to try to exploit it and there's a huge amount of incentives from billions of users to exploit the vulnerabilities of one algorithm then you can definitely expect people to try to hack these vulnerabilities and essentially your system to be completely open and completely cracked because of these vulnerabilities. In the rest of the paper the the discussion I find is very interesting it uh it describes the principles that are useful for a computer system in general but not systems that do machine learning and then try to make connections with a system doing machine learning so and and there are there are a lot of differences for example uh uh security for systems that do machine learning can be interesting to look at at training time but also at testing time so there there are these two types of attacks also discussing the the privacy is very important about machine learning and uh and there is also the fact that you should look at the whole system of the the deployed machine learning algorithms from the training from how the algorithms is queried from the input data that we give to the algorithms for each query and uh and uh it's interesting to discuss how these principles that are uh that that were designed simply for normal computer systems can give us very interesting insights about machine learning models yeah these are topics very uh not specific to uh to machine learning yeah there's a huge amount of of work already be done in trying to make systems more more secure more robust and have better guarantees uh in terms of privacy um and definitely like machine learning has to take a lot of this into account because for a long time like uh there was a the challenges of machine learning was just to show that you can have a system that works and now we have system that works but now we have to make sure that our systems will always work which is uh a huge difference there's a huge difference between working in average or in a in a basic setting and actually working once deployed in practice in complex environments we potentially a huge number of of uh measures users and to to to do so to to bridge this gap between working in some setting and working all the time uh it's actually it takes a lot of a lot of work and making a lot of thinking and many of the ideas for classical algorithms are definitely relevant for machine learning algorithms as well yeah what you so what you describe is a is a sometimes called a distributional shift the fact that one assumption of machine learning is that the testing distribution is going to be similar to the training distribution but what happened in practice is that the testing distribution is uh very different from the training distribution because as we said actors are trying to gain the system and uh and create these differences yeah um one other thing hit the the paper discuss a bit um and uh i would like to hear your opinion on this is the uh auditing ml like the there are many people who who think we should have auditing or audit protocols or or or um ways to audit machine learning and uh it's not very clear from the from the section how this audit could be done in a reliable way you don't know what's your take on that yeah so the the solution that he was proposing discussing in this paper as some of them i found quite interesting so the first one it happens at the so for example uh he also mentioned the concept of differential privacy and this is something that we can audit from the from the training data we can we could have a algorithms that are verified and we know for sure that part of the information that was in the training data set he is not actually part of the model so this uh this would be a proof i think that the differential privacy uh was correctly applied by first one algorithms uh a second uh a second interesting things is that so because it's a machine learning system as i said that we can uh there are things happening at a training time and possible attacks at that time and also things happening at the testing time and uh a second possibility of audit would be simply look at what are the queries being done at a testing time and for example if you see that uh during testing some query is very similar to some input data you are doing in the training data set it's likely some attackers that is trying to to learn the data that that you have been trained on uh an example of this is for for example youtube is giving me recommendations and these are somehow private i would not want that the recommendation that youtube is giving me would be known by uh by a lot of others so if if it was possible for someone to to somehow fake someone very similar to me some very similar patterns to mine and then knows what youtube would recommend so that means that that would be a kind of attack that happened at testing time someone trying to to to get my private data the recommendation that i don't know privately for me and uh and this uh this kind of auditing it would happen by somehow keeping logs of what queries are being done on the on the on the requirement system on the machine learning system and uh and somehow what one thing we could test is does it diverge diverge too much from what we expected like uh based on the assumption that the training distribution would be similar to the testing distribution yeah i think a lot of this is related to uh things that we have already discussed in the past like the the black box abstraction of machine learning um yeah i think one difficulty of of machine learning but actually it's not specific to machine learning you can think of any uh source code that's not open uh and even if it's open and poorly written i'd say uh sometimes the best way to know about the the algorithm is just to have all sorts of testing through interactions with the system like you're giving it input and seeing what the output is uh and yeah and this can be very hard to to analyze uh i think set that in the specific case of uh for instance neural networks or some of these machine learning algorithms they're not actually really uh fully transparent uh fully opaque uh they like you could share for instance the the weights of the from the like the parameters of the the machine learning algorithm and sometimes based on this you can do some analysis so there are some some some approaches based on uh ideas from formal verifications uh for instance you can try to prove that whenever you have an input that has this property that the output always has these desirable properties something like this so so these are ways to go as well but uh it's not always clear how they scale up to to larger in your networks no perhaps another thing we can discuss i think an idea uh that uh comes very early on in the paper is the idea that uh when you're designing a system and you're trying to make it secure you're trying to make it secure against adversaries and uh it's not necessary the right model to assume that the adversary is uh omnipotent and all powerful uh and omniscient so omnipotent is that he can do anything omniscient is that he can know everything because in practice this is not the case and it may be more realistic to assume that he has some budget some some you know amount of power and he can use it in different ways but he cannot do anything with this uh with this power and then you're trying to be robust to this kind of adversary uh and sometimes this makes this allows you to do a lot more because a non-powerful adversary like it's very hard to fight against but if you only say that you have to be resilient to a very powerful but not all powerful adversary uh then maybe then you can have a efficient algorithm scalable algorithms that do what you want to do uh actually this is not in common in uh the research on the more maybe on the robustness side of machine learning yeah this is definitely uh yeah what what I find is just thing is that it's an idea that would come up naturally for machine learning because machine learning you cannot have a formal methods but even if you use formal methods at some point you're still going to have to make assumptions uh because if you really think about the security of a formal algorithm with lots of proofs uh well the security of all of this depends on every component of this and if uh it's a combination actually of 10 algorithms and the 10 algorithms are communicating in all possible ways so there are like many combinations of of all of them then then the the component X starts to explode and uh it can be very hard to keep track of all the dependencies of all the and uh yeah in addition to this you can even have a more sophisticated sophisticated attacks like side channels or things like this or you're hacking some parts of the system that are easier to hack especially if it's a multi-component system and uh even for formal methods you need to to you cannot assume to be robust to everything because it's too hard but you have to be a robust to a huge amount of attacks and this actually turns out to be what is done in machine learning because indeed you cannot hope for much better in machine learning okay so just maybe to uh to talk about some research I'm familiar with actually the some of the research we did uh so for example in the context of Byzantine resilience so where you want to be resilient against an omniscient but not omnipotent adversary you can still have subclasses of omniscience uh based on the computing power of the of the adversary for example an adversary that could look at all the gradients of everyone uh at infinitely short time so with infinite computing power would be the would be the full Byzantine adversary but you can you can have more realistic assumptions for example whether the adversary could spy on the spy on the gradients of everyone and um do some computations on these gradients to to compute the best attack but to do that the adversary so she or he may need the uh big o so uh an order of the dimensions uh computing time for each update epoch and since the updates can be very short so they need to perform a big od so d is the dimension of the model computation within the time that it requires for the model to be updated so that's that's very constraining and so with this constraint for example you can you can you can downgrade or upgrade the the requirements on uh the the robustness required because if you know that you will update the model very fast you won't be waiting for for for late updates so then the adversary would not have time enough time to for example do some regression on the gradients and find maybe what is the best attack etc to poison you so yeah they're already they're already uh what what you suggest is already there at least for the past two years in in the field of Byzantine fault tolerance for machinery yeah but maybe you can push it to other directions so i think we we discussed uh uh was about uh social medias for instance on social medias you want uh facebook twitter and youtube to be resilient to Russian trolls for instance uh or like yeah these are fake accounts that just trying to manipulate uh the the public opinion uh for instance you undermine another country or things like this you want to be resilient to this uh and uh one thing you could say what we need to be a reason to all uh users but this is actually probably impossible to do unless you shut down every account which is not uh something that sounds desirable but instead of that maybe you can have models where you assume that the the adversary the the troll adversary has the power to hack uh certain number of accounts uh but and maybe some accounts are easier to hack in a sense than others um probably if it's uh uh the the twitter account of some very famous person it's going to be harder to to to patrol to take control of it but if it's all the accounts of people who have been using twitter for a long time but have not been using lately then maybe it's easier to hack and using this assumption of the amount of power of the adversary of uh here the trolls then maybe you can come up with algorithms that are robust to this kind of power yes maybe that's a research direction that could be extremely relevant especially to to the current situation uh of COVID-19 where there's a lot of misinformation agree something to add on that or yeah so maybe some uh something uh we that we discussed very quickly and we could say more about it is the the openness of the of the code it's I think it's something that maybe we didn't discuss uh when we talked about the black boxes previously so the it's a very common principle in a security system that if your code is open source uh it's it's much more trustful that the code is a is actually secure because uh and the reason to believe in this is that if a lot of people have had the possibility to to find the failures in the code uh because of the open source situation then uh that indicates that if you don't find a a bug or a possible hack then the code is trustful and for machine learning actually opening the the source code is only a small part of the of what can be opened so other question other things that could be open is the the input data the parameters of the model what prediction are being done in in which situations and and all of this relate to to question of security and uh and the question of privacy so clearly uh so it in the discussion we had it seemed to be quite obvious that people wanted the the the code of this this uh this software to be to be more open uh the the one of the main reason for me to see that this code should stay closed is that it gives more more more difficult time for attacker to to actually generate attacks uh one uh one way to see this is about uh in the case of your network generating adversarial examples I believe it's much easier to do it if you have access to the parameters of the model compared to if you don't have access to the parameter model and so I think the most the the recommend assistant that matters the most today they are they are very closed and it's even some uh some small things like uh receiving uh the output of the of the algorithms is something that's very difficult I think uh it's I don't think it's possible for for example to to query uh youtube to become a system more than one million times per hour for with uh for an individual so somehow uh we had this debate whether this uh this the openness is desirable or we or whether it will uh give more keys to attackers to to manipulate the system so we just at some point uh at one sense would you want your spam filter to be open source uh I don't remember how we settled on this like so yeah and again spam filter to be open source I I'd be very happy if the algorithms of the spam filter is open source but not the training data it has been trained on yeah yeah because clearly if the if the whole algorithm which training it out if the simply the parameters of the algorithms open source it means that someone wanting to send spam could simply run locally the spam filter know whether the spam will be stopped or not and and change it's the it's it's spam email to to to find the failure of the spam filter and uh yeah yeah and as you say if you know the parameters of the if it's neural networks and you know the parameters then you can compute the the gradients and this tells you how much what is critical in the message for instance you've written that pushed to the towards the definition the decision of of filtering out or not the message you've sent and so you can more quickly uh change your your your emails so that they they guessed through the the spam filter uh and so it seems that if you want a well a fully open source uh uh by fully open I mean you have access to all the data and all the parameters of the of the spam filter seems like a very bad idea so is it an impossible impossible thing to have well you can yeah like you mean efficient and would it still be efficient uh yeah yeah that's the question yeah so I'm guessing not but yeah if you have the parameters uh if you can run it locally I think there's there's huge incentives to to be for for for passing spam filters so yeah maybe you don't want your spam filters to be open source for security reasons which is a bit counterintuitive yeah I agree so that's why it was quite interesting to to see this discussion of of principles that are very obvious for normal company system but that became becomes slightly strange when we talk about machine learning systems yeah I guess if it was uh the spam filter was a hundred percent good in a sense uh if it does exactly what person wants and then doesn't want then would be would be fine but uh like it doesn't sound reasonable in practice like that's this game and and there seems to be a gain in uh obscuring the algorithm yeah and maybe it also goes back to the to some discussion we we had about social media manipulation that in the end certainly that the attackers would become extremely good at imitating normal humans and then it then if if someone creating spam is extremely good at imitating a normal email any spam filter would would would fail actually if you think this kind of spam so in this case what was my conclusion is that so because we can't detect what is a authentic authentic message compared to what is a message with the intent of manipulation the what what we truly desire that the system would do is actually push forward messages that are beneficial and and lower the impact of messages that are not no matter what no matter the source of these messages so yeah yeah and the primary is really I think that's with good heart slow I guess and with um yeah if you have a system that you that you know exactly who's no rules are exactly known then you can better exploit the system if it's not perfectly aligned and you I think we should absolutely assume that the systems we design are not perfectly aligned and because of this it's a vulnerability that may be best protected by not being fully open oh yeah it goes back to this idea from us to what we saw that I found quite interesting that to somehow beat a good outlaw to not not reveal the the the full objective function to the to the artificial intelligence system but somehow let the artificial intelligence system have an uncertainty about what is the objective function and and this way what the what the artificial intelligence we try to do is to to guess what you really want and to to work on this and yeah he mentioned this idea in the context of interruptibility because uh yeah this is a an instrumental goal to do not be interrupted for for machine learning systems and but if if the machine learning system has uncertainty about its own objective function we can expect that if he sees that there is a tentative of interaction he might deduce that oh it's because i'm not doing the right thing so it's actually my interest to let myself be interrupted yep yeah so you're playing this in the opposite way you're based on human by the machine at the machine by the human but yeah okay all trying to hack one another algorithms of humans and you don't want algorithms to hack humans but you also don't want humans to hack algorithms yeah so we also discussed the fact that maybe some parts of the codes of youtube can maybe should be opened like uh i think it's complicated but uh would push for more openness and at least collaborations with people outside from youtube for like academics for instance public health authorities yeah public health authorities especially these days it's extremely valuable but i think in many countries it could lead to two backlashes especially in France but yeah like these are extremely valuable information that they have and if they can be exploited for good that that that'd be that'd be good uh but it's not easy either to make your code open um and like we discussed like like the information systems of companies today is quite often a huge part of their whole business like if you remove the information system of well the the worst case probably a bank then there's nearly nothing left like the information system of a bank is critical and but this also holds like for the airplane industry like if you remove the algorithmic part of an airplane then essentially the airplane does not fly you know not only the algorithmic part of the airplane the like the plane itself but if you take a flight company yeah i've read that for example air france uh the assets uh so if you evaluate the assets of air France the the software that runs the booking the booking software amadeus which is a sub i think it's a subcompany of air France now it's worth more than the total cost of for buying all the other planes that air France owns yeah the software the work the work of the if you value the assets of air France uh it has more assets in software for the booking software in the booking software than assets in the form of planes yeah yeah but even the planes they're still i think absurd of because of the plane something like this is the software so if you if you just take the assets in terms of software then you get a higher value yeah so so yeah so far it is extremely important but changing the software it also means that changing the software is a big deal for a company like it's worth a lot of money to have a good software and we gave the example which is actually relevant to these days of cobalt uh do you want to tell so so yeah recently the the governor of new jersey was uh was looking for so in the middle in the during this mess created by the coronavirus uh the governor of jersey on users they were were seeking the help of cobalt developers because the the unemployment platform crashed due to the surge of of people trying to register in it and it has legacy code written in cobalt which is now 60 years old 70 years old 60 it was written in the it was written in the 50s 60 years old and so in the book we wrote with lea uh we're in a place in a moment where we discuss the problems in having interruptible systems uh we illustrate the fact the the fact that sometimes it's very costly to to just interrupt something that is running by the example of cobalt so before covid the legacy there was there were known cases of banks trying to get rid of the legacy code written in cobalt and for example the austral the australian commonwealth bank spent three quarters of billion dollars and seventy seventy hundred fifty seven hundred fifty million dollars uh so it lost all that amount of money uh during the the replace for the sake of replacing the legacy code that is written in cobalt so that that is an argument for not changing uh for for not interrupting the system because if you interrupt it the cost would be high yeah and so if you want a company a big big company wants to make much of its code open even half of its code open then it's probably a multi-b if it's a big company like google it's probably a multi-billion business project and it's not even sure that is going to succeed like it's like really really a big deal as i think there should be efforts in this direction but i think we have to factor in what we are asking from from from these companies and you have to show that there's a gain in this either through regulations or through a gain of trust or whatever but if there's not this gain i don't think we can expect this company to out of their goodness say we're going to make the codes more open so i'd say like if you're advocating for more openness of this company i think it's a good idea to first really think about the incentive structures and what can be done and maybe ask something that's more reasonable that's more likely to lead to something good uh should we rub up or something missing i think we covered all the topics we discussed in the reading group anything else that we discussed uh and that may be uh important is uh essentially there are two ways and usually they're more or less combined to to to to guarantee a greater security and privacy in systems uh the first approach is more um empirical is more uh you set up a system and you try to attack it and if you cannot attack it you let others trying to attack it and if others cannot attack it then you say well it's likely to be uh secure uh but it's nowhere near guaranteed maybe there's a new trick that someone will find and that can completely break the system so it has its limits and the other approach is theoretical uh so for instance for differential privacy there's this concept of uh no differential privacy which is a formal concept and this means that you can actually use mathematics to guarantee ahead of time that your algorithm will have a lot of desirable properties uh for for security you are also all sorts of such systems for instance formal verification methods uh Byzantine resilience uh like guarantees in terms of performance guarantees or yeah resilience to to these are these uh variations in the data distributional shifts and stuff like this and I think it's important to to command the two but if you want to do theoretical work for security or privacy it's first important to have a good model or a good definition of what you mean by security and privacy and this can be actually uh extremely difficult uh for privacy essentially but there are different definitions but the leading definition is differential privacy and it's a nicer definition but it's also arguably too restrictive in many settings uh so it's very very hard to to apply also like it blows up very quickly so if you have a uh yeah there's this parameter called epsilon but if you have a one epsilon equal ones differential privacy then it's good but if you have epsilon equals 10 a differential privacy then it's essentially useless as a as a guarantee uh so yeah making all of these work together is a it's very complicated and it's very hard but it needs to be done too okay and there's also what yes and there's also a huge pedagogical challenge uh because a lot of people are talking about privacy but uh like uh i'm doubting that a lot fraction uh a large fraction of people who are talking about privacy know about differential privacy for instance uh about the formal definition and the intuition and i think even few of them are able to explain it to to someone else uh in a understandable manner uh and i don't think i can explain it to most people uh differential privacy unless i have like two hours of their time and they're very focused uh so yeah in all of these even security and stuff like this and it's related also to the covid situation to all sorts of problems pedagogy i think is underrated and is really really important and um yes so um and then yeah with with pedagogy i think we can conclude and tease a bit the topic of next week so next week we'll discuss the question of contact tracing and the challenges for contact tracing in terms of privacy and trust uh because contact tracing has emerged uh now as one of the most promising solutions to get us out of confinement due to covid so next week we'll discuss the the question of contact tracing and some of the proposed solutions by people from epfl among others and the trust and privacy challenges they waste and and how to be pedagogical on these two fronts to ensure that these solutions would work and would be adopted by a large fraction of the population so thank you very much and see you next week bye