 Hi everyone, we are going to start in another 3-4 minutes. You all know why we are here but just to give you a brief background, this session today is about the part 3 or the section 3 of the paper that Prashant and Malavika and Professor Subash Benerji and Subodh Sharma and I.D. Delhi and Dwara research have written and today's session will primarily focus on the computing principles and Prashant will be ideally explaining you the computer science principles behind privacy bed design and tomorrow's session is essentially around the regulatory architecture and what would an example system would look like and Prashant I guess you can just start I'll let people join in as they come in. Okay so thank you Srinivas and welcome all of you to this part one of this two part series. So as Srinivas mentioned what I'll do is I'll give a broad overview of various privacy concepts in computer science and we'll try to understand the strengths and limitations of them and part two will basically focus on an operational architecture which we have explained in detail in the paper but this one is just kind of a primer to part two. So the agenda today is that we want to first of all overview these privacy techniques in computer science and evaluate how well they align with the legal principles of privacy and this is by no means a comprehensive detailed overview privacy research is almost four decades old now the dozens of journals conferences dozens of pages so what we see here is naturally a very compressed version of the picture and you'll find the details and references in this link. So before I go further I actually wanted to make some clarifications regarding some terminology and I believe this might have led to some confusion between the legal community and the CS community. So in terms of the legal definitions you have informational privacy which is you know the broad concept of individuals right to be left alone and this was the basis of the put us on in judgment and traditionally you think of data security as various technical safeguards and operations which companies or organizations keep in place to make sure that the data is secure meaning that there is no unauthorized access to data and data protection is kind of above and beyond data security which talks about a general legal framework which achieves informational privacy and in addition to data security it also talks about preventing unlawful collection and processing by entities and hopefully with this talk I would want to convey that modern computer science techniques many of them also talk about data protection and traditionally where cryptography and security is just thought to be related to you know preventing unauthorized access to data just encryption and that kind of stuff that is not the only focus of modern cryptography essentially and you will see various techniques which explicitly talk about data protection and informational privacy and not just security. So before we go into computer science principles this first look at the legal principles and these are called the OECD principles they were established in 1980 and I think now they are the de facto principles to go by in terms of privacy. So the first one says collection limitation which is essentially a statement about consent that whatever data you collect it should occur with the knowledge and consent of the individual whose data you're collecting. The data quality principle says that you should only collect data that is relevant and necessary for the tasks you are trying to do so you should not collect irrelevant data. Purpose specification says that when you collect data the purpose for which you are collecting it should be specified so you can't just collect without specific purpose and use limitation says that this purpose specification should be respected and the rest are kind of self-explanatory. Individual participation means that individuals need to be in control to maybe update the data or control how it is shared. Security safeguards talks about essentially the data security aspects of it that all other raised accesses should be prevented. Openness talks about it should be open and transparent how you are processing over data and what you are doing with it and accountability talks about you know if there is a breach then data controllers should be liable and they should be able to pinpoint who is accountable. And if you look at the privacy risks it's actually a leveled approach that you have to take and at the very bottom level the risks are that you can call them as belonging to the data security domain and that is that leakage of sensitive data while it is in transit and any unauthorized access of information that should be prevented. But if you go a level up you also start thinking that the linkage of information which is shared across multiple databases that should also be prevented and this is why Aadhaar is such a big issue because if you start sharing, start seeding everything with Aadhaar then it means that all those databases could be linked together and you know your purpose may be violated from there. Then even when you do some kind of anonymization you don't collect Aadhaar or you just give anonymized data there is also a risk of re-identification of individuals even though you have not explicitly identified them and we will talk about this in detail. And finally the problem with privacy is that post access purpose violations are very hard to prevent. So which means that once you have given the data to an authorized agent then what is the guarantee that it will not be sold or it will not be used for some illegal surveillance or otherwise misused. And even when there is no manipulation you can have purpose violation through AI. So you can have illegal profiling or targeting using AI. So all of these things kind of broadly cover most of the privacy risks which we will talk about today. So I think before going further I just wanted to clarify the format I'll follow is that I'll introduce a concept and then we can have a discussion on it rather than having questions at the end. So I think that will make it more interactive. So first talking about encryption. So encryption deals with you know protecting data in storage and transit. So you have these two people Alice and Bob who want to communicate and they want to make sure that nobody can peek in what data is being sent in during transit. So Alice encrypts the message M for Bob and gets a cipher text. And the idea is that given this cipher text it is very hard to find the plain text message M. And this is the general theme which is which you'll see in any encryption mechanism. For example in the RSA encryption system it depends on hardness of factoring of composite numbers. So this has actually parallels with the physical analogy. If you want to protect something in the physical world you put it in a box and lock it. Now the security of this mechanism depends on the hardness of breaking the lock. So similarly in the digital world the security depends on hardness of some problem. And you have really two kinds of encryption. The first is called the symmetric encryption and the second is called an asymmetric encryption. So in symmetric encryption what you need to do is that these two parties need to exchange some key k over a secure channel. So the secure channel is shown by this bold arrows here. And once they share it and they share a key via some secure channel then they can encrypt multiple messages using this key. But this first step needs to be done in some other way in some offline fashion. And that is a problem really. How do you actually share keys securely in an unsecured environment? Now asymmetric encryption or this is also called a public key encryption. That solves this problem by having entities generate two keys, not one key. One is called a public key and the other is called a secret key. And the public key is shared freely essentially. So this Bob can share his public key through the insecure channel. There's no security risk involved there. And the encryption is done against this public key. Now the guarantee that you get is that anybody who knows the secret key corresponding to public key only that person can decrypt the message from the safer text. And if you don't know the secret key, then you cannot decrypt. So this way you avoid having to share a key via some secure channel. And that's why public key encryption became so popular. This public key encryption is a little slow. So what people usually do is that this first key exchange they do it via the public key encryption mechanism. And then the rest of the messages are encrypted using the symmetric encryption encrypted against the key shared in the first time. So this protects data in storage and transit. But when you are actually doing computation, so when Bob actually gets this data gets it encrypted and then Bob needs to run some algorithm, some program on it, then you still need to decrypt it. And even though it's a program decrypting it, there's still a risk that while a program is running somebody might inject malware on Bob's machine and steal the decrypted message or the keys. So all of that is outside the scope of encryption. So if anybody has any questions up to this point, I can take them now. I think we are good. If there are any questions, I'll be in chat and let you know. Oh, sure. Sure. Okay. So then I move on to the other basic primitive, which is digital signatures. And what signatures do is they make messages authentic. So again, this is to invoke the physical world and the digital world. And in the digital world, what you do is that you want to make some message M authentic. So you sign it using your secret key. Now, a person who is verifying it, he verifies it using your public key. So that this verification will not pass unless the signature S was signed using the corresponding secret key. So which means that this gives you authenticity that this message was signed by the authority A that has the public key K. So this makes the signatures authentic. And it also makes them non reputable, meaning that a cannot later claim to a third party that it did not sign M because S is an evidence of it. Since this person has the signature S, nobody else could have produced it other than A. So it itself is an evidence that A actually signed it. So this is a useful primitive to do dispute resolutions. But you should be aware that the security of encryption and digital signatures goes only as far as the security of the keys go. So whoever has access to the secret keys can actually do the decryptions can actually sign messages and all that. So key management is a separate issue, which we won't go too much in today's talk, but you should be aware. And finally, I want to talk about these one way functions as well. So what one way functions do is that given a message M, a function H is a one way function. If starting from M, computing H is easy, but starting from H, computing M is hard. So you can get H, you cannot find which message might have produced this hash. And traditionally, you might have seen this in when you download large files, you know, they give you an MD5 hash. And that hash represents that hash is computed on the message. But looking at the message, you cannot find out which, looking at the hash, you cannot find out which message produced it. And so this edge appears to be randomly generated. So this is also the basis of how do you generate pseudo random numbers. And most of these cryptographic hashes are also collision resistant, which means that given edge, it's also hard to find out another M and such that the hash of that M matches edge. So what this means is that if you have the hash, then you can be reasonably certain that it was produced by the same message. It was produced the first time. So a simple use case would be a password example. So when websites store your password, they don't store it in plain text, they store a hash of it. And just by matching the hash, they can be sure that it was produced by the same plain text password and nothing else. Hi Prashan, can you use your pointer when you explain some of these functions? So is it the annotate button? Do you mouse just where you are in this slide? You can't go. I'm trying to do it, but I'm unable to. It's okay, fine. Let's go ahead. Okay. All right. So basically so far I have covered the basic primitives which are pervasive in most of cryptography and most of data security actually. So now I am kind of transitioning into the data privacy principles and techniques to achieve data protection. And the first principle is data minimization. It's also called minimum disclosure principle. And this is just the age old age that share only the minimum amount of data required for the purpose. So if I want to prove to somebody that I am over 18 years, so I am allowed to drink or drive, I don't need to disclose my exact date of birth or my unique identity. It's like you need to do with paper based driving licenses and so on and so forth. Similarly, if you are collecting some information for statistical purposes, for analytics purposes, then you don't need my personally identifiable information. You are okay if I give you anonymized information. This is the general data minimization principle. And there are various techniques and concepts which enable data minimization. And we'll go through each of them as your knowledge proves the anonymity and the unlinkability database anonymization. So okay let's go over the zero knowledge proofs. Zero knowledge proofs are techniques which let you prove a statement without revealing anything other than the statement itself. And while this may sound a little counterintuitive, it is actually quite a useful primitive for privacy. So for example, you can prove that I know the secret key corresponding to a public key and thus identify yourself as the owner of that public key without revealing the secret key itself. So that will let you identify without revealing the secret key that you hold. And in order to drive this point across, I'll actually do a zero knowledge proof right now. So I have this Sudoku example and what I'll do is so I am the prover and you all are the verifiers. And I want to prove that I know the solution to this Sudoku puzzle without telling you the solution itself. So you need to be convinced that I know the solution but you cannot know the solution. And just for people who don't know the rules of Sudoku, basically for a correct solution of Sudoku, each row, each column and each three by three box, all of them must contain all the numbers from one to nine. So that is the definition of a correct solution. And I'll prove that I have one of it. I know one. So what I'll do is the first step is called a commitment. So commitment means that I need to commit to the solution so that if you later ask me questions about the solution, I am not interactively changing the solutions. That would be cheating. So how do I commit? I have a little card for each cell of the Sudoku. And on the face of the card, this face up, I write the number that will go in that cell. And on the back of the card, I write the location. So by writing the location and placing it on, and then I place all the cards on the Sudoku grid, I am committed that I cannot change whatever is under C3, for example. So under C3, I'll always have nine. So you must know that this commitment is both binding in the sense that I cannot later go back on it. And it is also hiding, meaning that after looking at these face down cards, you cannot identify what the solution is. So in that sense, it is hiding and binding both. And this is a very useful primitive, which is used in all Zilmalish proofs. I can actually give you another often cited analogy of commitment. And that is that if I want to commit to a value, I write it in a piece of paper, I put it in a box, I lock it, I send a box to you, but I keep the keys with me. So which means that when you look at the box, the value which is inside cannot be changed. I cannot go and change it. But since you don't have the key, you cannot open the box and read the value. So in that way, it is hiding. So this is actually quite a useful primitive for Zilmalish proofs. And we see an example of that in this Zilmalish proof as well. So this is the first step. In the second step, now the verifier, so you will ask me to open any random column or row or a three by three box. And what I have to do is I have to open them, and I'll have to shuffle them and then present them to you. So when I present them to you, you see that these cards actually have all numbers from one to nine present, which means that the solution for this particular column actually satisfies the condition for a correct solution. But I may have cheated in other columns or other rows or other boxes. So you don't know. But if you repeat again, now, since I don't know which column or which row you are going to ask to open, if you repeat again, my chances of not getting caught drop exponentially. So if I repeat K times, my chances of not getting caught drop exponentially of the order of one over two to the K. And this is actually huge, which means that, and so just to give you an example, if I repeat this process 60 times, and I have a probability of not getting caught of one over two to two to the power 60, that is almost as rare an event as an event which happens in 100 billion years once. So that is practically certain that I have not cheated. So zero-knowledge proofs by this virtue of verifiers being able to ask random challenges, which are unpredictable by the proven, give you this overwhelming probability. And this is the basis of all the acknowledged proofs. But finally, at the end, what you also need to do is that you need to reveal the locations where the original problem was marked, so that I know that you have solved the given problem and not some other problem. So for example, the 73 in row B5 and B6 match the original problem. So I know you have not solved some other problem. So at the end of it, you are convinced that I know the solution, but you do not learn anything about the solution. And in that sense, it's a zero-knowledge proof. So this is the typical structure actually of all zero-knowledge proofs. You have a commitment step and then you have some random challenges given by the verifier, and they are followed by a response by the proven. And this commitment is actually not just a physical concept. It's a cryptographic construct too. You have this binding and hiding properties in that commitment. And then the other point is that this mechanism of proving was interactive and that is inconvenient in many contexts. But you know those one-way hash functions? So the output of the hash function appears to be randomly generated. So you can actually make interactive zero-knowledge proofs and convert them to non-interactive zero-knowledge proofs by using the hash functions, the output of hash functions to act as the random challenge. So the guarantee there is that nobody knows what the output of a hash function would be. The hash function is applied on the problem itself. So nobody knows what the output of the hash function would be on this problem. And so the prover cannot guess where the challenges will come, what questions will be asked in the challenges. And you get the same guarantees. So that gives you the ability to attach a certificate, a certificate which does not require any interaction. You publish it and people can verify later. But the most important thing to take away is that actually you might think that, oh, this is a silly example, but actually all practical statements can be proved in zero-knowledge with this overwhelming probability. So that is a very strong statement. And just for people who are a little more technically oriented, all practical statement means all NP statements, which can be verified in some polynomial time. But I'm not going into those details. You can think of it as everything which is verifiable easily can be proved in zero-knowledge. So I think at this point, I would want to maybe pause for a minute to take some questions if there are any questions. Prashant, I think there is some confusion among people to understand the difference between hash functions and encryption. Can you just help with that before then people might ask you more questions on zero-knowledge proofs? Sure. So encryption is for a particular party. So you encrypt something for the receiving party so that nobody in transit can decrypt it. But that party for whom you are encrypting, that party can decrypt it. But hash functions are unconditional. So looking at this edge, nobody can find out what the source M was. And they are just used as a random number generator. You can think about it. So I think that should answer the encryption versus hash function question. Yeah, but there is one thing for most of the people who may not understand. We keep finding falls with hash functions all the time in the sense that in theory, you're not supposed to you think about it in theory, you're not supposed to know the actual message from the hash. But in practice, I mean, there are plenty of workarounds to avoid it. But in a sense, if you look at the technology of computer science in general, we keep inventing harder and harder to reverse engineer hash functions in the sense that you can't look at the hash and figure out what is the message. It has gotten to a place where the older ones are broken like Mt5 and all that, but the newer ones are pretty hard to break. If I may add there, it is always easier to define a hash function than break it. And infinitely exponentially easier to define a new hash function than to break it. So when a hash function becomes vulnerable, you can always move on to a harder hash function. So this is a game that goes on in security. But right now for privacy, what is important to understand that there is an instrument of the hash function that can give you a signature of a message, an untamperable signature of a message. Now, whether it is really the untamperability guarantee may come into question sometimes, but it can always be enforced by suitable care. So the base idea is always this, I mean if you're looking at hash, it's one of the most foundational a thing in computer science. I mean, there's really no privacy, security, cryptography, nothing whatsoever without one-day hash functions. Just to add a little bit over there, I think when we say break a hash function, just typically people who are working at breaking hash functions will typically just find two inputs that produce the same hash and say hash function broken. I don't think that necessarily equates immediately to being able to tamper with a message and mess the signature up. So I think just something to keep a track of that, just finding two inputs that have the same hash doesn't necessarily completely invalidate the use of that hash function as a way to sign things. And also, I mean, hash functions which are guaranteed to be collision resistant, I think finding two messages which produce the same hash is also a hard problem. Exactly. I mean, so you just produce two random sets of data that had the same hash, doesn't mean you were able to tamper a message and practically implement it in a forgery attack. Yeah, it depends how hard. In the sense that let us say you're able to find two random bits of information that produce the same hash after two power 64 iterations, we don't consider it good enough. But on the contrary, if it is two power 32, maybe, yeah, definitely broken. That can stop. Exactly. Like what's the number of iterations? What's the effort involved in implementing it practically should be definitely an evaluation metric. So in computer science, we call it a polynomial time bounded adversary should not be able to break it. Precisely. You should not be able to do that a quantum computer. So adversary with finite resources should not be able to break it. And that, you know, such hash functions are vertically easy to define. There are practical usability issues. But again, as I said, it's easier to define a hash function, secure hash function than to break it. So you will always win the race. I think we are going to come towards encryption more further down the presentation. But does anyone have more questions or zero knowledge groups or can Prashant Mohat? I'm assuming everyone gets zero knowledge groups with, okay, Prashant Mohat. No, no, let me be honest, okay. ZKP is one of the most complicated bits in computer science. And so the way in which, at least I hope our audience is non technical, the way in which we have to go back and look at this is that take something that is very common use case, something that people face it on a daily basis. And not even Sudoku and then go back and figure out how to do ZKP. I mean, I think that is basically where we have to go back. Yeah, but I think the problem there is that a ZKP of something which we do commonly would be very difficult to present. Essentially, it requires a lot of math. And you only get an intuition. But the actual bits and pieces and the nitty gritties of zero knowledge proof rely on how the commitment scheme is secure and how you do the random challenges, how you convert from interactive to non interactive and all that. Yeah, I mean, at least for me, right, I had spent a lot of time on ZKPs. They are not intuitive at all. I mean, they're extremely contradictory. Every time you go and tell it to a common person, I mean, imagine a bureaucrat, Donman bureaucrat sitting on the end, you're going to come back and saying, I can tell you that someone knows a secret without me knowing a secret. I mean, it is like so contradictory. Actually, it's not if you go down the track that the sadhus have been taking for years, but I won't tell you what it is, but it'll be effective. Yeah. Yeah. So somewhere on the line, I think that is the part we have to go back. I mean, on initial session is okay, but let's not have any inductions about how hard ZKPs are. Okay. I think the challenge to explain cryptography to people is how you bring proof with math. I think Prashant will focus on that in a bit. Yeah. I mean, I'm trying to keep the math away, actually, for this presentation, but yes, like Anand said, maybe in further sessions, you can go back. Prashant, before you go on, perhaps it might be a good idea to give a little bit of intuition in converting the interactive to the non-interactive. Okay. So the intuition, so the intuition there is that interactive, okay, so how do zero-knowledge proofs derive their power? They derive their power by the prover not being able to guess what questions you will ask, right? So in this slide, basically, the power of zero-knowledge proof comes from the fact that you cannot guess which column I'll ask you to open. So you don't know where to cheat. And wherever you cheat, if I ask you enough number of times, you are more than likely to be caught. Now, this unpredictability can be captured either by the verifier interactively generating some random numbers and asking you questions. Or this unpredictability can come from one-way hash functions also because one-way hash functions are unpredictable by definition. So if you do a hash of the problem, which means that, so the problem is given to you once. So you don't know what hash it will produce. You have not seen the problem before. And if you have not seen the problem before, a hash of that would be completely unpredictable. And that forms the basis of the challenges. So that forms the basis of the random numbers, which are the challenges. And then you can't answer, you can't cheat, you can't place your cheating answers intelligently because whatever you do, it's more than likely you'll be caught. So that's the intuition from interactive to non-interactive. So you basically say that I have computed. So basically what you do is that you write your responses. You write the responses to all the challenges which a hash function would have asked. So after the hash function is applied on the problem and the challenges that you get from those random numbers, you write the responses of those random challenges. And somebody else who verifies, they also calculate the same hash. And the underlying assumption is that you don't know what hash this problem will generate. So the verifier is convinced that the responses that you wrote down were actually for some random instances were not cooked up by you, essentially. Prashant, two things. One is don't get into too much computer science, right? Second is I'll go into a little bit of computer science and just point out that the field shovel protocol may not always be practical for all ZKP problems. So not all ZKP problems can be made non-interactive easily. And the loan act interactive proof may become arbitrarily large exponential size for some ZKPs. So some ZKPs can give you small proofs, right? The others may require infinite space to write them down. So there is a dichotomy there, right? So this has to be, there is a caveat that not all interactive ZKPs can be converted to non-interactive ZKPs in a practical sense. Theoretically, yes, all of them can. Okay. So I guess I can move ahead now. Yeah. Okay. We'll always come back, I guess, as we go further, we'll always ask questions. Sure, sure. Okay. So I'll go back to this topic of anonymity and unlinkability. And anonymity is simply the state of not being identifiable in a set of individuals. So you are anonymous only with respect to a set of individuals which are under consideration. So in that way, there's a slight, subtle difference between anonymity and privacy, right? So and when you are transacting with organizations, there is this concept of unlinkable anonymity, which means that, you know, multiple transactions that you do, none of them are, first of all, they are not linkable to a true identity. And second of all, that even multiple transactions which are coming from the same individual, nobody can identify that they belong to the same individual, right? So in that way, they are not linkable at all. So I've tried to show that in this diagram that, suppose this R1 and R2 are two random numbers. And the first transaction sends some f of R1. The second transaction sends f of R2. Now, T192 are completely unlinkable. Nobody knows whether they came from the same individual or different individuals or whatnot. So that's unlinkable anonymity. In linkable anonymity, you still get this anonymity in the sense that your true identity is not revealed in this random number R, whether it belongs to you or somebody else is not known. But multiple transactions that you do, both of them depend on this same random number, meaning that somebody can identify that, oh yes, these two transactions belong to the same individual or not. That is linkable anonymity. And to get a balance between privacy and utility, you kind of need both, depending on the use case. So this leads us to this notion of virtual identities, which was pioneered by David Chom in 1985. And this is the notion that individuals own a master identity. And from this master identity, they are able to generate random looking virtual identities. So this individual gives a virtual identity A to organization A and VIDB to organization B. Now, VIDA and VIDB, they appear to be completely random numbers. And this gives you, first of all, unlinkable anonymity for all interorganization transactions. So if a transaction involves communication with A, as well as communication with B, those two transactions cannot be linked together, because VIDA and VIDB are completely unlinkable. And with respect to a single organization, if you keep on using the same VID with B, then all the transactions that you do with B are linkable. But use cases may demand that you use different virtual identities for each transaction. So in that case, they are all unlinkable. And I think maybe tomorrow we'll show a COVID app where you see all these COVID apps, they generate random tokens every time and they all are collected somewhere. Now, all these random tokens, you can think of them as virtual identities and they are all unlinkable. So you have the place for both depending on the use case. And finally, these virtual identity A may need to be linked with virtual identity B, depending on, again, the use case. You may want to link the financial status of people with their medical data. So there, the purpose limitation of it, of the linkability is extremely important. You only want that linkage to happen for a given purpose, and do not want that purpose to be extended. And that is something we will talk about in the term. So, you know, this interorganization transactions where we have two organizations a very common problem that occurs there is that, and this is actually quite common in public service applications. And that is that, you know, A is some organization which gives some credential to the individual. And the individual wants to present this credential to B without letting either A or B link the two B IDs, right? So for example, A is the college event and B is the employer. The person wants to convince the employer that he has a certain degree, but he does not want them to link VID A and VID B. Otherwise, whatever information is associated with VID A could be linked with the information associated with VID B, right? And that is where the anonymous credentials come in, and they let you do this. So, the regular credentials, simple digital signatures, we have seen before that, you know, the organization A gives a sign on a message M, and that is S. Now, this signature is presented to B. And in order for B to verify it, it needs to know the actual message, right? And then it verifies it. Now, if this message is against the VID which the individual shares with A, then this verification process naturally leads VID A to B, right? And then, accordingly, VID A and VID B could be linked. So this is the problem that anonymous credentials can solve. And they are often based on these blind signatures, and these blind signatures are transformable. So you can think of it this way that you blind a message, meaning you obtain a signature on VID A has degree X, and then you are able to transform it to VID B has degree X. So you can only change this blinding factor, which is the VID A or VID B, and you can change that. So once B obtains this signature, it sees the signature against the VID that it knows. So it cannot link. And similarly, A cannot link where all the credentials obtained have been used. So there is no possibility of tracking where the signatures go and how they are used. And so this is actually a very important data minimization principle. Each organization only identifies individuals with the virtual identities they present, and nothing else is kind of linkable. So this also allows you to handle the linkability and give it a problem. And for example, Aadhaar, you don't need to have a unique ID. You just have virtual identities. And that should be enough. Each purpose should have a virtual identity. Prashant, if I may add, so John gives two schemes for achieving the same thing. So he first of course shows that this can be done with blind signatures and gives a scheme for blind signatures. It's a little bit mathematical to describe it, but it exists and it's a very sound technique. But you could do the same thing with a zero-knowledge proof also. And John also gives a construct where this transformation of messages across entities preserving privacy is possibly using a simple ZKB. And in this case, the ZKB can be non-interactive as well. Yeah. If there are questions with respect to anonymous credentials, maybe I can take them. Perhaps you might want to comment this unlinkability even when A and B collude. Yeah. So even if A and B, they collude, they cannot link SB with SA. So let's say A and B were to share all the signatures they gave out and all the signatures they received. SA and SB, they appear to be completely random. So in that way, even if A and B collude, they give you this unlinkability guarantee. So that's a scenario which is usually seen when departments, interdepartment share data with, say, MHA, for example. Right? Right. This is what was most surprising because this paper from John is from 1985. So many, many years have passed. And this paper says that how to make big brother obsolete. That's the title of the paper. It came in communications of the ACM. And it's an absolute seminal paper on privacy and everything out there is trivial to implement. And yet, almost 40 years later, strange schemes that don't do, that don't look at this, these techniques. It's interesting that it came out a year after 1984. Yeah. And refers to the pay, refers to the 1994 work. It refers to some coincidences are just almost supernatural. Prashant, but if there are, there are scenarios which Professor Banerjee always refers to where we want linkability, right? Like, but we would want linkability with access control. Say, you only want certain departments like the health department in particular to be in a position to actually know the identity of the individuals in case of a pandemic. So anonymous credentials essentially don't fit that scenario entirely, right? Or are you saying? So anonymous credentials, you can also have anonymous credentials which provide you optional revocability of anonymity in a way that some trusted authority can be set up, which is able to link them. But then at that point, you have to ask the question, how do we purpose limit that trusted authority that it cannot misuse it? And that's a separate question. You look into that now. Yes. Yeah. Okay. So I'll move on to this third data minimization principle technique, which is the database anonymization. And this is typically kind of, so you do anonymize data only for non statistical databases also, but typically when you want to do analytics, then there is no reason for you to ask personally identifiable information. And that's why anonymization is a hot strategy to do, right? You hide your name, you hide your age, you hide your location and replace them with some course information, or you add some noise there and you believe that your data is anonymized. And so as an example here, you know, this name is replaced with stars, age is replaced with a range. So this is a reasonably anonymized database. And there are many notions of this anonymization database anonymization. For example, K anonymity talks about, you know, we will anonymize in such a way that all, that at least K individuals have the same attributes. And similarly, there are L diversity, key closeness, multiple notions, but I will not go into them. What I will say is at a high level, this approach does not work. And the reason is very simple, that as you keep on adding more dimensions, more columns to your database, you, you can think of each individual being projected to a very high dimensional space as a point. And this high dimensional space is extremely stars, meaning that it is very unlikely that someone else with who, whose 10 attributes match with mine to find such an individual is very rare, to find an individual whose 20 attributes match with mine is even there is much, much rare. Right. So that is why this high dimensional space is extremely stars. And what these anonymization techniques do is that they just add some noise. So they add noise, meaning you are not identifiable anymore with a point, but you are identifiable with a sphere, a sphere surrounding that point. But since this space is so sparse, there are not many other individuals in that space. So you essentially are identifiable even after the anonymization. And a very intuitive way to understand that is that even though you collect data about me and you remove my name and where I work, but if you tell me, if you tell that, okay, this is a person who is a PhD student in British anxiety, really, and is aged this much, and is as high as this much, all these attributes combined together, certainly identify me. So even though they are not really identifiers individually, but when combined together, they necessarily become a pseudo identifier. And that is a huge problem. And it is actually quite well established that anonymization never works. And this is established through various kinds of attacks. So people have attacked anonymized social network data, location data. So for example, there was this nature paper, which was able to be anonymous people by tracking just four spatial temporal coordinates. So they track from your mobile GPS, they track four coordinates there in the day. So assuming everyone follows the same path to their office, they are able to identify you with very strong precision with just location data. And you are identifiable by how you write code or how your browser history looks like. So all these things which you would never consider are actually your identifiers, they become your identifiers. And this is a huge problem. And I think this is something which should be immediately understood that anonymization, we always talk about anonymization. But this is never enough. So I hope that picture is roughly clear. But I also wanted to give a little bit of theoretical reason to it. And there was this paper which talked about that no matter how you anonymize, if you have a database with some n rows, and your adversary is asking, let's say, order of n queries, then it can almost reconstruct the entire database. And that is actually not very surprising. Because if you think about it, even if you're asking like statistical queries, like what is the average of this particular segment of the population, that amounts to solving a bunch of linear equations to identify what the value for each individual row was. So that way, the entire database could always be reconstructed. And this is why anonymization actually does not work. So if there are questions here, I can take them. I think there are a few questions in the chat. People are asking, can you give more nuance on that, like how easy, difficult is it to de-anonymize? In theory, no lock is completely safe too, but we still rely on those to keep ourselves safe in our houses. So how easy is it to de-anonymize the database? So I think, first of all, it is hard to put a bound on how easy it is to de-anonymize because it depends on what you already know. But otherwise, these attacks are actually quite efficient and they're quite easy. So yes, for different kinds of databases, you would have to design maybe different attack techniques. But if you follow similar principles, you would be able to get a lot of information. So even though you are not able to exactly pinpoint exactly de-identify who the individuals are, but you will get a lot of information about them. So if I may add on to that, why is it impossible to derive a bound on the easiness of de-anonymization because de-anonymization is not a content problem. It uses auxiliary information from multiple sources. But there have been enough attacks from Irwin Dayan and several others have demonstrated that it will be somewhat foolhardy to depend on this technique. So this technique is considered even now, despite all this work as a primary privacy-preserving technique. But I think that it doesn't stand up to schooling. So every time you're using de-anonymization, you have to be extremely careful. De-anonymization is a primary method. You have to be extremely careful. There will almost always be an inferential privacy attack possible. And the Princeton group has shown that these attacks are also very, very computationally efficient, solving linear equations, for example. And they can be orchestrated with almost effortless ease. Just to add to what Prakhar was hinting at that no lock is secure, but we still use that to secure ourselves. I think just taking the thinking a little bit forward, I think we also vary the kind of lock we use on different kind of assets. We also engage with policy to ensure that even though some things are not locked like public spaces, they are not damaged. So there's a whole range of thinking that derives from that locked question. So this of course depends on the usage. It's the kind of security or kind of privacy-preserving technique that you push in. But if you're talking about a national level database, like electronic health record and so on and so forth, public service data which is for the whole nation for 1.3 billion people, then the privacy standards will have to be significantly higher. Absolutely. I mean, think from that point of view, something that is so relevant and so central and so public needs to have a completely different policy level thinking around the security. It's not the same as locking your house. Just one final question before we move on. There is one question whether this is not a computer science question, but whether can someone file an RTI actually to find out what the data minimization technique is being used inside government. But at the same time, I want to bring in this case of sale of anonymized data that Ministry of Road Transport actually stopped this week. They were selling around 12 parameters of each vehicle owner, type of vehicle, color of vehicle and so on. And they finally decided that this data can be de-anonymized. We don't know how they have decided or what they have done, but they have essentially stopped the sale of data. But if you look who bought the data, there are companies like Trans Ocean Sibyl which have access to other parallel databases, but there are also companies like Ola which directly have access to your location data. So how would a de-anonymization attack on these datasets would look like in different scenarios like when you look at some of these parameters? So Prashan, can I take that? It will almost always be a linear equation attack. It will form a set of linear equations and solve it in low order polynomial time complexity to do it. When you hear something like that, selling the transport department's data, I think that the default null hypothesis assumption has to be that it can be de-anonymized rather than that it is safe. The default assumption so that's why this de-anonymization is a myth slide is important. The default assumption should be that it is de-anonymizable. So if the designer thinks otherwise, then that must come with the proof. That must come with some kind of a guarantee that an anonymization in this special case is hard. You know, most non-anonymizations can be broken trivially. There is a US census paper around the same area. So if you look at it, they even wrote a public paper around it and they said they were able to solve the entire de-anonymization problem with 400 plus linear equations. And on a Pentium processor for the dataset that they had published it, it was about 350 seconds or something like that. And I think so if we are talking with the same paper, it was concluded that all Americans, at least 87% of Americans could be identified with their date of birth, age, and name, something like that. I think parameters. You just give me three more parameters and then give me the anonymized dataset. I'm going to identify 87% or 90% of it. I think that is the paper they wrote. It is actually done by the census bureau, not by any academic person. Yeah, exactly. And also with respect to Srinivas, your question, there's also the problem of linking. So I talked about this high dimensional spaces. So if you are able to link multiple databases using certain parameters, then suddenly your dimensions increase. So you are able to suddenly jump from 10 dimensional space to a 20 dimensional space because you can now link those together. And especially for companies like OLA, that would be a huge risk. So yeah, that's what I wanted to bring. With respect to RTI, I think the problem is that the loss of privacy can never be quantified. So if you know something, how do I know you know it? And how are you going to use it is also not quantifiable. So I think the RTI approach, maybe it is a good defense in the case when we don't have any other strategy than anonymization. But in general, it is not a very strong way to protect privacy. Yeah, I think the post audit approach, basically you are saying that you audit how things happened. But I think this is, if you look at another one of privacy principles, they call privacy should be preventive, not reactive. And the precise reason for that is that it is very hard to put a bond on whether privacy loss happened or not. Whether you read something, whether you know something or not is very hard to judge. I think there is one short question and we can move on, which is why is it that the default assumption that it can be de-anonymized? Because there are so many attacks out there. And it is pretty much well-established in computer science that anonymization does not work. Also, just given the amount of data collection going on, it's already somewhere, right? So you have to, and since you can't guarantee, again, it's almost a form of zero knowledge. Since you have zero knowledge of what people are going to do with that data, you must assume certain things at a base level. That's why. Yeah. And especially if it is a public database, it's out there. I mean, if it is a controlled database where every query is controlled, then even then you can say something. But if it is released, then once it's released, it's released. I think the RTI question was also, it was not what you answered. The person was asking whether you can ask in an RTI the question relating to the technique used for anonymization. I think that is a question for Srinivas. Unfortunately, we don't know. We never know, especially with these days. But code, I wasn't able to ever get access to any code inside the carment. Data maybe, but code no. So any technique, unfortunately, no. Prashant, I think we can move on. Okay. Yeah. Sure. So actually, this is a continuation of what we discussed. And this is the You will lose your mouse pointer if you go full screen. Okay. I wanted to show them step by step. Okay. So I wanted to talk about this impossibility of absolute privacy. So this is continuing from the anonymization discussion. What you would want from what would be your absolute privacy goal. And this is also called influential privacy that there's a database that you designed, and you want it to be private. So what does it mean to be private? You want to say that if a person A has access to the database, A should not be able to obtain any information about an individual that B cannot obtain. B does not have access to the database. So if A obtains some information about an individual by interacting with the database, it means that the database leaked some information. Right. And this is what you would want. But this is actually, this was shown that this is impossible to achieve, especially if the adversary whoever is talking to the database has arbitrary amount of side information. So if the person had, if the adversary has arbitrary information, this is impossible to achieve. And I'll give you a very simple example of that. And that is suppose that the, this is common knowledge that the salary of the director of the company is twice the average salary of all employees. Now for a person B, that is all the person knows, right? But for person A who has access to the database, it can make a simple statistical query, the average of the salary, which does not seem to be violating privacy in any way. And he gets the average. Now by using this auxiliary information, he is able to find out the salary of the director, which is an individual, right? And this is a very simple example to demonstrate that this absolute inferential privacy is impossible to achieve. And you should know that for this attack, for someone to know the salary of the director, the director does not need to be participating in the database, right? The director would not be in the database, even if the director is not in the database, you would find out the director's salary. And so it means that the privacy loss is not only limited to people who actually participate in the database, but also to other people, right? And this kind of observation gave rise to this notion of differential privacy. So differential privacy is changing the goalpost. The goal of differential privacy is that we know that in an absolute sense, when you let people interact with databases, your absolute privacy would be lost. There will be some loss to your privacy. But what we will guarantee is that the additional privacy risk that you incurred by participating in the database, that additional privacy risk is minimal. So whatever privacy loss you will incur will mostly be something which you would have already lost because of other people or because of the impossibility of inferential privacy, right? So this is the goal of differential privacy. And typically what is done is that the analyst asks some queries to the database. And you evaluate the query. You see that, okay, if I answer this query accurately, then whether by changing just one row in the database, by changing just one individual's data in the database, how much would the answer change, right? And what we are trying to protect is that whether an individual participates in the database or not, their privacy risk should be the same because we are trying to minimize the additional privacy risk. So you measure how much privacy loss will happen if I change just one row of the database and you try to minimize that. And accordingly, you basically add a noise such that no matter how much you change a single row, the answer will not change, essentially. So which means that after getting this answer, the analyst is not able to even figure out whether you particularly, whether a particular individual is in the database or not, right? So even though he will be able to find the director's salary, but he will not be able to, he would be able to find that out even if the director was not in the database. So this is the technique of, you know, interactively calibrate the noise as per the sensitivity to maintain user's differential privacy. And perhaps you can already see some problems with differential privacy. And that is, first of all, as your query sensitivity increases, meaning if you ask the query, which depends very much on a single rows data, then in order to protect the privacy of that individual, you have to add a lot of noise. So as for the sensitivity increases, your noise increases. And this is also why differential privacy can never be used for any non statistical databases. So something like financial transactions, or simple OLTP transactions, which are not statistical where you are not doing analytics, they are not doable with differential privacy. Secondly, for each type of query, you have to design the noise you want to add, how you will add the noise appropriately so that you get good enough utility while maintaining differential privacy. So there is always this inherent utility versus privacy trade off. And for the same reason, you cannot answer many, many queries. Otherwise, the reconstruction attacks, which we saw earlier, they are also possible with differential privacy. And finally, even keeping aside all these problems, you talk about one individual's differential privacy, but what do you do about community level profiling? So somebody gets doesn't get to know specifically your attributes, but gets to know certain profile of people belonging to a certain community, as demonstrated by the Cambridge Analytica episode, right? So how do you prevent all that is not clear. So, yeah, this is a problem with differential privacy. So I think at this point, I'll also I'll stop again to see if there are So differential privacy is a significantly weaker goal compared to inferential privacy, right? So this is shifting the goalpost by quite a bit actually. So you are saying just the inferential privacy loss is inevitable. What is the additional loss that I can prevent from happening? That's what differential privacy is. Yeah, I just wanted to emphasize that. Yeah. So the use cases for each of these techniques are different in different scenarios, because you mentioned the scenario where you can't use this for financial transactions. But differential privacy, for example, could be a really good thing to be used for, say, credit scoring, right? For analytics, yes, yes, yes. So within the sector, you would have to look at different scenarios where it would fit, but it's not like one solution for all. Yeah, in fact, we just sort of interrupt. Differential privacy is not really a technique. It's actually as Prashant said, it's a goalpost. It's a notion of privacy. And it's sort of interesting in the context of some of the ones who are lawyers among you that this notion was sort of appealed to by Manindra Garwal and his expert testimony in the Adhaar case. I personally think it was a terrible argument. Shivash, I think this agrees with me, because first of all, I think this differential privacy is a symmetric notion. And the sum argument that Manindra Garwal thinks, I didn't quite find. In any case, that's something to look at, which is something which is fairly interesting. In the pandemic situation, or actually, in a situation where, forget pandemic, where there is very clear pressure from the state to invoke eminent domain. But if you just look at other diseases which have some stigma attached to them, these are notions that you have to be a little careful about whether this is an appropriate notion of privacy to be applying to their databases or not. So differential privacy is a very individualistic notion that I'll be guaranteed that my additional privacy risk will be minimal. But from a policy maker's point of view, like you said, it's not clear if it is an appropriate notion of privacy to be talking about. So Shivash, can we go ahead? Yeah, I think there are just the queries on whether this has been implemented in practical scenarios. Has there been a lack? Oh, yes. Yes. So differential privacy was included as a notion was introduced in 2005. And since then, there are many, many kind of this actually started a new field in itself. And there's a lot of work. But most of them are, first of all, they are towards specific problems, meaning for specific type of queries, you can do differential privacy really well. But you, first of all, you can't, this is not a silver bullet which will answer all your questions. And second of all, the problem with the notion itself is unclear. Yeah, I think that's about it. Okay. So yeah, so then this brings us to you know, this question, what is the necessary condition of privacy? I talked about this impossibility of absolute privacy. And I said that initial privacy isn't quite correct. So, but what should we do? So impossibility of absolute privacy, if you think about it, this impossibility comes because you are allowing the adversaries to do arbitrary processing. And so that is where I think we should have some mechanisms to stop it. And essentially, what you want to ensure is that all illegal data access and processing must be prevented. So if you want to process some data for certain purpose, you must declare what purpose you want to collect this data for. And there should be a mechanism in place to ensure that only computations which fulfill that purpose and nothing else should be allowed. That would be in line with the legal principles of purpose specification, use limitation, and so on so forth. And this legitimate purpose, what is a legitimate purpose that depends on you know, how people have consented or what approvals you have, what is the authentication of the person who's asking for the data, various kinds of these dynamically changing things. So, you need to have some external body, which controls all these things, which decides what is legitimate or what is not. And I must point out that some preliminary work exists on purpose-based privacy policies. But they were actually in an era where controlling what computations must happen was not a notion at all. So they used a very poor proxy for purpose. So for example, many of these papers take the role of the data requested as the purpose, meaning somebody from the accounts department is asking for data, meaning they will use it for accounts, for the purpose of accounting. And that clearly is not a sufficient notion. But now, as I will demonstrate next, we have techniques to control what computations may happen on data, and that could be leveraged to give you this definition of privacy. Finally, you also need data minimization, not only because not only as a further defense in case some of these techniques take, but also when data exits the regulator, the controlled boundary, whenever you actually give it to individuals, to humans, essentially, then whatever happens to that data is unbounded. So at that point, you only have a defense approach at hand, and you must follow all these data minimization principles to share only the bare minimum that is required for the purpose. So this basically brings us to this goal of secure remote execution, where you must secure data in such a way that the remote party that you're giving the data to can only execute a given program on it. So you know upfront what program they are going to execute, and nothing else gets executed. They should not be able to use it in any other way. And there are cryptographic solutions to it, as well as some hardware-based solutions to it. The cryptographic software-based solutions are fully secured in the sense that they do not require trust on anything, but they are extremely slow. So two popular branches here are homomorphic encryption and secure multi-party computation. And on the other hand, you have the hardware solutions. So these solutions, they depend on a trusted hardware, and you assume that there is some kind of guard in place which ensures that nothing except a given program is executed, but these are extremely fast. By extremely fast, I mean extremely fast compared to the cryptographic solutions. There's still some overhead with respect to an unsecured execution, but they are actually reasonably practical. So there is Intel SGX, which is present in most of today's laptops and desktops, and slightly older technology, which is ARM trust zone. That is also around for Android devices. So I will actually briefly talk about homomorphic encryption. This is actually a neat idea, which is to say that the problem with encryption is that you encrypt the data, but when you compute it, you need to decrypt it, and only then you can compute on it, right? Traditionally, you could not perform computations on encrypted data, but homomorphic encryption is the same as to let you compute on encrypted data without ever decrypting it, so which means that since you never actually decrypt the data, you can only compute it in a given way. So before going further, I mean if there are any questions, till this point I can take them now. Everyone? Okay, so all right. So I'll briefly explain the concept behind homomorphic encryption. I won't explain secure multi-party computation in garbage circuits because I think that becomes too complex and is probably not relevant. So the idea of computing without decrypting is this, that first of all, all computation can be expressed as plus and multiply. Just like all computation can be expressed as a bunch of mandates, you can think of all computation can be expressed as plus and multiply. So now if you are able to compute plus and multiply in an encrypted fashion, there is hope that you can compute everything in an encrypted fashion. So that's the idea. So an encryption scheme is called additively homomorphic. If you know the plaintext addition of A and B, if you try to encrypt that, that is equal to some special addition of encryption of A and encryption of B. So this right-hand side is happening in the safer text space where you are computing the encryption of A plus B without actually ever decrypting A or B. You are only operating on the ciphertext and you are applying this special plus operation on them to compute encryption of A and B. And similarly, a multiplicative homomorphic scheme is where you can perform a multiplication of A cross B, A times B, by performing a special kind of multiplication on the encryption of A and on the encryption of B. So you can imagine that in this way, if you keep on computing, so you can express each computation as a circuit of plus and multiply and you can keep on computing ciphertext, essentially. The encryption of any arbitrary expression or any arbitrary computation can be expressed by these primitive encrypted computations. And essentially, at the end of it, you will still get an encryption. So you don't know what value you computed and you give the result out to the party whom you want to present the answer and that party can probably decrypt it. So that is the rough idea of homomorphic encryption. The remote party who is performing the computation never gets to know the answer and probably if it does, it only gets to know the answer of the final computation and not any intermediate things. So there is no problem with leakage of intermediate data. But the problem is that designing an encryption scheme which supports either additive homomorphism or multiplicative homomorphism is easy, but designing something that satisfies both of them, that is, both additively and multiplicatively homomorphic, is actually quite challenging. And such schemes do exist. There was a recent breakthrough in 2009, but they are extremely slow, means orders of magnitude slow. And so they are far from being practical, maybe in a few decades, you will see something, but at least now they are not a practical solution at all. And finally, there is this problem of how do you convert your regular computation to a circuit and that itself has its own inefficiencies. So I think I should add out there your refouling homomorphic encryption and the homomorphic researchers will be angry. I think that in specific scenarios, homomorphic encryption can be used with great effect. For example, for electronic voting. So people have been able to tally votes in the cypher space, in the encrypted space completely. And there are absolutely outstanding electronic voting solutions based on homomorphic encryption. So there can be specific applications, but converting a general purpose computation to a homomorphic solution will require problem reengineering at a scale that is not practical. Yes. With respect to voting actually, I just wanted to point out that that is a special purpose application, which is solved only even if you have an additively homomorphic scheme. So you don't need this full homomorphic encryption where the encryption scheme must be able to do both plus and multiply. So yeah, some applications will turn out in such a way that they can be nicely done with just an additive homomorphic scheme or with just a multiplicative homomorphic scheme. But in general, yeah, it's a very hard conversion. There is a question in what scenarios should one be using additive or multiplicative homomorphic encryption? That depends on problem to problem. For example, voting requires you to count votes. So that is naturally an addition of votes by different individuals, so additive homomorphic scheme fits naturally there. For a different problem, you may find multiplicative homomorphism is required. So that depends on the problem. On the issue of computing time, you said how this is very slow. I think it was up to the power of 10.12 compared to non-encryption analytics. If one is probably analyzing using homomorphic encryption, and it has been decreasing over the years. So do you see an actual scenario where this is going to be implemented, say on par with the non-encrypted scenario? And it's an added business cost, especially for companies to invest in this. Yeah, so I mean, it's not an additional investment per se, but actually there was a recent paper which talks about the cost of doing this in software, like using fully homomorphic encryption, and doing this by buying special hardware. And it turns out that the hardware solution turns out to be way cheaper than doing it in software. Considering the compute costs, so assuming that you are charged by the hour at Amazon EC2, the costs that you will incur by the software solutions are exorbitantly high. And even though, yes, techniques are emerging to make the first fully homomorphic scheme practical, but there are still the general consensus in the communities that these techniques are still way behind in terms of any practical deployment. Maybe a decade away, at least. So there is a friend of ours called Shreeta Agarwal. If she heard this, she would have beaten you up. She works on homomorphic encryption. That's a specialty. And any active researcher in the domain will tell you just a few years, just hang on. So I think that the jury is out. This is an extremely promising area of work. There are two issues out there. One is to do a fully homomorphic encryption fast. That is one aspect. The second aspect is not all computations are naturally expressed in terms of homomorphic encryption. So there's a translation process in there. So any problems, for example, if I'm doing sorting or searching or something, right, you know, some SQL queries, there is a theorem. It says that I can do it in the homomorphic space, but to actually convert it will be a lot of work. And how will you convert legacy applications or legacy algorithms into homomorphic thing is an open question. So there are two issues out there. One is making homomorphic computation fast and second is making the translation practice. And both are open questions. Any other questions? I think that's it. Okay. So now I'm actually going to describe this Intel SDS solution, which is the hardware solution, and it depends on a trusted hardware. And what you do here is that you trust the hardware and hardware has a special security module installed in the CPU, let's say. And what it does is it its job is to create a program as an isolated box. Your program is put in an isolated box called an enclave. And the job of this security module is to give you two guarantees, the confidentiality guarantee and the integrity guarantee. The confidentiality guarantee says that while the program is executing, none of my intermediate state or the final state or whatever is visible to anybody outside the enclave. And the integrity guarantee says that while I'm executing, nobody can actually tamper my execution midway. So nobody can give me some can change some arbitrary memory locations and change my execution. So together with these two guarantees, you achieve purpose limitation. So if you define, you know, the if you want to make sure that only this particular program should run. So these two guarantees together give you that guarantee. And the way, you know, the security module works is, and representing this in a very abstract way, is that any request that comes, all requests, first of all, have to go through the security module, which is why it is part of the CPU package itself. And if it is requesting a part of memory which does not belong to the enclave, then it is allowed. But if it is an outside request and it is trying to read the enclave memory, then that is not allowed. So this gives you confidentiality. Similarly, there's a similar approach for integrity. So this is a simple solution. But it's also a very strong solution, unless you have physical access to the box, and you can actually break the box. It's very hard to break these guarantees. There's a slight catch. I'll come to that later. But, okay, you get this confidentiality and integrity. You start the program, you load the program in this enclave, and then you are guaranteed that that's how it will run. But as a remote agent, so you don't have this machine, this is to be installed at the data controller, right? How do you know whether they are running a given program? So that comes from this third basic property that it provides, and that's called remote attestation. So it is basically a signature mechanism. It's basically the hardware, the trusted hardware, which is the security module within the CPU, that signs a statement, which essentially means the following, that this particular enclave contains this code X, and its execution's confidentiality and integrity will be protected by the hardware at front end. And it signs it with its own hardware key, which is flushed in the hardware, so nobody can read that key. And you as the verifier, you check whether the signature comes from the right hardware key, and you check whether the code X, which they are talking about, is the right code that you expect, and only then you send any of your sensitive data. Right? So this ensures that any sensitive data will only be processed as per code X, and not in any other way. So this kind of gives you a reasonable practical solution to this problem of preventing computations, only allowing certain type of computations. But there is a catch you need to be aware about, is that there's this thing called side channel attacks, which is that if you observe the memory access patterns, where the memory is being accessed, and then you associate with that with the timing of access, and whether the cache is full or not, these kind of things may reveal some information. And there are some successful attacks, which basically break this confidentiality guarantee. But this is a very hard attack to do. You have to be, first of all, you need physical access on the machine. You cannot do it remotely. Second, you also need to observe these access patterns very carefully, and it may not always work. For some kind of algorithms, yes, this attack may be a possible attack, but not for all. But this is something, yes, you should be worried about. And there is research going on to give you Intel Sdx-ray guarantees, but also protection against side channel attacks. So people have proposed other hardware, and which protects against some of these things. But what I want to convey is that this is an active area of research. And this is a good, reasonably good practical solutions today. Sdx is available in most of the modern laptops. Most of the modern MacBook Pros have Sdx nowadays. The desktops have Sdx. And the equivalent of this, which is ARM Truscone, which was actually a predecessor of this, is also present in mobile phones, the Android phones at least. So that's something which gives you this purpose limitation kind of guarantee. So if there are any questions on Intel Sdx, I would like to take that now. Then I'll quickly wrap up with one or two slides. Do you have any industry applications that are already happening on this? Yes. So actually, no, we are mostly used for digital right management. Also Apple uses them for ensuring that you are paid for all the Apple software and only the real versions are running. So Sdx primarily uses limited to that for database applications. There is a great application called Enclave DV from Microsoft, where they show extremely high performance databases can be put into entirely can be run on the Sdx environment. And I can see my colleague from my department, Riju, is one of the listeners. She is an absolute expert on Sdx and especially Truscone and specializes in it. Maybe she can throw some light. Yeah, I think both of them because there are attacks still going on, especially on the side channel. So and sometimes these things are not open for programming for the normal people. So people give out, you will see Sdx is present on your laptop, but you are not able to run a program in an Enclave or create an Enclave. You might similarly from ARM Truscone. So all our phones are Truscone enabled. So the chip supports it, but you cannot run a secure program. So this is essentially for security that these companies like Samsung and all these companies, they say that we have a secure chip, but it is not open for programming. So the use is sort of limited because there are very few people with access to these things who can program it. So Samsung makes something called a NOx, Intel itself or Microsoft who has some collaboration with Intel machines, they can have their Enclave DV. So it has created a sort of an ecosystem where some people have access to these devices, but definitely they are showing more and more capability. So it will become open for more people. But as of the normal like any app developer, if they want to use the Truscone, they will probably have to use a DevKit, which is not really a phone, but really an open kind of a phone, which has all these wires coming out and stuff. So yeah, there is some platform issue at still at the moment. I think Prashant, you can go ahead. I think there are no more questions. All right. So this last point I want to bring out is that Sdx takes care of your computation, but essentially you also need the database really where you store data and where the Sdx or whatever remote execution mechanism interacts with. So there are really, you can think about this space in two disjoint branches. First is that you have an encrypted database and you want to query it. And that itself is quite a challenging problem. And there has been research on specially designed searchable encryption schemes. And these searchable indices along with encrypted data. What means is that you have encrypted data, but for the purpose of searching, you add a certain index. And that index structure may just be useful for searching, but it does not reveal a lot of information about encrypted data itself. As a quick example, for example, you want to, let's say, search for everybody with everybody by their name. So you want exactly quality testing. What you can do is you can add an index. So you have everybody's name in an encrypted format, but next to it, you can add an index which just is a hash of the name. So somebody who looks at the hash does not know what name you are talking about, but it lets you search. So when you search, you search using the hash of the name and that will give you, that will fetch the record, the correct record, which you can decrypt and use it. The problem with this kind of technique is first, it is not very flexible. So, for example, what if you want to do, you know, ordering. So give me all records greater, whose age is greater than 18. And so how do you do that with an index? So there are these order preserving encryption and all that. The point is that as you keep on adding more and more capability to the indices, the index, the indices themselves start containing information about the encrypted stuff. So all these reconstruction attacks that we talked about earlier, many of those attacks are possible and have been demonstrated on the searchable indices also. And this is a problem, which is why I think this solution of EnclaveDB, which is reported, where you put the entire database within an Enclave, that may be a good solution. But yes, there are practicality constraints with that as well. How do you put the entire database within an Enclave? How efficient that is? So all those things are current questions, which people are actively trying to solve. But this, I believe, is a good direction to go. There's another field called private information retrieval. But I think that the concern of this field is different. Here the database itself is in plain text. It may be some public database and you are querying that public database. But you are only concerned with the fact that whoever is responding should not know which data items you actually queried. So it hides the access patterns, but the database itself is in the clear. So I think with respect to the privacy question, this may not be very relevant to us. But I just put it there for completeness. So this is actually now my last slide. Essentially, you know that we have hopefully been convinced that privacy is a complex problem. It requires a multi-pronged approach. You can't just, you don't have a single single bullet. It's all subject. And there are many techniques which exist for each. But you need kind of an architecture or design, set of design principles and a set of operational design principles to give you the privacy guarantee, which is well aligned with the legal principles of privacy. And this is something which we will talk about in the next session. So we have hopefully established that purpose limitation is this crucial privacy requirement, especially conceiving this impossibility of absolute privacy. So you only can protect what processing is allowed. And that is something which the computer science research is preliminary at best. And I think that's where we should spend more time on. So the other two points basically will be covered in tomorrow's talk more elaborately. Basically, there is no technique which talks about an external regulator which decides what is legal processing and what is illegal processing, what is legal or illegal access and controls it and actively controls it. So we give broad high-level architecture and design architecture to do that. And also there's this question of consent and how much individuals themselves should manage their own privacy and what is the role of regulator. That is also something we will bring out in tomorrow's talk. So I think with that I would like to finish. There are any questions I'm happy to pick now. Thanks Prashant. If anyone has any open-ended questions on any of these topics or if you're confused about anything, please just go ahead and ask it. But ideally I would say just say you're going to unlock your mic on the chat so that you're doing it one by one. There's a larger problem with just not the techniques but about the implementation. And that problem is why industry... So here is my industry opinion. Historically, all the teams struggle in order to just deliver what is required for the business. And there is always a competing element of time in the things that you only have X developers and your list of things to do is 2X developer time. And so where is the time and bandwidth and energy to do research to figure out how to implement ZKP problem. I mean you forget ZKP. We have historically found our teams even struggling to do B-crypt and put MD5 checksums on open line. And this has been going on for such a long time. It is imperative for us to go back and say that what is the capacity building that we are doing on industry side in order to ensure that they get these implementations. I think that is where the problem is going to be. Not on the theoretical constructs. Because what is going to happen is I mean these are all very nice sounding stuff. And I know it doesn't take a lot of time but the problem is the amount of research that has gone into my head is like very high. So I kind of know how to do it. But how do you build the capacity name the company? I mean that is the hard technological... It is not a technological problem. It's basically a bureaucratic slash organizational problem. And until you don't do that any amount of regulation that you write on the policy side would be met with yeah it is too hard and then plain lying. That's basically how I look at it. Can I respond to that? I would agree. I think that capacity is always a problem. But you know I think that we are restricting ourselves in this paper primarily to large national state databases. And if you are pulling out something like Aadha for example or a public credit registry or pulling out a national health record system as Disha had. Then the privacy protection demands shoot out. I think that a company doing this with a limited data where you have got a special purpose engagement is different from a national level mandatory application. Like some of some of some of those that we are seeing in these days. I think that the privacy protection demands in those applications require techniques which are far more sophisticated than what we have what we have seen. And some of these are difficult to implement. You know ZKP may not be because ZKP has standard libraries for many, many simple ZKP applications. But some of those techniques like an Enclave DB or so on and so forth is very hard to do. But I think that the capacity will have to be built if you are trying to do something at a large scale. And which is probably one of the reasons like no country has done an electronic health record. No country has done a digital diamond. So when you put them to scrutiny almost all countries are backed up so except a small country like Estonia. Nobody else has really an electronic voting system or electronic digital identity system and so on so forth. But if you try to build computer science into public life then you will not be able to leverage rigor behind. I think that you have to also bring the rigor into public life. And the state diversity question will have to be somehow answered. So now I'll ask a different question. I mean, so here is where it goes. I mean I completely agree what you're saying. The only question I'm seeing is that look, let's not just stop with the theory. I mean this is probably the first time we have it. So it's okay. But I'm just seeing on a broader level as we keep working on it, we really, really get down to implementation question, prototypes and how to do it kind of a guides. And I think that is really what we are really looking at. Yes, I think that Prashant is not getting his PhD without that. So you have to implement some of it. Because moment he tries to publish in computer science. The first question that he'll face is the question that you're asking. But yeah, where is the implementation? I think there's a challenge. I think that while an implementation is feasible, we can do it and we'll have to do it. I am at my wits and rest of how to test it. At what scale? Where do I test such a thing? Just putting it together is probably something that we can do. I think that we need a lot more brainstorming around that. Even if you do a pilot, how do we test it? How do we populate it? And how do we stress test it? Figure out. Performance stress testing is one aspect, but privacy stress testing is a completely different model. So those are open questions. And yeah, so there is also the other negative side testing, which is kind of easy because I mean, I noticed today's participants, a lot of people said, what the anonymization doesn't work. I mean, it's it is not a big deal for me because we've been doing it for a very long time. But so what we had to probably do is we had a lot of proof for it. Because it's fundamental. It can't be like your statement versus my statement and some other statement saying anonymization works. You need to write a lot of proofs with public data. Some of it is doable because the Vahan database is available. If only you know where to look for it, you can just look at it and then basically figure out a lot from people. I mean, it's quite possible for you to take a small dataset like that and prove your point, rather than getting into the debate of I mean, easy. I think on that anonymization question, the Princeton group, especially Arvind Narayanan's group, they have demonstrated it beyond all doubt. They have been doing it, privacy attacks on anonymized datasets. And they have done all kinds of things and their latest review paper. I think that one that is put it out there for something for the next slide. You have the references. I think the Nuscalon et al. 2019 pretty much settles the debate. Yeah, yeah. But then, I mean, we know how our country is known. So it's all automated. But we have to do it somehow internally. So I'm just saying, just as a proof point, I mean, in my mind, the debate is settled on the anonymization, but not on a lot of people's mind. And we are on the forefront of the technology edge. It's very unsympathetic of us to expect that other people understand the same things as what we do. I guess this is that effort, and we will continue to do that. But anyone else has any more questions in particularly about these techniques or even about the paper? I guess we have all of the people. Yes. So Siddharth has a question. Siddharth, can you unmute yourself and speak? Yeah, am I audible? Yes. Yeah. So my question was about, I guess, tailoring the data for specific purposes. And you've covered this already, but in the sense that you covered the methods, but I was also wondering, I mean, I'd appreciate if you would talk about it a bit more, but also I was wondering how you kind of perceive that role, the role of, I guess, tailoring either the access control or tailoring the data itself for specific purposes. Because I'm imagining this to be a kind of function, which not only has the technical capacity of how to kind of manipulate this data or how it needs to be kind of modified, but also doing that according to how the data is going to be used, which is more, I guess, on the application side of it in that sense. So I was wondering how you kind of think about that role or think about how that would work. So I think you don't tailor the data per se, but rather you specify how this data can be processed, right? And the concerns that you need to, what you need to be concerned is whether what you are allowing to process is legal or not, essentially. And that's where we will talk about it maybe tomorrow or some more. An external regulator comes into picture. And the job of this external regulator is to ensure that whatever is allowed is legal in terms of whether whatever data you are processing is necessary for the purpose you, for the legitimate purpose that you have stated. Or for example, if the state is collecting your due to tokens for COVID contact tracing, then how they are using it? Are they actually just using it for identifying the intersections or they are doing something else? So that is something which is more of a policy question and less of the application question. And this is just Malavika here. I mean, just to quickly add in and maybe this will come up tomorrow as well. I think part of this, why the role of different actors even within a secure execution remote environment also needs to somewhere be twinned into what we all agree regarding some of these things that Prashant mentioned, even in terms of the privacy risk assessment, for instance, and the types of data and how we react to that. And hopefully we can talk a little bit more about that tomorrow. I hope that answers your question to that. Suhan Mukherjee has a question. Yeah, hi. I think what I wanted to look at was essentially being a lawyer looking at the legal system and how that's architected and the techniques that you're talking about and the architecture on the technology side. Then from a regulatory perspective, what has now come out you know, as the say the Shri Krishna report and then the draft bill and the bill which is introduced in parliament. In a sense, it tends to be that the AP Shah approach of principles and architecting that with a regulator that intervenes on a context driven framework with bright lines in terms of outcomes is perhaps a better approach because the techniques that you would apply from a technology perspective would keep changing or you would keep improving or testing. And the principle is what the regulator would be applying or changing. And it also then ensures there's a dynamism in the public policy regulatory space which matches what the changes in technology techniques, etc. So any comments on that? Right, so please. Malvika, go first. You're putting me in the wrong seat. This is definitely a pressure on the floor on this one. I think what is, I mean, I'm mostly biased but unlike the data that this such a system will be dealing with, I think I mean overall simple answer to that is I think the principles are something that can that must be updated from time to time. And they think the benefit of having a system like this is that those principles and where we land on say biometric data or genetic data or a lot of these questions is where we have to land having a conversation together as a society, right? And that's hopefully the role that this new regulator will play. But I think the idea that they cannot then quickly update data processing by everybody in the ecosystem in line with the principle we agreed that implementation problem where we fall down. I think that's what I hope that an approach like this will help solve because it's very much if we all agree tomorrow that you know actually we don't agree that genetic data should be added to say a national unique identification database. It could very well be a conversation that arises in our society. The minute the regulator agrees yes or no, we could quickly hard code it into wouldn't be hard code but we could translate that into a situation where we actually ensure that no one is doing that outside of a particular enclave. I don't know if that helps answer the question and Prashant and Subashish can support back to you actually. I think Prashant can add on. Yeah, so I just wanted to add that yes, if you later on change the principles respect to what's legal and what's not that requires the regulator to take an active dynamic role in all this control of data processing and that must be part of the architecture and we will talk about this tomorrow. I'll answer it very differently. I'll say that if you look at the data protection standard like the GDPR, the data protection bill, it is not operational at all. It is a generic set of guideline and why that is fine, you know, while that gives you flexibility to bring in different techniques and so on and so forth. I think without going and getting into specific technique, there's a need for operationalization and the need is as follows that if you don't operationalize at all, anytime you're trying to say evaluate something like proportionality, you know, you have to do a balancing and so you have got a utility argument from somewhere and now you have to balance it against the privacy laws but without finding out the limits of the privacy laws or without, you know, not necessarily quantifying it but without characterizing the privacy laws in more concrete terms, I think that all proportionality arguments will run into random outcomes like for example in the Adha judgment, you know, some judges went this way, some judges went that way. So part of the problem is that when you talk about privacy, also perhaps some of the arguments, there was no standard to bench the privacy. So you have to define some kind of a standard and I think that that's where you're coming in. You're not coming in with this implementation right now. But our endeavor is to, you know, to set a bench mark for the privacy, these are the kinds of things you have to do for data manipulation, whether it is this technique or that technique or so on and so forth, but this is what you have to do for privacy, this is what you have to do for access control. So I think that the endeavor right now in this paper is to set those standards and define them and characterize them a little more crisply than what is available in the data protection app. Yeah, I agree. And that's why I asked the question because for me, during this conversation as well as whatever other reading I've done, I feel the model doesn't actually support this sort of an approach which I think is actually extremely useful for a country like us, given that we've been able to benefit from the history of looking what's happened for the past 20 years or more in other jurisdictions. And we've kind of gone and straightjacketed the regulatory architecture using that old design rather than taking advantage of things like what you all have discussed over here and framing it in the same manner. I think our space and atomic regulation sectors are where we could take the lesson out of for this because it stayed dynamic with the folks and what they're doing in the technology sector and regulation matches or follows the tech piece. And I think we haven't done that in the privacy piece in the way that it's currently started. So yeah, thank you. Actually, if I can just add a quick thought, because obviously we're following the, sorry, I'll just go a little bit, I'll just mute if that's okay. Suhan, yeah, I mean, I get your question now. I mean, one thing that I feel is important to note is we think that this model could actually fit within even the current PDP builds vision because we did a, I mean, we, I mean, in Vara research had done a policy brief looking at how you would implement what the PDP bill is trying to do in terms of secondary regulation. And it seems like on many of the big picture items that we're talking about here, you know, in relation to security safeguards and so on, it's the, it says we should have them, but a future DPA will say what it is kind of thing. So hopefully, you know, this kind of dynamic approach, even though it seems to be straight-jacketing and like setting up a regulator that will do X, Y and Z, I think the detail of that hasn't been fulfilled or fulfilled out in the statutory legislation, which I think is good. And I do think there's an opportunity if we have models like this out there to actually, you know, show what is possible to that regulator in the next, and I'm hoping it will take at least a year or two before they start, you know, writing very granular rules. And in that time, I think this is the kind of work that could be done to actually help the regulator pass those codes that they are supposed to be passing on many of these subjects, if that helps. I completely agree with what you're saying in terms of what the approach that you'll have taken is fantastic. I'm just concerned that just looking at the way the regulator has been structured and how regulators have tended to work, I feel they may end up being a disconnect in the way that you're suggesting how things should go. And if you ask me, I do believe that you're going down the right path. And the regulator will not be able to, you know, appreciate that nuance or the regulatory system in the way that it's being architected may not be able to achieve that nuance, given that there's firstly a limitation on it just being for personal data and not looking at data in general across the board, because I think we need to look at data in a broader context than just legal definitions of what is personal data, etc. The second is that if you don't give it a commission type of structure with the powers, it will not have the ability to dynamically dictate or put out there as standards what you are saying should be, you know, basis what research or do and you know, I mean, but that's just a concern that I have just as a response. I completely agree, you know, we have been a little wild out there. And you know, there are, there are several problems that I see, like what Anand said about capacity building, you know, capacity is a big problem. Regulatory capacity is what you're talking about is again a big problem. The regulator sees that this is the fit job for a regulator is also a problem. I think that, you know, our objective out here is to conceptualize privacy protection in certain terms and lay out that if privacy is indeed a wish list, is on the wish list, then there are certain necessary things, necessary conditions. I think that Prashant tried to say those and we will come back to it tomorrow. We will also try to argue that there are certain sufficient conditions, right? So there are, there are certain, there are, there are some things that you need to do, maybe not in the form that we are suggesting, but in some form those things will have to be done if your privacy wish list have to be satisfied. So, so I think that we are more in the idea space here than suggesting specific solution out here. So we are saying that any architecture that you do will have to look something like this. We'll have to have these ingredients. Yeah. I think you should submit, I mean push really to have the Ministry of IT, you know, buy into the approach before they kind of really push the bill as it is through, because I think your, your approach will serve us better in the long run. My personal opinion. Okay. I see one last question from Sanjeeva Prasad. I don't know if he has a question or if he has something to say it, but please go ahead. I want to comment that I was sort of interested to hear voices from both the legal world and from the industry world talking about proof. I'm a formalist and one of the things I think, you know, if I'm, I don't think I'm on Prashant's evaluation committee, but one of the things I would like to see him do, and this is sort of gets quite interesting, is to use notions of, you know, knowledge based proofs. And so the zero knowledge work comes out of a very sort of rich tradition of knowledge based proofs of correctness of such protocols. So, so, you know, for example, there were questions about what is a hash function, how, you know, why can't I use a hash function and how is it different from encryption and all. A lot of these get actually quite clear when you start to start formalizing their, you know, what they achieve in some kind of formalism, which is logic based formalism. And some of that also ties up very well with the legal world, you know, there's a bunch of people who have done some nice textbook work on how laws which affect people can be rendered in some kind of computational logic. And they're again, you know, tie up quite nicely. So I found it interesting that there is this concern. And so the question of whether the law really conforms to this or whether a piece of technology is confirms to the conforms to the law, can possibly be answered going by this very computer science logician nerdy way of thinking, but it's just a possibility that intrigues me. So Sanjeev, are you suggesting a language based approach? Well, no, one is that I think what Prashant hasn't actually talked about entirely is language based security, which is something. Yeah, that's a different approach. That is where you start in your computational language itself, you start having things like type systems, which start saying, how can this piece of data be used? How can it be accessed? How can it be passed on? Can I delegate authority to Prashant will have to get into type systems? Yeah, that's a different thing. I think it's a rabbit hole, which I don't know if he needs to do at this stage. But I think just this business of is this piece common knowledge. So one of the things, for example, which I always find very difficult to convince people is that something need not be a secret, but it doesn't mean it should be common knowledge. And this is a very bad leap that technologists make. So it's not secret, so I can just publish it. So things of this kind. So I say, well, your fingerprints are not secret, but that doesn't mean that they should be on a public database for everyone or your iris scan. It's not clear. It should be available to anybody in the country who wants to see your iris scan. So these are things which I think lawyers I think will understand quite intuitively. So things like common knowledge and I know that you know that he knows that you know that and so on. These are quite interesting frameworks Okay, then Malvika, you go tomorrow first, right? I think it might be Anubhuti. She's also at this time. Okay. Yeah, I just wanted to quickly just add that I thought that that's a very powerful framing. And actually, there is this entire few universities in the world have kind of interestingly, logic based AI research, where they they they try and go back to like the logic based underpinnings of legal regulation. And then they're also using that. So I mean, entirely agree. I wish we could shift to that kind of framing because Sajjeeva would absolutely love it. Because, you know, we're not it's not the leagues apart. I think any discipline which has logic is it's underpinning. And that's why I think somebody like Amartya Sen does such great work because he's working back from economics down to logic, right? And so you're also able to legally access his work. But I'll stop. I'll stop, you know, being stratospheric about it. But just thank you for saying that is what I mean. Okay. On that note, we'll end it here. I'll see you all tomorrow. I'll send you the slides in an email and also send you the link for tomorrow. It's ideally the same. So on your right now, it won't change, but I'll just send you one more email reminder tomorrow. And I'll see you all tomorrow. Thank you. Thank you. Thank you so much.