 I Can interpret it any way I want Hey Dan into the mute, sorry So that did clear everything up. Sorry about that. I needed to cycle Resources getting contested and Fun stuff like that right so Since I haven't had time to bug folks did everyone already sign up for Scribe No, so If I could get a couple scribes we have a We're gonna kick off our use case Exploration again, and we have a special guest today. Dr. Roy has joined us and I'm right is here to continue some of the discussion that Mark shared with us and As soon as we get Two scribes ascribing we can get started I can be one of them. That's Mark Great. Thank you mark. Okay. I need one more anyone else want to join mark and Think up your notes today. All right. Well in the sake of time. I'm gonna put my name down and join Mark and Allow us to get get started. Um, all right, so Before we You know dive into the use case with dr. Roy I'd like to You know give the opportunity to To have check-ins from any of the SIGs and working groups Anything in SIG off or the policy working group that Anyone has to share so mark, maybe you could introduce dr. Roy and You know connect the dots from the discussion that you Share with us and and what we have today Sure, be glad to do that. Thank you The NIS big data working group has been cranking along since this summer of 2013 this is not a standards body per se the Output is three technical reports of which we've produced one another one is in review at NIS It'll probably come out in the next five or six weeks Maybe before that We haven't got around to publish in our papers, but we're working on that aren't we are not Our knob and I co-chair the security and privacy subgroup of that big data working group and In that role our knob is the primary guy that covers all the crypto aspects of that so while we've worked on the the models together and Basically hammered out the drafts together and helped adjudicate the content we got from other third parties He's really the primary contributor to our backgrounding on blockchain crypto aspects of Data at rest and what the role of some of those things might be and some of the emerging big data Technologies, so he's really the better of the two of us to present those aspects of it He also had some experience previously in the cloud security group Maybe you'll want to mention that our knob when you when you kick that off Over to you buddy Thanks, Mark. Can you hear me? Yeah? Thank you Okay, so I really got scared when you said the use cases because what I'm going to talk about is not as much a use case But an overview of the document that you're going to talk about is that okay? That's great. Yeah for us. We're considering the ingest that we're doing in understanding of You know what's going on in the ecosystems broadly characterizes use cases But yeah, that's fine Right, so but before I dive in this is my first time to this meeting and Thanks for inviting me by the way But can you give me a very short overview of what this working group does so that maybe I can tailor my presentation better You bet. I'll give you our elevator pitch. So You know the safe working group exists You know in the the the cloud native space We are a proposed working group for the that's cncf There are very few of those so there's you know infrastructure and see I And you know a couple other serverless That in the actual cncf and the you know cloud native You know overarching ecosystem there are very few of these working groups if you go down like into Kubernetes there You know is a extensive ecosystem of SIGs and special industries groups and and working groups that are operating there and You know what we're focused on in the safe working group is you know safety in You know this this cloud ecosystem where you have the operator the administrator you know the developer the end-user and We're trying to You know build understanding and and shared vocabulary around how we you know Make sure that there's secure access and you know operational safety in place that all of those parties in this new cloud ecosystem That everyone has a clear sense of what's going on there I see thanks then for that overview and it seems like I'll be preaching to the choir so you know much of what much of what I would say my team in a juvenile to you and and You know as as as you talked about your group It seems like many of the metaphors that we use in the big data working group might just carry over to Your group and we would like like to hear Maybe now maybe later You know how these things are that we talk about in our documents are relevant to you guys So I would like to share a presentation on the screen Let me try to Big green button that says share screen at the bottom here screen. Okay. Okay. Can you see this? Good Okay, so what I'm going to talk about is the security and privacy of big data It's a missed perspective. That is the perspective of the documents that we produced in our working group And Mark has contributed a lot to this document So we start off with We started off this working group with a lot of discussion on what is big data? So this was back in 2013 and You know, they were very diverse opinions on what constitutes big data. I'm sure the answer is Not canonicalized even now But this is the definition we came up with in our document. It may not be perfect, but it was a consensus So it goes like big data consists of extensive data sets in the characteristics of volume variety velocity and our variability That require a scalable architecture for efficient storage manipulation and analysis and this definition is there in Part one of our documents Of course security and privacy are important for big data. You all know that so I don't have to go through the slide essentially it says that it's very important because It causes damage to company reputation when it's debris and that can be evaluated in dollars So the big data working group started out with five subgroups definitions and taxonomies use use cases and requirements security and privacy reference architecture standard throat map The subgroups have kind of spread out From 2013 to now. So we may have more deliverables than subgroup and this one the definitions are a bit nebulous but the deliverables are what you see on the right one through seven and Number four is the big data security and privacy document, which I'm going to talk about So we released our version one three years ago and the NIST SP 1500-4 is our document and It's available on this site that I gave a long link to in the slide Version two draft as Mark said is it's a NIST review phase public comments were received in September 21st and The public comments version is also available on the web. So given that background I'll go into some of the characteristics that we identify in the document that are Seemingly different for big data compared to what was before So we spent a lot of time understanding, you know, what is emergent about the security and privacy of big data given it principal characteristics So it seemed that there were two aspects one is due to scaling and This you can attribute to the volume and velocity characteristics of big data and it has to do with Many things that I'll I'll cover those in the next slide The other more foundational aspect is mixing and this is the notion that One of the characteristics of a very important characteristics of big data is that you get data from diverse endpoints and a huge amount of data and some of that data May not be completely accurate itself. So you get this mixing characteristics Which can be attributed loosely to the variety and veracity characteristics of big data and that causes emergent problems for security and privacy So to go into some amount of detail. So on the left is Water different due to scaling on the right water different due to mixing. So the scaling can be scaling problem can be summarized as you know, how do you retarget your existing systems due to the infrastructural shift because of big data so the infrastructural shift is due to various things like distributed computing platforms like Hadoop Non-relational data stores, etc. So paradigm shift in infrastructural thinking has required Is still requiring new solutions of security and privacy the other is a more foundational aspect the mixing aspect and Here the problem is to control the visibility of data while enabling utility So what is this about? So here the principal questions are, you know, how do you balance privacy and utility? So you get a lot of data but and to be useful all that data needs to be used but then you also run into these privacy aspects where you combine different sorts of data about different individuals and you Get a bigger picture that may not be quite apparent from individual data sets alone How do you enable analytics and governance on encrypted data and then finally how do you reconcile? Authentication and anonymity which on the face of it seems to be a conflict So these aspects are all described in section two of our documents We then go into some amount of depth regarding how do you characterize different? Different security and privacy Aspects that arise due to these principal aspects of big data. So There are five V words that were identified volume velocity variety veracity volatility and What I give in this flat This slide on the next one are examples of security and privacy Concerns that arise due to especially due to each of these characteristics So for example the variety characteristics of big data is Is Is apparent where you know traditional encryption schemes which render Data into Into a random collection of bits that hinders organization of data based on semantics Then volume of big data Requires that you store them in multi-tier data storages so that is a lot of back and forth of data between different storages and All of this communication requires threat models To identify, you know, is the communication secure or not? Is the data being handled properly or not? So these are complex and evolving issues and Then the velocity aspect is that is the retargeting that I talked about So data is coming at a very fast pace How do you retarget traditional security mechanisms to support this? so veracity has to do with the provenance as Mark talked about last time. So this is keeping track and ensuring integrity of the ownership source than other method of individual data And how do you take care of that given the complex movement of data between nodes entities and geographical boundaries? Volatility of data is another big aspect So indefinitely persistent data requires evolving S&P considerations because the ownership may change Merges and acquisitions and so on like who takes ownership and responsibility of keeping the data safe So this is These were characteristics of big data Enforcing new requirements in security and privacy We then in section four Try to classify security and privacy topics. We have two kinds of classification one is cross domain and cross infrastructure and Trying to look at the type of property That each S&P requirement is so some properties of privacy properties you want to keep data secret or safe confidential Provenance properties you want to keep the data accurate You want to identify who owns the data and so on System health has to do with are there security vulnerabilities in the infrastructure itself can somebody exploit that how do you keep the health of the system safe? And then some of these have to do with public policy aspect. So these are things like, you know, what is right? And what is wrong to do with data? From a policy point of view and then their Operational classification of S&P topics. So this has to do with the particular infrastructure that we have in place today So there are devices There are identities and you have to manage access to them You have to go on the use and access of data You have to manage infrastructure and also you have to You know risk analyze an account for each of these aspects Are there any questions so far? Sorry. I just went on No, this is really good Thanks, man. Okay, so So we covered How the characteristics of big data Define new emergent S&P consideration and We classified S&P concerns for different types of systems a centerpiece of our working group is a is a reference architecture and That becomes especially important for security and privacy The reason is security and privacy does not compose What do I mean? So let's say we have two systems system a and system B and We have completely analyzed them So we have we have seen what the end points of system a are what then points of system B are They have data inflows and outflows We have complete accountability for each of them and We have let's say we have guaranteed that, you know, it they they satisfy some security requirements right, but When we put system a and system B together Then suddenly it may turn out that security properties are no longer satisfied And that's because there may be a PI's and system B which leaked data from system a So they made together they may have unknown Dataflow patterns that Were not analyzed when they were in isolation so combined systems can have an unexpected data flow They can destructively interfere So it's very important the point of this is it's very important to think of S&P from an architectural standpoint like think of the system as a whole rather than modular in parts It's also important to look at each module Individually, but then when combining we have to ensure additional property. So there is a need for architectural thinking and That's where it becomes important that we refer to this big data reference architecture so mark might have already talked about this but This is also described in one of the documents in our working group. I think number six And it and it conceptualizes big data systems as these boxes We have data providers and data consumers There is an application provider which sits in the middle of that and it provides different collection and access Access capabilities the framework provider is the underlying infrastructure which which Gives processing and platforms and the structure and there is a system or Orchestrator at the top who is orchestrating all this movement And you can see that there is a security and privacy Fabric all around the system. So what does that mean when to signify it signifies that? this fabric is all around the system and you cannot think of it in isolation so We have to think of security and privacy at each of the interfaces Between the boxes as well as internally to the box. That's what we at least preliminary Did in the version of one of our document in section five of the document you can find some of the security aspect That we talked about like for example in the interface between data provider and application provider You have to do endpoint input validation On the other end from going from big data application provider to data consumer There's some sorts of privacy preserving data analytics and dissemination in the framework provider you have need for key management securing data storage and transaction laws and so on and In section three of our document we also talk about a bunch of That's right. I have a question on architecture before you move on to the next section So it's a bit of a meta question my apologies So when when you were saying reference architecture did you the group there Go and actual actually, you know build this out or you know this to you know laid out some of the architectural definition of What a typical system Is it looks like? So a combination of both so this this was a lot of discussions actually it consumed a year and a half. I would say So we started with a lot of existing architectures Like there was an architecture from IBM there was an architecture from other places We actually have a document in our working group that goes through each of these proprietary or public architecture and then the group shifted through those architectures So what were the principal characteristics that we were looking for and this is the architecture that Evolved out of all the So it took a lot of time to evolve, right? Yeah, it had been evolving even till like last year So I don't think in a we haven't have changed it in a in the last year, but That's what the amount of evolution that it went through Got it. And so what did you end up doing in terms of the the technical code part of this? Yeah, what? I did not get your question. Sorry. So the the the code component What what what purpose does that end up serving for your group? The reference architecture itself So we try to describe everything with respect to the reference architecture even in the security and privacy Document so we try to identify how each of our concepts each of our Classifications each of the technologies that we identify how do they fit into the reference architectures? So that's why it it constitutes a linkage piece an arbitration point where Which defines how we go through the document great the reason why I'm asking is this is an area where? you know we sort of cut it back away from going down this path just because you know taking and You know coalescing all those things across the cloud ecosystem you know seen daunting and You know possibly impractical So good good context that you know, yeah, it does take an incredible amount of time to go and capture and distill that down Yeah, I understand your point so cloud security lands had this huge reference architecture with like 300 boxes right so but What we opted for? Well, one of the reasons is big data systems are so diverse It's not as homogenous an entity as a cloud, right? So when we describe big data systems, you know big data systems are everywhere you have health care you have a fundamental physics You have aviation you have transportation you have so many use cases and each of those use cases Can identify at least something that you know may not fit readily into this architecture, but It it is actually one of the reasons why our reference architecture is so succinct Instead of going into You know, 300 little pieces of details, right, right Since because it has to homogenize and inherently in homogenous Collection of use cases, right? Great. Thank you for sharing. Sure Okay, so we collected all these use cases many of these are actually from mark and you might have talked about some of you But overall They were they were like five big buckets retail marketing health care cyber security government and industry of big data So with that I would like to dive into some of the cryptographic aspects that we talked about Document so these are emerging cryptographic technologies and the recommendation from This document is to be aware of these technologies and to be aware of risk-benefit Analysis of you know choosing some of these technologies over others So this table is divided into various facets, so I talk about specific Cryptographic technologies on the left These are emergent Some of these are in limited deployment, but most of it is in is in research states and all of these technologies provide different kinds of features while Affording visibility to controlled entities. So what do I mean by that? The first example is how do you outsource computation securely? So an example is suppose you want to send all your sensitive data to the cloud photos medical records and so on You can send everything encrypted, but the cloud can't help you much after that So you can't Find out for example how much you spent on movies last month if everything you sent to the cloud was encrypted with your own So fully homomorphic encryption is a crypto technology Which enables you to do just that so you encrypt your data and then the cloud can do Analogous computation called homomorphic computation, which is which is a transformation of the actual computation and then The the the amazing thing about this is that it only operates on Piper text It never has to decrypt the data So all Piper text all process type attacks are all random sequence of bits to the cloud And then the cloud can send you your processed encrypted data and only you can decrypt it So this is great because you know the only can only The user can decrypt the process data. There is an end-to-end security So you you get to pick your key and so on We can also control visibility like who we give access to based on encryption technology so this is traditionally done by role-based access control or Some other types of access control by systems like operating systems So these usually restrict access to data, but the data is still in plain text So in particular if you hack the system you get access to the data and When you want to send the data in transit then the security is kind of ad hoc depends on system to system So now the question we ask is can we encrypt it in such a way that we do not have to go through all this So decryption is only possible by entities allowed by the policy So this is kind of you know technologically enforced rather than system enforced Well, of course you can hack keys, but This is a much smaller attack. So Q a key can be a few kilobytes and you can have Very special protective mechanism to protect small keys rather than you know gigabytes of data And then encrypted data can be moved around as well as kept at rest the handling is uniform so So many of you might already know examples of this so The starting point is public key encryption so How public key encryption works is that there is a certificate authority It signs for certificates of public key and then You know, you can show Let's say Alice and Bob are trying to communicate Then Bob can show his science certificate of public key then Alice can Use that public key to encrypt data And only Bob can decrypt it. So this is just a plain public key encryption Going one level higher. There's something called identity based encryption So here the idea is that there is no science certificate of public key You can just use the identity Of some person And there is a master public key just one master public key And you just use that master public key and the identity of the person you want to encrypt to And that's all you need to encrypt your data. Any other person cannot With the even using the same master public key cannot decrypt your data So in this scenario Alice can use The master public key and just the identity like maybe email address of Bob or George To encrypt the data and only Bob or George can decrypt their respective side of it So taking this to the extreme we have policy based encryption So here the policy can be a complex predicate. So which is indicated as pie here so This is one simple scenario where there is a There is a hospital and let's say somebody can see a patient's data only if he or she is a doctor or a nurse who also works in ICU So it this is a more complex policy predicate than just identification So what policy based encryption does is It enables an encryptor to encrypt to a policy Rather than some identity. So you can encrypt to a policy of your choice. It can be complex and then Only people who satisfy that policy will be able to decrypt And nobody else Finally, we also talk about blockchain So we avoid the financial aspects of blockchain in this document So we we don't know how how important that is But there are many technological aspects of blockchain which can be very useful in the security and privacy space especially things like asset and ownership management transaction logging for audit and transparency bidding for auctions and contract management and so on So the high level recommendations are as follows. So which technology to use among all these cryptographic technologies is It involves a lot of risk benefit analysis We have to consider sensitivity of the data cost of rates and cost of security security systems When doing this analysis, so I give an example where You know, there are three different cost benefit analysis So let's say we want to run the task of Running software on encrypted data at rest There are three possibilities So let's say we just do What is traditionally done which is decrypt the data in the cloud And run software so you your data is encrypted at rest But you can decrypt it and then just run plain software So what are the pros of that a very very fast execution? Problem is if the server is hacked decryption keys leaked all the data is exposed The second better option is Run the software on the decrypt the data inside a hardware security module So there are many hardware security modules In the market today prominent ones are like in the last years or just arms on So this is a little less fast than Then just doing computation on plain data But it's still practical But there are some problems. We have not been solved at this actually yet And these are to do with side channel attacks And these attacks are kind of, you know, you can You can see the patterns of memory addresses and so on And you can infer something about something secret The final Completely secure solution is, you know, you just use fully homomorphic encryption The pros there is it's cryptographically secure. There are no side channel attacks It's secure against all the vulnerabilities of the last solution It works even if you completely breach the server But the disadvantage is that it's very slow at this point except for limited operation But so just to conclude, you know, there are four things that I want to Take you away Think of security and privacy at the time of architecting the overall system Not as an afterthought which is Which is Which is the way many systems are designed today, unfortunately In security and privacy Systems do not compose. So you have to reanalyze security and privacy When you add new features or join new systems There's a lot of cryptography that is emergent You just have to stay tuned and patient at this point But it will enable many remarkable operations in the future And finally, it's very important to read the document And we hope that you have feedback for us that's useful Um, we should be able to document that. Thank you Great. Thank you, dr. Roy So I'd like to open the floor to any questions You know a request, uh, either to yourself dr. Roy or mark Um, it'd be fantastic if in The meeting notes we could link to not only dr. Roy's presentation But if we could, you know, provide links to The documents, I believe, uh, you know mark you touch on some of these in the Issues in our github repo But you know for those that are, uh, you know Following along, uh, if we can point them to A way to go deeper in this that'd be fantastic Yeah, I have a question about Combination of The way that changes The access control, right? We we see both that you can By combination you can deanonymize data. So data that was previously anonymized and that maybe doesn't need strong access control Suddenly by combining that You you you have the need for stronger access control and also the the reverse Where you have data that gets aggregated. So the the access control doesn't have to be as secure, right? Was there any thought on that in your Right. So we test on this in the variety aspect Of big data Requiring new thoughts on architect and secure systems and you bring up a very good point where architectural thinking is very very necessary Not only at the level of a single organization, but as a whole Of you know, what is going on? throughout the internet because As you said, you can aggregate data from various endpoints And suddenly you have a much clearer picture of Of sensitive data than before so It's unclear at this point like how you can So, you know technologies like differential privacy They have a privacy budget which is that You always leak some amount of information even if you aggregate and If you do that too many times Then the privacy budget is lost, which means that Over time you get clearer and clearer more and more accurate Picture of the sensitive data So this is kind of inevitable. So Other than like completely restricting access to the data It is not clear how to stop this leak of information Where does anonymity So, you know in order to guarantee Some degree of privacy, you know, we Tend to lean towards systems that You know identify the the players You mentioned anonymity at some point, you know, how are you dealing with anonymity and trying to solve Are you always trying to solve for anonymity? So for the next For our document, we just, you know, describe what the problem is It's not a solution document, but in the research community There are technical aspects too. So I talked about, you know, how do you reconcile Authentication and anonymity Right, so that is a technical question that the research community has been looking at so There's there are primitives called group signatures. For example, what does group signature mean? It means that You have a group of people And anybody can sign a method, but you won't know who signed it So you can still authenticate that person But you will not know beyond the group structure Who that person is or entity is So you could say, you know, give them the same signing key, right But that is not desirable because later on there might be an arbitration process where Where you want some amount of non-repudiation you want to call that person responsible If a legal case comes up, for example So So that's why this this kind of Primitive is far more sophisticated than just giving out the same signature keys to everybody so This system in fact allocates a trusted arbiter Who has some more information so that he can look at the signature and identify who signed it But without going through this arbiter, nobody can find out who signed So that is one of the technologies that addresses Reconciling authentication and anonymity And you can think of it as In an iot context as well Like there are different iot devices. You don't want to specifically pinpoint which device it came from. Maybe that's very personal But if there is a glass breaking scenario you want to know Mm-hmm. Yeah, the concept of uh, you know trusted arbiter. I think will Uh come in handy, you know as as we model things out. Yeah Yeah, one of the approaches that uh It came up and as Arnav said, we don't really get very prescriptive but We talk about trying to treat pii and what pii is Uh varies depending on the domain, you know, it could be a floating point number depending on the scenario, right? but If you have a domain that you can consult to understand the meaning of the thing You might want to tag that data throughout a system and that includes when you federate the data so the the persistence of Some people call this metadata, but really, you know, just carrying other data along with it in some kind of structured framework so that you can Do traceability and provenance so you can understand when it's been violated So that's kind of the fundamental principle and doing PCI compliance or Being HIPAA compliant, which is you know, something most of the big companies were in Have to do on a regular basis But the problem is there for everybody really because If you think of pii is just an instance of Really really important data in some domain Then that's a that's an issue we all face at some level So from a security point of view, you want to know that those That you can expose Where that data has been used if you need to and who's touched it and To authenticate the people who've done the touching and that includes Machines and that's why uh, I don't know if you were there Dan when you were trying to get booted up Um, I was touching on this issue of authenticating these low-cost smart home devices No It's an interesting use case and stop me if I already mentioned it to this group But uh anybody on this call we have them at home already I've been Alexa Like smart uh the smart switches tie do it. Yeah So the smart switches are mostly uh going to cloud services overseas Written by who knows who in fact In fact the error messages that come back are are in chinese Usually you see the chinese stuff at the top and then So this is an interesting problem. We have a cloud service from amazon doing the driving A local iot device i.e. Alexa on your home network probably on a single segment Uh collecting data for amazon But going out to these other cloud services to direct traffic out to these devices and If you like blow this into a neighborhood or utility scenario It's an interesting problem, which kind of is part of the rationale why we Uh, we're we're glad in retrospect that we stayed away from the More expansive cloud specific model because this is more realistic. I think this multiple be multiple cloud multiple entity multiple developer communities even so It's more of a I guess a case study than a use case, but It it certainly is a realistic one at least in my house Absolutely and you can drop you know drop a uh, you know 3g chip in there and you know easily backchannel some other, uh, you know data data source Good. Well, uh, let's keep on topic. Uh, so, um You know, I want to uh, you know give everybody a time check. We've got five minutes here, uh to to wrap up Um, you know any other questions for for dr. Roy? All right, so thank you dr. Roy for for sharing that this has been insightful um, look forward to uh integrating and capturing these in um in our our notes and Uh, you know mark. I I I added uh, you know to my uh rolling agenda A check-in uh from the the NIST big data working group You know if there's nothing to report, uh, you know, uh, please You know just feel free to ignore but uh, you know would love To have you share at the beginning of our meetings Anything that the any contacts or any information that that uh, this group would would find relevant I really appreciate the the perspectives that you're bringing Sure that uh, just uh, let me do that since you invited me and I'll make it short in in light of our time. We I introduced uh, this was me dominating the last conversation we had In that group we were trying to understand how to uh, do traceability for ethical requirements that are put out in organizations and It's a big data problem because often these things are authored by um People outside the organization So or inside it who who the developers are not connected to so to some extent it's a traceability challenge it's also a problem of Where do the natural language artifacts belong? And and what do you do with them in in the architecture of the systems you're building? So Whether it's a cloud native, uh issue. It's certainly one that We're wrestling with and if you think about, um Some of the uses for algorithms that are being Contemplated or have already been deployed sooner or later. We're all going to be in the position of having to explain algorithms and Why they are recommending one thing or another to To users and so we're trying to figure out what the implications of all that are and if we can make any contributions Nice Yeah, it's a that's a big area that I'm happy to Hear you're trying to get out in front of uh, because uh, not a lot of folks are getting out in front of that and that's barreling forward I'm afraid I'm going to be in the position of that poor Volkswagen engineer who ended up getting blamed for it All right, right exactly Yeah, that's the um, you know, that's a great perspective of of how this that ends up playing out and uh, You know the individuals that that get the real hit, uh for for, you know, bigger decisions like that Um, so, you know coming up. I've got uh, Jerry, uh, who unfortunately couldn't join the states. She had a sick kid to take care um, is going to be joining us uh for an overview uh of Uh, you know some of the the security infrastructure that uh, you know, she's been working on at cyber arc that, uh, you know overlaps the The kubernetes and cloud foundry Deployments of cloud native infrastructure. Uh, so looking forward to that. Uh, I think uh next week we'll have adp lined up and then june 1st I am cancelling the meeting. I'm going to be on the road and um In berlin, uh, so uh, you know a couple weeks. We'll give you a uh, friday off to Uh, enjoy friday things All right. Thanks everybody. Thanks for joining us. Uh, see you next week You love everybody. Thank you. Bye. Thank you