 Good morning. Good evening. Good afternoon. I am Michelle DePama, Red Hat Principal Solutions Architect with the data services team, and this is the data services office hour. I have the pleasure of welcoming Mr. Kyle Bader today. Hello, Kyle. How are you? Hello, Michelle. I'm Kyle Bader. I work out of our Hyde platform. I've been a student here at Red Hat. I'm a data foundation architect, so I've covered our technologies like Koby Koby Ship Foundation and Red Hat Storage, but I've been working with the Red Hat. I've opened a science team for a long time now, so I would call on and have fun during the happiest hour. That's right. So talk to me about what you want to present today and what we're going to dive into. Yeah, so let me go ahead and hear something about doing that. I want to talk a little bit about data lake security. So, you know, if you're using using Red Hat OpenShifted cases, the two platforms that are available in our OpenShift by way of OpenShift dedicated and Red Hat OpenService in Amazon, if we accidentally use the shorthand in acronym ROSA or OSD, don't fold because we use it a lot, but I'll try to use the full fold only. So there's no confusion. Hard to remember to do it, though. So just so you know, I am getting a little bit of feedback. I'm hearing a little bit of an echo right now, but you're talking nice and slowly, so I think it's manageable, just so you know. I don't know if you hear it, but I hear it a little bit. Just wanted to let our viewers know. But go ahead. So you were talking about a good clip. All right. So I'm going to go ahead and share here. Okay. So your screen is presented. Go ahead. Okay, good. So yeah, data lake security. I wanted to go over kind of the CR and options that are available to you if you're kind of seeing up a data lake in Amazon S3 or some sort of S3 equivalent. And kind of on top of the security capabilities of Amazon S3, using something like Starburst data, which is one of our initial partners with red hash of data science, they give off offer like a massively parallel processing engine based on upstream Trino, which originally was kind of goes back to the Presto project. So the Presto project originally started and there was two flavors of it. One is now Trino, the other one is Presby, which is a really great tool to get a lake. It's used by internally via the Amazon Athena service. And there's a lot of people who use an upstream game, you know, get it by way of EMR, your point EMR cluster. It's a pretty handy query engine. So it all kind of starts around authentic authentication. It's almost a confusing number of different ways that you can do authentication and access control in an Amazon S3, which is perhaps given that S3 is the oldest service in Amazon web services. I think it originally started in, what was it, 2006 or something? Wow. So that was like over a day to go. That's amazing. It's hard to believe it's been a long time. So kind of the initial thing that was available was just static credentials and a lot of people still static credentials for a lot of things. So static credentials is get a set, which includes Amazon access key and Amazon secret key. And that secret key or the combination of those two good things used to sign each HD request that's sent to the object storage endpoint. That's all fine and well and working. There's a number of tools that support using static credentials. So whether you're writing something in photo, you know, just from a notebook, or if you're using something like the S3 file system to client, that comes by the new common, you know, say you're using, say you're using PySpark from your notebook, you can, of course, you got credentials. If using something like, you know, you can use the Hive Connector to interact with data stored in S3. And you can plug in a static, that's all final, final well. Think about static credentials, though, that, you know, it's probably not the cleanest thing to do in terms of, you know, no built-in mechanism to rotate them. In the case of Kubernetes, the, neither of the options that you have for credentials are really tastic, right? You can kind of score all them away in a Kubernetes secret. And then you can, you know, expose them to the pod by way of an environmental variable or something. But you still have the management headache, like that? Yeah, it's just, yeah, they're your staff, right? It's like having a, I don't know. So it's kind of not, not best, right? So the, the, the, originally, the mechanism that you get is where you can use tokens. You can do tokens authentication using Amazon's secret token service. And the ways that you can do that is you can take your static goals and you can, you know, interact. You use them to sign a request to the secure token service API and say, give me a free token. And then you can kind of get it. The, the, the, the, the template in the way that works is access key is a secret key and then like a token ID and you can use that to sign requests. So if you have an app, have an app chain, you can provide, provide, you know, like you spawn a notebook and pass it to the environmental variables. During the spawner there's, you can add an arbitrary list of sentimental variables and you can get in to your notebook. But that's, that's not really ideal there because it's a a secure token spire. And, you know, you'd still have to sign better credentials. So it's not not really ideal for notebook books, but it's perhaps even less adequate. You have something that you want to be running in background, because every, I forget what the default expiration time, it's like six or 12 hour hours. So yeah, if the token expires, and it's like sort of service, doesn't really have a means to generate a new token on its own, so it's SDK. Now, if you're running your application in just SBC2, you have the, you can kind of assign a role using like IAM to the instance. And then if you're running any sort of application inside of the EC2 and instance, and you set it to use assume role, what it actually actually SDK actually does is it reaches out to a metadata service and EC2 that's like kind of defined IP address. So it'll go, it'll talk to that metadata, and that metadata service will like a sound response. And in that, it'll provide a key. And that SDK is actually smarter than I know. Oh, okay, okay, my key is about to expire. I'm going to, you know, interrogate the metadata service, and I'm going to get a new key, and then it'll use the new key to sign class going from there. So that's kind of a nice, nice way of going, but kind of breaks down Kubernetes, because the way the networking works. Your pod often cannot, unless you're, unless you're being hooked out working or something, something like that, doesn't, doesn't have the ability to interact with that metadata data server. And so if you have, if you're running an application on a pod, even if that pod is running on a node that is an EC2, instead of has the ability to roll, the pod actually can't talk to that metadata servers to get keys. Okay. Yeah, so there's some people that we're experimenting with, like, there was a school called Kube2AM. We're kind of like a proxy service that would run, it'd be like a daemon set that ran on every single host in your cluster. And if a pod to the metadata service, then it basically proxy, or if you try to talk to the metadata service they could by proxying through this Kube2AM theme. And that's kind of neat, neat because then you get, you get the ability to rate your credentials, but all applications that were running on that pod would basically be able to assume the same role, which could be less desirable, right? You wouldn't want a situation where you had like different sets of nodes that were able to assume roles. And you only permitted certain namespaces to run nodes and other namespaces to run on these nodes. You could get it set up or really give it, make it so that you could have different acts of control for different pods that are running on a different basis, but it's really an obtuse way of doing it. You would have to use no taints and tolerations and, I don't know, it'd be messy. Okay, so you're saying that the, like to get a real level of granularity down to the pod in that scenario, it's just cumbersome? That sounds like, like you could... Yeah, you have to like, right, because the node has the ability to assume role, not the pod, right? And pod is what the, you know, the nodes really just are a pod could run. What you more ideally want is the ability to have a pod assume sort of, assume a role, right, right. You can, you can say, and this is the next slide where we talk about, you know, you can say that, you know, this role is able to access this stuff. And so you really, really want to be able to say, this pod, because it's in, you know, its namespaces is able to access stuff, but these other pods, they can't, right? You don't really want to do it at a node granularity. So with the clever folks over Amazon, what worked up was, well, they already had this, they already had this mechanism called assume role with both Webity, where you could instead of using your, you know, static credentials to generate a temporary token, you could get an attestation from an open ID, which is cool because OpenShift also supports open ID providers to authenticate to the Kubernetes API, or if you log into the console, and kind of have a common, you can have a common ID provider that is working with your access control for Kubernetes itself, and also for access control for your, your data lake. But it was kind of a thing where you would have to, you know, the pod would have to do, you would have to, in order to have the pod be able to kind of run in the back, the back is like a service, you need to be a little bit more, and so that's where a pod ID comes in. So it's a pod identity in Red Hat, OpenShift, service for, on Amazon, it's, it's, you know, if you're provisioning Clare with the Rosa CLILI, you do Rosa cluster 8, and then it kind of prompts you a bunch of things, plus you, you know, pass a bunch of, you know, flag tags, it basically walks you through your prompts and whips and whips that it'll, it'll ask you, is whether you want to send a cluster with Estus, or a security service. What that actually does is, is it sets up a pod identity, identity hook, so that when a pod runs, you can basically configure serve accounts in different namespaces, and applications have run out of the service accounts, and it has this something fancy behind the scene that is going to interact with your OpenID provider, and then get a JSON web token, and then project that JSON web token into a pod. And that JSON web token can be used, the STS API. So it's kind of like this, this crazy easy, crazy like Jane of crypto stuff. So it takes some web token token and uses it to sign or to the STS endpoint. The STS endpoint issues an STS token that can, it can be used to sign requests with the, the audio store. So it sounds, sounds creepy, but from your application, it's super awesome, because you do this and using your application, you just say, run this application under this service account, and then the, the SDK in the application just does everything for you, right? You don't need to, you don't need to like set environmental variables, secrets, and this, and then wrap those secrets, secrets into notebook or Python, like, you know, variables or anything like that. It's all, it's all kind of old transparency. So do you have a demo of that actually working? I don't have for today, potentially. Yeah, yeah, no, no, I can certainly do that. And if I don't have it right now. But, but, but it is super, you'll have to take my word. On the other side of things, right? There, there's, you know, if you're, if you, and it comes in as, as if you want to, you need to do authentication also, just like in Jupyter Hub where you have, you know, it's all in authentication, because sometimes, at least in the case of, of roads, you have the Jupyter notebook spawner, and, and you authenticate with, with, with the spawner so that you can map an identity to the notebook that's being spawned. Similarly, if you deploy a star booster, you're only going to want users to authenticate with, you know, when they're, when they're, you know, sending queries to, to the coordinator, which is kind of just batches the query and the Trino cluster cluster. Or if you're using like a visual tool to interact with the Trino, like super set, you can set up, set up a Trino user impersonations. You basically authenticate, like, like when you log in to super set UI to dispatch queries, it actually does authentication with Trino. And so it kind of like pushes it, it kind of, kind of passes through. The coolest thing is that it also, Trino also supports, you know, OAuth 2 so you can authenticate with it with an OpenID Connect provider. So you can have your access to your data lake, access to, you know, the query engine, and you can have your access, your, your, your, your access call that's tied, tied into, you know, the Jupyter hub notebooks spawner and access control Kubernetes. You're going to have them all hide back to a common OpenID Connect provider, which means, which is kind of ideal, ideal, right? You don't have, you don't want to manage different users and different systems that you want to be able to, like, if, if, if Tom, Dick or Harry leaves the organization, or if they're, if they have a different role within the organization, you don't want to have to, like, remove their credential or a bunch of stuff or, like, change their, you know, you want to do it essentially in one OpenID Connect provider. And that way you have covered, covered us all your, all your different aspects of, you know, data processing and, and data science, you know, utilities. That makes sense. So we, so we move down the line more to actual access control, right? Authentication is saying, like, you know, Michelle is in fact, you know, Michelle, right, right? Or Kyle is Kyle. And beyond that, you, you, you need some sort of definition of, like, what, what those P people are going to do. So part of authentication is also, you know, any sort of roles or that map to that particular principal, right? So when they say principally, what they mean is, like, a service account is a principal, a person is principal, or, and so it's kind of like a general term, I guess, in access control parlance. But yeah, the same thing. So earlier I was talking about how there's, how there's a myriad of, a number of different ways of doing authentication with SS3. Similarly, because it's grown organically, it has, it has several different means of access control too. So in the earliest days, it had these just basic apples. That was like you said, the public, you can set up a bucket to be public or a certain prefect, prefect to be public, you can set it to be to be fit. Or you could like set it to anonymous access. And so there's some, some, some course level access control type stuff you could do with like the earliest SS3. But then it, over the years, it's just started to get more dedicated. And the next step in the evolution of access control was the introduction of action of bucket policy, which is pretty powerful. It means that the, the, the, the access control is actually controlled at the point where the deal lives, right? So you define the policy with bucket. And so if you know, which is kind of, kind of a nice, you know, when, when people are talking about this new idea of like a data mesh, it's that instead of trying to like have some sort of like centralized control of policy, you define policy with the data and, and then that way, you know, people, people that are creating the data usually have the bad, the bad idea about the limit that need to be placed on thing like, you know, where it originates from. And in the case of like, there's, you know, country of origin and type stuff for GDPR. That's right. Right. The kind of the data protection thing out of, out of Europe that says, you know, if you had it out of Europe, it has to stay, stay here. And then there's certain, certain rules about being, you know, delayed stuff. There's similar, there's similar things that are wired for information about people from California, what the legislation is. But yeah, so it's growing. There's this growing, growing need to, you know, basically govern, burn data specifically based on the person that the data is about and all kinds of other things. But the people, the main point is that the people that are creating the data have the best idea of what should be placed on the data. So having, having the bucking, the policy, policy with data is, is pretty powerful. And bucket policy is pretty, pretty rich. So you can, can, you can have a bit and, and you can say that the roles or these are, or these named principles, or these are, or these named roles are able to access, access the entire bucket. They can limit the API operations they're able to make for that bucket or prefix of buckets or objects in the bucket with, with particular labels. So you can, so you can say there are all cases that nobody, because buckets gets us, it has the, you know, like general classification of label. And, and, and by default objects written in and don't have, and there, and there's staying the background that's looking at the data to make sure it doesn't have any PII or something. And then once that thing verifies it, it doesn't have PII, it adds a label to the objects. And then, you know, all the people organization are able to access that you can do, you can do something like that. IAM policy is kind of the, the, the other way, the, the, the opposite in perspective. And, and, and you usually want it when you also are using other Amazon services and you need something that like transcend just the data, the data. So kind of add, add a little bit more capability there. But in a lot of cases you can get away with just using bucket policy in combination with the principle, principle based on, based on the accessible. So if you have identity, you're able to do like assume rule type stuff. And then you're able to say, you know, these roles are able to access this data, data by label by prefix. And if you think, if you look at how like most data warehouses are organized, like say, say you wrote a bunch of data with high or something, you know, pre-preferty powerful because the, the, the, the layout in the object store of data or, you know, something like HDFS is, is it would be like slash warehouse slash database slash, you know, the name. And then underneath, underneath there's, there's like separate, like, sudo sub-directories that have the different partage, partitions for a table. So, you know, you know, it's prefix alone control who has, who has as to a particular dealer database or particular tables or even in particular partitions with entity. So that's pretty rich, right? That's, if you have used on object story before or a delay type thing, that's what we get. If you did, if you configured hive to be you do like, like, delay or push off accidental to HDFS, right? So that's what, you know, makes it on parity with the HDFS terms of access control. That's a little bit richer because you, you can, you can do, you can do attribute based access control, control, even with, you know, with labels, objects and stuff, stuff. So you can do some kind of fancy stuff there. What you can't do though, you can't do any sort of like all in level security, security, because the granularity is a whole object, right? So you can always say that Michelle can operate, you know, do these particular requests on the object, object completely, or not at all. And if you're writing data and then something, you know, you know, CSB or Parquet or JSON can't really control, you can always say like within the object you only have access to these, to these and get that to use something like the Ranger plug plug-in Trino or Presto where you can say, okay, you know, you know, Michelle doesn't have, have the whole object, but if she runs a query Trino that only, that only access the columns that she's permitting, if you have, you know, Parquet files that represent table stored in S3, she can't access it directly with S3 or it's far from something that's not set up Ranger directly, but she can, you know, go through Trino, so this can go with Ranger, Ranger, Ranger policy says that she can, you know, just access the customer's name query but not the credit card number query or credit card number column. And so that, if you need, you might have granular, if you need the column of mobile granularity for access criminal, that has to be done some sort of policy enforcement point, like the query engine itself. I have a visual net or example. Sorry, I don't hold your thought. So the echo is still present, someone in the chat just let us know, do you see a settings bar or anything, do you think there's anything else you could do on your side? I know I have a settings clock where I can make sure the echo suppression is on. Oh, let's see. There isn't actually very much to toggle it, but I have stuff, so anything we have noise control drawn. Is this any better? No, no. And all your kids are sleeping, right? Everybody's making things choppy. Hopefully, unless I've gotten too excited here. All right, let's see if we can continue. Just thought there might be something. Yeah, I'm sorry. I don't really know what. It goes in and out, like right there, that was fine. That was totally fine. No worries. Okay. All right. Sorry, you were saying, you held the thought. Oh, yeah. So I was kind of diving into how you could have more granular access control than just object level, if you want to draw on and be able to access control for just particular columns or if you wanted to make sure, mask particular columns. So if you wanted to mask, ask, make her column or social security and everything like that, you would have to do that with sort of policy enforcement, which is kind of, you know, could be anything, but oftentimes it's, you do it like a sort of query engine that has some sort of sort of plug-in with something that has kind of a rarer access control semantics that take into account column or data as opposed to, or tabular data as opposed to object. Okay. And the next thing is, of course, a hot topic. And on the object storage side of things, the STA is capable of doing client-side encryption and decryption. You have to manage your own, you have to manage encryption keys in that case. You can also kind of have the store system do the encryption for you, right? And that's server-side, exciting. And there's three different IDs of keys are managed, just you can manage the keys themselves. So whenever you do it, or best you say, you know, S3, you know, either encrypt or decrypt the object key, and what S3 does it, you know, takes that, it's included in the request, which is why the key is in there. It enforces, it makes mandatory the use of HTTPS that, you know, for obvious reasons. But yeah, it'll, you know, encrypt, decrypt the, if it's a put rest, that is, you know, doing object, then it will, you know, encrypt the body of the request and store it. It's a, if it's a request, it's a read call. It will, you know, take that, it'll pull up the bits-to-bits storage system. And then, and then the key to decrypt, decrypt it and send the decrypted payload to the client or encrypt it to channel over HTTPS. You know, it's an encrypted tunnel, but the bit inside of it or not. The other option is, if it doesn't want to deal with keys, they can either have the key stored in the storage system. That's called SES3. So it's like just S3 storage, encryption, encryption key. You can also do SSEC, KMMs is where instead of the encryption keys being stored in the storage system, it kind of coordinates the storage of keys with, with, with Amazon KMMs. So that, and Amazon KMMs can, you know, it puts keys in like hardware or security modules, basically. So it's a really, really dense, dense security. But I mean, you're still getting to someone else to have the keys, right? Obviously, do you have like a real storage part? It's always best if you encrypt something, your thing, and decrypt it by yourself, help separate from whatever service, but that's kind of the hardest thing about photography is key management. So you're taking on the hardest, the most burdensome thing of cryptography by, if you do that. Now, again, just like access control, all of the object-based encryption and is all whole object, object clarity. So that means, means the error object is encrypted using the same key. So if you wanted to control, you know, just decryption of, of particular parts of the object, there's the, you know, the STPI gives you enough, or it doesn't give you a fine grant anywhere you do what you want. Which is where, you know, you know, something like Treno can and Treno, you know, either has no encryption or can use the S3 service encryption, or there's actually encryption that's available in, in parquet, serial, deserializer, parquet, mod encryption, which can actually use different encryption keys for different columns, which is just pretty cool. So you can access control, control to particular columns, but you could also encrypt different columns and keys. I haven't played with that too much with Treno, you know, but, but it didn't, that exists, this, you know, the data format is no nodes and is able to do, in many of its, serializer, deserializer, implementation, encryption and of, of just particular columns. So all of the, all of the data, data pages that correspond with particular, particular column that'll be encrypted with, per, per column, which is just pretty, pretty fancy. And then there's some, there's some other things I actually, this is, just no mistake here, right here. So here, here, we'll, we'll fix it live here. So, so there's also a few other, a few other kind of management things that are kind of handy, or object storage. Some of them have our security element. So there's like, I don't know, probably six or seven, seven ago, you know, Netflix did, or for a while, it was almost a year, they were like at the Strat conference, they would talk about their, their best practices with their, you know, because they're all, all MNS3 as a, as a data warehouse. I think it was like four years ago, they strata, they said at like 60, petabyte or something in Amazon S3, something, something insanely huge. I'm sure there are probably, well, 150 devices, something that would make that up. I'm just guessing, right? It had 64 years ago or five years ago, they probably got at least double that. So the eight versioning thing, and thing is a good, good kind of, like protect people from shooting their health in the foot kind of thing. Because if you have an object versioning, it's like, it's like the, the, like a super snapshot, right? So if you're, if you have any, you have any sort of store background, there's this idea of like taking a snapshot out of things. And if it's a file system, you like take a snapshot and say, oh, well, every hour I want to take a snapshot. And then if something happens, I can go back to the, you know, this hour snapshot or whatever. Well, versioning is actually way better than that because it keeps, it keeps a version of every change to every object. Plus you tell it to start an old version. So, so it's not time-based at all. It's, it's based on modification. So you can, you know, if, if you write, if you write, you know, key ABC, bucket foo, then, and then, and then you write another, you write to that key in that all, there's a history of all the different versions of that object, unless you delete ones on some of them. You can make that even great for, you know, if a developer is writing into a bucket and they accidentally core data, right? You know, stuff, you know, stuff happens, right? Certainly make mistakes or we'll probably make mistakes in the future. And, you know, putting things like these helps, helps kind of prevent me from shooting myself in the foot and then the same for others. Now, you can delete, you can delete older versions of objects. So if, you know, you actually, you know, really got yourself in the foot and like deleted all the versions of the object and you didn't mean, didn't mean anything, that would be, that would be bad. And you wouldn't have it in your read-read course. But you are able to do something called MFA or multi-factor delete duration on a Budsman configure for version. So if you can configure your name, then you can set up that multi-factor lead and then in order to delete versions of objects, you have to, you have to use like a multi-factor authentication with the request kind of like, you know, if you shoot yourself in the foot way of also signing requests and like going to the whole dance with Google Authenticators or something like it, then, and all right, well, you know, your own, I guess, but you, you, you have kind of, kind of, it's kind of like this, this defense in some depth, the other versioning and then you make it, you can't delete the versioning without, you know, multi-factor authentication. That's, that's, that's pretty good. It's pretty safe. Other useful things, things are like inspiration and transition. So if you have some data that, you know, needs to expire after third days or something and instead of head back and keep track of how old stuff is, you can do it for, for it, for it and say, okay, okay, after three days, after six, 60 days, I want you to delete, delete this data. It's actually really handy for things like, like high, high is like some stuff of like temporary table. So, and the, the, the, the prefix for keys for like, like temporary tables look a certain way. So you can say, you know, anything that looks, looks like it's a high temporary table or peak or something. It's probably, it's probably orphan, right? And instead of having to like go back and remember to clean up that stuff, you can, you can purge from the system after days, 14, 14 days, whatever. But you can also, so any sort of data that's, that's, that's gonna go stale and you really want to delete it anyways, after 30 days or you need to delete it after, after a time period for some sort of, where I give you all the purpose, you can configure your storage system by way of some sort of life, by a way of lifecycle policy to, to, you know, automatically delete the data after, after a certain time. Also addition, which will, which will, you know, store in different storage cloud or something colder, colder. Object lock is, is to be able to do worm type functionality. So if you have to comply with like fin, fin rations, because you're, you're bank and need to be able to, to write it and like seal it and then not be able to tamper with it for a specified period of time. And, and often times for, for, I don't know if it's a fin rar thing, but I know most worm implementations also have this idea of like a litigation hold where, you know, it, you can, you can kind of put a hold on when you set the object block block, you basically like this, this, this object can't be put with for, for this period of time, right? So, you know, 30 days or a year or five years or whatever. But in some cases, you, you want to stop that timer because there's some sort of suit or something, right? Right? There's some sort of gation instead of this like idea of like a litigation hold. So, so you might want to say like pause the timer, right? So that's, that's a thing too. And then finally, there's the object tags, refer to this earlier, but object tags are really powerful when combined with ax control, because instead of doing it based on, or instead of doing access control based on a bucket, or based on a fix within a bucket, you can buy a tag. So if you have things dumping data into like your, your data lake or whatever, you know, you can have some sort of, you can have some sort of automated process. And, and we've done some, done some cool those before where you like set up a bit with notifications and then that triggers like some serverless, some serverless stuff in native. And then for each object, it like, you know, does sort of inference against it to see if it's a, you know, I think we did like an x-ray one, right? So looked at it and did like a Pneumonus score for the x image. But you could also do something like if it was tabular, you could have it look and you could have it do it for instance, against the table and say, Oh, this does, or doesn't contain PI PI. And then the result is they could, you know, you know, they could add a clean tag or something that a lot more people tend to access to that. And if it does that PII, then it can then it could put a label on it that's like sensitive or something so that you can use that use that to contract access to data too. So I did give kind of a more example, an example type thing of how you can kind of kind of course and find finer like, like, like kind of like hybrid, Treino and stretch-based access control or a data like, or data warehouse. It's kind of all these, all these cups, like it gets kind of tricky. I thought like a good diagram or an example might be good. So let's say we're going to do like a classification level like the, the government uses, right, where it's like here to access control type of thing with public, potential, secret, top, top, secret. And then you create corresponding roles for, for principles that are able to be access public, confidential, secret, top, secret. And here's a whole bunch of like rules to kind of make it work. So if you have roles that can only write to buckets of, of equivalent security also, you have like a top, top bucket. Like only people top, top secret get to a top secret bug. If you have, or, or certainly doing the sort of sort of change like the tag or sort of something like that, right? Because that, because that could adjust the access control call. Roles of the higher security level read from the lower security level. Run columns can have lower security levels than the bundle store in. So you could have you can, you could actually have like, like some of the columns in like top secret bucket might just be, you know, general access. Rows and those columns cannot have higher, higher security levels than the, because they could access the whole, the whole object. You know, they would, that would compromise the security. The rules can't directly access buckets or objects. The higher security materials, even if they have lower security rows, those are problems. And then the policy enforcement points, right? So I talked about this before, we're able to assume the roles and they can like, like basically filter data for the lower security role roles and then, and then only give them the data. Right. So, so they're able to dip into a secret secret and then filter throughout the data that is, and then can give, you know, like say, say just the general columns, the general access trolled columns to a general access control year. Right. So you need a policy, policy enforcement point if you need to like, like SIFU, the data, finer granularity. And these, these policy enforcement should only be administered by users who can assume the role of the policy enforcement, enforcement point, policy enforcement point having has the high privileges, which is, and so, so what kind of like is this? Right. So if you have, say you have your, like your Rhodes notebook namespace that your notebook is running in and set up with, with a particular service service. So say you put an annotation on the namespace or some things that all of the pods that we're running in the namespace would run under a certain service count and you provision the cluster also with STS so that this, this service count that you defined in namespace would be able to kind of do the whole dance around, you know, you know, using S, STS calls to, to SyQuest and it's all tied back to the OIDDC provider. Then you have a Trino namespace and the, either the Trino namespace or Trino pods within the namespace are configured with the severance count. Right. So in this case it would be, what do I have? The confidential. Okay. So the, anyone, like any of the notebooks that are running in the notebook namespace, they're kind of general access control, but you need to ask to your security, security levels. Right. You could have a Trino namespace and so you got a Trino namespace that it would be the highest level and it would just filter out the orange or the yellow or the red, red for users that are have orange, yellow or red for your level. Or you could have, you could have kind of like different, if you wanted to have it like this cluster is like, like, like in this, we have the confident, you know, Trino namespace and so the service count for the Trino that's running in that namespace is, is limited, limited confidential. So can you see, see for concrete, but it can filter out confidential information from general information and then give it to a general user. So the, the, the, the, the Trino service count would need to need to be to the highest privilege that it's capable of accessing. And so if you look on the data side, if you have, you know, if you have, if you have whole objects, right. So if all of the call, the call in the object and, and then in the book are all kind of general access, you just access it directly, like they go in their notebook, they use Boto, they can, they can, you know, use, you know, PyAero or something like that to like have the data out of Parquet and then they can manipulate it. And then they can just add directly with the storage system. If they need access, if they need to access green and they, they're, they're able to access green data, but that green data is in, is, is kind of like mixed in with other columns that are, that are confidential. And, and so it's in a, a, a potential bucket as well. Then you need, that you need something like Trino, you know, to be able to do more, more granular access control for you. So this is, this is, you know, assuming in order to do this work, it kind of sifts out, sifts out masks off potential data. That's, that's where the, the tie in integration with Apache Ranger comes in, is that's where you define the more sophisticated policy. If you're just doing like storage based policy policy, then you're in the same situation bottom with, with Trino, right? So if you want the more sophisticated access control, you know, you need to use something like to where you can define your column policy, but if you do, then you can filter or mask, ask out the commercial columns and then just, just provide the, the general access control data to the user. And so now you kind of have this, this really nice similar access control where users and different tools are able to access the data directly if, if, if they're able to act, interact with the whole object, but they need to act, act more grandly on just specific columns. And then they're able to do so by what, by what a policy enforcement point like. So one question. So just to, to, sorry, I may have missed this part, but so Ranger, the role of Ranger is to allow you to set this level of granularity, this level of policy or to morph the end results like altogether or both. No, Ranger is just, Ranger is just a policy, right? So yeah, yeah. So Trino talks to, talks to Ranger and it says, these, these, these principles are able to, to access columns and then it can, you know, deny or, or for filtering or masking of, of, of, of project, project columns. But, but, but Trino is the one that actually does filtering or masking and so on. So when I say mapping usually what it does is it like takes all the values for column and it, it hashes them so that, so that it's just gargling instead of, instead of bringing the actual, actual, instead of filtering out the, it's kind of like anonymization almost. Anonymization is because it tends to not, not, not a lot of cases. I don't, I'm not going to get into whether Trino did not, I don't actually know if it's good, good, but again, mask. And, and so that's like, you want to return, you want to, want to return the column that has like the credit card number and you want to be able to like have the cardinality, Lee, can, so that you can see like, you know, the number of unique credit card numbers or something or the number of credit cards per customer or something, but you don't want them to be able to actually see what the credit cards are when you would, you would use masks to the filtering ring, but, but yeah, or the addresses or anything like that, right? If you won't, if you, if you still needed to know, like, like have them, have them have some sort of unique value, how you can count them or, or, or where you use masks, masking, but yeah, the masking or the filtering is actually done by Trino. The, the ranger is just like where, where the, where the more granular rules are stored. Okay. Thanks. That's kind of, that's kind of, so that's kind of my, my crash course on a day like security stuff. If there's any sorts of questions on, no, not, no, I do think we'll have to, because some of the issues with audio, we may end up like recording parts of this, because it's like, it's a, it's chock full of good information and we, we can talk about if we want to do some deep dives on particular things, but that was great. Like I, I'm, I actually, I wasn't sure what we're going to talk about today. I know you gave me a little blip roller, but I was like, well, okay, I'm not sure, but this was like a really nice overview. And I thought I would like to see, you know, maybe we can do a future demo where we actually, even if not the whole thing, but just like a little part of it, like do some stuff through, do some stuff with notebooks, like using like the static credentials and then using the STS Mechanics and Merosa and then kind of from there, we get played with lightables or something, I don't know. And there's definitely a set of things that, that someone who's going to go set up security for their data like should be thinking about, right? How, what's the level of granularity you need? It's going to be on column data. Are you going to need extra policies set and right? Like all kinds of like just like a guide, like how do you, how do you actually approach this stuff when you're doing it at the data lake and don't just, don't assume that what you get out of the, what you know from S3 with bucket policies and I am and all the stuff is going to be sufficient. You may have to think about some other things. So that was really cool. Awesome. Well, so we're, we're kind of at the top of the hour. If we, I will put out, if we decide to rerecord and like, and so that we can get better audio, I'll definitely publish something and tweet it out and saying when it's available and maybe we can even replace this one. I'm not sure how we do that exactly, but we'll see. Anyway, I don't see any questions at the moment, but I want to thank you, Kyle, for getting up early and putting this together for us. And I'll let you, I think we will rerecord. I think that's a good idea just to make sure that the audio is a little clear and we'll publish that as well. So that'll be really awesome. Cool. Well, thank you, everybody. So I just wanted to say, have a good day and I'll talk to all of you. Thanks so much. Thank you, Kyle. Thanks.