 All right, so let's get started. Welcome everyone. So it's already late afternoon. I think you guys are all tired. So before we get started, just a quick poll. How many of you were here for some of the security talk yesterday and today? Cool, yeah. All right, so this talk going to be slightly different from the previous ones. So instead of providing you another open source tool, we're going to present how at Daybreak's we connect the thought together and present a holistic security solution that works for our company and hopefully also work for your company as well. This is my colleague, Eric. And my name is Wing. We're both software engineer at Daybreak's. We work for a cloud platform team. So our team's responsibility is to provide the infrastructure services like the deployment, monitoring, or permission, credential control management system. So our philosophy at Daybreak's is to use open source project to manage our infrastructure. So we do use a lot of those tools being discussed on this conference like Kubernetes, Pomesis, Bazel, and a lot of those Hashtag Hub tools like Terraform and Vault. So here's our plan for today. I'm going to tell you a little bit about how, what Daybreak's does and what kind of service and tools we provide. And we're going to quickly run through some of our security concerns and then we'll try to dive in to one of those fundamental areas of our security management systems. So first, what is Daybreak's? Because most of the time when I talk to people, the first impression they have is, oh, you're the creator of Apache Spark. Yes, we are, but because this is not Spark Summit, we're not going to talk anything about Spark today. And this is not to talk about how we use Kubernetes as a cluster manager for Spark. Instead, we're going to talk about our product, which is this, oh, by the way, there's a very cool talk about how to use Spark, how to use Kubernetes as a cluster manager about Spark on last Spark Summit. So if you're interested, go check this out. Anyway, so we provide this unified analytics platform. You may not know what that means. So it's actually a product we call the Databricks Notebook. So what this Databricks Notebook is, is a SaaS offering that on top of multiple cloud like AWS and Azure, and it provides better performance, real-time collaboration, as well as enterprise-grade security around Apache Spark, making it more suitable for enterprise workload. Just to give you a bit taste of how a notebook looks like on the right side of this slide. So it's a browser-based workbench that can be used by data scientists and data engineer to collaborate and create a data pipeline by writing Scala or Python code, or they can also run Spark query to get intermediate result out of their dataset, and they can also run this built-for dashboard around machine learning models or even deep learning models to finish their analytics workload. All right, so beneath this beautiful UI is actually a very common SaaS cross-account setup. Consider if you're one of our customer, you may already have a AWS or Azure account where you have all the data storing it, like your data lake, data warehouse, and you want to use Databricks product, the notebook, to launch a Spark cluster and run your Spark workload inside your customer environment. And in order to do that, you actually talk to one of our notebook service across the internet and make it connect to your customer environment and manage a Spark workload. All right, you may have two observations here. One is all our control plan services are deployed to a Databricks environment that running on Kubernetes. And on the other side, as an engineer or as a service principal like Jenkins, you occasionally need to access those control plan services. That means you also need have access indirectly to those customer data. That's why the security of the Kubernetes cluster is super important and critical to the success of our company and our customer. We have a couple of security concerns from different people. For example, from our customer, they always want their data to remain private. They don't want anyone else to access the data. And a subset of our customer also have the security compliance they need to conform, like those in the health industry or in the federal agent industry. They have the HEPA or SOC2 or FedRAM those compliance. So if you are a security engineer inside Databricks, your concern is security. So first of all, you want our security solution to have this nature of defense in depth. We don't have a single security feature that just defend anything else. So instead, we need you to have network security as well as service security, right? And second of all, they want all the production access to be limited and audited. So nobody else, nobody are supposed to access production and environment if not supposed to. Finally, as an engineer, as an application engineer, you don't, maybe you don't care that much about security. What you really care is about productivity. So what that means is when you build a new feature, you build a new service, right? You want the security to be an integrated feature that already been built for your system or your service. You don't want to spend more time working on your security stuff. And over time, because security is the ongoing effort, you may want to onboard engineer with new security features of a service. They actually want that to be easy to integrate with and easy to extend, right? So with all those requirements, they boil down to this three fundamental aspect of a security management system. Basically it's access control and security management and audio logging. So we're gonna try to attach each of those today. All right, so we're gonna first talk about access control. Access control really come in twofold as we mentioned previously. The network access control and the service access control. So here we just assume that network access control has been taken care already. So you already have those ingress, egress rules set up correctly for your services. And we're gonna focus on the service access control here today. So service access control really come in, usually means who have access to what kind of resources, right? So who the subject here basically can either be a human or be a service principal. So if you are a Databricks engineer, access control is very simple to you actually. We're all developers, we love terminal. So as a developer, you can just go to your terminal and run this single command called getcube access with the environment of that Kubernetes you want to run. So this getcube access script just help you do everything magically and you don't need to care about how implement on the line. So you can see from this screen, this command actually leads you to your browser with this backhand service link and expect something to return from that link, all right? So the first thing after you go to the browser is to sign in with your Google account. And after that, it will lead you to this ugly page where all backhand developer, we don't know how to create a UI. So if you know how to create a UI, please come help us. Yeah, so basically you have to fill those additional information. First, you need to specify this authentication type, which means whether you want to access to a customer notebook, just impersonate that customer, do some real-time debugging, or you want to maybe sometimes SSH to one of the backhand services and grab some log or something realistically that happens sometimes, we can avoid that. You can also specify how long you want to have access to that resource. In production, it should be very short because we have a limit set to all production environment. And you can also provide a, you have to provide an ID of your ticket assistant to demonstrate that you're not just messing around, you're actually trying to do some customer face, you're trying to solve some customer facing issues, right? So after you fill out all those information, you can submit your form and you can see that this Gini service, it actually returned a credential back to your laptop and with that credential, it can run your favorite keep control command to get information out of your Kubernetes cluster, all right? And you can see all those processes are self-service. You don't have to talk to anyone because we're all shy developers. Here's some more explanation of what's going on under the hood. So remember, this is your laptop and you talk to this backhand service called Gini. So Gini is our centralized access control service inside Databricks. It's integrated with Google. So remember, the first thing you ask is to do a Google authentication. So that's a Gini OS proxy that's forward you to Google and finish the authentication, right? So after you're authenticated, Gini will send another request to Google, get all your Google group information. And with all those information, Google will bundle them together and send it to VOTE, the hashtag called VOTE here. And VOTE, well, upon giving those information, it will issue an employee certificate and send it back all the way to your laptop. And then with that search, you will be able to run your keep control command. So that's how it works. This is an example of the employee certificate. There are a couple of things you may notice. It's signed by this employee CA. We will talk about it later. And it's time limited. So this is a dev environment. So it's one day, should be much shorter for production environment. And it has your employee email address as well as your Google group information that embedded in this common name field as well as the organization and subject alternative names field. So those are used for authorization purposes by Kubernetes later on. So this is how our CA trust chain looks like. We have the single root CA, which is clearly stored somewhere I don't even know. And this root CA is used to assign a signer CA, which is then used to sign two branches of certificate. One is this employee CA and employee certificate we've already discussed. And the other is this services CA and service certificate. So basically on your laptop and on all of those data break services, they need to trust both employee CA and the central service CA. So this two party can communicate with each other using SSL connection. Yeah, so Kubernetes provide a lot of authentication strategies. If we go to their website, you will see a bunch like client TL certificate, all those token based like OpenDC or Webhook token and all those password based and OZ proxy. In fact, I think a different company use different kind of OZ strategy, but we found what worked best for us is this TOS certificate. And the reason primarily being that TOS is very widely supported. So because of this, it's actually very easy to apply this TOS authentication to all the services we run inside data breaks. And as a developer, that actually means you have this unified experience when you want to create a service, you want authentication for your service, you can use TOS. And if you want to get access to some of the service, you can also use TOS. So it's a very unified experience for them. And also TOS is very easy to control expiration because this has this validity date in it. And finally, it's also very kind of easy to integrate with Kubernetes RBAC and we'll discuss it next. Does anyone find any problems here? Because I think one of the weakness of TOS is doesn't actually allow you to revoke your access. So that might be one of the problem. That's also why we want all the certificate to be short lived. But I think someone in this conference they're trying to work on the revocation of certificate. I think there was in one of the talk yesterday, so if you're interested, you can check that one out. I think that's also some of the features that we're looking for. All right, so we already have this employee certificate and it signed with a Google group information. And all we need to do is somehow bind it to Kubernetes row and then use that to access your Kubernetes resources. Our binding strategy here is also very simple. Consider you have a Kubernetes cluster here and different color represent different namespaces. We actually do a namespace specific row binding. So we heavily rely on this cluster row as a user-facing row by default. I think provided since Kubernetes 1.8, maybe earlier. This added row is basically a namespace specific admin row. So it gives you all the admin access to this specific namespace, but it doesn't give you any permission to do operations against like row or row bindings. So it essentially like the power user row of AWS IM. So if you already have your employee search, it will be bind to this cluster edit row through this row binding in a specific namespace. You can see this row binding, it has association with this cluster row edit and it's also have a subject with a group name in it. So that means if I, I'm from growth team, I can only access a namespace assigned to my group. And you're Eric, you're from different organizations. You only have access to the namespace that you're assigned to. He will not be able to access any resource in my name space. And if you're automated system like Jenkins, Jenkins can also be associated with a Google group. So it will be binded in the same way to the cluster row and then it will only be able to do deployment to that specific namespace. One of the advantage of the setup here is because your company's organization structure may constantly change. When that happens, you may want to transfer the ownership of service from one to the other. And with this setup, all you need to do is do your organization change on Google side and then update this row binding. You can just assign a namespace to a different group and that group gonna be the new owner, right? We also find this permission impersonation use case to be very useful. So consider if you, we have a use case here, consider if you're LA's from team A and you need access to service pod belongs to team B. You may want to do some ad hoc debugging because root cost analysis usually comes across different teams. You don't care which that service belongs to. Or you may want to do some fleet scan or some maintenance task over your backend services. So when that happened, what do you want is you actually want to temporarily impersonate one of the developer on team B and use his permission set to do your task, right? So this getCube access command, it can help you to achieve the same thing. This is an existing example we showed before. So you still run this getCube access command and remember that ugly web page where you can also specify additional group, in this case group team B. And Gini will get that request and it will notify the owner of group B, say someone trying to access service in your group and you can prove that request. And then both will be able to generate a new certificate with that additional group information inside the cert. And then you will be able to use that new cert to access resources that belongs to team B. So this we find to be a very useful use case in our organization. And finally comes to continuous deployment because we use Jenkins pipeline to do end-to-end deployment. And one of the challenge we found about Jenkins is it cannot just open the browser and do this two factor authentication, right? So it runs in headless mode. So what Jenkins really do is it use a long-lived token that was issued by Vault and it will send the request instead of to the front end of this Gini, it sends the request directly to the Vault, a hash of Vault backend. And Vault will recognize that token because it's issued by Vault and will issue a Jenkins cert the same way as it issue an employee cert with all the required permission. And then Jenkins will be able to run the same cool control command to finish a deployment the same way as employee access equinities cluster. So we have talked about almost everything here. The only missing piece is how Vault generates a certificate and complete this deployment story. So with that, I'll invite Eric to tell you something about how we use both at Databricks. Can everybody hear me all right? Cool. So as you mentioned, I'll be discussing the tooling we built to address secret management. So at Databricks, we use an open source tool called HashiCorp Vault to manage our secrets. It provides an easy to use restful interface for configuring and managing secrets. For example, you can install a signing certificate background to generate TLS key pairs. And if any of you had had experience working with open SSL, you know how annoying it can be to sometimes use and Vault handle all the heavy lifting for you, whether it be certificate generation, revocation or auditing. It's so easy to use that it encourages us to use shorter live certs. And that's the approach when generating headless Jenkins tokens for Kubernetes access. Additionally, for other secrets, it can generate generic secret, it can store generic secret material, which we use for storing our long-lived VIX certs. So all of our cert services are deployed to Kubernetes and we use Kubernetes secrets to inject that information into our running services. As a quick refresher, here's what a typical secret config looks like. Here we have a secret called myServiceSecretKeys stored in the development namespace, which stores a basic C4 encoded secret keyed by secret config. This is very easy to read up front and very concise, but in reality, you may have secrets deployed across all of your namespaces and all of your clusters. And it can be really hard to manage the context in which each secret was generated and the format each file is configured. Thus, we start to realize that managing our secrets at scale is hard. And there can be quite a few pain points in the process. First, secrets are hard to audit since they are stored all over the place, basic C4 encoded without any context behind them. They're difficult to rotate as it's not always obvious which secrets have dependencies on that rotated cert. And lastly, secrets obviously contain confidential information and thus you cannot check them into version control. So any otherwise automatable service deploy process always requires some sort of manual step in which someone on your team needs to manually generate those secrets and apply them to get your service up and running. These three issues made it really obvious that we needed a solution to address this problem. And to do so we introduced a custom Kubernetes object definition called secret template which integrates with HashiCorp Vault to leverage its ease of secret management. Now, let's take a look at the equivalent secret template to the secret defined in our previous slide. Notice that the definition looks very similar. We define a secret template with the same name and namespace as before. However, since we're integrating with HashiCorp Vault to generate our secrets for us, we provide context name and URL for the Vault server we're talking to. The data we request from Vault is reachable by the following path and the resulting payload is referenced by the name secret keys and might look something like this above JSON blob. Finally, the actual secret format is a key value pair with the key secret config and the value from the associated payload. Notice here that we've described everything needed to define this secret without displaying any of its confidential information. So, to recap, we added internal tooling to declaratively define the properties of our secret and when applied, it automatically pulls all relevant data from HashiCorp Vault and turns it into a Kubernetes secret. This enables us to tackle each of the pain points from our previous slide. We now automatically have audit log history managed by Vault and if we need to rotate the cert, we simply update the signing backend in Vault and reapply the secret templates. Also, the templates contain no secret information so we can check our configuration into Git to get all of the benefits of version control and revision tracking and now we've automated and streamlined the steps for our developer to easily create their secrets with fewer chances for mistakes along the way and now it's all just code. We found that this declarative formatting for our secrets was really useful in improving the velocity of our developers when managing their security-related concerns and if you guys are interested in a tool like this or have faced similar problems, please let us know after. Last, I'll talk a little bit about our auditing story. So, Genie was designed to be the de facto authorization service for all employee accesses to our various services and we built auditing in from the beginning. Genie records all types of authorizations at grants including failed attempts at access. Since we're also like the creators of Spark, we naturally use Spark in our internal systems so we run a daily Spark job to compile this information and export it via email to our security team and here's an example of what that email log might look like but unfortunately we had to blur almost everything out but basically you can still see the general gist of things. It has the standard properties, timestamp user, resource type, resource ID and reason and generally if there's ever a break-in we have all the historical information persistent in the database for additional investigation. Last thing to note is since Genie only tracks requests to gain access to our Kubernetes cluster, we don't actually have insight into what that user does in the Kubernetes ecosystem. So we leverage Kube API server audit logging to get command level logging for the users in our system. To wrap things up, I'd like to highlight a couple of the key lessons we learned along the way when building these security systems. First, security shouldn't be a tax on developer productivity. The easy way should be the secure way and any gnarly security implementation details should be seamlessly hidden by automation. A generic solution is usually the more secure one. TLS was easy to adopt for us and is a common enough standard that it could be used for both Kubernetes as well as other developer application use cases. For security compliance reasons, remember everything until you're allowed to forget other by your contract or someone tells you you can do so. And lastly, don't reinvent the wheel. All of us at this conference can agree, open-source software is your friend. Take advantage of the battle-tested solutions that others have engineered and you'll be a lot happier along the way. Cool, so I hope you guys enjoyed our talk and if any of these topics interest you or you enjoy working on other projects related to Kubernetes, please come talk to us after because we're also hiring. Thanks, so actually we're actually not using custom resource definitions to apply that object, but what we have, and it's actually like a wrapper around kubectl that transforms like, I guess, Databricks specific object definitions into the underlying Kubernetes objects. So we don't actually use custom resource definitions but I think that's the way we'd wanna write it if we had known about the later free shares in Kubernetes. So, Genie has like a long-lived token that is stored in its like persistence to actually talk to Vault. And yeah. Yeah, so basically our Vault, the data notebook is working on, that stuff is stored in like some distributed file system of the user which typically is S3. So it's all stored in the customer's account so it depends on like their, whether they use encrypted S3 buckets or whatnot. And then there's just peer with the connection so that we can access them from our control plane. So basically all the data are stored in customer account so they have total control of who can access, who cannot. We have a certain expectation of what type of permission they do give to our account to access their data but they can always like revoke those permissions. I think that some parts, so having a well, sorry. Oh, how do you work with developers to take away some of their like privileges but also like increased security? What does that like balance? For things like secret management, having the secret template is a clear win from the sort of observability perspective because everyone can clearly audit what secrets they're actually using instead. So there's like obvious benefit that if they adopt the system rather than manually configuring their secrets, it's a win win. Things like auditing and security. I think there's a top-down approach also of we have to comply with like some set of security rules and that trumps the rest. How long does the security for? So question is how long the certificate for employees lived for? Question. We have different requirement for different environment like DevStaging and Prod. For now, Prod is actually 60 minutes at maximum and for that we allow a bit longer time like one day. How do we replace the certificate? Yeah, so as we mentioned previously, one of the, I think one of the drawback of using PLS certificate is you have no way to easily revoke your existing ones. So if you have certificate compromised, then we have to switch all the certificate on the server side. But I think there are some people working on the revocation of certificate in this community. And if that is a new feature we can onboard, I think that would be great. Yes, the question is whether we have any automation bill to rotate the certificate. We do have some of those part automated, but we don't have end-to-end fully automation system. But that's, I think that's the direction we want to go. So yeah, there's some work, I think it was by the HashiCorp people to integrate Kubernetes with Vault directly using service counts. So currently to actually apply secrets, it actually leverages the employee credentials to actually apply them. So when you like a wrapped Kube CTL, apply the secret template event, like you will use your credentials to talk to Vault to populate the secret and then apply it to Kubernetes. Right now, so a lot of it depends on the sort of employee pathway and getting those credentials. And the source of the employee credentials source from Genie. If they were, we could, yeah, it's dynamically created at the time of application. Yeah, I would position Genie more as a scene layer on top of all those open source tools that's specifically for developers for interact communication. So services, if they're running inside Kubernetes, they don't have to talk to Genie to get those service account or anything. So if, yeah, I think if Vault have some integration with Kubernetes, we'll definitely look for that solution as well. Cool, right, thank you guys. Right, thank you.