 All right. Welcome everyone to our next talk. I'd like to welcome Barton Rhodes on securing your Kubeflow clusters. Everybody please give him a round of applause. Thank you. So yeah, I'll talk to you a bit about Kubeflow and Kubernetes and how to get it set up in a way that is not going to compromise your data. So first of all, what is Kubeflow? Kubeflow is a set of open source components that are running in Kubernetes and are designed for end-to-end deployments of your machine learning models. So you may have seen this quote or heard it on a recent podcast. It's a quote by Ian Coldwater. She said that containers are only as secure as their runtimes and their orchestration frameworks and their kernels and their operating systems and everything else. So this is not a talk about Kubernetes security per se or container security and there are plenty of excellent venues and talks at cons right now where you can go and learn about these things. What I wanted to do rather than go into detail on that is to introduce you to Kubeflow, let you know that it exists, and help you get started with it in a way that will give you some secure defaults and will allow you to prevent data exfiltration and help you to understand a little bit about the security model of Kubeflow and point you at resources that you can use to learn more about the underlying stack. This may not be new to most of you. I think originally I saw this chart in the paper called the Hidden Technical Debt Machine Learning. So the actual machine learning code, the training code that you use to run your model is only a small portion of most of the things that go into making a successful deployment of a machine learning project. In many cases, the hard part begins once you have illustrated the value of the project, once you've shown that it works on your test data set and you are asked to integrate it. This is not a skill set that a lot of people coming from statistics or data science are equipped with. Obviously, people in this room are a slightly different mix from a typical data science environment, but that's effectively the goal of the project is to find open source components, existing cloud-native technologies for every step of the way in deploying the model, starting with data pre-processing in a consistent manner, to training your model, to actually tuning your hyperparameters, keeping track of your experiments and metadata associated with actual model training. It's also a nice way of making the design that's comfortable with a number of communities concepts, arguably. Kubernetes is not the right level of abstraction for a lot of us to understand the data science people and analysts understand the infrastructure that we consume. Typically, we just want to have compute. We want to be able to run our models within a certain amount of time. We want to be able to migrate our workloads between different deployment environments. We want to do it in a manner that's familiar to us and using the tools that are commonly used by us every day. That's the why of it. Kubeflow, like I said, reuses open source components and it is designed to give you sane defaults and integration pieces for every likely aspect of what you use. Here is an example of one of the possible deployment scenarios where, as a data scientist, you may start in Jupyter Notebook, which is a natural environment for a lot of data scientists, and then have the same workload that you run for training and prediction in the cloud. Including a managed service like AI platform and run it on-premise and basically anywhere Kubernetes runs. That's the goal. In order to accomplish this goal, there's a very young project. My background coming into it is I am deploying this in a healthcare setting where you may have considerations of cost, considerations of data locality. You need to have building blocks that are stackable for data scientists. One of these building blocks is a project of Kubeflow called fairing. To introduce you to this, I would like to point you to a series of blog posts by Netflix. It's trying to bring the Netflix model of notebook deployments to an open source project where a lot of not just data exploration or model training or model inferencing happens inside of notebook, but also the data engineering tasks and tasks that are repeatable. There's lyrical paper mill that fairing integrates. Fairing allows you to ask a question. How much of the actual workflow can we make reproducible directly from the Jupyter Notebook? This may sound crazy, but if you think about typical workloads that you encounter in large projects, the actual execution of your IPython kernel and things like that is a minor part of that. Computationally, that's not a hard constraint. Fairing is a project that allows you to do that. To illustrate the benefits of fairing SDK, which is a Python SDK, you can see the current approach to deployments that are local. It may work in your preferred environment, not necessarily a notebook. Fairing is equally usable from your favorite ID. You then would have to adapt it to some deployment surface, like it's an AI platform or an actual managed service of some kind or your own API endpoint that you build. Then the way to deploy it in Kubeflow is through consuming right now, consuming the TF job type deployments. That's a lot of modification and refactoring in every step where your target deployment and your local deployment differ. Introduces potential for bugs to creep in. With fairing, you can have a fairly simple way of migrating your workloads by abstracting away the backend and the deployment procedure for your model. That's one of the ways that Kubeflow allows you to benefit from code trees. You can transition the environments pretty seamlessly by just walking out whichever Kubeflow config that you have. You can target the cluster that's isolated healthcare cluster that has spatial data. You can target GKE. Pretty much any scenario where you would have communities can be easily addressed, including local environments. If you're used to developing locally with a Minikube, if you're exposed to that, you don't have to change a lot of the templates that you work with in the cloud to be able to execute it. Here's where it gets into security. A lot of these niceties are nice to have, but it's very important that we understand the security model and different permissions and exposure that you get by using these services. Once again, my goal is to inform you as to where we are today and give you an idea as to what is affected by permissions that you have and put it in front of you and let you examine it and understand the different approaches. As a part of the Jupyter notebook service that comes into Kubeflow, you get access to a variety of different resources. You can schedule pods, great deployment services, including trading jobs and inference jobs in PyDorch, TensorFlow, and other frameworks are being added all the time. You can create this directly from your notebook. You can once again have that as a repeatable step in your MLops workflow. That's the blog post that actually goes into it. The end result of this is to be able to, instead of making every deployment a one-off deployment where a majority of data science projects actually end up not realizing their investment, you want to be able to have something that's automated and repeatable. That's the Netflix model that fairing enables. You can find the fairing SDK on GitHub and the scenarios in which you use it is through consumption by the notebook and some sort of environment where you can use Python. You can imagine an airflow trigger. It can be any variety of different scenarios where the workflow needs to be automated. However, you get to a point where you want to be able to have a more defined workflow. Let's say you have deployed it, you've exposed your endpoint, so typically the end result of the fairing workflow is some sort of an endpoint where you have the ability to pass it on to your UI team, for instance, to build a simple UI around it or you have some other downstream service consuming it. That's as far as fairing gets you and you can get that from the notebook or from a repeatable process. What if you have a complex end-to-end process that requires a lot of many different steps that fail and you want to isolate each of those steps and control the security permission model on each individual part of the DAG. That's where Kubeflow Pipelines comes in. Pipelines is a different component I wanted to introduce you to before we jump into what you can do to make these deployments secure. It's a UI for managing experiments, jobs, and runs. It's also an SDK that effectively allows you to break down your entire end-to-end workflow into a number of containerized operations. Instead of thinking in terms of an entire function that does a number of different things, from data processing to outlier removal to training and then tuning, you can imagine your scenario. You isolate each component of that workflow into its own component and you have the Docker image or some that captures the requirements of your model in one place. You then can add things like monitoring to it. All you have to do is to follow a number of standard interfaces that are defined in the container ops spec. The benefit of that is to be able to not only get performance out of your models by addressing model next, but you can also freeze a version of your model in time by looking at the exact image that was used to run at any given step of the model. If your goal is your reproducibility, which I think it should be, you can get back to the exact versions of containers that were used to run your model. So, visualization of the DAG might look like this. That's an example of the XGBoost run that you can find in the docs. As you can see, the operations here are not limited to just the ML tasks like training and transformation of the data, but things like creating clusters and a few slides back, we can find that we have access to prawns, deployment services, and jobs. So, you can not only do the modeling, but everything required for it in terms of infrastructure and service requirements that are needed and then introduce monitoring and outputs in intermediate steps. So, one of the things you could do is visualize the HTML output of the Jupyter Notebook that gives you some reference metrics for how well your model is doing or visualize a confusion matrix or an ROC curve, something that will give you a way of visually debugging it, which is one more place where the notebooks make an appearance. You can take those same ferry notebooks that you use for analysis and integrate them into these intermediate steps as a form of debugging your intermediate steps. This is a diagram of what happens behind the scenes as far as allowing it to happen. So, if you actually look at the way pipelines are implemented, there are Argo workflows with a very thin layer that makes it adaptable to Kubeflow pipelines. But if you are familiar with Argo, everything that Argo normally gets you, which is event-driven workloads, isolation of individual steps, and you get with it. And so, in the recent version of just came out, 0.6, you actually have the metadata store. So, every run of a Kubeflow pipeline is versioned. And the information about it is stored into a metadata database, which is a MySQL database with a JSON schema that allows you to track artifacts. The artifacts are pipelines or individual container operations in the pipeline. So, once again, it's a lot of opportunity to mess up and expose some part of the architecture to the outside world. So, it's imperative that we talk more about how we isolate that. And as you can see, the entire architecture is quite complex. So, this talks a little bit about the domain-specific language that Kubeflow pipelines gives you. So, you have container ops, pipeline parameters, components. There are a number of standard components that come with prebuilt containers and images that allow you to run simple Python functions that don't require a lot of dependencies. There are operations that allow you to handle volumes. So, you can actually create a data store that's needed for your storage of intermediate artifacts like weights on your model or some debugging outputs. And this is an example of a specification. Like a lot of things in the Kubernetes world, you define it using YAML. So, as long as you are consistent in what inputs, outputs each individual step takes, so you can attach an image to it. And you can actually run a train command that might, or any other sort of entry point that will take arguments and allow you to execute a specific workload in a specific matter with a number of arguments that will power it. What does it get us? Just take a step back from the concrete detail over there. It gets us a standard reusable component structure that allows us to take our work and be able to take it with us or generalize it to problems that we might encounter down the line. So, instead of learning a specific stack at one of the big four, we can actually go ahead and build in a matter that is shareable and is available to be executed anywhere Kubernetes runs. So, as a Google project, it initially had more contributors from Google. It recently passed the point where a majority of contributors actually come from the community. David Aaronczyk, the person who started as a PM of the project with Google is now at Microsoft and bringing it to Azure. So, it's got broad support by the community and it's a community-driven project. In the future, you can imagine individual pieces of your pipelines becoming shared with others in a reusable manner. So, that's it about Kubeflow. I just wanted to give you a brief introduction. There is a lot more to it and that's just scratching a surface by looking at these two components that get you from notebook interface to something that's event-driven and triggerable and reusable. So, I want to talk about benefits of this approach and how we think about security in Kubeflow. So, principle of least privilege, I think, is a benefit not just for this project but Kubernetes in general and it's a good argument for running things in containers in the first place instead of having one service account with access to everything in your infrastructure. You can isolate your workloads by specific tasks that are needed to perform. So, in this case, you can think about a container operation that needs to read data from a Google Cloud Storage bucket. The service accounts, the secrets that you expose to that container, do not need to have anything beyond that and if there is a vulnerability in some component of that container or if the container gets compromised, you don't lose access to the entire infrastructure. So, once again to inform you what is happening now and for context, a lot of this is still not like GA or it's not 1.0 version so it's the latest stable version is .6. Security tends to take a backseat when rapid development is a goal and so right now is the time to start thinking about how can we improve this and so my goal is to, by informing you, let you get to the GitHub play with the code, see what you can do, see how you can break it and hopefully open some issues and pull requests to help us do better. So, from that perspective, I will introduce you to the three service accounts and permissions that are currently deployed with them so we have and a lot of this maps into the on-premise deployment but of necessity, Google Cloud Platform has been more of a stable target for Kubeflow as of now so if you're dealing in the on-premise environment, I think things like IAM roles I'll be describing now, they're more specific to Google Cloud Platform and won't translate as easily to your on-premise role-based authentication. So, one service account is Admin and it's used to actually deploy and configure the cluster so you can find the kfctl script that is going to use the admin service account to perform most of the deployments. A user is what most data scientists will encounter in their day-to-day so that allows you to use and consume gcp resources and interact with gcp resources from your container operations or faring if you're using a Jupyter notebook and we have the logging account as well which uses VM which is called VM and it's used to take audit logs and send them to Stackdriver so as far as concrete roles for the account and all of these can be adjusted by the way so this is taken directly from a cloud deployment manager template that you can find in the repository. The admin service account has source admin which is allowing you to push the application to cloud repository. It has the management account which allows you to control the endpoints and the host names of those endpoints. Network admin is used to enable Identity Aware Proxy which allows you to use your Google account to authenticate against different resources and get and consume them as well as health checks so that's that's the admin account if you wanted to change these you would add things to the template but you know that would be specific to your organization's security policy. The user service account allows you to do things like build custom things on container builder that's important if you want to run your application in the context where you don't have access to GCR container registry and so the builder allows you to actually package it up as a coherent sort of end to an application and deploy it potentially in an air gap environment which could be a concern for security. You can have a viewer role for actually viewing the resources of the builds and then you this is just an example you know your services may vary so for instance you want to add data proc or something this is not currently captured here but this is a minimal set of permissions for interacting with cloud storage buckets big query and data flow which all of which will probably help you with data so and then of course the monitoring account logrider metric writer and object viewer so these are these are the roles that you can review with and you know decide whether they're appropriate one of the things that hasn't been mentioned in the earlier slide is an IAP account so this actually goes to your user account that authenticates against identity where proxy so you have to be able to have an object viewer role on that so as a part of the implementation and then VM service account like I said has logrider and all that so okay so that's that's based on cube flows built-in roles right now the way you will control the the way that it relates to role-based authentication Kubernetes is it compliments it so when you are trying to define that IAM it can only get you so far and it's probably better for metadata and access to other objects that are accessible through IAM and control through IAM roles if you wanted to actually interact with Kubernetes permissions that themselves it could create other objects like cluster roles and cluster role bindings that will help you control the management of your resources using the Kubernetes model so that once again I point you to resources towards the end where this can be addressed in more detail so this is a brief description of how that works and just to clarify one thing is that in point six we now have per user namespace isolation so if you start with a new project and you add your own user all of your pods notebook services and points that you deploy will be contained to that namespace and so other data scientists or other teams may not necessarily have access to your namespace that buys you multi-tenancy and things like that so it's it's important to for instance if you are in a context where certain projects require a separate set of NDAs or have some PII that should not be accessible outside of a particular analysis okay so the just this talk is actually quite simple there are two main things you can do as data scientists when deploying this my goal is for you to go back to your place of employment and try this you know and when you try it I don't want you to add cryptocurrency miners to to your company's infrastructure so a few simple precautions can help you help you avoid that so the main one honestly is the vpc service control idea so what that allows you to do is to restrict access to certain api endpoints that will be consumed by your application to only the services and pods that come from within within your kubeflow deployment vpc service control model gives you the secure perimeter within which only the the clients that are authorized can have private access to resources create them copy them and so to be clear this only applies to the content within those resources not the metadata about those resources if you wanted to manage the metadata you would have to fall back on the iim model so if you're trying to copy things from google cloud storage or do things in BigQuery if you want to make sure that you know your your team and a different geographic look look out we will not have the same access you just use vpc service controls to isolate them so i use some of the benefits that allow you to that i could go into more detail on if there are any questions but effectively it's it's configurable using standard OAuth credentials you have ability to restrict it by per service and here's an example of how you would do that of course all these environment variables actually have to be set but this is the command line approach in which case you enable the resource manager api and the other prerequisite steps there are similar ways of defining a secure perimeter in the ui in gcp once again this would be different on your actual on premise deployment but you could say that your application should only be able to access pub sub or storage api and should not be able to access big table or you know other other services within the perimeter and you can restrict it to specific projects which in gcp land have come with their own set of isolations so when you're setting it up and trying to justify it to your boss is just make sure to say that you will not have issues with like some high high security service like let's say BigQuery has all of your critical data another thing you can do and i recommend doing this especially if you're exposing any any resources that have a little balancer in front of them restricted by authorized networks by using a specific IP range that you'll be heading them from so it's something that can save you from unfortunate outcomes and here is another command line example of what you saw earlier in the ui so in this case you once again define a secure perimeter but that's that's something that you can automate for creation of new projects either with cloud deployment manager terraform or other tool like that don't don't make it a guesswork every time and try to try to you know fall back on good practices and discuss it with your team and understand like what what is within your security budget so effectively we have different different APIs here in this case we're enabling BigQuery container registry and storage but you know your use case may vary and so i added some diagrams in case you wanted to see practically what this gives you so once you have a certain service perimeter that allows within which you run qflow you will not have to worry about unauthorized access from outside world or other clients that will consume it in ways that you didn't intend this is the case for cloud the on premise situation a little bit more complicated especially with gke on-prem and anthos you know you can have these things a bit more configured but once again this has been mostly gcp focused in this case you can extend your vpc perimeter into a on-premise network by using some sort of a vpn gateway that could come with its own subnet and by defining strict firewall rules between for how different vpcs talk to each other you can prevent that from happening honestly if you get to this point you should probably work with your operations as dlc teams but as a one-off experiment i recommend just try your own gcp and configuring very basic service perimeter like so one last step a thing and it may seem obvious but one of the most common ways gcp accounts get compromised is a service account leakage it's you know either somebody commits to github or you know something silly like that can happen especially for newer data analyst data scientists a lot of the tutorials advise you to add everything in your directory to get and commit it so that's how a lot of that stuff happens so true story luckily kubernetes allows you to store credentials in a secure manner that can be accessed from a variety of places in cube flow infrastructure and i wanted to show you how to go through that really briefly so you can add a secret by using standard secret creation interface your secret can be your service account it could be json it could be environment variable that you said i recommend using secrets for anything json or even the passwords that you want to consume later you can spread the credentials to your entire infrastructure by using google application credentials which is your environment variable that stores your service account files and that every pod that will start with a certain service account within your cluster will have access to those resources let's say there is something every pod has to hit once again for namespace so you don't have to worry about leaking to your colleagues you can authenticate from the pod directly by you know picking it up from a secret store and mounting a gcp secret as a volume once again this environment variable is how your sdk will know what to authenticate with you can access secrets from a pipeline so if you remember the components earlier that it's part of the dsl to be able to include a specific secret secret into your pipeline so think of this as a critical step so it may be very tempting to have your secret to to your docker image if you're on a deadline but if you if you just follow these simple procedures and you add secrets by using the dsl directly you're preventing you know unnecessary harm from leaking your service accounts to your registry last i had a little bit of material on the point six edition which just happened big migration from point five was the replacement of ambassador proxy with the istio gateway and at this point it may not be as relevant to a day-to-day use by a data scientist but istio comes with a number of very nice ways of controlling flow data and security between different services in your in your cluster and so you get a lot of things for free like oath see authentication authorization and things like fold injections which allow you to test your your model and so wrap wrap up real quick but i wanted to point you to some resources and references so qful.org is great there are private clusters description over there and a lot of this material came from complete documentation and as well as the blog post by aricto which did a really good write-up on authentication and please join our slack start contributing we need security people to look at this before it hits 1.0 so that we don't have nasty surprises down the line and thank you you can find me on twitter keybase good hub or feel free to email me directly