 Hi everyone, my name is Paul Van Eck and I'm a software developer with IBM, who lately has been quite involved with the operations side of Kubeflow and more notably KF Serving. And I'm Mofi, developer advocate at IBM. Lately I've been focusing a lot on Kubeflow deployment, Kubeflow operator and Kubeflow manifest. So welcome to our talk about taming the beast and managing the day two operational complexity of Kubeflow. Here we hope to share some insights about how to make Kubeflow operations a bit less daunting. So let's get started. Yeah, so if you are here for Kubernetes AI day, then you are probably already familiar with or at least know about Kubeflow. So for those that might not know, Kubeflow is an open source project that contains a curated set of tools and frameworks for ML on Kubernetes. Think of Kubeflow as a machine learning toolkit for Kubernetes encompassing a combination of open source components all stitched together. So taking a look at a conceptual overview diagram, we can see Kubeflow as a platform for arranging the components of your ML system on top of Kubernetes. And as you can see, there are quite a number of components that are typically involved in a Kubeflow deployment. And this is by no means all inclusive. In the big blue box, you have a number of applications that are developed inside the Kubeflow ecosystem, which leverage open source technologies in the wider ML and Kubernetes ecosystems, such as Jupyter, TensorFlow, and Istio. So typically when people deploy and use a distribution of Kubeflow, they are using an opinionated bundle of these applications, optimized for a particular cloud platform or use case. Now with all these bits and pieces, you're bound to run into some trouble, right? Well, that's the nature of the beast when you are dealing with so many moving parts. It's no question that deployment can be challenging. You might run into integration issues when trying to install Kubeflow, the Kubeflow stack with existing components in your cluster. Maybe the install of your Kubeflow distribution requires additional configurations on top of what's already there. And for those who have looked in the Kubeflow manifest repository in the past, you may have found the sea of YAML files and folders quite confusing. And furthermore, managing Kubeflow can also be somewhat tricky when doing day-to-operations, such as updating or reconfiguring a component. So like most large pieces of software, Kubeflow has its own set of obstacles and complexities. How do we do ops for the platform? How do we do ops for the platform for MLops? So in the short time we have, let's talk about how we contain this Kubeflow beast, at least to some extent, by providing you with some strategies and insights for tackling Kubeflow operations. Now first, let's talk a bit about deployment. When you deploy a Kubernetes app, this requires applying a set of YAML files into your cluster in order to create the necessary resources. For Kubeflow, all the application resources can be found in the Kubeflow manifest repository. Now generally, when people deploy Kubeflow, people typically use the command line tool kfcuddle. This was created to deploy something called a kfdef file, which is a YAML file that contains references to, well, a list of references of applications you want to deploy for Kubeflow. So this was, this kfcuddle and kfdef paradigm was heavily used in prior releases with many distribution install guides using this tool. However, recently there has been a push by the community to make Kubeflow installable using just customize and Kubecuddle, which are both tools used commonly in the wider Kubernetes ecosystem. So that is also an option and will likely be the favorite option going forward. So on the topic of deployment, I think it is worthwhile to mention some of the work that the Kubeflow manifest working group has been doing to try and simplify deployment. First, the manifest repository structure was reorganized to make it more straightforward for the end user. In previous versions, the structure was flat when every component jumbled together at the top level. And so it wasn't really quite clear what was what. Well, it wasn't easily clear what was what. To rectify this, the structure was entirely reorganized into three main folders. First, you have the apps folder, which contains Kubeflow's official components as maintained by the respective Kubeflow working groups. Then we have the common folder, which contains Kubeflow common services as maintained by the manifest working group. These include services such as isTO, K-native, and even cert manager. Then there is the contrib folder, which contains third-party contributed applications. These are maintained externally and not part of a Kubeflow working group, at least not yet. Then there's also the distributions folder, which contains manifest for specific opinionated distributions of Kubeflow. But the plan is to phase this folder out and going forward, new distributions of Kubeflow will be developed outside of the Kubeflow GitHub org and manifest repository. To also help clean up and consolidate manifest development, responsibility for maintaining each individual component manifest was moved upstream to the respective repos. Previously, there were multiple versions of manifest. You'd have one set living upstream in the component repo, which would be used in development and testing or even just to provide a standalone way to install the component. Then in the manifest repo, you might look inside a component folder and find a set of manifests meant for K of cuddle and one meant for customize and with some even depending on each other. Now since development has moved upstream, there's only one source of truth. The manifest repo will just copy a specific version of a component at a specific commit for each Kubeflow release. So with these adjustments, among others, the manifest repo now provides an independent base Kubeflow deployment. This makes it easier for users to reuse much of this reference deployment and manifest and repurpose them for their platforms. So how this helps? At the time of recording this, these manifest changes are still pretty fresh with Kubeflow 1.3 not yet released and some kinks possibly not yet worked out. But you might already see how this may improve the deployment experience. For one, there is improved accountability for maintaining component manifest. In general, component developers will have an easier time of maintaining their manifest. For example, in Kfserving, all manifests are located in the config directory in the project repository. The default install is for the standalone experience with Kfserving. However, to support integration with the rest of the Kubeflow platform, Kfserving provides a Kubeflow overlay from which a user can do a customized build piped to a Kubecuddle apply to install a Kubeflow compatible version of the component. This leads into the next few points, where the hope is to have a smoother deployment experience and increased modularity. Kubernetes users, again, are likely already familiar with customizing Kubecuddle. So with these changes, one can more easily install individual components based off of their needs. And also distribution owners can build their opinionated distributions from a tested starting point. And even existing tools like Kfcuddle can leverage this base of manifest. Now, it should also be more clear how the common manifest for services like Knative and Istio differ from what is provided upstream. For example, the manifest for the Kubeflow 1.3 release include Istio 1.9. In the Istio 1.9 folder, it's documented what changes were made to get Istio integrated with the other Kubeflow components, making it easier for users to integrate their existing deployments of these services. So all of this should hopefully lead to an overall smoother deployment experience for the end user. Now let's move on to another deployment option, which is operators. For those unacquainted with operators, operators are a method of packaging, deploying, and managing a Kubernetes application. Kubernetes can handle stateless applications just fine by itself. However, when you start adding stateful components like a database or monitoring system, Kubernetes does not know how to scale, upgrade, and reconfigure these types of things. These require domain specific knowledge or human operational knowledge. So the idea behind operators is to replace what would be what would be a human operator managing stateful applications with a software operator, which would have the knowledge to be able to do all the functions that typically a DevOps team would be doing. Though through the use of a custom control loop and CRDs, custom resource definitions along with the needed domain specific knowledge, a well-made operator is able to implement and automate common day one and day two operations. So this leads us into the Kubeflow operator, which can deploy, monitor, and manage the life cycle of Kubeflow. It was built using the operator framework, which is an open source toolkit allowing users to more easily build, test, and package operators. Currently, the Kubeflow operator relies on KF Cuddle and is built on top of a KF Dev custom resource, though this may possibly change in the future considering all the manifest changes. In any case, depending on your cloud environment, deploying Kubeflow to your cluster can be as easy as a few simple steps since it's available on Operator Hub, a catalog of community published operators. And one benefit of using the Kubeflow operator is that it will help monitor all the child resources of your Kubeflow deployment, which can be a lot. So if something is down or deleted, the operator will get it back up. So you can learn more about using the Kubeflow operator in the Kubeflow.org link listed here. So keep the operator in mind if planning to deploy and use Kubeflow. So now I will pass it off to Mofi who will talk more about some of the barriers that may come with managing Kubeflow. Thank you, Paul. In this part, we're going to talk about day two operations. Day two can mean a lot of different things to a lot of different people. When it comes to Kubeflow, what we mean by day two is the update and upgrades of Kubeflow and underlying components, the security patches, troubleshooting of when something does go wrong, and finally monitoring to know when how things are working, how resources are being utilized. One of the biggest challenges of Kubeflow being the open source project that relies on a lot of other open source projects is that Kubeflow has a lot of moving parts. Each component has their own release cycle. Each component has their own update and upgrade path. So the way you can upgrade one component is not the same as upgrading another component. It is all built on top of open source technology, which is also changing very rapidly. Kubeflow is deployed on Kubernetes where Kubernetes itself is coming from different vendors and platform. Each platform has very minor differences that eventually add up to be pretty big changes when you're talking about your Kubeflow deployment. This is by no mean a complete list of things you can do to solve all your Kubeflow problems. But I think we can help you get started with showing you some of the common issues and how we had solved them in the past. This should give you some idea what you can expect to do when you bump into some of these problems yourself. One of the common cases we see is that component has an updated version where Kubeflow release hasn't caught up to that version yet. So the user wants to use the latest version of some component where we don't have that on the Kubeflow manifest yet. And we might want to do that for several reasons. One of them being we want to be on the latest version because it has some new features, some security fixes, and other bug fixes or some minor improvements or major improvements for that matter. So the concerns here is not all component can be updated in place. There can be breaking changes that either breaks interoperability with other components or it kind of breaks the assumption that Kubeflow had on that component in the first place. Also, if the component is backed by some database, it becomes even more difficult because now you have to run some database schema migration and data migration so that you can use the same data as you had. So the way to go about solving this is review patch notes of the new component that came out to see what are kind of the breaking changes that are coming out. You can look at the upstream manifest to see if the new version of the manifest is available. You can try to apply upstream manifest to get the new version using Kube CTL. And if it fails to update the new version in place, you might have to install the version by deleting the resources of that component and re-applying the YAMLs. One of the other common issues you might see is about around stale webhook. One of the ways to resolve these issues are to delete the stale webhooks. Deleting namespace does not delete the webhooks. For example, cert manager, KND observing. So, yeah, if you are seeing issues around the failed calling webhooks, it's because you might have some stale webhooks hanging about. So you can delete them using the command in the slide below. Oftentimes, when you're deploying Kubeflow, you would access your Kubeflow dashboard by exposing the node port on your cluster's public IP of your cluster node. But that is not a secure solution. So if you want your application to be used, if you want to use production data, you want to have added security by enabling HTTPS. If you're able to update the service type of Easter ingress gateway to load balancer, you can use cert associated with domain name of external IP. Some cloud providers may already provide TLS certificate or you are able to generate one for your service. And if you can do that, you can create the certificate into Istio system namespace as a secret and then mount that secret into your service. And then update the Kubeflow gateway and the change port to HTTPS and port 443 and add the TLS section. So once you have added the TLS section, Qflow and Istio will be serving your traffic through TLS and you'll be able to access your Qflow dashboard via HTTPS. So when you are trying to secure your K native deployment for K of serving, you may need to take a couple of extra steps. For one, you will need a valid custom domain for K native app routing. You may need to update the K native domain format. You also may need to add wildcard matching because K native would add some zone on top of your domain name for on the serving. So you might need to add wildcard matching on your host name on your Qflow gateway. If you are, if you were using an older version of Qflow, you probably had to do some work around to make sure upper higher version like Istio under six were working with Qflow. I'm not going to go into details of that process anymore because hopefully with the newer versions of Qflow that comes out, this problem will be non-existent. For update and upgrade, there are a few different things to think about. Again, number one is Qflow component update as Qflow new tools come out, Qflow is underlying component of Qflow can update and you will have to be mindful and think about how you want to update those components. You also need to think about Kubernetes update. We'll talk about a little bit about that later. I also need to think about bug fixes as security patches come out. What fixes come out for different components? How do you want to go about updating those components? And finally, platform specific changes. You might be managing your Kubernetes distribution yourself or you might be using a managed Kubernetes distribution on any of the major cloud providers. So if you're using a managed Kubernetes, each of the platform can have minor differences in terms of their storage classes, in terms of their networks, and you will have to be mindful of how those differences mean. You will have to have a different Qflow installation. Another thing to be mindful of that if you want to persist your data, one good way to do that is using external databases. Qflow installation, you can install a database as part of the Qflow installation itself, but if you're running this database inside the Kubernetes cluster, as you have to do to upgrade, you will have to a, keep on managing the database and b, it will be harder to update and upgrade the database without losing the data. So if you can utilize an external database, it will be much easier for you to then update or completely delete the Qflow installation and reinstall it without losing your data because your data would be stored in an external storage. Qflow runs on top of Kubernetes. So you have to kind of adhere to the Kubernetes update policy as well. So Kubernetes is on an around three month release cycle and current version of Kubernetes gets deprecated in about nine months. So if you are on Kubernetes version 120 now, by the end of 2021, Kubernetes version 123 or 122 will come out, older version will start getting deprecated. Deprecations in Kubernetes are announced about three, four versions in advance. So if Kubernetes is deprecating v1, alpha one and changing that to v1 or dropping some APIs, you will have about three to four versions about a year runtime before those versions go away. So we'll have to be mindful of those changes as well. So we talked about some general practices you could use to help mitigate the problems you see, some common issues we have seen before and how we can start thinking about it. But if you hit a roadblock in your Qflow distribution, how do you go about finding information? How do you go about finding help? Qflow has a public Slack channel that you can ask questions, connect to people that are facing similar issues or have solved the same issues that you are facing. You can open an issue on the GitHub repository, Qflow team tries to help folks as much as possible. And again, finding issues also helps us improve the project as we can fix the bugs as they come around. But there is still a question about how do we debug failure? And that's a question it's very difficult to answer because each of those problems can come from a number of different places. So some common ways to figure out exactly what the issue is and how can we solve them is first identify the source of failure. So is it a failure from a Qflow component? Is it a failure from Istio? Or is it a problem from Kubernetes? One common way I like to deal with these kind of problems is look for similar issues. If the issue has been around for a while and hasn't had any resolution, this might not be the fastest way to get a solution. But sometimes it can nudge you towards the right direction. Look at the logs from the failing pod. Sometimes the log itself will tell you exactly why it's failing. Some of the common log errors you can see is that permission denied or end of file found or like some network connection timed out. So this can kind of guide you towards the right direction of where the error is coming from. In the Qflow Slack, you can ask folks, there are always people online and if someone else has seen this same problem, they can definitely help you out and help solve your problem. And finally raise an issue on the repo. This doesn't have to be your last step if you're facing problem and you think the problem is coming from the Qflow project on the software itself. Definitely open an issue so that the maintainers can take a look and this definitely helped us improve the quality of the project over time. Finally, how do you go about troubleshooting problems that are specific to individual platforms? Some issues can be platform specific. Each platform has minor differences that can lead to situations that are not covered by general Qflow. For that, definitely use provider specific docs, use GitHub issues to mention exactly which provider Kubernetes you're using and what are the issues you're saying. And in the public Slack, there are channels specific to individual vendors. So definitely raise your issue there as well. So maintainers from those companies and vendors can look at that issue and help you out if they can. Monitoring. So monitoring has a couple of aspects. Number one is figuring out how much resource you are using in your cluster and figuring out if your components are healthy. And for this, Qflow doesn't necessarily have any specific opinions on how you go about doing that. Qflow is using Istio underneath and Istio has a lot of monitoring capabilities by using things like Prometheus, Rafaana, a lot manager, so you can use all of those tools at your disposal to get information from your cluster to know exactly how much resource is being used and how your components are behaving. Qflow 1.3 release has been around for a while. With this release comes many new upgrades to Qflow that will help resolve some of the issues we have been seeing in the past. For example, with the update to Istio 1.9, a lot of the exterior issues that we were seeing before will should be resolved now as well as we are on the latest version of Istio. So we will be able to use the upstream Istio docs as well to the fullest. Finally, the manifest repository is being reorganized as Paul mentioned. So we're going to have much easier installation of Qflow and it will be easier to understand the manifest repository on how Qflow is working. And the official installation will rely on KubeCuddle and Customize. KfDev and KfCuddle will still be around. It will still be possible to install Qflow using KfCuddle, but the official installation instruction will be focusing on Customize and KubeCuddle. Some of the reference links are here as the manifest repository where you will find instruction on how to install Qflow as well as community links. This is where our links for Slack, some of our meeting links are there. Finally, the operator framework. If you want to install Qflow using operator, as Paul mentioned, definitely try check it out and try it out. Finally, I'd like to thank you for spending the last 20 to 30 minutes listening to us talk. I don't think this would be enough for you to have all the resources to tame the beast, but this will at least get you started so that eventually we can all tame the beast and utilize Qflow for our AI workload.