 Hi guys, I'm Pete. I work in the AI center of excellence and Work with some of these fools over here sitting up front and Today we're going to talk about kube flow, which is an interesting new project. It's a fairly new project I got involved with it about a year ago about this time and We're going to talk about what it is what it means and also how it can apply or be used in conjunction with OpenShift, so Basically, we'll go through sort of we'll set the table and talk about a little bit about data science and then delve in a little bit to sort of the challenges they're associated with doing DevOps for machine learning So there's been lots of machine learning sort of talks already in the in the conference But it's more than just notebooks. You've seen some of the demos and examples of Notebooks that's an important part of it, but there's more to it than that talk about the project and We'll basically do an overview of the core components in kube flow and then finally We'll talk about some of the specific considerations if you're sort of just Taking kube flow and the information we have on the community site There's a few little gotchas that you have to understand when you're deploying it on OpenShift And then I think a live demo. I think we're good for that so you know Recently I've heard a lot of Discussions and talk about AI winters and what they're referring to is that there's been this cycle over the past few decades of Okay, now now we're ready for AI now. We're ready for machine learning and It sort of falls apart, you know companies and institutions get invested And it's not really living up to the promises basically But in the year 2019 obviously things have changed dramatically so In the modern age, there's a lot of well-developed techniques such as cnn's convolutional neural networks and other types of techniques for doing machine learning Um, and what's happened is, you know In the age of the internet and how we're connected through various types of devices and environments There's many more applications for machine learning than there might have been in the previous sort of go-arounds and iterations So many of these you're very familiar with they sort of inform our lives today When you go on youtube, you know, you look around at some videos It's there suggesting some some videos that you might also be interested in How's it doing that in the background? That's really where machine learning is is doing the work So image voice video recognition or other applications character recognition And then there's important applications for You know companies and and various types of security related things like detecting fraud At banks or financial institutions network intrusions for hacking And then there's the notion of doing ranking for various types of things. So Last night we were looking for a restaurant and TripAdvisor gave us a recommendation of the number one restaurant in Brno And almost certainly, you know, that wasn't just based on reviews. That's also informed by machine learning. So lots of applications So we have the sort of new demands and these applications, but the other sort of See change if you will with respect to machine learning is the modern hardware that we have available So the modern cpu and you know, I described at the point that I wrote this slide that was sort of the The fastest if you will Intel Core i'm not sure if that's still true But almost 2000 gigaflops on a single core. So that's the cpu space. We can make use of that for um machine learning But we also have gpus and gpus have been around for a long time In the gaming space like we're all, you know, especially at a conference like that There's there's lots of us who have built Rigs for video gaming and stuff like that and you know put in gpu cards, but it turns out Those gpu cards the things that they're good at for rendering, you know, high frame rates and things like that Make them also excellent for general compute computational like matrix multiplication and things like that So in conjunction with the gpus, we also have higher level abstractions and apis that have been provided for the gpus That allow For machine learning applications or enable machine learning applications So kind of like the cpu at the time that I wrote this the slide the nvidia b100 the mezzanine series Is kind of stay the art in terms of pure gigaflops and and the processing that it does What's different between the cpu and the gpu is that the gpu is very good at parallel processing So cpus of course have general applications for various types of use cases, but gpus are very good for particular types of as they say embarrassingly parallel type computations And let's see if this will load my youtube video Don't have the sound but I threw this in there because I love this video and this is from about Oh three years ago and this was an enterprising young man in japan Who whose parents owned a cucumber farm? But he went away and did engineering and computer science And he came back to them. They had this sort of problem with Um, not not really a problem, but there is a lack of efficiency. There's actually many different types of cucumbers And they're priced differently and they're you know, there's various things that go into sort of You know, this is that type of cucumber and it's good for you know I don't know this type of salve and it costs this much so he It's enterprising young man sort of put together the system using tensorflow and deep learning to basically Basically create a machine That's sort of the cucumbers for them. So Whoops, let's get rid of that And let's go back to my presentation. Sorry about that Yeah So, um the point the the reason I threw that in there is it's a fascinating sort of Example of how machine learning has become almost commoditized The cpu's for doing machine learning even some of the lower end gpu's that can do You know a fair amount of high-end processing Are fairly affordable. So, um That's really enabled You know, sort of this sort of um this explosion of interest in machine learning So, uh, I don't know of myself here, but I stole this we did a workshop on friday with coup flow And it's a pretty good sort of demonstration of A simplified machine learning uh workflow and pipeline So there are these different types of tasks that are involved for data scientists and the people who help the data scientists sort of Um, basically render deploy the work and give it out to the world. So There's the notion of uh dealing with data sets that are put into the Machine learning algorithms So keeping track of those is difficult. Um, the data sometimes has to be cleaned and prepared And then you go into this sort of iterative cycle of model development and training the model And um if you have to manage that, you know Sort of in a manual fashion, it can be sort of cumbersome and inefficient. So And then there's the case when you want to you've derived a model and you want to serve it and actually have Data hit it so that um, there can be sort of probabilistic sort of Computations made against that model. There's the notion of reproducibility. So it worked for me in my environment Here you use it How do we make sure it's going to work in that other environment and also the scalability? So These are some of the challenges in a pipeline for machine learning So if we sort of think of that pipeline and sort of decompose it into some of these individual steps that I talked about um What's important to understand is that There's different sort of environments and systems that are well suited for doing these things So something like working with a notebook. Maybe that requires a particular type of environment But if you're actually training the model or serving the model Maybe what you need is a different type of environment that's gpu enabled gpus are typically, you know, more expensive in terms of the time that you need for um Making use of the the computation, you know the physical hardware for gpus is more expensive than the typical uh cpu. So um You sort of begin to see that We need some way to sort of organize these components, but also allow a certain degree of flexibility It's it's not good enough or sufficient to have sort of a monolithic sort of pipeline that only works in one type of environment so um There's all these different pieces that we need for an ml platform And it's been difficult over the years for any one project or institution to basically You know try and decompose this problem and break it down. So Um, this is where kubeflow sort of got started. It was the idea of a bunch of a couple of google engineers Who are still involved with the project And the idea was to start with sort of um This high level mission statement. So um, whatever we build for this uh platform It should be portable. So if i'm doing development on my laptop Or a bare metal that that's fine, but it should also be able to Work just the same way in a cloud environment like gcp or ec2. What have you? Um, it should be scalable again. I'm doing development on one machine my laptop It should be able to basically scale that out so that I get the processing power of Say hundreds of hundreds of hosts hundreds of nodes And it should be composable. So, you know microservice architecture has been with us for, you know, many years now And it's it's not going away It's it's a good model in terms of having well-defined separation of concerns between these different components That we saw on the machine learning pipeline. Okay So it turns out that open shift and kubernetes upon which open shift is built are actually You know a really good platform for taking on this kind of challenge So that was part of the idea for these google engineers. They'd done some of this with um some of the uh infrastructure at google including at the time kubernetes was was being developed And um what they bit that what they did is basically take the ideas that they developed Or the patterns for these pipelines That were actually used in the youtube recommendation platform And sort of brought them forward as an open source community project and that is what kube flow is today so so The the mission for kublo is basically um make these deployments of machine learning workflows Specifically on kubernetes simple portable and scalable kubernetes is a important point that it's not trying to be something that's deployed in different types of Say cloud environments like just purely vm or something like that It is organized around kubernetes specifically for the reasons that I previously mentioned Also another part important point with kube flow is that It's not trying to reinvent some of the other machine learning ecosystem projects We have tensor flow. We have pi torch. Okay, so what it is it's aggregating these projects Basically and stitching them together to render sort of that Idealized platform where we have a comprehensive workflow or pipeline For these things. It's not trying to build yet another Python api for machine learning, so it's incorporating all the all the most popular ones And then finally anywhere's you're able to run kubernetes You should be able to run kube flow at a minimum with the cpu processing power That's available to The nodes also With kubernetes it has enablement for NVIDIA GPUs and so pods can be scheduled particularly on specific GPU nodes, so Mentioned generally true for open shift it is it it works on open shift Open shift we're going to get to it later, you know has It's enterprise class and it has enhanced security features And some of those need to be addressed when you install kube flow The community today Basically, it's hosted on github. It's an asl2 license There's over 20 Repositories 23 to be exact and in the core repo which is called kube flow itself There's about 143 contributors and over a thousand commits So that's all happened in the span of about a year in change about, you know, 13 months So we're currently we just released 0.4 We're doing planning for 0.5 and ideally in 2019. There'll be a major 1.0 release As a community we are trying to work through In various sort of fashions including like a product management group that we've created What it means to be a 1.0 release for this and a lot of that has to do with documentation testing and Yeah, so There's been a lot of integrations some integrations happened early on the project and there's new integrations that are continuously happening So you see there Google gets a little bit larger in that circle there. We have participation from redhead, but Google The project was started at google. It's still the lion's share of Community involvement is from google engineers, but we have definitely had significant contributions from microsoft and vidya alibaba kai cloud in in china Cisco canonical GitHub not just for the hosting they've been actually making use of coup flow as well Into intel and the jupiter project So at a high level just on this one page We're going to delve into some of the what I think are sort of the important sort of components here But we have this notion of core components We started out in the project with sort of What was defined as a core and the core was really organized around jupiter hub That's still true today. I put an asterisk there beside jupiter hub We are in the process of possibly well. It looks like we will we're going to probably remove the jupiter hub spawner in in Lou of an opinionated version of a spawner that is Much more kubernetes native cloud native than what is provided by jupiter hub There's ambassador Which is another project in the kubernetes ecosystem. I'm going to talk about that And then there's some internal sub projects in there So catib for hyperparameter tuning and just recently google has invested a lot of effort in And manpower if you can use that term Into kubeflow pipelines. So there's a training controller for launching tf jobs That's one of the things we demonstrated in the workshop on friday And also there's tensorflow serving, but there's also optional components and we'll see with kubeflow it has Basically a fairly flexible deployment architecture. That's enabled by case in that and I'll talk about that there's Argo in the another kubernetes ecosystem project that's used for workflow management. There's a pie torch operator There's selden. I'll talk about that in detail in pakoderm basically the takeaway here is there's lots of different sub projects and The idea there there's a bit of overlap with some of these sub projects, but the idea is Each one contributes to sort of this comprehensive machine learning platform that I was talking about earlier So um Most of you I think have made some use of open shift and you understand sort of the notion of these resource objects And how they're sort of represented through Json and yaml There is a kubernetes ecosystem project called ksonet Which is used heavily in the kubeflow project basically so It allows us the flexibility. It's based on the jsonet language, but It allows us the flexibility of having very powerful expressions around What is deployed how it's parametrized how it interacts with The other components in the platform and The key thing there is the reason it's called ksonet is There are native definitions in ksonet that relate to things that we recognize from open shift and kubernetes. So pods services Cluster rolls roll bindings things like that Okay, so that's what we use to define sort of the integration of the kubeflow project And that's what we use to generate ultimately the yaml that produces a running kubeflow environment so The important thing to understand is there's nothing sort of magic or I mean What it renders basically is yaml that you would otherwise just recognize from any other type of interaction You've had with kubernetes So you would see pod definitions. You would see service definitions It's all the same the way it's originally expressed is in json and sort of this particular DSL, but what you get is something that you can still manipulate yourself. So And another reason we use ksonet is that it makes it very easy Again, we talked about we wanted it to be sort of portable And ksonet gives us this general ability to have kubeflow be fairly portable. We can specify that This is the deployment we want for a dev environment and we're going to tweak it slightly maybe different Say s3 endpoints or something like that for our test or staging environment And then change again for production. Okay, so it allows us to You know heavily parametize the the different components for these different environments And the other thing is it's so it's It's basically resource aware So if you make a change in your ksonet for kubeflow in terms of I update parameter x or whatever it is Uh, all you have to do is reapply That change and it'll go through and update the deployed components also Finally, you can tear down that environment with ksonet. You can basically have it delete all those resources and just start again Notebooks are very important for the project Like I said, we currently use jupiter hub and kube spawner for launching our notebooks But uh, that is about to change in the next couple of months. We're going to go to a simplified web app That will have better integration with the underlying kubernetes our back security model And kind of do away with some of the stuff jupiter and jupiter hub the spawners that are in that collection there We're designed for a variety of different types environments And the community has come to the conclusion that we're probably better off just having something that's highly opinionated for spawning the notebooks Okay The other thing uh, the kubeflow project does is it curates tensorflow CPU and gpu notebooks going back to tensorflow 1.4.1. Okay, and when we stay current We have Believe 1.12.0. We're we're up to date with that. I don't know if we have the The the latest the very latest version These are based on the default builds from the tensorflow tensorflow project um, some of my colleagues in the ai center of excellence are doing Enhanced builds of tensorflow for the different types of uh, rail Platform so fedora sentos These ones that are currently in the kubeflow community are just arrived without any specific changes from The images that or sorry the the wheel files that are created by the tensorflow project Now we've set up these notebooks such that They're they're oriented around tensorflow There's some visualization and some libraries in there that are useful for the data scientists But they have the permissions to go ahead and install any supporting packages that they might Deem interesting or necessary for their work We have a notebook for adapted for the kaggle project Of course kaggle is where you have these competitions for machine learning. Some of them actually pay significant prizes and they have a A notebook that we've adapted for kubeflow. It's a very large notebook. It's over 20 gig But that's because kaggle kind of wants the kitchen sink in there That not only do they have tensorflow. They have pytorch and all the other libraries in there as well so Um, and recently just before the end of last year We did an adaptation of a new project from nvidia, which is rapids ai and rapids Is basically a set of python libraries that are specifically designed for gpu acceleration And they include data frame manipulation various types of machine learning algorithms and also a library for graph processing Um, in terms of deploying kubeflow, we make use of ambassador And that's an ingress controller That's based on the envoy project, which has been very successful Istio is also based on envoy and It gives us a cloud native capability for ingress Into the what we call the kubeflow cluster There's the kubernetes an open ship cluster that the components are deployed on and then there's the various kubeflow components in there. So that's our ingress controller in open shift It probably you know, it serves something of the same role as the ha proxy So that's an area that we'll probably explore more this year about making that deployment option more configurable That includes a reverse proxy And it uses annotations and i mean kubernetes annotations on the components to do the url Mapping to services. So we have links to our jupiter hub the tf jobs ui pipeline And currently it integrates with google iap for authentication that can also integrate with cert manager And uh, I think there's more work being done with ambassador for integration with other types of um off controllers We also make use of argo Argo is again, you'll see this pattern in kubeflow. We're making use of different projects that have kubernetes native Capabilities and argo is one of those it's provided by into it. It's an open source project And uh, it's a crd in an operator and we actually use in a variety of ways within the project So at the application level, it's used within the kubeflow pipeline project There's also kubebench, which is basically kubernetes machine learning um platform You know benchmark test basically But for the community itself it actually drives our pre and post submit workflows So when uh, we're ready to um approve a pull request and merge that into our master It uh, well even as the pull request is put up there It goes it's put through a pre submit test and then submitted To a test again all that is controlled by argo. So in the way it works is you're basically specifying in yaml um different types of directed Asiclet graphs, you know, and that's your workflow. So it's good for basically organizing some of these jobs And we're in these pipelines where you have inputs and outputs and you know This this step in the workflow needs for this other batch step to complete so Cloud agnostic, but it's it's definitely designed for um kubernetes at its at its heart um I mentioned argo the follow on from that is the kubeflow pipelines project early on in the project There was a company and a project called packaderm That had got involved with kubeflow and we provided um sort of an integration for Packaderm It does data pipelines and and governance basically using sort of a get check-in model I suspect what's going to happen is that the kubeflow pipeline sub project is kind of going to overtake some of the capabilities That were previously defined in packaderm. We'll see how it plays out Um kubeflow as a project is fairly open and you know encompassing like We're ready to adapt various different types of ideas and components But a year on we're starting to see areas where there's a little bit of overlap and um, This will probably be sort of the challenge for 1.0 Is drawing a circle in kubeflow and saying what are the core components? What are truly the components that um are necessary so So the pipelines project it has a ui for managing and tracking the experiments and the jobs And it's operator-based. So again, it's doing scheduling of these machine learning workflows And it has a python sdk for basically doing an annotation based sort of Specification for the pipelines and components So it's it's fairly powerful. Again, it uses our go into the hood as a workflow orchestrator Um selden is it's Its own project, but it was also one of the early sort of adaptations or integrations that we had in kubeflow and um selden is entirely about doing deployment and management of the um inference graphs And being able to scale those out on a kubernetes platform. So it's a perfect fit for um kubeflow and um The way it's set up is that it has basically specifications for different components and you can organize those different Uh pipeline components in various ways. So there's a transformer router a combiner output transformer And the way these would be used is oh my gosh 10 minutes. Okay Um a b testing multi arm bandit. These are sort of concepts that are known in the space for um doing uh testing of You know, you have a variety of models you want to basically probe and inspect Which of these models are? um Performing the best basically So it supports TensorFlow scikit learn as um interfaces for g rpc and rest and interestingly They use s2i open shift s2i now is before even red hack got involved in kubeflow or had anything to do with selden They had started discovered it on their own and found those Uh an excellent utility for doing these model wrappers that are then deployed as pods Uh kubeflow and open shift. So we've talked about kubeflow how it sort of exists naturally on uh kubernetes So on the website, um going back some ways There's a troubleshooting section for different things like mini kube and stuff like that I added one a long time ago for open shift. It's slightly out of date now It needs to be updated and um So the challenges with open shift, of course, you know It's our our beloved product and we regard as the enterprise kubernetes. It's You know focused around developers with the s2i Capabilities and as we all know it has these extended sort of concepts about you know In kubernetes, you have a namespace. We've enhanced that as a project And then there's the notion of users and things like that But in conjunction with that With the development of open shift we put in all these best practices that we've learned over time as red hatters about how to securely deploy components in the cloud However The line share of contributions to the kubeflow community comes from google comes from um other Other companies and individuals And they do their development with just kubernetes Now they do it with um our back turned on But at times in developing some of the say the docker files and the images They don't think about things like uids and and these types of considerations Also, um, what's happened is the advent of the operator model and you know red hat has been a big proponent of that Certainly with coro less so um and typically with the operator model You have a crd and then you have cluster roles and cluster role bindings and There's more and more of these Not only showing up in the integrations into kubeflow but being developed when thinking kubeflow itself. So sometimes And certainly we do and then we use an internal open shift platform That's hosted for us called open shift Where we don't always have the permissions we'd like for deploying some of these neat new features like some of the operators We're interested in so that creates certain challenges friction points for us so, um The the reason i've been sort of reluctant to sort of blindly update the troublesuit troubleshooting section for the community pages is To quickly enable a deployment of kubeflow on open shift We can make use of certain cluster admin features where we say okay for this service account We'll give them any uid that's not Ultimately the right solution what we want to do is sort of work our way back To the image and have it properly developed like on the rat analytics fire project Where they take appropriate steps to have a user defined for Um the component that's running so that would that's the proper way to mitigate some of these things Having said that you know if you're interested in running this And getting up to speed quickly on it there are shortcuts and You'll see if you naively installed 0.4 on open shift 3 11 You would start going through and seeing these failed pods and if you look at the logs They'll say hey, I can't bind to port 80 And that's open shift stepping in saying no can't do that. That's a reserve port Um, and it's because of the uid that has or hasn't been in fact specified for that image I think ambassador has done some work to adjust that but I need to properly test them um, so you go through that thank you and um Then when you get to one of the components that is min i o Again, we fall on this thing where there's a problem with the uid In this case, it doesn't yet as of today have a service account So there's no trick we can do by saying add s s scc to user for that So we sort of give in and cry uncle and say okay system authenticate Again, that's not something I really want in the troubleshooting section necessarily So it's sort of a debate i'm having about what is correct information to have up there Finally, um, I mentioned catib. Catib is a sub project for hyper parameter tuning in cuplo and um Uh, it wants to write well There's actually the use of mysql pods in two different sort of sub components And they both run into the same problem where it's trying to write It's a var lib mysql storage to a place that needs to be at least in a rel environment You need to basically set the appropriate se linux permissions on that location So various things there, but that's generally it you know, it basically works It definitely helps to have cluster admin and in the workshop that marcel and I did We kind of for the benefit of all these users we'd set up for the workshop We had to make them cluster admin and that's just kind of the way things stand today So there's work to be done in 2019 for doing the adaptation Uh, I don't think I have time for a live demo I don't think so Oh, okay. Um, so let's Let's do a little demo. Um, so in the workshop we demonstrated tf jobs Here i'll just show you sort of the dashboard And this is our central dashboard. That's what you get when you, um Uh, when coup flow is deployed you have links to the website You have a link to jupyter hub and also the tf jobs ui catib dashboard And then a pipeline dashboard For the live demo we'll keep it simple. We'll just go to jupyter hub and let's see what we have I have a notebook running. That's too easy. We'll stop that Because I want to show you launching a brand new Pod Okay, so this is what our version of the jupyter hub spawner looks like Um, this used to be done as an html5 data frame and we had a pre-populated list. These are the curated Um images that we have in the google container registry Okay, you can actually Switch to custom and just you know type in any image location that you want It is important to understand that there are certain scripts that need to be in place for the image to be launched from coup flow We do a check to see that the jovian pvc is in place and things like that But let's spawn a new notebook And that image should be in place. Hopefully it won't take too long while that's happening any Questions I can answer while we're hour glassing clears mud Okay Let's pull up a new python 2 We'll kick it old school python 2 notebook And then There's m nest hello world. We'll grab all that And this is a very basic check to confirm that Um TensorFlow, you know, is it stalled in the pod and we can make use of it. So we'll just copy that in there And the m nest data set is basically a collection of digits and labels So it's sort of a basic confirmation of whatever, you know Machine learning sort of framework you're working with and also the algorithms For doing a machine learning computation That predicts, you know, makes a guess about what the image is based on the information that it's Processing from the model And we'll run that and we'll see what we get fingers gone Live demo It's doing its thing some warnings from tensorflow Now we curate as I mentioned cpu and gpu. This is running on a cpu only Uh node and that last number at the bottom. Can you see it there the zero no and that is the Accuracy of its guess about the number set so And that's this is sort of the the storefront window dressing for kube flow. There are a lot of other components I simply don't have the time and you guys might not have the interest in Looking at all the other stuff, but that's basically it No questions. It's good. It's like, man. Okay, whatever Okay, well, thank you for your time