 Hi everybody. It's so happy and I'm very glad that I'm able to meet you all in the Kubernetes AID Europe 2022. So the title of my talk is a deep dive into Kubeflow Pipelines. While you might have heard or you might have watched several talks about Kubeflow Pipelines before, I presume most of those talks would have been highly data scientist or data engineer focused, meaning the more focus would have been to how to write pipelines in a more efficient manner, how to build components from scratch or how to convert a Python function into components and then eventually build a pipeline. So that is fine but today I'm going to talk to you about Kubeflow Pipelines which will be more from an, let's say, ML engineer point of view or an ML ops engineer point of view or even a DevOps point of view. So today I try to cover how Kubeflow Pipelines are composed, what are the components that comes along with Kubeflow Pipelines and how these components interact with each other and eventually how these components are able to execute the pipeline that is submitted to the Kubeflow Pipelines. So let's enter into the talk. I am Senthal. I am working as principal software engineer in Ericsson and my job in Ericsson is to primarily architect cloud native EIML platforms. So these are platforms that are highly distributed in nature and use Kubernetes as the underlying platform for compute and other resources and apart from work I take time to participate in other aspirations of mine. For instance, I am the organizer of Kubernetes Community Days Chennai. So this is going to happen on 3rd and 4th June of this year. I am the maintainer of an open source project called as Kubefledged. So this project is actually an operator which will help you to cache container images directly on the worker nodes of a Kubernetes cluster. And I am also an occasional speaker I would say. I am not, you know, very active in speaking. But whenever I talk, I love to talk about Kubernetes cloud native technologies and very recently I have also picked up an interest in talking about MLOps. And I am a tech blogger. You can watch my blogs and medium and nowadays I am a little bit not that active in tech blogging due to my preoccupation with organizing Kubernetes Community Days Chennai. And I am fairly active with social media sites like Twitter and LinkedIn. So do check out my profiles on these social media platforms. So let's get into the talk. The agenda for us today is actually very simple. I'm going to talk about ML workflows and the various ML pipelining tools. And I'm going to pick out Kubeflow. And I'm going to talk about Kubeflow. What are the platform components that comprise of Kubeflow pipeline? I'll be talking at length about the Kubeflow pipeline architecture that is where I will talk about the various components that make up Kubeflow pipeline and how these components interact with each other. And I'll try to dig more deeper into Kubeflow pipeline. I'll also go and talk about what is an Argo workflow executor. Okay. And I'll also talk about other notable features of KFP, Kubeflow pipelines. And finally, I'll finish it off with a Kubeflow pipeline demo, a very simple demo, which will help us understand the theoretical part that we see during the talk. And by the way, we know very well what an ML workflow looks like, right? So there are distinct steps, right? And each and every step performs a certain portion of the overall work that needs to be done. And each and every step is also self contained. It has its own distinct set of input, and it has its own distinct set of output. And the input can be a very simple parameter like a string or integer or float or the input can be a huge data set which is stored somewhere in a data store. Okay. Similarly, the output can be a very simple file or the output can be a huge data set that is for instance pushed into a Kafka or that is for instance stored into a Minioj object storage. So whatever it may be, we all know that machine learning systems are typically workflows. So you need to build a machine learning system in the form of a workflow. And as the execution proceeds from each and every step of the workflow, there is a distinct work done and data gets processed in each and every step and a model is being built in each and every step and eventually the model is deployed into production and then the monitoring happens where the drift detection and all these things are coming into play. Now, whenever we talk about workflows, not necessarily about ML workflows, any workflows in general, so we make use of pipelining tools. So for instance, if you are from a DevOps CICD background, you know that we need a tool like Jenkins in order to perform the CICD workflows. So similarly, in the ML world, for us to execute the ML workflows, we need ML pipeline tools. So that is how we can be more productive. And there are a plethora of tools available for us to build the ML workflows and run these workflows in production. And today I am going to focus about one single tool which is called as Kubeflow. And Kubeflow is by the way an open source project which also provides you not only with pipelining capabilities and it also provides you with a gamut of features and functionalities that you would expect from an end-to-end machine learning platform. For instance, there is a K-Serve which takes care of serving the models in production at scale and it provides features like A-B testing, multi-armed bandits and things like that. And Kubeflow also provides you with development capabilities where you can make use of Jupyter notebooks in order to make use of various machine learning frameworks to develop your model. It provides you with capabilities of training your model, retraining your model and things like that. But for this talk, I will be focusing only on Kubeflow pipelines. Okay, so now Kubeflow as I said earlier, it is actually being tutored or it is actually being branded as a machine learning toolkit for Kubernetes, right? And it is highly Kubernetes native and it makes use of many of the features that are available in native Kubernetes and that's why I call it as Kubernetes native. And Kubeflow by the way it started as an open sourcing of the way Google ran TensorFlow models internally, right? So we know that TensorFlow is a very popular machine learning framework that is widely used and once a TensorFlow model is developed, so you need to run this model. So Google was using some of the features that you find today in Kubeflow internally to run their TensorFlow models. In fact, it began as just a simpler way to run TensorFlow jobs on Kubernetes. It actually aimed for removing the complexities associated with running TensorFlow jobs on Kubernetes and that is how it all started. And ever since that Kubeflow has even expanded into a multi-architecture, multi-cloud framework for running end-to-end machine learning workflows. Okay, so what I mean by end-to-end is it caters to each and every step of a typical machine learning lifecycle starting from data exploration or even starting from defining your model accuracy criteria and metrics criteria up till deploying the model and monitoring the model in production. So it offers an end-to-end platform and Kubeflow provides components as I said earlier for each and every stage in the ML lifecycle for exploration, for training, for deployment, for monitoring, for retraining and things like that. So what are the installation options available for Kubeflow? So either you can install Kubeflow pipelines as a standalone framework or a platform, so that is available or you can choose to install the complete Kubeflow platform and then use only the Kubeflow pipelines part of it. Okay, or there is a third option, you can consume Kubeflow as a fully managed service. Consume Kubeflow pipelines as a fully managed service, this is offered by Google Cloud AI platform pipelines or if you are trying to use Kubeflow pipelines just for testing purposes, you can also install it on local Kubernetes distributions like K3S, so that is also available. Okay, so when we talk about Kubeflow pipelines, it is predominantly built of these four components. Okay, so the first and foremost, you have an user interface for managing and tracking the various machine learning experiments, jobs and runs and there is a very core workflow engine that actually performs the hard work of executing the workflow. Okay, we will talk about what this engine is made up of and things like that later and a third more important feature of Kubeflow pipelines is it provides you with an SDK for you to write your pipeline. Okay, and for you to even build components, reusable components for pipelines so that these components can then be used in different pipelines. Okay, so it provides you with SDK and there is also a REST API, so if you want to consume KFP by in the form of REST APIs that is also available and whereas if you want to do it in the using the SDK that is also possible or if you want to just use the UI and then submit the jobs via the UI and then see the artifacts and things like that that is also possible and KFP also provides you with some inbuilt notebooks for you to easily interact with KFP using the SDK so that is also available. So let's get or let's spend more time on the slide where this is where you see the architecture of Kubeflow. Okay, so at the top of it at the top of it you have the UI and the UI is served by the pipeline web server and the UI itself has several capabilities for instance you can actually submit the pipeline in the UI and once the pipeline is submitted once you have you have run the pipeline you can see the history of the runs in the UI and you can see several metadata you can in fact even drill down more deeper into the job history and see what are the steps that were executed what what was the input for that step what was the output for that step and in fact if you you can also see where that output is stored okay and you can also use it for debugging and things like that and there is also a capability for you to visualize the run so if you get a results out of training your machine learning model for instance if you are trying your machine learning model using various hyper parameters right so you can visually see how the model is performing with these various hyper parameters so so the UI is actually catering to a wide wide set of features that is one good thing about Qflow pipelines and underneath you have the orchestration system which is the primary orchestration engine which performs all the hard work necessary for executing a Qflow pipeline so on top of everything you have the pipeline service so the goal of pipeline service or the responsibility of pipeline services whatever pipeline you submit to KFP it is the pipeline service that interprets it it parses it so it understands the Python DSL that is actually defined for writing the pipeline it understands the DSL it parses the DSL and then eventually it compiles it compiles the pipeline code and then it prepares the pipeline YAML so that is the job of pipeline service and whatever is done by the pipeline service every at every point in time it makes sure to store the metadata into the metadata database and by the way it's a MySQL database and it stores all the metadata into this MySQL database and once it has determined what actions or what tasks have to be performed for a particular pipeline run it goes ahead and creates the necessary Kubernetes resources that are required for executing the pipeline okay and for and and in KFP each and every step of the pipeline is executed as a Kubernetes pod okay so there is a container image and each and every container is run within the Kubernetes pod okay so essentially what happens is whatever Kubernetes resources that are necessary to execute this pipeline are created by the pipeline service and the pipeline persistence agent basically persists all these Kubernetes resources the state of these Kubernetes resources the output that these Kubernetes resources create it is the job of the pipeline persistence agent to persist all this into the metadata store or even in the artifact storage okay let's move on now underneath the orchestration system you will have a bunch of orchestration controllers so Qflow pipeline is built in such a way that it can support multiple orchestration controllers so one primary controller that we use for task-driven workflows is the Argo workflow and Argo workflow is again a separate CNCF project for executing workflows so you will also see instances where ML pipelines are written directly in Argo workflow using ML constructs okay but whereas in Qflow pipeline you have a pipeline servers you have an SDK there is a v2 version of the SDK you have a DSL you have a DSL compiler and you get everything on top of that okay but for this presentation we will stick to the Argo workflow controller and yes once the resources are created whatever output that are created by these resources for instance but by resources I mean the pods that actually execute the step is eventually stored in the data artifact by default it is Minio and there is an option to use other data to artifacts as well so let's get moving so choosing an Argo workflow executor so as I said earlier Qflow pipelines run on Argo workflows so Argo workflow is the primary workflow engine okay that actually executes the ML workflow and you can either use the Docker executor for Argo workflow or you can use the very latest emissary executor and by the way emissary executor is the default executor from version 1.8.0 onwards the Docker executor is actually has some limitations for instance it supports only the Docker container runtime and we know very well that in version 1.24 of Kubernetes the Docker shim is getting removed or the Docker shim has already been removed because 1.24 is already out and which means the Docker executor can be used only if you are using an older version of Kubernetes right and from security perspective since Docker needs privileged access to the Docker socket on the host it is not preferable to use such a such a approach or such a solution in production whereas emissary executor supports any container runtime and it is also more secure okay so so moving forward it is going to be by default emissary executor that is already and default executor from version 1.8.0 onwards and other notable features of KFP so I wanted to give you this other features because it will help you to understand in a more deeper way about KFP so it provides out-of-the-box multi-user isolation for pipelines and by the way this is available only in the full Q-flow deployment it is not yet available in the standalone KFP deployment basically this feature allows you to separate the Kubernetes resources for multiple users so you can create multiple profiles and each profile is nothing but each profile is actually get getting mapped into a Kubernetes namespace so if you if you create a user profile and that particular user when they run a Q-flow pipeline whatever resources that are created for that pipeline run will get created only in that particular namespace okay so this provides you with isolation for instance when you are sharing a Q-flow instance with multiple users it provides you with very good isolation and another good feature is step caching okay so we saw that there are the pipeline is executed in multiple steps let's say you create a pipeline run and let's say you once again recreate a pipeline pipeline run this time just by modifying the hyper parameters alone okay then this modification of hyper parameters is specific to a particular step let's assume so step caching make sure is that whatever steps that were run previously do not get executed again and it will elegantly use the output of the step that is already cached and it will skip the execution of the step so the speeds up the execution of the pipeline it also efficiently uses the resource of the pipeline and you can also control when you the cache in validation should happen and when the caching should be disabled and you can also altogether either enable or disable the caching feature and another feature that was recently introduced in the SDK V2 is pipeline root this essentially represents an artifact repository where the pipeline stores artifacts okay so originally only Minio was supported and that to the Minio that was packaged along with Q-flow pipelines that was the only way to store your artifacts but whereas now you have three different options you can have Minio you can bring your own Minio or you can use any S3 compatible object storage or you can use even GCS Google cloud storage all right now let's get into a quick demo so for the demo this is how the pipeline is going to look like so the first step is training the initial model and then receiving a candidate model and further on we retrain the model with more data so that we increase the accuracy of the model and once we get this retrained model we we just run model prediction on this model and then we calculate the metrics out of the data that was produced by the model prediction and if the metrics is within as within the acceptable criteria then the retraining is stopped if or else the training is again re-triggered and then the retraining happens and this happens in a loop until the model accuracy is as per our expected criteria so let me let us go into the demo so let me end the slideshow and before I open the UI let me show the list of pods that are actually running for a Q-flow pipeline installation so here you can see the Minio which is actually the artifact repository the MySQL database which is actually the metadata store and workflow controller is basically the Argo workflow controller because this installation has only Argo workflow controller and this is the pipeline service which basically accepts the pipeline and then creates the various Kubernetes resources this is the pipeline persistence agent which persists all the Kubernetes resources their input and output everything in the ML data store and schedule workflow is used whenever we need to schedule workflows rather than one time workflows we can also have scheduled workflows and when we have scheduled workflows the scheduling is actually taken care of this component and you have a bunch of other components these are all UI related components the pipeline UI the pipeline viewers here D as well as the pipeline visualization server so visualization server is basically it crunches all the data from the metadata sales server and then it creates the visualizations that are necessary to actually evaluate the performance of the model okay now let's get into the KFP UI so along with the installation of KFP there are some default pipelines that are installed as part of the installation and today I'm going to use one such pipeline which is actually the pipeline that I explained using the slide so this is how it looks graphically and I'm going to run this pipeline by clicking on start so once I do that run a new run has been created so I can click on this run and it will show you show me a visual graph explaining the various progressions of that particular graph so as you can see this step has completed and you can see that this step produced two output artifacts so one is it produced a table which was stored in the artifact repository and then it produced the logs and we can also see the pod that was the Kubernetes pod that was created for executing this step what are the events that were generated so if at all there are some failures we can look at the events and try to figure out what went wrong so what is happening is the data transformation steps have completed the initial model training has also been completed and we are seeing in this step that the initial model as well as the data set are being sent as input for this step and the output of this step is the trained model along with the model config plus the logs okay that is what we see it as output again we can see the pod that was created we can also see the logs that was created by the container that ran the step so there are some details it's a succeeded volumes so there were no volume mounts used for this step visualizations no visualizations for this particular step let's see what's there in ML data ML metadata there is it says corresponding ML metadata not found and meanwhile the pipeline has progressed to the point that it has calculated the matrix and it has decided that the matrix is not as per expectations and it is triggering a retraining okay so the retraining has concluded and it is again predicting the retrained model so for this prediction it uses the retrained model as well as the data set data from the data set okay let's see what is the output of this prediction yeah the prediction output is available and there is a calculation that is going on let's see what is the result of this calculation okay so the calculation has determined that the expected condition has reached so the run has completed so you can see that there is a green tick mark and it says executed successfully the pipeline has ran successfully and now if we come here we can see all the pods that were created by Argo workflow in order to execute the pipeline so for every step in the pipeline you will see a corresponding pod so you can also use kubectl commands and then look at the pods look at the logs that were produced by this pod events that were produced by the spots the same information that you saw in the UI okay yeah this is a very very simple simple pipeline that you typically will find during model exploration and model development phase and we saw now that kuflo pipeline was able to execute this pipeline and execute it successfully okay yeah and that is pretty much what I intended to talk and I really hope that you enjoy the talk and I really hope that the content of this talk will be useful and by the way if you have any questions about this talk feel free to post them as text questions in the corresponding Slack channel and I will make sure that I provide an appropriate reply to your questions so enjoy that enjoy the day and enjoy the rest of the talks and see you soon thank you so much bye