 Hello everyone. My name is Samhita. I'm a software engineer and technical evangelist at Union AI where we are trying to make the orchestration of machine learning easy to use. I'm here today to talk about ML Ops with Flight. Remove the barriers to successfully implement machine learning for production workloads. The crux of this talk is to know how machine learning can be productionized using a Kubernetes native platform which in our case is Flight. Let's get started by understanding how ML Ops differs from DevOps. To automate and smoothen the process of software engineering, we inculcate and abide by a set of practices which we call as DevOps. DevOps or development operations refers to bringing together the development, testing and operational aspects of software development. The best practices include continuous integration and delivery, infrastructure as code, monitoring and logging, communication and collaboration and microservices. Dev needs DevOps. Likewise, ML needs ML Ops. However, ML Ops is a superset of DevOps. This is because models could deteriorate over time. Why is that so? Models and data are mutable. They require constant experimentation and iteration. Data could be very complex. It can be huge and it may need to be pulled from multiple data sources. Infrastructure overhead. We may have to take care of the latency, higher processing power, compute requirements and next comes hardware requirements. More often than not, ML code needs GPUs. So we'll have to set up GPUs, take care of the memory requirements, etc. And lastly, we'll also have to take care of the scale of operations, meaning the scalability of machine learning models. We'll have to ensure that. In addition to the previously mentioned challenges, we'll also have to think about ARM, which stands for auditability, reproducibility, recoverability and maintainability. Auditability is the ability to backtrack to the error source when a pipeline execution fails. It also includes data lineage, which means to understand data that's being used and transformed over time. Reproducibility is the determinism of a run. Are we able to reproduce the run given the same set of inputs? Recoverability is the ability to recover when there's a pipeline failure. And maintainability is the iterative development of machine learning pipelines and enabling team collaboration. ML Ops comes with technical debt. If you look at this diagram, ML code is only a tiny fraction of the real world ML systems. We'll have to take care of a lot of components in ML Ops. Say for example, we'll have to take care of resources. We'll have to take care of serving. We'll have to take care of monitoring, etc. And hence, we need a cohesive platform that brings together different components, teams and models onto a single platform. Components meaning the components that I've talked about earlier, say serving, monitoring teams. Because in MO we can have multiple teams pertaining to multiple domains and the teams may want to work together. They may want to collaborate and hence the platform has to support team collaboration, multi-tenancy to be specific. And then models, we usually in machine learning, we usually do not work on a single model. We need multiple models. We work on multiple models. So that has to be supported too. Moreover, we also need an ML Ops platform that is scalable, extensible and reliable. Meet Flight. Flight is a Kubernetes native workflow automation platform for business critical machine learning and data processes at scale. So Flight's built on top of Kubernetes and it's a workflow automation platform that can be used for mission critical machine learning and data processes. The goal of Flight is to provide a reproducible, incremental, iterative and extensible workflow automation platform as a service for any organization. Its focus is on user experience, reliability and correctness. If you look at this diagram, we are separating machine learning. We are segregating machine learning and operations. So on the ML side we have user teams and on the Ops side we have Flight and platform team. User teams usually consist of data scientists, MO engineers and so on and they interact with the machine learning frameworks. And on the platform side, on the Ops side we have platform engineers, DevOps engineers who deal with cloud, cluster management, resource management. So Flight provides the capability to separate ML and Ops, to separate their concerns, to separate their responsibilities. And yeah, that's a very important requirement of ML Ops. The basic entities of Flight are workflow and task. Workflow is a sequence of processes through which a work, through which a piece of work passes from initiation to completion. And task is a fully independent unit of execution and a first class entity of Flight. So simply speaking, workflow is a collection of tasks that are arranged in a specific dependency graph. So if you see this, if you see the diagram, task two is dependent on task one and task three is dependent on task two, which in turn is dependent on task one. So that's how we model a workflow in a task. Let's look at the challenges of ML orchestration that Flight solves. Develop incrementally and constantly iterate at scale. So let's look at the diagram, the steps in the diagram to understand what this means. The first step is to start with a job, be it a spark job, a training job or a feature engineering job. The second step is to scale the job. Third step is to create a pipeline and test it locally. Fourth is to execute the pipeline on demand. Fifth is to run the pipeline on a schedule or in response to an event and sixth is to retrieve the results or the outputs. So this is a generalized pipeline architecture that Flight supports, which in turn helps to develop the code incrementally and also constantly iterate at scale. And Flight also supports having a centrally managed infrastructure, which the platform teams can take care of. So there's a separation of concerns that Flight supports. And hence, we can only have one centrally managed infrastructure that the platform teams can focus on and the user teams can focus on the machine learning models. And Flight also supports access to resources like GPU, CPU and mem through code, so it's infrastructure as code. And it also supports automatic scaling, knobs to control costs, multi-tenancy, architecture, etc. And as for efficiency, it supports spot machines and model check pointing, which we are going to look into later. Next comes in ML, experiments require similar algorithms with different parameters or inputs and hence there can be altering behavior based on inputs. So if you look at this diagram. In the first one, the first diagram, we have input as M1 and this is the graph we have got. This is the dependency graph we have got. And for input M2, the dependency graph is the same as the dependency graph we have got for M1. And for input N, the dependency graph is totally different. We have an additional load and also the dependency structure is different. So this is a very common scenario and machine learning and in flight, this can be addressed using the dynamic decorator. Next comes memoize, recover and reproduce. So let's look at the diagrams again. We have three nodes here, right? We have three nodes, the first two are successful and the third one and the third one we are basically storing the output of it in cache. So when we run the code again, when we run this pipeline again using the same input, what happens is the first two nodes are successful, all right? And for the third node, flight fetches the data from cache instead of executing it instead of re-executing it. That's the advantage with cache. And next comes the recover feature of flight wherein we have pipeline again, right? The first two are successful and the third one failed, the third node failed. So what happens is when we run this again, when we run this code again with the same input, what happens is it recovers. It's not just running, but it's hitting on the recover button that's available on the flight UI that basically runs it again. So the successful executions are copied and flight starts running from the failed executions, but it doesn't run the successful execution. So that's the advantage of using the recover feature in flight. And there's also retries. If you see this, you can set retries in the task decorator. So when you set retries, what happens is it tries to run the pipeline again in the case it fails. Next comes collaboration and organizational scaling. So if you look at this diagram, we have two teams, right? Team A and Team B. Team A has two pipelines which are using a critical or complex algorithm. And Team B has a pipeline which uses the critical or complex algorithm that's available or that's being used or that has been built by team A. And there's also another pipeline wherein, which uses a pipeline that's built by team A. So how flight implicitly provides support for reusability. That's implicitly provided by flight. And for the separation of concerns, there's a concept called project and domain in flight. What happens is project and domain basically mean that, say, let's understand that by an example, through an example. So we have team A and team B, right? So team A pertains to one project and team B pertains to a different project. Now, both team A project and team B project have three domains each, right? One is development, there's staging and then there's production. So that's the project and domain concept. So that's how we can separate the concerns of team A and team B. And as mentioned previously, reusability is implicit. You can just reuse the workflows from different projects and domains. So that shouldn't be a problem. Next comes extensibility of flight. So that's supported by flight. The extensibility is provided at three levels. At user level, at platform level and at organization level. At user level, you can use FlightKit to add your plugins and integrations. At platform level, if at all you want to extend your platform in terms of distributed training supports, sparks, streaming, that's possible through flight back in plugin. And you can also extend organization, say you may want to control cause or you may want to bring capabilities and how that's possible too. Next, let's look at the user level concepts of flight. First comes task. So task is the building block of flight. It's the smallest unit of work in flight that basically accepts inputs and gives outputs, right? And it has a strong type interface. All the inputs have to be typed and it can also map to a backend execution plugin. It's declarative version language and framework independent. Next comes workflows. So workflow is a collection of tasks that are arranged in a specific dependency graph. So a workflow is declarative. It models data flow through tasks. It has a strongly typed interface and yeah, it's declarative version and dynamically recursive. This is an example of a flight code snippet in Python. We have two tasks and a workflow. This workflow is basically it basically encapsulates all the tasks. So here we are encapsulating total spend and total spend is in turn dependent on pay multiplier. So this is the dependency we are basically trying to tell and that's what flight captures. We are also setting cash to true and cash version and the task for the pay multiplier task. And when we modify the cash version, what happens is flight invalidates the previous cash and it tries to run the task again. So that's the advantage of using cash version. You can also run all of this code locally just like calling a Python function by sending the required arguments. If at all you want to scale your code, say you want to specify resources for your tasks, right? How do you do that? So in the task, you just specify your resources, say you want to specify CPUs, a number of CPUs, memory, GPUs, you can do that. And if at all you have Spark code, you can specify the Spark cluster configuration by installing the FlightKit Spark plugin. So Spark configuration and yeah, that's the configuration for the Spark cluster. And you can also specify retries. So you are saying retries as two. What happens is if at all the task fails, it tries to run it again, right? It's two. So it tries to run it twice. If at all it fails again, it runs it again. So it's twice. And then yeah, you can also create a launch plan. Launch plan is basically a plan that flight uses to launch or trigger a workflow. So calculate spend is the workflow we want to associate this particular launch plan with. We are also saying that this is the schedule, meaning run this every 10 minutes. And we are also saying that if at all the workflow fails, send a notification, email notification to the to a set of recipients. So yeah. Next, let's look at workflow modalities. So dynamic is evaluated at runtime whereas workflow is evaluated at compile time. Here's an example of yeah, dynamic workflow. We have a dynamic workflow and we have a furlough within it where we are looping over multi-train, multi-val, multi-test, which are the parameters of this particular task, which are this particular function. So the number of times this loop is going to run is dependent on the length of the list that the user inputs. So that's that's the dynamism. We are that's the dynamism we're I was talking about. So this dynamism is captured by flight at runtime. And that's when the graph structure is built. Right. And also a point to notice within the furlough, we are calling two tasks, one is spread and one is predict. So this dependency is evaluated at runtime. Awesome. So how do you execute and, you know, execute the flight workflows and interact with the flight back then. You can do that through flight CDL, which is the flights command line interface. You can fetch and create executions. You can fetch all the results of your executions. You can visualize your workflow through flight CDL and you can do all kinds of stuff that you want to do with the flight back then. And you can also use a programmatic Python API to interact with the flight back in and that's called flight remote. You can use flight remote to, you know, trigger a workflow or fetch the results and do again do all kinds of stuff. Let's look at a couple of pictures to understand how the flight UI looks like. So here's a graph that's capturing all the dependencies between multiple nodes. And this is a workflow page. So whenever you open a workflow and on the flight UI, this is a typical UI, you know, so we have multiple workflow versions. You should be able to view all the workflow versions. You should be able to view all the executions. And you can also launch a workflow just by a click, just by a click off the button launch workflow button. So when you click on that, a form opens and you can give your inputs. You can also say give a service account that you want to associate your run with. You can also modify your workflow version, the version that you want to use and so on. And this is the error. This is how the error message looks like if at all there's a failure. So this helps with debugging. You can debug with ease and flight also tells if it's a user error or a system error shows you the trace back and the error message. Now what are the features that are what are the features of flight for platform folks. So if at all there's a platform failure, you can recover instantly using the recover button. You can manage your customers transparently. You can set platform by defaults. Project domain resource limits. So you can use spot machines. You can set compute limits. So all these are the features that are available for platform folks. And now let's look at the Kubernetes side of flight. Meaning the architecture. So let's try to understand all the components that are here starting from the user. So the user author's workflows using the flight SDK. User compiles them and then registers them using flight CLI, which is flight CTL in our case registers flight CTL or flight remote. And the registration happens after the registration happens. Basically what happens is the code gets registered and they are now available on the flight admin. So that's where we have all the workflow and task information. The user also builds a Docker container to where all the dependencies are present. And after the registration happens, you say the user wants to trigger the workflow. So what happens then? Flight propeller comes into the picture. So flight propeller is a Kubernetes operator and it's part of data plane. Now flight admin is the control plane and flight propeller is the data plane. So flight propeller executes the workflow, guides the workflow to completion. And during that process it can interact with other Kubernetes operators or it can also interact with other Kubernetes objects. Or some external services and it guides the workflow to completion and stores the result in a cloud blob storage. And flight admin gets the results from the cloud blob storage and shows it on the flight dashboard which the user can view. So this is the basic architecture of flight. Next, let's look at the flight's ML features. First comes intro-adastic pointing. So running multiple epochs or different iterations with the same dataset can take a long time. Hence the bootstrap time may be high and creating task boundaries can be expensive. So we have already looked at retries or recover mechanisms available in flight. Now when you recover a workflow, what happens is it recovers from the failed node but not from within the node. If at all there is a failure within the task, it doesn't know that it just runs the whole task or it just runs the whole node. But in machine learning we can have tasks where there can be multiple iterations within it, multiple epochs. It's compute expensive, time expensive. We may want to capture the exact failure point and run from that point. So in such a case we can use intro-adastic pointing. Flight offers a way to checkpoint progress within a task execution. So how does that happen? How do you do that? So this is a flight task, right? In that you can initialize a checkpoint. It's available in the FlightKit library. And after you initialize a checkpoint, you just have to write the required or the concerning. You just have to write the data to it. And so what happens is this opens up the opportunity to use alternate compute systems with lower guarantees like AWS part instances, GCP preemptible instances, which offer great performance at much lower price points as compared to their on-demand or reserved alternative. That's the advantage of using intro-adastic pointing, especially when building machine learning models. Let's also look at Union ML, which is an interface that's built on top of flight. If you look at this diagram, so we have Union ML app, right? If at all you want to deploy this Union ML app, you can deploy it either on a web app or on the flight backend using FlightKit and Fast API. And there are train and predict end points respectively. So Union ML app provides, it's like the easiest way to build and deploy machine learning microservices, right? Because it's flight compatible, you can easily deploy them and you need not really think about ML ops. Let's look at how the Union ML code has to be written. The anatomy of a Union ML app. First, import the required dependencies. We are also importing the Union ML library. And then initialize the dataset and model here. The dataset that we are using is digits dataset and the model is logistic regression. And then we have to initialize a reader method that basically reads the data or train a method to train the data, predict the data to generate predictions, and then evaluate it to evaluate the model. So pretty much it. You just have to write a couple of simple functions and then boom, you have a production ready ML application with you after you deploy it. So you'll also have to make sure that you decorate your functions with the required decorators in order to keep it in sync in order to keep the code in sync with the flight backend. Because internally these are tasks and workflows. And this is how you deploy right local if you want to execute locally. It's pretty simple. And if at all you want to create a prediction microservice, that's simple too. Here's an example using fast API. So you just have to trigger the server. You just have to spin up the server, serve the model and then hit the endpoint to get the predictions. Finally, an ecosystem of integrations is what we are planning for. So we currently have integrated Union ML with fast API, but we have plans to integrate it with streamlit and flask. We are also planning to integrate it with other hyper parameter optimization libraries, machine learning libraries, AWS Lambda or serverless and even driven service and so on. Lastly, let's look at a couple of MLops best practices. So retrain models on a cadence. That's very important to keep your model updated. Version your data and code. Have a backup model for generating predictions in the case when the main model fails. Have ML related tests and CI. Sanity check the data, meaning ensure the quality of your data. Monitor your machine learning models. Maintain a clean project structure, meaning segregate your model and data code. Half pre-commit hooks to ensure the quality of your code. Finally, with flight, you can code, ship and scale ML pipelines with ease. Yeah, if at all you want to learn more about flight, your other references, take a look at our website documentation and repo. Thank you very much.