 So, hi everyone. I know we are kind of coming to a close to today's sessions, but I felt that not a lot of machine learning was covered in today's session, so we kind of wanted to cover MLOps. Of course, these days it's really relevant given all the rise of JNAI and how of course a lot of work is also being done by platform engineering teams and DevOps teams to adopt machine learning workloads on top of Kubernetes. And of course, our talk is not going to be directly towards that, but of course we'll be talking about a lot of different engineering aspects that you can leverage when it comes to adopting multi-tenancy inside of MLOps. And you'll understand why that is the scenario, why is that required, especially when dealing with a complex set of machine learning workloads where you might be working with different type of machine learning models and how all of that kind of ties up with Kubernetes since we are at Cube Day. So quick introduction, I'm Shwai, I'm an ambassador at POSMED and with me is Rohit. Hello. Yeah. I guess I'm audible. So hello. So who am I? So I'm Rohit and Shwai has already introduced. So I run small consultancy, which is the real estate service. And if you have ever been to K-studies, I was also the organizer of one, then also committee groups and a lot of things. So if you want to ask anything about DevOps, we can always reach out. So what actually today's topic about multi-tenancy with MLOps, right? But first, discuss MLOps. So if you have ever worked with the machine learning projects, how the traditional machine learning projects look like? So you, like most of the time, you are just playing with the data sets. If you actually worked in the machine learning companies, it is not just about taking the pretend models and done. So you are playing the data sets, then you are, like, fine-tuning it, you are evaluating, you have running the training data sets, then you got the evaluation training parameters and a lot of things, then you generally scale up and scale down according to your business needs and according to your company needs, right? And then there is a heavy testing, so it goes back whenever the things goes wrong and all. And if you run it successfully, then you deploy. Otherwise, that pipeline runs for once and years. How, like, because I have worked in one of the companies previously which is acquired by Reliance, so it was same thing there. And then monitoring and same iterating and this pipeline works. So I think that's, that was one of the things and major pain about the data scientists. That's why the MLOPS kind of a term got introduced, but MLOPS was in the picture from the long time. So let's see what it is. So if you see our data ML infrastructure, it doesn't scale across well with the sand organization. It particularly needs the different, different, what we say, different organize, different things needs the different teams for it. Let's say there is a TMA, which is ML team and team B is the DevOps team. So if you see here, TMA is responsible for the machine-lining pipelines. So they have made some critical update, which is, we are seeing as a, yeah, pipeline update, right? So, unprecedentedly, they have also broke the pipeline Y and pipeline Z, which is not like reliable to them. It is DevOps team, right? But this happens. So it was not in a manner, but this thing happened. So that is one of the thing here. So if you see team B was not responsible for it, but things happens. So this is unintentional cause. And if you see that, it happens every time. And that's why dedicated things required when you scale across the teams. So if we move forward, you can see for, like, different different ML workloads, you require the dedicated infrastructure teams. So why? Let's see. First is our provisioning CPU and GPU memory. So every company, right, if they are working on the ML and stuff, here we are referring about how the allocation of computational resources, like your CPUs, GPUs, then memory, which are critical to ML also, like, so that's what is required. So what is one of the things? Then there is a framework, right? Every machine-lining models and stuff, like, every infrastructure suggests, like, go and use to support multiple frameworks, right? And that's how flexibility in development comes. Then there is multi-tenancy, which is, like, ability to support all of the multiple users or groups into a single instance, same as a same infrastructure without compromising your privacy and security, which is one of the important concerns in the ML ops, right? Then there is auto-scaling where you don't require any particular individual to come into the picture, and your infrastructure can auto-scale any ML workloads you have, ML pipelines you have, right? Then there is a cost-control spot machines, which is nothing but you have some spot machines, right? Then you are actually, like, using spot instances to specify the computational resources, so you can have that strategy to control your cost and define it particularly. Then model monitoring, which is nothing but you keep track of the model's performance every time, and that leads to the beneficial for the company, and also post-deployment, you ensure how the, like, efforts to be expected in the future. Now, how ML ops benefits from multi-tenancy. Let's see, we explore a lot of points, right? So we are discussing about multi-tenancy. Multi-tenancy, what happens here? Cost efficiency, because you just have, like, a single instance, which is having, like, different software stack inside it, so you don't have to worry about the infrastructure cost and licensing cost for different, different software previously. You're just combining everything into one instance. That's why platform, platform, platform engineering team, everywhere, things are happening, right? Then similarly, type-and-saving, time-saving is, like, you perform similar activities with similar software stacks, and that multi-tenancy eliminates to switch between different stack, and you are saving the time and you are using the same thing, including your work flows and everything. Then collaboration announcement, which is nothing but you foster a collaboration among ML ops practitioners, so they can work on same platform, share resources, they can share whatever, they can work with the dev ops teams and all, and everything works, combining the knowledge sharing and production deployments. Then there is a resource optimization, so with multi-tenancy, resources like power, then storage, it really optimizes, so because it is shared, right? Then there is a, it ensures, like, available infrastructure is using efficiently and not reducing your, sorry, reducing your waste and time. So, yeah. As you can see here, if you read that term, it is really hard to understand, but we can relate it to one thing, what actually is the multi-tenancy for the AI, ML architectures, so if you see that image, it is like building and it is with a lot of apartments and all, right? So, every apartment will have some families, right? So, families here are nothing but your, let's say, your servers or nothing but your tenants, so, and your building is nothing but one of the environment, and so, let's see, in the families, they are working on something, let's say, they are cooking, they are doing a lot of things inside their families, right? So, that is nothing but your data and models, because company needs, ML workloads needs the data and models to perform and all, so that we can relate with that. Then there is, let's say, you are using the computations here and a lot of things, right? So, that is nothing but your family using the electricity and all. Then there is another thing is, let's say, you have that, like, AI models are deployed, distributed and orchestrated, we say, and it is like, solution is accurate, reliable and scalable. What is this thing? So, let's see, in your families, so you are going and booking the apartment and stuff, right? So, which utilities will get, which apartment we will get, that is nothing but your distributions, orchestrations, deployments for your companies, and if that goes really well and stuff, that is, you will perform for the accuracy and scalability for the future. Now, let's discuss to the one of the important point, which is orchestrators. It is game changing for sure, like, you can see, it is tool source system that manage everything for you, like, it will contribute the logical flow for you are transferring one, like, raw state to desired state, which is processed state. So, how it is doing that? You will see in that diagram, which visualizes, which shows, like, that different nodes of the progression from start to finish, right? So, you will see, like, the workflow with different set of tasks. So, here you will see, like, retrival, then splitting, then data processing, data cleaning, then various possible machine learning system, you can see in the trained data and also training might be there. So, everything is going from one end to another, and this use of such orchestrators is crucial in complex system. So, that is what is efficiently ordered by the orchestrator. So, moving forward, we can see what actually is the orchestrator help you in about. So, first is units of computation you need for your workload, which is nothing but orchestrator help you to figure out and manage the individual tasks or steps required to complete your data processing or computational network. Then there is data flows between those units. That is, like, it will assist you to manage how the data is passed from one of your tasks to another and then ensure the workflow is running really smoother and without any problems. Then there is another thing which is what the types and state of your data is at any given point. So, it is nothing but you have different types of data, right? Image and all these things, like, which kind of image and stuff. So, this keeps the track, like, how the things will move and all. So, raw process and stuff. Then there is what dependencies each unit relies to do its computation. That is, like, it helps you identify and manage the dependency, which conditions or regular particular tasks it defines, then there is a, we are showing containers so you can understand it. Then there is what resources each unit has available to it. So, let's say orchestrator also manages resources, such as, like, memory processing power, if you are working in the ML workload side, computational power, then storage. So, these are assigned to each task in different big workflow, right? So, this is managed by orchestrator, like, it not managed, it helps you to reason about. Then, so, we would like to introduce this one of the open source tool, which is flight. So, what actually is a flight? You will see there are tasks and workflows. So, you can easily define it in Python. So, when you go to the flight documentation, you can easily download it. It's really easy to understand. Then you can have that import task and workflow. So, task is nothing but a small kind of a thing you are defining. Let's say you want to give some instructions to the robot, like, go left, go right and this thing. So, that is task, right? But workflow is, like, asking them to go from this stage to the other, right? So, that is the entire workflow. So, here, we are having that task and workflow, so you can define it really well. And that will help you to build the entire pipeline for your workflow. Now, just to kind of add to this flight is basically an LF A&I project. So, the Linux foundation has a separate AI related, a lot of, like, AI related projects. So, it comes under the Linux foundation. It's also similar to Kubeflow. Some of you might be aware of the term Kubeflow. That's also currently been incubated inside the Cloud Native Computing Foundation. So, of course, all these different tools really allow you to very easily manage your machine learning workloads, whether you're dealing with a very simple workflow having, like, let's say, working with an MNIST data set, where you're trying to determine, like, what is the data set. Or, of course, you're working with a lot more complex workflows that require you to run your machine learning models natively on top of Kubernetes. So, a very good example is OpenAI, where they are, of course, serving thousands of different requests with their GPT models, and all of them are running on top of Kubernetes. And that is why, like, flight or even Kubeflow, these are Kubernetes-native platforms. That means that all these workflows run on top of Kubernetes. So, as Rohit also kind of mentioned about how you can basically do resource allocation and task allocation with the help of your orchestrators. So, these orchestrators, such as flight, are running on top of Kubernetes. And very easily, you can manage what kind of pods and what kind of resources you want to allocate to these particular pods. And that's what we'll be also be showing in today's demonstration. But, yeah, over to you, Rohit. Yeah, so, same thing. And if you want to use, you can see, like, there's a config file, and there is a containers, and all. So, it is easily integrated. And it's just a MLOps platform kind of thing where you can use it really well. And you have the Kubernetes cluster for workflow execution. So, you can define that flow. I will also like to mention you, like, what is a task. And so, you can see here, like, task, which is nothing but containerized and virtualized. And you have, like, strongly typed input and outputs you can get whenever you are defining it. Similarly, for the workflow, it is nothing but a bunch of a task. And you are defining how that will work and data flow will work. Then, moving forward, you can see here, so, task and workflows are there. So, whenever you are writing in Python code, you will see, like, you just have to denote that, your code wherever you can see the task or whenever you see the workflow. So, what is here? So, you can see, in total spend, it is just a sum, right? You are doing the sum. But, calculate spend is, like, multiplying that sum and to get some output, right? So, that is your entire workflow. So, you need to annotate really well, like, task and workflow. And, yeah, it uses various data frames, PySpark and Python data types. And you can use it. Execute locally. You can try it on any cloud providers, whatever you need. And, yeah, if you want to enable catching, yeah. And this to kind of, if you can probably go back one side. This to kind of also give a very good example is, so, of course, some of you might be, if you are working with Python, we are, of course, using decorators. So, it has a similar syntax to how you would kind of declare your decorators instead of Python. And another example that I think we can probably quote over here is, so, let's say if you take an example of a typical machine learning life cycle, starting off with cleaning out data and then, of course, moving towards data processing and then applying your machine learning model. So, each of these particular steps inside of your machine learning cycle can be attributed as individual tasks. And your entire workflow would be a sequence of these particular tasks that can run badly. So, imagine that if there are multiple teams, you can define different set of workflows and each of these workflows can have a number of different tasks. And, of course, like, it's totally up to your team on how you want to define multiple tasks and how all of them incorporate these different tasks. And that's what Rohit will basically share in the next slide. Yeah, so, you can see here, so, if you have worked in the machine learning traditional methods, so, you can see, like, there is ETL, data ETL, extractions and stuff you do, then there is a classification model, forecasting model, there are different types of a model, right? Which is nothing but the projects you are using then. There is a domains which is nothing but different type of environment, which is development, staging and production. So, you can see here, it is just a logical group of tasks and workflow for built-in multitenancy and isolation, which is really showing as a project. And domains is, like, providing specific environment to execute your workflows. So, you can see, like, development, staging, production, like, if you are here, then you know these terms. So, you have heard from it long, long time and all. So, flights in multitenancy starts with, like, projects and domains. So, you can see here, like, there is no need for, like, separate environment. So, how? We will see it in some time. And there is, like, you will see, like, how the models may need to be used or worked upon by several teams with organizations. Then single ML pipeline can be compromised with several components, each assigned to them. The teams and then team members can go and they will be able to share the files and collaborate and stuff. So, everything comes under the flight multitenant ground or something. So, yeah. Then I will like to move on to Shiva and he will present you the demo. Thank you. And I think, like, one point that I want to quickly mention is that flights infrastructure was actually inspired by AWS. So, we know that AWS is very huge in the entire, in the cloud ecosystem. So, of course, what AWS allowed you to do is have a single infrastructure, but still be able to manage different aspects of your overall cloud native experience, whether you're just using EKS or ECS for deployment, using Kubernetes or you're using something like load balancers or spot instances to manage your workloads in terms of, like, how you can do load balancing. But, of course, it provides something like IAM support where you can have, like, different people working on specific parts. So, AWS in itself kind of inspires a lot of, like, multi-tenancy inbuilt inside that particular platform. And that's what, like, flight, and as we'll kind of give you a demonstration now how it kind of incorporates that multi-tenant features directly inside of the platform itself. And, again, like, you're not just limited to flight. You can use any other orchestrator platform, like Kubeflow, that also do provide multi-tenancy out of the box. But, yes, over to the next slide. So, of course, we'll now kind of cover the main aspects of flight that kind of are inspired by multi-tenancy. So, the first one is resource isolation or resource sharing. Now, let's imagine a place where, like, let's say you have two different type of teams. One is the DevOps team and one is the machine learning team. Of course, when it comes to the ML team, they're doing a lot of, like, data pre-processing and they're running these huge machine learning models that would require you to have larger compute, for example, using GPUs in order to train these models. Whereas, DevOps teams might not require that much amount of compute because they are mainly just using that particular model that has been generated or that model artifacts that have been generated by the machine learning or data science team. And they'll just use that for deploying that particular project. So, out of the box, you get very nice support for being able to isolate your resources so as Rohit kind of shared that flight has this entire architecture of having different projects. Now, these projects are essentially, you know, you can think of, like, two different teams. Team A, the data science team, having their own separate project. And the team B, which is, like, let's say, the ML ops team, having their own set of project. And each of these projects are basically comprised of different workflows and tasks. So, team A, which is the machine learning team, will have a bunch of different tasks that they are only wanting to use. So, all the resources and all the memory that will be isolated to them. And again, since this is working on a communities level, because it's a communities native platform. So, basically, whenever you run any task instead of flight, it basically generates a new community spot and it will have, of course, the namespace that gets generated will be just limited to that particular project and that particular task that you're running it on. So, you get native support for isolation at that particular task level as well. So, of course, like, this is one example where we have team A with running one kind of workflow which has, like, multiple tasks and then you have a workflow for team B. They are having a separate workflow with themselves. You could also probably have multiple workflows if you have a much more complex architecture. So, if you want to kind of demonstrate this now, let me quickly open up my VS code to kind of just quickly demonstrate this particular aspect. And let me just go over here. So, you'll be able to probably see that if I go to this particular team A.py and I'll just quickly also zoom in my screen. So, let me know if everyone is able to see this, right? So, over here, I have basically two different files. One is team A.py and the other one is team B.py if you're able to see it inside the workflows directory. Now, of course, team A is, let's say, a data science team that is using the wine dataset. They will basically use binary classification to detect what kind of wine there is. So, we have defined different tasks such as for fetching our data, for processing our data, and then we have a task for training our model. So, here, we are just using some basic machine learning steps that you'll do in a typical data science life cycle. And then, of course, we have workflow, which will basically be used to train our dataset on using logistic regression as one of our machine learning models. Similarly to this, we'll also have a team B. In this example, we have basically used the same kind of code, but imagine that this particular team B, which is like a separate team altogether, they might have probably some other workflows. For example, deploying that same particular code that you used, they will probably have some steps related to deployment. Now, the flight UI will basically show you two different type of projects that we have created. So, if I click on our flight, you can see that I have basically created two separate projects over here. So, the first project is the demo data science, which will basically refer to this particular file, which is the team A.py. So, you can basically define one particular project. So, what I did behind the scenes was that for this particular team A.py, which is one workflow, I assigned it to the particular project, which is the team... Let me just navigate back to over here. So, one second. Okay. Yeah. So, that is for data science team. And then the other demo, which is the team B, is for the demo team MLOps. Now, what I can do is I can run workflows directly from my local machine, and I can run these workflows that are isolated on a project level as well. So, for example, I'll just quickly take up one particular line where I'm going to be running these particular workflows, and I'll go back to my terminal, and I'll run this particular command. So, over here, let's do one thing, and let's run. So, what you'll see is, if I kind of zoom in a bit more, the command that I'm running is pyflight run hyphenp. Hyphenp basically stands for the particular project, and then I'm also defining that it should only run for team A. So, and then I'm also giving it the domain, so I can define whether I want to run this particular workflow inside of my development environment, my staging environment, or my production environment. And as soon as I run this particular workflow, what it will do is it will generate that particular workflow, and I can actually visualize that particular workflow right over here in my UI as well. So, in this case, let's just wait for it to load. So, now it will basically execute that particular workflow, and as you can see that it is only happening in that particular project that is for team A. That's inside of the data science demo, and then it runs all these particular different tasks that are there as an execution. You can also visualize it through DAG, which is basically being used inside of flight. So, it helps you to visualize the different domains and the different tasks and how they are interconnected with each other. So, a DAG representation also allows you to visualize your complex machine learning workflows very easily, right? And of course, all of this, the great thing is that all of this is running natively inside of Kubernetes. So, if you want to kind of get more information about that, let's say if I want to take a look at my task, which is get data, I can actually look at my Kubernetes logs, and that will basically show you how you were able to generate a particular namespace that is just specific to that particular workflow that we have just run. So, over here, if I quickly show you all the different pods that I'm running, then the latest pod that I just ran, which is basically in the staging environment, was for that particular task that I just ran. And this particular pod that I have just instantiated will only be limited to this particular task, and all the different team members or users which are part of this particular project and this particular workflow can only access it. So, I can define role-based access as well at a task level, at a workflow and a project level as well to provide you that kind of isolation because you don't want the MLops people to interfere with the workflows that are being run for the data science people, right? We don't want that to happen. So, you get isolation at the pod level as well, and of course, if you take a look at, carefully take a look at any of the namespaces that we generate, right? So, the namespaces that you see, for example, Team A demo production, so basically a namespace gets generated from the project and the domain that is like whether you're running it in the production environment, the staging environment, or in the development environment. So, all of that is natively supported over here. Now, if I go back to the resource sharing aspect, right? We can also actually do task sharing. For example, one workflow would be that, let's say, your team B relies on the model artifacts that you generate from team A. So, you can basically reference tasks from one team to another team when you explicitly define that because by default, you don't get that capability because every team is working in their own resources because of course, because of communities, like we are generating that RBAC control. But if you want to explicitly like, let's say, team A is working on the design team, they generate a model, now you want to deploy that model. So, you could basically refer to one of the tasks that is generating the model and use it inside of your team B where you'll basically use it. And again, like one such example, I'll quickly just go back to our project over here. So, one such example is over here where basically in our team B, we are referencing to a particular task that's from another project. So, imagine that this project is generating, this particular task is generating your machine learning model artifacts that you want to use to get deployment. So, you could basically do that. So, that's one another aspect that you get with flight of being able to share tasks across different projects. Of course, another thing would be restricted access with respect to RBAC. So, of course, you can define roles and role bindings. So, that you enforce RBAC control natively inside of your different projects and different, you know, different workflows. And then, of course, you can also define GPU and CPU requirements. So, for instance, like let's say on a CPU level, like let's say one particular task is a machine learning task. You can assign explicitly assign GPU resources and CPU resources to that particular task as well. And apart from that, you can define resources on a task level as well. So, for example, once as example would be, and I'll just quickly show you because we are running out of time. So, for example, you can define, while defining a task, you can also define a port template as well. That when you execute that particular task, what that particular unique port should look like. So, in this case, we're defining a port spec to ensure that we basically put some limitations on what resources and what type of GPU you want to assign to that particular task. So, all of this gets natively implemented instead of flight. So, that makes it very easy for MLOP teams to work inside of a single platform, as Rohit mentioned, that you don't need separate platforms for your data science team working on the Python lab notebooks. All of that can be handled easily inside of one infrastructure team and that with one particular platform itself. So, of course, you get resource isolation as well. And you can actively do declarative syntax or declarative infrastructure provisioning as well with the help of flight. But yeah, I know we are sort of out of time, but you can connect with us and we'll be open to questions. And if we don't get time for questions, we can, of course, connect after our talk. Thank you. Thank you.