 But yeah, I guess thank you everybody for coming to my talk. This session will be a little bit longer than you're probably accustomed to. This is going to be like an hour session. I'm going to try to talk for an hour and then we've got another 10 minutes for questions. So hopefully I don't lose you guys in the second half. But yeah, so thanks for coming. If you haven't guessed already, I'm going to be talking about GenOps, supporting generative AI workloads with open source tooling and Kubeflow. Of course, my name is Farshad Kholotsian. I'm a lead consultant at Source Group. And what we do at Source Group is we help large enterprise move their workloads to the cloud. Everything from app modernization, app migrations, to building out landing zones. I specifically specialize in building out large data platforms and ML platforms in the cloud. With the more recent explosion of generative AI, I'm also finding myself building generative AI solutions in the cloud. So I kind of want to share that knowledge, some of the learnings that me and my team have learned over the course of a year with you guys. So hopefully you'll get some good content out of this. And yeah, the other thing I would like to mention is Source Group recently got acquired by Amdox, our parent company. So that's why you'll see we are now an Amdox company. But yeah, I guess lots to cover. So we'll get started if that's okay with everyone. So yeah, first thing I want to mention, you guys are probably wondering for those who weren't in Nick's talk yesterday on what is Kubeflow, I'll kind of explain what Kubeflow actually is. So yeah, so Kubeflow first started out as in Google, Google created Kubeflow. It was their ML offering for quite some time before they decided to open source it in 2018, thankfully, so that we all get to play with Kubeflow. And Kubeflow is an MLOps platform that runs entirely on Kubernetes. And it services the whole ML lifecycle. From model development to model training, model serving, model monitoring, and all the stuff, what have you. And you can use it as a platform in its entirety, or you can also use specific components of Kubeflow and switch and swap out the ones you want to use. And then the other cool thing about Kubeflow, because it's open sourced, it lends itself really nicely to working with other open source solutions like Keycloak, which we're using for a single sign-on, MLflow for model tracking, Spark on Kubernetes, et cetera. You'll see me demo some of the stuff in the presentation. And as well, other cool thing is Kubeflow as of July got accepted into the CNCF, or the Cloud Native Computing Foundation. And the Kubeflow team is working hard to get that set up as an incubating project. And as well, Kubeflow is very much in active development. The Kubeflow team, I was a part of some of the community calls. And they worked really hard to get Kubeflow 1.8 out the door as of November 1st. And a lot of cool new features. Personally, my favorite new feature with Kubeflow 1.8 is the PVC viewer, which allows you to look into PVC volumes directly from the UI and transfer files to it. PVC volumes are the underlying storage that gets attached to your notebook or your model server. And then one thing you guys are probably wondering, what sets Kubeflow apart? And why would an organization or a person want to use Kubeflow? Instead of some of the cloud offerings like AWS SageMaker, GCP for text AI, or Azure ML, I've kind of categorized these into five key factors, which I'll go into. The first one, of course, is cost savings. Because Kubeflow is open source, you're not paying for any licensing fees using Kubeflow, you're not getting charges tacked on just because it's an ML instance in the cloud. You're really just playing for the underlying compute and the storage that you're using. So it's hard to compete with that on cost. The next thing is portability. For a lot of large enterprises that we work with, they don't want to be pigeon holed or locked into a single cloud, or they have mandates or regulatory requirements to be in multiple clouds. And what Kubeflow allows you to do is run on pretty much anywhere. Any cloud, whether it's AWS, GCP, Azure, even on-prem. So long as you have a Kubernetes cluster, you can run Kubeflow. And what I like to say is when you run Kubeflow, the same teams that support Kubeflow on one cloud can also support Kubeflow on another cloud. You kind of solve for one cloud, and you almost solve for all the other ones. The next thing about Kubeflow, another benefit is usability. The UI you'll see in my demo is highly customizable. You can integrate it with other open source solutions. So your data science users or the users of your platform don't actually have to go to another website to access another tool. They don't have to go to the cloud console and learn how to navigate the cloud console. You'll see the Kubeflow UI is very easy to understand, and everything is all contained within the Kubeflow UI. So definitely a really big plus. As well, upgradability. Of course, Kubeflow being open source, you have access to the code base. So what that means is that you can customize and add the features that your data science team or your organization wants. You don't have to wait for the vendor or the cloud provider's roadmap to implement a feature that you want. And then hopefully, you add these cool new features. You can also contribute back to the open source community and back to the Kubeflow code base. And then lastly, familiarity. So a lot of the organizations we work with, they have a mature Kubernetes platform engineering team. So if you're used to running Kubernetes or you're already running Kubernetes workloads in your organization or in your team, you'll feel right at home with running Kubeflow on Kubernetes. So yeah, if you're asking yourself what that kind of looks like, I have like a diagram, let me just zoom in maybe so you guys can see it a little bit more, maybe not, there we go. Anyways, but yeah, so at the heart of this, this is kind of a logical diagram of all the different components that we have in our internal Kubeflow cluster that we're running. But basically at the center, you have Kubeflow. Let me see if this later point actually works, there we go. So one thing I really love about Kubeflow is their notebook environment. Not only can you run Jupyter notebooks, which data scientists are accustomed to using, but you can also run visual code server directly in the Kubeflow platform on Kubernetes, you can run RStudio, you can even, for those data scientists who want to use the SparkShell or whatever, you have access to the terminal and you can run your code directly on Kubeflow. Some other important things that come with Kubeflow, you have the model serving piece, which is K-Serve. K-Serve is great for scaling out models, serving models at scale. And then you have Catib, which is the AutoML for hyperparameter tuning that comes out of the box with Kubeflow as well. We've also, I guess I should say, Kubeflow pipelines for ML orchestration. And then we've also added other open source projects into our platform. So you can see we have MLflow for model tracking, LabelStudio for annotation. I'll talk a little bit about why we're using LabelStudio for generative AI workloads, it's really cool. Yeah, Feast for Feature Store, and then of course Spark on Kubernetes. I love Spark on Kubernetes. As of the newer versions of Spark, it natively supports running on Kubernetes. So that's also important in your ML lifecycle. And then initially, I should probably say for Kubernetes GitOps, the standard tools you're used to using like Argo CD for managing YAML manifest files, again, keycloak for authentication. And also Terraform for deploying all the infrastructure. These are all things you can use and your team might already be familiar with. But yeah, so because this is supposed to be like a workshop, I'm going to kind of walk you guys through how to install Kubeflow. Of course, you guys can do it if you have a recently modern laptop. You could go ahead and install it or do it a little bit afterwards. But what you will need to run Kubeflow locally is you'll need a Kubernetes installation locally. So you can use anything from Docker Desktop with K3D. You can use Minicube. My preferable tool of choice is Rancher Desktop, just because Docker kind of updated their licensing. So if you're working in a large organization, you have to kind of pay licensing fees. Rancher Desktop is still open source, completely free. So if you want or it doesn't have to be today, but you can definitely go to the Rancher Desktop website. I see most of you guys might be using Mac, but it supports Mac, Windows, and Linux. You just download the executable, install Rancher Desktop, and then you'll have a Kubernetes cluster running locally on your laptop. And then I'm going to show you now how to actually install Kubeflow. So yeah, there's a couple ways to install Kubeflow. Of course, the first one being is using a package distribution. So let me actually just go to the Kubeflow install page and show you guys. Yeah, so here's the install page for the Kubeflow documentation. And as you can see, you have multiple distributions of Kubeflow, all the major cloud providers, AWS, Azure, Google, they all have their own Kubeflow distribution as well. Some other notable ones are Charmed Kubeflow by Canonical, and also DeployKF. DeployKF is definitely a cool one to check out. And as you can see, every single distribution, you go to their website and they have their own documentation for getting it up and running in a specific cloud provider or using something, for example, DeployKF to get the stuff installed. The other thing you'll note is that most of the distributions are already running the latest version of Kubeflow 1.8, or they're in the process of working really hard to get their distribution updated to support the newer version of 1.8. Of course, there's some legacy distributions which are no longer being maintained, so they've kind of got demoted to the bottom of the list. But yeah, that's one way you can install Kubeflow. Of course, that's not maybe what I recommend initially if you're running Kubeflow locally. What I actually recommend is installing Kubeflow via the manifest files. So of course, this might seem like a little bit more of an advanced way of installing Kubeflow, but as you'll see in the next slide, it's actually fairly easy. And the main reason why I recommend you install Kubeflow via the manifest files is that you can actually go in and install every single component one by one and get yourself more familiar with what components, what deployments make up Kubeflow, and also where does everything live? What namespace do the various components live in? So here you go. I've prerecorded this only because if I go in, I could install it myself, but I will probably run over time. So this forces me to talk really quickly. But yeah, so once you have Rancher Desktop installed, you'll go to the Kubeflow manifest page. Actually, let me show you what that looks like. So yeah, so Kubeflow has a bunch of repositories. The manifest repository actually lives in Kubeflow slash manifest, and it's maintained by the Kubeflow manifest working group. And basically what the manifest files are, they're a collection of YAML files under the hood. Kubeflow uses Customize to install all its components. If you're not familiar with Customize, think of it kind of like Helm. It's like Helm, but kind of different compared to Helm. But you don't really need to know initially how Customize works to be able to get this stuff installed. If you go down to the repo instructions, you'll see these are all the various components of Kubeflow. And then as well, you'll actually have the install instructions. So you can see here, there is a one-liner command that you can just copy, paste if you're running something like a git bash or you have Linux. And what this will do is it will loop through all the components of Kubeflow, install them one by one, and then it'll keep retrying until it sees all the components are up and running. Of course, I don't recommend that because sometimes you run into issues with this and you don't really know what you're installing. So I'm going to be walking through how to install each component. So yeah, let me play this video. And then I'll try to talk through this video. So yeah, so we went to the Kubeflow manifest page. The first thing you want to do is select the version of Kubeflow you want to install. So if you go to the latest branch, which is version 1.8, you can also go to the release page and download 1.8 that way. But you clone this repo, and then you'll see here. I'm just going to skip ahead because I already talked about some of the stuff. Yeah, so the first thing we're going to install, the first component of Kubeflow is a cert manager. So cert manager is used for automating the creation of certificates and as well renewal of certificates. So what we've done is we've actually linked cert manager with Let's Encrypt. So we're able to generate certificates for HTTPS, totally free. But yeah, so I ran to look at all the pods. Currently, we only have the Kubeflow system pods. Nothing else installed. So there you go. I copy and paste the cert manager command. And then you'll see all the cert manager components installed. Don't worry about that error that you'll see over here. It says in the documentation, just means cert manager's not ready yet. But if I check the cert manager name space over here, you can see you've got the three pods for cert manager installed. Next thing we're going to install is Istio. So Istio is used for service mesh. And for routing traffic, you go to a specific URL. It sends you to a central dashboard or sends you to the notebook page via virtual services. And that lives in the Istio system name space. So you'll see there, we've got the Istio daemon, and then Istio ingress gateway, which is really important. We'll talk about it a little bit later. And then we're going to install the auth service. This is used to connect to an OIDC provider. And it's used in conjunction with DEX, which is the next thing that gets installed. And DEX is used for authentication. And the authentication server lives with Istio system. But DEX lives in its own name space called auth. Now the rest of the components, I'm not going to actually check every single pod. I'm just going to run through them and talk about them. So first thing we're going to install is Knative. Knative serving and Knative eventing, which is kind of the underlying, really important for K-Serve. And then we're going to install the Kubeflow name space. All the, most of the components of Kubeflow live in the Kubeflow name space. So I've just run that. I've installed the Kubeflow RBAC roles. And then we're going to install the Istio resources, which is like the Kubeflow gateway, which is pretty important. And as well, Kubeflow pipeline. So this is all you're doing. You're kind of going through copying and pasting and just getting all the components installed. It's fairly easy and straightforward. We're installing K-Serve now, which is used for model serving at scale. We're installing K-Tib, which again is the hyperparameter tuning auto ML feature. Of course, you got the central dashboard, which is what you'll see when you connect through the UI and the admissions webhook, which is basically kind of tells what users can access, which parts of Kubeflow. You've got the notebooks and the Jupyter web app, front-end interface you're installing, the new PVC viewer that I mentioned, as well the profiles and Kubeflow access management, which I'll talk about a little bit later. Web apps, and then if you're familiar with TensorFlow or used TensorFlow before, we're also installing TensorBoard, which is pretty cool. You've got the training operator, which is getting installed right now. This training operator is used for training at scale. And then as well, the last thing is the username space. So every team or user gets their own like namespace or workspace, which I'll show later. And then it also installs or sets up a default user. Of course, you can go and change this. The last thing I'm doing here is I'm just running a command to kind of check all the namespaces, make sure all the pods are running. So you can see here most of the initial components that I installed, they're already running. And then we're just waiting for some of the Kubeflow stuff in the Kubeflow namespace to fully initialize. One thing I should caution you is if you're going through this install for the first time, you might see in the Kubeflow namespace there'll be a lot of different pods that actually failed to deploy. So you'll see them kind of in an error image pull or image back pull failed. And what basically that means, don't fret. What that really means is because you're downloading so many different container images, most of these containers are actually coming from something like GCR. And what GCR is doing is is rate limiting you. It's realizing that you're requesting like 30 different images trying to install them and it's gonna actually like block you for a short amount of time. But if you just wait like two to five minutes, the ones that are in failed state will just retry and eventually they'll be able to pull the image they need and everything will be up and running. So yeah, it's fairly simple. I mean, there's a lot of components behind the scenes but as you use Kubeflow more, you'll get more and more familiar with all the different components. And then the other reason why I actually like Rancher desktop I'll kind of show you here. So the next thing you need to do, once you installed all the components, you need to actually expose the Istio ingress gateway which lives in the Istio system namespace on port 8080 to your local system so that you can access it. And basically the Istio ingress gateway, let me just zoom in here. The Istio ingress gateway is kind of like your main entry point into the cluster. So when you connect to a Kubeflow URL, it first hits the Istio ingress gateway and then it kind of decides, okay, you need to go to the central dashboard, I'm gonna send you there, then it's gonna go to DEX and DEX is gonna be like, okay, you need to authenticate, goes to the Auth service, comes back and then you're in. And then if you're going to like a notebook URL, it knows to send you to the notebook service so that's fairly important. But of course, if you've worked with Kubernetes before, you can run this command to do a port forwarding on port 8080, you have to kind of keep the terminal running, there's other options, but what I like about Rancher desktop is they actually have port forwarding built into the UI. So I'm just gonna play this. But yeah, so you can see this is like Rancher desktop once you have it up and running. If you go into port forwarding, you just need to find the Istio ingress gateway, make sure you're also port forwarding the HTTP service because by default, it doesn't run on HTTPS, on HTTP, but yeah, you just port forward it to 8080 and then you can go into local host and then there you go. By default, you're just using the vanilla DEX interface, you're logging in with user at example.com and then the password is 12341234. Of course, make sure you change that or switch to a different app provider. And then here you're in the Kubeflow UI. So here you can see you've got notebooks, you've got pipelines, you've got like endpoints for model surveying and then you can even go in, start creating your Jupyter notebook, playing around with generative models and what have you. So yeah, it's a pretty cool interface. This is what you get out of the box with Kubeflow. I'm gonna show you now what we actually have done with Kubeflow or how we've customized it. Here, I'll just show you like a splash screen of all the other really cool things with Kubeflow. You've got tensor boards. This is the AutoML experimentation, your survey models, some monitoring there. One thing I do wanna highlight in the middle, which I won't have time to talk about, but it's a pretty cool nifty tool, is actually Alyra AI. So Alyra AI is an open source extensions for JupyterLab that's centered around AI. And it has cool things like the ability to save code snippets or clone from repos. But the other cool thing that it has is this kind of, the ability to create Kubeflow pipelines with low code or like no code. Basically, you have this pipeline editor where you can take all your Python scripts or your Jupyter Notebook, drag it onto the pipeline editor, and then connect the different steps. So you can see here, I've created a pipeline where I'm loading data, then I am doing some data cleansing, and then I am doing a time series and doing some other analytics. But yeah, definitely something worth checking out. The only thing is it doesn't support Kubeflow Pipelines V2. It's still on Kubeflow Pipelines V1, which is kind of old, but definitely worth checking out, Alyra. Especially for those data science teams who don't really want to learn Kubeflow Pipelines. But yeah, let me jump back and show you our Kubeflow implementation. So if I go to this URL, right off the bat, you'll see something quite different. Because we're using Keycloak for authentication, it allows you to kind of style the login page however you like, so for any organization, they can have their logo up there. We have our, of course, source logo with the source branding and colors. The other cool thing is you don't just have to log in with username and password. We have set up single sign-on. So all I have to do here is click single sign-on, and then, of course, granted access, and I'm into the Kubeflow. So this is great for large organizations. It's very easy to onboard users when you can just hook into the existing SSO. And then the other thing you'll notice is there's actually a lot more options in our implementation of Kubeflow. It's not just the stuff you get out of the box. I can probably zoom in a little bit. But yeah, so not only do we have the default things that come with Kubeflow, like I mentioned, we're integrating with tools like MLflow. And as you can see here, we actually have MLflow directly in the Kubeflow UI. So data scientists don't need to know what the MLflow URL is or go to a different page, as well, Feast is used for feature stores. Of course, these two are mainly ML-related, not really for generative AI, but we have that on our platform. And then here's Label Studio. So Label Studio is actually pretty cool as well. It's another open source project. It does have a commercial offering, but the open source side of things works just as fine. And the reason why I like Label Studio for annotating data is not only does it cater to normal ML workloads, like for example, it supports templates for computer vision, natural language processing, speech and audio. The other cool thing is they have actually created templates specific to generative AI. And I'll show you later on in my demo what we're actually using or how we're using Label Studio in our generative AI workflow. But yeah, definitely a cool project to check out. And then as well, the other thing I wanted to highlight is here we have like a bunch of notebooks that our users are using, but there's also this thing called workspaces, which I mentioned previously. So for example, right now we have two workspaces. So imagine like this workspace is like the data science team's workspace and then this one is the finance team's workspace. So each team has their own space and if you switch, of course, I'm the owner of both workspaces, but if you switch, you can clearly see like it's totally different workloads. So users in one team or one workspace don't have access to the notebooks from the other team. And as well, if I go to models right now, you know, the source sandbox name space or workspace doesn't have any models deployed, but if I switch over to the Kubeflow Playground, you can see we have a generative AI T5 model, which I'm gonna demo. And then we also playing around with VLLM. So we have a VLLM server also running on KServe. And then lastly, like I mentioned with the functionality of a Kubeflow, not only are you using Jupyter notebooks, but just for an example here, I'm going to start up a R, what is this? BS code server. And again, you also have, like I mentioned, RStudio if you really wanted, just give it a minute, it'll spin up. And I'm doing this for you, Nick, because you wanted to see what BS code server looks like. It actually spins up really quickly. It's basically requesting some compute and memory resources from the Kubernetes cluster. And then once it spins up, you're gonna see we have a, there you go. This is, you're familiar with, this is actually VS code running. So the data scientists, some of the data scientists we work with, they find that if they're trying to run stuff on VS code directly on their laptop, they might not have that much resources, but you can use the full power of Kubernetes and run VS code. So that's pretty cool too. Yeah, so let's go back to our slide. I wish I could talk more about some of the components of Kubeflow, but I don't know if I have time. Let's go back to this. I totally forgot to set my timer. Okay, so you guys might be wondering what does the architecture or the underlying architecture of Kubeflow look like? For our internal Kubeflow platform, we decided to deploy it on GCP or Google Cloud. So I'll zoom in here for a bit, but this is what the architecture actually looks like. So at first glance, it might look a little bit complicated, but let me just zoom in. It's not really that complicated. The heart of the whole infrastructure is actually this Kubernetes cluster. So like I showed before, all the components of Kubeflow kind of live inside the Kubernetes cluster. The other thing you'll note is it's actually in a private cluster. So one thing I cautioned with trying to use the existing distributions of Kubeflow, a lot of them are set up to actually deploy a public cluster. And Kubeflow is something you really don't want to make publicly available. So that's why we chose to deploy via the manifest files, because this way we can actually deploy a private cluster and then deploy all the components directly on that private cluster. So it's only accessible by our organization. And then the way we get into the cluster, of course, is we have a public VPC that's running a VPN server, and then we have our users using a VPN client to connect directly to the Kubeflow cluster. And of course, we're using extra security stuff like client whitelists and whatnot, so only our users can connect to the cluster. And again, in our internal implementation, we're just using a WireGuard, which is an open source free VPN solution. It's really cool. But in most customers that we work with, they already have an existing client VPN, so we just hook directly into that. And then of course, all the other, we do leverage some managed services from the cloud provider. So for example, for cloud for storage, for container registry, and also for database, we're using Google Cloud Storage, Google Container Registry or GCR, and then of course, Cloud SQL. And then just for the operation side of things, you can use your pipeline tool of choice. We're using BitBucket, but of course, you use whatever the client is using. And then we're again, we're using things like Terraform to deploy all this infrastructure. So again, it's static. So once you deploy and get all this stuff set up, you're not really changing any of the infrastructure. Most of the thing you're doing is maybe adding more node pools with different GPU or instance types and just making your cluster larger and larger, but there really isn't anything else you need to set up for this to work. Okay, so the queue flows out of things over now. Now we're gonna switch gears into generative GenOps, I guess. And some of you might be wondering, what is GenOps? And you wouldn't be remiss to kind of not really hear that term before. I did a Google search like last yesterday and even this morning and you'll see not a lot of people are using the term GenOps, but I'll explain it here. So basically, GenOps is the practice that aims to make developing and maintaining production generative AI models much more efficient and seamless and it covers kind of all the types of generative models. So not just language models, but also image, voice, multimodal. And similar to what I explained with MLOps, it also covers like the entire spectrum of generative AI. So not just model development, model training, but there's fine tuning, GenAI tooling like agents and chains. You've got models serving and then scaling on GPU hardware, prompt management, human feedback, and then some of the more important things like data privacy and ethics and security guardrails. So yeah, you guys might be wondering, well, isn't that just LLMOps? And it's true that LLMOps is kind of the most popular term nowadays for referring to this kind of things. Me and a few others kind of feel LLMOps doesn't go far enough because like with the switch to more like smaller generative models and also the introduction of multimodal with things like Gemini and more domain specific models and then the incorporation of helper models, you'll see that like not everything is in LLM. You're gonna see very quickly in the next few months probably a lot more models supporting image generation, voice generation and what have you. So definitely I encourage everybody to kind of switch to the new term GenOps and not just mention LLMOps. The other thing I wanna mention is I kid you not, I read an article the other week and in the same article they mentioned GenAIOps, LLMOps and RAGOps. And like honestly, GenAIOps doesn't roll off the tongue and please don't use RAGOps. It just sounds very weird. I don't know where they're going with all these terms. And then of course you can't have the GenAI talk without talking about retrieval augmented generation. I'm sorry to bring this up but the reason why I'm bringing this up is in my demo I'll show you a RAG setup. So of course we have to talk about retrieval augmented generation. And of course most of you know this is referred to as RAG. In older research papers this was called retrieval augmented in context learning but that acronym probably doesn't sound too good. So we have RAG. But basically RAG is kind of like the AI architecture where you're using or retrieving facts from an external knowledge store. And then you're feeding that into the models context and asking it to answer question. And we find it gives you more accurate and more contextually relevant information. It also allows you to use a lot smaller models to get just as good answers as a large general foundation model. And of course I have a diagram here. You've probably seen many diagrams of RAG. This one I find is fairly simple and it illustrates it pretty well. So yeah so user asks a question. Then you go to some kind of like pipeline agent which will take that question vectorize it. Go into a vector database and retrieve like the top five most relevant chunks or information and then that'll feed that into the model and then say based on the context that I've given you please answer this user's question. And actually I'll maybe give an analogy just for some of the people who might not be quite clear on this thing. What I like to say is like imagine you're writing an exam on a topic you've never really seen before. You're not gonna really have a good time and this is what foundation models generally have to do. For example if I'm using a large foundation model and I'm asking it about specific questions on Kubeflow more than likely that foundation model might not know or have access to the Kubeflow documentation. So it might not really give the greatest answers. It might just completely refuse to answer your question or even worse it might just like hallucinate and give you the wrong answer. So what I like in RAG too is like imagine writing an exam with an open book test. So if you're writing that same exam but you have a book of like all the Kubeflow documentation and stuff like that you can much more easily answer the question. And again I love architecture diagram so I'm gonna give you the actual architecture diagram we have for the demo that I'll be showing you. Here we go, hopefully you guys can see that. But what I've done is I've kind of broken this out into three different I guess workflows or phases which to help you understand of course the first one being the data loading phase. Sorry, let me zoom in here. There you go, hopefully that. So the first thing you wanna do obviously is take your corpus of documents or your internal knowledge store and then it could be in any format. You can have PDFs, HTML, text, what have you. But what you'll do is initially you'll probably wanna do some ETL and kind of clean up the data. You might also wanna add relevant metadata before you put it into your vector DB. But once you get all your data together you're gonna use an embeddings model. At the time when we created this BGE was kind of the top embeddings model on hugging phase so that's what we're using. And then we're vectorizing all the data and putting it into a vector data store. Again, this was fairly I guess antiquated by GenAI standards we were using from a DB because that was kind of the hip new vector DB for that time but of course you can use just in this conference alone I've learned about Cassandra using vector DBs. You've got PG vector and Postgres ML and you can run anything from Pinecone, Milvus, what have you. And then the other thing I wanna mention is when you're vectorizing documents and putting it into vector data store you're not just turning the data into numbers and throwing it into the vector data store. What you're doing is that embeddings model is very important. It's adding like a whole bunch of dimensionality to the documents or the text or tokens and it's also kind of adding semantic meaning. So that's how it works there. But I won't go too much into that. Yeah, so you've got of course the app front end. So what we're using for our app front end is I guess open source library called Streamlit. And Streamlit is pretty awesome. You can see it at Streamlit.io but it lends itself really well to building out rich data web apps using Python. So definitely something to check out. Of course there's alternatives like Gradio but I really like Streamlit. And then yeah, and then we have the model service which you can see here is running on K-Serve and it can scale out to n number of like pods depending on how much traffic that you get. So yeah, so now I'm gonna hopefully give you guys some kind of learnings that we've gone when we're running generative AI workloads on Kubeflow. So here we go. And I apologize, this might get a little bit technical but yeah, I'll try to kind of illustrate this stuff with examples as well. So the first learning that we wanna talk about is GPU time sharing to increase GPU utilization and cost savings. So of course if you're working with generative AI models you're more than likely using GPUs especially on the model serving side and also model development. And GPU time sharing, there was a talk earlier today about GPU time sharing, time slicing but I'll kind of explain it here. What GPU time sharing is it allows multiple pods to utilize a single GPU. So that is very beneficial when you're working with generative AI models. The other thing I should note is that you can't use more VRAM than what the GPU is actually capable of. If your GPU has 40 gigs of VRAM you have to make sure that all the workloads that are running on that GPU don't exceed that memory or you'll get out of memory errors. And that's not typically the best thing when you're running applications where you don't know the memory requirements but specifically for LLMs and like working with LLMs inside a notebook environment and exploring and doing prompt engineering. This lends itself really well to using GPU time sharing. And the reason for that is basically you almost always know the memory requirement of the model you're using. So for example, if you're using LLama 270B you know that's gonna take about 15 gigs of VRAM. If you're running it in eight bit it's like nine gigs or even smaller if you're using four bit quantization. So if your team plays nicely together it lends itself really well to running multiple models on the same GPU. The other thing I should mention with time sharing that's pretty cool is if a single user is using that GPU that user gets access to 100% of the compute of that GPU. If you got a second user requesting like that GPU then it kind of divides that evenly 50%, 50%. And this is generally what works with data science workflow. You're not 100% hitting the GPU all the time unless you're doing fine tuning or training which of course I don't recommend GPU time sharing for. But when you're basically like working in a Jupyter notebook environment you're loading a model and then you're kind of asking it a question you're getting a response back you're looking at what the if the response is very good you might then decide to switch model or do change the prompt. So that way you're usually typically under utilizing a GPU. And then lastly I'll also mention MIG which is multi instance GPU. So I won't really go into detail but there are a lot of cool videos on YouTube from Nvidia, from Google and other people specifically on GPU time sharing and multi instance GPUs. But multi instance GPUs is kind of a way of physically isolating the GPU or dividing it into different slices and then each user has access to that specific slice. They can't use more than that specific slice but it's also another way of again dividing very large GPUs so that you save on costs. And of course MIG is only supported on the newer Nvidia hardware that's the A300, A100 and H200 and most likely the new H200 as well. And then it's very easy to enable on most cloud providers. Here's an example of how we've done it on GCP. This is some Terraform code. All you really have to do here is just specify I want GPU sharing config, the strategy I'm using is time sharing and then I want maximum four pods per GPU. And then when you go to deploy your workload obviously you add similar annotation on your workloads to say use time sharing. It's really simple to do. And this has to be done though on node creation time, node pool creation time or cluster creation time. But definitely something worth checking out if you're working with generative AI models. We're still good. So the next one I'm gonna kind of talk about is using read only many persistent volumes for rapid scaling. So we all know generative AI models can be very large. You have something like in our demo we're using Flan T5 248 million. It's a small model by today's standards but then you have something like Lama 7B which is 15 gigabytes. You have Lama 13B which is 40 gigabytes and that's not even the largest generative AI Lama model. Basically anytime if you're like deploying a model server on K-Serve every time a pod spins up it has to actually download the model files either from Hugging Face or from like S3 or GCS before it can actually start serving requests. And as you, if you get a bunch of requests and you're scaling up to 10 pods that can definitely add a lot of latency and a lot of like unneeded unnecessary traffic. So what we've done on our cluster is basically we've created the default persistent volume which is just like a disk which is in default it comes as a read write once that means one disk can only be used by one pod and you can read write to the files. So what we do is we create a new persistent volume and then we download the model files onto that persistent volume and pro tip. If you aren't aware you can actually use the get LFS clone command to clone the model files for any Hugging Face repository because on the back end a Hugging Face repository or a Hugging Face like model is really just the Git repository. But yeah, so we download the model files onto that persistent volume. And then next thing we do is we create a read only many volume and we clone the previous volume that we created as a read only many volume. The reason why we do this is because if you just create a read only many volume it's gonna be an empty volume and it's read only so you can actually add any files to it. So you have to always clone it from an existing volume and then I can kind of show you this is just the YAML file for creating that persistent volume. So now that you have this read only many volume the next thing you can do is when you're deploying your case or model inference service again it's a YAML file. You basically specify this is a persistent volume with your model file and then this is a local folder where you wanna mount it. Now that when you do that every time you scale up to more than X number of pods every single pod all they have to do is just attach that existing volume instead of downloading the volume every single time and then this leads to rapid scaling up of pods when you're serving. I'm sure a lot of the big AI companies are probably already doing this. Nobody's really talked about it but definitely something to look into. So yeah so the next GenOps learning that I wanna talk about is actually saving model responses for human feedback. So if you aren't familiar with some human feedback techniques there is reinforcement learning with human feedback which is RLHF and then DPO which is direct preference optimization. I can't really go into too much detail on what these two are but basically with reinforcement learning you're creating a rewards model that's trained on preference data that you want like specific answers that you want and then that rewards model gets kind of put into the training life cycle of the model and then the model is trying to optimize for the largest reward. It's kind of complicated. It's also very complex. So I wouldn't recommend you use RLHF. I definitely recommend you use direct preference optimization and what DPO does is basically get a bunch of data like you're scoring the data or the responses that the model is giving as like good response versus bad response and then you optimize inside the specific LLM the preference of the way you want your model to answer. So yeah so I highly recommend when you are building out your model front end make sure you're saving all the model responses to an annotation tool like for example Label Studio which I'll show. Label Studio is a great tool to save annotation and then you can go in after the fact and score it as like these are good responses these are bad responses. Of course I also recommend you probably do that from the end user perspective so in your web UI you'll see some of the chat interfaces will actually like have a thumbs up or thumbs down like this was a good response that was a bad response. You let the user do that so that it saves you from having to annotate that stuff after the fact. We haven't done that in our demo but definitely cool. And then if you take anything away from this slide I highly recommend you use DPO or direct preference optimization. It's a lot simpler it's computationally much cheaper and more stable and it's been shown to provide better or at least similar results to RLHF. There's like a reason why the newer models like Zephyr or even the new Mixtro model they all use DPO. And if you wanna learn more about DPO definitely check out Chris Manning on YouTube. He has a talk where he talks I believe it's his team from Stanford that published a paper on DPO and he'll kind of go through the whole math behind it with the Bradley Terry model, all that kind of cool stuff. Now we're almost done here. But yeah, so this I'm gonna show you like so what does that look like? How can we take our existing architecture and add that whole human feedback loop to our architecture? So here we go. This is what we've done basically. So if I can zoom in maybe. There we go. So yeah, so what we've done is as we're asking questions to the model and we're saving the responses we're also saving those responses directly to Label Studio. Once we get enough responses like let's say your model's been running for like three months you've got a whole bunch of responses then you can take all those responses rank them and then use that in your RLHF or DPO pipeline. The other thing I should mention is you generally have to have a lot of data for DPO to be effective. So the other thing you actually can do or the low hanging fruit which I recommend is for example if we go back to the Kubeflow example you're asking your model questions about Kubeflow and you're finding that like the users are asking about K-Serve but your model's not very good at answering questions about K-Serve. That doesn't necessarily mean you gotta go through the whole like retraining or fine tuning of the model. Nine times out of 10 it's probably because you don't actually have enough documentation about K-Serve in your vector DB. So what you can do is just add more documentation or the missing information into your vector DB and then you'll very quickly see that your model's a lot better at answering those kinds of questions. The other thing I would be remiss to mention is in our model service we're actually using LangChain to kind of orchestrate everything. Of course you can use LamaIndex. Basically when a user asks a question it's actually going to LangChain first in the model server. Then LangChain is doing that similarity search to kind of find the most relevant documentation. Then it feeds it into the model and the model gives a response. And of course you can run a model on CPU which I'm doing in this demo where you can use a larger model like Lama2. So I'll have you. But yeah, I think finally we're gonna get to the demo. Thank you guys for bearing with me. Let's show you what actually it looks like. So here we go. We've got our simple interface that we've built on Streamlet. What you can see here is we have a set of collections. So something else I didn't mention. In vector databases you have a concept of a collection. So you can have different sets of documents as their own collection. So you can ask model different information. Because I just came from re-invent we're gonna be asking it specifically on questions from AWS FAQs. So what we did is we took a whole bunch of the online AWS FAQs which are made available in PDFs. Throw that into our vector DB so that now we can ask questions specifically about AWS. But we also have other collections like financial minutes and what have you. The other thing I should mention is the model we're actually using is a very fairly small model. We're using the Flan T5. This is an instruction fine tune Flan T5 with 248 million parameters which means like the model size is really 900 and something megabytes. And it's great for running on CPU. That's the only reason we're using this model is because we wanna save on cost. I don't want my company to get a really big bill. So yeah, so I will now ask it a question. I'm just gonna minimize this. So I'm gonna ask it specifically on AWS. What is Amazon Bedrock? So yeah, so if you weren't at reinvent a couple weeks ago you'll find what Amazon Bedrock is basically the generative AI service that AWS is now offering. And hopefully it shouldn't take too long. There we go, our model has answered the question. So basically Amazon Bedrock is a fully managed service that offers leading foundation models as a simple API and what have you. So the cool thing about this is we're using a model that's fairly old now Flan T5. If you were to ask Flan T5 this question without doing a regular limitation it would probably not be a very good time. And then let me just ask it another question. What, wait, can you tell me some of the names of the foundation models you can use with Amazon Bedrock. Just to show that the model is like keeping the conversation, the memory and the context hopefully the model gives us a good answer. And this is not really about accuracy. This is just about kind of our implementation. There you go. So Bedrock gives you access to Claude, now Claude 2, of course AI 21 Labs, Jurassic Model, Civil AI, you even got Lamma 2 in there. But yeah, so of course this is a very simple demo. Let me actually show you guys what it's actually doing under the hood. And this is another tip I would probably recommend to you guys. When you're building out an application especially in development and non-production you might wanna build a debug interface directly into your application. So we have this super secret debug mode that we have in our application. Now when I turn this on and I do the same thing, let me ask it a question. What is Amazon EKS? And I'm just gonna minimize this. So yeah, so there you go. So now when I'm asking the model a question because we have debug mode enabled you can actually see what's going on under the hood. So this is the prompt that we've set up for our model. Basically we're saying using the piece of context that we're providing you answer the user's question. The other key thing you should note is that we're also telling the model if you don't know the answer do not try to make up an answer. Just say you don't know. And we found in our testing that a simple sentence like this really helps with model hallucinations. I'll show you a bit later. I'm gonna ask you the question that's like totally not related to AWS and hopefully it doesn't make up an answer. But yeah, so all we're doing is we have the model being served on K-Serve and we're just sending it a JSON object that says this is our query and then this is the collection that we want you to use. And then where that's actually going you'll see when the model returns a response it's basically returning the top for most relevant documents. So here you can see the first document that it actually found that's related to my question is right on the mark. It's basically what is Amazon Elastic Kubernetes Service and it's actually coming from I believe the EKS PDF. You can see here as well you've kind of got like half of a sentence here and maybe another half of a sentence there. This is why chunking overlap is kind of important when you're building out your chunking strategy. But there you go. You can see it's returning basically all the documents that are relevant to EKS. And then obviously it's giving you the final answer which is Amazon EKS is an open source container because it's orchestration. It's basically Kubernetes managed Kubernetes on Amazon. And then again I was gonna show you that example. So if I asked like what team won the NBA championship in 19, I don't know 1982. So we're hoping that the model does not actually try to make up an answer and there you go because I added that specific sentence in my prompt it's basically telling me the provider context does not give me the information like I can't answer your question. So again another cool thing that you might wanna do. And then let me jump over quickly to Label Studio just to show you that we are in fact saving these responses. The other cool thing I should mention is Langchain actually has a built-in module for Label Studio. So in your Langchain pipeline you can save responses directly to Label Studio. So as you can see here I'm just gonna go to the most recent one I guess. This was our NBA question I guess. But this is the actual template that we're creating. It's basically what team, so basically you have the question, you have the response and then all you're doing is just scoring it as like this is a good response, this is a bad response. So you can do this directly in your chat interface or you can do it after the fact. I'm gonna say this was actually a good response I guess because we didn't want the model to answer correctly. And then once you have a whole bunch of responses saved and ranked you can then go in and basically say I wanna export all of these and then you export it as a CSV file, that's your data set that you use for the reinforcement learning, the DPO or what have you. So again, and Label Studio is completely free to use. So yeah, that's pretty cool. And I think we're kind of at the end of our presentation. Let's go back. There was a lot more other things I wanted to talk about, a lot more other learnings but of course I've already talked for way too long but some of the other cool things that we found is decoupling and parameterizing your GenAI components. You kind of saw that in my diagram, every single component was kind of its own pod or service. This way like so you wanna decouple the front end from the back end so then that way you can swap out different models or different model services and the user doesn't really notice anything. I would even go as far as taking the Lang chain piece and also putting that in as its own thing so that if you make any changes to the model you're not constantly rebuilding the container every time you go to deploy. Another cool thing K-Serve offers is notebook culling and K-Serve scale to zero. So that is really cool for resource management and freeing up resources. Notebook culling, you can basically set a timeout in the notebook so if I go back to my VS Code server if you remember I had a VS Code server running. Okay, maybe it hasn't timed out yet but you said in our 30 minutes so that those data scientists who don't really clean up after themselves it'll automatically kill the notebook if it's not being used. And then scale to zero is pretty awesome so when you deploy a model you can set the minimum replicas as zero so that if nobody is actually hitting the endpoint after five minutes or 30 minutes it'll actually scale down that node for you and save on costs. And then another couple of things I guess VLM now is pretty popular so definitely try using that for speeding up GPU inference and then if you're working with CPU there's also something called deep sparse or sparse ML which you can use to speed up model inference on CPU. That's something I'm kind of playing around with right now but definitely something worth checking out. And finally, just to recap everything the open source community has a vibrant ecosystem of all the different tooling that you can use for ML Ops and for Gen Ops. You can see here specifically for Gen Ops of course we have Kubeflow, K-Serve, LangChain, LamaIndex, again label Studio and Streamlit and ChromaTB and then specifically for ML Ops definitely worth checking out ML Flow if you're not already, Feast, Spark on Kubernetes. I love Spark on Kubernetes. I can do a whole talk just specifically on Spark on Kubernetes and of course Elyra. So yeah, thank you everyone. I guess we'll leave it up for questions if anybody has any questions. I know this was a long talk so thank you guys for bearing with me. And then as well if you want to check out Kubeflow you can also join the Kubeflow community Slack just scan that QR code. We're a pretty cool community, pretty chill. As well you can connect with me on LinkedIn if you just want to ask more questions or anything like that. Yeah, thank you everyone. Sorry, did you have a question over there? Oh yes, that was actually from the Kubeflow repository. So when you go to a Kubeflow manifest page which again if you go to Kubeflow slash manifest it actually has basically install instructions and then all the different components of Kubeflow. Is that the one you were talking about? I think so, yeah, sorry. Yeah, and then you're right. If you scroll down a little bit further you have all the instructions of like how to install the different components one by one. So yeah, you totally do that. Yes, yes. Yeah, so I don't think specifically the cloud providers support using both because like as you can see if I go back to my YAML file, so when you're deploying a cluster you're basically specifying what GPU sharing strategy that you're using. Here we go. And it's either or, so either you're selecting time sharing or you're selecting MIG. So there's a lot of science that goes behind it with like temporal multiplexing as it's called. You could probably ask the guys at NVIDIA they would probably know the answer to this but I don't think you can do that. In that case I would recommend you just use a smaller GPU. The only reason we haven't really played around with multi-instance GPUs is unfortunately we deployed our Kubeflow cluster in the Canadian region on GCP. And then like a few months later we realized the only GPUs we have access to are the Tesla or NVIDIA T4 GPUs. So we're in the process of migrating our cluster over to the US region just so we can play around with larger GPUs but I don't believe it's possible. And generally with MIG you have to kind of like restart your instance to be able to change the configuration of MIG. Okay yeah that might actually work. I mean yeah I guess it's left up to you to experiment. I could see like generally when you do time slice sorry multi-instance GPUs you are like that GPU slice is treated as its own GPU. So you could probably like find out what the underlying Kubernetes command is or manifest is and definitely try that out. It would be interesting to see. And yeah it does make sense when you're in a student kind of thing where the students are just playing around with notebooks. Again not when you're fine tuning or training and you need to use 100% of the GPU but yeah it might be worth trying that out. So you can utilize it even more. Yes yeah so for our example of course we built everything with open source inside our cluster but Kubeflow allows you to connect. It actually has connectors like when you're running a Kubeflow pipeline it has connectors to connect to the cloud service providers. And of course now that the cloud service providers are all catching on and they each are coming out with their own vector search. Like every database it seems has a vector search. So you can definitely do that. And that's what we recommend to some of our clients because not everybody wants to manage their vector DB. And again chroma DB is great for small scale but if you're scaling up to much larger stuff definitely go with like a managed service or a much more robust database. So yeah because it's open source you can swap out and say I only want to use this. You know we have clients who maybe just like starting out they only want to use just Kubeflow notebooks and the case or model serving. And they don't use all the other components. And because the UI is highly customizable you can actually go into the Kubeflow UI config and just remove the parts that you don't want to use. So you can just have notebooks and just MLflow or something like that. You mean this, I didn't actually demonstrate a notebook but which no, so I don't have the actual code for that. I don't have it on GitHub but if you actually go to streamlit.io they're really trying to cater to generative AI models. So you'll see they have a lot of templates. Let me see, yeah. So they already have a lot of template code on how you can build a chat app. And then the other cool thing is they have like example apps that you can clone directly from GitHub. So if I go to like this Lama chat app I can totally test it once it loads but you can also, it shows you the GitHub repository for that. Oh wait, it's telling me to install. But yeah, so all the users who've created their, maybe nothing is working. I don't know. Okay there it is. Like this is again a chat interface and then you can totally go in and create, fork this app up there at GitHub if you would like. And then the actual Retrieve Augmented Generation part of it is we're just using a basic like LangChain reg. So if you go to LangChain documentation we're using the same kind of, calling, instantiating the LLM, the embeddings model, the vector data store and then just saying run the chain. Yeah. Thank you. Anybody else? No? Yes, go ahead. I guess that is true of like older versions of Kubeflow in the past that we've even had team members who use Kubeflow in the past. And they were like, I wouldn't recommend when we started out our MLOps platform and I chose to use Kubeflow. They were like, I use Kubeflow in the past. It wasn't very good. But now with all the new implementations of Kubeflow it's getting a lot better. Like with the newer version releases I showed you how to quickly run Kubeflow. I would say that it does take a little bit more advanced if you want to customize it like what we've done. But there are distributions. I should shout out Deploy KF. What Deploy KF is trying to do is actually solve the problem of how do you customize Kubeflow? Hold on, sorry. Let me just go to Deploy KF. So if you don't have a large development or engineering team, you should definitely check out Deploy KF. What Deploy KF basically allows you to do is Deploy Kubeflow with very easy customizing. Hold on, let me go to the source code, I'll show you. So basically what you're doing when you go to Deploy, you have one central config file which has all the customization. You just change that config file with your customizations. And then when you go to Deploy the cluster it automatically puts the right code in the manifest and deploys. So it's worth checking out Deploy KF. It's a really cool project. But this helps ease the burden. It also helps with upgradeability. So when it's time to upgrade to the new release what Matthew's promising is that you just have to make minimal config changes and then it'll, on Matthew, upgrade to the newest version. So yeah, it's very new, yeah. Yeah, so the guy who's creating this is actually Matthew Wicks. He's a big part of the Kubeflow community. He's actually the leader of the notebook division of Kubeflow. So he's working hard along with some others to kind of make it a lot easier for large enterprises to adopt Kubeflow. Cool. Anybody else? Yeah, if you want, after you can always come to me for more questions or if you want to see more of the different parts of Kubeflow, I can definitely demo you guys all the Kubeflow stuff. Thank you, I guess it's almost time for lunch. So if you guys want to go and grab lunch, you can do that.