 Thank you. Bonjour à tous. Bienvenue à notre représentation. Hi, everyone. Welcome to our talk titled The Hitchhiker's Guide to Kubernetes Platforms. Don't panic, just launch. We will walk you through how to build an AI platform. As the title says, you will see a lot of references to the book, The Hitchhiker's Guide to the Galaxy. But if you haven't read it, don't panic because we'll make every example as clear as possible. So my name is Tessa Pham. I'm a soft engineer at Bloomberg. And my name is Alexa Griffith. I'm also a software engineer at Bloomberg. We both work on the inference team under cloud native compute services. So as Tessa mentioned, in this talk, we're going to walk through how to build an AI platform that allows both you and the developer and the users to not panic and just launch their services. So let's open our guidebook to the first page, which you'll see six items. This is the set of principles that we want to follow when building any kind of platform whether big or small. And they are stability, security, scalability, observability, easy onboarding and active user support. Together, they make up a towel, which if you don't know is the most important item an interstellar hitchhiker can carry. Once your platform is grounded on these set of principles, you should be all set. So we're going to use inference platform as an example for this talk. So how do we power our inference platform? We power our inference platform with KSIR, which is an open source project that our team lean Dan Sun co-founded. And basically it's an easy way to monitor and deploy long running inference services. It has a lot of KSIR supports various model serving runtimes and other box features for things like security, observability, scaling and more. So at the bottom of the slide, you can see the link to learn more about KSIR and get involved. So in our control center, I think it's important to understand any platform you're building, the components that make up that platform are the requirements that they have. So for example, in our inference control center here, when you consider the increasing usage of LLMs and GPUs in inference services, additionally, we need to support various model serving runtimes like Triton, PyTorch, Hugging Face and much more. We need to consider there is a multi-pod deployment most of the time and these pods need to communicate to each other. So a transformer and a predictor are usually deployed together, maybe even an explainer and they all need to communicate to give a prediction result. Also, we need to support advanced deployment strategies like canary rollouts, experiments, ensembles. And we have a complex architecture that one pod serves multiple models called model mesh. We also need to store our models in a model registry and we should as a whole just take in or consider the ML development lifecycle. So we experiment, train, validate, serve and then maybe repeat. So what we've done is we've created consistent APIs across our organization to give users a consistent experience across the ML lifecycle between not only our inference platform but also for things like training and experimenting. But today we're going to focus on the three main APIs for inference platform that we've built to achieve this. So first we'll talk about debugging, then deployment and then versioning. So before we dive into the APIs let's address one question that you might have. Why don't we use the Kubernetes API directly? First of all, we don't want to expose Kubernetes to our platform users who are mainly data scientists. Kubernetes is a complicated technology with a huge learning curve and there's a lot of info that is not necessary to be exposed. For example, we want to abstract away from cluster deployment as users don't need to know which cluster their services are running on. Aside from Kubernetes, we also use other tools like Humio, Grafana, et cetera. And by building APIs on top of these tools we offer a centralized platform for all steps in the model development and deployment cycle so that users can find everything in one place. It will also be easier for us to manage dependencies and keep the platform consistent. Having our own APIs will allow for more control over security and visibility and once user gives us what they want to deploy we can manage the rest from authentication, authorization all the way to serving and monitoring their services. And in the next slide you'll see we can also offer an enhanced debugging experience with links to metrics dashboards and also aggregated service statuses. Okay, so let's talk about our debugging API and the benefits that it offers. The debugging API provides better usability for our users so users can use our API and it's consistent both programmatically let's say in workflows and if they want to deploy in the UI. We also offer enhanced statuses which we'll dig more into. They return consistent and easy to understand error messages to our users. We've also added customized debugging links for each resource to different dashboards and events for better observability. Overall the debugging API adds a level of consistency to our platform and generally a better user experience. So let's talk about some of the things we could return from my debugging API. So here's some examples of some endpoints in our debugging API. Let's say we want to get pods, logs, deployments and any abstraction resource you have on top. So for us we have a resource called Inference Service that has different status and information about the Inference Service deployment as a whole. So next let's look at the right. We have an example response. So it's very similar to if you know kubectl get pods, the YAML output you get with some additions in red. So we're going to return some customized links as I discussed for debugging for each endpoint and a new status summary filled that aggregates all the statuses into one consistent output. So overall we, and overall at our company for example we have strict restrictions on viewing production. So this API removes some of those steps it took to do that and makes it much easier for us to securely debug production. I got it. So let's dive into the work that we did to make statuses clear and easier to understand. So firstly made statuses as consistent across our UI and API because as you can see at the bottom of this slide status fields and values take logic to interpret are not consistent between each resource. Some statuses have defined types and for example pod has phases. So we don't want our users to have to understand or interpret each of these differences so we should just make them consistent. Additionally for example if a pod has or if a deployment has zero replicas it could be an indication of an error or it could mean that Knative has scaled to zero due to no traffic. So a user doesn't need to know what a revision resource is and go check that status but it is important for us to be able to show that as the status message. So some other tricky situations in edge cases include like when a deployment is technically progressing but the pod is failing or a pod is running and the container is failing. There's also logic that needs to be added just to determine each condition. So if progressing and available in the deployment are false but no replica failure exists we need logic to determine that the deployment is most likely unhealthy. So in the UI imagine at the top level we show a status and this status needs to indicate if the user needs to click into the page and debug further at a lower level. So deployment needs to show error if the container is failing. This is why we need a status summary. So let's take a look in an example of our logic to show a consistent deployment level status by considering each condition and resource. Okay, so first I'm just going to walk through quickly our logic for determining a deployment. So first we're going to check the progressing condition. If that's true we'll label the status as progressing. Next we're going to check available. Again if it's the replicas are zero and it might have scaled down so let's check ready. If ready is true and the replicas are zero then we'll set the status in active. However if it's false it's an indication that it's unhealthy and we need to return it. Same with container healthy. If that's false it's an indication we're unhealthy and we need to bubble that up. But let's assume that we're healthy. Next we'll go to replica failure. If that's true that means we're unhealthy. Okay, so let's take it everything's healthy here. That's not the end right? We need to then consider the pod. So pod phase is the name is a bit different but we want to be consistent and not confuse the user so we'll just map those to one of our predefined statuses. But let's say that the pod is healthy. Pod can be running or healthy and still there could be a container issue. So basically we need to check if ready is true on any of these conditions then we're probably good. If it is not it's unhealthy. So let's say one container is failing but everything else is healthy. We want to show at the top level that the deployment has something you need to look at so that you know to click in and look at the lower level. So not only do we want consistent statuses but we also want to return an easy to understand and debug message to the user so they can understand and fix their service if they need to. So another team on our organization was also working on this and they did some great work with flow charts for each resource. And this was like one of the smallest ones I could put on. There's more but I'm not going to go through it. Just feel free. We'll upload the slides to look afterwards and look at it. But the point there is that you know translating each status reason and message is important for a better user experience because you want users to be able to take action and debug their service. So not only do we need a clear and consistent status but also we need to show the user and tell the user how to fix it. So as an example this is part of the pod flow chart and I just show this to say that this is when the pods terminated and there's some other conditions. There's a variety of exit status codes that you can get. So returning to the user just exit code 125 it's not an amazing user experience, right? So instead we need to understand and translate these into messages that are short and something easy to understand for the users because the problem is the reason field is not always easy to understand for the users that we have. But the message field can be very long. So we need some kind of in between and mapping out each of these can help us provide users with the information that they need to take action on. So let's just take a step back real quick and look at how this would actually work. So let's say on the left you have a UI where it's like the component is the deployment level that if you need to click in you can, right? Let's say that deployment was technically fine or progressing. Here we would see a pod output at the bottom you notice that it's running, right? But what you notice is that you need to check because one of the containers is actually in crash loop back off. The state is waiting with a message and a reason which gives us some indication. But additionally, we need to look at the last state which has exit code 128. Using that we can return to the user something that lets them know, hey, you actually need to look into this and fix it. So if a user saw this they could click into components maybe and go to the debug tab, right? So on the debug tab you'll see more information and I know in this example the container didn't start but if the container did start and a lot of times it does and there's useful error messages we also provide that here. So again I talked about status enhancements but also customized debugable links. So on these links you can have dashboards that detail at every level resource utilization through put and latency. Our team has worked very hard to get a very small number of really good graphs for our users and we also show events at every resource the Kubernetes events and real-time pod logs and for when the pods go away we also have a persistent log store. So this helps them observe better than or observe their services better outside of the UI if needed. Next let's move on to the deployment API. In order to deploy an infant service you need to prepare a YAML manifest to run Kupcaro apply with, right? So let's see what this API looks like in space with a YAML manifest being equivalent to a spaceship that got built and an infant service equal to a spaceship that has been launched and is now flying in space. On the left-hand side is the launch pad where you can select which infant spaceship to launch. The building configurations of this ship is shown as YAML manifest which contains the name, metadata, labels, the model, etc. If we'd like to enhance the API further the spaceships can be validated and approved beforehand to make sure that we're always launching a functional spacecraft at all time. Ideally we'd want to build spaceships way before not wait until we want to launch one to start building it. So for spaceships that are not yet launched or have been launched before but have now returned to the station they are stored in the space dock database on the right-hand side. So you can think of the space dock database as where the manifest of the infant services will be stored and the launch pad is where you pull up a manifest and deploy it into a service. Let's break down this API into smaller chronological steps. First you want to put together an infant's manifest of which the key ingredient is the model. In space this means that you're building a spaceship with an ML model being its engine. We'll integrate this API with a model registry that stores different models like sklearn or space GPT and like any other model registries out there this one also tracks versions and statuses of the models. This registry allows you to reuse, monitor and easily pull a model into the spaceship that you want to build. And if you'd like to learn more about how we built the internal model registry at Bloomberg there will be a talk tomorrow at 11.15. So now that you have built a spaceship let's find out how they are docked or where they're docked. Just like Zafod Bebobrocks and Fort Prefect Alexa and I are also pretty passionate about spacecraft. Each of us has our own fleet of spaceships with different mix and models and they're all stored in the space dock database as you can see. So how can they be managed so that we don't accidentally deploy each other's spaceships? This is the typical challenge of a cloud native platform how to handle multi-tenancy. So let's take a step back. We are building our platform on Kubernetes. So why don't we use Kate's namespaces to represent multiple tenants? Well, that's a neat idea but how do we connect Kubernetes cluster to a database? This is where kind comes in. Kind is a component of Kubis that allows the use of other external storage alternatives mainly relational databases as a replacement for NCT. So now using this we can section off the spaceships in our database based on the namespaces that belong to their owners. Utilizing Kate's namespaces we can have our backs on the cluster to control access and privileges and because stability is one of our priorities we can also store this set of policies externally and translate them into our backs. And doing this if the cluster goes down for any reason you can still access the policies and recreate them. So with this design we are now able to track service ownership and the ownership of space vehicles. All right, so now that you have built spaceships know where they're stored and have the peace of mind that no one can use your spaceships. Let's go to the launch pad to deploy them. This is a more detailed launch pad than what was shown before. After introducing Kate's namespaces into our API design we'll now need to specify the namespace that you own and that is associated with a spaceship that you want to launch. Before launching you also want to make sure another thing which is do you have enough fuel to launch? Similarly, it's like you want to know how many CPUs and GPUs that are available to your attendancy before deploying an infant service because we all know resources are limited and each tenant has different resource limit. Since Kubernetes also allocates resources by cluster and tier we want to track and display resource utilization at cluster and tier level as well which isn't visualized here but should be included in the deployment UI. Both of these measures are important to be exposed to the users because they determine if the infant service deployment will be feasible or not. If you'd like to modularize the platform further you can choose to implement a separate API that deals with and exposes resource insights exclusively which Alexa has mentioned before. Now that we have deployed resources let's handle cluster maintenance. If you've read the book you probably remember Magrathia which is the richest planet of all time that has been visited by the main characters in the book. So it actually orbits two twin sons Sulayanus and Ram and using this analogy let's find out how disaster recovery is done in space with these sons representing a pair of Kate's clusters. So just like you would want to have your service up and running at all time in the target tier let's assume that Bebobrocks and Slotty Bartfast want their spaceships so-called the Heart of Gold and the Bistro Math accordingly to orbit a son at all times and we don't know why but that's their goal. So to achieve this each of them launched two identical spaceships one towards each son. Well this is smart because if one son gets destroyed or explode for any reason they will still have the same spaceships heading towards the other son so unknowingly they've handled disaster recovery. Amazing right? So for our Kubernetes platform we can do the same thing we can translate this idea into setting up a pair of identical Kate's clusters in per tier, per each tier and always deploy every infant service to both of them. So if one cluster goes down we'll still have the other clusters serving traffic with the exact same configurations. When setting up our cloud native infrastructure we want to treat our clusters as cattle and not pets. And what does that mean? That means that they should be easily taken down and started up without effort and without affecting the users. And in order to accomplish this we keep a single source of truth externally which stores the desired state of a cluster and outlining resources that should be in it at a given time. Similarly in space we have this table that can be kept in space dock database as well that records the instances of spaceships heading towards what son and in what namespace. Coupled with a debugging UI that was introduced earlier adding this deployment API to the platform we know will ensure a centralized and continuous inference pipeline from creating YAML files to deploying them into services to debugging and monitoring the service statuses. Having this API also allows users to perform programmatic deployment for example by calling the API directly in the workflow without having to ever touch kukaddle. And as you can see throughout the design we have incorporated the three principles through managing access control, handling disaster recovery and exposing resource insights. Our next and final API is the versioning API which is actually represented by the space dock database. It allows you to save and store versions of your spaceships and easily launch and revert them. So remember when we want to access your database through a kates cluster. Well in order to work together for them to work together we'll need to create CRDs for the resources that we want to store in the database and one of which will be the manifest. And this is what we'll have in real life we'll have a bunch of manifests which we can use to deploy an inference service and also revert an inference service back to one of them. By creating custom resources for the manifest and using kind to store them in an external database our versioning API is completely Kubernetes native and this gives us dry run and validation out of the box along with the ability to manage multiple tenants and manage access control as mentioned before. A versioning API typically enables users to reuse manifests and perform rollbacks and promotions of inference services easily and the deployment API and versioning APIs can work together to create and save a deploy event whenever user deploy, revert, or delete an inference service and this will provide an audit trail for the deployment history. So why do we want to implement an internal versioning system instead of using getoffs with Argo CD for example? Well first of all at Bloomberg our source of truth is stored in a database for security measures but having our internal system also allows us to do deployment override and also providing diffs between versions and we can customize the features catering to our users and it will also make it easier for us to integrate with the UI and streamline our user experience. And just as in space where people would want space stations and spaceships to follow the same galactic rules we want to push for cohesive APIs across teams in our org because each team will support a part of the full model development life cycle and having uniform APIs across the team will also reduce duplicate efforts and streamline the end-to-end experience from training to deploying the user's models. Apart from the APIs, we also replicate the infrastructure underlying each runtime from one common stack. So with the same namespaces, tiers and clusters. So in reality our API design is mainly user-driven. We usually conduct user interviews before building out new features and collect their satisfaction score with each aspect of the UI and UX after the feature is already rolled out. According to the survey our alien user thinks that the deployment process on the interstellar launch pad is pretty awesome. So with better APIs we can decrease the support burden for our platform teams and at the same time we still need to provide sufficient and up-to-date documentation and on-boarding guide along with active support chat to answer questions and debug issues live. Once it tries a year we'll also hold workshops to introduce new features and provide in-depth training to our platform users. So looking forward we are looking to reach level three scalability as a product. You can see the timeline of our inference platform above it. So basically our next challenge is revolver on scaling. We also want to provide a lot of LLM support as LLMs are becoming increasingly popular. We want to make a benchmarking API and roll out our open AI protocol support as well. So now that we've given you all the answers to build an amazing AI platform you can go build your own. So long and thanks for all the fish. We have a couple of minutes for Q&A if anyone have any questions. I think we gave all the answers. So 42. Okay, thank you. Thank you.