 Hello, everyone, and thank you for being here with me. My name is John, and I am a Principal Engineer at GNI. And I'm here to present the presentation called ML Ops for Python and ML Engineers. Please let me share some words about our company, GNI. So we were founded around a little bit longer than three years ago. We are around a little bit more than 50 members in the team. And although we are distributed around the globe, our main offices are in Germany and China. So just check our website, and you will see exactly what we do. So let's go through the outline of this presentation. First, I want to get an overview of the ML in ML Ops. So kind of discuss what are the use cases, challenges that all these deep learning powered applications have, what motivates us to build what we are building. Then we are going to have an overview about this ML Ops platform that Gina provides. Some of these names will make more sense as we go through the presentation. So we are going to present Dockerade and Gina. We are going to see how we provide observability, how we go in our journey through to the cloud with Kubernetes, and then our last part of the stack that is JCloud. And at the end, we are going to have the typical conclusions that we have in our presentation. So let's see, what are the typical use cases that right now with these deep learning models we can see? So we can have generating embeddings, neural search, all the generative AI, LLMs that are kind of a subpart of these generative AI, and all these umbrella term of multimodal AI that has appeared now. So embeddings are the foot of many of the main foot and source of force for many downstream tasks in AI. So what is an embedding is a vector representation of any object. It can be text, it can be images, audio, whatever. And we just try to represent in a vector. And the job of the model is to make sure that this model, that this vector has some meaning, so semantic meaning. When it comes about representing neural search, it means that similar objects are close to each other. When it comes to generative AI, it means that at the coding step, they represent nice pictures or nice generated content, whatever, then we see, for instance, generative AI, this is something that everybody has been astonished in the last years of how amazing pictures have been drawn by just giving prompts like this. So like stable diffusion, Dali, I'm sure everybody has been playing around with chatGPT, with all these GPT models, LLAMA, whatever. But all of these kind of, it comes to a new trend of AI, that is this multimodal AI we like to name, where kind of text, images, videos, audios are inter, they interchange the roles they play together. I'm sure that at some point, or already there is a model that given an image will create a song for you that is inspired by this image. So one thing that is worth mentioning is that for building these applications and working with these models, Python is by any, is good or bad, but it's the lingua franca for these applications. Most of the deep learning frameworks are written in Python. Python is by far the most popular language for these ML practitioners. And although Python has some lack of performance, it's worth mentioning that a lot of the problems are being solved, for instance, Python is going to a more strongly typed frameworks, and most of the work is actually done in other low-level languages, and Python is only the binding part. I think there was a talk about this before, where a Rust library was bound to Python via some bindings. So most of the problems are actually kind of hidden and messed away. So what is a typical workflow of when we try to deploy or serve or build these projects? So first we need to train the model or create a POC. Sometimes we don't even need to train the model because all these already provided models out of the shelf can serve our application properly. We just have to create the POC to make sure that we have the results that we expect, right? Then we need to serve this model because very rarely you will only use it yourself. You want to serve it to your clients or to your colleagues or whatever. You would use a serving framework, ASGA, then we might want to have observability to make sure what goes well, what goes wrong, and then at some point you might want to deploy to the cloud whenever you need scaling. You just don't want to run it yourself. You have more needs, right? As you can see, well, I think that there is a trend in these MLOps, like as long as you go to this ladder, you need less ML skills and more ops skills and the other way around. So I want to present this MLOps framework that in Gina we have that tries to help the ML engineer to go through these steps in the most smooth way we can. Okay, so I have presented the MLOps platform like this. So we have two main components that are Docker and Gina that you see at the center and I want to emphasize that Gina and Dockerade don't do it everything on their self. We don't reinvent the wheel. So we somehow integrate, use, make it easy to work with a lot of these technologies that are already awesome to work with. We make it easier to work with PyTorch or TensorFlow. We rely heavily on PyDantic, GRPC or Fast API. We make sure to use OpenTelemetry, Kubernetes, to ease the journey to the cloud, for instance. Okay, so first I want to see, to present Dockerade that is the component that we expect to make your life easier when you build your POCs locally, okay? So it's worth to mention that it is a project that we developed and we donated to LFHI, I think, back in November last year, but time flies. So basically it's a Python library for representing, sending and indexing multimodal data. So it's focused to represent this multimodal data and to put the embedding at the center. As I said before, embeddings are like the food that every downstream application eats for breakfast. So we want to make it easier for you to work with these embeddings. And there's this new component of the MLS stack that appeared in the last years that are the vector databases. So we make it easy for you to index this data and to retrieve this data from these databases in a seamless way so that you can change your backend very easily, so on and so forth. So it might make sense better later when I show some code snippets, but Docker is modularized in kind of four models. There is the data representation where we have the base document. We have some containers where kind of we group efficiently and it lets you work with groups of these documents. Then we have a typing module where we kind of have helper methods for you to work with multimodal data, video things, audio things so that you work more efficiently. And then we have document index where we allow you to get these docs and index them in vector libraries or vector databases where we have HNSW, just in memory. We support VV8, Quadrant, Elastic Search, so on and so forth. So the data representation module, it has the base doc. It's the main cops inside of the array and one can think of it kind of is that it depends, the array depends a lot on pedantic. So you can see it as a pedantic base model on steroids, so to speak. It helps to represent multimodal data and it's ready to be serialized into JSON, into protobuf, into bytes, and so on and so forth. And there are some kind of predefined documents that we have built for you to work easily with multimodal data. Then for typing, it's just a set of classes and helpers that you can enjoy for better, we will see later. And then for containers, we have two set of containers to work kind of in two modes. We have the dog list that is kind of, it behaves like a Python list and it is intended to be served, to be used kind of in inference workloads, kind of it just, it's kind of a role-like structure and we have the dog back, I'm not a fan of this name, but it's kind of a columnar oriented structure where tensors are grouped better together so that it's more thought for training applications. Okay and document index as I said is where we get the integrations of all these vector databases. But this makes little sense out of the box unless you see the code. So let's see it more or less in action. So we are gonna build a small POC that is gonna compute some embeddings, okay. We are gonna have an encoder class that has an encode method. Let's see what are we need here. So we do the imports from the career. We import from the main modules the base doc and doc list and we import from this typing a torque embedding class that is gonna help us work with embeddings that come with Torch. Okay then we define our types, kind of similar to Pydantic, we define our text and we say that our output is a Torch embedding of the dimension 384. This is gonna help us validate the dimension of the embeddings which is something that is very often quite a struggle to work with. Then we just wrap our model and we expose the model in a method, I mean just regular programming and we just use it and we get an embedding out of it. Then we can go and expand it to build a new research application. For this we need an indexer in this case. So in this case we import from the index part we import a method that is gonna help us index and retrieve documents from memory, kind of without the scaling or anything without any extra work. Then we get our results. As you see the results here is very nested because kind of the result you want the object that was the query and the results as matches. So it allows you to get this complex representation. You build indexer, you wrap the index and you build an index and search method combining with the encoder before we get an application that given an input matches whatever was indexed, it's quite a toy example. Or something that is more fancy, the generative AI let's build something that given a text will give us an image from that prompt. Kind of the same story. Here we use an image doc because it's gonna help us work with images. We wrap the method, we generate images, we call it and voila. Okay, so far doc array has not made a lot of work so it has helped us reason about our data, model our data in a very nice way and so on and so forth but what we mentioned before that is ready to be sent with Protobab, with JSON and so on is gonna shine now that we are gonna go to the next step. We're gonna use, try to use Gina to serve these applications, okay? So first I want to say what is Gina? Gina is a framework, it's an open source framework that is used to serve models and applications to the network via gRPC, HTTP or WebSocket. Basically it's consuming doc arrays in and doc array outs as structures, okay? It allows to do batching, it scales them, it has a Pytonic and Jamel interface and it makes everything ready to go to Kubernetes and to scale. It is also exposing telemetry data, tracing and so on for you to make it, for making your life easier. Okay, Gina is a structure in two basic layers. So we have the serving layer where we, the object executor leaves. This is the main layer, the main object that will always be there. This is the one that consumes doc arrays and it leaves in your Docker container in your process wherever and it processes doc arrays and it does changes and it serves via the network. Then we have another layer that is orchestration layer where we have deployments and flows. This is a layer where we configure the executors and we decide how these executors are gonna be served. But it's important to mention that these orchestration layer, the objective is that eventually it goes away and we let Kubernetes take over. So we have built this kind of simple orchestration layer to serve some use cases. But at some point if you want to go to the cloud we will keep the executor, we will keep all these serving part that consumes doc arrays in and out. But we are gonna let Kubernetes orchestrate it for us because we don't want to reinvent the wheel. Okay, so the executor, we will see it in action later but it's just a class, kind of you just inherit from a class and it allows you to expose these applications to the network. It allows you to configure for batching, dynamic batching. It allows to expose metrics and spans and whatever. And in combination with the orchestration layer it is served and scaled efficiently. In the orchestration layer we have deployment that is an object that configures how the executor serves data. And it also makes sure that it is scaled as it's desired. So it's kind of a simple orchestration layer in terms that it makes sure that the executor is alive at the beginning. It makes sure that the resources are cleaned at the end. But it cannot dynamically scale. It cannot dynamically under scale. So that's why I say that in the future at the next stage we will want Kubernetes to take over this job and it's important that it has a Kubernetes translation to YAML that we will see later. Flow has the same API as a deployment but it's intended to work with for microservice architectures where you need more than one model or more than application and work together, scale differently and so on. Okay, let's see it in action. So let's see how we can serve a single model. For instance, the generative AI application that we had before. Okay, this is the code that we said that just works with the query. And by doing this couple of changes we are gonna be able to expose it and serve it. Okay, let's take a look at the changes. So we import from Gina this executor and requests. We make sure that our class inherits executor so that it becomes an executor and is ready to be served. We have these decorator requests and we say that it's gonna be serving in this generate endpoint. Then we are gonna use a deployment to make sure that we serve this application. We are gonna use it to protocols, GRPC and HTTP and we're gonna serve in these two ports. It's important to note that these two protocols are not too rapid because they are two servers sharing the even loop and serving both in GRPC and HTTP. You even have swagger UI and so on to make sure that you have a rich documentation of your API. And then also it's important to say that this deployment as well as you can play in Python you can play with it in German. And then you just call it, we have a client that you can call it with GRPC or HTTP. And the same way you have worked with your class before, now you can work with it remotely and get the same results. Then when the flow comes in is when you want to orchestrate two or more executors together. Like in the case where we had neural search, I don't know if you remember, we had an encoder and we had an indexer. So in the encoder we are gonna do the same. We're gonna import executor and request and expose it the same for the indexer. In every endpoint we are making sure that every function is exposed to the network as an endpoint. And here we are telling a flow that it will listen on these ports. Also will expose the same protocols. And we are telling them that they will use an encoder that at index it will use the encode method. At search it will use the encode method that it will use two replicas and we are gonna then use the indexer. What is the point here is that you can scale differently your first executor, the encoder and the indexer. And as the same way as deployment, it's kind of a compositional pattern where a flow is just a composition of deployments. It also has a YAML interface. And the same way we can call it and have the same result. So many of you, I guess, might wonder what is the difference? So this, everything that I talked about might remind you about FastAPI and Palantik. And you might wonder what is the difference? I here put a versus but it's not a versus because we are very inspired and we are glad to say that we are kind of friends with FastAPI and Palantik. So they are awesome projects. The main difference I would say, and we are very inspired, the main difference I would say is that we are kind of more restricted in terms that we are more designed for machine learning use cases. Well, wherever these frameworks are really more general and are thought for any software application. Then also we put this embedding at the center and we use protocol, like we are able to use GRPC for communication which is important to serialize centers and so on. And also what is different is what it will come later is this orchestration layer and how we can help you move these applications that look like a FastAPI and Palantik application to the cloud and to telemetry and so on. Okay, so let's see how we use, how we expose in Gina traces and metrics to add observability to your services. Okay, so basically Gina emits automatically if you configure it to do so, traces and metrics and the open telemetry SDK, we expose it so that you can consume later by different services like Prometheus or Grafana or any other service of your choice that can work like that. So basically it's as easy as telling in your YAML applications that you want to expose metrics and tracing and where do you want them to be exposed for you to consume later. And it exposes the typical metrics that you would expect from these applications like what is the request time, how many requests have succeeded, how many requests have failed, so on and so forth. And for traces, it's especially important for flow when you have the typical spans for each microservice. But the thing is also that we expose a easy Python IPI to expose and to enrich this data with your own monitoring logic. So for instance, you could use custom metrics via the corrector or via context manager to make sure that you can find the access to these applications. For instance, what am I spending more time in the preprocessing step of my executor in the actual inference or whatever. This is for you to build this logic. And for tracing, it's the same. It's a little bit more complex because spans are a little bit more complex, but you can add spans of your request as you wish. And then you can consume in your favorite backend like Prometheus or Gafana. And then, as I mentioned before, one of the objectives is that once you have built your POC, you are able to serve your model via an API. You have make sure that you have observability in your service. At some point, this is not enough and you might need auto scaling, you might have a lot of needs or you might not want to run it locally and you want to deploy to the cloud. So how we deploy services to the cloud? I want to make a first recap about what is the cloud more or less. So the cloud is someone else's computer as we often hear. It just happens to be that someone else is normally a cloud provider, right? These cloud components normally have Docker containers as their first class citizen. So more or less what is a process in your host can be more or less logically mapped to a Docker container in the cloud. Most of, at least, Gina in the platform, but a lot of components are relying on Kubernetes to make this journey to the cloud in an easy way. This cloud has challenges and benefits that comes from scaling, auto scaling, exposing the service, isolation, resource management and so on. And something that is especially important for AI workloads that is the serverless capacity because resources are expensive, especially GPUs and so on. So what is Kubernetes? I don't know what is the knowledge of Kubernetes here. So Kubernetes is basically a container orchestration platform. It is important to say that it's open source and has a large community which makes it cloud agnostic. The community has made sure that every cloud provider can support this so you can run it in Google Cloud, in AWS and whatever. What is an orchestration platform? Something that I like to think about of Kubernetes is kind of like if you had the perfect DevOps engineer that could just go in every note of your cluster and check if the service is up every 10 seconds or whatever, so this is super powerful. So Kubernetes is this super awesome engineer that every 10 seconds or whenever you configure, it does this task. So if you know how to configure, it's super powerful. Okay, what is, and how does Kubernetes work? So Kubernetes has a set of objects that they call custom resources and they work in a declarative way. So what you are saying to Kubernetes via YAML, you're gonna say, I want this object to be in this state. And Kubernetes is gonna try at its best to make sure that that object ends up at this stage. It does it with their main objects, the objects that are native to Kubernetes. These are pods, deployments and services and with custom resources that anyone can manage. So what in the architecture point of view, it has nodes where you want to scale your application and distribute your applications. Every node has this kubelet, the kind of a demon, the same way that you have a Docker demon in your local when you talk to the, when you have a Docker command that you talk to your Docker demon. Kubernetes nodes have this small demon kubelet that makes it available for this control plane that is kind of the main node and the manager of everything to communicate to it. So to make sure that it schedules the actual workloads, containers and pods in the right places. And in this control plane is where your CLI or your SDK, everyone talks to this control plane and there is the scheduler that makes sure that everything is scheduled to work whatever it is intended to do and the controller manager where you can write your own operator to manage your own objects. So how Gina employs Kubernetes to do the, to do this, to take over the orchestration part. So basically what we do is translate these deployments and flows in a quite naive way. It seems from the outside, but quite easy. We translate these objects and we know how to translate it into Kubernetes deployments. What, there is a magic behind that we kind of know how to override the entry point of the Docker containers, how we set up the startup props, what is our endpoint for health checks, so on and so forth. But this expects you still to know quite some about Kubernetes and you would still have to handle auto scalers, original port auto scalers and so on. To help in this last mile, we have this last part of the stack that is jcloud. So think about deployment made easy. Up until now, if you follow these steps, you are good as long as you have provisioned your resources, you know how to manage Kubernetes, you enable monitoring, gateway, certificate, security, whatever. Compare this to doing Gina cloud deploy and your flow YAML or your deployment YAML without caring much about this. So what is jcloud? Jcloud is beyond some CLI and SDK and an API, is basically a Kubernetes operator and a set of custom resource definitions based on Gina flows and deployments. And it knows how to manage these objects more efficiently. It comes with auto scaling, GPU resource allocation, it exposes your services, DNS, blah, blah, blah. And it can not only be served by us as a management, but also since it's an operator, you can take it with you and it can be self-hosted. So it's kind of is in this bring your own cloud paradigm. So the conclusions right now. So I have tried to make clear how Gina provides a platform to help ML engineers build and deploy AI services. We have seen how DocArray help you from having a POC to serve with Gina AI. How Gina can remove your decision layer to move it to Kubernetes. And how at the end, jcloud can make this even more seamless and more easy by providing this operator and this API. So thank you and I'm welcome to take any questions. I have a very probably simple question. If I have this land chain, how this general platform can be integrated with this kind of library. So there are two ways. So you could just have long chain in one of your executors as deploy the service as long chain or the other way around. And we have already been successful as adding ourselves as a long chain dependency because long chain is kind of a monolithic platform that works locally here. So for instance, we have a plugin to help with you with DocArray. I don't know, one of the vector indices that are most popular on on long chain is coming a wrapper around DocArray or we have our own embedding models served with this Kubernetes approach and that's exposed as an API wrapped from long chain. So this is the way we integrate with them. Yeah, so I'm also one of the co-maintenors of this GINA project. So I faced a question of, so what's the difference between GINA and long chain especially this year, right? It's because long chain now becomes kind of very popular in the generative AI and people start to build prototype and thinking about using long chain production. So I think the philosophy behind GINA and long chain are quite different. So I would compare GINA to more like a Kuber flow or air flow, which is based on the all the philosophy of cloud nativeness, right? So you need to have microservice, you need to wrap the model into a Docker container and then you need to connect all this Docker containers to do some high level task, right? So if you look at GINA source code it's all about this containerization, setting up the protocols, using GRVC, compressing the data into binary format and then communicating all this microservice in order to kind of orchestrate some high level task. But if you look at long chain is very different, right? So long chain is a very monolith program where inside long chain is basically a string manipulation happens everywhere. So one can say that it's a prompt orchestration framework. So it's not really like a microservice orchestration framework, but it's prompt based. So it's mostly like a prompt serving framework. So in that sense I wouldn't compare long chain versus GINA in that perspective even though from both architecture there is a concept of a flow or chain stuff but you know everybody hate flow, right? So that's why, for more explain our perspective this is the difference between. Yes, so what I believe is the future generative AI applications or multimodal AI applications is a combination of model based technology and prompt based technology, right? So long chain and some prompt serving framework can be used as what John said as one of the executor used for pre-processing, for example, on constructing the prompt, right? Building instruction prompt from the template from the prompt database, all these kind of things, right? And then you feed the prompt with to a llama, right? To a kind of self hosted service. And the self hosted service is running as an executor inside GINA, right? Yeah. Long chain is one of the executors, yes. The whole orchestration is still organized by GINA or Kubernetes, right? In this case, right? Yeah, so we had, I think we have GINA embeddings as a plugin, for instance, in long chain. Long chain at the end is really a Python program that calls different services and kind of wraps things. We have our own wrapper for GINA embeddings that instead of computing the embedding locally kind of uses our client to send two computing embeddings that are working as scaled in Kubernetes. Yeah, but the general idea I will say is more intuitive to kind of feed a monolith program into a microservice architecture than the other way around, right? So feeding a microservice architecture such as GINA into a monolith program such as long chain is less intuitive, yeah.