 All right, welcome, everybody, to the talk today, which we're going to be delving into a really interesting topic that is going to have a lot of call to actions for the community that we'll have to get together, collaborate in order to make this a success. So today's talk is metadata operations for end-to-end data and machine learning platforms. The objective of this talk is to give an intuition primarily on the challenges that are being faced when it comes to managing machine learning systems at scale and specifically how the metadata within the systems differs in regards to when it's compared to the more traditional data systems. So let's dive straight into it. A little bit about myself, my name is Alejandro. I'm Engineering Director at Selden Technologies. We're a company that focuses on deployment and monitoring of machine learning models at scale. And we're the authors of one of the most popular machine learning deployment frameworks in Kubernetes called Selden Core. I'm also Chief Scientist at the Institute for Ethical AI, research-centered based in the UK that focuses on developing frameworks for the responsible design operation of machine learning systems. And I'm also a governing council member at large at the ACM. So today, what are we going to be covering? We're going to be giving some intuition on the motivations. Why should we care? And challenges of how is it different to traditional metadata. We're going to talk about some of those differences being the relationships that we're dealing with the entities in the ML Ops world. We're going to talk about some ways in which we're looking to tackle these challenges through what we can refer to as the open inference protocol, as well as the open inference schema. Finally, we're going to have a call to action for basically things that we can do as a community to continue driving this forward. So let's dive straight into it. So just to set the scene, just a picture, early 2010s, how it all started, a handful of frameworks in the ML Ops ecosystem, in the ML ecosystem that you could pick and choose to be able to even productionize your machine learning models. How it's going today, there is an ever-growing number of ML Ops tools that are appearing every single day that are tackling pretty much very similar challenges across very analogous areas in the ML Ops lifecycle and ML Ops stack. So one of the things that we're seeing, and one of the topics of another session that we had earlier this week in the Kubernetes AI day, was that we're seeing a convergence into what is not like the concept of the canonical stack, but a set of canonical stacks. So an abstraction of what are the end-to-end lifecycle blueprint components in your machine learning frameworks? And based on the particular organizational use cases, there would be a pick and choose the best of breed tools for those different areas. Now, even though we are now getting a little bit more of an understanding of the shape of production machine learning, there is starting to be a convergence around being able to accept we're moving into a best of breed set of tools. And we are having a sort of standardized set of components that we are still expecting to be filled out. We still expect to have data engineering. We still expect to have perhaps even feature store, a deployment and serving framework. Now that we've identified these different components, what we want to understand is, well, what are the interfaces and expected standardized APIs that we would want to ensure they are enforced and fulfilled to have a thriving MLOps operation? If you're going to bring in a pick and choose best of breed tools, you would want to make sure that those best of breed tools have, at the very minimum, a set of interfaces, a set of principles, a set of promises. Talking about principles, things like reliability, robustness, security, but in a much more pragmatic level. So when looking at this from the end to end MLOps lifecycle, this is the ML workflow ecosystem that is put by the Linux Foundation. You can see a pretty standard overview of the lifecycle of a machine learning model from the data cleansing, data engineering, data splitting, then the training of the models, the evaluation of the models, and then the deployment of those models with monitoring, logging, fine tuning, and then going back to the beginning of the lifecycle. The key thing here, the intuition, is that with that in mind, we already decided that we're going to have this best of breed, pick and choose the best tool for each of the stages. Now what we want to make sure is that as the resources are handed over across every single part of the stack, you have full lineage, full reproducibility, auditability in order for you to know where did this come from? What do I have out there? How do I go back to find who did what? What is the responsibility and accountability structure across my digital resources as they get handed over across the stack? So those are key questions that have to be asked and considered now that we're living in this situation where we're going to be bringing this best in breed pick and choose tooling. From that perspective, it's important to make sure that you choose the best tool for your application, but at the same time, you do need to abide by the baseline standards of best practice at each of these stages and ensure that each of these vendors, suppliers, open source tools have the required mechanisms to ensure those principles are enforced around the metadata management that we're going to be covering, but then also other principles that we have other talks that you can dive in this conference around reproducibility, operational management, et cetera, et cetera. Now this is equally important because we're moving from what is a centralized homogeneous world into a heterogeneous decentralized world. So I don't know if you remember perhaps a couple of years back, the hype was all around the concept of data lakes. Organizations have to go and purchase data lakes because that's going to solve all their problems. You're going to have all of your data mapped out and everything that is going to be all of your big data is going to be centralized. You're going to be able to consume it, and everybody's going to be super happy and it's going to work. But fast forward to now, we've moved into a situation where we realized, OK, well, now that we have this centralized investment, this is becoming the bottleneck. This is becoming a bottleneck both from a technological standpoint, but also from the domain standpoint. We have some teams that have very different requirements, the marketing team, the operational team, the data analytics team, the advanced research team. They all have data requirements. They all have machine learning requirements that are going to necessarily have some consistent reusable trends within their unit. So this is where we are starting to move into the concept of the data mesh. Data mesh, the first thing that you think about is an infrastructural data mesh, like a set of tooling that is decentralized. But the interesting thing is that when you look at a lot of the proposals for the data mesh architectures, this is more an organizational structure in how you can think of your data products. This is how you can actually link the domains in a vertical perspective so that you can actually have the concept of a data product, moving from the concept of just machine learning projects, delivering answers into a reusable set of infrastructure, tables, databases, golden data that you can revisit. Now the interesting thing is that this is actually being explored in the world of data ops. And the world of data ops has done a lot of really interesting research around how to tackle this. Now, what we want to do in this talk is to start extrapolating it into the world of MLops and push towards bridging these two worlds where you have the data analytics in one side, you have that spark world where you're trying to get interactive analytics and interactive insights into the perhaps MLops world where you have operational deployments of models, real-time, semi-real-time serving, data-centric view of your ML systems, et cetera, et cetera. So let's have a look at what are the challenges of metadata management at web scale from the data ops perspective. There was a really interesting talk from the Data Hub founders. So this is basically the LinkedIn internal metadata management framework that then was open sourced. They have actually some really interesting insights around what are the challenges of metadata at real web scale. So there are challenges that they raised around complexity and extensibility. So these are challenges around facing this best of breed tooling, the fact that now you have to deal with heterogeneity of interfaces, heterogeneity of systems. This pushed towards democratized data mesh where the power is now being pushed towards the domains as opposed to a central data lake team. This integration with multiple SaaS platforms to be able to ingest all of that metadata and have the centralized understanding of what I have out there. The integration with open source tools that are already out there that you're trying to bring in into your organization. The questions about loosely typed versus strongly typed, you want to have flexibility for your users to be able to define rich metadata, but you also want to enforce structure so that you can ask meaningful questions on top of that. Schema evolution, how do you make sure that you have backwards compatibility that doesn't have breaking changes and then use case extensibility to make sure that there is domain knowledge within that metadata. There's challenges at scale and not only scale, but also heterogeneity at scale. So these are the data relations. Whenever you have graph relationships when you're extending to millions of vertices and millions of nodes, you can't just have a single graph database node, machine node where you're running all of this compute. You have to think about web scale when you're actually performing some of those queries and also the liveness of your data. It's like thinking, where should I have my data? Should it be in somebody not relational that is then transferred or then should I pull it for some analytics into my graph? Is that gonna be like a centralized golden data set that is gonna be consistently available or is that something that is run and made available once so that the analytics is carried out and then removed? There's also bottlenecks when it comes to database types. So if you just adopt transactional databases and you have to have different types of queries, this would be another challenge. And we also see that there's not only challenges of wanting to answer questions in relational or data structures, but also full text search, right? So when it comes to having not just string-based regex and you want to have even embedding-based search, that is the question of, well, how do I then achieve that at scale in an operational manner in my data mesh perspective? I don't know, of course, reliability. Of course, in the Kubernetes conference, we're all aware about system reliability, but when it comes to data reliability, it's a completely different beast. It's a completely different problem, right? Thinking about real-time sync between six systems, but potential disparities as this is carried out at massive scale, zero downtime requirements when actually carrying out migrations and also ensuring that you have the auditability in this perspective. Now, thinking about the scale of this challenge, you can realize now that this is challenging because the metadata itself can be as complex and as large as not just your big data, but also as your operational ML systems, right? So you're now just not only dealing with big data and the complexities of that, but you're dealing with big metadata. So let's not coin that phrase. So now actually, talking about the author of this data hub project that was talking about the challenges at scale, so this is actually something really interesting because this shows you what is the architecture of their metadata management platform, right? When you think of a metadata management platform, you're like, well, why don't we just add a Postgres database or a MySQL single-server database and just store some of our basic metadata, but here you can see that within their metadata ecosystem, they would have ways of automatically crawling metadata across different servers, across different data. They're able to actually stream it into the different applications. They would have horizontal scalable jobs that are able to process that data. They would have that relational interface towards our metadata store. They would have those requirements for that graph type queries. They would have the requirements for full-text search and then also that interactive ways in which people can consume the, I guess, the metadata of the metadata platform. So from that perspective, you can even think to yourself, well, why is this so complex? Why would anyone go into that crazy architecture for just a simple metadata service? The reason is because now we're looking at the challenge at web scale. Now, the point that I had mentioned about bridging those worlds, there is a lot of really interesting research that has been done in the data ops world. This is because big data, big data, has been alive for a long enough time that people have been asking, well, how do I keep track of all of the tables, the views, the visualizations that are available? How do I perform data discovery? How do I store my data schemas? How do I have a central object blob store with unstructured data and actually have an understanding automatically even of what is the shape of my data across each of these locations? And the data ops world has had some intersection with the MO world. We have seen actually people that are using Spark or people that are using like central data lakes that still run ML jobs in like batch or map reduce type functionality. And we have seen actually the initial attempt to starting to trace metadata of machine learning models. You know, some really great initiatives like the model card that was published a while back that defines what are the expected attributes of a machine learning model, the push towards model artifacts that has been a primary way of dealing with metadata management is through artifact stores. And we're going to talk a little bit about that. They also have been starting to explore how to deal with model versioning, right? Versioning of artifacts, versioning of experiments, the dependencies of the models, what does it need to run, the risks of the use case of the model, you know, what are the explainability constraints, et cetera, as well as the ownership is like who built this model, how does it link to the relevant dataset? Now we're going to start diving into how ML ops adds further complexities, right? How ML ops would be slightly different to the data ops and ML world. So the reason why is because you would have a live systems, right, like systems that you can query in an operational manner. You would be deploying machine learning systems as you would with microservices. And you would actually have similar challenges that you have in your microservice infrastructure for service discovery questions, your, you know, API, you know, schema questions to answer what do I have out there. So things like ML service discovery, inference data schemas, model data schemas, model artifact schemas, pipelines that are deployed, as well as system ownership around, you know, who gets called at night if the system goes down, et cetera, et cetera. So let's actually now dive into the first point, which is how are entity relationships in the ML ops world different, right? So if we take the anatomy of production ML, you know, we actually saw this earlier. So let's think about this as, you know, your training data, your artifact store, your inference data, and then your metadata that we're gonna be, you know, delving into later on. So you have the experimentation stage where you have machine learning engineers, data scientists performing hyperparameter tuning, training evaluation to ultimately convert and train artifacts, deployable artifacts from training data, right? You would then have, you know, continuous integration, continuous delivery pipelines for, you know, whether it's ETL jobs that are consistently converting this, you know, artifacts into deployed services, right? You're deploying machine learning models as either real-time or batch inference services that you can consume. From those machine learning services, you may also add advanced monitoring components, things like drift detection, explainability detectors, et cetera, et cetera. And you would want every single input and output of your data to be also stored in, you know, what would be your inference store, right? For audit trails, for reproducibility, et cetera, et cetera. Now, what you want to ask the question is if I was to have thousands of machine learning models deployed, well, the question is, well, after I run my experimentation, after I run my CICD, after I have dozens of, you know, data analytics teams productionizing machine learning models, what do I have in my production environment? What do I have in this, you know, highly scalable Kubernetes cloud-native ecosystem that I can, of course, reuse, right? What are the services that I have replicated? What are the instances where I have like this pre-processing NLP pipeline deployed 40 times across 12 different teams, right? These are the questions that you would want to ask so that then you can ask, well, what value can I extract from what I already have out there? And then even going beyond of like, well, how can then I start mapping problems within my business with things that I already have as capabilities in the production environment? So let's actually dive of what are the limits, you know, specifically the limits of model artifact stores. Why can't we just use an artifact store as a production metadata management tool? So let's think about first our dataset, right? So let's take, you know, instances for dataset A. So here we would have dataset instance A1, A2, A3, A4, until A, N. Then we would have a set of experiments, right? This is still in our experimentation side where you wouldn't want to train datasets into trained artifacts. So we would first train an initial experiment that gives us a model artifact A1, right? So in this case, we have model A1 that was trained with the initial half of our data. We then have model artifact A2, A3, until A, M, right? So each of them trained with different subset of data with different hyper parameters, et cetera, et cetera. Similarly, we have another model trained with a different dataset. So this is what our artifact store stores, right? The artifacts that are available for deployment. So now what is the relationship that we have in terms of the challenges in our production environment? Well, the challenge that we have is that when we deploy models, we are creating instantiations of existing artifacts, right? We may have our artifact, A, M, deployed in environment CX, but also in CY, right? We have the same artifact deployed in two different environments, but also we have another pipeline that is using model M, A, M, will also model B1, right? So we have now this very complex one-to-many relationship and many-to-many when it comes to pipelines that we now have to think about not just from the perspective of the deployments, but also the versioning, right? Because what if I change the pipeline to have another model? That's a new version of the pipeline, but still perhaps the same models that I have deployed. So, you know, I don't expect you to kind of like internalize and grasp all of this by memory, but just, you know, the key thing of this slide is that it's hard, right? And it's not something that we can just use with the existing one-to-one relationship that would be enforced with existing model artifact stores. So we have to consider many-to-many relationships in production environments. And because of that, we need to have some service discoverability. If you deploy a machine learning model, that will have metadata that is extended to what the artifact would have because it would have further descriptions, perhaps the schema that is exposed for the APIs is different. Perhaps the actual parameters that were provided in the environment variables have make the artifact behave differently. Perhaps you have different versions that have to be considered with simple changes in your service. And these actual services would have to be able to in some way provide that data so that it's actually discovered, right? So with that, now we can think about, well, how do we then tackle this challenge? And why would we tackle this challenge? The way that we have seen this at Selden, we have actually had to deal with organizations and environments that have thousands of machine learning models deployed. So the way that we are able to tackle this is taking our own responsibility as an open-source tool. And we're gonna say, we're not gonna build a completely new feature store and try to sell it as the brand new feature store. We're not gonna try to build a new metadata management system and try to sell it as like the, or not sell it, publish it as the new metadata management system to solve all your problems. What we're doing is just trying to ask the question, what are we responsible for in our system? What are the services that are running within our system? What is the metadata that is running within our system? And what are the ways in which external systems can consume that data for service discovery and for higher level enrichment? So in this case, what you can see is that we want to enable, ultimately, users to enrich by adding metadata to models, to discover and find available models and to be able to do lineage and audit for existing models that are deployed, you know, in this case within Seldom Core. But with the assumption that there's gonna be a centralized external metadata management system that is gonna ask the higher level questions across the end-to-end MLive cycle. And we would expect that the model artifact systems, that the experimentation training systems will also be able to play nicely and expose that information so that we can all work in tandem as a happy community. Right, so that is basically the call to action in regards to the relationships. Now let's dive into another problem. So now we have nailed what are our relationships in terms of our MLops deployment and our machine learning services. Now let's ask the question about data, right? We have now a problem, right? We have the data ops world that is operating at the data analytics side, at the experimentation side, at the training data, let's call it, side of course, that's not, you know, but just for the sake of simplicity at the training data of this slide that I had shown. And then the inference production machine learning services are basically creating data in your inference world, right? So this is unseen data points that may actually have divergent distributions. It may have like, you know, further considerations. It may have like impacts into potential use cases. So you have data that is being also created in this world. Now there's a question of how do we bridge this to worlds? Right, how do we make sure that for our, you know, schema, our data in our data lake of schema S1 and then our data in our inference store of schema S2 has some like transformation function that would allow us to convert to and from, right? So that we are able to actually consume that data in a meaningful way. But not only, this is not only important to transfer data from training, from production to our central data lake. This is also important now that we're moving from the concept of a model-centric view from deploying and productionizing machine learning models into this data-centric view, the productionizing of machine learning systems. So this is a question that we need to ask because now the machine learning systems that we deploy are not just a three-step NLP pipeline that has a pre-processing step, an inference step and a post-processing step. Now we're deploying complex machine learning systems. So this is actually an interesting paper from Facebook research, meta research that, you know, shows what their, you know, search infrastructure looks like. You know, you can see that there is some offline document embedding indexing job that happens, you know, on the side. And then there is kind of like, you know, different stages of the query processing, the retrieval and the ranking for the actual inference stage. And at this stage, you know, you may have some like NLP pre-processor component somewhere deep inside of here. You do not want to be in a situation where you would have this component replicated a hundred times across your hundred teams just because you're not aware what is currently in your production environment because this leads into un-standardized heterogeneous risks. Right? So for that, we have to ask the question of, well, now that we have this data-centric machine learning system paradigm, what are the problems that we're facing? Okay, well, let's dive into some of those problems. The first problem is inference data heterogeneity, right? Back, you know, a couple years back, each of the machine learning deployment and serving frameworks had a different interface, had a different signature. So us at Seldom Core had the Seldom protocol, right? It looked like an array within the array. K-Serve also had a different protocol, had something that looked kind of like TF serving, but also a little bit different. TF serving had a different protocol. MLflow had a different protocol. PyTorch just allowed you to send binaries. So, you know, it had, you know, another protocol. So you had a situation where you had different ML services that are running, but you have stopped flying around in ways that are completely un-standardized. So the first step is to tackle that problem. Unfortunately, you know, we already have a solution to that problem, which is already in production and adopted by several players in the market. So this is actually called the V2 protocol. So this is a collaboration between NVIDIA, Selden, the Triton, the K-Serving team to come into an alignment of what is a standardized protocol that we can adopt so that everything that is flying around in your machine learning systems is actually standardized. So, you know, for this, you know, we can see it as the open inference protocol. This is a standardized schema that allows you to have a REST and GRPC interaction between your machine learning services. And this is great, right? Because not only we've been able to come up with a standardized signature, but we have been able to also provide a minimum level of features and functionalities for anyone that claims to have a serving capability. We have kind of like the standard interfaces for multi-models serving by design, multi-version support for machine learning models that are deployed, model management APIs, model metadata APIs, and server slash model health APIs, right? So this is actually basically not only aligning on standards, but coming into an alignment of what do we feel are the baseline best practices or how to avoid bad practice at the very minimum in order for anyone that is deploying machine learning servers into production. So that is the first step, right? So now that we have standardized APIs, standardized interfaces, so now we can look at the holistic end-to-end lifecycle of the machine learning pipelines and start asking the questions of, well, now that you have deployed machine learning services, now that you have experimentations that are running evaluation at scale, now that you have like data labeling services that also require the data schema to be able to function properly, how do we start thinking about mapping all of these different stages to start getting and empowering the business units to collaborate without the need of a centralized thinking. So the question now is, how do we then map this deployment side with your data labeling earlier stages of the equation? And the way to think about it is, well, now that we have a standardized protocol, we can start asking meaningful questions of what does this ML service actually do, right? And we can think about an ML service as a component that has a set of meaningful inputs and meaningful outputs. It is in a way you can think about as a database query, right? You can have a machine learning model that is deployed and if you have a specific data set, you can actually run a query by transforming that data set by running it through that specific model. So you can ask the question of, okay, well, I have this machine learning service with this shape, with this expected signature. And now we have to extrapolate into what we refer to as pipelines or machine learning systems. We have the concept of a simple pipeline, right? You have an input that then is passed into another model and then you have the actual output. So the actual interface of this pipeline is the input of the model one and the output of the model two. Similarly, you can have combiners where you have the input into two models and then you have the output from a combined aggregation of these two models. But then you can also have machine learning systems, right? Where you have real world scenarios where you want to ask the question of, what is my inputs? What are my outputs? And more importantly, asking the questions of how is my inference being affected downstream and upstream? So those are the type of questions that we would want to ask. So again, we have good news for this that we have now standardized a protocol for the extension of this open inference protocol into an open inference schema. So this is another sort of like proposed and adopted approach into what is the shape of the model inputs and the shape of the pipeline inputs and outputs. So once you have this understanding and you know what the shapes of your machine learning systems are, you can start thinking of the data-centric view of your machine learning systems in the perspective of what is the value that I'm receiving from the services that I have deployed and more importantly, how do I then map this actual value that I'm providing into the business outcomes that I require across my horizontal, horizontally growing business units, right? So those are the type of questions that you can ask. And the way that this looks like is basically as simple as being able to define and explain what is the shape of those inputs? Is it a categorical probability at tensor? And again, this is very important, not just to come and say, hey, look, we've solved it, right? Because we're far from solving this problem. What we want to do is to have this final call to action and really kind of push and request practitioners not just in the inference side, but at every single stage of the ML Ops lifecycle and the data Ops lifecycle to collaborate, to come up with what could be standards that we can all align into and be able to contribute so that we are able to achieve this capability for end-to-end interoperability at scale and collaborate further into defining this and solidifying this so that it can become a proper, even to a certain extent, a proper standard in the sense of like the ISOs and the IEEE. And the reason why is because in several cases, and this is a quote to remember, bad standards can be better, can be worse than no standards. So yeah, we don't want bad standards. We want bad standards better than no standards, right? And there's something that at scale, you have to consider because when you have no standards, you just have like stuff flying around, right? Like you don't know what is the potential value that you could consume. So I'm not saying like we need bad standards. So we shouldn't aim for bad standards. We should aim for good standards. But at the end is like, at least having something that is standardized. And of course, you know, making sure that we're not just creating a standard to rule all of the other standards, right? Because otherwise we're going to end up in like a recursive loop that is never going to end. So yeah, so with that, I'll leave that in that call to action. So thank you very much for everybody for joining the session on metadata operations for end-to-end data and machine learning platforms. Thank you very much. Awesome. So I think we have, I guess, two minutes. Two minutes for questions? Awesome. Yeah, so we have two minutes for questions. Yeah, I'll repeat the question if you shout it. Have you done the same analysis for the workflow system rather than just the serving part? Yeah, that's a great question. So the answer is no. So, you know, I can't stand here and claim that I would understand and become sort of like fully knowledgeable in the other stages. So one of the main reasons for this talk is more for that call to action to kind of encourage the other areas of the ML lifecycle to push towards coming up with those tough questions. What is the metadata that I am accountable for providing and that is able to be consumed by those higher level systems? What I can say, however, is that there is a lot of really interesting and applicable content in the earlier stages of the ML lifecycle, in the experimentation side, in the data analytics side. However, that actually falls short when it comes to integrating into that final stage of, well, not final, but into that latter stage of machine learning deployment and serving. So, yeah, so I mean, what I would be interested about is to collaborate with people that are interested in the earlier stages of the MLops lifecycle. Oh, yeah, another question. Okay, so yeah, so the question is there is a broad range of machine learning algorithms that are continuously growing in number. How do we deal with that complexity in the context of standardization and metadata? No, that's a great question. So the way that we, the hypothesis that we have, right, is basically that in the data science, data analytics, I guess, part of the machine learning lifecycle, the convergence is towards heterogeneity, right? Relevant tools for relevant use cases. Some may be using Spark, some may be using other tools, right, and different algorithms, right? But what we're now looking into the convergence on the later stages of the machine learning lifecycle is then standardization of interfaces, metrics, and operational considerations. So you can have whatever, like in the microservice world, you, as an ops person, you don't ask, tell me exactly what is the lines of codes in your Django app, right? You just ask the question of what are your, what are your, you know, service interfaces, what are the metrics that you expose, and what are the SLOs, SLIs, and SLAs that you want to establish in order for us, ops people to be able to like look after your services and to make sure you avoid leakage of abstraction. So one, like some of the things that we're focusing on is creating tools like, you know, ML server for machine learning serving, seldom core for orchestration in Kubernetes, to come up with standardized layers that allow you to just abstract the data science separate to the operations. So I would say that we are going to most likely be looking at heterogeneous algorithms, heterogeneous tools, but with the push towards abstracting them from the ops. Awesome, yeah, I think we've wrapped up, but it's time for the break. So happy to take any questions, yeah, informally. Thank you so much. Thank you. Thank you.