 Hi everyone, I'm Alulita Sharma. I lead observability for AIML at Apple, and super happy to be kind of looking at this topic today with all of you, and especially, you know, discussing how observability is changing with the advent of applications which are AI enabled. And with that, let's get forward moving. Okay, so a little bit about myself. I'm super excited to be actually seeing the change in the observability landscape being brought about by a new generation of smart gen AI applications. And gen AI really means that, you know, we are starting to add more models along with our code in writing applications and application services that, you know, do several different kinds of things. Could be search, could be, you know, just a news application, could be music, could be many different areas, but applications are now no longer just code, no longer just infrastructure, you know, that you're running on your application in a global and distributed way, but it's also usage of models in the way that we actually do our analysis. So as I said, I have been involved in observability for quite a while, and in the CNCF, I am a member of the Open Telemetry Governance Committee, have been a project maintainer as well as contributor on different parts, metrics, interrupt between Prometheus and Open Telemetry protocol, and super excited to actually see new language profiles, coming in, new language signals coming in into the project. I'm also a co-chair for the observability tag, and we have a tag session later today at four, where we'll be kind of talking over a larger landscape of observability, so do join if you're available. And I also chair, just started joined in as the chair for the end user technical advisory board of the CNCF, where this will really serve as a core group for being able to take end user feedback, end user requests, and bring that back to the projects. So this is a new body that has been created within the CNCF and super excited. Again, as an end user member of the CNCF to be participating there. And of course, as part of that responsibility, I'm also part of the governing board of the CNCF. So with that said, just, you know, my slides are simple, but I do want to talk about some of the key areas that are really driving the change in terms of observability for smart applications, right? So there are two ways of looking at JAN AI in the observability world, right? There are smart applications that observability or observable frameworks are looking at. And then there's a second part of actually using AI for observability, right? So I'm gonna kind of focus in on the first part, that when you're building smart applications with using AI models, such as LLMs, and models can be of different types, right? We just don't all have to be neural nets and large language models. But today, we also have LLMs in that mix and AI apps are here to stay. So if you're building large distributed services or applications, you will likely to use an LLM in that process today. Observability is a key part now of that ecosystem because not only are you looking and observing the behavior of your applications, your infrastructure, but also of your models that you're using along with your applications, right? And observability has to actually look at all those different parts. So what I wanna talk to you about today is really highlighting on three aspects of this new paradigm for observability, which are really the three pillars, if you will, at this point. And again, this is an evolving area. So when we chat again in six months, there'll be many more details that we will be kind of talking about. So in this new paradigm of observability, there are three aspects that have become very foundational in the way that observability needs to support AI models, right? The first part is at a high level, looking at integrating with the model training process, right? So this is your code for your application, your configurations, and then this is your model, right, so for your models, what you're doing is you are typically having a training process where you have data coming in from real world use cases that you need to operate with, and you typically have training pipelines where you train your model to width, right? So this is a new dimension which comes with the model training space, where you train your models with data in order to be able to give you the results for your application, right? The second part is really understanding the inference pipelines, right? Because once you've trained your model for the applications that you want to use it for, then you actually have inference pipelines which you use for being able to infer and take parameters and use parameters for accuracy, and observability is applied at each one of these layers. And then the third is really considering performance and resource consumption for the infrastructure that you're running these models on as well as your applications on, right? Because at the end of the day, there is new hardware that is coming in into the mix. We are using GPUs or you're using accelerated CPUs to be able to run your models. And performance and resource consumption and understanding that and factoring that in the overall observability monitoring and the analysis that you do and present to run production networks is something that you need to factor in. So these, I'm gonna dive deep into each of these areas and what that means and kind of give you an example in each of those areas. So when you are, you know, the first area of integration for a new paradigm of observability is that you are now starting to integrate observability instrumentation as well as analysis, if you will, for your model training process. And what that means is that you're taking, your aim here is to understand how long it takes for an AI model to train, right? Because it depends on the size of the model to your using, the complexity of the parameters, the number of layers you have for the models and the errors that are known errors and failures based on the data you're training with, right? So there are several parameters that you have to look at in order to say, hey, you know, these are factors that we are looking at in our observability pipeline and we need to collect the metrics or the traces or the profiles that we need for being able to determine whether our training is successful or not for that model, right? So the important part here is also to understand errors, right? Like what are the known errors that can occur in model training pipelines which I need to observe and report back? Instrumenting training pipelines to view model weights, for example, data distribution in the model itself on the different layers and how you are actually determining confidence indexes for the training that is being done is something that you, again, typically there are a whole series of functional metrics as well as training specific metrics that you are actually collecting as part of the observation of these pipelines. And these pipelines have to be instrumented based on the model that you're training with. The other part is also running continuous analysis and this is where, you know, an observability specific model can be used is that you can use continuous analysis of the errors that you may be seeing while training the pipeline to be able to train your own observability model also. And why does that matter? Because if you are continuously running these pipelines over time, every time you use a model you will be training again with new data, right? And especially if you're running real time across a period of time you will take that and be able to reuse it and at that point having an LM which is observability specific also is super helpful because that gives you long-term information about the observability metrics and the changes in those metrics over time. And so for the sake of example, do not get lost in a lot of details because you could easily dive into each layer and say, you know, these are the layers of the model and these are the metrics, this is the distributions, these are the weights, these are the numbers that you are getting for the training time that it takes for this specific model and you're reporting that back but you're also continuously learning with the observability model that you're using. And what that does is, for example, with the news service app, right? Like if you have a news application where you're getting new news updates from your data sources that you're picking up for your news app. You can run an AI model on the application side that is continuously harvesting these new news data sources and being able to, you know, on the application side, hey, say, hey, you know, okay, so we are training with all this news data and now we have an application that can give you current news any period of time but then you are also categorizing the incoming news items and running smart error analysis as you train continuously and that error analysis long-term is kind of what your observability LLM learns with which then gives you back the kind of feedback that you're looking for that, hey, you know, my model training process, is it working as expected or is it, you know, out of thresholds? So this is one aspect of your new generation of AI apps where you need to have observability and that is something that's expected if you are actually looking at using models in your application. The second part is really understanding the inference pipelines, right? Because at the end of the day, once you've trained your model, you're using it in your AI application, then at that point, you want to understand the parameters that is your inference model, your inference pipeline accurate or not, right? And this goes back to what does accuracy mean here? It means that is it actually providing fair use? Is it within the fair data use guidelines that your organization may have established for your applications? Is your data privacy within your privacy guidelines? So, you know, any guidelines that you're setting for fair use, use of bias, for example, if you're detecting bias in your data, each one of these are parameters for the inference pipelines which are then looked at for accuracy, right? Is it within the thresholds of, you know, what is expected versus is out of bounds? And then from an observability standpoint, you want to know that, right? Because at the end of the day, that's part of what you're reporting back as part of the observability metrics or the data that you are communicating now back to say, hey, you know, everything is in order with this application. So what that does also is that it evaluates the latency of model in processing and input. That is somebody asked for a news item, for example, and it returned back, hey, you know, is this the result? Is the result accurate, right? Did the application actually give back the response that you wanted to back using that model? And that's what inference does, right? That you are now starting to infer whether your results that are, you know, being from the model are converging to an accurate output. And you are looking at that in order to understand if it is within the confidence bands of inference generated or not. So defining and instrumenting telemetry data for measuring latency and distribution is super important because if your telemetry data does not really understand, you know, or measure that latency in that inference response pipeline, the input output pipeline or the cycle, if you will, as well as the use case application confidence, the use case specific. That is in this case, it's the news service. Your confidence thresholds for normal and edge cases is the result that you're getting from your inference pipeline within those normal thresholds or not. That's also another metric that you are actually reducing from your observability pipeline. And then the third thing is to, again, apply an ongoing observability LLM in the back end to continuously learn from the results that are being picked up from the observation of this application. So it's complex, right? You're looking at not only an AI-enabled application, but you're also using an observability AI in the background. An observability model in the background to continuously learn whether those results are within your normal thresholds that are established or abnormal. So in the case of the news app that I was talking about, again, being able to, from an observability standpoint, monitor and publish the confidence thresholds that are for each encoder layer in the model is something that's super useful as conveying back to the application developers. Because they care about, hey, what is the weight on each of these layers is the encoding done in time. It's like if I want a sub-second response on each layer, is that what I'm getting back? And being able to convey those values at each layer and provide that back as part of your observability of the report or your dashboard is super useful. Any questions so far? Or we can finish first and then you can ask questions. The third part, which is considering the performance and resource consumption of models, right? So what that means is, again, as you're running these models, they're expensive. They are typically, especially LLMs. And it really depends on whether you're running these models on the edge or on server side, which is typically where your cloud infrastructure runs, right? So if you are running it on the edge, typically your models will have fewer parameters for what, and the level of complexity that you can run on an edge device is far smaller than what you can run on on large compute with a GPU farm to back it up and being able to then observe that entire workflow. And it's super important to understand what the performance latency numbers are and the resource consumption of such, you know, of the hardware underneath, because that kind of factors in also in what kind of large language models you can run on the edge versus what you can run on your cloud, right? So in order to evaluate the efficiency of an application AI model, you want to be able to continuously also understand how to optimize resources for running on edge versus running on cloud. Because again, you can have a lot of different ways that you can run models where you can split out the functionality of each model based on where it's running, right, because the hardware footprint matters and therefore if you run it on edge, likely you will do the more, you know, edge based computations and analysis that you would like to with largest smaller models versus running NLMs on the back end. So instrumenting resource utilization metrics and this is an overloaded term because resource utilization can mean many things. It can mean, you know, again, measuring what is the CPU usage, for example, or the GPU usage for the actual set of GPUs that are being used for running a model. And also it could mean other accelerated GPUs or GPUs that you're running on-prem, right, on cloud. So instrumenting those resource usage metrics is something that in the observability domain today, we don't do, right, because we have a concept of understanding the typical generation of processors that are used under the hood. And so that could mean CPU, infrastructure CPU metrics, so it could mean memory. But typically you don't actually look at GPU usage, the percentage of time that each GPU is getting used. And because these, you know, resources are actually very expensive today, so you really need to consider this being as a third pillar for observability because you want to constantly understand what is the resource utilization and can you actually optimize the usage of your resources better, right? So it's not like you can just run idle, you know, you set up a job and you just run an application on a particular set of CPUs. CPUs are cheap compared to GPUs today. So to consider, you know, again, having a continuous process as part of your observability to report back performance and resource consumption is super important. And what that means is also using an AI model again to continuously learn about the usage patterns for each model because what that does for you is over time enable you to kind of optimize, based on the model type that you're using, the type of, you know, expectations for the GPU, you know, clusters that you're using or the sizing of the footprint of the edge devices that you're using, right? It's again, a delicate balance because again, your applications, you can write any application, but the question is can you run your model effectively there and can you get the results that you need with the kind of training that you need and can you report that back and understand and observe that on a regular basis in order to complete that cycle? And that's why observability is very key in enabling day one for AI, you know, based applications because it really is, comes down to, you know, efficiency and optimization and it's way more important to do this for AI apps than for regular apps applications at this point, just because of the sheer, you know, size of the models, especially in the Gen AI LLM generation, but also as we get more and more, you know, bigger models with billions of parameters, how do you actually optimize that for each footprint of hardware? So with that said, again, there is a need, you know, as in this new paradigm of observability to kind of have our existing generation of observability frameworks to be able to accommodate that. And what that means also is that you have to have continuous analysis and understanding and evaluation of the efficiency of what works on Edge versus what works on Cloud. So the takeaways here really are that, you know, can intelligent observability requires changes in the current observability stack as well as in the instrumentation of AI models and Edge and Cloud-native infrastructure be drawn on because this is the key change between writing a current generation of application services which, you know, do not use models versus adding models as part of that whole application, you know, that you're building, right? Because the moment you add additional assets, such as models, you immediately actually also have to have observability built in in order to just understand what the behavior of the application is going to be and whether it is actually delivering the results that you expect, right? The second part, which is a takeaway is that a new taxonomy of data for these models has to be evolved. This is still evolving. It's actually really not standardized today because every vendor who has GPUs and is building GPUs and rolling them out into the Cloud or on Edge has their own taxonomy. And those metrics, you know, it's like Kubernetes did a lot in some standardizing the metrics for CPUs, for example, like certain CPU metrics that are shared today which are all standardized and we take it for granted. But in the world of models, this is not standardized yet. And for example, open telemetry, which is a large collection framework, needs to be aware, be aware of AI models, the types of open AI models, and understand beyond the black box implementation, right? Like it's not good enough to just say that, hey, you know, we're just going to do the, we're just going to black box model. We're going to assume that this is what we're going to get out of it. You may not because every model is different today. And there is no standardization in the metrics or the telemetry data that has been emitted by these models, right? There's also not enough standardization yet to be able to say that we can say it's one instrumentation fits all. It may not because the kind of data that these models may be emitting from an observability standpoint may not be the same. So you really have to keep some of these aspects in mind when you're building and using models for your applications because that also filters back in into the observability frameworks that you're using for collecting this data and then being able to analyze it, right? And operate these applications in production. And the third thing I'll leave you with which is the observability stacks need to factor in is data bias and security influences, right? Because the data actually can really change the way that your model behaves for your application. And if there is bias in the data, today observability frameworks have no understanding of it. They just look at the data that's just been produced by applications very mechanically. So to be able to understand those influences in your data and to be able to report that as a first step for course correction in terms of an application, avoiding that with bias in the results because of data. So it is very connected now because you can see that the moment we introduce a complex structure like a model, it's not just throwing in a model. It's also actually understanding what your data is doing and what your data looks like. What are you ingesting bad data in? Could be bad results out, right? So it really is also has the potential of kind of biasing your observability results because unless your observability stack understands what are normal confidence indexes, it can't tell, right? So that's why it's like a completely changing paradigm and this needs to be actually factored in both into the collection as well as the analysis of what you are reporting through your observability pipelines. So with that said again, that's all I had to share with you today. I didn't want to make this discussion too complex because I could have gone into a lot of detail about what models look like and what could you do with different metrics and different layers but happy to answer any questions in this space.