 All right, awesome. Cool. Hi, everyone. Today, we're going to be talking about applying observability to machine learning, so figuring out how you can use observability to debug your machine learning models that are deployed on the edge. So first, just as an introduction, Natalie and I, we are principal engineers at New Relic, and we work primarily on Pixi, which is an open source CNCF sandbox project for Kubernetes observability. And you'll learn a little bit more about Pixi later on in our talk. So why is this important? We see today that a bunch of products and software are moving their ML to the edge. So for example, here you see with Cruz, the self-driving car, it's collecting a bunch of information using its sensors, cameras to figure out, am I driving in the lane correctly? Am I about to hit something in obstacle? And it collects all that information and does it locally. So it stores it locally and then runs some inference on it to figure out what action it should do next. And then similarly, you have the Amazon Echo, and that does the same thing. It collects input from the environment, whatever you're saying at the time in your home, and tries to figure out, should I do something with this information? And what does this exactly look like? So in the traditional model that you see on your left, before a lot of how things were being done was all these sensors were collecting information, and then in order to make an inference on it, it'll send it off to some cloud. This model is running in the cloud. It receives that information and then sends back, OK, here's what you should do with this. And then now, as things are moving towards the edge, you're seeing a lot more of this sensor is collecting this information. It's storing it potentially in memory, and then it runs the model directly on the device itself. Show me the flip for you. So there's a lot of benefits to this, right? You can see immediately here from this picture, you get a lot less network bandwidth in egress, especially if your sensor is collecting tons of information. Now you're not sending out to some other cloud that you potentially don't even own. And so that's really nice because you get the data and you can immediately start processing running your models on it. And then second of all, the nice thing about it is you get some really good security privacy benefits, right? So think about the Amazon Echo case. Your Echo is sitting on your counter, and it feels a little bit weird to be like, oh, it's listening to me and sending all whatever I'm saying all the way directly to Amazon, right? It's nice to think about it's like, OK, it's actually doing edge compute. Whatever I'm saying, it's storing it locally. Probably expiring it out after some point because that's a ton of data that it's collecting. And all of it is just stored within the edge device itself. And then finally, there's scalability. So before, if you have millions of connected devices and all of these are streaming their data to your cloud, you're going to have to be able to handle all this input from a bunch of these different devices. But now all these models are running individually on the device itself. And so they're just responsible for dealing with their own set of inferences. So how does this actually look like? How do you start to go and deploy your models to your edge devices? And how it usually happens is that the first part is pretty much the same, right? You need to go and you train your model. You have a bunch of training data. You go and you make sure you tune all your parameters and make sure your accuracy is correct. Like you have an image classification. You're trying to classify what kind of dog breed is this. You train this in the cloud. And you ensure it's like, OK, any input I give it, it looks about right. And then you go and you can deploy these to all these different edge devices. So I wanted to make it very clear that all these are just potentially different hardware environments. So I just made these boxes of the edge devices different colors. Because one could be an iPhone. One could be an Android device. So it's just very clear that these models are now going to a very different set of environments. And what actually can happen here? What could go wrong? It works perfectly in the cloud. And keep in mind that your cloud, it has as many resources as you want to give it. You can give it as much memory you want. You give it as much CPU that you want. And as long as you're willing to pay. But then now you're going to these edge devices. And like I mentioned before, you have phones. But you might be deploying to just some little thing, running a sensor or something. And you're going to be limited by your resource requirements, your memory requirements, so a bunch of things could go wrong. So the first case up on top is you've deployed to your edge device. And then things seem to run. But then you get the completely wrong results. You're noticing a huge drop in accuracy. So before maybe when you had 80, 90% accuracy, you're seeing that your model that's been deployed somewhere else is giving you maybe even 30% accuracy. And the second case where something could go wrong is your model seems to run fine. It's giving you the right inferences. But it's taking a really, really long time. And so it's like, OK, it used to work. Now it doesn't. What can I do to go and debug this? And I think a lot of the things that we turn to, I'm sure many people here are familiar with it, are like, I need to figure out how to solve this problem. I'm going to go and just put a bunch of print statements and just figure out where exactly in this pipeline are things going wrong. And obviously that's a very tedious problem because there's especially in a deep learning model, there's a lot of stuff going on in there, right? You're going to print a bunch of things. You have to make sense of all that information. And that is where MLXray comes in. So MLXray is a project that came out of a Stanford research project. So huge credits goes to all these people that I've listed on the bottom who came up with the original idea for MLXray and all the great work that they've done so far on it. And essentially what MLXray is, is that it's an end-to-end framework for debugging your cloud-to-edge deployment. And let's dive a little bit into more about what that actually means. So first, MLXray gives you this API. And you basically invoke that API. And it starts collecting a ton of information for you as you run your model. So up on top, we have an example of the Python API. You start the inference. You stop the inference using the MLXray API. And during each part in your model pipeline, it just records tons of information. So it's like the pre-processing step, the individual layers down to the final output that it gets in the end. So you don't have to go in and actually add whatever logging that you yourself want. So what kind of information does it get? It really gets a bunch of things. So just a bunch of baseline debug data. So what are the inputs and outputs of your model? And then for each individual layer inside your model pipeline, it also gets the input and output of that. And for the case where things start running very slowly, it will track your end-to-end latency for running the entire inference, but also in between each layer itself. And just resource information. So it's like, what was the memory when my model was running? And for devices like Android, it also collects just other sensor information. So it's like, your Android knows how it's rotated, what the lighting conditions are like. And that just helps give more context to just how the model is running on your device. And you can go ahead and use the MLXray API to just add whatever other fields you might want to log. So more information that you might want to collect that are specific to your model itself. But now you have all this data coming in, right? So you're like, all right, cool, I've instrumented my stuff. But what do you do with it? Now you're like, OK, I know it takes this long. I know that at this point, my output looks like this. What am I supposed to do now? And so the idea behind MLXray is that you compare this to a reference pipeline. So you have your model that you've trained in the cloud. You know, this is your baseline model. This thing runs the best that you can imagine, the best accuracy that you can get. It runs at the speed that you want. And so what you do here is you collect logs from that baseline model. And then you have the one that you've deployed to your edge device. You collect the logs from that one. And essentially, you combine them together to figure out, what are the differences here? And that's what we call a debug report. So you compare the differences and try to figure out where exactly in my model are things going wrong. And what does this flow kind of look like when you're debugging? So the first is in accuracy validation. Is the model that's deployed on your edge device, is it doing as well as the cloud model that you originally trained? And so if it is, that's great. Maybe they have comparable performance. And you're done. It's like, all right, great. I've deployed to this Android. I'm good. Everything looks good. I can move on to whatever the next device I want to deploy to. But a lot of the time, it's going to be like, OK, no. My accuracy has dropped a lot. And so that's when you look into the accuracy or the output between each layer. And what you might see here is like, oh, OK, this particular layer, the output is very, very different from the output that I was getting in my baseline reference model. And so then there you can go and look into, OK, maybe I have some pre-processing issues. Maybe there's something wrong with the weights and biases in this model when you try to quantize it so that it runs better on an edge device. So that gives you indication about just which layer you should actually look into. And then finally, MLXray has this thing called custom assertion checks. And so what actually happens there is that you can tell MLXray, as you're running through my model and collecting all these logs, just run some assertions that I know always need to be the case so that if it fails, I know there's already something in my pipeline that is wrong. And so an example of this is, going back to the lane detection case, you know that when it detects a lane in a picture, it has to be this wide, for example. And so you can add an assertion to be like, whenever you detect a lane, always make sure it's this width. And then if it isn't, then it will trigger a failure and can be like, oh, OK, so there's something wrong potentially here. And so how Natalie and I kind of got involved in MLXray, so all the work I just presented before was work that was done by those people I mentioned at the Stanford Research Group, was that we got in contact with these people and we found there's a lot of limitations with MLXray that Pixi, which we'll talk about in a little bit, could potentially solve. So the first is you have to make code changes to enable this instrumentation. It's not a lot of code, right? You go and you start your inference, you stop your inference, you invoke the MLXray logging, and it's not that bad, but it can be cumbersome. So it's like, you have to remember to go and add that in, you have to remember to take it out because whenever you're ready to finally deploy to prod, you don't wanna keep just logging all this information all of the time. And it has a slight performance impact sometimes as well, plus memory because now you're keeping all of this data in a log stored on your device so you can pull it out later to analyze it. And the third thing, which I didn't show as much, but there's just not a lot of ways to visualize this data. So MLXray, it gives you an API for collecting the data and parsing it, and it gives you some simple operation so it's like, okay, I can plot these two different like layer accuracies against each other, but there's no way to actually go and like play around with this data and try exploring different correlations. And so then that's where Pixi comes in. So I will hand that over to Natalie to explain what that's about. All right, thanks, Michelle. So as Michelle said, you know, as contributors to the Pixi project, we were really inspired when we saw what MLXray could do. And what we wanted to do is try to take some of its concepts and see how we could use Pixi, which is an observability platform for Kubernetes to apply some of these concepts in a Kubernetes native environment. So just a little bit about Pixi before we jump into the demo. Pixi provides auto telemetry to your Kubernetes cluster using EBPF so it can automatically collect a lot of data from your application without any code changes. Pixi runs on the edge, which makes it a good, you know, option for edge-based deployments because the compute and the data collection and all that stuff actually lives on the node on which it's collected. And then finally, Pixi has a very customizable interface so that allows us to build custom dashboards for specific use cases that we have. And as she said before, Pixi is a CNCF sandbox project, totally open source, so anyone can use it. So I think the way we think about Pixi is progressive instrumentation, which means you start with a baseline of a certain amount of visibility into your system and you can extend that visibility in areas that you're interested in. So Pixi, right out of the gate, has things like resource utilization, flame graphs showing which functions are taking a long time in your application. It shows the raw requests in and out of your application. But we also have the ability to, once Pixi's deployed in the cluster, to add additional user-specific instrumentation for your application. So we can use Uprobes, which is a user-space BPF probe to trace particular functions in your application that you're interested in without requiring a redeploy or anything like that. You can write custom scripts and custom dashboards and soon it's gonna support things like supporting arbitrary data sources from open telemetry ingest and things like that. In the diagram below, we just have an example of how Uprobe works in your system. So you can say, I'm going to trace this particular function, the function will then run, and then the Uprobe will actually read the values in and out of the application, or in and out of the function, if that's what you're telling it to trace. You can collect things like latency, arguments, return values, and things like that using Uprobes. So next I'll show a demo. This is just the beginning of using Pixi to apply some of the ML X-ray concepts, but I wanted to show it on an application and show ML observability on the edge using Pixi. So, okay, well, I was warned that this might happen. It's back, okay, great. So I just made a quick little GitHub project to run a Kubernetes object detection, so you can check it out if you want to test it out yourself. I have that application running on my Kubernetes cluster. So this is the Pixi UI. You can see that it's showing me the traffic between different applications in my cluster, things like the nodes and the namespaces. And specifically what I'm interested in is this model server, which is running my TensorFlow application, which is an image classifier. So right off the bat we can see things like, what's the CPU usage? What are the containers that are running? What are the processes? Bites in and out, things like that. So what I want to do is I want to trigger some load on this application. And what I want to do is I'm going to apply custom Uprobes using a Pixel script, and that's going to allow me to trace model execution in terms of the number of times that it's run, and also the latency of the model execution. So I'm going to go to my scratch pad. I'm starting some load over here. We're going to need to re-forward this port, taking a little bit of time. Demo gods be with me. But basically what I'm going to try to do here is say, hey, Pixi's already collecting a lot of information about my application, but I specifically want to trace particular functions that are running my model. So that's the goal of what I'm about to do. Still struggling, I'll try. I do have a video of this if it fails to work. Do you think I should go to the video? Or do you think I should? Okay, stand by while we, I just have to pull this out of my files. So as you can see, kind of the same thing that we showed earlier, I'm just going to jump ahead to the part where I'm going to deploy the probe. So basically, as I said, we're basically going to instrument this application using a custom script. This is deploying Uprobes that I have written in BPF Trace. And it's basically going to show me the latency and the number of times this model was executed. Over time in my cluster, without any redeploys. So by using things like this, we're able to add additional observability into our models once they're deployed when they're running on Kubernetes. So the trace point is being deployed. And soon we'll see a plot of latency as well as number of invocations over time for my model. Just giving it a little time to collect some additional data. Okay, there we go. We can see model requests over time. That's the top chart. And then the model latency. We have the P50, P90, and P99 for how long the model is taking to execute. So you might think, okay, it's taking about a second to execute, that might be a little bit long. I need more information about what's happening. I need to debug this and figure out where the bottleneck is. So I mentioned before that Pixi has flame graphs that it captures about the most CPU intensive functions that your application's running. It might be a little hard to see here, but you can basically see the entire call stack of what's running. And what I'm gonna discover looking at this is that most of the time here is spent parsing JSON and serializing JSON. It's not actually spent on the model itself. And this is a really common problem. Without the ability to attribute the latency, you actually might blame the model when it isn't actually the model's fault. But there is a particular part in here that it's gonna highlight soon, which is basically a layer of the model that is taking a really long time in comparison to what we'd expect. And that's underneath this parent function. We can see the different ops that are running, the gather op, a king cat op. And then finally, we're gonna see a non-max suppression op. Now, this is a particular operation that we expect to be really, really fast. It's basically just saying, I have a bunch of different bounding boxes. Which ones are the ones that I should actually output? So, it's kind of a surprise that this would even show up at all, especially in comparison to the other ops. But as we've established, we have the ability to do custom tracing of whatever function that we want. So, what I'm gonna do here in the last part of the video is basically trace arguments to this non-max suppression operation and try to figure out, based on the arguments that it's receiving, why it might be taking a long time. So, I'm gonna run another script. And it's basically going to trace that particular operation. And let's see if we can figure out a little bit about why it's taking longer than we think. Okay, it's a little hard to see here because of the small tax. But what we see is that there are actually thousands of bounding boxes being passed into this function. 2000, in fact. That is a lot higher than I might expect from my reference pipeline. So, this has allowed me to get to the part of the issue, which is why is this particular layer of my model slow? It's because it's receiving an outsize input. It's an input that we wouldn't actually expect. So, we have kind of gone to the root of the performance issues using some of the principles that we got from MLXray. Okay, it looks like Michelle's mic is off now. So, I guess resources. The MLXray repo, it's on GitHub. You can just search for it and see kind of what they've been able to do. It's very well targeted for mobile deployments. The MLXray paper, which has been accepted into MLSys. So, if you are kind of part of both worlds, you can check them out at MLSys. And then, finally, you can check out Pixies GitHub repo. It's pixie-io slash pixie. And if you want to check it out on your Edge devices. And that's it. Thank you.