 All right, awesome. Hi everyone. Nice to meet you all. I hope you all had a good lunch So today I'm going to be telling you guys about ML x-ray And so ML x-ray is essentially an end-to-end debugging platform for your models that are deployed on the edge So just a little bit about myself. I'm Michelle and I said before I'm a principal engineer at New Relic Working on the pixie open source project So that project is essentially a CNCF sandbox project, which is an observability tool for Kubernetes and Before my time at New Relic I was a pixie lapses first engineer and pixie lapses where the pixie project was born out of So why should we even talk about debugging deployments of machine learning on the edge? So we see today that a lot of deployments or products and software are moving their machine learning models to the edge So for example, we have cruise which is self-driving. That's a very popular hot topic these days So essentially your car is going around it is picking up a bunch of sensor information So for example, it's using a camera to figure out, you know, am I driving correctly in the lane? It's using LiDAR to find object detection and see if there's an obstacle in the way So you don't go and accidentally hit somebody or we've had you know The Amazon echo around for a while and that listens to you You know you're going out your day just talking normally and it listens and picks up on cues whenever you say Alexa And so all that is done on the edge It's picking up sensor information and then it is basically running a model and figuring out You know some inference and what action to take based on the information it's gathered and then another example is just you know You want to deploy your applications on two different phones, right? So essentially on these phones you're running machine learning models to do different things such as image classification, for example Or for the case of like a pixel six one of the recent things they came out with is You can do like a magic eraser all these models are running inside your phone itself And to kind of expand on that we have this idea of the traditional model, which is on your left And so here in this case The the sensor is on a separate device. It is picking up a ton of input data So let's say for a nest thermostat, right? It's going it's figuring out You know, what is the temperature at this time in this house? And it might want to do something with that data and figure out Okay, what should I do with this and run some inferences on it and it will send it to the cloud where the model is running The model will basically go do some inference and then return a result And so then when you move your computation to the edge Which what actually happens is you now have these models running directly on the devices so for the example of the Amazon echo from before you're having the model run directly on the Echo itself rather than going and running the model on the cloud And so here what actually happens is that now you have a bunch of different environments, right? You can deploy to many different edge devices that are built on different hardware They have different memory compute resource requirements and whatnot And so what are the benefits of doing this actually so you can see here from this picture You are no longer egressing any data out to the cloud So before you know you have a constant stream of data coming in and then you're sending out to be like, okay What should I do with this information? What is the inference that I want to make but when you move it to edge compute all the information stays within the device itself And a lot of the times this is just stored in memory And so that helps a lot because now you know you're not sending something out You're not waiting for the latency of that network requests coming out to come back and tell you okay This is what I should do and that helps a lot with latency and overall just egress And then you also have security and privacy benefits So now it's like you're not sending your data out to somewhere else right going back to the Amazon echo case it's like you're talking constantly in your house and You don't really want all that information like whatever you're saying is to be sent to some remote cloud That was managed by someone else rather you feel more comfortable where it's you know This device is in your home and it's kind of all stored in memory And then probably at some point eventually expired out because it no no longer needs that information to make an inference And so there's a lot of security and privacy benefits for moving to the edge and then lastly you have scalability, right? So let's say you have millions of connected devices in the traditional model You send all those all that data to your cloud and here in this case you're actually handling it per device So even if you have million of devices, whatever you're doing in your cloud It's not impacted by you know all these inferences that need to be made on these individual devices And so how how do you actually go and start deploying these models to your edge devices? So how it usually happens is you're going to train it on your cloud as you normally would with just you know The traditional model you go and you tune all those parameters You run your training data sets and then you know your models looks great You're you know able to accurately detect dogs from the example earlier today And then you go and deploy these models to your edge devices And so in this image here I kind of label these boxes at different colors because I want to make it very clear that these may not be the same Architecture they may not have the same environment. They may have completely different hardware and so you're just deploying these models to these heterogeneous environments and What could go wrong, right? So here's an example of what could go wrong on the top you have this case where it's like okay well one second is kind of slow, but you're running your model on your edge device and Accuracy is just not right. So before when you're training this thing on The cloud and running inferences was like able to classify this dog correctly But then now you've deployed it to your iPhone for example, and now it's starting to have some problems Or in the other case, which is the bottom one you go and you deploy you deploy to your Android and oh man This model that ran really quickly when you're training it in the cloud It takes 10 seconds and you have no idea what's going on So you're running into all these issues and it wasn't the case when you're running on your like, you know Your single cloud environment, so how do you actually go and debug these things, right? You have a bunch of different environments, right? You have an Android you have an iPhone or in some cases you have like a sensor running on this and the sensor running on some Other device, how do you even go and figure this out? And that is where ML x-ray comes in so ML x-ray is this project that came out of a Stanford research group So tons of credit go to all those wonderful people on the bottom for just kind of going and figuring out all this information but essentially they've built out this framework for Providing visibility into what is happening on your edge deployments and you can use those to figure out You know what exactly is going wrong with my model that you know usually works well in the other cases that I've deployed it and ML x-ray Essentially what they do is they give you an API that you can use to instrument your models so at the top you see an example of the Python API and All you really need to do is say my ML x-ray library Let's start on inference start you run your interpreter and then on inference stop you stop And so this API once you invoke that it starts collecting per each layer in your model a bunch of interesting Information that is going to be used to help you debug what is going wrong in your system And so some of the information that it collects is you know the original input of the model the output of the model You know is the result correct and also per layer the input output the end-to-end latency So you actually know how long did this whole inference take and also within the layers themselves How long did those individual layers take things like memory? I think especially as you're moving to an edge device and these have lower like memory and Compute resources that is something you might want to hone on to and then for in the case of the Android example They have in their Android API It also collects other information such as peripheral sensor information just like the orientation of the phone The lighting that it's detected in the room and that just helps you provide more context to the model that is being run You can also go ahead and add your own just like whatever you want to log in the ML x-ray logs You can there's the API allows you to go ahead and do that if they're custom fields that you might want to go and pick up And it also allows you to write custom assertions, which I'll talk about in a little bit So now you have all this data coming in right you've instrumented your your model all this data is coming out as you're running it But don't really know what to do if this information is like okay cool This layer takes you know this many milliseconds and this one takes this many milliseconds How do I actually use this information to figure out what is going on with the model that I've deployed on this edge device? And so the idea behind ML x-ray is that there's a set of reference pipelines And these are usually you know the model that you've deployed on the cloud You know that this is accurate You know that this is kind of like the baseline for just how you want your model to perform And what you do there is you run ML x-ray on that reference pipeline. It gives you the logs You know it gives you this is how long each layer took This is the approximate output and input of each layer and then you run this on your development pipeline And that gives you the same information and then you basically do a diff between those to create a debug report to help you figure out Okay, this is what's going on with my system This is what's different when I've deployed to this environment versus the other one and the basic flow of this debug report Is this so first you do an accuracy validation you look at your Accuracy for your reference pipeline you look at it for your development pipeline Are they matching up and if they match up then that's that's pretty great, right? Because now you're basically performing about how you might think when you've like I've Train this model on my cloud. It looks like my device is accurately doing what I want it to do But in a lot of cases like I mentioned before you're going to realize that is not the case and the accuracy goes down And so then the next step of that is then you want to look into each layer specifically So you're looking at the output and you're saying is there a layer where this output accuracy? Just or the output is very very different from the output that's received from my reference pipeline And you in the case where things are slow then you want to compare latency It's like well this layer took a lot longer than the other layer In my development pipeline So you run through that and that helps you hone about in on which layer is having problems And then finally like I mentioned before there are some assertion checks So you can specify these are custom assertions in your code that check inputs and outputs are what you expect So let's say you have this self-driving case that I mentioned before and you know that when you're running your camera whenever you make an inference all the The width of the street should always be the same And so then this assertion check would be like check that you know the width of the input of the model Is always five feet or something or a check that you know whatever is detected at the end the width is five feet And so what kind of issues can this pipeline actually help you debug? So there are three that I'm going to step into a little bit more detail for but the first one is pre-processing errors The next one is quantization inaccuracies and then kernel Optimization differences amongst heterogeneous environments so that's kind of the case I mentioned before where you have a bunch of different hardware and Just completely different environments that your models are running on So the first is pre-processing errors and I think even in a case where you're not deploying to an edge device You're going to run into this right you have something collecting informations That's you're using to structure for your input to your model And that's going to be different from whatever that model is expecting and this happens even more in the edge device Case because since these are all running on different environments in different hardware your sensor might be picking up information in different ways or You know in the case where you have you know You're taking pictures using a camera that that could lead to the case where it's like you might need to Shrink your information or shrink the picture so that it runs well on your edge device because it has lower memory requirements And so there's cases like resizing that I just mentioned where you might need to be downscaling the image or in some cases upscaling the image if the camera is not picking up the right resolution or There's something wrong with the sensor or just the information is coming in differently and the it the information might be Rotated whenever you're feeding into the model which can lead to very low accuracy or in some cases There's a lot of models that might pick up your images and expect it in RGB format or BGR format And you don't really know which it is And so how does ML x-ray help in this case so this goes back to the assertions that I mentioned before But essentially whenever ML x-ray is running on your pipeline It's going to go and run these assertions to make sure that it checks and passes so here in this example, this is using the Python API and this is checking that It expects your input to be in RGB format So it's checking if this this thing is accidentally coming in as BGR format It's gonna let you know so it's like hey your deployment pipeline It's broken you're gonna need to go and add this preprocessing step to convert to the RGB format It's just kind of stepping through exactly what this code is doing It's taking in the input from your development That's called edge out and then the input from your reference pipeline and saying do these look the same and If they do and okay, that's that's great if not then let's try to convert your Input from your development pipeline to RGB format and then now if it matches then it's like Oh, yeah, you had a channel mismatch And so then it will raise the assertion and let you know that there's an issue with your model Another issue you might run into is through quantization So in quantization this especially becomes important when you're deploying to edge devices Because you just like I said before have lower resource requirements and so therefore you might want to go and quantize this information so that it uses less memory or less CPU and Essentially what this means is that you're converting the weights and biases of your model to a lower precision So in this example picture, you know, you start with the floating point 32 bit number and then you do quantization to convert that to a int 8 and Some of the issues that can happen here is that your quantization process could just be wrong So one of the methods of quantizing your data it needs to know the min and max of your input and what can happen in that case is let's say you have your training data and There's an outlier in that training data that will heavily scale or That's going to scale your min and max to some extreme end Whereas most of stuff should like follow somewhere in the middle and you know Don't follow what that outlier is and so in that case when you quantize you actually get the wrong values for your weights And biases and that's going to lower your accuracy and How MLX ray helps in this case is that it looks at that per layer output and it compares it to the reference pipelines So you can see how is your? Development pipeline doing in regards to the reference one. So here we have two examples. So the orange line is this Model that we know all the weights all the biases have been quantized They work and we compare that to the baseline which is you know that perfect baseline image that has been perfect baseline model that has been trained in the cloud and we can see that the The mean square error is right there in the bottom It's pretty low and that's doing great And then you have this other model that you've trained and you've quantized it and you see that okay comparing that to my Baseline the error is much higher and so therefore I should go in and try to figure out just what I need to do Do I need better training data to fix this or what what other processes? Can I do to quantize my weights and Then last this one is very unique to edge compute because now you're deploying to a bunch of different and devices These have different hardware requirements these just at the core of it in the kernel Optimized different operations in different ways And so this can lead to a huge latency difference or performance between Devices so you know in one case you have something running It's very fast and then in some other case You don't know why it's the same model and it's really slow And so we used ML x ray to help us create this graph here And you can see that we've compared against different models How long it takes to run each layer and some of the results are pretty surprising right so you have this the exact the quantile or not the quantile the quantized version Pipeline that we used before is actually pretty slow in that second convolution step and ML x-ray helps you figure out. It's like okay this layer. There's something wrong And that's why it's slow and maybe I need to deploy like a special model to this particular hardware So I'm gonna walk through a little bit about what Using ML x-ray actually looks like so First this is a nifty colab that we have that just shows like an example model That uses ML x-ray. So the first thing you need to do is install the ML x-ray library and Then you want to go and just create your model runner class So this is just using tensorflow light and the important things to kind of pick up on here are essentially this M monitor so you're Initializing ML x-ray to go ahead and start logging information from each output layer And you know the inputs and outputs and then finally you go you invoke the model This is not specific to ML x-ray at all. So this is all just code for How to run the model itself and then you're gonna want to run the model on an image So this is an image classification example You can see here that we ask For the log path to go to these specific files and then you run the model. So here you scroll down a little bit more The models essentially run and in the background ML x-ray has picked up just like a bunch of logs about how each layer is running about the latency of each layer All that information. So what does that actually look like? Here's an example of in ML x-ray log and there's a ton of information in here, right? You have the start time you have the overall latency of how long your inference took you have the memory usage and you have For each layer all the outputs and this is I'm not going to keep scrolling. It's just a ton of information You also get your summary information. So this tells you for each layer How long did it take how much memory did it take the names of it? So it just collects a bunch of interesting information. So now you have all this information What exactly do you do of it? ML x-ray has an API that you can use to go and start parsing this data and making sense out of it So here's just an example script. It loads in a bunch of things from the ML x-ray library This first function here it goes reads the logs in it parses it so you see here. It's reading the logs It's getting the keys and the values and Then essentially in the end it can plot the results and we use this code to plot those results earlier that I showed Back on that slide where it was comparing the The accuracy between the different or the differences between the output layers So you can essentially very quickly get started with ML x-ray Okay, and then jumping back To my slides. Oops you can see that ML x-ray has some limitations and the first one is that you need code changes to go and enable instrumentation on your Debug pipeline and that can be annoying right because you might go deploy it and then you're like, oh, I forgot to add this I forgot to add this line in to go and invoke ML x-ray and you have to go back in and do that and Generally when we're whenever we're doing observability, we like, you know low-touch Instrumentation there's also a slight Performance impact when you're using ML x-ray. So obviously it's more noticeable on GPU You're writing tons of things to logs So that also has a memory impact because you're just storing all this data somewhere And then I think we could kind of see towards the end It was like, okay I have all this data now I need to use this Python API to go ahead and parse it and I can use that API to create a graph But it kind of limits you and how you can actually go and visualize this information What if you want to do more interesting things with it because it's not in some standard output format that you can like Stick into any tool that you want. It's kind of hard to go and just build more interesting visualizations with it So kind of here how I got Involved in ML x-ray is I worked on pixie. I mentioned that before and There were a lot of correlations between how we do things in pixie that I thought could help the ML x-ray project And so just like a brief summary again pixie is an open-source CNCF sandbox project for observability on Kubernetes and there are three pillars that I think kind of help in the ML x-ray case So the first is auto telemetry So pixie picks up information using tools like ebpf without you having to go and instrument things in your application so just automatically starts collecting information as soon as it's deployed and That really helps in the ML x-ray case where you have to go right now You have to add that line to be like I want to invoke ML x-ray and start seeing information This also helps in the case where it's like you don't want this thing running on your pipeline all the time, right? You maybe want it when you're debugging but in the future it's like when you know it's running Well, you don't want it anymore So you're gonna have to go and take that line out of your code that invokes ML x-ray The second thing is that pixie really does wealth edge compute So that fits very well in this case where we're deploying across edge devices It makes sure you kind of follow all those standards where it's like you're keeping all that data on the edge In memory and then finally I think the biggest thing that ML x-ray would Benefit from is pixie scriptable interfaces. So here there's essentially a data format for pixie Everything is inside a table and you can go and do whatever you want with that information to build visualizations very easily And so this is kind of just a preview about just like how we wanted to apply pixie's use case to ML x-ray So we're actually gonna go into this in more detail tomorrow on Kubernetes on edge day So if you'd like to come by and learn some more that would be great to see you guys all again But here are some resources for ML x-ray So the first one of course all this is open source ML x-ray is open source Pixie is open source check out the repo check out the code try running stuff yourself I also included the ML x-ray paper for those who are like more interested in picking up on some of like the very technical Information I think we have time for one or two questions. If there's anyone that has a question. Yes, we have a question Hi, thank you for the presentation really great work I have a question. So why was the decision made to use logs to diff the layer outputs between The cloud and the edge model for example Why not probe the actual layers because I'm assuming you own both the edge model and the cloud model, right? Logs can run into issues for example of formatting and also being really like large, you know Your model is large. You're gonna be storing large text files And also the parsing is pretty expensive and can be error-prone. So why not probe the actual layers? You know in the cloud and the edge models Yeah, so I think that's a very good point So the initial version of this does use logs And I think that's because it is hard to get this information on some edge device that you've deployed to that You don't have access to as easily and so then when I mentioned pixie later We essentially do use probes to go pick up that information rather than going and recording it and writing it Storing in memory where you have to go and just you know grab that file and then parse it later So luckily the parsing itself that's when you actually want to go and debug your pipeline And so that's like done a sink and not actually in the model when you're running it Great. So how did you end up solving the issue? You said it was difficult to parse the edge model How did you end up solving this issue? Difficult to parse you said it was difficult to probe the edge model because it's like on the edge So you don't have direct access to it Okay, yeah, I will talk more about that if you're going to go to edge day Essentially pixie uses this thing called ebpf and that runs at the kernel level and so then that can pick up a ton of interesting information Okay last question so I understand that pixie Telemetry model in general for like service monitoring. I was curious about ML model performance and Perhaps those data also being interesting to be Aggregated and looked at it in a place where people are usually looking at ML performance Comparisons like in weights and biases. Do you guys have like a like a picture of like where those data could somehow intersect or how you could bring them together Like that. Yeah, so I guess in relating to pixie we use ebpf like I said and that kind of picks up You can use ebpf to hook on to certain new probes so that are like certain user defined functions And then that can collect a bunch of information You can get like the arguments of that function you can get the outputs of that function and you can Send all that a data to pixie to visualize it. I hope does that answer your question. I'm not sure I got it, correct But we'll be talking more about tomorrow, so Hopefully you can come by and see our demo about how we just like use pixie to go and like probe all this information And what information we can get Okay, we have another question from the slack channel so the question is is ML X ray mainly for deep learning all the examples shown Who seems to assume layers? Yes. Yes, so it is primarily for deep learning that is correct. Okay, cool. Thank you