 Hello, everyone. Welcome to the day section on Infusing Trust AI using machine learning payload logging on QVDAPs. My name is Tommy. I'm an open source developer in IBM, mostly working on gift flow and machine learning infrastructure. And we also have my colleague Andrew. Hi, everyone. I'm Andrew Butler. I also work at IBM on open source and most of my work is about integrating trusted AI on CAF serving. Thanks, Andrew. So now we want to go over some of the background of why we need machine model serving. So to do production model serving, it has a very difficult process. So first of all, we will have data science to train the model first. Before training the model, they also have to filter out the data set. And once they train the model, they have to figure out how they want to deploy the model and scale it on top of the cloud. And this whole process is very complicated and it's very difficult to scale. So that's why we want to create open source solutions to help all the data scientists to just bring their models and scale it on top of Kubernetes. And this is where we start with the project CAF serving. So CAF serving is founded by Google, seldom IBM, Bloomberg, and Microsoft. And currently it's part of the gift flow project and focused on 80% of the use case for single model rollout updates. And the goal for CAF serving is to have a serverless ML inferencing, canary rollout, and model explanation. And optionally, we will also help the data scientists to do pre- and post-processing put in the prediction. And this is the high level description for CAF serving. So CAF serving by default supports pre-processing, predate post-processing, and explain. It has a set of default framework you could use, such as TensorFlow, PyTorch, Second Learn, HG Boost, RNEX, TensorFlow RT, et cetera. And this is built on top of a KineTip at Istio. So we'll go over some of the background on KineTips. And KineTip is one of the projects that use to serve the models. And IBM is the second largest contributor on top of KineTip. And CAF serving is KineTip to build several major functions, such as serverless serving models, and also getting metrics to auto scale based on GPUs and TPUs. And then the other project is Istio. Istio is used in conjunction with KineTip to help us connect the models into an ingress, observe the models based on a set of metrics, and able to log in and tracing as well. And in addition, it will also provide a secure connection. So you want to secure models using the token as secrets. You could use that by enabling Istio. And we also have policy from Istio to help us ship the traffic from the models to a different route as well. And now we want to showcase how CAF serving is working using the default and counter configuration. So by default, when you create models on CAF serving, you will have to create what we call an inference service. Inference service is how we manage the lifecycle of the models. And behind the hoods, when the inference service is created, CAF serving actually create a configuration for models using KineTip. And that configuration will contain multiple revisions. That is the version of the models. And also create a route to point every version of the models to the generated KineTip route for users to use and hit the predictions. And currently, in CAF serving, we have a set of default components and storage systems. And the default models of framework we support is three of them are TensorFlow, NVIDIA, Tritis, PrideTourge, HDBoost, second-learn-onix, and more. And when you create a model, there are several components you could create, a genetic predictor to just predict your models. And we have a concept called explainer where you could add additional explanation when you create a prediction. And also, Transformer is for helping you to pre-process and pro-process your request before you send the user request to the predictor. And for storing your models, CAF serving service currently supports like the S3, GCS, Azure Blobs, for system volume claim, and genetic URI like HTTP. And the inferences control plane, when it gets very advanced, you could actually have multiple endpoints. In this case, you could see when the user tried to call an endpoints either for prediction and explanation, I actually go into an ingress gateway where behind the scenes, when you have canary traffic control, it actually routes to do different candidate services. And when it routes to one of the candidate services, behind the scenes, the candidate services is backed by multiple parts that handles different functionalities. For example, when you want to do pre-processing and post-processing, there will be a Transformer part to help a new processor request. And then we direct that request to either the explainer predictor based on the user endpoints. And when it comes to explainer, the explainer will use the predictor part to get the information to explain the transaction of the user request. And if it's directly calling the predictor, then it will just go directly for the predictor and get the model we sell back and return back to the user. And from a deployment point of view, when you deploy KF serving, the traffic routing is getting very complex. So we kind of like digested a little bit how it works in conjunction with Istio and Knative. So when the traffic comes in, it actually comes to a Istio ingress gateway first. And then that will actually trigger the Knative activator to reroute that traffic, that Istio sidecar. And then that sidecar will actually forward the request to the user container where you contain the model and do the predictions. And when the prediction is complete, then that request is actually returned back to the user in this case. And we also have the source initializer that is used for when you're creating a new predictor explainer that will help you to pull the model locally so you don't have to download the model reloaded every time you try to do a prediction. And to create a KF serving model is very simple. Simply just create this basic CRD to define the API versions, the kind of inference service, and give a name for your model. And on the spec, you can have different model specs. In this case, the second learn TensorFlow and PyTorch. And then you just have a model URI to point to your model endpoint on after storage, HTTPS, or persistent volume claims. And behind the scenes, KF serving will help you deploy the models that have the endpoint set up already. So now we go over how we could do the model serving on Kubernetes. So we have a good story on model serving. So how can we make the model serving models to be trusted? And because of the nature of models, it would always change based on data. So we always need to make sure it's the model not vulnerable to adversarial attack. And can we explain what the model prediction is on average actions? And we want to make sure it's the model creating any outliner over a certain amount of time. And is there any concept within the current set of the model prediction? So with this, we introduce the idea of explainer. And in addition, we also introduce the concept of our metrics logins to do more advanced offline prediction and explanations. So on the high level, when you want to do an online explanation, you could have an explainer to do real-time processing and give back the user the explanation of each transaction. However, for more advanced use cases, such as online detection, adversarial detection, and concept drift, we need to do offline explanation where we need to collect data over a certain amount of time and use that transaction data to give more advanced explanation on how the prediction is being done and is that done correctly or not in a trusted way? And how it works is that for every request that comes into the Q&A survey, we have a logger agent that actually is for the request to a, in this case, it's the candidate brokers. And the trigger of the candidate broker will see that and then forward that to whatever services we have to generate metrics on either online detection. And if it crossed the threshold that we have defined, that would send an alerting message to the user. And for the pair logging, it actually works on all the infant service parts. It actually not only could get the request and response from the predictor, we also could get request and response from every transaction from the explainer and transform as well. And each of those, like payload are done with a cow event we want. And you could specify that you are out to send the event to whatever platform you want, not just limited to Canada. And to specify a payload logging, it's very simple on the infant service. You simply under your predictor, explainer or transformer, just put a section for the log aspect. And under that you could say, in this case, we want to send all the requests and response or we set them all off, but you also could limit to just request and response. And you have to set the URI to a service that could accept like cloud event and forward that request to whatever offline explanation you want to create. So in this case, we also showcase that you could use it with the ACTP Kafka bridge. So it's not limited to like K-native for sending the cloud event for payload logging. And now I will give to Andrew to talk about machine and explanation and why we need them in the first case. Sure, thanks Tommy. So the basic idea here is, let's see, all right, here we go. The basic idea here is that Tommy had mentioned we have this idea of trust in AI concept drift explanations for models. And so we want to go a little bit more in depth about what those actually look like, the problems they address and how do we build up trust in AI and systems and what is it that we're using that utilize these payload logs. So specifically for this, the Linux Foundation AI and data committee, they've come up with eight principles for trust in AI. And these eight principles are reproducibility, robustness, equitability, privacy, explainability, accountability, transparency and security. And that's a lot we're only gonna cover basically four of those here and surrounding the four tools that IBM is working on. And IBM has contributed already to Linux Foundation AI and now LFAI is working on. So let's take a look at those. So for that, we have four and the first one being around robustness. The question is, did anyone tamper with it? And if they did tamper with it, can we kind of see what happened detecting this and making defenses against it? The next one is for fairness. Can we ensure that our model behaves fairly across people of different genders, different races, nationalities and just ensuring that our model doesn't get biased from different data points or whatever the source may be. The third is explainability. Can we explain why our models made certain predictions? And there's different levels of this, different stakeholders for who we need to move these explanations for. And so what are they? How can we appeal to a larger audience? And then the fourth is lineage. Are our models accountable? Can we go back and check what has happened and how the model has came to its conclusion? So these are the four major projects. There's three already committed to LFAI and widely used, and that is around robustness, fairness and explainability. And we'll talk a little bit about the projects that surround those in a second and how we integrated those into KF serving and are utilizing payload logging as well. So the first one for robustness, just to set up the problem a little bit that people have is that models are great and when you have them in standalone situations where an expert can come and look at the model predictions afterwards, that's great. But sometimes these models have to work in real-time solutions where an expert isn't readily available to check. And so an example of this is for self-driving cars. You have a car driving around a road and it needs to be able to recognize stop lights, stop signs, other cars, all types of things, pedestrians as well. And for this real-time simulation, we need to be sure that there is no malicious act or messing with your thing that could make you move and maybe run a stop sign like we have in the left example where someone just put a few boxes of different colorings on top of a stop sign. And although the average person would look at that and not see an issue with it, maybe it will recognize a stop sign, your model may not be able to. And so how do we check for this and protect against this in production? And so that's the robustness problem. And now to take a look at the explainability issue, we have a bunch of stakeholders and in order to get them to trust the model, they need to have an understanding of where it's making its decisions and what it is explicitly in each example that is informing its decision. So just to name some of the stakeholders that we have for the first stakeholder of end user and customers, they need to know very simply without a very complex understanding of AI and ML. Now they need to know why they were recommended something. So for people who submitted a loan and got denied, what was it that made their loan denied? And if somebody needs to look at that and figure out what it is, we need to know the data points that it compared for and all of those sort of things. Similarly, a doctor who's looking at a scan that has been sent to a model that is recommending a certain treatment, they need to know what it was exactly that it was recommending for. So the doctor can go back and verify that. And then the second stakeholder with government regulators, this is the question of, can you prove to me through your explanation that your model wasn't using some data point that it shouldn't be protected data points that shouldn't be utilized? Can we guarantee that those weren't used in the prediction? And one of the ways you can do that is taking a look at the explanation and seeing what data points are considered. The next is for developers. If it makes a wrong prediction, what is it that made the prediction wrong? And then once we know what it is, we can go find examples that we can train our model on and make it much better. So debuggability for developers. And so that's what we're looking for for explainability. And lastly for fairness, the idea is can we guarantee that our models are treating everyone fairly across different races, genders, nationalities. We wanna make sure that our models, even though they might be built up on bias data sets, and in some cases by data sets that we don't even know are biased, how can we check for that and how can we protect against that? So an example of this would be for the loans. If someone put in an application for a loan and they got denied, we wanna make sure that in the past, the data set that they had used for the models wasn't based off of very old records of who was accepted and who was denied loans because it may not be the case that they were checking and applying and approving and denying those fairly and they were considering things that they shouldn't be considering. And a more concrete example of this is that a algorithm that was used in Broward County, Florida to assign a risk score to people who had made an offense and were arrested and how likely they were to re-offend. And it was not a fair model at all and it was not equitable to all races. And so after a report came out that that was the case, they had to re-look at their entire model and decide what was it that was causing this and how can we prevent this in the future? So that's the problem for fairness. Now let's look briefly at the toolkits that are available for these problems. The first one being for adversarial robustness. We have ARC, the adversarial robustness toolbox. And so the idea is to provide the ability to quickly get analysis of attacks, quickly create defenses, and how can we make a good detection method for if our model is getting attacked by some adversary and attempting to cheat our model into mispredicting in some way. So that's ARC and we'll see an example of all of these in KFs that we've implemented in a KF start ring after at the end. So for explainability, we want to look at how can we explain our predictions? And there's a few different ways to do this. And we'll look at one specifically, but we can look at the data, we can look and see, you know, based off of the data that we have in the past, what are some of the ideas based off of the predictions that they had made what are some of the ideas in that and the data set that we can derive. And then also looking at the models, looking at a specific example and trying to get a complete explanation for that, but then also looking over a wide range of examples and getting an explanation for those wide ranges and kind of comparing across the wide range of examples. So that's AIX360. And then the last one is AF360. All around fairness and making sure we have metrics to check for fairness, being able to have pre-processing items that we can remove unfair data from datasets. How can we make sure the training is fair? And then even if we have a bias model, how can we post-process it in a way that makes it so it doesn't have issues and have bias? So lastly, what we've done is we've moved these trusted data projects into KF serving and integrating them into core KF serving functionality so that you can just with any model that you've served using KF serving, you can quickly and easily as Tommy has shown, add an explainer that will say get you metrics for how biased your data is, get explanations for each individual prediction that it comes up with, or give you adversarial examples that are cheating your model and making your model mispredict. So that's what we've enabled on then on top of payload logging is the attempt there for getting more complex systems and workflows so that we can look at metrics over very long periods of time and making sure we're getting the full comprehensive idea of what's going on there. So specifically here, some of the examples we have, first for AX360, we use a tool inside of trusted data projects at AX360, which is locally an interpretable model explainability or LIME. And the idea here for a concrete example is that you take a MNIST, just any handwritten digit and then you feed it to the explainer and it returns the same image, but with pixels highlighted as to what pixels LIME believes is the most indicative of the classification that it has given. So it gave a two classification here and the pixels that are highlighted in red are what the LIME believes are the pixels that indicate to very much over some of the other predictions that I might give, like a nine or an eight. And then for ART, we have integrated the square attack method, which is putting a mask over the image to try and make your model mispredict. So this is an example of a mask that you might see placed on top of your MNIST image in order to cheat your model into maybe mispredicting as a nine here or something similar. So that's what our tool will do in KF serving. And then lastly for F360 and the main reason that payload logging is needed here is taking a look at bias and fairness metrics over long periods of time. So let's say you have let your model run in production for a very long period of time and you have all these examples in the data, the predictions that the model is given, then we can take metrics off of those payload logs and make inferences based off of those metrics. So those are the three concrete examples of what we've implemented in KF serving so far. And as well as that Selden has their own explainers as well for Alibi. And they have lots of break tools that also work in a very similar way to the trusty AI tools and can give great explanations and calculate drift and similar ideas there as well. So now I will pass it over to Tommy to give a demo. Thanks Andrew. So we'll go to go with the demos right now. So I will go with the demo flows on what we're going to demonstrate here. So in this case, we will have a KF serving model which I can learn. And for this model, we will forward like all the request and response to a Kative event brokers. And this event broker is configured with the Kafka cluster. So all the request and response are actually logged into a Kafka cluster. And behind the scenes, we will have a Kafka connector to send those Kafka events to a relational DB. In this case, we're using MySQL. And as Andrew mentioned, for AS360 to calculate the fairness measure we need a lot of data. So that's why we have to accumulate our historical prediction inside a relational DB and calculate it periodically. And in this case, the AF365 detector will calculate the new metrics for every hour and pull it from the latest database in MySQL. And once the metric is being calculated it will store back into the persistent DB. And we also have a metrics transformer to actually watch and update the latest metrics to format this. So we have a visualizer UI. In this case, we're using Govanna to visualize what is the latest status of our fairness metrics. And you can also see the detailed demo flows on the next slides we have downloaded from our website. So now I'm going ahead and show you the actual demo flow. So in this case, we already set up the deployment for Kative eventing to receive events. As you can see, we have the Kafka broker to sync all of our requests and respond to Kafka. And we also have set up the Kafka Connect to help us to sync all the requests coming into the Kafka cluster to our MySQL DB. So now let's go ahead and take a look at our models. So we have a model called German Credit. This is a model based on a set of features to calculate whether or not we approve a credit card to a particular user. And this model is based on scikit-learn and right now it's served at this endpoint. So now we want to like create some predictions and get some feedbacks. To do this, we will use a Python script to generate some dummy data and then send it to these models. And we will send 10 different kind of payloads. So we have 10 new entries in this case. So once the prediction came back, you should be able to see the result of the prediction. So you could see one is like disapproved and two is approved in this case. So out of the nine requests we have done, only one of them are being approved. So now let's go ahead and take a look at what the Kafka event going to see. So inside Kafka event when a request comes in, they actually request two events. One is the payload of the request and one is the response, which is the result of predictions. And let's us to set up as a stream mode. So you kind of rerun the script again to have a new set of payloads. You should be able to see a new, the request and report will be loaded as the Kafka in real time, as you can see in this case. And from the BD perspective, you should also see like the prediction is synced to the R database. So in this case, previously we have 76, so because we made like two different set of requests. So we should have like two requests and response. And now we have 86. So we want to like confirm we have more data to come in because I'll do another request. And once the prediction is here and being synced by the Kafka connector, and you can see we have more data coming in, like 88, just to make sure request and response. And behind the scene periodically, we will have the AF360 Chrome job to actually join in more metrics, as you can see over here. Every hour we will generate a new set of metrics and push it back to like, for me this so we can visualize it on Govarnas. And you can see we will have a Chrome job running. And this is set as like pushing metrics periodically back to Govarnas. And once we have those metrics, as you can see, let's refresh the Govarnas dashboard. You will see that every hour there will be new metrics being generated. In this case, we are looking at the metrics for disparate impacts, which is the ratio between a privileged and unprivileged group. So ideally in AF360, when we calculate the disparate impact, the ideal value should be 0.8 to 1.2 because that's the ideal ratio between the two groups. However, in this case, we can see like most of the disparate values over time are like around 0.54, which is like a darker in this case. So for data synthesis, you can see the distribution of the result is not very good. So they should actually try to fix it and try to make it look better. And in addition to just looking at one value, you can also set up the dashboard to pull more different values from committees. For example, you wanna look at just the base rate on what are the balance between passing the credit for a particular person. You could just create this base rate check and then apply it. And usually it takes a few seconds to load the data or we could actually create a new dashboard to display this data eventually kind of work out. So when I wanna take a look at the base rate, we could actually create this. Then we could able to see like what are the base rate. So because our several data actually have more high ratio to not passing the credit for a particular person. So you could see over time, we actually like the base rate dropped because our sample data is actually a little bit more biased now. So based on this information, so the data sciences have more useful information to like know why they need to like fix the models and how they could fix it based on the measurable rather by the AIF 360 fairness tool. And that is the end of the demos. Thank you very much for joining. And if you have any question, we will be on Slack and we'll be answering questions right after this. Thank you very much.