 afternoon everyone. So I am Ram Prakash, so people please settle down. We've started with the talk. So I am Ram Prakash. I hope I make this post lunch talk interesting. Yes. So not audible. Better now. Is it better now? Hello. Check. So this is my Twitter alias and I work for this company called Zoho. So we make lots of business and productivity applications. We call ourselves the operating system of business. And I work for a particular group called Zoho Labs where we design solutions to our product teams and we kind of offer it as a service to them. So we work on lots of interesting problems like hardware acceleration using FPGAs, GPUs. We are one of the active contributors to PGSQL. And we have a division that works on machine learning and artificial intelligence. So I own the machine learning product stack there. So I hope that gives you a good idea about me. I'll get into the talk. Now let us look at the problem description. Right. So ML as such machine learning and artificial intelligence is very snake-coilish. I know that term is a bit difficult to digest, but you know it is more of, there are no proper proven results of an ML system autonomously taking decisions. There always has to be a human element which can get in and make the decisions for you. So any ML practitioner would accept with me that this particular field hasn't matured enough for a full-blown business production system. And coming from the B2B world, there's not much differentiation you as an ML service provider can offer because almost all the algorithms are open. And in the B2C world, you have this data advantage, which means, hey, I know better than you than the other company. So my algorithms would work better for you. But that is not the case with B2B companies. And that is not the case in highly regulated industries where data is well-defined in a schema. And if it is a CRM system, you're only going to get contact leads. If it is a support this system, you're only going to get support tickets. So there's not much differentiation you can offer apart from your competitor. And we all use cross-validation techniques to evaluate your ML models, which I don't find it very convincing for a production system. Yes, it is a good indicator, but that might not be the best indicator of the accuracy of your ML systems. So problems like these could happen. So one is data leakage, where an unintended variable in your training data set has an unintended effect on your result. For example, let's say a patient ID is heavily correlated to the chance of that patient getting a cancer or something. So this is unintended data leakage and your cross-validation systems might not be able to capture such edge cases. And there could also be a data set shift meaning it is the standard practice in the ML industry to take a snapshot of data from the past, train a model over it and then pass on other snapshots of data from the near future, make validations on that and get the results. But you know, real-life data keeps changing and in B2B companies, you could be actually deploying one particular model and serving it to many customers. So you'll have to make sure all your customers get their predictions right and you'll have to tune your models accordingly. So these are all certain problems which we face and there comes a solution called model explainer. So what if your model can explain its prediction? So it basically happens in real life, right? So you ask someone to do it. They won't do it, but you ask someone to do it and tell them, Hey, this is why you have to do this and there is many a chance that that guy is going to do that. So this is where we took our inspirations from. This paper was like it took the industry by storm. There were lots of, you know, interests around this paper and the interest hasn't faded out yet. So we started looking at this paper and so this is what this paper is all about local interpretable model agnostic explanations. I'll break that down into simple sentences for you. So this is where we have our own. We have hosted our own version of it. We made it production ready. It is Apache to license. So it is commercial friendly licensed and so we had to do lots of changes so that we can serve this explainer in a real time production system. So for which the existing implementation from the actual paper was a bit difficult and we are primarily a Java based company. So we wanted that to be on spark view spark for our machine learning implementation. So we thought it is better if we could write it on top of spark. So coming to local and interpretable. So the explanation you give has to be local to the given query. I know I have an example coming up for that and that explanation has to be interpretable for an end user to understand. Right. And this is one change which we are making so that we make it suitable for production. So model agnostic is the actual paper is model agnostic meaning you have an explanation system which works independent of the underlying machine learning model with it. It looks at the parent data set and it gives you an explanation. But we have got certain clues from the model. So we have made it. It is not model agnostic now. So the one liner of this particular framework is you know your model can be complex and nonlinear. So we are trying to fit a simple linear model around your query so that your explanations are interpretable and local. So I know I haven't given you the right example for local. So this is the case. So the ML world's you know very famous example the housing price example. So now you have lots of variables like you are going to buy a house somewhere in Bangalore or somewhere in Chennai. So you have variables like size in square feet. Is it an apartment is it a luxury villa how far is it from the nearest landmark and all that. So a global explanation would say particular result is because of the size in square feet. So that would have the highest correlation with your end result. But look at the second case. You're buying a 600 square feet luxury villa. Here you cannot go and get your explanation saying hey this is priced greater than one crore because it is a 600 square feet house that reason doesn't holds good because for this particular query you will have to use a different variable. So here the explanation could be it is a luxury villa. So it is priced that way for the first case it is a thousand eight hundred square feet apartment. So it is priced this way. So I've explained the same in a visual way. So these are certain features size in square feet distance is luxury s and there are three categories less than 50 lakhs 50 lakhs to one crore and greater than one crore. Right. So you get the context for locally interpretable. Right. So this is what it is all about. So and importantly the local faithfulness does not imply a global faithfulness. So there are two different ways to explain models. One is giving you a global picture. Second is giving you a local picture around a given query. So here the problem we are trying to solve is a local picture a global picture. There are lots of ways to do it. It could be as well as you know getting the feature weights from your decision tree and just printing it out or visualizing your data set that can give you global solutions but local solutions are a bit a different ball game. So that's what we're trying to solve. So I'll just explain you the design of the system. So this is your actual this is how the we lost the model agnostic feature. So basically for each and every query we had to look up the training data so that the model is model agnostic the explanation model is model agnostic. So we made it weak saying OK we'll borrow some data from the model itself and then we'll use it. So that was one major change which we did. So we lost the model agnostic property but we have still kept this open. We could always you know switch it so that tomorrow in case a model agnostic need arises then we could always roll it back and get it out of there. So this is how it is. So there is data coming in that is the training data you give it to your train your machine learning model and we look at the continuous and categorical columns. So far now we are only looking at tabular data sets. We are not you're not ventured into text and images so far that is definitely in the pipeline. So we look at the categorical columns we sample them and we look at the continuous columns we divide them into discrete buckets so that you know you kind of reduce your dimensions and we inject that into the model. Right. So this is your training process. So as in when you create your machine learning model then you can do this. Now this is this happens during a query time. So you have a given machine learning model and you have a prediction query that is coming in. You're going to predict it on the model and give it. So now you have a binary vector. So this is this places where it could open up for the future use cases. So the binary vector here in the case of a categorical or a continuous feature. It is a mere presence or absence of that particular variable. Right. So so so in other cases let's say tomorrow you're trying to explain images with that it could be patches of images so it could be a continuous group of images. Let's say tomorrow you're trying to do text analysis with that it could be a continuous set of words that could be given for the explanation. There's a binary vector which determines the presence or absence of that and we do some scaling on that and based on that discretized values we get the sample rates. So now we get the weighted data and we have a feature selector and a regressor so that we could get out the features and you know give them a score getting a confidence. Right. So we started off as this to primarily you know trying to test our own algorithms and configurations. So the thing is like we we in zoel labs we try to serve other product teams so we'll have to give them a compelling value at first to the product managers of other product teams and then to the users of the other product teams. I mean users of the other products so there has to be a compelling value at and this seems to be a you know like it is without any dependence of the other product teams so we can get this started on our own and the value at would be exponential with this case. So we started doing our pilot in our turn prediction app. So like every every hour we have a turn predictor which looks at the access logs and tries to create a model on which if a particular subscriber would renew his subscription or would you turn. So we have variables coming in from the access logs we have variables coming in from the help desk number of tickets interacted sentiment of the tickets number of tickets escalated and all that. So we have so many variables that come into this turn prediction system. So we that will hook on this explanation engine to this turn prediction system. So these are the results we were getting. So this is for a user who is going to renew his subscription. So it says he has done lots of bulk insert activities. So this is for a campaigns product it also campaigns. So this data has a snapshot of that. He has sent more number. I mean based on the campaign sent there is so much of probability that he will buy and based on the percentage of non a CTP 200 request there is a probability that he is going to renew. Right. So these are the explanations for this particular case and this is an explanation for a guy who is going to leave. So this guy has a more probability of leaving because this because of the value of his percentage of non a CTP 200 request and too many support tickets and active sessions. So we would also be able to print out the value but I'm just giving you a you know a visualization here. So we are trying to roll this out to customer facing apps in a faced manner because mostly in the BTV world people turn off automatic recommendations. People want to do it themselves. So so we have started rolling out this to customer facing apps in the form of a notification. This could be because of this so that we instill trust on the user and we get them to use more of the machine learning features we have in the pipeline that is the whole goal of the project. I thought I'll add the slide for you know so these are the other ways you can actually get in and try explaining your models. So a quanta regression is where you take it percentage by percentage. I haven't explored all of these but these are the other things that are available in the market which you can go back and start looking at it glyphs are glyphs and correlation graphs or visual visual representations of your original data set. So we had used a lasso regressor there. So there is one with least squares regression and there is one with elastic net regression. So basically you try to reduce you try to calculate the feature importance given to that particular query based on some assimilated data and there are tools like gms which can allow you to you know hit a sweet spot between complexity I mean accuracy of the model and the explainability of that model. So you can get in and more of you can tune it to see how things are working and then there are dimensionality reduction techniques like TSA auto encoder networks PCA and all that so which could project your data set into a lower dimensional space so that it will be easy for you to visualize and understand. So these are other techniques which you can go back and have a look at it. So that's it for me and just to summarize so we are trying to build a predictor. I mean we're trying to build an explainer that would explain predictions of a machine learning system this particular explanation is more mostly for a local query. So a query that is I mean for a for a with respect to each and every individual query and this will instill trust on the end users. So we are trying to roll it out in faces. We have a GitHub page so we have you know hosted it on git with an Apache to license so your contributions welcome and these feedbacks welcome. Thank you. We have questions. We'll start from there. Hi. Thanks for this. So you mentioned that the model is local and interpretable. Right. So but I mean in the last slide you show for the turn customers. So this is the functionality so but it should be for individuals turn customers right as in like if it is local. Yeah. These are just individual cases. So here you look at the number of variables in the system. It is the same. So for this particular result it has chosen three variables which is bulk insert campaign sent and percentage of non-CTP 200 request for this particular query it has chosen two different variables and percentage of non-CTP 200. Okay. So is this for a single user user or the group of these are two different users. So one is a guy who is going to buy who's going to renew one is a guy who's going to leave. Sorry. It's not local. Right. It's the global phenomenon for the turning group and the non-churning group. No, no, it's not for a group. It's for that particular individual user. Question here. This might be a very stupid question. So sorry for my ignorance, but it looked like your binary thing vector and all looks like a bunch of nested if else to me. Sorry. Sorry. I'm not able to. I mean your binary vector thing and all this thing looks like a rule engine with a bunch of nested if else. Where is the model? I mean, I think what he's trying to say is that he saw a lot of if else there. Am I correct? Model thing you're talking about seems to me depend a lot on your ability to parameterize and look at with the sample size of one like all the example was on the sample size of one. So I think you're making a rule based engine here with a lot of if else thrown into it nested if else. The machine thing we're trying to fit a simple linear model around a very complex system. That is where so for linear models explanation is very simple. So it holds good for a decision tree. So we're trying to build a simple linear model around that particular query. So that is the whole crux. It is it is okay. The when we make it model agnostic it is totally independent of the underlying model, meaning your model can say this guy will turn but this guy can come in and give an explanation for this guy will subscribe. So that is now that we are borrowing some data from the model as such. I mean, I know it is it is just fitting a simple linear model around a very complex nonlinear model. So that is the whole idea behind the. So the question is that you said it is not model agnostic. So which models that is support and then how is it different from a decision tree or a regression model where where you get all of those decision nodes, right? Yeah, our decision trees also explains the whole thing actually and also a regression model gives you the coefficients actually. So how is this different from that? And thirdly, is this a free tool and open source tool that anyone can use? Sorry, I didn't get your last question. It is free and open source. It is Apache to licensed. So did I answer that so for the first question? Yes. So actually for now it is for decision trees. It is for random forests. That is the version that is on GitHub. We will soon open up open it up for gradient boosting and SVMs, which are specific to Apache Spark. So we have written this on top of Apache Spark. And so yes, how different is it from other linear systems? It is basically a linear system. So it is around your query. You try to build that linear system on that. So that's it. Question here. Yeah, what would an explanation for a say image look like? Yes. So basically when it comes to interpretability. So here interpretability is a mere in this case. Interpretability is a mere the value of this, right? But in text, you cannot give him a single word and say this is the problem. But in an image, you cannot give him a single pixel and give him that this is the problem. So you'll have to give him a patch of image. So let's say this picture a cat or not. So you'll have to give him some reasonable subset of the pixels so that it shows something like a cat eye. Same goes with text. Let's say you're trying to predict if it is going to be. Let's say you're doing a sentiment analysis. So this product is so good. So that would be a reasonable explanation than this or product. So here it is just presence or absence of that variable. But there it could be a group. I see it. Hi, I'd like to ask how would if I just extract rules from a cart model? Yes, how would this be different from it? Yes, that was the housing price example. So your rules could be for a global explanation. So you could say your size and square feet would heavily contribute to the result, but not for that particular query for that particular query with let me show you that. Yeah, for that particular query with 600 square feet ultra luxury villa that square feet is not the variable that is going to be a result. It's about that ultra luxury thing. So that is where the difference between global and local comes in. So the variables you select for each and every individual query could be different. That's what we're using using a feature selector. One last question. One other confusion for me is in that big picture. You have the scaling and sample weights that feed into the lasso regressor. Can you explain that because you're taking one data point? How does the scaling and this sample weights? How much data point does the lasso regressor have? Do you run the regression model at the time of for that data point? So that would really help. So basically we put them into discrete buckets. So now we have a toned down data set that is that is residing along with the model. And here you you have a query as well. Now you're trying to place this query into any of those buckets. So the closer the bucket the more it weight gets. Let's say you have variables. You have 10 buckets 1 to 10 10 to 20 up to 100 your number is 25. So it is more close to the 20 to 30 bucket. So there you get the weights and now you also get in this weighted data. You also get the class probabilities from the model. The probability of that class being a result in a classification case. So you also have that now you try to basically we use these regression and selector tools to reduce the number of features so that we get only the most contributing variables. So that is why we have a twofold case. One is a square selector which can select the features and one is a regressor which will in another way give me the probability of those weights. So combining this would be a compelling result. Actually we could skip out on of the two. So now we are able to get a prediction in about 100 milliseconds. So in case there is a demand for you know very quick responses then we'll have to tone down one of the two. So that was one of the reasons we made it non-model agnostic because if it is if it is model agnostic then we'll have to look up the entire training data set for each and every query which is not good for a production system. Thank you. Thanks Ram.