 Next talk is by Jennifer and she's going to be talking about taking a peek under the hood, interpreting black box models. That's the title of the talk and it's essentially about interpretability, right? Because in the previous talk, we saw Margot and Padmaja talk about a way to prevent churn or to save customers and there was a lot of explanation before they built up to the tool. But typically, when you see tools in the wild, you just see the results and you see the clustering values or you see machine learning results, but you don't really know how it works in the background and to that end, there are books like, the very nice book, Weapons of Math Destruction, that's one book which talks a lot about interpretability and I think it's a very important concept in ML and I let Jennifer do the rest of the talk, of course. She is the lead data scientist at Sun Lives, Canadian Analytics Centre of Excellence. She's helping the company build intelligent data solutions to better serve their clients. Her past experience in the field includes the globe and the mail, script and slice. She holds a master's in machine learning from University College London and a bachelor's in math from the University of Waterloo. She's a strong proponent of gender diversity in her field and partners of the University of Waterloo to support young female female pursuing careers in STEM. Yeah, over to you Jennifer. Well, thanks everyone for the introduction. I'm very happy to be here and thank you for coming to listen to this talk. I also like to say a big thank you to the Picon India Committee for organizing this event. Looking forward to listening to the other talks out there for the next couple of days. So for today's talk, as Eber mentioned, I would like to talk about how we can lift the hood on black box models to see how they work. And before we get into that, though, I'd like to share a short story of how I was motivated or inspired to give this talk. And as I tell it, try to see if you can place yourself in my shoes and see what you would have done in the situation. So I was asked to build a model such as to predict insurance fraud. So as a data scientist, I'm like, great, OK, so this is classification. I'll use XGBoost because I know it performs pretty well for these types of machine learning tasks. So I went off, I got my data, I did some feature engineering, I trained my models. I got pretty good results in terms of performance. So I was pretty excited to share the results with my business stakeholders. So they were pleased by the results, but they asked me. So that's great. How do you know? How do you? How do the models work? What are the insights? So we're the key drivers for fraud for committing fraud. And so at this point, I was a bit stumped because I use XGBoost. It's a bit of a black box, nebulous kind of model. And I didn't really know how to interpret it. So let's take a pause for a minute and talk about why it's important to interpret your models. Well, first, it's if you can interpret your model, you can justify a prediction. So in the case of predicting insurance fraud, you can you're able to justify why to the customer, to your business stakeholders, why this person is committing fraud. And especially for insurance and investment company, auditors will also be interested as well. So if you know how their models work, the better it is for everybody. And in that same vein, being able to interpret your models, you're able to instill transparency is not just for yourself so that you can believe in your own models, but also for your business stakeholders so they can have trust in your work and in the models itself. And then finally, the more you know about your model, the easier it is for you to improve it. So if you know that certain features are not helpful, then you can remove them and make your model a lot simpler. So going back to the philosophy of Occam's razor or the law of parsimony, similar is better. Now, all of this is possible if we use linear regression, but that's not very a performant model. So as data scientists, we are often asked to choose between performance and interpretability. So if we use a high performing model like XG boost, Cat Boost, neural networks, they perform great, but then we lose on the interpretability. On the other hand, if we use a simple model like linear regression, it's quite easy to interpret, but then we lose on performance. So there's that tradeoff that we always have to consider. And there are methods to help interpret your models. So for example, if you're using a tree based model like random forest or XG boost, metrics like cover and gain can help with the interpretation. So here's just the model. If we use cover, it shows that marital status is the fourth most important feature. But then if we use gain, it's fourth from the bottom. So there's inconsistency in these metrics. And so, well, what do you tell your business stakeholders? Yeah, if we use cover, it's fourth most important, but we use gain. It's not as important. So they're going to be like, so what do we trust? And so the smarter data scientists than myself have come up with a solution to this. And the idea is to build a simple model on top of the complex model. So let's suppose f of X is the original complex model. That's the one that we spent hours and days and maybe weeks building. What we want to do is we want to build a simple model G of X that approximates the original model f of X just just to help with the explaining part. It's not doing the prediction. It's just doing the explanation. So here, G of X is the sum of these files. These files represent the weight of each feature. And note that G of X is linear. And that's key because linear models are easy to interpret. Now, there are three properties that we want of G of X. The first one is what I call missing in action. And what it means is if a feature is not there, then its importance should be zero. And the analogy I like to use is in sports, if a team member, let's say LeBron James, is not playing during a game, he's benched, but his team wins, he shouldn't get any credit for that win. That's what this property is saying. If a feature is not helpful in the performance of the model, it should get no credit. The second property is what I call locally accurate. And I mentioned it briefly earlier. We want the function G of X to be locally approximate to F of X at a certain point. So, for example, if you are familiar with QV Perry and Zoe Deschanel, QV Perry is a singer, Zoe Deschanel is an actress. They look similar in certain aspects, maybe their eyes, their nose. But overall, they are not the same person. So for G of X, this explanatory model, we want it to be locally similar to F of X at certain points, but it doesn't have to be the same across the entire function. Now, the third property is key, its consistency. What it's saying is that if a feature is just as useful in a different model, then its importance shouldn't decrease. So put another way, if a feature becomes more important in another model, then its importance should increase. So another example to illustrate this point here in Toronto, that's where I am right now, we have the Toronto Blue Jays. We have a baseball team. One of the players is Josh Johnson. I believe he's still there. But before he joined the Blue Jays, he was playing for a team called the Athletics. His WAR statistic was 6.9, and so his salary was half a million dollars. When he came to the Blue Jays, his WAR statistic went up to 8.8. And so his salary also went up, commensurate to his performance. So this is what this property is saying. If the contribution of that feature goes up, then its importance should also go up. So those are the three properties we're looking for for G of X. Now, how do we find G of X? So in other words, how do we calculate these five values? Because that's essentially what we're looking for to measure the importance of each feature. Now, there are a family of models of G of X to help with that. They're all additive in that they're linear. The first one is lime. You may have heard of that, e-plift and shaftly additive explanations or shaft for short. And this is the one that we'll be talking about today. The pro about shaft is that it's the only one to meet all three properties that we mentioned or I mentioned earlier. OK, so before I go into how shaft works, let's just take a step back and talk about the inspiration for shaft or where it came from. So it was inspired by a concept from cooperative game theory and the called shaftly values. The idea behind shaftly values is that it calculates the contribution of each team member. In the team as they're working towards a common goal. So for example, if we have a group of advisors or financial advisors, they're all working to sell insurance at the end of the day. Who gets the most credit? How should they divvy up their profits? And the answer to that is shaftly values. shaftly values, what it does is that it calculates the marginal contribution of each player over all the possible ways that player can enter the team. And in cooperative game theory, it's called coalition. So let's look at a toy example to see exactly how this works. OK, so let's say that we have here. Mary, Joe and Rob, Mary can by herself pull in $4 million worth of sales. Joe can also bring in $4 million worth of sales. And similarly, Rob can do the same as well. Now, Mary and Joe can bring in $9 million together. Mary and Rob can bring in 10 and Joe and Rob can bring 11. But all three of them can bring in 15 million. So we know how each sub team work and how they can how much money they can bring in more sales. So we want what we want to do is we want to calculate the expected marginal contribution of each player. So let's just start with Mary. So if Mary were to enter the team first. So here she's the first in in the line then followed by Joe and followed by Rob. In this case, her marginal contribution in this sub team, this enumeration is just her minus the contribution of the empty team, which is four minus zero. Similarly, if the other enumeration where Mary comes first, but Rob and Joe come in second or they're flipped, it's still the same situation. It's she's first. So it's the her marginal contribution or her contribution minus contribution of the empty team, which is still four. Now, if Joe comes first and then Mary, so she's second to come in, it's her the marginal contribution of Mary this time is M and J minus just J. So Mary and Joe nine minus four goes by. And then in the other scenario where Mary comes in second, but Rob came in first, the contribution is the marginal contributions is Mary and Rob minus just Rob. So 10 minus four is six. And then if Mary were to come in last, the calculation here would just be the contribution of all three people minus the contribution of the first two who came in first. So 15 minus 11 equals four. And then the last scenario is the same situation. Mary's last, but the other two flipped their position. It's still 15 minus 11. So what we did was we enumerated all the possible ways Mary can enter the team and calculate her marginal contribution for each position that she came in. So the average of this, so it's like the expected marginal contribution is 4.5 for Mary. If we did the same thing for Joe, his would be five and then for Rob, it would be 5.5. So in this scenario, Rob gets the most credit followed by Joe and followed by Mary. So that's the idea behind shabby values. How do we assign credit to each member of a team? Now, how do we relate this back to model building? Okay, so instead of team members or advisors, we have features. Think of features as the team members. These features are contributing to the overall performance of a model. So we can calculate their contribution or how much credit we should assign to each feature by calculating their marginal contribution over all the possible ways that feature can enter the model. And this is what this formula is saying. And that's how we can calculate or estimate the values for Phi for our explanatory model, G of X. So that's the theory behind shab. Next, I'll show how I use shab to explain a model that I built and show how the different features of the shab library can help you interpret your models and really take a peek under the hood. Okay, so some background on this model. I was asked to build a model to predict transactions in particular redemption. So who can we anticipate clients who are about to sell off their funds or sell off their investments and explain why? So features that I use in the model included product characteristics, what's the risk of this product? How much of the fees for this fund? Market indicators of how well the market is doing right now. So the Dow, the TSX, the S&P 500, et cetera. And then features around the client. So how long they have been with the investment company, how old they are, their net worth, et cetera. And then historical transactions. So what did they do in the past? How often do they sell? How often do they buy? So any activity in that regard. Okay, so after building the model, I applied the shab library onto the results to get idea of what features are most important and how they play in making the predictions. So in this chart, which is an output of the shab library, it orders features in order of importance. So higher up means more important. Here I have to obfuscate a few of the features, but they correspond with these family of features here on the right hand side in the legend. So the top feature or most important feature is fund fees. Now, what about fund fees is affecting the likelihood to redeem? So points to the left, oh, sorry, point to the right of this zero line means they're more likely, this prediction is more likely to sell a fund. So this is saying that funds with low fees are more likely to be sold off. And that makes sense because if there are no penalties for selling a fund, then why not? Whereas conversely, if the fees have high, I'm sorry, if the fund has high fees, then people are less likely to sell their fund. So that's shab being applied to the model at global level. But what I really like about shab is that you can also apply it to a prediction. So here we have an example of a instance of, we have a client here, he's 72 years old, he's been investing with the company for three years, he has this type of account type and he sold off a Canadian money market. And the model predicted that he would or she. And this is what shab said or the reasons the features that drove this prediction to say that this client was selling the fund. And the main, the top two are fund fees. So again, lower fees, more likely to sell off a fund. And then something around the market also drove this prediction as well. And then it goes down from there to explain, to provide other features that help drive this prediction. But you can see the top two, those are the main top two that drove this prediction. Now you can also take the shab values for each prediction and then cluster them to see if there are certain groups of predictions that are common or. So here we have a cluster of redemptions that were all driven because for medium net worth clients and because there was an increase in bond indices and these clients also had previous US equity redemptions. This cluster redeemed their funds because there was a decrease in the FTSE and an increase in bond indices. This cluster, these clusters over here, they were all sold off because these funds have no fees or no load as they call it in the business. And so you can, with shab, you can see common drivers by using this function in the shab library to plot the entire data set and see what's really driving these predictions. Shab also has the ability to look at interactions between features, so how two pairs of features interplay with each other. So in this chart, we have investor tenure on the x-axis. So how long they've been investing with the company. The y-axis here shows you the likelihood to regain. So higher, anything above zero means high likelihood to regain. And the colors means how high the fees are for that fund. So red means high fees and blue means low fees. So what we can see here are two things. First, clients who haven't been with the company for that long, they're more likely to sell off funds that have high fees. This may be because they haven't been with the company for that long, so there's not too much of a attachment just yet to their investments. But conversely, we see that clients who have been with the company for longer, they are more likely to sell off funds with low fees. So here now there's more investment, but if the funds have no fees, then there's less of a penalty there. All right, so what to know before using Shab? Well, first it's only implemented in Python, which shouldn't be a problem because we're all here at Python India. It is model agnostic and can be used for any type of model, tree-based, kernel-based neural networks. The one con with Shab though, compared to LIME, is that it is confrontational intensive and it's because it's coupling the marginal contribution over all the possible ordering. So that is pretty hefty. And so a tip is to sample your data. You don't need to apply Shab to all of your data, but you can just do it on a sample to get an idea of the feature importance. And especially when you're calculating those interaction values, it will get a bit intensive. So I recommend taking a look at the Shab GitHub repo. It has a bunch of different notebooks for different types of models and how you can apply Shab to your own models. And also highly recommend reading the paper by Scott Lundberg. He's the author of Shab and the Python library. And just to get a deeper sense of how Shab works. So the three main takeaways for you to take home is that, one, we no longer have to compromise between interpretability and performance. We can have both. And we can do that by building a simple model on top of our more complex black box model. So for example, we can use line or my preference is Shab because it is model agnostic and it meets all those three properties that we talked about, especially the consistency property. So that's it for taking a peek under the hood on black box models. I hope you enjoyed this talk and thank you for having me today. So I'll take any questions now. Thank you, Jennifer. Yeah, and you're perfectly on time as well. And there are a few questions. But the first question is, can we apply Shab to deep learning models, something like a text classifier or image classification model to see which words in the text contributed to a prediction or which part of the image contributed to a prediction? Yes, you can actually in that, in the Shab GitHub repo, Scott Lundberg has provided a few notebooks of how you can do that to a deep learning model, especially I think he has an example for text as well. So I would definitely check out the repo. He also showed how he did it for text and for images. So it's a great resource there. Sure, and the second question is, can this be applied to NLP models built using deep learning? I suppose, I mean, I think it's somewhat relevant to what you just answered, but. Yep, it can be applied to deep learning models as well. And the next question is, can Shab be applied to clustering algorithms where we're trying to find anomalies in data sets? Since there's no major text there, how will we be able to interpret the results? Yeah, for, I haven't tried buying Shab to unsupervised models, so that I'm not sure of. I actually have an inputs. You can check the cluster composition and that will tell you exactly what is there in which cluster. So there is some amount of explainability there, but it still doesn't tell you how it came to those results, depends on the algorithm, I suppose. Because if it goes with something as basic as K means, it will tell you that it is closest to the centroid and that's how the Euclidean distance. But if you go to hierarchical clustering there again, I think it depends on the data and the interpretability has to be built in from the ground up to you see that as being reasonable explanation, Jennifer. Yeah, for, yeah, for, I know Shab can be applied to K and N, now K means unsupervised. I think it can work with, because when you're applying the library, it's just taking a look at the, it uses the model as an input and it uses the training data as an input as well. So it, which requires having the label or the target. So that, yeah, for Shab and unsupervised models, I am not too sure of the answer, so I don't want to give any definitive answer and mislead everybody, so. Fair enough, I think we'll look them up and we'll get back to this. So Akshay asks, no, that's just then. Shashita asks, how is this different from other tools like LIME, is this recommended for any specific scenarios? Yes, so as I mentioned, Shab is the only one that meets all three properties of missing in action, locally accurate and consistency. So the, so LIME is short of, he doesn't meet all three. So you get the benefit of with Shab, the consistency part. The con with Shab, as I mentioned, is that it is more computational intensive. So LIME is a lot faster if you're in a bit of a rush. I would say if your data is big, use LIME, but when I found a way to work around that using Shab, just I would sample your data and then you can apply Shab and you can still get a good approximation. And I think we have one last question. Nazir asks, what are the other options about from Shab or LIME? I guess you had a list in your fourth or fifth slave. Yeah, there's LIME, DeepLift, Shab. I know there's been some other methods out there since Shab has came out, but there's a library called ELI5, that's another Python library to help interpret models, but the space around explainable AI has exploded in the past couple of years. So there's definitely other methods out there. Shab and LIME are the more mature ones? Yeah, I recently came across a whole Twitter storm of people talking about how models aren't transparent enough and results are determining things that don't make sense, but people with data have a lot of power and what they say goes. So how do you know what's going on? Exactly. Especially if you work in the industry where transparency is important because if you're being audited and held accountable, it helps to understand how your models work. Exactly. Yeah, and I think a lot of the details get lost when the end result is just a lot of shiny dashboards and people get taken by them, but they don't look at how they got those results. Thank you, this was a great talk and I hope the slides will be made available either on your website or to us. Yeah. I will send them to you and I'll also put them on my website. Excellent. And is it a safe assumption that you'll be available on the Zulip channel for a bit in case people have questions? Yes, I'll be on it for a little bit and then I have to go back to work. It's still 9.30 a.m. in the beginning of Friday over here. The national holiday over here, I can't assume the same. So fun. Okay. Again, thank you so much and thanks to the audience for sticking around. It's almost towards the end of the session, but the level of interest hasn't wavered and that's always great to see. And thanks to my co-organizer, co-volunteer here, Nitin Tom, who's been putting up the questions on the banners. That makes it easier for people to read them as Jennifer answered them. Apurva, yes. The slide deck will be available. What other question have I missed? Gayatri has one last question. Sorry, Jennifer. Would this be similar to how grad cam works for the image classification case? Similar to how grad cam? I'm not sure what that is either. Yeah, I can't say I am aware of what grad cam is. So I don't want to also mislead, give a false answer there. Sorry about that. Gayatri, if you can quickly type up what it is in the next 10, 15 seconds, or maybe we can come back to it on an offline basis. Yeah, make me take it offline. Sure. All right, then. Thank you, Jennifer. I am going to bring it on from the stage. And what's next, right? What's next? There's a last session on this stage, but we still have a keynote at 7 p.m., which is in two minutes by Nayum Siddhar at the Bangalore stage. And then at 8 p.m., IST one hour from now, we have the Bicon India closing address for today. So please don't miss those. At least it's going to be a lot of fun. If you can make it, please do. I'll be available on the Zulip chat on the Chennai stage. My name is Abhiram. And you can also find me on Twitter at abhikanthdraw.com. So if you have any questions, if you have missed out something, if I have not answered something, or if I can put you in touch with any of the past speakers, please let me know. Thank you. Have a nice day. Thank you so much. Bye.