 Hi everyone. My name is Mikhail. I'm a data scientist at Scraping Hub. We develop smart web crawlers and I want to talk about how to inspect your machine learning modules. So, how many of you used scikit-learn? Could you please raise your hand? Yeah, nice. I will be asking you questions to make sure you can understand my heavy Russian accent, okay? So, a typical machine learning project. You collect some training data, then you extract features from it, then you put it all to your favorite classifier and then you put it to production, right? What's missing? There is a lot of things missing, but what's missing? Please shout. Sorry? What else is missing? Yeah, evaluation, right? Cross-validation. So, yeah, there is a lot of things missing, but these are important. So, let's say we added evaluation. We computed accuracy score. We performed cross-validation. We got 90% accuracy of one score, but what does it mean? Is it good quality or is it bad quality? Maybe the good quality for this model would be 99%. And we got 90% only because we have bugs in our software. Or maybe that's actually a great result and we can be happy and deploy it to production and go party. But so, there is a lot of questions, like real-world questions, which can be solved just by looking at evaluation scores. We may want to know if our model is reliable or not. Like maybe it fails spectacularly on some examples, but we can see it in our evaluation. So, how to solve these problems? Does anyone know how to do this? I also don't know how to do this. I don't have an answer, so there is no silly about it, but if you inspect your models, you can get an additional tool, which can help you that. So, I came to the conference a bit earlier, so I had time to walk by the show. There are a lot of restaurants. And each restaurant, there is a pizza in each restaurant. So, who tried pizza in Rimini? Please raise your hands. Yeah, and who haven't tried it yet? Yeah, so there are people. Next slide is for you and for you. I visited these restaurants and I created a formula, how to compute pizza price. It's a linear regression formula, so you can see that pizza price depends on pizza weight, on distance to see, on ingredients, and if there is an old man playing the guitar, then it's very important for pizza. So, the question is, what can you see from the coefficient of this linear regression? So, please. Well, you can see, for example, that distance affects price negatively, so the farther you are from the sea, the less pizza costs, right? Or you can see that maybe old man playing the guitar is more important than mushrooms, right? And, well, yeah, but what are pitfalls? Yeah, this formula is awful. I don't say it's perfect formula. Sorry about that. Please don't use it if you want some pizza. Yeah, this formula is awful, but the greatest price gets negative. So, yeah, one more problem is that you can't compare coefficients directly, because, like, distance from sea doesn't have the same scale as, like, old man playing the guitar. So, if you just check the coefficients, you may think that weight is important, but mushrooms are not important at all, but this is not the case, because scales are different. So, as we can see, checking model parameters is helpful. We can get some understanding of what's going on, but we must know what we are looking at. And there are other pitfalls. For example, let's say we live in an alternative universe where each pizza always get meat and mushrooms together. So, they always come together, or almost always come together. In this case, weights of meat and mushrooms, we can choose them, we can choose any weights as soon as they sum to the same value. So, we may have a weight minus million for meat, and plus million and two for mushrooms, and the formula will give the same result, because meat and mushroom always come together. So, it depends on the training method of which coefficient we will get in the formula if they are correlated. And you need to be aware of this. In practice, it's not usually a problem, because you usually use regularization. But, for example, if you use L2 regularization, then coefficients will be about the same. And if you use L1 regularization, then one coefficient can get to zero, and another one will be twice as large. And so, if you assort coefficients by the absolute value, you need to be aware of methods used for training of this linear regression. Okay, so, some people here used scikit-learn. And in scikit-learn, like in many other libraries, there is a way to look at coefficients of the model, right? So, this is how to do it in scikit-learn. But this code isn't correct. Who knows what's not correct here? Well, it's not entirely incorrect, but it won't give you the whole formula, because there is one extra coefficient, it's named intercept, which we don't see here. So, we created a library called LI5, it means explained like M5. It started from a snippet similar to the snippet on previous slide. It knows where to get these coefficients from various machine learning modules. It supports five popular machine learning packages, more than 70 objects, and it has some features which allow to explain models and their predictions of arbitrary black box classifiers. But in the simplest case, it started from this stuff. By the way, this table doesn't make any sense, because scales of features in Boston data set is not the same. So, the library is open source, you can use it, you can join us, contribute to it, and raise a lot of issues if something doesn't work. So, it supports scikit-learn, XGBoost, LightGBM, there are a lot of less popular packages, and it has an implementation of line algorithm which allows to explain black box modules. So, let's go, so now I will give you like a more real-world example of how can we use it. How many of you follow it's text processing tutorial in scikit-learn docs? Are there people who have done this? Yeah, there are people, yeah, nice. So, scikit-learn has great docs, it has great tutorials, I learned a lot from them, and there is a tutorial on text classification. There are messages from forums and based on text features, the task is to classify them. Data sets named 20 news groups, but here we are using four categories. So, the final model in this tutorial is TFIDF features and SVM classifier. So, who knows what is SVM classifier? Yeah, there are people. So, it's also the same with a linear kernel, it's also a linear model, it's similar to our pizza formula, but then to check if class is positive or negative, you just compare score to zero. If it's greater than zero, then answer is yes, it's less than zero, answer is no. And who knows what's TFIDF? Yeah, there is a lot of people. There is a book by Christopher Manning, informational trivial book, it has a chapter about TFIDF, and there is a page which shows 60 different ways to compute it, like 60 different formulas. And scikit-learn is not one of them. So, every machine learning library uses its own formula for TFIDF, no idea why. So, this approach to classifier text is back-of-words approach. So, you have a weight for each word, and then you check if the word is in document and multiply this weight by TFIDF score of the word. If we inspect our model trained, can at least someone see what's on the slide? No, so I can read. So, these are top features learned by the model for different classes. For example, for computer graphics, we have words like graphics, image, software, images, 3D file. So, if some of one of these words appears in the document, it's more likely to be a document about computer graphics. It makes total sense. But if you look at, for example, at atheism class, you can see that while there are words which are related to atheism, like atheism, Islamic Islam, atheists, morality, there are some words which don't make any sense. Two of the three top words are Kate and Matthew. So, do we have any Matthew in this room? No? So, if some document mentions a guy named Matthew, then document is for some reason about atheism. This is what the model learned. And there are similar issues with documents about medicine. So, some guy named Pete, if he's mentioned in a document, document is about medicine. And the most negative word for Christianity is NNTP. So, something is going on here, right? It doesn't feel right. So, we can check our documents and find some of them which have what met you. And we can see that documents are messages. And we are using them as is with from header, with all the email addresses, etc. And model found an easier way to classify messages. So, instead of remembering, instead of figuring out how to classify them by content, it just remembered some authors and their names, parts of their emails. And so, what it thinks is like, oh, I see, this is my old friend Matthew, he only writes about atheism. So, this is document about atheism. It doesn't matter what he say, it's document about atheism. So, while... So, it depends on the task. Maybe this is what we wanted from the model. But maybe we wanted to classify messages using their content, like what's the message about. Of course, the model learns something about message. But top features, at least here, they are highlighted using LFA library. So, top features are these emails, etc. So, what does it mean? It means that if we don't look at the predictions and the coefficients, we may start to try different classifiers to hyperparameters. And in the end, our model best model will be just the model who can use this information better. But if we look at it, we may realize that maybe this is... there is a problem in our data. Cycadome provides a way to remove headers, footers, and quotes from these messages for this particular data set. So, we can do this, retrain the model, and we can see that the accuracy drops a lot. So, in the previous model, accuracy was more than 90 percent, and now it's 0.798. Oh, it's 6. So, why does it happen? Who can answer? Yeah, so previously, the model was overfitting, and for example, there can be Matthew, both in training and test parts of data. And so, evaluation doesn't show us that... this is not... doesn't show us that. And by... but also, we have to remove some useful information. So, like... okay, so let's try to improve quality of this model. We can see that there are words which are not related to a particular topic. Like, do you have, you are, don't, for this, what... these are like background words, and they are not... well, yeah, they are not related to a particular topic. It's... it is not seen, I think, on the screen, but what are is negative, and what you is positive. And it doesn't make any sense. Why... why... why? So, likely, these are just some background words, and they appear in documents about all topics. And the model just decided to use these words as... to learn preferences for some classes. And... well, sometimes, because of our dataset does not latch, these words don't appear equally in documents. So, some may get positive weights, some may get negative weights. The common way to fight this is to use... is to remove these words, to make task file classifier easier. They are called stop words, and second one provides a way to remove it. You can see we can pass stop words equals English to TFIDF vectorizer. And, indeed, so now these words are not highlighted, and quality improves. So, by looking at explanations, we can try to figure out how to process our data better. We can try to find new features. But if you look at it carefully, you can see that word don't is not removed. Why? Why is don't still highlighted? We are passing stop words English. Yeah, exactly. So, actually, this is a bug in scikit-learn that the default tokenizer splits on contractions, but stop word list doesn't account for that. So, by looking at explanations, we may find bugs. Okay, so we can add extra stop words, which were not in stop word list, and we have quality improved. So, the lesson is that pipeline worked even with bugs. And it's the challenge in machine learning because modules can adjust for your bugs, but if you fix them, quality may improve. There are other ways to process text. Like, instead of words, we may use character n-grams. It means we are using sliding windows of size in here, three and four and five, and just use these three, four and five letters as features. So, if five knows how to visualize them, you can see that quality is worse, and you can see that words are not highlighted in full, but some parts of words are now more important, some are less important. Quality is not good again. We can try the same approach, right? We can try to remove stop words. What would be the quality? Will it improve or decrease or stay the same? Who knows? What do you think? Who thinks the quality will improve? Please raise your hand. And who thinks quality will decrease? And who doesn't know what will happen? Yeah, and who thinks that quality will stay the same? Yeah, so someone thinks quality will stay the same, and it will indeed stay the same. Why? This happens because it's documented in scikit-learn that stop words have no effect if analyzer is not word. This is very easy to miss. I made this mistake myself. So, by looking at explanations, you can find bugs in your own code. So, by inspecting the module, by checking what's going on, we were able to find issue with data, issue with tax specification, we found a bug in scikit-learn, we prevented a bug on our own source code, and we got a better understanding of the processing pipeline, and we made our relations much worse. Well, I'm not saying this to show that scikit-learn tutorial is bad. It's very good, but I've seen these problems in every single machine learning project I've worked on. So, any additional tool which can help you to check data to debug what's going on, I think it's helpful. So, there are two ways, two ways, as you can see, two main ways to look inside models. You can see the model as a whole, you can inspect its ways, or you can explain concrete predictions of the model. So, like here, we are checking concrete prediction, and here we are looking at model as a whole. Both ways are useful, and so, it seems I spent too much time talking about pizza, so I won't have time to discuss all my slides, but, well, so far I was talking only about linear models, but of course there are methods to inspect other models as well. Like for decision trees and tree ensembles, you can use fishing potences, and there is a way to inspect predictions, to explain predictions of these models. So, here is the link. LFI's library has an implementation for XGBoost, for YGBm, for many of scikit-learn, gradient boosting methods, red and forest. Yeah, there are also ways to inspect more black box models. They're probably the simplest method. It's called mid-decrease inaccuracy. I first read it in a paper by Leon Briman. It was paper where Random Forest was introduced, so it's like an old method, and there is no reference to this method. If someone knows it, please give me a link. So, the idea is simple. You train a model, then you get predictions on the test part of data, and then you want to know how features affect the result. So, you can remove some features and train the model again, but training is slow. So, there is a walk around. You remove these features only in test data set, but you can't just remove features from test data set because the model uses these features. So, you may replace it with random values, but you can't just take random values because they may have different scale than in the data set, different distribution. And so, the walk around is that you just shuffle values for this feature and get random value from some other example. And then you run your model without training on this data set with features shuffled, and check how much these effects score. And you can do this for pairs of features, and so check well, which features are important for any model. We don't have implementation this in LFI library, but we should and we will add soon. There is also a way to debug predictions of black box models. So, the main idea is that we have a black box model. We don't know how to look inside it. We get an explainable model, maybe linear classifier or linear regression model, and then we train it so that it approximates this black box model. So, we don't train it to get correct predictions. We train it to get the same predictions as our black box model. And then instead of inspecting black box, which we can do, we inspect this white box model. Do you think this works? Well, better than nothing, but I think this doesn't work because, well, if you have inspectable model which can approximate black box model, then why you don't use this inspectable model? So, there is algorithm called LIME, and the main idea is to do the same, but not globally, not on the whole data set. You approximate your predictions only small neighborhood around a single example. And this method works. It got, I think, pretty popular recently. There are some issues with this method, like what's neighborhood? Neighborhood means that we need examples similar to a given example. We can get it from our data set, but like there won't be enough examples. So, we want to generate these examples. And also, what does it mean similar? So, we need a distance function between examples. And we also need to define neighborhood. So, if we, it should be some size of the neighborhood, and we must choose the size properly. So, to generate fake examples, we can, for text data, we can remove some parts of text. For image data, we can paint some parts of image with gray or with image min. And for arbitrary data, we can estimate its distribution and then sample from this distribution. So, here we have a trade-off. We don't longer care about the black box model, but we care about data. So, instead of writing code for each black box model, we write code for each data type, or maybe even for each data set. And there are challenges, like white box model, should be powerful enough to explain, at least in small neighborhood. We must choose neighborhood size properly. And if generated examples are not diverse enough, then a line may lie to us. It may give us incorrect explanation or incomplete explanation. And this error is very hard to detect. Because if you have chosen incorrect neighborhood size, you may check how well your white box model approximates black box model. And if score is low, then you see that something is going on. But if your examples are not good enough, then your simple model may approximate your black box model very good on these examples. But it doesn't mean that the explanation is correct. So, I'll skip this. There is a popular and high-quality implementation from LIME authors. It's a separate package. It has support for images, for aggression task. And we also have an implementation for LIME because some details are different and because it fits our library very well. We have a lot of inspectable models and we can use any of them with LIME without re-implementing anything. And so, we have export to JSON. We can show it in a Python notebook. We can export them to data frames. And we don't have to rewrite all this code for each model. And so, LIME can use everything. So, it's just a better code reuse. And we have unified API because of this. We have some cool features which you can read in the documentation. What I was explaining today is like a very ground to earth. I mean, down to earth. I mean, very simple and basic methods. There is research going on about models which incorporate explanations in themselves. For example, in deep learning, you often may use attention mechanism. And there are new ways to visualize models, especially for images. And there is a DARPA program called Expandable Artificial Intelligence going on. So, expect a lot of new research coming for this topic. So, the conclusion is that probably you should inspect your models if you can, but you should know what you are looking at because explanation may lie to you. And LA5 library may help, but it may not help, but you can help LA5. So, please join us. Thank you. Questions?