 Okay, thank you so much. So I'm gonna be talking about model visualization. And just the context for this was last year I spoke about visualizing multi-damaged and data going from kind of talking about one dimensions to two dimensions to all the way to up to 10 dimensions and more. And the other part of kind of visualization that is there that we all use is not only at the data part of the visualization but also the model part, right? And so this year I was trying to think about how to get my head around this topic of model visualization or what can we do to actually go about this, right? So that was the intention behind the talk and that's how this is structured. I'm gonna start with a, I'm gonna just start with a quick story, right? Because we don't see things as they are, right? We see them as we are. And this story is really about the blind man and the elephant. And I'm sure many of you may have already heard there's this pretty popular in different context. And the story goes like there are these six blind men who are brought forward to come and see an elephant and they wanna make an observation by their own hands to find out what the elephant really is and build a model for the elephant, right? So the first one actually comes and approaches the elephant from the side and sees a very sturdy and a solid part of the elephant and says this is definitely an elephant is like a wall. The second one actually goes and catches on to one of the tusks and is looking at them saying this is smooth and really sharp and definitely has to be something like a spear. The third one actually approaches him on the knees and sees them as round and stump and grounded and says an elephant is definitely like a tree. The fourth one actually approaches him on the top, sees the ears, which are like a fan. Definitely like a fan, they can move around, they can swing in the air. The fifth one actually catches the rope, or sorry, catches basically the trunk and seeing the trunk as a long slender piece which is moving around says this is definitely a snake. And by the time the sixth one has actually come about the elephant has turned a little bit and the tail of the elephant hits him and catches him and he says this is definitely like a rope. So all of these six men, also these men have to understand disputed lord and long each in his own opinion exceeding stiff and strong so each was partly in the right and all were in the wrong. And this is really true about a lot of work we do, right? So the elephant keeping the team with this conference is the data and the term I like to see is like the data is actually just a clue to the answer, right? It is not the final picture, it is just a one clue and what you're trying to do with your part about model building is really understand that data and see how you could build a better representation or a better or simpler representation to you. And the men in this case are really building models and all models are wrong but some are useful and the intention of model visualization is not so much as to understand the data that we do in data visualization but it's more to build an understanding of whether this is really a model that can potentially be used, right? So that's really the intention behind doing model visualization. So when we think about data visualization when we think about now that we have the data we kind of move through layers of abstraction as we try and understand what's really happening in that, right? So we, and I kind of think of we move through these three layers of abstraction. We, the first one is kind of the data abstraction. The second one is really the visual abstraction and the third one is really the model abstraction. And as we move down these three layers we eventually reach to a place where we have an understanding of what we're trying to explain or the model, what we're trying to understand or explain or in this case to explain or to predict, right? And to translate this to maybe ML speakers within a conference which is really about ML we're talking about these three steps that are really key in it. So once you've collected the data, you've acquired it, you've cleaned it up these are the three primitive steps that you do. There is data transformation, there's visual exploration and then there is model building. These are the three things that come together. And the ML pipeline really looks something like this when you put it together. You have data transformation and then you have tidy data and then you go down and you have visual exploration which is what we call data phase. And then to the right is basically model building. But as you see in this diagram these are not really connected, right? So a good data scientist, a friend of mine would actually just do model building largely through numerical summaries, right? So I've never actually seen him open, take the model and actually start to visualize it and see what the answer is, right? And that's really true for a lot of practicing data scientists. So a lot of model building happens just through numerical summaries or understanding the model. But we don't really build an understanding the model around the data, around using a visual tool. And the reason this happens is because we don't have a way of doing this. So the architecture that's missing really around this is how we actually start to do model exploration. And for this, we need two components. Those two components that are there in the green. We need something that helps us visualize it. So that's the model risk part. But we also need something that can convert our model data into something which is similar to tidy data which is a tidy model. Once we have these two components, we can actually start to explore a model and build a better understanding of it in a similar way to what we would do in visual exploration. So that's kind of the intention of this. The idea about modular exploration is really, can I use visualization as a way to embed or to transition whatever is the implicit knowledge I have about the model, about the data, whether it's in the data or it's in my head, to an explicit knowledge into the model, right? That's what we're trying to do. And that's what the intention of this exercise is. So I'm gonna talk about two basic concepts in model visualization. The first one is, can we abstract this to an approach that can apply across any possible model that you're trying to build? So not something that is very specific to a particular domain, but just something that can be abstracted to an approach and I'm gonna call this model-vis approach. And the second is I will talk about why we need to have something like a tidy model concept very similar to what we have in tidy data. So those are the two things I'm gonna talk about in the rest of the talk. So let's start. Now I typically try to take a very simple example as a way of setting this context up, right? And I think yesterday I was saying in the BOF session that I like data that can fit literally on a page or in the Excel version of 2010 where it was limited to some 65,000 rows. So I really like small data guy. There are lots of big data guys, people here, but I am really a small and medium data. But for this purpose of exercise, we'll just take an example. And the example is really buying a car. So the example we're gonna use is I wanna buy a car and the typical colloquial term that you would find when somebody goes to buy a car is, which is if you loosely translate it, is to what is the mileage for the price I'm gonna pay. So that's kind of the question that we're gonna frame it and use in our exploration of the data and see how we can build the data and see how we can build these models. And we would use this same data set and it'll probably illustrate all the concepts that are widely used in any of the ML pipelines that we use. So that's the introduction. So where is this data from? So where is this data from? There are many car comparison websites. So one of the, I scraped some of this data from one particular car comparison site. And as you know, 80% of the time goes into refining and cleaning it up, so there's an element of refining and cleaning that happened in that. And then there was an element of filtering that did. So there was about 800 cars. And to build this very simple data set on a model, I just took the base version of each of the cars. So I ignored any of the trims. So we're just looking at what is the starting price for any car model that is there. And only for the petrol cars and all the cars which are really less than about 100K. So that's kind of the data set that we're talking about and this comes to about 42 entries, which is probably a good data set that can even keep in my head when I go out to buy a car. So this is the data set that we're looking at. And we're going to start to frame it in this way. So this is how the data really looks like. So I have brands. I have models. And we can see there is a price to this, which is roughly, in the first case, there's Tata, Nano, about two lakhs, or 200,000, 23.9 kilometers per liter. And on the bottom end, you can see a Volkswagen, Vento, about 785 and 16.1. So this is basically the starting data point, data set we have. And we're going to try to see how we can build from this and go forward. So the first step, obviously, is to visualize the data. So this is how the data would look like. So price on this axis is going up from about 100K to 900, or in our case, I guess, 200K. And from about 14 to 24 kilometers per liter is this. Each circle is the data point that we have about the price. And once we've done that, that's probably the first step that we need to do. And the principle in model visualization, if I was to capture, is to really, number zero, step number one is visualize the data space. And I'm calling it the data space to abstract it from just a graph. But the first step is really to visualize the data space. The second step, then after this, is to build a simple model. And in this case, because we're going to look at a simple model, we're going to take something very simple, which is the most simplest model that is there in statistics or ML, you call the ordinary least square linear regression. So we're going to all regress this price on kilometers per liter and see how the model looks like. And this is how, if you were to plot it, this would look like, which is basically a line. But I'm not using, literally using a line, at least in this case, to give the idea that this may, when we translate this to two, three, four, five, it may literally not be aligned. It will be a line of plane and a hyper plane. We'd have to think different ways to visualize it. But in this case, we have plotted the predictions in the data space. And that's kind of the second thing that we always need to do, plot the, frame the predictions or visualize the prediction in the data space that there is. Once we've done that, which is step number two, we need to change model parameters. So you want to change model parameters. And if you were to look at regression, ordinary least square is literally a one-shot algorithm. It gives you one answer. You basically solve AX equal to B in linear algebra, and you get the answer, right? So how do you actually change the model parameters? We adapt this model, and the way you adapt is basically, we're going to change the cost function. So we'll just modify the cost function so that instead of just regression, which is ordinary least square, we will now use something like a range regression, which does an L2 regularization. So we're going to modify it a bit to see L2 regularization, and which will then give us what we can do with the model based approach, which is really visualize with different model parameters. And so now we can actually start to visualize how these lines will actually look at, which is alpha equal to zero, going all the way up to alpha equal to one when the constraint becomes too hard and we literally get a straight line. So we've done kind of two steps here. The third step typically in model visualization would be we need to select the model parameters, right? So how to make judgment around the model parameters, we need probably more data for the same kind of model, right? And the way to get more data around that is to really bootstrap, right? So we're going to then basically start to visualize how this bootstrap would really look like, right? So that we can make judgments around it. And bootstrapping is really sampling with replacements to kind of get a better confidence level or a better understanding of it. Or we can start to then see the whole model based space now with bootstrapping to kind of see what the progression really looks like. So the white line there or the white-ish line there is basically our original model and now we've plotted about 500 models on this side to kind of start to see how this space starts to look like. And that second step of doing bootstrapping and plotting it allows us to kind of start to visualize the data space and then start to look at not only just the data space that is there but also to start to visualize the model parameters. So that's kind of the other thing. We don't have to stick just to visualizing the data space. We can also start to then visualize the model parameters and start to see what the variance around that is and start to pick something up. So that's step number two. So step number three is really to see how we could actually visualize with different input parameters. So that's visualized with different parameters. Once we've done that, we need to then add more models to this. And because clearly we're starting to see some trend lines but probably is not enough that we really have right now. So we start to add OLS, we have ridge regression and I add two more models to this which is polynomial n equal to 3 and low s which is locally weighted scatterplot smoothings. So I add two more models to it and then start to see how this model starts to look like. And now I can clearly see that the lines which was probably missing some of the parts on the top are starting to make sense. So now I have two lines OLS and ridge on the left and now I have polynomial 3 and low s right now. So I'm now trying to expand the model space which is what I want to do in this step. I really want to understand if I use all the possible models that are there, how much would that look like or what the space would look like. And this can then be extended very similar to what we would do in bootstrapping which is really trying to create very different input parameters and averaging them. So very similar to, in this case, to start to average the space and we kind of get an ensembling model that we've built out of that. And that's the third element that we wouldn't build into this. So we now have ensembles and we are now basically trying to visualize the entire model space that is there in this simple example. So that's principle number four, visualize the entire model space. But typically you would not just have a very simple linear regression, even in linear regression two features, you would probably need to add more features into it because you're trying to really search for a better explanation and right now we can clearly see that the price in mileage is probably not getting fully explained even with these model space that we have. So we need to extend, not only the model space, we need to extend the feature space that there is. So we add another type which is basically what type of the car it is and in this case there are only two types, the hatchback and the sedan. And once we add the two of them, hatchback and sedan, we can start to plot the space again and we can clearly see there is probably this mileage that we're talking about. There is clearly a linear relation, something in the bottom there but there's something on the top there, right? The top, the fill circles in this case are sedans, the bottom ones are really empty ones are the hatchbacks. And once we start to look at this relationship, we can add more features to the model and we probably would simply add the features to this and start to plot the line and or plot the model space in this case which is just these two of them. And now we can see that it starts to fit a little better but probably not good enough. And that's the other challenge you would have, you would then start to build multiple models around this to better explain it. So instead of moving away from a single model that we tried to build, you will add, expand the model space to start building distinct model for distinct model. So now if you fit a lowest curve to each one of those hatchback and sedan, we can start to see, the orange ones are really the predictions here, the green ones, we can start to see that the predictions start to match a lot more. And this, the mileage question that we're really asking really extends for probably just the hatchbacks that people are buying. As we move up to the sedans, there's probably not too much that we're trying to do. So how do we go beyond this? How do we go on this? We probably, so this approach till now, we kind of talked about model space and now we're actually visualizing many models together. We've added from single model to visualizing many of them together. And adding features can get complex really fast, right? So if you just add in one more of the features into the mix, which is really horsepower that is there in this case, we now probably have about three features. We can start to see the size of the circle being the horsepower and really the nano on the bottom is probably the small circle and the other ones are probably equally big on the top, little sections slightly smaller. So we're using now size to represent that. And we can start to see that if you build the regression model on top of this, it starts to map much more nicely to this. And the intention here then is just to then move from just mapping this to trying to understand this is really hard. So we then move to obviously visualizing model errors. And how do we visualize errors is pretty simple. We would probably just start to look at it based on just the data points that are here, the difference between the two. And we can continue to plot on that and we get kind of price error with coming to the later. And moving away from features to moving away from features helps us to kind of start to move away from this cursor dimensionality that we have. So the cursor dimensionality around whether we have features on the bottom and we just want to map the errors on the left is probably not good enough. We probably need KMPL, we need to remove that feature out of there and we start to look at probably test the robustness of the model, really looking at prediction versus errors. And you'll add probably a validation test to this to kind of do a six-fold validation and start to plot the price predictions and the price errors that are there. And so our final model in this case is just OLS regression and a cross-validation six-fold that we try to build there. And it starts to match a lot of this data points that we have around that. So the last step is really visualizing the errors in the model fitting, right? And if I was to kind of capture this six steps that we've really talked about, the approach to visualization to a large extent is really visualizing the data space, visualizing the predictions in the data space, visualizing with different model parameters, visualizing with different input data sets, visualizing the entire model space together, visualizing the many models that you will build, and visualizing the errors in model fitting, right? And this algorithm of going from step 0, 1, 2, 3, 4, 5, 6 can be applied to pretty much any problem that you have in your data set, right? If you were to kind of sync this up from model ways to an ML approach, each of this maps really nicely to something that you would do in your ML pipeline, right? So data visualization is how you would match your data space, you're probably looking at data space and you really need to understand how to do that well. Prediction, mapping prediction in the data space is really important, so you start to treat prediction as just another variable in your data point and you start to see how you can apply the same principles of data visualization back onto the predictions. Tuning or model tuning with different model parameters is really what you would be trying to do once you start to plot model parameters. Bootstrapping is what you are trying to do is trying to visualize with different input data sets. Adding those four models that we had makes it an ensemble technique which is the basis of all the ensemble techniques which is when we're trying to look at the entire model space to build a better model. And this is really different because when you do different models from hatchback to sedan just a simple idea there, but if you had like 10 different types of categories like suppose you have onion prices and you're looking at onion prices across different markets you start to build many different models for each one of them. So N model is basically trying to think about how you would visualize the many models together. And the last part where you start to move to error is really looking at how you can actually start to look at validation as a way of just that. So these principles that we talked about last time visualizing the data space, prediction in the day space, different model parameters, different input data sets, visualizing the entire model space, visualizing the many models together, visualizing the errors in model fitting. They all come together as a way of a framework to how you would approach model visualization. And the same challenges that you have around regression that we just talked about can be applied to classification or the same ways you're talking about can be applied to other problems like classification. So visualizing the entire model space or prediction, for example, in the prediction case, for example, is just trying to plot different visualization boundaries. So you're trying to think about how I can visualize the whole mesh of data points that they are and how I look at it, whether it's in two-dimensional, three-dimensional. And then trying to reduce it in the same way that you would do in a normal classification using any of these dimensionality reduction, you would apply the same to the prediction space that you have generated. So the principles that are in data visualization apply very similar to it. Once you start to treat your predictions as one another input that is in your data space, and then you start to apply those same principles back onto it. Sorry. The challenge here, and even in the simple example of 42 data points, is really model explosion. So even in the simple example, we've created seven kinds of base models. We did OLS regression with one feature, two feature, three feature. We did ridge regression. We did polynomials. We did lowest. We did lowest with one total and lowest by type, right? So we've already done seven models in the seven main models. And on top of that, you would start to add bootstrapping to it. You would start to add ensembling to it. You would start to add cross-validation to it. So if you start to think about predictions as something that you're trying to capture, it really becomes challenging for you to start to track this stuff. So how do you keep track of prediction and errors? How do you keep track of model output parameters is really the challenge with model base, right? So you need a framework or an approach to kind of start to think about it. And the tidy model approach that is really there kind of arguments this problem, right? The tidy model tries to map your data set back or your model outputs back onto it. So it augments the prediction and the errors back onto the data set. It creates the output parameters and to a data frame. And then it visualize, and then you can start to visualize it very similar to what you would do in a tidy data, right? So the approach of tidy model really complements what you'd do with tidy data. And in terms of tooling, when I started with Python, so I created this workbook or this presentation workbook just on for this talk in Python and very quickly just keeping track of even a simple model like this really starts to explore when you want to do all of this, right? So there's a great talk by David Robinson in USAR conference this year in 2016 on the broom package which really builds this topic about tidy data and how that can actually be used. And a very simple example of how that translates using just dplier and provenar for example would be just using the CAS data set bootstrapping 100 times and then augmenting a spline model back into it. And once you do that, you again have a new data frame that you've created which you can then start to use in a way that allows you to plot the output very similar using what you would do with a simple data frame. So now you basically grouped it and you can start to plot it in a very similar way. That approach of actually converting your model output into another data frame argumentation is basically the basis for doing starting to get your handle around this model visualization or the model outputs that you will get. The second talk I would really recommend is how do you manage many models? And that approach, there's a talk by Hadley Wickham that talked about that really around how can you actually keep a data frame inside a data frame or create lists inside data states and then start to use again very similar dplier techniques to then process that as a pipeline to really start to take the outputs out and plot it. So that approach of really looking at this model visualization as a pipeline thing that you can then bring into your ML pipeline is really important. So just to kind of summarize, what we have is very similar challenges to data waves when we come to model waves. Again, an art then on science. So I took a very trivial example to kind of talk through the different steps that are there. But the challenge is to try to look at different data sets are very similar to data waves. How do you have to translate it? How do I plot the errors in the different models? How do I plot the errors in a different way? How do I do dimensionality reduction? All those techniques that you would challenge is to face in data waves apply back to it. And then it's again more kind of like an art rather than a full science. And the thing is it is essential in your ML model pipeline and can be actually both whether you are trying to build an explanatory model or to build just a predictive model. And obviously we're trying to write a package on Python around that very similar to what Broom is doing in art to kind of create an easier tooling around how to manage models and try to apply a lot of these techniques back. We're also augmenting that with really around how to actually start to do even some tools and projections on top of that for classification stuff in this package. So there is scope for easier tooling and in kind of support to make that happen far more. So the cars data set which is there if anybody wants to go is really available on GitHub. N is equal it's about 800 plus cars and about 63 features so even though this is a trivial simplified example you can actually go and do much more feature based analysis that you would want to do on that data set. And all the code for this model is in this repo and will be actually converting this soon into a more repo for creating the package that we have written right around. Thank you very much. Is there any questions? Thanks it's a wonderful talk. So going through all of different ways of visualizing models it looks very nice but there's one thing that I couldn't understand is what do you do have multiple dimensions. Especially regression is probably okay because you can just plot points and see if you're doing a classification or clustering it actually deceives you. I mean it doesn't really try to do a three dimensional data just take the it's data and do it simply clustering. If you try to visualize that as a scatter plot in two dimensions it doesn't really give you the complete information. So how do you do model visualization because the visualization that you're doing doesn't really give the complete picture there. So I mean I could have shared a classification example how you would actually start to look at kind of the visual while this opens so the idea is really around you need to apply the same techniques of visual dimensionality reduction that you would apply for data is back to the prediction space. So there are two broad approaches to this. One is again you create visualizing you create the entire prediction space and try to map to it and then try to figure out the boundaries there is right. So you can either do the boundaries by creating a mesh and getting the original the original the entire space and then getting the boundaries from this or you can try sampling close to the classification point to get those boundaries. Once you've got that classification model is you then need to apply similar techniques that you're trying to do on database to this model is so you need to again think of a PC MDA, PSNE you know all of those kind of dimension reduction techniques to kind of bring it back to this. So you would ultimately boil down to maximum two or three dimensions that you would reduce that and that's how you would look at it. I mean there is there is the same challenge you have in database on where is the same that you have in. So I mean I've tried doing that on a simple iris data set the three dimensions to do clustering or some kind of classification I found that the model and the visualization doesn't really show complete picture of the clusters. So you have to look at it's catapults across multiple dimensions to actually make sense of that that's what I felt. Right. Yeah so the approach I'm just trying to pull out something that will kind of make the same case for visualizing yeah so I mean the whole approach really boils down to either either visualizing it in this whole space so you would probably evaluate the boundaries and start to create something like this. So this is a simple logistic regression on the wine data set and you could look at creating different meshes for this. The only two techniques that are there according to me are either dimensionary reduction or projection in tools. Now the tools whether you take a guided tour or you take a projection pursuit which will help you actually find a path through which you can actually look at the interesting parts of it and try to see whether there is something interesting. Those are the only two techniques that I know of to deal with this kind of problem. I would really recommend looking at projections and tools as one way of doing it. It's really under explored and there's poor support of that in post tooling around it but it's really powerful. I think G-Gobi in R used to do it really nicely long time back. Now there is a very small package in R and it is nothing in Python to do that. The other techniques I would really look at if you really want another example you could go to Christopher Ola has done some fantastic work on visualizing theiveness data using dimensionality reduction to start to see the patterns. He writes extensively on kind of using visualization and really interactive visualizations to answer those kind of problems and that would be another way to kind of do it but the techniques used there are basically PCA, LVA and TSN here. So at some point you need to fall back to that. Dimensionality is something that you cannot really do much about. Thanks. Every single one of those models made that assumption. Correct. That's plausible but not necessarily true and with general data it's often unclear and news reports of this determines that are often ridiculous. If you drew the line that best predicted kilometers per liter on the basis of price you would draw a different line even with least squares because the errors that you would be looking at would be sideways rather than vertical. That is correct, yes. And which things would be punished most would be so it would still be in that general direction. So this immediately doubles your number of possible models unless you have a solid reason for believing that the kilometers per liter is determining price and not the other way around. Correct. So I mean yes, so I mean now you're trying to test mostly correlation versus causation to some extent or also flip the model to other end to try this is really trying. And then I think... For the least square fit with... I understand. So I think that's fair and that could be another part you could add. The idea here is really to kind of talk about not so much the price to kilometer relations just to use as a medium to talk about the...