 Hi, my name is Julia Silge and I'm a data scientist and software engineer at RStudio. And I want to thank our medicine so much for having me here today to talk about data visualization as it's used in real world machine learning. So data visualization, this hugely important tool for all kinds of data practitioners, not just those who are building machine learning systems. And a lot of what we're going to talk about today is true because data visualization is so effective at surfacing information to human beings in this visual channel that we have. So when it comes to data, these visual ways of perceiving information form how we think about the work we're doing, how we understand the problems that we're solving and the models that we're building, and how we decide. And I especially want to focus on data visualization as a support or aid for decision making because building models or machine learning systems is this complicated, involved task. And I want to explicitly to frame the plots and graphs that we make the way Tamara Munzer does here. This purpose is so that we as machine learning practitioners are more effective. When it comes to models and machine learning, more effective can have a couple of different meanings. It can mean our models are more accurate. It can mean they're more practically appropriate. It can mean they're more fair. And we'll talk about these and how they work and how we use data visualizations to do these. I work on an open source framework in R for modeling and machine learning called tidy models. And a lot of the examples I'll be showing today use tidy models code. I'm glad to get to show what I work on and build, but a lot of what we'll talk about and walk through today isn't specific to tidy models or even are really we're focusing on data visualization practices for real world folks engaged in building models. So let's start by asking this question, when do data practitioners build data visualizations? Asking this question requires us to think about the process of model building to build a mental model of model building, if you will. So I'd like to put out this sort of schematic here as what might happen during a typical model building process from when a practitioner first sees some new data to having a working model. The model process starts with exploratory data analysis, EDA. This is a systematic iterative process of exploring data to investigate it. Next comes feature engineering. This is this task of transforming our original variables to a more useful feature for our model. Then we have model fitting and tuning. Notice that I've represented this with lots of little skinny bars to represent that we typically for with breakfast, when we're using best practices, we typically do this using resampled data sets. And we often we do we don't know a priori what kind of model we're going to use or like what will work best so we've not don't try one model. We try multiple different kinds of models. Next comes model evaluation. This is when we measure how well a model performs using some metric that is appropriate to that specific kind of model and problem that we're working on. Are we done? No, we're probably not done. We're probably not done at that point because this whole process is usually iterative. We take what we learned during that first process and we go back and we do a little bit of more exploratory data analysis. We spend some more time on feature engineering. We go back to the maybe the two models that look like they were going to turn out best and we fit and tune we tune and fit them again using the resampled data and then we know we evaluate again. Also notice that this whole process is ending with some goal or output that might be communicating the results of the model or might be deploying the model to some kind of production system. I represented this here with another pretty thin bar but picture this really as transitioning to a whole other set of tasks. In this talk we're going to keep the scope of what we're talking about on this iterative process of developing a model, developing the model itself. So machine learning practitioners can of course make plots at any point during this process but I want to make the argument that there are two phases of this modeling process during which practitioners make data visualization the most. The first of these is exploratory data analysis. I said before that EDA is this iterative phase of investigation, question generation and refinement and exploring. This is a phase that maybe some of you may be saying wait a minute is this really part of machine learning but we're going to walk through examples and an argument that where we see that these investigations and the understanding gained during EDA have a huge impact on models. The second part where people make a lot of plots is during model evaluation. This is a part of the modeling process where a practitioner has trained multiple models and is assessing, measuring, comparing how well they performed. The reason why is that these two phases of the machine learning process are the points of the most important human decisions. These are decisions where validation of them can be difficult. You might hear a phrase like involves art not science and that is because these decisions are full of trade-offs. And visualizations are used as an aid or support to carry out these decisions, these tasks more effectively just as we talked about at the beginning with that quote from the Tamara Munzer data visualization book. So let's dig deeper into this and start with exploratory data analysis. And let's ask this question, why are these plots built? So at a very high level, these plots are built to understand the data. Understand the data or the input to your models. So when a practitioner starts out on a modeling project, this is one of these first substantive tasks being faced. The reason we as practitioners need to have this understanding of the data is because there's about to be a whole bunch of decisions to make. These data visualizations let practitioners ask and then refine specific questions related to our modeling goals. Like what kind of model is appropriate? This leads into the feature engineering phase that we talked about before. Like what kind of pre-processing do I need to do? And all of these kinds of questions are ones that machine learning practitioners ask, refine and answer by making plots during the EDA process. Let's look at a couple of examples. Some of these plots take the form of very familiar classic statistical graphics like this scatter plot. This plot uses a data set called the Palmer Penguins. Penguins are observed in an arctica near Palmer Station. And it maps each observation to a point on the plot with aesthetics of the plot like position, X and Y, size, and color representing things that have been measured about these penguins. So exploratory graphics like these are important for modeling practitioners because they help us understand the type and amount of data we have available. And they highlight, they visually highlight relationships in the data that we can expect to use in feature engineering or models. So since machine learning practitioners use visualization during EDA to make so many different kinds of plots, it's important that the tools that we use for visualization facilitate fluent iteration through lots of different kinds of graphs. So I have personal experience with decades of various scientific plotting tools. And I think that learning about and internalizing the concept of the grammar of graphics is a huge step forward. And this being able to have this kind of efficient iteration and adopting a system that applies this grammar gets you just a huge step forward in efficiency in this task of asking, refining and answering questions. To be able to get ready to train models. So ggplot2 is probably the best known system that implements this grammar of graphics and almost like all the plots that I'm gonna show you in this talk are plots I've made myself over the course of training models using ggplot2. So investigations during EDA inform decisions about what kind of models to train and also inform decisions made in feature engineering. So in tiny models, we capture data pre-processing and feature engineering in the concept of a pre-processing recipe that has steps. So you choose your variables, like your ingredients. You define steps and prepare them using training data and then you can apply them to any dataset, other dataset like a testing data or new data. But the specific feature engineering steps are typically chosen because of what we learn during EDA. So let's look at another example dataset. This is a dataset of houses in Ames, Iowa. We have a lot of information about them, like how big they are, where they're located, what was their last selling price. So this plot shows the sale price on the y-axis and the latitude on the x-axis. There's a big hole there where Iowa State University is and there's some unique and complex structure in price as you move across the city. One option we have for modeling that kind of structure is to use splines. We can use splines to build new features for modeling, but we have to decide how many spline terms to use. Two doesn't look like it captures the real structure very well. Five looks better, I would believe 10, 10 looks pretty good. 20 seems okay, but I'm starting to feel a little skeptical. At 50, I no longer believe this. And at 100, this is just silly. What's great about data visualization during the process of EDA is that we as machine learning practitioners can see and absorb this and use it to make decisions. These decisions then may end up looking like this if you use tidy models with a feature engineering recipe defined with steps that transform and prepare our data to get it ready for modeling. Whether that's dealing with factor levels or creating indicator variables or setting those 10 spline terms. So whether it's for engineering new features, choosing an appropriate model or understanding data quality, data visualization during EDA has this huge impact on what model practitioners do later on. And in some ways it's the foundation on which the rest of the model process is built. It is often possible to estimate the best answer or value or option for some of these choices that we've talked about using empirical methods. But it's nearly impossible to remove entirely the element of human judgment from these choices, or at least it's a bad idea. Machine learning algorithms are astounding. They're incredibly capable of learning patterns from large amounts of data. Vigorous adoption of EDA as part of the machine learning process is one's safeguard against ending up in a situation where you have to ask yourself whether it was a good idea to just deep fry your data. So now let's move on and talk about model evaluation. The phase where we have trained and possibly tuned models. So we have predictions to compare to true values, and we can assess how our models are performing. Let's again ask this question, why are these models built? At a very high level, these are built to understand the models or the output of our process. And here we're assessing model performance. This is the point at which we have to reckon with questions about success and failure, and ask, did this model work? The process of model evaluation also allows us to interrogate subgroups. And this is typically where discussions of model fairness comes in. A lot of the current work on machine learning bias and fairness focuses on how models perform differently for different groups. Perhaps for different protected classes that may not have been included directly in the model, but we can still see differences because certain predictors are proxies. Model evaluation is also concerned with understanding with why a model is making the predictions it does. For all of these questions, practitioners heavily use data visualization. For a lot of the same reasons that it's used during EDA, which are largely the same reasons that anyone ever uses visualization. It's that this visual channel is such an effective way for us to absorb a lot of information quickly. There are some pretty unique things to note about plots used for model evaluation. So let's walk through that a little bit. So one is that many of these plots for model evaluations involve graphical norms or idioms. So this plot shows how two model evaluation metrics, RMSC and R-squared change with the regularization penalty. Like how regularized a model is for a lasso regression model. So someone who is familiar with these kinds of models can look at this plot and can quickly recognize whether or not it worked. Whether or not it needed a lot of regularization or a little bit. And maybe looking at the R-squared, see how well or how not well in this case. This model fit this data. So it's like a shorthand or an idiom that packs a lot of information for a familiar audience. This plot is similar in that sense. This is a receiver operator characteristic curve, an ROC curve. Well, it's ten of them actually. For a model trained on that on Palmer Penguin data. And it's a visual representation of how well a binary classification model is doing. So the more up in the left hand corner, the curves are the better the model is at distinguishing between the two classes. The axes have the true positive rate and the false positive rate on them. And ideally we want the model to do a great job at both of those, at all thresholds. So this is another example of a model evaluation visual idiom that is familiar to modeling practitioners that can be used to compare models to assess a model and so forth. These kinds of visual idioms are a great fit for automatic plotting methods. When we talked about EDA, I said, you need to use a really flexible plotting framework that can be used to create any kind of plot really fluently. And let's adopt a grammar that lets you learn the grammar and then you can make any kind of plot that you want to in a flexible way. A lot of model evaluation plots are the same every single time you make them. And you can be better served by using shortcuts that are built into modeling frameworks built like by my coworkers and me or the equivalent of whatever modeling framework you use instead of copying and pasting the same code for an ROC curve every single time. So I've focused so far on model evaluation plots that show us as practitioners how models are performing. In terms of metrics like RMSE or accuracy or similar, but model evaluation activities can also encompass model explanations. And visualization is a key tool for explaining models as well. So model explanations fall into two main categories. The first are global explanations, where we evaluate what features are important in a model's predictions overall aggregated for the whole model. This plot is from that San Francisco trees example. And the length of the bars here correspond to how important that feature is in predicting, using a random forest model, whether the tree is maintained by the department of public works or not. Globally, for the whole model overall. So having a private caretaker is the most important and then we see that spatial information. Then the year the tree was planted and so forth. The other main kind of model explanation is feature importance at the individual observation level. So one framework I like using for this is Daleks. There's support for tidy models in Dal extra. But you can use it with lots of different kinds of models. And so here we create an explainer for a random forest predicting price for homes back in Ames, Iowa. And we're gonna look at what explains the price of this specific duplex that I pulled out. So not the model overall, but just for this duplex. So there's multiple ways to do local feature importance. This particular plot shows an example of shapely additive explanations for the price of this duplex. So we can see it has a price lower than the mean and the things that contribute to that the most are how old it is, how small it is, that it's a duplex and not a single family home and so forth. We can also build up to global explanations from individual observation level explanations, even to see and highlight the model's behavior for subgroups here. Like comparing a duplex to a single family home and whatnot. So for both global and observation level model explanations, these plots are used by the creators of these machine learning systems to evaluate and understand them. So at this stage of the modeling process, we're again at a point of important human decision and model explanation together with model performance can be used to make choices. So through this whole explanation of data visualization and the machine learning process, we focus on plots made by system designers and operators for themselves as a tool to make better modeling choices. So these are plots made by practitioners for practitioners. So often data visualizations center it as a communication tool and it absolutely 100% is. Like a lot of the plots that I've shown here today could be used to tell a stakeholder about a model or how it works. But we can center some of data visualizations other functions, like its ability to be used by system designers to understand a system more fully and do a better job of designing it. But whatever a reason a plot is made when we're building machine learning system, some plots are created to help us understand the data that we start with the input for models that we create. Investing in data visualization to understand the data is like laying a strong foundation. Some plots are created to help us understand the models that we create, the output. And careful visualization practices in model evaluation sets us up for success in models that are as accurate, as appropriate, and as reliable as possible. And with that, I'll say thank you so much. I want to be sure to thank my teammates on the tidy models team, as well as the folks who organize the tidy Tuesday project. This is a weekly social data project that most of the data sets that I shared today are from.