 I'm Melissa Santos. I'm a data analyst at a little company, and I love box plots. I love them a whole lot. I use them a whole lot. I realized while talking to someone else earlier in this conference that I don't like a lot of other visualizations. So it's really surprising that there's one that I really love. But this is it. This is the box plot. They make fun of me at work for it. We have meetings that are at the description as to admire box plots. They joke about how maybe I can get these results as a box plot, and I'm like, yes, you can. You can. I'm going to try. So I know not everyone is super familiar with box plots, so we're going to do an introduction and go through what they are. We're going to spend some time talking about how to draw them with computers, and then we'll just look at some examples. It's called a box plot because it has a box in the middle. Real story. True story, friends. And the box has the median in it, and so if you're showing someone a box plot and they get nothing else out of it, just be like, that line in the middle, that's the median. Now you know where the median is, and that's a huge amount of additional information. Then the upper and lower parts of the box plot itself of the box are called the upper and lower quartiles. Sometimes they're called the fourths. Sometimes they're called the hinges. Let's just call them quartiles. They're like the medians for their halves. They're dividing their halves into half. And the upshot of that is that half the data is in the box. That gives you a measure of spread associated with box plots. We call the area in between the area, the height of the box, the inner quartile range, and it's a perfectly good measure of spread. It's probably more understandable than a standard deviation because you can tell it's the part that contains half the data. If you are so lucky that your data is perfectly normally distributed, it is about equal to 1.35 standard deviations. If that somehow makes it click better for you, but you really can think of it that way. We also have this notion of whiskers that go off the box plot, and they would go to the max and the min values if all your data was well behaved, so they don't. In order to keep them a little bit wrangled, the rules of box plots allow that the whiskers go out as far as 1.5 times the inner quartile range. So they don't necessarily go out that far entirely. They go out as far as the data does within that amount. And then anything past that is an outlier. And I'm not saying that in any kind of statistical way. I'm just saying that in terms of a box plot, it's an outlier. And ideally, if your data set is well behaved, there should be so few of them that it's worth the chart ink to let it be a mark of its own. So why do I love box plots so much? Because I have to get asked all the time, well, how many people typically do this? How long does it typically take for this to happen? And it typically kind of implies that the average is meaningful. And I don't believe in the average. I don't trust it. And I want to show people the range. And I want them to actually be able to see and compare things a little bit better. Even when the average is, even when the data is very close to the typical value, that's important information to know. And that's not something you get by cutting the average itself. And so again, we get the range of the data. And we get the full range of the data in a box plot. And hopefully, your data is pretty well behaved and it doesn't have a lot of outliers. And maybe you'll have a really boring box plot. And that should be wonderful and reassuring. Wouldn't that be great? But if you don't have a really well-behaved box plot, then you've learned something. And that's even better. If you find out that your data has all these wacky outliers and you have somebody who took 2,000 days to do something, you wonder what's up with them. And while we didn't see it in the example so far, one of my favorite things about box plots is they make it really easy to compare groups. You can, instead of just comparing the average or the median, you actually get to compare where the bulk of the data sits and get a feel for how the distributions differ. So let's talk a little bit about how to draw box plots with computers. I was so excited about this talk that I've been excited about it on Twitter for like two months. So I've had a lot of lovely input from people. And this is just one thing somebody sent me. Vega is an HTML canvas plotting library. And it draws these nice elements. And Altair is a Python wrapper for it. So there's just a ton of tools out there that can be used for box plots. We're not gonna get into this one. I just wanted to show there was one I had never heard of, that someone brought out to me. And before we get into the programming tools, I call these kind of the push button tools. The tools where you have a data set that you import or you get from your SQL, and then you push a button to get a graph. Tableau is the one that I see in all the job descriptions. All the job descriptions. So it must be important. So it's very good that it uses box plots. I've not ever gotten to use Tableau, so I don't know how great they are. But there's box plots, so they gotta be pretty great. Chardio is the one I use at work, and it has box plots. Thank goodness, I would have to go out to Python all the time. Someone on Twitter brought up Power BI, and I looked into it. And they don't have box plots by default, but they can be imported. Many tabs and other tool in this category that I used in school, and it has box plots as well. If any of you know of other push button-y tools that have box plots, I would really love to have a more comprehensive list of these. Because part of this is to try and make using box plots more accessible to other people. So I have a nice example data set to show you. I used Google data set search, and I found this Pokemon data set on Kaggle. All you need to know about Pokemon, and this is just the video game data. I don't know the specifics. As they come in various generations, and they have a variety of stats. They have speed and defense and attack and whatever. You can add those up and get a total stat that tells you how cool your Pokemon is. So let's do that and look at some of those stats with Python. Basic Matpotlib is very happy to give you a box plot. You just pass it any kind of vector, and it says, sure, here's your box plot. The total score for all Pokemon has a median around 450, a max around 800, and a min around 200. Here you go. Here's a box plot. Pandas has a built-in box plot method so that if you have any data frame, you can tell it to do a box plot on it. And not only can you tell it to do a box plot on any one column, you can tell it to divide up by any other column that it knows about. So in this case, we divided it out by generation. And we see that for generation one, there's one outlier, and there's one really cool Pokemon in generation one that everybody wants to get. And then we can see a little bit that the range for generation three is wider than for any of the other generations, or that the median for generation four is maybe a little bit higher than some of the other ones. If we want to get fancy, then we can use Seaborn, which is built on top of Matpotlib and is delightful, as well as having our x-axis with the generation values and our y-axis with our total score. Now we can also have a color variable, so the hue is set to the legendary type of Pokemon. So obviously, legendary Pokemon are going to be better, and the statistics reflect that. Our outlier in one is swapped up in the legendary group, and you can see that the legendary ones are uniformly much cooler scores. And then we start getting into variations on the box plot. A notched box plot has this little waist that tells you where the 95% confidence interval is for the median, which I love it. Like, I've been talking, I saw that four looks like it's a little high in that median, but then when you actually have that confidence interval to compare to the other confidence intervals, you're like, yeah, that's actually a little bit of a difference, that's for real. And so ideally, they have that beautiful shape that the blue ones have, where there's just a notch in the waist of the box. But in the case where the legendary ones or the median is at the bottom of the box, sometimes confidence interval has to extend out past the box, and you get this sangle-tooth pattern that you see here. And of course, we can do all of this in R. Base Graphics is very happy to do box plots for us. And it immediately lets me split things up immediately, just with the generation and the legendary all in one go. And then not only can I do the notched box plots in R, but I have a new function called variable width, which lets me have the sizes of the boxes be proportional to the size of the dataset. So for instance, here we see that the legendary ones are all... There's many fewer legendary Pokemon than regular Pokemon. Seems pretty reasonable. I like far width so much I had to show you another example because I really love it. So I just wanted to show you that, for instance, there are far fewer flying Pokemon than there are normal or water Pokemon. And as a glance, you can tell that the median power of the flying Pokemon is pretty good. And of course, you can do all of this in GigiPlot too as well. There's a GM box plot, and GigiPlot is perfectly happy to do both notches and far width. So I work at a little company that does organizational chart software. And so I count a lot of things about how many people... How many people report to one manager? How many... Just how many people are using the tool yesterday? How many people did anything? So I just do a lot of counting. And then I have to... And then I give them a lot of box plots about the counting. So I'm going to show you some examples from work. So we have the notion of a planning order chart. You can invite people to collaborate with you on your org chart. So by default, you have zero collaborators until you invite one. And so this is something I do fairly often, where I do this histogram-esque bar plot. And I also do a box plot because they tell you different things. From the histogram-esque bar plot, we can tell right away where the data is piling up and that there's a huge spike at zero. And it still shows you the range, which is very nice. But the box plot tells us a little bit more about where the median sits and the quartiles. And it also shows the range. So being able to pull out the median separately without having to guess at it by trying to add up these bars in your head is something that you gain by using the box plot as well. And then we have a game on our... There we go. We have a game on our mobile app called Who's Who. And it shows you someone's picture and you try and figure out which name goes with which picture. And we want to know how many people are playing that game and how viral it is within companies. And so someone has played this game over 600 times. And then I wanted to show you the screenshot from the bottom is a hover over the same box plot from the maximum Who Count. Chartio has this convention that it says the maximum and the minimum are the ends of the whiskers. And then it just calls everything else an outlier, but it does not call them the maximum or the minimum. But partly I wanted to complain about that convention and partly I wanted to show you that this tool lets me hover over any of these graphs and quickly see the median and the quartiles. And sometimes we're not just counting things. Sometimes we turn things into percentages or other kinds of metrics, and they still can make sense to be a box plot if you have a whole vector of data. So in this case, we've looked at how many people in the company play if at least two people play. Trying to get it how viral it is. And it's not really taken off. That median is pretty low. Even at the most engaged companies, only 40% of people are playing the game. But it's a lot to ask for. Another thing I do periodically is I use a logarithmic scale. And it takes away a little bit of being able to see the IQR and get what's going on with that. But it stretches out the bar plot a little bit. Box plot a lot. So you can see where the median is if it's all squished up at the bottom. As you're kind of seeing, a lot of my data is highly left skewed. It's all squished up at zero and one. So being able to see the median on something like this and still see the upper quartile and everything and then just see, oh, it does go out past a thousand. Geez, that's a lot of nodes before they first get to share what they're doing. They must be doing some kind of cool import or something. Sometimes the logarithmic scale isn't enough to save us. Sometimes we're just dominated by outliers with this highly skewed data. This is the days between signing up and using a CSV import. And so 60% of people use one on the first day. And then it goes out and out and out. And so the first box plot is obviously useless. It does nothing for us. It's just like we have a mess of outliers. The second one tries a little harder, but all it tells us really is that the median is there at one. Or at zero, rather. So in this case, I finally had to give in and use not a box plot. I had to use, sadly enough, I used a cumulative percentile graph to try and show this data. Taking a little bit of a turn, I'm back to the Pokemon data. This is the speed of the Pokemon by generation. I want to complain about bar plots. I don't see the point of having a bar plot to show on average. I feel like it implies something about the height, that fully filled in bar, implies something else about the height. It's nice that they have the error bars for the 95% confidence interval, but it tells you so much less than the box plots telling you. The outliers that are visible in generation three, the minimum of generation six being higher than the minimum on any of the other ones, all of these are things that are really easy to see with the box plots, but completely obscured by the bar plot. But I guess a table of averages didn't look cool enough. And I won't admit it, box plots aren't perfect. Even though I had an example where I had to give up already, this is some really cool research out of Autodesk where they did an entire paper of trying to fool statistical methods. And so the box plot was not the only thing they broke. And they broke the box plot by grouping the data in all these various ways. And so you have these buildups of data in different places. So what it showed me was that the box plots are really not useful at telling you where your multimodal data is. If there's multiple peaks, that's just not something it's going to communicate to you. Which it didn't really imply that it would. So I'm not too upset. The violin plot is a new version that is an extension past the box plot where we showed the density on either side of the line instead. And it's a little nicer for that, but it's a little harder to read in other ways. Like I don't know where the medians are on these. Box plots, they're wonderful. I love them. There's so much more than a central tendency. And more importantly, you can slice and dice your data a lot of different ways and see those groups and see the differences really quickly and easily. Even if you're not expecting outliers, you're going to be able to have them show up really quickly for you. And you can make them in a lot of different ways. So I love the box plots. I hope you were not too bored hearing about my enjoyment of them.