 Hello and good afternoon, everyone. I'm going to go ahead and get us started as people continue trickling in because we have a high impact and full agenda for today's webinar and we don't want to waste any time. So we'll go ahead and get started. Welcome to a guide to data visualization best practices for communicating open educational data. This webinar is a part of a source a series of resources produced through our national science foundation supported stem education hub. So please visit the center for open science stem education hub website for additional resources, like deep dives on open scholarship topics and recordings of other webinars. The last stem education hub webinar was analyzing educational data with open science best practices are an OSF with Joshua Rosenberg and Cynthia D'Angelo, and the response was just overwhelmingly positive. And the recording can also be found on the stem hub page. So wanted to make you aware. Also, if you are an educator of open scholarship and looking for supportive resources. We invite you to visit the open scholarship knowledge base or OSKB. The open scholarship knowledge base provides a way for educators to share their teaching materials and connect with educators around the world. The approach is basically to build a knowledge base platform and a community of contributors to organize information on the what, why and how of open scholarship in a way that's easy to to apply easy to find and easy to apply. And that knowledge base can be found at the link link below. So with those two quick announcements, I'm really anxious to turn it over to today's presenter, Dr. Daniel Anderson, Dr. Anderson is a research assistant professor in the College of Education at the University of Oregon. His research lies at the intersection of measurement and large scale policy with a specific focus on educational educational inequities. I know I had the pleasure quickly reviewing the full data visualization course and syllabi Dr. Anderson teaches and was so excited to invite him to speak today when saw the course and how powerful it was. Just as members of the education research community, we've all seen just how important it is to effectively communicate data to reach wider and more diverse audiences. And sadly, no matter how great a data set is the value is diminished. If we struggle to connect that impact of the research with the audience that really needs to understand it. So we will be closely monitoring the Q&A feature in zoom and we'll take questions at the end of the webinar as time permits. So please share your questions as they arise and we will do our best to address them. With that, I am going to warmly invite Dr. Anderson to take it away and help us better understand data visualization, and I will be in the audience, diligently taking notes and learning. So thanks Daniel. Great. Thank you. So I want to start by potentially apologizing for what's going to probably happen which is probably going to be distracted by my cats because I have a 10 week old kitten who's currently attacking my feet. And he likes to chase the cat behind me all around. So if that happens, I apologize. So I am going to share my slides now and these slides. So this is me. These are my two daughters and my wife. This has sort of already been mentioned so I'll move on from that but what I did want to mention is down here in the slides, you will see these different icons. This is all linked to different parts of the slide so this will download a PDF of the slides. This is just a link to the actual slides online. And then this is a link to the GitHub repo where the slides are posted. And this is all the source code to create the slides so if you see anything in the slides and you are interested in how it was created. You can see all of that there. And you're welcome to contact me as well. At the very, the very last slide is my contact information if you would like to get in touch. I wanted to start by providing a few resources to learning more because of course, we can't go too terribly far in a one hour presentation. You can, but there's a decent chance I won't get through everything that I have here. And so these are two really good books. This book, I'm going to actually, I've actually borrowed some of the code straight from this book for this presentation. This is written by a guy named Klaus Wilka, who is just really, really good. Kieran Healy, this book is a little bit more introductory and it's very much more how to do the things, whereas Wilka is more focused on design principles and things like that. But the source code for that book is also available to if you're interested. Other resources. So Marcy mentioned this earlier, but I teach classes on this stuff and specifically I wanted to mention this one, which this is my website for my class on data visualization. It's still up there. It's going to be up there for the next, I don't know, nine months or so. So you are welcome to go there and look at all of the slides and then there's actually lectures, video recordings of the lectures as well. So if you're interested that's a whole bunch of additional stuff and more in depth then we'll be able to get to today. Last thing I wanted to mention actually before we really start is I also just yesterday came across this visual vocabulary resource which is put out by the Financial Times. And this is a really nice interesting way to get at sort of different ways of visualizing data. So if you want to visualize a correlation, you can click here and it'll show different ways that you could possibly do that. There are two differences, etc. And so it just shows you a bunch of different types of dish of visualizations and different ways that you might think about visualizing the same sort of thing. Okay, so I think that's in the chat. Now, if not, we'll get it in there. But in terms of creating the visuals, that's not really going to be the focus of my talk today. I'm going to mostly be focusing on design principles, and you can use whatever you use to visualize data currently. But if you don't use our, I would really recommend moving to our as soon as possible for data visualization if nothing else and I'm a huge proponent bar just generally, I think it's amazing. We have slides using our, they can do all sorts of different things, but for data visualization in particular, it is super strong. The ggplot package ggplot to it's pretty amazing. So this is an open source book to sort of help you get started with our, and then for ggplot in particular, there's these two books, the third edition of this book is actually in progress. You can get yourself open source books, all of them are open source books so you can read them online, or you can get a print copy for like 35 bucks so pretty cheap. I mentioned this already but the source code is here these slides were produced in our and the focus of this talk is really not on the code itself. I don't even think I show any code in this presentation if I do it's pretty minimal. It's more on what you might think about and sort of do's and don'ts around design. Okay, so this is from Wilka. And when we're thinking about data, there's many different ways that we can encode that data into a visualization. Right, so sort of the most basic is to put those values on a coordinate system, like an X and a Y axis. Right, but we can also visualize discrete data with different shapes. Right. And if we have continuous data we could also use shapes but then we would scale the size of the shape relative to the data versus changing the actual shape. Okay, similarly we could encode data with color. And I'll talk about this more later but that could be used for discrete data or more continuous data. Line types, same sort of thing we could use different line types to represent categorical discrete data or continuous data so this would be an example of where the line is getting thicker as there's more data. Right. But there's also all sorts of other elements to consider. So for example, text. How is the text displayed and this may seem like a sort of minor thing, but it can often really help with communication so for example one thing that is becoming more and more popular is to include colors in your actual title that reference colors in the figure, and that ends up making it so you don't have to have a legend at all so you could say, you know, the difference between treatment in blue and control in red and then the treatment points are in blue and the controller in red or whatever. Right. But also thinking about the purpose of the text so you could use text annotations as well so if there's one specific point or groups of points that are really important and you really want to point those out, considering including annotations that actually draw people's eyes to those points. Transparency is another one if you have a lot of points, then it can help with sort of showing the data density, rather than just the points themselves. Another thing is if you have overlapping shapes so maybe you have two density plots and they're overlaid, including transparency will be really helpful because then you can see the full distribution for each group. Daniel. Yep. Would you mind expanding your slides I think in the appearance they're looking a little small. Okay. With two screens that better. Oh, I have I've shared the wrong thing. That's why. Is that better. Oh much yes thank you. Yeah sorry. Absolutely that's great. Okay, so. So here's an example. Now, where we have some data this is from WCA again, where we have some data where we have the month and the day and the location this is just a sample of the data the data goes a lot further it's for an entire year. And then we have some temperature. So we could think about if we want to show changes in temperature over time by location. Right. How would we encode this data into a visual display. Right. So why don't you just think about that for a second. If you were going to create a plot with these data. How would you create that plot that would show changes in temperature over time by location. And probably, you would think something like this. Right. So we have the month going along the x axis, we have the temperature on the y axis, and then we have different lines represented for each location. Right. And that's a really good visualization. An alternative is something like this. So now we have the same data, but represented in a different way. So back here, we have a continuous x axis, a continuous y axis, and color mapped to a discrete scale. In this case, we have a discrete axis so we've collapsed time down to month. Right. And we have a discrete y axis, and then we're mapping color now to a continuous scale, rather than a discrete scale so each of these squares gets filled in, according to the average temperature for that month. Right. And I'm not trying to argue that this is better than this one. In fact, I think this one is probably better than this one. But what I am trying to point out is that you can take the same data and think about different ways of encoding that data, and you can come up with different visualizations. And sometimes one visualization might be preferred over another depending on the situation. So this is just summarizing what I just said, both represent three scales, two position scales x and y axis, and then one color scale. But the difference is, we've got color mapped to a continuous in the second and to a categorical in the first. Okay, we can also include a whole lot more scales. Right. So in this case we've got an x and a y, we've got color, and we've got shape. But the problem is, if we're on this side, if we have a high degree of structure to our data, then this works really well, because you can see we've got bands of green, orange, purple, pink, right. And then right here, right about 10. You can see that these change from circles to triangles, going from the bottom to the top. Right over here we have the same scales represented but the structure and the data gets lost, because there's just not in, well, the, the, we're not able to delineate all of these different scales because there's not enough structure in the data. Okay, so in this case, you're actually probably just confusing your audience by trying to include too much in a single situation. Okay, so it might be better to break those out into what are called small multiples. So if you're using ggplot it would be like facet wrap. Right, so you're going to create a bunch of little plots that maybe show for each of these relations rather than one with them all coded by shape. All right. What about color, let's think more about color for a minute. So there are three fundamental uses of color, generally. Okay, the first is to distinguish groups from each other. So like you have treatment control. The second is to represent data values. So maybe you want to color it according to maybe have some continuous measure of socioeconomic status right, and you want to color those points according to that that value. The last is to highlight. Okay, so imagine there are specific points make their outlier points maybe their points that are very important to your sample, and you want to highlight those points in the overall scheme of things. Okay, those are the three fundamental uses of color. So if we have discrete items, where there's no intrinsic ordering of these things, then we want to have a qualitative color scale. So this is, like I was mentioning treatment and control, but often you have many more categories than that, right. So, when we have a qualitative color scale. We have to be very careful about how we choose those colors, because we want to make sure that the colors that we choose are maximally different. Right, so one color should be very distinguishable as much as possible from all the other colors. Okay, and that should be true for every, every color combination. But at the same time they have to be equivalent. And what I mean by equivalent is that when we're maximizing those differences, there shouldn't be any appearance of order. Okay, because these are discrete things, right, it's not a continuous sort of thing and so we need them to appear equivalent, but maximally distinct. So there's no impression of order. Okay. Here's a couple of skills that I just picked out to illustrate a qualitative color palette. The first one is actually probably my favorite, it's called the Okabe Itto palette. It's really nice because it is maximally distinct but it's also probably the best palette I have seen out there that deals with color blindness. So it works really well, regardless of the type of color blindness people have. It works so well that it actually works if you print something out in black and white you can still distinguish the groups. Okay, this is the dark to color palette from color bluer color brewer, and then this is the dynamic palette. So you can see all of these have each color is distinct from all the other colors. But there's no impression of order. I think you maybe could argue that there's some impression of order on this one. So maybe that one would not be as preferential as these ones we also are showing a lot of categories. And the more categories you have the harder it is to keep this property of no order and maximally different, different while being equivalent. So if you want to show order, then we want to use a sequential palette. Okay, and so this is like I mentioned, maybe you have a continuous SES indicator, right, and you want to show the values of that. Well, then you want it, you want to show order, right. And, but this is still very difficult. It's much more difficult than we might think because it's not just varying one color. So that, and that's what this purples scale is showing. But when we're doing that, it needs to be consistent. It needs to have interval properties just like a measurement scale. So in other words, the change from here to here. Let's say that represents 10 points on our scale that should represent the same difference perceptually as 10 points here as 10 points here as 10 points here. Okay, so whatever the scale of your data is that should be represented and map to the color in an equivalent way. Okay, so these are three different examples. The, this middle one is an example of a single hue sequential palette so we just have a single color and we're just modifying that color as we go across. But there's also like this is Sevitus from the Barrettus family of palettes. And this is this AG sunset from the color space package. But what's interesting about these multi hue sequential palettes is that what we tend to do is have them mapped to colors that occur naturally in the real world. So the vidus one I always think of is kind of like nighttime to daytime sort of thing. Right, and this is more like, well I don't know what it would be but it maps, maybe it's like cooler to warmer to like a earth tone sort of thing. Okay, so we can have sequential palettes that still have this source sense of ordering, even though they have multiple hues to them. Okay, probably as simplest is a single hue, but we can have multiple, multiple hues. The last one that I wanted to mention is are these diverging palettes. Okay, so diverging palettes have a center point. And so I think of this a lot in terms of like, let's say we're looking at differences in achievement between groups of students, right, we might want to center that scale at no difference. Right. And then for as one group has a higher achievement, we color it one direction as the other group has a higher achievement we color it a different direction. The center point of that scale represents no difference, right, zero difference, but then as we go out in either direction it gets more extreme in the corresponding color. Okay, so here again are three different examples and you can see they all kind of have this center point around here. This one goes more blue in this direction and then brown, right, orange and purple, and then purple and green. And here's a quick example of an earth palette. This is just a map of California, where we're looking at the percentage of people that were coded white in the 2010 senses. Okay. And so, this is a great use of a diverging palette if you're trying to look at like what the majority is. Okay, because you can center this scale right at 50. Right. And then, in this case you can see how the state sort of lines out geographically as it gets more brown, we have fewer people that are identifying as white as it gets more green we have more people that are identifying as white. So, in this case with California you have a pretty strong southern to northern difference in the demographics and that's really apparent by using this diverging color palette. Okay. Common problems with color. This is probably the most common problem that I see anyway is that people try to map way too many categories to a discrete color scale. So this is an example where I'm coloring every single point, according to the state it represents. Right. And so if we're looking at like this point this blue point. I would guess that's probably Oregon, but maybe it's Tennessee maybe it's California right and then like what about this orange one maybe Vermont Massachusetts like it just gets really really hard to tell what they actually are. So the general rule of thumb if you're getting much beyond five, it's probably going to be difficult to track. You can, in some cases push beyond that maybe up to eight, but it's going to be very often hard to map your legend color scale to your actual points or whatever it is you're plotting. So that's, that's something to look out for. An alternative is you could use labels and label the points directly. In this case this is still way too many right we have labels that are just overlapping and it's way too cluttered. And so it doesn't look very good. Here we could use a subset, but it's still not great. Right. We still have a lot of overlapping and like with Tennessee here. I'm not certain whether Tennessee goes to this point or to this point. Florida is clear Texas is clear, but some of these others are not as clear. So best is to do something like this and this could absolutely still be improved. We still do have some overlapping of labels and points, things like this get a little bit messy. But here we're using color to highlight specific states, and we're using text to annotate those same points. So now we can see very clearly what points correspond to which labels, and we get a sense of the overall relation while highlighting these specific points. Why these points I don't know I just took a random sample but that that the point is we can bring them out a lot easier by using color. And in this case, with text annotations as well. Okay, last few notes on palettes. I really recommend that you actually do some research when you're thinking about using a specific color palette, or colors generally. They're much more difficult than they seem on the surface. If you're just using two colors that you like that think you think look pretty great. But if you're going to do that, I would recommend that you check for color blindness. Okay, because a good proportion of the population is color blind. And if you're not checking for color blindness, then, you know, it's like roughly 10% of people might have difficulty interpreting your plot. So there's many options for doing this there's our packages that will do this for your with just passing it code. There's also, there's a link here to one where you can upload a file and then you can check it with different kinds of color blindness, etc. You can do whatever you want. In terms of colors to make it look nice, but just make sure you do additional research after the fact if you do that. The other thing is there's color brewer, which is a really nice package for a bunch of different palettes and it has some things like this you can specify the number of palettes and then you have multi hue single hue sequential, or you can do diverging or you can go qualitative, and you can do things like this where you say give me only the color blind safe palettes, and there's none that are six classes right I can go down to four, and now there's one. Okay, so that's worth checking into as well. Alright, moving on from color. Let's talk for a second about visualizing uncertain. The first one is very regularly, all of us see a point on a plot like this, and we interpret that as the value, as if that value is the singular value. But of course we know that that's not always the case. We have all sorts of things that go into estimating that point that may or may not be correct right we have sampling variability we have measurement error we have all sorts of other stuff. But there's also some secondary problems in addition to just that. First, we're not great at understanding probabilities, you can train yourself to be, but we're not great at it naturally. And so the general public if you're trying to communicate some communicate not just a relation but the uncertainty in that connection, then saying there's a 20% probability or whatever is just not going to connect with people as you might hope. And part of that is because we are very content to round probabilities to 100% or to 0%. That's what we tend to do. So, I think a great example of this is the vaccines right now for COVID right we hear that they're 95% effective, and people want to go ahead and just round that up to 100. Right. And so, that's an example of not understanding that 95% is not 100%. Right. Similarly, you have things where, like in elections where it's like there's a 20% chances person will get elected. And then when they're elected people are very surprised, but 20% is one out of five. Right, so it's not zero, but we tend to round it down to zero. So, we're not great at probability generally but particularly as we get up to the tails, it gets worse. So what we typically do, we include error bars, and this is good right, especially for scientific audiences this is going to be an improvement for sure is to include an error bar around that point. But there are other ways we can do this I'm going to go ahead and just pass this one way is to do these are called risk theater charts or at least that's one of the names for them there's many names for them. So what we can do is we can create these grids that show for a single probability. And what we do is we have like in this case I think it's 10 by 10, we have 100 of them. And it says, Okay, what's a 10% chance look like, and we just take 10 points and plot them a different color than the other 10. And here's a 40% chance, and you can see there, right. The reason why I said it's a risk theater is because there's a sort of story about let's say this is a theater and each of these are seats, and you get a random ticket to the theater. What do you, how, how confident do you feel that you're going to not get in one of these darker seats. Right, that's kind of one way to think about it in terms of probability. There's a lot of research around these and they have been using them with with doctors to help them figure out the uncertainty of things like diagnosis errors and and survival rates and all sorts of things like that. And they tend to work better. So, non discrete probabilities. So now we want to just say not is there a probability of X, but like how much, then we are no longer in this world. We can't just look at a single probability. We want to look at a range of probabilities and communicate that. So this is an example like totally made up from will go of an election. Okay, so we have 0% right would mean the election is a tie, and you've got this blue party and this yellow party. We want to know not just what's the probability that the blue party will win or the yellow party will win, but what's the chance that the blue party will win by 2.5% right, and if this is what the distribution looks like, our best guess would be the top of the distribution right the median, but that's not the only thing that we care about. And the problem is, when we look at this we're pretty bad at interpreting area, and that's essentially what we're trying to do here way different area. And so what we can do is sort of take that same basic approach with the squares and the risk theater chart and turn it into a plot like this where we've discretized the continuous distribution. So now, we can kind of do this counting thing, right, and we can see this says each ball and represents 5% probability. So we can just kind of see. Okay, if there's 2%. Well, then we've got, you know, 15% chance that they're going to win by two points or more. Right. And this is still pretty experimental, but there's some evidence out there that this is that lay people in particular will interpret this better and have a better idea of what the uncertainty is around these estimates and how much we can go in one direction or the other versus something like this. Okay. So the other that getting back to what we started up at with, which is the point estimate. There are other ways that we can communicate the uncertainty around that point estimate as well, besides just error bars and error bars are great, but we can do more than that to one that is very simple and really appealing, I think, is to just include multiple error bars. Okay, so this is a case where instead of just communicating the 95% confidence interval, I'm going to communicate the 80%, 90% and 95% interval, and I'm going to code though I'm going to color those, according to basically a sequential scale. Okay. And so then you have the point estimate and you get this kind of fading out effect from for where it's more or less likely. Right. So it's most likely at the point estimate, but then it gets less likely from there, fading out in either directions. Okay, so that's one option. Another option is to take this just a little bit further and use what are called density stripes. And this is basically the exact same idea. It's just basically coloring from the point estimate out, according to our estimated likelihood of observing that sort of Okay. And then the last thing we could do is the actual densities. I kind of made a point against densities just a second ago and now I'm showing you how to that you could actually use those densities. So, my preference is still for this, but if you like the densities and depending on your audience that might be a good thing to do. This is yet another alternative to typical error bars. Okay. One last thing with uncertainty. And this is again fairly new, but there's these things called hops, which are hypothetical outcome plots. Okay. So if we think about a regression line. And this is nonlinear but that's fine. We would typically look at something like this where we'd have some shading around the line. And that's, that's great. Again, this is much better than not showing the uncertainty right. The first sort of point is try to show the uncertainty. But then if you're going to show it, there are, again, multiple ways to show that uncertainty. So an alternative would be to do something like this where we've used now bootstrap resampling to not just show one line, but to show a whole bunch of different possible lines. And that gives you the same sort of sense of the, you know, likelihood of these things as just showing a single line, but it kind of doesn't allow you to settle on one single line as the answer. Okay. And you can see up here, the points are pretty tight. Right. And so we can kind of infer a single line here anyway. But once we get down here. It's quite a bit more uncertain. And so you can kind of, you could take like a median line in there that you're kind of assuming. But you, it's, it's more uncertain. And that's true with the shading too. But this, this is just a little bit different view on it. Okay, the actual hop, the hypothetical outcome plot is to take that same sort of a thing, but then animate it so that then your brain literally cannot settle on a single line. Okay. And depending on, you know, your outlet, if you're publishing in a journal article, you're not going to be able to do this more than likely. But you could have a follow up blog post that goes along with your article where you show this as an alternative. Okay, or depending on who you're communicating with something like, and I'm not saying necessarily this and this is kind of a dumb plot. But, you know, there are times when you want to show something and you want to make sure that you communicate the uncertainty in that because otherwise people couldn't be making incorrect decisions. Okay, so last couple of notes on uncertainty. The first point is do try to communicate uncertainty. I think fairly regularly we're content to omit it entirely. So I would encourage you to try to think about uncertainty and include that in your plots. And then if you are interested in this topic. This talk that is linked here by Matthew Kay is my favorite talk I've heard on the topic to date. He talks about hops he talks about many other things. And he talks about those risk theater plots as well. But he goes into a lot more depth and I will hear. Okay. All right, shifting topics again, we're going to shift to the data to increase show. This is a topic that probably many of you have heard about before. It was popularized by Edward Tufty. Specifically, it goes with this quote above all else show the data. He came up with this idea for the data to increase show, which is looking at a single plot. How much of the ink used for that plot is devoted to data versus other things. Okay. And so the common goal with the data to increase show is to maximize it. Okay, so all of the ink that's used for it given figure, you want to maximize the amount of that ink that is devoted to data. For example, Tufty came up with these different kinds of plots. Right. These are all like we have a typical box plot, and then different versions of these. So see, they all show the same relation right but in different ways. And these last two, in particular this one a little bit helps show the distribution as well. Okay. But see, in particular, maximizes the data to increase show. Okay, so this was Tufty's favorite. And they did some empirical research on it. And your first thought might be, that's pretty cool. Unfortunately, Tufty's plot was the most difficult for viewers to interpret. Okay. So, even though it maximized the data to increase show, the cognitive load it required was much higher and visual cues like labels and grid lines reduce the data to increase show, but they can also reduce cognitive load. So, here's another example, where we have on the left, just line drawings of these bar plots, and the right solid figures. If you're just looking at these you could ask yourself which one you prefer. For me it's it's not even close. I like this one a lot better. But the other thing that's a bit of a problem with this one is because we have these grid lines in the background. It can introduce these artifacts where it looks kind of like there's another bar right here and right here, and right here. But because of those grid lines in the background, whereas here it's very clear that we have one distribution and it's we're not being confused by that at all. Okay, so this is advice from Wilka. I won't read the whole thing but this part that I've bolded which was me adding that emphasis is I think quite important, whenever possible, visualize your data with solid colored shapes. That first part is important to whenever possible because you're not always going to be able to do that right, but when you can do that. Okay, so here's another example. I would argue these are both excellent plots. Okay, I don't think that either one is poor at all I think they're both excellent in part because they have these text annotations right with the lines. But I still prefer this one to those two. Okay, and this gets into a little bit more sort of personal preference. And maybe you prefer one of these two over this one. And that's fine. But generally as a general rule you should consider and try to use solid shapes whenever possible. These plots also are great examples of using text annotations in place of legends. So, normally with something like this, right, we would have a legend over here on the side that is green orange blue and has these three labels. But if you have a plot that is simple enough like this, you can just put labels right next to the line like that, and that will reduce the cognitive load, because now I don't have to track back and forth between a legend and the actual figure. Okay. So that's another example, where you can do something that is going to reduce the cognitive load required, and hopefully should increase clarity, I think it increases beauty because you don't have to. Well, because it's just right there tells you what it is. And the other thing that I think is a little bit underappreciated is when you're able to do that. You can omit the legend altogether, which means that you're going to be able to maximize the figure size because you don't have to have space that's dedicated to a legend, whether it's on the top or the bottom or the side or whatever. Okay. Okay, so we've covered a lot already. So I just want to kind of pause for a second and go back to practical advice that I've sort of mentioned as we've gone through here but haven't really summarized yet. Included uncertainty whenever possible. Avoid line drawings, generally. And then maximize the data the increase your within reason, but preference reduction in cognitive load. Okay, so I think less about the data the increase your and more about cognitive load, how, how much is is that are the things that I'm adding here actually reducing cognitive load or am I doing that just because it's pretty or whatever. Use color to your advantage and think very critically about the color that you're using not just the palates that you're using but kind of be used to highlight points can be used to show to reveal sometimes relations that you might not otherwise see other plot annotations over legends. Okay, so that's where I was mentioning actually just labeling a line or labeling a point or group of points directly, rather than using a legend. All right, quickly, we're going to look at a quick empirical example of dealing with grouped data. And this is where we're starting to get into, we've covered some of the basics now. And now we're going to get into an example of group data, but we won't be able to cover everything there is, you know, in terms of like visualizing relations and visualizing magnitude differences and visualizing all all these sort of things so this is just one that I chose, which is grouped data. Okay. So the data look like this, it's the Titanic data, many of you are probably familiar with this it's just the passengers that were on the Titanic. Some information about them, and whether or not they survive. Okay. So, if we were interested in whether or not they're they survived, they are grouped by sex, in this case, okay how they were coded in the data, which is just male and female, in this case. So we could use box plots, right, that's a very typical use of box plots where we're trying to show the distribution for each group. Alternatively, we could just actually plot the points. And in this case we're using jittering. So if we just plotted the points it would just be a line of data for each of these right, but we use jittering to actually move them off of that line. And therefore we can kind of see the differences between them. And one of the things that we get from this is sort of a sample size representation, because you can see it looks to me anyway like we have more male than female passengers. Right, but in some cases, like a box plot isn't really going to show you that and there are things you can do like you can scale the width based on the end or whatever. In some cases, this is going to be really helpful. But I think actually even better than that is to use what are called sceno plots. Okay, and so this is the same sort of thing, but now we're jittering in a very purposeful way, where when we have points that are overlapping we're just sort of pushing them out to the side. Okay, and so you can see, like, especially once we hit 20 years old for males we get this big burst of passengers. Right, probably because of a lot of them are crew and whatever. And then for female, we have the same sort of thing but it's more of like a constant increase in the number of passengers, and then constant decrease from there. Right, whereas for male it's sort of constant up until just about that point and then we have this explosion and then kind of a nice smooth down to all the way up to 65 whatever. We could also use stacked histograms. I think this is really weird. I think this is hard to understand if you're looking at just the male distribution, it makes sense. And if you want to look at the male distribution relative to the total distribution, then it makes sense. But if you're interested in the female distribution it's pretty hard to see what's going on here because each bar for the female distribution is starting at a different point. Right. So instead of that you could do dodged I still don't think this is very good. It does allow you to kind of see the different distributions but it's, it's still pretty hard. And this like makes me take a sigh of relief basically is I think just to facet wrap them out so this is a very small example of that small multiples I mentioned before where you're taking a distribution and now you're showing it basically separately for the two groups. Okay. So if you were very interested though I need to actually add this in the slides but if you were interested in the total distribution, you could also show that here, where you're showing that as a background distribution so you could have like a dark, dark gray distribution in the back here that shows the total distribution, and then you could have the female distribution overlaid on that and then the male distribution overlaid on that. And that would be a good way to see each of those distributions relative to the whole. If that was part of what you were interested in as I kind of mentioned earlier. You could also do overlapping densities. Okay, so here's density plots and we have them overlapping and so you can kind of see each distribution, but you can also see where they're sort of piling up together. And then there's this other alternative which are called ridgeline densities which are similar to overlapping densities, but we kind of space them out on the y axis of it. All right one more empirical example, and then I'll probably have to skip to the end visualizing amounts so we're going to look at how much does college cost. Okay. So, we have tuition data looks like this we just have every state, we have the year, and then we have the average tuition cost in that state for that here. Okay, so we could start out with something like this, we have the x axis is representing the state, and then the y axis is representing the year. This is I'm calling this the three puke emoji version of this plot. It's really hard to see anything that's going on here I have no idea except for maybe this first one which state is represented, and it's hard to see any relation at all here. So we can make that better like this, I'm still calling this a two puke emoji version, where we're making these labels very small, and we're, we're putting them at a 45 degree angle so now we can kind of get what it is but it's, it's not great. Much much better, but still not very good, in my opinion, we just move that from having states on the x axis and money on the y axis to the other way around state on the x axis and having now horizontal bars as the y axis. And I don't think this is very good is because there's no ordering, there's no reason that the y axis is ordered in the way it is it's ordered alphabetically, but like why, who cares. What's better, I think is to actually just rank order the y axis, according to the x axis. And this is, I would argue, should be your default. Okay, because if there's reasons that the, the discrete axis should be ordered in a certain way, then order it that way. But if there's no intrinsic ordering, and there's no reason you need to have it ordered in a specific way, then just sorting it like this can be highly effective in terms of Okay, now I know, very quickly New Hampshire is the most expensive, and Wyoming is the least expensive. Okay. So that is, I think, like one of the biggest things I can tell you from the whole talk is, when you have bar charts, your default should be to order them according to the other axis. Okay, order the categorical variable, according to the continuous variable. Unless there's a reason not to and I'll show you that in just a second. Kind of smiley or full smiley version. I'm now highlighting a specific state that I'm interested in. So I would of course be interested in Oregon since that's where I live. And so I've highlighted that specific state in this blue color. And now, not only can I see where Oregon is, but I can see very clearly where it is relative to the other terms of the distribution it's right about in the middle. Right. And that's, that's very powerful. And I would argue a lot more powerful than highlighting it in something like this because it's harder to see what where it ranks in something like this. But as I mentioned, it's not always good to start to sort. If your categorical axis has ordering to it, or there's a reason that you really want to include specific ordering to it, then do that. Okay, so here's an example where we have age has been in a categorical variable along the x axis, and it goes 45 to 5435 to 4455 to 6425 to 34 like it makes no sense. Right. Much better is to just keep it ordered, according to age. Right. So now we're going from left to right, increasing age. Right. So if you have order to your categorical variable, then keep that order. Otherwise, the default, I think, should be to sort according to the continuous variable. Here's a heat map. Same thing ordered different color scale. And even another different color scale. So I know I'm skipping ahead faster but we're almost out of time and I want to make sure we have some time for questions. Sort of a finalized version, the thing that I wanted to point out about this is this scale goes from black is like the lowest to like the color of the sun is the highest right so in this case I've used the overall theme to kind of help me out because I've made the background, the same color as the lowest and now the lower values kind of disappear into the background and those higher values, like sort of pull out a bit more. Okay. You, if you have geographic data you could actually capitalize on that. Alright, last couple of things some things to avoid. I mentioned land drawings, much worse is 3D when it's not necessary. Some situations where you want to use 3D because you have like a Z variable and that's fine. I would argue a lot of those cases still we could do in two dimensions and it's probably going to be easier to interpret but sometimes for things like topographical data or whatever, that's going to be needed. But in this case where we have like a pie chart, we don't need 3D, and it's going to introduce visual artifacts so if you look at this, how big you perceive this chunk of the pie, it kind of changes depending on which one you're looking at. And it's the same in all of them. Much worse and this one's used pretty regularly is 3D bar charts. Okay, so this is an example where this bar is 80. And if you look at 80 and follow that line, it sure doesn't look like it's 80. This one is 105. If we follow 100. It does not look like it's 105. Okay, and that's because it's 105 at the start, like at the front, and this line is way in the back. So it is correct, but it looks really weird. Right, it doesn't this bar does not look like it's more than 100 when here's the line for 100 right there. Okay, and that one is used pretty regularly. So I would really encourage you not to do that. Pie charts with lots of categories are generally not good. So an alternative representation is just to use bar charts. And then this case we can sort, or we can sort of compare a value across years easier, but we can also compare the values within a year easier. So there are times when pie charts are really useful. So this is a case where we have a low number of categories and then the difference between them is really big. And, and specifically if we want to see what the sort of part to whole relationship is, then pie charts are going to be good. But we just don't want to use them when we have like 15 categories. Right. So if you want to compare stuff, which you can look back at more later if you want dual axes are another example that are used relatively frequently and I would argue they're not good. The problem is they end if they're on two different scales, our brains want to put them on the same scale, even if they're on two different scales, our brains don't want to do that. So we're going to look at this and say okay this one goes from 10 to zero and this one goes from 10 to 10.7. Okay, so it's, it's very hard for us to sort of look at these things independently. And so what we should probably do is just put them in separate plots, using those the small multiples. Other things. I'll skip past here. These are sort of the final points void line drawings as I mentioned, sort bar charts, consider dropping legends and use annotations when possible, use color to your advantage, considered double encoding so I didn't really talk about this too much but that you want to highlight, maybe for a specific group or whatever, you could use color, but you could also use shapes so maybe some are circles squares triangles, and the color changes with each of those. And that's going to be helpful again for people who are color blind because even if they can't perceive the color differences, they should be able to perceive the shape differences. Make your labels bigger. I didn't really talk about this too much, but a very common problem is people are creating these figures on a big screen. And then they scale that figure all the way down to six and a half by eight or whatever, and their, their labels which look fine gets scaled down so small that they're hard to read so be careful about that. To avoid, basically never use dual axes, unless the secondary axis is a direct transformation of the first. So if you have something like Fahrenheit and Celsius, or feet and meters or something like that then that's fine, but other any other use case I would strongly encourage thinking about alternatives. 3D unnecessarily avoid that be wary of truncated truncated axes and maybe pie charts with lots of categories. Alright, that's all I really have. I just want to mention that and there's also lots of different ways you can communicate data. Flex dashboards are really nice example. And I think they're kind of neat because you can include just this amount of code. That's something that looks like this. Right so now you have this plot, plus a bar chart plus a table and this table is actually interactive and everything and you can kind of click through it. And that's a pretty small amount of code to write for something that's a pretty nice output that could be shared with a lot of people. Websites blog posts that's another thing this is actually from one of my students, which I think is beautiful. Okay, so I know that last part was very fast. This is my contact information. I wanted to err on the side of too much content so you can review it more at your own pace. But if you have questions about any of this stuff that I didn't really get to please feel free to contact me. I'm on Twitter, or get up or my website or you can just email me. Okay, and that's all I have minus questions. That's fantastic. I feel my cognitive load was already decreasing over the last hour, as the pieces started falling into place and it made really great practical examples so thank you so much Daniel really appreciate it. I just wanted to give everyone a chance. I know we did a really good job intuitively of going through and answering questions and the comments that have been coming in and the chat feature and in the Q&A. But if anyone has any last remaining in the 30 seconds or so that we have, feel free, but Dr. Anderson has graciously offered to field any follow up questions so feel free to get in touch. Either with me at the Center for Open Science Marcy at COS.io or Dr. Anderson. And unless there's anything burning. I just want to thank you so much Daniel for sharing your insight with us. This is so eye-opening and revealing and I think going to help. So many different research projects so we really appreciate it and we appreciate everyone for joining us today so thank you for taking an hour out of your day and joining the webinar and we look forward to any follow up that may emerge from it.