 Thank you all for joining us today. It's really exciting to have you here. So today we're talking about grafting multivariate categorical data and specifically we're going to be focusing on mosaic plots and alluvial diagrams. So this is going to be the how, what, and why of mosaic plots and alluvial diagrams. So I'm Luda or Lin-Mila, you can call me Luda. And I am a data scientist at Amplify. Amplify is that education technology company that focuses in curriculum development. So essentially a textbook company but with an online platform. And I focus specifically on our science product. And I'm coming to you, I usually live in Brooklyn, but I'm coming to you today from upstate New York. Fingers crossed that I'm visiting my uncle and internet isn't great, so I have my phone as a hotspot. Hopefully everything will be okay but please let me know if you're having any issues. Also having some seasonal allergies, so you know, excuse any sneezing. But hopefully things will go fine. But yeah, please let me know if you're having any issues hearing me or seeing my screen or anything like that. And now I just want to introduce our co-host, Joyce. Hi, I'm Joyce Robbins. I'm an instructor in the statistics department at Columbia University in New York. I teach data visualization to our master's students in the statistics department, as well as in the data science institute. I have a course called EDAB that I teach every semester. And then the rest of my schedule is generally introductory statistics classes. I met Luda through Our Ladies in New York, a wonderful organization. And a fun fact is that both of our moms are in data science and code in R. So it's kind of cool. You don't meet a lot of people with that genealogy, I should say. So Luda's going to give us the go through the agenda quickly and then we'll get started. Yeah, and our moms actually watch this, so this has been mom approved. So our agenda today is that we're going to have our welcome right now. First, we're going to move into the mosaic plots with a coat along with Joyce. And then we'll move to alluvial diagrams. We'll take about a five minute break after that. And then we're going to move into a lab for about 30 minutes, where you'll get to practice making your own graphs. And then we'll come back together and discuss the lab results, go over solutions as desired. Our, you know, our versions of the graphs not necessarily the right solution. But so I just want to point out that all of our materials are here on our GitHub. It's also pinned in our Slack channel. You might want to go ahead and pull if you haven't already recently, since we did make, you know, a couple last minute changes as always. So, but you'll find all of the slides there for the code along all of our slides and all of the code for the code along. You'll also find code for the lab section. We encourage you to ask questions at any time. One of us, whoever isn't, you know, giving their section will be answering questions on chat. We'll also try to keep an eye on the Slack as well. It's a little easier for us to just get the questions right in the chat. And, you know, we encourage you to participate as much as you want. You're welcome to turn your video on. You're welcome to, you know, be involved. But it's up to you how much you want to participate. So without further ado, I'm going to turn this over to Joyce, who's going to start us off. Thank you. I'm excited to be here and excited to discuss mosaic plots with you. So I'll just just point out one more thing on our repo, which is that we have these links to the specific slides. So the slides I'm going to show you first are these mosaic slides. And then when we do the code along, you may want to open this up in our studio, whatever works best for you. Some people like to code at the same time or just watch what I'm doing either way. And when we get to the breakout rooms, we'd encourage you to turn on your videos so that you can meet people and work together with them. So let's start with what a mosaic plot is. I've cobbled together this definition from a number of different places. It's a little bit hard to explain, but I think that this does it as well. At least this is the best I can do for this. It's a space filling visualization in which the area of each small rectangle is proportional to the frequency count for a unique combination of levels of the categorical variables displayed. So that's a mouthful. I'm going to break it down with this example. And what this shows is births in the United States in 2019. There were over 3.7 million births actually going down in some years. We had over 4 million births broken down into age category, pre-pregnancy weight, and then weight gain during pregnancy. This is data from the Centers for Disease Control. So there are seven different levels of age. So this particular mosaic plot was created by first cutting what we call cutting or splitting on age proportional to the number in that group. So this rectangle, which isn't even a rectangle since it's so small, represents the mothers under the age of 15 years. The next one is 15 to 19 years, etc. The second cut is by pre-pregnancy weight grouped into four categories. And I've intentionally left out the label, so you can just focus on the kind of the structure of the plot. So there's four different weight categories. So those are what we call vertical cuts. And then the last cut is horizontal, and that represents the weight gain. So what we're looking for in a mosaic plot is whether the proportions are consistent or not consistent. If you see something that looks like a piece of graph paper or in New York, we would say the Manhattan grid where all the streets and avenues are lined up at right angles, then that is telling you that there is no or little association between the variables. When you get a staircase like this, where things are off, that tells you that there is an association. And that's the purpose. And we're going to compare mosaic plots to other types of plots and see when it's useful to use a mosaic plot as opposed to another type of graph or categorical data. And this just shows you that with those seven levels of age, four levels of weight, and seven levels of gain categories, you end up with 196 rectangles. And the area each once again is proportional to the count. I want to contrast a mosaic plot with a tree map. These are often confused because these are both space filling visualizations. There's no white space other than perhaps some borders, but not within the plot itself. A tree map, though, shows hierarchical data. So here, the employees are divided first into a high-level category, and then in each high-level category indicated by the fill color, it's broken down further into subcategories and then sub-subcategories, etc. So it shows something completely different. You don't have that consistent number of rows, columns, and splits for each. You can see that there's no lines that go, a few go directly across, but then the rest of them do not go directly across. There's no, within any one box, you can't indicate the variables and the levels of the variables in the same way as you can with a mosaic plot. So let's just remind ourselves what numeric data looks like because we love numeric data so much. It's so easy. It's so convenient. It's so consistent. It lends itself so naturally to the Cartesian coordinate system with an x-axis and a y-axis. We don't have that with categorical data. Categorical data is messy. It comes in different data types. It needs to be cleaned. There's no obvious way to graph it. So that's where the challenge is. So if you find yourself struggling with categorical data and you're wondering, why is this hard? It's just a bunch of words. It's just a few categories and it seems like it should be easy. You're not alone. It is difficult and there are fewer options for categorical data. Our kind of go-to for numerical data for looking at the relationship association between two columns in a data frame would be a scatter plot. You can't do that with categorical data. If you try doing that, you get something that's meaningless because you just get a lot of overlap of points since you're limited to a few levels for each of those categories. I guess it could work in certain situations with a huge number of categories but that is not very common. Now let's consider the multivariate part of graphing multivariate categorical data. Sometimes people think they're graphing multivariate data. When they're not, I have this issue with students. They'll give me a lot of things that look like this where it looks like they're considering a lot of variables together. This is the percentage of adults who've ever tried an e-cigarette in their lifetime according to a survey by at least that was published by also by the CDC from 2014. So we have the percentages for men and women for different age categories, for different racial categories, but none together. So that in my mind is not, that's just a series of univariate data. It's not truly multivariate because you don't have any interactions. You don't know if, how, how the two connect for three connect. So we're interested not in data that looks like this, but in truly multivariate data. Now I'm going to be talking about mosaic plots that are good for proportional association, as I've said. Luda's going to discuss olivial diagrams and the kind of big punchline there is that they're great for change of state when you have multiple states and you want to follow a flow from one state to another. There are other types of visualizations for categorical data and we certainly don't mean to imply that mosaic plots and olivial diagrams are the only way to go. So the most common way to show categorical data is with a bar chart. When you use fill and faceting and you can increase the number of variables that you can show and I'll give you a quick example. Cleveland dot plots are an underutilized form that also work really well if you're trying to show frequency. So what we're talking about here is where you really just want, you know, what's the highest number. You're not necessarily interested in mosaic plots. So one criticism of mosaic plots that I've heard is that they're not, they don't, they're hard to read numbers. Well, that's not the point of that. The point of them is to be looking at the associations among variables. So you're not as concerned with that frequency there. So this is the same data with labels this time for the 2019 births. It's faceted on these pre-pregnancy weight groups and the x-axis is used for the age categories and the fill for the gain. It's common to use the fill color for the dependent variable. So we can see here easily what the highest number is it's women. The highest number of births were two women in the 125 to 174 weight category who are between the ages of 30 and 34 and then within that you can look at the weight gains 20 to 30 and 30 to 40 pounds are there are the most common. And the Cleveland dot plot shows you the same information this time I faceted on age instead of on the weight category and again we see the highest values are for those 20 to 30 and 30 to 40 pound gain categories for women who are age 30 to 34 and in that 125 to 174 category. So if you really wanted to see what is the most common of those 196 possibilities a Cleveland dot plot does that well. But again that is not our concern here. So we're going to move back to the issue of proportion or association and consider a very simple data set. It involves whether older Americans or younger are more interested in local news than younger Americans. And this is data from the Pew Research Center. They surveyed and asked many questions including this one to almost 35000 adults asked whether they follow local news very closely and 34.5% said yes. The sizes of the groups based on age were varied from about 2800 to over 11000. And we're not actually going to look at the origin we're not going to start with the original data. We're going to start with a hypothetical and think about what it would look like if there were no association. So if age didn't matter if people followed local news at the same rate no despite their age or regardless of their age what would the breakdowns look like? Well we would just take 34.5% of each of these group sizes and say that those are the followers of local news and these are the non-followers. Again not the real data. Let's see what a mosaic plot of this data looks like. We build it by first making cuts in one dimension on the age category and then on the group category whether they're followers or non-followers. And there are a couple of very important design decisions here as well as information about how to read it. So the first thing we see is as we might expect because this is kind of rigged data to show no association we do get that graph paper Manhattan grid look right because the proportions are the same regardless of age. In terms of the design this is following best practices. The guru here is Anthony Unwin who wrote a graphical data analysis with R who recommends drawing mosaic plots in this way that includes cutting on the independent variable first on age making that cut vertical and then cutting next on the dependent variable assuming there's just two variables we'll get to what happens when they're more later. So the second cut is horizontal and it cuts on the dependent and then within the categories of the dependent variable we put the one that we're most interested in on the bottom closest to the x-axis or where we imagine the x-axis would be and use a more prominent fill color for that to show. So you really see where the followers where the followers are. So now let's see what the actual data looks like and how it compares to this and we end up with a more staggered look like in the first mosaic plot that I showed you where we see that in fact right the proportion of young people who follow the local news closely is much smaller than the than in the 65 lesson. There is a gradual increase as you go through the age categories from in terms of the proportion of people who follow the local news. So again it goes you know you can see how that jumps from one to the other. Now when I talk about if I call this the expected values and this the observed values what does that bring to mind for anyone in a statistical direction? Any tests come to mind? Key square? Yes, yes, yes. So you can think of and you don't have to but you can think of the mosaic plot as a visualization of a chi-square just kind of like the way that a scatter plot is the visualization of a linear regression right or two-dimensional linear regression a chi-square mosaic plot. This is really part of how of the design visualizes the observed and expected in a chi-square and so in this particular case that association that we saw is statistically significant if performed a chi-square test. If this doesn't speak to you at all that's fine you can make use of mosaic plots and learn a lot about categorical variables without doing any tests it's I think exploratory data analysis stands on its own you can think of it either as like a precursor to more just a statistical analysis or just you know in its own right. I'm missing it okay yes so let's break it down more and you may have noticed I have not showed you any code yet I believe in showing you kind of the theoretical how it's how a mosaic plot is constructed you can there are different packages I'm going to show you how to do it with the VCD visualizing categorical data package but there are the packages in R there's plenty of packages in other ways and I want to really focus on the principles so that you could do this and any anywhere. So we said first cut's vertical that is the independent variable second cut is horizontal and remember when you're reading a mosaic plot you think about what would it look like if there were no association. We're going to add another hypothetical besides no association you also want to keep in mind kind of what would it look like if it were if there were a deterministic relationship so if your age completely determined what whether you follow or don't follow local news what do you think it's going to look like that probably something like this right where you get no there are no followers in the younger age groups and then everyone in the 65 plus age group is a follower of local news so these are the extremes and we're trying to figure out where we are in between like I say to my students you know that this is what statistics really is all about like trying to figure out where we are between no connection and complete connection if there's no connection you don't need these techniques if there's a deterministic relationship you don't need these kind of tips we're always in kind of that that murky area where there's a connection but it's it's not it's not 100% so we're going to go back to the birth data and just familiarize ourselves with these categories there are seven age categories four pre-pregnancy weight categories and seven different weight gain categories the data was already in groups I cleaned it up just because there were some and this is one of the challenges that you face often with categorical data in that you'll have some categories that hardly have any data in them and they're really hard to show I just grouped everything in the 225 above 225 into this everything above 60 because they're small they're not going to it's not going to distort things to put them together but that all depends on the situation and what's going to work in that particular situation couple other things I'm using a sequential color scheme color schemes both for weight and for weight gain because there it's ordinal data there's a natural order to the categories you must preserve that order if you had data that didn't that it's nominal that doesn't have a natural order then you can you don't have to do it that way the generally though you're going to want if in these two cases I'm treating these as dependent variables and as I said before you want the field color to be the one at the bottom in the most the most important category to be at the bottom and the most prominent color and it depends again on what what what what you think is most important I'm assuming here that the high weights are important for medical purposes and pregnancy risk but if if I were interested in I'm sure a low pregnant with that not not a medical person but I would guess that a five pound weight gain during pregnancy is also of concern and so you know if that's what your focus is and you're trying to figure out why people aren't gaining weight during pregnancy you may want to reverse it and put that on the bottom so there there is some choice so this one shows pre-pregnancy weight as the dependent variable based on age what do we think is there association no association what do you see here Alisa there is some association right there is some what do you notice in particular that the the lines move in a how to say it looks like scattered the areas are different in the different rectangle rectangles the ratios are different so you know these interestingly weight doesn't seem to make it above age 20 yeah doesn't it doesn't seem to make a lot of difference there seems to be a much smaller weight gain for the 15 to 19 category remember this is all just pregnant women so I would expect to see a stronger association between weight and age if they're older if we were dealing with all people not just pregnant right good Michelle so um if we look at weight gain against age also um there's a slight association it's very consistent though right that um the older pregnant women tend to gain a little bit less weight than the younger ones for whatever reason I don't know um there's this is a little bit of an exception there's some you know a little dipping down and then up again for the 30 to 34 category um this is the one that um is interesting it shows the strongest association and one that we didn't I weren't necessarily expecting as we were exploring this data that um the the mothers with the highest pre-pregnancy weights actually tended to gain less weight and maybe that makes sense in terms biologically but it wasn't wasn't necessarily what I would have expected but you can see that clearly like that they gain of zero to 10 pounds increases um as the pre-pregnancy weight increases um you also see though that the highest weight increases as the pre-pregnancy weight increases the highest weight gain so there's the greatest variance for the um free the mothers with high pre-pregnancy weights but there's clearly a strong association there and now um we are moving on to uh three variables so with three variables again your dependent variable will be the last cut in horizontal the order of the other variables try it both ways try cutting on one first and then the other and then going back so which one was this cut what where was the first cut here right the first cuts on weight sorry and you can tell that because um the lines go all the way um the the groups are together for the different weight categories right so you see how these this line goes all the way across all the way across all the way across so that was the first cut where these borders are and you can control whether they're borders or no borders I put borders on this one just on the first cut the second cut is within each weight group cut on age and then the last cut is the horizontal one for the weight gain category and you can see within each weight group the gains are pretty similar right there this within each of these right there you're getting the grid and then there's a big difference though as we've seen before between the weight categories so let's say you try the other way and do the age cut first right that really shows you the the strong association between gain and weight that we saw before so trying things in different ways helps you um visualize helps you understand those relationships the most important things in drawing the mosaic pots are the order of the cuts and the direction of the cuts I'm going to show you some things about the labels but I will warn you the labels are a pain and you can't always get it you can't always get what you want I'll just put it that way you can get close you can if you see this time I left the categories on the bottom this time I labeled just some cells and even here it's not perfect it's it's difficult but I'll get more into that later so there's another example here this is with the housing data in the mass package another good use of a mosaic plot but I'm going to move on now to to the coding so we're going to go to our studio but first I'll just remind you about the dependent variable being split last of the beam split horizontally the highlighting this is the type of it's highlighting uh fill only affects the dependent variable so that was a little confusing when I was first doing them because I wasn't sure what was getting what I was setting with the fill it's setting just the last cut um your best bet is to try to split the other variables vertically first and you can experiment in a lot of different ways these are not like hard and fast rules but these are guidelines that will get you know there's a lot when you take the number of different ways you can do it it goes up quick exponentially right when with each new variable since you can make that you can change the order and you can change the directions of the cuts so there's a lot of possibilities even with three variables um try first vertical vertical horizontal and if that doesn't work you can try um something else now here's the introduction to the coding as I said we're using the vcg package there's also gg mosaic I want to like the gg gg mosaic more than vcd because I like working in tidyverse I use gg plot to all the time it's just not at the level of vcd in terms of options in terms of um stability and everything else maybe it'll get there it's not there right now so I just am more comfortable doing these using vcd there's also some issues in that there's some workarounds in gg mosaic for labeling because it's not there's some things that just don't work within the gg plot to framework so um students of mine have though have used it and used it successfully I don't want to discourage you this is just my personal preference so I'm going with um this the most important things are the order and direction once again I'm going to say that a lot so the order is controlled by the formula in your call to the to the mosaic function in the vcd package um using formula notation the dependent variable is on the left of the tilde the independent variables are to the right the cuts go in the order beginning with the first independent variable so the order of the cuts here is weight age gain the data and then the direction the direction um follows the order of the cuts not the order of the formula so this v this first v connects with the weight the second v connects with age and this h for horizontal connects with gain the last cut so if you by the end of this just learn the order in the direction feel good that's all great all the rest is just gravy the labels the colors and all that kind of and the the level of the dependent variable we said we want to be the darkest or most noticeable closest to the x-axis how do you think you control our people how do you think you control which level of the variable is um closest to the x-axis or at the bottom of the mosaic plot the re-level function exactly right our favorite thing re-leveling the levels of the factor variables and I like to use the four cats package there's a lot of nice um functions there for recoding relabeling lumping things together I don't have time to go into a lot of that today and so really any of that today but it is in my code so you can look at it and see and ask questions about it so um we're going to go to the mosaic code along here's the reference for more from Anthony Anwin so how am I doing with time I'm not sure exactly when we started um I think I have till 4 30 is that good okay so we don't want this we want this um this is all just here for um you to reference later all of my data cleaning so I'm going to just clean and point out a few important things and um this is to save you time because some of these things just were not obvious to me or intuitive you must have a freak even though you don't use if you noticed in the code I showed you there was no frequency column mentioned I only mentioned the dependent and independent variables I didn't mention the count column or um frequency column it has to be there okay but um and it has to be named frequency F R E Q like that okay so in this data set it was listed as birth I changed it to freak so that I have that it should be I'm using um data in a tidy format with um I'll show you what it looks like with a frequency column you can use tables and it all works fine but most people are more comfortable at least I'm assuming people are more comfortable with the tidy data format now the other thing other super important thing to know about VCD is you cannot have spaces in your variable name so I'm changing um I'm creating a very short single word name for each column name for each of these variables and then I'm doing things that I mentioned before like combining and recoding things so that I have a smaller number of categories so we're just going to do that and now um we'll see what the data looks like for starting and we have now just four columns the age weight gain notices are all factors and the frequencies and we can just look at the beginning like this is really this stands for under 15 and then the waking the prepregnancy weight category the gain category and frequency so as you um work with mosaic plots I encourage you to start small don't try to make a mosaic plot with four variables right off the bat okay start with one variable and then add and this is this is not just for learning purposes but when I'm create when I'm look exploring data this is how I do it I look at one variable and then I start adding to it so we can look at the age variable and you see how we have these seven categories now that I've cleaned it up a little bit okay and make sure that they're all in the right order if I want to change the direction we use the direction parameter so the default is to start horizontal um horizontal I want though the first cut to be vertical so I'm going to change it and then we get that vertical cut okay we you can play around with looking at other variables like you could look at the weight variable first okay but however whatever you want to do there but now let's move on to two variables so very simple right we can look at the gain on age gain is our dependent and this looks like the graph I showed you before but the cuts are in the wrong direction the order of the cuts is right age first then gain but the direction is wrong so I'm going to fix the direction with by setting an array here indicating that the first cut is vertical and the second cut is horizontal right and now it's looking a lot more like the graph that I showed you in the slide now some cleanup we're going to rotate the labels if you're used to base our graphics show me with this um how many people use or have used base our graphics with the thumbs up sign so um we know that when when when you're setting parameters and base our graphics you start with the bottom so the sides like for example you would do side equals one side equals two side equals three side equals four here it starts with the top I don't know what the logic of that is that was another thing that I couldn't figure out why the proper why the labels weren't being rotated the way I wanted them to but the first one here is the is the top and then it goes clockwise so if you want to rotate labels you can do that the defaults are not zero zero zero zero zero zero zero zero is all horizontal the defaults are I guess 090 090 I don't recommend 45 by the way as much as possible you want things to be horizontal so now a nice way to get um a sequential um sequential colors is with the um our color brewer package using brewer pal you can determine um you can either just if you know the number of categories you need you can do that or you can set it equal to the length of the levels of weight this is doing the same thing I know that there's four weight categories so I set the colors to um I pick four categories and use the greens palette if you look in the help for I think it's display display brewer all you can see what the different palettes are and so for this type of data we want to use um one of these sequential palettes and then oops okay now now we're getting a graph uh that looks like the one I showed you in the slides not a lot of code just um being very specific about um where what where we want things to be in the plot if you want to move the labels down there's a lot of things you can do with this with um labeling args this TL means top left so this is saying um top is false and left is true so we'll end up just moving the um aged ones to the bottom if that's what you want to do we can change the variable names it's a little frustrating when you're used to tidy verse and being able to use the spaces in the variable names then you have to take them out so um you're going to probably want to do things like this to make your um access labels clearer you can um change the I have a couple other things here like um moving moving the variable names a little bit or justifying the labels so these are just some minor changes to just clean up what here looks like I introduced a little bit of a problem as well um now three variables so again it's just adding another variable on the list in the formula so now we're doing splitting on weight first then age then the game um I'm making the font size smaller changing the spacing this means the spacing goes by the cuts so for every parameter you should think about is this the parameter that relates to the variables or is this a parameter that relates to the sides of the plot so direction relates to the variables the variables get split vertically or horizontally the spacing relates to the variables right is there going to be a space between the levels of a particular variable or not um label um rotation right as we saw already has to do with the side so there are this is an array of three the point three um means the spacing for um weight should be point three so we're going to have a little bit of a border between the weight categories but not between the age or the game categories and that looks like this and I'll try to make that a little bigger so um you can see it there is a way to not repeat the um label categories but um it keeps the first ones so that doesn't really help a lot in this particular situation um I'll show you quickly where the code is for how I did it in the slides with labeling the cells but it gets a little bit involved that's not where I would recommend starting so here I um you know if you're coding along change things around change this to um age plus weight to see what happens right or you know you could just you just take these out take out the direction see what you get um you can see what happens if you you know if you just do things kind of randomly you're going to get something that's a lot harder to read um than what you get when you follow follow recommended practices so if you're comfortable with tables there's a lot you can do that you can't do with um data in the tidy form um one of those things is labeling the cells so I'm not going to talk through this code I'll just point out that it's here for those of you who are interested I'm making the same graph just using the data in table form instead of um kind of more tidy verse friendly data frame form so here's the same graph just data's in a different form and now I can create labels you can create conditional labels and um label things using this um the labeling cells I'm not sure why I picked those particular ones maybe a lot okay probably those that instead um you get the idea so that code is all there for you to review and it should be helpful when we get to the lab later and you can um decide whether you're going to work on mosaic plots or allele diagrams or both and we'll be here to help you with that um I am out of time so I'm going to turn it over to Luda and to talk about alluvial diagrams and we can take questions I think uh during the lab or after the lab I think that'll that'll work best so uh this is my first example um this is all based on fake data as uh I like to indicate with the kind of ridiculous unit titles but it's based on a real-world examples so as they said I work at Amplify uh we're a tech company and so um we have a lot of assessments that we like to look at I work specifically on our science data or our science units um looking at that data um so this uh the you know my quest to understand alluvial diagrams was kind of worn out of a real world question my manager asked me to make some graphs that showed uh the movement of students um from their score level uh in the pre-unit assessments uh to the end of unit assessments and he just said go ahead and make these graphs he didn't give me a lot of um you know he didn't tell me how to do them I just kind of sent me out there and I'll say that it took me a little while took me a couple of days to actually figure out how to make these graphs and make them look good so hopefully I'm here to save you time on that there are a lot of there's several different packages now um in order to do this than are and I've tried uh several I'll say that um so today I'm going to focus on using the gg alluvial package I found it kind of the easiest to work with at this point um though I'd encourage you to try out other packages too I have another uh package referenced in the resources event uh but that's that's the approach that I'm going to focus on today um so uh as you can see we have these graphs we have students um their scores are categorized um from one to four um and so we wanted to see how many students are moving from a low level um into a higher level and we split that out by the units to see if there were some real disparities in these units so I'm going to walk you through how to make this graph I'm going to show you another example one other example and then I'm going to show you with the code and then I'm going to show you just some normal uh some some examples of the diagrams that I found in the wild and we'll kind of talk about the pros and cons of some of those so the here is talk about is to use alluvial diagrams I think they can feel really cool and you want to kind of use them because they're kind of this uh interesting different uh data visualizations style um however you really want to uh kind of make sure that your your data makes sense for an alluvial diagram are you showing groups moving from one state of being to another I think that's the absolute best use case for an alluvial diagram and a lot of other diagrams don't quite work well as well uh when they're not actually showing flow um and I also think you want to have a reasonable number of groups and states of being I'm not going to give up a hard and fast rule of those numbers but um you know if you've just got a couple uh groups yeah it's um you're breaking up a lot it's not going to be very interesting but also the I use the other stream often where there's just just Luda yeah there's a problem with the categories you've just got I have an idea Luda middle ground I see there are some toppings easily it's okay yeah that's better as a backup I can as a backup I could show the slides that you can talk so let's let's try this though and then okay yeah I have the slides all ready to go but you sound okay now so without the video on so why don't we try this okay yeah so I think that this will be better okay and and please do stop me again if it's sounding bad again okay um all right so we've got our axes these are our different states um where the graph that we're showing movement between um here I have just a completely uh blank graph well actually this is an example that we'll actually see later on but here I've just stripped away most of the labeling just so that it's very easy to focus on these elements of the graph um so we've got these axes here I've just labeled them really plain you know first second third fourth so these are the different states that we're going to see movement between we call those axes and that's on our x axis um on our y axis we see this stratum and these groups that we're going to see uh at each axis and one so one little box like this is called a stratum and the whole group of them are stratum um and then uh we've got um flows flows are movement um from one you know state or axis to another uh they're just from one to the other whereas alluvium show the movement across all states or axes all of the you know all of the flows together um and then we've got those loads loads are the intersection of one alluvium and one stratum so all together here's all of our terminology we've got our axes our stratum our flow our alluvium and this little thing the load all right um and I'll keep using this terminology um you know you can refer back to this slide um uh so oh I do want to say also I think uh you all have already um cloned the repo and you've got the slide code so um I'm going to be showing all code from my slides actually you follow along in the arm of the slides that's in the slides folder and you should be able to run all of the code that's there um if on the repo um uh so you you're welcome to run it along with me or you can just watch me as I do this either way works um so I'm going to start making the graph that I showed you at the beginning and I'm going to build it up level by level so here we're starting with this simulated data um with the pre-post and um here I'm just showing you the first five rows so we've got us we've got a student ID you know usually there's a uuid it's something much more complicated but just here for the simplicity I'm giving you uh just you know numbers and um then you can see the unit title you can see the um what type of assessment or the students are taking and you can also see the what score level they achieved um on that assessment right now it's in this long so in order to use the gg alluvial package the easiest thing to do is to first make sure that your data is in the wide format by so my axes my two states are going to be the pre-unit and the post-unit so here um here I'm going to go ahead and um uh pivot this wider uh pivot my dataset wider um in order to have it wide by these two elements right so I went from long by assessment to wide by assessment and these are the score levels all right this allows us to use the two loads form function in the gg alluvial package um and this might seem a little odd to you because now it looks like we're going back to the log format which we are essentially but we're adding these two variables the alluvium um and the stratum and uh this does it for you right out of the box you give it the key what you want to call um and you know what you want to call it so here I'm calling an assessment and then you tell it which axes uh so which of your stratum uh which uh variables are your stratum so that was so here we have one two three four so pre and post are three are positions three and four so I just give those positions um for your axes and um then it sets this up really nicely for you so you're already you're set to go um in order to make the graph so now we get to go ahead and start graphing um so right so before we do that I just want to talk about the plot elements um there's kind of two plot two plot elements that you're going to have to choose between depending on what kind of graph you want to make um either you're going to use gm flow or gm alluvium so what is gm flow that's going to give you the flows from one axis to the next um with uh you know it's just just one axis and and then the next one rather than gm alluvium is going to give you the alluvia across all of the axes um but I just want to point out that it's nice for gm um set your alpha to be uh you know fairly transparent so you can see the flows and how they kind of intersect with each other um for gm alluvium it's nice to um you're you're you have options and how to set your fill based on maybe the starting state or the ending state so we'll see that in another graph soon um and then your last plot element is that's going to give you the strata that's going to set that up so let's go ahead and start graphing so the first thing I'm going to do is set up my gg plot with my aesthetics as usual my x axis is going to equal my categorical axis variable um so that's going to be assessment here right so this is um uh you know pre or post in this case and then this is so easy because we've used the two loads form we can just set stratum equal to stratum and alluvium equal to alluvium um super straightforward and then here I'm just going to add gm stratum um so then uh with this uh you can see that we've just got our uh stratum our stratum here we don't have the flows between um and we we don't have any color so next I'm going to add the flows um these are uh this is just added here like I said I add alpha 0.5 so it's fairly transparent um so just adding this little chunk here so now we see the flows but this isn't giving us a lot of information yet right um because it's a little hard for us to see what the actual flows are we need to add a fill so here I'm setting my fill equal to my stratum and that's within the aesthetics oh I'm so sorry let me get back to that um and uh that gives us the uh the actual colors so here I want to point out I'm just using the total default gg plot um and default colors right so we're going to beautify this uh this isn't how it should be in the end but um just to point out right now that's uh that's uh where we're at um so here we've added fill um the next thing I want to point out is uh so uh just as Joyce talked about I actually want to switch my factor uh order I want four to be at the top because that's the top score that's the that's the best and I want to be able to track my students going from um you know uh I really want to see them going up to four uh and that's what I'm most interested in and so I'm going to hit it with that factor reverse from the four cats package um and now I've got four at the top the bottom uh so that's nice um and and that's something you know you want to pay attention to is what it what it what is the category that you're most interested in tracking with your eye um so then the next thing I'm going to do and this is you know specific to this graph but I wanted to point out is easy to do um here I'm going to get six graphs for the price of one by using facet wrap by my unit title so this was across all of my assessments but I really want it to be by unit title um so now out by each of these ridiculous sounding units um and uh I can see the flow from each of these units and this was actually really relevant to us and or to me in my work uh because allowed us to see that uh for some units we've got a lot of students um you know going from one all the way up to four whereas in some units uh we have a very few students going up from level one and uh this let us kind of start inspecting those units a little bit more looking at the assessments and maybe what the students were learning it also um was a way to kind of verify some of our suspicions about some of these units so it was very useful um so next um I'm going to manually add some colors uh because uh and uh as Joyce mentioned you can use our color brewer you can use um you know various palettes uh there are a lot of different ways I actually used our brewer to pick my colors but I'm just showing you how to add them manually um or that's what I did that's what I ended up doing in this case so um here are the added colors so at this point I want to talk about use of color in your alluvial diagrams um your stratum are going to be one of two types right so they might be like this where there's an underlying ordinal element the the variables variables ordinal there's a under variable and in this case level one is actually we consider that like a lack of uh like a showing of lack of knowledge of the science um whereas two three and four are an increasing intensity of uh understanding of the science concepts so I chose to make level one gray while I made two three and four um you know increasingly dark so you'll want to use this kind of sequential palette in this case however in the next example you'll see the the categorical variable is discrete um you'd want to use a qualitative palette where you have separate colors rather than um showing sequential element so here I'm just beautifying the graph I find that these defaults are quite small and hard to see um and uh I just like to kind of uh fix things up a little bit so I'm not going to go through everything here but you know I use I leverage the theme in order to take away those grid lines some of the um and change the text size at the left bottom because I think that gives us more space um I also like to I like to um put the uh uh set the labels to comma so that the axis has commas in it um just makes the numbers easier to read like you can see here with the thousand so here's my final graph um as you can see as we saw before um just you know with that uh one piece of one big chunk of code there um so uh there we are um all right um so uh I'm going to move on to another example now um but I just wanted to give you a second to you know look look at this a little bit more so next we're going to move on uh so so here you know I was using the GM flow because we're just moving between two states um and we're uh you know just showing that those those flows but now I want to show you a uh GM alluvium example where we're actually following an alluvium all the way across the graph so this is what we made um it was made by David Noiserling I really love this example though this was an example of his um for going through the process of finding a data science job um so I think what you know this is really interesting for us to see all of us that have gone through that kind of brutal process um and I think it's also really wonderfully done uh for several reasons um so I'm going to talk this as well um kind of more quickly though um so here we see the raw data again here you can see his data is already wide by these states right so we have contact for a stage second stage outcome let me just talk through this really quickly so this is the first contact was it um internally at his workplace um through one of those job sites online um through LinkedIn through his own personal network or through a recruiter and then um what did that lead to in second stage like maybe a coffee meeting maybe it was ghosted right and the people stopped communicating with him uh for our phone call or he was rejected um and then there was a second stage he was either ghosted or there was an interview um maybe that role disappeared we've all had that happen right um and then there's uh but you know he also withdrew um from some of these roles and then we've got a final outcome here where he was either ghosted rejected withdrew or we've got this final offer and you can see this alluvian starting all the way at the job site on some job site he went to um and came down to the phone call and then came over to the interview and that gave him an offer so we can follow that whole um you know experience all the way through the sorry the alluvian through all those different axes um so here also I want to point out the color choice is nice because um here these are discrete um uh categories uh so we've got a qualitative color uh panel a color choice I do think that this might not be color blind friendly but I didn't want to change his graph because that is how and I you know I don't want to I want to present what he he did himself um but it is good to pay attention to you know color blind friendly especially when you've got these kinds of all kinds of noodley uh spaghetti is going everywhere so um so yeah so looking at the raw data we've already got it wide so that makes it really easy for us to go ahead and um uh you know oh sorry right before that I want to point out that he so we've got um n a's in some of these state stages so he's made this final outcome that lets him color the alluvian by the final outcome and so we're just using this coalesce function um and it takes the you know first um non-na section and this is going backwards from outcome to second stage to first stage um so we've got a final outcome column as well and now we can use this nice two loads form the key is the you called it contact and the axes were two to five here right so that's um two three four five right there um and then we've or is equal to that contact that's a key stratum to stratum alluvium to alluvium and label so here he's adding a label um to the to the graphs uh to each stratum um so you do that with the label equal to stratum now we're adding the geomalluvium um and he's setting the fill to his final outcome because he wants what happened in the end rather than with my graphs I wanted to show where students uh were starting out and where you know and then uh that would allow us to see really easily okay so they started at level one and now they're ending up at level two or three or four right um this color equal to dark gray lets us um uh see these nice uh little these lines this adds a little um border to each of the alluvium um so let's us see them a little bit more easily and then the na.rm is necessary both in the geomalluvium the geom stratum because he has some na's in some of those uh you know and some of those rows so you want to keep that in mind um so then we are adding the geom text with the sat equal to stratum um and you know doing some beautification here as well um also using the minimal um uh changing the text size low legend position um you know adding uh caption and also manually adding the color choice so here we have that final outcome again um and I wanted to just contrast this with if we had chosen to use geom flow so if his decision had been I want to visualize what happened between each of these different axes um that might have been his goal right and so in that case we would have used geom flow and that would have um that would have let us see like okay all internal uh uh contacts led to a coffee meeting in the first stage uh you know whereas job site led to either ghosted or phone call or rejected um and and then we're kind of cut off we can't follow all the way across but we then we see all the oh the coffee was kind of split up into ghosted interview no role and with true um you know and phone call got kind of like the phone most of the phone calls the like largest bulk went to interview but then a good bulk also went to ghosted or with true um not many you know so we can we can see that that flow from each of the axes but we don't necessarily we don't see the the flow across on the entire graph so just you know a contrast there so now I want to get into the examples um I'd like to make this a little bit uh more interactive if I have the time um but um if you give me just a second so uh so this is our first example I think this is a great example um I I really uh like it this is the very recent New York City ranked choice for the Democratic mayoral elections and I think so this is from New York Times I think this is a great example of an Olivia diagram in the wild um because uh you know so these are the last three rounds um and it really it's really nice it really lets us see okay so in this case um Yang was a late at state at round seven and so his votes um some of them went to Eric Adams a very similar amount went to Catherine Garcia and then a very small amount went to Wiley and a good amount also were eliminated since they were uh votes for people that were no longer in the running right and um one thing that I want to point out is that um a really nice so a really nice thing that they did hear that I haven't talked about is that um rather than just sticking with a um color from the beginning or color from the end we actually see this color changing from the beginning to the end to to demonstrate if a candidate's votes from went from a candidate to another whereas if they stayed with the same candidate and I think that that was really nicely done really well done I don't really have any criticisms of this uh alluvial but I think what's uh it's useful for us to look at some alluvials that um you know maybe aren't so strong uh so this next one so this alluvial diagram is from the economist and it's of uh refugees um then their origin and then their destination countries and whether or not they were accepted or rejected their decisions um and I think this is really nicely done uh I although this is a little bit different than what I've recommended for color choice because these are you know discrete categories in both cases and origin and destination um but they've kind of shown them in these sequential color palettes but I see that they were trying to group origin and destination together so I see what they're doing there um although I think so so this is a little bit confusing so we get this accepted state and then we flow back to where the um decisions were by origin so it's a little confusing because you're kind of um going backward in the information I think they could have actually eliminated this section and colored each um alluvium by um it's accepted or rejected that might have been a little bit um yeah more compact or easier but I think generally this is a fairly good visualization um all right um so this next one um I kind of I want to ask what people see with this one that makes it a little different from all of the other uh alluvials that I have shown yeah so this exactly Nick there's no transition here from one state to another our x-axis is just a different demographic categories really and I think I've seen some alluvials that do this and I think you know you you can gain some information from them just on extent however it's kind of an odd use case for an alluvial I think where alluvials really shine is showing that movement from one state to another um so so here we've got class um the sex and then the age and uh you know these are just flows so really we just see um from first class how many were uh that survived were male or female but it's kind of it's you can't really follow that across to child or adult I think this would probably work better actually as a mosaic where you'd see this in relationship with one another um so I would suggest that and I use this one specifically because it's actually from the gg alluvial um crayon repository so um you'll probably see this if you go you know look at the um vignette and uh I'm super confusing and it actually took me a while to understand how to make alluvial diagrams because I think a lot of the examples kind of fall into this category rather than really showing a flow from one state to another obviously this is no critique of that this package but um I think these examples can be a little bit confusing and don't they're not the examples that kind of most clearly show the utility of an alluvial diagram so that's that's my point there um all right um so uh this next one is about um the uh so this is uh giving us uh cancer uh information so different cancers are split out by uh race ethnicity and gender but I want to point out here that once again these are flows um and so really all we gain um from gender here or sex um is uh that we find out that um each of the ethnicities are kind of evenly split male female which is rising given our you know normal population especially since we've included a bunch of different cancers here um I also want to point out that we've got a lot going on here this there are a lot of stratum and it starts getting really hard to see what's actually happening so I think this is also an example of you know um kind of overwhelming the viewer but really my biggest criticism here is that I don't think that this chunk really adds information to our graph in this situation um uh here's the last example and there's just way too much going on here so we've got um uh different uh gas sales by state so petroleum natural gas coal and retail electrical sales and you just can't see anything right like this is just complete spaghetti so you know I would highly suggest avoiding this kind of approach um so in summary I want to you to uh you know ensure that your data really fits the alluvial specifications doesn't make sense um it can be fun to make alluvial diagrams but I would say often your data might not be the best fit for an alluvial diagram unfortunately um so when you go to uh when you do have data that's a great fit that's exciting and you can reshape the data to wide by your axes um if needed and in order to use that two loads form function if it's already wide you can just go ahead and use it consider whether you want to highlight flows or luvium right are you just showing the movement from one state to another are you showing the movement across the whole graph um because that can give different uh information and I'm not saying one is right or the other it just really depends on what you're trying to communicate also uh pay attention to your use of color so I would say both in terms of your variable type um and then so you know whether it's uh has a underlying order or if it's discreet and then also whether you're coloring from the starting state or the ending state or even like I we saw in the New York Times example where you might be blending once uh you know moving from one color to the other from each state um so you have all of these uh great options for color use um and lastly have fun these are really fun graphs they're really cool to make um I think uh you know they're pretty exciting so have a great time if you make them um so here I've got some of your you know basic resources for um how to make these um this last one is a the GM Parallel Sets or what from the GG Force um package so they're they're definitely uh different options for how to make this