 Are you a person that likes change? I think I like change, but perhaps I'm a little bit more conservative than others. I know a lot of people love change for change's sake. They like to be out at the forefront of what's going on in the field and all the new and up-and-coming things, whereas I'm kind of like, well, I like this method. Why would I change away from it? Do you give me a good reason? Well, in the field of data viz, change is a constant. Well, one of those changes was the advent of what's called a violin plot as an alternative to a box plot. What's a violin plot? Why might you choose a violin plot over a box plot? Why might you prefer a box plot? Well, stay tuned and we'll cover all that in today's episode of Code Club. A few years ago, violin plots were kind of all the rage on Twitter. People were really excited to show the shape of their oddly distributed data using what's called a violin plot. What is a violin plot? Great, I've mentioned violin about 10 times now without saying anything about what it is. You can think of a violin plot as a histogram that instead of being arrayed on the x-axis is arrayed on the y-axis and then it's mirrored on either side of a point on the x-axis giving it kind of an hour glass shape or perhaps a football shape or perhaps something that looks like the body of a violin, hence a violin plot, right? And so it's showing you the shape of the distribution of the data in the actual curvature of the violin body or of the shape that it's presenting. Whereas a box and whisker plot, the box plot is a box, right? It's a rectangle. It shows you where 50% of the data are, whereas the violin plot shows you where there's more data than in other places. And so if your data are highly skewed, then you might get something that, you know, is kind of flat on the end and peeky up towards the top. Anyway, that's what a violin plot is. And today, we're going to go through how you would go about implementing the code to generate a violin plot, strengths and weaknesses. And maybe we'll do a head to head comparison comparing what we've currently got with the box plot to what we can generate using the violin plot. So if you'd like to join me as I work through all my code here in our studio, down below in the description, there's a link to a blog post that has the code that I'm starting with here. And so again, you'd be able to work along in parallel with me. Also along the top here is a link to a video I did earlier in this series of episodes, describing how you can get our studio install the tidy verse and how you can get the data that I'm working with and set up your studio project so that you can seamlessly work alongside me. Here I've got my Schubert diversity.r script. This generates the strip chart with a box indicating where the intercortile range is. I'm going to go ahead and change what I'm outputting this to. So I will change the name of the output file to Schubert diversity box.tiff. I'm gonna go ahead and load everything so I've got it. And I'll show you what we get for the output here is what we finished with after the last episode where we've got the data laid out again within each of the categories the position on the x axis is random. And the box the outside of the box indicates the intercortile range that the space between the 25th percentile and the 75th percentile the thicker solid line in the middle is the median or the 50th percentile. So I've got this and we'll come back to this later in the episode. For now I'm going to go ahead back to my code and I'm going to change box to violin and whenever we update we will also update that. All right so in our code again we are loading our packages we're getting the metadata off an excel spreadsheet we're reading in our alpha diversity data joining it all together getting colors getting counts all that good stuff and then we come here to the code for our box plot and our jitter plot so for now I'm going to go ahead and I'm going to comment out the the code to build the box plot as well as the code to put up the points and let's start with the violin plot. So we'll do a geom violin open close parentheses add that voila we have our violin plots they look perhaps a little bit more like dulcimers than violins but you know you get the idea that there's more points kind of where the plot where the shape is wider and fewer points up where it's more narrow. I'm going to go ahead and turn off the legend so we have more real estate here in our figure and also because as I've mentioned in the past the the color there's only one color per column and that corresponds to the name on the x-axis we've got our three violin bodies and again it shows us where the data fall one critique I have of this that we'll come back to is that I think this doesn't so much show you where the central tendency of the data is where the median or the mean is but it shows you more of like where the mode is right so where it's the widest which is what my eye is drawn to again humans are drawn to area that that I say though this is really important down here and so for some reason you know I think well you know the the the cdiff positive individuals have higher inverse simpson index than people that have diarrhea but are cdiff negative but that's where the mode is not necessarily where the central tendency or like the median of the data are so that's something that we'll come back to as we go through this episode I'd like to add the data so let's go ahead and return our geom jitter and I'll keep everything the same one of the first things I notice about this plot is that the distribution of the points in my jitter plot is different than the distribution of points in the violin body that the the x distribution is uniform across the y-axis and so we get a column whereas we get a violin shape or dulcimer shape for the violin body of the distribution an alternative plot to the strip chart is something called a sine a plot where it jitters but it jitters within the shape of the violin that's something that you can do in another package that's related to the tidy verse I'm going to use an alternative to the jitter plot and the sine a plot that will give us a similar type of effect that I think might also be effective and that's a dot plot so for now I'll go ahead and comment out geom jitter and I'll add geom dot plot and what the geom dot plot does is it'll to put across the x-axis my different categories and then in the y-axis it'll put the points next to each other but to do that it's got to bin the y-axis discreetly rather than continuously like we currently have it I'll show you what this looks like so we'll go ahead and run this so it's complaining that it's missing the missing aesthetic y and so what I need to do is put in bin axis equals y and that will then bin the y-axis and what we can see is that it is binning the y-axis let me go ahead and show show legend false and so now we we can see what's going on that kind of starting at the midline of our violin body it's then putting points going out and it's breaking this up as it says into 30 different bins so it said you could pick a better win better value with bin width so let's go ahead and do that and we'll also center the points at the middle of the violin distribution so we'll do bin width equals 0.5 we might alter that later and then we'll do stack dur equals center and that will stack in the center of the column and I'll go ahead and put this show legend on the next line go ahead and run that and so what we see is that at um 0.5 binning the y-axis in the units of 0.5 we get this shape to the distribution and that we then get points arrayed or dots arrayed across like like so and it kind of fills out the shape of the body if we do say 0.25 if we use a smaller bin width then the points are proportionally smaller in in their actual size so in dot plot the size of the point the size aesthetic is proportional to the bin size let's go back to 0.5 I think that looked a little bit better and again we can see that the violin body follows the shape of the data one place where it seems to be a little bit off the rails is that it doesn't quite know what to do with the data kind of down towards the bottom um and so it seems to be kind of fitting some type of spline and kind of bringing that back down so it's not a perfect representation of the data something I'd like to try is let's use the adjust parameter in geom violin so if we said it's a 0.5 that should cause the shape of the violin to fit the data a little bit tighter and so you can kind of see at least for the healthy that it it's a little bit more form fitting if you will um these three violins are scaled to have the same area and so that our eyes again aren't deceived by one having more area than the other um and so again it's kind of fitting splines to fit the violin shape body and so I think as you use that smaller adjust factor yet maybe hugs the data a little bit tighter but it also kind of makes it look a little bit more squiggly and less well fit for some of the others so maybe let's do 0.75 and that looks a little bit better um where I think that the the curves fit of the violin matches the data a little bit nicer um so let's let's stick with that adjust value now again as I was mentioning one of the challenges with this visual is that my eyes are comparing where it's the widest where it where's the where's the the most area within each of these three violins and so I'm seeing it here for healthy down here for diarrhea and seed of neg and up here for diarrhea and seed of positive I can fix that perhaps by including some quantiles and up in as an argument to geom violin I can say um draw quantiles and I can then give it a vector of values where I want to draw the quantiles so I can do 0.25 0.5 and 0.75 and that then gives me horizontal lines to indicate where the quantiles are um those aren't so pronounced maybe if I made my alpha for the body say 0.5 that looks okay um I'm not totally loving having the 75th and 25th percent quantiles I'd kind of like that to pop a little bit more um but it's just not doing it for me um let's put the 0.5 to draw the median so again now we have the median line and so the median we can see although again I'm more drawn to the shape than that median line um and I'm not I'm not totally sure this violin is doing it for me I might try with making that bolder um but I can't I can't change the boldness within geom violin I might say like let's do size equals two what that does is it makes all the lines bold which is pretty hideous uh so let's turn that off um I'm gonna turn off the quartiles so I'm gonna go ahead and uncomment or old friend stat summary and I will do um fun median and I'm gonna get rid of the fun args and I'll do geom crossbar and so if I only give fun equals median and then geom crossbar it will draw the crossbar the median line and be quite thick uh I'll leave that let me turn off the width for now um and so let's give this a run and see what it looks like and so we see that we've got those wide bars um across the data along with the the violin shape body I'm really not feeling it I'm I think you know I'm really turn off the adjust for now and see what this looks like it's not so jittery um in the shape of the violin body you know I don't know that the violin body with the data drawn with the dot plot really gets me that much more than having the dot plot on its own or maybe even the jittered data with with the geom box plot um and so you know I make an episode about geom violin and maybe where we end up with is something like this um I don't know maybe having the violin was nice I think you know I'm really more of a fan I think of having the box plot with the jittered data um what would happen if we took the geom dot plot with the geom box plot so let's let's give that a shot and I'm going to turn off uh it's kind of a rectangular version of the violin plot obviously you know perhaps we'd bring it in a little bit to make it a little bit tighter you know I so let's let's compare um let's go back to the violin and we can compare what we like and what we don't like about these different representations so I think this is about one of our best representations of the violin plot and we might compare it also with what we had for the the box plot with the jitter data what do we like or not like about these two different depictions one of the things I like about the violin plot depiction because we use the geom dot plot is that we see all of the data we don't have to worry about over plotting if we have a lot of data that is a problem with the jitter plot and it's probably more of a problem uh because we have so much data right we have you know 100 points 155 points in that first column um the other thing I like about the dot plot depiction is that it puts the points right next to each other and it centers it on that that column on the x-axis um whereas with again this the jitter plot things are they're randomly distributed across the x-axis within each category but they're not evenly distributed right we don't at for one value of y we don't see things kind of evenly distributed we've got kind of clusters within within you know each each grouping the the downside though is that to get that that plot effect in in the violin plot is that we had to bend the y-axis to make it a more of a discrete variable rather than a continuous variable and in the long run I don't know that that actually matters um but we are removing some signal from the data um we're removing some of the the variation that we see to get things to kind of go into those bins so that's a trade-off another challenge that I see with the shape of the violins is that in this case I have 155 healthies 89 and 94 in my two other categories and so I mean just roughly speaking we have twice the number of points for healthy as we have for the two other categories and so I can make the violin fit the shape of the dot plot for the healthy but for the two others it's not a it's not a good fit but what it's trying to do is preserve the area because we don't want our viewer to be thinking like well those violins are different areas that means they have different importance right or different that that's some variable we should be keeping track of but instead when I look at this I'm thinking like well why doesn't the violin fit the shape of the data um why doesn't you know why don't I see more like that why does it kind of form like a bulb on the bottom um and so I feel like the violin is is giving information that's not really there and it's confusing the audience this also might be another case kind of like the jitter plot um that your audience might not be familiar with it because they're not familiar with it it's causing confusion and it's making things less clear it's so it might be an active empathy not to use the violin plot or to maybe use it much more sparingly um when when you have more even uh distribution of samples across your different categories I also think if if you only had say five to ten points I don't think the violin plot really is appropriate because I don't know that you can fit the shape of a distribution around five points whereas again you know if you're showing those five points um yeah like the the 75th percentile the 25th percentile that's probably overstated as well um in that case so I don't know you let me know down below in the comments you can probably tell from my reaction that I'm a little bit torn um frankly I prefer the box plot to the violin plot but some of my self reflection is is that because that's that's what we've always done um and that perhaps we need to get with the times I need to get with the times and accept the violin plot but I do I do feel like there's things going on in the violin plot that are causing confusion uncertainty that have nothing to do with comparing these three different groups it's more about me trying to figure out what's going on with the violin whereas I don't have that situation so much with the box plot like I said tell me down below in the comments what you think um and and we'll go from there all right well hopefully this has been a fun discussion of thinking about how we can combine some new approaches to visualizing the distribution of data along with the data themselves some tradeoffs of using both approaches you know I think again that the violin plot does emphasize more of the mode of the data or the distribution of the data than it does really the central tendency um and and we could kind of see that in the example we had here whether you ditch box plots and move to violin plots uh perhaps a more of a matter of personal preference but I would encourage you to be mindful of your audience and what their reaction is going to be upon seeing this and I think that is a is a good opportunity to remind you that I would generate both of these plots show them to somebody that might be in your audience and ask them what they think what do you take away from these different um depictions of the data what are what are your initial thought when you see these data and if it's not like comparing the three different groups then maybe go back to the drawing board and how you're designing your visual anyway enough pontificating from me on that um please tell your friends about these episodes try to engage the material with them and asking them to help you figure out what is the best way to represent the data stay tuned for the next episode be sure you subscribe and we'll see you next time