 I've had this weird ambition to implement violin plots in Excel natively without any plugins and I've finally figured out how to do it. Now it goes without saying that if you want to do this repetitively and easily I mean there are better options like R and GG plots that will do it and there are extensions for Excel that will do it I just didn't bother to download it so I thought okay let's figure out how to do it and learn a bit about how you would actually construct these from scratch so what is a violin plot anyway? Well let's start with some data we've got these random issue numbers between 0 and 100 our first protocol would be a box plot this tells us where the median is, it's this line and then the box represents half the data so 25% in this part of the box, 25% in that and the whiskers, all the error bar things that are pointing out here go to your maximum and minimum value or maximum and minimum value excluding some outliers as well so these plots in Excel they do exist as a default option they're not great in being super flexible in their formatting of that look but they do accept just the raw data and they calculate outliers for you so a pretty good starting point but if we look at this data that I've got well there seems to be actually two clusters somewhere around 75 and somewhere around 45 ish and you can see that from the violin plot because this line here kind of represents the local density of the data there's more data up here it's a bit wider there's not as much data down here so it's narrower and I've also plotted the data points with a bit of scatter so they're a bit more scattered where they are denser they're probably better way of doing that but that's just random for now so that's what a violin plot is it is showing you the kind of the distribution the probability distribution the density of the data on this plot and it's a way that the box plot itself wouldn't convey so how are we going to build this line how do we build this density well it's using something called KDE kernel density estimation where each data point kind of has something called a kernel attached to it it's basically a function of a particular shape I'm just going to use the normal distribution all the shapes are available normal distribution is probably fine though and what you can see is we just add all them together so where these dotted lines curve overlap for instance this bigger darker line is much higher so if we move the data around a bit maybe I'll bring 35 bring that up to about 80 you can see that line moves up to here it gets a bit thicker so this graph here is basically based on something on Wikipedia except this version is interactive and we can also change the width of each of those curves as well so it can get a bit spikier or it can get a bit broader so that's the main idea we need a normal distribution for each of these data points centred on each of these data points and we're going to add them all together to build this function here, the dark line so I'm going to do that I've got some data I think this is actually natural students survey data so it's real it has an actual distribution it's not necessarily normally distributed and we're going to see what it looks like the first thing is to make a defined range for this data so I'm going to use the oof the offset function and I want string $1 $1 for that I'm not going to offset it I'll move it around from that point so the next two things are 0, 0 and the height is count $1 A again so just the count of whatever is that column and then width so this now will get all the data and if I happen to put another number on the bottom it'll grab that number as well great fun so far so I'm going to copy that code come into my name manager define a new thing I'll just scope it to the workbook for now and I'm going to call it data paste the code in there so now if I'm referring to my data named range I've got all of these numbers I've added a new one it'll appear in this data set as well so delete that for now so what am I going to do I'm going to start building a normal distribution for these data fonts and then summing them up so I'm going to do sequence just at 100 rows now this is going up on the y axis so for every point here I'm going to calculate how wide this graph should be so I'm going to leave this on the right because it's eventually going to be a y axis but it is sort of our kind of dependent variable when we're plotting a normal distribution so I'm going to write norm.dist and our x it's the input of the distribution is going to be that sequence press hash to select all of it now the mean I want a piece of our data I'm just going to select one of the numbers for now standard deviation I'll look at this 5 doesn't really matter for now this is illustrative false because I want the probability distribution for the normal distribution not a cumulative thing so if I return that you can see the number is very low round 1 10 to the minus 16 in fact when we hit 40ish that is where that data point is centered gets much higher and then it's going to tail off to lower numbers again and we can see if we insert a graph of that I'm going to insert a scatter graph with a smooth line tidy it up a bit get rid of the chart title some of the all grid lines set the axis to just go between 0 and 100 to tidy that up to delete the x axis we can see there's a normal distribution a bump at about 40 something and if I change that to 70 it will redraw the normal distribution a bit higher so we now need one of these for every single data point and that's a bit trickier because we can't just substitute that for say data because a weird behaviour now starts happening it's kind of really jagged and if we scroll down we realise it stops so what this is doing is it's recalculating the normal distribution for each of these inputs but changing the mean each time we also can't simply just wrap it in a sum function to add them up because well let's put the data thing back in and you just get one number sometimes it doesn't even work properly because we are dealing with problems there so what are we going to do first of all I'm going to delete that and I'm going to introduce a function called by row so this is going to solve some of the summation problems that we might have because what we're going to do is calculate how big this density estimate is at every single point by summing things up so I'm going to do by row and the array that I'm doing in by is this input here I'll put it in column I so I1 and the functions are lambda functions so I need to define that as lambda first and then the parameter I'm going to say is x so this is like the lambda function becomes here so this becomes like the input here so now what is my function going to be well I'm going to do the sum of various normal distributions so the norm dist here well the input is going to be x and I want to figure out what the distribution is for that x point by summing up over all the data so my mean is in fact going to be data so for every single data point I'm going to calculate what the normal distribution would be at that point x add them all up and then it's going to do that for every row in my series so it's under deviation I'm just going to call that 4 false for now and make sure I can close enough brackets to end that and what we see is that we've got if I change my standard deviation to 2 it gets a bit spikier so I don't know what I can put over here I'll just put over here and I might define that name as smoothing I'll scope it to the worksheet I can replace 2 here with smoothing and here we are now I can just change the visual of it and broadening that out if I wanted to or that's the density I don't want to make that classic violin plot shape I'm going to make equals the negative of that actually each one I'm going to add new data so first of all I'm going to edit that series and just say call it line plus and then line minus our x values are that entire column and the y values are that entire column going up so what I'm seeing now is that kind of double sided violin plot format that to be blue make sure the blues match so there's my shape so how do you fill it in that's the question because I've filled it in it would be lovely if you want to do it the other way we do have area charts stick this in stick this in for a second if we looked at charts we might have things like where are the areas but unfortunately they only work horizontally so we're going to have to be a little bit more challenging so what I'm going to do is add two new things I'm going to do bar plus it doesn't matter what this is for now because I'm going to change it and then bar minus because what I'm going to do is hack it with a bar chart so chart design change chart type and what I want to go is combo and I need to make sure that the line is still that scatter with smooth lines and the bars need to be stacked bar charts I have to play with the axes setting to make it work because I can't put the stacked bar charts on the same axes so there we go there at least set to the right chart type so I'm now going to select my data and check what these bars are going to be or bar plus now when he needs a set of values I'm going to put in all my positives here ok that bar minus put the negative sides in and what you can see now is I'm starting to shade that in just to keep my y axis and if I format the data series drop that gap width to zero it applies to everything it's shape filled with a lighter blue ok so we're nearly there you can probably see if you zoom in the line and the bars are not quite aligned slightly so the easy hack for this one is to change your sequence and minus 0.5 off it because what we're going to do is that we'll shunt this line slightly almost imperceptively and put it on top of the bars instead if you zoom in really far you'll be able to see it so what we could also do is sequence 200 and divide it by 2 that means things become a little bit finer you could increase that to 1000 divided by 10 or something like that and it could become even finer it doesn't necessarily matter but we can leave it there for now so now there we have we want to start adding something else to it I can put let's put the average average of my data and I'm going to add a new thing here so if I put series name I'm going to call it average x value 0 the y value is that number there ok that and if I go back to my chart design change chart type I can make sure my average is now just a scatter and I've got myself a central point so let's do this movie let's do standard deviation I'll do this population of just data the difference between standard deviation population and sample will be bordering on negligible so let's add format the data series no that's not format data series plus some error bars more options for here custom the value I want y error bar specified value let's put the standard deviation in there and delete the horizontal because I don't really need it so what we've got there is our violin font natively built in a spreadsheet and if we start changing some of the data it will let's change some of these to random numbers they're quite small and we'll watch it start to build up some density at the bottom which you can then spot because of this it's a little bit more powerful that box plot let's get rid of those for a second send it back it's a nice old student survey data I'm pleased to see that not many are scoring below 20 so I can't remember what score this is but it is real data and that's how it's distributed