 So in today's tutorial we're going to talk about the humble histogram. So in my first cell here in my Jupyter notebook I'm just importing my cascading style sheet as per usual just to get away from this bland looking black to just something a bit more interesting. There we go. If I run the cell we see these nice little colors and different fonts for the markdown that is used. So let's set up the library. I'm going to import from plotly.offline my iplot so that I can plot directly in this Jupyter notebook and to the init notebook mode just to initialize the notebook. We did that with the previous tutorial on the bar chart. We're going to do exactly the same here and again we're going to make use of one of the high level charts so we're going to do plotly.graph objects as go will do that import as well. Now for this case I just want to generate some random values so I'm going to import the random module the python random module so that's for easy import random and I also want to import the numerical python library numpies and I'm going to use this namespace abbreviation so I'm going to say import numpy as np. Now I'm going to just seed this cdorandom number generator so that we get the same results every time we generate these random values and I'm just going to seed it with a value 1234 and there we go so let's create a few random values my first variable here is called age and I'm going to just use a uniform distribution I'm so I'm going to say an np for numpy.random so I'm doing the random module in this numpy library and from there I want the uniform distribution my lowest value must be 21 my highest value was 75 and the last argument size meaning I want 100 values from that distribution the next one I'm just going to use the computer variable salary again that's numpy.random and this time from a normal distribution I want the mean which has the argument loc to be 3000 I want the standard deviation which is the scale argument here to be a thousand and again the size the number of samples I want drawn being 100 now I'm just going to use a choice between just purely for the sake of simplicity use a gender binary gender here just for purely for the sake of simplicity we use and I'm going to create the the computer variable binary gender and I'm going to use from the random module I'm going to use random.choices and what you can do with the this choices function is just to pass a list of all these various text the string elements that you want to choose from so I've got female then male and I'm going to say draw at random 100 of those and it is it is drawing that at random with replacement so the first one might be female it puts female back in the box draws another one blindly out of the box that might be female again throw it back so a hundred of those choices now I want to place these inside of a pandas data frame it's a beautiful thing to use I urge you to have a look at pandas I'm going to import pandas and use the abbreviation PD and I'm doing that because I want to create a data frame I'm going to call my data frame DF and I'm going to call the data frame function here PD dot data frame and I'm going to create this data frame out of using a dictionary so I'm going to have this age column and underneath that age column think of a spreadsheet so the header in the first row I have my header values my variables and I'm going to call this first one age and down that column all the rows are going to be populated by these hundred age values the salary is going to be the next column header in my spreadsheet basically which was what a data frame can be seen as in a simplistic way pass all the 100 salary values and then gender this binary gender values of mine so let's run that and I've created a data frame and I'm going to split my data frame in two one for female and one for male and the one that we're going to do how the way we're going to do that create these two computer variables female and male the first one says take this data frame and then we're going to use this square bracket notation because it wants to run down row by row so I'm just indexing this DF dot gender column so it's going to look down the gender column and only select those so this is a Boolean double equal sign only the females and then male will only be contained contain the male so I have two data frames now female and male so let's create our first bare bones histogram now that we have some data again per usual just by convention I'm going to use this trace computer variable and it's go go because this was a graph object one of the high level graph objects and it's a histogram I'm going to pass at the x value because let's just stop there for a moment what you know just to discuss what a histogram is where's a bar chart looked at categorical variables on the x axis the histogram is going to look at numerical variables so age is definitely a numerical variable a ratio type numerical variable because there's a true zero and we have we're going to split that up into little bins and if we just pass the age here plotly we'll decide how large those bins are so what are bins well as soon as I show you the histogram I can show you what bins are again the computer variable data and pass the list of all these histogram elements I want to put on my figure and then instead of using go dot figure I'm just going to use this dictionary notation so I plot and then data please and we see here look at the bottom this is new these are numerical variables it's not categorical and what it's done is it seems to be having plot from 20 to 25 25 to 30 30 to 35 35 to 40 and that's what I mean by these little bins so all it's going to do it takes those hundred values and it and it decides how many are between 20 and 25 and it'll count them and it note that there was eight and between 25 and 30 there was 30 between 30 and 35 there was five so between 25 and 30 there was eight etc and so it goes on so that is why a histogram is not a bar chart and what you can see here also there is really no spaces between these just to indicate that we're talking about a continuous a continuity here where's a bar chart by definition has these gaps and just to create this visual impression that we are dealing with categorical variables that this is not a continuum and here we do have a continuous random variable at the bottom called age now let's just change that to a frequency distribution otherwise known as a normalized histogram because what you can see on the top one here is we count we count how many were between 20 and 25 but if we normalize that it gives us a frequency distribution and the way that we do that is just by in this histogram element that we create at the top we pass a new argument his norm and we pass to it this equals probability I'm going to introduce a layout here by format by dictionary format so if title and its frequency distribution of the age gen age variable and my x axis I'm going to bring in a title also dictionary with a key value pair the value pair being a dictionary itself consisting of a key and a value and why I plot via a dictionary as well and now we can see frequency distribution of age I have age in five year increments here at the bottom but most notably you'll know now you'll see now this this is normalized so the area under this curve which is the area of this rectangle and this rectangle and this rectangle and this rectangle is going to sum up to be one it's mutually these these bins are mutually exclusive but collectively exhaustive so they are all here and the area under this stepwise curve is going to be one so it's a frequency distribution so I'll have to if I want to get how many there are this increment is five times 0.08 to get to to get to the value that we have there that we have well let's have a look so that was eight so remember these are now units of one so 0.08 times times a hundred is going to say it's going to give you that eight so let's create two in an overlay fashion so to do that I'm going to create two traces by convention calling them trace zero and trace one so go dot histogram and I have x equals female dot age and this time I'm going to give it a name so that we can have this little sidebar that indicates what is what and the second one that I plotting it's going to layer them from the back to the front so the back one is going to be female the one in front is going to be male so I just want to lower the opacity of the front one so we can still see the bottom the bottom histogram the bottom trace through that so I'm going to set opacity equals 0.8 everything else being the same I'm bringing in my little layout there and now we can see I've got two plots here it's going to give me this little legend on the side and I've created a bit of opacity here in the orange which is the male just so that this female at the back shines through and but if we hover over them beautiful beautiful it's beautiful what plotly does and you can see that it's still going to give you the values let's stack them so I don't have to do the opacity there and it is just going to stack them but it's still going to give me these values to say in male there was four and in the female there was 12 and look at one thing a difference that it made it is now making these blocks of unit length of 10 years so I've put in that as 10 years that didn't that didn't influence that this was done I drew this first and saw by you know automatically it chose 10 and I put the 10 in there after the fact so let's do that let's control the bin size and the way that we do that is by introducing inside of our traces here we introduce the X bins argument and I'm going to pass a dictionary for that in this format not the curly braces format so start is 20 in that 80 so just the left hand side of the x-axis the right hand side of the x-axis and I want a bin size of five units I'm going to do the same for male and I'm changing my title here to five because now I'm in control of it and if we run this we'll see that now we are back with this increments of five we're back with this increments of five let's just do that again this time we're just going to stack them and just for argument's sake this time I'm looking at female dot salary and male dot salary and we have introduced a bin size of 200 years starting at 500 ending in 5500 and because we do that from a normal distribution you can see this tending towards this normal distribution the last thing I want to show you just in this little tutorial is the cumulative histogram and when you do statistics you start looking at cumulative cumulative distribution functions etc this is important but we're not dealing with that at the moment this is the histogram and the way that I do that inside of this histogram object that I've created I'm going to use the argument cumulative and pass to that a dictionary with a key value pair of enable being true and I use the dict function instead of the curly braces notation here and if we just execute that we see this beautiful step wise and we can see the largest steps here meaning we are dealing with a normal distribution you can see that from this cumulative distribution function so that is that for the humble histogram remember that is for numerical variables and we're just going to start counting them or putting them in a frequency distribution by creating these artificial little bins and we can control the bin size as I've shown you so that's the histogram I'll see you in the next lecture where we continue our look at this wonderful world of plotly and this wonderful library plotting library that plotly is please please please remember to subscribe and hit the notification button because that's going to allow you to know when new videos or new tutorials do come out I'll see you again