 We get to one of the most exciting things in data analysis and there's visualization. Just to see your data come alive as a plot or a graph is absolutely fantastic. So after you've summarized your data, the next step is to visualize it. I'm not talking about those final plots that you're going to do when you create your report or publish it. I'm talking about just getting to know the knowledge that's inside of that data. I'm going to show you a plotting library called Plotly. It's my favorite. There are so many plotting libraries inside of Python about Plotly. It's my absolute favorite. And I think as you start looking at the video, you are going to see why I enjoy it so much and you're going to enjoy it too. Absolutely phenomenal, modern type of library to plot data. So I'm going to show you the Plotly website and then I think we're going to dive into the notebook. And I'm going to show you some basic plots, the ones that you're going to work with most commonly. Plotly is an enormous library. There are so many plots you can do. You can even plot the earth and all sorts of geography plots. It's just phenomenal what you can do. And there's no ways that I can show you all of it. But I want to give you the basics so that you then can go on the Plotly website and you can understand how to go from what they explained on the Plotly website and implement that in your own plots. Let's have a look at this beautiful world of data visualization. Good. Here we are at the Plotly website. It's plotly.complotly.com forward slash Python. So Plotly is a company. They do commercial work, but they also have this open source library available to us. So if you go to plotly.com forward slash Python, this is what you're going to see. You're going to see examples and tutorials of hundreds of plots. Look at everything that is available, even geographic plots, 3D charts, subplots. There's just so much here. There's more here than I could ever show you. Most important bit is this API reference here and the figure reference. And as we start building these plots, I'm going to show you how some of this works. I'm going to get you going, show you the important plots, how Plotly constructs its figures that we are going to plot. We're going to look at the different modules inside of Plotly. And I really hope you enjoy this. Plotly is one of the most exciting things you can do, visual things that you can do with data analysis. Here's our notebook, data visualization. So let's have a look at the libraries that we're going to use. I've already imported them. I've been playing around quite a bit. I really do enjoy plotting. So we're going to use pandas. So import pandas as PD. And here are the modules we're going to import from Plotly. So import Plotly.graph underscore objects. You can see graph objects as a module or package or library, whatever you want to call it. And we're going to import that as go. We're going to also import Plotly.io as PIO, a shortcut and we're going to use it immediately. And we're going to use the templates.default setting inside of Plotly.io. So PIO.templates.default and we're going to set that to Plotly underscore white. There are quite a few of these templates or themes if you're using a dark background that you can use inside of Google Colab, you can set it in the settings as a dark background and you can set your plots to be dark background as well. So that works out quite well and actually looks quite good. And then we're also going to import Plotly.express as PX. And that is a library that works very well with data frames. It's a quick way and easy way to create plots. It's not as powerful as graph objects, but it's certainly a quick way for you just to get some of the visualization while we explore our data. And then as always, we load the data underscore table function here. That's a magics command right inside of Google Colab. There are the files. Remember, we've got to import the drive module from Google Colab. We use the mount function and then we just change directory that's %cd into the data directory that holds our data and we list that data with the %ls command and you see all the files there. And we're going to import the data.csv file as part of a series. Please go back and watch the others if you're not familiar with this data set. It's exactly the same data set as we used before. With the data table imported and set here with a magics command, we see this nice representation of a table right inside here and we can see it's the same data set that we've worked with before. So let's start with plotting and plotting really depends on the data type. So of course age is a continuous numerical variable. There are certain types of plots that we use for continuous variables. Smoke here, although we see zeros and ones and twos, that is a nominal categorical variable. We've just encoded it during data capture as a zero, one or two. But those zeros, twos and ones, they just placeholders for an actual sample space value. Smoker, non-smoker, X-smoker, etc. And heart rate that's a continuous numerical variable. All these things will, the type of the data will determine what kind of plot we can create. So we're going to start with categorical variables and we get bar charts and pie charts. I'm not even going to talk about pie charts. Get them, they are no good. They are up to no good. Of course you can do them with plotly. Have a look at the website if you're interested. So let's just remind ourselves of these smokers and we're going to call df.smoke. So that's going to return for us a pandas series object. And it has a method called value counts and we're going to call that method value counts without any arguments. And we see the zero was the non-smokers. One was the smokers. Two was the, they were the X smokers and we had frequencies of 88, 85 and 27. And we're going to use this fact just to create a figure. So let's create our first bar plot. First of all, we are going to create a computer variable. I'm going to call mine smokers underscore fig. You call it what you like. And I'm going to create an object. Inside of graph objects, there is a figure function and I'm just going to call the figure function by saying go.figure. And that's going to create this empty figure for me. The figure object, which is now inside of this computer variable, remember, it has some attributes and it has some methods. The method that I'm going to call to add something to my plot is add underscore trace. So there we go. Add underscore trace. And we're going to, that method, we're going to populate with some arguments. The argument that we're going to pass is another graph object function. And that's the bar function. As you can see, the go.bar is a function. So parentheses there. And we're passing two arguments to it and x equals and a y equals is our two arguments. I've just hit the space bar there just so it looks nice on the screen, but you can have the x directly there. So on the x-axis, I'm passing a Python list object and it has three elements and each one of those are string. And it is in quotation marks in other words, so I've got non-smokers, smokers and smokers and that corresponds to this zero, one and two. And then on the y-axis, I'm passing the actual values, 88, 85 and 27. So that would be how many people it's a frequency count. And then the figure object here, smokers underscore fig, it has a dot show method. And when we call that dot show method, it's actually going to show to the screen and there we are, a bar plot. Now there we go, the three bars. We see non-smokers, smokers and x-smokers, but something very interesting, when I hover over this you see some information popping up. It says there non-smokers comma 88, smokers comma 85, x smokers comma 27. This graph is interactive. It really is. I can do certain things. I can zoom in and look at these things here at the top. If I click on this button, it's going to save a file, a PNG file, which is like a JPEG file onto my hard drive or on the Google Cloud. And I can use that inside of a Word document or some form of presentation. You can zoom in, you can pan around, you can select data, you can lasso some data, you can zoom in even further, zoom out a bit, just go back, reset everything. And all sorts of other things you can do here. This is completely interactive. It's fantastic. You can also set up your notebook, by the way, in presentation format. So when you do a presentation, you'd needn't go with boring old PowerPoint. You can keep your interactivity here, and that is absolutely fantastic to do. So these are bar charts. They are great for categorical variables and showing the frequency of the sample space elements of a categorical variable. And we know it's a bar plot because there are these gaps in between to show us that these things are not continuous. It's not like age, where age is continuous. These are distinct and discrete sample space elements. They have nothing to do with each other and not a continuum. So we leave these gaps in between. So next time you open a journal article and just look at the plots, you can immediately see when the authors were trying to tell you this is a categorical variable that we are dealing with. Now I want to work a bit on having to have done these things just by hand. Let's see if we can do better. So I just want to just make you aware of something. If I call the dot unique method on my DF dot smoke series, I get zero, two and one. So that's not in an order. If we go back to the table here, it's just in the order that they were discovered. So DF dot smoke is going to give us this smoke column as a pandas series and look zero, zero, zero, zero, zero, and then two, zero, zero, zero, zero. If we go on to the second page, smoke zero, zero, two, zero, two, zero, two, one. So zero came first, then it saw two and then it saw one because it's only documenting the unique values. And is this as they were discovered as pandas goes down that series, that column, zero, two, one. So we get it in that order. And if we do a value counts, if we leave all the defaults, we're going to get o12 there, but o12 there has nothing to do with the order. The order is about the frequency, the count. We are just lucky that 88 was the zero, 85 was the one, 27 was the two. So that order, but that order need not be there. By default, this is going to be in descending order of the frequency count. So be aware that there's a difference between that o21 and this o12, that's different, there's a different order there. So if we were just going to use some code, instead of hard coding the values as we've done here, we might run into a bit of problems. So let's see, we can deal with it to some extent. And the way that I'm going to deal with it, first of all, is I'm going to call the df.smoke. And that's going to give me a pandas series. I'm going to call the valueCounts method on there. That is going to give me back another pandas series. I'm going to call the values attribute on that, and then I'm going to call the toList method on that. So I'm just stringing these things together, and in the end, I'm going to have a Python list by virtue of the fact that I have this toList method there. I've just got to be aware of the order here. So just do it more than once. You know that the 85, the first to the non-smoke is the 85 to the smokers and the 27 to the non-smoke. So you just know this. So let's create the plot a little bit differently. I'm still going to use overwrite, the smokers underscore fig, computer variable, and putting an empty go.figure object inside of there by calling the go.figure function. On that figure now, I'm using the addPlace method. Again I'm passing the go.bar function to that, and my x-axis, I'm still printing out these words. I'm still creating this Python list there. But on the y-axis, instead of typing out those values, because imagine I had more, I'm just doing exactly what I did there so that I have a list object. Just as we had a list object before, we typed those values in as a list object. Now all we're going to do now is just code those values, but be aware of the order. I'm going to show you a new argument, marker equals, and I'm going to use the shorthand notation for a dictionary. So there's my outside set of curly braces, and so you can see I have two key value pairs. Remember dictionaries? Now it's the time to use them, a key value pair. My key is color, colon, and then I have a value, we'll talk about it now. And then I have comma, there's a comma, so I have another key value pair separated by the colon. So the key is color, the value is a list. So I can pass a list object as a value. And there's the list object green, red, and orange. So inside of Plotly, there are these protected words like colors. So Plotly will know exactly what that means. I don't have to code the colors later, I'll show you how to code colors. And then the value of the line key is another dictionary. So a dictionary itself can be a value to a key. So the key is line, and the value is another dictionary. And that dictionary has key value pairs. So I have a color, and black, and width, and one. The integer one. We're going to see what that is going to bring us. And then again I just call the dot show method on my object. And there we go. We have green, red, and orange. And we have this little black outline. That's where the line comes from as the key for this marker argument. So color is black and line width is one pixel. So there we go. I want to add a title. Now fortunately I already have my object, smokers underscore fig. And I can call another method, update underscore layout. And it has a couple of arguments. One of the arguments I'm going to use is this, the title argument. And I'm setting title equal to some title I came up with. You come up with your own title. And now we've updated our plot, and if we show it now, now I have this beautiful title to my plot. Let's add some access titles. Because although it says non-smokers, smokers, and smokers, we'd like a title here, and we'd like a title here on the y-axis. The vertical axis. What is that all about? So again I'm just going to call the update layout method. And remember we had title before, but now we're going to have x-axis equals and y-axis equals x-axis equals addict. Now if I write it out like this, that is the proper Python way to create a dictionary. And we just use the curly braces notation, which I prefer, but I'm just showing you, you can do this as well. So in addict I'm going to have a key value pair, but we write this differently now. So I'm going to say title equals groups of smokers, which is more in line with the other things that we did inside here of, inside of an object, a bar object here, x equals y equals. So you can use the notation that you prefer, but I'm giving these two x-axis and y-axis arguments. I'm giving them some values, and then we're calling the dot show method again. And now we're also going to have groups of smokers here and counts on the y-axis. So that is absolutely fantastic. Next thing I want to show you is to df.smoke value counts. I'm going to normalize that so that it is a fraction of one or multiplying it by a hundred so I get percentages. So now 44% of people were non-smokers, 42.5% of participants were smokers, were current smokers, and 13.5% then were x smokers. So let's try and express that in our graph somehow. So again, an empty figure object, I'm going to pass this graph dot bar. And by the way, if I hover there, look at all the arguments that you can set with a graph object. It's just absolutely phenomenal. There's so many changes you can make. And that's where I meant if you go to the APIs of Plotly on their website, it'll show you what each and one of these can do. And that's just wonderful. So on my x-axis, I still have this list object. My y-axis is just going to be those three values that I've done in this way instead of having to copy them. Now I've got a new argument text equals and text position equals. And you'll see exactly what that does. The text is going to be exactly those three values we had for the y-axis, so no difference. The text position, I've set that argument to outside. There's also inside, none in auto, but it was going to stick to outside. And then I've got another one here called hover text. In my hover text, I've actually written some words. 44% are non-smokers. That came from my value counts normalized equals 2 up here. 42.5%. I'm just doing this by hand. I can write any bit of information there as long as it's a string. Marker, we've seen that one before. And I'm going to use the dictionary notation. This time though, I'm passing three key value pairs. Color, line, and opacity as my three keys. The color has a value green and then something else and then orange. Just to show you that there are different ways that you can present a color instead of these protected color names. And the way that we do that is still inside of quotation marks. So these are still strings. RGBA, you don't have to put the A there. If you just put RGB, it's just red, green, and blue. So you know some of you might know that the pixels on your screen, they just read green and blue little things and they just shine brightly or they dim down to nothing. And if you mix those together, your eyes perceive those primary colors as some color. And you just have to tell how much each of these channels, how bright they must be. Zero is nothing at all and 255 is maximum brightness. So the red is at 255, the green is at zero, and blue is at zero. So this is going to be pure red. And then the comma one refers to this A and that's opacity. From zero, it's totally see-through, so you can't see it at all. And one is totally opaque, so you'll see the red completely. And then orange. So with these three colors, you can mix up any color you like. So you needn't be stuck to these named ones. My line, I still have this one pixel with black. And then I'm setting opacity as 0.7, only 70% of opacity. I'm using opacity outside here as my last key value pair. That means it's going to affect the green and this red and this orange equally. And that's why I set this to one here. If I put 0.7 there and 0.7 there, it'll be 0.7 of 0.7. But the green, if I write green like that behind the scenes, there's going to be an RGBA with an opacity set to one and same with the orange. So I'm keeping this one at one and outside. I'm doing all of them a lower opacity. There'll be a bit see-through. And then I've got the same things going on with my add layout. Now I've put the title here, and then x-axis is two different. Look at this update layout arguments. I could have all just listed them in one of these methods. But I've done it separately here. So I want to show you this x-axis underscore tick angle set to negative 25. That's a new argument. And let's see what that does. So there we go. We have our title, we have our axis titles. We see these at a negative 25 degree angle. Because we're lucky, yeah, we only have three sample space elements. Sometimes we have a lot more. And if we were to write these words next to each other, they might overlap and we can't see. So you can put an angle up to negative 90, then they're going to be completely vertical. So you can stack a lot of them together. So remember that x-axis underscore tick angle there. And then there's our text at the top. And that's outside the text position being outside. And the actual text up there, that's where we get the 88, 85 and 27. That's very nice. Remember I told you it can also be inside auto or none set to none. And then we see the bit of opacity that we've also introduced. And then lastly, look at my hover text. So now I get the non-smokers comma 88. And 44% are non-smokers. And 42.5% are smokers. The text that I passed, we can see that in the hover text here. So absolutely fantastic. What about horizontal bar plots? Just to show you how to do those, we create the same exact thing, go.figure. We have the air trace and we can add a bar to that. But all I've done here is I've swapped my x's and y's around. So on the y-axis, I'm now going to have the three categories. On my x-axis, I'm going to have the actual values. Text is the same text position I've put on the inside now. Hover text, I've kept the same, the marker we've kept the same, but now we've got a new one, orientation equals h as a string. And now we're going to get the horizontal bar plot. The other thing I've had to do of course is just swap the x and y-axis titles here around as well. So you've got to be a bit artistic about these things. And there we go, we see the inside numbers there. Plotly we'll decide whether they have to be black or white. And then we have our categorical variable here on the left-hand side. What though, if you wanted to group these. So this is for all the patients. Let's have a look at that. I'm going to use the group variable there, statistical variable. So I'm calling the df.group to give me the group pandas series. And then survey. And I'm passing that to the pandas cross tab function. Remember the cross tab function? That's going to give me a data frame back. I pass df.group first. So that is going to go along this axis and it's found active and control. Some participants were taking active jobs, some were taking placebo. They were given the survey and they could choose between one and five how much they agreed with the survey. So we see the values for the active group. 21% of them, 21 chose one, 18 chose two, 17 chose three, et cetera. So remember that. Let's hard code that into a bar plot that we split in two. We don't want to know of all the participants together. We want to split it up by the group. So how are we going to do that? Well, let's have a look at this. I'm going to create a computer variable, serve underscore group underscore fig, and let's go dot figure. And I'm going to pass, I'm going to create two different bar plots, two different traces. There's an air trace there and there's an air trace down here. So I'm doing it twice. So on the first one, I'm going to pass on my x-axis one, two, three, four, five. And on my y-axis, I'm passing those values there for the active group, 21, 18, 17, et cetera. I've written it there. I'm going to do exactly the same for text, the text position on the outside, but now I'm giving it a name because we're going to get legends down the right-hand side of our plot and that to show that these are only for the active group. And then I've got a marker there and a bit of opacity. And I'm doing the same here and I'm adding another trace, same x-axis values, but now the different y values that I get from control here, 17, 32, 13. And I've given it a different color. And then nothing changes as far as my title is concerned, my x-axis and y-axis titles as well, but then this new one bar mode equals group. I want them grouped together. Let's see what the effect is going to be. There we go. They are grouped together. So on my x-axis, remember I have one, two, three, four, five, and I've split it into two groups, the active and control group, but I want them grouped together. So all the ones are grouped together. That's for the active group, that's for the control group. And you can see the numbers there. In the active group, 18 participants chose two. And in the control group, 32 participants chose two. So you can see that they are grouped and that's what this group is going to do. It is going to group the two different groups, but we're keeping the x-axis sample space elements still the same because that's the survey answer, one, two, three, four, five. So I want you to do a little exercise. I think there's only exercise that I'm going to ask you to do. It's quite complex. I want you to do the same thing, the same data, but instead of grouping this one, two, three, four, five, I want grouped by whether they're in the active group and control group. In other words, we can have one, two, three, four, five and one, two, three, four, five on the other side because that's the other way you can do the grouping. So it's time for you to take a break and try this on yourself. Google it. Go unplotty.com forward slash python, see if you can figure it out. Go die your back. Let's have a look at the solution. There we go. The cross tabs function. I'm just going to do that again. So we see this is a Panis data frame. Now what I can do is call the dot values attribute or dot values property of this data frame. And all of this, remember, gives me a data frame. That's exactly the same code as we have up there. Other than the fact that I've turned things around, I've got survey first and then groups. Notice that difference. And I'm saving that in the vowels now instead of getting this 21, 18, 17, 23, I'm getting 21 and 17, 18 and 32, 17 and 13. So I'm getting this as these nested little lists inside of a NumPy array. So this is going to be a two-dimensional NumPy array. So that's the difference from before. If I were to swap these two around, I was going to get five values comma another five values. And you have to think about it why you would want it in this way. Now I'm going to create a little list object active in control as two elements that are both strings inside of a list. And just print that to the screen and I'm doing the same on so one, two, three, four, five by my choice as another list object. And then look what I'm going to do here. I am going to use a for loop to do all of this. So there's my computer variable. I'm passing a go.figure object to it. And now for I in range five, remember range was going to give me, if it's five it's going to be zero, one, two, three, four. That's five elements and we're going to loop over those because we've got here, look at this. We've got five elements to loop over here. That's the first one, the second one. And I've got five elements to loop over here. And we can make use of that in a for loop. So that I don't have to do the five air traces separately. Of course you can do that, but this is just a short way. So I'm calling while I'm looping over this five times serve dot underscore group dot fig the air trace and it's a bar, there we go bar. My x-axis remains the groups because remember now on my x-axis I told you I just want active in control. So that's all I want on the x-axis. On the y-axis though, I want values the first set, the second set, the third set, the fourth set, the fifth set. And that's why I've set it up like this so that I have these five sets to loop over and these five here to loop over. So my y is going to be vowels and then I'm using indexing the zero with one, the first one, the second one, the third one, the fourth one, text values exactly the same, text position the same and then the names it's going to loop over these five names to give me the legends on the side. And then everything else is going to stay the same. I just want to show you here how we've done this all as dictionary shorthand. So that's a different way of doing this update layout method. And look now, now I have them active and control exactly what I wanted and then they are grouped by one, two, three, four, five, one, two, three, four, five and I have the one, two, three, four, five at the bottom and plot is absolutely fantastic. I can click on one of these and they disappear. I can click on three and they disappear. Now I only see two, four and five. I can bring them back by clicking on them again. Absolutely wonderfully interactive. Just wanted to show you this one more before you take another break. I'm going to loop over this all again and nothing has changed but I'm changing the bar mode to stack. I don't like stack particularly. I just want to show you what happens if we set the bar mode to stack. By the way, I've used alternative notation here, not the curly brace notation just to show you both are available. And this is the stacked one. So I've put the values here and that's why if I do stack, I like to put the numbers there because it's difficult to see how many chose two here because you'll have to kind of guess what this level is and what this level is and subtract this one from that one to know that there are 18 there. It's difficult to see when they are stacked. Sure, it gives me the whole lot that there are 100 in there and 100 in there because I can see it's up to 100 there but just to show the values there helps a bit as I say, I don't particularly like the stack but if you set bar mode to stack, this is exactly what you're going to get. I hope you've had a nice break. Let's continue on with histograms. Now we're going to get to the types of plots that show us a distribution of a continuous variable. I use the term distribution. We're going to look at what distributions are. In a future video, we just want to see the pattern of the spread of a continuous variable and we're going to look at the age and I'm going to use the plotly express library here, so px.histogram. So we don't do .figure, it has these objects that we can access directly. So by calling the histogram function, I'm going to create a histogram object and look at this. I told you it's integrated with pandas so I can just, my first argument is this, the df, that's the data frame, comma, x equals age. So I can just use the column name there and then I'm calling the .show method on my computer variable there, which holds a histogram object and there's my histogram. Nicely done. Now you can see there are no gaps between the bars because we're indicating that this is a continuous variable. We've created little bins out of it and we're just counting how many participants fall into that bin. If I hover over here, you can say age, you can see there age equals 30 to 34 and there were 13 of them. Age equals 35 to 39, age equals 40 to 44, 45 to 49, 50 to 54. It tells us and then it shows us how many within each. But you also get this idea of how the data was spread, the shape of this spread. Now you can see the values are five apart. There's from 30 to 34, that's a five year gap, the five year interval, that was decided by plotly. Later I'll show you how you can decide how big and wide those bins must be. Now let me just show you, it's not as powerful as the graph objects, but it's much quicker. Here are some of the arguments that are available if I hover there. You can see there's relatively fewer arguments, but there's still a lot of them. There's still a lot you can change. If you go to the reference API, you'll see exactly the things that you can do or plotly.com forward slash Python. You click on one of the tutorials there, it'll show you how to construct these. So it is px, so I can just call df and on the x-axis I want age, but now I want it split up by some categorical variable. I wanna see it been split up by the group. So I want to see a distribution by histogram of the ages of the active patients between the active drug and getting the placebo drug. And for that we split on this argument called color. Now it has nothing to do with color, although it's going to color the two different sets differently. So perhaps then that is a good name for an argument, but it doesn't set the actual color. For that we have to use other arguments. There's a title argument, an opacity argument and a marginal. Now I've set that to rug because that's gonna create a rug plot on the margin of this histogram. By default it's going to be on the top. Let's see what this looks like. There we go. Now we have a group by active and control and again I can just click on them and decide which one I want to see or whether I want to see them both. And by default here they are stacked because if I look on the left hand side there's six in the active group in this age group 30 to 34, there's six there and there's seven there. So it goes up to 13. So that's the total lot and you have to kind of guess if you didn't have this hover text exactly how much each of these were. It's easy for the bottom one, you can see it says around about six but now you've got to guess that that's about 13 and 13 minus six is seven. To know that there's seven in there ends me not liking stack bar charts that much. But now you can see the rug plot. The rug plot is just a little line here indicating each of the actual values and there might be more than one patient in this group of 54 and that'll just be on top of each other. But on this margin here of my histogram I can see the rug plot. So that's fantastic. Now let's do all of this with the graph objects module inside of Plotly. So I'm going to create a go, use the go.figure function and that creates a go.figure object and I'm storing that in the computer variable age underscore hist. I'm calling the add trace method on that passing the histogram object to that and the x-axis is just df.age. All the ages on the x-axis please and Plotly will know that it has to create these bins and do exactly as we've seen before. So the hover text is a little bit less but we can still make out that's ages 40 to 44 inclusive on that interval 27 of them, 27 of them, et cetera. So let's add a bit more fun. So that's quite a bit of text code there to write a figure and I've got these two traces and what I'm going to do is on my first trace I'm going to set this conditional. So on the x-axis I'm saying df, then df.smoke, so the smoke column returning for me a panda series when it's equal to zero and for those please use the age please. So only in this trace putting age values of people who are non smokers and people who are smokers because you have got smoke equals equals one. So for brevity we're just leaving out the x smokers here but look at this new one x-bins equals dict start at 10, end at 90 and make sizes of five. So you can control that bin size by doing this and I've just marked color as orange and that's how you do the actual color not just remember before we had color. So inside of here we said marker underscore color and I've got a bit of update layout there and let's plot that out to see what it looks like. So there we have very beautiful histogram showing us the distribution of ages of those two groups. One thing I should say they look at that bar mode equals overlay. So now it's an overlay that eight there it's not stacked on top of that that is at eight because there are eight participants in there and there are four participants in there. So that's fine unless they're both at the same value and then what you'd have to do of course is just take one of them away and now you can see the traces for that one or the other one. If you just look at those two groups separately. So histogram very nice if we want to determine this distribution of data. Now more commonly in healthcare literature you're gonna see a box plot as an indication of the distribution of data. So let's rather have a look at a box and whisker plot or box and whisker chart. So I'm gonna use px my express library here dot box and so that's gonna give me a box object and I'm pausing that to computer variable and here are some of the arguments we're gonna use. We're gonna set df the data frame the x-axis we want smoke and on the y-axis we want age. So it's going to see well there are three different types of samples. There's three sample space elements inside of smoke and that's gonna be put on the x-axis. So the non smokers, the smokers and the x smokers and on the y-axis I want some indication as a box plot of the age. I've added a title there. Let's see what it looks like. This is just, this is really good for quick data analysis just quick visualizing and understanding your data after doing the summary statistics and then for me, if I wanna create plots and charts for my reports or publications, I'm gonna use the graph objects. So here we can see, I can already see there's quite a bit of difference between these smokers and non smokers. Now look at the bottom, we have zero, one and two so you would have to do something perhaps replace or map the values with a dictionary inside of a pan, that pan is series for smoke, change the value zero to smoker, one, two, non smoker, smoker and x smoker but for these plots, I'm just looking at the data, I'm just exploring the data, I'm gonna use plotly express. And look, what do you see with a box and whisker chart? If we hover over those, you see zero, zero, zero, zero because that is the group in which the participants fall and then you see the minimum was 30, the maximum was 75 and those are the outsides of the whiskers. Now I warn you if they're statistical outliers, those whiskers are not going to go to the minimum and maximum. They're going to go to a value beyond which we'll find outliers and then you'll start seeing little dots either above or below or both above and below those whiskers and you'll have to click on the top most one and the bottom most one to get the min and the max and but it'll show us that Python thinks that these are outliers. The box in the middle, that's going to be the first quartile, the second quartile and the third quartile and that's how you form it. You can see the second quartile there is the median. So that's how a box and whisker plot. That's what it looks like. Now let's do the following. Let's create separate traces and I'm going to do it the long way. So I'm going to create three list objects, Python lists, so I'm calling that dot dot to underscore list method on this conditional that's only going to return for me the ages. So df, df dot smoke equals equals zero, the ages to a Python list convert to a Python list and I'm saving them into three appropriately named computer variables there. So I have these three list objects there. So I'm going to go ages underscore smoke underscore box, go dot figure and I'm going to create three separate traces. I'm going to take ownership over the design of each three of those. So my first trace is going to on my y-axis I'm going to have non-smoker age and then I'm giving it a name, non-smokers. I'm making the marker color green and then I've got a new one box mean equals true and box points equals all. I'm going to leave that as a surprise so you can see what that does. On the second one, we have just a list of values of ages of smokers, giving it a name, making it red and I have box mean not set to true but to SD and SD as a string so inside of quotation marks and box points still all and I'm doing the same here for the X smoker ages and let's have a look. Let's not keep you in suspense here. That's what we're doing. Here's all the box points. You see some jitter left to right so that different people, participants of the same age, those dots are not on top of each other and you can set that jitter and also set how far the box points are from the plots. This first one where we said box mean equals true and the dotted line is going to give me the mean as well. Very important here when we get to actual statistical analysis, we're going to choose between parametric and non-parametric tests. Sometimes you can't use analysis of variance to compare these three ages. You'll have to use the Kruskal-Wallis test and that happens when we don't meet the assumptions for the use of parametric tests. I'm going to tell you everything about this, not to worry, but in this plot, I can see my mean and medians quite close to each other. I also don't see a lot of outliers and the box seems to be very central in the data. That seems as if this data is normally distributed and that helps me to decide just visually trying to already guess at what tests I am going to use here. When you set it to SD, not only do I get the central line, which is the mean, but I also get the standard deviation. You can see the mean plus minus in this little sigma that's the greatly illegal sigma for standard deviation. So it's showing a mean of 56.16 plus minus 2.29. That's the mean and standard deviation and this diamond goes out to that standard deviation. So it's 12.29 above the mean and 12.29 below the mean. So we get all this beautiful, rich information. When you get used to using this, a plot like this for your numerical variables or box plot like this is really going to help you out. You'll get used to reading these plots and seeing exactly where your future analysis is going to go, absolutely fantastic. The last plot that I'm going to show you here is a scatter plot and that is where we're going to just deal with numerical variables. We just want to plot numerical variables. Sure enough, we can split them by a categorical variable. We can say to show me the patients on placebo and the patients on the active drug, but we're dealing inherently with plotting numerical variables. So let's go for it. I've got Godot figure there, passing it to age underscore SPP. So I've got two numerical variables as you can guess from the name I've created, the computer variable name that I've created. So I'm going to use the add underscore trace method there and it's Godot scatter is the object that I'm passing to it, Godot scatter. On the x-axis, I want all the edges please and on the y-axis, I want the systolic blood pressure. My mode that I want is markers because you can also do lines and markers. I want a mode. My mode is just little dots, markers. I've got my standard update layout there and let's see what this looks like. And there we go, a scatter plot. So every dot here represents two numerical variables. For that participant, that participant was 33 years old and had a systolic blood pressure of 179. So pairs are values, very nicely done there. What can I do though if I want to bring a third numerical variable into it? Well, I can create a 3D plot on the one-axis edge, on the one-axis systolic blood pressure and on the other axis heart rate. But that doesn't translate into print format or if you send your article to publishers and they're going to put a publisher on their website, they also not going to have interactive 3D plots that you can hover with your mouse over and twist and turn it so you can see the 3D. You've got to have some idea of how to represent this in 3D. And one way to do it is a bubble chart which is a former scatter plot and that is where we take the size of our marker, we attach that to the third variable. So that's what I've done here. So my x-axis, I still have a h on my y-axis, SPP, but now I have this size and I'm sending that to heart rate. So the size at this time, the size of my marker is going to be coupled to the heart rate. So the bigger the dot is, the bigger the marker, the higher the heart rate. I'm going to show you another marginal here. On both my y-axis and x-axis, I've got marginal underscore y, marginal underscore x. I'm putting box and box and I'll show you what that does. I'm also adding a trend line. We're going to learn all about linear regression, creating these models. If a model will say, well, if you give me the patient's age, I can tell you what the SPP is going to be. You can get linear regression models. And then I'm giving it a title and then there's this nice labels. I like this and probably express this labels argument. It says, well, I got SPP, that's what it's going to print on the screen because that's the name of that column inside of my data frame, but I just want to change that to write it out like something like this. So let's not waste any more time. Let's call the dot show method there and lo and behold, look at the beautiful thing we've created here. Look at my two marginals. I've created two box plots, box plot for both marginals. So of course later on, I'll show you, you can do different ones, but they are all interactive. So I'm going to get that indication and look at all this data. We see a tiny little dot here. Heart rate of only 24. Definitely that's going to be a statistical outlier, I think a statistical outlier, but look at the size of the dots. That gives me an indication of the third variable. I've got the two groups separately from each other, and I've got these trend lines. You can see them. That's my mathematical model. And let me try and hover over one of them. Look at that. There's my ordinary least squares trend line model, my linear regression model. It says, if you give me the age, you can see the age there. So if I multiply age by 0.729, 121. So 0.729, 121 times age plus 114.18. If I do that little algebraic expression there, I'm going to get a predicted systolic blood pressure. I also get an R squared value there. A coefficient of determination, and it goes from zero to one. Zero is a very poor model. One is an excellent model. My predictions will be spot on. And you can see my predictions are not spot on. These values are up and above and below my trend lines. My actual values are far away from my model. But this ordinary least squares is going to give me the best model, but we can see an R squared value of only 1.0.12. Beautiful. After we've done linear regression, when we plot this again, this will make so much sense. This is all in a one, in a graph like this. It's absolutely fantastic. So four out two box plots here. We're going to see age here for this box. And we're going to see systolic blood pressure for this box here. And you can see here on this attempt, you see the little dot there? That was beyond this box in whisker. So that's a statistical outlier as far as the systolic blood pressure is concerned. Now no longer have this as the minimum. There is the minimum. And this little box now, this whisker down here, is going to give us the margin beyond which we find statistical outliers. Excellent information in just this plot. Now, let's just do this again. In this instance, I'm just going to show you two different marginals. I have histogram and rug as my two marginals, y-marginal and x-marginal. So there we see my rug as my x-marginal. And that is going to be of all the ages. And here my y, I've got the systolic blood pressure as histograms. Fantastic, absolutely fantastic. All the things that we can do there. Now I just want to show you one more thing and that is where we used the size of these markers for our third dimension. We can also use color as our third dimension. And this is the last plot I want to show you but it's absolutely beautiful. I love it. So px.scatter for scatter object inside of plotly express. The data frame, the age, the systolic blood pressure and I'm going to call color equals HR. So what did we do before for HR? There we go. We said color size equals HR. And now we're going to say color equals HR. And we're going to say facet column. Now what facet underscore column is going to do is going to create separate two columns, one for the active group and one for the placebo group because I'm wanting it to split on group. And you'll see what that looks like. A trend line using OLS. I still got a title, still got a label but now I've got to tell it because I want this color to have the spectrum of colors to indicate the heart rate for me. I've got to say color underscore continuous underscore scale. And there are different named scales and this is the name scale that I'm going to use and we call it px.colors.sequential. Whichever one of those color scales you want to use. And if we run that, look at this beautiful plot. The facet underscore column done by the group statistical variable there. There were two groups, an active group and a control group. So if there were more, we would have had more facet columns. And we see them individually and we see the agent systolic blood pressure. Agent systolic blood pressure, they're both sharing the same y-axis but they've got the individual axis. And look at the colors. The more yellow it gets here, the higher the heart rate and we can see that from the color. And we can also see the two separate models. There's the one trend line and there's the other trend line. The R squared is 0.124 there and 0.08. So for the control group, our model is a bit better. There seems to be more of a correlation between agents systolic blood pressure than there is an active group. And there's a statistical test that we can compare. If there's a significant difference between those two models, it's all fantastic and I'm going to teach you how to do all of that. So I love the plotly library because I can export it, I can work on it. I mean, if we hover here, I can say, let's zoom in just on these points. And now it's going to zoom in on the other facet column as well. And if we hit on pan and I move around the one, it moves around the other one too. It is absolutely phenomenal. And if you bolt this into a presentation, of course you can just load this page as a presentation and go down and that's what I do in my research group. We get together and I go through this analysis and it's a wonderful way to explain, especially if we have a personnel that are not that knowledgeable about statistics. This is a beautiful way, a notebook format to show them the statistical analysis because I can put word, text, all sorts of things in here. It's absolutely fantastic. Remember there, let me just reset the axis we back to where we were. So please go to the plotly website, play around, there's just absolutely just so much that you can do. I really hope you enjoyed this. Click to like this, please subscribe, tell everyone about this so that everyone can learn how to do their analysis inside of Python. It's really easy. It's overwhelming this plotting, I do know. It takes some time. You are going to make mistakes, refer to the website so that you learn how to do this and there's such absolute control. I mean, look at this plot. It is absolutely fantastic.