 the last library that we're going to talk about that really sort of just makes all of our data analytics come together is known as pandas. In the entire idea is there is there are a few other languages out there that are you know very popular with statisticians and data analytics like MATLAB or R. I know here at NC State we have a MATLAB class and sort of you know you would see some of the same kind of structures in that language and so Python pandas is sort of Python's answer to that you know it's I won't lie it's a little bit of trying to get those MATLAB and R coders onto the dark side of Python but either way I digress it is what it is so the entire idea is again it allows us to do our data analytics by giving us some additional data structures and just like we've seen with MATLAB, NumPy, and random we can utilize the alias and as you'll see almost anywhere on the internet when people are working with pandas pandas is six characters too many to type it's always almost always shorthanded into PD. So I underlined sort of that first word of data structures and again the entire idea is that pandas allows us to utilize these data structures first one is something known as a series data type and the entire idea is this is very similar to a dictionary that we see in Python. I have my values if I for example have one two three four five and I want to associate them to some header or you know again if we're thinking about this like dictionaries to some key so I happen to have them and in pandas' world they call them an index but effectively that means that this A is associated to that one B2C3D4 E is represented to a five so again like I said this is very similar in approach and you are still passing it as a list it just allows for again a little bit of structure and those indices are just meaning to are meant to allow us to now utilize them at our disposal. So one of the best ways to really see this is actually something you've seen me do a few times with Jupyter and it's this idea of using Jupyter in conjunction so it's not just a programming language and building a program but rather you know utilizing it to do data exploration so sort of this first cell here is I'm importing pandas and then I'm building that same series and then you see I have a little S there and the entire idea is that's just going to show me my data but that also is allowing me to just like a dictionary do something like specify what index I want to look at and it'll give me that index nothing terribly crazy going on here but as you can see sort of this next fragment here this is where I will say pandas gets the bulk of sort of its power and it is creating something known as a data frame. So a data frame in essence is just this it is taking the series kind of approach of these being entries and now instead of it just being one value it's meant to represent all of the values for if you were to think of this like excel in a particular comma a column so we could instead of this being something like one we could say q1 so question one or variable one q2 and q3 and again what we're doing is we're just sort of now saying that this first column of data is one person's entries this is the second person's entries third person's entries fourth person's entries and when we sort of compile it up it does in fact produce sort of a matrix that we can work off of and operate from and very similar to what we saw with the series we can reference those indices to in this case talk about a specific column now if you were looking at sort of the simulating infection rates video like when we thought about numpy for example oh it got into a weird we need to do like a colon comma zero to get out a specific column now with that with pandas we just need to specify the particular column that we want to work off of so if i'm looking at q3 this is the variables i want to be operating off of i get them and i get them in a nice little list and again this also allows me to do very simple descriptive statistics as well so as you can see i can do something like df dot mean and for every single one of the variables that are numeric in nature give me in this case the average for them so i can do very quick again data analytics just just very quickly in this case so we can see q1 or i called it one originally is 2.87 and then our q2 was 2.75 and q3 3.025 and once again just as you can already sort of guesstimate what's going on here i can do q3 and now i'm explicitly stating just give me the average for the q3 variable and it will so okay that's great but a lot of times you're not going to be focusing on working off of data from a dictionary that you have to build like this you sometimes do i won't lie sometimes i i generate my data frames in the code or more commonly it's through a csv file you know if we think about csv files for a second they have lists of entries and they have headers representing each one of sort of our values just like this so it's very quick that i can just do something like pd dot read csv and as you can imagine what that's going to do is it will take the csv file do fancy you know python code to it and convert it into the data frame for us now the big thing here that you might be noticing is i also have dot head df dot head and df dot tail are just saying show me the first five or the last five entries in this data set is what we can consider it now and as you can see i'm seeing the last five entries in the 150 entries of the iris data set already in place and you can see again it just makes a little formatting makes it look nicer all that kind of fun stuff either way what's really great about this is we can also utilize sort of the data frame and again do different types of expressions off of sort of the matrix as a whole if you think about the data frame as a matrix so for example if i did something like df species equal equal setosa all right well we have a setosa set we have there are i think 50 of them in there are 50 entries in here and so inside of the iris data set and so if i did an expression like df species equal equal setosa what is going to be produced well in that case it's actually going to generate a series for us in that series is just going to be a list of true values or false values where for every entry in the species column we look at the value at that at that entry and do an equal equal is it equal to setosa so you can see here the first five are yes setosis we can see that but those last five were not they were virginicas and so we're seeing falses here and then if i were to print every one of them all 150 you'd see it's 50 trues and then 100 falses because we have varice colors and virginicas but the reason why this is really good and you know useful is that allows us to do filtering so in this case let's say i do want to only have the virginicas well by utilizing that command that i just specified right this list of trues and falses if i give data frame or our data frame just that a list of trues and falses this will actually filter out anything only entries that are true so in this case the subset of virginica df species virginica in this case dot head and you see exactly that so in this case i now have just entries with virginica only virginica now we can do additional things with the pandas library as well so we had descriptive statistics i can do dot mean or i can do something like dot describe and it will do all of those basic statistics for us very quickly so you can see that count in this case is just going to count how many entries and nothing terribly crazy there but you can see for all of the virginica irises or iris i irises all of the virginica flowers we have the mean sepal length the mean sepal width mean petal length and mean petal width as well as the their minns their maxes and the standard deviations and if we wanted to break them into quarters we also have sort of their breaking points as well for very quick separation now where this can become very useful is we can also tie in pandas with matplotlib and so matplotlib again is the data visualization library pandas is the data analytics library merging them together will allow us to produce visualizations off of the data frame so this entry here is saying take our virginica subset take specifically the sepal length entry you can do it this version if it is just one character or one word as an entry or you could come in and give it the same kind of approach that also does need the dot notation both of those will work it it really depends on whether or not specifically you've got spaces in your keys or not if you do then yeah you need to do it this way if they are all one word separated by an underscore in this case then you can do this both work but we're still utilizing the dot plot however you notice that we're not giving it an x y coordinate instead we're saying i want 20 bins and kind in this case is saying what kind of plot are you working off of and his is the shorthand way of saying make this a histogram and so there it is a little maybe a little funky but you know again that's that's fine as you can see boom we are getting a histogram of the sepal lengths and their different frequencies so how often does something like a six point four appear quite a lot so we can see you know that's roughly speaking where the average is and yes that's exactly you know it's not six point four six point five but you get my six point five but just to kind of show that in a different fashion you can also do some separation so in this case you can also just say grab the entire data frame in this case dot his so in this case build histograms histograms because you can say specifying which entry and you can do something known as group by and the entire idea to group by and we'll see it a little bit it's just we want to separate them out by some sort of criteria in this case we want to separate and build three histograms for each one of our species and again we have the bins for how many bins to operate from and as you can see we have satosa versus color and virginica now one thing i will point out because this code right here this little ditty i hate this i absolutely hate doing this this is more of a personal thing for me uh so i don't like working off of this it works it's fine but one of the things that i have sort of built on my end as a personal change is i take sort of data frames have a function called mask and so what i can do with mask is i can specify what key to be operating from what type of comparison i want to operate from and the value i want to mask or filter out so in this case this sort of function here that you feel free by all means take it uh is my way of saying now i want to just mask things that are bigger larger smaller or the same so again i load that and this code right here the very bottom we've always talked about how you shouldn't use certain keywords in python because you're you you can physically break python and you shouldn't uh this is one of those weird times where you know you know it's like comedy there's no no in improv unless you you know it works and it's very similar to the kind of that approach there's no you should change keywords unless you know what you're doing so this code right here is effectively saying take the data frames uh function uh take the data frames data structure that pandas has built take the mask function that data that pandas library has built for data frames throw it away use this instead again this is just my way you could if you didn't want to get rid of mask you could call it something filter as you can see as a keyword so i'm not going to use filter uh but i could change sort of that name to something else but again i use mask and the reason why is once again what this does is it is the equivalent to df df species virginica it again this is weird to me this is a little cleaner i don't have to say df twice i don't have to say the data frames name twice and it will still do the filtering process for me now one little piece that i don't have in the code just yet is this i don't well you can see that it's going to show me quite a lot so i will actually do a dot head to shorten that down the one thing that i haven't shown thus far is that you can also do the by command on the data frame and this is a way for you to separate out all the different values so a way for this to kind of operate on it is you can come in and i'll say something like species what if i want to split my data frame up into all of its different species df dot group by and group by is going to say well look at a particular uh key and wherever there are differences put them into their own separate entry so in this case species or species equals group by dot group by species this is going to take our sotosis verus colors and virginica split them into three different lists so species lin will show us that there are in fact three entries out of this again uh but the reason why is now what we can do with this is we can build for loops when we work off of this for loop we can say first the name of this particular separation this would be you know sotosis virginica and whatnot and then data give me the actual data frame i'm going to call it data uh this is just my sort of go-to name for it uh in species and just to see that in action print name again that's just going to show me again sotosis verus color virginica nothing terribly crazy going on there but let's say for example i want the means of the sepal links for each one of these well in that case i could come in and go something like data sepal length dot mean again it's just a data frame so in this case i should see the data frame uh averages for each one of those entries and if we hopped all the way back up here to virginica's there's that mean that we saw with sepal length 6.8 or 6.5 8 8 6.5 8 roughly rounded down but that is obviously where we can utilize our dot format command to shorthand that to something like uh this and 8 8 or there we are 6.5 8 8 so again this is a way for us to start to do data analytics on large data sets process them work off of them and in this case then we can separate them into their individual categories if you will uh and then do some simple analytics off of those individual categories as well so this is again pandas it's a great library uh i would strongly encourage you learning uh everything you can about it