 The last thing I'd like to show off is actually me taking mapplotlib and pandas and doing effectively a data analytics. So in this case, this is a smaller set of a study that I recently done. If you've looked on my channel, I have another video on student practice sessions. If you have not and you're curious, you can click on that thing. The entire idea was, you know, I want to see just how students are practicing during their time on a particular course. So we'll kind of, we'll walk through sort of what I did as that process kind of. So the start I am working off of my import statements. Again, these are just loading in what I plan to be operating with. There is the mask function that I told you that I like to use in place of the data frame kind of filtering that you can do. This is my cleaning that up way. I don't think I'm using it in this particular example, just because I've stripped out a lot of things. But this is almost like the first thing I do whenever I'm doing a new analytic is just make sure this function is right there. So I always throw it to get in there. And then the next little process that I'm doing here is I'm loading in my data set. So just to kind of see that in action, it's just a CSV with a bunch of numbers and particular values going on there. But the entire idea is I'm utilizing Panda's read CSV function to convert this CSV file into a data frame. This next little portion here is because that created column, if you kind of take a look at it, the issue here is that that is a string. And when I want to do analytics on literally, you know, the time deltas between two rows, so literally these two, I need to have some way to do that. And this by strings is not going to help. So this is Panda's way happens to have a version where you can take a column and convert it into a daytime. I am here specifying which column. And this is the format that that string happens to be in, take it, process it, do turn it into a date frame. And then this last little bit here is save that as the column itself. So replace what was there with this. And then I just have a nice little print statement where I group by the student ID is again grouping by will look at all the unique IDs and then filter all their data into their individual bins, if you will, and produce a list. So that allows me to, in this case, see, as you can see, if I did all in, how many students on processing. And then I can do something like .head. This is one of the first things I always typically do when I loaded in a data frame just so I can see it. I don't have to jump like I don't have to go jump on this or jump in Excel. I just have it available. I can see it quite nicely. So then I've just got some basic descriptive statistics going on next for my way of just kind of seeing the data and seeing, since I don't want to print out all of these entries because you'll see why in a second. This just gives me sort of the number for that. So again, I'm doing that same printing of the unique students. Then I am looking at my data frame and saying, well, how long is it? Just how long is my data frame? And that's how many total actions there are. And then the unique activities of my analysis. So in this case, I see, again, the 87 students. There are a total of 17,000 entries in here, and again, this is why I don't want to print that out. That's a big number. I like that from a teaching perspective, but not from a I need to do analysis perspective. And then I see the unique events or activities that students are working off of. In this case, these are little shorthand ways to represent each one of the different activities. So for example, TE is representing the typing exercises that you guys are doing for your attendance. But I happen to have a number of different activities as well to work off of. So this next section is what I wanted to do is I wanted to look at a particular activity and or I wanted to look at a particular student and see how many things they did in a particular session. So just to kind of see it here, I'm looking at the difference between two records. So if I'm at this record, for example, I want to say, well, what's the record before me? And specifically when what are the time differences between these two records? If they are greater than 60 minutes, then I have a new session. The student got up, walked away, ate, you know, something went to bed because that's greater than 60 minutes. Hopefully you're sleeping more than just an hour, either way. So in this case, again, I'm grouping by the students to separate them out into their nice little bins. And then I'm creating a dictionary for sessions again, this is my way of just storing individual users and then creating a list of sessions for a particular user. And that's exactly what I do again, when we do a group by it's going to take the unique values of in this case, user ID. And that's going to be the key. And then data is going to be all of the data associated all the data entries associated to that user ID. So again, for every user, let me have their user name user ID and the data associated to that. And then I'm creating a an entry in my sessions dictionary for that user. Here is an empty list of sessions that I'm about to start working off of. Then as you can see, this is where I start building those sessions. Now this is where I it's a little on the finicky side with pandas for specifically what I wanted to do. So in this case, data that it arose is a way to convert it into just a straight up here is a loop, or here's a list. And same kind of approach that you can see I'm doing, I have to unpack two values row index and the particular row. That's because this row index, I don't want the ones, but I have to unpack it. And then I want the row. I want the actual physical row. Now that's going to allow me to have a dictionary just like with our CSV dict reader. So in this case, I grab, say, for example, time, I don't think I'm using it. But again, I was sometimes it's leftover data from when I was building out the process. But the first thing I'm doing is if session does not exist, just this is the first session that I've seen from this student, just grab it, put it into the session. Otherwise, this is again where I'm doing an analysis. I'm saying, look at any two particular sessions. So I'll actually kind of use this as the better example. I'm at this activity. So take a look at created. Now take a look at the last activity that that student did in this session. And do a difference between these created. It's how, what's the duration difference between these two sessions? So that's me extracting those out. And then calling that a delta. And then I have extract that by seconds. If that delta is greater than 60 times 60, or 3,600 seconds or a 60 minute cut off, that session is over. And start a new session. If it's not, then it's an activity. They're still working, they haven't ended their session quite yet. Add that particular row to this session. So okay, we go in and then we take those sessions, move on. Now what we get into is this idea of the session counts. That's just checking how many sessions a particular student had. This order is mostly just so my activities are in a specific order of easiest or most least engaging to most engaging is the terminology that I used in my paper. But then just here's me taking a data frame and setting it to those orders. Now what I'm doing is I'm going to be adding in. So this is actually an empty dictionary, an empty data frame. Then for every user, for every session of that user, grab out the different activities that that session had. And then create a binary representation of whether or not that activity existed or did not exist in this transaction. Transaction or this session. So in this case, again, what we would be seeing is if a student did a typing exercise, that particular value would have a one. So in this case of this student, we see that they have a bunch of TEs and then PPs for Parsons puzzles. This is all still happening within, I guess, one session, roughly speaking. So we would see that that would be a one in a one. Okay, so I take that process at sessions. I never ran this one. There we are. Now you can see this is going to take a second cuz I'm processing 17,000 rows. But here we go. So how many total transactions do we have? So of the 17,000 we now have reduced that down to 686 individual practice sessions across all students. And just to see sort of what that looks like. So in this case, for example, this student, this particular time did all the activities all in one session. That's what I like to see, honestly. So again, now, okay, I've converted. I've made this binary representation of whether or not an activity exists. Here's an example where this particular session didn't have a find the bug. So again, now we're dealing with a factor analysis. And the first thing we have to do is we have to do some adequacy tests. We have to make sure that our data set is worthy of doing a factor analysis. And so again, I'm just doing some quick little here's some analysis. This is more because the factor analysis has to be in a float. It can't be a number. I originally had these being integers. So this is just conversion in some more debugging that was carried over. But again, the first one we're going to work off of is something called the Bartlett's test. And the entire idea here is we're just looking at our data and seeing if it is an identity matrix. If it is, you can't do a factor analysis. So this is where I am happening to use a completely foreign library called factor analyzer to do my Bartlett's Sphere SysCity test, if you will. And when we run through it, it gives us our chi-squared and then the p-value. And you can see that's a very small p-value. And in that case, we happen to have a statistically significant data set. Awesome. That means we're good. Now I did add in an additional test because more tests are good. And so the same kind of thing. A KMO is just going to check to see if whether or not our data is good. And as long as it is above a 0.6, that's great. Same kind of thing. It happens to the factor analyzer happens to have that same one. So I run it. And in this case, as you can sort of see, we got a KMO score of a 0.77. Again, that's above the 0.6 threshold. So again, we're good. So that means now we can do our factor analysis. As the header sort of indicates, we need to specify how many factors that we want to operate from. Factor analysis is going to do dimension reduction. So from our eight variables, we need to condense that down or create new variables from it. So how many do we create? Well, in that case, we have to calculate out eigenvectors and eigenvalues, very similar to a principal component analysis, which I also have a video for. Anyways, from here, I am just following sort of their tutorial. And they happen to have this kind of in place for us. And so the first thing I am doing is printing out the different eigenvalues. So you can see I am making it a data frame. That's just for my sake. But we have, again, those values just to see a different variation to this. Again, you can use instead of just eigenvalues, using a screen plot, we'll show you sort of the amount of variance and how much more information you get with each additional factor. So you see there's a large jump down in value from one to two factors. Okay, understandable. That means that we have one that's just like a definitive factor. By doing two, we get a lot of information as well. Then as we get to three, we get not as much and four, not as much. Again, you can see it slowly goes down over time. But the last little thing that can help you is seeing what the cumulative variants of these are. And you don't need to worry so much about the warning that's not crashing of the code. That's just, I'm not doing something quite what it wants, but I'm, yeah. The big thing is you can see I can see based on my different numbers of factors, if I were to do, and a one factor, two factors, three factors, how much can I explain from my data. So in this case, with a four factor, I can explain just under 60% of all variants between practice sessions. Okay, so basically what this allows us to do is work off of this. I'm going to use a four factor analysis. So effectively anything that was above a 0.7. So in this case, four. So with that in mind, performing the factor analysis. So this is just a little formatting to color my individual cells. This is mostly so since I'm building a table, this is going to allow me to sort of stick out. What are the important values? Then I'm specifying my factor analysis. I did multiple factor analyses. So this was just my quick way of having one variable, one cell to change. But in our case, setting it, and then we do our factor analysis. Again, this is where that library that I'm working off of happens to have the factor analysis for me. This is beautiful. I don't have to do the full on math for the factor analysis. Which again, if it's similar to a principal component, we do the correlation, the covariance matrices and everything, it'll take care of that for us. Beautiful. So again, what I'm doing here is I'm running through the factor analysis. I'm fitting that data to it. So this is sort of building it. And then this is fitting my data to it. And then I'm building, I'm taking that fitted factor analysis. And I'm converting it into a data frame. Again, this is just my way of having a way to very quickly operate from Jupiter with. I am then adding in the events back as a extra column. This is more for, so I don't have to scroll up to see what was the third row kind of thing. And then this last little bit here, I am saying, take all of the, I'll even actually let me hide it and I'll show sort of why I did it. So again, I take this, I run it. And then this is the print factor loading. So I'll come back to this in a second. And this is just saying go through those. And in particular, set them in a particular way. Or no, okay, sorry. Loading. So in this case, you can see this is the factor analysis. All right, well, the next little bit that I added in here is this dot round four. That just shortened it, you know, cut down my values a little bit. Nothing terribly crazy. But this last little bit here is effectively saying, for every one of those numbers, decide whether or not to color in that number. And that is actually where that function came into play is saying, well, if that value is greater than 0.5 or less than negative 0.5, make it green. Or if it's not that, well, check if it's at least above a 0.4, you know, that occasionally may be indicative of something in color yellow. Otherwise, just color white. And then this is a little bit of HTML, CSS kind of going on there. But again, that's just how to process the background color. And what this will do is instead of our data sort of looking like this, it'll look like this. Oh, look at that. You can see pretty colors. But the entire idea is now I can see, oh, well, you know, the first factor, I can draw some conclusions based on this. A particular session could be one of the most common sessions was a find the bug, fix the bug, self explanation. The next common practice session was a fill in the blank, Parsons puzzle and output prediction. The third most common session was a typing exercise. This was not a attendance style class. It was just another activity. And then the last most common was self explanation into a coding exercise. So as you can see, oh, what do you know, I can do factor analysis and I have explained out my data. This last little bit is, again, if I'm dealing with a much larger data set, again, I'm only working with eight particular variables and only four factors. But if I'm dealing with something much larger, it may be a little more difficult. So this little portion here is just go through the particular listings, the loadings in my factor analysis, and just grab the ones that are the high loadings and the moderate loadings and just have them print out for me. So has no observable states or has high or moderate levels for these particular variables. So again, I take that and oh, what do you know, factor one has high loadings for fill in the blanks and find the fix the bugs. I didn't do the yellows as well, but that's more just like, because this is a quick explanation for my sake. But again, as you can see, this is a full on example of utilizing, again, pandas, numpy, libraries, matplotlib to do something scientific, a data analytic, and in this case, a factor analysis to study how students behaved in my class.