 What's going on everybody? Welcome back to another video. Today we are continuing our data analyst portfolio project series with our fourth project in Python. Now I am extremely excited about this project because this is the very first portfolio project that we are doing in Python and we're going to be using lots of popular libraries like Pandas, Seaborn and Matplotlib. If you have never used Python at all, this project may be a little bit difficult for you. I kind of expect you to know at least the basics of Python and I go a little bit above and beyond that in some of the areas but I explain the more difficult concepts whereas I'm not going to explain what an array or a for loop is because I'm going to expect that you know that. I tried to make this as beginner-friendly as I possibly could but I will say I think this is one of the more technical projects that we have worked on so far. With that being said, let's jump over to my screen. We are going to download the data set. We're going to install our Python IDE. We're going to complete the entire project, upload it into GitHub. So let's get started. All right so the very first thing that we need to do is we need to download our data set and this is coming straight from Kaggle as you can see up here. This is the movie industry data set. Let's go down here and just take a sneak peek really quickly. It's showing us 10 to 15 columns so we're really not able to see all of them at the moment. We could but I don't want to take the time to do that. So we have the budget. We have the company. We're going to be looking at things like genre, the gross revenue of the entire movie, the name of the movie, the released date and then there's a few other ones and I'm not going through all of them right now because we will get plenty of time to look at that in just a little bit. We're going to click download right here is going to go into our downloads folder. What we now need to do is to download our Python IDE. So what we are going to do is we're going to be using something called Jupyter notebooks and we can get that through something called Anaconda. So we're going to be downloading Anaconda right now. All you need to do is go right here and click download again. I'll include all these links in the description. So you just need to click download if you're using a Mac or if you're using Linux, just click on one of those and you can get that version that you need. Let's click on download. I am not going to walk through the actual installing it because it's super, super easy. I don't foresee anybody having any issues, I hope, unless you have storage issues like you don't have any room for it. But just install that and right down here, this is going to pop up. And what we are going to do is we're going to click on this Jupyter notebook right here. So we're going to launch it. Of course, I already have this pulled up over here, but I'm going to pull it up over here as well. Here are where we kind of store our projects. This project is for a later time and these are ones that I've already completed. So I've already completed this entire project. I have it on another screen right down here. So I'll be getting a lot of, I'll just be reading a lot of it because it took me a long time to create. So I don't want it to take as long as it took me to create. I want to do this quickly. So what we are going to do is I'm not going to pull up any of these. I want to click new, click on Python three. So this is what it looks like. This is where we're going to start. Excuse me. Give me a second. So what we need to do, first we need to import our libraries. So, or import the packages. And then we're going to read in the data, right? So the first thing is import libraries. And we're going to be using a lot of classic ones. I don't want to waste a bunch of time on this part specifically. So I'm just going to paste this in here. But and you guys can do the same thing. I will have this on the GitHub. Look at the link. Everything that you're about to see will be in there. So you don't have to write this out either. But when I initially did, it just took me a while. So I didn't want to waste that time. But what we're going to be using, we're going to be using pandas as PD, this PD, SNS, PLT. This is what a lot of people use. I mean, it's like common practice to use that. So I advise you doing that as well. Excuse me. So we're going to be using pandas, seaborne, matplotlib. And I mean, those are the big ones. We should include this matplotlib incline. If you don't know what that is, just Google it. It's useful for what we're about to do. But now we can actually read in the data. So let's go right down here. I'm going to say read in the data. All we need to do is we're going to be using pandas for this. We're going to be creating a data frame using just, I think it's read CSV. So it's super easy. So we're going to do df.read underscore CSV, open parentheses, and then what are these? Apostrophes? Double, yeah, just apostrophes, I think. What we now need to do is we need to locate where it is in our folder. So if we go right over here, we have this movies right here, we can right click on it. And if you want to, this is what I usually do is I go right in here, the location says see users Alex F downloads, that's where it's located, which it is. So we're going to go right in here, we're going to go like this, and we're just going to type in the name, and that's movies.csb. So that is as easy as it's going to get. It's probably the next few things are probably the as easy as going to get. So let's try running this really quick and we should get an error. It's going to say Unicode escape. All you have to do to resolve this, and this happens, if you don't do this happens all the time, just need to include an R. Okay, so now it should work. What did I say? Oh, I said DF, why did I say that? Why is it somebody should have told me I was doing that? We're going to do data frame equals DF is just answer data frame. So PD, so we're using pandas, I don't know why I wrote that I was getting out of myself, I guess. So we're going to run that, it should work. So it did work. And let's look at it. So let's say, let's look at the data. We're just going to do DF dot head, and then parentheses, we're going to run this as well just to get a really quick glimpse. So this is just like the top five rows. So we have our budget, our company, some of the ones that we were looking at earlier. And ones that we didn't see before, I believe were score star votes, writer year. And we'll get into all of this in a little bit. The very first thing that we're going to be doing is cleaning up the data. Once in just a second, I'm going to cut myself off screen, I wanted to join you for at least a little bit, a little bit longer. But we're going to be cleaning the data and kind of just formatting it how we need it for what we're about to do. Or for what we're what we're going to be working on. I said earlier, we're not going to be working on what I think most people thought we would be working on. We're not going to be using pandas to basically use it like a sequel where we're doing like group buys and all these things. We're going to be going kind of in a different direction. We're going to be working on correlations. So if you don't know what correlations is, for example, let's say as the budget increases, you also expect the revenue to increase, right? If they spend a hundred million dollars on a movie, you expect them to make 500 million. Or if they spend zero dollars, you expect them to make like, you know, $10,000. That's a high correlation, because if it's low, it's low, it's high, it's high. And so we're going to be doing that for all of the fields that you see here and trying to find what fields direct, what are directly correlated or highly correlated with this gross revenue. Because again, I think it's interesting to know kind of what things impact the revenue of a film. So that's something that I found interesting. So that's why we're doing it because, you know, again, I created this. I'm just kind of going with the flow. So a little bit different. I hope that is exciting. We're going to be using some really fun things coming up. But with that being said, I'm going to take myself off of here so that you guys can see my full screen and we're just going to get going. So goodbye. I will miss you, but you know, we'll see each other again, I promise, in a future video. All right. So what we're going to do is come down right here. And we want to look to see if there's any missing data. So let's see if there's any missing data. I wouldn't be putting this as my description. Go back and change that to something that makes sense. But that's what we're going to keep it as. There's lots of ways to do this. But what I'm going to do is I'm going to create a for loop. We're just going to loop through these columns. And within each one, see if, you know, there's there's missing data in it. So we're going to say four column and different columns. And we're going to say, let's do np.mean. And we're going to do an open rocket or sorry, an open parentheses, we're going to do data frame. And we're going to put in that column like right up here. So when that column gets inserted into here, it'll say data frame specific specified column. And then we're going to do dot dot is null, and open parentheses, it's pretty straightforward, I guess. And we're going to say we're going to call this percent missing. So we'll do that. Then what we want to do is we want to create our output. So we're going to be printing this. So we're going to print. And let's do a little bit of formatting. And this is purely just for visual purposes for you. But you don't have to do this if you don't want to. But I'm going to do like this. And then we're going to insert what our format is going to be for this. So we're going to do give me a second. Let's go back here at this percent sign. So we're going to do dot format. And then it's going to be the column name that's going to be right. That's going to insert right here. And then we're going to do our percent underscore missing. So let's run this really quick. See if it works. NP is not defined. That's probably because I didn't import it. Import numpy as NP. Why did I why did I not do that? Now let's see if it works. There we go. Happens to the best of us. So yeah, so this NP is supposed to be from numpy, which apparently I didn't include it. So we basically loop through every single thing. We look to see if there's any columns that had nulls in it. And it looks like every single row, every single value is filled in. So we don't have to worry right now at least about that. So let's keep going on to the next part. We're going to be doing some really basic data cleaning. I think I mentioned that earlier. So the first thing that I want to look at it are the data types for our columns. Super easy to do. We're just going to do df dot d types and run this. So we have floats, objects, integers, and that's about it. But one thing that I noticed right away was that, at least in the data, was that this has like, what is this, 80 million? I think that's how much that is. Let me see if that's what it is. One, two, three, one, two, three. Just 8 million. Okay, so this is 8 million, but it has this 0.0 at the end. We don't need that. And the same thing for the gross revenue, I think it's the only one that it does that on. I just want to get rid of that just for the pure sake of I think it just doesn't look great. We don't need it. So what we're going to do is we need to specify what column we're looking at first. So the one we're going to be doing first is this budget. So we're going to say data frame. And then again, we need to say budget. So that is specifying the column that we're going to be working on. And we're going to do as type. So this is just going to change the data type. And we're going to do open parentheses, apostrophe. I hope that's what it's actually called. If I'm calling it an apostrophe and something else, I'm going to feel like an absolute idiot. So this is going to change it to an integer, but we need to apply it. Right. So we're just going to take this. Otherwise it wouldn't apply it. So we're going to do just like that. And I'm just going to make the note change data type of columns. Make it something better than that, please. I'm just doing this quickly for our purposes. So we're just going to copy that. We're going to do the exact same thing for gross. Okay. And let's run this and take a look and see if that actually worked. So we're just going to run our data frame. And now it just looks a little better, right? It's nothing huge. That's a super small change. But it does work. The next thing that I want to look at and this is something that you, unless you're like kind of looking in the data, you may not notice, but as this year here, if we go back to right here, and I actually am now going to pull in all of these, if you look in here, it says that the year is the year of release. And then we also have this column called released date. So the year in the released and the year in the in the year should match hypothetically speaking, but they don't always. So here's 2016. Here's 2017. Let's see if there's any ones up here that show it. There are a lot of them though. There's a lot that were like 1987. This said 1986. So you can go through and see those all yourselves. I'm not going to, I'm not going to do that. You can if you would like. But again, that, that just will take more, take a lot of time to kind of dig into the data. But that's what you need to do to figure out how to clean the data. So let's do that. What we're going to do is we're going to fix it. And we're not going to fix it by changing this one we could, but what we're going to do is we're going to create a new column. So we're going to take this year released column and we're going to take just these first four values and that's going to become our new year column. Okay, so I'm not going to delete the year. We might later might drop it. But as for right now, we're going to create this new column. So we're going to say DF and then we're going to do bracket apostrophe. I really I'm telling you, if I am saying this wrong, this whole video, I'm going to be, I'm not going to be happy about that. And we're going to take that from the released. So again, we're taking this released column. And what we're going to do is we're going to, we need to, right now it's, what data type is it? Released is it's an object. I want to make it into a strings that I can pull from it or take the string from it. You'll see. So as type and we're going to create this as, we're going to make this a string. And then what we want to do is take the first four. So string and then we're going to do open bracket and then colon four. You can also do zero, but it's an understood that's just, if you leave that blank, it starts from the very beginning. And let's create this new, what we're going to call this. So we're going to do like this. And we can do year underscore. Actually, all these are under year correct. And we're going to do it just like this. And we're going to say create correct year column. So let's add this right here. Let's run it and see what we get. And so this is the original year column that had 2016. And this is our year column has 2017. It looks correct. I'm, I believe that there's, I don't know why the one in here is a mistake. They made that mistake, but it looks like that's what it is. And ours fixes that. So if we're going to be using that year at all, we'll pull from the year, year, correct. I think it was called your correct. That's what we'll pull from. So we just corrected that and we should be good to go on that front. The last thing I'm going to do, I guess, you know, the real, really anything we do to the data is I'm just going to order it. Super simple. I'm just going to order it by the gross revenue. So we're going to come down here. We're going to say df.sort values. And we're going to say buy and then do an equal open prevent or open bracket. And we're just going to specify that gross column. And we'll do in place equals false, oops, false capital. And then ascending. And that's going to equal to false because I want it to be descending. So let's look at this. So the highest grossing and you'll notice a trend down here. Some of the highest grossing films are really big ones, right? Avatar, Titanic, Jurassic World, the Avengers. Some of the ones down here not as well known. I don't think I spit on your grave too. It was the most popular movie. One was fantastic. Two, I thought it just wasn't my cup of tea. I think I was all their revenue, to be honest. One thing that we're doing right now, and I'm just going to add this, we can make it to where it doesn't have these, you know, we're only looking at a little bit of the data. Let's say we want to look at all of the data. Let's really quickly do that because I'm sure some of you guys are wondering how to do that. Maybe you're not, but I'm going to show you how to do it anyways. So we're going to do pd.set underscore options. And we're going to say open parentheses, apostrophe, display, oops, display.max underscore rows. And this is set to like, I think it's like 20 or something by default. And we're just going to say none. So we're going to do that right here. And we're going to run this. Oops, what I do wrong. Has nothing called set underscore options. Oh, that's because it's supposed to be set underscore option. All right, so that should be good. Let's try running this again and see what happens. Okay, so it's going to take a lot longer because it's pulling in all of the data. But when you come in, oh, yeah, when you come into here now, you'll be able to scroll, right? Super useful. I prefer it this way. I just, you know, most people don't do it this. Most people don't have it as a default this way. So I just wanted to show you how you can fix that. So that will now be like that for the arrest of the project. Let's keep going. One thing that is important when you're working with something that doesn't have any null values, you want to make sure you don't have any duplicates. So super quickly, we're just going to look to see if it has, if there are any duplicates, and we're just going to drop them. Super easy to do. We're going to say, well, actually, let's write, we're going to drop any duplicates. So let's do df, we're going to do open bracket. And you can do this on any column you want. You can do this on multiple columns. You can do this across the entire thing. And you should. But how you do this is you can say, you know, company, oops, excuse me, company, and do drop underscore duplicates, and that sort underscore values, oops, values. And you can say, sending equals false. Let's just run that really quick. What did I say? Sort under that sort values. Oh, that's because they didn't go like this. My bad. So it's going to, it's going to sort the values. And it's going to tell us if it drops any, it doesn't look like we're dropping any duplicates, right? So that there's no company. This is the distinct count of these sort values. So this is just showing us all of the unique values in here. If we were to get rid of this, and I'm just showing you really quick for, so I can actually make sense of what I'm trying to say. If we get rid of that drop values, then we start seeing all of these Zentropa entertainments, whatever that is. We're seeing it tons of times. So all that does is it shows us what values are distinct in here. And if we want to get rid of that, we can do, you know, DF company, and we're not going to do this. You have company equals. And then we do it like this, we're not going to do it because I don't want to get rid of all the ones in there. But if we wanted to do it across the entire thing, we do DF company, or just data frame, we wouldn't, we wouldn't do any of it. So that's just to show you kind of what that is, but we do this. And if we wanted to do that, we could absolutely do that. I mean, there aren't any duplicates, but you run that, it will drop any duplicates across the entire data frame. So that's what that does. Let's keep going really quick. And something else, just with this, by the way, a reason why we're also could be looking at this is to see if there's issues in the actual quality of the data. Actually, let me go back up. There was one up here. I think it's like Warner Brothers or something. Let me see. Did I go too far? So right here, actually, this is fine. We have Walt Disney, Walt Disney, Walt Disney, right? There's a bunch of them. Something that you might need to do when your data cleaning is to actually aggregate all these or standardize all these, however you want to say it. These all, I've already looked into this and we don't need to do this, but all of these are different companies or were companies during different times, right? So let's say this one was from like 1995 to 1980, and then they changed the name to this. We don't want to then standardize it because those are two distinct timeframes and two distinct companies. But if this one said, say, for example, there was Walt Disney feature animation, then Walt Disney feature animations with an S on the end, that'd be a mistake and we would want to correct that. Luckily, we don't have to do that because that's a huge process. Trust me, I've done that is tough. So we're not going to do that today. Thank you for sticking with me in my rants that I'm doing at the moment. So that's kind of an additional reason why we were wanting to look at this and how we looked at this. But you can also drop the duplicates, which helps clean it up because you shouldn't be having any duplicates in here. But with that being said, I believe we now have our data how we want it, which is fantastic. I think that's probably the easier part of what we're doing. So now that we have our data, we're going to start looking at what variables or what columns and let's pull this back up. I should have done a lot of run should have done that head, but we're going to see what things are most correlated to this gross revenue. Okay, so my hypothesis, what I'm going to be kind of checking. So because it is hard to look at all these not hard, because we're going to do it. It does take time to go and do one at a time to compare all of these. So I'm going to be doing ones that I think will have a high correlation. And then we're going to test it. And then we're going to look at all of them together. And I'll show you how to visualize all of this and write all this out. But I believe that this budget, and I'll write it down here with my predictions, I believe that the budget is going to have a high correlation. I think that the more money they spend, the more money they're going to bring in. That's my guess. I believe that the budget is going to have a high correlation. I also think that, and you know, this may not be correct. I think that the company would also have a correlation as well, somewhat high. I think that some of these bigger ones like, I mean, 20th Century Fox, Film Corporation, Walt Disney, they make movies that bring in a lot of money. So I think that the company company will have a high correlation. Let me write that out. That's kind of my guess. These are my educated guesses. Don't put that in your scripts. You don't need that. That's my guess. This is what I think is going to happen. But we're going to test it out, right? So one thing that we can do super quickly to compare the budget and the gross revenue is to do a scatter plot. So let's build a scatter plot and let's compare, let's do a scatter plot with budget versus gross revenue. What we are going to do is go right down here. We are going to say plt. So this is our map plot lib plt.scatter. And that's going to be our scatter plot that we were just talking about. And we're going to say x equals and this is, you know, what data are we going to be looking at? So this is on the x axis. So we're going to say x equals data frame. And this is going to be our budget. So we're going to do a bracket, apostrophe, budgets. Again, I keep hesitating on that apostrophe. I feel like I'm wrong. I feel like if I am wrong this whole time, I'm going to be so mad. I'm telling you. And then our y axis is going to be data frame brackets. And then it's going to be our gross. Oops. What do I do? I'm messing stuff up. So it's going to be our gross. So super easy. Let's plt.show. This is going to actually bring it out. So this is what it looks like. It's hard to interpret exactly what's going on here. I am going to .head. Actually, let me actually go pull. I want to pull this thing that we were looking at right up here. Actually, no, what I'm going to do is I'm just going to say data frame is equal to. So that I can just run the data frame down here. So there we go. .head. All right. So I just wanted to have these ones on top. So it's hard to tell exactly what's going on here. So I'm going to add a little bit of information just so we can all read it. Let's add a title. This will be plt.title. And this is going to be budget versus gross earnings. We'll do plt.xlabel. Oops, not clabel. xlabel. We'll do the xlabel. And that was our gross. So I'm going to say gross earnings. And we'll do plt.ylabel. And this is going to be open parentheses, apostrophe. We're going to do budget for film. So oops, that's not what I wanted. So now let's run this and see what we get. So this is a pretty good, really quickly, a pretty good visualization of what we're looking at here in terms of the budget versus the gross. If you look at this one right here, this one is easy to find because it's the very first one down here. So we're looking at a budget of $245 million. And these are in the millions. So $2.45 is going to be right here. So that's right. And then the gross earnings was $936 million. So almost $1 billion. And then this is by $100 million. So $200 million, $400 million, $600 million, $800 million almost to $1 billion right here. So just a super quick fact check just to make sure that this is in fact correct. What we want to do is determine if these are correlated. Visually, you can kind of guess, it seems to be a little bit, but it's hard to tell. So what we're going to do is do something called a reg plot or a regression plot. So we're going to come down here. We're going to be using Seaborn for this. So let's do SNS. Let me actually type in right here. Bear with me. We're going to plot the budget versus gross using Seaborn. So let's do SNS.regplot and open parentheses. And again, we're using just going to steal a weight. No, no, no, no. I'm not going to steal that. I was thinking about stealing something, but that doesn't actually work. So our X is going to be our budget. And our Y is going to, oops. I hate when I do that. Y equals gross. I'm just really fumbling things up here. And then we're going to say our data is equal to our data frame. And let's run this really quickly. And then I'm going to add some additional things to this. But now we have this line. And this is going to show us the correlation in a super simple terms. It's going up and it's showing a positive correlation. So just add a glance really quickly with what we've done. I can already tell you that the budget and the gross are correlated, but how much we don't know. But I will get that in just a second to show you exactly how much it is. But I want to add some other information to this just so it looks better. So we're going to do scatter underscore KWS. And then we're going to change some of these, just one of these colors. So it makes it a little bit easier to read. We're going to do these, oh gosh, what are these called? I'm just going to call them squiggly brackets. So let's call it that. We're going to do color. And we're going to say colon and we're going to do red. So I want to keep the dots red, right? But I want to change up that line just so it's easier to visualize. That will help us down the road, I promise. So line underscore KWS, squiggly brackets, color. And we're going to do that. Let's make it blue. Why not? And just like that. Whoops. Yep. Let's see if this actually works. I feel like I messed something up. But yeah, I did something wrong here. Give me a second. Because I have this, I just need to make sure I'm like closing this off correctly. You guys are probably seeing what I'm missing. Oh, that's it. Oh, I must have hit insert again. I'm telling you. It messes me up every time. There we go. This should work now. So yeah, I just, it was simple syntax error. But you can specify these things and make it look a little bit more appealing, easier to visualize. So much easier to see this. When it was red on red, just made it a little bit more challenging. You know, it's hard to see in here. It's tough. You can make this almost any color you want, by the way. You know, you can make this black. You can really do anything you want. I just prefer the red and the blue for this. It's just super simple to see and looks totally fine. So use any colors that you think you want. And we'll go from there. But now let's determine what the actual correlation is. Because we can see that there's a positive correlation, but we don't know how much. Is it more or less than other fields? We don't know. So let's start looking at correlation. And something that you can do that's so easy is df.core. And let's run this. And these are some of the fields from our data, right? These are some of the fields. Now year is in there, but our year current isn't. I'll go back and look at that in just a second. Important to know is that this correlation is only going to be working on numerical fields. It's not going to be working on all of our company, our title, the things where there are strings in there. It's only working on the numerical, which is okay, but that does pose an issue. So we're going to have to solve that later on how to do that. Another thing to consider is that, or not consider, another thing to know about this correlation is there's different types of correlation. Different, what's called methods. So there is the Pearson, which is the one we're using. That's by default is Pearson. There's also one called Kendall, and there's one called Spearman. And they're all going to give you slightly different results or I think this one gives more than slightly different results, but they all have their different way of determining correlation. So it's just something to be aware of if you want to really use this. You should be aware of which one you're using by default and which one you want to be using. And I recommend doing some research into these just so you know. But let's just try the different ones real quick. So Pearson is the one that I believe we're using by default. So before we actually hit enter or run this really quick, the budget and the gross has a pretty high correlation. It's 0.712196. That's a pretty good correlation. There aren't many other ones in here that are that high. Votes in the gross are close. But for gross, I mean that the budget I think is the highest one. And then next is vote. That's what we know. So let's run this again with Pearson. So it's going to be the exact same. But now let's do Kendall and let's run this one. And now budget and gross is 0.523459. I don't know why I'm saying the entire thing, but I am. And then let's try Spearman. This one should be a lot closer than the Kendall 0.698. So again, you need to be aware of what you're using, the different types. Why? For what we're going to be doing today, we're just going to keep it default and be doing Pearson. And that should be all we need to really look at. What I want to do is it's kind of, it really is, it's hard to look in here and read each number individually. What would be super easy is if we could visualize this and we can. So really quickly, I want to say, well, I want to make note that high correlation between budget and gross, I was right. Not important. I just wanted to toss that out there. But what we're going to do now is we're going to visualize this information right here. This correlation matrix is what it's called. It's called the correlation matrix. So what we're going to do is going to take this and we're going to assign it as our correlation matrix. That's going to be equal to this. So this is now called correlation matrix right here. And what we're going to do is something with Seaborn is going to be sns.heatmap. And we're going to do an open parentheses and we're going to use this correlation matrix. And I want the annotations equal to true. And if I didn't have it on, I'll show you what that does. If I don't have that on this part in there later. But we can do plt.show. Really quickly, let's look at what this looks like. As you can see, it has our numbers. So right up here, actually, let's run this again. We had 0.71. We have 0.71, 0.29, 0.29. So now we have a visualization of this correlation matrix that we wrote. And it has this nifty little bar over here. And if it's black, it's a very, very low correlation. So anything that's black, super low, anything that's brighter colors are a high correlation. So we have, of course, a one to one correlation on everything that in this matrix that is on itself. So year to year, budget to budget. And then 0.71, 0.71, 0.66, 0.66. So ones we were just looking at. But now it's visualized. It's a little bit easier. And this will come in handy in just a little bit when we're visualizing every single column, which will be really fun. But what we should always do is I'm going to go steal these real quick, because I don't feel like writing these out again. Again, I would consider myself somewhat lazy when it comes to this. So we're just going to say the title is going to be correlation matrix. So let's just say four numeric features. Sounds good to me. And then we'll do movie features. And I'm going to make that. I mean, they're both on the same x axis and y axis. So let's run that. Looks a little bit better. There we go. So it's nice to visualize this because it is tough to kind of read through every single number. It's just nice to see that, okay, these are highly correlated based off the color and based off these numbers. So super easy to see. And again, you can always go up here and change it to Kendall and see what that looks like. You know, it changes things. And so statement of the year, it changes things. So I'm going to keep it as Pearson as we talked about. And we can move on from there. Now, we're next on, I think we said it, we're going to look at company. And company is, let's just pull up, pull up this really quick. Company is not numeric, as we can see. That's not numeric at all, but we can convert this and having, and then we can create a numeric representation of it. So for example, this 20th Century Fox Film Corporation could be number one, where Lucasfilm is number two, Marvel Studios is number three. And, but this will say, you know, one, one, two and three. So they'll all have their unique identifier. So instead of it being a, again, being a string, it's going to be a numeric so that we can include it in this correlation matrix up here. So let's, let's look at company. Okay. So what we're going to do is we're going to, I'm going to call them numerize. And that may not be a term at all. But that's what we're going to call it. So we're just going to say, for the sake of simplicity, df underscore numerized is equal to the data frame. Super easy. And we're going to do a for loop and we're, we're going to do this for all fields. But we could specify just doing it on company. But by doing all of them, and I'll show you in a bit, by doing all come by, let me, let me take a step back. By doing all the fields at one time, we'll be able to look at company as well as country and director and genre and name all at one time. We could just do company, but it's better to just do them all at the same time. So let's use a for loop. As we did before, somewhat similar, we're going to say for column, let's just do column name in df dot numerized dot columns. So this should seem quite familiar because we kind of did something like this before. We're going to say if and we're say, if df numerized, and we're going to put the column name. So if that column has a d type that is equal to object, that means it's like company country director genre. If it has that, what we want to do is we want to change that to a category type. So all we're going to do is say is do, do, do, do. I'll take this is say df numerized column name dot as type. So we're changing the type of the column, the column type, and we're going to change it to a category. And let's call this actually let's do like this equal to that. So that in the next one, we can do something called cat codes, cat codes. So we're going to do df numerized dot cat dot codes. And this is what it's going to actually give it the random, the random numerization. Again, it's called df numerized. Let's just just roll with me. Okay. So then now let's look at df numerized. Actually, let's, yeah, let's let's look at it here. And let's run it and see if it works. Yeah, so I mean, it did exactly what it left budget alone. It left the gross earnings alone. It left a score alone. Anything that was already numeric, a left it alone, because it has a numeric representation, any of the ones that had an object type that we looked at before, those were all numerized. Again, I don't know if that's a real term, but that's what I'm calling it. Maybe it is. And I'm maybe it is. So we have this and let's compare that to our original data frame. I should have done headers, I always do that. And then it ends up taking a ton of space. Um, oh, whoops. Let me go back way up to the top really quick. Let me just run this again and see if you see if that it may have ruined everything, but we'll see if that does we needed it to do. Okay. So we have it ordered. Oh, no, we don't have it ordered the same. I did screw up everything. Geez. Let me go back up here and run through. No. Yeah, I'm gonna run this one. To add that field. Sorry. This is totally my fault. Oh, what's happening? Why is it taking so long? Where's that one that ordered the data? Here it is. And then we ordered it by gross. So let's go back down again. I, you know, I'm gonna make mistakes. I'm, I'm only human. Now let's look at the data frame. And I just wanted to do a quick comparison just so you know or you feel confident that what we're doing is what we're supposed to be doing. Okay. So the company has 1428 for Lucasfilm 2062 for 20th century 2062. Um, better yet an easier one to look at is country. So 54 is USA. 53 is UK. And you can see that really easily. So it looks like it worked properly, right? It still is keeping to what it's supposed to do. So now we're going to do is we're going to go back up here and we're literally just going to steal this correlation matrix. And we're going to put it right down here. But instead of the data frame, we're going to use data frame, numerized. And let's run this and see what we get. And it should be quite significantly larger than what we were looking at before. So these are every, this is every single um, every single field now, right? So now it has a numeric representation of it. And so if we're looking at, um, let's look at where's the gross. So let's do just, I'm just going to skim. Um, it looks like the company had a small, very, quite a small, um, part in if it was related to the gross revenue. But right here, let's look. So it looks like budget is pretty highly correlated. This has a negative correlation. It looks like runtime, like longer run times earn more money. Sometimes just in a super small scale votes. So if a movie was really successful and it got voted on hundreds of thousands of times, usually those, those ones made more money. Um, and those look like that looks like it. Um, and you know, it's hard to see right. It's not hard to see, but, you know, there are a ton, a ton of stuff right here. Something we could do is keep it in its, um, in the original matrix we had the, just the numbers inside of this heat map. We could also do that. Um, and so let's just, for the sake of it, we could do that and filter it down to kind of look at this as well. It'll give us some of the same output, but it'll be a little bit easier to visualize than this huge thing. But this is good. This is still good. Um, let's do df numera, numerized, uh, dot correlation. Um, and we'll just run that. So this is what we're looking at. And I want to kind of organize this to where I can see the ones that have the highest correlation quickly. So what we're going to do is use something called, uh, unstacking. Um, and so we're going to do right here. And let's just call this, um, correlation, under store matrix, just to keep it simple. Uh, and we're going to do correlation matrix dot on a stack, oops, unstack, and then parentheses, um, and then core on a score pairs. And whoops, I need to do this. So what this does when I unstack said it says, okay, here's our budget. And this is what all the things are compared to for budget. Um, if we go down to gross, which is what we've been looking at this whole time, we can see that the budget has a high correlation. Obviously the gross is correlated to itself, um, and votes. So we can see that in a really quick way. Let's do it in an even quicker way. I think, um, we can do correlation pairs dot sort underscore values. And we can say we can call that, um, let's call that sorted underscore pairs. Oops. It's equal to that. We'll do underscore pairs. Okay. So now everything is paired up, right? So it's kind of like a, uh, the matrix except in a linear way. I don't know if that's right, but it's, you see what I'm saying is genre versus budget budget versus genre. It is still in the correlation matrix, um, type. Uh, so now that we have that sorted, we can say sorted pairs and inside of that, I'm going to say where we have sorted pairs that's greater than 0.5. So if it has a high correlation, and we'll just call that high correlation and oops, we'll do this right here. High correlation. And now we can see all the ones that had a high correlation. These ones obviously don't count, right? Cause these all are themselves, but, um, the year correct and released did really well. I'm not surprised. Um, but for gross, so we had gross and budget. We had votes and gross. This is the only other one for gross, um, the gross revenue that did anything, right? That had a high correlation. So it looks like, you know, my hypothesis of the company being significant didn't really play a part. Uh, it wasn't necessarily correct, but we did find one that we didn't, I didn't think of was that votes, um, votes and budget have the highest correlation to gross earnings. So that is our project. I mean, we are at the very end. Um, and we can say company has no, has low correlation and I was wrong. I was wrong. Um, we have come to the end. I hope that you stuck with me. I hope you got to the, well, actually we're not at the very end at all. I was completely wrong. We have to upload it to the portfolio project. So what we need to do is we need to save this. Let's rename it really quick. Let's do, let's type in movie correlation project. I'm going to call it v2. You don't need to call that. Um, but I'm going to call it that. I'm going to save it. So now we've saved it. Um, if we go back to our right here, you can see it's there. Um, one thing to note is that you can only upload it to GitHub if it's under 25 megabytes. Um, so this is a little big, but it's really easy to fix. Um, all you need to do is where this just like looking at a ton of data. And like this, we're just going to do data frame dot head. And if we do that, like two on one or two of these, it will resolve all of our issues. Just trust me on that. It'll, it'll make it much smaller. Let's see if there's any other ones like that. Um, yeah, df numeraries, numeraries dot head. And then let's save that because you need to actually run those. Let's make sure I ran those. Yeah. Perfect. So now let's save this. I want to save you some heartache and it's, I mean, it literally dramatically reduced it. Um, so we're going to go up here. We are going to add our file, upload files. And this is where we need to go find it. So I need to go into my C drive. You go to users, Alex F and go right down here and click on movie correlation project V2. Whoops. I didn't want to actually open it. That was my fault. What I wanted to do is I, I wanted to drag it in here. So I am going to go right here. I'm going to drag it right over here, drop it in there. Just say, oops, initial commit. I'm going to commit changes and there we go. So let's open this up, see what it takes, see what it looks like in here. And it could potentially still be, you know, yeah, it's, it's still loading because it doesn't immediately go in there, but it will show it in there. And I'm hoping if I keep rambling for a second, it'll work because sometimes it takes a little bit to get everything to work properly. But that is that. Now, one thing that we're going to be taking a look at very soon and the very next video is how to actually put all these projects together and put it into a portfolio website, right? I have done this already. I have already created the website. Let me see if I can, let's do Alex, the analyst, github.io. So we're going to be using github pages for this. And so I'm going to show you how to create this. And it's not that hard and it's completely free. And so I'm looking forward to showing you how to do this. I learned this through YouTube and now I'm teaching it through YouTube. So I've come full circle. And this is a really good one. I use a similar variation of this for my own portfolio. But yeah, okay, this loaded. So this is what it looks like when, when you actually upload it. I mean, it literally just looks like the output of the Jupyter notebooks. And so everything that we just looked at, geez, maybe we trim that down before you upload it. Yeah, so that is an issue if you have like, if you do that, I know when I uploaded mine before I trimmed the era, you know, I limited these to a certain amount so that this didn't happen. But that's just funny to me. And let's see, here we go. And so you see exactly how it is in Jupyter notebooks. Don't do what I did and have this. Definitely make sure that you've limited in some way. So you can do the head dot head on all of them. And that will work. So that is our project for this week. I hope it was helpful. I hope that it worked. And I hope that, you know, you can add this to your portfolio project or your portfolio. And feel good about it. I feel like I'm going to do more of a traditional one on this dataset, because I like this dataset. And we'll go look and see, you know, we'll do counts and like a lot of the stuff that we do in SQL when we're doing exploratory analysis and then visualizations. We'll do that with this. I like this dataset. So I think I'm going to do another project with the same exact dataset, except look at it in a much different way. And clean it up a little bit different. And that your current that we added over here, we'll probably actually use because we'll do some time series stuff. So with that being said, thank you for joining me. I hope that this was a good project. If you stuck around this long, I mean, you definitely have invested quite a bit of time. So thank you. I hope that I hope that the next project and the next coming videos to finish out our data analysts portfolio project series. I hope that they're super helpful and that you can get up and running and have a complete portfolio by the end of this. Thank you guys for watching. I really appreciate it. If you liked this video, be sure to like and subscribe below and I'll see you in the next video.