 Hello everybody. Today we're going to be looking at exploratory data analysis using pandas. Exploratory data analysis or EDA for short is basically just the first look at your data. During this process, we'll look at identifying patterns within the data, understanding the relationships between the features and looking at outliers that may exist within your data set. During this process, you are looking for patterns and all these things, but you're also looking for mistakes and missing values that you need to clean up during your cleaning process in the future. Now, there are hundreds of ways to perform EDA on your data set, but we can't possibly look at every single thing. So I'm just going to show you what I think are some of the most popular and the best things that you can do when you're first looking at a data set. The first thing that we're going to do are import our libraries. So do import pandas as pd. We're also going to import seaborn and matplotlib. Now, during this exploratory data analysis process, I often like to visualize things as I go, because sometimes you just can't fully comprehend it unless you just visualize it, and it gives you a larger broader glimpse of everything. So we're going to import and let's do seaborn, oops, it's sns, and then we'll import matplotlib.pyplot as plt. Let's run this. That should work. Okay, perfect. Now we need to bring in our data set. So we've worked with that world population data set. That is the exact one that we're going to use now. So we'll say data frame equals pd.read underscore csv, do r, and we'll paste in our csv. And this is what it should look like, although your path may be different, be sure to make sure that you have the correct file path. Now we'll read it in. Now this data set should look extremely familiar. If you've done some of my previous pandas tutorials, but I did make some alterations to this one, took out a little bit of data, put in a little bit of data here and there. To change things up, because if it was just exactly how I pulled it, which I got this data set from Kaggle, if it was exactly how we pulled it, like we've looked at in the previous videos, it's too simple, you know, we wouldn't actually be able to do some of the things that I would like to show you. So be sure to actually download this exact data set for this video, because it is a little bit different. But what we're going to do now is just try to get some high level information from this. Now if yours looks just a little bit different, like your values are in scientific notation, I have applied this so many times, I think it's, you know, still applied to this, you can do something. And we'll write it right down here. We're gonna do PD dot set underscore option. And we'll do an open parentheses. And we'll say display dot float underscore format. And so we're going to change that float format by just saying lambda x colon. And then we're going to change basically how many decimal points we're looking at. So let's just do here. So do a quote, sent sign point to F sort of formatting it whoops point to F. So we're going to format it. And we'll do percent X. This is going to format it appropriately. I mean, I can run it. And actually, it will change it because this is at point one. So I believe last time I did it. So let's run this. And then let's run this again, it'll change it to point two. So that's two, I like it at point one, we don't really need it any well, let's keep it a point two. Why not? We're going to keep it a point two, but that's how you change that. And I like looking at like this a lot better than scientific notation. So just something to point out. Let's go down here. And let's just pull up data frame. So we have this data. One of the first things that I like to do when I get a data set is to just look at the info. So we're going to do dot info. And this gives us just some really high level information. This is how many columns we have. Here are the column names. Here are how many values we have. And if you notice, this is where it kind of gets. So we have 234 in each of these. So in each of these columns, we have 234 until we get to this 2022 population. Once we get there, we start losing some values. And then at the world population percentage, we have all of our values, all 234 of them. The count tells us that it's not null. So it does have values in it. And then we also have the data types. These come in handy later. And these are really great to know. And we'll be able to kind of use those in a few different ways later on in this tutorial. Really quickly, I wanted to give a huge shout out to the sponsor of this entire Panda series. And that is Udemy. Udemy has some of the best courses at the best prices. And it is no exception when it comes to Panda's courses. If you want to master pandas, this is the course that I would recommend. It's going to teach you just about everything you need to know about pandas. So huge shout out to Udemy for sponsoring this Panda series. And let's get back to the video. The next thing that I really like to do, and this one is DF dot describe. This allows you to get really a level overview of all of your columns very quickly. You can get the count, the mean, the standard deviation, the minimum value and the maximum value, as well as your 25, 50, and 75 percentiles of your values. So just at a super quick glance, there is a row somewhere in here and they're this country, their population is 510 or 2022. And in fact, if you go back to 1970, it was higher, it was at 752. That's just interesting. Then if we look at the max population, one has 1.42 billion, I believe that's China. And then over here in 1970, we have 822 million. Again, I still believe that's China. But this gives you just a really nice high level of all of these values or all of these different calculations that you can run on it. And we can run all of these individually on even specific columns. But you know, this is just a nice high level overview. One thing that we just talked about was the null values that we're seeing in here. I'd like to see how many values were actually missing, because that is a problem. We don't want to have too many missing values or could really obscure or change the data sets entirely. And so we don't want that. So we'll say DF dot is null. And then we'll do a parentheses and we'll say dot some. And when we do this, whoops. There we go. When we do this, it's going to give us all the columns and how many values we're actually missing. Now we have 234 rows of data. So we have 4147755424. So we have, we definitely have data missing. What we choose to do with it in the data cleaning process, maybe we want to populate it with a median value, maybe we just want to delete those countries entirely if the data is missing. You know, I don't think you're going to do that. But these are things that you need to think about when you're actually finding these missing values. This is what the EDA process is all about. We want to find different either outliers, missing values, things that are wrong with the data, or we can find insights into it while we're doing this as well. So this is definitely something that I would consider when I'm actually going through that data cleaning process, really, really important information to know. Now let's go right down here to our next cell. So D F dot unique. This is going to show us how many unique values and it's actually any unique. This is going to show us how many unique values are actually in each of these columns. And this one makes the most sense for continents, because I think there's only seven continents, right? But we have six right here. And for all of these, each of these ranks, countries, capitals should all be unique. That makes perfect sense. As well as these, you know, these populations are such specific numbers and such large numbers, I would be shocked if any of these were similar. And then for these world population percentages, it's much lower. And again, that makes a lot of sense. Because when we're looking at and we'll pull it up right here, when we're looking at these world population percentages, a lot of them are really low, 0.00, 0.01, like this one, 0.2, there are a lot of really low values for those small countries. And so those are all, you know, one unique value. Now, let's say we just have this data right here, and we want to take a look at some of the largest countries. And we can easily do that, we could even we could say max and take a look at the largest country, but I want to be a little bit more strategic, I want to be able to look at some of the top range of countries. And we can do that based off this 2022 population. So we'll say df.sort underscore values, this is how we sort and not filter, but order our data. So we'll do sort values. And then we'll do buy is equal. And then we'll specify that we want this 2022 population. And then we're going to say comma. And we'll say, actually, let's just run this as is, but we'll do head. Because we just want to look at the top values. So now we're just looking at the very top values. So we're looking at is actually these 2022 population. That's what we're filtering on or sorting on basically, and we'll get the very bottom values, because it's sorting ascending. So from lowest to highest. So this Vatican city in Europe is, you know, 510. That's the value that we were looking at earlier. Now we can do comma ascending equal to false, because it was by default true, we can do false. Whoops, we can do false. And then it'll give us the very largest ones. So if we just take a look the top five largest by population, we're looking at China, India, United States, Indonesia, and Pakistan. And we can even specify that we want the top 10 in this head. We can bring the top 10. We also have Nigeria, Brazil, Bangladesh, Russia, and Mexico. And you can do this for literally any of these columns, whether you want to look at continent capital country, you can sort on these and look at them and you can even look at, you know, things like growth rate, world percentage, this one seems really interesting. Let's just look at this one really quickly before we move on to the next thing. If we look at this world percentage, just China alone, I believe, yep, just China alone is 17.88% of the world. So 17.88 and 17.77. And that's China and India. And those are very large countries with a high, high, high population. That makes a lot of sense why that is the highest world population percentage. Again, just getting in here, looking around, that's all we're really doing. Now I want to look at something and I have always liked doing this, which is looking at correlations. So correlation between usually only numeric values, we can do that by saying D F dot C O R R and a parentheses and we'll run this. And what this is, is it is comparing every column to every other column, and looking at how closely correlated they are. So this 2022 population, if we look across the board, it's very highly, I mean, this is a one to one, this is highly correlated to each other. And that almost for all of these populations, they're very, very closely tied to each other, which makes perfect sense, because for most countries, they're going to be steadily increasing. And so they're probably almost exactly correlated. But we can look at these populations. And if you look at the area, it's only somewhat correlated. And that's because in some countries, you know, they have a very high population, but a small area, or vice versa, small area in a very high population. So there isn't a one to one correlation there. But it's hard to really just glance at this and understand everything that's there. We could just visualize it and it would be a lot easier. So let's go ahead and do that. Let's go down here. We're just going to visualize this using a heat map, basically. So we're going to say sns.heatmap and an open parentheses. And the data that we're going to be looking at is df.core correlation. And we also want to say annout equals true, kind of show you what that looks like in just a little bit. But let's do plt.show. And this will be our first look. And I need to say show, not shot. We can get a little glimpse of what it looks like. But this looks absolutely terrible. Let's change the figure size really quickly. So I want to make this much larger than it already is. We'll do plt.rcparams. Right there. Do an open parentheses. And then right here, we're going to do in quotes, do figure.fig size. This actually needs to be in brackets, I believe. Just like this, not parentheses. We'll say fig size is equal to and now we can specify the value that we want. Let's do 10 comma seven and see if this looks any better. No, no, that doesn't look good. Do 20. Okay, that looks a lot better. And you know, this is just a quick way because it gives you basically a color coded system. Highly correlated is this tan all the way down to basically no correlation or negative correlation even, which is black. So when we're looking at these 2022 populations, and these are our populations right down here on this axis, we can see that all of these are extremely highly correlated very, very quickly. Whereas the rank really has nothing to do. It's negatively correlated doesn't really have anything to do with it. Then for the population and the world population percentage, it again is quite correlated, except for the area, density and growth rate. So I found that really interesting that, you know, the density, the growth rate in the area aren't really all that associated or correlated with the population numbers. That is, I kind of would have assumed that on some level, they went hand in hand. The area does would you know, again, make sense, you know, larger area, larger population, that kind of thing. But even density, I guess, I guess density and growth rate, growth rate I can see because that's a percentile thing, that could be definitely not correlated. I thought the density would be more correlated than it is. All that to say is, this is one way that you can kind of look at your data, see how correlated it is to one another that can definitely help you know what to analyze and look at later when you're actually doing your data analysis. Let's go right down here. Something that I do almost all the time when I'm doing any type of exploratory data analysis like this, I'm going to group together columns, start looking at the data a little bit closer. So let's go ahead and group on the continent. So let's look at it right here. Let's group on this continent because sometimes when you're doing this EDA, you already know kind of what the end goal of this data set is. You know kind of what you're looking for, what you're going to visualize at the end that you really comes in handy when doing this. But sometimes you don't, sometimes you're just going in blind and so far we've really just been going in blind. We're just throwing things at the wind, kind of seeing some overviews, looking at correlation. That's all we've done. Now I kind of want to get more specific. I want to have like a use case, something I'm just kind of looking for, not doing full data analysis or not diving into the depths, but something we can kind of aim for. So the use case or the question for us is, are there certain continents that have grown faster than others and in which ways? So we want to focus on these continents. We know that that's the most important column for this use case, this very fake use case. So we can group on this continent and we can look at these populations right here because we can't really see growth. You can see a growth rate, but the density per kilometer, we don't have multiple values for that. It's just a static one single value. Same for growth rate, same for world population percentage, but we have this over a long span, many, many years, you know, 50 years of data here. So this, we can see which countries have really done well or which continents have really done well. So without, you know, talking about it even more, let's do df group by and then we'll say continent. Oops. Let me just copy this. I'm not good at spelling. I'm gonna say df group by and then we'll do dot mean. And we can just do it just like this. And now we have Africa, Asia, Europe, North America, Oceania and South America. Okay, so if I'm being completely honest, I knew most of these. All right, I'm no geography expert, but I knew most of these. I don't know what this Oceania is. This, I don't, I genuinely don't know what that is. So let's just search for that value and see, we'll come back up here in just a second, but I want to, I want to kind of understand what this is. So we're going to df and we'll say continent. We sound that out for you guys. Then we'll do dot string dot contains, oops, contains, good night. And then I want to look for Oceania. And let's let's run this. Oh, I need to do like this. Now let's run this. So now we're looking at our data frame, we're seeing when the values have this continent as Oceania. Okay, so these look like islands, I'm guessing. So we have Fiji, Guam, New Zealand, Papua New Guinea. Yeah, these look like all, I'm guessing based off the continent, Oceania, Oceania, Oce, Oceania, Oceania. Guys, this is tough for me. Okay, I'm doing my best. I, you know, this is part of the EDA process. I don't know what that means. I don't know what Oceania, Ocean, Ocean, Oceania. Geez, I'm just going to call it Oceania. That's so wrong, but I'm just going to, it's so easy for me to say, you know, I, I now am seeing this and it looks like islands, which would make sense because for their average, they have the highest average rank. And I'm guessing that's because they're just mostly small continents. Let's, let's order this really quickly. We're going to do dot sort underscore values, do an open parentheses. And I want to sort on the population, we're just doing the average population, we'll do by equal. So on the average population, and we'll do ascending equals false. So when we're looking at this average or the mean population, Asia has the highest population on average. And we have South America, Africa, Europe, North America, and then Oceania at the very bottom, which makes perfect sense. Again, small islands, world population percentage. So each of the countries, each of those countries in Asia makes up about 1% on average. Really interesting to know and just kind of look at this. And the density in Asia is far higher than double, almost double every single other continent. Really, really interesting, actually, now that I'm looking at this. But, you know, that's something that I would actually look into. And I would be like, what is this Oceania or Oceania? What does that mean? And, you know, let me look into that. Let me explore that more because I want to know this data set. I'm trying to really understand this data set. Well, but what I want to do now is I want to visualize this because I just feel like looking at it, I don't, it's hard to visualize. And again, the use case that we're saying is, is which continent has grown the fastest? Like it could be percentage wise, it could be, you know, as just a whole on average. Let's take a look. So we're going to take this and let's copy it like this, let's bring this right down here. So let's look at this. So if I try to visualize this, and let's do that, let's do df2 is equal to, because I'm, I already know it's not going to look good, just based off how the data is sitting. We do df2, oops, what am I doing? I don't need to do that, but I will. Okay, df2, and we'll do df2.lot. And we'll run it just like this. As you can see, Asia, South America, Africa, Europe, North America, Oceania, we can kind of understand what's happening. But these are the actual values that are being visualized, not the continents, which is what I wanted. In order to switch it, and it's actually pretty easy. And this is something that, you know, is good to know. We can actually transpose it to where these, these continents become the columns and the columns become the index. And all you have to do is say df2.transpose. And we'll do this parentheses right here. And let's just look at it. And then we'll save it. So now, all of these columns are right here. And all of the indexes are the columns. So let's say df3 is equal to, and I'm just doing that so I don't, you know, right over the df or my earlier data frames. So now we have this data frame three. So now let's do data frame three dot plot, then it should look quite a bit different. Whoops, I didn't run this. So run this, and run this. And as you can see, this does not look right at all. And the reason is, is because we're not only looking at the correct columns, we have this density in here, we're population percentage rank, we don't need any of those. The only ones that we want to keep are these ones right here, this population. Now we can do that. And we can just go right up here. This is where we created that data frame two that we transposed. We can get right up here. And we can specify within this, we actually only want specific values. Now we can go through and handwrite all of these. And by all means, go for it. But I am going to go down here, I'm going to say df.columns. And I'm going to run this. It's going to give us this list of all of our columns. And I'm just going to, you can just copy this. And you can put it right in here. I think I need a list of them. I think it needs to be like this. If I'm, let me try running this. Okay, so this work properly, you can do it just like this, or a little shortcut, if you want to do it like that, if you want to do a shortcut like I would hope you would, you would just do df.columns, just like how we looked at down here, except since this is an index, we can search through it. So we can just say zero, one, two, okay, so we can do five, up to 13, because I think it's, and we'll just let's see if this works. It may not, I may actually need to go like this. Let's see. There we go. So you can just use, you know, the indexing to save you some visual space, gives you the exact same output. So now we have this, this is our df2. Now let's go down and transpose it. So now we just have these populations, and we have our continents right here. And then now we're going to plot it. And this looks good, although it's backward. Okay, it's backward. So what I actually want to do is not this. That is a quick way to do it, although not the best way to do it. So I'm actually going to copy all of these. And although I said it would save us time, it did not at all. So I'm going to put a bracket right here. I'm going to paste this in here. And I'm literally going to change these up. I might speed this up. Or I might just have you sit through this, because you know, this is an interesting part of the process. And I want, you know, you to get the full experience. You know what, now that I'm talking about it, that is what we're going to do. You guys can hang out with me. This is a good time. We have 2010, 2015, 2020, and 2022. Now let's run it. What did I do? Oh, too many brackets. There we go. So now it's ordered appropriately. We have 1970 all the way up to 2022. This is how we want it. Let's transpose it appropriately. Let's run it. And now we basically have the inverted image of this. Now just at a glance, and we haven't done anything to this except for literally what we are looking at. At a glance, we can see that from 1970, Asia and China are already in the lead by quite a bit. And it continues to drastically go up, especially in the 2000s. Like right here, it explodes, like just straight up. Then kind of starts going up and just leveling off. Every other continent, especially Oceana, is just really low. It never has done a bunch. Let's see, look at green. Green has gone up from, you know, point, I'd say, 0.1 up to about 0.2. So they've almost doubled in the last 50 years. And again, you can just get an overview, a high level overview of each of these, you know, continents over the span of this time. So this is kind of one way that we can, you know, look at that use case. We're not going to harp on that too long. I just wanted to give you an example, like, you know, when you're looking at this, sometimes you'll have something in mind of what you're looking for and you go exploring and just kind of find what's out there and find what you see. The next thing I want to look at is a box plot. Now I personally, I love box plots, you know, they're really good for finding outliers. And there's a lot of outliers. I already know this because the average, the 25th, 50% all are very low. And then there's some really just big outliers. But for your data set, it may not be that way. And those outliers may be something that you really need to look into. And box plots have been something that I've used a lot where I found those outliers that way and started to dig into the data to find those outliers. And, you know, came across some stuff that I'm like, Oh, I have to clean this up. I have to go back to the source. Really, really, really powerful and useful to be able to find these. So all you have to do is df.boxplot. Yeah, let's take a look at it. And the sort of looks good as is, maybe I'll wait, make it a little bit wider. Let's do fig size. Oops. Sorry, big size is equal to, let's try 20 by 10. Okay, that didn't help at all. I apologize, thought I would. But let's keep going. What this is showing us is that these little boxes down here, which are actually usually much larger because you have a more equal distribution of numbers or values. In the small value, this is where our averages lie. This number right here is the upper range. And then all these values, all these open circles, those actually stand for outliers. So we're looking at the 2022 population. There's a lot of outliers now for our data set, knowing our data set is really important. Outliers are to be expected, especially when most countries or continents are small. So we're looking at, you know, all of these little dots are outlier countries, or outlier values, which each value corresponds to a country. So if this was a different data set, I would be, you know, searching on these and trying to find these so that I can see what's wrong with them, if anything, or if they are real numbers, like if this was revenue, everyone's revenue is way down here, and then there's one company that's making like $10 trillion, that'd be an outlier up here. And it would definitely be something that you want to look into. For our data set, knowing that, you know, we're looking at population, this is more than acceptable and, you know, oddly enough. But that's what box plots are really good for showing you some of those quartiles, the upper and the lower, as well as denoting these points that fall outside of those normal ranges for you to look into. So really, really useful. So now let's go down here, pull up our data frame again. And we've kind of just zoomed into the whole EDA process, there was one last thing that I wanted to show you. This is the very last thing that we're going to look at, we're ending on really a low point, if I'm being honest, because last kind of stuff was more much more exciting. But there is something df dot d types. Oops, let's do df dot d types. And we'll run this. Now just like info, it gave us these values. But we're actually able to search on these values now. So these object float and integer, we can search on those, which is really great, because we can do include equal. And we can do something like number. And none of these are numbers, right, or none of them explicitly say number. But when we run it, I'm getting an error series object. Oh, that's because I'm doing d types is for a series, we need to do select underscore d types. Now let's run this. Now it's only returning the columns in this data frame, where the data types are included in this number. So you won't see any, you know, country or any of those text or the strings. If we want to do that, we go in here and say object and run that. And this is another really quick way where we can just filter those columns to look for specific whether it's numeric, we could even do float in here. And so now it's not including that rank, which was an integer. So we can specify the type of data type and it'll filter all the columns based off of that, which, you know, when you're doing stuff like this, you, it is good to know what kind of data types you're working with and look at just those types of data types, because there might be some type of analysis you want to perform on just that, whether it's numeric, or just the string or integer columns within your data set. So again, ending on a low note, I apologize. You know, everything else that we looked at all those other things that we looked at are all things that I typically do in some way or another when I'm looking at a data set. Exploratory data analysis is really just the first look. You're looking at it, you're going to be cleaning it up, doing the data cleaning process. And then you're going to be doing your actual data analysis, actually finding those trends and patterns and then visualizing it in some way to find some kind of meaning or insight or value from that data. And again, there's a thousand different ways you can go about this. It does typically, you know, depend on the data set, but these are a lot of the ways that you'll clean a lot of different data sets. And so, you know, that's why I went into the things that we looked at in this video. So I hope that you guys liked it. I hope that you enjoyed something in this tutorial. If you liked this video, be sure to like and subscribe, as well as check out all my other videos on pandas and Python. And I will see you in the next video.