 Hello everybody! Today we're going to be talking about merging, joining, and concatenating data frames in pandas. This whole video is basically around being able to combine two separate data frames together into one data frame. These are really important to understand when we're actually using the merge and the join. Right here we have what's called an inner join and the shaded part is what's going to be returned. It's only the things that are in both the left and the right data frames. Then we have an outer join or a full outer join and this will take all the data from the left data frame and the right data frame and everything that is similar. So basically it just takes everything. We also have a left join which is going to take everything from the left and then if there's anything that's similar it'll also include that and then the exact opposite of that is the right join which is going to give us everything from the right data frame and it's going to give us everything that is similar but it's not going to give us anything that is just unique to the left data frame. So this is just for reference because in a little bit when we start merging these these become very important. So I just wanted to kind of show you how that works visually. So let's get started by pulling in our files. So first we're going to say import and is as PD will run this. And then we'll say data frame one, and we'll also have a data frame two, and these are the different data frames the left and the right data frame that we'll be using to join emerge and concatenate. So we'll say data frame one is equal to PD dot CSV underscore read. And we'll do our and here is our file path. So we have this lot r dot CSV, that's our load of the rings CSV. And let's call that really quickly so we can see what's in there. And I'm having a dyslexic moment, because it's supposed to be read underscore CSV. I apologize for that. But this is our data frame, this is our data frame one, we have three columns, it's their fellowship ID one thousand one two three and four, their first name Frodo Samwise Gandalf and Pippin and their skills hide and gardening spells and fireworks. So this is our very first data frame that we're going to be working with. Let's go down a little bit. Let's pull this down here. And we're just going to say data frame two, data frame two, and this is the Lord of the Rings two. So let's pull this one in now. As you can see, it's very similar. We have fellowship ID one two six seven eight. So we have three different IDs here. We don't have six seven and eight in this upper this first data frame. We also have the first name. So Frodo and Sam or Samwise are in the very first and the second data frame. But now we have three new people, Baramir, Elrond and Legolas. And now we have this age column, which again is unique to just the second data frame really quickly. I want to give a huge shout out to the sponsor of this video and that is Zendesk. I've been using Zendesk for my company's customer analytics and has been absolutely phenomenal. They're going to be hosting a conference called Zendesk Relate on May 10th. And they're going to talk all about customer analytics, chatbots and AI in this space. You can attend in person in San Francisco or you can attend virtually, but space is limited. So be sure to apply if you want to attend. So if you are a business leader and you want to make the most out of your customer data or you want to learn customer data analytics, I believe links in the description. Again, huge shout out to Zendesk for sponsoring this video. Now the first one that I want to look at is Merge. And I want to look at Merge first because I think this one is the most important. I use this one more than any of the ones that we're going to talk about today. The Merge is just like the joins that we were just looking at, the outer, the inner, the left and the right. And there's also one called Cross. And I'll show you that one, although if I'm being honest, I don't really use that one that much, but it's worth showing just in case you come into a scenario where you do want to do that. So let's go right down here. And I want to be able to see these while we do it. So we're going to say data frame one. And when we specify data frame one as the very first data frame, we say data frame dot merge, this is automatically going to be our left data frame. Then if we do our parentheses right here, and we say data frame two, this is our right data frame. And let's see what happens when we do this. So what it's going to do in this, we didn't specify this, it's just a default, it's going to do an inner join. So it's only going to give us an output where specific values or the keys are the same. Now you can't see this, but what is happening is it's taking this fellowship ID and saying I have 1001 here, a 1002 here. This is the exact same as up here with this fellowship ID and fellowship ID of 1001 and two. But when we look at 1003 and four, those aren't in this right data frame and 678 is not in this left data frame. So the only ones that match are this 1001 and two. And that's why they get pulled in down here. But because we didn't explicitly say, here's what I want to join or merge between these two data frames, it actually is looking at the fellowship ID and the first name. So it's taking in these unique values of Frodo and same wise, which are the same in both, which is why it pulled it over. But really quickly, let's just check and make sure that we did it on the inner join. Because again, we didn't specify anything that was just the default. So we're going to say how is equal to and then we'll say inner. And if we run this, it's going to be the exact same because again, the inner is the default. But now just to show you how it's kind of joining these two data frames together, I'm going to say on is equal to, and I'm only going to put fellowship ID. So let's run this. Now the first thing that you may have noticed is this first name underscore X and this first name underscore Y. What the merge does as kind of a default is when you were only joining on a fellowship ID, we have this right data frame with fellowship ID, the left data frame with the fellowship ID. If you're just joining on these, and you're not joining on the first name and the first name, then it's going to separate those into an underscore X and an underscore Y. And even though they have the exact same values, since we are not merging on that column, it automatically separates that into two separate columns. So we can see the values within each of those columns. If we went into this on, and we make a list, and let's do like that. And we say comma, and then we write first name, oops, first name. And then we run this, it's going to look exactly like it did before. Again, it automatically pulled in both of these columns when it was merging at the first time, even though we didn't write anything. But if we actually write this, it's doing exactly what it was doing when we just had DF two, we're just now writing it out. Now there are other arguments that we can pass into this merge function, let's set shift tab. And let's scroll down here. So within this merge function, we have a lot of different arguments that you can pass into it. First, we have this right, which is the right data frame, which is data frame two, then we have the how and the on, which we've already shown how to do. There's a left on right on left index, right index, not something you'll probably use that much, but you definitely can if you want to look into that. And there's all these doc strings which show you exactly how to use all of these. So if you're interested in looking at the left and the right and the left index, it's all in here. The one that is really good is the sort and you can sort it saying either it's false or true. Then we have these suffixes. Now if you remember when we took these out, what it automatically did was it put in these underscore X and underscore Y, you can customize that. And you can put in whatever you'd like instead of the underscore X underscore Y, you can put in some custom string for that. We also have an indicator and a validates again, all the things that you can go in here and look at I'm just going to show you the stuff that I use the most. So these things right here are things that I definitely use the most. So now that we've looked at the inner join, let's copy this right down here. And let's look at the outer join. And these get a little bit more tricky. I think the inner join is probably the easiest one to understand. Let's look at the outer is spelled O U T E R. I don't know why I always want to say O U T T E R. But let's run this and see what we get. So now this looks quite different. The inner join only gave us the values that are the exact same. This one is going to give us all of the values regardless of if they are the same. So we have one, two, three, four, six, seven and eight. So let's go back up here. So we have one, two, three, four, one, two and six, seven and eight. So we don't have a 1005. And then if you notice in this data frame right here, if the value doesn't have so if we can't join on the fellowship ID or the first name, like Legolas wasn't one that we joined on, or that has a similar value in the left data frame, it just gives us an N a N, which is not a number. And it's going to do that for any value where it couldn't find that join or it couldn't match something within that either ID or first name. So in age, we also have that for the ones that weren't in the right data frame. We only had 1001 and 1002. So we'll have the age for both Frodo and Sam. But for Gandalf and Pippin, we don't have their corresponding IDs. And so it's just going to be blank for Gandalf and Pippin. And you can see that right here. So again, outer joins are kind of the opposite of inner joins, they're going to return everything from both. If there is overlapping data, it won't be duplicated. Now let's go on to the left join. And I'm going to pull this down right here. And now we're just going to say how is equal to left. And let's run this. So what this is going to do is it's going to take everything from the left table or the left data frame right here. So everything from data frame one, then if there is any overlap, it'll also pull the overlap to the, you know, whatever we're able to merge on from data frame two. So let's go back up to our data frame one and two. So it's going to pull everything from this left data frame, because we're specifying we're doing a left join. So everything from the left data frame will be in there. We're also going to try to bring in everything from the right, but only if it matches or is able to merge. So just this information right here will come over. We weren't able to join on 1006, 1007 or 1008. So really none of that information is going to come over. So let's go down and check on this. So again, we have one, two, three, four, all of the data with this first name and skills, everything is in here. But then we are trying to bring over the age, but we only have matches with 1001 and 1002. So only these two values will come in. Let's look at the right join because it's basically the exact opposite. Let's look at the right. And this is basically the exact opposite of the left in the fact that now we're only looking at the right hand. And then if there's something that matches in data frame one, then we will pull that in. So this is basically just looking like data frame two, except we're pulling in that skills column. And since only 1001 and 1002 are the same, that's why the skills values are here. Now, those are the main types of merges that I will use when I'm using a data frame or when I'm trying to merge a data frame. But there also is one called a cross or a cross join. And let's look at this one. And this one is quite a bit different. Here we go. Let's run this. So this one is different in that it takes each value from the left data frame and compares it to each value in the right data frame. So for Frodo in this left data frame, it looks at the Frodo in the right data frame, Samwise in the right data frame, Legolas, Elrond, and Baramir all in the right data frame. Then it goes to the next value, Samwise, does the exact same thing, Frodo, Samwise, Legolas, Elrond, Baramir, and it does that for every single value. So let's go right back up here. So it's taking this, this 1001, and it's comparing it to 12345. Then it's taking Samwise, and it's comparing it to 12345. Gandalf, 12345, Pippin, and then you kind of see that pattern. And that's what a cross join is. There are very few, in my opinion, reasons for a cross join. Although you'll, if you ever do like an interview where you're being interviewed on Python, you will sometimes be asked on cross joins. But there aren't a lot of instances in actual work where you really use or need a cross join. Now let's take a look at joins. And joins are pretty similar to the merge function. And it can do a lot of the same thing, except in my opinion, the join function isn't as easily understood as the merge function. It's a little bit more complicated. But let's take a look and see how we can join together these data frames using the join function. So let's go right up here. We're going to say data frame one, dot join. And then we'll do data frame two, very similar to how we did it before. And let's try running this. And it's not going to work. When we did the merge function, it had a lot of defaults for us. Let's go down and see what this error is. It says the columns overlap, but no suffix was specified. So it's telling us that it's trying to use the fellowship ID and the first name just like the join did, except it's not able to distinguish which is which. And so we need to go in there and kind of help it out a little bit. Again, a little bit more hands on than the merge. But let's see what we can do to make this work. Let's do comma. And we'll say on. And let's really quickly, let's open this up and kind of see what we have. So this one has less options than the merge does. We have other, and that's our other data frame, we can do on and we're going to specify, you know, what column do we want to join on. And then we can look at how do we want it to be a left and inner and outer, the same kind of types of joins as the merge. Then we have that left suffix, right suffix. And that's right here is kind of part of the issue that we were just facing is that those columns are the same. But if we say left suffix, it'll give us an underscore, whatever we want to specify, any string, four columns that are both in the left and the right, we can give it a unique name. So we'll no longer have that issue. And then we can also sort it like we did on the other one. But anyways, let's go back to our on, we'll say on is equal to, and then we'll say fellowship ID. Let's try running this. And we're still getting an error. It's just not as simple as the merge. So let's keep going. So now let's specify the type. So we'll say how is equal to and we'll do an outer. And if we run this, it still doesn't work, we're still getting the exact same issue as the less suffix and the right suffix. Now let's finally resolve it. I just wanted to show you how a little bit more frustrating it was. But now let's say L suffix is equal to and now it automatically when we did the merge did an underscore x, but we can do let's do underscore left. And then we can do a comma do right suffix. And we'll say is equal to and we'll do underscore right. Now when we run this, it should work properly. Let's run this. So this is our output and obviously it looks quite a bit different. Over here, we have this fellowship ID. Then we also have fellowship ID left, first name left, fellowship ID right and first name right. So it just doesn't look right. Now something I didn't specify when I first started this, because I kind of wanted to show you is that the join usually is better for when you're working with indexes. Before when we were using the merge, we were using the column names and that works really well and is pretty easy to do. But as you can see right here, when we're trying to use these column names, it's not working exceptionally well. Let's go ahead and create our index. And then I can show you how this actually works and how it works a little bit better when we're working with just the index. Although you can get it to work just the same as the merge, it's just a lot more work. So let's go right down here and let's go and say df4. So we'll create a new data frame. We'll say df1.set underscore index and we'll do an open parentheses. And we'll say we want to do this index on the fellowship ID. And then we're going to do the join. So now we're going to say join. So we're setting an index. So we're setting that index on the fellowship ID. Now we're going to join it on df2.set underscore index. And then we're also going to do that on the fellowship ID and I'll just copy this. Oh geez, I hate it when I do that. Okay, now we also want to do and specify the left and the right index. I'll just copy this because we do need to specify this. Now let's try running the data frame four. So really quickly, just to recap, we were setting the indexes. We were doing the same thing above, right? We have this join. We were joining data frame one with data frame two. Now we're joining data frame one with data frame two, except in both instances, we're setting the index as fellowship ID. So we're joining now on that index. So now let's run this. And this should look a lot more similar to the merge than the join that we did above, except now the fellowship ID right here is actually an index. So it's just a little bit different, but we can still go in here and do how is equal to outer. Oops, let's say outer. So we can still specify our different types of joins or the different way that we can merge or join these data frames together. We can still specify that. Again, it's just a little bit different. And that's why for most instances, I'm using that merge function because it's just a little bit more seamless, a little bit more intuitive. The join function can still get the job done. But as you can see, it takes a little bit more work. Now let's look at concatenate. Concatenating data frames can be really useful. And the distinction between a merge and join versus the concatenate is that the concatenate is kind of like putting one data frame on top of the other, rather than putting one data frame next to one another, which is like the merge and the join. So concatenating them is just a little bit different in how it'll operate. But let's actually write this out and see how this looks. Let's go up here and we'll say pd.concat. We'll do an open parentheses. And then we're going to concatenate data frame one, I'm a data frame two. That's all we have to write. And let's run this. And so just like I said, it literally took the first data frame 1234 and put it on top of the right data frame 12678. So that is our left data frame. This is our right data frame, and they're literally just sitting one on top of the other. But just like when we merge either with a left or a right, when you have these skills and there aren't any values that populate for them, it is going to say not a number. And since we're not actually joining, we're not joining on one and two, even though this one and this one is the same rows, it's not populating that value because again, we're not joining these together. We're just concatenating and putting one on top of the other. Now if we go into this concat, we say shift tab, there are a lot of different things that we can do, which if you remember the zero axis is the left hand index, and the axis of one is the top index, which is the columns. So you can specify that. And we can also do joins. And this is the one that I'm going to take a look at. But there are other ones that you can look into as well. Let's look at join. Let's do comma. And we'll say join is equal to and let's do an inner join. So let's see what happens with this. As you can see, it is only taking the columns that are the same. That's what this inner is doing. It's joining these columns together. And the ones that were different, they didn't take because again, we weren't able to combine them. They aren't similar between both data frames. Let's do an outer. And now it's going to take all of them. And like I said, that's doing this on these columns right here, but we can also do it on this axis as well. So let's go ahead and say, axis is equal to one. And when we run this, now it's joining us on this index right here of zero, one, two, three, four. So now these ones are being joined together. And it's putting it side by side, much like a merge would. So that's how concatenate works. And I'm going to show you one more thing. And again, it's not up here in this, you know, title because it's not one that I recommend, but it's one called append. The append function is used to append rows from one data frame to the end of another data frame. And then we can return that new data frame. And so let's do data frame one dot append to an open parentheses. And we'll say data frame two, very similar to how we've been doing other things. And let's run this. And as you can see, this is almost exactly like how the concatenate did when we first did it. But if we read kind of this warning, it's saying the frame dot append method is deprecated, and we'll be removed from pandas in the future version, use pandas dot concat instead. So it's literally warning us, you know, append is on its way out. If you want to do exactly what you're doing right here, go and try concat or concatenate, because that'll do the exact same thing. So I'm not really going to show you any other variations of append, because there's no reason it's going to be on its way out in the next version. So that is our video on merge, join and concatenate, and append as well in pandas. And I hope that that was helpful. I hope that you learned something. I mean, this stuff is really important, because oftentimes you're not just working with one CSV or one JSON or one text file, you're working with multiple of them. And you need to combine them all into one data frame. And so this is a really, really important concept and thing to understand. With that being said, be sure to like and subscribe, check out all my other videos on Python and pandas. And I will see you in the next video.