 Hello, welcome to SSUnitex social decide and this is continuation of PySpark tutorial. So in the last video of this video series, we have seen about the for loop container. If you haven't watched the last video of this video series, so I would strongly recommend to watch that video before going forward. So today we are going to see about the schema comparison. So what is the schema comparison? So schema comparison is very important feature because let's assume if we are having multiple data frames and we want to store the data from all those data frames into a single destination and schema are different for all those data frames. Then it's very difficult to how we can do the union and dumping that data into a single destination. So by using the schema comparison, we will be comparing if any columns are missing, then we can add that column for making the data frames as same for all the places. So let me quickly go inside the browser and we will try to understand in practical. So here I am creating three data frames data frame one data frame two and data frame three. If you can see the schema for the data frame one and data frame two both are same. But in case of data frame three, here we don't have the gender column, but additional we are having department name column. So let me try to execute this and it will be going to create three data frames data frame one data frame two and data frame three. So that we can see command executed successfully. And here we are having three data frames like data frame one, data frame two and data frame three. Now the next thing here we are required to check how many columns are missing. So before that, let's try to see the schema of data frame one and data frame two. If the schema is same, then simply we are printing that schema must. If the schema is not same, then we are printing as schema not matched. So here we can simply use if condition and then we can use data frame one dot schema that should be equals to data frame two dot schema. And here I am going to print this as schema matched. Something like that we just want to print. So schema matched. Now here in case of else part, we are going to print schema not matched. So it's very straightforward. Schema not matched. Like that it will be returning. So let me try to execute and we will see in case of schema for the data frame one and data frame two both are same so that we can see schema matched. If we are going to see data frame one and data frame three, then it will be returning schema not matched. So by using directly equals to equals to with the schema of that data frame, we can check whether the schema is matching or not. Now here we are required to check the missing columns as we can see schema not matched for data frame one and data frame three. Then how many columns are missing in data frame one? Those are available in data frame three. On the parallel side, how many columns are missing in data frame three? Those are available in data frame one. So how we can do that? So for that first we are required to convert this into set function. So what set function will do? Set function will be converting this from array type to JSON format. So let me try to use data frame one dot schema. First let me try to print this. So you will be understanding what it will be returning. It will be returning all the columns those are available under this data frame. Now here let me try to convert this into set and here we can use and execute. So what it will be having here it is additionally adding this curly braces. So once we are converting this into set now we can subtract schema for data frame one to data frame three. So simply we can use the minus and here we can use data frame three and let me try to execute. So it will be going to have only those columns. Those are missing. As we can see in data frame one we are having gender which is not available in data frame three that we can see. Let me try to copy this on the same side here. Let me try to subtract from data frame three to data frame one. So it will be adding department name. So data frame one is having gender column as additional and data frame three is having department name column as additional. So by using this we can simply get it. But here like we can see it is not in the array format it is having the curly braces. So what we can do here we can simply convert this into list. So we can use the list function and we can also use the list function here as well. So what list function will do it will be converting this to an array format that we can see. So now here we have find out how many columns are missing. If the columns are missing we are knowing then the next thing like we are going to add these columns there. So for adding these columns we should be writing a dynamic code by which this code will be using anywhere. So first we are required to check both the data frame schema and then we will be going to pick the unique schema those are available in data frame one as well as in data frame two. If those are not available like something with the full outer join. So we will be going to do and we will be storing those column names into a variable. And inside that variable we will be going to use the loop and inside the loop we can simply check if that column is available there or not in that data frame. If that is there then that is okay if that column is not there then we are going to add those columns. So let's do in practical. So first we are required to check how many columns are there. So for that we can simply use data frame one dot columns. And here we can use the plus symbol with data frame three dot columns. Let me use this into all column as a variable and let me try to print this all call variable. So it will be going to have all these columns from data frame one as well as from data frame two that you can see id name gender salary from data frame one id name department name and salary is from data frame three. But we are required to have only unique columns. So how we can get the unique columns? So for that we can simply create one more variable like unique column. And here we can use set function. So what set function will do? It will be going to get the unique value from this particular array. So let me try to execute and we will see the output of this okay. Let me print this newly created variable that is unique column. And here we can see it is having department name salary and name all those columns those are unique between those. So the next thing like as you can see it is not in the array format. So for converting this into array format we can use the list function. So it will be converting into list as an array format that you can see department name everything is here. Why I have converted this? Because while we are going to deal with the loop it will be accepting the array format. So that's why we have converted this into array format. Now as we are having all these list of columns so what next we need to do? Next we are required to loop through with all these columns one by one and then in the data frame columns we need to check whether that is available there or not. If not then we are going to add those. So how we can do that? So that is also very simple we can use the for each loop. So for loop we have seen in the last video. So here we can use for i in then we have to specify like this unique column variable. This loop will be executing total 5 times. Now let me try to use column and then enter. Inside that we are required to write the logic. So inside the logic first we are required to check if your i value in df1.columns if that is there or not. So instead of in we are required to use not in. If that column is not in data frame 1 then what we are required to do? We are required to add that column. So data frame 1 equals to data frame 1 dot here we can use with column and inside the with column here first parameter will be the name of the column. So that is i in our case. Second it will be asking the value. So we can pass the none as in value. But this is the literal value so we have to use the lead function. So we can use the lead function here. And this lead function we are required to import from library. So we can use from pyspark dot sql dot functions. Then we can go import and let me try to import this lead function. So this is done next we are required to check the same thing for another data frame that is df3. So here let me use df3 as in columns. If this is not there then we need to add that into that particular data frame 3. So the same thing we can do here as well. So once this loop will be executed it will be adding all the missing columns. Let me try to execute and we will see the output of this. So as we can see command executed successfully. Now let me try to use display of data frame 1 and data frame 3 both. So this time it should be adding two additional columns. And data frame 1 should be having department name and the values are null. And the second data frame 3 will be having gender and the values we can see as null. So this is the way by which we can do the schema comparison and then we can add those. So let me recall what we have seen in this video. So we are having three data frames. So first we are checking whether the data frame is schema is matching for those data frames or not. If the schema is not matching we are printing it. Then here we are checking how many schema are missing. So for that we can simply use the set method and then we can subtract schema and then again we are converting that and here we can see the missing columns. Next here we are collecting all the columns and then we are picking only unique columns and then we are using the loop for those unique columns and checking inside the data frame if that column is there or not. If not then we are adding that column. So simply we can do that. So this is the operation we have seen in this video. So I hope guys you have understood how we can use the schema comparison and how we can create the same schema for different data frames. So thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. See you in the next video.