 Hello, welcome to SSUnitech Social Decide and this is continuation of PySpark tutorial. So today we are going to see about the user defined function in PySpark. So before going forward if you haven't watched last two videos of this video series, so I would strongly recommend to watch those videos because this is the continuation of that video. So in part 36 you will see about for loop and part 37 you will see about schema comparison. So what is user defined function? So user defined function is nothing but it is used to extend the function of the framework and reuse these functions on multiple data frames. So let's assume we are having a requirement where we are having two data frames and we are required to load these two data frame values into a single destination. So first the main thing we are required to check the schema for these two. So if the schema is not matched then we are required to add the missing columns in both the data frames. So this is the requirement. So for this we are required to create one of user defined function which will be taking two input parameter as the data frames and it will return the output as data frame. Let me quickly go inside the browser and we will try to understand in practical. So here as you could see we are having three data frames. Data frame one and two is having a schema like ID, name, gender and salary. But data frame three is having ID, name, gender is missing here. But additionally we are having department name along with salary. So let me try to execute this and we will be creating these three data frames and we can see the values those are available under that. Now for defining the user defined function we have to use the def then user defined function name. So I am going to call this as schema compare and inside that here it is asking two data frames. So I am going to take two input as data frame one and data frame two. Then we can specify column and put enter. So this is the syntax for user defined function. Now the first thing we are required to check the total columns. So for that we can use a variable all call and here we have to use df1.columns. Then add df2.columns. So it will be adding these two columns and all the column will be here from data frame one and data frame two. But next we are required to have only unique columns from this. So all these we have already seen in the last video. So for getting the unique columns so I am going to create unique call as the variable and for getting the unique column we can use set and this set can be used in all call variable and this will be returning in JSON format. So for getting an array we have to use the list function here. So this will be going to return all the unique columns between data frame one and data frame two. Now here we are required to use the loop that will be going to loop for all the columns those are the unique columns and then we will be checking for data frame one and data frame two. If any column is missing then we will be adding there. So for using the loop we can use for i in unique column. So this is the syntax for each. So it will be going to iterate one by one and in i will be holding the value for that particular column. Now here we are required to check if your i value not in df one dot columns. It means if that column is not present in data frame one then we are required to add that column. So data frame one dot with column function can be used to add a new column. Column name will be i and what will be the value for this. So value for this will be null. So we can use none and this is the literal value so we have to use the lit function here and for using this lit function here we are required to import this function first. So we can use pyspark dot sql dot functions. Then we can use import and then here we can specify lit function and from is missing so we have to add from at the starting. Now that's it. Next we are required to check if i not in for second data frame that is df two and here we can use columns and if that column is not present in data frame two then we are required to add that column there and here i comma lit as none. So this will be data frame two and it will be going to replace the data frame two with newly added column. So the same thing we can do for data frame one. Once everything is done then here it will be going to return data frame one comma data frame two. So your function is ready. Let me execute it. So this function is created and you could see like it is asking two input parameters data frame one and data frame two. So while we are going to use this function we have to pass these two data frames. So let me try to use the schema compare function that is a user defined function and here we can have data frame one and data frame two as an input parameter. So data frame one and two we have already created at the starting that you could see. So instead of data frame one and two because the schema is same for these two so it will not be adding any additional column there. Now we can check it will be also returning two data frames. So let me use data frame one comma data frame two. So it will be returning data frame one value in the first and data frame two will be in second. Here we can use display command for data frame one. Similarly display command for data frame two as well. So display for data frame two. Let me execute it will be executing okay. That is saying this module name is not correct that we have specified inside this user defined function. So P should not be in caps. So that's why it was returning error. Let me try to re execute. So here we could see it is executed and we are having the same data set. So that's why no additional column has been added. Now let me try to pass data frame three instead of data frame two. Let me execute. So it will be adding one column in both the data frames. So here department name was missing so department name is added and in this data frame gender is missing. So gender is added. Let me recall what we have seen. So we have created one of the user defined function which is asking to input parameters as in data frame and inside that we are adding whatever the columns we are having in both the data frames and it is slicing in one of the variable. Then getting only unique columns in this unique call variable and then we are using the loop for that particular unique call variable that we have created which is having unique columns. So it will be executing for all the columns. Those are all able in data frame one plus data frame two. And here we are checking if that column is all able in data frame one. Then that is fine. If not, then we are adding that column. Similarly we are checking for data frame two as well. And in the output of that we are returning data frame one and data frame two. The same thing we can executing here and we are verifying. So I hope guys you have understood how we can use user defined function in real time. Thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. See you in the next video.