 Hello, welcome to SS Unitex, so she'll decide and this is continuation of PySpark interview questions and answers. So in the today's video, we are going to see one more real time interview questions and answer. So recently I have received one of the request. So one of my subscribers has sent me this request or email. So here we can see we are having two input data frames. First data frame would be department and second data frame will be employee. We have employee salary information here and in note like manager ID is employee ID of the employee. So here we can see the manager ID column. So this manager ID will be the employee ID of the same data frame. Now what is the requirement? The requirement is in the output we just want to have the department name column, then manager name column, then employee name, then salary year, then salary month, then total monthly salary. So for getting the department name, we have to join both the data frames and we'll pick the department name from the department data frame. Then we can see manager name. So for getting the manager name, we have to join the employee data frame with employee data frame for getting the manager name by using the employee ID and manager ID. Next we can see employee ID. So employee ID, we can simply get it from here. Then we can see salary year. So for getting the salary year, first we have to convert this salary date column which is in string format because this date format is not supported in PySpark. So we have to convert this first into the date. Then on the newly created date column, we have to extract the year and then we need to extract the month and last we'll be going to do the sum of this salary for getting the total monthly salary. So first we'll be going to create this new column that will be the actual date format of the salary date. So first we have to do this. Second we have to do the join between employee and department data frame for getting the department name and employee to employee data frame for getting the manager name and rest will remain same and at last we'll be going to extract the year part from this column and month also from this column and then we'll be grouping on the department name manager name employee name year and month and then doing the sum of this salary for getting this output. So let me quickly go inside the browser and we'll try to see in practical. So here let me try to execute this query for creating both the data frames that we have seen in the slide. So one data frame that we are creating with df underscore salary second will be df underscore department. So both the data frames has been created and we can see here now the first requirement first we have to convert this salary date column to actual date and we'll be adding a new column. So for that here we have to use the df underscore salary dot here I'm going to add a new column so we can go with the column option. Here let me create new salary date column. This will be newly created column and then we have to convert the salary date. So for conversion to date we have to use the two date function and then here we have to specify like which column we want to convert. So we want to convert this salary date column and the second parameter it is asking like on which format we want to convert. So we want to convert into some of the date format like ddmm way way like that and let me put this into a data frame that is df. Let me use the display of this df and we'll execute and here we should be able to see a new column will be added and that is in date format. If you can expand this here we can see this new salary date is in date format and the old one salary date is in string format. So now we have this newly created date column. Now the next requirement we have to join this data frame with department data frame for getting the department name. So how we can do that we can simply go with df. Then we can go with join and then here we have to specify like on which data frame we want to join. So you want to join with df underscore department then on which column basis you want to join. So we want to join with department id column. But here we have a catch. If you are going to directly join let me show you that if you can go with df dot department id is equals to df underscore department dot department id. If you are going to join between these two let me create a new data frame and go with display of this df1. So here we will be having the duplicate column with the department id. One department id column is coming from this df data frame and second department id column is coming from department data frame. But we don't want to keep the duplicate columns in the output. So how we can remove it? So we can simply remove this join condition and here we can simply specify department id. So it will be joining with the department id and it will be having only a unique column in the output. So like that we can specify and let me execute and we will see the output of this. So here we can see duplicate column has been removed and we have department id once only. Now we have created this df1 data frame. The next requirement we just want to join with the same df1 data frame for getting the manager id. So how we can do the same join? We can go with df1 dot here we can specify join and under that we can specify the same df1. So here we have to also provide the alias name for both of those. So here we can specify like alias name of this as a. So this is one of the data frame and we are joining with the same data frame. Let me provide the alias of this as b. So these two data frames will be going have a and b data frame. Now the next thing here we have to specify like on which column we want to join. So we want to join on a. Here we can specify the column. So the column that will be manager id. So that is the mgr id should be equals to equals to b. Here we can specify employee id and here we have to specify type of join that should be left join. Once we have done this let me put this into another data frame and let me use the display of this df2 and we will see the output of this because we cannot use directly a here. So what we have to use we have to specify this into a call function and here we have to put this into single code. Do the same thing with this column as well and here let me try to execute this cell. So call function is not defined. So we have to import this call function first. So we can go with from pyspark.sql.funcels then we can see import and here we can import the call function. Let me execute this and we will see the output of this. So the new data frame will be having all the columns from data frame a and data frame b. So from both the data frames we will be getting all these columns and here we can see we are having all these columns from left table and then we can see all the columns from right table. Now everything is okay but we are required to select only required columns. So how we can get all those. So at last we can specify select and inside the select we have to select only required columns. So how we can select only required columns. So first we have to get the department name. So department name will be getting from this data frame which is a. So here let me try to use the call function and we can use a. here we can get the department name. So this is the first the second thing we just want to get the manager name. So manager name we will be getting from b data frame and inside the b whatever the employee name that will be the manager name. So we can simply use call function and here we can specify b. employee name. So we can simply go with EMP name and here let me provide the areas of this. So this areas is nothing but the manager name. Now the next thing let me select the employee name. So employee name we can simply get the a data frame that is the left data frame and this is the employee name. Next we can see salary column that we are required to the actual new salary whatever the new salary we are having we need to pick that column here. So that is also present in left data frame. So we can go with that only one column we are required from the right data frame that is the manager name and last we have the salary column. So we are required to get the salary column here. So a. salary once everything is done let me try to execute and we will see the output. So in the output we should be having only these selected columns. So here everything is coming from the left table except this manager name. Manager name is coming from the right table. Now once we have created this data frame next we are required to do the grouping on all these columns and this new salary date we need to extract the year and month from here and doing the sum of this salary column. So simply we can go with df2 dot here we can go with the group by and inside the group by we have to specify all the columns. So the first column is department name. The second column is manager name and the third column will be EMP name and the fourth column will be extraction of this year from this new salary date. And let me put the alias name of this. The alias name of this will be year. The same thing we have to do for the month as well. So let me copy this. This should be month and we have to extract the month from here as well. Once we have done all these things then we have to do the sum of this salary. Now let me put this into another data frame maybe data frame 3 and let me go with display of this data frame 3 and we will see the output of this. So year is not defined. So we have to import this year function first. So we can go with from pyspark dot sql dot functions then we can import first year then month. Let me execute and we will see the output of this. So salary is not defined because we have not put this into a single code. So in the output we could see we are having all these columns. So here we can see we are getting the as expected output but here the month is coming in number. If you want to convert this into a string format then we can simply use inside this month. We can use instead of getting this we can go with the date format. So we can specify the date format and inside that we have to specify triple m. So triple m will be getting this as an April. So we can see in the output. So here we are getting April instead of four and if you want a complete name instead of three part name then we can specify four time m then it will be returning complete April instead of APR that we can see here. So this is what you can do. I hope guys you have understood how we can write the query for getting this output. I provide all this script in the description of this video or in the comment side. So thank you so much for watching this video. If you like this video please subscribe our channel. See you in the next video.