 Hello, welcome to SSUnitex, you see this side and this is continuation of PySpark interview questions and answers. So in this video we are going to see one more real time scenario based interview question. So here let's assume we are having two input data frames. First is containing the student and second is containing the student marks. So these two input data sets that we are having and in the output we just want to have the student information and it could be having a student ID and student name from the student data set and percentage and the result that we need to calculate by using the student mark data set. And here as we can see how we can calculate the percentage so percentage simply we can calculate whatever the student ID we are having based on we will be going to sum the marks and divided by the total count of those. And next we can see the result so how we can get the result. So for result here we are having certain conditions. So if the percentage marks is greater or equals to 70 then that will be distinction and if the marks between 60 to 69 then that will be first class if the mark between 50 to 59 then that will be second class if the mark will be 40 to 49 then that will be third class and if percentage marks is less than 39 that student will fail. So what approach we will be following for getting this output. So here like these are the steps that we will be going to follow. First of all we will be going to combine these two data frames. So by using join we will be going to combine these two data frames. So you can see the ID and name we are getting from the student and the subject and marks we are getting from the student marks data frame. Once we have joined between these two data frames we are having a single data frame. Now here we will be going to calculate the percentage. So how we are calculating the percentage? So for this we are going to do the group by on ID and name column. And then we are just doing the sum of this marks and divided by the total count. So by using that we will be going to create another data frame with the percentage marks. Now here we are having the final data set. So based on this data frame that we have created we will be going to create one more data frame and this data frame will be having the result as well. So here we will be going to use the case statement. So inside the PySpark we can use the when and otherwise condition for getting this result. So let me quickly go inside the browser and we will try to do the same in practical. So here let me try to execute this query. So it will be going to create these two data frames like data frame 1 and data frame 2. Data frame 1 is nothing but the student data frame and data frame 2 is student marks data frame that you can see. The first thing that I told you we will be going to do the join between these two data frames. So we can simply join data frame 1. Here we can go with join and inside that on which data frame we want to join. So we want to join on data frame 2. Second it is asking like what condition on which condition we want to join. So the condition is here we can see data frame 1.id column. Similarly we have data frame 2.id column. So on the basis of id column we just want to combine these two data frames. So let me try to put this in other data frame maybe df join and let me display this df join. Let me execute and we will see the output of this. Here we can see it is having id name from first data frame id subject marks from second data frame. Id is coming twice so we are not required to have the id twice. So let me try to remove the id which is coming from data frame 2. For removing those we can use the drop and under that we can specify data frame 2.id because we want to have the unique columns in the output. So now we can see we are having only this unique column. So once we are having this the next operation we just want to try to get the percentage marks instead of this mark. So for this we can simply use df underscore join. And here the first thing like we are required to do the group by. Second on which column basis we want to do the group by. So group by should be on id column and the second column that will be name column. The next thing we are required to use the aggregate operation. So for using aggregate operations and functions let me try to import the function first. From pyspark dot sequel dot functions then we can go import a stick. And inside this aggregate we are required to specify our aggregation. So the first thing we just want to do the sum of this mark. So we can go with some and here we can specify the column name. So the column name we can specify mark. Now once we have done this let me put this into another data frame that could be df underscore percentage. And let me see the output of this df underscore percentage. Let me execute and we will see the output of this. So it will be going to have the sum of mark. But our requirement is we just want to do the aggregate on this mark. So it will be having the percentage. So for this after doing the sum we are required to divide this with the count of a stick. Let me try to execute. So it will be getting the actual percentage. As you can see it is having this percentage here use the alias name. So how we can do the alias name. So let me try to add this operation inside this brackets. And for this we are required to use the alias name. So that's why I have put this inside a bracket and let me use the alias here. And this alias this may be percentage. Let me execute and we will see the output of this. So here we can see it is having the percentage column now. Now the last thing we are required to check this percentage value. If the percentage value is greater than 70 then that will be distance. So how we can do that. So as we have created DF underscore percentage data frame. So here first let me try to select. So as we are required to have all the columns. And second here we are required to specify our conditions. So first I am going to use when and the condition is DF dot. Here we need to check the percentage. So if this percentage value is less than or equals to if this percentage value is greater than or equals to 70. Then we are required to specify the output of this as distance. Now once we have done this we are required to specify another when condition. And on that when condition we can specify DF dot percentage less than 70. And DF dot percentage greater than or equals to 60. On this condition it will be first class. And this should be coming inside the bracket only like this and we can close this bracket. The same thing we can do for the second class third class and fail as well. For the second class the condition could be less than 60. And here it should be 50 and for this it will be second class. And this should be third class. And the last condition that could be for fail. So we can specify fail here and this could be between 40 to 50. So here we can specify less than 50 greater or equals to 40. And for the last one here we could specify that should be less than 40. On that scenario this could be fail. Now once we have done all these let me put this into another data frame that could be DF final. And let me use the display of this DF underscore final and we will check the output of this. Okay it is saying index syntax just because of this. It should not be double M percent it should be only single M percent. Let me execute and we will set the output of this. So it is saying like DF is not defined because it should not be DF it should be DF percentage. So let me try to use the DF percentage in all the places. And so as in the output of this we could see all these four columns and it is having ID name percentage earlier. But the result column we have added by using this when condition. So I hope guys you have understood how we can write the query for getting this output. Thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. See you in the next video.