 Welcome to SSUNITEX, this side and this is continuation of PySpark interview questions and answers. So recently one of my friend has attended interview with the PepsiCo. This question was asked there. So let's assume we are having a input data frame that will be having total four columns. First is the machine ID, second is the process ID, third is the activity type and fourth is the time stamp. And in the output of this, we just want only two columns as machine ID and the average processing time. So if we are focusing in machine ID 0, then we can see activity ID. So activity ID 0 and 1, two type of activity IDs are there. So activity IDs are nothing but it is indicating like a group when the machine was started and when the machine was stopped. So here in the activity type, we can see the start and then we can see the end. And similarly, if the process ID has been changed, then again the machine is started and then the machine is end. So here total three machine IDs that we have 0, 1, 2 and then we have the process ID and the activity type. So how we can calculate the average processing time? So for getting the average processing time, first we are required to take the difference from end to start. Let's assume we are getting x value while we are going to subtract this. Similarly for another process ID for that machine, we will be getting y value. So here we have the x value and here we have the y value. Once we will be going to have these two values, then we can take the average on the basis of machine ID. Like on the group by, we will be adding the machine ID for getting the average. So that average will be average processing time. So let me quickly go inside the Excel and we'll try to see the steps how we can achieve this output. So here the first we are having this input data frame. So first thing we are required to use this machine ID. Then we have to use the process ID and then we'll be going to have two columns. First we'll be having the start time and second could be having the end time. So by using case statement, we'll be going to check if the start activity is here. Then we'll be going to use the timestamp over here. And if we have the end, then the timestamp value will go here. So let's assume we have zero and then we have zero and for the same process ID. Here we'll be going to have the start time that we can see this value. So start time will be having here and this value will be null. And in the next row, it will be have only the end time because here the activity type is end. So it will be having the value like this. Once we'll be going to have this data frame that we'll be going to create as intermediate data frame, then we'll be going to do the group by on the basis of machine ID and the process ID and we'll be taking the max on this start time and from this end time. Then go and create this data frame from this intermediate data frame. So this could be data frame two and this could be your data frame three. Once we'll be going to create this, then we can simply create another data frame and that data frame will be going to simply take the difference between your end time with your start time. So we'll be going to subtract and we'll be getting the difference from here. And at last, once we have this difference, we can simply take the average on the basis of the machine ID. So this is the process that we have to follow. Let me quickly go inside the browser and we'll try to see in practical. So here, let me try to run this cell for creating this data frame. So data frame one has been created and it is having the same data that we have seen in the slide. Now the first thing that I told you we have to use the case statement. So simply we can go with DF1. We can go with select and here first we are required to select the machine ID. So I'm going to select the machine ID. The first column, the second column will be your process ID. So we can get the process ID from data frame one dot. Here we have the process ID. Now here we have to use the case statement. So for using the case statement, we have to use the when condition. So when function is not available here. So we have to import this when function from SQL dot functions. Here I'm going to import all the functions because we have to use the aggregate functions later. Now, here we are required to check DF1 dot activity ID. So this activity ID, if this activity ID value as a start, then what we want to do here, we just want to have the DF1 dot time stamp. So this time stamp nothing but it will be have a column and we can call this column as start time. So if activity ID as a start, it will be going to have that time stamp here. If your activity ID as end will be going to create another column and will be calling that column value as end time maybe. So here let me try to use this as end time and this activity ID value will be end. Let me put this into another data frame that could be DF2. Let me use display of this DF2. Let me execute this cell and we will see the output of this. Here we can see the same output that we have seen in Excel. Now once we have this data frame created then we can go and try to do the group by on the machine ID and the process ID and we will take the max of this start time. So these null values will be gone and it will be having only a single row for each process ID under the machine ID. So under the DF2 we can go with group by and in group by we can specify DF2 dot machine ID the first thing and the second column will be DF2 dot process ID. Then we are required to use the max. So here I am going to use the aggregate and then I am going to use the max of this DF2 dot start time and let me use the alias of this as a start time. So I am not going to change the alias name. So the column name will be same and here I am going to use the end time instead of start time and here it could be your end time. So simply we can put this into another data frame and let me use the display of this DF3. So we will see the output of this DF3 data frame that we have created. Now we have all these values till now we are good next we have to do the subtract between end date to start time. So this is the end time this is the start time we can do the subtract and then we can take the arrays on the basis of machine ID. So here we have data frame 3 we can use and then let me try to select and here I am going to select only this machine ID because this column is needed only for our use and the second column that we are required the difference between end time to your start time. So we can simply go with DF3 dot end time minus DF3 dot start time and here let me put this inside the bracket for providing the alias name of this we can use the diff here and we can put this into another data frame that could be DF4 and let me try to display of this DF4. So this DF4 will be having only two columns first will be your machine ID and second what is the difference between end time to start time. Now we can simply take the arrays of this so for taking the arrays we can simply use DF4 dot here we can go with the group by and this group by will be only on this machine ID column and after that we can use the aggregate function and inside that I am going to use the arrays function. So this arrays function DF4 dot diff. So under this diff column we are going to take the arrays let me put the alias name of this column that will be going to have the arrays. So it could be arrays processing time and we can simply use the display and we can execute this cell for checking the output. So here we should be able to see the same output that we have seen in the slide. So I hope guys you have understood how we can get it. So thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. See you in the next video.