 Hello, welcome to SSUNITEX to this side and this is continuation of PiSpark interview questions and answers. So today we are going to see one more real-time interview question that was asked in PepsiCo company. So here we are having this input data frame and we want to generate this output. So in the input data frame it is having the sales information like the sales order ID, sales order date, item code and item quantity and item value. But in the output we just want to have only year month of the sale and then the total sale what is done on that particular month and year. And then we need to calculate the percentage difference for previous month. So how we can get it? So first we will be going to get the total sale and then we need to take the difference between next month to the previous month. So here as we can see the difference that is 1000 so on 1000 we will be taking the percentage from 4000 so it will be 25. And similarly here we are going to check the difference between 10000 to 4000 so that will be 6000 and if we will be going to take the percentage on 6000 of this 10000 so it will be 60. So let me quickly go inside the Excel file and we will try to understand how we can write our logic. So as here we can see this is our input data frame. So first what we will be doing? We will be going to get the month and year from this sales order date column. Then we will be going to take the sum of the item value. So once we will be doing this and then after we can simply take the difference. So how we can take the difference? So for that we have to get the lag function we can use. We will be getting the value from the previous row. So this is the first row. So for this it will be null and let's assume if it is having total 3000 and maybe next is having 4000 and then we will be having 10000. So here lag function will be doing it will be going to get the 3000 and in 10000 it will be getting the previous row as 4000. Once we will be going to generate this data frame then we can simply go and take the difference percentage. So how we can take the difference percentage? So first row as it is null so we will be going to get null here. But in the second row we can simply take the difference of this 4000 minus this 3000 and then we can divide with the 4000 and we can also multiply with 100 so it will be 25. Similarly we can do for next row. So first thing we will be going to generate this data frame and then we will be adding the lag and then we will be going to get the difference. So these three steps we have to do for achieving the output. Let me quickly go inside the browser and we will try to see in practical. So here I am going to create the data frame that is DF1. So this DF1 data frame will be having the same data that we have seen in slide and Excel file. So here we can also verify it is having the sales order ID sales order date item code and item quantity and value. If we want to verify the data type of this so for the sales order date the data type is string. So that is not correct. So first we are required to convert the data type of this. So we have to do the typecast. So for doing the typecast here I am going to replace the same column that we are having SO date. And here we are doing the cast as a date type. Let me try to execute this cell and we will see the schema of this. So data frame 1 will be going to replace with this operation. And here the print schema is indicating like SO date is having the date column. Now let me use the display of this DF1. So the first thing we are required to get the month and year from this SO date and item value because only these three columns are required for our operation. So for that we can simply write like DF1.select. And here first let me try to get the month and this month we can get from the SO date. SO date is present in DF1 data frame. So we can use the DF1 data frame here and this could be SO date. And we can also provide the alias of this. So alias name will be indicating like month. Now let me add next one. Next we will be saying as year. So we can go here and it will be adding year. Alias name will be going to have like year. Now it is having two columns. The third column that is needed as item value. DF1. We can go with item value. Let me put into another data frame that is DF2. And let me use the display of this DF2. So name month is not defined because we have not imported this from the library. So we have to import this function by using pyspark.sql. Then we can go with functions. Then we can import and put a stick here. Let me execute and we will see the output of this. So here we have like month, year, item value. So these three columns are required. Next we just want to do the sum of this item quantity and we will be doing the group by on the month and year. So how we can do that? So for that as it is having data frame 2. So we can go with data frame 2. Then we can specify group by. And inside the group by we can specify these two columns. So we can go with DF2.month. That is the first column. DF2.year. That is the second column. So we are doing the group by on this. Then we can go with aggregate. And here we are doing aggregate like sum of this DF2.item value. And after that we can also provide the alias name of this. So the alias could be total sale. Let me put this into another data frame that is DF3. And here let me see the output of this DF3. So it should be having month, year and the total sale. So everything looks good till now. Now we are required to use the lag function for getting the value from the previous row. And then we can take the difference and get the percentage. For using the lag function we have to import the window function. So for that we can use the pyspac.sql.window. We can import a stick from here. So here we can use DF3.letmeselect. And here let me select all the columns first. And then we can use the lag function. So simply we can use lag of which column. So it will be on total sale column. So we can specify total sale. We can specify dot we can specify over. And here we can specify like window. We can specify order by. And this order by should be on month and year. So we can specify DF3.month. And second DF3.year. And then we can also specify the alias name of this. So it will be like pre-sale. Let me put this into another data frame could be DF4. Let me execute this and we'll see the output of this DF4. So as we could see previous sale we are getting here. And it is getting the value from the previous row. So like this row is having 4000 and previous row is 3000. So on this we are getting 3000. Similarly in the next row it will be getting the previous row value that is 4000. So 4000 is coming here. Now the final thing we are required to take the percentage. So how we can get the percentage difference. So first we are required to do the difference and then we are required to do the percentage. So we can go with the DF4 dot here. Let me try to use the select and let me select all the columns first. And then here we are required to do the operation. So how we can do this. So we can use the DF4 dot the column that is the total sale. We can subtract DF4 dot previous sale. So once we are going to do the subtraction here. And after that we are required to do the divide on this with the DF4 dot total sale. And let me put this into a bracket. And here we are required to multiply with 100 for getting the percentage. Let me use the display of this. So we will see the output of this. So here we can see it is getting the exactly same output that we were expecting. So here we are getting 25 and here we are getting 60. So by using this approach we can achieve it. So I hope guys you have understood how we can achieve the output how we can write the query. I will provide all these scripts in the description of this video. You can copy for your practice purpose. Thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. See you in the next video.