 Hello, welcome to SS Unitex, so she'll decide and this is continuation of PySpark interview questions and answers. So recently I have received one of the email from one of my subscribers and this question was asked in one of the TCS interview here, how we can check the data SKU issue and how we can resolve it. So first we have to understand what is the data SKU. So let's assume we are having a data set that is containing total 10 million of rows and the data that is partitioned which is not evenly partitioned. So for example here we can see 3 million in one partition, 1 million in two partition, 2 million in one partition and again 3 million in one partition. So data has been partitioned like this. So such type of data will be going to cause the problem while processing. What will be the problem? So let's assume we are having total 5 nodes that will be processing this data. So like node 1, node 2, node 3, node 4 and node 5. So once this partition data will be going to assign for these nodes. So like node 1 will be processing 3 million of rows, node 2 will be processing 1 million of rows, again 3 will be processing 1 million, 4th node will be processing 2 million and the last node will be processing 3 million. So as these nodes are having same memory and same capacity, so 1 million data that node is having will be processed quickly and after that it will be waiting for completion the other nodes for getting the next task. So in between these nodes will be sitting as ideal and this problem is coming while we are going to process a last data. So how we can resolve it? So for resolving it we are required to do the repartition of this input data. So once we are going to do the repartition then in the output of this repartition it will be evenly distributed. Maybe one record will be getting from here and added in this partition. Similarly, one record may be fetching from here and adding in this partition. So all these partitions will be having 2 million of rows. So it will be going to process, other nodes will not be sitting ideal. So first let's understand how we can check the data SKU problems if we are having in our query. So let me quickly go inside the browser and we will try to see in practical. So here I am going to read the data from one of the CSV file that is the sales.csv and here I am reading the data and initializing this data into this data frame and then we are doing display of this data frame. So here we can see total we are having 799 rows and first let's check how many partitions we have in this data frame. So for checking the partition we can simply use df.rdd and then we can see we are having a option for get number of partitions. So let me try to execute and we will check the total partition. So as we can see total partitions that we are having one. So like total 799 rows will be going to processed by a single core and that single core will be going to process number of rows and other will be sitting ideal. So this is not a good practice. So first we are required to do the repartition of this and then we will check the number of count. So for doing the repartition we can simply go and use the df.repartition and here I am going to create total 10 partitions out of this. Now let me try to use the df.rdd for checking how many partitions it is having now. So we can check by using get num partitions. Now let me try to execute and we will see the output of this. So earlier it was having only one partition that we can see here. But here we can see we are having total 10 partitions. So now your data frame is having total 10 partitions. Now let's understand like on which partition how many number of rows we are having. So how we can check that? So we can use df. then we can use the select and here I am going to select only two columns. The first column could be your partition name and the next how many number of rows it is having. So only two columns we will be selecting from here. So for getting the partition ID we can use the spark partition ID that we can see here. So we can simply use this spark partition ID. Let me use the alias name of this may be with partition ID. Once we will be going to add this column in your data frame then let me do the group on the same column. So on the group by side we can simply specify your partition ID column and at last we will be checking the count of this. Now let me put this into another data frame and check the display of this df1 and will execute. So here we can see like the first partition is having total 79 rows and remaining all the partitions are having total 80 rows. Once your run engine will be going to process this data then it is having the evenly distributed partitions. So none of the executor will be waiting for completion the another executor. So let's recap what we have seen in this video. So first we have read the data from the csv file then by using this command we are checking how many partitions we are having on that data frame and if you are not having the evenly distributed partition then we are going with the repartition. So it will be evenly distributed and after that by using the spark partition ID we are simply checking the partition ID and then we are doing group by on that partition ID and checking the total count on that partition. So I hope guys you have understood about this question. Thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. See you in the next video.