 Hello, welcome to SSUniTeX. So, she'll decide and this is continuation of PySpark interview questions and answers. So, recently one of my friend has attended interview in one of the MLC. So, this question was asked there. So, interview want to understand about the partition by and how the internally partition by is working. So, as per the today's azenda, first we'll see about the partition by, then we'll see how internally partition by is working. So, what is partition by? Let's assume if we want to create a separate folder with the data only based on any specific column. So, it will be going to create a separate folders for the distinct values of the partition column. So, what it mean? So, let's assume we are having a data set and on that data set we may be having a column which is country and the requirement we just want to create the subfolders based on the country column and it will be picking the distinct values from the country and will be creating the separate folders from there. So, this is the partition by. So, let's quickly go inside the browser and we'll try to see the internal working structure of the partition by. So, here we are having one of the CSU file and we just want to read the data from that CSU file. So, let me try to execute this cell. So, it will be going to create this df data frame and first I'm going to check the total count from this data frame. So, let me rerun this and it should be having total 1 million of rows. Next, we just want to see what type of data it is having. So, let me try to rerun this cell. So, here it is having the sales order ID, sales order date, item name, quantity, value and item code. So, these are the total columns. Next, we just want to check how many partitions are available under this data frame. So, for getting the partition, we simply going to use the df.rdd.getnum partitions. So, it will help us to check how many partitions it is having. So, here we could see total eight partitions it is having. So, let me quickly go inside the blob storage and here if we can go under this particular file that is the sales.csv folder. So, it should be having total eight partition files. So, that we can also verify from here. So, we can see partition 0 1 2 3 4 5 till 7. So, including 0 it is having total eight partitions. So, eight files are here by which we are fetching the data. Now, our requirement we just want to create the subfolders based on the distinct values on the item code. So, how we can write this inside the partition folder under the input container. So, here I am going to create another file that will be sales.sep underscore partition.csv. And here if the file is already there, we just want to override and here we are using the partition by. So, it will be doing the partition on the item code. Let me try to rerun this cell. So, it will be going to generate those many partitions. So, here job is executed successfully. We can go inside the container and here we can see we should be having this sales.sep underscore partition. So, we can open and here we could see we are having these many folders. So, these many folders are nothing but it is having the distinct values for the item codes and creating the separate folder for each of those. Now, I am going to read the data from this newly created file and then we just want to check how many partitions it is having. So, let me try to rerun this cell. So, here this got executed successfully and the number of partitions that we can see it. So, on the file level, nothing has been changed. So, here if I am going to open the first item code, then here we can see we are having only a single file that was already or label earlier that is having the values for this H01. And here only this file is containing the information for this item code. And if we just want to check for any another folder may be H17, then we might be seeing any another file that we can see partition for file it is having. So, file level partition has not been happened. It is only done the partition on the folder. It has created the folders and the partition will remain same. Now, next I am going to create the partition on the file level. So, here the partition has been happened on the folder level. It has created the folders but files remain same. Now, if you want to create the partitions on the file level, then we should be going to use the repartition. Repartition option is available inside the PySpark. Let me quickly go and run this cell. So, this got executed successfully and here let me try to check how many partitions it is having now. So, it should be having 100 partitions that we can see. It means the file those were available earlier that was having total 8 files. Now, it is having total 100 files. Now, let me try to write this data into salessep underscore repartition.csv file. Let me try to execute this cell. So, it should be going to create the file and that file should be having total 100 files under that. So, let me quickly go inside the partition folder and here we can go on this repartition. So, this should be having total 100 folders. If you can scroll down and here if you can load more and we can go in the bottom and load more and if we will be going to reach at the destination, then we should be seeing the partition that is having total 99. So, 19 partitions are here and one partition with 0. So, if we just want to do the partition on the file level, then we should be going with the repartition on the data frame and after that we will be writing that data into the file and if we are doing the partition, so partition are happening on the folders on the file level, it will not be touching anything. So, repartition will help us if we are having any data SKU problem. What is the data SKU problem that we have already seen in earlier of these video series. So, thank you so much for watching this video. I hope guys you have understood about the partition by how internally it is working. Thank you so much. See you in the next video.