 గాక౜తౚటరండిం నెకూచిం ్ెచణ౉త్సి ఆచ్స్రచాలిਹయానెడలానా. of PySpark tutorial so in this video we are going to see about the distinct and drop duplicates so when we can use the distinct clause and when we can go with the drop duplicate function so today's azenda is first we will see about the distinct clause next we will see about the drop duplicate and then we will see the difference between drop duplicate and distinct function so the difference is distinct will be going to perform with all the columns so if your data set is having total 10 columns so distinct will be going to apply with all the column next drop duplicates is used for the selected column so if you have little bit idea inside the sequel then you can understand while we are going to write the select query so inside the select query we can write the distinct and after that if we are going to specify the aztec then that will be the distinct if we are going to specify only few columns like ID name then that will be the drop duplicate so it will be working like that let me quickly go inside the browser and will try to see in practical so here we have this data frame which is DF and it is having the sales data and we are reading this data from the sales file now here first I am going to use the distinct clause so simply we can use the distinct clause by using DF dot distinct so it's very straightforward let me add a new data frame where we are going to load it and reading it from that DF one now let me execute it so here we are going to see we have total 799 rows and all are the distinct rows so simply we can say here we cannot specify any columns so if we want to remove the duplicate based on any particular column then we have to use the drop duplicate now so let me comment this and let me add drop duplicate and let me add this in one of the data frame now here we have to specify the column by which we just want to remove the duplicate so I just want to remove the duplicate from the item name so we can specify item and name and execute it so here so that should be drop duplicates let me try to execute it again okay so here if you are going to specify the columns like that it will not work we have to specify the column that should be in the subset like this so let me execute it again and we'll see the output so here we can see we have only distinct item names and duplicate item name has been gone so it is going to pick the random value for the other columns and will be keeping only distinct values from the item name if we just want to select item name then here we can use the select calls and in this select function we can specify that column that is the item name and we can execute and we'll see it will be returning only a single column with the distinct values so I hope guys you have understood when we can use the distinct function and when we can use the drop duplicates thank you so much for watching this video see you in the next video