 Hello, welcome to SS Minitech, so see this side and today we are going to see about the remove duplicates from the source data. So for example, if we are having certain file and we want to load the data from that file and this file is available on the Azure Blob Storage or any other source. So before going to load into the destination, first we need to check whether we are having the duplicate data there or not. If we have the duplicate data there, then we should remove that data. So let me see the source first. So our source as we can see this JSON file. So this JSON file is containing the information of the customer like the customer ID, then the first name, last name, city and enrollment date. So here as we can see for the ID2 we are having the duplicate record. So we can see the duplicate on the customer ID, first name, last name and city and enrollment date are the different. So as per the requirement we should be going to keep whenever the enrollment date value is more than the previous one. So as we can see the second row which is 1992 and the next row is 2006. So as per the requirement we should be going to keep this in the output. Similarly if we can scroll in the downside then we can see for the ID9. Here we can see like everything is same except the enrollment date. Enrollment date is 1999 for the first one and second for the 1998. So we want to keep 1998 here. Similarly if we can check here for the ID3, we have 2001 and 2006. So we want to keep the distinct data from here and should be as per the enrollment date in the descending order. So how we can do that let us see in the practical. So this file is available on the Azure Blob Storage. So as we can see under the input folder of the container we have this duplicate.json file. So it is keeping the same information that we have already seen in the Notepad++. So let me go into the Azure Data Factory and we will try to implement the same. So first we are required to add a new data flow here. So let me click on these three dots with the new data flow. Here we can call this as remove duplicate. Now we need to add the source. So let me click on this add source. So here we don't have any dataset as of now. So let me try to create a new dataset and that dataset will be pointing to the json file under the Azure Blob Storage. So here we can select the json file. Click on OK. Here let me call this dataset customer json. In the link service we have already created SSU testing so we can use the same. Here we need to select the file path. So we can browse and we can go under the input folder and after that we can select the file which is the duplicate.json file which is mainly the customer file. Click on OK. We can click on OK. So here the dataset is created. Now we can go into the source option and under the source option if we can go in the downside we can see the json setting. So in the json setting as we can see the source which is having the multiple arrays value. So here we need to select array of document. Let me go into the data preview and try to refresh so we should be able to see the data into the table format here. So here as we could see total row which is 16. So first what we need to do we need to do the sorting on the basis of the customer ID and enrollment date in the descending order and after doing that we will be going to get the first row if we are having duplicate data. So how we can do that we need to use the sort transformation for doing the sorting. So let me use the sort transformation here and the sorting let me try to make this little bit smaller in the size like this. Now here we can go in the sort transformation and we need to do the sorting first on the customer ID on the ascending order that is OK. Second we want to do the sorting on the enrollment date. So as of now we can see the enrollment date which is not the date actual. It is the string value. So let me make this as descending first. Go back to the source and here go to the projection and we need to override the schema and this is the date. So we need to replace this from string to date. Now go back to here go to the data preview and try to refresh. So this time we should be able to see the data. The first row of the duplicate records will be the maximum of the enrollment date for that particular customer ID. Similarly if we can go in the downside for the ID 9. So 1999 is the maximum so that is coming first. Second we can scroll in the bottom side and here we can 2006. So what next we need to do we need to use the aggregate transformation. So let me try to add the aggregate transformation here. So here first what we need to do like we need to specify the group by so we want to group by on the customer ID only. So that is OK like the customer ID level we want to do the group by inside the aggregation what we need to do here we need to specify the other columns. So what are the other columns as we can see. So the column which is the first name last name city and enrollment date. So either we can use the maximum of all these or we can use the second option here for the add column pattern. So let me try to use the add column pattern. So the first way as I told you we can add the maximum for all these other columns so that will be going to return the value. If you can use the column by pattern then we can also use that option. So I am going to use the second one because the first method is very straightforward we can directly add all the columns here and take the maximum. So in this option first we can see each column that matches. So like it is asking like how many columns we want to keep under this. So except the customer ID we want to keep all other columns what we need to do we need to add all other columns except the customer ID. So name this is indicating all the column names. So we don't want to keep the customer ID here. So that is the customer ID like this we don't want to keep this column in the output we want all other columns. So that's why we have specified like this. Let me click on save and finish. So here we have added all the columns except the customer ID. Next we need to specify the column name as expressions. So either we can specify all the columns here one by one or we can use the dollar symbol and that will be having like all the columns. So let me click on this open expression builder. Here we need to specify the column name expression. So here we can see like the dollar dollar. So let me select the dollar dollar here in the value now here we need to use the aggregate option. So either we can use the first or the last function. So the first function will be going to return the first row which we have already sorted. So let me use the first here let me try to use the dollar dollar as well. So dollar dollar will be having all the columns. So let me save this. So the column names will be whatever we are getting as an input and that will be the first row of the dollar dollar like whatever we are getting in the input. Now let me try to save this and save and finish. So this will be going to return the output of the distinct rows. So let me go into the data preview and try to refresh it. First of all we can see the total row which is 13. So only 13 rows we are getting from 16. So we are having the distinct rows and here the enrollment date that is 2k6 and similarly we can check for the ID 13 that is 2k6 and here we can see for the ID 9 that is 1999. So we are having the distinct rows as per our expectation. Now we need to load this into the destination. So let me add this thing here. So we need to add this into the destination of the sequence over. So here we can create a new dataset or we can use the inline query. So I am going to create a new dataset here and that is for the azure SQL. So let me try to select the azure SQL database click on continue. Link service we have already created. Here we need to select the table name. So the table name that is the customer new. So everything looks okay click on okay. So it is going to completed now. Now we can save this and now let me try to add a new pipeline here and under this pipeline we want to execute this data flow. So we can drag and drop directly. So data flow activity is here. Now let me try to execute. So once it will be executed, distinct rows should be inserted into the table. So it got executed successfully. Let me go into the sequence over. So here as of now we don't have any record in the table. Let me try to execute it. So we are having all the data that is the distinct one. Thank you so much for watching this video. If you like this video, please subscribe our channel to get many more videos.