 Hello, welcome to SSUnitex, so we will decide and this is continuation of PySpark interview questions and answers. So today we are going to see how to load error records into error file while reading the data from the source. This question might be asked in different ways as well. Like while we are reading the data from any file, we are having the correct record as well on that file. So we just want to keep all those correct records into a separate file and the actual record should be processed. So such type of question will be going to ask. So here how we can deal with this. So we are having an option that is the column name of correct records inside the permissive mode. So before going forward, if you haven't watched the last video of this video series, so I would strongly recommend to watch that video because this is the continuation of that. So what is the permissive mode? So inside PySpark, while we are going to read the data, we are having three modes. First is the permissive mode. Second is the drop malformed and third is the fail fast. So total three modes we are having. By default, it is having the permissive mode. So in the permissive mode, if we are having incorrect records for few rows, then it will be replacing those incorrect records by null. While we are going to deal with the second mode, that is the drop malformed. So if we are having any error rows, then that row will be ignored while we are going to read the data. And third is the fail fast. If we are having the error rows, then it will be failed while reading the data from source. So by default, it is having the permissive mode. So inside the permissive mode, we are having an option by which we can use the column name of corrupt records. So what it will be used? It will allow to rename a new field having malformed string created by permissive mode. So what does it mean? Let me quickly go inside the browser and we will try to see in practical. So here I am having this CSV file which is having id, name, as, and department name. And in id, second row and fourth row which is having incorrect data in id column. Because this is not the string column. So while we are going to read the data from here, we should be having one of the column here that will be indicating like which row is causing the problem. Inside the Apache Spark document, then here we can see we are having this option, column name of corrupt record. So it will allow us to add a new column and that new column will be going to have the incorrect records. So here I am going to declare the schema first. But we are having id, name, as, and department name all these four columns in our source. I am adding one more additional column that is the error rows on this schema. Let me try to run this. Now this schema has been created. Now let me try to read the data from that file. So we can use the Spark dot read. Then we can specify option. And under the option, we have to specify first header because it is containing the header. So we can mark the header value as true. Second option we can specify this is for mode. And under the mode, we have to specify permissive. We are required to load the data from the CSV file. We can specify CSV here. Then first we have to specify the path by which we want to read the data. So path is under the mount point. We have input folder. Then we have employee dot CSV file. Now second thing here we have to specify the option. So the option is column name of corrupt record. So in this column, we have to specify like corrupt record row values will be having inside the error rows column. So we have to specify error rows column here. Let me try to put this into data frame. And let me try to use the display of this data frame first. Let me try to execute without using the schema option. So this time this error row will not be there. And we can see the output which will be having all these values. Now let me try to use the schema that we have created here. So let me try to use this schema and under this we can use this schema. Let me try to execute it now. So this time what will be happening here we can see id values are null and additionally we have added this column which is the error rows. And if error rows we can see null then we can say it is the correct record. And if we are having not null values here then we can say it will be corrupt record. So let me try to filter out from this data frame that we have created. So we can simply filter from this data frame. And here we can specify like df dot we have this error rows column if the value of this is null. Then we are filtering and let me try to use the display of this. And let me execute and we will see the output of this. So it should be having all the correct records so that we can see total three records those are correct. Two records are not correct. So we have to use the instead of is null we have to use the is not null. Let me try to execute and we will see total two records those are not null. We can simply put this into another data frame and then that data frame can be loaded into error file and all the error records will be loaded into error file. And if we are going to use the not null then those are the correct records. We can put those records into a data frame and that data frame will be processed further. So I hope guys you have understood how we can separate between correct and corrupt records and how we can add a new column by which we can do the filter and we can differentiate between those. So thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. Support us. See you in the next video.