 Hello, welcome to SSUnitex Social Decide and this is continuation of PySpark tutorial. So in this video we are going to see about the split function inside the PySpark. So what is the split function? So PySpark SQL provides the split function and it is used to convert with the delimiter separator string to an array column on data frame. So this can be done by splitting a string column based on the delimiter like the pipeline or comma or space or anything. So you can assume we will be having a column and that column is having the values like 1,2, maybe comma 3 like that and this split function will help us to split these three values into three different columns. So first column will be having 1, second column will be having 2 and the third column will be having 3 and according to our need we will be going to select all these columns. So let me quickly go inside the browser and we will try to see in practical. So here I am creating this data frame and this data frame is having total four columns with the first name, middle name, last name and date of birth. So let me try to execute this. So here in the output we can see it is having total four columns. Now the requirement is we can see we have the date of birth. So actual requirement we just wanna split date of birth into three different columns. First column will be here, then month, then date. So accordingly we will be going to split it. So how we can do that? First we have to import this function. So for that we can use the from pyspark.sql.functions then we can use import and I am going to import all the functions. So we can simply use the as tick here. Now by two ways we can achieve this. So first by using width column and second by using select. So we will be going to see both. First let's start with the width column. So I am going to create a new data frame that is df1 and here we can use the df. Then we can use the width column function. Here we can specify the bracket. So this is asking two parameter. The first parameter will be the name of the column. So that may be here. And the second parameter what will be the value of this column. So the value of this column will be getting from this date of birth. We can use the split function. So here we can use split function and here we want to split from this data frame and the column that is the date of birth. So from this column we just want to split. Second we have to specify the delimiter. So under the split first parameter is your string and second parameter the delimiter. So in our case delimiter is this minus. So we can specify this. So now you can understand here it is having total three argument. The first is year, second is month and third is day. First we want to get the first argument. So for getting the first argument only from this split we can use the get item. So in this get item we can specify index as 0. So what it will do it is going to return only year part from this. So let me use the display command here and under the display we can use df1. Let me execute it and we will see the output. So we can see we are having year here. So first four we are getting from the data frame and this additional column we have added. Now for adding another month so we can use the same column again and this time maybe we can go with the argument as 1. So this is going to pick for month. Let me execute and we will see the output. Okay here we have to specify the backslash because we are adding this in next line. So here we can see month is coming. Simply we can add for one more as get item value as 2. So this will be going to return as day. So here let me rename this column as day. So we will see three additional columns with year, month and day. So simply we can use this function and if you are thinking this function is little bit lengthy to use multiple places then we can simply use this function over here and specify in a variable. So I am going to specify in this variable which is the split. So under this split variable we are using it. So instead of using this complete we can only use this variable. So both will work. So let me replace all these and let me execute. So we will see the same output here. So this is the first way by which we can achieve it. The second option we can also use with the select statement. So how we can do that? Let me try to put in df1 and df.select and inside the select we can specify only required columns. So maybe I am going to select the first name. Then I am going to select the middle name. Then I am going to select last name. I am not going to select data about here. So simply we can use the split function here. So we have used the split in a variable that you can see here. We have this split variable. So simply I am going to use this split variable here. So let me use this split variable dot here we can select get item. So which parameter we want? So we want a first parameter and we have to specify the alias name of this. So this alias will be for here. And let me try to display this df1. This time it will be adding one more column here. So that is here. Now we can see that. Similarly let me try to copy this split and add one more time and one more time. This is for month and here we can use one and this is for day and this will be two for the get item. Let me execute and we will see the output. So we can see year month and day in three different columns. So these two ways by which we can achieve this. So thank you so much for watching this video. If you like this video please subscribe our channel to get many more videos. See you in the next video.