 So the next topic is looking at how we can make operations on columns. So with, I mean, one of the very nice features of Pandas DataFrame, and it's, you know, something that you also have in, I think that would be natural to you if you're using the Apprentice R, is that you can very easily apply operations on all elements of column. So here we have an example in this first section, we will look at arithmetic and logical operations. So I loaded my DataFrame and for instance, if I wanted to increase the age of every passenger by one year, I don't have to write a loop to loop over each element and add one to each element. I can simply take the column and say plus one, and then Pandas automatically understands that what I want to do is to apply this operation to every item, every element of the age column. And so now the age is increased by one. So you can apply any sort of mathematical operator in this way. So adding, decreasing, divided, and so on. Here now I'm subtracting one again. You can also use, if you're familiar with this Python shortcuts. So minus equals one means, you know, take the element and remove one. This also works with DataFrames and with columns of DataFrame. So here this basically means subtract one from the age column. So now we are back to the values we had at the beginning. Here we have another example. So we can also, you know, do operation on multiple columns. So here we have now a different data set. Maybe I will just show it here. Okay, so this is a data set of Swiss sensors made in 1880. And so it contains the rows of towns in Switzerland. And then the columns are different attributes of these towns. So you have the town name, how many inhabitants live in the town, how many are Swiss, how many are men, women, and so on. So here I just take a subset of this DataFrame. I just take the total number of people in town and the number of men. And now let's say, if I want to compute the fraction of men in each town, I can very easily simply take the columns of mail and divide it by the total. Okay, so if I do this, now I understand that I want to, for each row, I want to divide the value in the total column, sorry, the value in the mail column by the value in the total column. And now if I wanted, for instance, to add this information as a new column, then remember I can simply use the square bracket syntax here to refer to a column that does not exist yet, and assign it the value of the division of the mail column by the total. And if I do this, and now because this column does not exist yet, it's created by pandas, and I have a new column with my values here. Right, so this we have also already seen if we take a single column of a DataFrame, it returns a panda series, and it's the same if I apply an operation on a column, but not then what I get is a pandas series. So for instance, let's say if I take this, I will take, so now here I add two columns together. So the number of men and women, so male and female columns, and I want to check if it's equal to the total columns. Here let's say I just want to receive this data set, and I just do a very basic integrity check to see if these three columns make sense, right? If the sum of men and women equals total people in the village, or if there was some error introduced in the data set at some point. So here you see I can apply a mathematical operation, but I can also apply logical operations, and in fact we already did this earlier when we created masks to filter through DataFrames. So here you see I got a panda series, and therefore I can also apply any method of a series to the output of such an operation. So for instance here, I wanted to count the number of, so I wanted to count the number of true and false, because this returns me a series of true and false values, and if I wanted to see how many are true and how many are false, I can use the count values. So here I see that I have two rows, so two rows in my DataFrame where I have apparently a problem, and then I could, if I want to see exactly which rows are making a problem, then I can simply use the output of this computation which is true false as a mask, and then as we did before, I can mask, I can filter my DataFrame through the mask, and now I can easily find the two lines rows of my DataFrame where there is a problem between the counts of men and women and the total values. Okay, and here you can notice that I use this tile in front of the mask. So what this means is that it basically inverts, if you have a vector or series of Boolean values, if you add a tile in front, it will basically inverse the value, so it changes true into false and false into true, and here I use it to basically take the rows for which the output is false instead of true. Yeah, okay, so you can also, we didn't explicitly see this so far, but basically you can, what you can do is that when you do a selection, then you can also perform an assignment on this selection, and what this means is that it means that you assign this value only to the rows that match the conditions that you, or the selection that you did. Okay, so here, let's say I want to put the fare of the passengers in third class to NA value for some reason, so I can simply make a subset of the passengers in the third class, and I say for the fare column, I will set values of NA, and you see that values for passengers in the third class have been set to NA. Okay, okay, so let's take a few minutes to do a microexercise 6. Again, with the Titanic data set, now we would like to do a query to get the passengers that are younger than 10 years and apply a discount on their fare, so modify the value in the fare column basically. So what I want to do is I want to divide basically the fare by two, but only for people who are under the age of 10, right? So the first thing is I will need to select these rows somehow, and we know how to do this with the lock indexer, so we can say we want to select people, so columns, sorry, rows for which the column age is smaller than 10, right? So if I do this, I should now have only people whose age is a bit of 10. Okay, so that seems to work. Now, what I want to do is for these people, I will select the fare column, so now I have only one column here, I don't have my whole data frame, and for this column now I want to divide it by two, okay? So you can say, you can do it like this, okay, the column equals the column divided by two, but as we've seen, so this would work, but there is a shortcut to say simply divide equal two, okay? So this means the column equals the column divided by two. So if I do this, and I look up, I'll actually use this to get a few more kids, and now if I compare before and after, I should see that for kids, so for instance, the third column here, I have now divided by two fare, and same for the last two columns, I also have half of what I had before, and for all other passengers, the value is unchanged, okay? Because I only applied the division on this subset of rows. All right, any questions? It's not the case and let me continue, and we will now see a couple of built-in functions that you can apply to columns of a data frame, or actually also data frames. So you've seen, before I showed you, for instance, an example when we wanted to check the fraction of people who survived in the Titanic data set, I showed you that you could use the dot mean method to compute the mean of an entire column, while there are many other of these functions that are available, and here we just list a few of them, so they're useful, so you have count to count the number of non-NA values that you have in a column, or a series, then you can have things like sum, mean, max, min, you know, that would do exactly what you expect, so operation indicated in the name, std.std for standard deviation dot var for computing the variance, count to round the number to a given decimal, and you have also operations on booleans such as dot all to check if all values are true, or dot any to check if any, if the series or the column contains at least one true value. So all these methods by default, they apply on a column, and if you want to apply them on rows, then you have to add the axis equal to one argument. So zero is the default value for the axis argument, which means apply the operation by column, and if you want to apply it by row, you change this to axis equals one. So let's see a couple of examples here. The first one is I want to compute the mean and standard deviation of H in the, in the, in the data set, so I can simply select the column H, and then I apply the mean method on it. Okay, so I get the answer here. And if I want to compute the standard deviations and I simply apply the dot std method. So you can do this on one column but you can also do it on an entire data frame so here it's a data frame with two columns age and fair. And if I apply one of these function on it, you see that it will give me the value per column. If I wanted to compute them a row. In this case it will make much sense, but if I have some data frame in a data setting, it was a different setup organization, then I can simply use axis equal one. And sometimes you, you don't necessarily want to get to, for instance, you don't want to get a value itself but you want to get the index of the row with this value. So the typical example here is with a minimum and maximum. I don't mean this in this use case I want to find out, for instance, which passenger has paid the highest ticket price in the in the data set, but I'm not interested only in the in the actual fair so I don't want to use just the max method, but I can retrieve the index of the row where the fair is maximum so I can display the entire role and I can retrieve other attributes of the passengers and also the names, the class, and so on. So now what I can do for this I can use IDX max or IDX min. Okay, so in this case I want to get passengers who paid the most so I will select the fair column and then use the IDX max method and this will return me the index where the fair column is has its highest value. And then I use this output of this to filter my data frame. And you see now I retrieve the row of the person who paid the highest ticket price. Alright, so here we've seen how we can apply these some of these built-in functions, and they cover you know all the basics as we've seen like a mean median minimum maximum standard deviation and so on. But sometimes, you know you want to apply a function that is a little bit more specific, and that you will write yourself. In other words, you want to apply a custom function, and this is of course possible and the way it works is that you have to use these map and apply map methods. Okay, so if you want to apply a function to only a series or a single column of a data frame, you would use a map. And if you want to apply to an entire data frame, then you would use a plan. So here is an example of how it would work. So, first we need a you know some sort of custom function. So that's what I'm defining here in this. This is Excel. So it creates these functions that we call silly function. And what it does it simply takes a numeric input. So actually, in Tetris, and it will simply check if the value is even or up. Okay, and if it's even it returns a string even it's odd. It returns the string up. So let's just test it. I applied on numbers for from zero to four. And you see, I get even ones and even and odd, when it's odd. So now, let's say that I would like to apply this function to all elements of the age column in the data frame that I think data frame. Okay, so, the way I will do this is, select the column, I want to apply the functional. So he's age, and then I will apply the map method of this function and the argument that you pass to map is the actual function you want to to apply. Okay, so if I try this. And you see that now, it has taken every element of the age column, and it has checked if the age was even or odd. So just a couple of things that you have that you need to be aware here is that's a, sorry, the important thing is that's a this function that you are passing to map. It must accept, take exactly one input argument. Okay, because the column contains, you know, it's applied on each element of the column and each element is a single value so it's logical that's a function that you're passing to map must be a function that accepts exactly one argument, if it's a function that needs two arguments and it will not work. So in the case where I would like to apply this function to an entire data frame. So here, I created a frame with the three columns age for passenger class. Then I cannot use map I need to use a plan but it works exactly the same the same way you just call a plan map, and then you pass the custom function, you want to apply. It's the function has been applied to every cell in the data frame. Okay, so just map for a single column or a Panda series and apply map for an entire data frame. So if you wanted to apply a custom function, actually to an entire column, but not, you know, not element wise but actually that's a whole column is the input of your function, then you have to use dot apply method. So it works the same way as a, as a map method but the difference is that now you have to give a function that will accept a sequence of values, and the sequence will be the content of the, of the column so here, I have an example like this. I create a custom function called some also it should be some of squares. And what it does is takes a sequence of values as input and it will simply compute the sum of the square values. And now, if I want to apply this function to each of my three columns of the data frame, then I will simply call apply, and I pass the function sum of squares. And actually, here I'm passing one more arguments actually, I default zero. Yeah, so here I pass function sum of squares to apply. And you see that now, what I get as output is basically the function applied to each of the columns. Okay, so that's the difference between apply and apply map is that apply map will apply the function to each element of the data frame was applied will apply it column wise. So obviously, apply map takes a function that takes a single where each that takes only a single argument, whereas when you use apply then the input function must accept an argument that is a sequence of values. I wanted to now apply this function instead of computing the sum of square per column if I wanted to do per row, I would simply change axis to axis equals one K by default. Things are computed by on the entire columns. If I wanted to compute them on rows I have to manually add axis equals one. So now what I get is a base for each row I get a square sum mean the sum of the square values of age fair passenger class. So, here we have a small section is just to to draw your attention on the fact that, of course, you know to compute these, for instance, to to apply a certain function on all the elements of all the elements of a column so even all elements of data frame you could also very well say oh you know I will just write a loop and you know goes through my items one by one, and then applies a function to each of them. Okay, but so this is possible. It will work but the problem is that it will be very slow so if the data frame is small, maybe it doesn't matter. But if the data frame is large, then you know it could really have a make a big difference. So the idea here is that whenever possible you will want to do what is called this you know vectorized operation it means to apply a function on an entire column or on the entire data frame at once. So basically, what this means is that whenever possible, you really want to use a map apply or apply my functions. And you don't want to write loops. Okay, so basically you want to avoid for performance reasons, and also to make your code more concise, you really want to avoid loops as much as possible. And when you work with these data frames you really have to get into the habit of thinking into these vectorized ways or apply functions and methods on all elements of the column at once with, you know, map or apply map. So that's that's the idea that we want to message us we want to give you here. I think for if you do this type of analysis or in our party you're also familiar with this problem because you know always even slows and Python and the writing loops take for takes forever. So it's a you probably also use the vectorized functions. So in Python, and as it's the same, whenever possible, use map, apply or apply map to do computations on your tables. Yeah, don't don't use loops. Yeah, I think that's last. We almost got micro exercise for this notebook where what we want to do is that I give you here a function that will. Sorry, if you remember in the Yeah. So in the Titanic data sets the last column and bar here. It contains the abbreviated value of the port of invocation to the city where people boarded Titanic. So it has, you know, as for Southampton, see for Sherwood and queue for Winston. And so what we want to do is we want to create a new column that contains expanded value of the port of invocation. Okay, so here, I give you a function that when you pass it. First letter of the The abbreviated value of the port of invocation, it will expand it to the full value and what I want you to do here is to apply this function on every element of the embarked column and create a new columns. So let's take five minutes to do this. And then we will correct it together together. So we've just seen that to one way I want to apply a custom function to a column of data frame what we want to do here, we can use a map. Okay, so this method is, you need to apply it on a single column of data frame. So, and the column we want to work on is embarked right so I can do def dot embarked dot map. And now I can give the function that I want to apply. Okay, so I will do it like that. And if I try to run it. Now you see that I get back a column or a series that contains the expanded value. Okay, so for each element in the row. Pandas applied the function, and it got the response from the function which in this case simply expands the abbreviation of the city to the full name of the city. Now if I want to store this in a new column of the data frame, then we know that's very easy to do, because we simply use a square bracket notation, and we create a new column that we will call bar. Okay, and we assign it the output of the map function. Now if I look at my data set, I should have a new column exactly here with all the expanded values. And that's it to with just this line I can easily apply the function and create a new column in my data set.