 Now we've seen how we can access a single column of a data frame but you know very often you don't only want to access column you actually want to access a combination of rows and columns so yeah so you can basically take a subset of your data frame for instance to you know either filter it on some condition or simply select a number of rows and columns so if you're familiar with ours and you know that you can simply pass rows and columns in the square brackets with python and pandas you cannot directly do this we have to use what we call here an indexer so these indexers there are two of them will mostly work with the first one which is dot lock and then you it's not really a function so you don't put regular brackets after it you put these square brackets okay so this is a syntax that is specific to pandas okay so the way it works is I simply do the name of my data frame then dot lock and then I give the rows I want to select so the actually I have to give the index values of the rows I want to select so the row names if you want and then the names of the columns I want to select so here what I'm doing is I'm selecting the row with index two and the column name so let's try to load it and that's actually what I could do before I could let's just look at it and so we should in principle select the third row of the table because remember that's a the third row was index two okay so first the index here it's a default index and it's starting with zero so you see I take row with index two and the column name so indeed it's this is the name of miss lavish here okay so here I'm accessing a single value of the of the data frame by specifying the index and the column name that I want to access and again don't forget to put the dot lock because if you try to do it directly like this as you could do in R for instance then this will not work clear so you really have to put dot lock for accessing a specific row and column now if I want to not only select a single row or single column I often want to select multiple of them then I can simply pass a sequence of values as either rows or column names okay and there are a couple of ways that I can do this so I can either explicitly pass a list of index values and column names like that or if I want to select the whole range of values so in this case let's say all the values between index 0 and 10 I can use this slicing notation where I put a column okay so if it's the start index the column and the ending index and the same with column names I can take everything from name name once column name sorry until the column page so that would select the first three columns in the data frame if I want to select everything from a certain value to the end I can put the value column and nothing after and if I start with column and then the value I will take everything from the start up to that value so let's see a couple of an example here so here I want to select the first three rows and the first three columns okay so I gave a slice here from 0 to 2 so index 0, 1, 2 and on the second line basically it achieves the same result but because I start from 0 it's the same as starting from the beginning so I could just put a column and two and actually there would be one more way to do this because the columns are and actually I'm not selecting the first sorry I thought I was saying the first three columns if I if I was selecting so let's say the first three columns then I could simply say name and then column H okay so that would select a slice of column so from name until H now one thing that is really important to notice here is that when you do the slicing actually the end value is also included in the slice okay so if I say from 0 to 2 or up to 2 then the index with value 2 so the third row in this case is also included in the selection and the same for the columns okay so H column here is also included and you know if you are familiar with python you will know that this is not standard in python when we're slicing because default behavior of slicing is in python is that the end index is always included okay so with this lock indexer this is something a bit unusual at least for you know people working in in python is that the end position of a slice is included and actually if I'm using the other index as a i-lock indexer then this is not the cases and position is not included so really with a lock just you have to always remember this that the end slice is included here we have another example to just show that you can actually use this also this select technique of you know sub-setting a data frame with a lock indexer for instance to reorder a column okay so you can I mean it's possible you can pass several times the name of a column then the column will be selected several times so I can duplicate the H column if I want in this way and I can also reorder the columns as I see fit so for instance here I put H and passenger class in front of the name column and I could do actually the same thing also for the for rows I could also reorder rows with this same technique here another example where I'm selecting the last 5 rows so from 886 to the end and I select all the columns from the beginning until the column H and yeah this was just to show that if you when we select a single column of a data frame then as I mentioned earlier this returns a so-called panda series object okay so a series object as I said is simply a if you want a one a vector of values so so a difference with a let's say a regular vector in an umpire or a list in python is that this is actually a named vector so each each value in the series has also a name and by default you see that the the values are set to the name of so if I take a single row in this case the values are set to the column names now here I have a section just to show you a couple of you know common problems for you who will might encounter when you use luck okay so actually in the in all the examples that I showed so far what happened is that the index values there are the default index values so there are numeric numeric values starting from zero okay and so this can sometimes you know create a small confusion in the sense that if I now let's say have a data frame where the index is is not are not the default numbers but I have actually some you know other index values so here in this case are so passenger names now let's try to run again a selection like we did before so I want to select the first uh sorry it's a rows the second third fourth and fifth rows okay so one two four okay so before this worked but you see that if I try now it's not working okay any does anyone has an idea why why it's it used to work and now it doesn't work so the reason is because with as a lock index remember that we are working we have to we are selecting on index values not on positions okay so here actually the before it happens that's a index values uh exactly match the position of the rows because that's a default behavior in pandas but now when I loaded this data from here it's no longer the case now the index values are something completely different from row position right so index values there are actually uh the names of the passengers so now when I write this pandas is trying to find the rows that is has an index value equal to one but there is no such value in the index okay so that's why I get this error so instead what I would have to do in this case is that I would have to actually give uh the prop as a proper value for the index which is the name of the passengers so if I want to select these uh four columns I have to say okay it's from uh mr bejoo here until uh mr uh duly okay so if I do this now it's okay so selection is working properly and I selected so so correct that means the rows selection is working okay so always remember that if you don't have numerical values as index you have to be careful when you use the lock index okay so of course it would be possible to for instance as I show here to query the index for the positions that I want but actually uh if I in this case if I really want to query by position and that's a position as a index does not match the position and it's easier to use as a second type of indexes that we we have not uh not seen but basically this i lock indexer works uh more or less exactly the same as the lock indexer except that it only accepts positions instead of index or column names okay so if you want uh if you really want need to index on positions and you can use i lock instead of lock all right just short small quiz here let's say now I have the same data frames that I'm that I have before I'm loading it so let's let's try this here so sorry I'll just put it here and we show it again all right so now I want to let's say I want to select the last five rows and you know since I I don't know exactly the how many let's say I don't know how many rows there are so I would like to use this you know this shortcut where I say everything from minus five position till the n okay but if I try this actually it doesn't work you see that I'm it returns me the entire uh data set but my you know my index here there are actual numbers right so so can you tell me what is I mean do you have an idea why what's the problem is here or why why does this not not work so so problem here again you can click here if you want to see the answer uh is that uh lock is uh you know is selecting really on index values but it doesn't uh it considers these values if you want so index is a stream it doesn't consider it as an actual number even if it it is a a number of numeric values that is that you this is used so when I try to access uh do this slicing here with minus five what uh pandas or the lock index so is doing is that it's trying to look for a row that is whose index is minus five okay and there is no such index in the I have no row that is indexed minus five in my data frame and so that is the reason it uh it doesn't work okay if I was using the i-lock indexer which works on positions then this would return the expected value but with a lock indexer that doesn't work and all right so this brings us to the i-lock indexer so in this course we choose not to discuss it too much because in general the lock indexer is is a lot more useful but basically it's exactly the same as the lock indexer except that you pass positions instead of passing row and column names all right so time for uh micro exercise where you will try to do a selection with lock indexer so as usual I give you you know five minutes or so to try to work together so all right so I want to select first let's just focus on selecting the correct rows so with a lock indexer I can just see that we can select rows and columns so of course you know I could say I want all the odd rows sort of the one two sorry one three five seven and and so okay so here I'm selecting them of course you know there are 891 rows so I'm not going to type all this selection by hand that would be a crazy thing to do so instead I need to think of a way that I can you know auto generate this sequence of values and one way you can do this is forgot if I put it in the yeah put it in the hint is with range functions that you know you can give a start value and end value and also a step of increase so let's say I want to start from zero and then I want to go until the last position in the in the data frame in this case it corresponds to the number of rows so if you remember we've seen the shape attributes that returns the number of rows and columns as a tuple so if I want to get the number of rows I would I get the first element of the tuple which is at position zero okay and so let's just convert this to a list and yeah I will just subset it so we don't show the whole list those 20 numbers all right so now I have my list of numbers here since I want to take only the odd odd rows actually I will start from one and I will put a step of two to you know only keep every second values okay so now I have my uh it's a proper list of rows I want to take so I can copy this and replace it here and now you see I get the odd rows okay one three five seven nine and and so on actually here I don't even need to convert it to a list I think I can pass directly a range of exactly so that's fine let's just do a head on this to make it more compact and so that was for the first part and now the second part is I simply I want to select the columns name age and fair and then I want to reorder so that age is first and name is second so let me just copy this and so we can pass a list of column names here like that and as we've seen we can pass them in any order we wish so in this case we wanted age first so we'll simply take age from here and let's see okay so now I have all the odd rows and I I've kept only the three columns I wanted and I reordered them in the order I wanted all right any do you have any questions for for this my regular size or more generally for selecting columns with this selecting a rows and column with this lock indexer okay it's not the case and we can move on and so now the next thing we want to to see is often actually when you know when you do a selection of typically of rows you often want to select on a given condition so for instance I mean the example of the titanic data frame maybe I want to select only women in my data set to carry out some specific analysis or maybe I want to select only kids or people you know under 18 or people above six years old or you know this type of selection on our to so basically take a subset of my data frame based on a given condition so this often is also referred to as filtering data set so this is possible with lock indexer and the way it works is that you simply pass a condition as a row selection in the to the lock index so here we have an example here I so it's always the same data set now I'm going to select all the passengers that I have an age that is greater than 15 so you see what I do is in the row selection I simply pass df.h so the h column is greater than 50 and now you see that I have selected all the rows where age is above 50 now one important thing to remember here is that this type of conditional selection only works with the lock index so you cannot do this with high lock okay if you try to pass this type of conditional selection as as row selection to high lock it will not work and this is is one of the big advantages of lock compared to to high lock and there is also why it's generally generally the so more more useful index of both and why we focus on on this one in the in the course because it's able to do this conditional selections and there are some things that it's very useful to to do when you're analyzing data so actually what what happens behind the scene here is that when I pass this type of condition what I'm redoing is that I'm creating a vector of boolean true or false value so a series actually of a kind of series of boolean value so just to make a demo of this I will show here I create a mask variable that will that I create by applying a boolean operator to the sex column in this case so I want to check if the gender of the passengers is male or female if it's male then the value will be true and if it's not male then the value will be false and now that I created this mask variable so it's simply a series of true and false values I can apply it to the to I can use it inside of the lock index so you see that here I now selected all the male passengers and I also selected three specific columns but the point I wanted to make here is that basically here you can pass any sort of sequence of true false values to select the columns so when you do a kind of complicated selection often you can create an intermediate variable to where you store this vector of true and false values and if it's a very simple selection like we've seen in the beginning for instance age greater than 50 then you can directly put the value here it will be you know on the fly converted to an to an array of true false and then the selection will be made now another thing that we can I mean we've seen how we can do a selection on a single condition but often you also want to have more than one condition you want to combine multiple conditions together and this you can easily do with end and or logical operators so so a way that you use them in when you work with pandas data frame is that you have to use ampersand and a vertical bar for end and or because if you're if you're familiar with I mean as you should know you should be familiar with the basics of python at this point is that in general the boolean operators in python they are written like this right like true and false for instance we return false and so or operator is or okay but when you use the pandas data frames you cannot use and and or you have to use the ampersand and vertical bar to do and and or operation the other important thing is that each condition has to be surrounded in brackets okay so here I want to select passengers that are so the women's and those amongst them who had more than 200 I don't know it was dollars or pounds for their ticket on the on the ship okay so if I do this here I do it in two steps I first create a variable that called mask that will contain a vector of true false values and then I pass this mask to the lock index but I could have also I could very much have done it in a single line I could have done it like like that and I would get it the same result okay so now I see I should have only women which is seems to be the case and all of all of them should have had more than 200 let's say pounds for them for that job okay so this is how I can combine conditions and here another example where I use all operators so now I want everyone who is under 25 or who is women so now the only male passengers I should see are those that are younger than 25 all right so time for you to try this type of conditional selection with a new micro exercise again are there any questions please let us know and otherwise I give you again five to ten minutes to to work on this and then we will correct all right so yeah I think we can go ahead with with the corrections so that's so we want to select first we want to select the passengers that are either in first class now that are in first class and that are less than 18 years also basically the children who are in in first class so I will I will lose this in in two steps so I will create the mask with my selection and then I will apply the mask to the data frame so all right so I will have two conditions and with and say both have to be true okay so I use the end uh boolean operator here and my first condition is that the passengers class so p class must be equal to one okay and then my second condition is that the age of the passenger is younger smaller than 18 okay so if I have this I expect that I get exactly a a vector of true and false values uh so yeah so here two things to remember is use a ampersand as end operator if I tried this okay it doesn't work even though this is a you know it's a valid Python keyword it doesn't work so I have to use ampersand and also do not forget the brackets okay if I do something like this it happens to work but I think remove both yeah then it's so the the parenthesis around each condition are necessary okay all right so now I have the mask of values I can now apply it to my data frame so if I do df.log I can now pass the mask and actually I wanted to select only two columns and those were the name and fair okay so name and fair okay so I have now selected the in principle passengers that are in the first class and younger than 18 actually if I wanted to really check this it would be useful to add also the age and passenger class all right indeed it's already first class and the age is called below 18 so that's perfect I made the selection I wanted now let's say I want to compute the fraction of these passengers that survived so in the in the data set I have actually there's a column called survived that contains value of one and zeros that will indicate if the passenger survived or not so at this point who is what we know from about pandas and data frames the way that we could do this is a little bit cumbersome but it's possible so we would say we would do something like maybe I would say I would create a new mask that I call survived and here I would add the condition that that df.survived equals one okay so with this second mask I'm only selecting the survivors and then what I could do is I could say okay I will now select here people who survived I divide by all my entire selection and because you know I I mean this will return a data frame right I can I don't want to divide one data frame by another I want the number of rows so here I can use the shape property and I can say okay I will take the number of rows of the first one and the second one like that and in principle so this should give us 91% survived and indeed if I actually looked if we actually had it here the survived column we should see that almost everyone except one person survived so 91 value looks about right of course there is a much easier way to to do this I would be to say that I will take we'll see this a little bit later but basically we can apply on on data frames and columns we can apply a lot of standard sort of functions and so here we could simply say okay let's take a survive column and now because it's a column it's just zero and once I can compute a mean on it okay and I get the same value so I basically take my data frame I apply the mask to select only children in first class then I will also take only the column survive and then I apply I want to get the mean value of this column with the mean method here again we will see this a bit later but that would have been the I mean at this point of knowledge we can come up with a solution like I showed here or maybe actually we could do the sum of these two vectors that would be also also working but later we will see this easier solution all right do you have any questions if it's not the case then let's continue and yeah we have about 10 minutes left before the break so we here there's a small we put it super material but I will just maybe briefly explain it sometimes can be useful to it's a concept of you know whether we have when we do a subsetting of a data frame what can actually happen is that sometimes the value that is returned by pandas will be okay so sorry I just answer the question does the masking also work with pd.iloc no so with iloc it's not possible to to do conditional selection so masking will will not work so with iloc you really have to to give a list of a sequence of positions that you want to select you cannot pass a vector of true false values now so yeah so as I was saying sometimes it can be important to I can make a you know let's say when we create a subset of a data frame sometimes pandas will return what we call a view of the data frame so it's kind of a pointer to the original data frame or sometimes it will actually make a copy of it so here we have a couple of examples so let's say I have my data frame and I'm changing now the age of all the men or male passengers to 99 but now let's say I create a subset okay so I I subset to select names and I store them in a new variable here and now I try to set the values on the age column to 888 and you see I get this warning from pandas okay so basically what it's telling me is that actually the this value here it does not contain an actual copy so when it created a subset if you only does it didn't create a copy it created what we call a view so here we have an illustration to to show this so basically the idea of a view is that it's a pointer to the original data so the original data here is a orange or yellow orange box df1 and df2 you see it's a subset but basically it's still pointing to the original data whereas in some cases what panda will do or what we want to do is is to actually create an actual copy that the subset is an actual copy so in depth it becomes if you want independent data so the importance here is it means that if I change something in the subset then it will not change the value in the original data frame if you want whereas with a view if you change the values and it will it change both into a view and the original data frame so if you're trying to change change the value in the views and pandas will basically one will say okay I know I I mean I don't want to do this because you know this would also affect the original data frame and in this case what you have to do is that you explicitly need to make a copy so here you see that what I'm doing is that I I do the same selection but at the end I add the copy method okay so this means please you know create a copy of the of this okay and I assign it to this value here and now this this object this variable is pointing to really a copy of the of the data frame so it's a completely different object in in memory if you want and now if I change the values now I can change the values in my subset so in my because it's it's a own copy of the data frame and it will not affect the values I have in the original data frame so what is sometimes a bit tricky is that it's not always very clear when pandas returns a view or a copy so the rule here is that if you if you create a subset and you really intend it to be a copy it's best to always explicitly apply the copy method so you're sure that pandas is giving you a copy and not a view of the of the data okay so whenever you will make a subset and you intend to modify it in some way you should create a copy of it of course if you just want to make a subset but you don't intend to modify anything then you should not make a copy because as you can see here when you can when you create a copy this has you know a cost in terms of memory so now I'm using more I mean okay it's a very small data frame here so it doesn't matter but let's imagine it was much bigger all this data has to be stored you know in memory in your computer and your memory is a precious resource so you don't want to waste it by making copies all the time so only make I mean when needed make it make a copy but make it only when you really need it