 Okay, let's get started with the Pandas tutorial. We have assigned giving the tutorial, and this will be the third one, and we'll have one more at the hacknet venue, which is at 7.30. So, let's get this started. We'll be back in a minute. We'll be walking through Pandas. I'll be, this will be finally hands-off, in the sense that you can follow along with me to try out these examples. There are a couple of things in which you can do that. If you've got Pandas installed on your machine, then just follow the steps that we go along. I've set up two instances of IPython notebook online as well. You can go to 175.41.189.15, code of either 8000 or 8888. And I suggest you create, if you're creating a notebook, I suggest you create one with your name, so that you, which one it is, and don't remember the other people's numbers. It's 175.41.189.140. Or can I just use it on? Yeah, please use it locally if you can. That's the best option. Now, the way to start, let's start with what Pandas is. Pandas is a data analysis library. This is going to be up your data analysis session. There's no way we'll have time to cover this and visualization using Pandas. Pandas visualization can be done not necessarily missed. You've covered D3, you've covered R. Both of them are superior to Pandas from a data visualization perspective. So, we'll be covering this from purely an analysis perspective. The answer to the question, why would you want to use Pandas for data analysis is because, A, at the moment, it appears to be the best package on Python, which in turn appears to be the best or the most popular programming language for data analysis. This is, if you, in many respects, a quite a good one as well. If you compare this with R, then put it this way, there is no doubt that every single feature that Pandas offers is also available in R, and there are enough features in R that are not available in Pandas. So, features and perspective, R is to be given. If you're a statistician, you'll love R and its documentation. It's just made for analyzing statistics. If you're a programmer, you may find it a lot tougher to read. R does have a steep learning curve and that's something that's been a challenge for a lot of people. Pandas, on the other hand, is simpler. Relatively speaking, in absolute terms, Pandas is a bit tough, but compared to, so, between D3 and Pandas, there's no doubt that Pandas is the best. At least, if you've got a good estimate that will happen, that's a really easy experience. So, what we do is, first, on your local machine, if you want to start off with Ipython, all you have to do is say Ipython. So, you say Ipython, space notebook. Now, Ipython is something that comes as an, it's an add-on package to Python. It provides interactive capability. Ipython notebook is a web-based console. So, when I run Ipython notebook, my browser starts, goes to the URL where I can create notebooks and start working with it. This is also what is running at the URL that I mentioned earlier, 175, 41, 180, 9, 140, with a variety of, either port 8000 or 888. I'm going to be running it on the local machine just in case it gets cut off. I don't want the session to be stopped, but you can feel free to either run it on your local machine or on the internet, whatever suits you. Now, we'll be creating a new notebook. There's a link here, a new, a notebook is a program for all practical purposes. Out here, I can type in any Python command and it gets executed. So, I can say one plus one and not enter, but shift enter, that executes it. Enter will just add a new line. So, you can type multiple lines. I can say print one plus one, print two plus two, et cetera, and it prints two and four, one and one. You can type a sequence of commands. So, that's it in terms of setup. We have a couple of options in terms of what we do in this session. We can either play around with cricket data or we can play around with elections data and I leave the choice entirely. It's either the 2012 elections data, assembly elections, or all of the test cricket data that exists till both the data sets are publicly available till you get all the links. So, how many votes for election? Do you have the majority? Cricket? Okay, so... Cricket. Yeah. Cricket. Cricket. Cricket. Cricket. I'll meet you on airs here. I'll start with cricket. Cricket. Cricket. Cricket. Cricket. We'll see you later. So, for those of you who are on, there is a file that's already available, but I'll tell you, for those who don't have one, go to files.gramlar.com, those who are running it locally, go to files.gramlar.com, slash data, and there's a file called testbatting.csv.vz2. Don't do it. Gramlar.com, slash data. It's gram and a. Yeah, incidentally, I'm Anand. Hello. I'm the chief data scientist of Gramlar. And that's one of our things. If you can, answer the visa to file, if you don't have it over in Mandaske, hand it visa to file. It's just fine. For the rest of you, what you could do is, in the rest of you, I'll walk you through it. I'll be cool. I'm going to download this file, testbatting.csv.vz2, so actually, I know that I already have it. Let's open this file and see what it contains. So, what do you do this? Let me not do this. On ipython, I've got this file on my local on the, in the current directory. So, if I say LS, it gives me a listing of all the files in the current directory. One of them is this testbatting.csv. Now, here is the first command that you are going to try, it is import pandas as p. The module pandas is imported with the name pd, pd just save you some typing effort and it is the convention. So, when you run this pandas is now imported, we are going to load this five test padding dot csv and I will pause until all of you are able to load it and then we start playing around with it. The way to do it is you say data is equal to or any variable for that matter, but I am going to be using data. It is equal to pd dot the read underscore csv of this file and shift it at this point data is loaded. If you have a bz2 file say comma, the options are as for more options are listed here. Somewhere you see a compression is equal to now, you can say compression is equal to bz2. If you do that it will load up the meter. On the online version it is exactly the same command that I typed before it is in your current folder. So, if you run this it ought to work. At least it should not report an error. If I type in a file name that does not exist it will report an error like this. How do we do it with this you are in like it is not working? You are in the local one. On the local one it is usually 127.0.0.1 colon 888. I am running a vf match. So, still it will be on 888. Can you do a port forwarding for that? Or just ip colon. Give it a short if it does not work try the online version. 175.41.189.141. Anyone has issues loading the file? No, it does not. Just say pd.readcsvtest-banding.csv. It has to be in the same directory to load. Yes, and if you are running on the server I have put it in the same directory. If not you can type in the full part. So, just say if you are on a Windows machine then use forward slashes. So, you can say d equal slash in my case it is in d equal and 10. Don't use the server for the home directory. No, you can. Yeah, probably not yet. No, you can. You can you need to take a look. For the encrypted file what do you just add? You mean compressed. If it is compressed then say compression equals what are the compression. Not what are the compression. You can suppose gz and vz. In this case it's a vz. Anyone has not been able to load and find it? Incidentally it doesn't have to be on the local machine. It would be loaded from a URL directly. You can use URL open and load it from there. In fact we do that for the next set for the elections data set. We load it directly from the ECA website. Let's hope and pray that we don't bring the site down. Files.granl.com Slash. Scram. Don't see us. We don't want to start. Can you put that on that? I don't. I would have access to it. Maybe you can share those USB sticks and pass it around. That might be useful. But somebody who has already downloaded it can put it there. They're floating around. Let's just carry on. Let's take a quick poll. Those who have not managed to load the file yet. One, two, three, four, five, six. Hopefully we'll be able to catch up. So now we've got this file. It's sitting in this variable data. The question is how do we load it? How do we load it? How do we load it? How do we load it? How do we load it? What can we do with it? What can we do with it? Before we answer that, obviously you've got to know what is in that data. So let's try and see what happens when you print data. So if I just say data, it says the following. It says it's class pandas.code.frame.dataframe. Let's pause here for a minute. When you read a CSV file, it loads it as a class called dataframe. A dataframe is like a data. There are two other similar classes of importance in pandas. A series, which is like a three-dimensional thing. Think of it as multiple worksheets. I've never used pandas. Mostly we would stick with series and dataframes. When you load a CSV file by default, it puts it into a dataframe because it often has rows and columns. A real CSV has a huge number of worksheets, beginning with how many lines do you want to skip at the beginning and what are the dataframes and so on. Usually it does what you think it would want to do. It just loads the file. And that's a basic type of work. Basic, not very advanced. It's got 83,369 entries, which represent each one of the innings ever played in test cricket. That's how many innings there were until the day before yesterday. And we have the players, the country, the minutes. I have no idea what that is. How long has it been battled? Number of runs, balls, phase, fours, sixes, strike rates. Oh, number of innings. Opposition, which country you're playing against, ground and the start date. But this is not quite the same as seeing the data. Can you see the data? Firstly, if you want to see what methods this supports, the dataframe object supports a number of methods with the dot as a namespace. So you say dot, press tab and ifyte, you'll see a bunch of things. Firstly, you'll see some attributes. Country, force, ground, inns, minutes, opposition, etc. You'll notice that these are all available as columns. So which means that I can type in any column here, data.ground, and it will show me the values for the ground. In this case, starting with Melville, it stops in the middle, and then towards the end, it's a series of innings at least. These are data dot sixes. So mostly zeroes with a few ones and twos categories. So first, you can get the column directly. There are other ways of getting a column. You can treat this like a dictionary. So you can say open square bracket, close square bracket, and within that put whatever you want. So if I said sixes, I'll get exactly the same thing. I'm going to make it a little smaller so that you can see the... So it is a data of sixes. You'll see the same thing as data dot sixes. Now this obviously is helpful if your column title has spaces in it. Otherwise, there's no way of putting those in. That part, there are a number of other useful methods. Align all, append, blah, blah, blah. It's a huge list. What we're going to do is go through a few of these today, and as we go along, I'll point you to the bottom, and point you to which one's that important, which one you probably need to remember. Now that you need to remember and use a lot when you're playing around with the data is head, which does what you might think. It shows you the head, the top portion of the data set. So in this case, it shows something that is not too different from what you saw earlier, except that it just shows the first five entries. It's still not the same as seeing the data. Now the reason for that is when the data goes to white, partners says I'm not going to show it to you the normal way. I'm going to show it to you in this kind of column structure so that you get an overview of what's happening and what's confused. But what we can do is say pd.set option. Now pd.set option has a number of options internally. You can say display.topThreshold is equal to something, display.callHeader is equal to something, et cetera. The ones that we want are line width and mass columns. We want to tell pandas, look, it's okay, even if the line goes up to say 200 characters, show it to me. If the columns go beyond say 20 columns, you can still find this. Show it to me. How do you get this total? At almost any point, you can press tab in partners. So if I say pd. Sorry, not partner. We say pd. and press tab. You get to see everything that's under the pd namespace. The function that we are looking for. I meant when you actually took the readers of what the function can take, that tool. Yeah, so that comes when you go inside the function. In this case, I'm saying data as a pd.set underscore option, open bracket. After the open bracket, I press a tab. It gets a tooltip. Now, this tooltip is pretty versatile. I press tab once. It shows me the options at a summary level. Press tab twice. It expands a little bit more. Press tab the third time. We can keep it there even after we start typing for 10 seconds. So the first option that we are going to set is display dot line underscore width. And I'm going to set that to 200. I'm also going to set display dot max columns to 20. Display dot line underscore width is 200. Max underscore columns is 20. That's again a display dot. What are the default values? Sorry. What are the default values for max columns? No, max columns is more like 5 or 10 or something. Line width is probably 8. And now if I say data dot head, I get a much better format display. And here's when we can start looking at what you want to do with this data set. We'll pause for a minute. Now it's the moment. Sorry, if you're not getting results on the server, it might be... It's just too busy. If you ever see an asterisk, so out here, you see the command number that's being executed, 14, 15, et cetera, which roughly represents this is the 14th time it's shipped into a control enter. If you see an asterisk there, that means that it's still in the process of executing. On the server, you see a whole bunch of asterisks. Sorry, you may just have to... Please, sir. I thought two servers would suffice, but... Yeah, I'll just wait. He's shutting down that notebook and he's starting that one. So now we've got the player in the country in minutes. What question would you like answered? Who spent the maximum time as code release? Who spent the maximum time as code release? Let's note these down. Who spent the maximum time as code release? The good part about Ipython notebook is you can read this like a markdown editor. So I can change the cell type somewhere to markdown. I know the keyboard shortcut. Unfortunately, you don't know how to do it. It's very easy. I don't know how to do cell markdown either. All right. Yeah. So that converts it to markdown. It's easier to type the keyboard shortcut if you're familiar with it, which is control enter. To see the list of keyboard shortcuts, press control M with H. And that's all the keyboard shortcuts that there are. Eventually, you will get used to them. For now, you can use the menu. And since it's markdown, I'm just going to put that as a third-level heading. Okay. That's one question. Anything else we'd like to see? Who's with the most assist? Who hit the most assist in test cricket? Any guesses on who that is? I don't know. Who's with the most assist? Yes. Let's leave it at that for now. From the game that, at the last, most number of runs scored. Okay. Which game have the most runs scored? Go on. What is the highest difference between the runs scored? I mean, the runs average and the bowling average. The batting average and the bowling average. Okay. We have only the batting days. Okay. Let's do a couple more. Highest percentage of runs scored by a single batsman? What do you mean percentage? Teams scored 150. Out of that, one guy scored 100. Okay. Which country has the highest percentage of runs scored by a single batsman? Okay. Last question. Which country has scored more duck? Which country has scored more duck? Which country has the most better strike rate? Which country has the highest strike rate? Okay. Let's see how many of these we can ask. We can ask all of these. Let's go to these. The same order. I may reach out to the order depending on which one is the rate. Let's start with who spent the maximum time and scored the least. So that begs the question. How do you find out who spent the maximum time? You know, who spent how much time? We're looking at data of minutes. Data of efficiency data of minutes. Tells us who spent the, for each innings, it gives us information on the number of minutes that was spent. Do you see any problems with this? Okay. Let's do this. First, let's try and find out what is the maximum duration of time that anyone ever sat in a match. So to do that, dataofminutes.letmepresstab. That doesn't work. Let's just say minutes is equal to dataofminutes. And minutes.tab. We've got a whole bunch of functions. Let's see if min and max exist. Yeah, min exists. Max. Max also exists. If you say minutes of max, it says 99 minutes. Do you see something weird with that? It's text. It's text. Exactly. Now, why do you think it's text? Because of the code. In some cases, we don't have code times that's coded by dash. And that is the underlying. Yes. So you can see that it had codes. And the reason it probably is treating it as text is because we have these hyphens. So we've got to convert it to numbers. Let's start by looking at the type conversion function. So the second function that I'm telling you about after head is as type. Now as type takes a data type. As input. In this particular case, let us assume that the minutes are always integers. And I'm going to put in. If we thought it could be flow, I'll put flow. I'm going to say int. And it says invalid little for long with base 10. And it says hyphen. So I don't know if you saw that. Hyphen is not a valid int. So we've got to remove these hyphens. Let's do that. So the next function that you've got to look at is replace. When it starts to replace, I can replace anything with int. So let us assume, for simplicity's sake, that anything that's the hyphen is going to be replaced with a 0. So now you've got a bunch of zeros here. Now this I can then do an as type of int. Let's see if that works. Now note here that right at the bottom, it's saying d type int 32. If I did not do the as type of int, it instead showed us, remove the as type of int. It instead showed object. By default, Panda stores stuff as integers, floats, or pretty much any other type is treated as an object. And it has the ability to store a variety of objects. So dot as type of int. Now we've got, let me replace this itself with that. Now let's look at the max of this. Dot max. We have somebody who stayed on the crease for 9 minutes and 17 minutes. Questions, how do we figure out who this was? There should be some kind of an index function, absolutely. And this figures to another fairly important feature of Panda's, that data frames, series, et cetera, they also have an index. Now this takes a bit of time getting used to. The initial concept is quite easy. You'll find that for the first month or so, you play around swimmingly with indices, except there are a few things don't work and you just ignore them. Then you'll find that it's a whole lot more subtle than you imagine. But for now, let's just start with how one goes about creating indices. Before we do that, I'm going to do a few things. We spent some time converting minutes to integers. I want to preserve this, and I want to preserve this in the data frame. I want to make sure that data of minutes is an integer, not the string that it was. Commit it up to that point. Exactly, so sort of commit it up to that point. By saying data of minutes, which we took it from, is equal to minutes.replace of this. Now, I was using this variable minutes is equal to data of minutes, just so that I can press minutes dot and press tab for simplicity sake. If this were regular code, this is how I would write it. So, data of minutes is equal to data of minutes dot replace hyphen with zeros dot as type of n and a chop of the max. We're not looking to store the max, we're looking to store the values. When I do that, I get nothing there. It just has put it into data. From this point onwards, data of minutes is an integer. Prior to this point, it was a string. Now, let's take a look at data. What would we like the index to be? Presumably, we're interested in the player's name. So, we might want the index to be the player. However, for a given player, there would be multiple index at the very least. So, we don't quite have... We can use the internal pointer. We can use the internal pointer, exactly. So, by default, Kanda's provides an index which does not have a title. That's zero, one, two, three, etc. So, the question is how do we retrieve this index? Let me show you a number of ways by which we can do that. Now, if I want... One way of doing this is by filtering. I can say... And this is the next concept that I'm going to be teaching you. You can say data of... I put in a condition here. Depending on this condition, the output will be a free-turn list. For example, I can say, show me all of those rows where the player's played for more than 900 minutes. Firstly, let's do this. Data of minutes, that gives me... Data of minutes gives me the number of minutes. Data of minutes greater than 900 for each row gives me a true or false. It applies this condition across the entire series and you need proof if it is true for that row or false if it is false for that row. Now, this true or false, I can put this inside square brackets. Data of minutes greater than 90. And what this does is it returns all of those innings where the player's played for more than 900 minutes. And there's only one such innings. Anif Harif, Anif Mohammad in 1958 against West Indies. Which sort of already gives us our answer. That's one way of doing this. And if I wanted the index from this, that's 17, 7, 10, we're not going to get into how to get that index programmatically just to give it five minutes. So filtering is extremely powerful. If I wanted to see... Let's pick something else. Who had more than six sixes in a match? Six sixes is greater than six. Okay, that's... What do you think is going wrong here? It's effectively listing 83,369 rows. Say, practically everyone has hit more than six sixes. I have no. Let's try and find out. Most likely data type. Let's find out. First, I'm going to say data of six is greater than six. It's a clue for everything. Data of sixes, D type object. That's because... So what we did for minutes, let's copy and do it for sixes as well. And now if you see data of sixes, that's in 32. That's good. Now, let's say data of data of sixes is greater than six. And there were reasonable numbers, quite a few actually. How do we find out how many there were? How do you find out the length of an array? Net. Net. Which works just fine with the data thing. Now, in a data thing, you could say length is columns or rows. By default, it's rows. If you want the number of columns, there are other ways of finding that out. So length of this says that there were 22 innings, in which there were more than six sixes. And if I wanted ten sixes, there were only three innings. Which is all very fine, but what if I wanted a histogram? That is, I wanted to see how many innings had... How many innings had zero sixes? How many innings had one six, two sixes? There's something for that. That's value underscore columns. We just need data of sixes. Dot value underscore columns. That says there was one innings with ten sixes, twelve sixes, two innings with eight sixes, two innings with eleven sixes, and so on. This isn't sorted. It looks from this list that the highest number of sixes that were scored in the match is at the bottom of the list, it is ten. But this is in fact sorted in descending order of the number of innings. Which is a problem because we would ideally like it in ascending order of number of sixes. Which is possible is a function called sort underscore index. At this point I'm going at a pace where you may not be able to remember the functions, that's all right. The ones that you needed to remember, I've already mentioned, these you will get just by exploring the AP. So now it's sorted by the index and it's sorted as zero to twelve. You can see how many innings. Which is a life of number of sixes, twenty sixes. And now you know how to figure out which one innings had twenty sixes. So let's do that. Could you tell me what I need to type to, what do I need to type to figure out who hit six sixes, twenty sixes? Data, data six is equal to ten. Data, filter data is equal to ten. Or greater than or equal to ten. Greater than or equal to ten. Let's say equal to ten. That was Vasim Akram against Zimbabwe. Now we still haven't quite answered the question who spent the maximum time and scored the least. Though we've learnt a few other things about Vasim Akram and others. Now how do we figure out how they scored the least? That's just the number of runs, right? So let's see what data of runs shows us. Now we have a bunch of problems. We've got a DnB. We've got a star at the end. There's no way we'll be able to easily convert these into numbers. Now I guess you know what a star Dn means, right? Not at all. Which is useful information. I don't quite want to toss that out either. I would rather preserve that in a separate column. And then in the runs column I'll mark that off. So we'll have a separate column for not at all. DnB may be useful information, but there are other ways of preserving it. So let's first start by getting rid of the Dn. Now there is a value called NaN. NaN effectively stands for not a number and is a common representation for missing data. Supposing I have rainfall data for the months of January, February and March. So let's say the rainfall data was 20, 30 and 40. Sorry, I'm going to take all of these markdown things and move them elsewhere. So if I have this series, 20, 30, 40, that's when the rainfall goes for January, February and March. Supposing I never measured it for February, it doesn't make sense for me to put a zero here. That would indicate that there was no rain in that particular month. Zero is a value that is distinct from saying I did not measure it. In a CSV file, I would have just left this as black. I would just put nothing on it. Trouble is that does not translate well into Python syntax. The equivalent of leaving something black is num5.NaN. In fact, I think pd.NaN may also work. NaN is not defined directly in Python. You get a NaN error and stuff like that, but NaN by default does not exist in Python. Python to some extent understands the concept of NaN, but it does not directly have, unlike JavaScript, a number value called NaN, and num5 introduces NaN. Num5, incidentally, is a library that's sitting below pandas and does most of the lower-level. And its return in C makes life a little bit faster, which is what makes pandas reasonably faster. So what I can say is 10, num5.NaN, 30. And that indicates that it's a missing value. So what I would now like to do, therefore, is replace all of the dnds with NaNs rather than 0 runs. It is 0 runs, but we just preserve that information by saying data of runs is equal to data of runs dot replace dnd with 0. No, sorry, num5.NaN. And let's see if we can convert this to an int NaN. The problem is there are these stars at the end. So we've got to get rid of the stars at the end. What we can do is take that information regarding the stars and put it into a separate code. Now how do we do that? Pandas has a series of string functions. I can say, let's say runs is equal to data of runs. Runs dot str dot gives us a series of functions. For any of these columns, if you say dot str dot a set of functions, they execute functions on string columns. It has cat, just concatenate, center, contains, comma, decode and code ends with. And ends with looks promising. We want to see if it ends with a star. So if it ends with a star, with a star, now that's a true or false. Just to check, now the first one is true, second one is false. Let's print runs. First one is true, second one is false. So runs dot str dot ends with. Now I can put this in as a new column. To add a column in Pandas, you use the same syntax for modifying a column. You say data of not out is equal to. Now not out is a new column. I just can't get it right now. And assign a series to that column. So now data of not out is this particular value. I'm going to move it along with the rest of the stuff. So now that we have the not out information, we can afford to discard it from the runs column. Now how do we do that? Let's do replace. We just replace asterisks. Except here, we are not replacing an entire value. We are replacing only a portion of a string's value. So dot replace replaces an entire value with another distinct value. What we need for string replace is still dot presumably replace. I don't know what the function is. Let's find the function. So runs dot str dot is there a replace? There is a replace. What does it take? It takes a pattern and a replace type. Good part is this works with regular expressions. The bad part is star is a regular expression. So we may have to escape it. No harm in doing that. So let's say backslash star comma nothing. So the first one which had a star in it, that star is gone. If I didn't put this still dot replace, the first one had 165 not out with a star. With the still dot replace, the 165 has gone away and the star has gone away. So I can take this and put this code snippet at the top again. So I replace dnb with numpy dot man and replacing the asterisk at the end with nothing. Now let's try and convert it to an int. Does it work? No. It cannot convert a float man to integer. Those names that we took, it's treating them as float. Integers do not support man. I'm not too first about it being an integer. It can be a float. Something point zero is fine. Will that work? Not quite. It says I cannot convert this string called abz. How do I print out distinct values? These are not numbers. How do you think? Distinct. Not distinct. Not distinct. Okay, I don't know. Let's hold off on that. Value count will certainly work. We just saw that. Any other thoughts? Let's try. So runs dot this. No, there isn't anything that starts with this. So let's ignore that. Runs dot unique. There does appear to be a unique. So if you say runs dot unique. That shows us a bunch of values. Now here we have stuff that ends with star and we have stuff like tdnb. And we have absentee. Obviously sorting this list and printing it would make more sense. But for now you get a censor. At least absentee and tdnb have to be taken care of. One could also say value counts. Runs dot value underscore counts. And that gives me this list. Which if I sort. I'm going to put in a function called order. I'm going to explain why order and not sort. But if I go sorry. Sort index would work. Sort by index. And I see that there's tdnb and there's absentee at the end. So let's account for these two. So I replaced the star at the end. Let's replace tdnb with numpy dot map. Let's replace absentee with numpy dot map. And now convert to true. That seems to work. Let's do that once again. So now at last. We have numeric information on the time, minutes, and the runs. Now we can find out who spent the most time at school the most. How do we figure that out? Almost. The equivalent metric of that. Let's call it runs per minute. So runs per minute is effectively number of runs divided by number of minutes. That gives me a bunch of maths. Some guys got out in the first minute. Data of runs is all consistently math. Instead of trying to replace the patterns. Why not you know. The question in integer catch the exception in that case personally. Which is exactly what I'm going to come to next. Why that is not possible. Bear with me for a few seconds. While I do a few things that I want to do. Now you remember we had a whole bunch of nans here, right? The reason was because I kept running this several times. Now if it's already an integer. And I start doing stuff like replace dmg with nans. Brass ashtray as well as string replace this with nans and so on. Heaven knows where. But somewhere in the middle of this it got confused. And said okay. It did whatever it did. I'm sure it made sense to buy pandas. It doesn't make sense to me right now. But if you know. If you're rerunning stuff. Then try and make sure that you start with the same state. But this is of course working fine. When I just went to the top and press shift enter one by one for each one of those. Now it prints a number of runs properly. Which brings us to your question. Let me just typecast it in. And when it stops in the middle. Catch and take an exception. And put that as an answer. Let's see how long that takes. Just timing this. Let's add up a bunch of nans. Actually I'm not going to actually do that. I'm going to show you a bunch of benchmarks. Just how long stuff takes. If you were to do it. Let's run it. Let's run it. What I'm going to do is add up a million numbers. That's the sum of x range of 1 million. That's whatever number it is. Let's time it. In fact let's make it 10 million percent. That took a bit of time. So how long does that take? About 789 milliseconds per loop. Let's do a numpy.sum. Numpy.a range of 10 million. That took 35.9 milliseconds per loop. About 20 times faster. Is that because numpy is in C++? C. And this is a numpy. Yes that is sort of the simple answer to it. But that's not because python is just painfully wasting time. Python allows you to, in the middle of this, try and catch. It allows you to have all kinds of polymorphisms. This x range could be replaced by any function that has an iter. And it would execute that function. And in the middle of this it will allow you to change that function. Numpy does none of this. It says you have got to give me an object of exactly the same type. If in the middle it fails, the whole thing fails, there's no exception catching. There's no any of the nice features that you have in python. But I will give you the features that you really need for the processing numbers. Which is the why you can do this. In fact, we are about 10 minutes away from the end of the x-ray. We haven't even answered the first question. You can either answer the questions that I mentioned. So let's get to the calculation that we wanted. We wanted to take the number of runs. And divide it by the number of minutes. So now we've got a bunch of numbers. Now they could be if that is if the number of minutes was 0. It could be mad if the number of runs and the number of minutes were 0. We don't care about those. We just want to see who took the, who scored the max time and scored the x. So let's do minutes by runs. How many minutes per run? And we want to see the max of this. Now we want to get the max of this. Max is if. So somebody spent some time not scored any runs. So when you think about it, that will always happen. So how does one do that? Runs greater than 0. Let's do it. So what we do is say greater than, let's say, the score is equal to data of runs greater than 0. And then I can say is among those that scored, what is the max? So somebody spent 82 runs per minute. Most likely they spent 82 minutes scoring 1-1. Now who was this? To figure out, now this brings us to the second way of getting the x. Now let's see if there's an index of this. If I just say dot index does it show something? No, unfortunately not. But what it can do is order this. Order is the same as sort. It's a slight, subtle distinction. For data frames, you use sort. For series, you use order. The way I remember it is I try one if it doesn't work at all. So now index 18804 is the last one. Now how do I just get the last item? Data dot ix. Data dot ix. Dot last. Minus one. Minus one. Minus one. Let's try them all one by one. So let's say this is some series. Series dot tail gives me the last five values. I can put a c dot tail of one and get the last value. Can I get the last dot last? No, there isn't a dot last. It's tail instead. Dot ix. Now what exactly is this dot ix? I'm not going to cover dot ix. Very powerful, very useful, but I'm not going to do that. There is, but let's do this. Let's press i for index and press tab and see what all is there. There is something called idx max and idx min. Presumably that will be the index of the max in the index. That's good. There's also i get. What that does is gets you the index of a specific item. And presumably if I say i get of minus one, it might get you the index of the last one. It doesn't, but you might want to explore something like that. Or a bunch of other. Or this dot ix which like I said I'm not going to explore. Let's go with idx max. And that says 1884. Now I want the 1884th row. So from data I need to get that. Let's try data of this one. Not quite. It says here. It says there's no item named 1884. Why is that? Remember we were using data of something for the columns. So we can't use it for the rows as well. By default data of something refers to the columns, not the rows. So how do I get the rows? Couple of ways of doing it. One is to transpose it. Data dot t transposes it. And then I can say data dot t of this particular. Then I get RG Narkarmi who in 1959 spent 82 minutes hitting one half. And he did not get off, which is good. Another way of doing it is instead of dot t, you can say dot ix. Dot ix is a very sophisticated way of getting a list of rows. You could get one row. You could get multiple rows. And if I get that result, it's exactly the same. So now we have answered the first question, which is who spent the maximum time at school to list? Let's go a bit fast. Who hit the most sixes in test cricket? That's an easy one for you. How do you do that? Some more for the number of sixes. Who hit the most sixes in test cricket? We have to sum it up by, which brings us to the whole concept of grouping and aggregation and so on. Pandas has this function called group by. I say data dot group by. I'm going to group by player. This assumes that there are no two players with the same name across different continents. But for now, you assume that there are not. We can, in fact, group by multiple keys. That will be the next step. So we'll start with the assumption that there is only one player, one name. You get the idea. Now, this returns a group by object, which is a pretty powerful object in itself. Let's not worry too much about what it does. From this, I can extract just one piece of information. So group by object doesn't actually do anything. It sort of creates a representation and says I'm ready to do whatever other transformations you want. I've created an index. Actually, that's what group by does. Think of it as creation of a database index. And internally, it keeps the index ready. Now you can do whatever other database operations you want. For example, on this, I can say sum everything up, which, if I put a head, it gets, for each player, the sum of minutes, sum of runs, and sum of sixes. It didn't get the sum of fours. Any guess why? Not an index. It's still a string, so it didn't get that. So now with this, I can just say, solve by sixes. Get me the head that gets me the lowest sixes. Get me the tape. And I can answer the question straight away. Adam will answer. Who was not on our list directly. Or I could do this in another way. Maths.index. So that's Adam Gilchrist. We have now answered the second question. Faster. Which game had the most runs score? Now, how do we identify a game? Each game. By when you ground them. Ground. Date. Date. Countries. On the same, date on the same ground, I guess two different, I mean, you'll have to only two teams. So when you say which game, do we care about which team in the game? I guess not. So if we just take these two parameters, then we ought to be good. So let's do that. But the test match is over five days. This is fortunately the start date. The start date. And the ground. We want the sum of runs. Which obviously we want to sort. Sum.sort. And get the date. Like that. Runs. So in Durban in 1939, 1894 runs with score. Why? How many days in this year? Minutes is 4,290. How long is that? When we are timeless. No, you can do it in IP. You can do it in IP. And 0 by 60, that's 71 hours. That's not too bad actually. Let's say 8 hours of play a day, that's 4. 8 days of play. There used to be a timeless. No, timeless. South Africa. The start date is after 30. I think 30. Yeah, well, let's do that. That gets into data processing. It was 2 minutes, I don't want to do that. It was a timeless match. But anyway, so we now have this match in Durban in 1939. That scored 1,800 and whatever. Who scored the highest percentage of runs? Is there a match duration? Colourful? We effectively have that. This is a match duration. Total number of minutes of innings. Each batsman has added so many minutes. Which is not, yeah, which is true. But we don't have separate matches now. Is there 1,000? 1,000, even the 24 numbers. You can see what they play, not really the type of match. Okay, at least 10 minutes we'll have that. 8 days of play. Adding people. Players' times together. 10 days. But totally it's normally 5 days, right? No, no. There are 2 people playing on either side. So it's twice the number. And also it doesn't take into account stoppages. Correct. But see, let us say team A, played for 1 day. Let's say 8 hours of play. It's time for team A. 8 hours? No, it's 5 hours. Each minute is concrete-wise. Strike on an obstacle. Oh, I see. Basically, bite over it. Then it's not bad. 1,800 runs though is quite impressive. Let's look at this match. What was the scorecard for this one? Data of start date. Equals 3rd March, 1939. Now, if I wanted to do an AND, it's amper sign. But it's generally good to put in each bracket something. Data of Yeah, actually, and data of something else is equal to something else. Close bracket. This is ground. Equals Okay. So, let's see what exactly happened in this match. Sorry if you can't read in the back, but I'm quite curious to see what happened. South Africa was in England. Two minutes. And somebody scored 200 runs. 1,500 3 Not too bad actually. It's a timeless test. 219 219 and 140 So, just looks like a regular high scoring match. And it's four innings anyway. So, each innings seem to have got what? Around 450 or so. Not too bad then. But they must have been scoring it pretty fast. Let's just do the last bit, which is number of tasks. Who scored the maximum tasks? Data of runs equals 0. Now, do this work. We'll have to do group 5. That is correct. I can do that after this also. But, will data of runs is equal to 0 work? If it's not bad, it will not work. If it's not bad, what will be replaceable? None. And 0 is not equal to 9. So, that part is fine. Any other reason why it might not work? Float. You did not score any runs. You patterned but you did not score any runs. But you did not score any runs. Therefore, it has to be add out. So, data runs is equal to 0. Data of not out is equal to false. Of course, I could have put this as tilde, but let's keep it simple. So, that is sorry, what? Which gives us 7,921 innings. 7,921 does. Now, let's group this by player. Sorry. And now, instead of summing it up, I can just count. And that gives me a bunch of players. Let's sort them. Let's just get the runs. Count. And then, order this Walsh with 43 ducks by a big margin. Followed by C.S. Martin, then Yen Magra. It was entirely Walsh who seemed to be getting 100 luck. Understandable. But the Courtney Walsh goes the credit. Let me stop there. So, this is the first sense of what Pandas feels like. Hope you got that sense. Also, I want to give you a sense of how you discover stuff in Python, in Ipython specifically for Pandas by just pressing dot and tab and play around there. Pandas documentation is decent. I would strongly suggest going through the document, spending the time and effort to go through Pandas documentation in full. We'll take a while, but it's functionality. It's got great aggregation functionality. The index is extraordinarily powerful. We haven't touched upon the index. You can merge across multiple data. It's very, very good. And lots more. And it's got a good, strong growing community. Plus, it's Python. So, let me stop there. We probably do not have questions. Let's take a few questions. A lot of particular products are not available on GPL. So, how does Pandas figure it out? Any ideas? The continuum distribution of Pandas cannot be on GPL, but Pandas itself is whatever license it is, and it's available. If I don't go the route of doing an Anaconda Instra, I can just go pip install. You should be fine. You can't quite go pip install. You can relax. You can do it. On Windows, you cannot do it. If there is anything very powerful about Pandas because a lot of these things can be done with MySQL or any relational data representation. Even Excel. So, I just want to just talk one part. Watch. Firstly, it's programmatic. Secondly, it's a question of programming language of choice. So, if you say, if I did this in R, how easy would this be? Probably just as easy. Assuming you know how. The song was very slow, particularly when you did the song, right? On 80,000 rows, it seems slow. I mean, it was a little second slow. I'm doing it the most inefficient way possible. What I would have done is done the filtering once and then done the group pipe. Or done the group pipe first and then the filter. Talk to them doing both. And yes, Pandas can be still used in efficiency. Same as any database. But in terms of raw speed, it rivals practically every platform. On average, I find that Pandas is going twice as fast compared to R. Compared to raw C code, Pandas 1.2S compared to more than JS is about 30-40% faster. Compared to most platforms short of hand-rolling your own C code, Pandas is 30,000 and there is no filtering. So speed is certainly one advantage. The other advantage is simply that it is in Python. So if you have it, if you are already using Python, then there is no double. If you are not using Python, then I wouldn't suggest learning Pandas for the sake of Pandas. As far as performance goes, taking this in a Python script should give performance but generally, right? Not quite. Pandas are C-based, so Python is a bit of a problem. If you took raw Python code, put it into Python, yes, you would get that kind of optimization. You wouldn't get it as fast as Pandas in first place. Can it work with live streaming data? As long as you get the live streaming data to come. What kind of live streaming data are you talking about? As in Twitter. Anything that is continued. In this example, you have to see if you can run all these queries on a stream of data that is constantly coming. Will you take the question and wait for it to come? You can set up that. Pandas itself takes a fixed constant chunk. So you need something in front that takes whatever stream that comes in and breaks it up into chunks. That is one half of the answer. The other half of the answer is Pandas itself can take chunks. It does support it. When you load a file, I want to read it as multiple chunks because there is no point in coding. Supposing you are adding a bunch of numbers. You don't have to load the whole thing into memory. Just load every 10 megabytes or 100 megabytes and then move on. So you can tell it to read it in chunks but it requires a little bit of front-end engineering to make sure that the stream passes through to it seamlessly. That is how you would do it if the stream passes through. Yes, iPython can. Pandas supports iPython. Why would you bother? Meaning, between R and D3, you've got much better K20s. And iPython plotting would work in an iPython environment. If you wanted to package an application and send it through, then... I guess you could... Don't. Yeah, iPython has created JSON. So you... The native JSON parser in Python is extremely fast. It is just an object where iPython should have to generate sales. Exactly. Is it meant for any kind of... ... ...multiple servers later aggregated to profit? On this... If you have, on a single server, trying multi-threading or multi-processing, then Python has a known problem with iPython, which is a global interface. If you're doing it on multiple systems, the overhead of splitting the data and splitting the processes is on the same as any other... ... Pandas does not provide any extra support for... ... I think it's simple enough for a lot of... ... ... ... You don't want to use... ... ... ... ... ... ... ... ... ... ... ... ... ... Thank you folks. ... ...