 So, yes, I am a resurgent in economics, I don't know if there are any other resurgent in economics here, I would bet no, the pandas is the thing that was helping me and some other people show that we can do resurgent in economics with Python, it's really crucial for me and I'm not a programmer by profession but I spend all my days coding pandas so I learn some of the pitfalls. Now, the title might seem more philosophical than it is, it seems like I will go in depth in some very conceptual description of some specific wrong way to use pandas, it's not the topic, there are many different ways to use pandas the wrong way and I will show some of them. Disclaimer, as I said I love pandas, I couldn't, my daily work wouldn't be in Python otherwise, I couldn't manage. This has pandas has bugs and quite a bit of them, you are very welcome to help us fix them, I'm an occasional contributor and I'm more than willing to help people who'd like to contribute, learn a bit about the code base which is complex but not so complex in the sense that it's big, it could be tidier but it's not too hard to learn the main concepts and so I'm trying to organize a sprint on Saturday but I don't know of anybody who, I mean, I hope people are interested but if you are, tell me so that we can organize. This talk is not about pandas bugs, it would be too easy, there are many, many, many. It's rather about things that, well, some are borderline but are mainly design decisions which couldn't have been made very differently and so it's your, it's the user's task to understand the things. This is not even about wrong design decisions, the fact that most of you probably know NumPy more than pandas, now I think NumPy is very intuitive, it's not trivial to use but it's intuitive. Now pandas is in principle an extension of NumPy which seems sometimes more intuitive than it is, simply because it's complex, it does complex operations. Okay, let's start with some examples. So I will start with, this is an intermediate talk, so most of you will be bored by maybe the first five minutes, still it's good to start from the basic mistakes one can do in pandas. And for example, let's have a series of 10,000 elements and the same thing as a list. Now pandas is based on NumPy and NumPy is good for managing large, well, depending on what you mean by large but some thousands at least of elements. And so one should, would think it's good to store in a series that is a pandas object 10,000, well, for the moment nothing but 10,000 elements. Now let's see some timings. For instance, for retrieving by positional index. So there is clearly no competition. So the lesson is trivial for most of you but it is pandas is good if it allows you to avoid Python loops. If you use Python loops, the single operation is much more complex because you have several layers of indirection as we will see. And so if you don't need, if you are not able to use pandas to make your work better with reasonable amounts of data, then don't use it. We can see another example. Again, the list is more than 100 times faster than pandas. For comparison, this is a problem which is also present in NumPy. Even in NumPy, you have some overhead which however can be compensated by the side of the data. It's there but it's much slower. So for instance, we can compare these 7 microseconds with the 149 of pandas or the 1 millisecond with the 109 of pandas. This could be better than this probably but not to the point of NumPy. There is more structure under the code. Okay. Let's get to something slightly less trivial. Pandas a bit in the Python spirit allows you to do a lot of things that you really don't want to do or you want to avoid as much as possible. And one of these is duplicated indexes. I mean for some people, duplicated indexes are already a heresy. When I talk to some people using R and similar structures in R, they don't expect this. Now pandas allows you to do this. We build a data frame which will have a normal index from 0 to 99. And we repeat it, concatenate and remove the first 50 lines. So it's 150 lines. We can take a look at its structure. It's basically this. Okay. And then it repeats itself. The index is not unique. It's not even sorted. I didn't put this but it's good to see. It's monotonic false. It's an ugly index really. You don't want to play with this. And why don't you want to play with this if it's possible? Now, let's start with an example. If I take lock 0, I get a series. Why? Because 0 is repeated only once. And so I can do this. And if I repeat the above, I see that the assignment worked fine. I set the first line. So we can look at the head. Now the... Sorry, it's not the head. Okay. It worked fine. Now, I might be tempted to do this. And it's going to fail. Why? Because 99 is repeated. It's actually two lines because the index is not unique. And this was a trivial case but it becomes very messy when you have an index with some elements which are repeated. Some are not. You don't know how many times they're repeated. So last one is you just want to avoid the duplicated indexes. For instance, if we take the same data frame and we reset the index, now it's going to have a nice unique index. It's also sorted. And this is the right way to work basically. Lock 0 is now only one series. And we can also see that this is slightly more efficient than the previous version. In some case, it's more than slightly more efficient. Even... Well, I didn't tell anything about this. You probably know. I'm avoiding repetitions at the expense of precision because Pandas has a lot of caching involved. So I don't want to show you caching results. And this brings the risk of unexpected results. But this is more expected. So trust me, usually this is what you see. That is, the unsorted and nonunique index is lower. Now, talking about indexes, Pandas allows you to do one thing which is not at all trivial. We remember that DF had elements indexed until 99. Now, what if I do this? Well, it's just added to the bottom. So this is different from what we were used to, for instance, in NumPy. If I have an NumPy array and I say, well, it goes until four. If I say add the fifth line, it's going to say no way. There is no position of this kind. In fact, Pandas in the label-based index mode allows you to add a line without protesting. Now, this is not necessarily a good thing to do. Let's consider this. We are adding 1,000 elements to an empty series and it takes 400 milliseconds. Now, what if we had given the index in the beginning? So it's exactly the same. But we're saying since the beginning we want exactly those elements, it's way faster. So in general, you want to be very cautious to when you add elements which are not already in the index. Well, the reason is pretty simple. It's based on NumPy. There are contiguous structures. If you add an element which is not there, it will have to change the location in memory of all the array. So it's actually asymptotically bad to not just here. Okay. And we can see also comparing the same statement twice. It's going to be slow the first time because it's an element you didn't know and it's going to be faster the second time. If you don't believe me, this time I'm lucky with time. Okay. This is a more standard problem. But some of the consequences are not so obvious. At least I've been beaten by them sometimes. So let's create a stupid data frame here. This is a stupid data frame. No, nothing inside. We don't care at the moment. This is possible because clearly this is possible, right? I can instantiate one line and then get an element of the line. Now I can also do this. But what is actually happening? Well, here I'm lucky. I'm actually setting an element in the third row, fourth column. That is third row, fourth position of that row. What happens if I do something slightly more sophisticated so I use lighting? Well, it's going to warn me. A value is trying to be set on a copy of a slice of the data frame. Why is it warning me? Well, basically because nothing is happening. So it's a warning that is telling me you think you are setting something, you are not. And why? Because when you use an indexer, you don't, unless you know very well the code base, you never know whether a copy has been made or it's a ref of view to the previous data. Now, this warning is standard. You probably saw it several times if you work on pandas. And so one would say, well, I feel safe. There are two problems with this. First, the logic for deciding when to give this warning are very complicated. So don't relate too much on them. But a more subtle problem is that I often work on another data frame which is derived from a previous one. And when I work on that data frame, I might experiment and not think that I'm modifying the original one. This is nothing specific to pandas. Admittedly, it's a general problem when you work on objects without copying them. And so you might modify the original data frame without noticing. So in general, what you want to do instead is my temp s equal to defloc3.cop. And that's the right way to go if you want to work on this smaller data frame. Okay. One of the very cool things... Now, one tends to think that the main addition of pandas to an umpai is represented by indexes. This is true, clearly. But it's maybe not the most complicated addition. Arguably, the most complicated addition to an umpai is represented by the possibility to manage different types in a same data frame object. It's complicated not just for developers, but sometimes possibly for users. So let's create a stupid data frame with a given index and no columns. And let's create a column which is basically a copy of the index because we are saying that column A is exactly the index. And then we do the same thing element by element. So for each element in the index, you go, sorry, this is wrong. This is times two. So it's the same. And then you do the same thing here, element by element. So for each element, you take the element multiplied by two and put in the column. Now, certainly the second is less efficient. We all know. And if you didn't know, you saw my first notebook. But there's something worse than this. This is expected, right? We are populating a column with elements of the index. I didn't show you the index, but I can do it now. The index is just zero, one, two, three, etc. Because it's a standard index. I passed no index. What happens here? It's low, but it's not just low. It's a float. And why is that? It's not the fault of pandas. It's probably not even the fault of numpy. Also some discussion isn't going. The fact is that integers have no knowledge of missing values. So if you create a column which is empty, it's going to be filled with missing values. So, for instance, the F is this. I could say V equal to B, zero, sorry. Lock zero, B. Equal to one. You would expect an integral column? No, it's a float. Because I didn't tell him how to fill all the other rows. And this can be annoying. What is the solution? Well, it's very simple. You take some value which usually you use only to denote some missing value and you first instantiate and then do whatever you want. Why? Why am I saying this? Clearly you don't want to have this loop, but you might have to have Python loops for things that call external functions or whatever. So in those cases, if you have to work on integers and populate bit by bit data frame, just instantiate to some unused integer value before and you solve your problem. Yeah, this is what I created before, but this is the one I'm showing you. Okay. Now, since we have a data frame, I would like to say one thing that I forgot to introduce. Well, no, I'll say later. Sorry. Okay. Again, about data types, let's take a look at some non-totally expected casting. So let's create a series with two rows and 1,000 columns. Okay. So this is a shape. And as I told you, I want to be sure I'm working with integers, no missing values. So I just instantiate to minus one, which is my marker for missing value. So everything looks fine. What are the types? This is just the first four columns, but they are all integers, right? Now, this might be the, maybe it's trivial for most of you, but this can be the good time to tell something about the internals. So what you see over data frame is basically this. You have the data, you have the columns, you have the index. What happens inside to allow you to use multiple data types is this. Each column has a data type. And remember, you have some NumPy arrays, true NumPy arrays, although in some abstraction, that store the columns of the same type. Okay. So for instance, here we have a blue NumPy array, which is storing the columns, which are integer 64, then some data frame, which is storing the float columns, and some data frame, which is in NumPy array, sorry, which is storing the object columns, for example. So data types are characterized as single columns. So here I'm asking for dd types, and I'm getting one d type for each column, they're all integer, fine. Now let's create a stupid method like telling me if a number is even or not. And let's use this method to do some operation on this data. Basically, I'm saying for each column, okay, you take the first element, so top row, you take the content and you add one if it's even zero otherwise. And then I do the same thing, but without adding, only setting the value is even or not. This takes 200 milliseconds. This takes three times more. So this operation is taking one third of the time of this operation. What is happening? And remember, there are no missing values here. It's all about the integers. Now what is happening is this following. If I take the first method, I'm adding values. So I'm adding a Boolean 20 integer. And it's automatically casted by Python, not by pandas 20 integer. This is a Boolean instead. And since it's a Boolean, what pandas does is record is a Boolean. But I told you that the types are characterized by column. So if this happens, it means that all columns are no type object. So it means two bad things. The first is that we are working with objects while we had Booleans and ints. So it's less efficient. And the second that it had to recast all of these columns to the new data type. So the lesson is pandas is great for having multiple data types in a single data frame. But this happens only on columns. Do not try or avoid as possible on rows. And talking about columns, let's use another stupid data frame, which is similar to the previous one, but it's currently empty. And let's fill it bit by bit. Okay. So for each column, we are just give me a second. Just let me close a bit of stuff to be sure I'm not finished my memory. Close and hold. Better wasting five seconds now than later with memory exhausted. So what I'm saying, what I'm doing is I take each columns and I populate it with two integers, which in the index and minus the index, pretty trivial. Okay. And this is the result. Okay. Nothing unexpected. How much time does it take? A bit too much. Now recall, there is no typecasting here. Okay. There are just integers. I'm setting them as integers. It's true. I'm adding a column by column. So three seconds. What if I had initialized the columns immediately? So exactly the same operation, but I start with a data set which is already initialized. So what do you expect? Compared to the previous one? Well, now it's too easy to answer. Okay. It's not faster. And why is it not faster? Well, let's make another attempt. Now let's add the columns and initialize them to minus one. And let's fill this one. It's way faster. Another hint. Let's initialize to a float way slower. So what is happening? I told you that in a data frame, columns of the same type are regrouped together in single non-py arrays. And this makes a lot of sense, for instance, because when you have a data frame holding only data of the same type, you want operations to be as efficient or almost as efficient asymptotically as a non-py. But now let's go step by step and see what's happening. If I initialize the same data frame, I'm actually storing data in a single block. A block is a non-py array or some abstraction of it, which has a shape I would expect. Now what happens when I set the first column to integers, right? Because it was a float data frame. When I set it to integers, I get what I told you. There is now one 9999 times two block and a small likes block holding the integers. What happens when I add another column the second? This is not what I told you. Actually, it's storing the two integer columns into separate blocks, so separate non-py arrays. And why is it doing so? Because otherwise the operations we just executed would be way slower, because every time, it would have not just to create a new block for the new column. It would have to re-merge in a single non-py array all the columns of the same type. And this is an expensive operation, because it has to re-copy everything in memory. And so on, another end block. And indeed, you can check. This method tells you if the blocks as I told you, or as I told you, that is a single block for each data type. If we run almost any operation, for instance, taking a stupid max, then we see that the blocks are restructured or technically consolidated. And yeah, now they are consolidated. So this is important to know in some cases because it can be really, the consequences can be really unexpected. If we compare these two functions, so each function is doing what we did before, that is adding a column, and then taking the max. This other is doing the same, but in two separate loops. So technically, this is more difficult operation, because each max is run on the whole data frame, while this is run all the other columns inserted so far. Let's compare them. So, yeah. So the operation which is in principle quicker is taking more time. And why is that? Because every time, in every cycle of the loop, this data frame is being reconsolidated. So long story short, if you have a loop, and you have to work on the columns in this way, typically adding them or changing the type of them, then do it all once and do not try to do other operations. Because pandas is smart enough to try to reconsolidate only when it has to, but when it has to, it's a costly operation. It's an expensive operation. If we had started with integers, we wouldn't have had this problem. Or actually, it's there, but it's much less important. Okay. Again, on data types, also from another point of view, let's create an ugly data frame. Well, it's ugly, but not too much. It's got different data types, but they are all nicely ordered on different columns. So this is the right way to work. And as an economist, I like pandas because it allows me, for instance, to put a name of country and an ID of firm in the same data structure. So the data types are what you would expect. Maybe some of you would not expect objects, but pandas does not support strings. And so they are cassettes upwards. What happens if I ask the mean of this strange beast? Well, it's fairly smart. It's saying, well, column A is integer, it's one. Column, sorry, zero. Column one is 1.5, it's 1.5. Column three is 17. Column two is not numeric. I'm not going to try to take a minute out of it because it makes no sense. Great. What will happen if I do this? So I'm saying, okay, take the mean, but across the other axis. Again, it's pretty smart. It's doing the same identical thing, so excluding strings, but on the other axis. Good. Now, axis equal to one intuitively means run this operation on the transposed version of this matrix, which is a data frame. So we should expect this to be a matrix of true anyway. We should expect these two objects to be the same. And instead, we get a value error. Why is that? Well, precisely because axis equal one is smart, smarter than this. When you do this, what is happening is actually that all the d types which were nicely ordered on columns are now on rows. That is, each column has different d types. But if a column has different d types, it's only possible d type is object. And if it's object, mean is not trying to get an average, a mean out of it. And so you get an empty series. That is, it's a series of only columns with numeric d type, which are known. All columns are object type now. Okay. Whoever used pandas more than, for more than 10 lines of code probably had some group operations because pandas is really good for this. It's pretty efficient compared to other alternatives. Still, and probably you used, as I mostly used when I'm lazy, apply, apply is a very powerful function. Now, pandas is efficient for group operations. But this does not necessarily efficient that apply is efficient. And why is that? Well, take, now we take some real life also stupid data frame, which contains a date, a ticker, a bid and an ask price. And we ask the mean of this. It's working. And let's just check what it's producing for reference. Sorry, shouldn't have done this. For reference. Well, it's just taking the mean of bid and ask because the rest is not correct. Now, this took 91 millisecond. Is this the best we can do? Is it the best we can do? Not at all. By far. And what is the problem here? It's that group by operation can have very different characteristics. You can have what is called an aggregate in pandas. That is, you have several groups. And on each group, you want to synthesize information in a series. Or you can have, for instance, transform group by operations, which means that each group does some operation involving the group structure, but then returns a value for each value passed. They're pretty different. For instance, in a transform, I could do multiply each element by two. It's very stupid to do, but it's an example of an item wise transform. Apply is very smart. It's trying to understand what you want to do. And it understands this by looking at what you return from the past function. It's smart, but this smartness has a cost. And this means that you want to use it only if aggregate or transform is not doing what you want. Okay. And I say this because apply, for instance, on Stack Overflow is very often suggested and used. It's very powerful, but it's not particularly efficient, which is sad, given how efficient is a group by object per se. Something about multi-indexes, which is one of the things I love most in Pandas. Let me just close some stuff. Okay. Now, let's create a data frame which is, again, very stupid, but has something new. This multi-index on the index. Sorry. Let me skip this, because I have 10 minutes and more interesting other stuff. Pandas is great, also because it's very coherent. That is, you have a data frame, you access a row and you get a series. You access a column and you get a series. This said, don't be fanatic. If we have a data frame with, for instance, ABC in each cell of its only column, and we want to access separately ABC, something which is a very stupid operation, but I often end up having to do. We can do this, which is for each element in the column, you apply an operation which takes X and gives me back a series with the pieces of X. Very straightforward, very elegant, since the data frame is conceptually made of series. I'm just producing the series and putting them in a data frame. Very elegant, but not necessarily the best way to operate, actually. Because if we compare this, that is the line of code I just showed you, with the following in which I'm just returning the result of X times split. The difference is huge. And the difference is huge because we are wasting series to hold three elements. A series is a complicated object which is worth using either if I have a lot of data or if I need complicated indexing or both, not for three elements which could be in a list. And so in this case, it's really a waste of computing power to have all these series instantiated to just fill the data frame. Finally, this is not strictly speaking about pandas, but it's about HDF which is probably the most used or at least I think the best accessible format to store objects on disk with pandas. Because it holds anything, it keeps the index structure, the D types, et cetera. That said, even here you don't have to be fanatic. Okay, no problem. We are just creating 1000 series, the first with one element, the second with two, et cetera, et cetera. Very stupid thing. And we are storing them in a folder. And then we do the same thing, storing them in an HDF. You already see, it's much slower. I wouldn't expect so much slower, but let's have hope. And then what do you guess it's going to be, it's going to use less space on disk. This is 12 megabytes, this is 5.8. So in general HDF is great because it's good at storing big amounts of data and accessing, for instance, a single rows. But don't think that it's the best way to store small objects because it actually has a lot of overhead that could use actually way more space than your data. If I have five more minutes, I'd like just to show one additional thing on multi indexes, which can be unexpected. And it's the following. Now, this is a data frame, as I said, with a multi index on the index. Now, it's cool. What is the problem with this? Let's remain generic. If I ask, for instance, for, let's change this a bit and put four, five, six, okay? So they are all numbers. What is this going to say, going into return? Zero seven. Well, you probably perfectly understand the ambiguity. I'm asking one four. Now, this could be interpreted in two ways. I could say, okay, I have a multi index, and I look at one four in the multi index, or I have an object with two dimensions, and I look at one on the first dimension, four on the second. Because pandas allows you to do partial indexing. This works, and it maybe works as most of you expect. But don't try too hard to guess. In general, you want to do more, to be more explicit. That is, either you write one, four, this, which says exactly in the multi index, which I have in the index dimension, take one four, or use one, four, which is the opposite operation. Talking about this, here I use the typo. Now, in Python, most of the time, we are used to the fact that a typo and a list are actually the same thing except for implementation details. List can be modified, but it's less efficient. In pandas, it's not true, and it's not true for good reasons. And that is that this, so, yeah, you see it's different. Incidentally, it's very similar. But conceptually, it's different, and it's wrong to use this to access, to do this, basically. So the difference is that tuples mean parts of the same key. So this means I want to access that key or better. I want to access anything starting with one. This actually means I want to access one, which is partial indexing, and possibly other stuff. For instance, two. Okay. This is going to fail. So in some cases, you can obtain what you expected, even using the wrong type, but it's a simple association, and it's good to stick to it. Tuples are parts of keys, lists are lists of keys or parts of keys. And with this, I think I'm done. We're much for your talk, Pietro. We have time for a couple of questions. Just a comment on the earlier stuff about the mixing D types when you're kind of looping over arrays. It seems to me that, you know, they're kind of all in this kind of topic of like, you shouldn't loop over data frame, like, if you can avoid it. So if you, sorry, so with the earlier stuff of these examples that you built by looping over data frames and inserting things in different ways, causing these problems relating to mixed data types and mixed... Oh, sorry. The earlier stuff about, like, you constructed these examples where you, which were inefficient because you were kind of mixing data types and columns and deconsolidating the underlying arrays and so on. It seems to me that, like, a lot of that can be avoided by the general principle that you shouldn't really be looping over the data frame in this way at all. Do you agree with that? Or, like, could you please speak into the microphone? So the question was, in these examples as shown, the lesson could also, for most of the example as shown, not just the first notebook, the lesson could be don't loop of the data. Now, the general lesson is don't loop of the data if you can avoid it. Then it all depends on what you have to do. There are functions I might need to go from external libraries. There are... Maybe I'm looping over the data, but not by groups, so I'm doing the best I can to vectorize, but I need to work on each group individually. Certainly, this is always a good lesson in non-pandas. Try to vectorize as much as possible. And pandas is, I think, really good in the extent to which it offers, for instance, group bioperations which are ready to use and vectorized. Still, okay, many examples I gave are stupid, not out, but there are some cases which... And by the way, the casting problems are not necessarily related to looping. They can come out in other times even, for instance, if I set all the values I know in a series, but some are missing and they were integrated, they will still be imploded even if there is no loop involved. I'm not the most expert on this. I'm very happy with pandas integration with Matclotlib. Not because it's perfect, but because it's very handy and allows you to refine the result using Matclotlib itself. But I know many people now are talking about how it's called the web bokeh, which I should try and just ignorance so I cannot judge. The integration with Matclotlib is very nice. And, I mean, as a researcher, the web interface is less interesting to me, but I'm not expert again. This is just more of a comment, truly. You as an economist, how do you find searching for micro-optimizations with using Python? Because basic tutorials I have seen and the documentation really just provides you basic explanation of the method, but you as a non-developer, how do you get around with that? So the question was on micro? How do you search for this kind of stuff you as a non-developer? I'm not sure I get the question. Well, how did you find out the micro-optimizations you show us here today? Did you look a lot of... Did you search through the source code? How did you get around with that? Well, my honest answer is probably I discovered Jupyter notebooks, which allowed me to try five things and do the one working best. I was helped a bit by just a quick look at the code of Pandas, which is very complex, but in some sense the main concepts are pretty clear. By the way, me as an economist still missed some things in the Python ecosystem in terms of methods, estimation methods, but in terms of manipulation. I mean, I'm pretty sure I have a better life than all my colleagues using other softwares, despite some some corners we have still too smooth. Is there a source of more advanced information of Panathema whereof? No, not organized. The docs are reasonably well organized, but not complete. I mean, they are not terrible either, but they would benefit from many more examples, and if you come to the Sprint Saturday, we can work on that also. And then there are many people reporting their experiences in single Jupyter notebooks, maybe, but specifically in the economics field, very few, but it's growing quickly. Let me just add one thing. I talk about estimation methods, so let me add that when Python is not enough, the good interrogation between Python and R helps a lot, so PyR2 is a library of choice for this. Let's thank Pietro for his deep insights in Panathema.