 Okay, so after the previous great talk where we learned how easy and fun it is pandas this time we'll Listen from Margarcia who's a maintainer of pandas what is broken with it and how we break it even more? Thank you Can you hear me is that Mike? Okay? Okay? Yeah, thanks. Uh, thanks a lot for for attending my talk just I'm losing my voice and my health is not perfect and my tone might be a bit a bit low If I don't sound very excited is not because I'm not excited about being here I'm very excited just I'll keep to I'll try to keep my tone enough to to have voice until the end of the talk so I'll start yeah talking about Breaking pandas and to break pandas what better than a live demo and it would be an easy one because pandas is expected to break so If it breaks is is good That's usually the main worry about About demos so Let me start see if I can Make it big enough. Okay. Let me start by importing pandas There is a new feature of pandas that you can import pandas without being a speedy and it works in the latest version Okay, this is this is master. Don't try it in your version might not work Then I'll open some some data. Well, I didn't show it long enough But I've got I'll get some data about the presidents of the United States and I'll try to answer answer these questions who was the the younger Like to to die the who took office being the oldest person and who was the president for a for a shorter shortest periods so pandas has really Any cool features including that you can provide anything that it's Kind of reasonable in in the fall you want to open Even if it's in a in a website that needs to be downloaded and even if it's if it's compressed pandas in general takes care of everything Okay, trying to make things a bit a bit complex for pandas, but it works well I need to try to answer the first question. Okay, who died at a at a younger age the youngest age What I will do is just like to take the Death year and get it interesting year and That was that the question or I did it Yeah, who died at the at the younger age. Okay, and Let's answer some more questions like who took office Who took office being the oldest well actually I didn't but I'll do something president I actually Quit focused on and yeah, the first question. The answer is Kennedy. I wonder I wonder that why why is that? Let's go for the second question who took office Who took office? Coffee is Being the oldest for that. I'll just get like from the two coffees I'll get the year and Then I'll subtract the Dead birth in the birth year And I have an error pandas is not really broken here It's just like Jason doesn't support like a daytime format so Everything that we got from the Jason that looked like a date was actually a string. So let's just start Merting things if you know anything about pandas, you might know about the the types we can see also an interesting thing here When you realize you've got this in many cases if you're a regular pandas user that it's like the The year here the death year is converted to a float. That's because there is some presidents that are still alive So there are missing values in this in this field So pandas has something quite nice. Well, you saw that the type is is float pandas has something that It's like now that they year Well, if I try to convert it back to a back to a Yeah, that is failing it will tell me that there are missing values or infinite values Or they don't have a representation in it's not a pandas limitation issues like a hardware limitation There is no representation for integer values, but in the newer versions of pandas We've got something that that we actually have a type integer type that What is that? That's not how I expected it to break Okay, I have yeah, okay, I repeated here. Yeah, that's actually Still Sorry, yeah, yeah, sorry, sorry about that Yeah, let me just get the date again then Should be easier. Let's reassign the the index. So, yeah, and and now we That's right here now this This actually got converted to an integer if I I got the types is a is a different you see the capital The capital I or the lower case I for the original type the capital for the for the one that support missing values okay, and then we the original problem actually was Going a bit too slow, but I'll try to speed up a bit like the Two coffees That needs to be a date time that that was actually our original problem on the same for the left office Probably looking looking better. It's look will look the same. We cannot tell if it's a if it's a string or not and this one that was Broken now it's returning me the the year and now I can check like who was there who was there the oldest one it was Mr. Trump So now we can do something else. Well, I'll just do I shouldn't do that I don't have that much time, but I think it's reasonable if I have if I want to clap the my my types Reasonable party is just like a limited set of values. I don't think it's that much Required to save different strings. I can just like kind of encode them as categories and it's a this is a much much better represent That's a much better representation So now I think I don't have objects anymore that objects are is low in there are Python objects and there is low in in pandas so now what I wanted to know is who Who was less less time a president? There is something quite nice in in pandas that we can create a In office a column that it would be an interval between two periods for that The syntax is a bit tricky. I'll discuss about this a bit later right North and then I will say that I want since the president took the office until the president left the office And this is breaking in this case it's for a well in this case is because I did something Ah, yes It's good like like this team working that you tell me what's wrong You can come to my office if you want. I need this all the time What it's telling me there is a limitation the way that intervals are represented They are they can have missing values, but just a missing value not like a part of the interval missing So or you have like the the the whole thing it's it's missing and Or or it is not in this case. We have a problem here That is that one of the presidents actually didn't leave off ops Didn't leave leave office So there is a there is a missing value. Okay, that's actually a problem that we are going to fix And actually there are many ways. I think I can let you choose which one we can make him leave office today. We can also We can also change the value. So he never took office or we can delete him totally. That's very easy to do in Pandas we can do but we can we can probably yeah Yeah, let's say that The f.log the president in question is don't know j Trump and i'm gonna say that That he office today, that's quite cool. So Hope you're happy because our interval array will work not for any other reason Okay If you were if you would prefer actually we could even go to the past and say like, okay He actually never took office. Let's say So yeah, in any case the we can create our interval Created our interval and now we have this column that it's a an interval and from it We can just get like the in office say array dot length that's will return as the The length and actually now if I want to use this I need to convert it to a series. That's Many details of this but this actually it's not a great syntax. That's kind of like the point of This talk now we have like the exact days that they were in office and we wanted to know the the shorter. So So we've got that Yeah, so this is the the guy who took office less and I did actually a mistake at the beginning when I was showing the Who was the the maximum who took office now is actually don't know Regan because we Will remove him like from office But yeah So the point there are a couple of things actually one of them didn't fail some of the problems that I'm showing here Because I did something in the order that I was not expected to do But the first thing is that we converted things to data time very easily with pandas to date time We converted things to interval with extremely difficult syntax. I would say Like that's something kind of very easy easy to solve but I think it it says something that it will come later We also got that this Column dot length what we could have just like calling the length actually we had to create this like the the whole thing here And actually we have to create this we have to call this Disarray that it's the underlying representation of pandas is something kind of internal just to get the We won and we didn't make this fail because I didn't cast the time before Getting the first president but actually the after converting to the integer with missing values You cannot compute the the id x max So why is that pandas relies on numpy all the internal representations are numpy based for the internal types of numpy was like the Origin of pandas and the support is extremely good But pandas implements many other things and and they are growing that are like what we call Extention arrays that it's like date times that are aware of the time zone It's something that numpy doesn't support also time deltas also period the intervals is the one that we saw the nullable integer Is the one that we also saw we also have the categorical the sparse so Which as you saw there are several things that That in pandas don't work quite the same or they are not as as mature when you are using numpy types and when you are using Like extension arrays that was supposed to be a live demo, but actually i'm already quite late on time So i'm going to just run it like this okay, it's still That's kind of important because the the pandas 0 24.2 the latest version has a small changes in this syntax Nothing important that it's relevant But so you know that if you try to run these examples it needs to be on master Okay, we can define our custom types Don't worry that much about the exact syntax, but we just tell a name And we say like that the type of an element in our in our series in our arrays It will be an object in this case a python object what i'm doing is just like representing instead of numpy Representing things storing things internally in pandas in a in a python list Then i create this class of array that it represents things like how are you going to get the data and how are you Going to store like computing the length converting it to numpy adding two values like which the type is associated to Nothing really fancy and quite short and with this we can do things like rating our Our own series Okay with our series with our own types Okay, that's actually kind of straightforward if you don't call many operations in this case I implemented that so I can just like add one here easily. I didn't implement the subtract so if I Try to subtract something i it will fail But i've got my my custom type. It's a kind of a very small version But and then in many cases if i'm implementing my custom format Actually, what I want to do also is do a specific operations like I saw it in interval What we do we don't provide this for in term interval yet But it should be easy to implement if anyone really cares about this data type that is like the accessor So I have my series And on my series I can I can define this Pyl or whatever I want that is what I register here. I place a class that has a Property in this case and then I'm able for a series to access this property that I define that I define here Okay, so in this case I'm I just have a simple function that goes to the To the python object and it returns the the type. So this is how How things are represented in python like I said like numpy has a different It's like a simpler somehow a more like close to the core everything that it's a custom type Actually in pandas is implemented it in this way and as you see there are different levels of of maturity and And it's not it's kind of recent. This was introduced in recent versions. There is still work on progress But you see that it's actually making things much easier in in oversight so after After this I said like the the idea here about breaking pandas is that pandas is huge like the api of pandas is huge and what we have Is that is that it's growing even more and and overcoat is being affected by that What it's getting better is that we're implementing like these extensions So not everything needs to be in pandas. You can implement things separately Like the arrays arrays and the types that I shown like for example the internal is included the accessors to access this And there are some examples already of this like your pandas is using that So if pandas if you want to represent IP addresses you can represent with a specific format and having specific operations Fletcher is a an arrow an arrow back and for representing for example strings in a very efficient way that pandas is not because it's using python python strings so I think my point of the talk a bit is just like to explain that we need to keep breaking pandas and making it pandas more Extensible and actually like making even the internal things that pandas provide like making it like in a in an api One of the first thing that it's already Almost implemented should be ready for the next version is plotting back ends. I'll discourse them in a in a second each of them Another one that it's under discussion and we'll see what it happens Is there are your plugins and then like having direct methods the same as the accessors as I saw like this is a an accessor this thing over here But just being able to to have like direct methods like if my data is interval why not like being able to call length directly and not that So plotting back ends. There are already several plotting back ends They are using different apis and they are doing monkey patching and weird things that I think they don't they shouldn't be using I think we should be better in pandas to let them interact with us So for example if you want to plot in pandas you would plot like this in all of views if you use this This extension your this plugin or module Your you would use this function. So it's totally independent of pandas pandas don't play nicely on letting you plot in whatever you want What we are adding now is just like having a pandas option that you can specify the the plotting back end And in the in the back end actually you just will have the same api for every back end You will be able to plot in bokeh you will be able to plot in In altair or whatever it is and third party developers would be able to provide like modules that You can plot in anything if you want to plot in ascii why not just do it So we'll have one api to roll them all. I think that's a good thing I mean the standards the packages would also be more consistent among them It would be easier for for developers to get started will have better documentation for the api everything would be more uniform I think that's a really good thing In terms of IO plugins Okay, pandas is supporting all these all these Input and output things is able to read from csv I'm sure you know ray json. You just saw it in the example is able to read and from html from the clipboard from excel Parquet is quite an A great format to use like we also hot like pickle somewhere in here And also some like like commercial software status and google BigQuery and that so lots of things in here And also this if you are and have any that it's missing. This is other formats. This is not proper io It's like not the might saving tool this for accessing io But this is like different representations like in in memory representation like python digs latex or strings or or whatever But there are some that many that are missing Like for example, it's able to export in an importing excel But it's not like the formats the the open source the open standards formats like all the s for library office xml i assume java people would be happy if we are able to Export things in xml like data frames and maybe generate an image Maybe a pdf it was someone who developed a library and was asking us if we wanted to include it markdown We got an issue for that And i'm quite sure with time you will you will get more formats in here So what i personally think it's a bit wrong is that The moment we're making a decision we're taking like the The responsibility of saying what is in pandas and what is not and what it's not in pandas It's not supported and it's kind of like a sneaky library that we don't care about What i think we should probably do is like make every one of these io modules Make them make them like a Plug-in even if we distribute them with pandas and for the user there is literally no difference We should be creating an api where we define that That you are able to to access pandas objects for for importing and exporting data in a in a certain way Implement all all our libraries in a similar way as we're doing with extension arrays Having that and then that third party developers can develop them And that you are able to use them in the same way as you will use that I think that would make the code much clearer. It would create like a much better Structure of code where we have like several modules that they all look the same We'll empower other developers to say, okay, you're welcome to create your formats. It doesn't matter We don't know about all the formats in the world, but you do because you need it so you can create it and it would be like a really well supported like Or integrated with with pandas the api would be consistent no matter if it's a third party module or not There won't be big differences between a core and a third party and for the user it would be as easy as Come to install your plugin and an import plugin as pl Just to make sure that people in the scientific python world is happy having a short tell us for For that and also probably like delegating some of this work. I personally believe that Google the gbq like the google big query actually is charging users to for every For the data for the bandwidth that they have with their apis So it literally means that every time that someone is calling pandas dot read gb gbq Google is getting money out of it We're maintaining this library in our free time, which I think it sounds a bit wrong to me So I would personally take this out of the of the of pandas and having it a third party module and let google Take care of that. I'm sure they they will be happy to do that And finally Going to the direct methods that's a bit more the controversial because it's a bit going a bit far even in in what I think we should be doing That is like we saw this before that is like we can have something Like like our column dot interval dot Dot lengths like having the the accessors already Or having these weird syntax or we can even go and say Actually, I use intervals a lot. So why I need to be calling interval all the time I'm not calling for numerical operations. I'm not calling them all the time So I can just like inject directly some methods into into data frame on sirius interesting fact sirius had at the moment 237 public attributes and methods Really, okay. Well looks like I'm out of time. Let's just give me one minute and I'm finished that thing Actually got different timing here. Anyway uh, but yeah, it's like full of things in the first order category for for For the numerical values, but not even one for intervals I think we should be we should be injecting That because if you are using intervals or other structures that are not known by base Basically, you are a second class cities and you cannot use direct methods like like that Please forgive us for that And I think I'm going to conclude here. I just got a couple of slides more But yeah, it looks like I got I thought I got 30 minutes and I got 25. Okay. Well, anyway, thank you Thank you very much