 So, thank you so much for coming to my talk, and this time is a little bit special because the lady who just introduced me, Chin, I would like you to give her a round of applause. I will explain why, because we know each other for six, seven years, a long time, and without her I won't be here because she brings me into the community, so that's why I want you to give her a big round of applause. Okay, so let's go back to the talk. Polis versus Pandas, so I really love the title because I love bears, so Pandas is, well, is it a bear? I think Pandas is a bear, so I do love both animals, and it's very interesting that there's like two libraries that name after the two different types of bears that I really love, so it's good to compare them, I think, myself. It's difficult to choose, I love both, but I will show you what's the difference, and I will show you maybe how you can take advantage of there's two libraries available to do similar things, so I always said this is the most important slice of the slide deck because you would get the whole slide deck, so if you have this slide, you don't have to take any more pictures, also you've got my contact detail if you want to connect and ask me questions after you are welcome to do so, so a little bit about me, of course Chin just introduced me, I contributed to a lot of open source library, I also organized a lot of events, including this one, so I'm very busy. I used to serve in the EPS board, but right now more of my work kind of moved to the Python Software Foundation, so I'm a Python Software Foundation Fellow and a new director this year, so I will start a new role next month, but I think it's a bit early to announce it publicly, so if you want to know where am I going next month, you can talk to me. So, who knows what Polis is? Not the Bear Library? Yes, yes, okay, okay, okay, so I hope that this talk is not going to bore you, but I think I also need to ask the question of who is using Polis instead of Pandas, so no Pandas, Polis? Do you transition from Pandas to Polis? Oh, one person, so Polis, to be honest, is not Pandas 2.0, a lot of people I think when I first heard it, I make this wrong assumption that oh, Polis is like the flashy new, Pandas is cool, but they are actually quite a lot of different, because first of all, Polis is actually not a Python library, it's a Rust library. Why do I make this claim? So, who was at the workshop yesterday learning Rust? Yes, some of you? No? Oh, okay, okay, so I'm also learning Rust, I quite enjoy learning Rust, but one of the motivations that I learned Rust is that because now there are quite some libraries that we use in Python, actually, they are written in Rust, and the reason why we can use it in Python is that because they have a Python API, so one of those libraries is actually Polis, so it's written in Rust, if you go to the documentation, you will see that they are not just, that they can also be used, of course in Rust is native, and JavaScript, so it's not just for Python, but we love it, we can use it with Python, and we can take advantage of Rust, but use it with Python, I will explain why it's good, so some people consider it as a Pandas alternative, I personally think there are different reasons of using different ones, so first of all, some information about why using Rust, because only a few of you have been to that workshop, maybe some of you are not quite familiar with Rust, I would try to explain a little bit of why Rust seems quite popular nowadays, but first of all, let's check how much we know about Pandas, so Pandas is actually a wrapper of NumPy, so we all know about that, right, double check, yes, something, okay, good, so this is a picture I stole from the internet, you can go to the source there, so if you are looking at a Pandas data frame, actually, you know, the internal is actually a bunch of NumPy array of different types that got wrapped around, combined, to become a data frame, so if you look at the, if you're brave enough to look at some internals of Pandas, you will see that actually a data frame, they will have like different blocks, and they are corresponding to different data types, and each one of them actually is a, underneath is actually a NumPy array, so actually it's using NumPy, so, NumPy actually is not, again, not written in pure Python, it's actually, it uses a lot of C code, so if you, again, if you are very, very brave to look at the source code of NumPy, I've done it before, and one funny joke is that I tried to contribute to NumPy five years ago, and that PR almost lived as long as my visa, but it's recently being closed, it's good, because I gave up, so, yeah, so see, so NumPy is actually not that straightforward, it's not written in pure Python, there's a lot of C code, the reason why it is so is because when we use, who has, who has, like, who said, you use NumPy, then raise your hand, okay, if you said that you are, oh, so quick, I was about to say, if you actually say that you are familiar with NumPy and can use it very efficiently, then keep your hands up, yes, okay, okay, so we all know that NumPy is fast, we all know that, we all know that NumPy is actually, if you use it correctly, it's very efficient, because it's written in, it's using all these, like, Ufong, which is actually written in C, compile C, so that's why it's fast, if it's written in pure Python, if you have done what I've done when I first used NumPy, which is using a for loop, it takes forever, right, because the for loop in Python is slow, so that's why we use Ufong in NumPy, that's why it's fast, so now we are comparing polars, which is deep down is actually rust, and pandas, which is deep down is actually C, so why, you know, some people hot take think that rust is better than C, so as a learner of rust, I think rust is actually quite easy to learn, because compared to C, I have struggled with C when I was in school, so rust is also considered a memory-safe language, because rust has this very strange ownership rule, those of you have been to the workshop know what I'm talking about, it's like, oh, why does all this M% what's going on, because rust trying to enforce this kind of rule to make it safer, make sure that, like, the developers can't really mess up very easily, if you, like, for example, you have written C or C++, that kind of like C family code, you know that you can easily create a pointer pointing to nowhere, something like that, so that's considered not memory-safe, if you are trying to access a memory that your program doesn't own, or you don't know what it is, so the vagueress compiler check of rust actually makes it saver in terms of memory, and so, like, because it's so vagueress that if you can compile your code, it's probably safe, you can't really access something that you shouldn't, and you shouldn't have, like, a lot of problem with the garbage collection and all this, like, you know, memory leak, whatever, funky thing that you can mess with, so, erase condition, all those things, so that's rust and C, so that's basically one of the reasons why I think Polar has an advantage there, but if you are already convinced and you want to be, oh, maybe I should try Polar's, what should I do? So I'm trying to tell you that it's actually very easy to learn Polar's, because Polar's also have the data frame and series, so it's kind of like Pandas, you know, series is, like, a data of the same type, have, like, index and have a name, and then you combine them, they become a data frame, it's very similar, so you don't have to change your way of thinking of how a table of data work. Polar's also have the data types, have the numerical values, so we have all these, like, different integer values, floating point values, it's, you know, because rust also have all these, like, unsigned integers, signed integers, so they will appear in Polar's. Polar's also support datetime object, and objects, which is, do you know what it is? What is an object in your data frame? Strings, yes, so objects are actually strings, so in Polar's they also have object data types, so more or less the same, you can basically direct translate them. It also support a lot of different data transformation, you can do joints very easily, privilege, group buy and aggregate, so all these things that we do, like, they in, they out with Pandas, you can do it with Polar's. So, how do transitions from Pandas to Polar's, so this is the meat, right, like, maybe you are here because you want to learn that. So, I would say that, first of all, you don't, I recommend you to try Polar's out, but I won't say that, I won't say that, like, you just, you know, oh, I would now, like, uninstall all my Pandas in my, all my environment and just use Polar's, well, no, no, no, you don't have to. I still think that Pandas in some scenario is, can be still useful, but maybe your next project, you can start think about trying out Polar's, because it's so easy, you don't have to have a huge learning curve, and you can take advantage of the memory save of Polar's and also the performance, performance, I'll talk about that later. So, this is the slides that, if you do this in Pandas, do this in Polar's, so there's, like, a comparison. So, import Pandas, we all know that our PD is Pandas, like, if you want to, like, trick your co-worker, you can say import NumPy as PD, they would, you know, their head will explode. But Polar's, I think now, if you look at the documentation, the standard is PL, so it makes sense. So, next time, you can try, like, import Pandas as PL to see what happens. I don't know, it looks even more similar, so maybe nobody noticed until something goes really wrong. So, yeah, because you can see, re-csv, it will still work. So, pd.re-csv, pl.re-csv, it will still work, okay? It's just, like, using different libraries. Re-excel, it will still work, so, you know, for a lot of people, if you do it, this is very handy, because Excel is a very common kind of data format, but, like, a lot of companies still use it, but, like, it's a little bit hard to handle. So, in Pandas, if you can't load in the whole, so, whole spreadsheet, everything, or it runs, you can do it, read in badges. If you don't need to, you know, transform your data at once, you can do that in badges. Polar's, we have lazy loading, which, you know, you will see it later in the performance as well. It just means that it won't load in everything at once. Like, when it's got used, it will be loaded in and used. So, it will get executed when it really needs to be executed. Constructing a data frame, again, super similar. So, by, just by looking at these slides, you think that, oh, maybe, actually, I can just, like, change my code, like, import polices, as, you can even import polices, pd, and your code may still work, so. Cool. So, again, head will still work, it's just, you know, because you can still call the data frame, df, and they are basically the same. But this is the part that your code will give you an error if you just, like, import polices, pd. Is that, in pandas, if you want to get a column, you would use the square bracket, but in polas, you won't use the square bracket, it will have a method call, dot call, dot call for that. In pandas, if you want to have a subset of columns as a data frame, you do double square bracket, polas, you use the method select, it's a little bit more complicated than that, but you can also, like, select multiple columns and it will become a new data frame. In pandas, you have these to kind of, you know, select filter all your data and stuff, we all know that, we all love using it, it's very handy. In polas, we have the filter method, but again, you know, getting a column is not the square bracket, so it's a little bit different, some people prefer that because it's easier to see what's going on, the square bracket really drives people crazy when they are learning pandas, so good or bad, I don't know, you decide. So another difference, pandas support a lot of plotting kind of thing, you can, like, have a map polyplot easily by just, like, calling df the plot with the pandas data frame, polas doesn't have that yet, I don't know whether they have plans to implement that, so the next thing is that the df sample, they both have, but pandas, you can actually, with a parameter, you can put in some waste there, polas, so far your cons, maybe they have in the future, I don't know, so again, describe, same. So I think the most important thing is the performance, like, that's why people are thinking about switching, so let's look at it. I have to confess, I want to do an experiment myself, but I have no time, I'm also organizing Europipers, so I stole someone's lies. This is in Python DE, so there's a speaker, Thomas, here, you can watch the talk, it's on YouTube, so the link is there, this is what he tried to do in his project, he switched from pandas to polas and then this is some performance metrics that he got to compare the performance of the two. So the vertical column is the speed, so the time span, so actually the higher the column, the slower it is. So if you look at, like, pandas pie arrow, that one is super slow compared to polas lazy, which is the lazy loading, lazy operation, that is super fast. This is, like, I would say, like, a lot faster, but still, he claims in his talk that it's not even the expectation that he's expecting, that polas should be 10 times faster. So this is running on an 8-core laptop, probably similar to mine, I also got 8-core in my laptop, but the performance really shows the advantage when you're running it on a 32-core machine, so this is a 72-core machine, so it's a 72 cluster, you can see that polas really fast compared to the standard pandas, which is the pandas num pie, is 10 times faster, so it's really, really fast. So which one should I use? Let's wrap up the talk by helping you to choose. So like I said, both of them have their advantage and disadvantage, there's a different scenario, you may want to choose different ones. For pandas, I think it's very good if you're doing, like, exploring of data because you can just plot some graphs of a polas, you know, if you're doing some data transformation, if it's like part of your production pipeline, I think that the speed really can help you. Pandas, if the data fits in the memory, it's perfectly fine. Polas, they are now trying to have these like out-of-memory capability to handle out-of-memory data, but it's in trial. Polas is actually a very young library, changes very, very quickly, so maybe it's more stable now, I don't know. So pandas is, yeah, like I said, it's very established, it's stable, so the changes would be quite minor right now, so your code will probably work more or less the same in the future, maybe three versions or something like that. Polas is very young, it's changed a lot, a lot of functionality will be added, a lot of performance will be different, maybe even better. So if you use polas, make sure you pin your versions. Actually you should pin your version for anything, but polas, especially, like young project, they change a lot. So again, pandas, good for data exploration, polas, good for production, if you are having some kind of product, you need the performance then used. Also another thing that I learned is that pandas is actually quite good with scikit-learn because a lot of times, what you do is like, you would just grab the NumPy array and put it in the scikit-learn model and stuff, so pandas will work quite well. Polas, then if you have many cores, like 72 cores, clustered, then of course you use that. So last thing, I want to show you this picture. So this picture on my, you are looking at it like this, so on my right-hand side is the creator of polas. He should be at the conference, I know he's coming, so look out for him. The person on the left-hand side is Mark Glacier, it's the, oh, he's here, oh, yay, oh! Ask him questions, don't ask me. Okay, so on the left-hand side is Mark Glacier, it's the release manager of pandas. So we were in the pub, we are both friends, so you don't have to imagine it's a fighting thing, we are all like helping each other in the community, so it's a beautiful thing that's why I love the community. So I think you should give polas a try, like I said, it's very easy to, you can try to change your, you don't have to change a lot in your code, and it's very easy to learn, very easy to use, it can give you the performance boots, and there are more features coming, if you have any wish list, talk to him, not to me. So last thing before I am told that I'm running out of time, PyConCZ said, if you love this city, if you wanna come back in a few months, it's happening again here in Prague, it's in September, if you think that, oh, I don't have enough time to explore the city, there's so many museums, come back, come back here, and I think that's it for my talk, and thank you so much. Oh, wonderful. So, thanks so much, Chukting, so we have a good bit of time for questions, so the mic is there, if anyone's approaching, or I can come to you with the roaming mic. They don't need to ask me questions, Chen. Oh, or if you have nothing, then Chukting would like to get a coffee, no, I'm joking. Okay. Thank you very much. I seem to remember that there are some optimized Fortran codes somewhere down deep in NumPy or pandas. What's happened with this in Polars? Yes. Have they even put more wrapping around it? Very, very good question. So, of course, I don't know the very deep down detail of it, but because it's written in Rust, it already has the, because Rust is actually, oh, I should have told it before, like some of you may not be familiar with Rust. Rust is also a language that got compiled. So, the Rust compiler will, of course, optimize the Rust code that is written, but for that, like Fortran kind of optimization thing, maybe you have to ask Richard about whether you are using Fortran's optimization. No, no, so we have written a little code from scratch. Every algorithm is written in Polars itself, and we compile that to binary code, so we don't use any Fortran or C or, yeah, no. So, Rust is already very fast. That's what you said. Yeah, Rust is C++ Fortran level performance. Yes, so you get the performance of a very fast language. Yes, any more questions? Yes. You asked me. I'm not gonna get you to the ringer. I want one correction. So, we can read object types. Very good talk, by the way. Really, you see, great to get the endorsement. But for strings, we read them as string types, because an object is opaque, an object is actually a Python object, and we don't know what to do with it. But strings are so common that we have a specified data type for them. That way, we can traverse memory way faster and do string manipulation without going through the Python interpreter, which would block parallelism and would also be slow because they are e-pallocated all over memory and you would have a cache miss every time you access one. Yeah. Thank you. Maybe it's kind of a silly question, but why don't make it polar, 100% compatible with pandas? Why don't you change only the import pandas with the import polar? We tried initially, but the pandas API is pretty bad for performance. It's really hard to see what the user's intent is. The user needs to use lambdas pretty often. So it's not really expressive. If you do a group buy and a complicated aggregation, almost everybody goes down into a Python lambda, which means we don't know what happens, which means we can't optimize it. The pandas API was suboptimal and we could make it better. Yeah, it's a bit tedious to learn a new API, but we believe we can make a better API that has a very small surface area so you can extrapolate your knowledge and you can use them as composable blocks. So yeah, we saw an opportunity to make something better and it really made our optimizations a lot better because we now know the user's intent and we can optimize those queries. We don't need to go into Python. We don't need to go into another framework like Numba or Numba to get the performance. We can do everything on our site and similar to how SQL goes to a query engine, the query engine can do all the optimization and make sure it's fast. That's what we do as well with the API. Okay, thank you. Just one question. Which operations are able to be run on multiple cores like does a join or a filter or aggregate on a multiple core? Almost all operations. So we do parallelism in every operator and we, well, we actually have two engines. In my talk tomorrow, I will explain a bit more about this, but if you write your code idiomatically as a Polar's expression or you can compile even your own Polar's expressions, we will make sure that we're running parallel. Yeah, try running it on 72 core cluster. That's super fast. Any more questions? You're trying to get to the mic, are you? Oh. Yeah, I think it'll be faster if you get there than I come to. Great, thanks for your talk. I'm hearing a lot about performance. Is that the only motivator for going to Polar's? Would you say Pandas has the API is easy to use and you go to Polar's for speed? Or is there any other reasons? Personally, I think, because it's like Rust is memory safe. There's also the foundation of that. Performance is a major, I would say a major kind of factor a lot of people considered. Another thing is like, I would say that there are some small things. Since I've been using Pandas for so long, there's some small things that I don't like in Pandas. I think it's cleaner in Polar's, just my personal opinion. For example, that's where bracket thing, of course I'm used to it, but sometimes I wish that is more explicit, which is like what Polar did is use the method to get those things. Yeah, so I would say that performance is maybe the number one factor that a lot of people consider, but there are also other smaller things that you would also be considered. So try it yourself. You may find something that, oh, I like Polar's because of that. Yeah. Any more questions? Maybe I'll ask one. Okay. To Choosing or anyone in the room. I don't think you covered it because I was monitoring the Discord. Do you know any organizations who are using Rust, who maybe you see as Pandas and have moved over? Yeah, so like the talk I showed earlier, so the speaker there, Thomas, he was, so he was trying it in, so I think if I remember correctly, he said that he worked in a consultancy, so that is like in one of the client project, he was trying to use Polar's to, because they think that okay, the risk is relatively low, they can try using it. So yeah, there are people nowadays using Polar's. It's not like a very new project, it's experimental, it's already working. It's just that you have to take the first step of trying it. So yeah, yeah. Oh, yeah, something to say. No, not a question. Can I make another? So I would say, so the question you asked is only, the performance, the only benefit. There are a few other things we focus on that's out of core processing, processing datasets that don't fit into memory. But another one we really hammer on is making readable explicit code. You already named it. You think it's more explicit? And we do this by design. We think code is more often read than written. And especially in Python, we often don't know what kind of type we got. So we can have a dictionary or it can be something else. And when you read that code, you need to run it to see what kind of type we have. So if we optimize the API for reading and explicitness, you will have way less a box when you work with coworkers. Another one is that we want to fail fast. So we are really strict on schema. If a data type doesn't match, we throw an error. And we don't do this at runtime, 20 minutes in your pipeline, but we do this immediately so you can get this quicker iteration and you're not frustrated that because some schema change, your pipeline failed one hour in. Yeah, so please go to the talk tomorrow. And I think we are done here. Yeah. Perfect. Thanks everyone for coming. Another hand for Tupting. Thank you.