 Hello! Hello, hello, hello. So I'm Czech, hello. I hope there's some more of you I've talked to you before. If not, feel free to talk to me afterwards. I will be mainly spending my time downstairs at the registration. Or if you want to talk to me, if you're a remote participant, you know, message me on the notice. I will try to check it out when I have time. So, yeah, I have a confession to make, because what happened is that, like what Martin described before, I was a data scientist, I used a lot of pandas. But then, you know, I kind of moved away from being in the data science, you know, as my career profession. And then I've met different people, different developers, and some of them may comment that, oh, pandas are so difficult. Or like, pandas, I don't know how to really use it. So that's why I'm going to talk to you today about pandas. And actually, I still love pandas, and I hope you would feel the same. Even, you know, you haven't used it a lot, or have used it a lot. So here is my social media. You can find me there again. If you're a remote participant, if you're later, you want to be connected with me, you can also find me online. And I'm happy to talk to you on Twitter. Or, you know, you can also. I don't think GitHub does messaging, but you can also find me on my website. There's also LinkedIn and other stuff you can find me. So, first of all, I would like to ask people that I can see here now, like, how familiar are you with pandas? So maybe people, I won't use years of experience, because that's sometimes a bit silly. So I would ask, like, how many of you are very confident? And I think that, oh, I know how to use pandas. I'm very experienced with pandas. You can raise your hands. Wow, there's some of you. Okay, good for you. So, you know, you can recognize their faces for the participant here. Then, like, if you can ask them questions, if you are stuck with pandas. Okay, so how many of you, the other way around is, like, oh, maybe you haven't heard about pandas or heard about it, have tried to use it a little bit, but think that, oh, I'm not, like, a power pandas user. I'm just using it for some specific things that I do. So I'm not too sure. Beginner-ish kind of, okay. So I hope this talk will be beneficial to you. So for the remote participant, no matter what your level is, I hope you enjoy this talk. So first of all, what is your pet peeve about pandas? So this panda is, like, what I feel like when I first use pandas. I'm, like, rolling. It's, like, oh, it's so concy. I don't know how to do stuff. And I have to always go to Google and see, like, how things got done. And sometimes when I do something, I thought it's very simple. And why do I have to go through all these, like, you know, dot something, you know, always, like, you know, you have a data frame and then you use the dot something and then you manipulate it. Sometimes you replace the original one. Sometimes you create a new one and then you forget that you create a new one. And then why suddenly my data frame become none, you know, those kind of things happen a lot. So pandas is quite difficult to use, I would say, if you are not familiar with it. So for me, my advice for some people who are maybe not a data scientist and just start to use pandas and feel that pandas is super, you know, freaky. I would say, you know, a data scientist would feel the same when they start using Django, right? Like, it's always Python, but I don't really understand it. It's, like, it's familiar, but also not familiar. So what I'm going to do is to share some of my experience, like what I found difficult about pandas and how to overcome it. So check. This is me. I've already, you know, introduced myself. I love open source project. I've involved in different open source projects in the past. Also, I love going to conferences, organized conferences. I love seeing people. So talk to me because I love talking to you. Also, I stream a little bit online when I was not able to talk to people. So you can also find that I have some tutorials about Python that's useful for you as well. So my number one pet peeve of pandas is that I was a data scientist, right? So it's fine when I do the textbook thing, right? You know, they show you how to do stuff and then they give you a CSV, maybe, to do some things. And then usually those CSVs are quite small. So it's fine until you work for a company, right? If you work for a company, then all the data is not that textbook kind of example size. It's become big. And I can't really use pandas as I was when I was learning data science. I can't handle large data sets and it's not efficient. It takes so long and it's very slow. And, oh, how can I do that? So there's a few tricks that I learned when I was talking to other people, to see what other people does is, you know, a lot of times from my experience, a CSV when you receive a CSV, let's say from your colleague, you will also ask them, oh, what is it about? They may have another file that tells you, oh, what are the columns in the CSV and what they are supposed to be and what type, data type they are supposed to be. Then instead of just using read CSV, just like, you know, directly without adding any options. So you just try to use D type options because, so what is D type options? So in read CSV, you can do things like that, right? Like you can specify a D type. For example, the last row here, I hope is big enough for you to see. Oh, the third row here. That I specified the type of each column. So if you don't do that, right, what Pandas would do is that it would, based on the guess, to see if you, for example, they see some characters, not numbers in the column, then they would think, oh, is this, probably all of them would be a string, so they make it into an object. If it's like numbers without a dot, then probably integer, but a lot of time that, like, if you're missing something, it's suddenly become a float because not a number is a float for some reason. So that's, sometimes it's not something that you want. Sometimes you want to specify the type of the column. So it kind of saves some time for Pandas to guess what it is. And so yeah, also it kind of saves space. For example, the default would be float64. So if you load in a huge CSV with a lot of numbers in it, they would just load into, let's say you have missing something as well, right? So they all got loaded as a float64 and it takes up a lot of space. But actually you don't care about that, like, so precise, you know, because I'm just doing some very simple calculations. Maybe, you know, let's say six digits behind the dot or something like that. You can just change it, right? You can compress it into a smaller size to store the data. So for example here, I know that all of the movie idea would be integers and then they would also would not be a very huge integer. So probably in 32 is enough to store that number. So I make it smaller. By doing that, you could sometimes, if the CSV is involved with a lot of digits, you can actually compress it just enough to fit in your memory. And that's good. You don't have to find other solutions to handle huge data sets or think about other tricks. So the bonus is also like, again, I mentioned that before, that it will be much faster to load it in and you start working faster. Which is great. Also other operations will be faster as well because you're using less memory, so it will be much faster. So that's one way of doing it. If you already know the type of the columns in your CSV, in your data frame, so I would recommend doing that as a good practice as well. Because for example, you know that will be an integer, but you don't know there will be something that is missing. So you would get the error right away when you read the CSV instead of, you know, if Panda do it for you, Panda doesn't know that it's supposed to be. An integer that is not missing. It would just suddenly change everything in the float and, you know, you have an NAN and then if you're not careful checking whether there's N8 and sometimes you get an error later on and then you was like, oh, what's going on? Well, why is it not an integer? And then you realize, oh, I was missing some data there. So having the D type, you would get the error right away when you read the CSV and that could sometimes be very helpful to help you debug instead of, you know, suddenly your model failed and you would know, oh, the data type was wrong. So that's one thing. The other thing is that if it's a huge CSV, you can do the trick that I just mentioned that to, you know, use a smaller size of the D type. You can process your data in chunks. Actually Panda's is really good in doing that. You don't have to write your own for loop to do it like, oh, just read the first like, let's say, 500 lines or 1,000 lines. You can just use chunk size in read CSV. So by doing that, putting chunk size into something, and now I of course set a chunk size manually there, but you can, you know, change the power of six, but you could also let's say 1,000 if you want to be smaller or something like that. Then instead of creating a data frame when you read the CSV, if you set the chunk size, it will become an iterator. You can just loop it over by using a follow-up afterwards, you know, so you can use the width, you know, the width course, and it will become almost like you're reading a file and then you can use a follow-up to process each chunk bit by bit. So it's really, really handy. And if you use it together with another tool called TQDR, then you could also get a progress bar, you know, if you know about TQDR, you know, it will show that always. Actually, when you process the data by chunk, then you can see the progress in your Jupyter Notebook or see it in your terminal. Then it's actually very handy because instead of staring at the screen and hoping that everything is fine, you see the progress being made. Also, if something failed, then you know where it got failed. Also, if your data doesn't need to be... So this is actually very handy when your data doesn't need to be, you know, processed all in one go. So a lot of times, if you just got to, you know, clean up some data or, you know, just make some adjustment unless they drop some column or like, you know, add one to another or something like that, you can process them by chunk and then store them somewhere else, put them on the disk or store them in the database. That would be a very good way to have a pipeline and then process your data that way. So it's quite good. So that's it. So that's about the size of the data frame that Henders can handle. So another pet piece that I had is... Henders is really, really complicated sometimes. You know, because there's a lot of times that there are multiple ways of doing the same thing or there are ways to do things that are quite similar to each other because, fun fact, yesterday I gave a workshop and then talked to some people and then they told me, yeah, Henders, you know, sometimes I just don't understand because there are different methods that do similar things and they are just slightly different or, you know, I don't know which one to use. So actually to think about that, I would kind of put them into different categories. For example, let's say this is one of the things that a lot of beginners found very tricky or why is this so, right? To get a column, you can either use the dot to get a column or you can use the kind of... shoot it as if it's a dictionary. You can use the square bracket to do it. So why there's two ways of doing it and which way is the better way? Maybe you should choose one and stick with one as a habit. That's actually make things easier because if you have switching between the two, then you got confused yourself. The other type of confusion is that... what's the difference between the two? Because I have a method that is built-in by Python. Let's say some. Minmex, there's a few of them that's like that. They're actually built-in method in Python itself that can handle any iterator. So a Pandas data frame actually is an iterator. So there's one way of doing it, but there's an equivalent Pandas method that you can do the same thing. So using the df.sum or df.minmex, whatever. So what's the difference? Which way is the better way of doing it? So also, there will be different methods to do exactly the same thing. For example, missing values, right? We always deal with missing value. But there is two different set of ways of doing it. There is no and it's NA, but they're actually alliances. They're the same. So basically, they're doing the same thing. And also, the counterpart will be not now and not NA. But which one should I use? I don't know. I'm so confused. Sometimes I see people using the first one. Sometimes I see people using the second one. So in all these cases that I have just talked about, I would prefer the former rather than the later. So first of all, the practicality. For example, first example, I would use df.users instead of df... No, sorry. I would... Oh, sorry. I made the wrong statement before. I would prefer the later rather than the former because, for example, I would prefer the square bracket, df square bracket user instead of the df.users. Reason being, because when you work with some CSV that you get from whatever sources, they don't know that you were put into a pandas data frame and they may have the column name having a white space in it. So in that case, you can't use the dot at all because it would create a syntax error in Python. So you have to use the square bracket. So I would just stick with the second one just so that the code look more clean. Because every time you get a column, you would do the same thing even though you don't need it. You know, user doesn't have a white space, but I would still prefer the later one because I do it for every single column no matter what their name. So for performance, actually for the second case, the sum in a max does... Using the default method in the pandas data frame is faster because something about pandas, well, it kind of is a library that knows that you're going to deal with a lot of data, right? They have actually optimized the performance by providing some built-in functions that actually have involved some C extension in it. So for some operations, they have actually optimized C functions behind the back to do things for you. For example, the sum in a max is a very, very good example of that. If you try it yourself, you time it, you found that the df.sum will be much faster than using some bracket df. Reason being the sum bracket df is a Python built-in and it just assumes that it's any iterable. So it would just do some standard thing and loop it over, but that's way slower than calling the optimized method in the pandas data frame. So that's the reason why I prefer the later one. At the end, sometimes which method to use is by popularity. Sometimes something is done in the past, for example, is now and in recent years, more and more people use is NA instead of is now. So name changes over time. There's something that fall out of favor. Just stick with the one that you see the most popular because at the end of the day, you have to work with other people and using the more popular one, the chances of people can understand what you're doing will be higher than using something that is not as popular and not many people know what it means. So that's one of the problems in us that we got sorted. It's just like choose the right method of doing things. The other thing that I found for my currency is that when I do some group by aggregation and things like that or privilege table, it will create something called a multi indexes for me. Maybe you have seen it before. When you print out the data frame of the head of the data frame, you see the index or the column, they will have two columns, right? They have the high level column and the small sub column level column. So those are multi indexes and those are very, very, very difficult to use and I don't like using them. So what I usually do, almost like a muscle memory, is that after doing those operations that will create a multi index, I would reset the index, which is flattening them. So to avoid using multi index, which is a very, very clumsy way of using pandas. So next time, if you see that those weird multi layer column or multi layer index, then you know you have to reset the index and that's it, that will solve the problem. Another thing is that for those of you who are familiar with using SQL, you may love this. So in a data frame, you can actually use the SQL query to operate it. You can act as if the data frame is a cable in your database. So sometimes for the operations, it's actually easier. If you're familiar with SQL, you can just do this instead of founding the pandas way of doing it. So the last thing I want to talk about is that there are actually sometimes, you know, confusion between the series and the data frame. What do I mean? For example, we have all done this. I've shown this before. This is getting a column from a data frame. But have you ever, ever tried to do the double bracket? So what's the difference between the two? I'm still getting a column from a data frame. So actually the second one will give you a data frame with only one column in it. So if you look at the d-type of it, not d-type, if you look at a type of it, so you know in Python we can do type and then put whatever you want to check inside. If you check these two, you'll see that the first one gives you a pandas series. The second one gives you a pandas data frame. There are two different things. There are two different types of data structure. So series, you can think of it as a one D array that list that had the label. So they would have an index that would be one to three four. You know, you can also change it. So series is a one D structure. It also can hold any data type. So series will have information about the d-type of what you're installing. So basically it's just a fancy list. But a data frame is a collection of series. So it's a 2D structured data. And it consists of columns of different names and then different d-types that combine together. But the trick is, like for example, my example here, this one, you know, I'm not restricted to have multiple columns in the data frame. I can only have one column in a data frame. So it's still a data frame. But it is considered as a data frame with only one column. So that's here come the square bracket. Of course, you can have multiple columns. If you use the square bracket, you can have users and encumber, maybe age or something like that. Then you will have a data frame with more than one column. But with this one, the one on the left hand side, then it's only giving you a column so you can't do it like with a comma or an age or something like that. So that's the difference. Yeah. So Pandas, so this is almost at the end, on Pandas is a very, very useful skill to have, especially when you're handling data in memory in Python. So you may want to master your Pandas skills. You may want to have a more deep dive in it. You may want to be a more powerful user of it. So there are some ways of doing this. You know, I follow James Powell. I've also met him a few times in other conferences. He is definitely a powerful Pandas user and he sometimes gives tips about it on Twitter. You can follow him. There's also finding your own way and stick with it. For example, like the square bracket thing, I decided that I would always use a square bracket if I want to get a column. That does actually make things cleaner and easier to understand. But at the same time, you should also look at how other people is doing it. And sometimes other people found a way that is more efficient to do it. And then you can compare and then decide that, oh, actually I want to switch to a way that's more efficient and more popular way of doing it. And at the end, you may want to understand the internal of Pandas. You understand this different structure, why are they different and what make it faster and more efficient. Then maybe the best way is to contribute to it. Even contribute to the documentation is considered as a contribution. So if you want to contribute to Pandas, the best way is to join a sprint. So I'm not so sure whether we have a Pandas called developers here in Europe. I haven't met them in person here yet. But your sidebar usually will have also, you know, maintainers of other, you know, scientific Python libraries there. They would be there. The best way is to sit with them and do a sprint. And then learn how to contribute and look at the source code of the library that you're using. So this is just a small advertisement for your sci-fi. And I will be there. I will see you there as well. So that's it for my talk today. Thank you. Thanks, Tok, for the talks and for the insights in your selection on how to work with Pandas. We do have sometimes and I bet with so many people new to Pandas, they might have some questions and the experience might also want to say something. Could you please come to the microphone and ask the question? So I wanted to ask, because I prefer to be honest to working with IDE, and do you have maybe, I found it's like it doesn't work well with Pandas. Do you have maybe a solution to that? May I ask which IDE you use? Oh, Python. Oh, Python, okay. Some of the IDE, they have that in mind that they design things that work with general Python libraries, but not specific like data science libraries. So I know a lot of data scientists, they prefer the Jupyter environment, which includes Jupyter Lab, Jupyter Notebook. So I would say that for data science work, Jupyter is the one to go. Okay. Thanks for the question. Are there any more questions in the room? Yes. So I don't have much experience with Pandas, only a little. And I think it's a little bit inconsistent. I mean, that's my problem with it. So I like to compare it with Git. So I know how to check out, to check in, to clone. I know all the essential operations that I do every day, but each time I want to unstage a file, for example, I have to look it up. And I unstage a file once a month, but it's been years. I still look it up. And I think I had no, because it's inconsistent, all right, if you know what I mean. So there's one way you do one thing, but completely different. And I thought that the same thing is with Pandas. Am I correct about this? Or does it become any better as time goes by or do I have to remember everything by heart? I would say that I understand your sentiment and it's absolutely true that Pandas, because it's so huge and there's so many ways of doing so many things. So that's why having a good documentation for Pandas is so important. And I think over the years, they improved that documentation a lot for Pandas. So if you want to improve it furthermore, of course, you're welcome to contribute to it. And I can introduce you to some Pandas developers and they can maybe have a discussion with you. Thanks for the question. Thanks for the question. Please come to the microphone. Probably the second or third time, so far in this conference, I've seen the .query method being used. And I generally use, like, the .loak to filter out separate data frames from my data. Is one faster than the other? Is there an advantage to .query over .loak? Yeah, I would say that I haven't test which one is faster yet, but .loak is another complication because that's the .loak and .i-loak. Yes, so people are confused about it, but I'm not going to dive into detail now because of time. For .query, I think it's usually useful for people who are very, first of all, very familiar with the SQL query, but at the same time doing some operations that are more than just looking up for one thing. They could do something like more like filtering or like doing some, even like counting or aggregation as well sometimes. So yeah, I would say that there's more use for the .query thing, but for the speed-wise, efficiently, I haven't looked up yet, but yeah. Okay, thanks very much for all the questions. Oh, we have a final quick question. Very quick, please. Very, very quick. Thanks, I'm pretty much a beginner, but I also teach beginners. Do you have a recommendation for a good tutorial? Oh, actually, I would say that the official documentation now have really, really good tutorial that is step-by-step and then explains things quite well. So yeah, check out the official documentation. That one is really good. Okay, so thanks, Choke, for inspiring everybody to get more deep into pandas. So let's have another round of applause for Choke. Thank you.