 Hi everyone my name is Uzi and I'm here to talk with you about pandas. I named this talk pandas not just for data scientists since I Have a feeling that in the pie that the community data scientists use it almost exclusively and it's a they recognize it as a very powerful tool and still other developers no data scientists are Some of the most of the people that I know are hardly familiar with this tool and I have a feeling that If more people will know about it outside of the data science community that could be this could be very useful to for them as well So before I start I have three questions for you So please raise your hand if you heard of pandas before something. It's familiar for you And keep it up if you played with it if it's something that you're a bit familiar with The last one keep your hand up if you use it in production environments Okay, so that's all is what I thought So just to be clear This talk is Yeah, so this talk is not for a data scientist and it's definitely not a tutorial 30-minute stock Which will go through other subjects as you see cannot even start cover the minimum that you need for Pandas I include the here two References that you can use if you feel that pandas could be useful for you and that's my intention my intention is by the end of the talk you'll feel that Potentially a valuable tool for you. So you'll go to these references and learn it deeper I'll have a short demo to walk you through some of the basic stuff because I have a feeling that sometimes when you have like a quick start and you just understand what are the basics it makes it much easier So yeah, and basically Hopefully you are developers that are not data scientist And that's my intention. So about me. I have 30 plus years of experience with different Platforms and languages, but in the last six years exclusively Python and I'm kind of fell in love in this language and It almost returned my joy of programming that I had in in my youth I'm very grateful for that So I work for Blue Vine Blue Vine is FinTech company. We basically give loans to small businesses in the US part of the major components that we have in our system is the ability to Approve or reject new customers and new deals and for that we have a very talented team of data scientists who are doing great job in doing that in a almost Impossible time frame sometimes even minutes using all the data sources that we have And as part of my previous role as a chief architect I worked a lot with this team and I had to learn pandas in order to work with them to help them with architectural design decisions Python and When I learned pandas more and more I realized that the other projects I work on with other developers can actually benefit from this and This made us start using pandas also outside of the realm of data science of pure data science And we saw a lot of benefits on that and I'm going to show some of it here so when it comes to a programming language, I mean you can see it as a interface between Human developer and the machine and and for me Python is probably the best Option on the side of the human developer like I feel when I develop in pandas that the cognitive Load is the minimal I can have when programming but on the machine side of this of this reality we all know that Python has its limits and most of the time we don't really care about these limits, right if you have a web app Simple web app. We don't really care about this, but there are situations where this limit can be really problematic and for these We tend to ask us if okay. So what do we do? How do we use pandas? Sorry use Python? and still work out these limitations so one of the thing that We all know is that if we use a specialized Python feature we Get much better performances. So for example in this example you see on the left side you see for implementation of a loop and As we all know for is like the most open and versatile tool that Python gives us for this Ability to loop over things and we know that least comprehension which in a way limit the things we can do it's more specialized and if you use that we as you see here we gain quite a Meaningful advantage in performance, right? I mean it's about 30% here and We usually use these specialized features not just because of performance It's because this idiomatic Pythonic way of writing is also make the code clearer and easier to follow The other thing we can do we can leverage the advantage of C and and Python kind of provide us provide us a way to extend this optimized specialized solution by implementing part of the code in C so and An interesting quote I have here from from A talk I heard of of Rhodes is They actually sometimes we think that C is a better option because it's a compile language, but as he very nicely show in his talk the nature of of Bad performance in Python actually comes from the dynamic nature of the language and not from the interpretation part So he show very nicely that the interpretation maybe can be Maybe can be Be responsible for 30% overheard, but the dynamic nature of the language is actually something that we pay around in his Talk he shows like around six hundred percent overheard in performance So as you know a lot of Libraries already take advantage of this ability and Also part of the standard library in Python, of course is written in C and NumPy and pandas are some of these libraries and and pandas just to give you really a short introduction so pandas is a Is You can think of pandas as a kind of an Excel. It's really a bad Metaphor to have going forward, but it's like a nice Starting point so everything you can do with Excel or maybe in an environment. We have database and squirrel query You can do with pandas pandas is based on NumPy and I'm pie provides a very efficient array implementation that is fixed size in creation Each element has the same type and They both NumPy and pandas provide what they call you funks, which are vectorized versions of many useful operations You can run over these areas and data frames If someone is familiar with our data frames, of course pandas is very similar to that and actually based on that so How can we actually improve performance with pandas? So what's the trick? It's it's quite simple again. It comes up to the Dynamic nature of Python and somehow Trade-offing this part and saying okay for this part of my Calculation, I don't really need the dynamic nature. I need a performance So if you look at this chart on the right side, you'll see a typical Python object in this case at least we have the object and the object Refers to an array of objects and ease of each of these objects is actually a reference to Memory To structure in a memory that is scattered all over so and if you look at the left side Which is the NumPy array of very similar functionality if you're willing to? live with assumptions in NumPy has like the The same type for each element then you have a continuous memory structure very similar to what you usually have with C arrays and of course this could be much more efficient especially with CPUs taking advantage of operations at Adano and these continuous block of memory So pandas is of course just a Small part of a huge ecosystem, and if you are starting to use pandas eventually you will be able to use Many more maybe hundreds of other tools and libraries that give you great performance and a lot of functionality and this is part of the Of the of the advantage of using pandas So at this point I would like to walk you through a very simple Demo I'm going to use a Jupiter. I assume that everyone here is familiar with Jupiter Let me see how I can do it Sorry about that Okay So So the first part is really Just the basic imports and some basic configurations To load pandas and then usually the typical thing you start with is a Is a Input file could be CSV or JSON or whatever and this example. I just downloaded a Sample CSV file with the boring sales information And pandas provides you a very nice way of Loading this data very easily in this case as you can see just one method And the first thing you can do it can just go to this data frame that the frame is the basic structure, right? Is the Excel sheet that you can imagine that holds all this data and you can easily Browse through this data and see different columns and rows and data. It's quite easy to really Get the first impression of what the heart data really holds You can get info of the different columns When you see here object, it means that pandas hold python object, which is not The greatest thing because this case It means that you don't really gain this Performance gain I talked about before because there's some kind of indirection here because pandas is like still referring to python objects and Ideally you would like to see things like this with the floats here or integers where these are native non-pay Types which is where you gain the performance So if you see at the price here, you see that for some reason the price is an object and usually This kind of information object means that you have strings because there's no native String type that pandas use it uses the python string and that therefore you see object So and if you if you run some you funk you get these weird results which is looks more like a string concatenation than Really a summary of numbers and if you look closely You'll see that the issue and the issue is that somewhere here There's a there's a comma which is a common problem with C3 files at times times you Load them and and the comma is interpreted as something that's not numerical and then The infer pandas infer that it's a string and not a number so So with pandas it's quite easy to really explore the data and see Where are these, you know, where are these issues and how to solve them so By using indexing see this is my data frame. So I'm referring to the column DF price So first I show here that if I use a you funk under str which called contained I get a series of results of Boolean results and each one is actually a Boolean that represents whether My cell contains comma or not and if I use this series for indexing the data frame. So I go to the DF price, right and I Index it with the series that I got then I Get a filter of the data frame that I started with so Pandas uses this Boolean series as Something that determines what rows I would like to see and of course for a free true I get the the row and for every false. It's filtered out. So here I see the issue. I mean I have a price with a comma and Once I know that it's quite easy We again using you funk To deal with this issue. I can just use replace and Doing so I can also at the same time change the type to int just to make sure that we're working with integers and not with strings And now when I do info I see right? It's int 64 and if I do some now I'll get the expected results which are of course the summary of the price so Pandas is great for that exploration. So and You know coming from General purpose programming or other fields In Python where you kind of have that a structure and that time it's like you feel like okay It's so much work to really understand what kind of data you have when you are used to pandas It's becomes very easy to really just explore that I understand it more and more and it provides a lot of Functionality doing that and I won't go over Really too many because it's like a long list of hundreds of these functionalities But just going through a very simple sample very useful sample. So describe is like It's like something that pandas gives you very easily. It's like for every numerical column you get the difference statistical Measurement you can do it for a specific column You can do like some kind of a value count So if you have a value that repeats itself in this in this situation We have different products so you can see what's the frequency of each product Of course you can do all kind of other things Here you can see an example for a more advanced filtering the Principles is very similar, but As you can see I can do here a filter that's based on a range right between this number and this number and I get the rows that Apply with this with this condition And of course I can create new columns with calculated fields or in this case, you know I can have like a discount Column you can see it on the left on the left and and Then you can do a lot more advanced thing like group by and statistics on these Different groups are here a group by product and I have the different statistic of each product and I can do Describe of each of these groups and and have all the statistical measurements on them Of course pandas provide a lot of visualization Tools Simple one is like having histogram and all numerical values because I can choose one of them. Just show it I can do Right. I mean any custom specific Chart that I need it's really endless So when I finish analyzing the data, I usually want to save the data. So Also here pandas provide a very easy to use feature like I'm saving to Json. I'm saving to rescue a light You can see here the files I have I can read them back. I get the right easily Can read it with a sequel query as well and get back the data The last thing I show here, I won't be able to run But if you're using Django, which we are and this is something that we use quite frequently so Usually need some kind of a bridge between the Django data structure and pandas and it's actually done very easily you just use from records on the pandas side and for the Django side you just need to use values and specify the the fields that you're interested in So once you do that you are in pandas world and then everything everything is available for you And the last thing I want to show you is like so pandas and and Actually in this case Jupiter is true something. It's about Jupiter Provides an option to develop plug-ins and there are hundreds of very Useful plug-ins. I want to show you one of them here. So I have my data sets here and and this is a pivot table so I can I can Examine the data in a More interactive way, right? I can take For example, I can put the country here and I can decide okay. I want to do it also by city Not just by country so subdivision and then I can choose, you know, I can choose Average for The price I can see the average price and different for different product in different country and different city, etc but that's really a Really quick glimpse of what pandas provides and I wouldn't go back to my presentation Which is not really happening. It's the second. Oh Wait a second, sorry So So how fast how fast is it actually? sorry, I ran I Can I show here a very simple test around? using on the left side Python data structure in this case least on the right side pandas and you can see for different operations some filtering and multiplication you can see more or less you get 30 times faster Results, which is not bad, right? I mean if you have Some kind of operation production that takes 60 minutes and if you're introducing pandas It's going down to two seconds could really be quite a difference and actually want to share two examples from things we experience in production. So the first one is that We had a sync process that gets information from a third party and has to run a lot of comparisons with data from our or M Django and by Refactoring our solution to use pandas we Were able to get results that are 15 times faster than before so and actually we got a cleaner code. So, you know, like It's it's not something that It's like immediately you see it, but once you're familiar with pandas and you know how to read and use it the code is much cleaner and and being able to have this better performance and 15 times was really Very significant change for us, but The second example is actually was much more amazing. It's like we had an example where we had to Use more advanced calculations using group I and things that are more involving like a more complicated business logic and there we got 1900s times faster solution and was really something that was like was looking like a mistake and and we looked at it Again and again, and this is actually the results and it's quite amazing to see it And it's clear that you know, we could have taken the original Python implementation and improve it without using pandas But there's no way we could have got to a such clean and easy to follow a solution and that's the power of pandas I mean, it's not just being able to give you a much more performance solution But the solution is usually nicer and much easier to follow So if you decide to use pandas You should know that there is a pandas way and and this is something that newcomers to pandas sometimes miss. I mean In order to really enjoy the the benefits of pandas You really need to use you funks as much as possible And if you don't have you funk you need to use apply and not iterate over With something you can do with pandas you can iterate over the rose But you should use apply which is a kind of customized you funk And even if you get to the point where you need to iterate over Rows in in data frame, it's still you see some improvements relative to Python and and there's some Situations where the more intuitive solution is not the right one and I want to show one one of them here This is something we actually saw in productions. We had a situations where we had 50,000 rows was an example and we need to Apply a category according to some kind of mapping that you see on the right side in this solution this example the category where you see a question mark should be see because it's in the date range you see below and the simple straightforward approach was to use apply right I mean to use apply with a function that we run that we wrote called get category and It seemed simple and and but for some reason we saw that it took around 61 seconds and it was like whoa like 61 seconds to have this simple relatively simple thing we want to achieve here and by digging into it more we actually understood that if we change our perspective here if we instead of use apply and sort of iterate over the rows using pandas we instead of that we iterate over the the mapping the the grouping that we have and we let pandas filter To do the It's magic And when we did that we actually get the results that's more than 2000 times faster So we had a something the trend production with the with the original solution on pandas It took 61 seconds and after this change it took only 26 milliseconds, which is of course quite huge So very quickly by using pandas and Jupiter you only you also get you get other benefits as except for the very powerful tool that allows you to explore that you also get This notebook which allows you to use to run it in multiple environments, which could be very handy You can write it in your development test it on staging run it again in production You can share in a notebook share the results, etc And we're out of time, so I'll just quickly summarize so my For my perspective the takeaway here is that you should learn pandas. It's not Trivial there is some kind of a learning curve, but it's not that hard and I think you can gain from it also in non-data science kind of problems and when you When you do that, I think you'll find it more useful of course in that analysis Which is what that assigns to but also in sync processes as I talked before in cubi ports and exports any Process that deals with a lot of data And when you use pandas Keep in mind that you have to be really flexible in the way you see things. It's not the simple you know iterative Intuitive way of looking on data, but you have sometimes to change your perspective and when you do that you really gain a lot of Advantage, thank you Thank you Yuzi for this great presentation about using pandas to avoid some simple for loops or Conventions and being much faster Well, we went out of time unfortunately if there's a very quick question raise your hand and run Otherwise, okay, I'm running to you first thanks for your presentation my question would be having encountered a case when Replacing a complex SQL query with some pandas logic would be quicker The question was whether we experience situation where we had a very slow a sequel query and By replacing it with pandas we gained Better performance Well, it really depends. I mean usually if you have a sequel that goes to the server and Runs the logic there and gets very small amount of data. You probably that's your best option there Right, I mean you have a query runs on the server. You get only a very small portion of that time and that's it Sometimes if the query is really Too complicated and and it really justified the overhead of getting all the data to memory Then for sure you can do stuff that could be more efficient it actually pandas has Also a query language and a way to Articulate these these needs in a way that's similar to query as you used to But I wouldn't like On a general case I wouldn't Get to pandas because of performance issues with SQL. It's not what I think you do Okay, thank you. Thank you really a lot