 I'd love to welcome Ian Oswald in London. Hello, Ian. I'll ask you to unmute your microphone so that we can talk back to you, please. There we are. Can you hear me? This works fine and we can see. Thank you, Arten. So I love pandas. I'm really excited to see what you are going to show because flying pandas I've not seen yet. No one's seen flying pandas. Okay, if anybody has questions about this talk, make sure to post them in the Zoom or in the Scott channel. I'll be monitoring both. And now I would like to give you a chance to start your talk, actually. So please enjoy and put the screen. Thank you very much, Margin. So I need to share my screen. I'm going to share desktop two. And then all going well, you can see my slides making pandas fly live from London. And my name, as Martin has said, is Ian Oswald. So who am I? I'm a chief data scientist. I'm an interim chief data scientist. So I work with companies on an interim basis to help them out when they lack a chief data scientist to give direction to companies and to teams. And I've been doing this for 20 years. I've been doing lots of other things, of course, building up to this point. I've built lots of intellectual property over the years. And now I really enjoy going into companies and helping to bring a team up to the next level, which is a really interesting challenge. Lots of different things are happening on all sorts of data science areas from machine learning, data preparation through to delivery of working things. Along the line with that journey I've learned and also a lot of things I found a lot of frustrations. I found a lot of colleagues with frustrations so I've turned them into courses. And so this talk is in fact, I'm sharing a part of my higher performance Python course, which is for me really exciting because I love sharing all of this knowledge that I've gained. And if I step back, I started my presentation journey in the open source Python world in 2011 at EuroPython in Florence. That was the first time I gave a public talk about Python and I gave several hours teaching on high performance. So for me, this is a return to my old place. And in fact, I've been a speaker, teacher and even keynote presenter at EuroPython in the past. I'm also one of the founders of PyData London, one of the largest data science communities on the planet. I'm super proud of that. Got 11,000 members. I work with a lot of companies, lots of high street brands you might recognize. And I'm co-author of a book on high performance Python. So I really care about this subject and the second edition of that book came out just recently. Being an organizer in PyData, I'm also an organizer of the annual conference. I know just how much work goes into it. And so I want to thank the organizers and all the volunteers for the work they've put into this. And I would suggest yourselves, you go and thank them as well. Go into the lobby and just say thank you to really make the event a bit nicer when the participants go and say thanks for all the work that's gone into this. So today's goal, we're going to look at pandas almost exclusively. We're going to look at saving RAM. So we can get more data into our memory without doing anything else. And we're going to look at calculating faster, both with pandas and inside NumPy. And I'm going to give you some reflections on being highly performant as well. And if we have time, we might look at the last slide on some of the data that I've been analyzing that gives us this talk. And you might ask why, why do we want to fit more data into RAM with pandas when, for example, we could use the Intel SDC that we saw in the previous talk. And the simple answer is, hey, it costs time to learn new things. But we are probably more efficient overall if we can get more out of the current tools that we have rather than having to learn new tools. If we don't have to learn those new tools, it's cool to learn new tools, but it's time expensive. They always work a little bit differently. So maybe it's good to extend our knowledge and go a bit further with the current tool than before we have to upgrade even further. So let's look at an example. Here I'm looking at UK Companies House registration data. So this is the registration data for the four and a half million companies that exist in the UK at the moment but are still alive. It's open data, we can do some really nice analysis on it. And it's a really good mixed data type set. So we have strings like the types of companies, if the names of the companies, their registration numbers, the year they were formed and some other numeric indicators, no financial information, but a bunch of other numeric. So we get a nice rich data set to work with. If we've got four and a half million rows of data, we've read it in with read CSV, we get strings for all of the string data types. And it turns out that strings are expensive. They cost a lot of RAMs. They take a lot of RAM and it means that operations can be slow. So for example here, we've got the data frame with a company category. There aren't actually that many company categories that you can legally be registered as, but the four and a half million companies all have a company category. And so we can see the breakdown here with the value counts. And if we ask for the memory usage at the bottom of the screen, asking for a deep introspection. So we count the memory used by every single string entry on every single row of data. And we ignore the index who just cared about their strings. We find out that it takes about 370 megabytes of RAM in this instance. So that's quite a lot of memory, right? That's 370 megabytes of RAM for what is ultimately a very small set of indicators. If we switch to the new pandas data type category, the categorical data type, that's some encoding behind the scenes. We get exactly the same kind of operations on the data. So we do the value counts. We can ask for the number of characters in the string and do all the other usual things we'd expect. But if we ask for the same memory usage with a deep introspection and no index, we find that it takes four megabytes. That's roughly 1% of the previous memory cost. That's an amazing saving, 100% down to 1% with a simple change. Now, if you're doing lots of work on strings, maybe you need to use the strings in some other form. Maybe you don't have a few strings frequently copied. Maybe you have lots of unique strings in which case maybe the category isn't the right type for you. Category isn't a magic string type. Category just takes low cardinality items, so items that have lots of repetition and it encodes them efficiently. And so behind the scenes, if we look at the cat attribute, we can look at the categories and the codes behind it. And we see that under categories, we have community interest company, European public limited liability company. These are encoded in the order that they're seen in the data. They're given a numeric code and it's the code, the numeric code that is stored against every row of data. And that's where our saving comes from. So we start storing lots of integers against each row of data, rather than those strings, then the strings are looked up when we need them. So we can still do our string operations. But if we had, for example, lots of integer numbers, I know, and numbers encoding product categories, zero through a thousand on a million rows of data, but we only had a thousand product categories, you could use cats to achieve exactly the same kind of speed up, except the same kind of memory optimization. And then we should talk about speed ups. So what do we get with our speed ups? Well, if we've got a company category and we perform a value counts, here it costs 485 milliseconds, about half a second. If I do the same operation on the cat variant, the categorical variant, 28 milliseconds. So that's, what's that, 15 times faster. It's more than an order of magnitude faster. So it's a huge saving in both RAM and in speed by going to the categorical data type. So there's a first tip for you. Do think about using the categorical data type. We even get faster lookups. So if we're doing the no category version and we're looking up a particular string, we have to do four and a half million string comparisons that cost 280 milliseconds. Then if we do the same thing on the categorical version, you might look at the number and just say, oh, 569, 569 is bigger than 281. That's no good. But look at the units. That's not milliseconds. That's microseconds. That's a 500 times speed up on the lookup. And that's because we're doing integer comparisons behind the scenes. We've looked up private limited company as a string and then we do that integer comparison on the codes. So it gives us a huge speed up. So if you can get away with using categories on low cardinality data, so things that are repeated frequently, do try it. You're gonna get amazing speed up to RAM and in speed. Now let's move on to another example. So what if we're looking at say float 64? So we're thinking about, we're thinking about calculating some new numeric columns. We've got our date time column here. And so we've got a date time item now, so 2027. We've got the incorporation date for each of our companies. And we calculate a delta between the two. So we find out companies that were registered yesterday are one day old. Companies that were registered 10 years ago will be 10 years old. And we store this in age years. So it's a floating point number. This kind of operation will typically give us a float 64 delta type by default. And so we get a float 64 times four and a half million rows and a float 64 is comprised of eight bytes. So each entry costs eight bytes and gives us a huge and astronomical range of precision, possibly more precision than we need. And if you look at the dynamics on this chart, so it's kind of interesting. We see that lots of companies are young. Some companies are 10 years old. Fewer companies are 20 years old. And there are companies that are over 50 years old. I think it's something like 190 years old for the oldest company in company's house for extant companies. So these are companies that are still alive. Company's house doesn't give you a public register of companies that have been delisted. You only get the live companies. So we're looking at some strange snapshot of all of the companies that are registered in the UK. But we can see that lots of them are quite old. So what can we do with the storage behind this and can we get any speed up? Well, if we take our data frame DF and we take age here, so that's the age we just calculated, that's a float 64. In a moment, I'll talk about the sizing behind this. We can use as type to convert that to a float 32 and we can use as type to convert it to a float 16. So a float 32 is four bytes and a float 16 is two bytes. And so what does that mean for us? Well, if you look at the three rows at the bottom of the screen, type float 64 has a range 0.0528 to 1908972 and a float 32 has the same range. So it turns out by going to a float 32 from a float 64 for practical purposes. If you think about one day of age is 1,365th. So approximately three decimal places when the numeral one will be one year old. So we care about three decimal places. So a float 64 and a float 32 for our purposes have the same precision but a float 32 has half the storage requirement. And a float 16, half the storage requirement again, so a quarter of the original, but we get a rounding error. We get a rounding error that's approximately months in this encoding. So for some purposes, that might be fine. Maybe months of approximate age is good enough. It'll be deterministic. So you won't lose, you won't lose any ordering in your ages. You'll just find that it's off by days or months if you try to reverse engineer your incorporation dates. But maybe that's sufficient and maybe saving four times the round of that column is a good thing. And if we look at the top right, you can see there a time operation on the 64 bit and the 32 bits as a float 64 and float 32 variants. Typically a 32 bit float comes out a little bit faster on numeric operations, maybe a couple of percent, but typically a bit faster. I didn't show the float 16. That's much slower hardware. It's always interpreted in hardware for a float 16 to save round but slower. So your top tip here is think about going to float 32s rather than float 64s. If the imprecision that is introduced is not a problem here, there's no imprecision that we can see, it's fine for our needs. So let's look at a quick summary of this. In the top left hand data frame, I've got a company category in age years. That's the original string categories or original string, sorry, ages in float 64. That takes 827 megabytes. The bottom right corner, I've got the categorized version of the company strings and age years as float 32. That takes 462 megabytes, nearly 50% saving on that particular data frame. So we could double the number of rows that we're doing or add more columns as extra series into that extra round that we freed up. I've got a tool, D-type diet. I wrote this to help automate my process of understanding what the simplest change I could make to a column is that doesn't introduce any errors. So particularly the third column along is called number different. You see number different is zero. That means that for the change that's being proposed by D-type diet, you could find that on my GitHub account under Ian Oswald. If this change would introduce no errors, the first change on the company name column, it proposes category, that sounds brilliant. We're at 552 megabytes, but we were at 366. So actually we've saved minus 185 megabytes. Well, that's bad. That's because company name is unique. If you take unique items where there's no repetition and you try to categorically encode them, you cost more RAM, you don't save RAM. On the next example, company category, we're at four megabytes. We were at 369, so we've saved 365, which in the shrunk two column on the right, that means you're at 1% of where we were. That's due to all that repetition that we've saved. And if we look at accounts, account ref day, there were two proposals, Float 16 and Float 32. They saved 25 and 50% of the RAM without introducing any imprecision for that particular example and the subset of data. So if you want to save RAM, I would suggest also you have a look at D-type diet because by saving RAM, often we make our operations work faster. We reduce the amount of data that we have to load from a persistent file storage, like a pickle or a parquet file. And we decrease the saving time if we're sending data back down the wire and we decrease the transaction time if we're sending data to and from other processes or to and from something like a database. So shrinking down the data from the default data type to pandas makes things more efficient and faster overall. So these are good things, you want these things. So what about dropping to NumPy? Well, here's an interesting question. So behind the scenes, originally, pandas was a lightweight wrapper around NumPy data types. And by that I mean, if you think about the simplest implementation of pandas, it will be a dictionary where every column is a column name and it mapped to a NumPy data type. So a NumPy array where all the arrays have the same length. So you might have a NumPy array of int64 and a NumPy array of 32 and then they were mapped inside a dictionary. And then pandas became more complicated and it grew. And then we've had lots more things put on top, including non-standard data types that don't come for free inside NumPy. So we have the datetime operation. We have objects to store strings and other things if we want them to. We have time deltas. And we now have extension data types. So these extension data types give us ways of extending pandas beyond anything that's baked into pure pandas. And there's some interesting extensions out there. One is for IP addresses, for example, we get formatting, visual display and some kind of constraints around the representation of IP addresses. And we have the new nullable data types just being introduced, but this all adds complexity. So the top example, we take our age years, our float 64 and we do some, that takes 19 milliseconds. If I take the same column, which is a float 64 behind the scenes and I do dot values to go down to NumPy and I ask for some, two milliseconds. So that's a 10-time saving. That's a 10-time speed up just by dropping down to the NumPy storage container behind the actual pandas column. That's kind of amazing. So does this mean that the pandas is slow or something weird is going well? Kind of yes, kind of no. So first of all, pandas is doing the right thing for us. It's making our lives easier by giving us a cohesive and well-checked, robust interface to these underlying types and then wrapping them up nicely. So we don't have to think about them. For example, here, if you call some on a pandas data type that has NANDs in it, you get the sum without the NAND and the same for the standard deviation and other operations. If you drop down to the NumPy variant and you forget to call the NAND variant of a function, you might end up with a result which is NAND. There's a NAND standard deviation, for example, which is different to standard deviation. And if you don't know to call the right one, you get the wrong result. If you stay in pandas, you get the right result, but it's a bit slower. So I've got more details in the talk that I gave recently at PyData Amsterdam and you'll find that talk on my blog, which is enosval.com if you want to go into that. There's another reason why these things are slow. There's all of this overhead due to handling all of these different extension types. So this is a nice diagram produced by a tool that my friend James Powell has put together a while ago and a copy of that tool is on my GitHub. This tool says that if I call a function, for example, series.sum, so that same column we had a moment ago.sum, it records every function called from inside.sum and all of the functions called by it. There's not a profiling function. I don't care about the time, I just care about the complexity, how many lines of code, how many functions, how many files are being touched. Let me get this complex diagram. And so we can zoom into that and we see the indexes, D types, core ND types for NumPy extension arrays, all sorts of things are being called. But if we call series.values.sum, we bypass a bunch of those mechanisms and we touch 18 files, 50 functions, a bunch less. We see that simpler complexity diagram. The things get simpler if we jump behind the scenes to NumPy. So it goes faster, which is brilliant, but you have to know that you want to go down to that level. And that's important, right? If you want to take away generality and get become more specific, things go faster, but you add a bit of complexity that maybe you have to manage. Because pandas are necessarily slow. Well, no, one of the things is by default, pandas does not install the bottleneck library. So here I've got the bottleneck library. First of all, in the first example, line 15, I turn it off to use bottleneck as false. I calculate series mean it takes 54 milliseconds. I then turn bottleneck on. It's a background compiler that you don't see, you just have to install it. I calculate the same operation, 16 milliseconds three times speed up just by installing one library for free and doing nothing else. It's used automatically in the background. You just have to remember to install it. It doesn't come for free. We can still go faster if we go down to the NumPy layer, but then you have to start managing whether you've got NAND or you have the right data type. So my advice here is install bottleneck. That's a really simple one. In fact, there are two optional dependencies that are not installed by default. If you look at the pandas enhancing performance page, you'll see reference to them, get bottleneck and get numexpr. And then pandas will become faster. It won't change the storage that is required. You have to do that manually. To do that, you might want to look at the D-type diet package that I put together up on my GitHub page. And also I've got a tool, iPython memory usage. That's actually in the Python package index. You can pit install and condor install it. That tells you in Jupyter how much memory each operation you're typing is taking as you go through your interactive session. And that can help identify if costly memory areas. What about making things faster in other ways? So what if we've got a pure Python function? Here I've got a loop which I'm going to apply. Here I'm saying if the older a company is, the more likely it is in some kind of Monte Carlo sampling scenario. It's more likely to have one big contract. It's won the jackpot. It's got a big win. Hooray! So young companies, less likely old companies, more likely it's a stupid function. It costs CPU because it goes around in a loop a lot of times. And so when we run this with apply over four and a half million rows, it costs 13 seconds. And as we saw in the Intel talk earlier on, if you use number, wrap up your code with an NGIT, then suddenly it goes in this case 10 times faster. That's an amazing win just for including number and wrapping your code with NGIT. So we get a 10 times speed up. Not all of pandas, most of pandas is not supported. Not all of pure Python is supported. Number is quite selective and the Intel talk was interesting. Maybe there were some interesting extensions that have been made available now that weren't available six months ago. But you get some huge advantages by using number. Another way to get an advantage is to use the Dask project. And so Dask is allied very strongly with pandas. Dask lets you go multi-core parallel and multi-machine multi-core parallel. And it wraps up multi-core versions of a numpy array. So you can stretch arrays across multiple machines or bags that like dictionaries that split across multiple machines or multiple cores or pandas data frames where each data frame is chopped up into segments and put onto different cores or different machines and then things are computed in parallel. Here's a very simple example. I take that one huge project function that I had before. I take my distributed data frame which is a variant of the original pandas data frame. So you dd.from pandas to turn my data frame into the ddf of distributed data frame. I ask Dask to take the distributed data frame and apply in parallel my one huge project function, no modifications to that function, no modifications to the underlying data. I just split it up and Dask does that for me and then my 15-second execution takes place in five seconds. It's not a pure four-time speedup. I've got four physical cores and four virtual cores. You can see them maxed out at the bottom of the screen because there's some overhead with copying data around but you get a nice speedup just for a couple of extra lines. So here's something to pay attention to. It's important that we are highly performant. Not that we necessarily write highly performant code but we as developers, we as data scientists as researchers are performant. So mistakes slow us down. So mistakes in our implementation make us go more slow and I've been reflecting on this as I get towards 20 years in this domain. Think about using the nullable data types in 64 and Boolean that have been introduced inside pandas. These give us the power to mix integers and Booleans with nulls without switching to a float 64, the lowercase f version, which can introduce rounding errors. It's kind of interesting. The use of unit tests and any kind of testing is incredibly important. We want lots of that in our data science code. And I cover loads of this inside of my training because to me it's, I found it's one of the best ways to make myself a more efficient developer and more efficient in reaching my solution in a way that I trust my code. And I think it's resilient and robust and repeatable. And I've got a bunch of these tips, a bunch of these talks are up on my blog. So you can go and take all of those. And I've got a newsletter. There's a link to that on my blog. If you sign up to that then every two weeks you get some data science jobs that I know about and tips of things that I've discovered and tools that I'm working on. So that might interest you. You'll find that all on my blog. There's one other topic I wanted to touch on just very briefly. There are two interesting tools separate to the Intel SDC, Vakes and Modin. These extend the pandas idea in different ways. Modin extends pandas, it sits on top of pandas. And if you've got lots of cores, like 20 or 50 cores and lots of RAM, like tens of gigabytes and a huge beta frame, it will try to make things run in parallel across that, which is very interesting. And Vakes is a different implementation of a pandas-like system using memory mapping on the disk. If you've got terabytes of data, Vakes can make your pandas-like operations run really quickly on all that data. You'll find that I've given parts talks on that this year. Go on to my blog and you'll see those from remote to Python earlier in the year and Budapest BI. So in summary, it's important to make things right and then you can make them fast. Always think about being performant, not just writing performant code, but being performant in your development style, write defensively, clean up your code, read factor, write unit test, communicate to your colleagues what's going on. If you're interested in classes on these kind of topics, please do message me and have a chat, have a look at my blog. I'm very happy to talk about that. And I have a request. So I've been giving talks now for nine years. They're all volunteer, they're all fully open sourced. One thing I love, and right now, I can't see any of you, I haven't seen any of your smiling faces. Hopefully I've made some of you happy and I've taught some of you nice new things. One thing I ask after all of my talks now is if I've taught you something, send me an email, ask for my address, and send me a postcard. I stick it on the wall with all the other postcards and it's a nice bit of physical feedback for me to tell me that I'm doing something useful and people are getting useful things out of all of this volunteer time that I've put into the community. And don't forget, go back into the lobby and thank all the volunteers and the other speakers for all the volunteer time they've put into the community as well. Thank you very much for listening. Oh, thank you very much for the talk. You're very welcome. We have a little bit of time for questions. So let's try this. And the first question we have here is from Daniel. Do the new nullable data types and pandas have a significant impact on performance over standard non-nullable types? I think the simple answer is always going to be yes. And that's because the built-in data types are, they're provided by the hardware. So that's why float 16, lower case F float 16 is slow, that's because it's simulated by the hardware whilst float 32 and float 64 are fast. And in 32, in 8, in 64 are fast because they're provided by the hardware. As soon as you put on top some high-level interpretation, so the capital I in 64 and the upcoming capital F float 64 which are nullable with different semantics as soon as you have to have a bit mask for the nullable items and the underlying in 64, if the mask things, more operations occur and even with the fast compiler, things have to, additional things are happening outside of the silicon. And so things will go slower. So yes, it's slower, but you're more likely to be correct. And getting things wrong is the easiest way to lose a minute or an hour or a day or a month or maybe even your job if you've got things really badly wrong. So I'm a huge fan of getting things right and worrying about speed later on. Okay, and final question, is there any trick if you have pandas operations which take up a lot of processing time to get some feedback over the running time? So like, for example, if you do something that takes a minute or two that you could get a progress bar or a feedback that you could then also cancel if you realize your code's too slow. Oh, interesting question. So by default, no. I think I'm correct in saying by default, no. Often in a Jupyter Notebook, you can stop the execution of the current cell if it's just taking too long. Dask has support for a progress bar. So they use the TQDM module. And TQDM is a really nice lightweight progress bar which plugs into lots of tools but you have to add support for it. It doesn't tie into pandas by default but Dask takes it by default. So if you've made an operation run across multiple partitions of your data, Dask will use that progress bar to tell you how things are doing. And you could use Dask on a single machine, on a single core partitioning a large data file into smaller partitions that fit into RAM. So you can process with Dask one terabyte of data on a small RAM machine with one CPU if you wanted and you would get a progress bar out of it. I think what I've normally seen is that tools use the TQDM progress bar to represent the length of time taken by operations. I don't know of anything else beyond that, I'm afraid. Okay, thank you very much for this talk, this was useful. I've got one more slide, if we want to see it, if we've got a moment. Sure. We've got a bonus slide just in case I had time to talk about it. So I care about the effect of the pandemic on the British economy. That's why I've been looking at companies house data. And I thought it was interesting to graph a simple statistic out of it. Now I didn't need to go high performance on this. This runs in a couple of seconds anyway with only four and a half million rows of data. But it's a nice data set to work off because of all of those mixed types. If we look at the trends here, we go back to 2018 and we look at the registration frequency by week for companies. Actually, sorry, registration frequency per day for companies. And we see that typically in 2018, we're seeing about one and a half thousand companies being registered per day with a sharp decline on January the first and around Christmas. That's because the offices are closed. Frequency goes up early in the year, levels off for summer, sort of July onwards. And then by Christmas, so January 2019, it drops down again. Then it suddenly spikes up again up to 2,000 registrations per day, calms down for the rest of summer. And then by Christmas, it drops down again. Early 2020, it jumps up to a similar level. And then March 23 lockdown, everything drops down with this precipitous decline in company registrations. And then we've seen a resurgence of registrations in the UK, which begs an interesting question. Well, is it that we've stopped registering companies and then we started registering so people started making new businesses or were the staff not allowed to go to companies house and register and actually actually do the physical work of registering companies? And as an indicator of the impact of the pandemic on the British economy, well, what does this mean? Does it mean that we're really seeing a lagged indicator for economic activity via new businesses? We don't see the activity elsewhere working on a project with that kind of thing elsewhere. To me, this is interesting because you can do this kind of analysis yourself on open data provided by your government, particularly in the EU. And if you haven't done that kind of thing in your curious, I would strongly suggest you go and get some open data and look at the effect of the pandemic on the world around us. It's kind of an interesting point in time to get this data and see these trends coming through that we've never seen before. And you can do it all with Python tools and the tools mentioned in this talk. Right, there we go. No, thanks for that. And if you have some links for open data, maybe you can post those in... Yeah, I'll post them into my talk channel for sure. Or your cat. And that would also be the right place to ask more questions about this. Thank you very much. I learned a lot and I think I have to find out your address to send you a postcard. I would love a postcard. I really would love posting. Please send me a postcard.