 We're talking about QALUS, which is basically an open-source project to expose the Pandas API to execute code on Spark clusters, on a Pandas Spark cluster. So a lot of terms just came here about Pandas. So Jim, we actually gave us a brief introduction on Pandas and some of the issues that could be caused by it, namely, we get it is single-threaded, it is memory-bound, but nevertheless, it's a very, very popular framework for slicing and dicing your data in Python. So often, data scientists, they might spend three-fourths of the time in Pandas and then the other quarter of the time in scikit-learn doing the actual machine learning, right? So both of their work points are being just slicing and dicing with that structure. And now QALUS is an open-source project initially developed by Databricks, open-source in April of last year. To basically take the same API the Pandas provides, but basically have that be a wrapper for the patchy Spark APIs. And Spark, as you may well know, is the big data distributed in memory engine that became so popular, it started about ten years ago. So where are the goals? The goals is really to make Pandas big data-ready. And this is the issues being the Pandas right now is single-threaded and memory-bound, right? It's really just meant to be single machine, single processor, while on the other hand Spark, which leverages a cluster of machines, is fully distributed in terms of the compute and the memory. So when you actually read the data set into Spark, it goes ahead and distributes partitions that across all these workers, and from then on all the competition is done in a paralyzed fashion. And this comes out of the box. So right, as I mentioned, the QALUS package was open-sourced in April of 2019. Now we're seeing, I think this slides without a date, it's around 8,000 daily downloads for this package. It's available via PIP, PIP is called QALUS. And yeah, architecturally, it's really just a lean API layer sitting on top of Spark itself, right? It's just that the notation and API specifications have been adopted from Pandas. That's really good, right? So a few design principles. One is to be pythonic. Spark itself, for example, uses camel case, everything's camel case. To be pythonic is to be stanky, right? So of course QALUS is going to dump the stanky as a function, a notation, right? Everything's NumPy-friendly and you can import, export in, out, yeah, NumPy is very much integrated within QALUS and the documentation is in the PyData stuff, okay? So really we're trying to have QALUS functions and Pandas functions be identical. So they have the same naming conventions, same functionality and basically anything that's found to be distributable, parallelizable, within the Pandas API is taken over and adopted in the QALUS project, okay? A few notes here on guardrails, really on safety. So Pandas is, QALUS is meant to be just performance scale, right? So if there are any functions, any processes which cannot be distributed, which cannot be executed at scale or left alone, so you can now be, you can be rest assured that anything you do in QALUS is going to be parallelized, right? And it's going to scale to petabytes. I'll give an example of one group that's, that's using QALUS, that is the Pandas API process around a petabyte thing, okay? A few exceptions, things that aren't parallelizable and that can be dangerous are these, these two, these two castings really, taking your QALUS data frame and casting it over to a Pandas data frame or a NumPy array. These two actually wind up pulling everything from your workers into your driver and there's potential there to kill your driver to get out of memory, yeah? So just as with data frames that collect, where you spark gurus, yeah, careful, don't do this in production. So I think, yeah, few differences between now, I'll talk about Pandas and PySpark. Pandas is, it is, you know, it's, comes with everything out of the box, PySystem is based on NumPy and of course it's very Pythonic, PySpark itself, so this is the Spark API for actually the Pythonic Spark, as it's kind of, has some primitive compositions, has some abstractions on top, its type system is from NC SQL, yeah, and underneath it's actually running Scala data frames, I guess, it's calling those, those, a few differences between the Pandas data frame and Spark data frame, okay, so, you know, one difference on the Pandas side, data frames are somewhat mutable, on the Spark side completely immutable, and the rest actually is just a, so just a demonstration of writing or performing the same task, okay, one thing that's really, that's really popular is its value count, it's basically a group byte and count, it's such a common task that Pandas has this value counts function, that's the point, in PySpark you gotta do that, which is fine, it's fine, but the trouble is, if you want to, if you already have a pipeline already written in Pandas and your data volume has grown to the point that Pandas is no longer the appropriate solution, you need to move on to Spark, and prior to Koalas you would have had to basically do a rewrite, you would have to take this little snippet and expand it out of that, right, and do this line by line, so it's just a very time consuming task, right, so give me an example, another example here, this is a very, very simple one, we're going to, on the Pandas side on the left, we're going to import, we're going to read some data, CSB file, right, we're going to change its column names to X, Y, and Z1, and we're going to create a new column called X2, which is the product of the X column by itself, yep, X squared, on the PySmart side you have to do a Spark read, add a few options, then the CSB, great, you want to change names, you have to do this 2 down the frame, XYZ, you want to add a new column, you have to do this with column X2, okay, now, now we introduce quality, okay, this is now the same thing that we just did in PySmart, so everything that you see here on the right actually gets mapped to the previous PySmart lines, okay, and as you can see basically the only difference here is that import statement at the very top, so in fact if the project continues to advance at the pace that it is, you will soon be able to just take your Pandas script at the very top instead of saying import Pandas as PD, you would say import Kualas as PD, so and then you don't have to change your PDs anyway, that's it, you literally change the top line and suddenly you went and took your workload from a single threaded memory bound framework to a fuller to shoot, okay, all right, demo time, the fun part, so I have a Spark cluster going live on Azure Databricks, I may, here we go, here's my little cluster, it's actually very little, it has a single worker, I have Kualas installed, right, I just actually pulled in from the PyPy repository, okay, and let me jump to my notebook, here we go, so this is the Databricks notebook, very similar to Jupyter for those familiar, okay, so I'm going to import a few things, Pandas, NumPy and Kualas itself, okay, so good, good, good, let's do just a, let's make a Kualas series, so I'll go through a bunch of types and castings of one type of data frame to another, so this is a series, great, so just as Pandas has series, Kualas has series as well, let's get the type, type just come out of there, should be in this in a second, next up is that series, that's sort of the basic element soon after, so here we go, that's our series, now I'm going to do something different, I'm going to create a Kualas data frame, so this is how you would do it in Pandas as well, and if you see, once you do that, it'll actually tell us, yeah, so this is a brand of smart job, here's the data, great, now PD, I'm going to create, I'm going to use Pandas here to create a date range, fine, I'll save that later, and I'm also going to use Pandas to create a Pandas data frame, so that's PDF, okay, that's Pandas data frame, now do a cast, from, so we're doing Kualas, from a Pandas, Pandas data frame that we just built, and basically doing a cast from Pandas data frame, Kualas data frame, done, and if I wanted to, I can display it, it'll actually start a smart job, right, so this did a smart job, the Pandas side did not, so this actually ran on the driver of the cluster, this ran on the workers, this one is fully distributed, the first one was just single-threaded off the driver, yeah, but same thing, same results, right, okay, I could do something else too, I could take the Pandas data frame and cast it over to a Spark one, yeah, so you can actually do, you can go back and forth between Spark, Pandas, Kualas data frames, interchangeable, ending on what you're trying to do, okay, now I'm going to Spark data frame to a Kualas one, again this can be fully distributed, you can see a Spark job come up, yeah, there you go, okay, let's do some fun stuff, take a look at the types, okay, so we have that Kualas data frame, I'm gonna do a head on it, now I'm gonna run a bunch of standard Pandas calls, like head, you see them, so those familiar with Pandas should find all these calls very, very common, you know, you can probably use them every day, you know, I'm gonna take a look at the index of this little dummy data set that we put together, columns, and I'm gonna cast it to a NumPy array, this by the way is dangerous, this is taking data from the cluster and bringing it to the driver, careful, so this is one of the few exceptions to watch out for, right, so Pandas has this function called describe, it's basically a statistical summary of your data set, call those those two, except that now it does it as a Spark job, right, now we have our count, main, max, standard deviations, one thousand, right, so you can imagine doing all this one liner on terabytes of data, okay, this is for summaries, transpose, again these are all Pandas calls, transpose, you know, do a quick sort, sorting by specific value, that was sort by index, sort by specific column, I'm doing a missing data, so I'm gonna take my Pandas data frame, I'm gonna add a new column e to it, plus a bunch of, yeah, so add some dates, use the index of our dates that we generate, so I can go, I'm gonna add a new column, right now that column e is a bunch of NANDs, nulls, and for a couple of these entries, we're gonna add the value of one to it, and then kind of print this guy, pdf1, okay, there we go, so here's our new Pandas data frame, has a bunch of dates, has an index, five columns, this e1 has now two missing values, I'm gonna cast that to, of course, a qualus, data frame, now the Spark job, still the same thing, now it's just running on Spark, and then you still get to use all the standard Python, the Pandas tools for handling missing values, like drop NA, there we go, and yeah, so basically drop NA drops any row which has an NA in it, in this case, how far it was at NAs, so that's that, filling missing data, we take that the data frame and any time it sees an NAN, it's gonna stamp it with a five, five, great, okay, statistical work, I think you already saw some of this earlier, then it's part configurations, so in the back you can leverage arrow for serializing and deserializing your Python calls, so what I'm going to do is actually do two things, first I'm going to run a very simple job here with arrow enabled, basically I'm going to create the range of 300,000, yeah, so zero to 300,000 minus one, and then I'm gonna cast that to Pandas, I'm gonna run the exact same, I'm going to time this all, yeah, I'm going to time it, and I'm going to run the same thing with arrow disabled, yeah, and we'll see the difference here, okay, so there it is, creating this, this is actually all that worked on as far, so it took around roughly one third of a second, here we go with arrow disabled, let's step back for a second, because it's more slow, hanging down, get your coffee, so 1.1, so maybe four times, okay, very enable arrow again, groupings, so the data of the common task within Pandas, slicing, dicing, grouping, right, again let's create a Koala's data frame, bunch of random numbers in there too, all right, so four columns here, A is bunch of full bars, B is one, two, three's, string, C and D are normally distributed random numbers, okay, so group by A, get it sum, yeah just again, everything now is this is all Pandas API code, right, in group by two columns, get their sum, good, good, good, very common tasks within, within Pandas, plotting as well, I'm going to import a good old mapla lip, yeah, thank you John Arthur, and we're creating new data sets, this is going to be a Pandas data frame, and four columns of random numbers, a thousand rows each, okay, I'm going to cast that to a Koala's data frame, and we can actually print this out if you want, let's do that, okay, yeah, so just a bunch of normally distributed numbers, a thousand rows, great, oops, shouldn't have done that maybe, bad idea, that was a bad idea, nevermind, let's take the cumulative sum, great, and I want to plot the commutative sum using mapla lip, there we go, again everything here that we're doing is in the back a smart top, okay, I don't know if you ever understand this stuff, you know, reading, writing, so I'm going to take that Koala's data frame that we just generated, oh excuse me, I'm going to take the CSV file, I have it lying around, Koala's data frame to a CSV, and I'm going to take that and read it back, so it's a write a CSV file and then read the CSV file back again, right, other data formats supported here are okay, Delta is also supported, Delta format, and even ORC, so regardless of what your, what format or data has been written in, you can now actually do all your IO activity with Koala's, okay, and so if you like, there's a great post on the Databricks blog, just to Google it, hours, two minutes, follow us with one Databricks customer that's processing around a petabyte of data using the Pandas API, which is pretty wild, and of course if you want to continue, please github.com slash Databricks slash Koala's is the repo. Thank you very much, I'm Ben Seregi, find me on LinkedIn, Twitter and GitHub with the Ben Seregi handle, and I have all the slides and demos posted on github, so again github slash Ben Seregi, you will see a repo called BossAsia2020. Thank you very much. Does anyone have any questions? That's also really cool, yes, yes, so right now there might be a few methods that aren't haven't yet been mapped over, so yeah, but all the really popular ones, common ones, haven't done, so it's a good chance that you can just to change your import line and let the rest run. So this is a bit of a question about like Pandas and Koala's, because it's quite recent that Pandas actually go on to Pandas 1, and there were actually a lot of changes to the APIs, a lot of applications, some functions that used to work in the 0025 stopped working the way that people thought, and like, so how does the Koala's project actually keep up with Pandas? We are, we're sticking with the following version 1.0 forward, so we're not actually going to support some of the older versions, and that was, that was a great move by the Pandas community to finally, after I don't know how many years, to release 1.0, because then that meant that the API is now stable, right? Then I'm going to break API to Pandas 2, correct? It's just, it's just basically optimizations or additions to the existing API, but they're not going to break DevTest for a long time. So that was the go ahead for us to come in and say DevTest. Koala's is going to go and support Pandas 1, 1 plus. Very good question, thank you. How do you handle that Azure, Azure's functions? Only use those in DevTest, and carefully. These things will happen? Yes. So how do you, from a Koala's perspective? Yes, so what you can do, Keep something, or? We're talking about say, Are these randoms or something? Yeah, you're talking about, we're talking about these two functions, dataframe to Pandas and NumPy. So these, you only, so you often do these just to, just to see what your data looks like in the Pandas world in a NumPy world. So we can do, before doing the call to Pandas or to NumPy, is down-sample the data. To call the sample, say, 1% of all the data distributed across by smart models should be done by one of us. No, this is done by you. No, no, no, no. We don't want to, we're not going to force any down-sampling by default. That's why these two, and really just these two that are dangerous. As a third one, we'll collect, and that's just dangerous for them to spot. But honestly, I rarely find myself using these, even during DevTest. I just demoed them just to show. There's a challenge in some of these, is that at least somebody who didn't read the documentation yet, when they hit with it and it blew up. Yeah, it blew up, and they'll learn quickly. It should be production, though. It should, yeah, yeah. Yeah, these, these should never be in production. Oh, I didn't get pressured. No, all right. Thank you. Thank you very much. Thank you, everyone online. Thank you.