 Hi everyone, thank you It's I know we're at an infrastructure conference. So it's nice to see fellow machine learning practitioners or people interested in machine learning So we are welcome to scaling machine Machine work flow machine learning work flows with big data to big data with you So we're gonna start off with a demo presented by Han in this demo We're gonna start out with a small pandas size data frame And then we're gonna apply some business logic to that and trans and bring that into spark And in the process we're gonna see how much of code is needed to bring it to spark How much pandas and spark are different from each other and how the fugue opens how the open source library fugue can help you do this hello everyone my name is Han Wong and today I'm going to do a quick demo of some very interesting features of fugue and I'm going to switch between Python code spark code and fugue code It's okay if you are not familiar with with spark and if you cannot follow on the spark part because That is just used to compare with fugue And I think the most important thing is just to focus on how fugue solve this problem How intuitive your own expression should be when you want to solve this problem. Okay, now let's start the first example is We have a very interesting business logic, which I guess if you're working a company you will always have some tricky logic Where in this logic we have we are going to process a data frame and here is the business logic We have four columns ABCD They are all strings and for any given row if the number of non-null values is less than two now We drop the row and that otherwise we concatenate the non-null columns in order and output the value as E column and Also keep the original columns. Right. This is not so difficult and The most intuitive way to express this logic. I think is in this way You should take an iterable of lift as a list of values And then yield those values when it meets the criteria Right, this is just Python code and also this code is very testable So now let's just write as you can see the last two rows will be dropped because there are too many nuns Okay, so this is the logic and now let's build some Sample data for us to work on and you don't have to care about this function Just look at this data frame. So this data frame has four columns ABCD and it has some nuns Right. So now the question is very simple. How can you apply this function? Onto this sample data, this is a penless data frame, you know, even you know, even you use apply You cannot directly use apply on this concat on this function You have to do some sort of transformation. I don't have I don't put Solution of pandas here Because we don't have much time But now let's directly jumping house fugue solving this problem So remember that the semantic is I want to apply that function to that data frame, right? That's simple Let's see how fugue is solving that. So in fugue. We just import transform We want to apply this concat function onto this pandas data frame and also specify Output schema is the existing schema Plus a E column. So that's how fugue is solving this pandas frame data frame problem. Okay But in your real production this data frame can be a spark different because you're probably you're processing terabytes of data right, so we want to scale to spark the first thing you need to do is Just initialize a spark session. So here is for simplicity reason I just start a local spark session, but you always have your own way to get a spark session Right. Okay. So now the question is how can you apply this function to this data frame using spark? So look at how spark is doing this. There are several things first You have to bring in this PD sample into spark, right? And the second the map partitions Has to be used on RDD. So you have to convert it to RDD and map, right? And in the map partitions, it can only take a iterable of rows This row is a spark-specific class or a tuple and then you have to do this conversion by yourself So that concat can accept this input type Right. And another thing is after you do all this kind of conversion You have to convert back the RDD from from the RDD to the spark data from and now here This is probably the simplest way. Oh, sorry the simplest way to It's probably the simplest way to describe your schema And I guess a lot of people if using spark you are using a more tedious way to describe this Especially when a schema is complicated and now you see we solve this problem, right? It's good, and it's over ten lines of code additional Boiler player code, but think about that. This is just one function What if you have a lot of such transformation logic how much convert boiler player code you have to write now? Let's see how Fugue is solving this as you can see in Fugue Again, it's only The the transform function and now instead of using PD sample We just use spark sample and this is not changed. We still use concat Schema is not changed and here we just tell the system. Okay your spark session to run this transformation This means I want Fugue to trans to translate this semantic to spark operations and to run a spark So it's just one line of code and it can solve that problem and the next example is What if I don't even want to convert the original data frame to spark data frame? Yeah, actually you can directly say I want to transform this pandas data frame Using this native spark function native Python function, but I want everything happen on spark You can also do this so this transformer this conversion data conversion is automatically done on the Fugue level Okay, so now let's think about another case. What if We have a we have a business requirement change Remember that we had ABCD four columns now for some reason The column D no longer exists. We don't want column D, right? So now we create another data sample called PD sample to and PD sample to what if other business logic is not changed to just just that We miss from one column. Well, that caused any problems actually if you try the exactly same thing on spark You will get an exception. Why because the spark logic look at this You have hard coded too many things It's not flexible enough. I cannot You cannot I cannot adapt to this this changing environment, right? So what you have to do just for this small change You have to rewrite your wrapper and you have to rewrite this output schema All right, so it's so in real production You will think oh, what should I do should I just is this worth it or not, right? It's it's a struggle to to make the technical decision Right, but what if we want to apply this fugue? So look at this. There is no change at all. So you still can apply this and that's different with without D column and you can still use Exactly same function transform function and everything else is the same this is because in fugue for example the schema if this kind of expression is very dynamic already and it can be determined It can be determined by the input Different and also for the concat because we have the most intuitive expression, so It's it's actually column agnostic. You can have more columns You can have less columns if other things don't change your functions don't change a few code don't change Okay, so now let's switch to the second example. We want to do Machine learning now So in the second example we have we have a data frame with many categories and for each category it's the features a b c d e and The value the target the why they have a different they have a different relations so meaning that Ideally you should train those models separately for each category because there are different machine learning problems But here now first of all, let's just generate that training set and let's take a look So this is the training set training data have like a 400 and test data will have several thousand and we have the features the ground truth and Also the category and the if it is trained so for training data is all true, right? So, okay, so now first of all, let's try to train a logistic regression on the entire data set on the entire day training data set Okay, so we train this data set on the entire data set and also we can write a very simple predictor function Predictor function on this pandas data frame using this logistic regression model and we can get the prediction Right, so so here is just very simple Python functions. It has nothing to do with spark or anything fancy Okay, now think about that How can we if if the test data is huge? How can we apply this predict function to that huge spark data frame? Okay, so if you want to do that in spark Now you can see that the nasty things about spark Schema's you have to import a lot of like weird weird Classes like struck to type struck field integer type So look at this code what you do first of all you have to bring in this data frame into spark and then you have to You have to construct a schema So in few it's like a star comma the new schema But here you have to do things in this way and another thing is way you want to When you want to use this predict you cannot directly use predict because in spark in pandas UDF after 3.2 After after spark 3 You have to have this type of function to work with pandas UDF it has to take an iterator of pandas data frame as input and you have to return the iterator of pandas data frame so So as you can see that We have to write a wrapper for this just for this purpose for just for this purpose and and and And in the end you just we just try to use a select statement to compute the precision Where this is the sum of true positives and this is the sum of all positive predictions Okay, so as you can see it works, right? but this code is really tedious and it has a lot of Spark dependencies, right you with this code see how many things you have to import from spark How many how many like a special data types you have to learn if you want to use spark, right? But what about fugue so again in fugue it's just two lines The first line is still transform and transform We can directly apply on that test the data set no matter it is pandas data frame spark data frame dust data frame or could QDF any data frame that fugue supports and then the predict function is just that original function right and schema and you can also specify the parameters to passing the the logistic regression and then spec specify the spark section and this the output of this is native Spark data frame, so you can do exactly the same thing here to compute to compute the precision So we reduce this whole lot of code to just one line of code Okay, and we get the same result Okay, then we want to do something that's more interesting. So think about that As I said for different categories is actually a different machine learning problem What if for different category we just train a model and then we can apply that specific model to the Data frame belongs to that category, so this is how we do that and First of all, let's think about Python code. So first of all given this data frame is a category of training data Right, what do we do? It's it's straightforward, right? So we just return the category name, which is the first value of this category column because this data frame all have the same category and then the model what is model? We just a pickle. We just a pickle this logistic regression model that we train on this category of training data, right? So we return a string and a binary block Okay for predict category Or predict the category when we get a model when we get a model, which is the output of this step We know what is inside, right? And also this DF is the DF we want to predict on so it's so here We just add an assertion to make sure we don't break anything if it doesn't break anything and then here we just a pickle loads this This model just recover this model from this data structure And then we just a predict on this data frame and assign to the product column, right? So it's just a simple Python. It has nothing to do with anything else Okay, now let's see how we use fugue to orchestrate to this two simple Python functions So it's no longer transform and actually to be honest the transform is just one tiny feature of Fugue covers a lot of more things than just Transform so here we use some more advanced features of fugue. So actually fugue is all about direct Directed acyclic graph. So we build fugue workflow and we just to get the train and we predict we're partitioned by category and then we transform using this train train cat Right. So this is the first step. So this step which distributed train many many models and the second step We just the zip this model with this test data set and then transform When we transform we just taking this symbol function, which will take two data frames and Then we'll operate on that. So this is a co-transform fugue and then in the end This is a similar expression like what we did in spark But this expression is framework agnostic. It can work on pandas can work on dust can work on spark Can work on any excursion engine So as you can see after we train and the predict per category The overall precision is much higher than we train a single model on the entire data set It's almost 10 percent up, right? And now the last thing I want to talk about is fugue sequel Fugue sequel is a different way to express your logic, but it's it the sequel will be translated to these operations While you can keep your mindset and your semantic all in fear all in sequel So it's it's perfect for sequel heavy pipelines where you just a need a tiny bit help from Python So for example in this way We just select we just construct the the train date data set from the original data The data containing both and then we just the train model So this is is exactly what we did here and then you zip the two Data frames and then transform do the prediction and then in the end it just return the result so here the only difference is that I In the end I want when I predict I predict on the entire data set containing the training set and test set And so here you see here I just a group by train group by this this flag so we can see okay This train the how is the model performance on the training on the training data set itself as well as the test data Set as you can see very simple. We just to get For training data set so we still get better result, which is expected and for the test data set We got exactly the same results as what we have Okay, so this is the end of the demo. So I will let Kevin to talk about some more details. Thank you So what we saw from that demo is that when people move from small small data Panda size data to spark data frames even for to implement the same logic. There's a lot of boilerplate code that has to be added Or a lot of times code has to be rewritten, but there's also other there's also other problems that Happen from when transitioning from smaller data to big data and we'll explore these in this section So the first is reusability of code in the top the top code snippet is how pandas would implement getting the median of each group and the second one is spark and Obviously in spark there is there are added parameters around the tolerance because in a distributed system Getting the median is a very expensive operation. So normally you would just use an approximation instead Second is inconsistency. So with pandas and spark even though they both operate on data frames They they have a lot of inconsistent edge cases. So for example, pandas will join null with null Whereas spark will not join null records together when you sort them Pandas will put nulls at the bottom of the column for both ascending and descending Spark will treat null values as the biggest value. So it's the bottom When you're ascending and top when you're descending. So even for these types of things, they're different entirely different systems And even if you wrap your pandas code and bring it to spark Then you may you may have to still write extra code to deal with these inconsistencies second is that third is that a lot of Pandas users are not familiar with the lazy evaluation of spark so often you find that you find that Pipelines that have been moved to spark offers up often suffer from Inefficient computation. So in this in this graph a computation graph above if you don't persist be then it's Recomputed three times for C D and E When when those are computed. So of course if you use persist and spark then then this would be kept in memory And you don't have to recompute it but pandas users are not familiar with this concept Next is partition partitioning where often you have to group your logic put data that belongs to a logical group on the same partition and This often involves shuffling of the data across your cluster. So this is something that pandas doesn't have an interface for and doesn't Handle well in pandas even that which is very heavy, which is very reliant on the index You often find that people just use this index as kind of a global Way to to locate data, but that that doesn't hold true in a distributed setting and Of course testing because you have to add all of these like extra functions to bring pandas code to spark or even Python code To spark you normally have to add a lot. This is for one function You have to add Minimum of two Helper functions and for each of these functions that you introduce you have to write additional tests And you have it becomes a lot harder to test to test this because it's very coupled to these boilerplate functions And that's why fugue was created. So fugue is an abstraction layer for distributed compute The goal of fugue is to number one make it easy to use spark and desk and also number to unify the Inconsistencies that are present between these engines. So in fugue we as hun showed we want people to be able to describe their logic in Python or pandas or sequel and then bring it to fugue and choose the execution engine So you can define your workflow in Python or pandas and then during runtime choose. I want to run this on spark I want to run this on desk without significant code change and there's interesting properties that That come when when we can decouple logic and execution So without fugue when you use pandas your pandas code is Is tied to the pandas execution engine and when you use pie spark code your pie spark code is also tied to the spark Execution engine, but when you can decouple logic and execution through fugue What happens is that you can just define your logic once and then choose where to run it So often you find that there are a lot of projects where you know You you started using pandas and the data became too big and now you need to introduce spark and Maybe instead of going to spark you'll just vertically scale your infrastructure or on the other hand Maybe you're using spark to optimize too early for like a data set A data set that doesn't need spark, right? So now with fugue you just need to write your code once and define your logic once And then you can choose the execution engine that makes sense during runtime So you can start small with pandas and then you can move to spark so here in this example we have a native a native Python Python function or pan it uses pandas and What we can see is that when we define the function there's no spark dependency whatsoever and this is what Han demoed and Once you've used the fugue transform function and specify the execution engine all of this is brought into spark for you What this did on this slide? I use a pandas data frame, but for the same function We can also use native Python using the list and dicks And we can just loop through this list of dicks and apply the pipe native Python code for each row and Again, this can also be brought into spark through the transform function So by decoupling logic and execution you're able to accelerate your testing because you you can test on Smaller data with with pandas size data and then choose the execution engine on spark when you need it You can you also need to test less code in general so you can accelerate development and then you avoid framework lock-in So today we're using spark and desk but if ray starts to pick up and a lot of people start to use ray or maybe even the Qoody F with rapids dot AI where you have GPU clusters Then we can make make an execution engine for those different frameworks and your same Python code should be able to map through fugue And because of because you use native Python code. That's very testable. There's a lot less maintenance There's no spark expertise required to maintain the code logic as han showed where he had An example of business logic that change over time. It was very easy to just still use the transform function with fugue and Then we have us fugue also has a sequel interface So maybe sequel lovers or bi analysts or data analysts these these personas can also Harness the power of distributed compute fugue sequel supports spark desk and pandas also and and then we also support blazing sequel through I To to operate on top of the GPU could the Qoody F So with with with the sequel interface what han showed in the notebook We have syntax highlighting already implemented We added keywords to bring sequel into the distributed setting for example We added keywords like load save persist partition and transform So now sequel can be a first-class interface where you can Before you would have sequel code Sandwiched by predominantly Python code and now you can have it the other way around where you have Predominantly sequel code that invokes Python code occasionally And this is the example of the notebook extension where we have syntax highlight highlighting implemented So as before I open for questions I just want to conclude here by saying that number one fugue is also is a mindset first of all and that our mindset is that we Should adapt to the user and allow them to express their logic in whatever way that they want to and then they can express their logic in a Scale agnostic way and we can take care of bringing it to a distributed setting when when you need to scale and we value Readability and maintain the ability of the code rather than deep framework specific optimizations But normally fugue just uses with the mechanisms under the hood of that execution engine So there you don't lose a lot of performance And With that I just want to give a quick recap before so number one We are an abstraction layer for distributed compute it adapts to the user Let's them define their code in native Python or sequel and then bring it to spark our desk when needed and because of decoupling logic and Execution you find that we find that a lot of fugue users can accelerate their big data projects And fugue is just one component of the broader fugue project So we have fugue and fugue sequel, but we also have fugue tune Which is an abstraction layer for hyper parameter optimization and we also have fugue validate We also can use fugue to perform data validation by wrapping around pandera or great expectations I just have contact info here and with that I'll open the floor for questions. Thank you