 Hi, my name is Megan and today I would like to talk about simplifying testing of Spark applications. The reason why I wanted to do this talk today is because recently I've coded or I've learned more about testing and about high spark recently. And I've also learned about the usefulness of testing. But when you put them together, testing and spark leads to a couple of difficulties, which is why I want to go over. So a little bit about the about myself and my co-speaker. I'm currently a data scientist that's Spotify and I've coded in mostly pie spark Python and a little bit of our in most of my career. I built recommender systems and deployed machine learning features. I've also designed a B testing experiments and contribute frequently to my workplaces internal data science tooling. On the other hand is a staff machine learning engineer at lift and he's also the author of fugue, which is the package that I would love to introduce in this presentation. So how did I come to know a spark spark allows you to work with very large data sets. And so spark became a very natural tool for the companies I work for due to its great ability to scale feature engineering tasks, perform model training and scoring decently fast as well. Spark is also open source and they have an active mailing list and JIRA for issue tracking and it has a really active community and you'll find lots of contributors to spark from Apple and IBM and even global contributors as well. So I work mostly I mostly work with pie spark and pie spark allows you to write spark applications using Python API's pie spark supports most of sparks features such as spark SQL data frame streaming ML Lib and spark core. And the reason for this API is because spark is written in Scala and its source code can be compiled to Java by code. And it has to be ran on the Java virtual machine. So the main takeaway is that when running pie spark code. It's actually not Python native. And as it's actually just an API just to run Scala code on the JVM. So, and that might face a bit of a problem for data science applications as most of us, most of us like to write in Python, and most data science team are familiar with lots of Python libraries with pandas being the most familiar. And as you can see from this chart. This shows the percentage of stack overflow questions tagged to pandas being relatively higher than the other Python packages. You know, the reason why people are really like, and also pandas is, is, you know, a really great open source package as well as great documentation, good functionality and also a great community of people who have fixed many problems that you've faced. And that goes, and it's the same for a lot of kind of Python based open source packages, such as numpy stats model and scikit learn as well. And it's become very popular and a mainstream type of tool for a lot of data data scientists. So running Python versus spark applications, when it comes to maintaining both Python and spark logic in your code base, Python meaning it needs to run on the Python interpreter and spark, meaning it needs to run on the JVM. It's good to think about what kind of data you need to test with and what kind of logic you need to run. And it's also good to think about what kinds of tests you want to run. So are you ingesting, you know, a large amount of data from your data lake where it's mostly spark data frames and are you running an end to end kind of integration test at point, or are you running kind of small modular unit tests using say pandas mock data frames, or are you running a bunch of data validations with your production data prior to running your production pipelines. So all of those things are things you want to kind of think about when you're figuring out what kinds of tests you want to write and what kind of logic you want to write your test with, how you want to make sure that those different, I guess, requirements are kind of satisfied. So that your data scientists can contribute very openly and freely without with the confidence of knowing that if they edit a certain part of the code, things would break and also make sure that, you know, the whole testing kind of philosophy is kind of kept in check in terms of like speed and whatnot. So you have to think about all of those things. And I know it's a lot to think about. So let's say we break it down to four scenarios on the columns, you have the type of logic that you want to support. And the reason why you want to write test score. And on the, sorry, on the columns you have the type of data that you're ingesting and on the rows you have the type of logic that you want to process that data with. So on the top left, this is the easiest to do and this is where you want to be everything runs on Python on the Python interpreter there's no cluster spin up pandas data frames are processed by Python, and these tests can run on the Python interpreter no problem at all. They're going to be quick and your data scientists are going to be able to iterate really quickly. And then on the top right, I would say this is also a very common scenario given that a lot of companies, you know, adopted already a data like HDFS or DBFS, and have like maybe their own spark servers. So the logic that you write has to be in pie spark or Scala, etc. But you naturally have people kind of being on the team and being part of the data science community and knowing that they want and wanting to write in Python. So, unfortunately, for this, in order to, I guess, cater to the needs of some of the people on your team who write frequently in Python. They would have to find a way to convert the spark data frames which your company already has available in HDFS or DBFS to pandas, so that they can run their logic with run their Python logic with those data frames, and it can be really either during test time or production time. So what some people do is kind of the converse where they wrap their Python logic within a UDF into a UDF so that they can run those Python functions on the spark JVM. And it makes so much more sense in terms of like, kind of data conversions and whatnot. But I'm going to kind of talk about some of the pitfalls and and of the Python UDFS and also about some of the confusion that it led in the community and some, I guess, pain points when it comes to testing. So on the bottom left, this quadrant doesn't really make sense, but it kind of makes sense in terms of, you know, you want to have like a pandas mock data frame and you want to process. And the logic process that could be in spark because usually when you have spark data frames in a very small size, it actually takes longer to run in terms of compute time. If you have your spark logic process it as a spark data frame so what some people do is actually have a pandas data frame instead. So you would have to run as the underlying data and still use that spark logic to process that. And in this case, you would still have to spin up a cluster, because one you have spark logic that needs to run on the JVM. And what happens is that when you convert and also you would have to run into the expensive conversion of that spark data frame to pandas but actually not too too bad because it's a very small data frame, but still you run into that conversion as well. So on the bottom right, it's also pretty easy to do because you know we have spark we have spark. It's all you have to do is kind of spin up the cluster, and you can spin up the local cluster you don't have to spin out like a remote cluster or whatnot. And you can kind of run your pie spark data frames together with your pie spark logic. And ideally, based on all for the scenarios you actually want to go closer to the top left quadrant as close as possible to the top left quadrant from the perspective of testing and because you don't have to do cluster spin ups and there's not much. I would say like data frame conversions in that case. And it makes so much sense because if most of your developers are writing in Python, it makes so much sense to kind of keep everything as Python needed as possible. Okay, so I want to circle back to the spark and pandas udfs that kind of talked about that a lot of people might to do in terms of trying to process their spark data frames using Python logic. And actually udfs. Well Apache spark. The Apache spark project is also highly aware of the popularity of Python and the demand for spark users to want to run their Python logic on spark. So they came up with the udf and the udf I'm talking about here specifically the pie spark udf. And the udf stands for user defined function. And prior to spark 2.3 the pie spark udf was one of the ways that enabled users to run their Python logic on spark. And that was to extend pie pie spark functionality. And it's totally reasonable to want to do that because you know at that point of time maybe there's some certain functionality that didn't exist. At that point of time so maybe writing in Python was one of the ways to kind of extend that, or, or it could also be that you know Python was just a little bit more readable than writing in kind of pie spark logic. And then parsing an ID in, and then storing it in a key value pair and any complex data wrangling really like regex reading sometimes could also be more readable and Python as well. So, spark is not able to translate that Python code from the udf to jbm instructions, and Python, and the Python udf has to be executed on the Python worker. So, unlike the rest of the spark job which is executed on the jbm, while the vanilla Python udf with the sorry with the vanilla Python udf this transfer is realized by converting the data to pickle bites and then sending it to Python, which is not very efficient and has quite a big memory footprint and over time, this process becomes expensive especially because since this operation is done one row at a time. And in spark point 2.3 spark release the pandas udf. So as you can see on the left here. I'm kind of kind of pairing those things where in three of these kind of different kinds of functions. There may be a better performance in the pandas udf compared to the row at a time kind of vanilla pi spark udf. And the reason for this improvement in performance is because they're all this kind of serialization is being executed by a patchy arrow to exchange data directly between the jbm and pipe and the Python driver executors with near zero do serialization. And deserialization from Python to jbm. So here that benchmark is taken actually from the data bricks blog, and these three different functions. The pandas plus one. This one is a cumulative distribution function, a statistical cumulative distribution function that is using sci pi. And the last one is a pandas function to subtract the mean from every one of these kind of values in a column. And you can see that the greatest savings that we have here is the CDF function which makes sense because it's actually a sci pi function. So, I guess what I'm trying to say here is that the pandas udf really was improved on by a patchy spark from two point spark two point three onwards. But using the pandas udfs can be confusing. In this simple case where you would like to it's a kind of arbitrary function where you want to add one to a column you could use any of these three ways to give you the same outcome. So these pandas udfs expects different inputs and output types, and it works in a very different way with a distinct semantic and different kinds of performances as well. It confuses users about which one to use and learn and how each works. So for example, in the top two, you can see that this pandas udf can be used with the spark select or with column statements but then it kind of fails when it comes to the group by. But then, if you wanted to use the group by function and still apply that same kind of arbitrary function, you'll have to revert to like kind of the third function definition. And even this is kind of confusing because you try to apply the group by in this third case and it fails and, and actually you don't need to pass in even the ID, it kind of takes in. It kind of knows like which, I guess, column to group by on. So, it can be a bit confusing to know which one to use and how to use them. Okay, so, and the pandas udfs got better. They even accept series and pandas type hints now. For example, on the left, those were the three kind of ways to write that and on the right, it got a little bit better because you don't need to kind of figure out what kind of udf type I need to kind of cast my output types as because it could just be a pandas type and a lot of people are aware of pandas series is it's like a one column type of is like one column from a data frame. People know what an iterator is. It's an object that applies, you know, different functions to something that you can iterate over. In the last case, you can even type in a data frame itself. So, it become became quite useful in terms and it's good in terms of testing because one part of the test is, is figuring out what kinds of data types, your, your objects are. And even testing undecorated udfs are easier. So, as you can see in this top here, we have a function that was decorated using a pandas udf decorator. But what you can actually do is you can obviously unwrap it and kind of write it on a separate line, and you can very easily kind of write a test function for the undecorated function. It's the same with the decorated version but you might have to read a little bit of source code to kind of figure out how to get to the initial kind of function, but you can very easily kind of do something like this as well. And I would say even the functional API is not better. So in this case, this is, you don't even have to decorate this function. It can, you know, take on an iterator and pandas has to be an iterator because it's a map in pandas functional API. But even so this function remains totally Python native and you can test this very easily. But I would say there's still one more kind of troublesome thing that you have to do, even though you have this amazing functional API, which is to actually write a test for that additional portion. You actually need to have this as it has to be a spark data frame that it's accepting because the functional API map in pandas is for a spark data frame so in this case we have, let's say, some sort of predict function using some sort of model path. You un-pickle the model and you want to score the data frame and get some sort of propensity scores from them. This function in this case is a pandas Python native function. But in this case, in the second function, this run predict function is actually accepting a spark data frame. And it's also, yeah, and it's using the map in pandas function, which means that if you wanted to test, if you were to write a unit test for these, this script, you would actually be writing two unit tests. And on top of that, you would still have to spin up a spark cluster because of your second function. So, so yeah, I think that kind of is this is where I'm going to start to introduce a little bit of fugue. And compared to this portion here, everything is kind of Python native. And the input data frame here can be anything you want. It could be, it could be a pandas data frame, it could be a spark data frame. It really depends on what kind of engine that you want to run with. So, and I'll talk a little bit more about it in my demo. One final note before I move on to fugue is that I also see a bunch of companies also on Databricks. And I absolutely love Databricks. I think it's, it's founded by the original creators of spark. And it allows users to spin out their own spark clusters at a click of a button. They also offer a tool called Databricks connect, which is a client library for Databricks runtime. It allows you to write jobs using Spark APIs and run them remotely on the Databricks cluster. So this is really useful for developers like me, who like to prototype their code locally in an ID or in some sort of ID, but still execute my code remotely. So I'm kind of executing it locally, which is really, really nice in terms of a developer experience. However, a downside of this is that it actually hijacks your pie spark installation. So when you, when you download Databricks connect one of the first things you have to do is actually uninstall pie spark. And the reason for that is because Databricks connect actually interacts with the driver of the cluster remotely. It also receives instructions to send to the driver note locally. So it kind of, it needs to kind of hijack your pie spark because you're no longer able to spin up a local cluster it has to spin up a cluster on Databricks for the ease of developers to actually execute their code remotely. So that that is the purpose of the client tool. It is a bit of a downside when it comes to wanting to iterate quickly and do testing quickly, because you no longer have to spin up or you know, you no longer are able to spin up a local spark cluster which takes seconds. And instead you have to spin up a remote cluster which could take five to 10 minutes and it, it, it could, it could kind of slow down the development time for developers like me. So one way we can test quickly quickly with native Python or pandas by execute spark only when needed is through fugue and fugue is an abstraction layer that keeps your code and computation native to Python, yet easily portable to spark clusters developers that use fugue benefit from more rapid iterations in their data projects. And I will now go over a demo using a Kaggle data set for feature engineering within the problem of toxic comment classification. And I will now go over a Kaggle data set that contains a large number of Wikipedia comments, which have been labeled by human readers for toxic behavior competitors were given this data set to process to determine if a comments type of toxicity labels contained toxic obscene or threatening types of toxicity, and they could have, and each of the comments could have more than one label which is why it's more of a multi label problem. So one of my notebooks that I will show, I will go through a tokenizing tasks using map and pandas, and also how to do this in fugue as well. It's a very basic feature engineering tasks that many nlp pipelines would do as they would have to use individual words or tokens from the comments, or long strings of text to compute TF IDF or embedding lookups or do embedding lookups. Here's notebook I've read in the Kaggle data set using pandas, and I just output an example of how that comment were to look like. And then I also have to spin up a spark session in a spark context, in order to process the comments as a spark data frame using map and pandas function because the map and pandas function only works with spark data frame so you need a spark session. So over here I, this is kind of the meat of the of the tokenizing tasks. First, I actually convert all of the comments into lowercase, and then I also remove any punctuation and split the sentences by spaces, and new line characters, and I also remove any non ASCII words or characters, and I also remove any stop words, and I use that, and I use the nltk corpus in order to get those stop words. And finally I assign that list of words to a new column in the data frame as comment tokens. Now the map and pandas function requires an iterator input and output. So I had to kind of wrap this function into an iterator format. And finally, in order to process my spark data frame, I have to also provide the map and pandas function a schema so this is very spark specific context and I have to kind of do a bunch of type hints so that the map and pandas function so that spark kind of knows what kind of schema to expect out as an output of this function. Finally, I convert that data frame pandas data frame into a spark data frame. And then I apply my SDF tokenize my spark tokenize function to the spark data frame and show in this case what the original comment is and the new lead comment new comment tokens. And finally, you can see that it has successfully done it, where this was the original comment and the eventual kind of tokens extracted from comments. And in this case is another kind of example of that successful tokenization happening. So I will now go through the same kind of process and the same kind of tokenizer function on on how it would be done in fugue. So, in this case, I read in a training data. I read in my same thing I read in the Kaggle data set using pandas. I have the same kind of function over here. The only real difference is the schema component. So how fugue works is that it parses the schema input types and the output types. Sorry, the, so it parses what the tokenize function output is going to be like, so that if you run this function on spark it knows kind of what schema to expect as an output. And same thing remove punctuation split on spaces new line characters, remove non-esky remove stop words also using an LTK, and then finally assigning it into a new kind of column. And the cool thing about spark, the cool thing about fugue is that I don't need to kind of wrap my function or mangle my function into a way that kind of fits the map and pandas kind of inputs and outputs. It also can. I can also kind of run this and show that. So as you can see, if I run this, it will now output the comment tokens. And in this case, I am also able to get the same kind of results that I got in the spark case with the comments and the outputs of those tokens. And also a following a different example as well. And the cool thing about fugue is that I can actually switch out the execution engines, such that it can run on native Python. So if I were to run that and print that result, it will now give this to me in a pandas data frame. So I wanted to kind of showcase the ease of adding new arguments. So let's say we wanted to now parameterize where the name of the comment text column as input call. So our function will now have another kind of parameter called input call. And it will now be passed as a parameter. So that quote there. And you could see that it very easily is able to use to add kind of new parameters, and that would still work with Python, and it would also still work with spark. And one thing if I didn't drive home kind of this point earlier is that you actually don't need to spin up your spark cluster unless you tell fugue to do so. And this makes testing really easy because one, you few guarantees the consistency of operating on any execution engine. No matter where you run your code. So what happens is that you can test very easily using native Python using the native execution engine. But when it comes time to production, you can most definitely switch to spark in order to ingest data from your data lake. And that kind of showcase how this would look like if you were to use, if you were to add another parameter into the mapping pandas function. So how this would look like is obviously, we would do the same thing here. That in here. Then map tokenize will now also require another argument. No problem here. And another time, I would have to add another parameter. We had to pass in a parameter in multiple places is because the map and pandas function can only take in really one parameter which is the iterator pandas data frame, and it cannot take in the input call so we would have to kind of pass in. You would have to kind of kind of Frankenstein this a little bit by passing in the argument into the map tokenize inner lambda function definition. And then finally, you have to pass this in to here, which in this case is the actual. And you would see this would run. So the difference here with few is that one, you don't have to kind of define so many functions, you can really just define that one. And from a testing perspective it saves you from writing tests from two additional functions. You don't have to do any cluster spin up unless you need to. So in this case this is still kind of a Python needed function, but you still in order to write the map and pandas function, you still have to kind of write a pie spark kind of test for this as well thereby incurring a local cluster spin up. And third, if you want to add any arguments into your functions, you would have to do so in multiple places. And as a developer it's not that great of an experience in terms of figuring out where kind of all your underlying kind of functions lie, so that you know where to change them. So all in all I think fugue is a really great tool in order for a developer like myself to reduce any pain points in terms of testing and also kind of speed up my workflow and so that I can kind of focus on my Python code that I would like to write. So now I would like to pass the time to Han who would talk a little bit more about fugue SQL and how this could further increase developer productivity in terms of making the testing experience for spark applications and other types of applications a lot easier. Thank you. Number one. My name is Han Wang. Now I'm going to talk about testing in spark application development. Testing for big data problems, especially for a spark application is challenging. More testing can effectively reduce the risk of production. However, this is probably the only positive thing in this table. What I often see is that spark tests can easily double the total CPU hours used in development. The larger scale, the more cost overhead you may encounter when you want to do thorough tests. Another pain point is that the tests themselves are hard to write. For example, how to test a complex SQL query getting data from Hive but running on spark. In addition, people tend to use more spark objects such as schema, data frame, RDD. They make spark code elegant and short, but they add more complexity and unnecessary dependency when testing the core logic that is not relevant to spark. Of course, there are always solutions, but they need extra code and they add performance overhead. So testing the spark applications can further slow down the iterations on big data problems which are already slow. So ideally, more testing on spark applications should be a totally positive thing. It should reduce the total cost of development if we can reduce the number of iterations while not adding much overhead to verify each iteration. It should accelerate development if the tests are simple to add and effective to catch bugs. To be more general, here is what we want to achieve. During exploration, we want to minimize ad hoc code. The code and tests we write should be almost ready for production. For each unit test, we want to minimize the execution time. It will save both time and money. We want to achieve 100% unit test coverage in order to minimize the risks in production. This seems impossible for spark applications, but I will demonstrate how to achieve it by using fugue. The last but not the least, we should consider test-driven development. TDD lets you define the requirements as tests and then you can write code to fulfill the requirements. It can effectively reduce the number of iterations in development. In order to achieve these goals, first of all, we must have a correct mindset. The biggest challenge of big data problems is not processing big data, but how to quickly reduce the scale. Downsampling is an extremely important technique that every data practitioner should keep in mind. It is useful for most use cases. If your logic cannot work on small data, why do you think it can work on big data? With sampling, you can iterate locally. It's not only faster, but all local development techniques can be used. For example, the exceptions will be more explicit, and you can also use debugger to find the logical issues. After a few iterations, you can apply the current logic on the big data to verify. But how can we switch easily between big and small data without code change? And how can small data not suffer from the tediousness and overhead of spark? The answer is to totally decouple from spark, code-wise and mindset-wise. In the past, we created our own computing logic to run locally using simple Python and SQL, but in order to scale out, we have to use certain distributed computing engines such as spark. In order to use a certain engine, we normally have to adopt their API and interfaces, making our core logic convoluted and hard to maintain. So what if we create an adaption layer that can adapt to both user's code and different computing engines? With this design, since the core logic is written in simple ways, they are already easy to unitest. Plus, the adaption layer can let you use local execution engine and mock data frames to run your data pipeline end-to-end without spark. Now your entire pipeline becomes testable, and more importantly, you can use this approach to quickly iterate on your local box. Plus, your logic becomes framework-agnostic and easy to migrate. This is the motivation of FUIC. It is to hide the complexity and inconsistency of different computing frameworks, and provide a unified approach to do distributed computing and machine learning. Now let me start the demo. In this demo, I'm using the Kaggle dataset USElection2020 tweets, a typical natural language processing problem, to show you how to quickly iterate on big data problems on Jupyter notebook with test-driven approach. All data has been loaded from Kaggle and converted to Parquet and saved into Google Cloud. Parquet files are in general smaller than CSV, so as you can see, we have about 16 files and half-gig in total. Let's start. First of all, let's set up the notebook environment. Fugle is a package that helps set up the fugue-friendly notebook environment for Kaggle users. It sets shortcuts to use Spark and Dusk, and it also enables FUIC SQL highlights. FUIC SQL extends standard SQL with extra syntax to make it a full-blown language. It is easy to use and friendly for data exploration. Now let's use FUIC SQL and Spark to explore the source data. This notebook magic have SQL means this cell is FUIC SQL, and Spark means I want this FUIC SQL to run on Spark. As you can see, under two seconds, we get some sample data from the original dataset. Our goal in this demo is to analyze the sentiment scores of each tweet, and then we want to get the overall sentiment of groups grouped by state code and hashtag. We're going to tokenize the tweets and use both an LTK and text blob to score the sentiment. We have about 1.7 million tweets to process, so it's compute-intensive. So the first thing we should do is done sample the dataset so that we can iterate locally. Again, we use FUIC to do this. So as you can see, we want to sample .01% of the data from the original dataset, deterministically, and then we want to yield this sample dataset to notebook so that we can use it in the next cells. Now let's print out all the sample tweets. Downsampling is such a simple and effective approach to save development time, and now we can forget about Spark, FUIC, and distributed system. We can just focus on the call and LP problem. Now we need tokenization. Intuitively, we need to lower the text and also remove stop words. Let's write the text first. Of course, it will fail. And now let's write the code to make it a pass. So as you can see, we use an LTK's lemmatizer and stop words to clean up the text and break into words array. So is this enough? We need to figure out. And this is where the sample data starts to be useful. Let's run on the entire sample data and have a visual check. So we still see special characters and links, and also there are text inside brackets. We want to remove all of them. So first of all, let's write tests. So this special character, links and brackets, let it fail. And then let's add the implementation. Okay, now it works. And let's do another round of visual check. So now everything looks okay. You can continue this type of iteration until you are satisfied. And after all these iterations, you get this tokenize function as well as its test. They are almost ready for production. After tokenization, we can compute the sentiment scores. One nicer advantage of test-driven development is that the test is also the requirement. You can write the requirement and then work on the implementation. Let's write the requirement first. We want the output to contain Vader score and the text blob score. And we want their signs to align with our expected directions. And these two assertions assert that. And also for the tweets, after tokenization, if they are empty, then we want the score to be nulls. Now let's write the implementation. Now all tests passed. And the row-wise operation have all finished. Now let's work on the aggregation part. Let's assume that this compute sentiment is a function that's applied to a group of data. And this group of data is grouped by the state code and hashtag. And we want it to output a data frame containing only one row, and in this row it contains the aggregated sentiment score. If the input scores of Vader score and the text blob score are well distributed from minus one to one, then we want the sentiment to be inconclusive, zero. If the positive sentiments are dominant, then we want it to be one. And if the negative sentiment is dominant, then we expect aggregated sentiment to be minus one. Okay, now here is the implementation. We take the median of the Vader score and the text blob score. If both of them are very positive, they agree with each other, then we set it to one. If both of them are very negative, we set it to minus one. Otherwise we set it to zero. And in the end, we just return one row of data frame with an additional column sentiment that containing that sentiment score. So far we have been doing standard test-driven development with 100% test coverage. And the code is absolutely nothing to do with Spark or Fugue. Now let's write down the entire workflow using Fugue SQL and the row on Spark. We filter the sample data by state code and country. And then we apply the compute polarities to compute per-tweet Vader score and text blob score. And then we group by state code and hashtag and apply this compute sentiment function and to get the aggregated sentiment score. In the end, we just get the columns that is useful to us. And here is the result. So how can we make this piece of logic unit testable? Again, let's write the requirement first. So according to the filtering logic in this test, this row has invalid state. This row has invalid country. So after filtering, we should keep only the first three rows. And then in the aggregation, as we can see for CA, we have one positive tweet, one negative tweet. And for WA, we have one positive tweet. So we should expect two rows in output. And we should expect CA sentiment to be inconclusive and Washington sentiment to be positive. And as you can see, this is a simple test that has nothing to do with Spark or Fugue. Now let's see the implementation. The first approach to implement this row analysis is to directly use Fugue SQL. And here is the only place where you need to have Fugue dependency. What we do is we just modify this original SQL a little bit and parameterize that and to run it using the given engine. Now let's run it and you see this problem is solved. But the question is why we still don't see Spark? This is because Fugue treats Pandas and Spark as different execution engines but unifies their behaviors. So the consistency is guaranteed on Fugue level. So you only need to focus on your core logic with small data set and the local execution engine. Your test and your logic can be completely independent from Spark. The function row analysis is your agnostic and framework agnostic and can run on Pandas, Spark, Dusk and all engines Fugue supports. But what if you still want to test this function with Spark? So you can install the PySpark plugin for PyTest and then pass the Spark session as a fixture. As you can see, it's minimal modification. We're passing the Spark session and then we just set the engine to Spark session to tell this function to run everything. On Spark instead of on Pandas. And then in the end, output will be spent on Spark DataFrame so we need a two Pandas. As you can see, the DF is still Pandas DataFrame because all data conversion is automatic in Fugue. But if you also want to use Spark DataFrame to test the workflow end-to-end, you can just be explicit at this step. And you never need to change anything inside row analysis, your core computing logic. You can also load the entire dataset using Spark and then run analysis on this native Spark DataFrame. And of course, this is a bad idea for unit testing because you are running the entire dataset in a unit test. This is just to show you how you can integrate your newly created function with your existing Spark pipelines. One last thing, I understand that not everyone is a fan of SQL. So in Fugue, we also have functional interface as another option. Here is how you rewrite the logic. So actually your Fugue SQL will be translated into the exactly same code you see here. So the two approaches are equivalent. Either way, you get fully tested code that is frame agnostic and scale agnostic. Fugue is open sourced. You can pip install to start using it right away. We are looking for feedbacks and collaborations. Thanks everyone for attending our talk.