 Hello everyone. Good evening. Thank you so much for joining in. I know this is kind of the evening. Everyone's tired and so on. And honestly we have 20 more people than I thought they would be originally in this room. So thank you so much for, well thank you so much for again for joining in. What I'm gonna talk about today is data-driven workflows, right? Essentially building data engineering workflows from getting data from one source onto a destination, right? Now within the industries, the industry is kind of heavy on using Spark and Scala related tools. And what I'm here to sort of help you all understand is this new kid on the block which is called Dask. And how you can use it to not just expedite your workflows but also build cleaner workflows which can easily be shared across. Now enough about the topic. Let me just talk about myself for a bit. My name is Vepav Srivastav. Vepav is a difficult name. You can just call me VB. That's visual basic. I'm a data scientist. I work at Deloitte Consulting back in India. And I architect machine learning workflows for Fortune Technology 10 clients on Google Cloud. And that's where majority of my experience working on big data workflows and PySpark, Dask and all of these, all of those things that have come through. You can find me on Twitter at reach underscore VB. I also put all of my talks at Vepav.blog. Feel free to look at it later on. Now before we get into the contents, who here has ever built a big data workflow before? All right. We've got three people. Who have ever heard about or worked with Apache Spark? All right. Cool. Who's ever heard about Dask? All right. We've got one person. All right. So hopefully by the end of this talk, I would want at least a few more hands to go up when I say who's heard about Dask. Cool. So what is Spark? For all those who do not know what Spark is, Spark is a distributed data processing engine. Essentially, you write some code to be able to process massive chunks of data, sorry, distribute massive chunks of data into smaller chunks of data, process them, and then put them from one source into the destination. This is used pretty much everywhere. So wherever you look at big data dashboards, you look at KPI monitoring either within your firm or outside, wherever you see any place where data is being aggregated or data is being migrated from one place to another, this is where typically Spark would be used. Spark has multiple abstractions. So you have Spark SQL, you have Spark Streaming, you have a machine learning library within Spark. It's called MLib. There's a GraphX library, which means that you can build graph nodes on it as well. It's got a fairly robust support as well. You can build it in Java, you can build it in Scala, Python, and R. But I'm going to talk about this from more of a PySpark view. And again, PySpark is fantastic. I love PySpark. We've had some fantastic talks today about PySpark and how you can build stuff on it. But as a Python developer or as someone who codes in Python, there are a lot of issues just with using PySpark. The reason for that is, and you may see on your screens as well, that Spark was originally written in Java using Scala as a layer as well. And everything that you code up within that or anything that executes on that is eventually run on a Java virtual machine. Whereas when you use PySpark, you add another layer of abstraction on top of Spark, which means that you first write your code in Python, which is then broken down into or compiled into Java code and then is executed in the JVM. This not just adds a lot of overhead, but also makes it very difficult for people who are new to building big data workflows or trying to get their head around Spark or just using PySpark very, very difficult. Some of the reasons for that is primarily centered around the fact that your Python code is trans-compiled into Java. And even if, let's assume that that's all right, every time you run a PySpark job, you get into mind-wrecking trace packs, something that looks like this. You literally have Java trace packs in your Python code and you're just like, what's happening? And not just that, you get null point exceptions in Python. I mean, if that's not weird, I don't know what is. And this just becomes like a very baffling experience to build quick workflows and something that just works. That's where Dusk comes in. Dusk is designed to parallelize the Python ecosystem. It has familiar APIs for Python users. And for all those of you who have never coded in Python before, Python is kind of the world's most easiest language to pick up. I mean, a Hello World program in Python is literally print Hello World. How easy can it be? Not just that. For those who have been working within the Python ecosystem, Dusk as a library was code developed by contributors from pandas, from scikit-learn, from Jupyter teams. So this is the best of the best teams working together to build a package or build a library to give you a way to build distributed workflows. And again, this is something which scales from multi-core, which meaning just one computer, to thousands of nodes of clusters. Now, let's take a deeper look into this. I mean, I've been trying to convince you guys about Dusk. So let me give you some data points as to why Dusk is the right fit for your next big data workflow. Dusk is basically a scalable pandas data frame. So if you were to read a file from, say, an Amazon S3 bucket, which is present in S3, say, bucket in 2019, all you have to do is just import Dusk data frames, which is the library, and you do dd.read parkway, because that's the file which is present in that bucket. And if you have to group by and perform any computations on it, you can literally just do df.groupby on the name, take the value, compute its mean, and give you a data frame too. In just three lines of code, you basically read through an Amazon S3 bucket. You grouped by that particular data on the name, you computed how many times was it there, you computed the mean of it, and then you returned that data frame as well. That's just three lines of code. If you were to do the same thing in PySpark, it would be massive amounts of code. And before you reach a stage where you can even build those flows, you would have multiple exceptions. You would have null point exceptions. You would have Java trace packs, which more often than not, you would not understand. Then second, this is something if you're a data scientist or if you're a machine learning engineer, there's something which is very important. You can actually parallelize your scikit-learn, which is a machine learning library, workflows or scikit-learn workflows using Dask. You can essentially export or take in Joblib, which is like a parallel backend within scikit-learn, and you can use Dask as a task scheduler, which means that for every job that you run via Dask, it creates a task graph, which means that where your data is and where all it can go, it creates that, and you can parallelize all of your ML flows, your data flows and so on and so forth with Dask. And now comes the most interesting bit. We were just talking about creating big data workflows, but Dask does so much more than that. Let's take, for example, this existing code base. I'm running two for loops in there, and I'm essentially just appending, so for x and x, and then again for y and y, I'm essentially checking if x is less than y, and if it is, then I put it into a function f, which can give me a result. If not, then I put it into another function, which gives me a result, and then I push it all into a result's list. Now, if you just look at this code, this is a sequential code, but it doesn't have to be sequential. And if it was to run sequentially, it would essentially take one x out of the big x list, and then it would find one y from the big y list, and then keep on computing this again and again, till the time both the lists do not exhaust. But we can actually use Dask to be able to parallelize all of these processes, because these processes are not dependent on each other. So each and every small x or each and every unit within the big x list is independent of each and every unit within the big y list. And how we can do that is literally by adding two lines, which is I define the function f as a Dask.delayed object. What that does is it basically creates a lazy graph of your function, and it only calls it when it is, it only calls those functions or it only passes the data into those functions when I do Dask.compute results. So essentially what I'm doing is I'm explicitly telling my compiler or my interpreter that all you have to do is once I hit compute, you have to take these delayed objects and parallelize this across all the cores that I have within my machine, and then send these data in parallel chunks. And this is massively fast, right? And again, you can pretty much use this on any of your existing Python code bases. And that's the best part. It's literally just three lines of code. All right. Next, sorry. Now Dask not just creates, it's not just something that runs on like a cluster, right? It fairly easily scales up. You can scale it up to 1,000 node clusters. You can use it on supercomputers. It works at gigabyte bandwidth. It has just a 200 millisecond task overload, which means that whatever I write within Dask gets computed within less than or hits the computation graph within 200 milliseconds. And not just that, it also scales down, which means that you can easily run it on a single Python thread pool as well, which means a single core machine. There is no performance penalty, which means that contrary to how PySpark works, if you have a lot of code written in PySpark, you would have to sort of transcompile it into Java and then work your way through. You don't have to do any of this. This is natively written in Python, hence the massive speed pools. And again, it is lightweight. You can literally install Dask with one command, which is pip install Dask, that's it. And it would work right off the bat. You don't need to set up like a Java virtual machine. You don't need to set a Java path or any of those things. And of course, I love Java. And the fantastic stuff. You have clean Python trace back every time your code breaks. As someone who creates multiple workflows, this is something which is very important for me. I need to know where my code is breaking. If it breaks at a particular function, I want to know where that function is in my file, not some random Java library, which is hidden in some random folder and so on and so forth. I want to know where my issue is breaking and how I can fix that. Apologies for the GIF. Not just that, Dask actually provides you, this gets better, it provides you with beautiful diagnostic dashboards. So right now, this is a grab of me running a dot compute on a four-core machine. And as soon as I hit dot compute, it essentially first figured out where all the data can go across my cluster. And then the green sort of lines that you see, that's basically IO between multiple codes ranked and computed through. You can see this is from the same flow. You can see that it also gives you how your memory is doing, how your CPU is doing across the core, what's your network IO, and how your entire flow is kind of progressing. So there was a group by happening, there was also some time series analysis happening through it, so it essentially tracks your entire flow and tells you how fast your flow is running. And this is it running in the wild. As you can see, essentially all we're doing over here is reading some data out of an AWS bucket and then running some sort of simple computations on it. This runs side by side. So when you're running your Python code, on the side you'll have this dashboard to keep a track of where your memory is leaking, where your maximum time is spent within the flow, how you can further optimize it and so on and so forth. Cool, now enough hyping, Dusk. Let's talk about some benchmarks. Let's talk about let's put PySpark in Dusk head to head and see what works best. I have the code for all of this. These benchmarks put up in my GitHub. I'll share the link with you all. Essentially all the data for this test was put into an S3 bucket and we had similar code written for PySpark and for Dusk. So what we did here was a very simple check. We just read 1,000 records from an S3 bucket and just see how much time does it take. PySpark takes 11 seconds and Dusk takes one second. But that's just 1,000 records. Let's scale this up. Let's push it up a notch. Let's talk about one million records. PySpark did 2.3 minutes, took 2.3 minutes to read, so PySpark's catching up. Dusk took 1.8 minutes. Now let's take it up another notch on the same one million records. Let's put a filter and persist that output for another operation or can be anything. When we do that, we see that again, PySpark's still trying to catch up. PySpark's at 2.6 minutes and Dusk is at 1.6 minutes. Now this is where things get interesting. When we try to join two data frames, essentially two data sources together within PySpark and within Dusk, let's see how they perform. This is where Dusk kind of goes down the drain. PySpark did the same thing in 5.6 minutes. Dusk did the same thing in 12.8 minutes. A quick thing here, the reason why this is slow is because Dusk works really, really well on single data sources. So if you just have one structure of your data, either in Parkway or in CSVs or in a table somewhere, it would do phenomenally fast stuff, but as soon as you try and join the two, then the task becomes not so easily parallelizable. That's why it becomes difficult for Dusk to catch up to PySpark. So if your use case has a lot of joining happening between multiple sources, you may wanna look into PySpark or you may want to do that join somewhere else outside your Dusk code to make sure you have the most optimized workflow. Now, which one should I use? This is a famous code by Thomas Sobel. I love it. He says that there are no solutions, there are only trade-offs. But again, just to put things in perspective, if you're trying to set up a new flow, if you're trying to set up a new project, I would recommend giving Dusk a shot. Definitely try it out because it's literally one simple command which is pip install dusk. You can easily take it out for a spin, run some flows on it, and see how it performs with your existing flows. If you're in a legacy system, say you have a Hadoop system, you have these Java-based systems, then I would recommend still sticking around with Spark. I wouldn't ask you to shift everything from there onto Dusk, but I would still recommend you to see if you can do some local installation somewhere on a virtual machine there and then try to see how it kind of performs. But by all means, do try out Dusk. It's something which has very less dependencies. It can do a lot more things than just building your data workflows. It can parallelize codepaces. It gives you nice dashboards to see how your flows are working. All of these things are something which PySpark or Spark kind of fails to provide you from the get-go. So it gives you more tools in that sense. There's something more. If you wanna try it without setting it up as well, I've put these two co-lab notebooks up on Google Co-Lab. So you can try PySpark on a Google Co-Lab. You can also try Dusk on a Google Co-Lab. Let me just see if I can open. So I actually did a bit of a hack here. Co-Lab allows you to expose external URLs onto within the VM that it allocates you. For those of you who do not know Co-Lab, let me take a step back. Co-Lab is basically a Jupyter notebook where you can execute Python code. It's an open-source offering from Google themselves. And let's see if we can connect and run this. So all you have to do is just sign in from any Gmail account. You can see that it's given me a Python 3 Google Compute engine backend which has 12.72 gigabytes of RAM and good enough hardware space as well. So let me see if I can run this. Let's execute this code. It says that there are a bunch of warnings. We can get to it later on. It says that my scheduler is set up. I have one worker and I have two cores. And I have 13.66 gigabytes of memory. Now, I really want you all to see the dashboard that I've been raving about for so long. You cannot get it by default. So what I do is I essentially download ng-rock onto the Co-Lab itself. And then using Python, I expose a URL from ng-rock there. This is extremely hacky stuff. Do not do this. But this is just for you all to try it out. I actually found this hack to be very nifty and it's not working. Ha! Let me get back to you on this in a jiffy. But when you see it next, this would all be working. I think there's some issue with the version of Python that I'm running within this. But as soon as you do this, it'll give you a URL which you can ping and it'll work. I have a similar flow for PySpark as well. And you can essentially just go on these two and try benchmark stuff on your own. Last, if you're more interested about learning what Dask is, how it works, the official documentation is amazing. You can go on examples.dask.org. You can also go on the GitHub repository within Dask. It's called Dask-tutorial. It's built by Matthew Rocklin and Tom Spurgers, who are some of the people I really look up to within the Python community. And there's this amazing Dask tutorial from SciPy, US, last to last year, which is up on YouTube. You can look these up as well. Before I end, special mention to Matthew Rocklin and Tom. They're the thinking brains behind the Dask project. And also Ian Whitestone, because some of the benchmarks which I used here were his work and my renditions. So thank you so much for your time. I can take any questions if you all have. Sure. Yeah. Right, so your question is that... Yeah, sorry, thank you for the question. Perfect, yeah. So the question is that the way Spark works is you have a master node, which kind of schedules everything to the slave nodes and then it distributes the work and so on and so forth. Dask has kind of the similar kind of structure. It has a task scheduler, which schedules everything across nodes and sends the data through it, but you don't need a dedicated virtual machine for it. Even if you were running it on the laptop that I'm running on, it'll work clearly on it. Whereas in case of a Spark, you would have to dedicate at least a core towards it, which is not the case with Dask. So Dask doesn't have the single point of error? Yeah, yeah. And so if we take a step back, what's happening there is, as soon as you create, you remember those Dask.delayed objects, right? The function, as soon as you create those, right? It creates a task graph, which is like from source, how do things go to the end? You can think of it as map reduce itself, but not exactly that. And those task graphs are then spread across multiple cores to be executed. Does that answer your question? Yeah, yeah, yeah. Perfect, cheers. I will take any other questions. Right, fantastic question. I love the question. We're, in fact, recently at a client, what we did was have automations for Dask, because Dask is nothing but a Python code, right? So you can easily schedule it up on a cron. You can easily schedule it up on, say, a virtual machine, or let's talk about serverless. You can actually, so we've tried it on Google Cloud. You can have it on Google Cloud Functions. You can have it on Google Cloud Run. You can have it on Google Cloud AI Jobs as well, and it would work just fine. Because all it's doing is just calling a library and wherever you read your data from, you can have your data in a SQL server, or you can have your data in a S3 bucket or a GCS bucket on Google Cloud and run that. Does that answer your question? Okay, cool. Yeah, we can talk later on. All right, thank you so much. You were a lovely audience. Thank you so much for your time.