 Wonderful. Thanks. Yes. It even works. Okay. So yeah. Clicky thing. No clicky thing. Okay. Cool. So I'm Holden. My preferred pronouns are she or her. It's tattooed on my wrist in case you forget. And I am a software engineer at IBM Spark Technology Center in San Francisco. Normally this is where I would try and convince you to come work for us, but San Francisco is a bit far away. But if you're looking to relocate to the States, come and talk to me, especially if you want to work on Spark. And yeah. I've been working at a whole bunch of places. I've been at Alpine. I've been at Databricks. I've been at Google and sort of just working on a bunch of related big data projects. Mostly Spark for the past few years. And I'm a co-author of Learning Spark and the new Spark book that I'm working on, which I'll try and get you all to buy, several copies of, is called Fast Data Processing with Spark. Because I get the most royalties from that one. There's less co-authors, so I get more money. If you don't have a corporate expense account though, don't worry. If you want to follow me on Twitter, it's just my name. You can watch a Canadian be really sad with what's happening in America. And occasionally I get distracted by computers and talk about computers. Or the sites are more relevant probably. So the sites from today's talk is. And most of my other talks are on side share. And I've got code and all that stuff. If you want to pretend to be business professionals, you can find me on LinkedIn. And I will do my best impression of a professional. And we can see how that goes. Yeah, cool. So what is the Spark Technology Center? Besides awesome people who keep paying my expense reports, which, thank you, IBM. Much love. It's a really great group at IBM that contributes to open source Apache Spark. As well as a bunch of other related tools built on top of and around Spark. So things like Apache Tori for notebooks with Spark. Things like System ML for more advanced machine learning. And of course, we've got the here for connectors. And we're a partner with the Scala Foundation and pretty much we're valued partners with everyone who would have us as valued partners. Very, very choosy. But you can see we've got a graph that goes up and to the right. Like all good graphs should. And this is a graph of contributions. And I have a second graph. No, sorry. Can you Google slides? And I have a second graph which strongly implies we are one of the top contributors to Spark. Yay! Admittedly, this graph was possibly chosen because it shows us as the top contributor, which is definitely not the case. But like we're totally legitimately in the top four. And so, you know, you can maybe reorder this. But you can see we're doing a lot of sort of contributions to Spark. There's myself and then there's a whole bunch of other people in San Francisco working on Spark. So I'm hoping you're all nice people. Normally the Python community people are really friendly. So if you're not, don't be the one Python community that's like jerks, just like fake it for an hour. You got this. If you object to pictures of cats, I'm sorry. Now might be a good time to take a break. There's going to be a lot of cat pictures. So how many of you have worked with Apache Spark before at all? Wow. Okay, cool. That's pretty good. And now don't be afraid to raise your hand. How many of you have no idea what I'm talking about when I say Apache Spark? Okay, cool. That's awesome. So for those of you, I recommend my first book. More seriously, there are, you should consider my first book. But Paco also has an amazing introduction video, which will make much of this talk make a lot more sense. And the other talks that are going to happen today about Spark and Flink. And is there anyone else who was up at 4 a.m. trying to get tickets to Hamilton? One person? Or no, no one. Okay, well, moving on. So we're going to talk about a whole bunch of things besides cat pictures. I'm going to talk about my assumptions about you, which we just did. We'll talk about how to reuse data in Spark, which is like a really important thing. We'll talk about considerations for working with key value data, because this is like pretty common and it's also really easy to screw up in Spark. We'll do a lot of instances of the word count example. How many people are really excited about using word count? Huzzah! Thank you for going along. I like to joke that I have a big data license. And yeah, Boo fell down. It's very sad, but it's okay. This is my dog Boo. She's very cute. She comes to all of my talks unless I forget her on the plane. And then it's really sad. But she made it to this one. Anyway, so many instances of the word count example, and we'll try and make it actually relevant. And we'll also talk about why word count is so common. We'll talk about how to use Spark SQL, and we'll talk about how to interface with the JVM. Because sometimes, even though we love Python, we do have to write some Scala or Java code to get really good performance. So for those of you that don't know what Spark is, you can think of it as this really awesome distributed computing system that magically fixes all of your problems and you want to buy six of them. It's open source, so buying six is free. So do it. It's an Apache project. It's really awesome. This means that it's not just one company behind it. It's not just IBM. It's IBM, Cloudera, Databricks, Hortonworks, all of these people collaborating together to work on this project. And a lot of people find it because they're like, man, my MapReduce jobs are really slow. How do I make this go faster? And that's where a lot of people used to find Spark. And I think nowadays a lot of people in the Python community find it because they're like, my data doesn't fit in Pandas anymore. Oh God, what do I do? What do I do? And then it's like Spark, yay! We'll fix your problems. Not a guarantee. We'll trade your problems for new problems. Yeah, and it's got two core abstractions and we'll look at both of them today. So Spark is a whole bunch of different components. It's this sort of core execution engine, but it also has all of these other really cool things that are built on top of it. It has machine learning tools that Juliet's gonna talk about. It's got streaming, if you've got some streaming data, which is very, very in fashion nowadays. Everyone wants to streaming. So it has a streaming API. And it also has a SQL API. How many people like SQL? Huzzah, yay! I love this audience. You raise your hands. The Germans that I spoke to last week when they did not raise their hands, they were just like, I am German. Sorry. But anyways, yeah. So it has a SQL interface and it's really convenient. So if you want to intermix your Python code with some SQL code, you can do this. You don't have to use two separate systems. You don't have to install like Hive and Mahout and get all of these things to work together. And so for those of you that are already using Spark, how many of you are already on Spark 2.0? Okay, yes, not that many people. For those of you who are looking for a reason to upgrade to Spark 2.0, there's a whole bunch of really cool things. The bridge between the Python interface and the Java world has gotten a lot better and faster. The SQL context and Hive context have been unified into this thing called the Spark session. Data frames have gotten faster in a lot of ways, which are really useful. And there's a bunch of other things and ignore the warnings. Cats. Okay, cool. So we are going to look at this word count example. And we're also going to talk about why everyone has a word count example. So part of it is that I can actually make it fit on a slide, right? That's a really important part of examples I want to put in talks. But the other part of it is when you see this like group by key line here, we actually have to shuffle data across the network, right? It's really easy to make a distributed system where our workers don't have to cooperate with each other, right? But word count is like pretty much the simplest thing we can do where our worker nodes have to sort of communicate with each other because they can't just compute the results locally and just be like, cool, I'm done. We actually have to like share information from them because if they each have like a part of the input file and they process it, then it won't give us a full picture. So this is why people like using word count. You might have guessed from the picture of the cat getting covered in water. This is not a good instance of word count. And there's some other problems too. So we compute the word count. We save it out to a file called counts, which is very exciting. And then we also compute the number of times the word warning shows up in our input. And this is pretty basic. For people that are new to Spark, let's go through it. Normally I would run over to this, but there's two of them and the microphone's kind of attached here. But so if we go to the second line and we see this rdd.flatmap, that's the line where we're tokenizing it. And because I'm from North America, I believe that I can tokenize languages with the space character and that works 100% of the time. Unfortunately, that probably doesn't work for you and you actually need to call like a natural language processing thing to do good parsing for the languages in different regions. But for English, this is pretty good. And then I've got this rdd, so I have this distributed collection of words. And then I construct the word and the number one. And then I tell Spark I want it to group all of the words together that are the same by a key and then I'll have all the number ones for it. And then I go ahead and I compute the sum. And then I save this result out. That's exciting. And then I go ahead and I compute the number of warnings and I compute this count. Now, can anyone tell me why the cat is about to get covered in water? Like what's wrong on this slide? What am I doing that's bad? Okay, yeah. So group by key. Group by key is bad in Spark for reasons that we'll talk about. Is there anything else that I'm doing that is like maybe not a great thing for me to do here? Okay, it's fine. I have a slide on what's bad. So we'll look at that. How's that? So the one that we didn't mention immediately is I've used the same data twice. And if you're like programming in Python, you're like, yeah, I use the same collection twice. What's wrong with that? That's great, right? I read in a bunch of records and I computed two things on it. The problem is in Spark, when we read in our data, Spark hasn't actually read in our data, right? At that first line where we said like, read in this text file, Spark just made a note of us wanting to read in that text file. And when we went ahead and we computed the word count, Spark said, oh cool, I'm done with that. I'm gonna throw that out. And then we asked it to do a second thing and it went, oh no, I have to go back and read this data in again. And so this is okay, right? We can tell Spark if we wanna keep data around, if we need to use it multiple times, we just need to tell Spark that we wanna keep it. And we've got sort of two options. We can tell Spark I wanna keep this in memory or if we've got like a really big chunk of data or we might have a bunch of things that we're keeping in memory, we can tell Spark like try and keep this in memory, but if we can't keep it in memory, write it out to disk. So you can choose two persistence levels. If you just call cache, it goes in memory and that's often pretty okay. Does anyone work in a shared cluster environment where there's other users or jobs for contention? Okay, there's like two people, so we'll talk afterwards if you want. There's some considerations for people in shared cluster environments, but we can skip them for now. So the other thing that we have is we have potential key skew, right? If you think about the words in the English language, they're not evenly distributed how often they're used, right? The letter A shows up a lot, right? And xylophone, unless you're doing musical instruments, not all that often. But we don't actually do word count in production all that often, right? But for a lot of things that we do in production, like grouping together by postal code or zip code, we also tend to get a lot of key skew, right? You have a lot of data focused in one city, especially humans, we group together, right? And if you wanna try and compute things, there's a good chance that you're gonna have a lot of things about humans that are all gonna group together in one place. If you're doing things about machines, there's a really good chance that the record null is gonna be your most popular key. And there's gonna be a really good chance that when you ask people what the distribution of the data looks like, they forgot about the record null. And they were just like, yeah, yeah, yeah. It's totally normally distributed. We're fine, we're fine. And it's like, no, no. 90% of your records are null because no one filled out that field. Might have worked on more than one customer project where this happened to me. And so our first problem is if we use group by key on this, it's gonna explode. Group by key will fail really quickly. But we can get other problems too. Even if we don't use group by key, if we try and sort our data, or we try and repartition our data after we've read it in, Sparks partitioners have this constraint that all of the data with the same key has to be sent to the same partition. And a partition can't be split between two machines. So all of your null records get sent to one machine and that one machine promptly falls over and is very sad. And it's okay, we can work around this, right? So let's work around it. Yay. Okay, well, first let's talk about how group by key is evil. It's pretty evil, just like this cat. So group by key is unfortunate because it sounds safe. I should have asked this before I told you it was evil. But before I told you it was evil, how many of you thought group by key was safe? No one? Okay, well, I think some people think it's safe. But if you don't, that's good. Oh, no, damn it. Come on. Huzzah. So the problem with group by key is that when we call it in Spark, Spark isn't able to really optimize anything with it. It's forced to create this giant record of all of the things with that same key. Even if right after calling group by key, we immediately do some type of a reduction like we do a map and a sum like we did in our example. Spark isn't able to see inside of our lambda expressions and figure out that it can pipeline this operation earlier. And we were stuck making these records that are too big to fit in memory. Unless we're in Spark SQL and then everything works. Yay, part two of the talk is Spark SQL. But yeah, so this looks like a cat getting wet. In this case, I'm computing where to open my artisanal mustache wax shop in San Francisco. And 94110 is the zip code of the area where I live. And there are a lot of people with handlebar mustaches there. And so there's so many people with handlebar mustaches that when I try and compute this, the San Francisco mission district record is just too big to fit in memory. We could see this happening with more realistic data, perhaps with orders, right? Like the number of orders in San Francisco is probably too big to fit on a single machine. Whoa, cool, lights are back. That was trippy. But so it's okay, right? We can compute this word count without doing this. Yay, normal word count. So instead of doing group by key, we do reduce by key. It's pretty basic. But the switch from this group by key with a map afterwards to a reduce by key, let's Spark know that there's an operation it can perform on the values that it finds for each key. And it can start reducing the data as it's going, as opposed to having to create this really big record and then reducing it afterwards. So by allowing Spark to sort of move this reduction earlier, we got a lot of really awesome performance benefits in that our code succeeds. Yay. Let's look at an example with kilobytes of data because I did this on an airplane. So we can see group by key has a shuffled read of 48 kilobytes and the equivalent reduce by key is about a fourth the size. It's 11 kilobytes. And now that's not all that exciting, right? Like kilobytes of data is gonna succeed even if I was processing this in basic. But if you had like a four terabyte input, that could make a really big difference. If you had to transfer four times the amount of data over the network, that makes a huge difference. So whenever you find yourself using group by key, you probably wanna use reduce by key or aggregate by key because Spark can pipeline these operations. And it's really awesome. And we get a map of side reduction for free. And also a cat with its bread and a piece of toast. Face and a piece of toast. Anyways, so there's another sort of challenge with PySpark. And our big challenge is that when we're working in PySpark, we have to copy data from the JVM into Python. And for operations which are really simple, the simple time of copying the data between the JVM and Python and back is gonna actually take longer than our computation itself. So for word count, this is pretty simple, right? Our operation is really basic. So the copying time really dominates. In other cases where we're maybe doing really complex math inside of our operations, it's not as big a deal, but it's still a pretty big deal. The cool thing is data frames are able to take our Python calls and for the most part, evaluate them inside of the JVM. Now there is a catch. I like to call it beware of distributed systems bringing claims of usability. And that's mostly data frames are not pandas data frames. So while they're called the same thing, if you're really excited about using pandas data frames and magically having them be distributed, you're gonna be really disappointed. If on the other hand you're like, wow, I have this magic distributed table, you're gonna be pleasantly surprised. So I say shoot for magic distributed table and then be pleasantly surprised when more things work. If you have semi-structured data, it's really awesome. So if you have a schema, the data frames are able to do really cool efficient storage representation in regular PySpark. Everything has to be represented as pickled bytes and that's not really all that efficient. And data frames have this really cool efficient representation of data. The one catch is we have to go from this RDD filter with a lambda expression to this DF filter where we provide a SQL expression inside of it. So instead of supplying arbitrary Python code, we have to supply expressions that Spark can understand so Spark can turn them into JVM code for us. And naturally, like it doesn't understand pandas and probably part of why you're using Python as you like pandas or the Python libraries. So sometimes data frames are not gonna be a good fit for what you wanna do, but you can take a data frame and turn it into an RDD and do your like necessary work at that point. Yeah, yeah, here's why they're good. Really the main thing which makes data frames good is that Spark can understand what we're doing with data frames a lot more than it can when we're working with RDDs. Because it can't introspect our Python byte code and figure out what we're doing, data frames just give Spark an opportunity to do a lot more cool optimizations. And here's our graph showing how much faster it can be. Higher numbers are bad, lower numbers are good. It's execution time and we can see we all know group by key is bad, right? So the red bar is bad, but even if we took our red bar and we rewrote it to use reduce by key, it's still only twice as fast. But doing the same operation on data frames is way faster because there's no overhead of copying the data. Now admittedly I chose word count because it's a really simple example, but also because there's so little actual computation happening and it's mostly just moving data around. So data frames are gonna be a lot faster for something like word count and in your actual production environment like they might not actually be this much faster. If you're doing really complex stuff, they'll still be faster, but this graph will have misled you just a little bit. Sorry. Huzzah! So let's go do word count again. This is the third time it shows up. So we load some data. It's a little more complicated than loading the data from a text file. But there's a really cool thing because data frames have this like schema information associated with them. We can ask Spark to figure out what the schema of our data is. That's this option in first schema equals true. And now for CSV data, that's not really all that exciting because you can just look at the top line and be like, yeah, I know what the schema is holding. You didn't do anything for me. But does anyone work with JSON data? Okay. Lots of nodding heads, not as many raised hands, but still more than the Germans. So that's okay. But it's good. It's good. So the really cool thing is this infer schema thing also works on JSON data. And with JSON data, I can't just look at the top header line. There's no header line. I can't even just look at the top five records, right? Because there's a chance that on record 275,000, someone used some random field that I like have never seen before. And so for all of these different data formats, Spark is able to infer a schema for us. And it can do this by sampling the data and figuring out what the variance is and then going ahead and figuring out what our schema looks like. And for JSON data, that's really, really awesome because figuring out the schema for JSON data is really, really sad using traditional tools. And sometimes I use Spark on even small size problems because I like the tool that gives me for figuring out JSON data. Because most of the times when people give me JSON data and I ask them for a schema, they say, no, it's JSON data. And then I'm just sad. And if that happens to you, Spark will make you less sad. But for Orc and Hive, this is less exciting. But it works with these two. Okay, so we can inspect the schema and then we're gonna do word count because word count. But let's look at the schema. So here we have a sample JSON record. I am obsessed with pandas. So I like to pretend that we have pandas in San Francisco. We don't actually have pandas in San Francisco but here's a fake record. And then this is our resulting schema. Now this isn't all that exciting because you could infer the schema from our one record. But if you had billions of records, doing this would actually be a lot of work, right? Like you'd have to look at the billions of records and I don't have time for that. But now that we've got our data and we've figured out what the schema is, we can do the most important big data task of all, word count. So we can go ahead, we can select some text out of our data frame and then we can go ahead and we can use this lambda expression. And unfortunately, because we're using this lambda expression, all of the cool fancy optimizations for data frames go away for this one step. For this one little bit here, I don't have all of the cool stuff happening but I can use my lambda expression, right? Like I can use my arbitrary Python code. And then I go ahead and I put it back into a data frame and then I go and I use the data frame group by which is actually safe. And then I tell that I wanna compute the sum and then I write the data out to parquet. And this is actually much faster even though we don't have the full like awesomeness of data frames happening here because we have this lambda expression, we still have a lot of the awesomeness of data frames happening here. And if you had like a Scala programmer on your team that you could trick into doing things for you, you could get them to write the code that was inside your lambda expression in Scala and just call their code instead, which surprisingly is what we're gonna look at next. Probably. I reordered the slides like 20 minutes beforehand. Ah, it is, lovely. So right, if you've got a Scala programmer that you're working with or if you're a Scala or Java programmer and you're working to support a Python team, you can provide UDFs to your Python team so that they can keep living in the awesome happy world of data frames for even longer than they can normally. So Python UDFs and maps lose this benefit, but if we use Scala UDFs, everything works just fine. And there's actually PR number 9766 got merged for Spark 2.1 so you no longer have to do a bunch of crazy hacks to call these UDFs. Once someone's written them, it's really easy and we'll look at how to do this. So here's some function. It's not super important, but we can say we put some function inside of an object called functions because I'm very original. And we can call this function. Did I skip a slide? Yeah, I did skip a slide. Okay, so we can call this function. I'm gonna skip this part for now because it's not super exciting and we don't have to do it anymore in Spark 2.1, but if you're in Spark 2.0, go and check out my slides to find out how to do this in the old way. Otherwise, just wait for Spark 2.1. It's totally coming soon. And the new way is we can just call sequelcontext.registerjava function and give it the class name that our Scala developer gave us. And we can then start using this Java function from Python on our data frame and I don't have to do any crazy stuff involving the Py for Jbridge or accessing internal functions. I can do this all using the public API and my code will probably continue to work when I upgrade. You do have to specify what the return type is of the Scala function, which is a little weird, but it's not that bad, right? Yeah, okay, cool. And other things happened, damn it, it got cut off again. Other things that have gotten merged for Spark 2.1. Who likes PIP? Huzzah, yay! So Spark is gonna become PIP installable. This doesn't do anything for performance, but it makes me happy. So if you've got a bunch of people that you're working with and they're installing Spark, sounds like a lot of work, I have to download a bunch of stuff and figure out where my Hadoop dependencies are. Tell them to try PIP installing it and if it doesn't work, they can email me. And I will seriously try and fix your bugs unless you're on Windows. If you're on OSX or Linux or anything, like please try PIP installing Spark 2.1. It should probably work. And if it doesn't, it's really easy to find me. I have rainbow hair. And my email is holdinatpigscanfly.ca or holdinatus.ibm.com depending on how professional you wanna be. Yeah. And there's a whole bunch of other things that are coming to make PySpark fast in the future. A lot of the pain that we work to avoid inside of PySpark is copying data from Python to the JVM and back. That's this whole like data frames things are awesome because they let us avoid doing that. But like sometimes we just wanna do stuff in Python. Like I just wanna use NumPy. And there's no real good NumPy replacement in Scala. So it's cool. There's a lot of really interesting work to integrate Apache Arrow with Spark. And it's probably not happening for Spark 2.1 but like Spark 2.2 or 2.3. And as we start to integrate these things into Spark, we'll be able to get faster interchange between Python and the JVM. And we have to do less caring about where our work is happening. And if the best place to do the work is in Python, then we don't have to worry too much about the copying because the copy will be much faster. And if anyone has Python UDFs that they're willing to share with me for evil benchmarking plans, I would love it if you would share your Python UDFs for me for evil benchmarking plans. Even if you have to share them with me from a hotmail address, I don't ask a lot of questions. Just please send me your Python UDFs so that we could have better benchmarks than word count because I'm really tired of using word count for benchmarking. And the random things that I cook up are probably not all that good. I want real customer use cases and not just the ones that have billions of dollars, like the ones from people that show up to my talks because I care about you a lot more. But yeah, cool. And there's a whole bunch of other Spark resources. This was a really short thing on how to make Spark go fast from Vroom. But there's the programming guide, there's lots of books, there's office hours. I'm gonna try and do some office hours online on Hangouts if people have Spark questions and they're not super secret, you don't work for the NSA, come, find me. If you do work for the NSA, I don't wanna know. Hotmail addresses, whatever. And books. You should buy several copies of all of these books. Relatedly, I receive royalties for not all of these books, but many of them. You should buy the most copies of this one, which is the one where I have the least co-authors and receive the highest royalties. More seriously, it has a chapter focused on going beyond Scala. And if you have a lot of Python code that you wanna make go really fast on Spark, it's like only one chapter about how to do that, but I think it's the best chapter. I also wrote that chapter, so I might be wrong, but you can buy an early release of it today from O'Reilly and tell me all of the things that I got wrong so I fix them before it goes to print and my name is on it and I can't fix it anymore. So if you have a corporate expense account, please buy this. It's also a great gift for friends and family if you have a corporate expense account and dogs love it too, as do cats. Yeah. If you don't have a corporate expense account and don't wanna buy a book that doesn't yet fully exist, you can give me your email address and I'll let you know when the book exists. And as I was saying, Paco has an amazing introduction to Spark video. He also has something for free on YouTube, but if you have a corporate expense account, he has a really nice professionally produced one from O'Reilly, it's $100. I mean, it's a lot of dollars as a person, but as a company, they probably won't notice, not a guarantee, but I encourage you to buy the professional ones so that Paco keeps making awesome content. I also have my own intro, it's free, it's on YouTube. I don't get any money, but if you wanna click the like happy button, I can like show my boss. Yeah. I have a bunch of talks. If you care about structured streaming, I'm gonna be doing a talk with my co-worker Seth on Thursday, and I'll also do office hours at Strata on Thursday afternoon. If anyone's American based for some reason, I've got a bunch of talks in the States as well. Or if you just want an excuse to visit Boston, San Jose or Chicago. Honestly, I recommend Boston or Chicago. San Jose is kind of boring. These are talks that are happening too. And so that's pretty much it for me. Okay, thanks, bye. Do we have time for like a break or questions? Yeah. Does anyone have questions? Cool. Oh, I mean, yay. Oh, what's your question? Oh, wait, introduction to Spark 2. Oh yeah, so you don't have to cache everything in memory, but it still is in memory. It just doesn't stay in memory. So sorry, the question was in Spark 2, do we have to, can we do processing out of core? And the answer is like, no. Well, you don't have to cache in memory. You can cache to disk instead of memory. For evaluation, it's still happening in the CPU and you still cache, sorry, and it still gets loaded into the JVM heap. There is off heap with tungsten, but it still ends up in the JVM heap for the actual evaluation. Yeah, if Spark starts paging, you're gonna have a bad time. Don't do that, allocate less memory. So there's a bunch of things which could cause this. Generally, you shouldn't be over allocating your containers. So you shouldn't be giving Spark an executor that can allocate more memory than is reasonably gonna be allocated on the physical machine it's running on. You can increase the number of partitions and decrease the amount of memory you're giving to each executor and you'll not see the paging as much. If you still see paging, it will work, but it won't work well. Any other questions? Yeah, yeah. So that's a wonderful question. The question was if I love SQL, then might I try converting some of the TPCDH stuff into PySpark code? And the answer is I'm kind of lazy, so I haven't done that, but I do work with some people who are less lazy than I am who are working on that, but they're already working on it, so I'm not gonna do that. But yeah, the people are trying this, and it's a great idea. Oh yeah, I want actual people's use cases rather than artificial benchmarks because we're really good at meeting artificial benchmarks, but I would like to try and meet more customer use cases. And that's just me. I don't get as much happiness with competing on TPCDH just because everyone's already doing it, and if I do something that's off in a corner, I won't have as much competition and I can do awesome things, but if I do what all of my other coworkers are doing, I'd have to be as smart as them, and I'm far too lazy. Okay, in your use case, you're reading Parquet files into data frames, so you, ah, okay, so you want to, so the question is, you have a lot of jobs on different Parquet files. Ah, okay, gotcha. So you have one job with a bunch of different Parquet file inputs, and as your outputs, some of the files are empty because there's no, right. So probably forcing a repartition would be good. At that point, I'd have to actually look at your output code. Yeah, so even if your data is all unique, you can just repartition on some junk. You're not actually expecting a reasonable partitioning out of it, you're just trying to get no zero-sized partitions, and so even if they're all unique, you can do like a sort and then save it out, and when you do a sort, Spark samples the data and it tries to create roughly even-sized partitions, and this should avoid creating zero-sized output files, but if the number of partitions that you have is such that some of them are just have to be zero, like you only have like 2,000 records being output and you have like 20,000 partitions, then you need to coalesce down to a smaller number of partitions first and then do your sort, or do your sort and coalesce. We can look at the code if that doesn't work, you can just email me. Yeah, so that's, so when you're reading from text files, the, okay, so are you still doing the repartitioning? Okay, cool. Let's take a look at your code because this is probably not super interesting to most people, but we can hang out. Sorry, yeah, cool. Come find me, I have been careful.