 Okay, let's continue. I'm super excited to have Holden here again this year. And she's gonna talk about accelerating big data beyond the JVM. Awesome, thank you, thank you for having me back. I don't know if last year's talk was good or there just weren't too many people. Regardless, I hope you're interested in going beyond the JVM. So I'm Holden, my preferred pronouns are she or her. It's tattooed on my wrist in case you forget. I am a developer advocate at Google. They pay me money and therefore are excellent people if you consider buying things from them, please do. I am on the Hashi Spark PMC. This colors my views of the world, but not that much because it also makes me hate Spark because I work on it. Previously I worked at IBM with Nick. He still has not merged my pull requests. I'm very upset with him, but I am planning to track him down before his flight and a bunch of other companies as well. I'm a co-author of two Spark books. I think you can buy them from the low early booth upstairs if you want. One Spark book is about a quarter of a cup of coffee in San Francisco, so you can help keep me from having to do enterprise support contracts, although I already worked at IBM. You can follow me on Twitter. There's not a lot going on. It's a lot of pictures of Boo who, while adorable, does not do a lot of programming. She's great. In addition to who I am professionally, I'm trans, queer, Canadian, on a work visa in America, which is just really fun right now and part of the leather community. And this maybe doesn't seem super important, but I think it's just nice to remember that we all come from different places and we should work together. There's no secret Canadian garbage collector or anything that we're hiding from you. We all have out-of-memory exceptions. Our lives are all terrible, but if we work together, our lives can be terrible together. And I think I just saw someone having the Spark book with the bird on it. Yay! Okay, so thank you for the quarter of a cup of coffee. I really appreciate it. So I'm hoping you're nice people. If you don't like pictures of cats, this is not the talk for you. There are two exits, but you can also just work on your computer. It's fine. It's possible that you... How many people are Spark users? Oh, that's a lot of people. That's awesome. Okay, oh, and Nick, merge my PR. How many people are Flink users? There is some overlap, but not as much as I was hoping. Now for the real long shot, are there any Apache Beam users in the house? Wow, we should talk afterwards. I'm really excited to see if there's some Apache Beam users here. That's great. Is there anyone who doesn't use any of those systems? Cool. What do you do for big data? Or is this more just like whatever there's a talk sort of situation? Okay, it's the whatever there's a talk situation. That's fine. That's all right. I will try and convert you to one of the big data systems which indirectly pays me money. It's pretty much all of them at this point. I'm very much hedging my bets. So how many people know some Python? Oh, thank God, my people. How many people are our programmers? I am really sorry. And how many people are just fed up with the JVM? There are only like six people. That is weird. Okay, the slide has clearly gone on for too long. Okay, well you're in the big data room, so whatever. This is for big data, for those of you that don't have a big data system, please pick one, preferably one that pays me money. Most big data systems are written in the JVM, not all of them. There's Dask and other exciting things like this which have decided that the path to happiness does not involve Oracle. And I respect that choice as well. But thankfully, they still have to interact with the rest of us, so they still feel our pain. So there is no escaping the wonders of the JVM. We're gonna look at how PySpark works. We're gonna look at how other similar systems work as well. We're gonna talk about why this kind of sucks, and we're gonna also talk about why things are finally changing, and how the experience for Python users is finally actually getting better. And I'll try and get you to buy my book many times over. The TLDR is we've made some bad choices. But we're finally fixing them because investors noticed. And yeah, moving along. Okay, so what is the state of non-JVM big data? It's a wonderful mixture of Pickling, Strings, JSON, and XML, often all inside of the same project. Yay! I love three different serialization libraries involved in calling a single function. No one is excited. Okay, the other one is sometimes people look at that and they go, you know what? That doesn't sound slow enough. I'm gonna use Unix pipes. Or I alternatively, this sounds too portable. I'm gonna use Unix pipes. And that's often used, not always. Some other people look at this and say, maybe I'll use Sockets to connect Python and Java. And alternatively, other people look at this and they go, you are all just out of your minds. I'm just gonna rewrite this in Python. How hard could it be to rewrite all of Spark in Python? I'm sure there wasn't a lot of development work that went into that. I'll just like read a paper over the weekend. And that's how we got Dask. And then they did a lot of work. And it got a lot better. But it's that classic problem of, well, what does your field do? I bet I could do that with a Perl script. But Perl is no longer cool. So we use Python. So PySpark, are there PySpark users in the house? Yay, and I'm sorry. So PySpark has the same general technique for how most of the other non-GVM languages work with Spark. It turns out most of the other big data systems looked at Spark and went, that was interesting. And took different directions to integrate languages. But fundamentally made pretty much the same trade-offs. And it has some serious performance issues that we're finally gonna fix, maybe. Okay, and because this is a big data talk, I had to put in a word count example. It's pretty much, it's a licensing requirement. I'm from California and we're very serious about not getting our licenses revoked out there. So this is the word count example. And we can just see, we have like these nice little lambda expressions. They're our friends. We can tokenize things. I can add numbers together. It's very exciting. Anyways, we'll move along. So how does this monstrosity work? We take pickles. I guess we couldn't figure out an image to represent JSON, so we just wrote JSON. We could have probably just had a programmer crying. I don't know. Okay, yeah. So how do we take, how does Spark do this? Spark uses Py4j on the driver side. And the driver is sort of like it's the thing which is sort of coordinating a lot of the work. And it's where your Python program, the part of it that you see runs. And it's the part which throws exceptions which uses JSON and magic. And then we use pipes because that's a great idea. We also don't use the standard pickling library because that was far too, uh-oh. It worked like that. And then we decided that we were just going to write our own pickling library. But we stole it from another project, copied it in and called it cloudpickle.py. Someone copied it out and has been fixing bugs ever since that we have been back porting about a year after they fixed them. And yeah, okay. And there's a way around this. We can use Spark SQL, but then we lose our Lambda functions and it makes us sad. So this architecture diagram, in addition to showing that I do not work with designers, shows Python, talks to Java, and then Java is the one who does a lot of coordinating. And then this, you know, sadness happens. In Flink, the same thing happens except for place all of the things with other things. And instead of sockets, use mmap. I'm pretty sure this is how it works. I looked at the code this morning. And so how does this impact PySpark? I mean, I'm pretty sure it's an mmap file. There were a lot of bugs about an mmap file behaving poorly. And I was like, that is probably a core piece of your tech. So we have this really expensive double serialization that's happening. Python Worker Startup takes a little bit of extra time. And the error messages don't make any sense. Excitingly, we're not gonna solve this one. We're gonna make this one worse, but we're gonna try and solve this one. So that's okay. So Spark DataFrames, they're nice, but they're not perfect. We're just gonna sort of skim over them. But we can see that when we take out the serialization cost, the blue is like good PySpark code. And then the yellow, the hideous yellow is the PySpark DataFrame code where we don't actually do any of our work in Python. And so we can see that this is a pretty big difference. This Python overhead is huge. And so if I can reduce the amount of overhead, I'm gonna see some pretty big improvements. And so we're gonna take a look at this problem. And we found out we were spending most of our time copying data from Java into Python and back. And then it turned out we were actually spending a lot of time serializing and deserializing the data. And when I looked inside of all of the other open source projects, everyone had their own custom serializers because apparently, that is a great way to add value in open source is adding a new serializer. And so yeah, so by the future, actually in theory, we're voting on the release right now. So this should probably be one month's to eternity. But I'm optimistic that we'll actually get a release out in February, probably, but it's okay. I live in California. It'll be hard for you to track me down. And so this code exists. If you wanna try it, you just need to check out this branch two, three from Spark. And then we get Arrow. How many people have heard about Apache Arrow? Five people? That's low. Okay, so Apache Arrow is what happens when your cat decides they like cadnip and serialization frameworks. Right, so Apache Arrow gives us this really nice performance graph. Big numbers are bad unless I'm selling you compute time, in which case continue to use slow serializers. And small numbers are better. But Apache Arrow is really awesome and it gives us a common format that both the JVM and Python can understand so I don't have to write custom serializers and I can start to use shared memory buffers intelligently as opposed to putting strings in shared memory buffers and being sad about my life choices. And so I have this graph which shows that Apache, so Apache Arrow can talk to a shit ton of things. But from the big data point of view, I mostly care about that it can talk to Java. It can also directly read in Parquet files and I can put Apache Arrow records in TensorFlow, or Python and even feed it into CUDA if I decide that I wanna give Nvidia all of my money. Which is fine, Nvidia makes great products too. And so this is really nice because this interchange has historically been really, really expensive and you can see that this, I mean when we're adding the number one, we're doing pretty much no work and we go from three seconds to 0.9 seconds and so that's really awesome. Actually this might be ours, I'm not super sure. Oh no, it says seconds, yeah, okay. So adding the number one pretty fast. And this is not obviously on one record. If it took us three seconds to add the number one to one record, we'd be very sad people. Yeah, okay, but so this is awesome and it means now that I can start to use the cool tools that the Python kids have been using because I live in San Francisco and if I don't put TensorFlow into my product, I am not raising another round of venture capital and then I cannot afford to go to BlueBottle anymore. And so this is pretty much a mission critical thing for people like me. So how do we access this? In Python, and here we're gonna add not the number one, but we're gonna add two arbitrary numbers together. We go from writing this really ugly UDF into writing this other really ugly UDF but we say Panda's UDF. Now we say Panda's UDF rather than arrow UDF because this is the way that we expose arrow into Python is as a Panda's data frame because Panda's data frames are nice and happy. And so yeah, that's great. Okay, what do the other systems do? Don't worry, they all suck. Hadoop streaming is not streaming despite its name. It uses units pipes and sadness. Kafka re-implements everything, so they take sort of the Dask approach to life. And that's great unless you wanna use new features. So features from two years ago work great. Dask re-implements everything, has multiple backends, it's pretty awesome. The only problem has been historically it's difficult to get data into Dask but thankfully we can now use arrow to get data into Dask from Parquet data. And so that's pretty awesome. Beam, yeah, we don't do things beyond the JVM very well right now, but we have a bunch of JIRA issues which you are more than welcome to join me on. Yay, does anyone wanna do my job for free? Alternatively, wait three to six months and someone else will fix it for you. There's a bunch of references. I really don't have a lot of time to go into the C++ code that is scary and powers arrow but you can go check it out. There's some interesting blog posts if you want to verify the benchmark slides that I showed you you can go get them and run them on your own data. And that's pretty much it. The most important slide is this one. You can buy high performance Spark. It covers none of the things I talked about today but it is an excellent book. And if you buy it online you can receive it in a nice box that your cat will love. And so that's pretty much it. I might do office hours at six but honestly I think I'm just gonna go get chocolate. So let's pretend the slide didn't happen. I don't know if I have time for one question or if I use that up with the VGA adapter. You haven't? Okay, cool. Sorry, I saw the five minute slide and I wanted to make sure there were time for people to heckle me just in case anyone had questions about how arrow works or Spark. We have plenty of time for questions. Or if anyone wants to like buy several copies of my book I can go back to the website, you know. Oh wait, I didn't put the website name on it. Well just high performance Spark. This is the keyword to type in. I noticed very few laptops open right now. Very few laptops, that's unfortunate. Okay, hope. There's a question. Yay, questions. I'm curious a bit about the, a bit about Dask. Just, I've heard it from our users. I'm not super familiar with it, but you said maybe in the ecosystem it's not as up to date is that what you would say? Right, so I think Dask is a really interesting system and its main challenge is that it eschews most of the existing ecosystem. So it's very Python focused and it doesn't necessarily do a great job of integrating with other data sources. So if you've got like existing big data you might find it painful to integrate with your other tools. On the other hand, if you're sort of like starting from like a small data shop which has then realized now you have customers and you have big data then Dask can be pretty cool. It also doesn't have a great machine learning story right now, although some people might be working on that hypothetically speaking. And honestly it does a really nice thing. Dask data frames actually look and behave like Panda's data frames to a much larger degree than any of the other distributed systems. All of the other distributed systems approach to data frames is really great. Yeah, really delightful and unique, right? There's a camera and I like my job. Yay, a second question. Yeah, and there's a coworker in the back row. That's a good thing I didn't say what I was gonna say about data frames. So if I understood correctly, Arrow wants to provide the serialization format across Java, Python, R. Yeah. So why not something like Protobufs or Thrift? What's the difference? I mean this is a great question. And so if you look at Arrow, it actually does use like plot buffers in some places and stuff like this. Arrow is more than just a serialization format. The part of it that I care about is primarily the serialization format, but there's also operations that are defined on the serialized data as well, which is really nice and convenient. Lots of fancy numerical stuff. I mean honestly so we use Protobufs a lot in Beam and there's nothing wrong with those. Arrow is just much more focused on making things that look like data frames for serialization between these different systems and providing the tools you need to build nice data frame like interfaces on top of it. So we can have some of this shared code between R and Python. Most of the times we actually end up calling FORTRAN code that was written in the 70s, but now we can share the FORTRAN code that we've hopefully rewritten into C++ to switch how much I wanna gouge my eye out with a rusty spoon to be less so. Yeah, if you were to go and make your own, you could go ahead and use Protobufs. But everyone else is using this one so maybe we should all be friends together because if we all use the same thing, then I can just steal your users or whatever it is. We can collaborate. Yeah, that's why I'm asking these. Well, I mean, I think fundamentally a large part of it is probably social in that the person who started the Apache Arrow project is also the same person who started the Pandas project. And so it's like, oh yeah, this person knows what they're doing. And so it's a lot easier to gain adoption when you have a good reputation for making good design decisions. So apart from a serialization format is also focused on data frames. It's focused on data frame like data. So yeah, if you wanna serialize a random number generator, Arrow is probably not the format for you. So you wouldn't use Arrow to serialize your Lambda functions either, right? This is very much just serializing your data, your functions. And we pretend there's a difference because otherwise I have to go and relearn scheme. Okay, cool, I think that's it. If anyone wants to come find me or boo, I'll be around until I find coffee.