 So, thank you for staying because you're all stuck here anyways. This is Extending Spark ML. Super happy new pipeline stage time, yay. I'm Holden. My preferred pronouns are she or her. It's tattooed on my wrist in case you forget, so don't fuck it up. And I'm a principal software engineer at IBM Spark Technology Center. If anyone's looking to relocate to America for some reason still, come talk to me. And I've been doing this at a bunch of other companies for a while. You can follow me on Twitter, and it's mostly just tweets being really sad about America right now. So, if you're too happy, follow me on Twitter and I'll fix that for you, and then we'll talk about software. It'll be worthwhile. We're going to talk about what Spark ML pipelines look like, and then we're going to jump into extending them really quickly. And the topic said it was going beyond word count, but we've got 10 minutes, so it's sort of beyond word count. So yeah, Spark ML pipelines are pretty much the Spark developers looking at scikit-learn and going, ooh, shiny, steal. Unfortunately, they're big data developers, and so they got kind of halfway through stealing, and they didn't bother stealing all of the algorithms, but they stole the APIs. So if you're familiar with scikit-learn, Spark ML pipelines are going to be really familiar to you. And they're exposed in Scala and Python, and you can use it to do distributed machine learning. But the algorithm selection isn't like 100%, right? You might have some really crazy problem that you're trying to solve, and you need an algorithm that isn't already in Spark. And the good news is you can implement it yourself. And the stages in these pipelines are either estimators or transformers. Transformers are the little blue boxes, and those are just things which don't need to be trained on data. And so they're really simple to implement. You can just write in a normal Spark job and add a little bit of sugar and make a transformer. But if you want to build something fancy, like a machine learning algorithm or something which needs to be trained on data, you'll have to build an estimator. And we'll look at how to do that really quickly. Yeah, these two are swapped, whatever. We'll just cover up this part of the slides. No one sees this part. But we have to provide a transform schema and a copy function. And these are the two sort of non-standard Spark bits that we have to add if we want to turn our Spark job into an ML pipeline component. And then we can add some configuration as well. And then if you add the configuration with the Spark way, you can do parameter grid searches and automatically tune your models. And because I'm really bad at statistics, I frequently use parameter grid search because fuck, I don't know how to pick my L1 regularization parameter. The computer did that for me. So here is our hard-coded word count stage. As a licensed big data instructor, I have to put word count everywhere. You may wonder why it's in every single talk ever, and this is why. But we've got a copy function in our constructor. It's not super important, you can pretty much just steal this and change the name from hard-coded word count to hard-coded magic algorithm. And for copy, you just call default copy unless you need to do something really weird. In my sample set of two, calling default copy works 100% of the time. If it doesn't work for you, please don't tell me. I'll have to change the talk, and that would just be sad. And so now this is the part which is different than our normal Spark code. Because these machine learning pipelines can take a really long time to run, and because data frames aren't compile time type checked, we want to do at least some type checking before we start our eight hour or 24 hour machine learning job, and then come back halfway through and be like, oh, it failed because I got the name wrong of this one random thing. And so transform schema is the thing which gives us the ability to verify that our inputs are sort of what we're expecting. And tell the downstream people what our outputs are going to be like. And if this code looks kind of weird to you, it's okay. It's just the data frame code, and it's really simple. There's a lot of introductions to it, and that's where these types come from. The string type is not the Scala string type. It's the Spark string type which is mildly different in ways that are exciting, and not for a ten minute talk. And then we can just add our result here. And then we can do our work. And so this is really simple, right? We're just counting the words that are being put in. I'm not even doing the proper word count, but it's really simple. And you could do something actually complex. I find that doing this with the UDF, even though Spark has a really nice way to do integrated functional programming and relational stuff, it doesn't work really well with this part because generally what we want to do is we want to just add things to the data that's coming in. We don't want to get rid of anything. And we don't necessarily know all of the types that are coming in. So it's hard to use the full dataset API. So you end up having to use this kind of ugly UDF syntax. But it's not the end of the world, right? This is much nicer than writing a high view UDF. I promise, you've never written a high view UDF, you're lucky. And this is where your fun can be. And if you're a researcher, you can spend months working on the code that's going to live inside of Transform. If you're an industry, you can spend a day or two. Or if your boss is wondering what you're doing, just have them email me, and I'll be like, no, it's super important. Trust me, and you can go and do some crazy stuff. And then we need to configure our pipeline stage. For pipeline stages in Spark, we use this parameter interface. And this just gives Spark a standard way of being able to configure and do sort of parameter grid searching. You probably wouldn't do a parameter grid search on your output column. It'd be a little weird. But you might do it on your input column, and you're not limited to strings here, right? So these are string parameters, but these could definitely be doubles, floats, and whatever. So you could do a parameter search on your tuning program. And then you provide some setters, and those are for humans to use. And it's really simple. And yeah, so we do it this way so that Spark's meta algorithms can work on it. If you don't know what parameters you should be adding to your machine learning models, you can go look at sharedparams.scala inside of Spark's code base and just like steal the ones that you want. You'll have to cut and paste the code or lie about being inside of package.org Apache Spark, cuz it's all private. If you lie about being inside of package.org Apache Spark, I don't know who you are. It's not my fault, but I do it pretty often myself. And so no harm, no foul, but your code's probably gonna break when you upgrade. Let's future use problem though. And if I've learned anything about scientists, it's that they don't seem to think about future use when it comes to their software, no offense. So that's really boring. We made word count, and I promised we were gonna go beyond word count. So let's go beyond word count. We're gonna make an estimator. This is the fancy thing. No one's excited, but it's okay. It's fancy. It's gonna like actually train on some input data. We do pretty much the same thing. So what our estimator does is we write a fit function. We have the same parameters to configure it. And then our fit function is gonna return our transformer. And so it's really simple. Yeah, this says November. That actually is old, but let's look at a really simple estimate. No, this is an old version of the slides, damn it. Okay, well, so I don't have the code for the estimator. But it's a fit function, which is just gonna return our transformer. Yay. So pretend that there's a fit function and it's gonna construct new hard-coded word count stage, or whatever it is you want. And if you wanna see the code for that, it's totally in this really nice blog post, which is a lot longer than 10 minutes, unfortunately. But hopefully maybe I trick someone into thinking that making their own Spark machine learning algorithm is a good use of their time. And you'll publish it to Maven Central. And then I can use it or tell people to use it. And we can all hang out and have fun. If I didn't trick anyone into doing that, that's okay too. There's also this GitHub repo with a bunch of examples. They're written mostly by me, so they're a little, but if you wanna go ahead and look at the ones that are inside of Spark itself, you can find them here. And those ones are a little more, I wanna say, professional or production ready. They're not all great, but they put a lot more thought into them than I've put into mine. So you can definitely check those out. And they're gonna use internal APIs, but it's okay. You can just lie and say you're inside a package org Apache Spark. And there's API documentation and fun stuff. As an author of some books, I would be remiss if I didn't try and get you to give me your money. And so, and these aren't all my books, to be clear, I don't have that much time in my life. Don't buy this one, it's kinda out of date. Authors won't normally tell you that. But you should buy all of these other books. Even the ones I didn't write. But if you have an expense account, does anyone have an expense account? No? Okay, well, then don't buy several copies of this. But if you know someone with an expense account, maybe someone that works at Bloomberg, you can get them to buy several copies of High Performance Spark. It is the holiday gift of the season for whatever holiday is coming up next. If for some reason you don't wanna buy a book, which doesn't exist yet. That's okay. I understand. You can give me your email address at highperformancespark.com. And I'll spam the hell out of you as soon as the book is finished. But please give me your money today rather than tomorrow, because money today is coffee. So, thank you. Any questions? One? Okay. Oh, doggie's name is Boo. She comes with me to all of my talks where I remember to bring her.