 We're gonna try and do some demos which if everyone stays off the Wi-Fi might work But it involves connecting to a Kubernetes cluster in the cloud So it might also just fail which will be fine, too And so it's just me. I originally had a copresenter. He was responsible for the demos He didn't make them and he's also not here. So I spent last night making some demos So I hope you enjoy them and I hope they don't crash If you're interested the slides and the video whenever it's posted all I'll embed it in the blog post that you can find from this link And I'll also put the links. I recorded the demos as well Just in case the demos went really poorly and if anyone wants to watch them when they get home You can you can go there Okay, looks like people are done taking pictures of that slide So yeah, I'm holding my preferred pronouns or she or her I'm a developer advocate at Google and I'm on the spark PMC and I contribute to beam and a bunch of other projects. I've been at a bunch of other companies They've all been very nice people and paid me money I Put my slides on slide share my Twitter is mostly pictures of coffee food and my stuffed animals like boo Okay And I started doing code of your live streams It's a mixture of live coding and going through other people's pull requests to spark So if you're interested in seeing more about how large open-source projects do code reviews It's it's a place to go. If you have feedback on this talk. I do have a feedback link I collect everything. I read it. I passed the good ones on to my boss the bad ones. I read myself But if you if you have things that you want to say like please feel free to tell me In addition to who I am professionally. I'm trans queer Canadian in America on a work visa and part of the leather community And this is still since yesterday not related to big data all that much directly But I think it's important that we build diverse communities Especially for those of us who are building machine learning models So that we take into account what the world is around us and we don't just recreate the same bad shit faster So hopefully we can all work together and solve some cool problems And if your team looks entirely like you this is a thing you should fix Even if this is an open source it is still a thing you should fix if it was your five friends from college And now you're working together, you know, you may to all have very similar backgrounds and you should get some other people involved Okay, so I'm gonna talk really briefly about what Kubernetes is how it's different from yarn Some nice things about this Then we'll talk about how to switch to using Kubernetes for spark I'm gonna skip the brief tour into kube flow. That's just not gonna happen with the joys of our setup work But Wi-Fi cooperating will mostly do demos They all involve word count. I know everyone loves word count. So that's that's gonna be real exciting And then I'll talk about sort of why well you can indeed use spark on kubernetes today Our plans to try and make it better and our other areas to improve here So kubernetes is a new open source cluster manager and it uses containers Okay, I thought that was a fun joke It runs programs in Linux containers and it has a lot of contributors over 60,000 commits recently Okay, so now we can see it Actually, we've got our kernel. We have little containers It's Docker rather than shipping containers And we deploy our applications inside of it and for those of you who are coming from The yarn world. This is kind of exciting, especially for those of you working in Python You probably are in the situation where you'll want to use libraries And then your cluster does not have the libraries that you want to use and now your life is sad and With Docker and kubernetes your life won't be sad because of that. It will be sad for other reasons So that's that's nice, right? Okay So more isolation is good, right? And this is really nice It means that if you and your co-worker need to deploy different versions of spark or Even the same version of spark, but you depend on different Python libraries You don't have to worry about stepping on each other's toes This is really really convenient I Know I keep coming back to this, but if you if you've worked in the yarn world, you know how painful this can be We can also virtualize a bunch of networking stuff for the most part That doesn't really give us a lot of benefits in the spark world It just makes my life harder because I misconfigure firewalls frequently But that's that's okay And we can also Have other levels of isolation. We can specify maximum memory usage. This is useful so that we don't stomp on each other's Containers when we're running in production. We can Throttle CPUs. This is really useful if you want to have some burst capabilities for perhaps your you know batch analytics job That isn't super important, but should run at a lower priority than your real-time jobs And we can do all kinds of fun configuration stuff And there there are persistent volumes and you can use them with spark Probably shouldn't it'll go poorly, but that's okay. That's okay and Oh, right. There's also credential management the default of which is base 64 encoding Really really very high security right there So but there are alternatives to depending on on the secrets of base 64 encoding Which actually I wouldn't a long time ago when I was a systems administrator We were doing a migration. It was going to be very painful because of course We assumed that we had stored the user passwords in a very secure manner And we were setting up a new authentication system, but then we looked at it and I was like well these these user passwords look a little funny Wonder wonder what happens if I try base 64 decoding them And then I found some references to how delicious juice was and I was like this is probably not a one-way hash And so yeah, that's a similar experience that you can have when you're playing around with Kubernetes secrets. Okay, so The main thing that we get from deploying on top of Kubernetes is that we can have our dependencies nicely managed things like spacey Psych it learn tensorflow. These are things which are really painful to manage on our own The solutions for people in the yarn space have tended to be a shared condo environment And that's certainly better than nothing. It's not great But supporting different versions of libraries is kind of hard And it tends to also involve Sort of an administrator and in my experience is things that involve administrators They get done eventually but not very quickly, right? Certainly not in the span of from last night to this morning in time for a demo, right? And so because my life is largely focused on things like that. It's not a good plan Okay, so what does our fine architecture look like? We'll have nodes nodes can have multiple containers deployed on them. And the other thing is like we can We can put spark alongside our other applications, right if you're just deploying spark Setting up a kubernetes cluster is probably overkill But there's a good chance that you have other things that can also be deployed on kubernetes And then you can deploy these on the same cluster without having to have sort of multiple cluster management solutions, right? And so yay happiness And note that the IP is being different means that if your firewall rules were written for the nodes Yeah, you're going to add another firewall rule which screwed me up last night. Anyways It's cool. We get nice isolation. Everything's happy That being said everything can go wrong, but that's okay So how does big data on kubernetes work? The non JVM support is relatively new. So if you've been trying to use this before it's it's been pretty much focused on Just supporting scala and Java and at that point all of the cool things that we get with containers We don't really care about because Java has a pretty easy dependency management story Client mode is new in spark 2.4 and this means that we can now run interactive applications Historically, we could only run batch applications There is Kerberos support It's an exciting opportunity to possibly lure see Kerberos tickets to anyone on your kubernetes cluster But it might not be broken. And if you're willing to try that out and file bug reports, please do Yeah, cool and lots of refactors because There were a lot of bugs and rather than fixing them. We figured we just rewrite the code and close all of the bugs Is fixed so that was a good plan. So how does how does spark on kubernetes at software? So we have our configuration. We request executors. We remove executors and the kubernetes scheduler Eventually gives us our containers It handles communication. It's very happy The kubernetes scheduler You don't have to worry about it too much if you're from just traditional spark world You can think of it as yarn, but with a different set of bugs And so then what happened? Okay, whatever Um, that was not intentional in any way shape or form that that is still not intention. Oh dear, okay Okay, so we have our client. We schedule a job. We schedule a driver pod Our driver pod is gonna go ahead and make some further resource requests It's gonna ask for some executors goes ahead. We schedule our executor pods They could get scheduled on the same nodes or different nodes Really depending on how many we're asking for and what kind of memory and other jobs are scheduled and then we Yeah Another person can schedule a second job. The second job will have its own driver and its own executor pods Yay, and you can have multi-tenancy with things that are not spark right that's that's the important part If you're just using spark you probably only want yarn So how do we change our application to run on top of kubernetes? So in theory we could just replace the configuration where we tell spark to run on yarn with master kubernetes in Practice if you do that, that's an exciting opportunity for your job to fail So what we need to do is we need to change our master to point to kubernetes We need to build a container which has all of our dependencies inside of it We probably need to change our storage systems around if you're on top of yarn There's a good chance that you've been using something like htfs Don't have that so you're gonna have to pick s3 or gcss Then you're gonna change your cluster manager and then you find out that all of your tuning stuff is now completely garbage And so you get to start over with spark tuning Which is a really great way to spend somewhere between a month and my entire life And so then we do this we got a job We iteratively improve it and eventually it works And then by the time it works someone comes up with a new cluster manager Okay, so with that exciting start we're gonna go into my demo of word count and there is a recorded demo I'm not gonna play it for you Because it has swearing in it My boss told oh, there's okay, whatever Water's gonna go here. Okay, so let's see if I can do word count Hopefully I can I've only been working with this system for five years Admittedly I've been working with Kubernetes since about a week ago So I Okay, is this text readable to people Okay, cool, I'm gonna make it smaller so you can't read it then Okay, cool so I've downloaded spark. This is spark 2.4 and Since I'm on Google Cloud Rather than Amazon There's some extra little steps that I had to do But first how we would start is we would go there's a bin directory and there's this docker image tool And this docker image tool builds a docker image doesn't put any of your special sauce inside of it But it builds a docker image with What spark needs? This will give us something that we can use as a base to them put whatever special sauce we need inside of it So then we can write a docker file To you to add some special sauce and in this case my special sauce and I'm sorry for using nano I just don't have emacs installed on this remote machine. I Should have thought that through I lost all of my street cred right in like 30 seconds. I anyways, okay So we can see here. We've got booze demo projects are rad So that's my my docker registry and I'm using the python spark one. Oh man people are taking pictures Fuck I can't even claim. I didn't do this. Okay Um, and then because it turns out that while I said Java has an excellent dependency management solution What I really mean is oh god. Oh god The shading hurts So we delete an old version of guava. Don't ask any questions. That's totally normal And then we tell spark that or sorry we tell docker that we want this random other version of guava That's fine. Somehow this works I continue to be surprised and then we tell it that we want to use the gcs connector And you don't have to do this for like all of your Java dependencies But since we're gonna use the gcs connector to like load in our Java files or our python files We need to make sure that it's in our initial container base That being said if you were on Amazon s3 support is integrated But I work for Google and so yeah cool, and it's not just us right like Microsoft Also joined the party and is not integrated into the core either So we we have to do a similar ad thing if you want to run on top of Microsoft and that gives us blob storage the two bottom lines are commented out in the bottom if you're doing this on spark 2.3 you need those lines if you're doing this on spark 2.4 you put those lines in and it doesn't work That's normal. Everything is normal. Okay so we'd go ahead we'd build this new docker file we'd push it and So I do a docker build I'm not gonna do that the docker build because this takes a while But you know we just do this docker build and we'd say this is the version I'm building you can note that there's a dash four at the end which it tells you how many times it took me to get This right We build it and then we do a docker push and we're happy and now we can use spark submit. I Really hope this is the right spark submit Okay, so we're gonna use spark submit. We tell it. We've got this local kubernetes thing now. This isn't mini cube I just set up a kubernetes proxy Makes it easier for me We tell it that we're deploying in cluster mode and in cluster mode What happens is we deploy the driver program inside of kubernetes and this is Really useful because then if my computer crashes my spark job still runs like that's really good The downside of it is that I can't really use it interactively, right? I don't have like a shell to to go and press buttons. So if I wanted to use interactive queries Cluster mode would be a little rough and then we tell it to point at the random image that I just built We tell it the number of years we want and we give it a service account Which has permissions to create those executors on the cluster and then we give it a Pointer to our file here. I'm using Python if you were using Java. It would just be a dot jar It's totally totally the same stuff and then I give it any of the inputs to my job So we're gonna run this and there's at least a 20% chance. It's gonna succeed So who's with me? Yeah Yay At least 20 people don't hate me cool. Let's find out Okay, cool. Well, so the first thing it starts off by telling me that I should use HTTPS, but securities for suckers Okay, and it gives me this pod name. So that's exciting. That means I think our odds are up to like about 50% by now It means that we scheduled a job and this is exciting and I can go ahead and I can go over here And I can get the logs for that Cool. So you can see actually it finished fuck. Well, I was hoping to take longer. I should have given it some Bigger input, but we can see it counted some words. Very exciting. It's genuine big data. We did word count. Yeah Yeah You are the second audience. I've tricked into clapping for word count this year Really excited or not tricked. Sorry. My boss says I should communicate more carefully It's a good thing he doesn't watch these videos Okay, anyways, so we counted some words. This looks pretty solid And that's that's cool, right? Kinda almost. Okay, so that's that's demo number one We can do a word count application pretty much the same as we could on any other back end That being said, there's a chance that your company has a Kubernetes cluster You can deploy on top of and that might make your life easier So we're gonna do a second demo, but I don't remember what it is. I'm gonna print space Okay, we're gonna do word count in client mode Yeah, okay Let's see if this works Spark dash shell, oh Okay, cool So there's some extra bits that we had to add to make client mode work And there's actually some other extra bits that I I'll just talk about because if I show you them, it's depressing. So one of the things is in client mode we need the Kubernetes pods to come back and talk to our host, right? They have to be able to talk to where we're running our driver program so that we can submit jobs to them and gets Results back, right? We need to have that network communication So we have to tell it what our IP address is because otherwise it defaults to the host name and in that case for some reason It doesn't work If you have DNS properly set up on your network, probably don't have to specify this IP address We also specify a port and then everything else is pretty much the same Except we also specify a port for the block manager, right? Those are the two things we have to do differently and then we're gonna run this It's probably gonna work and I'm gonna do word count a second time Really excited. I managed to fit three word counts in this talk We might cut one of them for time. I don't know how I'm doing on time Whatever, he's sound out. It's fine Cool, so we can see oh, oh, that's bad Huh, okay, I don't know if this is gonna work or not And I'll tell you why so there's this little warning message here, right? and it says that it can't bind to port the port that I specified and That's okay if it successfully passed that information to the workers But if it didn't pass that information to the workers, it's not gonna work, but that's okay Let's find out together. Yeah, and if it doesn't work We'll know why and then it would just be changed to the port number and and that's totally okay So we're gonna do the same thing Text file Man, I think I can spell the word diversity I'm gonna remember which side the eye goes on Diversity, yeah, that sounds right. Okay, uh, Jupiter diner score new dot sh cool Okay, normally we'd go and do the rest of the word count But I'm not entirely sure if I spelled diversity right or if this is gonna succeed So we're just gonna do a job right away and we're gonna see if this works. Okay, this looks Yeah, as I always say the key to success is lowered standards Okay, so I could go ahead and type word count a second time and I probably should just you know Just in case the union's watching Okay, and then Yeah, reduced by key underscore plus underscore cool Okay res one dot collect This probably works if it doesn't I'll be mildly embarrassed. Oh, yeah, cool. It works. That's awesome. So, okay So we did work count a second time. That's great How many people in here use Jupiter notebooks? That is a lot of people Relatedly, I hope you test your code Okay, so demo number three. I remember what we're gonna do because I was doing this until about three o'clock today I wasn't fucking working but it works now and And my life is terrible. Oh, right. So the other thing that I had to do to make this work is change some firewall settings and I was just like, you know what screw it We'll just make everything talk to everything except for the internet and that that worked out fine In practice, you're gonna want to you know those ports that we said you're gonna specify those exact ports And your firewall rules rather than just going for everything But the problem is because we get a different subnet for our container IP is then the one that we already had before The default firewall rules I had didn't work super great. Okay So now we're gonna try demo number three Fuck Jupiter notebook Hmm failed reverse search is not a good sign if I'm trying to do a demo. Okay, so Jupiter notebook We're gonna want your notebook my work cool Okay, let's See if this works Okay, cool. It worked. Oh And I have the sample file from earlier because I saved it so I don't have to type it again I'm really smart. Oh, and I can show you a bug that I found while I was doing this So That's not the bug. I'm not that superstitious, but I'm not gonna say that B word again okay, so No one can read this text. So that's great It's a good start best demos are the ones where people can't ask you what the hell's going on. Okay, so This is pretty similar. The difference is we can't configure it with command line arguments So we construct a spark comp object and we set many of the same parameters It's it's pretty much the same thing the difference here is This is a Python 3 notebook. So I set the Python version to 3 However, if we scroll down just a little bit, there's this little note here that says unfortunate smiley sad face And that turns out that there's a bug wherein when you specify the Python version we promptly ignore it It makes it a lot faster But doesn't give you any answers. So that's great And if we run this it might work. Did I exit the other one? well Yeah, whatever, let's let's run this right run all Who knows maybe it'll work and Okay, so now we're creating our spark context In that looks like it worked. So let's go ahead and take a little yeah, did it. Okay. Thank God So over here, I have the spark web UI This is the port I think For this one and we can indeed see that this is actually has these 1056 ones. It's not actually It's not just the driver. It is properly configured. It got to Kubernetes executors to do its work And I mean they're not doing any tasks because we finished our word count example Turns out I probably should have picked like a gigabyte of information instead of a kilobyte, but whatever. It's fine So that's word count again And we can use it in a notebook and this is this is kind of convenient especially if you're using something like jibre lab and you're here deploying on top of Kubernetes anyways, and so this is this is kind of fun. The configuration is kind of garbage And we should really fix that Python version issue, but I'll get to it later Maybe maybe once I'm finally home for 48 hours Okay, cool. So those are the three demos and I've got seven minutes left But I'm pretty sure I don't actually have seven minutes left. So what does the future look like? A whole lot of things are missing in our kubernetes support Dynamic scaling kind of a little rough. One of the things which is also a little rough is our storage support I would like us to automatically Take the files that you're submitting to spark and and put them in a cloud storage thing that you've configured So you don't have to manually copy it. I'm kind of lazy. I think that would be nice It would be cooler for authentication integration Was less sketchy We'll leave it at that, but if you are curious and would like to be sad you can go look at the kubernetes No kubernetes Kerberos spark integration when you put those three words together you get a very sad cluster and better documentation so None of the things that I showed you were particularly documented or the parts that were were documented incorrectly like the python version setting Dynamic scaling might sound pretty simple, right? We can just request more executors And that's cool, but there's this like inconvenient problem where sometimes jobs need to scale down, right? Like the holidays happen and you scale up, but after that it's not like you get q4 indefinitely forever afterwards and scaling down Means that This is bad because we lose all of your files right now And so we lose all of the intermediate work and we have to recompute them like it doesn't lead to correctness issues But it leads to slowness And we could do a smart scale down We could add a shuffle service so we'd only scale down the executors or we could migrate the shuffle files on scale down One of the two doesn't doesn't really matter And then then our stuff will not be terrible in the event that you have to scale down um That might happen next year, maybe if we're lucky, okay, so um if this sounds like an adventure you want to do Ideally, you're the kind of person whose compensation is not directly tied to your performance And you can spend a lot of time just submitting bug reports and open-source pull requests to help us fix our software But if that is you, please come join us. We need more of you And I have some blog posts and there's some documentation. I if you Thought those demos went well and want to see the hour-long struggle to get it to work at the start because of that Fucking network configuration. You can watch this really depressing live stream And if you're interested just more generally in contributing to open-source if you don't have like several weeks to spend That's totally fine. I do open-source code review live streams that I think are pretty cool But I'm down to four minutes and I would like time for a question. Oh, right and we have the most important parts coming up next Here's a bunch of books on spark. I wrote some of them other people wrote the other ones. They're pretty cool This one I got a better royalty deal on It doesn't cover any of the content in today's talk either But that once again should not stop you from purchasing it Yeah, the the Kubernetes integration came after I wrote this book, but you can still buy it anyways and if you have small children who You would like to introduce to the concept of distributed systems. I hope you will join me At distributed computing for kids calm And I want to be clear. This is not a joke Many people laugh when I tell them this But I actually think that we can use distributed systems like Apache spark to teach children functional programming at a young age And get them before they learn about mutation and anyways Maybe not maybe maybe this is a terrible idea, but hopefully you'll still buy the book anyways That's the most important part Okay, if anyone wants to fly to San Francisco tomorrow, I'm giving a talk there on Saturday If you're tired of nice weather, I'll be in London in December And if you like barbecue, I'll be in Texas in January I'm probably gonna be somewhere in February, but I forgot to write it down Okay, cool So that's that's pretty much it if you want to give me feedback. There's a link here I have a testing spark survey that I'm always trying to get people to fill out if you run spark in production Please fill it out and we can get people to do better choices with testing and validation I'll be around my jacket lights up You can probably find me even if the batteries run out I suspect and and I'll be around to answer questions for a while and then I'm gonna go get some delicious ham So thank you. Thank you all Okay, this is the moment for the question and I know that there are many people with many questions. I Mean it looks like it's it's full no questions. No, it's it should be question. Oh Right, I've got that thing right after this right The where people can come and ask me questions. Do you know where it is? Yes? Just go upside and ask the expert in the front at the right they can go there Okay, cool. So you can ask me questions now and then and then well or just then because it doesn't look like anyone wants to ask a question right now Yes, there is one question. Okay. I'm gonna give you a microphone microphone, please for him. Okay One microphone if you want to yell I can repeat your question No, it's much better because the one at the back isn't gonna hear it. Okay, okay in the middle The guy with glasses with the hand The guy with glasses is not really If I say little lady with Christmas, okay, I'm sorry For you is easier, right? Yeah, no, it's fine. I accept so I trans March in San Francisco People cannot find me because it's like the girl with the pink hair on the mermaid dress and they're like shit There's like seven of you Anyways, sorry. Yeah, this this one is actually coming from the from my friend in my in my right, okay Who's very shy? He's very shy You work for Google but you were using a tour. Why is that? Oh, uh, I Like I want to be clear the demo was actually on Google Cloud, but it could also run on top of Azure the the code supports both paths and that's just because I Don't know why not support the alternatives like there's Okay, of course my employer would love it if you ran on top of Google Cloud and I of course would also love it I like money Like a lot I have to pay rent in San Francisco, but realistically I don't really care I just want this stuff to work and I want you to get your job done And if incidentally you pay my employer money, that's good There are cameras rolling so please definitely by Google Cloud by six clouds Thank you so much