 Oh, cool. I guess I'm giving a talk. OK, rocking. Ooh, I can hear myself. Fancy. Cool. So I am here to talk about the Tower of Babel or Babel. I'm American, so I can't pronounce words. Actually, I'm Canadian living in America, so I extra can't pronounce words. And this is about making Apache Spark, Apache Mahoot, Kubeflow, Kubernetes, and a few extra friends all playing nice together. And now every time I hear Tower of Babel or Babel, I think of the book Snow Crash. Has anyone here read the book Snow Crash? OK, so the three of you are going to love the references that I'm going to make. For the rest of you, just pretend that I am hilarious. And that pizza is somehow related to machine learning. Also, you should consider reading Snow Crash. It's an excellent book with some weird bits in it. And it makes me very hungry for pizza. So here's a picture of some pizza. So you can be hungry, too. OK, so in addition to not being very good at pronouncing words, my name is Holden. My pronouns are sheer, her. It's tattooed on my wrist, which is really convenient in the mornings when I'm just like, what, where am I? Who am I? And I'm on the Apache Spark PMC. What that means is I'm like a committer, but I'm really, really hard to get rid of. Unfortunately, it's not like tenure in the same way as it guarantees that I get money. I can still get fired. And in fact, I checked my email right before this talk to make sure I wasn't part of that 2%. Good news. And I contribute to a lot of other projects, besides just Spark. Previously, I've worked at a whole bunch of other places. I haven't yet quite sort of caught all of the Pokemon in the traditional bingo card that you get in San Francisco. But I'm confident that before I eventually get hit by a car and die, I will succeed at this. My co-author of a few books, including one that is actually related to this talk. And you should definitely buy several copies of each book. They make an excellent gift for whatever the next holiday is. And this is Europe's, you have tons of holidays. So you should definitely buy tons of these books. You should also follow me on Twitter. And if you like questionable code, you should check out my GitHub. As you can tell by the jokes that I'm making, I may not represent the views of my employer, although they definitely do pay other people to make very interesting jokes. And I'm realizing I should stop talking. We are hiring still. And if you're interested, please reach out, although it's mostly in North America. So if you like paying for health insurance with a credit card, come and talk to me. No? OK. Worth a shot. OK. So in addition to who I am professionally, I'm trans, queer, Canadian in America on a green card and part of the broader leather community. And this is not particularly related. There is no secret Canadian out-of-memory exception debugger ring that they give us. I think it's useful for us to all talk about where we're from so that if we realize that we're all surrounded by folks with very similar backgrounds, we try and get some other people in the room. And it's also actually part of why I come to Europe. It's not just because I enjoy beaches. It's because I like meeting people from different backgrounds than myself. My co-author is not present. Trevor is a wonderful person. He is based in Chicago. He has a new kid. He's very exciting. Everyone send happy, warm vibes and maybe getting to sleep sometime this year, energy towards Trevor. He is the PMC chair of Mahoot, which means it is even more difficult to get rid of him. Once again, does not come with any guarantees of money. Just really difficult to get rid of. And he's an ASF member. Right now, he's mostly looking after his kid. But he is also trying to import electric tricycles into America. And if anyone happens to be interested in that, which is a pretty long shot, definitely reach out to Trevor. His email is at the end of the talk. And that's probably a great reason to go and visit Chicago, which is not as nice as here, but does have water. OK, cool. So what are we going to talk about? So we're going to start with our adventure, slash case study, but I thought adventure sounded cooler. So Act 1 is going to be getting to know the characters and the problem that we're trying to solve. Then we're going to talk about the problem in a little bit more depth. And then we're going to talk about how we solve the problem. And then I should use air quotes when I say solve. Solved, except by the time we solved it, the solution wasn't useful anymore. But that's OK because we did a bunch of cool things along the way. And so that's going to be the epilogue. And we'll talk about the cool things that we can learn from our exciting adventure. OK, who are our friends besides on Twitter? So, OK, Kubernetes, my second favorite friend. Kubeflow, who is the main character. And for the three of you in the room who read Snowcrash, hero protagonist. For the rest of you, that was a hilarious joke. I don't know why those three people aren't laughing, but maybe they're European. And Apache Spark, who is my favorite. But in kind of the way that your kid is your favorite, even if they're like maybe not so good at everything, you still like really hope that they're going to succeed this time. That's sort of how I feel about Spark. And Apache Mahoot, which is not my kid. And therefore, I care a little bit less about, sorry, Trevor. Apache Mahoot is very much Trevor's kid and much more important. OK, so I'm going to assume that this is KubeCon. We're several hours into it. You probably know what Kubernetes is. So we'll sort of skip past that. If you're new to the Kube community, that's awesome. I'm super stoked you're here. And that's really cool, but we're not going to explain Kubernetes. In the context of what we need it for in this sort of machine learning thing, how many people have had to work, sorry, have had the opportunity to work with data scientists? OK, cool. How many of you have gotten something like untitled underscore x.ipython notebook from data scientists? That is almost the same number of hands. OK, and how many of you had that run successfully without needing to install any dependencies? That is no one. Great, OK. And so that is why we are using Kubernetes. Also, because running on one computer is slow and also Kubernetes is cool and we like money. So Kubeflow is if someone was like, yo, what about if we put all this machine learning stuff on Kubernetes? So it's got this Kube and flow. We're going to make pipelines. It's going to be really cool. Everything is definitely going to work. And you should definitely buy my book about how it works. And we need it because putting together all of these different tools kind of sucks. No one really wants to be sitting around waiting for a job to finish, to go and kick off another job, to go and kick off another job. No one really wants to have to think about how they're going to get their data from one tool into another. We just want some magical tool to take care of it and Kubeflow promises to do that for us. We'll find out later that we did have to trade much of our happiness for that. But that's OK. I trade my happiness for money quite frequently. The other reason why we need it is because reproducible research. And grad students hate reproducible research almost as much as Grumpy Cat. And Kubeflow is this wonderful opportunity to make it so that we incidentally get reproducible research out of things. If we build them in Kubeflow, we can get these nice pipelines, and we can run them again in the future. Once the grad student graduates, once our coworker wins the lottery, once their shares finish vesting, there are some other jokes here, but we'll move on. Yeah, yeah, yeah. OK, cool. So Spark. How many people are not familiar with Apache Spark? Four people. You should buy my books. But for you four people, I'm so stoked that you're here. Yeah, OK. So Spark is a really cool data processing tool, and it definitely works 100% of the time. It works 80% of the time. And so it allows us to do distributed data processing so we can handle data sets that are too big to fit in an Excel spreadsheet. So if you find yourself trying to open something in Excel, and Excel is like, hey, I can't open files from HDFS, that is Apache Spark. We're going to make it better. OK, yeah, and we need it because it turns out that CT scan images are kind of big and don't fit very nicely in an Excel spreadsheet. And to be honest, they don't fit really nicely in a computer. And Spark is able to handle everything from doesn't fit in Excel spreadsheet all the way to doesn't fit in several computers. On the flip side, Spark does a really bad job of handling fits on a floppy disk. So if you have data that fits on a floppy disk, Spark is probably not for you. But you should still buy my book. OK, Apache Mahout. Oh, wait, yeah, yeah, we've got the yes. New logo. Woo. Don't worry, none of the code has been improved. But the new branding is glorious, glorious, glorious. No, and I joke. I joke actually much of the code has been improved. Apache Mahout was originally created for MapReduce, kicking it old school. And then Spark came along and people were like, whoa, MapReduce isn't cool anymore. Let's use Spark. Very happy about that decision, y'all. And so Mahout was like, oh, OK, we should rewrite to Spark. That took several years because people are lazy. And I'm including myself in this. I'm lazy too. And then they got a new logo so that you could know that it was new and fancy and ran with the new fancy thing that was about five years old. OK. So the other thing about Apache Mahout is Apache Mahout is a tiny, tiny little project that refuses to die, largely because of Trevor. Trevor is amazing. And if you have ever said to yourself, you know what, I want to get involved in an open source project, but there's just way too many people in all of these projects. It's going to be so confusing. I will not know who to talk to. You should get involved in Apache Mahout because you can just email Trevor. He is always the guy to talk to, 100% of the time. I'm joking. There is actually a mailing list, but you can also just email Trevor. It's pretty fast. Sorry, Trevor. OK. Cool. So why do we need it? So we need it because math is hard, and the people that wrote Spark, including myself, are kind of lazy. And we made some machine learning tools, but then part of the way along the way, we found out that people were giving us money anyways. And so maybe someone else could make the machine learning tools. So we kind of stopped making them. And so that's why we need Mahout because we want to do some fancy machine learning type things and some fancy math on top of Spark. OK. And S3 buckets. How many people here are new to S3 buckets? I'm so sorry. Slash congratulations. These are bad. But it's OK. They beat the alternative of doing it ourselves. So we can store data in them. And sometimes we can even read the same data that we stored from it. Not to guarantee certain terms and conditions do apply, not valid in US East 1. Yes. OK. At least someone likes my cloud jokes. So yeah, they're usually not the most performant. But the alternative is standing up my own HDFS cluster. And if you've ever stood up your own HDFS cluster, you, too, will be very happy to use Amazon S3. OK. So what is the problem that we're going to solve? We're going to solve the problem, which is why we're all wearing masks. To be clear, we don't solve it. We all still have to wear our masks, except for me, because I'm talking. But so the big problem in the early days of COVID was we didn't really have a fast way to detect if someone had COVID. And so we wanted to do COVID screening. And we thought, you know what? I have a problem. Let's add computers. And then we had two problems. So let's see if we can solve the problems we created for ourselves. The answer is sort of. OK. So we needed rapid testing. So we're going to go back to March 2020, when you could not walk into a Walgreens. Oh, crap. You don't have Walgreens here. Boots. The thing with the green plus sign? Pharmacia. You could not walk into the pharmacy and buy a COVID test. So back in March 2020, life was sad. Life was very sad here. Yes. Life was very sad in America. And we couldn't really figure out who had COVID, and it took way too long. So oh, yeah. And the new rapid tests that we got were about 60% accurate. Oh, yeah. Slightly better than my average in my non-major subjects. Shout out to Google for not checking my GPA. Or Netflix, too. OK, yeah. And so really cool. A bunch of people came up with things that were more accurate than 60% and went kind of fast. Ultrasounds were pretty cool. The CT scans showed a lot of promise. Now, admittedly, the people who said that the CT scans showed a lot of promise were the people who did CT scans. So possibly some bias in the same way that I might tell you Apache Spark is really cool and you should buy my books. So yeah, best diagnostic tool according to the person selling you the diagnostic tool. So that's great. There were some slight problems with that. One of them is cancer. And so it turns out that there's some downsides to getting a bunch of CT scans. And this is radiation. And so to detect COVID in someone, initially you needed to do a full body, like fairly high dose CT scan. And that's not great, especially if you might be doing it multiple times on people. I don't know how many people here have taken more than one COVID test. Certainly I have a lot more. So that was definitely like, well, OK, we should see if we can make something that isn't going to turn the population into little radioactive people. So instead, we could use low-dose CT scans. And the plan was, more or less, what we're going to do is we're going to go like CSI Miami, which I really hope you have, and we're going to say Enhance. And we're going to turn this into something that tells us what's going on. And so in comparison, much less radiation, much less chance of cancer, yay. Only problem, we needed science fiction technology. OK, the good thing is, it turns out that some people who were way less lazy than us came up with some ideas to denoise images a long time ago. It turns out, though, that it's kind of hard, and denoising them involves about 500 gigabytes of RAM. If we wanted to denoise full body CT scans, and it turns out that my credit card has what would be described as a low limit, so that's not happening. And we should figure out some way to do this without using 500 gigabytes of RAM every time we want to denoise an image. So we figured, OK, you know what? I've got a problem. We'll apply machine learning to it. It'll give us magic technology from the future. Everything's great, and we'll run it on Kubernetes. So that'll be cool, too. OK, intermission. OK, no one likes my intermission music. OK, so we, yeah, wait, oh no. OK, there we go, yes. So we really needed some way to do detection of if people had COVID. The best idea that we had was admittedly from someone selling CT scans, but it seemed like a really good idea. And in fact, there were a whole bunch of people who did collect data on what the scans of people with COVID looked like. Now, to be clear, we were not like, hey, we'll make a model that's going to tell you if you have COVID or not. There are a bunch of people who tried to do that, and it turns out they did a really good job of detecting if someone was lying down when they were having a CT scan because they were much more likely to be lying down in a certain position if they were really sick. And so, yeah, correlation and causation, not the same. But OK, cool, cool. We've got a data set. We've narrowed our scope of problems, so we're just going to use machine learning to make the images look better, and then we're going to give it to a human so it's not our fault if a bunch of people don't do so well. One of my goals in life is to not be directly responsible for someone's death. Right, OK, cool. So why did we use free and open source software? For one thing, I'm cheap. For another thing, I like using open source. And for another thing, at the time, things were not looking great for people who had low limit credit cards, and especially for countries who had low limit credit cards. And the nice thing about free and open source software is like, yeah, it's free. Admittedly, it's only free if you value your time at about $0, but conveniently, that's what I value my time at, as does Trevor. Unless my employer is listening, in which case, please continue to pay me money. My time is worth whatever you pay me, plus 10%. OK, cool. So we had an idea. We were going to make a pipeline. It was totally going to work. Everything was great. And because we were also admittedly working on a book at the same time, we were like, you know what would be really cool if we did this with Kubeflow because Kubeflow could totally solve this problem for us. So we'd take our CT scans, we'd turn them into something that we could do our fancy science on, we'd load them into Spark, and then we'd do our distributed SVD. And then everything would be great. And then we could denoise our image. And then, in theory, a human could look at it and say, like, yeah, this person looks cool. They just have a cold. Or like, e, this person should not go outside right now. And they get to go into the special room. Cool. So first step, we loaded our data. And so the wonderful thing about being lazy is that you search on PyPy before you write code. And so the images were all in a format that we couldn't just directly load into Kubeflow. Great news. Someone had already written a library for it because it was in Python. We could just really easily whip up a Python script, make that the first step of our pipeline. And it took the images and dumped them to a PVC. Now, this did have some slight implications for scaling in that, no. Because we had read write once PVCs. So we were only able to run one instance at a time. But the good news is that the data set at this stage had not yet become the like, holding this very sad stage. So it was OK that this was not happening in parallel. And it was only happening on one node. The next stage is the one where, if we weren't doing it in parallel, life would be sad. So we read them in from disk into an RDD. It is kind of annoying to do this in Spark, something that we should be doing better, because it's reading from a local disk into a distributed disk. One of the things that I really wish was easier was if we could have read write once into read many conversions for PVCs in a happier way, but that's a long story and not particularly really related. And then the DRM thing is not the thing where they didn't want you to listen to music in the 90s or early 2000s. It's some Mahoot thing. I think the M stands for Mahoot. Might stand for Matrice. I don't know. This was the like, sciency part. So Trevor did the math part. And then it came time to do our fancy math. OK, so SVD is, yeah, that's hard. And let's see how many minutes I have left. Yeah, not that many, because I don't have four months. So we're not going to go into the details of how we perform an SVD, but suffice it to say, Mahoot on Spark can do happy SVD. And it's very nice. It does not need 500 gigabytes of RAM on one computer. Everything is great. No rusty spoons required. And if you're interested in actually learning what's going on here, there is a link at the bottom. And these slides are actually also on the schedule link. So you can go to the schedule link, and then you can click on these links so you don't have to write it down or take pictures. Of course, feel free to take pictures, because Boo is fabulous. Oh, yeah, Boo is my dog. OK, so one of the things that happens, though, sometimes when we do things is we're like, you know what we should do? We should see if it worked. And that's always a mistake. That's the first step to sadness. And yeah, so yeah, it was sad. Or more specifically, one of the things that we realized is, well, yeah, we could just run this iteratively like a whole lot. We would eventually go from kind of OK image to slightly better image all the way back to kind of crappy image. And so humans need to do things, and we can't just give the magic computer box the magic button and everything gets better, in part because we don't have a good enough fitness definition of what a good image is. That's kind of humans. OK, cool. So we did have this pipeline. It cleaned up a bunch of images. What happened? Or what came out of doing all of this stuff? So before these results were published, the world changed. And for the better to be clear, right? Like, if we were all getting CT scans before getting on an airplane, that would kind of suck. Tests got cheap, a lot cheaper than the cost of doing CT scans, and they got more accurate than 60%. So pretty solid. So just like a real software project, in the time that we finished, it was no longer useful. But there were a bunch of really neat things that sort of came out along the way. One of them is there was this idea in the early 2000s that we could do this like futurey sciencey thing, and we could clean up these images. There was a published paper, not enough code to actually really just go and run it. But we were actually able to recreate it, and that's kind of cool. And reproducible science is like, yay, it's happy. Grumpy Cat and grad students might disagree, but that's OK because Grumpy Cat is not the principal investigator. Right, OK. We also, along the way, discovered that running Spark on Kubernetes is a lot of fun, which is part of how I have a job. So one of the things that we discovered is that we didn't have a shuffle service on Kube. And so we worked on some alternatives, and I changed jobs in between. And the first alternative that we came up with was this kind of janky decommissioning thing where we copied files around. Then one of my co-workers came up with another even smarter thing where we would copy files around, and then if things went to hell, we'd copy them into an object store, and we could still scale to zero, or not quite zero. We could scale to one because there was certain information that we couldn't manage to get out of the JVM. Oh yeah, we used Java. I'm sorry. Other exciting things that were admittedly unrelated, but were pain points that we experienced that have been improved, is pod allocation for our scale down is kind of flaky and scale up. So we're scaling down and back up and down and back up. And this happens a lot when we're doing things like switching from ETL to the machine learning phase. Really excited that Volcano and Unicorn support is now in Spark. I think it's being voted on right now, although it did just get a minus one during the previous talk. So no hard commitment on when it's going to be available. But if you like building from source, oh yeah, you can play with it. And then video folks did some wonderful things with making it so that the training and ETL stages can use different types of resources. Instead of just having a different number of machines, Spark can actually take advantage of the fact that Kubernetes can allocate pods on different machines with different kinds of resources. The other one is that we got an example for our book, which once again, you should all buy. And one of my co-authors is actually here in the road, if you want to. Yeah, so you should buy it so that we can each afford more coffee. And you should buy several, several copies. But yeah, there's the example on GitHub. You can check it out. And it's a full pipeline. It's not just like an iPython notebook and a note that says, good luck, have fun. Great, you should buy these books. They are great for the holidays and make an excellent gift. There is also the article that came out of it. I get no money from the article. Do not care if you read it. The book, though, yeah, that's where the money is. OK, cool. So what could we have done better? If I had a time machine, or if I was solving this problem again today, let's hope we don't have another pandemic, what could we do better? So Kubeflow makes using all of these different tools possible. And that's kind of cool. We have all of the different tools stitched together. And it's really cool. The only problem is it's also kind of slow. We end up writing data out to disk, or S3, respectively, between each of these tools. And it turns out that the only thing slower than disk is distributed disk. That's not actually true. The tape is slower than distributed disk. But it's not a fun time, right? It's not able to help us turn our non-distributed tools into distributed, right? So if we have something that isn't already parallelizable, it's not going to be able to make it a party for us. It is going to be able to schedule it on a really big, chunky machine for us. That's kind of cool. But it's not going to magically make PyDcom run on three machines at the same time. So how could we make this suck less? There's a whole bunch of different things that we could do. The first one is we can avoid writing to disk. And technically, we could do this with Kubeflow Pipelines because we can return anything. But this is the same sort of technical wherein technically everything's a Turing machine. So yeah, you can. But no one's going to. We could move our parallelization stuff up the stack. So instead of just using Spark to parallelize things, we could see if whichever workflow tool we were using gave us some ability to parallelize more things. And there's probably a bunch of other things that we could do. Oh, by the way, also this is my dog. His name is Professor Timbit. For some reason, Trevor left him off the paper. But he is included in the book. And if you buy a copy, you can see another picture of him. Right. OK, so probably we could do more things. But as mentioned before, I am very lazy. And right, so the other thing is we have three standards. We've got these three different tools. The solution is to make another tool. Does anyone remember that XKCD? For copyright reasons, it's not included. But if you don't remember it, you should definitely go look it up because that's what we did. Or another book, right? The software engineering thing. It does pay a lot more than books. But right, we could make some new books or these tools. So we can use Ray distributed to make using multiple tools less expensive. And we can do this because Ray is able to represent the data in an internal Apache arrow in memory format using its object store called Plasma, which has a nice shared memory interface. And if any of you have used shared memory before, you know by when I say nice, I mean dumpster fire, but contained dumpster fire. So it's really good, really fast. So we could make using these multiple tools less expensive. So we could use Ray to stitch our things together instead of just depending on Kubeflow pipelines. We could also just try and have less tools. And we could do that by doing things in Dask. And for example, Dask has distributed SVD as well as having the ability to distribute the PyCom stuff. So we could use something like Dask. So we have less tools to stitch together. And even though every tool that we stitch together is still going to be kind of expensive, there's less of them going on. But regardless, the answer is buying more of my books. OK, so you might be saying to yourself, hold on. That sounds great. I will, of course, buy those books that you mentioned on the previous slide. But should I still buy your Kubeflow book? And the answer is yes. And so this is because Ray and Dask don't solve everything. Specifically, when it comes to the level of isolation that we're able to get, Ray and Dask make it really easy for us to use different versions of Python libraries. But if we ever have to deal with different versions of CUDA or different native libraries, our life is going to be really, really sad. And we'll want to use something like Kubeflow to do the coordination instead of trying to use Ray or Dask, which have a much tighter integration. Other things is serving. The Ray people, if you're watching this talk, I'm sorry. So you can do serving with Ray. You probably shouldn't. Or if you do, you're going to have an exciting learning opportunity, which is going to give you an exciting opportunity to give a talk about all of the things that went wrong with building a serving system in Ray. So you can do it. And you can fix those bugs because it's open source. So please do. On the other hand, if you are in any way judged for your performance instead of hours worked, consider not doing that. And you can use something like Selden for serving, which is much happier. And I say this because I think one of the Selden people is in this room. And I'm not sure if the Ray people are in this room. OK, also hyperparameter tuning. So OK, yeah, there's a whole bunch of really lovely tools that we get with Kubeflow that we don't really get with Ray or Dask. They're sort of more designed to make these things parallelized and less focused on the fancy machine learning bits. So that's the party there. Cool. Also, if you've ever said to yourself, Holden, this sounds terrible. How can I get my children involved? Have I got the answer for you? You too can teach kids about the magic of Kubeflow with Kubeflow Place. Pallas? No, Place. They should have caught up Pallas, whatever. It is available now on the internet. And unlike all of the other books, it's free, which is clearly a mistake. But before they realize this mistake, you should go and download that PDF and show it to your children to save them from the life of software engineering. Alternatively, if you know kids who do not have your home phone number, you could teach them the magic of Spark with distributed computing for kids, which will be available for values of soon, which are similar to when I will do my assignments, which is soon. And you can teach them the wonders of Spark. And then later on, if you make the mistake of getting them your phone number, you can teach them the wonders of out-of-memory exceptions. Yays. Not me. Oh, I should take my phone number off the internet. Oh, dear. OK. Anyways, so that's it for the talk. I would really love any live questions. Alternatively, if you're shy and you don't like asking questions of the person wearing the fabulous dinosaur dress in person, you can always email me. Thank you, one person who clapped. It means the world. You can also email Trevor. Trevor is delightful. And we'll answer questions about the math things, electric bike imports into Chicago, and whether or not he has slapped. Oh, and also, this is totally, only tangentially related. He proposed to his wife in the book that we wrote together in the preface. And she said yes, and now they have a kid. And so I feel, in a very small way, partially responsible for this. And you too can feel partially responsible by buying several copies. OK, does anyone have any questions? That looks like a no. Oh, did I go over time? OK, well, I'll be around because I am cheap. And I think there's free drinks. I hope. I hope. Someone told me there were. If they were lying to me, I would be very sad. I'm going to go try and find free things to drink and eat. And you can come and find me. If there is another person wearing a dinosaur dress, please introduce me. Otherwise, you can just look for me by my dinosaur dress.