 Holden and welcome Hey, thanks. Thanks for having me. I'm holding. This is professor timbit Helping me this morning to make sure I don't fall asleep Hello professor can I call him professor? Yeah. Yeah, he goes by professor Okay, his research area is you know, he's not he's not ready to talk about it just yet He doesn't want to get scooped But he had some very exciting papers coming out. I'm sure Okay, they're looking forward to listening to you. Holden one thing Just remember remind our audience that you can start sending questions for holding as as now as as from now Because otherwise when we'll have time apologies for starting a video with delay Holden So any questions for Holden, please, uh, where did she get those fantastic? Bet sheet is also allowed So All yours Holden Thank you. Thanks So, yeah, I'm going to talk about some of the similarities and differences between spark dask and ray And in doing that, they're they're all distributed systems. So we're going to talk a little bit about some of the principles of distributed systems Um, so yeah, my pronouns are sheer her That intro is actually already amazing The only thing that I want to add to that is I also do these code review live streams and I also do some live programming now and live streamed writing of tech books And so if you're interested in in those things, you know, definitely check out my my youtube There's there's a bunch of streams there I'm also Trans queer Canadian. I live in America. I got my green card This year, which is very exciting. I mean it's harder for them to get rid of me And also part of the leather community And that's not directly related to these things But I think it is important for those of us who are building data or ml tools, especially those of us working Uh in open source or in large companies, which can have a really large impact in the world To look around at our teams and you know, if everyone's from the same background as us, it's time to Try and expand the the pool of people that we're working with And part of that is talking about where we're all from as well in our background So I'm hoping you're all nice people You are probably interested in distributed systems if you're here and if not, that's that's okay, too You know, I'll I'll try and have some pictures of timbit to distract you Um, if this isn't your cup of tea So I'm going to talk about distributed system I'm going to talk about data parallel distributed systems and we're going to look at these three systems We're going to talk about how they're different Um, and then we're also going to talk about some common parts Some of the parts where they're very similar And we're also going to talk about some of the mistakes that we've made in building these systems over time So for those of you who aren't familiar with distributed systems, your life is probably much happier There's this wonderful quote from Leslie lampour A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable and I think As as someone who works on on spark, you know, we we do a lot of work to try and make it so that that's not the case but at the end of the day There are still times when the failure maybe not of a computer anymore But at the very least the failure of a rack could very easily Cause our our systems become unusable Um, so why do people use distributed systems scale? Um, generally speaking if your data fits in memory on a computer, that's a lot better. It's it's much less work Uh, you can solve problems by throwing money at it Relatedly the follow-up one is it turns out buying a lot of memory in a single computer Like yeah, you can buy huge huge amounts of memory in a single computer But then computers just get really really expensive Um, and the last one is a bit of a joke, but not completely Um, distributed systems also make traditionally simple problems really challenging to solve Um, and I know for me that that's part of the appeal and from a business point of view like this is terrible Um, but from a like engineering point of view, it means that all of these problems that are kind of boring become interesting again Um, and I think I think that's neat So what are what are the core building blocks that all of these different distributed systems? Uh, builds on top of so there's distributed locks distributed clocks distributed counters Um, and pretty much, you know every sort of fundamental computer science thing a distributed version of x for all of x the locks clocks and counters are or sort of some of the key building blocks that they all depend on and While i'm just talking about data parallel systems today, um, like dask Uh ray and spark it's important to know that there are other kinds of distributed system problems file storage systems if you use hdfs or s3 Those things tend to be distributed systems min i o can be Can be not If you've run folding at home or the distributed net rc5 challenge Um, those things tend to be embarrassingly parallel With minimal or no coordination between the nodes Those problems are really fun And they're really nice because they don't involve a lot of communication between computers And communication between computers just like with humans is where things break um databases not all the databases are distributed but Cassandra's a good example of a distributed database API servers are also, you know, distributed systems frequently these days We we tend to have multiple api endpoints and put them behind a load balancer um and For the most part data parallel systems, which we share where I work Let us ignore a whole bunch of problems that we have to deal with in all of these other ones right file storage systems Super super painful to write. Um, but data parallel systems Not as bad so To a degree, um In data parallel systems, we get to ignore time and ordering of events and this is pretty awesome um The notable exception is when people insist on making streaming systems Which is unfortunately increasingly popular Uh, to a degree we get to avoid network partitions um, not because they don't happen just because We dedicate the the winning partition Is whichever partition happens to have the head node on and that's very very easy Uh multiple clients generally speaking. We don't allow multiple clients. So it's a lot easier when you have a single client leader elections We generally is statically assigning leaders. So there's no election um And we tend to ignore this failure of a keynote thing And this lets us get away With all kinds of problems Because we can essentially just take a distributed system problem And say, you know what? We're just going to solve it on just one computer and we're going to make that computer responsible for it um And you know, that's really cool. Um, the downside is of course if that node fails Everything breaks, but you know, it's it's not too bad. We get to skip a whole bunch of problems But there's there's some downsides to this Um, we'll we'll get back to those. So what's what's left when we when we skip all of those problems? Dividing and coordinating the work reliability on machine failure besides the keynote Um, and the times we allow states. So while we get to ignore state to a large degree um training machine learning models Uh tends to involve like building up a bunch of state. You're you're building up this Uh collection of parameters that represent your model um Transactions sort of matter and this comes from Even without streaming we tend to need to do things like speculative execution Um, and the last one the last one is the really important one Uh bottle and axon the reliable node, right? So once we've designated this keynote The problem is engineers are lazy and we tend to put a lot of things on that one keynote But then it turns out that this starts to get really slow and all kinds of sad So how hard can dividing work be? um, so If you if you're an ic you might be like, you know what that doesn't seem like that much work But if you have a manager or a p.m You can go and ask them how hard it is to divide the work of your team and they might have some opinions Um, but even in computer world things are really difficult, right? Uh key skew, um falls into this problem of dividing work because frequently we try and partition by keys And so the key skew here Uh, it really gets us non-uniform processing times. Uh, so stragglers, um, which if you if you've been using spark You're probably well aware of and pretty much any variant of trying to coordinate Uh and split up work is actually really hard, you know, it sounds really easy until you try and do it And then life just gets all kind of painful Then there's the fault tolerance like how are we going to handle losing a node And different people have different approaches um, so adobe map reduce, um Solves this reliability problem by just saying, you know what none of my workers are reliable I'm just going to save the data out to the adobe file system and that's going to replicate it across a bunch of computers and Then it doesn't matter if my computer fails because there's a replication of it somewhere else And I can just go and read it from there Recompute on failure is the approach taken by spark and ask Into a limited degree ray and we'll we'll talk about that more later But it requires that we keep track of how to recompute data um, and It also really breaks down when we're updating state because When we go to recompute the data, we might update the state more than once Um, and also it gets really annoying if our failures become correlated Historically recompute on failure was really good because failures of computers. They weren't not Like dependent they they were semi independent, I would say But with more and more people moving to the cloud They've become a lot more correlated as people run on things like spot instances or preemptible instances um, the other one, you know math and extra computers like that's that's the paxos approach um And this is really hard. It tends to be the most reliable approach And we tend to not do it very much because it's also really slow And the last one is ignore it. Um, you will be surprised how often that's the approach that people implicitly end up choosing um, and we'll We'll talk about that a little more Actually, I'll talk about it now. So and ray in its in its early versions actually took the ignore it approach um to Failure for anything involving state um And that's not great. That means that if you're if you're Actor, which is how ray represents state was scheduled only know that failed It would just fail and your application would just stop working And you are responsible for managing that and recovering from that and in the newer versions of ray. They've added Framework to to use Actually, it doesn't have a strong opinion on which one of these techniques it uses to recover from failure It's up to you to pick which one, but they have a framework that that lets you implement the recovery logic a little bit Less painfully So why do we have to care about state? um So even if we're doing stateless transformations, there is some state like how far we come along which records have we processed Generally at the end of the day, right like as much as people love functional programming We want to do something with our data. We want to write it out and that's kind of kind of state um We can think of this as has timbett had a bath this month. I think he has But keeping track of that is you know, that state the state of timbett has he has he had a bath Once we add state things start to go to hell um Using specialized systems is often how we deal with it. Um in spark We mostly deal with it by shoving all of our state onto that one reliable node But at the cost of just being really really slow um So there's some options here to handle that failure of the keynote um And generally speaking, I would say I don't see this done very successfully most of the time Uh, normally what people do is they just restart the job on the failure of the keynote um, normally people use something like zookeeper um to keep track of of everything and that's the spark high availability mode uh, but restarting the jobs is non-trivial and so this is like The magic hand wave, but it's about as easy as convincing professor timbett to shake your paw without any treats I'm sorry to shake his paw with your hand um So what about bottlenecks? So spark and desk both fall into this situation of having a central scheduler and that's Really great and that it lets us make all kinds of smart decisions because all of the scheduling logic is happening in one place And we can do all kinds of things like caching and stuff like that inside of our scheduler The downside is that if we have thousands and thousands of nodes and We're trying to schedule so many tasks that scheduler can get get overwhelmed The other one is in spark. We put all of this state on that one node And so that that one node is just very very busy Um, and that's that's not great, right? And the distributed system You really don't want one node to be busier than the rest. That's that's the sign that you aren't doing a good job of splitting up your work um And then the transactions one, uh, so this one's important for speculative execution even if we're just considering Uh traditional data parallel systems It's like non-streaming This is because we generally have multiple regulators, but we have one committer Who is responsible for deciding like hey, am I done processing this move it like market is done So the the next job can know that this data is done and ready Although it turns out that the approach that we we take, uh, which came from htfs Which involves renaming files to indicate that everything was ready Not all file systems support atomic renames And uh, it's really important that these operations be atomic. Otherwise, you know, you don't really have transactions If it's not atomic, uh, you can get like these partial views and that's that's really bad um And the solution to this is to put another system on top of the non atomic system That then gives you an atomic view over top of it, which is the approach taking with us 3a Uh, it's kind of weird, but it's okay So we've talked a lot about sort of the core building blocks, um Of these systems and a little bit about their differences But what are what are some more of the differences? I think a really important one is the apis exposed Um, and this is really important spark, but just exposes high level apis Um, it really doesn't let you schedule raw tasks. It's very much focused on data parallel systems only Another one is this unit of scheduling work and this task overhead, right? So essentially we can think of this as like your manager talking to you Um, if it takes them five minutes to tell you about a task versus it takes them 10 seconds to tell you about a task Um, they're probably going to be comfortable delegating different things to you Um, and so that that task scheduling overhead applies to computers as well And that approach to node loss, right? We could think about like how we handle it when our coworker quits Um, and similarly in spark. It's And ask and rain. It's how do we handle it when one of my computers die? And another one that's really important that I think we often overlook Because we're technologists is what is the community around these tools like? um So more concretely Ray has probably the best approach to the distributed state of of these three Um, strangely enough, it doesn't support the Standard example that we're all used to of word count because it doesn't It doesn't have shuffle Of course, there's an asterisk there. You can make word count work, but it's just really Really painful. Um, you you normally end up running desk on top of ray at that point And it's built in c++. Um, and it has python and java apis and by default ray is Less tolerant less fault tolerant that is And that's that's okay You can change these configurations to make ray behave more like desk or spark From a regards to fault tolerance Handling the state is more than just a configuration change though So you'll you'll have to write some code to handle your actor recovery Dask is notable for having Really kick-ass pandas apis. It probably has some of the best python integrations out there And it also has these really wonderful low level python apis ray also has low level apis But they tend to be implemented in c++ and that's that's great for performance But not Has great for getting people to use them. Um Because they can be a little bit more complicated for people to figure out what's going on Spark is is sort of the one that we're all used to I would say or at least this one that I'm most used to It has really only high level apis And that's that's not a bad thing right these high level apis mean that spark is able to take a much more aggressive approach to fault tolerance it's able to do a lot of really cool things with optimization, but It does mean that you can't schedule raw tasks in the same way It's built in java and it has apis for I would say probably the most languages Python is built in are built in but then there's also a whole bunch of apis for different languages That come from the broader community like C sharp It does have a new panels like api. It's not as feature-complete as dasks It does also have more overhead Ray probably has the lowest per task overhead Dask is in the middle and spark is at the high end And what that means is like in spark for it to make sense for us to be using spark We need to be able to split up our work into sort of like moderate sized chunks And in ray we could use much smaller sized chunks and then ray somewhere in between um And the last one is of course the hiddupe ecosystem If you are working at a place which has you know, a big data stack Spark is a much better job of integrating with the rest of the tools like impala And things like that in the hiddupe ecosystem Your catalog all those things That ray and ask, you know, they can they can talk to hive of course, but they don't understand the hive catalog in the same way that spark does Okay, and we are still Running a little over time. So I'm very sorry about that um But one of the things I want to talk about and this is because it's it's come up recently Is there is there's a lot of conversation around benchmarks? and this is because well one of the vendors um is Let's just say they're they're trying to illustrate that they're they're still relevant By benchmarks, um, and I think benchmarks definitely have their place, right? I think it's very very reasonable to do benchmarking On the other hand, I think that desk and ray and spark all perform pretty well at the medium sized scale of data And if you're at the like petabyte plus scale of data I think it's really important to not just take one of the like industry benchmarks like tpcds Um, you should probably make your own benchmarks that are related to your use case because Like tpcds is a lovely synthetic benchmark But it may not represent very well what it is that you're trying to do Um, and I think I think really for most of us probably the thing to do is to pick the one Which is best suited to our domain and our team, right? Like if you've got a mixture of java and scala and python programmers You know spark looks pretty appealing because they can all work together um on the other hand if you've got some amazing kick-ass data scientists who just like Came here to chew bubblegum and use pandas and they're all out of bubblegum You know desk has probably the best distributed pandas api Any of your options, right? Of course, you know if you want to look at benchmarks, that's cool. I'm not going to go into benchmarks because I think that Well pretty much You can always make a benchmark say what you want to say and it's just it's not worth it To me, so we're we're going to skip this but for preemptive So on that note, um I am three minutes over. I'm very sorry about that. I am working on some new books Namely scaling python with ray scaling python with desk and distributed computing for kids that one's in spark Um, and if you're interested in being an early reader or your kids are interested in being an early reader For any of those books, please DM me on twitter. It's just my name holding caro or email me It's just holding caro at gmail.com and let me know that you're interested in seeing Early drafts of this stuff and I would love to share it with you and get your feedback Another one is I really think all of these things are open source I think the community is of course a little bit different. Um, I've contributed to to all of these projects Um, and if you're interested in getting involved with any of these projects, definitely Please feel free to reach out or just try getting involved, you know I think they're they're great projects And I think that one of the ways that we can make sure that our voices are heard Is by contributing to our our open source tools that we're using Um, and I'll be doing more open source live streams if anyone, you know wants to come and watch and get an idea of what it's like to contribute to these projects in the open source space Um, so I'm hoping that we might have enough time for for a question I know I know we are five minutes over though So feel free to shoot me an email with your questions And I will I will do my best to answer them Okay. Thank you. Thank you so much. Hold them. Uh, that was fantastic. Uh, first of all, what does uh, mr. Professor timber Think about all this Where is he so professor timber? He's he's got a mixed view. I would say um, he does really like Uh The desk people the most I would say um I think that's mostly because they talk to him when we're talking on video together Um, the other people don't talk to professor tim bit as much. He's he's very engaged and in the research, of course Okay, well, there's a question for him Just one so just one for him and we don't have time for more But if you could answer on his behalf because we don't see him. So, uh, oh, yeah, he went back to sleep It's a very busy guest transmitted to him. Uh, we say this is clearly not, uh, a case of one size fits all So in order to choose the right framework Which you obviously explain a few of the differences somebody says Is the option having a data science infrastructure flexible enough to allow for a mix and match mix and match approach Yeah, so I I think what professor timber would say here is that that's definitely an option I think With with kubernetes, um, it's it's quite possible to have a mix and match approach and I think it's it's very solid the downside is um It's it's a little bit more painful to to maintain from a systems point of view, right? Um If you can convince people to pick two of the three, uh, your life will probably be a bit easier Then trying to support all three of them Um, okay. Yeah. All right. In case of doubt, uh, people can email you you said the best way to contact is to twitter dm or to uh, or through your well, obviously watch your Your youtube streams But in order to ask you any questions about which frame and to which to use or which two to use In case of only we can only choose two to dm you on twitter or, um, if we have any questions Are your new books coming out? Uh, yes, that's true any books coming out. We email you Uh, well, congratulations on that green card. By the way Excellent. Thank you. Miss professor team battles as well. Does he go in the package with you? Yes. He's he's included Okay, that's great. He's a lucky. He's a lucky man or he's a man or a girl or a boy or a professor He's a lucky professor Holding thank you so much for your talk I'm sure people will contact you directly We hope to see you again. You know, you're very much loved in uh, big things conference So, uh, I really appreciate it and and thank you. Thank you for the well Which is the the one year that I got in the in the motorcycle crash It was it was very very kind You're very very much loved uh holding so we hope to see you and professor timbert next year if not before that So in the meantime We dm you for whatever we may need Lots of love and see you very soon. Thank you so much