 All right, today, today we're going to talk about Spark. Spark's essentially a successor to MapReduce. You can think of it as a kind of evolutionary step in MapReduce. And one reason we're looking at it is that it's widely used today for data center computations. It's turned out to be very popular and very useful. One interesting thing it does, which I will pay attention to is that it generalizes the kind of two stages of MapReduce, the Map and the Reduce, into a complete notion of multi-step data flow graphs that and this is both helpful for flexibility for the programmer. It's more expressive and it also gives the Spark system a lot more to chew on when it comes to optimization and dealing with faults, dealing with failures. And also from the programmer's point of view, it supports iterative applications, applications that loop over the data effectively much better than MapReduce does. You can cobble together a lot of stuff with multiple MapReduce applications running one after another, but it's all a lot more convenient in Spark. Okay, so I think I'm just going to start right off with the example application. This is the code for page rank and I just copied this code with a few changes from some sample source code in the Spark source. I guess it's actually a little bit hard to read. Let me just give me a second while I try to make it bigger. If this is too hard to read, there's a copy of it in the notes and it's an expansion of the code in section 322 in the paper. Page rank, which is an algorithm that Google uses, a pretty famous algorithm for calculating how important different web search results are. What page rank is trying to do, well actually page rank is sort of widely used as an example of something that doesn't actually work that well in MapReduce. And the reason is that page rank involves a bunch of sort of distinct steps and worse, page rank involves iteration. There's a loop in it that's got to be run many times. And MapReduce just has nothing to say about iteration. The input to page rank for this version of page rank is just a giant collection of lines, one per link in the web and each line then has two URLs, the URL of the page containing a link and the URL of the link that that page points to. And you know the intent is that you get this file from by crawling the web and looking at all the, all collecting together all the links in the web. So the input is absolutely enormous. And as just a sort of silly little example for us for when I actually run this code, I've given some example input here. And this is the way the input would really work. It's just lines, each line with two URLs and I'm using U1 as the URL of a page and U3 for example as the URL of a link that that page points to, just for convenience. And so the web graph that this input file represents, that's only three pages in it. One, two, three, I can just interpret the links, there's a link from one to three. There's a link from one back to itself. There's a web link from two to three. There's a web link from two back to itself. And there's a web link from three to one. So it's like a very simple graph structure. So what you want to do, it's estimating the importance of each page. What that really means is that it's estimating importance based on whether other important pages have links to a given page. And what's really going on here is this kind of modeling the estimated probability that a user who clicks on links will end up on each given page. So it has this user model in which the user has a 85% chance of following a link from the user's current page, following a randomly selected link from the user's current page to wherever that link leads, and a 15% chance of simply switching to some other page even though there's not a link to it, as you would if you entered a URL directly into the browser. And the idea is that the page rank algorithm kind of runs this repeatedly. It sort of simulates a user looking at a page and then following a link. It kind of adds the from pages importance to the target pages importance and then sort of runs this again and it's going to end up in the system like map page rank on Spark. It's going to kind of run the simulation for all pages in parallel, iteratively. And the idea is that it's going to keep track the algorithms going to keep track of the page rank of every single page of every single URL and update it as it sort of simulates random user clicks and that eventually that those ranks will converge on kind of the true final values. Now, because it's iterative. Although you can code this up in wrapper MapReduce, it's a pain it's can't be just a single MapReduce program. It has to be multiple, you know, multiple calls to a MapReduce application where each call sort of simulates one step in the iteration. MapReduce but it's a pain and it's also kind of slow because MapReduce, it's only thinking about one map and one reduce and it's always reading its input from the GFS from disk and the GFS file system and always writing its output which would be the sort of updated per page ranks. Every stage also writes the updated per page ranks to a files in GFS also so there's a lot of file IO if you run this as sort of a sequence of MapReduce applications. All right, so we have here this version of page rank code that came with came with Spark. I'm actually going to run it for you. I'm going to run the whole thing for you this code shown here on the input that I've shown, just to see what the final output is and then I'll look through. Then we'll go step by step and show how it executes. All right, so here's the you should see a screen share now of a terminal window. And I'm showing you the input file. I'm going to run into this page rank program and now here's how I run it. I've downloaded a copy of Spark to my laptop. It turns out to be pretty easy. And a sort of pre-compiled version of it. I can just run, it just runs in the Java virtual machine. I can run it very easily. So it's actually doing downloading Spark and running simple stuff turns out to be pretty straightforward. I'm going to run the code that I show with the input that I show. And we're going to see a lot of sort of junk error messages go by. But in the end, Spark runs the program and prints the final result and we get these three ranks for the three pages I have and apparently page one has the highest rank. I'm not completely sure why but that's what the algorithm ends up doing. So, you know, of course, we're not really that interested in the algorithm itself so much as how Spark executes it. All right. So I'm going to hand to understand what the programming model is in Spark because it's perhaps not quite what it looks like. I'm going to hand the program line by line to the Spark interpreter. So you can just fire up the Spark shell thing and type code to it directly. So I've sort of prepared a version of the MapReduce program that I can run a line at a time here. So the first line is this line in which it reads the rasking Spark to read this input file. And it's the input file I showed with the three pages in it. Okay, so one thing to notice here is that when Spark reads a file, what it's actually doing is reading a file from a GFS like distributed file system. It happens to be HDFS, the Hadoop file system. But this HDFS file system is very much like GFS. So if you have a huge file, as you would, if you had a file with all the URLs, all the links and the web on it, HDFS is going to split that file up among lots and lots. You know, by chunks, it's going to shard the file over lots and lots of servers. And so what reading the file really means is that Spark is going to arrange to run a computation on each of many, many machines. So it reads one chunk or one partition of the input file. In fact, actually, the system ends up, or HDFS ends up splitting the file, big files typically into many more partitions than there are worker machines. And so every worker machine is going to end up being responsible for looking at multiple partitions of the input files. This is all a lot like the way map works in MapReduce. Okay, so this is the first line in the program. And you may wonder what the variable lines actually hold. So I printed the result of lines, what the lines points to. It turns out that even though it looks like we've typed a line of code that's asking the system to read a file. In fact, it hasn't read the file and won't read the file for a while. What we're really building here with this code, what this code is doing is not causing the input to be processed. Instead, what this code does is builds a lineage graph. It builds a recipe for the computation we want. Like a little kind of lineage graph that you see in Figure 3 in the paper. What this code is doing is just building the lineage graph, building the computation recipe, and not doing the computation. The computation is only going to actually start to happen once we execute what the paper calls an action, which is a function like collect, for example, that finally tells Spark, oh look, I actually want the output now. Please go and actually execute the lineage graph and tell me what the result is. So what lines holds is actually a piece of the lineage graph, not a result. Now, in order to understand what the computation will do when we finally run it, we can actually ask Spark at this point. We can ask the interpreter to please go ahead and tell us what, you know, actually execute the lineage graph up to this point and tell us what the results are. So, and you do that by calling an action, I'm going to call collect, which sort of just prints out all the results of executing lineage graph so far. And what we're expecting to see here is, you know, all we've asked it to do so far, the lineage graph just says please read a file so we're expecting to see that the final output is just the contents of the file. And indeed, that's what we get and what what this lineage graph, this one transformation lineage graph results in is just the sequence of lines. One at a time. So it's really a set of lines. A set of strings, each of which contains one line of the input. All right, so that's the first line of the program. The second line is collect essentially just just in time compilation of the symbolic execution chain. Yeah, yeah, yeah, that's what's going on. So what collect does is it actually huge amount of stuff happens if you call collect. It tells spark to take the lineage graph and produce Java byte codes that describe all the various transformations, you know, which in this case is not very much since we're just reading a file. So it's Spark will when you call collect Spark will figure out where the data is you want by looking hdfs it'll you know just pick a set of workers to run to process the different partitions of the input data. It'll compile the lineage graph or each transformation lineage graph into Java byte codes that sends the byte codes out to the all the worker machines that spark chose. Those worker machines execute the byte codes and the byte code say, you know, please read tell each worker to read its partition of the input. And then finally collect goes out and fetches all the resulting data back from the workers. And so again, none of this happens until you actually want an action and we sort of prematurely run collect now. You wouldn't ordinarily do that I just because I just want to see what the output is to understand what the transformations are doing. Okay, so if you look at the code that I'm showing the second line is this map call so we've so line sort of refers to the output of the first transformation which is the set of strings correspond to lines in the input. We're going to call map we've asked the system to call map on that and what map does is it runs a function over each element of the input. That is, in this case, over each line of the input, and that little function is the S arrow, whatever which basically describes a function that calls the split function on each line split just takes a string and returns a array of strings broken at the places where there are spaces. And the final part of this line that refers to part zero and one says that for each line of input we want to have the output of this transformation be the first string on the line. And then the second string of the line so we're just doing a little transformation to turn these strings into something it's a little bit easier to process. And again, a curiosity. So I'll collect on links one just to verify that we understand what it does. And you can see where as lines held just string lines links one now holds pairs of strings of from URL and to URL one for each link. And when this executes when this map executes it can execute totally independently on each worker on its own partition of the input, because it's just considering each line independently there's no interaction involved between different lines or different functions. These are, it's running this map is purely local operation on on each input record so can run totally in parallel on all the workers on all their partitions. Okay. The next line in the program is this called the distinct. What's going on here is that we only want to count each link once. So if a given page has multiple links to another page. We want to only consider one of them for the purposes of page rank. And so this just looks for duplicates. Now, if you think about what it actually takes to look for duplicates in, you know, multi terabyte collection of data items. It's no joke because the data items are in some random order in the input. And what distinct needs to do, since it is replaced each duplicated input with a single input. Distincts needs to somehow bring together all of the items that are identical. And that's going to require communication. Remember the all these data spread out over all the workers. And I'm sure that any, you know, that we bring we sort of shuffle the data around so that any two items that are identical are on the same worker so that that worker can notice I wait a minute there's three of these I'm going to replace it. These three with a single one. And that means that distinct when it finally comes to execute requires communication it's a shuffle. So the shovel is going to be driven by either hashing the items, the hashing the items to pick the worker that will process that item and then sending the item across a network or, you know, possibly you could be implemented with a sort, or the system sort of sorts all the input, and then splits up the sorted input. Overall the workers. But actually don't know which it does that anyway don't going to require a lot of computation. In this case, however, almost nothing whatsoever happens because there were no duplicates. And sorry, whoops. All right, so I'm going to collect. And the links to which is the output of distinct is basically, except for order identical to links one, which was the input to that transformation. And the orders change because of course it has to hash or sort or something. The next, the next transformation is is group by key. And here, what we're heading towards is we want to collect all of the links. It turns out for the computation will see, we want to collect together all the links from a given page into one place. So the group by key is going to group by it's going to group all the records all these from to URL pairs. It's going to group them by the from URL that is it's going to bring together. All the links that are from the same page, and it's going to actually collapse them down into the whole collection of links from each page is going to collapse them down into a list of links into that pages URL, plus a list of the links that started that page. And again, this is going to require communication. Although spark I suspect spark is clever enough to optimize this because the distinct already put all records with the same from URL on the same worker. The group by key could easily and may well just not have to communicate at all because it can observe that the data is already grouped by the from URL key. All right, so let's print links three. Let's run collect actually drive the computation. And see what the result is. Indeed, what we're looking at here is an array of couples where the first part of each couple is the URL of the from page, and the second is the list of links that started that from page. And so you can see that you to has a link to you to three, you three has a link to just you one, and you one has a link to you one and you three. Okay, so that's links three. Now, the iteration is going to start in a couple lines from here. It's going to use these things over and over again each iteration of the loop is going to use this. This information that links three in order to sort of propagate probabilities in order to sort of simulate these user clicking from from all pages to all other linked to pages. So this link stuff is these links data is going to be used over and over again and we're going to want to save it. It turns out that each time I've called collect so far, spark has re executed the computation from scratch. So every call to collect I've made has involved spark rereading the input file, rerunning that first map rerunning the distinct. So in order to call collect again and would rerun this group by key, but we don't want to have to do that over and over again on sort of multiple terabytes of, of links for each loop iteration, because we've computed at once and it's going to say this list of links is going to stay the same we just want to save it and reuse it. In order to tell spark that we want to use this over and over again the programmers required to explicitly what the paper calls persist this data. And in fact, modern spark, the function you call is not persist if you want to save it in memory but but it's called cash. And so links for is just identical to links three except with the annotation that we like spark to keep links for memory because we're going to use it over and over again. Okay, so the last thing we need to do before the loop starts is we're going to have a set of page ranks for every page index by source URL. And we need to initialize the every pages rank. It's not really ranks here it's kind of probabilities. We're going to initialize all the probabilities to one. Probably do one with the same rank, but we're going to we're going to execute code that looks like it's changing ranks, but in fact, when we execute the loop in the code I'm showing it really produces a new version of ranks for every loop iteration that's updated to reflect the fact that the algorithms kind of pushed page ranks from each page to the pages that it links to. So let's print ranks also to see this inside. It's just a mapping from URL from source URL to the current page rank value for every page. Okay, now I'm going to start executing inside. So there's one more question. Does spark allow the user to request more fine grained scheduling primitives then cash that is to control where that is stored or how the computations are performed. Well, yeah. So cash cash is a special case of a more general persist call, which can tell spark look I want to you know save this data in memory or I want to save it. I want to save it in DFS so that it's replicated and will survive crash so you get a little flexibility there. In general, you know we didn't have to say anything about the partitioning in this code. And Spark will just choose something at first, the partitioning is driven by the partitioning of the original input files, but when we run transformations that had to shuffle had to change the key does that and group by key does that spark will do something internally that if we don't do any we don't say anything and I'll just pick some scheme like hashing the keys over the available workers for example. But you can tell it look you know I turns out that this particular way of partitioning the data, you know, use a different hash function or maybe partition by ranges instead of hashing, you can tell it if you like. There are clever ways to control the partitioning. Okay. So I'm about to start executing a loop the first thing the loop does. And I hope you can see the, the code online 12. We actually going to run this join. This is the first statement of the first iteration of the loop. What we're doing is joining the links with the ranks. And what that does is pull together the corresponding entries in the links which said for every URL, what is a point, what does it have links to, and sort of putting together the links with the ranks and what the rank says is for every URL, what's this current page rank so now we have together in a single item. Every page, both what its current page rank is and what links it points to because we're going to push every page is current page rank to all the pages that it points to. And again this join is a is what the paper calls a wide transformation. Because it doesn't it's not a local. It needs to, it may need to shuffle the data by the URL key in order to bring corresponding elements of links and ranks together. Now in fact, I believe spark is clever enough to notice that links and ranks are already partitioned by key in the same way. It seems that it cleverly created links. When when we created ranks. It's assumed that it cleverly created ranks using the same hash scheme, as it used when it created links. But if it was that clever then it will notice that links and ranks are passed in the same way. That is to say that the links ranks are already on the same workers. The corresponding part partitions with the same keys already in the same workers, and hopefully spark will notice that and not have to move any data around. If something goes wrong though and links and ranks are partitioned in different ways that data will have to move at this point, in order to join up corresponding keys and the two, the two RDDs. Alright so JJ contain now contains both every pages rank and every pages. We have a list of links. So you can see now, we have a even more complex data structure. It's an array with an element per page with the pages URL with a list of links, and the 1.0 there is the page you choose current rank. And these are all all this information is in each sort of single record that has all this information for each page together where we need it. Alright, the next step is that we're going to figure out every page is going to push a fraction of its current page rank to all the pages that it links to it's going to sort of divide up its current page rank among all the pages it links to. And that's what this contributes does. Basically what's going on is that it's a one another will call to map. And we're mapping over the for each page we're running map over the URLs that that page points to and for each page points to where just calculating this number which is the from pages current rank divided by the total number of pages that points to. This sort of figures, you know, creates a mapping from link name to one of the many contributions to that pages new. Page rank. And we can sneak peek it, but this is going to produce. This is a much simpler thing it just as a list of URLs and contributions to the URLs page ranks and there's there's more there's you know more than one record for each URL here because there's going to for any given page there's going to be a record here for every single link that points to it, indicating this contribution of from whatever that link came from to this page to this page is new updated page rank. And now is that we need to sum up for every page we need to sum up the page rank contributions for that page that are in contribs so again we're going to need to do a shuffle here we need to it's going to be a wide transformation with a wide input because we need to bring together all of the elements of contribs for each page we need to bring together and to the same worker to the same partition. So they can all be summed up. And the way that's done. The baby page rank does that is with this reduce by key call. What reduce by key does is it first of all, it brings together all the records with the same key and then sums up the second element of each one of those records for a given key. And it produces as output the key, which is a URL and the sum of the numbers which is updated page rank. There's actually two transformations here the first ones is reduced by key, and the second is this map values, which, and this is the part that implements the 15% probability of going to a random page, and the 85% chance of following a link. Let's look at ranks by the way even though we've assigned to ranks here. What this is going to end up doing is creating an entirely new transformation. So not it's not changing the values already computed or when it comes to executing this it won't change any values already computed just creates a new, a new transformation with new output. And we can see what's going to happen and indeed we now have member ranks originally was just a bunch of pairs of URL page rank. Now again we appears if you are a page rank another different we've actually updated them sort of change them by one step. And I don't know if you remember the original page rank values we saw but these are closer to those final output that we saw in the original values of all one. Okay, so that was one iteration of the algorithm. When the loop goes back up to the top it's going to do the same join flat map and reduce by key. And each time it's, again, you know what the loop is actually doing is producing this lineage graph. And so it's not updating the variables that are mentioned in the loop it's really creating essentially appending new transformation nodes to the lineage graph that it's building. But I only run the loop once after the loop. And then now this is what the real code does, the real code actually runs collect this point. And so they're in the real page ranking momentation only at this point with the computation even start because of the call to collect here and go off and read the input and run the input through all these transformations and shuffles for the wide dependencies and finally collect the output together on the computer that's running this program. By the way, the computer that runs the program that the paper calls it the driver, the driver computer is the one that actually runs the Scala program that's kind of driving the spark computation. And then the program takes this output variable and runs it through a nice nicely formatted print on each of the records and the collect output. Okay, so that's the, that's the kind of style of programming that people use for Scala and I mean for for spark. One thing to note here relative to MapReduce is that this program, you know, it looks a little bit complex, but the fact is that this program is doing the work of many, many MapReduce, or are doing an amount of work that would require many separate MapReduce programs in order to implement. So, you know, it's 21 lines and maybe you're used to MapReduce programs that are simpler than that but this is doing a lot of work for 21 lines. And it turns out that this is, you know, this is sort of a real algorithm to so it's like a pretty concise and easy program easy to program way to express vast big data computations. And, you know, people like it's pretty successful. Okay. So, again, I just want to repeat that until the final collect what this code is doing is generating a lineage graph and not processing the data and the lineage graph that it produces actually the paper I'm just copied this from the paper. This is what the lineage graph looks like it's, you know, this is all that the program is is producing it's just this graph until the final collect. And you can see that it's a sequence of these processing stage where we defile to produce links and then completely separately we produce these initial ranks, and then there's repeated joins and reduce by key pairs each loop iteration produces a join and a each of these pairs is one loop iteration and you can see again that the loop is appended more and more notes to the graph rather than what it is not doing in particular it is not producing a cyclic graph. The loop is producing all these grabs are a cyclic. Another thing to notice that you wouldn't have seen a map reduces that this data here which was the data that we cash that we persisted is used over and over again and every iteration. And so it sparks going to keep this in memory. And it's going to consult at multiple times. So actually happens during execution was the execution look like. Again, the assumption is that the data the input data starts out kind of pre partitioned by over in HDFS. We assume our one file it's our input file is already split up into lots of you know, 64 megabyte or whatever it may happen pieces in HDFS. Spark knows that when you start a, you actually call collect the starter computation spark knows that the input data is already partition HDFS, and it's going to try to split up the work, the workers in a corresponding way. So, if it knows that there's actually don't know what the details are but it might actually try to run the computation on the same machines at store the HDFS data, or it may just set up a bunch of workers. To read each of the HDFS partitions and again there's likely to be more than one partition per per worker. So if the input file and the very first thing is that the each worker reads as part of the input file. So this is the read their file read. If you remember the next step is a map where the each worker supposed to map a little function that splits up each line of input into a from to link couple. But this is a purely local operation. And so it can go on in the same worker. So we imagine that we read the data, and then in the very same worker spark is going to do that initial map. So even though I'm drawing an arrow here it's really an arrow from each worker to itself so there's no network communication involved in the eight is just, you know we run the first read, and the output can be directly fed to that little map function. And in fact, this is that that initial map. In fact spark almost certainly streams the data record by record through these transformations so instead of reading the entire input partition and then running the map on the entire input partition. Spark reads the first record or maybe the first just couple of records and then runs the map on just the sort of on each record in fact it runs. Each record review through as many transformations as it can, before going on and reading the next little bit from the file, and that's so that it doesn't have to store. Yes, these files could be very large. It doesn't want to have to like store the entire input file it's much more efficient just to process it record by record. Okay so there's a question. So the first node in each chain is the worker holding the HDFS chunks, and the remaining nodes in the chain are the nodes in the linear. Yeah, I'm afraid I've been a little bit confusing here. I think the way to think of this is that so far, all this happened is happening on it on individual workers. So this is worker one. Maybe this is another worker. And each worker is sort of proceeding independently and I'm imagining that they're all running on the same machines that store the different partitions of the HDFS file but there could be network communication here. So this is from HDFS to the, to the responsible worker but after that it's very fast kind of local operations. All right. And so this is what happens for what the paper called the narrow dependencies that is transformations that just look consider each record of data independently without ever having to worry about the relationship to other records. And by the way, this is already potentially more efficient than MapReduce. And that's because if we have what amount to multiple map phases here, they just string together in memory, whereas MapReduce, if you're not super clever, if you run multiple MapReduce, even if they're sort of degenerate map only MapReduce applications, each stage would read his input from GFS compute and write its output back to GFS, then the next stage would read compute right. So here we've eliminated the reading writing and you know it's not a very deep advantage, but it sure helps enormously for efficiency. Okay, however, not all the transformations are narrow, not all and just sort of read their input record by record kind of with every record independent from other records. And so, what I'm worried about is the distinct call which needed to know all instances, all records that had a particular key, similarly group by key needs to know about all instances that have a key. Join also, it's got to move things around so that takes two inputs needs to join together all keys from both inputs that have this all records from both inputs that have the same key so there's a bunch of these non local transformations which the paper calls wide transformations because they potentially have to look at all partitions of the input. That's a lot like reduce in MapReduce. So for example, distinct supposing we're talking about the distinct stage. You know distinct is going to be run on multiple workers also and now distinct works on each key independently, and so we can partition the computation by key, but the data currently is not partitioned by key at all actually isn't really partitioned by anything but just sort of how much CFS happened to store it. So, for distinct, we're going to run distinct on all the work partitioned and all the workers partitioned by key, but you know anyone worker needs to see all of the input records with a given key, which may be spread out over all of the preceding workers for the preceding transformation. And all of the all of the, you know, they're all before the workers are responsible for different keys but the keys may be spread out over all of the workers for the preceding transformation now in fact the workers are the same typically it's going to be the same running the map is running running the distinct but the data needs to be moved between the two transformations to bring all the keys together and so what Sparks actually going to do it's going to take the output of this map hash the each record by its key and use that you know the number of workers to select which workers to see it. And in fact, the implementation is a lot like your implementation of MapReduce. The very last thing that happens in in the last of the narrow stages is that the output is going to be chopped up into buckets corresponding to the different workers for the next transformation where it's going to be left waiting for them to fetch. And so the scoop is that each of the workers run the sort of as many stages all the narrow stages they came through the completion and store the output split up into buckets. When all of these are finished, then we can start running the output for the distinct transformation whose first step is go and fetch from every other worker the relevant bucket of the output of the last narrow stage, and then we can run the distinct because all the given keys are on the same worker, and they can all start producing output themselves. All right, now, of course, these wide transformations are quite expensive. The now transformations are super efficient because we're just sort of taking each record and running a bunch of functions on it totally locally. The wide transformations require pushing a lot of data. In fact, essentially all of the data for page rank, you know terabytes of input data. And that means that, you know, it's still the same data at this stage because it's all the links in the in the web. So now we're pushing terabytes and terabytes of data over the network to implement the shuffle from the output of the map functions to the input of the transformation. So these wide transformations are pretty heavy weight for a lot of communication and they're also kind of computation barrier because we have to wait all for all the narrow processing to finish before we can go on to the to this wide transformation. All right. That said, the, there are some optimizations that are possible because spark has a view spark creates the entire lineage graph before it starts any of the data processing so spark can inspect the lineage graph and look for opportunities for optimization and certainly running all of the, you know, if there's a sequence of narrow stages, running them all in the same machine by basically sequential function calls on each input record. That's definitely an optimization that you can only notice if you sort of see the entire lineage graph all at once. Another optimization that spark is noticing when the data's has all has has already been partitioned due to a wide shuffle that the data's already partitioned in the way that it's going to be needed for the next wide transformation. So in the, in our original program. We have two wide transformations in a row distinct requires a shuffle, but grouped by key also it's going to bring together all the records of the given key and replace them with a list of links, you know, starting that URL. These are both wide operators they both are grouping by key and so maybe we have to do a shuffle for the distinct, but spark can cleverly recognize a high you know that is already shuffled in a way that's appropriate for group by key, we don't have to do another shuffle so even the group by key is in principle. It could be a wide transformation. In fact, I suspect spark implements it without communication because the data's already partitioned by key. So maybe the group by key can be done in this particular case, without shuffling data, without expense. You know, can only do this because it produced the entire lineage graph first, and only then ran the computation so the spark it's a chance to sort of examine and optimize and maybe transform the graph. All right. So the next topic. Actually, any any questions about the lineage graphs or how things are executed. Feel free to interrupt. The next thing I want to talk about is fault tolerance. And here the, you know, these kind of computations they're not the fault tolerance they're looking for is not the sort of absolute fault tolerance you would want with the database where you really just cannot ever afford to lose anything, which only one is a database that never loses data. Here the fault tolerance we're looking for is more like, well, it's expensive if we have to repeat the computation, we can totally repeat this computation if we have to, but, you know, it would take us a couple of hours and that's irritating, but not the end of the world. So we're looking to, you know, tolerate common errors but we don't have to certainly don't have to have bullet proof ability to tolerate any possible error. For example, spark doesn't replicate that driver machine. If the driver, which was sort of controlling the computation and knew about the lineage graph of the driver crashes, I think you have to rerun the whole thing. But you know any only any one machine only crashes, maybe every few months so it's no big deal. One thing to notice is that HDFS is sort of a separate thing. Spark is just assuming that the input is replicated in a fault tolerant way on HDFS and indeed just just like GFS HDFS does indeed keep multiple copies of the data on multiple servers if one of them crashes can soldier on with the other copy. So the input data is assumed to be to be relatively fault tolerant. And what that means that at the highest level is that spark strategy, if one of the workers fail is just to recompute the whatever that worker was responsible for to just repeat those computations. And they were lost with the worker on some other worker on some other machine. So, that's basically what's going on. And it, you know, it might take a while, if you have a long lineage like you would actually get with page rank is, you know, page rank with many iterations produces a very long lineage graph. And the way that spark makes it not so bad that it has to be may have to be computer everything from scratch if a worker fails is that each worker is actually responsible for multiple partitions of the input. So spark can move those parts move, give each remaining worker just one of the partitions and they'll be able to basically parallelize the recomputation that was lost with the failed worker by running each of its partitions on a on a different worker in parallel. So if all that's fails spark just goes back to the beginning from the beginning input and just recomputes everything that was running on that machine. However, and for now our dependencies, that's pretty much the end of the story. However, there actually is a problem with the wide dependencies that makes that story. Not as attractive as you might hope. So this is the topic here is failure, one failed node, one failed worker in a lineage graph that has wide dependencies. The reasonable or sort of sample graph you might have is you know maybe you have a dependency graph that sort of, you know, starts with some narrow dependencies. But then after a while you have a wide dependency. So you got transformations that depend on all the proceedings transformations and then some, some more narrow ones. All right, and you know the game is that a single worker is failed and we need to reconstruct the, you know, maybe it's failed before we've gone to the final action and produce the output. So we need to kind of reconstruct, recompute what was on this field worker, the, the damaging thing here is that ordinarily as spark is executing along it, you know it, it executes each of the transformations. Gives us output to the next transformation but doesn't hold on to the original output unless you, unless you happen to tell it to like the links. Data is, you know, persisted with that cash call. But in general that data is not held on to because now if you have a page rank lineage graph maybe dozens or hundreds of steps long you know and I hold on to all that data is way way too much to fit in memory. So as the spark sort of moves through these transformations that it discards all the data associated with earlier transformations. That means when we get here and if this worker fails. We need to we need to restart its computation on a different worker. You know so we can read the input and maybe do the original narrow transformations. They just depend on the input which we have to reread, but then if we get to this wide transformation we have this problem that it requires input not just from the same partition on the same worker, but also from every other partition. And these workers though they're still alive have in this example have proceeded past this transformation and therefore discarded the output of this transformation. A while ago, and therefore the input that our re computation needs from all the other partitions doesn't exist anymore. And so if we're not careful, that means that in order to rebuild this, the computation on this field worker, we may in fact have to re execute. This part of every other worker as well. As well as the entire lineage graph on the failed worker. And so this could be very damaging right if we're talking about oh I mean I've been running this giant spark job for a day and then one of 1000 machines fails. And we have to, we don't do anything more clever than this that we have to go back to the very beginning on every one of the workers and recompute the whole thing from scratch. You know it's going to be the same amount of work it's going to take the same day to recompute a day's computation so this would be unacceptable. We would really like it so that if one worker out of 1000 crashes that we have to do relatively little work to recover from that. Spark allows you to checkpoint to make periodic checkpoints of specific transformation so. So in this graph what we would do is in the Scala program, we would call it's I think it's the persist call actually we call the persist call with a special argument that says, look. After you compute the output of this transformation, please save the output to HDFS. And then if something fails, the spark will know that the output of the preceding transformation was saved HDFS and so we just have to read it from HDFS instead of recomputing it on all for all partitions back to the beginning of time. And because HDFS is a separate storage system which is itself replicated and fault tolerant, the fact that one worker fails, you know the HDFS is still going to be available, even if a worker fails. So I think so for our example. Page rank. I think what would be traditional would be to tell spark to checkpoint the output to check put ranks. And you wouldn't even you can tell it to only checkpoint periodically so you know if you're going to run this thing for 100 iterations. It actually takes a fair amount of time to save the entire ranks to HDFS because again we're talking about terabytes of data in total. Maybe we would, we can tell spark look only checkpoint ranks to HDFS every every 10th iteration or something to limit the expense, although it's a trade off between the expense of repeatedly saving stuff to disk and how much it would cost if a worker failed you had to go back and redo it. All right, so there's a question. We call cash that does act as a checkpoint, you know, okay, so this is a very good question which I don't know the answer to the observation is that we could call cash here. And we do call cash here and we could call cash here and the usual use of cash is just to save data in memory with the intent to reuse it. We could call it here because we're using links for. But in my example it would also have the effect of making the output of this stage available in memory although not not an HDFS but in the memory of these workers and the paper never talks about this possibility. I'm not really sure what's going on, maybe that would work, or maybe the fact that the cash requests are merely advisory and maybe evicted if the workers run out of space means that calling cash doesn't give you it isn't like a reliable directive to make sure the data is available it's just well, it'll probably be available on most notes but not all notes because remember even a single node. Uses its data and we're going to have to do a bunch of recomputation. So I, I'm guessing that persist with replication is a firm directive to guarantee that the data will be available, even if there's a failure, but I don't really know. That's a good question. All right. Okay, so that's the programming model and the execution model and the failure strategy. By the way just to beat on the failure strategy a little bit more. The way these systems do failure recovery is. It's a minor thing. As, as people build bigger and bigger clusters with thousands and thousands of machines, you know the probability that job will be interrupted by at least one worker failure. It really does start to approach one. And so the, the designs recent designs intended to run on big clusters have really been to a great extent dominated by the failure recovery strategy and that's for example a lot of the implementation for why spark insists that the transformations be deterministic, and why the, these, its RDDs are immutable, because, you know, that's what allows it to recover from failure by simply computing one partition instead of having to start the entire computation from scratch. And there have been in the past plenty of proposed sort of cluster big data execution models, in which there really was mutable data and in which computations could be non deterministic, like if you look up distributed shared memory systems, those all support mutable data and they support non deterministic execution. And because of that they tend not to have a good failure strategy. So, you know, 30 years ago when a big cluster was for computers, none of this matter because the failure probability was very low and so many different kinds of computation models seemed reasonable But as the clusters have grown to be hundreds and thousands of workers. Really the only models that have survived are ones for which you can devise a very efficient failure recovery strategy that does not require backing all the way up to the beginning and restarting. And the paper talks about this a little bit when it's criticizing distributed shared memory. And that's a very valid criticism. But it's a big design constraint. Okay, so the sparks not perfect for all kinds of processing it's really geared up for batch processing of giant amounts of data bulk bulk data processing so if you have terabytes of data and you want to, you know, chew away on it for for a couple of hours. Great. If you're running a bank, and you need to process bank transfers or people's balance queries then spark is just not relevant to that kind of processing nor nor to sort of typical websites where I log into, you know, I access Amazon and I want to order some paper towels and put them in my shopping cart spark is not going to help you maintain the spark the shopping cart spark may be useful for analyzing your customers buying habits, sort of offline, but not for sort of online processing. The other sort of kind of a little more close to home situation that spark in the papers not so great at is stream processing. And spark definitely assumes that all the input is already available. But in many situations the input that people have is really a stream of input like they're logging all user clicks on their websites and they want to analyze them to understand user behavior. You know, it's not a kind of fixed amount of data is really a stream of input data. And you know spark, as in the as described in the paper doesn't really have anything to say about processing streams of data. It turned out to be quite close to home for people who like to use spark. And now there's a variant of spark called spark streaming that that is a little more geared up to kind of processing data as it arrives and, you know, sort of breaks it up into smaller batches and runs in a batch at a time to spark. So it's good for a lot of batch stuff but that's certainly not everything. So to wrap up the you should view spark as a kind of evolution after MapReduce kind of fix some expressivity and performance sort of problems or the MapReduce has what a lot of what spark is doing is making the data flow graph explicit sort of think of computations in the style of figure three of entire lineage graphs stages of computation and the data moving between these stages and it does optimizations on this graph and failure recovery is very much thinking about the lineage graph as well. So it's really part of a larger move in big data processing towards explicit thinking about the data flow graphs as a way to describe computations. A lot of the specific win and spark have to do with performance part of the and these are sort of straightforward but nevertheless important. Some of the performance comes from leaving the data in memory between transformations, rather than, you know, writing them to GFS and then reading them back at the beginning of the next transformation which you essentially have to do with MapReduce. And the other is the ability to define these data sets, these RDDs and tell spark to leave this RDD in memory because I'm going to reuse it again in subsequent stages and it's cheaper to reuse it than it is to recompute it. And that's sort of the thing that's easy and spark and hard to get at in MapReduce. And the result is a system that's extremely successful and extremely widely used and viewed as a real success. Okay, that that's all I have to say, and I'm happy to take questions if anyone has them.