 Welcome to another edition of RCE. Again, this is Brock Palin. You can find us online at rce-cast.com And we're coming up very very close on episode 100 Once again, I have Jeff Squire from Cisco Systems and one of the authors of open MPI Jeff. Thanks a lot for your time Hey, Brock. Yeah today. We're actually this is actually fairly timely because The the HPC community was just kind of rocked in a little turmoil with that certain little blog post for mr. Jonathan Dursey About how HPC is dying and MPI is killing it and one of the big things that he cited in there as as a comparison point Was that the big data community seems to be growing much faster than the HPC community? And it's actually just completely coincidental that we are actually interviewing somebody from that community today Yeah, so one of the things he referred to was a spark and spark something I've been screwing around with and having a lot of fun with but one thing Jonathan pointed out was that you're able to still write He actually wrote diffusion using spark and much less space and it takes to use MPI and he had job postings our stuff So that that was really interesting And I thought a lot of it was because people just don't know these things exist Which is part of the reason you and I do this and I wrote about that at failure as a service which is my blog and You know, basically, I think people need to learn about these things So I gave it away. We're talking about spark today. So let's go right in Matei Take a moment to introduce yourself Hi, yeah, thanks for having me on the show. I'm Matei Zaharia I'm an assistant professor at MIT working in distributed systems And I started the spark project while I was a grad student Before that and I also spent the past two years Commercializing it through this company called Databricks. So I wonder if could you give us the quick run out? What is spark give us the you know, the quick answer and then we'll dive into more specifics. Yeah, sure So spark is a parallel Computing framework for data processing on clusters. It lets you use Some concise APIs in Python Java or Scala to process data at scale So people are probably most familiar with the MapReduce paper, which is actually quite old now and you know, the big data Analytics community really kind of got a run in when MapReduce kind of we got brought to everyone's mind How is this different than MapReduce? Yeah, a great question. Yeah, so spark is actually inspired by MapReduce and It tries to Generalize the MapReduce programming model so that you can capture more types of algorithms or of computations But still keep the nice properties of MapReduce Such as fault tolerance and automatic distribution and placement of computation near the data So it started our research group was actually doing research on MapReduce And as we saw people trying to use it for new applications. We designed this new model So Hadoop is known as you know kind of the the front runner in terms of a MapReduce framework that's out there right now But Hadoop offers a lot of things to spark replace everything about Hadoop Yeah, good questions So spark spark does not Try to replace everything in Hadoop The part it it replaces or you know The part you can use it for is the distributed processing part, which is what MapReduce would do But the Hadoop stack also contains a lot of other projects It contains storage systems such as the Hadoop file system and spark can use those but it's agnostic to those It doesn't try to provide its own file system or anything like that and likewise all the key value stores and Things like that that are in the Hadoop stack don't get replaced So spark is used most often in Hadoop environments alongside the other components of the Hadoop stack So what then is the value of spark? So if I'm using it inside Hadoop and at least to me a novice in the big data kind of world Why wouldn't I just use pure Hadoop all the things that come with Hadoop that allow me to do MapReduce and the like Yeah, good. Yeah, good question. So the main part is that the the spark execution engine lets you do more types of computation and Often faster or more easily easier to program than you would using MapReduce So there's kind of two pieces of that one is that the engine is designed to be Highly general so it can do not just batch computations like MapReduce But also streaming computations or interactive queries where you're sitting at a console and typing in queries and getting answers so basically it's this wide range of things and then the APIs for it are quite a bit easier to program with than the MapReduce API. So it's these high-level functional programming based APIs in Python and Java and Scala and many of the things that you'd have to write custom MapReduce code for in Hadoop There's already a library function to do it in Spark. So it's just faster to put together programs So as Spark itself utilize any of the MapReduce algorithms or is it pretty much taking what you saw a lot of people doing with MapReduce and worked with it It it it utilizes the some of the ideas that were in the MapReduce You know architecture or design so for example ideas like sending computations to the nodes that contain the data and You know figuring out tracking enough about the computation to figure out how to recover from failure But it doesn't use the code in Hadoop MapReduce. The only part of Hadoop it interacts with is the storage systems and it can use the the same interfaces that Everything else in Hadoop uses to read from any Hadoop supported storage system So now when you say that Spark is faster, how does it get that speed up? Yeah, good question. So there's two main things that contribute one of them is that Spark can Spark lets you Control and use Distributed memory in your program So if you have a bunch of machines, you can choose to load data sets into memory across of them Keep them there across queries and if you have any kind of iterative algorithm or any kind of interactive queries where you're Asking multiple queries on the same data set at once that part is quite a bit faster In in traditional MapReduce, you don't have any Control over memory you can't have like a variable that you load into memory and then you share across MapReduce steps So this is the first thing and then the second thing is that the engine Also supports more general sort of computation graphs So if you look at MapReduce, it's just these two phases map and reduce And if you have an algorithm that needs to do more phases of communication For example, you have to run it as separate MapReduce jobs And there's quite a bit of overhead to starting each job and feeding data from one of them into the next one because it has to go into a file and With Spark you can have a more general graph with multiple stages and this also Helps improve performance even if you're not using the in-memory features at all So that's an interesting point you raise right there one of the big performance Criticisms of typical MapReduce slash Hadoop And all of its clones is that it has to read and write to files all the time And there are a lot of reasons why it was architected that way But am I reading between the lines of your answer here that you don't do that or you don't have to do that in Spark? Yeah, exactly. Yeah, we can avoid that in a lot of cases And in particular in MapReduce the part that's really Slow really time-consuming is the output of each MapReduce job has to go into a file in a distributed file system So it actually has to be replicated across nodes for fault tolerance And it's very expensive to do that right and then you do that and then immediately you just read it into another Hadoop job So in Spark if you just merge those jobs together into one Execution graph you can avoid all of that So Spark tries to do a lot of this stuff in memory, but do I have to do anything special if Say the data set I'm working with is is larger than the piece of memory Do I kind of have to manually manage this in-memory out of memory? Yeah, the way it's designed you don't have to do that. So it will Gracefully sort of spill out to use the disk if it can't fit things into memory Of course if you want you can try to manage it You can try to tell it well, you know, you loaded a data set and you want to remove it And so on but even if the data sets don't fit the idea is that you shouldn't have to Change your program based on the size of the cluster you're running on and how many resources there are And this is also helpful actually if machines fail in the middle of your program Or if your job needs to be resized down because you want to launch another job or stuff like that Okay, now that's an interesting point right there, too So how what level of abstraction do you provide? So you mentioned at the very beginning that you have library calls for common operations above just the raw underlying Message passing and whatever Loading of data and things like that and so Assumably those are indicated for are targeted at specific types of tasks that and use cases that that you have created Is the fault tolerance baked into that so you just say hey solve You know for X and halfway through half of the nodes die You know what what level of fault tolerance do you have will it restart the computation or or or how does that go? Yeah, good question. Yeah, so let me let me explain that in a bit of detail so the way the programming model works is you work with these distributed collections these distributed data sets and It's maybe in some ways you can compare it to some of the array Programming languages for HPC where you had distributed arrays so you build a data set You know it's spread out across multiple machines and it contains Objects of any type so for example say Java objects if you're doing things in Java and then you have Operations on top of these data sets for example you can do a map that gives you back a new data set or you can do a group by or a filter or something of that nature and Spark Make sure that if you use these operations it can always recover You know the result of any sequence of operations so each data set is you know can always be Recomputed because spark tracks the operations that you used to build it and if you lose only a fraction of a data set like say you started With something and you did a map and then you lost you know one of the results It can also do the computation on only the part that was missing and we build just that part and not have to Roll everything back to a checkpoint or like redo the whole computation So it it tries to provide this fine-grained fault recovery where you only recover the pieces that are missing And all the abstractions are set up. So basically everything has this kind of fault tolerance From out of the box you'd have to go and do something special if you didn't want to have this But all the operations provide this already So you kind of touch on it there, but these are the RDD's Can you give a little bit more like tell what RDD stands because everything is based around that and then also tell about some of their characteristics such that Gives you the parallel performance the constraints you put into place, but gives it the performance Right. Yeah. Yeah, so these distributed data sets you worked with you work with are called RDD's or resilient distributed data sets and basically there are a few interesting things with them so they They can contain sort of arbitrary types of objects as I said so like Java objects for example if you're programming in Java and each RDD is spread out across the nodes of your cluster so it can be partitioned in sort of any way and RDD's are Immutable so once you create one you can't modify it you just create new ones from it And this is what allows us to sort of track exactly how it was built and basically what happens is you start by Creating an RDD for instance by referring to a file in your file system You can say okay like view that as a collection of strings You know for example and then you do transformations on it such as a map and spark tracks The the graph of operations that you use to build a specific RDD Which is called its lineage and then it can use that to recompute any Partition of the RDD at any time So you work by just defining these and and chaining together these transformations to create new data sets And then once you've built them our spark figures out a way to execute them on the cluster It's very similar to Working with collections in a functional language. So actually it came from the you know from looking at the Scala Programming language like at its collection library and it's also similar to the various Functional APIs and Python like the iterator tools and so on Okay, so you said something in there like you know, they're immutable you basically are just Transforming it and making new RDD's and you can kind of build them up Is there any type of like optimizer where it'll kind of look at all your transformations before it actually starts moving the data and kind of Limits data movement. Yeah, exactly. Yeah Yeah, there are several things that are done here and actually this is one of the areas where we're trying to Extend the API to enable even more optimizations so Couple of things that are done. So one is with regard to data partitioning you can ask spark to partition a data set in a given way So for example, if you have key value pairs like say the key is I don't know It's like a URL for a web page and the value is the content of the web page You can do something like say partitioned them by domain name So then pages, you know from the same domain name on the same machine and you can do things with them Faster that way and then many of sparks Operations that you run on these are aware of the partitioning and try to take advantage of it So if you have a data set that's already grouped for instance here by domain name and you want to Do a reduced operation on it. It knows that you know these pages are already on the same machine So it can avoid some communication or if you have two data sets that you want to join Which means bringing together elements with the same key from from the two data sets It also looks at how they're partitioned and tries to figure out the way that will minimize communication And then the other optimization that it does which is helpful is pipelining things So for example, if you do a map function and then you do another map function, you know On the result of that it fuses these together into a single function and then it applies both of them at once So you don't have to you know read the data once, you know Do the first thing and save it somewhere and then start reading it again. You can do both of them in one scan So these things help You know the reason they're nice is because when you write a program You can structure your program in terms of these small Simple operations like map and then at the end of the day you still get an execution plan That's you know, that's pretty good even though you got this this really modular program that that you used to build it Now let me ask a little more about these optimizations because you you mentioned some very interesting ones there about trying to minimize communication because obviously This is one of your limiting steps is shuttling all the data around and as the name implies big data You know it just takes time to send stuff across networks And it also consumes resources and all these kinds of things that could be used for computing So do you have other types of optimizations or are you looking at things like network topology awareness and even So outside the box right network topology things like staying on a leaf switch rather than going through a core switch and things like that and and other types of of optimization inside the box so for numa awareness and and Stuff like that. Yeah, definitely. There's a bunch of in of stuff in there already and there's a bunch of new stuff Going on as well So in terms of network topology For input actually both for input data and for data that you save Into memory spark is aware of the network topology and you can do stuff like you can specify, you know Which machines are in the same rack and be aware of the rack topology as well So it it understands that there's this Hierarchical like tree to eat apology. So when it's reading input data from something like the Hadoop file system It knows which machines have each block of data and same thing when when you cache data and memory It it tries to place computations, you know as close as possible to the machines that have it we've also Try to optimize some of the specific, you know communication operations So for example one That we spent a bunch of time on is doing Broadcast and we tend to run in these sort of ethernet Commodity networks, which don't have any special built-in Permatives for these So we we do actually for broadcast we have this Implementation of bit turned more or less that the nodes can use to send data around and that's that's pretty important in Some applications that tend to broadcast a large Sort of parameter vector to everyone So apart from this networking Stuff the the other area where there's active work So apart from this networking stuff There's also Some interesting work going on with doing relational optimizations in spark. So these are the same kinds of optimizations That you do in databases and there are things you can do when you have a more Structured model for the data. So for example, you have a table of records And you know that each record has say an integer field and a float field And then the database community has done a lot of work and storing and querying these efficiently You know including things like column oriented databases and doing processing directly on compressed data and so on So in spark, we have a project Called spark sequel which lets you Store data with a known schema and then query it using sequel and we also have an api called Data frames which is similar to the data frames in the r programming language and the pandas and python you know if you've ever looked at these python data processing libraries and data frames, uh, you know, let you have a Distributed table with a you know with a known schema and do these kinds of relational operations on it from inside a normal programming language like python and we optimize these operations then using these Techniques and you can get actually quite a bit better performance even then with the plain spark apis Okay, so that's yet another interesting point. I feel like I've been saying that the whole interview here So you're talking about making optimizations because you can know the structure of things which kind of flies in the face of what one of the perceived values Of these kinds of systems is which is unstructured data So can you talk a little bit about comparing and contrasting? You know structured data versus unstructured data and the kind of optimizations that you can do Yeah, good question. Yeah, so so, um, uh, a lot of the big data systems I used Process what starts out as unstructured data. So for example text files, you know, representing logs or You know documents from the web or anything like that But as you process them you generally tend to add some kind of structure So you extract some fields from each record or even with something like json data, even though Technically it can contain, you know, all kinds of weird Elements in there you select some of the fields that you care about and you do things with them So as data flows through a pipeline in most places and Hadoop deployments as well It becomes more and more structured. So this is where these apis come in now when you know something about the structure of the data There are several types of things you can do. So the first thing is in terms of storing it There's been a lot of work in in doing very high performance analytical databases and for example, one of the main things that has come up is column oriented storage and processing. This is similar to basically the Struct of arrays as opposed to array of structs that you sometimes have In scientific programming, it basically means, you know, if your queries tend to access only a few fields in the record It's better to store it together the same field into one Address and memory instead of having to scan through a whole lot of memory and only extract out a small percent of the data And once you once you store data by column, you can also compress each column and there are some algorithms that can Operate directly on the compressed data to do simple things like sums and so on where you never have to uncompress it Which is also kind of neat and then apart from these storage optimization The other really big class is a kind of algebraic optimization So when you have a query like you you start out with a table and you say, okay I want to pull out field one and then plus two times field two and then I want to group by something You know, and then I want to filter out where let's say just the top 10 values You can actually rearrange the expressions the user give you and move them to sort of minimize the amount of work So for example, if you know that they're filtering out something You can do the filtering before you do the computation of the other values that they want to select If you know they're grouping, you know, and then discarding some groups You can just skip stuff that is not in the correct group at the beginning And these things are You know, basically these are all things you'd have to do by hand If if you were writing a map reduce essentially and if you can instead write a structured query or write an expression using data frames There's basically a compiler that comes up with the best query plan for you So you've been talking about a lot of examples that really do come from the data world now Been seeing some examples around of actually using spark for more traditional Heavy computation data rate on the data many many times you see spark being used for that much and you know What's what's your opinion of spark being used for that sort of work? Yeah, we've we've seen a few Neat use cases there and actually i'm definitely interested in you know and finding More of them and in trying to apply this this kind of technology to hpc as well so One one set is just heavy Computation but on these kind of you know these these commodity clusters. So as examples of that There's a group In neuroscience at janelia farm institute that's using spark to process Data from brain imaging basically they have A way to image the brains in in sort of live animals and to see which neurons are active as they're doing certain things So, you know, for example, they have this fish and they show at different patterns And they can see which neurons are firing in it and they get a lot of image data And then they want to do some analysis on it often in close to real time because they're actually running an experiment there So they want to do things like, you know, just basic Filtering and processing of the images or they want to do things like figure out which neurons are correlated with each other Which would include some kind of clustering algorithm or some some kind of pca To reduce the dimensionality of the data and try to understand it So this is an application that's using spark even though it's it's mostly You know, it's very cpu intensive. It's still useful to be able to do things interactively or even in a streaming fashion Another area where it's being used is for genomic data. So that is closer to sort of big data processing as opposed to big compute But there's a team at at UC Berkeley that's using it for that to to build, you know, highly parallel Processing pipelines for that and then we've also seen it used in In some computationally intensive machine learning. So for example Image processing pipelines or things like Like expectation maximization using some more Complicated models that require a lot of compute. So it's not, you know, it's not going to be a fit for everything That's compute heavy but things that You know that don't have Super fine-grained Communication patterns it can actually be pretty good for and pretty convenient to use for those Now one of the traditional complaints About some of these additional frameworks or at least from the perspective of my community the hpc community is They're really looking to extract every cycle possible. And so using higher level languages Tends to be Relatively faddish and things like that and that's that's somewhat changing because the world is is going to rapid prototyping And you know if my computation takes a little bit longer It's still better for me to be able to code it easier with a higher level framework and language and things like that where Is your community falling on this because the languages that you have chosen to export spark in Is java, scala and python and those are all very high level things. So what's what's the trade-off? And uh, you know, how is the valuation of that in the big data and the spark community in particular Yeah, there is definitely a trade-off Although people are working to close the gap as well to let you Use native code at least in in some Specific cases in in spark and haduba applications I think the the reason that The big data community started with these fairly high level languages Namely the ones based on java is because There was a lot of experimentation And a lot of sort of time pressure to you know to build a new algorithm for example That you might try out and then you might throw away So these you know all these projects started from web companies Which have you know, it's it's not like the web companies have one algorithm that you know They've known about for 10 years and and they just want to run that at the highest level possible And it it will run there for months. So they want to make it very fast These these companies are experimenting with new algorithms all the time And they also have a lot of ad hoc queries, you know, someone comes in and wants to explore a data set and maybe find out something new So they valued The speed of development You know over making any specific algorithm very efficient And that's the same thing, you know, we've seen with spark We we have a lot of people especially the ones using that for interactive queries that really need the high level language now in terms of You know Closer to sort of bare metal performance. There are two types of things that people have been doing One is trying to use Coronals or libraries on each machine that are optimized, you know, and that are written as optimized code So for example, the machine learning library and spark uses Blast on each machine and you can you know use your own like favorite implementation of that To do linear algebra. So it's not as good as writing all of your algorithms You know and see from the beginning But at least for a lot of them you get the inner loop part can be pretty fast And the similar thing is for these Relational queries the sequel and data frames actually a lot of frameworks now are Generating code at runtime based on your sequel query and then you get, you know Code that's, you know, that's that's optimized to run this way. So I think I think we will see some You know some more efforts to let you get this kind of performance, but there's there is fundamentally this trade-off between, you know people trying to build something, you know In a few hours or type a query even in a few minutes and see what happens versus people Trying to build something that they know, you know will run for, you know, many weeks or months into the future We mentioned that, you know Scala python and java are things available now, but you've mentioned that some other things been worked into it What's the spark framework itself written? Is it written in one of those or is it written in its own thing? Yeah, it's actually written in scala. So scala is basically It's a it's a Statically typed language on the jvm. You can think of it as sort of a higher level java. So A lot of this stuff has similar performance to what you'd get using java basically Now just out of curiosity. What's the largest job that you've heard of or or the largest r.r.d? Yeah, the So the the largest job that we've seen We've seen some jobs that that processed Over a petabyte of data in in the same job and ran for I think close to a week So this was actually at one of the Chinese web companies That actually I should look up which one Let me let me just look it up actually because I can I can then give give you the right answer So so the largest job we're aware of Is over one petabyte of data and this is a job that actually ran for something like a week This this was on at Ali Baba the Chinese web company To do image processing. So they had one petabyte of images that that they wanted to process And there are some other web companies that are also processing You know around one petabyte of data per day, although not necessarily in in the same job using spark the largest cluster we're aware of is is about 8 000 nodes and In terms of in-memory rdd's I'm not sure what the biggest one is that I I don't think too many people have say a petabyte scale of memory in a cluster But we've definitely seen jobs with say 10 terabytes of memory being used With in one spark job So I want to touch on a little bit of history this too. Can you tell us a little bit about what the amp lab is? Sure. So the amp lab is a Research lab at UC Berkeley that I was part of actually, you know, when when we started spark it's kind of Multidisciplinary sort of lab in big data. So it's it's bringing together people from Distributed systems databases And machine learning and trying to you know, look at the problem together from these perspectives And it's also it also has a few people in there doing application so for example the genomics application I talked about or other applications at UC Berkeley So it was a lab of you know about 50 people where we worked on various systems that tried to do these and Part of why spark started is because we saw for instance machine learning researchers, you know trying to run algorithms that didn't work well on map reduce and and we thought We saw that basically as a problem Okay, now as for the future though, tell us a little bit more about Databricks Yeah, so Databricks is a startup company commercializing spark That began out of the amp lab About one and a half years ago. So I was part of the founding team along with You know, quite a few other members of you know of the research group and We're now we're you know, we're basically the the largest organization Contributing to spark and we continue to build spark as an open source project and then the way We commercialize it is by offering a hosted service on amazon. So Basically, you know spark as a service and tools on top to make it very easy to run and manage Now you mentioned open source there. That's kind of key. What license is a spark available under It's the Apache license and it's hosted at the Apache software foundation. So anyone Can sort of join and contribute, you know using this well-defined process that they have there Now something I ask all projects who have developers in them pure curiosity question because there tends to be a lot of passion on on every side of this is What version control system do you use and why? Yeah, so We we use git. I think we started with git from the beginning and really You know, I I know there are many pros and cons to these even within Apache, you know, many many people have different opinions, but For me a major reason to use that was actually github just the interface online For reviewing patches was, you know, pretty nice pretty convenient to use and then we also found that a lot of Developers were familiar with github and and they found it easier to contribute to spark using github So I think that's that's probably the main reason why why we did that Okay, so we've talked about everything spark is what what are you guys looking at adding into the future? Yeah, there's there's several Things in progress now are actually it's quite a bit of activity. The the whole community has gone really fast So now there are many different people working on new libraries and new features for spark the biggest Sort of direction we've been going in in the past two years is actually having, you know building up Really rich standard library on top of the core spark engine So when spark started out and similar with map reduce and so on it was just about the programming model and we said like Oh, isn't that very exciting that you can write say k-means clustering and 30 lines of code or something like that? Like wow, that's amazing before you would have had to build, you know a whole distributed system But in practice nobody wants to write k-means clustering and in 30 lines of code They want to call it from a library in one line of code So we're we're building a bunch of standard libraries on spark that all operate on rdd's You know all of them can share data with each other But that have a lot of the you know most common algorithms for data processing built in and optimized For a distributed setting. So we have a machine learning library Graph processing library the spark sequel system for structured data And also a stream processing library to do things like sliding windows and aggregating data Across time and most of the activity in spark is in these libraries now adding new algorithms and you know new ways to run the existing ones apart from this i'd say the other Sort of big thing that I think will happen You know in the next few months is the r interface to spark So I don't know in in the hpc world Maybe r isn't you know as popular as say python for data analysis But definitely in the sort of industrial like data analysis space. It's one of the most popular languages And you'll be able to call spark from r and call the machine learning algorithms and so on on You know distributed data sets that you build using r and I think this will again increase the the community of people Who can use that? So, you know overall Our goal is is to have you know to look at the The tools and and libraries people are using for data processing on a single machine And try to provide you know equally high level or equally convenient ones that can scale out of the cluster So that's what we've been trying to do Okay, another question we'd like to ask people with software projects is what is the The weirdest or the most unexpected use of spark that you have heard of Yeah, I'll have to think about that Yeah, so when we started the project one pretty unexpected use but Actually pretty neat and hindsight Was people building web applications like user facing web applications that run spark rays on the back end So for example, there are a number of Applications for say advertisers or You know, basically this kind of data products where Uh an advertiser can go in and look at how well their campaign is doing and then they can slice and dice the data So for example, they can say, oh, let me filter it down to one demographic or one Region, you know in the world or one time frame and as they change the filters for the data The application actually calls spark on the back end and he computes a result and with the in-memory Data sets you can set up you can actually get back these results and you know Half a second or like a few hundred milliseconds so you can actually have an interactive web application That's you know running these queries over a cluster in the back end So when I saw those initially, you know, I I definitely had not expected People to use it that way and you know, I was very excited about it And we've tried to do things to make it easier for people to build these Oh, thank you very much for your time. Where can people find more information about spark? Yeah, you can you can you know find out more download the project at spark.opachi.org And you know, I should also say if you want to try it, you don't actually need a cluster You can on it on your laptop. There's a local mode that you know works very well and that you know Most people use for development and you just need to have java on your laptop And also python if you want to use python Thank you very much. All right. Thank you. This was great. Yeah, thanks for having me