 Actually, this talk and the next talk by Helena will be talking to you about Apache Spark. Mine is more of an introduction, starting at the Scala Collections API that you all know and love, and talking about how Scala is really a scalable language, you can start from doing collections, transformations on a few items, all the way to distributed big data stuff. And the next talk is more about streaming and kind of like a big data pipeline. They'll both be pretty interesting. All right, well, sorry, you have to bear with the color a little bit. Let me see if we can make that a little better. Let's keep going and hopefully the text will be more readable in subsequent slides. Anyway, so again, my name is Evan Chan. I am the principal engineer at Socrata. And I've been working with Hadoop, Spark, Cassandra, Kafka and various big data systems for a number of years, and that's what I love to do. A little bit about Socrata, I'm very blessed to represent Socrata today. We build software to make open data, public data more accessible to you, the citizens. What does that mean? How many people here are from Seattle? Awesome. Great city. How many people are from San Francisco Bay Area? All right. So both Seattle, San Francisco and various counties in California, they're all like our customers. So we help you guys figure out how your city, your county, your state, the U.S. federal government is spending your money and are they meeting certain goals? Governments nowadays are really into opening up their data. They want citizens to engage because that results in better informed voters and so forth. We also expose really interesting data sets such as how much are medical procedures being charged around the country? Is your doctor being bribed by pharmaceutical companies? And how many free lunches are they getting? That kind of thing. A lot of really awesome stuff. And we also sponsor hackathons. Our entire backend is in Scala. Now a really interesting story. This is true. Last year I was sitting in your chair coming to this conference for the first time. I had not heard about Socrata and I heard it was now my colleague Clint come up and give a talk. And I was like, oh, this is really interesting. This is really interesting. This is a company that's using Scala in the backend but is also trying to change the world. And now I'm here working for them. So I want to encourage you guys if you feel like you want to do something really, really interesting in a sense of not only technical but also in terms of helping to change the way that our government works and enable a better country come talk to me or come talk to the rest of us, we're over in that booth. But anyway, enough about me and Socrata. We're here to talk about Scala. Now why do you guys love Scala so much, one or two things? Anybody? Almost Haskell. Almost Haskell, all right. Almost Haskell you can use as many symbols as you want. I guess that's one reason. So I'd like to think that one, some reasons, my favorite reasons are the concurrency, the Java interrupt, you know, functional programming. But I think one really key thing is the collections API. Being able to work with lists and maps and do all kinds of functional transformations on them is pretty powerful. And you don't realize how good it is until you compare it to a lot of other languages, how complete the Scala collections is, and the fact that you can work with both mutable and immutable data structures pretty easily. They're great because it makes working with data a joy. You can easily go from sequential to parallel to distributed. When you talk about Scala, it's scalable language. This is one key thing that you can, the same API as you'll see, you can go from small data sets to doing parallel operations on them to doing, I don't know, like huge data volumes. So let's look at a really simple example, just doing a map on the list. I think everyone is pretty familiar with this, we're having a list of numbers and we're calling the map function and the function is taking the item and multiplying by two and you get back the answer right away. Pretty simple. How does this really work? So the source code here is from the Scala standard library, it's from the trade traversable like, and you can see that there's something called can build from. Basically the first item is that it calls builder and builds a new collection. Then it's going to use a for loop, it's going to go through all the items in the original list and apply this function f of x and append that item to the list and then it gives it back to you and you can see that this is sequential, right? So anyway, that's the Scala source code. Does it have to be sequential? Well actually no, so we think about it mapping is an inherently parallelizable operation, you know, pretty easy to split up and apply the function in parallel. So starting in Scala 2.10, there's this feature called parallel collections. How many of you guys have used or tried out this parallel collections thing? Cool, maybe like a fourth of you. Very convenient, I can just add four characters.par and that turns my list into a par seek or something like that and it gives you back a par seek or actually a par vector and what it does is it does a divide and conquer it will split up the list into a magic number of splits and apply the map algorithm in serial. Speaking of divide and conquer, this is supposed to be an image of ants, alright, you can't really see it, but you know nature has been really good at divide and conquer for a long time. Anyway, what else can be easily parallelized? So things like filter for each, you know, all pretty easy. What about group by? Group by is a little bit harder, right? You need to actually transfer things back and forth. There's some, you know, shuffle going on. So we've gone from single threaded to parallel. What about distributed? So this is where I introduce Apache Spark. How many folks have not heard of Spark? Wow, okay, alright, that's, I'll just go through this really quickly. The audience is probably fairly familiar, but how many people were actually using Spark in their jobs? That's actually pretty good too, kind of. So Apache Spark is a horizontally scalable and memory computation engine. It's like Hadoop, but much better. How is it better? It gives you a significantly better functional Scala API. It has a Ripple, so you can try things out. In terms of productivity, it is like completely different. There's a slide that I didn't bring in, but basically a word count in Hadoop. It's like 60 lines, you know, and you could easily do it in one line. If you want, like, the long one liners we were talking about earlier, if you want nice looking Scala, it's still only like three lines, maybe. And it has a huge amount of momentum right now. It's like probably by far the most active big data project, and one of the most active Apache projects overall. It has just about everyone that's trying to plug their data store or framework or whatever into Spark. And it has a lot of components that are being built on top. So it's not only like map-reduced out batch drops, but there's a streaming component. There's an SQL layer. There's graph algorithms, machine learning. So a lot of really, really neat stuff is being built on top. So let's go back to our example. We want to have a list of items. We want to do some mapping function over it. So this is what a distributed map looks like with Spark. It's pretty much the same code. We'll go into what this means, this RDD thing. But we have a map function. Again, it's the same function. It's underscore times two. I just added a tick to it, which, as you'll see, is necessary to get the results back. But now you can map and filter on pair-to-bytes of data with the exact same syntax. So what is really going on under the hood? So the basic abstraction in Spark is called an RDD, which stands for a resilient distributed data set. And this is basically a distributed immutable collection of items. An RDD is split into partitions. And the rule is that one partition has to fit entirely onto one node. One node can contain multiple partitions, but a partition can never be split amongst nodes. And an RDD is typed just like a standard Scala collection is. So you can have an RDD of int, just like in Scala, you have a seek of int. Functions that return, for example, line input that read from a whole bunch of files will give you, for example, an RDD of string. And if you had something else, for example, like you read from Cassandra, you might get a case class by Acre. So depending on what you're trying to do. So under the hood, this is a little bit hard to see, I apologize. But basically, you have an RDD of numbers. The text in the first boxes read 1, 2, and then n. And what we do is that we produce a new RDD with each item mapped with the mapping function. And this is all just done in parallel. So you can imagine all the different nodes are running the mapping in parallel. So it's pretty easy to understand. So just some notes on terminology. Sorry, go ahead. Is a partition something you can decide up front? And can it change during runtime? That's a good question. So the question was, is the partition decided up front? You, it depends. When you, a lot of functions will give you a default for the number of partitions. For example, like the simple function to parallelize a sequential list, the default is the number of cores that you have, which usually isn't what you want. If you're reading from, say, HDFS or Cassandra, Tadoop has a notion of partitioning based on how many nodes you have, how many blocks in HDFS. So then that gets decided for you. You can, however, after you have an RDD, you can decide to repartition into a different number of partitions if you want. And subsequent operations, usually, some of them will, anything that shuffles over the network, a lot of them will allow you to configure a number of resulting partitions if, yeah, so, yeah. So a Spark worker node, actually, there are two processes, actually, a worker and executor, but that's where the computations are run and the results are data cached. A Spark driver is your application. There isn't any equivalent in Hadoop, but basically it's your main app that actually executes, that has functionality like RDD.map and that kind of thing. So that controls the program flow and executes the steps. And in this case, when we run a tick, that takes the items that have been computed and returns it from the worker nodes to the driver, so the network transfer happens. And Spark will actually take your functions in Scala and serialize them over the network. So one requirement is that mapping functions need to be network serializable. There's a really quick overview of Spark API. So there are some APIs that work within a data set like map filter, group by, sample. There are some joint operators that work across RDDs. There are some actions that will go a little bit more into those that give you results in general and then there's some other optimization. Someone's asking about repartitioning, so for example, there's a repartition function that allows you to change your number of partitions. So this is supposed to be an image of Homer lying down on the sofa with donuts. I love donuts, by the way, how many people have gone to Voodoo Donuts in Portland? So yeah, all right, cool. But for every person that likes them, there's always like some hometown donut shop that they prefer. Anyway, we're talking about laziness in Scala collections and how that applies to Spark. So when we did my list.map, then we turned a new collection right away, right? What about streams and iterators? What happens if we take a list, convert to a stream, and we apply a map to it? We'll see that when we apply a map to a stream, it doesn't give us the values right away. It gives us a new stream where you can actually see that the first value has changed. That gets mapped, but the subsequent values are not. There's still a question mark. So this basically, a stream is lazy, right? A computation is not done until you ask for the results. So in this case, in order to get the result, I actually am converting it to a list. Even when I do a tick on a stream, you still get a new stream and you don't get the result yet. So how does this work under the hood? So this is the implementation of streams map method from the standard library. You can see that it has this line called cons. So what it does is it maps the first elements, but then returns a stream of the tail of the list, which is all the other elements in the stream with this mapping function. So what happens in the lazy collection is that the key is that the tail is not evaluated and instead it remembers the step, the transformation that you applied. So you see it's, what you have is a new object that contains the composition of the function F with the previous stream. And so you can imagine that if you take more and more transformations, you start to have a history of you well of the steps that you took to build up that. But nothing actually happens until you actually evaluate the tail. By the way, does anyone know the difference between streams and iterators? I just threw this in. But both streams and iterators are lazy. Streams memorize the results, so you have to be careful about memory usage. But the key thing is iterators are mutable and have state and you can only use them once. Once you've iterated through it one time, then that's it, right? Then they're done. A stream can be reused. There is also something else called an iterable, which can give you an iterator again and again. So if you need to reuse an iterator, you might want to return an iterable instead. Anyway, slight detour. So what about laziness as it applies to Spark? Well, let's say that I do a map operation. And by the way, let me preface this bit of code with the spark.contact.parallelize. If you don't know Spark, this is the easiest way to, this is the way to take a sequence that you have and convert it into an RDD. So you'll see this a lot. Could be anything like, so there's test data to, I don't know, the list of text files you want to read from or whatever, right? So after spark.contact.parallelize, you get an RDD. And the map is being applied to the RDD. You can see the result of that mapping is that I get back a mapped RDD. I don't get back the mapped results. So what this tells you is that Spark is also lazy. And just like the way that streams work, Spark is also remembering that you did this transformation, a mapping of a function, and it's remembering that in what it calls a lineage. And so Spark will remember every step that you took. So I have a mapped RDD. What am I gonna do with it? So this is where the actions come in. So an action is, for example, counts. So let's say that I've done, let's say red six represents a bunch of transformations. When I do count, that actually forces the work to happen and Spark will actually go through. So let's say that you're reading from Hadoop files. Until you do counts, no IO actually occurs at all. It's only when you do counts or collects that the IO and the transformations all happen. Count is a pretty popular way to force a computation in Spark, by the way. So why is laziness important for Spark? So there's a number of, a lot of reasons. First reason is perhaps that for when you have a huge amount of data, you want to minimize the amount of work that has to be done. If I do a tick, for example, I don't need to go through every IO source and apply my transformation, expensive transformation to every bit of data. I just need to do the minimum necessary to return the first end results. You would use up, you end up using a lot less memory for intermediate results because instead of creating a new RDD at every point, I can apply the composition of a whole bunch of functions at once. And what's more to Spark can optimize and does optimize the execution plan. So if you have a whole bunch of map and filter, map like stages, where the work is all local to a partition, it will combine all of them and execute them together. So Spark will try to optimize things so that it can run as many things together as possible. And lastly, which is a really important reason is that for error recovery, the reason why Spark remembers all of your steps is that if something goes wrong when you're computing, I mean, let's say that you lose a node or you lose part of the data, then the way that Spark recovers some error is that it retraces those steps. So one assumption that Spark makes is that you're starting from a source which is redundants and you can read from again. So for example, I do distribute a file system or something like that. So it can go back and read that source data and recompute the steps to get back your computation. Now, obviously, if you had to read, if you wanted to repeat some calculation and you had to read something from this again and again, this is very expensive. And the key to Spark's speed is actually being able to cache data memory. So there's a cache function. It will save the results of the last bit of data so that you can reuse it again and again. And if we have time, I'll demo this at the very end. And it's like memoization. It gives you one or two orders of magnitude speed up because you can just read from memory. And it is used a lot for iterative algorithms, especially such as linear regression, that kind of thing. This is where you see a really, really huge boost in performance of her due. There are many ways of caching data in Spark. The default one is to memory, but you can actually serialize to memory and disk. You can also use an experimental off heap mode, which is pretty cool if you want to have redundancy and survive worker node failures. And finally, there's a Spark SQL. Some of you might have heard of Shark, which was the old Spark SQL layer, but Spark SQL now is replacing Shark. They're not doing new development on Shark anymore. Spark SQL has a very different kind of caching. The default caching uses serialization like such as Java serialization and it keeps those values of the RDD in memory. Spark SQL actually has a compressed columnar in memory kind of caching that takes values and compresses them using techniques like dictionary compression. And so it uses a lot less memory than something like Java serialization would. And it's a lot faster. It's built for fast SQL queries. Is there a specific mode that you use? I mean, I think it depends on the use case. I mean, if you're using SQL, then obviously the SQL caching makes sense. But for the other ones, it depends on what, like if you prefer, like, for example, replicated as a pretty good strategy, if you don't want to rebuild your data from scratch. Or Tachyon, for that matter. I think especially as more and more people are thinking about multi-tenancy, they want to run multiple applications to create the same set of data, as well as to survive, you know, let's say my worker note runs out of memory or whatever. Then something like Tachyon, which is an off-heap cache for your data, starts to become more and more appealing. So I think there's a lot of work around that area now. Yeah, quick question. That's a good question. So how long does a cached data live? So Spark has a built-in feature called a TTL. So I think by default it's off. I'm not sure what the default is now, but you can configure it. So you can say that I want my data to live for 12 hours. And then after 12 hours is over, Spark will start cleaning up old references and things that it thinks are not being used. So that's how you control how long it lives. So grouping and sorting, this is where it gets pretty interesting. So let's look at how do we do a top K word count in Scala, right? I have a bunch of words from somewhere. You know, I want to find out the top five words. So what do we do is we do a group by by itself. This gives me a map of each unique word to all the instances of that word. And I'm going to convert that into a count of each word. Then I'm going to sort it and just take the top five items. I'm sorry. Yeah, yeah, there was a lot of math. So I'm not a math person, so you know that. Yeah, I just like to hack it. So this is how you do it in Spark. It's not that different, but I'll highlight the differences. So first, can you know the first part that parallelize which is loading data? But first I'm taking the words and I'm doing a map to this underscore comma one. I'm basically turning each word into a tuple where I have the first part is the word and the second part is the count of one. And then I'm going to do what is called a group by key, which is a most common way to do grouping in Spark where I take a tuple and the first part of that is considered a key. And I'm going to move all of the instances of that word on the same note. Then I'm going to convert this tuple of the word and all the instances into a count like before and I'm going to sort it and do a tick. It's roughly similar, you know, just slightly different with the group by key. This is a little bit hard to see because of the color flip, but what is actually happening is that, so the first stage is a map from the word to the tuple, where you see apple comma one. Then we do a network, a distributed shuffle with the group by, where all the instances of apple, for example, appear on the same node. Then we're doing a sort because we want to know what the top items are. So we're sorting by the second parameter, which is another network shuffle. Then we finally use the tick to take the first items. Can we do better though? And the answer is yes. Oh, actually, so before that, just to highlight, you can't see the colors, but all the boxes on the left are happening on the worker node and only when you get the results using tick does it transfer to the driver. So can we do better than the first method? The answer is yes. So what we do this time is that we use the method called top, which Spark has specifically for doing this kind of thing. Top avoids the global sorting of items by taking the top items, K items from each node and then combining them at the driver. And the reason why you can do this for a top distributed top K is because you can do this because the items are unique on each node. If they weren't unique, you wouldn't be able to use this technique because you can't guarantee that you would be able to get the top K items if they're duplicates. And there's a function called a reduce by key, which is basically a group by key plus a local reduce, which is pretty convenient. Now let's move to another application of Spark, which is ETL. So let's say that I want to read a whole bunch of lines from, so this is one of our data sets, it's called Chicago crimes data. My co-worker's probably rolling their eyes. So let's say we want to extract a few fields filtered by the date and take a couple items. So this is pretty similar to the examples we've seen before, we're splitting the input line, mapping it into a tuple, et cetera, et cetera. Whatever, you have a huge file. Well, we can try parallelizing it. So we take this, so let's say you break up the file into multiple chunks, and you can do one to four dot par. This gives me a parallel, a collection of numbers. And I'm gonna flat map that into like IO.source. Let me, I can try scrolling this a little bit. Well, all this is this, yeah. So this is just taking IO.source.fromfile.getlines and converting that into one big source that's extreme, but this is still done in parallel. So this is cool, this helps it be a little faster. Let's say that my data is really big, it's in like the, in a gigabytes. So let's do this in Spark. Let's say that I load my data into S3 or XGFS. I can use the text file method of Spark context, pass it a URL. This is a, that's a local one, but it was XGFS, it would probably, you know, if you had XGFS enabled, it would, this would look into XGFS at that path, or you can preface it with S3 for S3, Spark, or whatever. And the rest of the code is actually exactly the same as my Scala local example. All four lines of map filter and the take is exactly the same, which is pretty cool. Now I have distributed ETL, and the only change is how I input my code. One note that I'll make is that if, if let's say that you want to read from data source that is not Hadoop or S3 or whatever, let's say you want to read from Mongo or whatever, right? Then I can use this technique of doing parallelize and loading a list of, for example, keys in Mongo. And then I can go and read those documents out and process them in parallel. So it's really similar, just that you don't have a built-in function called text file. What about side effects? So far we've been talking a lot about transformations. Let's say that I want to write my transform data out to a database, right? So then that's where the for each comes in. I can do my rdd.forEach, and that takes each data, piece of data, and writes it to a database. There's a similar method that allow you to iterate over all the items in a partition. If you want more efficient, let's say I want to write all the items in one partition to this, get one, so you can do that too. So pretty cool, so far we talked about, we take a Scala collection, we can, you know how to do it locally, and with very little effort, we see we can apply that to batch file competitions in Spark. I'm going to talk about Spark Streaming a little bit because you'll find out you can use very similar API and apply to streaming data, which is pretty cool. Spark Streaming is Spark but running on streams of data and it applies micro-batches. So it divides, it splits up the input stream into very small bits of data, let's say one second's worth or you can control what that interval is. And it runs a little Spark execution on those little batches and you can write it to different data sources. So just to give you a flavor of what that looks like, Nick's speaker, Helena, will go into this in a lot more depth. I have an SSC, which is Spark Streaming Context, and I'm going to read a bunch of, this is again the word count example that we know and love. We're going to read it from a socket. I'm going to take the lines and which is called a DStream for distributed stream. And I'm going to apply a flat map over it and split each line into words. So now I got a stream of words and then I'm going to apply the map function again and there's that thing that you are really familiar with by now, like converting the word into a tuple of the word and account. And I'm going to do this reduce by key that you've also seen. So this is pretty much the same code as the batch example, but now I'm applying it to streams of data. And one thing to think about is that what does a, the reduce by key is really a group by, right? What does the group by mean on a stream, right? If I just had a Scala iterator over lines of text and I applied group by, that wouldn't really work in a streaming fashion or it would be like, it's really unclear what that means, right? Because typically like in order to do a group by, you need to know the whole set of data. But in this case it's pretty easy to understand, it is doing the group by on each chunk of input data and writing it out. So a couple of tips on Spark. Definitely play around with the repo. This is a really, really great way to get started. Just like you might have played with the repo and Scala to get started, you can do the same thing in Spark and mix it very easy to explore. Remember that transformations are lazy. This gets everyone who's doing Spark, they'll be writing a bunch of code and they'll be wondering how come nothing ever happens. You want to limit the number of expensive grouping and sorting operations? Definitely explore the many sub-components, yeah. I'm wondering, so I assume the repo is not the standard repo, it's the repo's particular as far. It's a modified standard Scala repo. So it's not just adding the library to the class, basically, it's doing other stuff. What other stuff do you have? As far as I know, I've actually looked in the source code for that before, but details are a little fuzzy. I know it tries to make sure that the code that you have is actually properly serialized to the remote destination. So I know there's some class loader stuff in there. I don't remember what else, but to be honest, I've been able to get, what I do sometimes is so I have a separate project called the Spark Job Server, which provides a rest interface. And a lot of times, instead of loading the building Spark, which takes a long time, and launching SparkShell locally, I will actually launch SPD console from one of my other projects that pulls in Spark as a dependency, and you can actually do the same stuff using SPD console. So it's actually pretty easy. The only thing is that you need to create the SparkConnect yourself, whereas the SparkShell will create it for you. It's not a big deal. That's just a few lines of code. Yeah, or you can even put it into SPD config as a, I don't know, there's something where you can tell it to execute things at the beginning of a show. I will definitely explore the many some components. Spark Streaming, MLlib, GraphX, Spark SQL, because one of the coolest things about Spark, and I think what makes it unique is that in the old days, if you, let's say that you have Hadoop, you run a map of your job, then you have to run another tool, let's say Maho, to do machine learning, right? So another set of tools, you might do some conversion, and this is more things to manage. In Spark, I can seamlessly, I can load data in, run some Scala API, run Spark SQL, take the results of that, apply logistic regression to it. It's very seamless, which is, I think, very, very powerful. We'll definitely take time to understand caching because it has a huge impact on your performance as well as your design for resiliency and this kind of thing. And I will definitely check out the many integrations that makes your lives much easier. There's the Spark Cassandra connector, connecting Cassandra, there's an elastic search connector, and I think that people are writing all sorts of different connectors for whatever their thing is. So in conclusion, if you kind of flat map it, then you can spark it, so go try it. Actually, hold on one second, do I have a minute for a demo? Couple minutes, cool. So let's see if this color thing will work or not. This is, well, okay. I don't really know if this will work, but. Oh, okay, hold on for one minute. I'll have to blow this up a lot to make this work. Oh, no, that's the wrong one. Can't really see my cursor. All right, cool. Is that kind of readable? Oh, it's too low. Well, okay, all right. Here, how about that? All right, all right, all right. We'll move this, okay, there we go. Yeah, all right. So everyone can see this. Basically what I've done is, this is basically the example that I had on the screen before. I've taken a text file that I have and done a bunch of maps and filters and GTO operations. And you see what I get is that I get a mapped RDD, which is again that it is lazy. This doesn't do anything, right? So now what I do is I do red zero dot, let's say that I want to count the number of distinct crime types. So I do distinct and I have to do a collect. Without the collect, it won't actually execute. So now what it does is it reads from disk and it does the processing and it's doing this in local mode, which is a bunch of different threads. And I see the depressing list of crime types. This takes about 6.5 seconds, which is cool. But let's see what effect caching has. So instead of just, so what happens if you run this again is that this is gonna read from disk again. So this is again gonna take about, well this took a little faster, but it will take a bit longer. But let's say that I cached the results. So now I have a cached RDD, res3, and I do a distinct dot collect. The first time it will still take a while because it has to do the computation from disk, but then it's gonna cache the result or last step in memory as it's doing so. So again, this takes three seconds. But if I was to run this again, you see that this is almost instant. This is 0.1 seconds. So this is like 30 times faster. And if I keep doing this, it is instant. So you see that now it is reading data from the cache instead of from disk. Can you compose those cached things and add another calculation to the end of something that was cached? Yeah, so I can do like res3. I mean, let's say that I want to, I don't know, let's say I want to map. I don't know, let's say that I want to add a bracket to the thing or something weird. I can't think of a better example right now and I do collect. Yeah, so I can do that and it will be equally fast. Basically you can think of the cache as a point from which I can like add branches of a tree and all those branches will execute pretty fast. So anything else you want to do? Yeah, so you definitely have to watch your memory usage. You can definitely blow up the heap pretty easily. Spark does give you an API for assessing your memory usage though. So you can use it to assess how much free space you have and that kind of thing. Yeah, so you can un-persist, I can un-persist my data. I think there's a un-persist, I think. Well, okay, I don't quite remember what it is, but there's some method that you can use to free your, maybe it's uncache. Yeah, I don't quite remember what it is, but. Yeah, well if you guys have any other questions, I'd be happy to take questions in the back.