 So hello and welcome everybody. This is our first official Big Data SIG meeting. We co-opted a commons briefing a month or so ago and Head Will, who is the speaker today, give an introduction to Big Data and all things about Big Data in relationship to OpenShift and new cloud architectures, which was absolutely wonderful and that URL will pop into the chat room without reporting. It's also available on the OpenShift YouTube channel. But today what we wanted to do was follow it up and kick off the OpenShift commons Big Data SIG with part two, talking about Big Data and Apache Spark in relationship to running it on OpenShift and Will has graciously said he would be here even though August is the month of everybody's vacations. So without further ado, I'm gonna let Will talk, give his presentation and talk through all this. The format will be him talking for 20, 25 minutes and then we'll open it up to Q&A. During his talk, please enter any questions into the chat room. After his talk, what I'll do is just basically unmute everybody and we will sort of try and lightly raise our hands and chat and talk to each other about our questions and other future topics for Big Data SIG meetings. So again, without further ado, here's Will Bitten from Red Hat. Thanks, Diane. So as Diane mentioned last time, I gave one of these talks, I talked about architectures for data driven applications. Today I'm gonna give a quick overview of Apache Spark and show how you can put it into applications. Now because this is a short talk, I'm gonna sort of get things started, give you an introduction and sort of show you where you can go to learn more and give you a place to explore some of these ideas. I'm also gonna show a simple application that's a data driven application using Apache Spark, but nothing I do in that application would be something you wanna do in production. In a future talk, my colleague, Mike McEwen, will present on how we can actually put an application with a Spark component into production in a sensible way on OpenShift. So before we get started, there will be an audience participation part of the talk later on and I'd like to ask that you pull down this Docker image that I have linked to on this slide and run it with the command line that I have listed here and I'll give everyone a few seconds to do that. If you run this and pull it down now, then it will be available for you when we actually get to the part of the talk where we'll do some audience participation. Well, I'm gonna pause you for a minute. Can you cut and paste that hard link into the chats and everybody doesn't have to try and figure out how to type it? Yes. But I have to stop presenting to- That's okay. Yeah, if you wanna do something like that, making people type it will just be a problem matter. There we go. Okay, and then I'm also gonna paste a URL, which is what you can visit after you've run that command line to verify that everything is working and it will look like a notebook when you get that running. Okay, so a little bit of background. Some of this will be a review to those of you who were at the last talk I gave on application architectures, but in the middle of the last decade, MapReduce was a popular idea. It was a system that Google published a paper about in 2004. It took some very old ideas from the world of functional programming and applied them to data processing. And the idea was that you'd have a lot of relatively modest commodity machines with slow networks and slow disks. But if you could formulate your problem in such a way that you could distribute the storage and distribute the compute to go to the storage, you could scale out relatively easily. So as we know, developing distributed applications is hard, but with MapReduce, the idea was that you'd define three components. You'd define a mapper, which transformed an individual record of data. And you'd define a reducer, which is a way to combine records with a commutative and associative operation. And then you'd also define a shuffle strategy for how to move data around the network so that related data could be local to an individual machine so that you could operate on data efficiently. And then the framework would handle the rest of that for you. So this was a lot easier than writing distributed applications in the old way, but if we, and we can solve a lot of problems that scale out naturally. So if we wanted to, say, count all of the words in a very large file that was spread across multiple computers, we'd map each file to a sequence of, let's say our files look like this. We'd map each file to a sequence of word occurrences. So instead of a word, we'd have a word appearing once. The key is the word, the value is a count of one. And then we'd shuffle these pairs around the network so that all the pairs with the same key for the same word are gonna be on the same machine. And then finally, we'd reduce these by adding the values for every pair together with the same key, going from a sequence of word occurrences to a shorter sequence of word counts. A year after the MapReduce paper was published, the Hadoop project produced open source implementations of the ideas. We had a distributed storage system and a programming interface that let you implement the word count example from that previous slide with this Tersen elegant code. Now, this programming model was so beautiful that it has set a course for a huge segment of the industry for the last decade. I'm kidding, of course, no one really wants to write code that looks like that. But the power of HDFS is a data federation solution and the promise of eventually realizing massive scale artwork is extremely attractive. A lot of smart people accepted those trade-offs and began storing their data in HDFS and developing code in libraries atop the subtraction. Now, more recently, we have Apache Spark, which provides, as I talked about in my last talk, an architecture that works a lot better for running things in the cloud and running things in containerized microservices, but it also has a much nicer programming model. And what Spark provides is a distributed collection abstraction that's pleasant to program against. So you can write a distributed program that looks a lot like a local program, a scheduler to run those operations on distributed collections across multiple machines and a bunch of interesting libraries on top of those collections. And you can deploy it in a variety of different ways. You can deploy it as a standalone self-managed cluster, which we can run under OpenShift. You can deploy it on Apache Meso, so you can deploy it as part of a Hadoop cluster in Yarn. So I wanna talk about this fundamental abstraction in Spark. Instead of, I think the key difference from a programmer perspective between Spark and its predecessor Hadoop is that if you're thinking about Hadoop, you're thinking about an execution model. You're thinking about a way that's easy to run parallel programs. And you sort of, as a programmer, have to put the effort into translating your problem into that model. With Spark, you start with a programming model that looks like something you would use as a programmer and figure out how to execute this across multiple machines in a distributed environment. So the fundamental abstraction in Spark is called a resilient distributed dataset. And this is a partitioned, immutable, and lazy collection. We can unpack these. It's partitioned because it can be spread across multiple machines. It's immutable in that nothing you do to one of these changes it. You just create a new one. And it's lazy in that computations only happen when they have to. And because computations only happen when they have to and because we don't actually change the underlying collection, we have two kinds of operations on these resilient distributed datasets. We have transformations, which create new resilient distributed datasets that encode a graph of dependency information to a parent distributed dataset. And we have an action, which is where the actual work gets done. This is something that schedules a cluster job to compute values in the RDD so that you can get an ultimate value out of it. So we'll see what this looks like with some examples here. If we have a dataset that can just be like an array that you think of in a conventional programming language, but we can spread it across machines, either by ranges of contiguous elements or by applying a hash function to each element in the array. These are resilient because when we have a distributed application, we're inevitably gonna have failures. And failures mean that individual partitions of these datasets can disappear. But because of immutability and laziness, we always know how to reconstruct partitions when they go away and when we get a new machine to create them to run that job. So I'm gonna give you just a quick overview of some of the operations we can do on these things. The first step is creating an RDD. And you can create an RDD backed by a local programming language collection from a file that's either on a local file system or in an object store or from a distributed file that's possibly partitioned across many machines. Now, there are other ways to create an RDD, but these are just a few of the first ones we'll look at. For talking about the rest of this programming interface, I wanna take a quick sidebar and talk about some type notation. We'll talk about the types of these operations because I think it really makes it clearer what they're actually doing. And all you need to know is that if I have a capital letter somewhere, that's a type variable, meaning it can be anything. If I have two things, two type variables in parentheses, that's a tuple of two values with potentially different types. If I have one of these ASCII art arrows between two type variables, we're talking about a function taking a domain and returning a value from a range. And then I use these square brackets to indicate that a type is parameterized on another type. In the example at the bottom of this slide, we're talking about a resilient distributed dataset containing elements of type T. So some of the transformations we can do on a typed RDD are we can map the RDD by applying a function that translates from the elements in the RDD to a different kind of element. And that gives us an RDD with elements of a new type. Another useful operation is the flat map operation where instead of mapping each element to another element, a sort of one-to-one mapping, we map each element to a sequence of zero or more elements and we can catenate those together. You can do a lot of really powerful and interesting things with that. We can reject all of the elements that don't satisfy a given predicate by applying the filter transformation to an RDD, which just takes a function that takes an element of type T and returns true or false. We can take an RDD and reject all the elements that are duplicates by using the distinct transformation. And we can also turn an RDD of elements into an RDD of key value pairs by applying a keying function. Once we have an RDD of key value pairs, there are some other transformations we can do. We can sort our RDD by the keys. We can group all values for a given key into a single record so that we go from an RDD of keys and values to an RDD with only one record for each key but a sequence of all the values we had in the parent RDD. We can combine values together as we did in that sort of word count example by taking a commutative and associative binary operation on the value type and applying it so that we will wind up with one value for each key that's the result of combining all of the values for that key in the parent RDD. And we can also do a database style join. Now, again, these things all operate lazily so we haven't done anything yet if we put any of these in our program. We'll need to actually do some action that returns a value to the main program in order to have anything happen. And just some example actions, we can collect the elements in an RDD and turn them into a local collection in our programming language. We can get a count of values in the RDD. We could reduce in a similar way that we would reduce by key by combining all the elements in an RDD with a commutative and associative binary operation and we can save the file, save to a file either locally or as a partitioned file on a distributed file system. So that word count example that we saw in Hadoop earlier that took a whole bunch of lines of code is actually pretty simple in Spark. And here we're using this in Python. There's a typo on this slide. I apologize for that. But if we create an RDD from a text file on the first line, I'll be building up a dependency graph along the side of this slide as we go through. Then we can map each element into a sequence of words. But what we're doing is we're concatenating all those together. So we go from an RDD of lines to an RDD of words. Then we map each word into a word occurrence. So instead of saying we have a word, we're saying we have a word that we've seen once. And then finally, we use that reduced by key function, which moves data around the network to get all of the records for each given word together and adds together all of those counts. Okay, so just to see if people are paying attention, what have we computed at this point in the program? I'm not looking at the chat. So, Diane, if someone gives an answer, can you relay it to me? I will indeed. So let's see if anyone in the chat has an answer for this. Mike, if you want to pay any attention. I've seen him ask this question before, though. Oh, that's not fair. Well, why don't you give an answer so that we can move along? I think that's a good question. All right, so we actually have not computed anything at this point and the only things we've done to this distributed data set are transformations. When we get to the next line where we have an action, that's when Spark will actually schedule these jobs on our cluster to run this computation. And that will look something like this. It will coalesce together all of the things it can do at once on a given node in order to minimize communication time and run those. And then finally we'll get a file with all of our results in it. So this RDD is a pretty convenient way to write distributed applications, but it's very low level and it's not the most popular way to use Spark. I just introduce it because it is really fundamental to how Spark works and understanding it will make understanding other things Spark does easier. But we can go beyond RDDs. A lot of people want to do interesting things with their data driven applications. And one example of something you might want to do is train predictive models with machine learning. Spark has a lot of implementations of classical machine learning algorithms and models that you can evaluate. And it has some primitives for efficiently implementing your own machine learning models. So my team has had success both using the ones built in Spark and implementing our own machine learning algorithms on top of Spark. Unlike a lot of systems that have different APIs for processing batch data that you have all at once and streaming data that arrives a record at a time, Spark says, well, what if we looked at each window of a stream as its own tiny batch? So this is called micro batch processing. And if you do formulate your problem this way, you can use the same API essentially for handling both batch data and streaming data. And the way this works in Spark is you have a stream of data. Spark has a streaming engine that converts this into a sequence of small RDDs. And then you can process that in Spark as you normally would. Now, programming against RDDs is great, but a lot of people are more comfortable with something like SQL. And the nice thing about something like SQL is both that it's a high level declarative way to specify your problem and that people are used to the way relational databases work, which is that if you give a query to a relational database, it's not necessarily gonna give you exactly what you asked for and the way you asked for it. It might give you, it might wind up evaluating a more efficient program. So with Spark, if you write arbitrary programming language code in your RDD operations, Spark can't know anything about those and rearrange them to improve the performance. But if you use a high level data-specific language for queries, Spark can generate optimized execution plans for your code. And just as an example, let's say we have an RDD program that's gonna take the Forbes list and a list of everyone in America ordered by net worth and a list of everyone in America with account of felony convictions and we would expect to have some celebrities on that list. Now, if we did this, of course, we're talking about a very small subset of people have a very high net worth and multiple felony convictions. So we're really having to, if we just sort of do this naively, we're looking at the cross product of two very large sets and we're gonna reject almost everything. If instead we rearrange this program so that we first filtered out and only took those people with extremely high net worth and then joined that with only the people from the criminals table who had felony convictions, it would be a much more efficient program. And Spark offers a few different ways to do these structured queries. You can actually write SQL statements against Spark and that will work if you have a syntax error, your program will crash, but that's pretty much the case with any database. There's also a couple of domain specific languages for doing these sorts of queries in a safer way. And we'll look at those now. Those of you who went to that Docker image, let's pull that up and I will pull it up on my screen here. Can everyone see this? Yep, looks good. Great. So I'm not gonna go through all of this. I'm just gonna sort of point you to how you can explore this. This is really a self-contained notebook. You can see there's a lot of stuff in there, but I'm just gonna go through and sort of show some of the stuff we talked about earlier. So what we're doing at the top here is we're creating a Spark context, which is how we sort of deal with a Spark scheduler, create these distributed collections and schedule computations on them. So we have some, we have an example of creating an RDD in this cell where we just create an RDD with 10,000 numbers in it. And then we have some examples of transformations on that RDD. Rejecting things based on satisfying a predicate or not. Mapping every element in that RDD to a new value based on the result of a function. And then taking the intersection of two RDDs where we have one filtered on one predicate and one filtered on another predicate, and then sorting those. We wanna get values out of our program so we can run some of these actions on the RDD as well. And as you can see, we get the results we would expect from those. And here is next, I'm going through this sort of quickly because I expect people will wanna revisit this on their own time, but I just wanna provide some context so it makes sense. And then if we wanna use this structured query support, we can load a structured data file. In this case, I have a file in the parquet format, which is an efficient format for storing relational style data with some log messages from the Fedora project's infrastructure. And as you see here, we can do a basic database style aggregation with pretty straightforward code here. And we get results of all of these messages that went out on this Fedora infrastructure based on what subsystem in the Fedora project produced them. And this is a small subset of a real data set, but it's what we could fit in a Docker image. So as you go on in this example, there's a sort of overview of how you could do some data cleaning, which is an important part of most real world data-driven applications. And also at the bottom, a building a training a machine learning model on this data. So that's great for experimenting with Spark. I hope that people will be able to try that out and have some fun with it. But I wanna show what an actual application that has a data-driven component in it looks like. So we'll go on to that now. The model we're gonna use in our data-driven application is called Word2Vec. And Word2Vec is a really cool, technique in natural language processing that encodes words as vectors, hence the name Word2Vec. The cool thing about these vectors, though, is that they encode some meaningful semantic information. It learns words that are synonyms because those words appear in similar contexts. So if we have words that are synonymous, they will have vectors with similar angles. And the other cool thing, about these vectors is that we can use basic linear transformations on them to find analogies. For example, if we take the vector from Madrid and subtract the vector for Spain, but add the vector for France, we get a vector that looks a lot like the vector for Paris. Now, there's an implementation of this Word2Vec model training in Spark. And we can use that in our applications. And it's actually pretty straightforward to use Word2Vec in a data-driven application in Spark. These three Python functions are basically all it takes to train a Word2Vec model from a large text file of natural language that we get from a URL. So in the first one, I'm simply stripping out punctuation, converting everything to lowercase, and sort of pre-processing our text so that we can turn it into meaningful words for the input to the Word2Vec algorithm. In a real application, you'd probably wanna do some more sophisticated natural language processing things, but that's a good place to start. In the second function, I'm opening up a URL. I'm turning it into an RDD of paragraphs, and then I'm using that cleaning function on each paragraph. And then finally, that last method simply uses Spark's built-in Word2Vec algorithm to fit a Word2Vec model to the sequences of words we found in each paragraph from the original file. Now, we can incorporate this into an application. In this case, I have a very simple Flask application. So it's just a basic web application, and I have some routes, and I basically just interact with the Spark context and this model training code while I'm processing requests to each of these routes. Again, this is not, I don't mean to give you a wall of source code on these slides, but it's really just sort of to give you a flavor that it's very simple to integrate these capabilities into your application. And so as a demo of this, if my notebook doesn't crash, I'm gonna create a model, and I'm gonna call the model Austin, and the training corpus is going to be the Project Gutenberg text of all of the Jane Austen novels that are on Project Gutenberg, which is on my own website because Project Gutenberg great limited me earlier while I was developing this application. So I'm gonna click the train a model button, and it's gonna fire off a Spark job in the background. This is gonna go for probably about 15 or 20 seconds. And in a real application, we would wanna have this sort of queued up for another service to process in the background. Okay, so now we've gone on to a place where we can view the results of a model and find synonymous terms. So here we're queering against the model Austin, and it's gonna give us some randomly selected words to say, do you wanna find synonyms for some of those? I'll take a word that I know is in Pride and Prejudice and find synonyms for that. And if you find synonyms for Darcy, who is a character in Pride and Prejudice, you get a list of other characters in Pride and Prejudice. So it's sort of a cool application of taking a bunch of text, training a model on it, and learning something interesting about that text that makes sense if you're thinking about the domain of the text. Very cool, I have to say, very, very cool. All right, so that's what I have for today. Yeah, so now you've got me with visions. Thank you so much, Will, of training it on my favorite data sources, the public government open source, or open gov sites for financial reporting and corporate reporting, so that we can do that vector of criminals versus fordless people on corporate reporting filings. So lots of stuff to think about it. Big data is one of those areas where the tools have just exponentially grown and made life much more interesting. So maybe we'll catch the next Martha Stewart and we'll be back on some corporate reporting from the FDIC, or not the FDIC, the DSCC. I'm totally psyched about this, this is gonna be fun. So there were a lot of questions. I was incredibly happy to see you using the Jupyter Notebook and the Docker image. It looks like no one had any real trouble downloading the actual image, so that was absolutely awesome. So what I'm gonna do is there's a bunch of people, well, participants here now. Leon and Michael and Matt and Thomas and I see Diane Fedema, who I think is supposed to be on vacation, but I'm gonna unmute everybody and see if anyone has any questions. You are all self-muted now. So if you have a question, you're welcome to raise your hand. Diane, Fedema, I think if you're back from Colorado and you're back in Victoria area, we need to talk because there's a couple of people in the Vancouver area that I'd love to get you to talk about some of the work that you're doing around big data as well. Okay, I'll be back next week. Yes. There's a meetup of the VCGov folks on September 12th in Victoria that I'm gonna attend. So I'm hoping that I can connect with you then because I know, listen, this is the Canadians talking to each other now, everybody. Thank you, and this is Thomas Reber with T-Systems. Thanks, Will, for presenting this today. Will, have you done some performance checks on a cloud with these kinds of applications? So that is a good question. So we don't have like an exhaustive performance evaluation yet, that's actually something we're working towards. We are using Spark on OpenShift internally. And we're not noticing, I guess I would say we're not noticing obviously worse performance than we saw running under Apache Mesos or running standalone. Okay, cool, thanks. Thomas, at T-Systems, are you using Spark at all yet or POC in any way? Yeah, I mean, yeah, many of our customers are using it and we're, you know, I'm heading the pass and big data practice here at T-Systems. And we're combining the architectures that the big data folks put together and we have used OpenShift for quite some while on our traditional cloud systems. And what we're facing is basically with very large data sets. The cloud really comes pretty fast to its limits. So this is what we're basically trying to put OpenShift on a bare metal and then it should not be an issue to develop Spark or any other big data applications. Once you hit a certain threshold you don't need to have, you don't need to change your platform underneath. So that's the beauty that we see there. And probably in about six weeks times we should be ready to announce that architecture. Awesome. So what I would love to do is have you kind of talk about that as a reference architecture in one of the upcoming SIG meetings. Sure people would like to hear what you're doing there and you know, we can touch base if you're ready to do that in six weeks or if you want help on that. Sure. Yeah. Just shoot me in like six weeks timeframe with a quick email and I'm happy to pull things together and bring maybe one or the other folks onto this OpenShift session and we'll be presented to you. That would be wonderful. So are there other topics, Thomas and Maté and Leon who have you still got yourselves on self mute so that you would like to have, I have a number of other things that I'm gonna course different people I know into talking on a big day to SIG about over the next coming weeks just to get the background all set up but if there are specific things or use cases that you wanna hear about or topics like running on bare metal, that's things that you'd like to hear about so I can try and coerce people or coerce you into talking about. Let me know, it's easy to get ahold of me too. I'm at Python DJ on Twitter but Diane, you are at redhat.com but the best thing to do is to hit the mailing list OpenShift Commons one and we can have a thread on future topics for this. All right, I'm not hearing any other questions which is quite okay because that was a lot to chew on. What I'll probably do is drop this video on the laps of the web folks and post it as a blog post with all the links that Will gave to the image and that should be up by probably Friday or Monday at the latest and we'll schedule the next big data SIG probably in two to three weeks time depends on holidays as well. So with that, Will, is anything else you'd like to say or add? Well, it is sort of hard to really give an overview of a library like Spark in a short time. So I would say, you know, if you try this stuff out and have any questions feel free to let me know or reach out to the community and we can keep the discussion going. Thanks for the opportunity to give the talk everyone and thanks for attending. Well, stay tuned for more big data SIG topics and email me with anything that you'd like to hear and you can always find us at commons.bookmanship.org and all the things scheduled for generic talks on briefings and specific ones on SIGs will be listed in the calendar there. And we'll see you one last thing. Thank you very much for attending and we all do love that word to that app. I'm gonna go try it out now and take care of.