 Hello and welcome to this webinar about Spark. This is one of a number of introductory webinars about big data being run by the UK Data Service. I'm Sarah King-Heal and I work at the University of Manchester in training and user support. Peter Smyth is going to do most of the presenting today however. He's also based at Manchester University and also works for training and support for the UK Data Service. Thank you Sarah and we'll get straight into the webinar. Today, as Sarah just said, we're talking about Spark and what we're going to look at are what makes Spark different from other Hadoop products. We'll have a brief look at the Spark framework and APIs just to give an idea of how it all fits together. An API is a powerful application programming environment and it's just a way of allowing Spark to integrate with other programming languages. And then hopefully the majority of the webinar will see some examples if you're using Spark R in both a Windows environment and in a Hadoop Sandbox. The Hadoop Sandbox is a single machine Hadoop environment which allows you to pretend you have a large cluster when in reality all you really have is a single PC. So moving on, a brief history of Spark, even big data itself or big data processing really only goes back to 2002 with the Hutch project which was aimed at finding ways of processing very large data sets in a reasonable amount of time and using parallel processing. So that's the notion of a cluster of many, many machines, thousands of machines potentially all working together to process very large data sets by breaking it down into small chunks and processing in parallel. And from that project Doug Cutting who was at the time produced papers on the process called MapReduce which is this idea of breaking down the processing across a variety of machines and making them all work together. And MapReduce was really the basis of Apache Hadoop in 2008. That was the main way of processing any kind of data. The downside of MapReduce is it's very batch oriented so you set up a job, you run the job, you set up another job, you run that job and so on and so forth. It wasn't particularly interactive. In 2009 a gentleman called Matai Sahara, Saharia at UK Berkeley wrote a paper as part of his PhD thesis on using this protocol or invented protocol Spark which the idea was that it would combine all the different types of processing available in Hadoop into a single system and it would be far more interactive and a lot quicker than the MapReduce systems that were current at the time. It proved very successful and in 2013 he actually created a company called Databricks whose sole purpose in life is to to mark the uses of Spark. And it's going through a number of iterations in the process and it was also about that same time handed over to Apache to distribute Spark under an open source license. And so when we use Spark now we actually downloaded from Apache site but a lot of the development work is still undertaken by the people at Databricks and all the original founders of Spark I believe are still there. So because Spark is very new and very fast moving it can be quite difficult to keep up to date with the documentation. There's plenty of books on Spark but they tend to be almost out of date before they're printed because of this fast moving environment in which it's set up in. So if you want to know about Spark one of the best places to go is the source if you like which is currently at sparkapache.org and if you go there this is the front screen that you'll see but rather than just show you a slide if we actually go to the website so this is the same screen here and on this initial page at sparkapache.org it will tell you various things about the speed of Spark how easy it is to use. So this is sort of referring to the different APIs which allow you to use in different languages how the generality of it that you can use it with SQL and streaming complex analytics SQL structured query language is almost a de facto way for processing structured data that is data in tables in relational database systems and so on. Streaming refers to real-time processing so data coming in constantly so you can think of Twitter as being a source of streaming data because Twitter's are being generated every second of the day new ones are constantly coming along. Another advantage of Spark is that it runs every way now it runs on hadoop as that's where it started but it will also run in other environments as well and certainly we're going to see it running in a standalone mode on a Windows machine in a few minutes but it can also access diverse data sources so simple files on your PC which again we'll see simple files in on other machines in a cluster so it will interface with HDFS which is the file system used by hadoop systems that's what hadoop distributed file system and there's a various other types of data sources which a spark can interface with so that's the slightly advertising blurb for the product if you come down to the bottom of the screen it will also give you information on mailing lists where you can get additional information you can well you can download Spark from here there's quick start guides there's training videos and various other things which will be useful to anyone starting off in Spark and that's just the front page so if you actually want specific documentation on your release of Spark if we go to other resources here you see these are all of the earlier releases at this point here about 1.1 that's only two years ago all of this has come out in the last two years what way the one way we're going to be using is Spark 1.6.0 we'll look at that in the minute but you can see again here it's got lists of videos that you can watch various training materials exercises it's got a list of books like I say a lot of these books may already be out of date in that there'll be future further releases of Spark to look at but of course the core elements of Spark won't change so all of these books potentially could be of use to you examples again when you install Spark it comes with a whole series of examples which you can run yourself just to get an idea and of course all the source codes included so you'd be able to look at how these programs have been put together and use them as a basis of your own there's a wiki and these are the various research papers if you then go into the actual documentation for the release we're using Spark 1.6 and then under here we've got various programming guides which you can look at for the various parts different APIs within the Spark framework you can look at the APIs for the specific language that you're interested in so we'll be using R later on so we click on there you can see a whole list of all of the functions available to the Spark R user there's quite a few there but by far Spark R is the least function rich of the various dialects at the moment but there's still quite a lot of things you can do and if you need to dig in you can look at the individual descriptions that are provided there okay if you look at the programming guides Spark Program Routes this is just a general one you get a general overview of how Spark works in various ways and all of them will have examples in the different languages available at the moment there's not much in here for Spark R because Spark R is relatively new but you can if you go to the Spark R API doc you'll be able to find examples of that okay so just to come back to the slides briefly so documentation so really what is different about Spark was it's really I could have just copied that front page of the Spark documentation because what Spark is aiming to do is it's a complete processing framework for big data so in that sense it replaces several of individual Hadoop components which I'll show you in a minute it's got APIs available for several languages so that's the ones that we've just seen and of course that makes the introduction of Spark into those languages just the same as adding a package to R or Python or something like that so it's very in that sense it represents a very short learning curve for people who want to start using Spark it's scalable to very large data sets now of course that's the whole point of big data you whatever you do you've got to be able to do it on a very very large data set so that data sets we can only be effective to be processed on clusters and like the other big data solutions that the point about Spark when you're actually coding to work against your your data set in our examples here we're going to use a very simple Windows PC or Hadoop sandbox the point A is that the code that you write is exactly the same code as you would need to write is your data set was hundreds of gigabytes spread across thousands of PCs of thousands of machines in a cluster to you it all looks the same in fact what it looks like it's just a big table your data set and we saw a little bit about the speed 10 to 100 times faster than map reduce what Spark tries to do is it tries to do all of its processing in memory so if you can if you can eliminate chewing and throwing to disks then you're clearly going to have a massive increase in speeds in the processing and that's hopefully what it tries to do that I have been claimed for real-time processing and real-time responses but we're not going to see that today on our little sandbox and Windows PC you can also run Spark on virtually anything so Hadoop clusters obviously that's where it was originally designed to use but you can run it on a Windows PC as we'll see and if you really want to you can run it on a Raspberry Pi as well you will have a few memory problems but it can run so let's try and look at compare how Spark tries to encompass the various elements that you might find in a more traditional Hadoop environment so if we start off with Hadoop Hadoop itself is is little more than the map reduce processing engine and the the distributed files system htfs so on top of Hadoop you will typically find running things like Hive which is the allows you to treat your data as tables SQL like tables we've got pig which is again a very data manipulation type language we've got Mahoot which is used for statistical analysis or machine learning so there's a whole set of dedicated algorithms within Mahoot we've got Flume which is one of the several products available for processing streaming data coming into whatever system and we've got something called Apache Giraffe Giraffe which is used for dealing with graphics databases now graphics databases in this sense it's talking about a database structure it's nothing to do with visualization the use for things like mapping social networks and so on now so that's just five products we've listed there and the idea of Spark is that you can use Spark to do all of the the types of functions that those five products will do for you so the Spark framework it starts with some core elements of Spark which is just general processing type things and on top of that we've effectively got four pillars of functionality we've got Spark SQL which allows you to do the SQL type processing tasks we've got Spark Streaming that's streaming data we've got ML lib which is machine learning or statistical analysis so that's your clustering algorithms and things like that and then we've got GraphX which does the graphing database and across all of those we have data frames and data frames are the part of Spark which allows you to treat everything as from the user's point of view as a table internally a data frame in Spark is it's not quite the same but it equates to what Spark refers to as an RDD resilient distributed dataset now the part of a resilient resilient data dataset is that is what allows Spark to do the distributed processing which you would need to be able to do in a cluster and it also provides you with fault tolerance in cases of data losses or machines going down and so on and that's very similar to the type of things that Hadoop will provide in HTFS but from the user's point of view the point about data frames is again it looks like your entire dataset just looks like a table and to a large extent you can treat it in that way everything else is hidden from you because you don't really want to get too involved with having to make changes to your code to decide how big the cluster arrays or anything silly like that you just want to be able to process your data in a nice simple format and tables are pretty good for that so if I just quickly just put on the little graphics there so you can sort of see how each of those maps to the various original the components which you might typically want to use with a standard Hadoop system Pig doesn't really represent the Spark code but I have to put it somewhere it does cover basic data transformations that you might find so in the when we're seeing the documentation it did mention the various languages that you can use with Spark now Spark was written originally in or still is in a language called Scala and Scala is very similar to Java so in Spark you can write anything in a big one in Scala anything can be written any Spark code any functionality of Spark is available in Scala and it's very similar with Java you can write code in and we can also use Python and now we can use R and the way the reason we can do that is that for each of those languages there is a specific Spark API which is being produced to integrate the Spark code with the code of the other languages so Scala is the first language and so it has complete functionality because Java is so similar to Scala it tends to have full functionality this is when I say functionality I mean all of the facilities that you can do which are available are you can do using Java code or Scala code using the Spark libraries Python or PySpark as it's called it came a bit later and doesn't quite have all of the functionality although the main components are there and then frankly you'd probably do something pretty complex and unusual to find something that you can't do in PySpark and you would have to do in Scala so it's not really much of a restriction using Python and of course it's a big bonus if you're used to using Python because of the way it's integrated the Spark libraries for Python just looks like as if you've installed another package you know installed another library in your Spark in your Python environment and the similar Spark R now Spark R is the newest addition to the suite of APIs if you like and again it's written in such a way as to integrate completely with your existing R environment and we'll see this in the examples that I'm going to give unfortunately it's not quite as mature as it is so at the moment there's no graphics integration with GG plot or other things that you might want to use in our environment and there's a rather limited choice of statistical analysis algorithms but that will undoubtedly change as we as Spark progresses so the demonstrations we're going to do are I'm going to start off by showing you Spark running in a Windows environment just using what's called the standard shell in our case we're going to use a Spark shell you can use a Scala shell you can use Spark R shell and then various other shell but we'll use the PySpark shell for that we're then going to look at running Spark R in our studio in the Hadoop design box and we'll do various little examples using that and then we'll run our studio on Windows with Spark R and we'll look at a different types of a couple of different type of files that you can actually load so a file type so a JSON file type and we'll load something from hdfs just before we get into the demonstrations just a bit of information about the the main data set that we'll be using this is a data set from the UK data available from the UK data service if you want to search first is SN7348 European quality of life survey 2012 if you want to download it it's a very simple procedure to download it's got approximately 105,000 records with 484 variables what it's actually about it's a survey that's been carried out every four years since 2003 and they collect data on arranges of usage of employment family life home life balance and so on and I would say you can download it from the UK data service website now load those 484 variables in there we're actually only going to use five and one of the examples we're going to give is how we extract our five variables from the 484 okay so we'll leave a slide for the time being and we'll start on the demonstrations now the first demonstration we're going to do is the using the shell and the shell is just pretty basic stuff as you might expect so what we're going to need to do that is a command prompt now I've got a command prompt here on my pc if you don't have that setup if you from the search panel down here you can just type in command and click on command prompt and this is a standard windows command prompt which of course it being windows no one actually uses the command prompt so it's quite possible you haven't seen this before and from the command prompt because I've already installed spark I can just type in Thai spark and over the next few weeks we will actually be releasing a video and a guide on how to install spark on your windows machine if you want to try that for yourself and follow along and well when it all loads up you just get a little splash screen I suppose saying what version of spark you've got so we've got that's a default version 1.6.0 and you just get a little prompt nothing else it tells you some of the things it's done it's got something called a spark context available at sc and a hive context available as sql context now for this all I want to do is demonstrate that this actually works so what I've got in this um notepad is some spark pi spark code which is going to do a word count program word count the word count program for big data is the equivalent of a hello world program in almost any other language every if you learn any language the first thing you'll do is write a little program which tabs back hello world to you well in big data what you do is you do a word count of some document or set of documents or whatever in this particular case on my pc I've got a copy of the origin of species and these few lines here in fact this one these two here are the one shall do all of the work in the word count and the other three lines the output of the output actually returned it back to python from from the changes the rdd construct into native python code um representation of the of the output and then all I'm doing the last two is just standards uh python to print is out okay onto onto the screen so I'm just going to select those go back to our command prompt and run them now what you notice is that the first two uh the first line is reading the file here the second two are actually doing the transformations to start off by changing getting rid of a lot of the um um the commas and the full stops and so on whoops too late to have spoken a bit quicker there and what what it'll do is it will do those and then it's just printing the results out which takes quite a while um obviously if we're doing this for really you send this to a file um but you can see even from this that I've got individual words and at the end I've got counts I'll just give it a few more seconds it'll get to zed because I know there's some zed in the origin of species so zoological and so on and so forth and so you've got various so that's that's your basic workout so that gives us confidence that our system is working and that's really all we need the command prompt so we'll close that down now the next example we want to go to is using the sandbox now when you load a sandbox it looks something like this but you never bother about this all you're interested in is this ip address which I have used here you can see up here that same address there and that port tells me I want to use RStudio and in RStudio if you've used R before you'll almost certainly have used RStudio and the it looks exactly the same except that I'm using it from a web browser because the machine is actually effectively remote I've meted it on my same machine but this address up here could could be the address of a real hadoop cluster somewhere in the cloud or a very large on premises one or whatever it doesn't make any difference it just happens to be a sandbox in this particular case now unlike the um the command line where you run a command called say py spark or spark R those commands actually have a bit of code behind them which do a certain amount of setting things up for you now because we're using standards RStudio that's setting your pattern being done so the first few lines of our of our script here is just effectively setting those things up for you so when I run that um get a few messages down here and eventually when it finishes what I'll end up with um in the environments down here are our s e our hive context and sql context so it's very similar to what we were getting before when we were running in the command line and it's just telling us that's in there these next three lines are really just set up for the graphs that we're going to draw later on and that's just standard R that's not not spark at all so the first thing we're really going to do is we're going to read our data set this is eqls one which I mentioned before this is a comma separate it's actually a tab delimited data set so that's why I need a delimited tab there it's got headers in it some things got headers and I want this read df function to decide what the schema is schema is just the layout of the variables names of the variables and so on and so forth so all this like all this statement which is really a single statement it's just going to read the filing for me and as soon as I've read the file in what I'm going to do is I'm going to select from that file of 484 variables just the ones I'm going to be interested in which I neglect to show you in the slides let me just go back to the slides for a minute show you that missing screen I get to tune you know the fact that the five that we're going to use so we've got y11 country which is 35 different various European countries numbered 1 to 35 normal working hours for your main job how much you use the internet is your household able to make any to meet now this is is listed as a it's like an ordinal value between one to six one means I can we can manage very well six means we struggle to manage so it's sort of it's ordinal and what we're going to count it as a continuous variable for the purpose of this demos and the key 63 euro is your monthly income which has been converted into euros because obviously all 35 countries aren't don't use the euro so it's just been converted so so that selection of those five and then when we've done that what we're going to do is for this data frame here we're going to say show me the first few or six records if you like notice head is exactly the same command as you would use if you're using native are but in this case this head is going to work on df2 which is a spark our data frame when we get down to that file I'll just run that now when that runs again you'll get various messages coming out telling you the progress now this is running on the sandbox and when it's finished running you can see the physics records of that data frame if we look up here df and df2 the list is here is former class data frames and unlike normal r or r studio if these were normal r data frames up here you'd have the little thing and you'd be able to click on it and see the table you can't do that with these because they're not really tables as such at the moment not not as far as r's concerned you just it's just an internal representation but you can still do head on it so you can still see what it looks like and you can do other things as well so let's consider an aggregation so from our 35 countries supposing we want to know for each country how many records are for each country and this is the code we're going to run and I'll just start that off running and I'll describe it and basically it's just a normal summary of by country account of how many there are now you don't have fast that ran and the reason it seemed to come back instantly despite the calculations it's in theory performing it's because it hasn't actually done anything yet all it's done all a spark r has done is kept track of what needs to be done and it's only when it's actually being asked to return something to the user that it'll actually get around to doing the work and this command here this collect command is a way within spark r of returning the formal class of data frame into a standard r data frame which is what we're going to do here and as soon as we've done that we're going to plot the results just using gg plot so that the gg plot is just standard r code we're doing the collect to make it into standard standard r data frame when that runs we can see down here we have a standard little very simple graph but you can also see here that the r country counts is now just a standard r data frame with the little view screen we're going to look at that if we wanted to not much in it okay now other things that we can do i'm just going to collect the df2 data frame which was our original data frame which has all of the records still in it and i'm going to do a quick little plot of the monthly income that's the q23 euro against how happy people are or how able people feel they can manage with their with that income and when that finishes running we get a little a very very simple graph here and really the only point of not showing you this graph is to show you that i've got clearly got some very odd outliers up here which are going to make a mess of any kind of analysis i'm going to do so the next thing that we wanted might want to do is to filter to try and get rid of them so these statements down here are filter statements so i'm just filtering back into the same data frame here and these are just very similar to how you do in normal r i'm saying what the data frame is i'm saying what variable i'm interested in and i'm putting some kind of condition on that variable now again this is probably a better example that i'm doing four different filters say but when i click run it seems to come back instantly and the reason it's doing that is because it hasn't actually done anything it's only when i run the collect statement that we get the pause and it does actually do some work and i'm also going to just add in another plot exactly same plot at the end so we can see what effect all this has had on us so again it's not a brilliant graph i know but but i've clearly got rid of the outliers and spread things out a bit so on that basis we're going to proceed and do a little bit of modeling and what we're going to do is you're going to fit a use glm now again glm is exactly the same name as you would use in standard r but this is actually the spark r version of glm and therefore you use a spark r data frame when you specify here df2 rather than a normal r data frame other than that the format of that command is exactly the same and and just like in an r normal r the result is also going to be an r spark r data frame rather than an r data frame but i can still do a summary on it so if i run that it it's almost as if you were just running a normal you had normal data in r and you're running a normal glm type function core on it now a low the dataset we're using or started with was is only about 112 megabytes in size and i mentioned earlier the point about the integration with the likes of r and python means that the code that you write is exactly the same so low df2 is is relatively small so 6419 records five variables in it if it was a thousand times bigger and you needed to process on a large hadoop cluster in order to get anything out of it the actual code that you would write would still be exactly the same you wouldn't have to worry about how it actually did it how it made sure all the bits got back together and so on and so forth and at the end you just get your normal statistical type results of of what came back you can also run get some predictions out of this and and look at the results as well so that works a very similar way to almost an identical way to watch glm would work on in standard r except that we're using a spark r data frame and it wouldn't matter if this dataset df2 was actually thousand times bigger than it actually is provide you already on the cluster okay so finally our final demonstration is again going to be using spark r but this time in the windows machine which i seem to have shut down so i'm just going to start spark up our studio this is our studio running on my windows pc but again of course it's a standard spark environment so it all looks the same and these first few lines which i'm going to run are almost identical except here i've got a different location for my spark home variable because now this is on my pc so i just run that and over here we will get exactly the same thing there's no hide context here because hide this hasn't been integrated into this version but other than that it's exactly the same you're getting before and the first thing i'm going to do is read a json file and again this is a small json file which is sitting on my pc and it's got some tweets from or about uk data service which i collected a while back and you can see from here i've got nine text file formal class data frame just as before and one of the nice things you can do about these json files in spark r is you can it will actually work out what the structure of of the data is and you can actually get it to tell you what it is and this is so this here is what the structure of a tweet i've actually got that in this file here this is just a cut and paste of one i did earlier and you can just see how long and complex a structure you get when you're doing a tweet and the problem with this is that if you're used to using tables it can be quite difficult to extract information from this because each level in i'm going in here in relational database terms would represent new tables and you'd have to have links to those tables and so on so this is probably why tweets are generally considered or considered unstructured data and you don't tend to use relational database systems in order to store process them but we've said that we've currently just read it into a spark r data frame and we can look at the schema and just to show you how this would look if i converted back just into a standard table i can see what i'm doing here i'm just showing the the the structure if you like of this entities and you can see from here this output that entities itself is is full of sub levels and so on and so forth so actually extract stuff from that could be a bit painful and tricky but oops fortunately spark can deal with that spark and spark r can deal with that what we do is despite the fact we know how complex this structure is we tell spark r this is a table in this line here and having done that we can use the sql context in spark r to treat it treat it as a table and use sql like language to select variables from that table so here what we're going to do we're going to select the created that which is just the top level id which is the top level we're going to select text which is also the top level but also we're going to collect something called entities dot user dot mentions name which is now three levels in and uses dot notation to indicate the different levels but you're just specifying that as if it was some long named variable and having done that we're going to put that into a table and we're going to look at the table so when we run that what we get back is again what is essentially just a table are three hashtags with our four variables and our name which is what embedded fields is just shown as a single entity in fact it is actually a list because it's going to have multiple names in there but again you can just drill down to get whatever level says you need to do and the list you can just process as a normal art type list the last thing we're going to do just to show that we can actually do it this is the spark are running on the pc but it doesn't mean that files that you're going to process are on the pc so in this case we're going to read another file called geography and this file is actually located in hdfs on the sandbox that we were using before so it's going to go to the sandbox this is the address of the sandbox that is within hdfs and it's going to read the data in exactly the same way and I'm just going to bring it back into a standard spark a standard r data frame and we'll just have a look at the beginning of that file so the point of showing this is that almost regardless of how you've got your data stored or where it's stored there is a way for spark r or any of the other r dialects pi spark or java spark or scarlet spark to actually read that data and process it in a quite a uniform manner okay and this is just the first few lines of that data so back now to the slides I think just to summarise a bit I think I mentioned that spark is still very much in development and spark are even more so but you can certainly expect one or two releases a yak and so on adding new functionality as we go and certainly spark are got some catching up to do at the moment there is a something called spark r ggplot2 which you can download on github if you if you're familiar with github and one would imagine that eventually that will find its way into the mainstream spark our libraries and also I think I pointed out that glm that we showed I showed working a minute ago is effectively the only statistical analysis you can do in spark are at the moment you can do lots of other things in in pi spark if you wanted to but one would have to assume that as time goes by all of those currently available algorithms in in pi spark and scarlet spark will come across into spark are at which point they will be structured in such a way that using them from from r or spark are is exactly same as you would use them from that they're equivalent in in the our libraries so it represents an almost minimal learning curve for you if you're familiar with with r already so in summary it's relatively new and then the spark is relatively new in the content development the idea is that no matter what your needs are in big data processing spark will be able to do it using one or other of the provided api libraries you can certainly use it we've used it in in two different environments here and of course in a commercial environment it'll be used in large hadub clusters it integrates very easily with the the programming languages such as python and r and I think from the demos that we've done hopefully that should be quite apparent to you I mentioned more information well again we looked at a lot of these we looked at the documentation we looked at the programming guide I told you that in these guides when they give examples they will give examples in all of the languages so regardless of what your what your what you like to program in you hopefully will be able to find an example of how to do that particular task in the language of your choice I mentioned the spark books and there was a little list of them in that documentation section but you need to be aware that because spark is so fast moving some of these can be very out of date well there may be lacking the latest information the core stuff will be the same but for example it's very hard to find a book based on spark are so you may have to go to the to the website of a battery and look things up from there and certainly if you are going to do a search for information it's always a good idea to specify the type of language the language that you might want to find examples in and so forth I think we'll end the webinar there thank you Peter it's very good and thank you everybody for coming thank you thank you bye