 So, my self is Ravi, I am working at Cisco as a tech lead, so we had this sessions for two days, how are you feeling? It is almost end correct, I am feeling tired, how about you guys, cool, so that is what the feel I can kind of for next one hour it is going to be like refreshment for you, I am going to give a good refreshment for you, good topic on this one. So, I am going to take topic on Apache Spark, Pro Ho, I think most of you are aware of that this buzzword actually Spark, so my intention for this session is like just want to give awareness of this what is Spark inside and outside, why are people are using it actually. So, please be concentrate on this one, get more interactive on this one, so let me take this one, agenda is very simple, first few slides, few minutes I am going to talk on this big data or you going to talk on this big data, what is this and then we go into this Spark overview, why we are using Spark and we will get into the in depth of this Spark, few odd area architectural like and we will see in some libraries, yeah agenda looks very simple, but it has 30 slides you have to be with me, ok. Next is why this is not working, what is big data, I get only this small picture because I do not see bigger than this sumo for this in compare to the main big data, what is in top of your mind, two days we have this session, two days we had this session correct big data on this, anybody probably sumo logic guys can tell, come on guys, I am not running any series movies, yeah what is big data, what is in your mind ok cool, good, any more exactly what he told actually, yeah cool, all are good actually, yeah, yeah so I will come to that slide next, ok so why we are going for the big data, so if you see the digital world, the data of the digital collection of tremendously increasing due to various reasons like, so data is collecting through device like in IOU or like in mobile and so many sources are coming from different places and different types of data are coming into the picture actually, so the organizations are started looking into that how we use this big data actually, how to we can reuse this one actually, so they are trying to see the challenges on this one like in what we can do with this one actually, yeah they are trying to take this and analysis it on top of this one, they are trying to give some recommendation or trying to do some solutions to that end users or like in the trying to do optimize the process operations or something like that, so many use cases are building on top of this big data actually, so this is what challenge is happening, if you see this data how it grows from 2010 to the 2040 ZB is kind of predictions are happening and daily we are kind of population of data is like 5247 actually it is very huge, maybe it is very small but still it will be more actually I feel, yeah so as it told like in big data deals with the categories of 3Vs or actually the 4Vs, so first one is the variant, so the data could be very structural like in office, have you seen like in Oracle like in two former fixed column wealth will be like or it could be unstructured like in JSON or XAML or it could be some file of a picture also, so we need to process that any kind of data to get some use cases actually or it could be kind of velocity if you come to the velocity it could be like a batch of operations, we have huge amount of data for the 10 to 20 years we need to do some operation on batch or it could be like and we need to handle with the streaming data, live data actually that's very very very important in this current trend, people are looking for this real streaming data actually, how to do analysis of it, so your system should supported either it should be batch or it could be stream, same way like in volume it has to handle from z-byte to terabyte sorry terabyte to the z-byte in between the petabyte is that actually yeah, so it has to handle such a one, so you should have a tremendous processing system to handle all this kind of a characteristics actually yeah, so we have a lot of system existing systems like in Hadoop, MapReduce all these things yeah good okay, what is the challenges there actually okay, so when we use this MapReduce operations they have seen like in two just to click one visualizer report they had to wait for 24 hours report because back in the all operations MapReduce computation of the huge amount of data it took a lot of time actually performed this operation and has to wait for more time and do by this complex operations like in aggregation joints all these things and there are systems developed so far before this park like in specializer system or something like in for each like in for streaming storm came or for that yeah for each one like in batch processing specialized system something came like in for machine learning our system or something like that all for each one module actually there is a specialized system as building up actually there is no integration happen in something like and they have people has to write code here there and have to integrate everything actually one place it's become complex to the developer in the maintainability also actually so and business logic expression if you want to write and know just small word count in that MapReduce it needs some 20 to 40 lines of code actually because has to write map and reduce so much complex is there actually there is but in case of a spark you can write in two lines okay now why spark how many have have this question in mind why spark of course I'm not relating you to this until man gentle man or woman okay it's man correct why spark good so so much expectation okay so what is park basically before going to the vice park it's basically data processing parallel data processing engine actually that's all very simple thing else it's not replacing how do policy thing is it just replace map reduces operation so you know system actually why it is because it's very fast and it's a predator and it came first basically for the in-memory capabilities it has to capable of caching your information between the operations or what are the data source information you already had it has able to catch it actually and it has good fault tolerance algorithm if we can we can see more details in the slides actually and as I told I can eat able to handle different type of an operations like an streaming doing batch being machine learning everything you can write in a single core do streaming you have some batch you get it and on top of the stream you do this batch and you have some machine learning you build some model on top of the stream you do it a single and you can do it I mean single file you do it actually you need to track the different course actually and of course it supports the all type of an architecture like an Hadoop mezzo standalone in the cluster board actually so very well actually people have started using because of this movement characteristics and data source it's one of the important like and it started growing actually it has already supported HDFS, Cassandra, HBS, S3 and it's keep on growing the plugins are more grip on growing so oh just one point yeah so as I told like an in-memory is one of the good characteristics came for this park actually so what do you mean by in-memory actually so of course many systems are supporting actually so let me just give small brief on this one when you see this map produces operations so what happened like in for each if you have the multiple map reduces in for example if you take an operation Prigal where there are like a lot of iterations will be there actually in each iteration what will happen in map reduces when the results are happening one it will give back to that to for result to the HDFS in next operation it will take from this HDFS so kind of many IOs and CPUs operations are performing this in case of map reduces so it do you took a lot of resources and time consumption was more actually so Spark simplifies it initially actually okay yeah we have RAM RAM is there RAM cost also getting reduces why you can't use RAM stuff when getting from that disk so Spark gives this in-memory capability initially actually and they experimented actually so now all the intermediate results are what are the results from the data source you can put it in the cache and distributed across the nodes and started using in the same so they saw that actually see the results for a small logistic regression comparing this how do you print Spark for the 30 iterations you do if you see that how do it looks more time and can compare to the Spark Spark stable in the same place because of caching so it that's what it seems like a hundred in hundred X in case of in-memory to X in case of a disk because of different optimization that actually okay so these slides I'll stay a bit because it gives kind of high overview I'm sure like last guys can't see very easily but let me explain it okay okay so this this slide gives very deep very brief on what is exactly the spark characteristics Spark supports both memory as well as disk you can configure your data which is consumed from the data source you can keep it in the memory or you can keep it in this or you can keep it in the both it's cool and you can write the Spark code in any one of the language whichever you're familiar Scala Java Python or I see the familiar Elkin machine learning guys will write in the hour and then we'll write it in the Python some guy streaming like it's like in the scale all this so you can use anybody can use in the Spark actually okay cool and there are different libraries supported as I told like and it has unified architecture where you can write the code for all the different models in a single place like streaming what are the streaming coming from the cofter flume this is a good streaming processing system and you have like a traditional SQL like in hive Spark supports like an SQL where you can integrate it with your bear tools like our visualizer tool get it pull the deformation and do it and it's supposed to graphics very familiar use cases like in page rank where that connections won't understand all these things graphics kind of modules are supported and emblem machine learning where it's supposed different type of machine learning algorithms which we're going to see in a few slides okay and these two are kind of an experimental bling DB is basically for kind of an large data based on that error rate it comes attack high is very interesting where we have that sharing of an information across the nodes in a single place so these are like an experimental one actually okay and the next circle deals with a like in cluster more if you're new you can download and just you can download and try in local system which I tried actually or you can run in a standalone mode where just to master will work independently without any business manager it will work in the yarn okay and it'll work with that my source so missus is again one of the resource manager from the berkelin and it's very good resource manager yeah and a lot of distributions when this park is released they started distribution is started going like a map or database cloud it all is supporting in their apps okay it's good and it has supposed to integrate with that different type of DB's like asandra mongodb rdbms recently the rdbms driver is very good actually which is supported you can talk with any type of database using the driver we just you need to mention that file only so it's very cool okay so after seeing this why I want to leave spa man so many features are there okay this picture I just directly jumping to give kind of an a view problem I'm sure I can nobody can see it actually just want to split it like an I just I'll explain it so in this code actually I integrated streaming and just kill actually so this is just a like in Scala code actually okay so from here to here it is just streaming parties are actually I just open the streaming for the two seconds what are the streaming content are there I'm trying to register like an table temporary table like a verse what are the words are getting the stream and just temporary worse ingrating and reading like an SQL so these just want to give some example for how you are unifying that SQL and streaming part actually so just to get the plans for you so similarly you can do like an SQL with machine learning so you have some K model something like that if you build it here and you talk with that SQL or you can talk for the streaming something like that you can integrate very easily which is not feasible in all day-to-day existing frameworks cool now this is very cool actually spark when the releases they came with the benchmarking they benchmark with the Hadoop mapper okay so see the differences actually soft winches that one which has maintaining this data and if you see with the same tbs of 100 plus 100 so how much time it tooks spark tooks only 23 time minutes in case of mapper it took 72 minutes with the nodes if you see less number of nodes like in 200 compared to 1 so many years or there you can go through so very wonderful so because of this spark has been started adapting the real world actually so these are the laser latest facts which I got from this pack some eight actually so the largest cluster used by one of the client like 8000 nodes and the largest single job is one petabyte and top streaming intake is one TB per hour so see the client also so it's good oh it's okay let me get yeah are we good so far thank you now we'll see in that what is inside the spark why this park is all are talking and what is inside extra what is architecture actually okay okay the backbone is sore like an if I say like and behind the spark is nothing but RDD is the architecture resilience distributed data so whatever your data you're collecting it will try to put it in like an component like an RDD what is RDD the characteristic suffer is nothing but immutable distributed casual lazy evaluator we'll see each one in next few slides okay okay and distributed so if you see the very simple thing actually what are the data you're collecting from the source it will try to distribute partition distribute like an partition and put it in the different nodes that's all simple yeah so and what are your functions will work on all this partition that's why they call like an course in fine grained actually so each operation will work on all the partition at a time actually and each partition they are able to do caching actually or you can put it in the disk also so and immutable is like and you cannot change that you want to create here you can create new RDD from that RDD so each operation will work on this like an immutable okay and lazy evaluated we'll see like a certain operations are lazy evaluated until there is an actions happening actually okay and all are like in parallel operations in the across the nodes it will happen okay so I told you like an RDD is like an abstraction so so it's a very abstract actually oh sorry I missed one thing let me go back so yeah here if you see that component of this RDD it has it maintains a different information like in what is the partitions what is the computation of functions you like and you want to apply on this RDD and dependencies so these are all very important information across with that RDD for various reasons okay so as I told I can add it is basically an abstract actually so you cannot directly you can make an RDD you can use it but basically RDD based on your module or it is derived from other module the exact RDD will be created for example if you use a graphics either you have to use vertex or edge RDD or if you're using some skill bases schema RDD or data trap data frame something like that you need to use it actually so kind of different RDDs will be using based on your operations so I'll work on top of the RDD bus that is for sure so operation wise two types of operations basically three actually okay when is the transformation and is their actions transformation is very simple it will try to change the state of your RDD or like an information to another type of information based on the functions you are working okay let me give simple example so this this fire this will show like an RDD will transform a different way and this is like a rotated a number of transformation and actions will be happened and final value will come okay that's fine okay so if you take the simple example the flat map so I just reading some content from this file okay and transforming that information like in the file content I'm transforming to the like an with the function like a line of splitting the different lines again I'm from there and derive that words so each one is like an one transformation functions so each function operation will create a new RDD on top of the RDD this will this operation will work actually so that's why it will work actually there are some operations like an actions so like reduced by key or reduce reduced by key of the transformation so reduce is something action okay so this so that action only when that executor like in collect or something like that when the executed all this transformation operations will be executed actually like as I told I can this is a lazy evaluated correct basically so until that operations all will be just keep it in the history actually and then will be actually executed from that from the data source pooling to that action only at the time of an action is that happened so that's what the difference between transformation actions the third operation is like in caching and persist okay just to try to cache all this information okay spark the main characteristics of in memory is at all like in different types of way you can store the information you can store in the memory or you can store in the disk or you can store in both memory and disk or you can store in that off heap where is that a co-act is coming to the picture and serialization deserialization way supports serialization it uses the normal Java serialization or you can configure kind of in best optimized review of serialization and serialization for different operations spark supports or yeah and as well as there are some suffix or two user what is this okay okay you can configure in the two nodes what are the nodes you want to replicate you can do it actually okay yeah so this is just a small snapshot if I do some cache this I did I am trying to cache it if you want to see how it is happening you can use like in two debug string you can see it like in how history actually how it is happening and the cache information also you can just park here basically I'll run through the some example and show it okay okay dependency types two types are there actually narrow dependencies why dependencies it's very important actually you need to understand the attestioning shuffling phase okay so what is narrow dependencies hello yeah so if the parent RDD is just transferring information to the child RDD only once anyone only it's very easy like map is just transferring to the one only so that kind of an operations or dependencies or like an narrow dependencies so if there is some operational group by key which requires kind of shuffles among all the nodes to like and it will take the all the information and do shuffling and do it so that kind of a depend is called why dependencies so this is very important to take additions of the scheduling basically okay this very high level of cluster view actually okay so what are the different components basically your code will be here actually the driver program climb it will get has small spark context okay and you have a cluster manager it could be anything like a standalone or it could be young or missiles and this is very important like an all the nodes in the cluster will work like a worker nodes there is no distinction you can date our name or something so what kind of nodes basically so each new worker nodes will have kind of an executor where your job will be taken and we try to give it okay so so your job basically will be divided into a stages okay so what are the word contact given card will try to divide into different stages in the stages it will try to create a different task that task will be given to that executor according the scheduler will be taking care of that one this is very high level of the cluster spark will work okay yeah here is that actually so this is what I told you like in your job this is a simple job if you take so it will build a build operation on the DAG it will create a DAG operation your job DAG every one hour if I help okay okay so this DAG is something like a basically we'll see in the next slide also okay so basically it will create a DAG of that different operations different stages it's like in pipelining of the different stages and it will create split the stages into a task that task is given to the cluster manager which will be distributed to the worker so worker they have executor because they will take the task and then execute it it's very simple okay yeah this task scheduler as I told like it's very important like DAG how it how how difference from the map also it's very important so if you see these operations like in there are three RDDs actually one operation like an RDD of RDD one of map another is RDD of map some operations and I'm joining this RDD 2 and RDD 1 at here actually so here the DAG will see like in what kind of an parallelism I can do it actually okay I can do these operations separately I can do these operations separately later you can join it so this kind of parallelism all this DAG information will be happen in the initial itself and it will push it to that until the worker actually so what are the characteristics of this DAG actually scheduling is happening it will based on the pipelining functions within the stages it will do it cache hovering so based on the cache which has already information which nodes has that one so according to that the DAG will be created and partition error and location based on all the it is making characteristics for the DAG to make it the parallelism of the stages basically okay okay cool so fault recovery and checkpoints it's very very important actually so basically if you want that fault recovery in existence system what are the options any guess generally in any system if you want the good fault recovery what you'll do it actually I have some data in one system it may go crash anytime what you'll do it basically good good replication total good any other snapshot okay here there is no replication so what it will does like and it will maintain kind of a lineage graph so from the transfer from the operation from beginning to the RDD to the end RDD it will keep track of all the history actually what how that operations has been from how that e sequence that happened actually list of operation it will be maintained okay if anything crashed it will go to the next node because it knows the source node actually source RDD it knows it will give to the next node saying these other information track I mean history happened you rebuild it actually that's why it will avoid that replication in this spark okay okay it's good but it has very big transformation like n number of fun like in two hours it has happened what you'll do it you have the option to checkpoint okay in between if anyone anywhere if you want to store that information in the disk if you just give the checkpoint operation it is stored basically it will be used in the streaming base actually check pointing under okay so this is the snapshot which I told giving like and for this one of that top word count if you just put the debug it will give that history from which RDD to this RDD it has been derived actually can easily see it okay let me go through Qt demo oh so I'm so cool fast screwing so small not this one yes small demo I'm going to give it I had a string of the fifth elephant is the premier in the Indian developer conference and data which I got from this fifth elephant so what I'm trying to do like an I'm trying to give do the word count whichever so different operations are like in splitting and word count all this thing I'm trying to cash it finally I'm going to do this collect okay so let me do this are you guys able to see it okay just executed but it's not executed actually if you see here this is a mean operation no I'll do cash also no if I see collect which is action oh now it's in executed I told because of this lazy evaluation until this collect operation nothing has been executed these are all I can transformation only okay as you got oh you just build it or it's a like time out okay so here you can see like in that kind of an exhibition mode I told like and I did first take the dark scheduler and it will give them a talks the memory store for this memory information and it will finally create the task and gives to the exhibitor you can easily see it click and based on the partition all these things okay yeah cash cash does when you do that collection cash does yeah yeah I'll take the questions last okay so this is that you guys park you guys very nice way you can see that all this job status how it has been happened basically so I had this job basically so what are the stages in that and that happened you can easily see it in that stage you can see easily see that what our task has been handled all its characteristics you can easily see it actually and you have that good visualization stuff well I can yeah two stages it is there and then how that the transformation happened and there is a green user where that you gives information like it has been cashed should also and this is the parallel operations that happened okay so very nice their DAG visualization then the recent release they gave it actually can easily see it okay it's back to the okay spark stuck details cool so we'll see that each library in big details okay first one is the spark SQL so as I told I can you can use a SQL with your spark programming okay with your existing RDD can easily use it and you can load and query data from variety of sources like in JSON, Kassan, there are high parkout anything there are internally supports many things external spark packages also there actually support different data types of data sources okay and this is a snapshot you can easily say once you get this skill context from the spark context if you just give the spark that SQL selects stuff from table it will give some data frame so the scheme also automatically determined here actually so you don't need to mention the data different from the data source automatically define that the scheme actually here and it's supposed JDB size at all and high compatibility which was started for a long time okay you can write it in Java Python scalar anywhere actually okay cool the next one is the data frame what is data frame so very good features which spark started which is equivalent to that Python and R data frame if you guys are familiar how many are from Python or oh cool what is data frame yeah it's like in your table if the column or information cool that's all okay so based on your data source it could be if I have file or it would be from the any database or any it could be from another RDD you can easily derive that data frame so one advantage is data frame is like and you can pause this data frame information to any type of models can use like an SQL you can pass it to the mlib you can pass it to the graphics there is some conversions of that actually so through this it solves a lot of issues actually okay and behind the data frame there is a kind of an catalyst optimizer they have it actually so to give some brief on this catalyst of this here so if you see like there are today in this example there are two data frames are like in people in department and I'm doing some more new aggregation many aggregation operations like and join group average anything actually so until this because of lazy evaluations and automatically determine sort of from the people which are the fields which are the data exactly need to be executed and here it won't be retrieved actually after execution this one exactly based on this your conditions and you based on the select it will try to optimize you do they're getting the information from the actual data source for example people have some hundred columns okay and somewhere you are call us selecting only that one column like a name okay it doesn't require that correct actually hurt here so after executing this one only it will optimize is saying like an only name is required so it will try to fit only name from the actual data source so that catalyst optimizer in such a way has a different stages to hide in for this thing like an logical planning physical planning code generation where it will apply is the rules on top of your AST rules everything will happen actually so very big so that that's another advantage of optimization so if you see like an another advantage like and these two it's not necessary from your party for it could be from streaming or it could be from anywhere actually or someone from the database or you can from the some file so you can combine them in this one the conditions are like DSL actually domain specific language I'll say okay cool yeah so another addition into the spark isn't really the spark are okay some are guys are there correct okay just seeing her want to build it okay for the existing are users you can combine this spark is started supporting this our library with the using of the data frames all if you see that all the methods so data frames related operations you can easily do it actually like here if you see like an you read the content from the SQL contacts some JSON file and register like a temple or like a spark but is our syntax you see okay register like an people and on top of that some SQL you can do it and it'll give it so simple actually shown they'll just quickly go through you won't take much time okay so okay this is the spark car shell okay so this is the code which I shown okay I am executing I got it result okay see the same kind of an operation happened okay whole same job stages you can see it and one more thing I want to show you in the previous one storage you can see it here because I cached it correct in previous example so that you can see it shuffle dot dd stored okay and all the real is exhibited where it is everything will be there okay get back to this slide cool streaming I think most of you have doubt on this park stream how it will work is basically in your time real streaming why it is basically so it will try to stream so yeah I'm done almost okay so it will try to stream for certain amount of milliseconds or seconds and what are the stream data it came it will try to do some batch operation on that one this how this park works actually streaming in case so yeah it has supports of the data getting the streaming from the kafta of lume hdfs kinsis twitter so many sources are there one straws park streaming in cams it will create kind of an D stream for each operations and based on the operation it will write to the database or filer you can do anything whatever you want okay as well as we as I told like and it has integration with M. Lee base kill data frames anything and it's like a normal fault tolerance it basically it supports in different levels from the receiver to the batch operation so streaming operation and there are different operations like windows also you can do it like an if you want to split stream for the different intervals chose for the different different extra seconds but if you want aggregation for that one hour or some 10 hours you can mention that window operations also so you can aggregate the data for that 10 hours and do it so this box does so much awful things so M. Lee so the lovers of machine learning they can use is that M. Lee basically so basically a lot of iterative computation it works in M. Lee's and they try to compare that for the kind of an umbrella some computation and M. Lee had had to to this part they saw that very running time reduces actually okay so I'm done okay wait man two more slides are there so different algorithm supports M. Lee okay so you can see it easily and ML pipeline is a very interesting reason addition in that park where you can basically further data science basically where you can do a lot of feature extraction normalization dimensional addiction modeling training so an interesting thing like and you can pipeline the different stages like in these two stages is for the future extraction and you have that modeling stages like with the Augustic regression can put all these things it is that pipeline and you can do on top of the data frame which you have it and get that model so the reason is like and you can do a lot of changes in between like and if you model is not fitted or if it is if you want after seeing the cross validation if you're not single can change the model or you can do some changes in the parameter or future extraction and remodel it and do it actually just for that basically for the data sense that it actually okay next is that last is the graphic sex so basically for that graph base lovers basically and spark is basically for that parallel data distribution correct when comes to the graph it has the different kind of a structure like and where it exists and it just it's very complicated structure actually so for that kind of a multi graph stuff it has different kind of an distribution algorithm like and a way to base cut stuff algorithms it will optimize weight will store it and do operation on top like and different algorithms are there like and operations which is it okay so they I compare this graphics with the different similar systems like in graphics or under graph labs they saw that time consumption of the graphics is very less and this is the framework architecture where you see that this is exactly graph is actually stored like in different database like in vertex RDD and it is RDD different with this and replaces the view of that between this vertex and RDD so vertex is the actually each nodes with that information and edge RDD is the connected information between these vertexes so triplet view you can easily see that between the connections also so these are very interesting algorithms graphics basically supports kind of like in page rank short path as very plus plus triangle count connected components so so much algorithms are there where you can directly use just need to build a graph and then you can directly use it in the graphics okay so on top of that inside the spark external outside also spark packages are there so many third parties are started contributing the spark like an average ratio spark consumers so many application you can see in the spark packages you can use it directly along with the spark normal spark yeah almost done done so users and distributors is very less but lot are there actually can see it okay I am done if you want to thanks to spark you start using it first and contributed to their community it's very good community good contributions are happening you can also start contributing or shows realize spark like me that's good okay thanks guys thanks for one of questions I'm done almost I mean local modules anybody else questions okay and just before you leave if you haven't submitted the feedback forms kindly drop them at the registration desk I just wanted to know if you could compare spark streaming against storm I have one slide so you don't need to use storm anymore it's what is a bit of the name but yes heron is it's been replaced by heron it's API compatible but it's still deprecated it's completely API compatible with storm and storm is no longer being developed maintained and if you're picking between storm you probably want to go towards samsa which is most likely better than storm these days everyone's been comparing samsa with spark streaming and spark streaming is not yet mature for a production system but samsa is so this is apart from that this slide you can get it more readers comparison so basically it's a reliability mode is these exactly ones their storm is like an at least one basically and the basically and the latency and the throat put fault tolerance support it differs basically storm and spark stream okay