 Hi everyone, I am Chetan Khathri from India Bangalore. I am speaking on scaling TVs of data with Apaches Park and Scala DSL and production. So how many of you know Apaches Park? Scala. So who am I? I am lead of Data Science Big Data and Technology Evangelist at Axion Labs, India. And I am a Committer of Apaches Park, HBase and Elixir Language. I co-authored the Calcula in Kutch University for IoT Big Data ML and AI. And previously I worked with Xcela and Nazara Games. So today we will speak about Apaches Park and Scala. We will talk about RDDs and Theta Frames, Theta Site APIs and Spark operation. Data platform components and re-engineering data processing platform with the case study. We will rethink about fast data architecture, what are the components and how you can keep on the Spark jobs. And we will talk about Plasma and Conconcy with Spark. For those who don't know what is Spark, so Spark is a fast general purpose clustering computing system and unified engine for processing the data. It provides high level API for Scala, Java, Python and it supports the general execution graph. It has the different component and framework for different data processing. You can use the structured data for Spark SQL, muscle learning for ML lab and graph processing graph lab. They have streaming, Spark streaming and structure streaming. Those who don't know what is Scala, so Scala is a functional programming language. It also supports the functional paradigm plus it supports the object oriented paradigm. It is a strong leap type and type influence and it supports higher order functions. It has the power of the lazy computation framework. Data structure in Apache Spark, you will have already data frame and data set from Spark 2.x and 2.3. So, ID is the abstract data structure in Apache Spark, which if you have the data set and when you distribute to the entire Hadoop cluster, it will partition and it will be distributed and suffered to different executors and the cores. So, when I talk about executors, it is the different node and when I talk about driver, the where Spark context start and the driver which is the master node and worker as the executor nodes. So, RDD will get distributed and distributed across the different nodes and you will have a distributed file system like ADFS or AC bucket. RDD's characteristics is immutable and resilient. When I talk about resilient, it means when Spark supports two things, one is transformation and the other is action. So, when you have the RDD and one RDD, you apply the transformation. You again get another RDD, again you apply transformation. Assume that something went wrong and your transmission got failed. You have ability to recreate the RDD that you had earlier and that original RDD will be immutable without changing and providing any transformation. RDD is a compile type safe and strongly type inference. So, in RDD you will have the type inference. So, when you create the function, function will take argument as a function or the value. So, when one function is taking the argument as an integer and if you pass the string, it will catch on the spot. So, what happens is runtime and compile time errors. If you have like 70 of data, you create the Spark job and then you are executing Spark job distributed workload and as you know, debugging in a distributed computing is a cumbersome. If you run through your workload that takes 10 hours and after 10 hours if you get to know, there's a type issue and that's why your Spark job got failed. So, RDD provides the type safety and strong type inference. Lazy evolution. So, here lazy doesn't mean the way human being is lazy, but it's not slow forth. So, when you create the RDD and provide the transformation like map or filter or flat map, it creates the lineage, I mean directed acyclic graph in the system and until you apply the action like reduce by key or so or type before that it will not execute entire graph. So, you apply the every time transformation, it will update the directed acyclic graph and when you apply the action from that time it will execute your entire graph. So, Spark has a two type of operation. One is transformation, other is action. See, this is all about Spark operations and the actions, it support the general transformations like map, filter, flat map, map operations and mathematical like sample split of the random set theory like union, intersection, subtract and distinct correction jib, data structure you will have reposition and collapse. We'll talk about in detail. Actions, it supports all like count, reduce, collect and save as file. So, we'll talk about when to use RDD. When you don't care about lots of lambda function than DSL, that doesn't mean you don't care which lambda function you need to apply on which data set or you don't care about control on data set. You don't feel the flexibility on the data set at that time you can use the RDD or you don't care about the schema or structure of the data. So, when you have a data set, you have the structure of the data or you don't care about optimization performance or if you think is that RDD can create, it's a very slow known JVM language like Python or R. So, what happens? Spark has been created with the Scala and when you, it supports the different APIs with Spark, Python. So, when you perform any action or transformation on Spark it will picker and de-picker and send back to the JVM collections and then it will get same operation again. So, when you don't care about slow performance, you use the RDD or don't care about in-advanced inefficiencies that doesn't mean maybe you think unwillingly, unknowingly that you don't want to apply the transformation. For example, you can see this basic pseudo code here. If you can see the third line reduced by key and the filter. So, ideally there's a wrong here because first you apply the filter. As slow as you get data to this Spark, reduced by key will do the massive suffering on entire cluster and that will provide inefficiencies and not optimization. So, actually this will give us the line number four and then line number three. So, what will happen? You apply the filter first and then you apply the action. So, what will happen? You are reading the list data on your cluster and then you are applying the action on top of that. So, then we talk about structured in Spark. So, what's structured? API Spark does provide. So, Spark provides you the data frame, data set APIs. So, why we use the data set? Because data set is a strongly typing ability to use a powerful lambda function. So, when I talk about powerful lambda function, then this main anonymous function that takes argument as a function returns you the function. And Spark SQL optimizes with execution engine like catalyst and tungsten. So, what will happen that when you perform? So, Spark SQL and Spark data frame has both are equal in terms of faster performance. The performance speed is equal. But the thing is like that when you provide the SQL as a string to the Spark node SQL, it will execute. But if you have the typo, like when you say select, you have SWE, LED. And this typo, but this will execute and it provides the error runtime. So, you will not be able to catch that error as a compile time. So, it is very cumbersome when you run the workload of like 10 GB and you execute this and when home and when you come back, you see it will fail because of this typo. So, data set is a strongly typing. It will show you the error on compile time. Data set can be constructed from JVM object and it use the functional transformation like map, filter and flip map. A data frame is a data set organized into name the columns. So, data set, data frame is the alias of data set of row. Row is the JVM object that has the columns. So, we will compare that when you have SQL, Spark SQL, when you have the data frame and data sets and how it compile each other. So, when you use the Spark SQL, compile time syntax error and syntax error will be run time. Also, analysis error will be run time. Data frame provides the syntax error compile time but analysis error or run time. When I talk about syntax error and the analysis error, you are querying some table which table doesn't exist. You are trying to query the view which is not there in the Spark context or you are trying to read some of the columns or applying some of the transformation on top of some of the columns which is not there. So, data sets, your syntax error and your analysis error will be compile time. So, if you are coding in Spark job, if it provides the compile time and the syntax error on compile time, you know what's happening. You don't have to wait and execute the job. So, analysis error are called before job runs on the cluster. So, it saves the massive amount of time for you. So, when I talk about data frame and data set from 2016 in Spark 2.0, you will have two things, untype and type APIs. If you use a data frame, these are areas of the data set of raw which is untyped. If you use the typed API, it's a data set of the generic. So, think in this example, we'll have data frame API code. So, this is same as we used with the RDD. So, we have past data frame, which is data, I mean, RDD is the unstructured. And if you want to make it structured, you have to convert RDD to data frame. So, here you say past RDD.2df. And you say project, sprint and number of stories. And then you can apply any transmission like filter. You say project is equal to finance. And then you say group by an aggregation of the sum and call the function count. And you say limit 100 and so on. So, what will happen here? This will provide you the, I mean, strongly typed. Plus, you will have the, here, you will have ability to call same thing that you do with the Spark SQL in this. This is same as same transformation we are trying to first create the view. View will have the virtual table inside the Spark. And then you can apply the Spark transformation. And so, if you execute this code, what will happen if you have a typo as this query select like that? It will give you a run time error. So, you can save your amount time. And those who are working with Spark, they know you need to work sometime or night because of your workload is working in this problem. You need to stay like watch the Spark jobs and go to the yarn and check the, all logs what's happening. You say the plan of Exxon, then you provide all task code executed. It says out of memory sometime. I mean, you understand the pain you. So, why we need structured API that data frame SQL and RDD. The same as you can apply with data frame and SQL and RDD. So, Spark has a abstract syntax tree when you provide the SQL data frame and data set. So, first whatever you use, you use the data set or data frame or Spark SQL. It will generate first un-dissolved plan of Exxon. And then it will create the logical plan. When I talk about logical plan, it will check the column names and the table name whether it exists or not. And then it will create the optimized logical plan. Then it will create the multiple physical plan with the cost model. So, cost model will give you statistical understanding that if you apply, if you execute this a cyclic graph, how much time it will take and then it will choose the best cost model and select the physical plan and create the RDD. Now, again you will come back to RDD. So, this RDD is not the same RDD. This is highly efficient, high-level APIs of the RDD. So, RDD is not gone with Spark. RDD is still there. And this RDD is the highly efficient, optimized code. You understand as it is the Java beans it is. And because Spark runs on JVM and strongly typed RDDs. So, difference between that RDD that you've seen at the last and which we talked earlier. Earlier is the lower level API and this is high level API. So, data set API in apply Spark 2.x. See this basic example. We are trying to read the JSON file. Spark.read.json. And then we say convert data to domain objects. We are trying to map the JSON structure to the case class. So, case class those who don't know in Scala is like Pojo in Java which will map the structure of JSON with the structure of the case class. So, you can do mastering and unmasking with the case class to JSON and JSON to case class. And you provide here employee name string and edge integer. So, now we are saying employee data set data set of employee equal to employee's data frame as employees. And we will try to map your case class with structure of JSON. So, now when you see the line number for filter data set employee's data set.filter p is equal to p.edge. So, you can read as like Java object p.edge. And when you say get up in 3 if you will say integer right now it is a 3 if you apply any string it will case the compile time. You don't have to wait at the runtime and execute the job. So, in the Intel J or like that it will throw the L on the compile time. This is wrong, right? So, data set API saves your time. Some people I mean obviously Spark developers think sometimes why can't I just use the SQL string as I create with the API Spark and execute the same thing. But it takes a lot of time. You can save the time. But when you see strongly type data set API which is under the hood this color optimization you can save the time. And you know what's happening, right? When you use the RDD you have to say how to do. But here you say what to do. You have the freedom. You have the control on the code. And you have the flexibility to change the data. And you can apply any function on top of the high order. So, this is the one example we are trying here. So, here you have the employee you are trying to join one table. Understand events. Employee table is the one RDBMS table. And events file is the one Parquet file. So, the first thing we are trying to join on ID. And then we are filtering events date on some date. So, what happening here are that events file is the Parquet file. And Parquet is the under the hood optimized for Spark based on Golama structure. And employee table you are trying to read from the RDBMS. And then you join. And then you apply the filter. But think if you will first filter out the scan, when you scan the table and then you filter out and you scan the events and then you will join. You can save the time. But physical plan with pushdown and column pruning it's always good to first filter out and then join it out. I mean sometimes when you check the code, Spark jobs sometimes you fail. The code is not optimized. FM work is optimized. But you need to align the way transformation next time that does help you to in a physical plan and optimize it. So, when you first filter out any RDBMS table like employees you are really taking less data on the Spark cluster and then you are applying transformation. So, it can save a lot of time from you. Data frames are faster than RDB because you see one of the chart. RDB Scala and RDB Python is expensive because as I said you need to deploy current ticker before you strain to JVM objects and others are data frame are equal with the whether you use Spark SQL or use the data frame as I said earlier. One more benefit with data set API it is optimized for casing. So, actually the engine called tungsten. Tungsten's usage is to generate the code which I showed earlier the high level optimized RDB tungsten is the engine which will create that and tungsten also help to case data. So, as much as you can case the data okay. So, data set takes less memory to case more data. So, it is good for you if you have the less memory and want you to case more data and apply the data. Data sets are faster because the way if you see the JIT engine for Java works right. You provide the follow then you have the Exxon and it takes care that with the abstraction on the API that how which order you need to execute. So, actually data sets are inspired from that only in a JVM that. So, it takes care about encoding and decoding instead of you use the callous analyzer or Java analyzer. So, encoder is a faster way to use data set APIs. Before we proceed further on case study and equations. So, I worked with 17 TB of data plus around 6.5 billionth transaction per day and 17 billionth transaction historical data. So, what we face the problem and how we fix some of the issues that I am talking. Before that, if any questions I will be happy to answer. Yeah, hi. We use Spark in production. It's the main part of our program. In the recent project I read the book of Holden. Holden? The PMC member of Spark? Yeah. And you know, Scala and data set API is compiled time. See, you understand when you run the Spark job it failed after some time and you feel some heart, you know? It failed. But the problem is, it's not a problem but when we started coding we realized that it wasn't so if you go a little bit more complex operations the whole compound time compound time time checking is going away, even in your code. And when you do the join operation you have to call the strings. Yeah, we see that. Yeah, but we cannot even though we use the classes we have to go back to strings for static operations or other things. So, compound time safety is gone there even though we use data sets. So, that is one of the things that we found the worker out to solve it. Yeah, so the question is so we have the data model, we have case clusters with their functions and their methods but when we want to call the functions of our methods we almost always go back to maps on data sets instead of a few years. It's probably that most of the time we cannot optimize the code and we cannot get the full benefits over data sets. Yeah, I mean strings, are using these Spark 2.3 in production or Spark 2.0? 2.2. The Spark 2.3 provides you the high level APIs in a person ready. So, the benefit sometime you create the UDF and then you apply that function on the I mean the data frame but if you use the higher order function you provide the map and those and you create the function and pass the function to map and then get to return that until you use the UDF I think that is still not the 2.2 but if you code the Scala function and then you pass Scala function to the map that supports that we did. Yeah. The question maybe I can ask is writing on or our encoders makes it faster than serialization or should we go that way? No, I'm saying that data set provides the encoder in 2.3 so you don't have to use the I mean sometime you can say the callus serializer is equal to true in Spark submit when you say hyper parameters but it has by default so I'm saying that don't provide the hyper parameter if you were very familiar with Spark 1.6 and if you were coming with that background we think it's good because when you see see it's much more hard than science when you work with Spark you know that this hyper parameter is there and so many hyper parameters sometime I have to take a look on source code and see what's by default the value of the hyper parameter and then check that how it's optimized or not optimized in that way. Cool. So before data I'm in the case study when you create the architecture of the big data or fast data architecture so see I'm talking about fast data as soon as you get your processed data it make the value for your customer it make the value for your customer because so you have 3 components mainly you use data lake you have data warehouse and streaming message bus data lake you use for low cost and massive scale let us say edge base ingation is very fast faster write let us talk about S3 bucket ingation is free right and you can scale massively data warehouse like Hive faster queries and transaction reliability so you can scale your KPI queries on Hive which is scale horizontally on different north and cluster streaming messaging bus it's a low latency API like Kafka and Kinesis and still you want to optimize that you can use the Akka it's a very high low API so we use Akka on that and the because Akka works on actors actor is the very lightweight model actor model instead of going through the threads so when it become complexity so you have application here and so when when you ingest data with the streaming buses like Kinesis and Kafka through that and data ingestion it goes to the data lake assume that you can move any component to anywhere data and sometimes people move data from data lake to data warehouse and then you need ingest and process data to your application on top of that so it's complex, slow and ideal process unless you apply some of the principles of optimization like if you have so when you create the Kafka topic and ingest to the HBase and then if you have product managers, if you have the department, those who want to create the data now joining one table of HBase and one table of Hive is not performance reactive because what will happen you cannot join because HBase is good to create data in a form of columns you cannot you cannot create data in a form of SQL by using the tools like drill but drill is not that scalable it's not that optimized to apply I mean even we try to join one table of drill and one table of Hive it takes a lot of time so it's not only taking a lot of time but you are killing the infrastructure because other Spark jobs is not other Spark jobs are on hold for entire cluster so you have to think about not joining over a location of the resources when doing your job so I will talk about retail business one of the case study so it's kind of changing the game with re-engineering the data platform so what business want in a retail they want who is doing retailing what is doing retailing and when, where and why and how it's happening that's what business want we just don't care use this Spark, whatever you use inside that so challenges was weekly data refresh and daily data refresh but Spark job execution fails with unutilized Spark jobs and scalability of message data 4.6 million events per week plus you will have processing historical data around workload 30 dB of data and 17 billion transactions plus linear and sequential execution job mode with broken pipe of data and joining 17 billion transactional records with skewed data nature so what will happen skewed data nature means data is random in theory so it's not physically fit at some point of time and you will have to do data deduplication out like an item kind of one of the KPIs the solution we proposed right was like 5x performance improvement by re-engineering entire data like two analytical engine pipelines proposed highly concurrent elastic non-locking asynchronous architecture to save customers 22 hours run time from 30 hours to 8 hours for 6.6 4.6 million events and for historical load max performance you can be assured using the under the hood optimization whatever we speak earlier same thing I'm talking about here and that helps in MDM like master time management you see duplicate items duplicate records duplicate outlays and for that so you are curious how it happened right so I'll talk some of the so you can see here right how it was so most of the customers are using legacy data protocol so you produce in Kafka just the data and you will have a process data warehouse and then this is kind of earlier what was happening earlier with spark and real spark sequel and getting integrated KPIs here you might wonder why we have posted here because Hive is not made for services like you cannot scale microservices you cannot expose the microservices because Hive will create one of the hit to the cluster the text time and I mean posts also support the way JSON you can master inside that and you can expose the APIs that talks to D3 or Angular and Qlip view is the reporting tool for this board so this is the main I think of fast data architecture where you have to use when you want to scale use the data like when you want to have a performance and delivery of the data warehouse and the low latency streaming if you mix out this three component with better combination with the performance optimization it helps to create fast data architecture let me give you one example what was happening earlier I mean our workloads on sequential mode you execute one spark context, spark submit and wait until you get completed again, second, again, second and again so if you understand well data model according to KPIs and with this model you know what you need first and then you know with that you need to join again like out late and item manufacture and transactions so obviously this takes more time right we created this kind of graph we use the BMC cutoff amp you can use the airflow or kind of different open source tools so you know those items outlays organization and files are master tables so you start parallel executing so this will utilize your cluster if you have the good memory and the cores and executors you can utilize the cluster plus the developers do mistake by providing more executors and core and think the workload will be fast as I said earlier it's art then science so when you say number of tasks is equal to number of executors into number of core so each, so what will happen if you have 8 executors with 8 cores so you can execute 64 tasks parallel so one more thing hyper parameter tuning so you have to enable the external shuffle service on the yarn if you are using the source manager and that will help you to change runtime executors so what I mean is cluster is not for you right in your business cluster is for if you have the product managers they use the HQ queries that also runs on Hadoop cluster and so when you enable the external shuffle service if 5 jobs are running with Hive and you execute your 5th spark job and those are taking more cores and executors so if the order of the spark job is the priority it will reduce the executors and provide this job otherwise if you run your job having 5 tb of data with around 20 executors and 20 cores executor memory and driver memory so driver is the master node which which allocate the spark context and execution so at that time your job is good allocation to the cluster now other jobs cannot start because you are holding the memory and infrastructure resources so for that allocation of the resources for small jobs small jobs but you provide more allocation of memory that make hold to the jobs and block entire cluster so job what you can do those spark jobs are taking more resources but they were less disk volume intensive and hyper parameter executors cores memory on executor driver has been reduced to allow pending jobs to execute and not to block entire cluster so your job is done little slowly but utilize the entire cluster with the approach that will not hold other spark jobs so this is the one of the example you can see here you provide the master is the young here developer mode is cluster driver memory is the executor 12 and you can use the spark node suffer node service node unable to and you provide spark node dynamic allocation unable to and this will help you to allocate resources dynamically on the cluster and execute the memory you have 38 and execute the course is 10 for example so the number of tasks that runs in parallel is equal to the number of executors into cores so one thing is you can apply two approach one approach is like reduce memory and cores and increase the executors so this will this can allow us to better utilize all resources without looking others out if you reduce the executors and use the same memory that you had if it is the prep problem this will run slower for that one but will allow others to use and the cluster works in parallel so the problem happen like five in an entire team five people are working right if I run my job other people will complain like because there is no space remaining on spark job this guy you can't have memory so first you answer the number I mean you cannot use the same hyper parameter for all jobs you have to understand the number of records data size and whether it is disk intensive or the memory intensive memory intensive if you are using the persist or case in the casing mechanism in a spark job based on that you provide that so this is one of the example I mean we understood whether you can use the same number of and you can utilize and change the action what you need to change and everything one more thing that spark provides the control and flexibility in open source I mean spark is open source tool and as you know it provides the flexibility in the sense you can split the physical data in partisans and you can say how many number of files you want in that partisans for example spark may produce and generate small small files on larger data set at some extent which lead to increase disk IO memory IO file IO and the bandwidth IO right it takes reading I mean reading files also take the time so also downstream high queries and spark jobs get impact on the performance and sometime it fails this quota existing this kind of errors those people can see container lost exception etc so what happens that you already know the your dimensions and fact you know in your downstream KPIs how you need to do slice and dice so according to that you create the partisans and insert partisan to partisan and then you can get the you can post data to that but again when you post data it will create the small small files which impact the performance so you can use the you can use the partisaning after that you can apply repartisaning so when you talk about repartisaning it allows you to redistribution of data equally to all partisans and reduce the number of files and also you can use the colas colas will not do suffering ok so it is faster than repartisaning one more thing if you use the repartisaning you need to bump up your driver memory because when you say redistribution will you suffer from the way broadcast also works and colas will not do suffer but it will reduce the number of files that you store on in a partisan one more thing that this is also one of the good point to understand don't use streaming everywhere it's good to use a frequent batch mode if you have the if you think your business have at least whatever you are doing if you are ok with kind of 2 minute of SLA it's ok to use that frequent batch mode because streaming will will have problems sometimes with partisan hardware failures, GC, spike and traffic because it still use your JVM, GC and everything but data says also have the good thing as I told earlier to encoder recorder it use the off-hip memory ok so your hip memory is off-hip so that is also so keying off the batch job immediately scale to right size it needs to be it does it works and goes away you don't allocate permanently some of the cluster resources so this is the one example of historical data processing so things like these are the all tables in a high and partisans so you understand your data is coming with the POS data those are the files those files you pass it with Kafka and then you dump it to all tables so outlink by file item by file, file errors and transaction by file every file is a partisans file ID ok you will have the data so any by file ID downstream it will read only the parser I told it is not reading the entire table for you now when we try to join it together if you see the first thing those who work with Spark are familiar with this kind of this quota existed executor lost failure container killed by yarn on existing memory limits right this is the failure and we wanted to dump all data to redshift ok so we only want partisans related to error here that's fail so alternate approach that after doing some of the R and D that we got was create temporary view here and insert all data to high internal table so there is two thing external table and internal table high so external table when you delete so internal table when you delete it will delete the table structure, metastore and the data so when you delete the external table it will delete the old structure not the data so we dumped out to the internal table here minus table and pocket format without non-partisan data so you read everything, join it and dump down to the higher so when when you want it to redshift we didn't enable the yarn external server service, enable dynamic resource allocation and tune hyper parameters to e2x cluster apply business transmission here so whenever you apply business transmission apply here just get data to the one table and then move bulk load to the redshift or somewhere else so you can see here one TV of output data shuffle was happening here and you can see all executors are utilized on the cluster, sometime when you open the spark UI you can see some of red or purple here so execution lost but spark provides the as I told earlier dynamic DAG so that graph will get created again so how it was possible because of the open source spark and the scalar optimization and everything is open source right if you fill the other tools like informatic power center ETL tools where you don't have flexibility and control, you don't know how under the whole optimization happening how it is physically split in where what you need to apply and you understand your business logic and transformation according to that you create your transformation and partition you can control the number of files on that yeah so that's it any question you speak about your past or is it you submit your spark job to AMR and it will take care but mostly the case is like that when you use AMR what happens because you need to do in and out you need to apply some of the R&D levels working or not it cost you at the end of billing right the time you allocate resources in that so on premise flexibility is like you don't have to pay as I said earlier if your job runs for 8 hours and fail you again need to tune it right in distributed nature so my question is I have a data in S3 that's partitioning to where it was but when I run AMR cluster so when it is in S3 so I should transfer data to AMR cluster it will be the same partitioning or what happens AMR so you can control from the spark job so AMR is just taking the spark job right so your jar file when you build a bundle jar you pass to the AMR that will be taken care under the hold I assume that AMR also use the spark open spark so that has that control so you can actually whatever I say in AMR there's nothing that you can't do with AMR or Google has the data proc on the cloud which is kind of similar service from Amazon and Google okay thank you you can tweet me if you have any questions later or anytime