 data processing about the some frameworks and how we use Spark. So today our main topic is how to use Spark to do the data processing. Let's check our agenda today. So we will cover three main topics. The first thing is for some marketing or some marketing demand and the next thing is Spark introduction. We will introduce from different aspects. And the last thing is practice if we still have time. So if I'm wrong in this one, anybody have other opinion or other ideas please correct me. Thank you. So the first one is I know everybody a lot of people are transform themselves to data processing engineers. It's because of the high demand on the marketing side and the salary and the demand and your self-value for the company. So that's why I am I as a developer as before transform into a data engineer. But I'm struggling to do that. So I checked the industry's best job and I saw a very interesting thing in salary, only salary perspective. So the data scientist salary and the data engineer salary, machine learning engineer salary is like 30% more than other jobs. I think maybe it's because of it's more complex or it's more but some reason it's there are so many demands on these jobs even though in Singapore or in all of the world, even though in China the salary in the same level of data engineer and Java developer data engineer will have more than 50% of the salary than the Java developer and I think the demand as they demand as a lot of more companies are required smart application and smart system they will require more jobs or more jobs working on this area. So that's why we are trying to make ourselves as a data engineer or as a marketing or as a sales or as a engineer or we are trying to know something about the data. So this one is trying to introduce some business scenarios which we can use data to solve. For example in e-commerce this is most traditional one and most popular one when we are buying books in the other one and we'll find that if you buy one book he will recommend a lot of books in the bottom of the page and that one is very related. So I remember last time I bought all the rest of four books so I think it's very powerful now. And for the bold orange color is what I have touched before. The first one is the customer analysis. I think a lot of people will got some advertise when you are doing the, for example when you are using Grab or when you are using some other apps. This one is trying, I have done a demo online e-commerce system to do the analysis based on the e-commerce system. What we are trying to do is based on all the customer's behavior in the system and we collected the data and then we analysis the behaviors which kind of car, which type of car the top is the heart of the car and what kind of car people are spending more time to focus on that but they do not need to buy is because money is too higher or it's too expensive. So this is the first one to analysis the customer behavior and we also have storage forecast. This one is as before we are trying to help Shell to forecast the storage which they need to distribute their product into different warehouse but they don't know which single warehouse should keep how many products. So we try to forecast based on the transactions from different location and then we analysis this product history and product transaction and then we give them forecast. So in this arrow you need to put one minute products and in this arrow you only keep half of the minute. So this is based on the transaction data we help them to do the analysis. So and all the others like very typical scenarios for example the insurance pricing and anti-fraud and micro credit and smart chatbot for example but I use some banking software but when I have some questions you need to call the service guy and in a lot of companies they are building the chatbot to help you to answer your questions so it's smart and I think a lot of people already know that so I think it's smarter and smarter. So this is some user case the data processing technologies can solve so we just need to sum it but everybody are focusing on your own arrow can combine the data technologies and analysis and with your real business scenarios and see what we can improve use these data technologies. So this is some typical user case. Before we go deeply into some technical things how many people have using Spark before? Could you please rest your hand? How many people are using MapReduce or some processing framework? I think. Wow okay. Actually Spark is very, Spark is popular is because of this very easy to start up and very easy to use but Flink is also, I heard somebody also use Flink so Flink is also, Flink is more new compared to Spark but Spark is more stable and so many companies have already used Spark to build the models so that's why we try to cover Spark today just the introduction and some have found with some program thing and have a conceptual overview of what Spark is and what Spark can give us. So let's go into some technical things. In this subtopic we'll cover some Spark knowledges for example the model, the deploying model, the programming model and some conceptual knowledges. This is some features for Spark. Spark is, what is Spark? So when we are trying to learn or when we are trying to use we will think what is this to doing and does it really can match our requirement? So first question is what is Spark? Spark is an in-memory data processing system, data processing framework and it can support different kinds of languages. For example you can use Python, you can use Java, you can use Scala and you can use I. So it's totally with different languages and Spark can support different type of processing so somebody will say that Spark is kind of ecosystem but I will say that it's kind of processing ecosystem which it can support different kinds of processing. So for example your batch processing, your micro batch processing, your streaming processing and some graph processing and the structured processing. So these whole different kinds of processing combine together and combine as a Spark. So just imagine that kind of like as before when we are making something, when we are making in the industry the batch processing is kind of like I take a big box and processing this big box and the streaming micro batch is kind of like I split the big box into some small box and I processing every, I processing each box every time and very fastly. Swimming like I processing each product I got the record and processing it so it's more like the granular of the processing is smaller and smaller and smaller. So this kind of and also back to the features of Spark it has a good programming model which is I think this one is most fancy program model I have ever touched because it's kind of like it's functional programming so it can it inherit a lot of features from Skala. So sometimes as before I use Java I need to write a lot of class and I need to combine all of the class together do a lot of things to implement water count or how it works but for this one I just need to write as your mind you think follow the line of your mind and you finish the task it's kind of like the pipeline scene all these things pipeline scene and map and reduce and processing so it's great and it can also support different types of running model for Spark so this is most powerful feature is because a lot of companies will have different type of cluster for example young cluster they already write the map reduce run map reduce jobs inside of the cluster but now if you need to involve Spark then do you need to set up on another cluster or do you need to run a new team to do the same thing so Spark the powerful of this one is Spark can running in different cluster manager running in different cluster so for example it can running in young cluster so you don't need to set up Spark cluster to running inside and you can use your existing one and it can also running standalone standalone is Spark support support Spark cluster and it can also support methods, Kubernetes and EC2 so the most fancy one I think is the Kubernetes it's a new feature which publish it at February of this year so this one will be more utilize your resource and it's more powerful than before but it's a new feature so need to be more stable maybe need to spec more and wait some time to make it to let them more stable and then we can use the fifth one is it can integrate with different upstream and downstream storage and ingesting tools so this one will be so I think the vision of the Spark is very clear what you want to do is just the processing processing do different type of processing and I can make it integration very easily integrate with other data storage for example some MyCircle database and some NoCircle database some distribution so it's very powerful for you to integrate you just need to involve some dependency files and then you write your API or write your code as similar as others so it's very powerful and Spark encapsulate this one the last one is a typical feature for Spark so it encapsulates each data as an RDD so it means I encapsulate your product inside our box I provide different operations on the box and then you can do a lot of operations and then when you are doing some pipeline thing you just need to call the each box of the operation so it make you easy to do the programming the data set and data frame is encapsulated for the RDD and it's even more powerful because it supports different kinds of processes so why I just support one layer of the model and in the behind you can negotiate or you can communicate it with different processing type so this is some typical features of Spark this one is a sample code for Spark I just want to give an overview of how we can use just a piece of code to write a secondary sort so the structure is also very clear the structure in your right hand is we initialize the contest this Spark contest and then we got data that means I integrate with data ingesting tool I got data and then we processing the data we call a lot of APIs and we processing the data and then we can save the result into our data storage so this is very clear for this Spark so we can see the read rectangle the first part of code we initialize the contest what is Spark contest? Spark contest is I use this contest to negotiate with all the clusters so I initialize a contest and I let the contest to communicate with the cluster to got the resources for me to run all our job so the first one is I initialize contest I load the data from outside I get the data and I processing the data and I save the data but the key line of this code is similar the only difference is where you want to get how you want to process and where you want to save and then the other part is how you gonna configure your processing and to do the performance turning and to more match your requirements so this is the code sample until now any questions? so let's continue the last one is some programming model so in this subtopic we will introduce some programming model so as we mentioned of programming model what we can imagine for programming model so that means if we learn programming model what we can learn so the overview of different APIs so we will have an overview of different APIs and we will know what we can use for this programming model so how do we use it to integrate with our upstream and downstream downstream storage and downstream tools and the last one is how to use this one to write Spark program so Spark has two different operations two different type of reaches the first one is transformation and the second one is actions so I think it inherited from the lazy execute from the Scala so for the transformations it will not execute when you call the function when you call the method and for the actions when you call the method it will do the execution so there are different kinds of transformation and actions so I will not go through details I just have an overview of this one but just there is some code things need to consider is when you just avoid shuffle as much as possible and try to combine before shuffle as much as possible there are some mistakes you are thinking you are thinking use reduce key or use combine key so just follow some guidelines so let's reduce the shuffle as much as possible it will cost a lot of time also maybe in the future we will have some performance issues and all the others I think it's very obvious it's very similar to java stream any people familiar with java stream and can well know the API very obviously so this is some program model and this is how we get data and how we save data one thing I want to mention is this one is spark call we haven't touched spark circle machine learning or spark session so they get data so I just put test file maybe we have another but we didn't touch it here and for save data we just use the spark call API to save the data into our different file system so you will see that there is some difference of the API so some are saved into HDFS file system some are saved into HDFS storage system and some are saved into some are saved into as file or different file types so we call different API based on your requirements so as we mentioned that spark can communicate with different types of storage so there are so many so many types of database file system, storage I just found it from datebricks documentation so anybody are interested can check the documentation in datebricks I think datebricks is a company which hold a spark in cloud kind of make spark in cloud so until now any question is it to high level too far or too more detail so until then we just review something we touch something about what is spark and what spark program looks like and two types of spark operations one is transformation, one is actions you use transformations to transform in data you use actions to execute your DAG and your DAG so and different types of upstream and downstream database and storages and if you are trying to spark spark can support this storage system or you can just check some documentation more deeply to check whether it can support your requirement so that's the first part of spark so in the next topic we will cover some running and deployment execution and deployment model how spark can support so before that we want to introduce some terminology of spark so people use spark as before may have already meet spark drivers spark contest, application master and execution task maybe it's not very clear enough how the spark program can run in young model how does it distribute the task so before that I just try to explain some what does this mean so application is kind of like your spark application and it combines with a lot of jobs you use and these jobs combine together so it's kind of like our software application and what is spark driver spark driver is the start point to spark driver try to intellectualize your spark contest and to construct your deck files or your all the preparation stage so that means spark driver help you to prepare your tasks to initialize your contest so what is spark contest as we described before it's an entry point and it help you to negotiate or communicate with communicate with cluster managers just think about project spark driver is kind of like your your product owner and spark contest is this general board I initialize this general board and application master is kind of like your program manager so every time you run a program in a company you will have an application master and executor like who really do the jobs assign the task to the executor executor will run the task in the container in some other contest it calls container but in spark contest we call executor so this is the relationship so let me more clear about this one so based on this is any questions okay maybe okay so the list one is how we debug and deployment with spark and how we write our spark program and deploy to our remote server so the debug and deployment you can use two ways to do this the first one is you can when you are trying to discover something you can use spark shell in your local and to run it interactively so every time when you input online and you will got the result as soon as possible so it's spark shell it will help you to discover but when you are trying to really solve some problem or write some spark program you can use in debug mode we can import some dependency with maybe and griddle and then run with run in your local for the deployment as we discussed before is there are different ways to do the deployment so I will skip this so if anybody unless anybody have questions for the deployment model I think it's clear right so use methods UI and Kubernetes so the different ways for different ways for running local is you can import spark core and but just notice that if you are if you are trying to running or communicating with yarn you need to maybe involve the HDFS and or yarn dependency so there is difference in Python spark with Python spark in Scala so when you are trying to use Sparkshell with Scala language you run in Sparkshell when you are using Sparkshell in Python language you use Python it's too different so different languages spark core is in the bottom and in the top of that one in the top of that one it encapsulates with different languages so and to provide the API so and after also you can use the spark to submit your job into a removed cluster or your standalone cluster we are not going to check more details about what these parameters mean anybody interested can check the documentation for this spark so we drink a little piece of water this picture is I got from documentation the spark documentation maybe I can explain something with it so as we see the driver program will initialize the spark contest and inside of spark contest it will initialize the spark configuration and the deck model then it will communicate with the cluster manager to assign the resources and the cluster manager will talk with the worker nodes to are you available to run this job so if yes he will return he will run the executor and return the address to the spark contest and then in the future spark contest will dispatch the map task or reduce task directly to the work node so once these need to think about is as before we meet a problem is I try I run in the spark cluster in the cloud so and we initialize that we install that cluster as private IP address IP address so that means this part is inside of cloud and they communicate with internal IP address and when this one is in your in your local laptop and you communicate with the class manager but this one can you can communicate but sometimes when you are negotiated with the worker node you will fail it because in this one in the internal of the cluster they communicate with the private IP address so when you negotiate with cluster manager you ask them are you available to give me some worker node to run my task and he will give you the private address and then you use the private address to connect to the work node it will fill so just to care of that so based on that we will have some other ways to avoid this kind of scenario we will cover in the in later so this is the deployment model sorry the executor is a container the task is kind of like I want to map one value to another this is a concrete task so I send it into worker node and the worker node will execute that task inside of a container so you can have multiple tasks for an executor yes you can have multiple executors but one thing to consider about is for yarn if you want to run multiple executors in one node one physical node you will need to use yarn cluster so for the backstand node and the methods only can run one executor per node so the worker node is a physical node and executor is a container yes and you can have multiple containers in one node yes for yarn yes and in each container you can have multiple tasks I can have multiple map tasks yes yes yes yes just know that executor is a process it's a jvm instance inside you can have multiple suite each suite process as a task to process your task so and any other questions for this okay so this is so let's move on to the yarn cluster so why we usually use yarn to execute our Spark jobs, Spark programs so I highlight three advantages the first one is resource utilization yarn has very good utilization and optimization in resource usage and it can run multiple executors which we explained in the last slides it can run multiple executors in every node so that means it can run a lot of jvm instance in a single physical node and it can support authentication it's because when different departments share one cluster and you publish your job, publish your program into the cluster maybe and you are processing your data inside but how about other programs want to request your resource so this one is try to if you want to run inside or if you want to request other resources so you need to have permission so only yarn support this permission this authentication so this is three features three typical features why we choose yarn and also there are so many other features for example the resource isolation so I use this resource and I will not, if my program allocated any resource, this resource will not be shared with others but in method and instant knowing we don't know, maybe it will share maybe it will not so that is isolating and some other features we didn't cover here but but but for the yarn mode there are still two different modes to running the Spark program running the program inside of a yarn cluster think about it supports two ways to running your program one is yarn client another is yarn cluster so client means you put your you put your you put your driver running inside running in your client mode and the cluster mode is your driver running in the cluster so it's kind of like some people will confuse about this concept the key difference is where you running your driver program if you running your driver program with the application master which is inside of cluster, it's cluster mode if you running your driver program yourself, you want to do interactively processing and it's yarn client mode so this is in the right is very similar to the chart as before that one is abstract chart so this one is more concrete focus on the yarn thing so we can see that there is still some cluster manager but the cluster manager has already changed into yarn cluster manager and the sleeve is worker we know that yarn is master sleeve architecture so this sleeve is similar as worker and executor so it's very similar so I will skip this so another another big feature for for running the Spark program is we can running in Kubernetes cluster so as as so what is Kubernetes? Kubernetes is is a kind of like Docker cluster it can more help you to utilize your resource it's a new feature so how we running inside of Kubernetes cluster is Spark app is submitted to the API server so we can see this chart you use Spark to submit to the API server and then the Kubernetes scheduler will schedule some ports for you that means we will schedule some Docker for you and then at the same time scheduler will run our driver and after the driver is finished running and he will notice he will talk to the API server and to assign the task to the different ports different executors so this is very similar to the other yarn cluster mode but there is some difference so any any any questions until then? yeah so the question I have is the Spark driver what is the size of the Spark driver method? because the reduced job is run on the Spark driver? reduced job will not run in Spark driver what Spark driver do is he will as we know all your Spark program will be translated into our RDD RDD deck so this part of scene is done in Spark driver and for your real task so for example you are from one RDD to reduce to another RDD this task is run inside of your executor so the memory of the driver is only careful about the preparation and analysis and optimization there is also some optimization so for example if you are running Spark circle job he will optimize the sequence of the RDD and the level of the RDD he will reorganize for you so this part of scene is done in driver but in the real map or reduce or something some task is run by executor is that answer your question? so how do you size executor? does it auto size? you can sit but it will have default so there is some I can share with you some of the executor memory executor memory wait a moment I see it is in the bottom of this one I will not cover this but if anybody have interest I can share this first so for the executor of the the executor memory so it contains different types of memory the left one is 30% for overhead and the right is executor memory so there is some limited for the memory which is when you set your memory and the memory size will be plus 300 and 86 and then it contains all of your memory so this one is so this is the this is the memory of executor so my concern is if your reduced job is very big and it overflows the executor which is running on you mean your reduced job is very big? and it overflows the memory it crashes sometimes it will happen but it will reschedule another another task for you another container for you too long so this one is also you need to think about when you are doing the map and reduce the key is very important because the partition is very important because sometimes you partition by some key some node will got a lot of data some node will got only piece of data so the node got so many data will crash so this one is how to design your how to design your key and how to use the API to repartition your data and to use some hash algorithm or use some key prefix things to make it very reasonable this kind of so back to back to this one so until now any questions for these Kubernetes so essentially Kubernetes doesn't support interactive mode? interactive mode yes currently as I know yes it's a new feature but you can show your your job into Kubernetes after you have done your code then you can stop it so let's move to another so we just have a review of what we touch in previous sub topic the terminologies, the different Spark driver, Spark application master and Spark test and we cover different kinds of deploying model which Spark has support, the Kubernetes model, the Yine model, the methods model we didn't touch methods model because I'm also not very familiar with this one so anybody are very good at can share about something and we also explain some the process and the the execution model in Spark Yine how it can run and how we can how we can make it more performance thing and the last is how does Spark run with Kubernetes so this is what we briefly touch in the previous sub topic the next one is something else so maybe for most part of guys which has done this Spark program before is more focused on this one is because we will always think about different questions or we will always meet different problems which is included in this so the first one is how to automatically scale in and scale out the Spark cluster I remember somebody told me about this question before so I write in here actually how to scale in and how to scale out for it's based on the if you are running Yine model it's based on the Yine cluster so what you can do is just optimize your Spark program more suitable to the Yine cluster so for this scale in and scale out so it's dedicated into the cluster scene and how to optimize our Spark program so this one maybe a lot of people care about is because I run I write a program and I found it actually run very slow and as I expect so maybe you initialize a lot of partitions you initialize a different stage of Jaffer and you reduce by key and instead of combining by key so there is so many considerations we can optimize the Spark program and how to integrate with machine learning algorithm so this one is some data scientists will care about because as I know most part of our job is done with data engineer part for the machine learning part for the data scientist part he just try to write the algorithm and discover the analysis of the data and write the algorithm what is the others we will prepare for him we will build the pipeline we will code it make it scalable we will build the infrastructure data make, data platform we will do a lot of part or jobs but the key question is how we integrate with machine learning algorithm so this is kind of like it's very similar to processing so just think about it's kind of it's just a subset of processing you keep processing you're learning based on your processing result and then you optimize your algorithm you change the variables in your algorithm you keep learning you evaluate so this is kind of like it's include inside of this Spark has another library which is called Spark MNAP to support the machine learning but there is some limitations for the MNAP because some algorithm do not support partition, data partition you need to run this algorithm in the full dataset and some algorithm and also some algorithm is not supported by MNAP so that's why a lot of people still use Python to do the machine learning thing so the first thing is how to integrate with different storage system and streaming platform this point is important is because when you are using Spark try to when you try to integrate with other system it's not look like very easy is because sometimes I have one experience as before is I use Spark streaming to write files into HDFS file system so you know Spark streaming is streaming is streaming processing and every time you write into the Spark file system and it will create a file for you so as a result it will create a lot of result for you so then how you do it and finally we just write on another separate job to aggregate together and how to choose the most suitable serialized and deserialized framework so for our original Spark use Java serialization framework when you are processing when you are processing and your processing result will save into HDFS file system and this kind of way need serialization and deserialization framework so in the built-in serialization framework is Java so you can use but a lot of times another serialization framework so how to write unit test for Spark program this is the most top questions other people ask me how to do unit test in big data area how to write MapReduce test how to write Spark test so the lucky thing is in Python we have Python test and in Java we have somebody else write Spark test framework for us which is Spark test base we can use that anybody are interested can search in maybe there is a guy who I will share with you guys in later but for Python if you are running if you are trying to write unit test you can use Python test is there any guideline on serializing and deserializing? yeah Spark initially Spark support a key role serialization I think it's KY can't remember very exactly so Spark has another performance serialization framework but you need to you need to configure manually you need to configure the serialization framework you also can configure others but you need to import the dependency but this one is Java and the key role serialization can exactly remember the name sorry Avro Avro? Avro you can Avro is a schema but you can involve Avro in your Spark circle but this one is serialization and deserialization you also can what protocol office? protocol office protocol office is similar with Avro and others when you save your data you can use that to structure your data so we have so many deep questions so let's go through with some few of it this one is kind of like some of Spark processing so in the bottom of this picture we can see that there are so many different storage and also file system and in the top there is Spark call which we write an RDD because it encapsulates data as RDD and in top of that there is three different things so that one is try to optimize your file and make it more performance for example you have let's make an example you have a map and you have a filter so what this one can do is convert your sequence filter first and then map so this kind of thing and for this Spark streaming is another framework which supported by Spark but this one I think is a micro batch what's the difference for micro batch and real streaming processing I think it's granular of the processing and also graph X this one is also used for graph processing just imagine your Facebook network your social networking and how you store the particles and how you store the edges inside of your and inside of your graph database and then use Spark graph processing to read the data and to construct the graph in memory and in top of that it's circle and Spark data frame and data state it's kind of like as we mentioned before it's encapsulated for all the type of processing it provides a single entry for Spark and also another very similar concept is Spark station and Spark contest so a lot of people we are using confused about Spark station and Spark contest so actually it's Spark station it's encapsulated circle contest, hive contest and Spark contest it provides a single entry for different contest and it's a second version of Spark contest and in top of that it's structure streaming and graph streamers and ML peptides it's more it's some other features which Spark can support for Spark streaming it's also latest very new feature which is try to touch the real stream processing so it's it has a lot of similar features with FNIC so for example your infinite table and your processing each line processing each record coming so it's the feature is very similar and ML peptide is machine learning pipelines so it provides this one to integrate to help to net support to net Spark support machine learning processing so this is similar this is same so this too is based on this Spark, Spark is right with SCALA so it encapsulates SCALA to provide Java, Python and some API but yes so until now any questions about this so based on your experience have you encountered any types of data set that Spark cannot handle? how any kind of data set hierarchical data or in my previous previous experience the most the most usage of type is structure structure type for example the dump in database CSV files Excel files this is a lot of under the traditional database a lot of companies are trying to actually are spiking on this are trying to use this one to process in the mysical database the data in mysical database so this is most scenarios I have been experience any others? okay so this one is another another information maybe we will curious when we are writing the Spark program to make it in production so the job scheduler still this one is also a lot of people are asking me Spark is when I submit the Spark job it run one time how do I make it keep running or make it so as before we use some scheduler system also for example the first the first project I touch related to data is using the Chrome job which just in Jenkins pipeline we use Chrome job to to run in your script and then other project the last project we are using the airflow and the fuel and OZ to do the scheduler but not very deeply touch but airflow can match your requirement if you have any question so the last one is the staging server why I mention this is as before when you connect the Spark when you are trying to run your Spark in the young cluster you are submit your job in your client so sometimes we will use a middle server to do this submit to run this submit script we will not touch the run the script in your local laptop so the pipeline is kind of like I pull the code in my staging server and the staging server is in the same network with the young cluster I pull the code in the staging server and I package and I run the script to submit so the last thing is script to maintain the streaming job this one is because you are imagining you have a streaming job and it will keep receive the request and how you do the for example release next time how you maintain it it keeps running so you need to write some script shut down it and to block the request and shut down it and replace and processing so this one is streaming job and another part is authentication we mentioned before is only young can support Kubler as a particle to do the authentication inside of your cluster and maybe there are some other frameworks for example the Kinox and this one is kind of like authentication outside so this one is authentication inside so authenticate your jobs running inside of your cluster so based on this one any other comments or any questions let's continue this one is let me drink a little water this is a real user case we used before when I leave it the pipeline is this but now it's more mature so the main process is we use we use proxy message proxy to forward the forward the data which from the application the left side the black icon represents the application your real application for example your apps online sales system and we send that data we send that data into the Kafka why is there is a middle one in here anybody knows why I cannot direct it's because Kafka only support TCP particle it cannot support HTTP you cannot just call HTTP to send request so we use the protocol proxy to send the request send the data into the different topic we partition the data based on your business domain and then we write a streaming job in run a streaming job with Jenkins pipeline keep running and and we extracting the data from Kafka topic anybody knows Kafka I think it's a message queue it seems like active mq but it's distributed version of rapid mq or mq qc we can I think it's because in this time we our clients say we use Kafka yeah but it's more you involve more you need to use firm to ingest your data into Kafka and I think it's similar it's based on how you how you choose it yeah how many JSON files do you create one per topic or one per day one per topic and the folder is topic and the sub folder is the data the date so it's a day you create this file yeah and after we create all this file in data lake we will have a batch job which running normally because it's based on your requirement if you run through running one seconds or the architecture will hold on and you have a new architecture so it will run normally so we just make it simple to write a processing batch processing job and to keep and write a current job to normally extract the data from the file system and then we processing the data and we save it into post postgres db you can choose any other db but this one is decided by our client side so after that you can use the tab to connect anybody knows tab blue tab blue is a data visualization tools it's very popular but it's very expensive so so this too is some challenges we very impressive challenges is these small files we we think about a lot of ways we are trying to save it into edge phase and we go through a long way to discuss about this and finally we just write a job to aggregate all the files to make it very simple and fast and then running so our client said we need to make this happen this month just to make it so fast then all these things change you have no ideas you have no choice to find a very suitable architecture or very suitable very suitable decisions and the second is set up the environment which is very tricky is because as before we need to set up in our in companies internal cloud it's not internal cloud it's a cluster internal physical machines but the they separate they separate the cluster into different zones so everybody have experienced some big company project we know that for security reasons they will have DMZ zones and they will have some private zones they will have some public zones so the most tricky thing is they give us some machines in the different zones that means different zones different zones different application zones yeah yeah yeah yeah that comes in that comes in it's because they give me one private one public and one some they have different policies for different zones so they give us different zones so we need to use these machines to install the cluster then you need to open you need to clean up the port the HTTP port or other ports which they need to communicate and it's very cost of long time to clean up all the ports and send it to the infrastructure manager and they will open the port one by one or to you and then you set up and then it's running adjacent yes it's just because it's very easy to manipulate to operate no other very special reasons maybe we think about some parkuit or some other other very more efficient way but our thing is because it's based on the skills so if you save as parkuit or OC files you need to process and you need to I think I think you need to spend more time and the developer in this part is not very familiar with these files with these file types so we just make it easy first and in the future we change it so this is the project I experienced long time ago before so until now any questions about this one or this so we have touched about something about the spark processing ecosystem different kinds of processing frameworks and tools I will touch about some tips and some questions which we maybe in the future we will think about and the last one is share or use a case with any others other people so that's the thing so the last thing is we can run we can run our script what's the time now so actually I write an algorithm I write a recommendation algorithm and I install a cluster in AWS so I can share with you guys just a very small cluster already in Java so every time you want to scale out you need to configure the DNS name or the IP address inside of your cluster and then you just run refresh nodes inside of your cluster it will automatically detect your new node so maybe in that way when you are trying to do it manually or do it automatically, you can write some script any time you want to plug in one node and you can run the script and it can automatically know your node and then it will shuffle the data and make it balance in the whole cluster so how can I so you can share something about the is it too small this one is my master and I install the name node and resource manager inside but sometimes and the secondary name node this is about the HDFS very deeply so you can actually you can separate it out to play your young cluster HDFS file system out but if you after you separate then you need to care about how they communicate together because they are separate and when you are trying to processing part of your data in the HDFS file and you need to young node and your HDFS node there is a tradeoff between how you can your cluster so let's just run one sometimes I always use the staging to run this so we saw a lot of information it's because I open the debug mode so if you want to saw some debug information you can change the configuration the log4j configuration inside of Spark but this one has some some problem may have some problem is because I use micro to install my cluster because of the cost so sometimes it will have some memory overhead or because very very small very very small but in order to run this application it will initialize it will use the memory themselves the framework will also use it so we can see that the young client and they try to negotiate we can just go check some logs to see what happen so they try to communicate with the node master the application master so many so many information I'm not sure whether it's successful or not it's still running so what I try to machine is that you can use it you can use it script to run it inside of your staging server so sometimes you just running inside of your CACD tools and sometimes you can run it manually if it's very small but we don't really do that so the last one is practice I think we we can skip this because not all people take the laptop right if anybody has interest I can share with share some share some how to write test in Spark and how to organize your code so when you are trying to how can I make it more is it clear no I forgot I'm not complete I check presentation I haven't presentation with my technology before this is one right how good okay so how to run make it if you want to write some test for your Spark program you need to make it testable so that means you better not change all your code into one function so you better not you better extract some of your function into some other functions and then based on that so so you use some library which is write by this guy a test Spark test base and in how to share with Spark test it will initialize a contest in the first time at the beginning of the test running all the tests will use that contest so this one is just the union test and then we call your test function sorry yeah yeah yeah it's running Spark locally so I think that's all today we script this one so that's all today thank you guys any any questions any feedbacks or questions yeah how fixed would the data be before you would consider