 Hello, welcome to SSUnitech, so shall decide and today we are going to start a new playlist on the PySpark. So before going to watch the PySpark videos, if you haven't watched the Databricks video series then I would strongly recommend to watch the videos of the Databricks. So how you can find out? You should go on YouTube and you can search for SSUnitech. And after that you can go on this channel then you can go on the playlist and inside the playlist you can find out Azure Databricks Tutorial. So you should watch these 15 videos before going to watch the PySpark Tutorial playlist because we are going to use the PySpark inside the Databricks only. Now let's get started. So what is PySpark? So PySpark was introduced in 2009 and later on after one year in 2010 it was published as an open source in the market. And it is an interface for the Apache Spark in Python. The Python library and we are going to utilize those inside the Apache Spark clusters. So Apache Spark clusters are capable to process your data and that data could be your structure data, non-structure data or the live data. So all these this PySpark Apache cluster is enough to process everything. It is not only allow you to write Spark applications using the Python APIs. So what it mean? We are going to use the Python libraries and by using the Python libraries we could be creating the data frames and writing those data frames into the files. So PySpark is not only for that but it is also provide the feature for the PySpark cell for interactively analyzing your data into the distributed environment. So it mean you can also play with the data inside the Databricks notebooks by using the PySpark. So we will be writing the Python course here. So what is the prerequisite for understanding about the PySpark? You should be having the basic understanding about the Python language. If you haven't idea about the Python then don't worry I will be going to teach you about the few Python codes and those will be enough to work inside the PySpark. Next is the what is the benefit feature of PySpark? So why we should go and try to use the PySpark? The first feature that is the memory computation. So what it mean? So PySpark is going to process the data into the memory. So it means your data will be processed into the RAM. Then you can imagine like if we are getting some request and that is processing in the RAM then that will be very faster in the process. But one question should be coming in your mind if we want to process 8 GB of data but your RAM is only for 4 GB then what will be happening on that scenario? On that scenario 4 GB will be processed into the RAM and rest 4 GB will be processed into the disk. So this is for the memory computation. Next is the design to cover the wide range of workflows. So PySpark is designed to process your data and that processing could be in the batch file or the ML then interactive queries then live streaming. So these wide range of the workloads can be handled very easily inside the PySpark. And there is no any additional tool required for processing all these. Next it's very easy to integrate with the other big data tool. So here in this playlist I am going to teach you how you can use the PySpark inside the Azure. But you can also use the PySpark inside the AWS and the GPS for the Google platform. So it's very easy to integrate with the other tools easy and inexpensive. So what it means? It's very easy to use and it's not very expensive. Why it's not? Because once your request will be raised after that your cluster will be start up and running. And once your request will be completed after some time your cluster will be stopped. So it will be charging only when you are processing the data. And after that it will not charge anything. So that's why it's inexpensive as compared to the other tools. That could be your Hadoop or the other system. Next is the data processing. So here we can process the data into the batch applications SQL, ML and graph processing. So all these type of data we can process it here. So as I told you we can process the unstructured data, structured data, live data, graph processing, batch processing, everything we can process by using the PySpark. So these are the major benefits by which we should be going to use the PySpark. This is the architecture of the PySpark. So how PySpark is working. So for example we are receiving some of the request. So once we will receive the request that will come under the driver program. And driver program will be going to create the different tasks. And then the cluster manager will be going to connect with the worker node and start the executors. So cluster manager will be taking care for starting the executors and assigning those tasks which we have received from the driver program to the worker node. Once we will get the request from the cluster manager then that request will be processed into the different tasks. Once that is done then it will directly back to the driver program. So that's why you can imagine like everything will be going to happen on the worker node like the your processing everything will be happening here. And we can also do the parallel processing while we are processing the data. So this is the basic understanding about the PySpark. By the next videos we will be practically showing how we can use the PySpark inside the Azure Databricks. So thank you so much for watching this video. See you in the next video.