 Hey guys, welcome to SSUnitex, you see on this side and today we are going to start a new playlist on Azure Databricks. So this is the introduction video of Azure Databricks. If you have not any knowledge on Azure Databricks, so don't worry here in this playlist, you will be understanding each and every concept on Azure Databricks. So let's get started about Azure Databricks. So before going to start with the Databricks, I thought we should be going to understand about the Apache Spark. So why Apache Spark is much more important because behind the scene of Databricks, Apache Spark will be working. So that's why Apache Spark is much more important while trying to understand about the Databricks. So first understand about the Apache Spark. So what is Apache Spark? Apache Spark is lightning fast unified analytic engine for big data processing and machine learning. So this is we can say it is unified analytics engine and which is the lightning fast and it is used for the big data processing. So here we can process the big data that could be structured data or non-structured data. So it is having the ability to process all the data. And here we can also implement the machine learning. Next as I told you this is 100% open source under the Apache license. So it was introduced in 2K9 and then on 2K10 it was introduced as an open source. Next it's very simple and easy to use API. So that we will see while we will doing the practical knowledge on the Apache Spark. Next it is in memory processing engine. It will be going to process in memory. So for example if you are going to call your number or your family numbers then you can directly think and you can say like this is my number and this is my family members number. But if I am asking the mobile number which is not in our mind then what you have to do you have to go on your phone you have to unlock that and after that you can search and then you can tell the number. So this is the difference between the in memory processing and disk processing. So if anything is available in the memory so that will be quick and very faster. But if anything is available in the disk and processing on the disk so that will be taking little bit time. So this is very fast as it is processing inside the memory. So one question is coming in the mind if it is going to process in the memory and your memory is 4GB of the RAM and your data is 20GB then how it will be processing. So on that scenario it will be going to process only 4GB in the memory and the rest 16GB will be processed on your disk. So this is how it will be going to process. Now next is the distributed computing platform. So what it means so distributed means if we are getting any request for accessing the data or any task then on that scenario it will be going to distribute that task into different different small task. So let's assume if we are requesting for 100,000 rows and based on certain date period and while we are going to request this data then it will be going to distributed into different different nodes and cluster will be taking care for all these. So I have not discussed about the cluster and nodes so don't worry for now. You can simply understand your request will be going to split into small task and then we will get the data from the memory that will be the parallel processing we can say and we will get the data quickly. So that is the distributed computing platform. Next it is the unified engine and it is also supporting the SQL, streaming, ML and graph processing. So that we will see in our upcoming videos. Next it is integrate closely with other big data tools now. Next we have to understand about the Azure Databricks. So initially Databricks was their own tool and it like the developers from the Databricks sits together with the Azure developer and will create the Azure Databricks. So Azure is owning this and Databricks can be accessed in other platform as well like the AWS GPS. So on those platforms we can also access it but inside the Azure Databricks it is going to access inside the Azure portal. So those we will see. So inside this we are having different different components. So first component is the cluster. So cluster is the backbone of the Apache Spark. So inside the cluster we can decide how many nodes will be there, how many executors will be there. So while we'll create the cluster in our upcoming videos you will see in the practical. So cluster will be taking care for all these. Let's assume if we have requested for the data or processing some of the task. Then cluster will receive that and cluster will decide like how many nodes are free and the task will be going to distributed on those nodes accordingly. So cluster is backbone of it. Next it is having the another component which is the workspace or the notebook. So inside the Azure Databricks we will be going to work inside the workspace and under that we will see the notebooks. So these are the Jupyter notebooks that you might aware about and here we can write the code on the Python, Java, CSA, SQL. So all those languages it is supporting. Next it is also having the administrator control. Next it is optimized speed. As I told you it will be going to process in memory the first thing and the second thing it will be also distributed the task into a small task. So that will be the parallel processing. So that's why the speed is very good. Next we can also create the database and tables in the Apache Spark under the Azure Databricks. So those we will see in our upcoming videos. We can also implement the Delta Lake. So what is the Delta Lake we will see in our upcoming videos in detail but as of now you can only understand if we want to implement the ACID property or the versioning. So on that purpose we can use the Delta Lake. Next it will also support the SQL analytics inside the Synapse. So those features we can use inside the Databricks. Next we can also use the MLflow under the Apache Spark. So all these combined inside the Azure Databricks. So we can say that Databricks is the collection of all these. So all these components can be accessed and utilized inside the Azure Databricks. And Apache Spark is having their own libraries. So whatever we want to achieve there is no need to go and import the library or use or install another tool for accessing those. Apache Spark is only having their own libraries we can utilize those. Now here it is also have the integration with the unified billing. So what it mean? So unified billing so as I told you the cluster is the backbone so if cluster is running you have to pay. So it's depend if you are going to use the notebook and running the notebook then cluster will be running and after sometime it will be going to stop. So you have to not pay if you are not going to use that. So it is having the unified billing there. Next we can also use the messaging services inside the Azure Databricks. So that could be your Azure IoT Hub and Azure Event Hub for the messaging services. Next we can also integrate with the Power BI. So this is having the tightly integration with the Power BI as well. We can also integrate with the Azure ML. So that option is also there. We can also integrate with the Azure Data Factory. Next we can also integrate with the Azure DevOps and Azure Active Directory. And last we can also access the data services which could be like Azure Data Lake, Azure Blob Storage, Azure Cosmos DB, Azure SQL Database or Azure Snaps. So all these can be integrated with the Azure Databricks. So all these will be available inside the Azure portal. So we can say like all these features is available only inside the Azure portal under the Azure Databricks. So we'll see all these features in our upcoming videos in details. Go to the next slide. And here let me tell you about the architecture flow of the Azure Databricks. So at the bottom side you can see like these three which is the yarn, mesos and stand alone cellular. So what these three? So these three are the clusters. So these three clusters could be used at the bottom side because everything will be going to managed on the cluster. Next in the middle side you will see the Spark Core. So what is the Spark Core you have to understand? So basically Spark Core is contains basic functionality of the Spark like the task scheduling, memory management, fault tolerance, interacting with the storage system. As I told you it will be going to interact with storage. Next we can see the Spark SQL. So what is the Spark SQL? Spark SQL is the package of working with structured data. So as I told you inside the DBMS we can only process the structured data. And here it is also supporting like the different, different formats of the source. Those could be your Hive table, Parkette files and JSON files. So those we will see in our upcoming videos in practical how we can use the Parkette file, Hive table or the JSON file inside the Spark SQL. And it also allow developers to intermix SQL with programmatic data manipulation supports by RDD in Python, Scala or Java. So these are the languages we can integrate. So don't worry if it's not very clear now. In our upcoming videos you will be going to see in details for all these. Next we can see the Spark Streaming. So what it mean? Spark Streaming is nothing but the live data processing. So for example on the real time we are receiving the data and we are processing all that data. So those options are available inside the Databx. So it is having the very good ability to process the live data in the streaming manner. Next we can see the ML library. So this is basically we can say it provides the multiple types of machine learning algorithm. Those we can utilize under ML libraries. Last is the Graphics. So we can also utilize the Graphics option here. So this is the architecture flow as of now. You can understand in the bottom we are having the cluster. And in between we are having the Spark Core. Spark Core is nothing but the task, the cellular or getting the requirement from here and accessing your storage. So all these will be going to access in the Spark Core. And after that we are having these items which is the Spark SQL accessing the live streaming then machine learning library. So all these next here we are having the some comparison between the Azure Data Factory and Azure Databricks. So what is the purpose of using the Databricks over the Azure Data Factory? And what are the strong point of the Databricks? So those we will see here. So first classification is the purpose. So what is the purpose of using the Databricks? So first let me tell you about the data factory. So ADF is primarily used for the data integration services to perform the ETL processing. So if we are going to process the ETL operations, then we can use the ADF. This is also very good tool, but here we are having certain limitations. So those we can see inside the Databricks. So Databricks provide the collaborative platform for data engineers and data scientists to perform the ETL as well as build machine learning model. So those machine learning models are not available in the data factory. So this is the first difference. That's why we should choose Azure Databricks over the Azure Data Factory. Next classification on the Ace of Use. So as ADF provide the drag and drop feature to create and maintain data pipeline visually. So what it means if you have worked on the Azure Data Factory or have the little bit idea, then it's drag and drop tool and we have to only set up the few things and your pipeline will ready. So if we talk about the Ace of Use, then it's easy to use the drag and drop tool inside the Azure Data Factory. But Azure Databricks uses the Python or Spark or R, Java or SQL. So all these programming languages, it is supporting. So we can write our code inside the notebook. So if you are good, any one of these, then you can go and you can learn and use it. Next classification type is the flexibility in coding. So this is very important because inside the ADF, it's have the less flexibility. So why it's less flexible? Because we cannot modify the back end code. Whatever is written in the back end, we can only use that. We cannot modify it. But inside the Azure Databricks, we can write our code and we can implement it. So this is very good and very flexible and fine tuning code. So we can write our own code and we can achieve whatever we want to achieve. Next is the data processing. So as I told you, ADF does not support the live streaming. So this is the feature is not available inside the Azure Data Factory. And we can use the live streaming inside the Databricks. So this is another very good feature. So if, thank you so much for watching this video. If you like this video, please subscribe our channel to get many more videos. See you in the next video.