 Hello, welcome to SSNU Tech Susil this side and this is continuation of Azure Databricks tutorial. So this is very important video, here we are going to discuss about the clusters. So what are the clusters, how we can create that and how many type of clusters are available in Azure Databricks. So all these we will be covering in this video. So let's get started. The first cluster is the collection of the virtual machines. And we can also say like a set of computation resources and configurations on which you run notebooks and jobs. So that is the cluster. So we will be having the driver node at the top and after that we can see the work node. So virtual machine will be going to arrange like this. So this is nothing but a single virtual machine will be there. And under that virtual machine we will be having one machine as driver and another with the worker. Now go to on the next slide and try to understand about the cluster types. So what are the cluster types? So mainly inside the Databricks we have two type of clusters. First is the all purpose cluster and second is the job cluster. So these two type of the clusters are available. So let's understand the differences between these two. So first as we can say inside the all purpose cluster it is going to create manually. So whenever we want to execute any notebook then we have to use the cluster behind the scene for executing that notebook. So in case of the all purpose cluster we can create that manually and we can attach that on that particular notebook. So whenever manually we are going to execute that we can utilize that cluster. On the other hand for the job cluster it will be created by job. So inside the Databricks we can also create the jobs. So whenever we are creating any job so once that job will be going to execute that cluster will be created. Second is the persistent. So what is mean? So persistent mean it will be going to stop at any point of time and we can restart that as well. So cluster can be restarted stopped at any point of time in case of the all purpose cluster. But in case of the job cluster it is going to terminate it automatically at the end of the job. Next is suitable for interactive workloads. So whenever we are going to interact with the multiple workloads then it is very easily. But in case of the job cluster it is suitable for the automated workloads. Next is the shared among many users. So all purpose cluster can be shared with the many users. But in case of the job cluster as it is creating when job is started executing so it is isolated just for the jobs. Next it is very expensive when we compare with the job clusters. So all purpose cluster will be more expensive than the job cluster. So these two type of the clusters available and here are the differences. Now go to on the browser and we will try to see in the practical. So I have already login inside the Databricks so that you can see. So we are at the starting page and here we can see the create a cluster. So either we can go and try to create it from here or we can see inside the left menu we have the compute. Click on the compute and here we will see the same option. So let me close this here as you can see in the top it is having four options first is the all purpose cluster. Second is the job compute cluster third will be the pool and the policy. So we will be covering the pool and policies in our upcoming videos. Don't worry for now go to the all purpose cluster and we will try to create that. So here as we can see we have this option which is create compute but whenever we can go inside the job compute here that button is not available because it will be automatically created when job will be executing. So that's why it's not available here. Let me go back and try to create a cluster here. So whenever we are trying to create the cluster at the top side we can call this cluster name. So let me call this as SSU this thing. This will be your cluster name. Now here you can see the policy. So policy you can see only for the premium workspace but if you are using the standard workspace it will not be available on that. So here we can read out what is the policy. So a cluster policy defines limit of the attributes available during the cluster creation. So here as we can see it is having the personal compute power user compute shared compute and unrested. So I'm going to use the by default option below of that we can see two options for the multi node and single node. So multi node if you are going to work in the real time scenario and you will be dealing with the last amount of data then you can go with the multi node. But as for the training purpose I am creating this so that's why I am going to go with the single node. So once I have selected this single node we can see multiple options has not been available. Let me go back there and again we'll see here we can see the worker type. So whenever we move to the single node worker type will not be there. So what is a worker type let me discuss that first. So here worker type we can define we are having the multiple options inside the general purpose and other options as well. So as per your need you can choose what type of the worker type you can utilize. After that you can also select the minimum worker and the maximum worker. So as I told you I'm going to create for the single user so that's why it is not available on the single node. Here we can see the spot instances. So what is that? So use the spot instances to save the cost. So how it is saving the cost so it is also mentioning on the same like if spot instances are evicted due to the unavailability on demand instances will be deployed to replace evicted instances. Go in the downside and here we can see the driver type. So inside the driver type we can see it is having the multiple options. So by default I'm going to use the 414 GB memory with the four cores. But as per your need you can choose here we have the multiple options as well. After that it is saying the auto scaling. So what it means? Auto scaling means whenever you are going to request any task from the cluster then automatically it will be scaled out for these workers. So as we can see minimum workers is 2 and maximum will be 8. So as per the availability it will be automatically scaling up. After that we can see the terminate after. So this is very important. You should be make sure while you are creating this cluster you should be checking this checkbox. Otherwise you have to terminate the cluster manually. So it is saying after 120 minutes but I'm not going to use the 120. I'm going to use the 20 minute or 25 minute. So here the minimum we can set as 10 that we can see. So as per your need you can set it as well. So I'm going to set as 10 that is okay for me. Go to the downside and here we can see the tags. So inside the tags we can also add the tags on this cluster. And here Databricks is already having few of the tags. So we can see like the vendor Databricks. So all these are already here. So let me go into the advance options. And inside the advance option we can see the spark. So we can set up the spark configuration here. And we can also set the environment variables. Similarly go to the logging and inside the logging we can also set some of the logging path and your logs will be there. And after that you can see init script. So whenever your cluster will be going to use in any of the notebook by default we want to have few of the Python libraries automatically imported here. So we can specify the path and those will be imported automatically on this cluster. Now let me go into the signal node. Let me confirm this. Here everything looks okay. And here let me tell you about the Databricks runtime version. So as of now the Databricks runtime version is 12.1. So for the testing purpose I would recommend you should go and try to use the latest version. But for the real time environment you should go and try to use the LTS whatever the latest version of that. So I am going to use the 12.1. Here we can also see the ML. So ML runtimes is also available. So for the ML flows if you are to use then you can use that. But I am going to use for the data engineering. So I am going to use 12.1. After that we can see the photon acceleration. So this is basically the feature of the Apache Spark workload. So it will be reducing the cost per workload. So you can choose if you want to. But I am not going to choose because I am not having as much big data sets. So that's why I am not going to use this. Let me go into the node type. Inside the node type I am going to use the minimum one. Because it will not be as much costly. But as per your requirement and need you can choose the bigger size as well. Now here let me reduce this to 10. Because earlier we made the changes on the multi node. Everything looks okay. Now let me try to click on create cluster. So it will be creating the cluster and we will see. So let's wait and until this will not be created. So it will be taking around 5 minutes. So I am going to pause this video and we will back once it will be created. So now we can see the cluster is created successfully. Inside the configuration we can see all these options that we have already selected. Inside the node books we haven't attached this cluster to any node books. So that's why it is 0. So whenever we will be using this cluster inside the node books. So all those node books will be available here. After that the libraries. So as of now as I told you we have not created or imported any library on this. So that's why this is empty. So whenever we are going to use any library or install it over here. Then that will be available. After that the event log remember like inside the advanced option we have seen about the event logs. And here this event logs is available like the description and everything. After that the spark UI. So inside the spark UI. We can see it is going to display like the user and total uptime and all these things. And under that we have the jobs stages and after that it stores it. Everything we can see inside the spark UI. After that inside the matrices this is very important. Let me quickly go in this live metric. So this will be going to say like how many nodes is going to utilize. And so this is not going to appear here. But it will be going to display the informations like how many CPUs are using and how many nodes are using. And what is the uses of the memory. Everything we can see over here. Let me go back here. And let me go inside the clusters again. And here we can see this cluster we have created. And we can see the options for the policy and runtime all these options. Let me go into the right side and here we can stop this cluster so that we can see the terminate. Or we can click on these three dots. We can restart the cluster we can clone this cluster we can delete the cluster and we can. Provide the permission or edit the permissions inside this cluster. So thank you so much for watching this video. I guess you have the basic understanding how we can create the cluster. In our upcoming videos will be going to see more in depth about the clusters. Thank you so much. See you in the next video.