 Yeah, so as Diane said, my name's Jordan Knight and with me I have As and we're both software engineers working for Microsoft out of Australia. And we are here today because we built an operator. When I say we, I mean As. And turns out that when you build and post an operator to GitHub, that's like sending up a bat signal for Diane because I think within minutes of us actually making that project public, Diane was in contact with As. So Diane's been super supportive in helping us actually get that operator up to the stage where we're nearly ready to pop it up onto the operator hub. So we're here today. We're going to run you through some of the reasons why we indeed built an operator and it's for reasons far more than just they're shiny and everyone was busting to try and build an operator. We actually had to find a real business reason to do that. So what I'm actually going to do is just run through a bit of background. Basically, I'm the designer on a project, a customer project that we had a need for an operator and basically As is the lead developer on that operator. So together we're in the same team, but together we actually I've got the background on how we how we go about using that operator. And As has got some background on how you use the operator itself, but just a little bit of background about our team. We're called Commercial Software Engineering. I think Diane's still on the mic out there somewhere. We're called Commercial Software Engineering. We're in the past Microsoft engineers have been locked behind closed doors up in Redmond and other cities around the world working on mainly Microsoft products. And so what Microsoft decided to do was to go and build a software development team that's free and available to go and work on projects other than Microsoft software. So for example, we've got a customer that had a need for an operator. So we came in and actually made an open source project to go and build that operator. And so we're working with these customers. And this particular customer came along and they had to use this case to go and build highly cohesive loosely coupled multidirectional complex configuration, molding component, multi-technology, multi-platform, high-scale availability, low-latency big data system, or in other words, a stream processing pipeline. In this case, it's a flexible stream processing platform actually that can be reused for many different scenarios. The scenario that we have for this particular customer was that we're actually going through and collecting a lot of data from a river system, water quality data, nitrate levels. Basically, we want to be able to measure the quality of water that's coming down this river to see if the farming operations upriver are having an impact on various ecological downstream areas such as the Great Barrier Reef. So the problem we had, and it was a bit of a joke, that big blob of text there on the pipeline, but the reality, pipelines are actually fairly complex systems. They contain many components, like streams, and our particular pipeline has a bunch of storage data bricks transforms. We've got a range of environments to worry about, Dev, Test, and Prod. We've got to think about security, as well as external dependencies, and indeed, a lot of the time, the dependencies when you think about Kubernetes are within Kubernetes, but when you actually start going to work with Paz services or other cloud native services in Azure AWS, you start thinking about how can I really nicely manage those without having to have a whole second concept, a whole second paradigm in your flows to go and actually deliver and bring those systems up. So if you break down pipelines into their, I guess, semantic parts, or at least conceptualize them a little bit, the reality is they're a highly cohesive system, but they're loosely coupled, so the components don't really know about each other, but they know sort of the interface between the components they know what to expect coming in. The other thing with pipelines is that relationship is king, and so if we don't basically bring relationship up into a first class consideration as part of our software delivery, then we can actually have a really hard time as these pipelines get more and more complex down the track. I might do that, especially when none of the relationship or maybe none of the things that we're being asked is actually available at design time of the system, the particular platform we've been building is reusable. We don't actually know what the pipeline is going to look like when we build that code and deliver it through DevOps into the cluster. It's up to the customers then go and use that, this pipeline platform we've built to define those pipelines and to go and deliver those. So we got asked by the customer to basically build a reusable set of components that they can then pull off the shelf and go and string them together in various relationship ordering to go and put together a pipeline using a UI or using a simple configuration language that we've come up with after the fact. So we're long gone from this customer and then they can come along and decide to implement scenario A, B, or C using this pipeline system. So it's not just one pipeline we get to build, which would be still difficult, but we could at least hard code a lot of that stuff. So we had to build a pipeline system that had unknown component configurations and relationships until later on. Highly flexible and reusable has one to N or one to many custom transforms. So there might be a whole bunch of Spark jobs or Python scripts or anything that run part of this pipeline and there could be forks in the data stream. Some could go off to modern data warehouse, some could go to storage, some can continue on a hot path through the system. We don't know at design time. So the idea we had was to actually take all this complexity and chuck it away and not worry about it at the start, which sounds like an easy thing to do. So what we did is actually compartmentalize the whole problem. And basically we came up with a single point of configuration for these pipelines. So basically you can design the pipeline, have a look at it, all your relationships, components and everything are laid out in a single file, which when you think about a pipeline in many ways is just a directed acyclic or graph or a DAG. So basically we created this configuration system that then the rest of the DevOps componentry actually goes through and makes real. It actually goes and turns that into the real pipeline. So we start with this DAG, a directed acyclic or graph. You can visualize it. We can actually visualize it automatically when it gets PR'd in to the production branches. And then the DevOps makes it real, as I said. And in the background the DAG is essentially created to a series of Helm charts that then get deployed into the cluster or we could be using Ansible, we could be using Terraform or even just manually built scripts. It doesn't really matter. The point is that the config up front has no idea about what's happening behind it. So that way we've got this first class view of what the pipeline will look like. And so we had these configurable first class objects, these concepts that we can take and actually put into these pipelines. But we had a problem in that when you have these first class objects, we need a way to go and make them real. And we thought, oh, we could have all these scripts to do it. We could have all these other things. But it turns out we're using Kubernetes. And Kubernetes is so much more powerful than just a system to actually hold pods to do work. In fact, we've got a lot of scenarios now where we could actually use Kubernetes and have no pods in it other than operators. It's just such a powerful system for managing configuration and for desired state configuration. You create, read, update, delete resources. You update a manifest and it replaces what's there or updates it or deletes it. And then where we have considerations or objects that don't yet exist as a concept in Kubernetes, we can actually go and extend the Kubernetes API by creating a custom resource definition or a CRD that basically means that we can then say, hey, we want to spark notebook job. And it's got these parameters and it will go and fire that up in Databricks as if it was firing up a pod in Kubernetes. And so as a designer of one of those configurations, I can actually see my whole catalog of things that I can do. So we don't just have to use operators to create things in Kubernetes. We can use them to create things outside of Kubernetes. And that's what we're going to show you guys today. So, yeah, when it comes to custom components like Databricks or Event Hubs, which is like stream processing in Microsoft Azure, it's operators to the rescue. And so operators, we were unsure before we started using operators. It's a fairly, I guess, new pattern. But I can tell you right now that since we've been using it for the last, using the operators pattern for the last six months, we wouldn't do it any different because it's just such a nice way to package up and compartmentalize this piece of complexity that you can reuse over and over again in many different ways. So we can create these reusable modules. And once they're good tested, we can publish them. And folks can pull them off the shelf and use them in any other project. So they're completely black boxed. They're also a first-class object. It's a concept you can get around. Hey, do you know how to use the Databricks operator? Where's the documentation for the Databricks operator? So thinking those terms, it actually becomes a thing. It's not like just some bash script sitting in a DevOps pipeline. It really creates even an internal community around it. You can very easily represent them as a line in a directed acyclic or graph or in a config file or anything like that. They're also easy to represent in things like Helm or other well-known, I guess, Kubernetes delivery packages, whether it be Ansible or any of those other style projects. You could easily represent an operator. And they're obviously extremely easy to deploy, update, and remove because the operators themselves just work using Kubernetes manifests. So people know how to do that. But I think one of the main things and one of the main reasons that we became, I guess, very accepting of operators and even promoting them internally at Microsoft is that they're becoming well known and so their skills being built in the industry around operators. You can actually say to someone who's coming on board your project or you can ask for people that know how to work with the operator's pattern. And if they don't know the pattern directly, they might know a lot of the concepts that go towards using the operator's pattern because essentially it's mainly just Kubernetes, really. It's just being used in a certain way. And so that's the background on my side from the customer's perspective of why we're using operators and what they've asked us to do. And so I'll hand over to As to take it away with some more in-depth explanations. Thanks Jordan for explaining why we needed operators and why operators are awesome. So now let's look into what exactly Azure Databricks operator is and what it does. So for those who are not familiar with Azure Databricks, Azure Databricks is a Spark-based analytical platform that Databricks is created by original creators of Spark. And as the name suggests, Azure Databricks is optimized for Azure. You can create a Spark cluster in Azure in a few minutes. It's designed for large-scale data processing and it's ideal for ETL, stream processing and machine learning. Spark, by nature, performs all of operation in memory objects. That's why it's really fast. And Spark on Databricks decouples query engine and compute from the data storage. And it gives us the huge advantage. What it means is that you can provision Spark cluster and you can connect to the data where the data lives. So you don't need to copy over your data on the cluster and then after finishing running your script, you can shut down the cluster without worrying about losing your data. Databricks is secure. It's integrated with Azure Active Directory so you can get a granular permissioning. And also it provides interactive workspace where data scientists and data engineers, they can write their Spark code in Python, Scala, R, and SQL. It also supports Java and machine learning frameworks like PyTorch, TensorFlow, and Scikit-learn. So this is the very basic hello world notebook, Databricks' notebook. As you can see, it just shows hello and the name of the parameters, the name of user. But to run this at the moment, so you can go to the portal or Databricks dashboard and you can create a job or submit a job and then run it. But from the ops perspective, you can ask your SRE engineers or your ops to just go to the Databricks and use the UI. They will be really angry. So what they really like, and so there are two different approaches currently you can use. You can either use the portal or you can call the Databricks API or you can use a Databricks command line. But what we see that, we see that there is a space for the operator to extend Kubernetes functionality. So what if, similar to submitting the YAML file for deployment, you submit a YAML file to run the Spark notebook or Databricks' notebook from the Kubernetes. So what you can do, you can submit your YAML file, Kubernetes API server creates a record and then notifies the operator. These are the spec and then this is the responsibility of the operator to call the Databricks API to provision Spark or run the script and then update the status pack. I normally like to do a live demo but this session is really short. So I recorded the demos and I comment over it and I changed the speed to 2x so it's a little bit faster. And so basically, so for example, this is a Databricks run and as you can see here in a spec, you can create a Spark cluster with three nodes. You specify the location of your Databricks notebook and then you pass the parameters. And all you need to do is submitting your notebook your manifest. So as you can see, I can get a Databricks run, there's no runs running. But if I apply my manifest, so it starts running it. So the first time, it tries to provision the cluster. After provisioning the cluster, you can see the provisions cluster is there. And then after the cluster provisioning finishes, it runs the script and then it shows the update command. So to recap, I just applied my manifest. As you can see, the first time is pending. After cluster finishes provisioning, it runs the script and then it shutdowns the cluster. So it's very fast and it does its job. But you might ask, what if I wanna run a job with the interval? So do I need to provision cluster, shut it down and re-provision the cluster again? Or what if I wanna have multiple workload on the same cluster? And the answer is yes possible with the operator you can do. So Databricks has a functionality of interactive cluster. So you can create a cluster and then the cluster is keeps running and then you can attach your Databricks, your notebook to that. So for that, as you can see, I have a manifest for Databricks cluster. You can set the auto scaling for minimum workers and the maximum workers. You can specify the environment variables of your Spark and after that, you need to create a Databricks job. So you can, as you can see in this sample, I run my hell over the scripts every one minutes and I pass the parameters and I specify the location of my notebook path. So again, if I check the clusters, there is no cluster and then I can apply my manifest. So after applying my manifest, I will have my cluster. So you can see that it starts provisioning interactive cluster and Databricks gives me the ID. So after getting the cluster ID, I can update my Databricks job manifest. So I copy over the cluster ID and I update my Databricks job and then if I apply it. Yep. So you can see there is no new cluster. It uses the data current cluster and then I can use kubectl and describe to see the status of my job and I can see my page run URL. So with that, I can actually see the output of my job. So the good thing about using the operator is op-seams, they don't need to learn new stuff. So they can use all the tools and monitorings that they were familiar with. So to recap, I applied and create a Databricks job and then as you can see, I can go to the portal, I can see output of every single job that it runs every one minute. Now let's look into the something similar and closer to the real world. So imagine that you have a pipeline to analyze tweets. So for example, the first step is ingesting tweets. So you want to get an ingested tweets based on the hashtag of a certain keyword and then what you need to do, even for the first step, is you need to connect to the Twitter to get the tweets and then put it into stream. It can be Event Hubs, it can be Kafka, doesn't matter. But what is important is here is you need to manage secrets. So you need to connect to the third party services. So there is a concept in Databricks is called secret scope that you can provide keeper values for your password. But what we did here, we said that what if, because op-streams, they want to manage all of the secrets and all of the password as a Kubernetes secrets. What if in a secret scope, you can create your keeper values, but what if we read the secrets from the Kubernetes secrets? So that's what we did here. So you have your, we have secret scope here for connecting to the Event Hubs and Twitter. And after that, you need to run Databricks run. Again, we are attaching to the existing cluster or you can provision new cluster. And then potentially for this script, you need some third party libraries. It can be Maven or it can be Python libraries. So what you see here, you will say that run these libraries on the cluster before running the scripts. So the Databricks automatically runs these libraries and pull that libraries for you. And after that, it runs the script for you. Again, a short video. So as you can see, I have my Kubernetes secrets. And then if I apply my, so if I apply my secret scope, it creates a secret scope in Databricks and in Kubernetes for you. So I can get my secret scope. As you can see, the objects are there. And now I'm going to get my running cluster. So if I get my Databricks cluster to get the idea of the cluster running, and now I can apply and run my Databricks notebook. So the first time that I run it, so it says that it's running. And then if I use a Qtcdl describe and provide the name of my run, similar to how you work with the Kubernetes other objects. So you can see it goes to my run. I can see output of my run. And you can see that it passed the parameters, it installs the dependencies, and it shows the tweets that it's extracted. I have another notebook to test my Twitter ingestion and it's called Event Hub ingest. And I can attach my Databricks to the current cluster that is created by the operator. I can read the secret scopes that is created by operator. And then I can run and it's just for monitoring to see that it actually reads the tweets. So I can read all of the messages that I have in my Event Hub. And if you're patient, you can see, you can see there the exact same tweet that we ingested. So to recap, I have my secret scope, I have my run, so you can see, and after that I ran and I ingested tweets. And then with the Event Hub ingest, that was my Databricks notebook to just check the stream so I can see the value of my stream. Yeah, so I like to share that how we built this operator and then share some of the lesson learned that we learned along the way. For building the operator, we use CubeBuilder. There are so many tools and frameworks that you could use. We use CubeBuilder behind the scenes to use this Kubernetes API machineries and customize. And it's really easy for creating custom resource definition. For those who are not familiar with custom resource and custom controller, custom resource is an endpoint in Kubernetes that allows you to save a structured data. And, but with that, it's not enough. So if you wanna build the operator, you need a custom controller. You need a logic to set up, okay, this is my desired state. Check and the logic will check the current state and keep the desired state and current state in loop. So that's the whole functionality of the operator. So with the operator, you need a custom resource and then you need a custom controller. So CRDs has been used for third-party extension and we are using CRDs. And now you can see more recently, even in Kubernetes, they are adapting CRDs for their built-in functionality. And Tim Hawkins, one of the co-founder of Kubernetes, he shared his vision recently that everything is going to be CRD soon and there shouldn't be anything that you can do, that they can do. So another tool that I'd like to share with you and it helped us a lot, it's using kind. Kind allows you to create a local cluster. And what I really like about kind that it uses Docker and it works on different operating systems. Our team was distributed and they were using different operating systems, so we use kind for creating our local cluster. The benefit of using kind is you can create a local cluster and tear it down in two or three minutes and if you build an image of your operator on your local machine, you can load that image on the master node of your local cluster. So it saves you a lot of time and compute because you don't need to push the image into the image repository on Docker Hub or other image repository on cloud and then pull it down when you are testing, especially when you are writing webhooks. So everything is on your local computer and especially it's really good when you are in your test pipeline. So you can actually test your operator. Oops. Another tool that we found it very useful in terms of the onboarding is using DevContainer. So DevContainer runs the source code inside the container. So you don't need, before using DevContainer, our onboarding process was taking about half a day. But after using DevContainer, it is reduced on to three minutes. So you just clone the repository and then with Visual Studio Code and DevContainer, it runs everything inside the container. It's all the setup for the code, all the setup for using Kubernetes kind and it also gives you the ability to debug and run. So it's very powerful and it's really easy. If you have a distributor team or if you have a team with the different setups, I highly recommend you check it out. We got a lot out of using DevContainer. This is our GitHub repo. Please check it out and if you have any use case or anything that you'd like to chat with us, Jordan and I will be here today and you have our contact details. We'd like to have a chat and see that how you are using our operator. Thank you. We're really looking forward to seeing this in operatorhub.io.