 Hello, everyone. It's my privilege to be with all of you and a warm welcome to DBS Dev Day My name is Kamal Gupta. Hi everybody. Let's please respect all speakers. Thank you. You guys are so kind Thank you. All right guys Again warm welcome for DBS Dev Day. My name is Kamal Gupta. I'm founder and CEO of Omnistrate At Omnistrate we help data companies to build their enterprise grade SaaS offering in no time So I'd love to talk about you know how Today in the next 25 minutes. I'd love to talk more about what is DBS and DBS and wide matters and how You know some of the challenges in building the DBS how you can use some of the CNCF Ecosystem in building one So what is DBS? DBS is a you know is a cloud-based approach to build essentially a cloud offering around the database management to enable your users to Access and use the database without worrying about the underlying details, right? You don't have to worry about the provisioning infrastructure provisioning the installation or you know Managing the infrastructure and so on and so on The key characteristics that you look at like when you think about DBS is three things one is Your user should be able to access the database on demand. I mean we started back in AWS I was one of the founding engineers there that it used to take 15 minutes to provision gone Are those days now the expectation is to get get it up in running under a minute Then you know the second expectation is it has to be zero admin your customers don't have to worry about it They want to focus on their application. They don't want to worry about You know how the underlying details are managed You know how infrastructure is provisioned scaled or upgraded all of those things they want to just you know Not worry about it and third is the cloud native experience I think it's a very important point to understand that you know when we think about providing the experience They are looking for an experience where they they can think about as a table as a service as opposed to Thinking about hey can tell me how many CPUs needs CPUs in CPUs You need or the network or the storage configuration or memory configuration And then have to you know pick and choose everything and then tomorrow if they have to they have You know event going on then they have to worry about each and every small thing and and and you know scale things Themselves, which is basically pushing the problem to your customers as opposed to you know as a service provider You should be able to encapsulate those things and make it seamless for your customers to not worry about those things So why do we care? Well as I just mentioned right that your customers don't need to worry about provisioning They don't need highly skilled operators to manage the databases They don't need to worry about the making the trade-offs between performance Availability durability cost and I think by doing so You know deep ass is transforming the whole industry and how or reimagining the whole database industry and how the databases are managed in the cloud So here is the you know a rough Hello world architecture looks like you know your request comes in from the user and you apply Terraform apply to do some information infrastructure provisioning You then do some installation of the respective database software Configure your your database and you return back, right? Is that it like can we all go home and all done anybody sees anything wrong with this? Of course, it's not that simple as you all know You know there I'll cover the six broad challenges today But they are they are they are more essentially So and I think the reason I chose these six because they are Commonly applicable across the array of the deep ass landscape So I'll cover the first three because they are very correlated with each other together and then I'll follow over the rather thing So when we think about provisioning You know first thing that comes to mind is the infrastructure provisioning and And you know there are great tools like Terraform that allows Essentially to have a global state and and whenever you want to make a change you can make an incremental change on top But that works great on the static environments as you as you know There's several challenges with that model one is how do we you know keep it modular? When you are at a dynamic environment when when you have large number of users, how are you able to? And contain the reconciliation time essentially how are you able to essentially? Deploy things in a atomic way and guarantee the atomicity of it And then you know there are other challenges like drift detection and collaboration challenges that comes in as as you scale With that a telephone so those are all the things that to keep in mind when you think about building the or choosing the right tool for infrastructure management then The other thing to worry about is the customization as we are building the debas there users with specific configuration They are specific requirements. So you have to think about those requirements carefully You have to think about the constraints You know for example as a service provider You may want to allow a given user to only have a maximum of 10 clusters Let's say or other constraints that the product constraints that you want to put it You have to worry about the underlying cloud limits that underlying cloud providers offers You have to worry about You know the technical limitations of your own You know underlying database itself. So all of those things have to be thought through The other thing that is very important is The orchestration Typically you you know as you scale with a large number of users you will not have one Kubernetes You will have many Kubernetes clusters. So how do you orchestrate? How do you bin pack? You know across them you need to have a flexibility across Your deployment models because as your business is growing, you know, you may have Different infrastructure like let's say different networking types You may have a public offering and a private offering. You may Overtime may deploy in customers account Versus just a hosted mode. You may over time Be offered in different cloud providers offer different regions may add different services on top So all of those things as your business is scaling. You have to think about the implication of those So when you think about designing these things, you have to you know, consider those things in mind And finally, I think I didn't touch on reliability, but very quickly. It's important because at the end of the day the core metadata while you know the the database is doing its job, but the fact that you are storing the user information and where these states are stored in which Kubernetes all of those things have to be stored durably if you lose that that's a big problem essentially To four customers to have that continue to have that seamless experience. So you have to Maintain that durability Then we have scaling challenge if that was not enough, right? you know as you think about scaling you go from Simple manual scaling to things like start and stop the things like schedule based The things like being able to autoscale and then all the way out of scale down to zero Right and how do you go through this spectrum? And where do you want to fall in this spectrum right and and what kind of offering you want to offer to your customers? Then you think about The cost implication if you don't have scaled down implemented that will be costly either to you and you essentially will be costly You know a cost that will be passed on to your customers as well and then These important aspect about the state which is Let's say if you are a state less system. It's pretty easy right like you get a machine You get provision IP you get adjust the load balancer configuration and you put those things together Get the health check up. Okay, all good But what happens when you have a stateful system all of a sudden this becomes much more complicated and like for example you have to think about When you're adding a new machine do you have to do some rebalancing? What happens when you're doing the scaling? How does it interface with other operations that are happening in the fleet? Let's say you're running an upgrade during the same time So what do you allow that and how does it interfere with that and and considering those things in mind? then When how which metric will you use for example to make the decision on when to? When to scale up or scale down are you going to use CPU memory combination of that like how do you figure that out? so all of those things have to be carefully considered and Finally, I'll touch quickly on the patching challenges some of the patching challenges One is the speed of course, you know, you there's a security compliance requirements You want to get things out very quickly. There's also the user your customer's expectations that features needs to be out You can't wait in this era for six months to the changes to be out So think you know you need something that can out in days Essentially and and roll it out to the large fleet essentially. So how do you achieve that? scale is important because you know You have your software images you have infrastructure you have cloud provider itself is making the infrastructure changes So how do you do all those changes across array of customers that you have and and roll out safely and Coming back to the safely a safe point the reliability. You need, you know Several prevention and mitigation mechanisms because as we all know right in the software industry things do go wrong And so we need to make sure that we have a proper testing mechanisms We have proper canneries in production to catch catch things early. We have things to You know prevent from happening at the first place by you know If let's say an issue happens in production You want to make sure that you have some sort of mitigation mechanism with pause and and resume kind of mechanism And you know have a proper Rollout philosophy right like things like Start start slow and then accelerate or people follow some S curve strategies as well where they you know start slow then they accelerate and they slow down For large customers or in large regions and so some sort of strategy that works for your and tried and tested in the in the in your environment has to be thought through So all you know all those challenges From provisioning scaling and patching have to come together and so that's a lot right. So how can we achieve some of those? With the CNC of ecosystem. So first thing is cubanities right cubanities a great starting point It's a it offers right of the get-go a lot of functionality to Basically not to reinvent the wheel essentially right and you can just leverage and build on top of it then on cherry on cake is the cubanities operators provides an excellent framework to You know build your own custom resources and your custom control plane logic that essentially allow you to Think about Your Like defining your control plane logic right in there and so the way You know typically it works is you have a state in cubanities And you have a desired state that you want to change that you want it looks at the diff essentially it applies That act on that diff and make those changes happen And that's the core framework that you can use Essentially as a building block on which you can implement some of the challenges. I mentioned earlier But there are some challenges with even with the cubanities operators You know they are limited to one cubanities as you will scale you will definitely be spanning across cubanities So how do you deal with that? You will have challenges with respect to coupling so you have to be very careful in designing You don't mix your control plane and data plane logic Too much and how do you have proper testing in place to address to avoid cascading failures in the future? You know think about the service evolution Because if the change itself requires the operator deployment itself that can slow down the the changes in production You know think about you know, which how do you integrate operators with infrastructure management? and some effort has to be put on the You know on the maintenance on getting the in-built visibility like in sorry Extending the cubanities operator it to have the internal visibility and some controls you can use annotations for example to have been built some controls and Being able to implement the pause and stop again that I mentioned earlier in case things go wrong So here is you know some suggestion on top of the cubanities that one can think about So you can use cubanities operator to essentially use things like cross plane or config Connector to do the infrastructure provisioning then you can have a talk to the scheduler to which and rely upon the autoscaler to do some of the scaling efforts and you know and use a Some sort of a workflow system on top of this to orchestrate Across these cubanities clusters essentially and and offer that as you know as a service to to your customers All right, so next challenge is the monitoring When we're thinking about High availability There are several failures that we have to consider. It's just not sufficient to just in the specially in the stateful system to just consider The process failures and the machine failures That's not going to give you the desired SLA Especially if you're shooting for three nines or four nine SLA So you have to think about far more beyond that and and you think about what happens There's network partition where your customers cannot reach you and you everything behind the scenes may be actually okay So how do you handle that? So maybe we need some sort of an external? Mechanism to constantly ping the database to make sure that it's reachable You need to handle the gray storage nodes like the fluctuating storage nodes or The storage nodes are going in read-only mode and how do you detect those? You know failure or failed infrastructure and replace those In a timely manner you to worry about the hung processes databases can have dead latches And they can get stuck so even though everything may look that the machine is up and the process It seems to be running in the sense that you you can see the process is live But it may not be making any progress. So how do you detect that? right things like correlated failures depending upon if you have a you know a Horizontal scaled-out system and you have some sort of requirements of two out of three quorum or four out of six quorum whatever you guys have You have to think about What's the implication if multiple machine failures and how do you place those multiple machines across different zonal endpoints? Then you have to think about The easy the whole data center failures, you know, we we at confluent. I was running the Kafka engineering. We thought prematurely that It's not something that is you know that common Apparently we had at scale. There's a lot, you know, you it almost happens every every month and One of the cloud providers or something is going wrong And so how do you deal with the data center failures where and specially any every time it will happen if you don't handle it properly It can cause a huge In outage so we did a lot of work to at Confluent But that's something that you have to think about for your technology how to handle that So here are some suggestions to use Some of the CNCF tech on top. So the pixie and the inspector are more native tools to which uses ebpf to To gather some observability, but then you can also use more cross-platform tech like net data and Hubble and sentry The the goal here is to eventually collect this metrics through vector eventually send it to one of the you know sink It could be Prometheus. It could be data dog or whatever your choice Observability tool and then define alerts on top of it. So some sort of a You know infrastructure is needed and as I mentioned, you know earlier, you know, it's you have to think about handling the the in-process Handling of the failures, but you also have to worry about the network partitions So that's why both sorts of monitoring and collecting all those events and sending those events through some sort of an aggregation Mechanism like vector and and that's defining alerts on top, you know, could be one of the mechanisms that you guys can think about Moving on one of the things it's important. Okay. It's great. We got the you know, amazing D-bass which is we can provision scale and Badge and monitor, but we have to bill We have to charge our customers and and how do you do that? And what is infrastructure for that? All right, and so some of the problems that you have to solve There is first of all gathering the usage, you know, what what kind of In the infrastructure usage or whatever the metric you are using to define calculate your internal cost And estimate your price for the customers Then you need to collect that those things you have to aggregate Because usually these are sample data points in terms of time So you have to aggregate over time and then you have to you know, do the invoicing Many times bigger customers will also have a custom pricing. So you have to implement that you can handhold it initially You know, but as you grow your business will grow. That's something that it cannot be managed in you know What in the manually it has to be something that is automated and then one of the important channels is you know One is a direct channel where customers can come in and pay as you go They can swipe in the credit card and things but then you have to also important to consider as a marketplace integration This is a huge revenue driver for You know DBAS companies to how to integrate with AWS, GSP and Azure their marketplace so you can co-sale with them and and drive that, you know Marketing channel, but it's also something that requires a proper integration. It's not something, you know It's a technology problem where you have to integrate with them and there are many providers by the way, which Solve that like tackle and plazaar and others who who address this problem and Finally you have to worry about the the compliance because of financial Controls that you need to enable on top of it. So here is a some again reference architecture that you can use Basically the metrics that we collected in the in the last in the monitoring side You can also similarly you can collect the usage metrics and you can send it to building aggregation job which can be a some sort of a lambda function and Store that usage metric after aggregating back to S3 and then you can further apply the I can't even read On the other side. What is it? Some sort of yeah, so basically use that that the final output to generate the invoicing and send it back to your users And finally This all is great but Building Dbass is not over yet. You have to worry about the experience as we talked about earlier And let's talk a little bit more about that. So you have to worry about the UI CLI and API those are basic experiences But then you have to also some customers prefer to integrate with Terraform So you may have to provide that Terraform integration where they might have existing Terraform where they are provisioning their their infrastructure and they can call your APIs Terraform APIs to invoke and and provision their stack You have to worry about user management and how they are going to manage their keys how they're going to manage their access control how they're going to manage their organization and You know define groups within the organization You have to worry about observability. How do you provide them some sort of a basic, you know When we talk about observability, it's like metrics logging events, you know, how do you provide those kind of? Information to your customers with whatever you feel appropriate Then finally compliance, right? It's important to as a Dbass that you that some sort of compliance Sock 2 is is there, but then in Europe you go to ISR to 27,001. So those those things are Required and so here are some tools you can use Like key clock for identity and access management and then open policy agent for our back and those tools you can use to You know extend the from the user database that I mentioned earlier that you can that you have created you can integrate with those and provide API gateway with glue edge and Identity and our back with key clock and open open policy so here is a Quick demo by the way on we try to at Omnistrate help Streamline a lot of these and so I'm just going to quickly play that and then wrap it up Okay, I'll just wrap it up. I'll share the demo link and then we have it All right, so key takeaways Well first Kubernetes is a foundation. I think that should be pretty obvious second we talked about bunch of CNCF technologies from Prometheus to Gryphana to Docker and boy vector and you can use all of those Things to solve any of the problems or some of the common problems that we touched upon today Third it's important to one of the things I want you guys to take away is the design It's important that to not think about to short term because this Problem of the DBAS is as your business will scale if you're not careful You will end up with a spaghetti of the code essentially and will be very hard Nobody will want to touch that code and it'll become really hard to manage and so it's very important to consider You know how your business will evolve in the future and consider two way door decisions rather than one way door decisions And have that flexibility built in from the day one And finally I will say is you know don't underestimate the day to Operation side of the house when we think about DBAS we often emphasize a lot on the day one experience Which is like how do we get the beautiful experience for our customers? But not worry about things like upgrades and then how the evolution will happen and things like that so automating those things at scale is is a Paramount and that's will define essentially the experience for your customers as well think about it right when we're thinking about upgrades The customers will want to get the latest and greatest very quickly So if you haven't automated it properly and it takes months or quarters then becomes really really hard Well, thank you so much, and I have a link here if anybody was interested in watching that. Thank you