 Okay, hey everybody, welcome to our panel. We're excited to have you here. My name is Melissa Logan. I'm the director of the data on Kubernetes community. If you are not familiar with our community, it's about 4,000 end users and practitioners who are sharing best practices about running stateful workloads on Kubernetes, all kinds of stateful workloads, database and streaming, AAML, analytics, et cetera. We host monthly virtual and in-person meetups. We have folks from different end users sharing their case studies about how they run successfully data workloads on Kubernetes. We've had folks like Comcast, Payitgov, Dutch government, others like that join us. Really good learning for the community. We are very soon going to launch a new website that has a resource library that will catalog the different talks that we've had and you'll be able to search by type of workload you're looking for. So really just trying to share resources for you all as you're going on your DOK journey. We also host an operator special interest group. The purpose of this group is to identify gaps in knowledge and information and see what we can do about it as a community. These are members of our operator SIG here today and we're going to talk with you about operator maturity. And we're going to share just some practical advice as you're evaluating operators, what to look out for. So let's just start with some introductions. Tell us who you are and what is your background with operators. Michelle. Hi, I'm Michelle. I'm a software engineer at Google. I work on GKE and open source Kubernetes where I am in SIG storage TL. And so my main exposure to operators is we have lots of users that come and want to run stateful workloads and with all sorts of different types of workloads and they're basically asking for best practices. And one of my goals is to kind of pick out all the challenges that all the users are hitting, try to find common patterns across all the different types of workloads and then make Kubernetes better to be able to support those workloads in the future. Hi, my name is Robert Hodges. I run a data warehouse startup. We are service providers from Clickhouse, pretty popular database. My involvement in Kubernetes in operators is that about five years ago, we made a decision to basically push all of our workloads onto Kubernetes and because we saw that as the platform of the future, I say we, but actually personally, I thought it was kind of a dubious idea to run databases on Kubernetes. Since then though, we developed what turned out to be the first operator for a data warehouse. That came out about four and a half years ago and it's turned out to be a really great experience. We've learned a huge amount of stuff along the way. I'm actually doing a talk on storage for data warehouses in a couple hours, so. Thanks, Robert. I'm Gabriel Bartolini, VP of CloudNative at EDB. EDB is a company that develops the open source Postgres and I've been using Postgres for 23 years now and so as Robert was saying, databases, primarily I've seen the whole bare metal VM evolution and four years and a half ago, I started my Kubernetes initiative and the idea was to fail fast. So to see if this could actually be an area to explore and we tried to see if Kubernetes and Postgres in Kubernetes could run on bare metal machines. So we did a benchmark on bare metal Kubernetes cluster and we saw that things were improved and with local persistent volumes, it was pretty much the same thing as running on bare metal and that's where we started to develop. And of a racer for Postgres, which is CloudNative PG, it's open source and yeah, here we are. Excellent. So I just want to start with some context. So how many people here are familiar with the operator framework? Okay, so almost everybody, it seems like. We just, so the operator framework, if you're not familiar, has different levels of how the, what the operators enable for people. Level one is just basic install and it goes up to level five. We had run a survey a year ago in the data on Kubernetes community and it showed that people want their operators to function at a very high level so they want them to operate at levels three, four and five. So that's full life cycle management, deep insights and autopilot. So can you all just kind of give us some more context here like what are those things enable? What are you seeing with regard to how we categorize operators? Sure, I can go first. Yeah, I mean, I think definitely the level five autopilot I think is a very aspirational goal that we'd all like to get to where basically everything is self healing and self managing and you basically don't have to touch it ever once you start. I think that's, it's a really good goal. I don't know if it's realistically going to be achievable, but I mean, it's a great North Star to get to. I think I do see operators today that are able to get some parts of level five and so we're getting there. And I think also in Kubernetes, there's still a number of places we can improve Kubernetes to make it easier for operators to get to level five as well. Yeah, so we started, well, kind of like Gabrielle, we started our operator development long before any of these frameworks existed. So we didn't really think in those terms but I will say that looking back and kind of reverse engineering onto what we have levels three and four are extremely important. Probably one of the biggest things that we do is we provide built-in observability. The operator almost from the start had a Prometheus exporter that you could, you know, as you spun up your database clusters, they would just start magically exporting their metrics into Prometheus. We supplied dashboards on top of that. I think that's one of the biggest value ads and particularly for large systems in Kubernetes where you can't just go and lay hands on things, that's a really critical feature. Obviously DR backup and things like that are critical as well. Yeah, so for example, I'm gonna use my, our operator as an, you know, to reply to this question. We use the operator framework, operator capability level framework to classify the features. Okay, so we saw because customers and users are familiar to this model, so we decided to have, if you actually go in the CloudNet FPG documentation, you see a page that classifies all the capabilities according to these levels. So I really like this morning's keynote from my team from Google and he said something, you know, like Kubernetes will never be finished. Okay, so I think like, you know, Michelle was saying this is kind of an aspirational goal. Level five is really, we'll never stop, okay? But it's important for me, level three, the, you know, day two operations. So especially backup and recovery and with level five, self-feeling and automated failover for a database that has primary and standby. And observability is very important. So we have, even our processor has default permissions exporter and we export logs in JSON to integrate directly with any solution you can use for monitoring and logging, tracing and so far. So, yeah. So when you are evaluating an operator, what are the key things that you typically look for? How do you understand whether it's gonna fit in your stack? I think for me, when I talk to a lot of users, the biggest challenges that they have with stateful workloads is handling disruptions. And so I think when I'm, you know, trying to recommend operators to customers, I think that's one of the first things I tell them about, like, test and evaluate and see if the operators are able to recover gracefully from both planned and unplanned failures and disruptions because that's something that I think is pretty critical to evaluate and that's probably gonna be one of the biggest source of production issues that people will run to. Yeah, I think with operators, like all software, I tend to look for things that have production use behind them. So to take our operator, for example, we use it for our own cloud. So we've been running it for three years. These are clusters where people are running servers with up to 50 terabytes of block storage attached to them and really heavy load. There are other clouds, actually. IBM runs a big internal service as a Chinese, one of the Chinese click house services runs on our operator. So you want something that's been tested because particularly when you're running it load and there's a lot of stuff that can go wrong. So that's a really key thing and I think that's more important than particular features although those are obviously important as well. I don't know if we're gonna go into stacks but the other thing that I'm starting to look at is how does the operator fit into the entire stack? Because normally you don't just stand up at data warehouse and call it good. You're actually building a service that's designed to do analytics and is in some sense a replacement for something like Snowflake but just for a particular purpose. So the question is how does that operator and the software it manages fit into the entire stack? For example, does it integrate with Argos CD? Stuff like that. So that's the second thing I would look at. Yep, and I agree with Robert and also the pipelines I would say. I think for example in our operator we have put a lot of focus since day one to automated pipelines that also have automated tests. So our operator runs in continuous delivery so continuous delivery is a state of mind you know for a team that develops the software that the latest commit is always the most stable version of your software. Okay, so to be there having managed 24 seven support for years and producing hot fixes it was always good to know that you could commit your hot fix to the latest version or latest commit of your software and ship it. Okay, so to get to continuous delivery you need automated testing and you need strong pipelines. So I think that this is also an important thing to see that shows how many, for example, look at how many end-to-end tests are done in the software you are deploying and make your own judgment. Then of course the operator capability level framework to see if the operator is production ready and then again day three operation. So for a database it's important to look at performance but also RPO, RTO, so high availability and disaster recovery, so backup recovery and so on. So these are the things I would look at security but I think we can. Security is very important as well. We'll talk about that in a moment. So on Operator Hub there's about 240, 250 operators now. For Postgres alone I know there's maybe five, six different operators. So in the case where there are multiple operators available how do you suggest people evaluate and determine which one to use? I think a big one for me would be sort of the support of the operator and the community health. How big is the community that's contributing to this operator or what is the sort of support contracts that are available because I think one of the kind of misconceptions I think of operators is that you don't need any expertise and you can just run an operator and it'll take care of everything. But I think that's where some people run into trouble because when things go wrong you still need to be able to debug and you need an expertise of that workload to be able to go in. And so either you need to develop that expertise or you need to get support from someone that has the expertise. Okay, yeah, so yeah, I mean my advice is to actually look at all of them and possibly try and install all of them and then make your own judgment. There are, yeah, there are several operators for Postgres. So yeah, it's not easy for me to speak without being this subjective but yeah, but anyway, yeah, I think I would say just try with them all, look at the features and make your own judgment. So yeah. I wanna riff on that slightly which is an alternative question is what if there's only one? Is it safe, can you use it? And I think one of the things for any operator is to, as you're looking at it, think about can I wrap my head around running this thing? So for example, is it documented? We talked about tests but those are a great way to figure out what provided the tests are something comprehensible. Is there often a good way to figure out how it works? Can you see blog articles? Can you see, you know, things that are gonna help you integrate this into your development process, CICD and then the final services that you deploy. So those are things I would look at really hard and if you don't feel convinced then don't use it. There's other ways to run software. Yeah, and then, I mean, on top of these, I think yeah, it's important to, every organization is unique, okay? So, and for example, our operator has taken a completely different direction than the others and so I think it's good to see how much Kubernetes knowledge you have in your team, how much Postgres knowledge you have in your team because probably, especially the direction that we have to leverage the Kubernetes API. If you come from a Postgres background, for example, our operator might be a bit intimidating because we're not, we're trying to go in the direction that we don't use any Postgres tool also for backups. So we don't use any Postgres tool for us and also made it fell over. So if you have been using, for example, Patroni for years or other backup tools, you feel more, you feel in your comfort zone, okay? But we are trying to actually reuse what Kubernetes offers and two days ago with Michel, we did a presentation about very large database disaster recovery and I have been working in the backup and recovery area for many years in Postgres and Kubernetes for the first time has given us the possibility to use a standard interface for snapshots, okay? That's not possible outside. Every vendor has its own ways to take snapshots and we can restore 4.5 terabytes in two minutes. Okay, thanks to Kubernetes, thanks to delegate the task to Kubernetes, what Kubernetes has been designed for. So I think, yeah, it depends on the skills you have, how comfortable you feel with one operator or the other based on the main capabilities. So. Yeah, and I think we've heard in our community two people write their own custom operators and it's relatively easy to start this but then maintaining it and keeping it going. I think for all the reasons that we're talking about here, it's very challenging. It's a custom operator, especially for Postgres and use an operator. My advice is not to run Postgres without an operator because you miss all the integration stuff that operators enable you to benefit from, okay? Yeah, absolutely. So what are the challenges that as you're beginning to evaluate or use an operator, what are the challenges that you typically see or run into? How do we solve those challenges? You mentioned, for example, understanding Kubernetes skills, understanding those types of things. What else do you see as challenges of running an operator, maintaining an operator? I mean, when it comes to databases, there's still the perception that we can't run them. Okay, that's why we're here. It's not only possible, in my opinion, as I said the other day, we'll never go back to bare metal MVMs to run Postgres, okay? So this is a strong statement, okay? But you need skills. Yeah, I think skills are Kubernetes skills. And they don't need to be necessarily in one person. I love the concept of team, T-Shade profiles. So maybe have a team with Postgres experts that work together with Kubernetes experts. And also the DBA profile, the DBA skills need to be reshaped. As I always say to DBA people, you are elevated thanks to Kubernetes because you are not doing the boring stuff you were forced to do that you think that was your job. No, your job was not that. You had to do it because there was no better way to do it. But now you can go back to work more closely with developers, work on indexes, work on monitoring, observability, alerting. And most in my opinion, live a better life because you will be paged. That's the ultimate goal. Ultimate goal, live a better life. Yeah, exactly. You will be paged less if you do things properly with Kubernetes and the good operator. Yeah, I'm definitely with Gabriele that having started out thinking this was not a good idea to run on Kubernetes, I would absolutely never go back. We do a lot of performance testing. We have completely shifted all of that on to Kubernetes. We almost never do performance testing anywhere else because we can set things up, change configurations, scale storage up, change VM sizes. It's really wonderful for that. I think just in terms of the challenges, it also as a DBA or somebody accustomed to running databases, it definitely is a mind shift. We already mentioned that you have to know Kubernetes. There's just things that you need to know, like what is that CSI driver doing in the same sense that you need to understand storage. But I think there's also another type of mind shift where the traditional model for fixing something on a database like Oracle is, hey, just SSH in or run S trace or use your standard LSOF, find out what the heck is going on that host. You can't do that in Kubernetes. So there's a huge premium. I think one of the big shifts is you switch from this notion and go in and lay hands on things to having observability. So if you think you're exceeding load average and that's why you're getting umbed, you wanna have like Grafana dashboards that are showing you this stuff that you can just go straight to and find out what the problem is. It forces you to be better, but I think if you're used to doing it the traditional way for a while, you're gonna try to, you know, coupe, cuddle, exec in to fix things. And then you realize that just doesn't scale. So you really need to think in terms of things that allow you to diagnose problems remotely and not have to actually touch the, you know, touch the host where things are running. That, can I say one thing about this? I mean, this is just, because I think it's about feedback loops with the people that used you. It's unexplored territory, in my opinion. Okay, so in the next years, I think this will be clear. But for example, we had to change our operator to go get closer to DBAs and introduce fencing. Fencing, for example, is a way that Postgres is down, but the DBA can access the data files and see if there's corruption. Okay, so this is an exercise that's been done thanks to the cooperation between these two different ways of thinking a database. Yeah, I would agree with everything that's been said so far. I think troubleshooting and debugging is kind of one of the bigger mind shifts. And trying to troubleshoot a Kubernetes pod is a different set of skills. And I think one of the interesting challenges is when there's some problem and you have to go in and actually fix the system, sometimes operators, you might be fighting against operators because operators might be trying to do self-healing while you're trying to go in the same time and try to run some other commands and things. And so I think that that can definitely be a challenge. I think that's an area maybe where there could be more improvement from the operator space to be able to support those kinds of routines. Let's talk about security then. So in our data and community survey from last year, we asked about the criteria people use to evaluate operators and security was the number one criteria. So I'm curious, what do operators provide related to security? What's kind of built in in different operators or what would you expect to find? I think definitely where operators can really shine is sort of having secure by default settings and making it really easy to automate all the configuration for that. At the same time, I think there's an interesting kind of challenge in that operators have been designed to sort of, a lot of operators have been designed to spin up new instances of databases and other workloads, but operators themselves need a lot of permissions to be able to do things like create pods in any namespace or create entity pods in any namespace. And so I think there will be sort of a shift coming where even operators themselves, their permissions that they need will be sort of more scrutinized and people will be asking for more features like limiting the namespaces that operators have access to and things like that. Yeah, I totally agree with the notion of enforcing secure by default. So to give a concrete example, our operator, there's a default user for Clickhouse as there is for just about every database that ever existed. And so what we do is we fence that default user so that you can only use that account when you're on the local host, like you're actually on the pod, or if you're calling from one of the other hosts in the cluster, that's just that those network filters are just set up by default. So you wanna have concrete things like that and you wanna have them just enabled automatically. I think another thing that's really critical for operators to do is to integrate well with the existing mechanisms for security in Kubernetes, which Kubernetes is a very rich security model. There's notion of roles, there's notions of things like secrets, and so you want to leverage those features. What do I mean by that? Well, people can, so for example, if you need to pass credentials in for S3, a common way to do it is just to put them in the environment. That's actually pretty easy to apply as long as the operator allows you to get in and manipulate the pod specification. So what I'm saying is you don't wanna have an operator that hides the fact that underneath there's one or more pod specifications. It needs to give you an escape hatch to go in and apply the security. Same thing if you're getting secrets out of HashiCorp fault. So operators need to have these openings that allow them to mesh easily and conveniently with the security mechanisms that maybe you're using across your entire cluster and are very good. And on top of that, they need to have documentation. So there's many others, but that's something that's pretty high up my list to look at. Yeah, and when I usually try and define posture in security, I usually, I use the 4C security model of Kubernetes. So cloud cluster, container, and code. So it actually starts from the code. So code needs to be scanned. Images that are built need to be scanned when we go to the container layer. For example, our operator uses immutable application containers. So they only run one command at a time. The other day somebody was asking me for troubleshooting. I tried to run PS, and your operator doesn't allow me to run PS because that doesn't exist. Okay, so we try to build very minimal images. The database, so when the database that run in the container, we, by default, we disable Postgre's super user. Then we use secrets, we use mutual TLS to let the replicas talk to each other so we don't even use passwords. Then when we go to the cluster level, our back, so what Michelle said is very interesting because if you actually look at the documentation of our operator, we had to specify all the permissions that our operator requires and even point to the source code because that's the level of scrutiny that our customers require. They need to know, especially with an operand like ours, they need to talk to the operator to coordinate. Okay, so we need to say, okay, we need these permissions because of this, this, and that, okay? And documentation is where you put it. And then the cloud level, but then I leave it aside because it's for the infrastructure level, you know? Well, and beyond the basic built-in stuff, what are other security best practices that you wanna think about for operators? And I'm asking this because we as an operator have been working on an operator security hardening guide and so I'd love to hear what are some other things that we wanna think about related to best practices with security? I wanna jump in on this. I think one kind of interesting area for security of databases on Kubernetes is actually thinking about exported data. And I mentioned observability, that includes logs. So logs and databases often have quite sensitive information in them and you're happily shipping them off to, I don't know, Century or something like that. So there's definitely some things that we have to think about as we begin to rely increasingly on these convenient and accessible resources that push data out. Another example is object storage. In analytics systems it's increasingly used as backing for tables or all your data just lives out there. So, you know, understanding how to protect that properly and most importantly making sure that your database itself does not become an attack vector on object storage. So that's, I think that's an issue that's kind of amorphous right now. I know in our group we're still kind of grappling with what the issues are but they're clearly important to think about. Yeah, so another thing is to try and leverage Kubernetes as much as we can for security. So for example, certificates. Every time I think we see a password around we have to question do we really need it or, and if we can replace it with the certificates it's much better especially if they're managed by a certain manager which outside Kubernetes doesn't exist. You know, outside Kubernetes nobody in Postgres is using certificates and renovate them regularly, okay? So we've got all these capabilities that we can use and my advice is these, okay, to leverage the Kubernetes the most we can, okay? So, yeah, and I think the other thing I'd like to just add to that. I think another kind of shifting paradigm in security based authentication is to use things like temporary tokens. I think that's another big one that is, I think will require changes in a lot of these workloads themselves to be able to support that but I think that's a good trend to be going towards. Yes. We wanna open up, if anybody has any questions for us please just go to the mic up here. If you're in the front row I'll bring this mic to you. Therefore it should be loud. So I have a question. For those of us who are currently running inside of cloud environments like AWS, GCP, et cetera a common pattern that at least I see is not leveraging a Kubernetes operator for a data plane when a cloud provider provides some managed service and so instead using their managed service and exposing it inside of Kubernetes. I'm curious to get your thoughts on that. Especially if you as an engineer want to try to support something like operators on Kubernetes. How do you propose that? Yeah, that's a very interesting question. Obviously cloud services want you to use your managed services but I definitely do see a lot of users with multi-cloud environments choosing to run operators for the fact that they can have a consistent stack across all their environments and no matter where they run. And so I think that's one of the main innovations that Kubernetes has been able to bring to be able to have that consistent stack. But of course you have to make trade-offs between how much you need to manage versus how much the provider can manage for you and so those are trade-offs you're gonna have to make. I have a pretty strong opinion on this which is I think it's actually good that vendors are offering these services but the key thing for you as a user is you should think about well what if that vendor service ceases to exist or they raise their prices or they cut us off because they don't like us. So I think they're just sort of moving outside of operators, open source databases are the way to go. Operators are a key part of this because you can think of a spectrum of ways to run data services ranging from being in some, something like Amazon RDS which does a very creditable job of managing relational databases. We use it ourselves in our service all the way to running it yourself with an operator. People should have that choice and I think that it's really important. The operators are what enable people to say I'm not going to use the vendor solution because there's something that I can run economically and efficiently and safely myself. So I think operators are super important part of this picture but I think we should also recognize that people will move back and forth and that's actually kind of an interesting issue. How do you enable people to move from the operator driven version of some database to using some vendor service without wrecking their entire infrastructure's code? So these are things that I think we have to think about. And to me, I mean, multi-cloud like Michelle was saying vendor looking is another one especially if you are an ISV probably you need to do multi-cloud in my opinion. The other one, especially I come from Europe so for me data protection is important. So it's really up to you and your organization to decide that but with Kubernetes with a good operator with Postgres you see where your data is. You have full control of your data you can encrypt it the way you want it. It's yours but that comes with responsibility. Okay and that goes back again to skills, the skills you have. If you have skills in Kubernetes you can even manage Kubernetes by yourself. If you have skills in Postgres you can manage Postgres by yourself if you don't have them be best out of the way. So that's I think my view. Hi, I'm Tyler from University of Illinois. So one takeaway I hear from this talk is that when the user come to the open source operators if they want to use the open source operators they need to be very cautious about what is the capability of this operator and how it can mess up your production. And I'm wondering as a community what kind of tooling support you wish that we have? What kind of tooling that we should work on so that we can improve the reliability of the operator so that the users can just take the operators off the shelf and then confidently run the operator in production. I like you use the word mess. They also do amazing things okay but the good thing of an open source operator is that there's a community and you can join that community and you can help be an active part of that. But I agree with Michelle if data are the most important asset you have in an organization and it depends on the size of your organization you need professional support in my opinion for that. The community could be good but you need in my opinion professional support to sleep, you know, say. I think this is a wonderful question and my one word or one two word answer would be CICD support. So think about a typical example. Does anybody here use Trivi to do container scanning? Good on you. Yeah, oh there's another one. You guys get a free t-shirt. We use it too. It's great. It's just totally awesome. It downloads in about 60 seconds, allows you to do scanning. We run multiple scanners because they all tend to get somewhat different results. But tools like that are just, these are things that have enabled us to build operators that are safe. Some of the other problems like reliability, you have to solve that within your own code. I don't think there are tools that can do it but fuzzing tools are another example of things she might apply. Yeah, I think just to add to that, I think all software has bugs. There's always gonna be issues. I don't know if we can really create something that is truly bug free that you never have to troubleshoot or debug. And so I think that's just one of the realities that people need to plan for when they wanna use an operator, especially an open source operator. And just to kind of be aware of sort of, it's, operators can definitely automate a lot and make your life a lot easier but it doesn't completely take away the need for expertise. Okay, yeah, thank you, nice talk. I actually have a question regarding the deployments of DBs on top of Kubernetes. How do you see it or what do you suggest your clients to do when it comes to deployment? Do we have to have a different cluster only dedicated for DBs or do we deploy pods besides workloads? And I mean, how? I think the answer is yes. I think both those models that you described and we've thought about that because we have a model where, one of the management models we have is where the control plane sits in the cloud and then we basically have the ability to form a management connection with our remote Kubernetes that's owned by the user. And so then the question is, well, should they just run databases there or can they put their apps in there? And what we find is users have different choices and I think they usually have good reasons for doing it that way. I think our tendency increasingly is to see the Kubernetes cluster as representing some larger application consisting of a set of services. So the operator ought to be happy about running in there and maybe only managing something within a single namespace but then have multiple other things that are running alongside it. I think you wanna support that model very well. Yeah, I agree with Robert. I've seen two patterns so far with our operators. One, you've got a large organization siloed between infrastructures and development. Infrastructure provides database as a service. So in that case, normally you've got separate cluster for database. What I advocate personally is the microservice approach. So you put applications next to the database and applications on the database. So they are together. But it really depends again on your organization. And it all starts from day zero. Day zero is important. You choose the architecture and the storage. That's where if you mess at that level, you can't recover easily. Yeah, I would also add, I think it's very dependent on sort of your whole platform and what you plan to run on it and how well behaved other pods might be that are running on the same workload. I think the unfortunate state of Kubernetes today is that it's pretty easy to hose a node by having a runaway application just that's kind of unbounded. And so I think that it's gonna depend on sort of how well constrained you put restrictions on on all of the pods that run to help avoid any noisy neighbor issues. Michelle makes a really great point. With databases are kind of special. They have been for 50 years. And so one of the things that databases often still tend to assume because they're written long before Kubernetes is that they have full control of the host. So for example, Postgres kind of assumes it has the buffer cache to play with. And if it doesn't, Postgres performance will suffer. Same thing is true of Clickhouse. So what we do typically is we will map the database nodes to single VMs or single hosts, if you will. And so that's something you definitely want to have the operators able to support. And that's actually a practice which will then allow you to have the database run safely, performantly, without getting owned along with a bunch of other workloads in the same Kubernetes cluster because they're not sharing the same worker nodes. Yeah, and Kubernetes got affinity, that's amazing. Okay, I have another question we got in. I think we're out of time though. I'm really sorry. I think there might be another talk happening. So we'll be outside if you all have more questions. Thank you so much. We really appreciate it. Thank you.