 Thanks for joining our panel. Come on in and find a place to stand or sit if you can. We really appreciate you joining. We are members of the Data on Kubernetes community and the CNCF storage tag. The Data on Kubernetes community, if you're not familiar, it is an independent and adjacent community to Kubernetes. We exist to help end users connect with each other to share best practices for running stateful workloads on Kubernetes. We're here today to talk about one of our favorite topics that we talk about consistently in the community, Kubernetes operators. They are essential for day two operations on Kubernetes for data workloads, but they do come with a number of challenges. People cite things like a lack of interoperability, lack of integration with existing tools, and varying degrees of quality of operators. In the 2022 Data on Kubernetes report, however, people did cite that they're using at least 20 operators, the majority of people. It's quite a few. These are often custom-built, some are operators that exist on Operator Hub, for example. Most of the organizations want the operators to function at a very high level. If you think about the Operator Framework on Operator Hub, there's levels one through five, five including things like autopilot. Most organizations want them to operate at four or above, levels four or above. The number one criteria that we hear people are evaluating operators with is security and ease of use. I'm going to start with some intros and then we'll get into some questions. In spite of all of the challenges, I will say, people are adopting data workloads on Kubernetes. They have a high level of satisfaction doing this, and we do hear that they are increasing the number of data workloads they will be running. My name is Melissa Logan. I'm the director of the Data on Kubernetes community, and we have intros from folks starting with Xin. Hello, everyone. I'm Xin Yang. I work at Williamwell in the Cloud native storage team. I'm also co-chair of CNCF Tech Storage and Kubernetes 6th storage. We're happy to be here today. Hello, everyone. I'm Sergei, and I work at Percorna. We build Kubernetes operators, and I'm also quite involved in the OK community and hope that it grows. Hello, also, everybody. My name is Alvaro. You can call me Alvaro. I'm the founder of a company called Ongress, which means on Postgres. So you can guess what we do. We joined the OK from the early beginnings, and we will try to help it to grow where it is right now. We have run a Postgres operator on Kubernetes called Stackgres, and happy to share all our thoughts about operators here in Kubernetes today. Excellent. All right. We're going to start with a question about what these folks are hearing. So they work with a number of different end users who are running data workloads. We hear a lot of the same things in the data on Kubernetes community. Alvaro, what do you hear from people about the challenges that they have with operators today? So there are some challenges when you run operators with the stateful workloads. The most obvious one is sometimes a matter of trust. Should I run a stateful workload, the database on Kubernetes? This is a topic that's been addressed already in many areas. I actually wrote a blog post about this. And in reality, the takeaway is that yes, it's a very good idea to actually run a database on Kubernetes. Actually, I claim that it should be your default choice. The myths about performance and stability are already things that are handled by operators. There's all the challenges addressing concerns like security. But the reality is that those challenges can be addressed by the operators. Actually, Kelsey Hightower that probably you all know here mentioned that you need to operators for running database and Kubernetes need to fight Kubernetes. And that's certainly true, but again, that's something that's already done by operators. And as of today, operators can automate things farther than what traditional environments have done thanks to day two operations automation. So they can take classical database maintenance operations like vacuuming, restarting, repacking database to fully automate the scenario, something that you barely exist outside of Kubernetes world today. So there's challenges ahead. There's things that operators need to do, but they can be resolved through automation mainly. The question was what challenges did they face the customers? Okay. Yeah, what we hear mostly is that there is a growing trend. There is a desire to run databases in Kubernetes. There are various reasons to do that. Some just adopt Kubernetes as their de facto standard platform. Some users just want to get closer to the cloud native environment and so on. But overall the challenges here, the maturity of some operators are a question, right? Because when you start using the operator, you expect a certain degree that, a certain degree that it can provide you the same topology that you previously had for your database, for example. Some operators can't do that because they have certain limitations. There are definitely security concerns. And when we talk about security, it is both Kubernetes side of things. Like, do I trust this operator? Because when you deploy an operator, it has elevated privileges. It has to manage a lot of resources for you. And the second concept or the second issue with security, as long as it is databases, users always think about data trust encryption. Is my data safe? What happens if my node goes down? What happens if I lose the storage and so on? So data continuity is the, I believe, the most important thing when we talk about data on Kubernetes as a whole. So I think the operator model is very extensible and very flexible. That is great. But on the other hand, there are so many operators. So it has the challenges as well. We don't really have a place that governs how operators should work or what is a good operator. So there are no common set of tests that we could ask operator to pass in order to say that this operator has this capability. So it is very challenging for a user to choose which is the operator they should use. And also here are some people who are developing operators. They also find challenges because if you want to just support one type of database, then it has, every database has its own native way to do backups. But if you want to support more than one type of databases, then it's more challenging to find a generic way. It is possible, but it's just more challenging. So we have seen other challenges like this. Yeah, definitely. And to echo what Jin said, there are no standards for operators today, which is I think one of the issues that we have with the why you see people say cite varying degrees of quality with operators as a challenge. Because there are no standards, there are no tests, anything to test against how these work well. So it is a challenge. I'm curious here, who is running today, data workloads on Kubernetes databases, AIML, streaming. All right, quite a lot of people. How many are using operators to do so? Basically, everybody. Okay, is anybody not using operators to run data workloads? Oh, really? Okay, interesting. Very cool. We want to chat with you. I'd love to learn more. Find us in the back. So as I mentioned before, people are expecting operators to perform at a high level. But does the framework actually capture what we need to do to consider for DOK? Does it capture what's needed for data workloads really well? Yeah, so, so there's this capability model, right, for operators that classifies them into five levels. This was designed some years ago. And sorry, if I'm very strongly opinionated, but I'm not 100% sure this model as of today matches in 2023, what running data workloads on Kubernetes should be. Essentially, if you look at this capability model, there is no, I mean, and this is not a critic. It's just an open door for improvement. There's no way to really test that you match any given capability level. So it's a self assessment. And well, when you give vendors the option to self assess, guess what, right? So that's that's one thing that can be improved upon to build some kind of test compatibility, some more objective measure of those capability levels. But even going deeper into the capability levels themselves, if you look at level five, which is the top one for data workloads, maybe for stateful, stateless workloads, it's slightly different. But for stateful workloads, I doubt anyone can reach level five as of today. It's quite ambitious. And I love that, right? Like, essentially, it means that the operator is capable of self tuning, self healing, and even uses an example that the database operator will create indexes on tables when detect some performance problems. I'm not even sure as a database expert, you want to do that. So it's it's very ambitious. And I love the vision that was on them. But I'm not sure they really match the reality of where data operators should reach today, especially this level five. So if you if you look at or go and operator have and look for operators for databases, and you check those that claim level five. Well, you know, I haven't seen any database technology creating indexes automatically other than same claims very or called autonomous cloud. And all the capabilities that there are there. So I think it's it's a good location now 2023, three, four years later after this model was developed, if I'm not mistaken, to maybe sit down and and give it a thought about it. Yeah, to second that thought, we hear a lot from our users that, okay, we want automation, we want to automatically scale the database when we have peak coming up, or we want to auto tune some of it, like some parameters or create an index. And when we started experimenting with automated upgrades with this index creation and so on, it turned out that even users want that they are not going to use it. Because as a database expert, you don't want you want stability. And in Kubernetes, it's a it's quite volatile environment and operators can provide a certain degree of automating it. But at the same time, well, do you really want, as Alvaro said, to change the index in your prod database on the fly at night? Well, probably not so much right scaling. Well, probably it would work. Why not? But also, DBAs are quite careful with upgrades, with scaling, with touching the database that wants stability most. And so I'm not sure if the current framework captures how to assess the security aspect, because that's also one of the big concerns. So I want to mention that the CNCF tech security has this very comprehensive white paper on what is the cloud native security. So he talks about there are four different phases of the cloud native workload includes the development, distribution, deployment, and runtime. So every phase needs to be secured. So he talks about the zero trust, provides immutability, provides availability of the service, doing auditing, prevent the unauthorized access to provide unauthorized access to your resources and all of those things. So I think those are the aspects. Maybe we should figure out how to better assess the operators. In CNCF, when a project goes to graduation, it goes through a security audit. So that's a very comprehensive audit performed by a third party. Maybe an operator should also go through this type of security audit, then user will be more sure of the security part of this operator. That's interesting. It sounds like we need to do some more collaboration with the security group, too. Speaking of security, we mentioned that as the number one criteria that people evaluate operators. How are they addressing security today? What things could they do that could be better for operators? Well, as I mentioned before, for operators, let's say there are two layers. The first layer is Kubernetes layer, where as an engineer, operations engineer, I want to be sure that the operator does not get a lot of privileges or does not mess with my other tenants in the Kubernetes cluster. That's the first layer of security. And it's really hard sometimes to like have the least privilege for the Kubernetes operator, because you need to create role-based service accounts. Sometimes you need to create PVCs, you need to create other Kubernetes resources. So the expectation from the operator is that it's going to automate a lot of stuff and you can't just cut down a lot of things because, well, it's not going to automate then anything for you. And the second layer of security here is database security. It is data trust encryption and so on. And operators for databases were Greenfield like two years ago or maybe even a year ago. But now we see that users are looking for more sophisticated ways for integration with their existing security tools, key value storage and so on, so that they can be sure that their data is safe. That's the most important So I see that there's obviously several layers and certain stages at which you can address security. There is the obvious things that you can do like container security, scanning digital signatures and software build materials. They are starting to be table stakes for most projects and database or data operators should be no different. But I see also that there's an important path, which is to kind of deconstruct security mechanisms, presence in more traditional databases and adapt them to a more cloud native scenario. And this is not for the shake of redoing things or doing things in a different way, but more because this is what users are expecting us to do. And what according to your given policies that you're going to set up for running production workloads on Kubernetes, you should do to give an example, more concrete example about this. If we think about Postgres, which I know better, there is a mechanism for controlling access to the database part of security mechanism called HBA. That means host based authentication. And this is based on IPs. Well, you know, controlling access based on IPs on Kubernetes doesn't sound like very cloud native, right? We need to talk more about things like labels and services. And this is not baked into the database. So there is some re-engineering that I think it's going to be welcome at the database operators. Not the operators in this case, actually the database itself. Or operators can also something that, for example, we've done in our project, Stagras, is to also be able to offload as a cell. The database provides as a cell access as an option. But we have offloaded it to Envoy, which runs as a Sidecar proxy, because we also help implement with this Envoy community, a plugin for the Postgres protocol into the Envoy proxy. This helps manage the SSL certificates with third party tools also, like third manager, and also with Envoy APIs in a more cloud native way. So you can apply the same mechanisms, the same security policies and knowledge that you have in your org already, or you're doing for other projects the same way. And you don't need to do it things differently because of this is a Postgres database or something like that. I think you two have already covered a lot. So I just want to say that there's also, we want to also provide the ransomware protection. So when you back up your databases, you also want to have one copy that is immutable so that you could provide your ransomware protection and be able to recover from that. Awesome. Well, we've talked a lot about challenges of Kubernetes operators for data workloads. Let's talk about some solutions. So the data on Kubernetes community started at Operator SIG last fall, and we've been meeting to discuss how can we as a group and as an industry collaborate to come up with solutions for some of the things that we've been talking about, working with NDOKC in partnership with folks like the tag storage, maybe SIG storage, maybe SIG security and other places. One of the first projects that we started working on is a white paper with the CNCF storage tag. Jing, do you want to talk a little bit about the white paper, what it is, when it will be available, and if anybody can participate? Yeah, sure. Yeah, so we see that there are more and more data workloads running on Kubernetes. So it is definitely very important to know what works well in that environment and share knowledge. So there are a lot of practitioners in the NDOK community working on this, so I thought it's great to leverage that expertise and work on this project together. And so now we actually have a draft out ready for review. So everyone is welcome to take a look and provide feedback. So in that white paper, we talked about what are their storage system attributes and how they affect running data in Kubernetes, and compare the difference between running data inside versus outside of Kubernetes. And we look at some of their common patterns and features used when running data in Kubernetes. We look at operators, what are their best practices, what is the criteria for writing a good operator. And we also look at why observabilities are so important in the cloud native environment. When you have microservices running distributed fashion, you want to be able to detect problem early and prevent failure from happening. And also we talked about security as well and also data operations like upgrade, backup restore, data migration and so on. Yeah, so yeah, please take a look of that by paper and provide feedback. Yeah, and for those of you who haven't started running data workloads on Kubernetes, this paper will be out and hopefully it will be helpful to you. One of the other things that we've talked about is the fact that operators have varying degrees of quality. So we were trying to figure out as a group, how can we help end users find the operators based on different criteria to choose what they need. We discussed an operator feature matrix project within the community. It's a project we're working on to help people compare different operators. We're starting with database operators. Obviously, there's a lot of operators out there. So we're starting with databases. Alvaro, do you want to talk a bit about that project? Sure. So as you were saying, Melissa, this comes from the end user requests, right? And one of the challenges that I've heard a lot also is like, oh, so this Postgres operator, does it provide support for a synchronous replication? Can I do PITR? Can I do disaster recovery? Which architecture it runs on, right? Like this kind of questions. And there's some blog posts on the web comparing different operators. But even if you read those, they're typically incomplete. They're stale at the time they're published that may no longer update it. And the features, the way you name those features varies from author to author. So it's very confusing for users to really understand, is this operator providing this given capability or not? Right? And later on, you may also want to compare across operators to see which one provides a set of capabilities that you require to run in your workloads. So based on those thoughts, there was this project started within DOK that aims to create a feature matrix. A feature matrix is just a set of features that they are categorized, they are labeled, they are well defined, they are described, and they provide very objective and succinct information about a given capability of a given operator. So we pulled together, we started working on a proof of concept for these. It's now at beta stage listening for feedback. And we have already a feature matrix for Postgres. And later on, it can be extended to other databases, other data technologies also. We have also seen that some of these features are very specific to the technology. That's after all what operators do. They are knowledge domain experts on a subject in a given matter. But there's also some common that may apply to multiple operators. One of the goals with this project is fully open source is up on GitHub. You can go and check it out at DOK. It's hosted by DOK. And it's programmatic too, meaning that all features are encoded into JAML files. There is scripts to derive the JAML files for vendors to perform submissions. So every vendor that right now is implementing a Postgres operator will provide their submission to which features this particular version of the operator complies with and a justification for those. This will be a very, very good information for users. And then there is work starting to construct a website. So you can see this in a more user friendly manner. But a good thing is that there's programmatic way to access all the features and the vendor submissions and a render. There's also render in markdown. So you can see more clearly which these features are. And it's quite extensive. There's around 100 features right now for Postgres operators. So any feedback is more than welcome. Any contribution is more than welcome. Please join on Slack with DOK or GitHub, DOKC. Yeah. And to add what Alvaro is saying, we have a bi-weekly SIG meeting. Everyone is welcome to join. If you want to see the work we're doing, we're not done yet. So we're looking to agree on what's this common set of features we can do. We can come up with to compare across different operators. And it will be available on the DOK site. And I'll mention too, it's not, it's complementary to what's on Operator Hub, which includes installation, etc. This is more of like a feature comparison matrix. So it's very complementary to what is on Operator Hub today. We are also working on a couple other projects that are just getting started. Again, please join us if you'd like to have a voice in this. We really appreciate different perspectives here. One is we're talking about standards. We've talked about a lack of standards. So can we actually create some as a group as an industry to use that all of us can use to create our operators moving forward? And the other project that's just getting off the ground now is an operator security hardening guide. There was a member of our community that had created this for their specific click house operator. We are trying to make it more generic for any Kubernetes operator, in particular database operators, and we'll be doing that work starting after KubeCon. So if you'd like to join that effort, please please join us on the DOK Slack. I'll just ask one final question here too, and we might have time for a couple audience questions if there are any. Just curious, what, you know, we've talked a lot about what are the challenges where we're doing right now. What do you think the future holds? Like where, what is the conversation we should be having a year from now if all things go the way that they should with Kubernetes operators? Okay. There's a topic I'm particularly interested and I would like to see more conversations in these regards. You know, the operator pattern in reality is a combination of two things, right? CRD is the customer source definitions and a controller, which is a software that reacts upon changes or creation deletion or to edits to these CRDs, right? And there's typically a one-to-one mapping between a CRD and a controller. But in my opinion, CRDs are kind of high level APIs. It's a user interface, technically, to an operator. And there's commonalities potentially across many operators for the same things. So I would like to see more conversations about decoupling CRDs from controllers so that CRDs could become more of a specifications that can be defined across many actors in the industry. Again, an operator and CRDs by definition are subject domains, expertise is where you encode how you want to interact with things. And they maybe they can create a repository of shared CRDs that then operators may implement differently if so they want, but they reduce the burden for the users to switch from operator to operator or to use them differently. To give a very particular example, if you talk about Postgres, there is a Postgres config file, postgreskill.conf, and that can be exposed as a CRD and I doubt different operators want to do different CRDs for this configuration file, right? We can sit down all agree on what is the shape of a CRD for Postgres config file and everybody you can use the same CRD. Then you can implement however you want, right? Or a lot of database operators use object storage for storing backups and reference and object storage have an object storage handle like an S3 backup handle. It's probably the same for almost everyone, right? So instead of reinventing every operator their own CRD, I would like to see more collaboration on this front and see CRDs more decoupled from controllers. I call this open CRDs, but that's just my name. Yeah, so if I think what's going to happen in like one year from now, there are a couple of expectations. I think that there will be some sort of a consolidation of operators. There are many operators with different technologies right now and it's really hard to track which one is good, which one is bad. So consolidation would happen, I expect. The second one is I really hope that operators would get to such that maturity level that there will be more widely adopted by the community and that would also lead to the fact that they will not only, I don't want to say perform better, but do the job that no one is expecting them to do today. Because right now what we see mostly is that a lift and shift approach. You take your databases on VMs and you want to run it in the same way in Kubernetes. But in reality, Kubernetes is becoming a de facto platform, a de facto standard for cloud native technologies. And the way you can run your databases on Kubernetes, leveraging Kubernetes capabilities of multi-cloud, hybrid cloud, no vendor lock-in, this is the future where operators are getting us to because this enables us infinite opportunities to decouple the or avoid this vendor lock-in and cloud lock-in for the users so that you can run your databases anywhere in the same way without any changes in your code or applications. So I think that trying to standardize operators will be very challenging because there are just so many different types of databases. But given that there are so many operators and it's getting fragmented, I think it's worth trying and making an effort and trying to make that happen. So I was thinking in Kubernetes storage we have CSI. There were actually many story vendors who write the plugins, implement that common interfaces. So maybe we could do something similar for the operators and have a operator plug-in framework maybe. So we define common APIs and then have each operator implement those common interfaces and then user will still have common or similar experience. So maybe that can happen in the future. Awesome. Excellent. That was us looking to do our crystal ball. So we have time for maybe one question from the audience. Does anybody have a question for the panelists? There's a microphone right up here if you want to go up there. Hi there. I work for Neo4j, building our database operator. One of the big challenges we faced is kind of how do you quarantine a single replica of a database to do maybe like offline operations, maybe repair indexes. I'm just wondering if you have any insights on these kinds of challenges. Can you repeat it again? So basically one of the challenges we faced is how do you sort of quarantine a single node of your database cluster if you want to do offline operations because it feels at the moment like using stateful sets is quite, it could be quite challenging for certain types of database operations. Yeah. So with the operators we use the primitives in Kubernetes as much as possible to simplify our lives as like the renters like, but at the same time we understand that there are some challenges. Like as you mentioned in stateful set you can't like hoard on one node, do something with it and get it back into the back, right? You need to kill it completely and then Kubernetes will do its magic and recreate it from scratch. Definitely it's a challenge and there are various ways how different operators solve it. One thing I saw is creating a stateful set per single replica and then you can meddle with it in some way. So that's basically the only way we found is that you just have a set of multiple replicas. We ended up with multiple stateful sets and I'd love to get away from that. Yeah, I don't think there would be an easy solution to this if you have just a single stateful set. You can do some magic with different images and with some sidecar containers, but that would be super ugly. Unfortunately I don't have a solution right now in mind. Yeah, thank you. That's helpful. Awesome. Thank you. Well, thank you all for joining. We really appreciate it. If you have questions about running data workloads, please join the DOK community. We're at dok.community. Very easy to remember. If you have questions for us, we'll be at the back. Thank you.