 My name is Ided Pugya. Thank you all for coming. I am ambassador of the data on Kubernetes community. Our community consists about 4,000 of users and practitioners who share best practices for running a stateful workflow in Kubernetes. Today, we are here to talk about Kubernetes as a data platform. The main reason is that we have seen so much interest in people running stateful workflows on Kubernetes, especially those who drive AI. And we are here to talk about that. I just want to dig into this topic because we only have 35 minutes and it's a big topic. I will let my panelists introduce their shares. Please, Peter. Hello, everybody. It's great to see such a big crowd on our DOK. And yeah, my name is Peter Piotrstopaniak. I'm senior product manager at Percona. So I'm working closely with databases. So this whole topic is really close to my heart. Hi, everybody. Welcome. My name is Robert Hodges and I'm CEO of Altinity. We are an enterprise provider for Clickhouse, which is a very popular real-time analytic database. I've been working on databases since 1983. So it's a while. I've been working on Kubernetes since for about five years. And we currently run something like 200 Kubernetes clusters containing data warehouses and then advise customers how to run many more. Hey, everyone. I'm Clayton Coleman. I work at Google on GKE. And before that, I worked on OpenShift. But we've been in this community a long time. I helped design stateful sets, so I'm sorry. I apologize for any issues. And I'm also, from a design perspective, I'm also responsible for pod safety guarantees. So if your pod ever stops running and you have to force delete it, the shame on you. You should never force delete pods. But that was also my fault. So we're going to tell you all the dirty laundry about Kubernetes and data and secrets and hopefully you walk out of this smarter, wiser than we are, for sure. Thank you so much. Excellent. So let's start with databases on Kubernetes. People have been running databases on Kubernetes for some time now. And we are starting to figure out how to do this well. What is working and what isn't? Great question. Like the first observation I wanted to share is actually two years ago at KubeCon, most of the people approached our booth saying, Database on Kubernetes, really? Are you sure? And it completely changed. Like this year, it was amazing. None of the people asked that question. Everybody at least tried it. Know that it's working. And yeah, they were actually asking about solid features and stuff inside how to solve specific cases. So I think what was really changed is the stateful set, as well as the persistent volumes, definitely. I think the robustness and maturity of all the operators that are available for databases in Kubernetes also play a huge role in how this topic is growing. As well as the stuff that Kubernetes provides us like the high availability disaster recovery with multi-regions and multi-AZ deployments or smaller things like decoupling the CSI drivers from the Kubernetes, allowing them to kind of grow separately and evolve separately. And last but definitely not least, the community that is around, the knowledge sharing. It's just amazing how much knowledge sharing and how much this community is giving to each other. Thanks. I had a sort of similar evolution from Peter. About five and a half years ago, when I got my current job, when I was in the interview process, the folks on the board said, hey, we're going to run, we've got this great new database that we're working with, ClickHouse, and we're going to run in on Kubernetes. And I thought that was a really, really bad idea. And we got in this huge argument. This was before I was employed. I think what settled it was, I agreed to take the job and work for free. And they couldn't get somebody that, that was a kind of hard offer to beat. Five years on, now when I talk to customers, and we have people that run clusters, for example, with hundreds of nodes in them, and when they tell me, hey, I'm up in the cloud, I'm running ClickHouse or I'm planning to run ClickHouse and they're not on Kubernetes, I get kind of a sinking feeling. And the reason is that databases, particularly analytic databases of the sort that I work with, need to scale, they need to change, they need to, you need to have the ability to add more shards, you need the ability to add more memory. These are things that you kind of get for free with Kubernetes. For example, if you want to add more CPU power to a set of nodes, that's just a single, you have a node selector on your pods, you make a one-line change, and then assuming that Kubernetes is correctly configured, you basically get new, you know, all your pods restart and you have the new machine type. This is an enormous simplification of management of databases, particularly at scale. On top of that, as Peter mentioned, you have operators which incorporate now an increasingly sophisticated knowledge for fundamental operations, like upgrade of a database for being able to do not just a rolling upgrade, but a smart rolling upgrade, for example, that may preheat nodes if necessary to do restarts, may upgrade one node out of 50, ensure that that one completes, and then at that point, begin to upgrade things in parallel because you're relatively assured they will complete. So this is a huge step forward, and I think Kubernetes has reached a point where, as I said at the beginning, if you're running databases, particularly in the cloud, particularly at scale, if you're not on Kubernetes, I think you're in for a lot of, you may be in for a lot of hard work. You know, it's interesting too, thinking about what you're saying. You know, the state of the art in 2013 for running databases was you ran, you know, a single master and a worker replicas. There were a few distributed databases, Oracle, IBM DB2, you could kind of get my SQL server there. I don't know, I haven't used my SQL server in a long time. And so a lot of the early thought in Kubernetes was, you know, the state of the art for failover was running DRDB to sync data between two hard drives, and oh gosh, I'm going to blank on the name. What was the high availability thing in Linux that, yeah, Linux HA, but I can't remember the agent, but it was basically two processes watching each other, and then if one of them didn't respond, each of them fought to kill the other. And in, you know, virtualization, you know, there's this really complicated chain of like, oh, we'll go shut this stuff down, and then if the hardware doesn't resolve, and I got to say, Kubernetes said, that's all really important problems, but if we solve that, we'll still be here in 30 years and nobody will be able to run anything. So we kind of took that, well, first thing we're going to do is make stateless workloads work. And then we were doing the stateful set design, which was actually called PetSets back then, and I still think that's a better name. Stateful sets is very boring and stodgy, but PetSets are cool, they're friendly. We were having this discussion about, we went and looked at all the databases out there, and to be honest, not all of the databases at the time were very good at high availability. They had pretty good high availability. And so, you know, in the last 10 years, the state-of-the-arts changed. You know, databases have evolved, you know, MySQL, Postgres have both evolved to have very mature strategies, and there's still, you know, blind spots. People have built, you know, new versions of databases around them. Analytics databases have gone huge. Cassandra was an early Kubernetes, you know, people ask us how to run Cassandra early on, and Cassandra cares about scaling. Cassandra doesn't really care about individual instances failing. And so that, you know, it's always tough when someone says, Kubernetes is bad for stateful. I think, I think you kind of got to be this tall to ride the stateful ride. Maybe it was like 2014, and then, you know, over the years it's kind of shrunk down as databases get better, people get more familiarity. And to be fair, it's not easy. It was never easy. It was just nobody did it. And so we went through this big period where suddenly everybody could do it. Yeah, we maybe got a little bit out in front of our skis there, you know, like it got a little dangerous for a while. But I think where we are today is the right patterns, the right practices. It can still bite you. And so that means you have to practice it, right? If you don't do something all the time, if you don't practice for failure, you're not going to survive failure. It's like a backup system that you never test to restore. You do not have a restore system unless you've tested restore. So I think that's probably stuff that we can think about for the next 10 years is we can do more to improve like how things fail and injecting failure and testing failure. I know databases get a little grumpy when you shoot them. And I think we can do something better about that. I think we can make it easier to like, as a platform admin team, as a data admin team, to actually go through the worst-case scenarios in a more general way. That would be a great way to improve this day of the art. Thank you for sharing. Yeah, one thing I'd like to say just to follow up on what Clayton said is I think there's been a trend which isn't entirely visible in that databases and Kubernetes have kind of met in the middle in some sense. So let's just take storage. Kubernetes storage capabilities have improved enormously over the last, what is it now, 10 years. I mean, there was a time when it really wasn't, it was really very difficult to manage storage on Kubernetes. That is completely changed. We can now, we now have implementations of storage where, for example, we can dial up the number of IOPS, we can dial up the, or dial down, change bandwidth, extend storage, encrypt it. These are all really great capabilities. What's happened in the meantime also is the databases themselves have become better at managing storage in a distributed environment. A simple example, virtually every database now has baked in replication. You do not need to have replication at the storage level. So as a result, the storage can be dumber, but just good at what it does to manage that particular patch of storage. So in that sense, I think there's, this convergence means that, that databases now, you know, as long as they have the basic, you know, have access to the basic capabilities that Kubernetes provides, they run very well. So it's, and I think we'll continue to see that, that evolution develop over time as databases will become more sophisticated at managing things like HA. That's, that's always a topic. Thank you. Let's talk about batch unpublished in Kubernetes. The same question, what is working and what isn't. All right, so let me start. It's a very similar topic. The whole topic just matured a lot. We're getting a lot better support for GitOps workflows right now. Cluster autoscalers, like my colleagues mentioned, Kubernetes excels at scaling, and it's ahead of every other. Architecture, the workflow orchestration with tools like Argo, Flux, Tecton, they're just everywhere right now and make our lives much easier. Also the flexibility and features available in crons and jobs. This is all just allowing us to do things better, faster, and the way we want to. Yeah, I think this is the sort of workflow and event streams have been a big story related to data. And I think Kubernetes plays into this in a couple ways. One thing that's really important that Kubernetes does is that it enables you not just to run a database, but actually to build an entire application stack that's effectively vertically oriented and focused on solving a single problem. For example, well like again from our field, web analytics or security management, transaction processing. So what we see increasingly is that as people run databases like Clickhouse and like other analytic databases in Kubernetes, they're also bundling in things like Airflow, things like Rutter Stack, things like Dagster. Those are running in Kubernetes alongside the database. And that's actually a very good model because you can now stand up the entire stack in a development environment, in a staging environment, in a production environment as opposed to having the database be this huge monolith that doesn't easily with pieces of the architecture spread out in different operating environments. So this is a big theme and as Peter said, things like Argo CD are playing a big part and enabling that kind of stack integration that includes data, includes workflow, includes higher level things like visualization. I think the other thing that we're seeing is that event streams, Kafka for example, are playing an increasingly important role in moving data around. And so there's also one of the most popular services to run on Kubernetes, at least as a data service is Kafka. We see that widely used. Not only people run it themselves or they are running managed Kafka, which is often based on Kubernetes as well. So these are things, and I think Kubernetes enables both of these. Kafka for the same reasons that we described before, that it's a good platform for managing data. But I think the stack integration, the thing that allows you to pull your batch processes and workflows, and I think that's a really important movement and actually one of the places that's getting increasing focus in the Kubernetes ecosystem. It's interesting, and I'll take a side digression into batch. The history of Kubernetes really, at the time Kubernetes launched, the preeminent large scale scheduler was Mesos. And there was a pretty big ecosystem, a very high scale batch rendering that had been built around running Mesos. So some of the early compute clusters in 2010, 11, 12, 13, Mesos was really the hot topic. And Kubernetes actually, a little bit like Stateful, we had to pick a lane. We had to do something well to be relevant. And microservices and web services and monolithic giant web services, which have nothing to do with microservices, all have a lot of similarities. They need replication and they benefit from scale. Stateful is kind of, you want to be more conservative and that's gotten better over the years. Batch was, I don't want to say an afterthought, because I think a key lesson from inside Google that was when we talked about this in the community and the ecosystem that I got to kind of hear was, from a mindset of you really need to mix serving and batch to get high utilization. Because you need different classes of workload that fit together. And not everybody cares about high utilization for quite a lot of us. Time to market matter, simplicity of platform administration. If you're a platform engineer, you may not be focused on time to market or you might have somebody breathing down your neck. And so batch was always kind of that. We knew it would be important. We built in the kind of fundamentals to pods. Pods have restart never for a reason and that's one of the first fields that went on pods. And we added jobs and cron jobs took some time to evolve. Are they scheduled jobs or cron jobs now? I can never remember. I have an alias that makes them appear the same name because I always get confused about them. But we had cron jobs. So we had this kind of for the first eight years or so I think batch was something that sophisticated people did with their own frameworks on top. And in the last three or four years is actually we've seen a big shift which really coincides with data engineering and the early parts of the big ML boom in the 2010s, even continuing here. We'll talk about that in a minute. But that need to both bring a lot of data in to connect it to your web applications to do some batch. The batch side of Kubernetes is actually really improved in the last few years. I'm very fond of the Q project because it kind of tries to do what Kubernetes can do really well, which is in the ecosystem find something that works for the most number of people for 90% of the use cases and supports the frameworks on top of it. And Q allows to start jobs and to have them be time-delayed. Well, why do you want to do that? Well, because all of your machines are busy running something else. Maybe you have multiple teams and you want to ensure every team gets a fair share. And again, it's not a problem everybody has. But as more and more people do more and more data processing and have more and more state on Kubernetes, batch is actually like a great thing to have around. And so, you know, we're kind of in the, we're in this weird transition phase where job set, which was a fairly new API, running under the auspices of the batch working group, solves many of the classic big complex job needs, not all of them. But we've actually seen a strong uptake both in HPC style applications, people looking to move off their more complex bespoke HPC systems for things that actually let them start to bring in, serving at the same time as batch or perhaps to sit alongside new data pipelines that are complimenting what they have, their old data pipelines. So that mixing, which is what I think Kubernetes strength is, is Kubernetes became relevant because it helped people build platforms for web apps and added that state. And as data, is you built a little bit of data gravity on your platform and got more and more comfortable with it. And as you know, the dominance of data processing, it led to this demand for Kubernetes to improve. And it's not perfect. Like, in fact, I would say in many cases there are specialized solutions that serve the 5% of the biggest or the most complicated use cases better than Kubernetes can. And that hopefully remains true because if Kubernetes had to solve all of it, Kubernetes would be a giant complicated mess. And if you think it's bad now, it would be much, much worse. So I'm actually really excited about batch for a number of reasons, but I think we're going to be talking about this in a second. The combination of serving in batch is about to become incredibly important in the next few years. Excellent. Let's move now into machine learning and AI workloads. So what is accelerating the move to run AI workloads in Kubernetes? And what needs to happen to speed up the process effectively? So it's already happening. It's kind of that marriage of batch and data teams. And there's really big AI and then everybody AI. And the everybody AI has been going on for a long time. Folks building ML platforms on top of Kubernetes. And those ML platforms need data. They're fundamentally dependent on data, but it doesn't all have to live in Kubernetes. They're dependent on some state, but that state can often live outside of Kubernetes. But what they really, really, really need is the ability to share and perform compute and training and research and to get access to accelerators. So this is my personal hobby horse right now. Is the generative AI movement is really just kind of the inflection curve up on what we've been doing for quite a while, which is better and better support in the ecosystem for training models. And Kubernetes doesn't have to be visible to data scientists or to end applications to be useful. So being a platform for data, the thing that you have to feed models on is that, A, you're constantly changing those models. So they're always moving in and out of production. You need to run multiple of them at the same time because you never know whether a model is going to work until it's really rolled out. And so the demands on that at serving time are very similar to the demands of web services and web applications. They're not the same. And it would be a mistake to actually say, oh, we're just going to keep running with Kubernetes. But what I've noticed is talking to people practicing whether it's at the big scale or brand new startups that are only focused on gen AI or people who have mature ML platforms around Kubeflow, every single one of them is using a huge chunk of Kubernetes features. They're using a huge chunk of ecosystem, but everybody's using a slightly different set. And I think that's actually the real opportunity for Kubernetes as a data platform, as an AI platform is most people actually have really similar needs here. The space that hasn't really been explored is what can we do as a community to find those common patterns? And based on what we've learned over the last 10 years of Kubernetes, can we build those in? So it should be easy to get data really quickly of models that are stored in object storage into web applications to start up because startup time matters. It should be easy to share scarce GPUs between serving workloads and training workloads and the batch workloads that also need GPUs and fine tuning workloads. And all the things we don't even know yet that we're going to need five, 10 years out. So those kinds of opportunities, those capabilities, if we're all going to be running this, we can actually look at what people are doing today and say those are capabilities that we can build better support for, just like we did for Stateful Sets, just like we did for storage, just like we did for web applications. So that's my hobby horse. Yeah, so AI is a really interesting topic right now and not just because it's kind of hot and it's sort of fattish at the moment, but the question is what can we really do to enable people to use AI effectively? We get to question a lot at my company, hey, what are you guys doing for AI? And my real answer is we're fixing bugs in reading parquet. And that seems kind of stupid and like I'm a dinosaur or something, which I kind of am, but what I really mean is that, and what's really going on here is that AI draws on data. And a lot of what's going on in data is actually happening outside of Kubernetes and outside of databases for that matter. It's happening in data lakes residing on object storage. So the way that an analytic database can work with AI the best is actually be able to read this data out on S3 and share it with AI processes. So the kinds of architectures that we see evolving often involve large scale processes, be they batch, be they real time, which are pulling in data, dropping it into a data lake and then it can be independently read and processed by data warehouses, by other types of databases, and machine learning and AI, which is then training on it, building inference, things like that. So if we want to enable AI, I actually don't see so much a need, like what Clayton is describing, there are definitely features that are needed for Kubernetes. As an analytic database vendor, we don't actually need them. What we need to do is to fix our, first of all, to fix our databases so that they can just read this data better. That's part of the convergence of databases kind of coming to the problem from above. But I think the second thing, and this is the part where I think Kubernetes is going to have a really important role for us, is as it becomes more important to extract information out of these data lakes, be it using AI or using data warehouses, Kubernetes gives us, Kubernetes is a cluster manager, and if you've got a bunch of data sitting out on object storage, what you need is cluster management to split up the work as you're trying to read information out of these data lakes in parallel as quickly as possible. And so one of the really exciting opportunities for us is to adapt our database so that it takes full advantage of the cluster management and scheduling that's available in Kubernetes. So we're fixing bugs in Parquet right now, but I think long term, this is what we're doing with Kubernetes to help people implement AI better. Yeah. I think both of my colleagues nailed the topic. I can only add that I think what accelerated already a lot the growth of AI and machine learning in Kubernetes is, again, the community, the tools that are being created, like the Kubeflow that is designed with AI in mind, designed to deploy machine learning faster. And things that I would say we need to even further accelerate that growth is keep working on more efficient ways to work with large datasets. That's definitely one of the things. Keep improving our persistent storage solutions and caching solutions, caching strategies. Yeah. As well as keep improving also our specialized hardware, like using more GPUs and things like this in Kubernetes allowing to work those models even faster. That was awesome. Thank you so much. And if anybody have any questions from the public to our panelists, we have open mic. Yep. Okay. Hi. Very good talk, by the way. Sorry to sit here. We are just talking about Kubernetes as a data platform, right? The trending topic right now is AI and everything, but I want to ask you, how do you see Kubernetes in 10 years more? We could be used Kubernetes as a platform to all the cloud architects here. Could we actually implement it because we're seeing Kafka, we're seeing operators in Elastic, we're seeing operators for everything? It's just a question that comes to my mind. How do you see Kubernetes in 10 years more? This is the last 10 years. We thought we'd get a lot more done after the first year. So it took a year to get to 1.0. And they're like, oh great, we're going to go work on all these other features in 10 years later. We're like, should we get to some of those features soon? Like, we really should. Actually, at this KubeCon, in the contributor summit, we were talking about accelerators are a big challenge. So we're like, okay, well, Kube should be a cluster operating system. Accelerators should be something you can really easily deal with. That's kind of a low level detail. I think to your question, I want to see Kubernetes get flatter, which is I want to see, we shouldn't be changing things that are going to make new problems show up. We should be predictable. We should find the remaining sources of unpredictability. And then I think there's going to be more layers on top of Kubernetes. But I think a healthy ecosystem is a give and a take. Do you need 75 layers between you and the hardware? Probably not. I'd love for Kubernetes to work better with Linux, to work better with hardware, to work better with these crazy new complex compute things that are basically computers in and of themselves, and then batch frameworks and serving frameworks. If it's something everybody runs, I think Kubernetes should really try to make it work pretty well out of the box. But Kubernetes shouldn't try to be everything to everyone. And I do think specialized, Kubernetes is not a database. Kubernetes is not going to be a giant HPC cluster scheduling system, but it should be a building block for a real clear set of layers on top of it. Yeah, by the way, thanks for this question. It's a really wonderful one. So looking at Kubernetes, I have kind of like simple things that I'd love to see evolve and get fixed, particularly for database, and then longer term things. I'll just give an example of each one. Short term, I would love it if Kubernetes had a less cavalier attitude about restarting things. If you have a database which might have 150,000 file descriptors open, we really don't want to restart. We want to control that very tightly. Ideally, we'd like to be able to keep these databases up and running for years. The reason is, I mean, if you guys run databases, probably here, you know this, restarts are expensive. You're slow. It may take an hour before you can actually process connections. That's an example of a short-term thing that Kubernetes, I think, could get better at. Long term, one of the things that kind of fascinates me is the differential in prices between compute and clouds and compute in a place like Hetzner where you can buy servers and just have them rack for you and run them. If you compare the price of compute that say snowflake databases would charge you with Hetzner, it's about 60 to one. Wouldn't it be nice to Clayton's point where a flatter form of Kubernetes that would enable us to manage compute-intensive applications in those environments as well? I think there's a real opportunity there. And that's one that we're kind of fascinated with. And if Kubernetes could help us get there, that'd be really cool. Yeah. And I think that we will keep accelerating in AI and machine learning. I believe edge computing is also something very interesting for the future of Kubernetes. But to be honest, due to the nature, open source nature of Kubernetes and the amazing community of Kubernetes, it is really hard to predict what sort of crazy things we will figure out in the next 10 years. Thank you so much. Thank you. We have practically time for one question. If possible, yes. Please, one minute. I came to this session because the description in the description, it says that there is a trend in data management, like separation of storage and compute. And I was intrigued because to me, cloud-native means that you scale out both compute and storage. And now it says separation. Maybe you can comment on that. Yeah, maybe I'll start with that one. So, yeah, that's a great question. So particularly in analytic databases, one of the things, one of the big trends we see, which started with Snowflake actually, is to use object storage as your backing store for your data and then to allow people to construct virtual data warehouses that apply as much or as little compute to that data as necessary and moreover allow multiple groups of people to have their own virtual data warehouses. This is a really important capability because it means that you can basically drop the compute when you don't need it. You can scale it up when you do. Many things that we do with data are extremely computationally intensive. So, as it stands in Kubernetes, I actually think, and maybe I'd be interesting to hear what the other panelists think, I think Kubernetes has storage management about write or separation of storage and compute. Why is that? Well, we have NVMES, we can manage local storage, like NVMES SSDs. So we have the ability to get very fast storage that we can use for caches. It doesn't need to be replicated. It's just going to be populated by the application pulling blocks out of wherever the data is coming from. That's great. At the other end, we have object storage. To be frank, I don't know that Kubernetes needs to do a lot more to support that. Object storage is something that applications use to the extent that we can manage, take full advantage of managing the network so that access to object storage is not conflicting with things like reading off Kafka. That's good. Maybe Kubernetes could help a little bit with that, but I don't see that as a huge problem. And then in the middle, you have block storage. And I think Clickhouse, there's another kind of separation of storage and compute, which is to detach the VMs from the underlying block storage. Products like EBS, Elastic Block Storage on Amazon do a wonderful job of that. And Kubernetes makes it very easy to use it and to, for example, be able to spin VMs up and down in capacity. So I think all the parts are there. And this is, in fact, one of the really great things about Kubernetes as it stands today. Thank you, Robert. We are on time now. Thank you so much for joining. If you have more questions, please, we are going to go. We are going to be outside. And thank you so much for sharing your insights. A big applause, please, for our panelists. Thank you.