 All right, good afternoon everyone. Thank you for coming to our talk. Today's Sebastian and I will be talking about a few of the lessons we have learned along the way in running a database, in our case MongoDB on Kubernetes. We hope you can take some of the insights that we had to learn the hard way to the solutions you are building. A quick introduction for ourselves. I'm Rajdeep. I was an engineering leader of the team responsible for running and managing MongoDB on Kubernetes. And I'm here with my colleague Sebastian. I'll let him introduce himself. Hi, my name is Sebastian Waskaricz. I'm based in Poland. I'm a staff engineer working for MongoDB, focusing on the hosted and Atlas operators, as well as on-prem deployments. All right, for folks who are not aware of what MongoDB is, MongoDB is basically a document database that offers you the flexibility and the scalability of storing and querying data, however you need. The same database supports different use cases, like streams, vector search, and so on. Now you can take the server binary and can run it anywhere you want, from a local machine to on a bunch of machines hosted somewhere. We internally broadly classify the deployment models into the following categories to meet the needs of different personas of our customers. The first one is fully managed, where MongoDB offers you a managed database as a service experience, where both the control plane and the data plane are managed by us. We don't need to follow our users on public cloud infrastructure. Atlas is MongoDB's completely managed offering. The second model is something we call semi-managed, where the data plane is on our customer's data center, but the control plane, which we call a cloud manager, is managed and run by MongoDB. The third one, self-hosted, is what we will deep dive into today, where the customers run both the control plane and the data plane on their own data centers or private cloud. We basically provide the container images for our self-hosted control plane, which they can run on the data centers and manage the MongoDB deployments. To zoom in further into self-hosted, Kubernetes serves as the backbone for both the control and the data plane. To give some historical context, prior to our cube operators, our users would manually install the control plane first, wire them up, and then configure and deploy the data plane. As you can imagine, this would be like a series of manual, potentially error-prone steps to manage a database at scale. This was an ideal use case to automate the deployment and management of the database using Kubernetes operators. For those who are not familiar with what operators are, it's basically an active reconciliation loop running in your cluster that takes some disaster state of the world and converges the current state of the world towards the desired state. We divide the deployment into two parts. The first one is the control plane, which comprises of the MongoDB Kubernetes operator and a software component called ops manager. This is only needed for the enterprise deployments. At a high level, the operator handles the Kubernetes side of things like scheduling the database pods, doing the upgrades, and binding to volumes. And the ops manager handles the MongoDB side of things like performing monitoring backups and restores. The data plane, no surprise here, has the MongoDB server and a component we call the agent. The agent is basically a sidecar proxy which runs along the MongoDB binary, which gets the instructions from the control plane and ensures that the MongoDB server is running with the same instructions that we have configured in the data plane. I will now hand it over to Sebastian to zoom in further into the design. Going forward with the MongoDB enterprise operator architecture, imagine a running Kubernetes cluster with MongoDB operator deployed in one of the namespaces. Let's take the initial step by deploying an ops manager custom resource. Once the ops manager CR is discovered by the operator, it creates an app DB. It is a backing database for ops manager application. Going to the app DB pod, it is composed of two containers, a database and an automation agent. What makes this setup interesting is the role of the agent. It is responsible for rendering the entire configuration static and version. Each time we would like to change anything in the database, for example, create a new role or a new user, the operator creates a new version of the automation config, which is then served to the agents. The agents deploy all the changes into the database. Think of the agent as a maestro of the MongoDB orchestra with full control over database from managing its lifecycle to orchestrating backup and restore processes. As we move forward, the operator deploys an ops manager. It is a Java-based application focused on the operational aspect of the MongoDB system. It is useful to think about ops manager as an administrative panel for MongoDB databases that has additional responsibilities for backups, including all the housekeeping work, monitoring, alerting, and supervising the running MongoDB clusters. Typically, a single instance of an ops manager is managing multiple MongoDB clusters. Once ops manager is deployed, users typically start deploying MongoDB workloads with MongoDB custom resources. The agent processes download all necessary data from ops manager and kick MongoD off. Once everything is running, we can expose the cluster to the outer consumption. In this example, we are using node port services, but using load balancers is yet another popular option. Raj previously talked quite a bit about the control and data plane separation. In our example, we can draw a horizontal line and the yellow part of the slide represents the control plane, whereas the gray one, the data plane. So when is thinking about the control and data plane separation useful? Well, turns out we have very creative customers who started using the operator in a way we haven't thought about. Some customers built their own internal data platforms. They often expose a way to create MongoDB clusters on demand by their internal development teams. This enables them to take the best out of two worlds, centralized control plane with centralized reporting and creating clusters on demand. This example made us realize a few things. Ops manager and MongoDB CRs cannot directly relate to each other. The reference between them needs to be supplied in a standard form such as config map or a secret. The MongoDB CRs need quite often to be pre-populated to point to the centralized control plane. This is where the opinionated templates come into play and Helm is just one tool to enable that. The MongoDB clusters may reside in different physical locations. As long as there's connectivity between the data plane and control plane, everything should work out of the box. Finally, the control plane doesn't need to be deployed on top of Kube. Some of the customers choose VM deployments for it. Now let's have a look at some of the pitfalls we've fallen into when designing the operator. Over to you Raj. All right, running a stateful application like database and Kubernetes, two of the areas where we had to be more thoughtful while evaluating the trade-offs and the design decisions around storage and networking. On the storage side, we have tried to be as agnostic as possible of the underlying storage that our users provision for the database. We do provide some defaults, but we let users configure most of the parameters related to storage, like users can leverage CSI topology to co-locate the database pods alongside the storage. They can override persistent settings for MongoDB journals and op logs. All this is possible via the custom resource where they can provide the override settings. Lastly, we would like to call out this particular upstream issue around supporting volume resize via the stateful set. Currently, you would need to patch the volume, offer the stateful set pods and recreate the stateful set to resize the storage. We are in the process of automating it so that our users don't need to perform these manual steps, but having a support for this in the upstream would be the ideal fix. On the networking side, basically we have two communications that are happening. One is between the database nodes to perform operations like quorum or replication, and the second is from the database clients or the drivers talking to the DB. If both these communications are happening within the cluster, things are pretty straightforward, and we leverage the headless service that we create for the MongoDB stateful set to enable it. Things get slightly interesting when we start looking into the DB drivers living outside the Kubernetes cluster. We leverage something called the split horizon feature in the server. I won't deep dive too much into it, but at a high level, think of it like the server has a notion of multiple address spaces, which means that within the cluster, it can talk over the service of QDN, and for the outside cluster, it can leverage the load balancer hostname for the DB clients. It does this by maintaining a lookup table where the key is basically a horizon or the view and the values are basically the server addresses corresponding to that view. The advantage of this approach is that the inter-node communication still happens without leaving the cluster, so there's a low network latency, while the drivers are still talking over a load balancer. The second way is to override the addresses of each server with the load balancer hosting that we provision for each of the MongoDB nodes. The reason why someone would do this is because the cluster certificate authority might not be able to provision certificate for the local QDN, which is service.cluster.local, so they need one central hostname to serve traffic. As you would have guessed already, the downside of this approach is that the internal DB communication is also happening over a load balancer, which might entail some latency. Lastly, we are looking into leveraging service mesh options, which I will talk about a bit more in the next slide. All right, multi-cluster is one of the topics that we're excited to talk about, where we let users run the MongoDB deployments across multiple isolated Kubernetes clusters, potentially spread across different regions. Currently, we only span the data plane across multiple Kubernetes clusters. We have this notion of central and member clusters. The central cluster is basically where the control plane runs, and the member cluster is for the data plane. Having said that, if users want to optimize for resources, they can leverage the same cluster for both the control plane and the data plane. As I briefly touched upon in the previous slide, we heavily recommend our users to leverage a service mesh solution to solve the problem of MongoDB server discoverability across clusters. However, users can use their own networking solution if need be. We have tried to make the design fairly agnostic of the networking solution that they choose. One of the capabilities I would like to briefly touch on is handling DR scenarios. Users can leverage the operator to perform the DR scenarios automatically, where the operator does the health check for each of the member clusters and verifies if it's down. If it's failing health check, it tries to shuffle around nodes to healthy cluster based on some algorithm to distribute the workloads efficiently and evenly. For customers who don't want to use the automatic DR feature, we also provide a CLI experience which users can leverage to perform the recover manually by interacting with the customer source. I will now hand over to Sebastian to talk about a few of the future challenges. At the moment, we are working on a few of very exciting products that I wanted to share with you. As Raj mentioned, we've used the hub cluster pattern for multi cluster deployments. Now, we'd like to take this one step further and think how to make the control plane more highly available. We are exploring different possibilities, including the operator failover or designing it in such a way so that it can be moved from cluster to cluster very quickly. Another very interesting topic that recently popped up is to allocate different resources for specific pods in a stateful set. In order to unlock the full potential of MongoDB, the primary needs to have faster disks because it perform rides and more memory because of query aggregation. The existing stateful set implementation doesn't provide such options, but we are investigating the leader worker set that could help us here. Next one is external connectivity. MongoDB uses an intelligent driver that picks specific nodes in a replica set. For that reason, we require to deploy a load balancer in front of every replica set member. We are experimenting with sacrificing a bit of performance and making this scenario a little bit more user-friendly. There are a number of options on the table, including our own MongoS processes that aggregate data from multiple nodes and can act as routers, but we are also considering using Envoy Proxy and teaching it certain aspects of the MongoDB wire protocol so to route the traffic effectively. Finally, the Gateway API will help us providing streamlined experience. Last but not least is telemetry. Anyone who tried to collect telemetry data knows that customers quite often disable it in production, and gathering data from dev and staging clusters doesn't really make sense. So we are experimenting with different approaches, but as always, we'd like to act as good citizens, collect only data we are allowed to, anonymize sensitive information, and ensure we have enough insights for our customers so that we could drive our business effectively. Having said that, thank you all for listening, and we are ready to take the questions. Thank you for the presentation. You mentioned about multi-cluster, and I saw in the diagram that you were showing, you mentioned that you need the dedicated centralized cluster, right? Can you explain exactly why? Yeah, thank you for the question. So the reason we have a separation of the central and the member cluster is primarily because of the separation of concern, like we want to dedicate a particular cluster for only the control plane, and a particular set of clusters for the data plane. This is pretty similar to how the Kubernetes works in general, like you have the master nodes for the control plane, and the regular nodes for your actual workloads. Having said that, as I mentioned, if you want to optimize for resources, you can choose the same cluster for both the control plane and the data plane. We don't make it a hard requirement. I think we even have some customers that run two clusters only, and they squash everything into those two clusters. So there are many different options. You can deploy things. We don't set any hard requirements. So it's really up to the customer how to deploy it. Hello, I wanted to ask you about the state for set because you briefly touched several topics that probably the state for set forbids for now. I just wanted to ask about your thoughts to migrate to something else, maybe to manage the thoughts on your own, or you mentioned some other solutions. We do have a great example with cloud native PG that switched or the other postgres operators typically use state for sets up to some extent. The cloud native PG does this otherwise. So I was just wondering what are your options here? And if you see any benefits switching away from state for set. So the answer, we don't have any straight forward answer. So we started with state for sets and later as we were developing the operator, we were, especially when we were designing the multi cluster capabilities, we were experimenting with writing custom controllers to do that. We even experimented with some controllers that were available back then, but it was a few years back. The ecosystem wasn't that rich as today. So for now, we noticed very interesting proposal from Google about the leader worker set. It's designed for machine learning. However, it enables us to set those different amount of resources and different amount of expectations from a very specific nodes, like we have primary and we have also secondaries. So that's one of the hopes that we have that our development could explore later on. But we are open to any suggestions. So if you have any good things that we could explore, we would be more than happy to take this feedback. Thanks for the presentation. Regarding the challenges that you've ensured in the last slide. So, oh, sorry. So for the experimental connectivity, basically I guess MongoDB is using TCP. So what do you guys think about using L4 LBs for instead of going through an ingress and adding one more hop in between clients to the database? Could you please repeat the question? So instead of using ingress or like one more hop in between clients on the database, so what are your thoughts about using L4 LBs in between? Of course, you mean the, when we use load, so maybe let's explore what's the scenario here. The customer is outside of the cluster and we have cluster inside of cube and we are trying to connect to it. Yes, so basically, we talk about database here. So it's huge amount of data. When we add like more hops in between, it might add few latencies. So to avoid that, do we have any ideas to reduce the number of hops in between? Like for example, Celium provides load balancer L4 level. Okay, so I think this problem could be split into two pieces. One of them is the scenario that the client application is outside of Kubernetes cluster and we need L4 load balancers to connect to the MongoDB cluster because MongoDB wire protocol is built on top of TCP. It's not HTTP protocol. So that's why we need L4 load balancer for it. Now we also explored the, for simplification, we explored the scenario of having a replica set members connecting through a load balancers internally. So like the node replication, primary to secondary. So we came up with an optimization. We patched the cube DNS to avoid the hairpinning. That means so that we tricked the DNS to return proper addresses in the cube cluster. It is definitely not a solution that is production ready. It probably satisfies only a few customers because of the complexity of this whole setup. But again, it depends whether we optimize for simplicity and then it's probably fine or do we optimize for performance and then we definitely need to do something about this. There's something to add to it since we mentioned in the slide that we tried to keep our deployment as agnostic of the underlying infrastructure. And since you mentioned Celium, we do have a few of our users who have gotten both single cluster and multi-cluster running with like using Celium as the CNI and also the service mesh. So a lot of the solutions like we might not be aware of but our customers have already have it running in their production usage. Okay, thank you. Okay, thank you again. One more? I'll approach that next time. One more? I'm sorry. Hello, thank you. When spreading the data plan across multiple clusters in a cross-region replication model, are there certain latency thresholds after which things fall apart? Or is that not known definitively? So if I get a question correctly, you're like, have we reached the limit where the cross-region replication for multi-cluster, like we have hit some bottlenecks around that? Yes. Yeah, so that's a great question. And so far, we haven't. We have like recently gated this feature and like we have some like network latency requirements from our customers, which basically they have to adhere to get multi-cluster running. But we are very much aware that at some point the network latency would probably impose a limitation on the way the replication works. There could be some data drift, but we don't have a good answer for that yet. We are still like investigating how to like either load that latency or basically hack on the Mongo Viper protocol. Okay, thank you again. Thank you.