 Hello, my name is Raina Tartler and this is Yao Lin and we welcome you to our presentation Bloomberg's journey to a multi cluster workflow orchestration platform. Let me start by giving you a little background about what we are working on. We are part of the cloud native compute services group at Bloomberg where we offer a platform called workflow orchestration. This is not an externally visible product which means that all of our customers are internal engineers who generally use Bloomberg infrastructure and data. The team and system is rather new but has seen quite rapid growth and popularity. We believe that our requirements, challenges and conclusions may resonate with at least some of you in the audience and we are thrilled to share with you what makes our journey unique and get into a conversation with you afterwards. Before we dive into the technical details let me walk you through what we mean with workflow orchestration platform to give you a better idea about what we're actually offering. In a nutshell, our workflow orchestration platform provides general utility compute for run-to-completion batch jobs. So what does that mean specifically? That means that our customers come to us with two things. Coating containers on our company's internal container registry and an execution plan that defines in what pots to execute in what order. That plan may also decide on about fan-out of branches of execution, our defined synchronization points where the workflow requires several nodes to terminate before proceeding. Luckily we are able to use Argo workflows a CNCF project that has reached the graduated maturity level in December 2022 for the core functionality that our clients primarily use. Argo workflows comes with a controller, custom resource definitions, CRDs for DAX, workflow steps and most prominently a user interface that allows users to visually observe the progression of their workflows. It is also critical for debugging when containers don't execute as expected. At Bloomberg we are proud to offer workflow orchestration as a service for a number of different use cases such as machine learning orchestration which covers training and tuning of our AI, company's AI models. As an aside, AI is not something new to us at Bloomberg. We've been using AI for more than 15 years since 2009 to manage the large volume of financial data, news and analytics our customers access every single day. Now back to our system. So we have users implement custom CI CD resource solutions. In many cases we see users trigger maintenance tasks on both physical as well as virtual infrastructures. We have seen users orchestrate financial analysis tasks to build all kinds of processing pipelines. As you can see this is quite a variety of use cases and that calls, that's what calls for a platform that provides general utility compute for batch processing applications because we run all of them on the same physical hardware. While our workflows is a great platform, we do have a number of additional requirements at Bloomberg both functional and non-functional that steered our journey towards a multi-cluster solution. Let me give you a quick walk through here. Bloomberg is first and foremost a data company. The data is proprietary, substantial in size and has very specific access requirements. Our users generally benefit from our UIs and APIs to keep track of workflows, pods, inputs and the workflow logs that our users run our platform to develop new workflows and troubleshoot when something misbehaves. Additionally we provide our users means to observe the budget resource allocations so that they can make informed decisions on how to size their pod resource limits and when they need additional resource allocations. Thinking more about functional requirements, Bloomberg expects its engineers to be mindful where their applications run. The most obvious instance is to choose the correct network zone, think prod or dev network for data and compute. Additionally, Bloomberg is very conscious about production stability. In general, running workflows in prod requires approval from someone in your management chain. Our platform is tightly integrated into Bloomberg standard approval processes which makes it easy for our clients to comply with the relevant company policies. Most users rely on being able to run workflows on a schedule either by specifying a cron string or by coming with their own custom scheduler that is connected to external events, to internal triggers. One of the most challenging requirements our team had to overcome is cross data center resiliency. This basically defines how application teams must develop their application for resiliency across multiple data centers. From our perspective, that means that we essentially need to provide several installations of our group workflows at least one per data center. This however now comes with additional cognitive load and failure modes that we want to discuss in this very session. With this introduction and overview, we are now ready to embark on a journey or more specifically how we, the workflow orchestration team, have started out with what we had at the beginning and where we ended up today. This presentation will walk you through our analysis, the tools in our toolkit that we had at our hands, how we adjusted our architecture and what new tools we adopted. Finally, we're taking a look at what we found and have a critical analysis before concluding our talk with hopefully enough time to have a conversation with you. Let's start our journey with an application team that runs some financial analysis report on a weekly basis. For this, they may generate a workflow that starts generating data from various data sources, does some calculation and then persists the results to wherever they need to go and at the end send a Slack message to notify that it's done. In this case, we know that our users deal with two very kind of resources that have very different characteristics. On the one hand, we have the workflow that starts and ends at some point and after that is no longer needed on the cluster and can be deleted to free up at CD resources. On the other hand, that workflow requires a number of other resources on the cluster to work properly, such as config maps with connection strings to how to connect to the database or secrets with authentication credentials. Those kind of resources generally stay on the cluster. When we started our journey, we considered the following setup. Given two data centers, let's call them Oscar and Romeo, we need to provide means to our users to submit their workflows and deploy static resources such as config maps and secrets. Such static resources get automatically copied to both data centers. For submission, in this scenario, we would need to ask the user to choose which data center the workflow should run. This has a number of obvious challenges but most obviously it's way too easy to pick a data center that is overloaded or possibly not even available. Clearly, this approach doesn't work and to be clear, this is not a theoretical but very practical issue to us. In order to get confidence in failover scenarios, we conduct regular mandated data center resiliency tests where the service is brought down in one data center for an extended period of time on a regular basis. On one hand, this is a major inconvenience for everyone involved. But on the other hand, this is the only way to build up trust in confidence in the failover mechanism that we implemented. To overcome these issues, we have chosen to introduce a management service in the workflow cluster. Here, a service is hosted that offers an API for workload submission and deployment of static resources. This management service provides reliable placement decisions to address cases where a data center is currently unavailable or possibly overloaded. It also provides a user interface that allows onboarding of new users, submission of workflows, setting up and editing work schedules, management of workflow templates and much more. This is also the best place to record submission of workflows for bookkeeping so that our users can easily retrieve their previous workflow submissions from months ago. By the way, we use the service also to implement sharding of our tenant's workflows over multiple clusters. We have observed that our clusters are generally comparatively underutilized, but the Argo workflow controller tends to get quite busy due to the large number of workflows that it has to process. Therefore, we tend to prefer smaller clusters, which helps with scaling but also helps to minimize noisy neighbor issues and reduce the blast radius of disruptions. Now that we have covered how we handle runs in workflow orchestration, let me point out the most obvious difference to static resources. In general, workflows run exactly once on one cluster, where static resources, so think config maps and secrets, generally need to get consistently deployed and kept up to date on all participating clusters. This imposes quite a different set of challenges. At this point, I'd like to hand it over to Yao to walk you through the details and particularities on how our platform handles static resources. Thanks, Reinhardt. So we'll move on to static resources. As we can see, the management API here needs to be globally available first, even if one data center is currently unavailable. So in order to keep the management service reachable by both sides, they are deployed on each one, the Oscar and Romeo. So from the user perspective, the requests are route to the nearest location with DNS-based service discovery. But we'll just pause that topic here because itself can easily extend to another topic. Let's come back to the work, to the static resources. So similar to handling workflow runs, the API itself makes placement decisions, well, accept user requests first, makes placement decisions and push the change, push the resource to where it needs to be. However, there's immediate concerns because these are static resources. They must be consistent and persistent on the cluster. There are a few what-if scenarios. What if we temporarily need to make some fundamental upgrade to a cluster? We need to rebuild the cluster and bring back everything. So can we have, can we lose those resources and recover them? And additionally, these, think of the functionality of this API, it needs to do three basic things. Accept a request, make a decision and perform the deployment actions. Is that too much for it? And for every user to write a request, it must wait until everything finished. So that's raised immediate concern that we need some place to store these resources overnight or at least some place to survive over a disaster. Then the most straightforward approach would be store them in a database, obviously. Then this API simply accepts a request, make a decision, write it into a database. And then on the other side of the cluster, the workload cluster has a sync agent to pull things that should have been landed here. Now, it looks promising. It looks good. It satisfies our user's request on a very basic basis. But we have observed that it's become increasingly difficult to actually operate it. In particular, we start seeing the management service start to eat up our error rates, error budget, and show some increased latencies and exhibit issues when creating resource, most likely the consistency issue. Also, it sometimes just give us time out. So it's very risky for the schedule job, particularly, because that just breaks the promise of automation. So we need to do something to improve it. Let's come back to the design again. So think about the edge cases that the system might not function properly. What if the write request succeeded on one side of the data center, but not on the other? And what if now I need to add a third cluster? I need to remake all those decisions again so that the third cluster gets what it should have. So that summarized our issues into these two, the cross data center consistency, and to handle the change in cluster joining and leaving. So let's take a quick short pause here. We have explained how the workflow runs and static resource have different characteristics and they have unique challenges. This presentation going forward will be focusing mostly on static resource, although we do have considered the workflow runs case as well. So we start with reexamining what we have at our hand. So we're attending KubeCon, right? So people here, I would assume most of us understand how Kubernetes on a high level it works. It has a storage backend. SCD provides eventual consistency more like the source of truth. And it comes with the controller mechanisms, leader elections, and many more features. So eventual consistency is the key term here. Now, why can't we just rely on Kubernetes to reconcile the placement decisions and to automate the workload cluster change? It seems pretty easy from that point. So we're a big fan of Kubernetes and benefit a lot from the access control, the resource allocation, and that are all built on top of an API server that is easy to operate. However, this is true for a single cluster scenario, but we're thinking about cross data center. And stretching SCD across data center is not that cost efficient. It's actually quite expensive. It requires significant expertise and probably money. So in this session, we want to have another tool to help us on the other aspect. So in summary, Kubernetes help us with many of the problem here, but not all of them. So think about what's left. Cross data consistency. Why not just use a database? Modern database has been generally believed to be reliable and easy to operate. They're the perfect tool to store static resources and allow subsequent deployment to be executed. So with that in mind, you also have the chance to build a bookkeeping system for user input as well. It can serve auditing and tracking purpose. It also helps server user read request without hammering your Kubernetes cluster. But there's another aspect that we want specifically to highlight here. So modern database, modern relational database offers a number of different solutions for high ability across data centers. In our context, we have found the streaming replication offered by Postgres is the most cost efficient way. So basically, a database leader will be selected to accept all the reader request and the transaction log gets streamed to other replicas and eventually all the replicas will be consistent and all of them will be able to serve reader request. So in case of failure in one of the leader replicas, a leader election process kicks in and select a new leader, that is a critical process that usually happens on a weekly basis. But in case of failure, it's usually just finished in a few seconds. So that is acceptable. So let's come back to the cost efficiency part. We found this is much cost efficient and viable solutions for us comparing to stretching a real typical SED across data centers. But we are thinking of taking the advantage of API server as well. Then how can we benefit from both of them, not just one? So another pause here. So we have examined we have two kinds of tools and it's now to combine them together. And we'll introduce how we combine them. You may have heard of K3S is a lightweight Kubernetes distribution that is capable of using a debt external database as the storage backend. In most cases, K3S is discussed in context like constrained hardware, edge node or experimental environments such as your home lab itself as a whole definitely don't fit into the large scale cluster like what our system offers. But think of the uniqueness here. It comes with the interface to connect the database and the Kubernetes world together. How can we isolate that out, that particular part? That interface is called Kine. It stands for Kine is not SED. It's simply a stream layer that offers the connectors to expose things like SQLite or Postgres as a SED endpoint. So with that anything that works with an configurable API server may just function without any underlying typical SED. So for installation, you will need a real database endpoint. It's external. And then you'll install a Kine instance expose itself as SED and then configure an API server to talk to that SED endpoint. So here's the example what gets stored in a Postgres database to convince you it actually stores what you want. Now it's time to assemble things together to think about how it can work in practice. One thing to highlight here is that the API server that uses Kine itself is a real API server instance. But does not come with a Kubelet. So it does not have control plane, it's unable to schedule pod. So these actual clusters themselves have an actual stretch SED but within that data center. So the Kine API server here is more like the application you installed in this cluster and actually they're installed in multiple clusters. However, they talk to the same database back end. So they're consistent just like replicas on the same cluster. So let's look at how it operates. Still what it exposes is just an API server. It itself doesn't know how to talk to a workload cluster then you still need some kind of application, a deployer role to do all those work. You can probably select something from the CNCF community. We have heard of Kamada and OCM. You may also think about rather something in-house. But definitely you will need some specific interpreter to translate the resource anyways. So, yep. By using the same database here, one thing to highlight is that you don't need to worry about the conflicts between these deployer actions because they're talking to the same SED just use the standard leader election mechanism, problem solved. So that summarized what we designed and it's a very fundamental change for how the SED and the API server is set up. Now let's address some frequently asked questions. Most obvious issue is that this kind API server itself doesn't come with a cubelet. It does not have control plane and no worker pod. Does that limit the potential functionality so you can use against it? But think about the use case here. We're using it just for managing resources only. So it's not intended to operate a real cluster anyways. So if you clarify the use case here, then there are only a limited set of things that you need to install to actually interact with that kind. Many of the system components are more reasonable to just talk to the typical API server that's used the typical SED at the back end. Then we move on to the quantitative world. The storage back end has quite some fundamental change. Then what about performance? Well performance comparison only makes sense if you are looking at the exact same read-write pattern. So think about the use case here. We're using it just to manage the resource. So in most of the cases, resources are static. There will be writes, of course, but not from a workload side. So most likely the deployer application simply just read from this kind API server and land things on the actual cluster, on the actual workload cluster. So the usage pattern is quite different and there are a lot of means you can actually optimize that kind of call pattern for database characteristics. And then with those concerns addressed for new platform setup, there may be concerns for existing platforms like us. We have existing users. This is such a huge change. We don't want to interrupt their daily job. Then how can we do about it? Let's take the system, for example. Sorry. Originally those decisions are stored in the database as static decisions and later it's a pool model to actually land things. The placement decision has been made up front, but afterwards we're actually storing just the plain input and expect recognition to happen later and most likely it will be a push model. How can those change be hidden away? Well, just build another API to assist in front of it. So you can either build something new to talk to your old API and the new or you can just have them in the same binary. That's your call. So this API will help us to do the handover work without breaking anything. Also, it's not a throwaway work even if you finish because it's designed to be user-friendly to serve long-term goals. So right now I'll hand over to Reinhardt to do a quick recap and wrap up today's session. So thanks, Yao, for sharing your insights on the technical details, the solution that we came up with. Let me summarize and conclude. Multi-cluster federation is a very relevant and broad topic for the entire community, especially for companies and teams that offer large platforms, reliability and resiliency are key concerns. We observe that modern relational databases handle cross-data center persistence using streaming applications very well and are generally well understood. However, they alone don't offer quite the same conveniences and properties that we are used to in the cloud-native world. For the requirements that we have identified at Bloomberg, we have shown how we use Kine to achieve multi-data center resiliency with cloud-native technologies, more specifically how we persist Kates objects using the Kubernetes API server in a relational database with transaction log-streaming. Thank you so much for your attention. I hope that this talk resonates with you and we would like to hear your thoughts because we are curious whether you have maybe similar requirements or how do you achieve multi-cluster resilience. Let's compare in contrast and use the microphones for questions. So, I have a question. Awesome. So, before you're of course receiving API calls, but with this new Kine system, you can receive actual Kubernetes jammals or like this, how are the users adapting to this? Because you probably still don't want an API in front of it to converse what they actually want to Argo jammal or Argo workflow definition. Yeah, that may not have been clear as much as it could have in the presentation. Do you remember we have this management service that sits in front of it? So technically we don't expose the Argo API server directly, but in this management service with a API that we are designing and defining in front of it. That API server will then pack up this Argo native manifest. It could be Argo workflow template, it could be an Argo workflow, it could be a number of other things and persist them in this Kine server in a way that allows us to identify, okay, this is supposed to go to that namespace and that cluster and all of that is hidden away from the users. So in order to identify the Argo version, you have to identify the Argo version details at all. And if possible then I have one continuation question. How do you then handle inconsistency between your different clusters with for example Argo version? Because before you were getting the abstract concepts and you were changing that into an Argo version, what is the management layer? I'm not sure if I'm understanding the question, is the concern that we may be operating different versions of Argo workflows or did I understand that? If you have a federated cluster with ten clusters and five of them are running different version than the other five. So the answer is very carefully. So in general, Argo, the CRD themselves don't change very much. So in general, I think we are at 3.5. We haven't actually started on that but so far we didn't experience major problems with that particular concern yet. Thank you for your talk. I have a question because you are already using Argo. And did you look at the application set to manage the static resources because with the process that we have looked at it briefly, the thing is this GitOps model that Argo CD comes from really isn't a good fit for us based on the specific data access requirements and the way how we set up the platform to give you a little bit more insight. So one of the things that we really need to handle is if you are AI team that has access to one namespace then there are other controls in place that you can only see your specific things and only in your namespace and nothing else and integrating that with external systems like Argo CD controller are challenges that we didn't have the capacity to think that completely through and we came up with this solution which we felt is significantly simpler and thought we share that with you at this point. Did I understand the concept correctly in that you are using basically the kind API service to stretch I would call it virtual control plane over multiple data centers and you are using the post-press URLs basically as a database for that. That is correct, yes. So you built in that case a virtual control plane that spans over many more classes . Yeah, that is one way to look at it. I tend to avoid the term virtual control plane. You have the deployer there . It is only provisioning data in a Kubernetes way and then deploying it onto the workload classes. Exactly. When you are saying management of this virtual control plane people generally assume that you are going to quickly end up in a solution like vcluster which shares a couple of similarities but again we are using that to store and deploy static resources like config maps, workflow templates, things like that. That does not require a Q-plat because it is storing about and we thought this is a new trick. Thank you very much. Next question. How do you consider using Carmada to manage the federated clusters? Excellent question. Carmada is something that we are actually looking at for workflows for the dynamic resources. This presentation was mostly on static resources. I understand that Carmada can also be used for static resources but we will be using Carmada maybe not. From the cost-benefit trade-off we found this is the cheaper way to go for now. Makes sense. Thank you. Thank you so much for your questions. Happy to talk to you afterwards. Bye-bye.