 Hello, everybody. Thank you for coming to our presentation. So my name is Harris Khan. I am a software engineer at Bloomberg. And I'm here with my co-worker, Yal Lin, senior software engineer. And we are part of the workflow orchestration team. And today we're going to be talking about some of the growing pains associated with scaling to 10,000 workflows per week. So at Bloomberg, our team's mission is to maintain and provide a fully managed, general utility workflow orchestration platform for users. And this is so that they can orchestrate their tasks in a cloud-native environment. Ideally, we want to make sure that it's secure, reliable, and accessible for all of our internal users, despite what their skill level might be in Argo or Kubernetes. And as I mentioned before, we offer our users general utility compute, so that means that they can bring in their own containers so that they can orchestrate their tasks. And many users on our platform run a diverse array of jobs. Some of these jobs are something like AI model training, machine maintenance, and even financial analysis. And that's just the tip of the iceberg. Our platform, as you might guess, fits right on top of Argo workflows. And because of our user needs and because we're a fully managed platform, it's important that we give our users certain functional requirements that Argo workflows might not fulfill by itself. This could be, for example, network segmentation so that we could provide guarded traffic, isolate traffic and guard the production environment, approval guards, cross-owner liability, as well as the ability to allow our users to schedule their workflows and also have these workflows trigger based on events. And because we love our users so much, we also want to give them the right self-service tools so that they can observe their workflows and go under the hood if they need to debug their containers. Now, as our user base has grown significantly over the past couple months, we've noticed that many of our users have different SLA requirements. Now, one user might be running a maintenance job that would run at most once, whereas another user might want to run an analysis job that they need to run at least once and it might be persistent. So, these diverse SLA requirements bring us to our first growth pain, and that is tenant cluster isolation where it might be a good idea to have our tenants on different workload clusters so that we can adhere to these SLA requirements. Moving on, now because our user base has continued to grow and many of our users are running important jobs, high availability is extremely important for our platform. Additionally, many of our users are worried that cluster maintenance might impact their jobs and that's why this brings us to our second growth factor, which is data center resiliency. Now, I want you to imagine that you're a user on our platform and you might notice that your pods are potentially running slow or maybe your container is running into some errors. You might ask yourself, is there something wrong with the whole system? Could it be that I'm under allocating resources? And this quick example highlights the importance and the need for our next growth factor, which is observability and troubleshooting. It's important that we empower users so that despite what their level is, that they can look at their workflows and also have the right tools so that they can debug anything that might come in their way and that also helps us as platform maintainers as well. Now, an example that actually combines the three growth factors together that we recently experienced was that we had a workflow cluster that during peak hours was going through a lot of timeouts and so we decided to research this, pull our sleeves up and we found that it could potentially be CPU throttling. We looked into whether it was a memory allocation issue. We even looked in and saw if it was a disk and IO issue. As it turns out, it was actually none of these and the real issue was that we were overloading our Kubernetes API server. And why is that? So because all of our tenants are all on one workload cluster, we just had way too many job runs and too many updates from our tenants and their workflows going straight towards the API server and this brings back that factor that I was talking about before which was tenant cluster isolation. Something like cluster sharding could potentially remediate this, right? If we have different clusters and on those clusters we have the workflows running instead of all of them being on just one workload cluster that could remediate some of the pressure that's going towards the Kubernetes server. Additionally, this brings back observability and troubleshooting because this example shows how important it is to be able to understand what's happening with tenant cluster activity. And as you just saw, these are the three main growth factors that we've been dealing with recently as our user base has considerably gained more. And firstly, we saw the observability and troubleshooting is extremely important not just for us as platform maintainers but also for our users so that they can manage and analyze their workflows in real time. Additionally, we saw because of high availability that data center resiliency is extremely important to us and it's only going to get more important. And lastly, we saw that tenant workload isolation is also just going to keep getting more important, not only for addressing our users' SLA requirements but also in situations like you saw before where we could use something like cluster sharding to alleviate some of the pressure on our API server. And now I'm going to pass it to Yao Lin who is going to dive deeper into these three topics and then also talk about some solutions we have in mind. Thanks, Harris. We're setting up the contest here and help us identify the three major solutions. Now let's take a journey to see how, like, things we can do to make these solutions come true. So let's start with something simple. Imagine you have a workflow cluster installing a standard way and run some typical tenant containers and for starter, you will have the tenant's logs and the metrics. So inside Bloomberg, we have our own dedicated platform to persist the log and metrics. That platform also comes with the integration with query and visualization functionalities. So it's our go-to place to put our logs and metrics. Then it becomes our responsibility to forward those tenants' indicators there. Also, user might want to know some system indicators so that we can figure out things in more detail. So it's also our responsibility to build up the dashboards as a template for tenants to look at. That includes an indicator from tube metrics, container metrics, workflow controller logs, and also something more. However, these standard things are just not sufficient to answer all of our questions. Let's look at the examples here. By default, the workflow controller does not help emit the metrics per namespace. And in order to do that, it has to be done by putting extra configuration into individual workflow spec. So how can we avoid that? Also, if specifically we want to capture a certain state of workflow apart, that is not considered standard in the general community. But we do need that. Then let's summarize our needs here. So firstly, we want custom metrics. So in one scenario, we want the metrics to be published with the tags we define without injecting anything into the workflow spec. And on the other hand, we also want to guard the metrics that comes from our system just in case that some metrics are published with crazy amount of time series. So from that, we also need specific loglines to help us identify a certain state of workflow or part. Oftentimes, those states are considered problematic. Let's take a real example. If a part is configured with a typo in the mount path, then that part is likely to be pending forever. And in our context here, we want to kill that part immediately if that state is reached. Let's say it stays for an hour. Then we can also think about that a little bit further. Can we avoid waking up our tenants during the midnight just to capture that issue? We can just leverage our system to automatically reconcile that kind of state. Then we first consider, can we just achieve that by putting proper configuration of our installation of the workflow controllers? Then it turns out it's not overly complicated and sometimes it requires some enhancement to be put in. And as we all know, the workflow controller is already complicated and busy at its own task, which is the orchestration itself. Those housekeeping work can be better implemented in a separate channel. So we did a little research at what's the difficulties implementing a controller by your own. So actually it's not that difficult. This is the typical implementation we took from the QBuilder website. Initially, you'll need to implement event filters. They call it predicates. And it can actually be implemented more like event handlers to publish the metrics. And also, once things reach the reconciler, then it's time for you to capture a certain state and take actions for those states if needed. And actually that works pretty surprisingly well and it was surprisingly easy to implement than we initially thought. Now, one problem solved. Now, let's look at the data center resiliency. It's from a more higher scope. We often mentioned that with the term multi-cluster federation. So it has been a long-standing topic and it's difficult. There's no one good solution for that. So there are many projects and many aspects you can try to get that solved. But it depends on the use case, the needs. There's just no straightforward answer. Projects will include like Kamada and OCM or something you develop in-house. But here, we're not just to compare and contrast each solution and try to teach you which is better. We want to highlight an important aspect in this approach is that you can't avoid to build a unified API to expose to our users. Like we mentioned earlier, what we offer is actually a general utility compute platform. Many of our users are not familiar with Kubernetes. That's an issue we often overlooked in our daily life because we are pretty familiar with Kubernetes. So that API needs to handle things in a more user-logical way. By doing that, it does not only bring benefits to our tenants, but also give us an action space to do the re-architecting if you are picking up new solutions for those federation or you want to switch your federations. Okay. And we arrived our third problem here, the tenant workload isolation. So the bottleneck of the timeout as we identified before is on the API server. Now let's think about how can we solve that? Can we solve that by adding more powerful hardware? Obviously no because the instance of the capacity of the API server is quite limited on one cluster. Or let's think about how can we change the way we deploy our workflow controller. So the workflow controller, in case you are not familiar with that, is basically the backbone of the orchestrating feature. So it can be installed in two ways. One is per cluster, and that orchestrates all the names based on this cluster. Or it can also be installed in a namespace the way which a controller only responsible for one namespace workload. But however that we researched and realized it doesn't really solve the problem because workflow controller itself is not the bottleneck but the API server. Then that leave us no choice except that we need to build smaller but more workload clusters. So that will require us to build more automation tools for these clusters. Otherwise it will just kill us by linearly increasing the clusters when our tenants namespace grows. So let's look at how we are going to chart the workload. So imagine in these hypothetical scenarios we have four namespaces and some namespaces are pretty heavy. Imagine there are pro users and know everything about Kubernetes and let's say they run hundreds of workflows every day. And there are more like to users so they probably run workflow once per day or once per hours. Then these tenant namespaces should be evenly distributed across all these clusters. Now, so you may ask two questions. So how can I easily manage such many workload clusters? And secondly, how can I identify which namespaces are lighter and which are heavier? So we can balance them in a more reasonable way and going forward if the usage pattern of the tenant namespace just changed is grow from a lighter user to a pro user. So it's not that trivial as it seems. Actually that ties back to our previous solutions. So with observability and troubleshooting it doesn't just solve the problem at the moment, but it also helps us going forward. It helps us better understand our tenant's profile and we can make informed decisions to how to place them and whether we need a rebalancing like a year from now. And then secondly, imagine we already made our Federation choice and we have a perfect API to guard ourselves to make such re-architecting or rebalancing without sacrificing their user experience. The user don't want to be interrupted for anything that is not obvious. Okay, so with these helpers we can finally achieve the tenant workload isolation in a convenient and comfortable way. I mean comfortable is more for our confidence. So as you might notice earlier we specifically skip the topic of how can we achieve multi-closure Federation. It's a very broad and difficult topic. Hope you all understand that and forgive us. But we do have another talk tomorrow. It's around noon I think. So on the platform engineering track we'll talk a bit more about how we get started with the multi-closure Federation and I hope it will be a helpful sharing experience for everyone. So, yeah. Thanks everyone for listening. Now we can have some communication session now. Thanks. So again there's Mike over there. There we go. Hello. Oops, sorry. I'm going to play a devil's advocate here a little bit and you mentioned that you can't scale a controlled lane like API server. But why not? Because you can add more memory to it. You can add more like in-flight requests to it. It really depends on what scale you are but it didn't seem like 10,000 workloads per week is enough to justify many clusters but obviously I don't know enough about your workloads so I'm just curious to hear if you explored this topic more. Okay. Yes, so we really considered that. We can probably add more instance or more a resource allocation to the API server but the thing is we have so many tenants namespaces already and the workflow scale just keep growing. It triggers those timeouts at a one time and it's not happening all the time but we do need to think about how this solution would behave going forward. So we can definitely add more resource for now but it doesn't seem like a sustainable growth in the future. While we're unsure about how fast we can grow and so we figured combined with the different LSA requirements it's a really good chance to just isolate the tenants into different clusters. Okay. So I hope that answers your question. Sure. I have a question. You say you can share your clusters and then you have namespaces for each tenant. My question is if one of your tenants grows too big then you need to share it through multiple clusters. I don't know if you have faced this issue but I would be very interesting to know how you schedule this or how do you manage... Sorry, could you increase the volume a little bit? Sorry about that. Is it better? It's better now. Okay. So you can have you share your clusters and then you have tenants in different namespaces. Do you face the issue where one tenant then grows very big and then you need to share it through multiple clusters? Is the question clear or... Sorry. I'll let my tea out to take over that question. So you're right. We were talking about sharding of namespaces. Luckily we are not at the point where we need more than a cluster for a single tenant. So what this talk was mostly about is how can we ensure the service level for everyone who is participating on this shared platform and how can we make sure that individual tenants don't overload a particular namespace or particular controller by sharding them across multiple clusters. What you're saying is absolutely right but we're not at that scale yet. Great. Thank you very much. I have a question regarding observability. We are struggling with monitoring and using the observability with the workflows. For example, a job is stuck or it failed. The UI is pretty limited. You can't sort. The developers are complaining about it. In Prometheus, what we have is the monitoring without the specific job. Like we can see how many jobs failed but we don't see the specific job. So how do you do this? Observability with so many workflows running. Do you have some dashboards to expose or to share with us? It's really interesting. That's really case by case, depends on what aspect you're interested in. For example, if I just want to know the creation rate or the real-time counts of the workflow or part per cluster then metrics is just good enough without crazy cardinality of time series. However, if you want to look at it in a more troubleshooting aspect in such case you need more in-depth detail of individual workflow or part. In general, we think maybe loglines or even trace would help you better otherwise the metrics won't help as much like that. So the controller is designed in a general way that it can allow you to customize metrics and logs altogether. So you're using tracing to actually see where it fails? It can be captured by events coming in or also identified at the reconciliation phase depends on which kind of state that you want to track. Thank you. All right, awesome. Thank you.