 Hi everybody. I hope everyone's doing well today. My name is Kimberly Lee, but please call me Kim when I'm in trouble. That's what my mother calls me. This is Joe. He's my co-worker. We're both lead software engineers at Salesforce on the central performance foundation team. And today we're going to be talking about how basically we built an Argo workflows based service from the ground up for primarily for the specific use case of performance testing and managed to get a lot of the current performance workloads all migrated over to us. So just to, since the inception of our service, we currently have around 550 users using our platform and a total supporting over 200 teams right now at Salesforce. Since like the beginning when we were even just troubled like with the prototyping with a few users up to now over 190,000 performance workloads have ran on our service. Some of the workloads could spend up to 1500 pods depending on how, what kind of low testing they were doing. And yeah, in the combined of like a 25 million performance artifacts right now and over 70 terabytes of data is hosted in our service. Hello everyone. This is Xiao Joe. So today's for today's agenda. We're gonna focus on our experience of building an Argo workflow based service. So we're gonna divide into three discussion topics. So the first one is how we are able to design seamless experience of the servers, which still keep the flexibility. The second one is the strategy that we engage with our power users so that also to extend our servers and also the commitment we have been established with our users. The third one is the approach that we sustain and maintain our servers in the long term perspective. Now let's get into the first topic. So designing a cohesive experience. So in order to build our service is the first thing we're gonna do is to understand our end users. So we approach this in a ninth way, which means we just go to different teams, talk to them, try to understand their performance testing methodology. That's actually both a good idea and bad idea. But overall we do see a couple things, learn a couple things after talking with them. The first thing is we realize that the variety of load testing tool that's been using. So some people use K6, other people use Jmeter and even their own script. And also they have been struggled to scale their performance tests on their local due to the hardware limitations. And then some of the team that already built the automation service to fit their performance data so that they're able to into different tools such as the super set, Tableau, Splunk, etc. They have been facing a lot of challenging and difficulty to be able to locate their data and also metrics, informations and also during the process of the debugging and searching the data etc. The next thing is that we also realize that the lack of the centralizations of a scheduling feature, which make it very challenging for people to test again a shared environment. So they used to use Google Calendar. They have to reserve the slot if they want to use specific environments. Nothing we also... So basically from what we learned after talking to these users, we do see two criteria insights. So the first one is that they are not actually looking for a performance testing tool. In fact, they are looking for... They are actually asking for a workflow engine that's able to help them to test their different performance workflows. The second thing that we do notice is that even though different teams are very... They have a deep understanding of their use case, but most of this knowledge only in their own bubbles. So we actually change our strategy. So we decide to find the end users, the power users, actually. So we are looking for the users that has been working with multiple teams and also they might already be a lot of this kind of existing automation service. So they are actually not that difficult to locate. If you take a look, maybe they're going to look like this. In the Zoom meetings, they are being frustrated. They have complained about the tedious work. They have in their daily life for executing a test. And after onboard with us, we actually realized that some of them... Some of the workflow is going to look like this. They have multi dependencies with a lot of concurrent tasks that need to be executed. So with this great... Such a big pile of feature requests ahead of us. So we actually categorized them into a two-part framework, the platform and the ecosystem. So we also use this framework to communicate with our end users. Considering that such a big variety of the tools that have been used among different engineers, integrating all of them into our service is actually not practical. But if we are able to locate just a couple of them that's been common use, we actually will be able to anchor most of the engineer already. And honestly, we are not knowing better than they for what they really want to use. So we end up upgrading to have the low testing tools, such as K6 j-meters, into the ecosystem together with some post-processing tools. Oh, by the way, regarding the platform and ecosystem, right? So for us, the platform means that the process, the core components and the service that can be standardized, we go into the platform. Anything that cannot be standardized, we go into the ecosystem. And then come back to the platform, we also reach a consensus that we do need to provide centralized data storage so that all the performance data can be consolidated. And we build standard for debugging for our users. Meanwhile, with all the post-processing tools that are sitting in the ecosystem, we realize the need to build, to integrate the scheduling, request management servers and the searching tier into our platform so that our end users will be able to locate a test and also retrieve the testing data. So overall, the objective of our servers is trying to provide the delivery of the food spectrum of the performance test, starting from running, scheduling, debugging and then the sub-phase of the substandard phase for helping our user to execute a distinct performance workflow while keeping the flexibility. So now we decide to go with the scheduling requirement in mind. We set up our environment to be host, our servers to be host in the Kubernetes world. And so the first thing to do is that we're going to, for all the workflow templates and all the workflow templates and the Docker image, they should be available directly from the AWS RGOL workflow. Besides that, we build an API in front of RGOL. This is executed by the SSO and this helps to transform the user request to RGOL workflow. And many of you might already notice that actually our service is deployed in a shared cluster so that we actually, the RGOL workflow is deployed in our own NAMS space. And unfortunately, the controller, the permission of our controller is on the NAMS space level. That means, and also our end user doesn't have their own NAMS space. So all the workflow that's being created actually will be on our NAMS space. So having this service in front of RGOL will help for the SS control just in case that you can foresee that accidents can happen and people accidentally delete the workflow, etc. Also, we add the scheduler features into our API by leveraging the RGOL Chrome workflow. Then, for every workflow that we create, the port will spin up our syslog, sci-car, which is responsible to push the metrics and data to spin permissions and Splunk. And also, at the end, we have our artifact to be uploaded to the object storage. With the system being set up like this, our end users is able to use their own tool, or the tool on their local or as part of the workflow to do whatever data analysis they would like to do on the logs, on the metrics and all the artifacts. So our data workflow is pretty straightforward. But we also make a couple of decisions to help our end users. So considering we are doing such a KS system with all the data, metrics, configurations running around without the coordination and schemas, UUID and Invert Index is our friends. So we decide to move everyone to UUID for their workflow. This will help our user to easily locate their performance test and artifact. Also, we provide artifact servers, which include a search tier on top of our data storage. So the user is able to use the tag to categorize their performance test. And then the tags and artifacts will be indexed together, denoted by the UUID, so that the user is able to easily locate the artifacts and also to confirm if certain artifacts are being created. Now let's hand over to my colleague Kim. So this is great. We have a platform, but basically what ended up happening was we pretty much had a city with functioning roads in a sewage system, but absolutely no housing. So the next part was we started to think about all right, for a lot of these tools that we know, a lot of our diverse performance, like the engineers, they're running all sorts of performance workloads, all had on our own diverse use cases. What were the tools that made the first most sense to start, you know, maybe building an ecosystem from? And the best people for us to talk to was actually our power users. So we decided to talk to our power users again and see how we could work with them on this. So how did we work with them? To be true, to be honest, we actually had to be incredibly careful about this because the truth was is that a lot of them have had some automation built, even though they were on Jenkins and on their, you know, they were of course, like, how would you say, like they were constrained within like the hard with hardware resources. We basically had to be very careful because it would have felt like we were directly competing with them. That was not true. One thing they did really like about us even before they decided to migrate over a lot of their tools was the fact that we were on Kubernetes. And so the first thing that we did was we talked to them about what were the most widely used tools that were used among most of the performance engineers. What were the most business stopping tools to? And a lot of these tools were stuff like, for example, JMeter K6, there was also some internal low testing tools that was very common that a lot of engineers were using, including for like post analysis, like DWR, being able to pull down AWR reports. And these were all very, they were like, these are all the crucial ones that everyone needs to be using. The other one we also decided on board were like the most painful to set up tools. These were the ones with the extremely long readme's and also like just setting it up on the local machine was something that could have probably taken like six hours for an engineer to do normally. So for the tools that they told us like, okay, but most of the engineers would use these despite how different everyone's use case was. We then agreed, we made an agreement with our users, with our power users like, all right, for these tools, we will centralize these and we will set up a workflow templates for these. So we were standardizing and helping them with promoting these tools. So for the tools we wanted to standardize, we did also have an additional issue, which was that for tools like JMeter, like for the low testing, even though a lot of our teams were using stuff like that, there was like probably seven different versions floating around. And depending on just how the org system worked for within Salesforce, some of those teams will then be like, okay, myself is only checked into here, whereas for some other people, they check in their scripts to another source. This was not going to really work really well for us. So we decided like, maybe we should probably take advantage of our team name and probably set up a get repo organization specifically to start making some of a lot of these tools official. And this was also going to help promote like a more collaborative environment, also help with discoverability. Because a lot of the engineers were working on their own silo. Some of them were actually building funny enough, a lot of internal tools like data loading tools, ETL tools, without realizing other teams were building more or less the same thing. And they were, they both should could have just worked with each other on this. So we worked with, so these were the ones that we first like onboarded, we onboarded like, course, like JMeter K6. So the ones that most users would use, we basically said like, we will take care of these and like onboarding them and as well as any analysis tools and data loading tools. But this is a shared hub and we did recognize that a lot of the users would probably have some objections about like, oh, why are you not supporting my tool? So we made it easy for them to also just onboard straight to our get repo. Like Joe was mentioning earlier, our users had a lot of the engineers did not really have access to the case clusters without going through a series of security like security processes to get that valid, you know, verified for them and everything. So and furthermore, they even though they did sometimes have Docker images created already for their repos, it didn't mean they could check it into the image registry that our Argo workflows was able to download from. So to make it easy for them where they only have to wait maybe a few minutes, we also automatically included like a get hook in here, which would detect if they have a Docker file or a YAML file of some kind. So if it's like a helm charts, we will go ahead and deploy those create those resources for them. But if they are like Docker files, we will also bundle this up and push those to an image registry that we were then able to get. So we, you know, we this was a pretty successful, but then what ended up happening is support got pretty out of control pretty quickly because we got a lot of users, which meant there was also a lot of workflows to support. The other thing too with performance testing, we noticed was that a lot of them were actually testing against shared environments too. And they, you can probably imagine how much chaos that that would cause. So for us to try to understand more of what was going on that time, because we have some, there was too many use cases, we don't know what's going on. We decided to taxonomize our workflows and decided to coin it a pancake tower, pancake tower of pain. Hermetic is probably a term, I'm sure, especially folks in DevOps probably heard a lot. People who have used Bazel also have, you know, hear it everywhere. We decided to use this term just to kind of just to categorize, like these are the workflows that are self-contained. They don't need any external communication other than parameters coming in and artifacts coming out. These workflows were the least troublesome for us to deal with. Sadly though, a lot of us, a lot of them were not like this. So the next level we could hope for workflow whenever we're debugging them was please let it be a stateless workflow. These are, even if they're hitting like in a shared environment, hopefully they're only doing like, you know, read only like immutable data from the outside world. But our unfortunate reality was most of our workflows from the performance testing were all stateful workflows. So even though there was really nothing, there wasn't any issues specifically with like running the workflow itself. There was a lot of issues that came in around the shared environments coming, going down and up and we got in trouble with lots of different teams about that. So, and then we decided like, okay, well for hermetic workflows, we need to focus on the developer experience here. As long as they're able to create with our APR UI, select a couple parameters, be able to run the type of performance workflow that they want without needing our help at all, then we succeeded here. For stateless, the focus was more on debugging here, the debugging experience, because we wanted to make sure like with the, we want to make sure that if anything goes wrong, even with the state, like those, when they're reading something from somewhere or running 2400 users and trying to send like millions of requests per minute or something, they are able to debug easily like what happened with their workflow and anything went wrong. The stateful workflows were a lot tougher. This was the one where we really had to focus more on finding like, be able to figure out why, what happened and also who can help fix the problems very quickly, because generally for these stateful workflows, they are hitting so many different services within, they're hitting so many different environments, but not just environments itself, they could be testing against so many different products at a time that this was a major issue. So, the one thing we decided to do is we wanted to see how we can minimize the blast radius. For example, how could we turn like a more stateful workflow, one that modifies the outside world could potentially cause chaos, to something more stateless. Some things we had to do was like, we, our team was working with for example, the infrastructure team too, that we're building, you know, building a lot of the different test environments on the public cloud. Is there a way that, you know, like how can we help you guys scale to like, you know, configuring like their HPA, all these sort of things, so that in case they do get bombarded, the whole environment doesn't go down, and then affect everybody else. And then some of the worst cases we had was also like, if the test failed in the middle of like configuring a shared environment, because we did have some that would be like, okay, I'm going to change these configurations on this app level, and then sometimes what ends up happening is they try to change something else at the hardware level, and then the whole environment goes down, or the environment gets put in a very bad state, but then they don't realize that until the next person runs, and then when the next person runs a workflow, they're just like, I can't, this is not going to work anymore, because that's when they encounter the issues. So, so these were some of the things we were trying to see like, okay, could we, how, what can we do to change it to a stateless, more stateless one? One of the, this is not like the best solution, but it was kind of the more full proof one that we did. One of the things we did was we definitely utilize Argo workflows exit handlers here, in this case, to ensure that when never a workflow is completed, we get, we are able to hit like one of their CICD pipelines that own that shared environment, to be able to build the environment back to its original state. Then the next user that scheduled their workflow to run on the same test environment, is exactly what they're expecting. We also do, it did advice like, hey, for stateless, for your workflows, please also have health checks in place, so that you fail early. Now, the stateless workflows really didn't cause as much issues, but we did like, think about like, okay, but can we make these like somehow more self-contained? Because the shared, testing as a shared environment is a, it can be a very volatile, it can be extremely volatile and very unpredictable. So, one of the solutions, we were, while we were working with our users, was like, is it possible for them to instead spin up like a temper, like an ephemeral lab, actually? Like, is this a lab that, you know, you can do all of the experiments you want and everything, anything to make the workflow as self-contained as possible and not have to affect any, anything in the outside world that is not within their workflow. And actually, this was pretty successful, because the folks who decided to later switch to ephemeral lab, there is a downside to this, they did have to wait for the lab to come up and then they also had to wait for, like for example, the warm-up for the cold cash, warm-up the DB and everything, and that, so that also took time, that did include additional steps in their workflow, but they were able to at the very least start paralyzing their workflow, which in the end did save them time, because they didn't have to run them one at a time anymore with the shared environment and be worried about noisy neighbors. And then actually, with that solution too, we also were evaluating stateful workflows in this case of like, if you were, if it's going to be dyschaotic where so many changes are happening and it's very hard to recover from or it's going to be like another five-hour delay to recover, how can we turn those into, can you use an ephemeral lab instead? So the main goal of us was like, you know, of our service was like for the performance testing is we want a flexibility, but flexibility really wasn't the end goal. We wanted to see how we can turn that something flexible into something very usable. How can we create like a platform but also an ecosystem that really, you know, fosters like a community growth for our platform. So we work with our power users to capture the standards and of course the power users helped us in like capturing all the use cases with the end customers. And then the last takeaway really is just even if you have all of your support in place, take a look also at the workflows and see how you can make them either hermetic or stateless and work with your users to have a good system to mitigate these complexities. Okay, thank you everyone. Any questions or? Hello, I have many questions but I'll just ask one at the moment. Sure. And maybe catch you later if that's possible. Yeah, of course. You said that you offered a service where you would build the Docker file and the Helm charts for them, which is awesome. Yeah. But even for internal users, how did you make sure that was done safely? So we, so basically as part of the get hook, there were like, this is stuff we can definitely improve on, but while, but before like basically once the PR is created, the SNCC scans will kick in. There's a build process that also kicks in just to make sure that the Docker file can be created because it would be bad if we find out, you know, after they merge the whole build is bad. And of course, we also had the power users luckily we basically every tool that they had in that get org that they owned, we had them also be an owner in there. And of course, we were there to also make sure everything is going to go as it as it should. But those were some of the stuff that we have done. There's some things we could do better, for example, there are sometimes, especially after like an upgrade or like when the vulnerabilities that gets found out later, sometimes it doesn't get all caught immediately. So those are some things we can improve on right now. Yeah. Thank you. No problem. Okay. Thank you. We'll be around if you guys need us. Thank you.