 Hello, everyone. My name is Ken. I work at Astra. Astra is a launch services provider. And today I'm going to talk about one of our use cases for Argo and how we use it to orchestrate some of our launch tests and simulation of the structure. Quick show of hands for folks that are using Argo today. OK, so that's pretty much everybody, almost everybody. Folks that are using Hera today, nobody? OK, wow, cool. OK, so before kind of talk about Argo, first we need to talk about how do you even test rocket software, which is something that I've been learning over the past year. So in order to test software that runs on a rocket, there's quite a few challenges. You've got multiple embedded systems. We're not using Raspberry Pi's or like off-the-shelf reference boards, typically custom hardware. We've got a lot of IO interfaces and control loops that are actually talking to or meant to talk to physical things and then receive data from physical things like sensors and real-time systems. And it's a high burden to actually operate the real thing. You want to be able to ensure that your software is functioning correctly before you actually plug it into the physical thing because you could break the physical thing if it's not. And so that kind of leads us into hardware in a loop, commonly referred to as Hill. And so this is a test methodology for embedded control systems where basically you take the control system that's under test and then you, instead of hooking it up to the real physical things that it's going to control, you actually hook it up to simulated models. So then it's able to send control signals, get feedback from the models. And as far as it knows, it's talking to the real thing, but actually it's not. So what does that look like? At a high level, if you look at the picture here, you can see we've got up in the green box, we kind of call the device under test or the control system that we're testing. And then typically the inputs and outputs are physical wires going into simulator or simulators. And then what that simulator often looks like in a nutshell is that you have some kind of specific hardware platform sometimes referred to as a DAC or data acquisition hardware that basically takes all of the wires coming out of the device under test and all the electrical signals and then converts that through some software into data that the models can consume. And then the models basically kind of do a similar thing on the way out where they basically talk to the data acquisition hardware and then that gets turned into electrical signals which then go back into the device under test inputs. And so that's basically how you do it in a nutshell, but it does get a little bit more complicated in our case because we have some other things that we need to control and orchestrate. So we have our ground system software and computers that we use before the rocket takes off. We have our cloud data pipeline where we basically take all the telemetry data and ingest it for analysis. And then in addition to just the DAC hardware then we also have a GPS simulator that needs to be orchestrated as well. And the GPS simulator is hooked up to the models and that basically sends spoof GPS signals to the vehicle to tell it to basically convince it that it's fine. And so all of this stuff needs to have basically be orchestrated so that we can automate deployments to this. So some of the challenges with that got multiple hardware platforms with varying levels of APIs. Some of it's hardware we build, some of it's hardware we buy from other people. Hardware state needs to be queried, verified, reset through those APIs or even sometimes like CLI tools. We've got lots of software to deploy not only into the test system but then also onto the launch system and vehicle itself. And then we've got a lot of log and data sources that we need to be able to look at when we're trying to monitor the system, debug it. And it's time-intensive. Time, it's an expensive system to build. And so if it's not functioning properly or if we have an issue running tests, then that costs this time and money. So those are some of the challenges with operating this. I think the big one is it applies to Argo and how we use Argo is one the manual process that we use before Argo, which was typically a human going through a well-documented process, which takes a long time, is also error-prone and you don't have the ability to run it unattended. And then also just being able to gather all of the, when you have a human running manual steps to do things, you don't automatically get logging sent to centralized systems, that person would have to cut and paste logs out of that or use some tool that would do that for them. So having computers do this gives us lots of benefits. And so we decided to basically orchestrate the system using Argo workflows in particular. And so why did we choose Argo workflows? Well, basically our entire tool chain and build system was already containerized. So all of the software that we build in-house and deploy into the launch system, all of the CLI tools, all of the API clients, like everything is basically already built and shipped in containers. When we need to get from where one of the rocket, the rocket doesn't run Docker or doesn't run Kubernetes, but the image lives in a Docker image which then goes into some system, which is then capable of updating that. And that was all done through Docker. So it wasn't really a big leap for us to go to basically having people running tools that are containerized to then mapping it to a workflow system like Argo where every step is effectively, is just a container. We use K8s quite a bit across the company already. So again, not a big jump for us. And we already had some experience with Argo. And overall, just when looking at different options, we had a person before us talking about like Argo workflows versus Airflow. There's tons of options out there for building workflows, for automating things, for task management and scheduling. This was by far out of all the options I looked at, the least amount of code to actually build the system partially because of the things, maybe mostly because of the things that I already mentioned around our kind of adoption of Docker or container images and Kubernetes. And I think one other aspect of that, which I don't have up here, which kind of relates to our, I'll get into later our decision to use Hera, is that for non-embedded stuff, we do use Python quite a bit. And so when we decided to leverage Hera, having a Python tool that allowed us to basically kind of codify and create those workflows was very beneficial as well. So a couple of key considerations that we were thinking about before we started building this, the version of the Argo workflow or at least capturing the state of an Argo workflow at a point in time is important to us because the environment evolves over time and so its configuration will change over time. And sometimes we may need to go back in time to basically run the same test or simulation in the exact same environment that we ran it before because we need to go back to an older piece of hardware, an older piece of software. So unlike writing a web app or something, it's not always just living off of main or trunk the whole time and kind of moving forward. Sometimes you have to move back and you want to know what workflow you ran with a particular test or simulation so that you can run it again later. Or later down the road you may discover that there was a bug in there and you want to know basically which tests or simulations were impacted by that bug and the bug could actually be in the workflow that could potentially have an impact on the outcome so you need to have all that metadata. Workflows, so basically we wanted to be able to trigger the hardware and the loop simulations through pipelines, through GitLab or GitHub type pipelines, where basically like if somebody's working on a particular piece of code and they make a change, they want to be able to ship that change in addition to their unit tests and local integration tests, they want to be able to kind of run it on this platform which in the manual world that was hard because then now they've got to coordinate getting time to access the system, they've got to figure out how to use it and how to get their code deployed on and how to run it. And so we really wanted it to be as simple as possible from an automation perspective. It also though needs to be triggerable via humans. So we don't, not all of our use cases are just pipelines or automated processes kicking off jobs. We have humans that want to run basically one off tests or series of tests and maybe potentially test out certain software configurations or certain hardware changes and then go run that test through something like a GUI. And we also, because this is an expensive resource, we can't just go to Amazon and say, give me like a thousand simulation environments and if we could it would probably be prohibitively expensive. But so we have few of these resources and we generally have more demand for them than we have supply. So we need to be able to queue all of the requests coming into the system which Argo doesn't natively do for us and we also need to be able to lock those resources to ensure that we don't have multiple people or multiple systems running workflows on a simulation environment at the same time. So at a high level architecture overview or really technology choices, no surprise there. We're at ArgoCon, I'm talking about Argo. We chose Argo workflows. I already kind of talked about why that was a fairly easy choice for us. For actually kind of authoring and codifying the workflows, we chose Hera, which is a Python SDK that basically allows you to, it has basically full APIs into all of the Argo workflow primitives, the stuff that you would see in the CRDs or in the YAML when you're building a workflow and allows you to basically author that in Python. And so you can codify your entire workflow in Python and then use Hera to produce basically a YAML that you would then submit to your Argo workflows. You could actually just use it to create the YAML file and you could keep track of that yourself. If you want, in our case, we specifically, from Hera, we just run the Python code to generate the YAML and then submit it every time we run a job. I'll get into that a little bit more. I don't honestly know, in terms of the Argo community, I don't know much about complexity of workflows and what other people are doing. We have around 2.5,000 lines of Hera code, which is, I don't know if that's a lot or a little, but we do have a lot of helper classes and stuff to make it easy to write new workflows and to add functionality to workflows. And then to basically, so as I mentioned earlier, we need to be able to queue these jobs coming into Argo. And so we built basically a purpose specific queue for this around FastAPI, which is just a Python web framework and Postgres, and then also using some of our existing internal web platform and Angular UI in front of it, which I'll kind of show you some screenshots of in a little bit. So next, I guess we'll kind of just look at a high level kind of overview of the architecture here. So on the top left, I kind of try to demonstrate here that we have, we have multiple inputs coming in to the system. We have pipelines and we have end users talking or requesting simulations through either UI or API. Those get queued up into a system that we have coined as hill queue, kind of short and gets the point across. And so that basically stacks up requests for Argo jobs. So basically what happens is like a user comes to the system and there's two main inputs. There is, or actually three main inputs. There is what is the software version of the launch system that I want to test? Like a Git ref or something that actually links to basically like every single piece of software that goes into the system. So you can kind of think of it as a deployment manifest. Then there is what is the version of the Argo workflow that I want to run, which is basically the Git ref for our Heracode base. And then what is the target environment? And so you provide those three inputs and then it gets stacked up into this task queue. And when your target environment is available, then the next step is that it'll actually take, we package our Heracode in container images. So like every time you want to update the workflow or create a new version of it and you go and make an update to your, to our Heracode base, that produces a net new container image and that's tagged along and that lines up with you can correlate that to the Git ref. And so basically when you tell the system, hey, go run this workflow for me on Argo, it knows that it has to go get a particular version of our Hera Docker image. And then it actually runs that in a Kubernetes job, which then runs the Hera and then submits the ML to Argo and then kicks off the workflow. And then basically the Hilki system will just effectively babysit the workflows. So it knows at all times, which workflows are running on which systems. And it also, you know, it basically has a, we do that through like the, through the Argo API, the web API, not the Argo, not like the CR, not the Kubernetes API and the CRDs for the actual Argo workflows, but the thing that you talk to when you're in the Argo UI, there's a rest API behind that. And so we actually just query that directly to get workflow status. And so whenever we put something into Argo, we have a bunch of metadata in there. So we know how to track it. And that's in our, you know, in our Q database and we can come back and check up on it later. And yeah, and so then basically Argo goes ahead and orchestrates that job against our simulation platform. So let's see here, we've got about nine minutes left. I'll kind of go through a couple of things here around, you know, just user experience and what it looks like. So as I'd mentioned, you know, so we have the queue that basically kind of stacks all the jobs before they go into Argo. And then we have an API and UI in front of that. And so this is actually what the end user would see. So if I had a, if I had like, you know, a pipeline that's kicked off a simulation job and I wanna come and check on the status of that, I can go to this UI and I can basically see, you know, okay, this is the job that's running and it's actually got a link to the Argo UI in there. So I can go see the status of that workflow. I can see, you know, basically which jobs are stacked up, waiting to go on to, waiting for Argo to execute on our simulation platforms. I can see which job is coming up next. And then also got a little bit of visibility there of the environments that are available. What does the workflow look like? Well, so at a high level, it kind of looks like this. Or actually you've got another version of it here too. Again, I don't really know like for Argo, I don't know like how complicated these workflows are in terms of like logic and nodes and everything like that. I think for at least my experience, you know, doing this with something other than Hera would have been really tough. It was actually fairly easy to do in Hera and to kind of keep it all in my head and know what's going on whereas kind of just, you know, trying to like build our own tool to manually construct the YAML or something else would have been, I think, a little bit more challenging. One other screenshot here that I've got is just kind of a little kind of, when we're actually running at the Argo workflow as it goes through, you know, kind of each step of the simulation process of configuring the hardware, deploying the software. We actually also have it shoot, the workflow itself shoots messages back to basically a chat channel that we have that provides people that have jobs running with the links so that they can go drill down into the workflow if a step fails or something like that or if they just want to go see what the status is. So in terms of experiences with Argo, as I mentioned earlier, I feel like we were able to build this out much faster because we were using Argo and because we were already kind of had everything containerized and we're very Kubernetes centric. The other things that Argo got us out of the box that were super helpful were basically having a system that automatically integrates with our SSO, automatically having a log archiving system built into it. One of the things that we struggled with before, as I mentioned, is that when people are running this and they're running these steps by hand or running the simulations by hand, it's up to a human being to basically capture all of the output and get it into some system where we can see it later. Argo gives us all that out of the box because in our workflow we can just basically say every step, for every step should be archived and then it automatically captures all the log output, keeps it in object store and then if I want to go look at a job that ran two months ago, I can go look at that job and then see all of the logs for every step in that job. And I'd say I guess the only other thing was that I know that I don't know what the current status is but I've heard people talking about having a basically like a more kind of like Argo or K8's native queue, I think that would, that would definitely be a welcome addition. I feel like what we built kind of going back here to the system for kind of queuing up Argo jobs, I imagine that that is not an uncommon use case. And so I was kind of surprised that there hasn't really been any traction around that lately but maybe if folks know of something they want to tell me about it I can do that in the Q&A. Let's see here. What else do we, oh, Hera, yeah, yeah. So I guess considering how few people raise their hands around Hera at the beginning of the talk, I would say if there's one thing you walk away from this with, if you're using Argo, I would definitely check out Hera. I felt like it made implementing this project way easier, you know, I would actually, I feel like I actually kind of learned the Argo workflows API through the Hera code base because I was just trying to figure out how to do stuff. I would just go read through, you know, Hera basically maps all of the, all of the Argo resources into Python code so I could just kind of go jump into their code, look at it, see, you know, see how those things worked. I felt like it was actually better, better than reading documentation. I feel like it's a lot easier to just kind of reason about complex workflows in Python versus YAML and then we got a lot of benefit out of just creating reusable code. So like for our workflows, when we have, you know, we have like some baseline kind of classes and functions that set up boilerplate for common tasks, you know, for example, like CPU and memory resource limits, you know, you can kind of set all that stuff up in one place for tasks that share the same container image, you can set all that stuff up in one place for our, you know, Mattermost or kind of Slack type notifications, you know, we just had like a little helper class for that and then, you know, I can literally just write like, you know, take the entire Argo, you know, like an entire DAG and say, effectively, like for everything in this DAG, you know, send a Mattermost message if one of them fails, that type of stuff. So I found it to be super useful. I would definitely recommend checking it out if you hadn't and if you have not and also just developer response from the Hera developers was great as well. They might open a couple bugs with at least one bug and then a couple feature requests and always got like super prompt feedback and yeah, overall great experience. And then just as kind of a parting screenshot here, kind of talking about Hera, you know, this is basically kind of like how our workflow initialization kind of works at the end of our Hera code base where, you know, we basically got a couple of different DAGs and then we set them up and then we set some, you know, things to basically like, you know, if this DAG succeeds, run this thing, if this DAG fails, run that thing and then on exit run another DAG which is just basically a list of tasks that we run at the end of every workflow. And so obviously there's like a lot of code that goes behind this to set these things up but in terms of kind of being able to express what you're trying to do, I found it super useful. All right, we've got two minutes for questions. Anybody? Okay, thanks.