 Hello everyone. I hope I'm not speaking to a lot, but thank you so much for joining us today as we talk about Ford's journey from click-ups to get-ups, so basically zero get-ups to a whole lot of get-ups. And I am Jailam. I work at Red Hat and I am in the OpenShift team, in the OpenShift product marketing team, and I'm super excited to talk to you all and talk about Ford. And I have Arthur with me who can introduce himself. I'm Arthur. I work at Ford. I'm on the platform engineering team that handles the configuration and deployment of OpenShift and Kubernetes. And at Ford, we obviously build vehicles, but with those vehicles, they are now more connected than ever. So those telemetry vehicles, our system build vehicles, our other sub-organizations like Ford Credit, our banking divisions all require the services and they're moving from legacy VMs and applications to more modern container-based solutions. So the rise of Kubernetes at Ford is prominent. And Red Hat, if you didn't have a chance to stop by our booth outside, we're an enterprise open-source company. OpenShift is our hybrid cloud application platform, and we're excited to talk about how Ford has been using OpenShift and get-ups. And I'll hand it back to Arthur now. Thank you. So Ford started their journey into Kubernetes back in 2017 with CoreOS and Tectonic. Later when it was acquired, we switched over to OpenShift. And then around May 2020, we started to become more production-ready with OpenShift. And learning the hard way, we figured that we needed to switch our permission model to a more least-privileged. Previously, the entire team had cluster admin, access to all the secrets, et cetera, et cetera. This led to some disastrous effects. So we switched to a more least-privileged model with a subset of the team having access, as well as granting increased access on an as-needed basis. Later that same year, we switched over to using Customize to manage our YAML manifest. Previously, we were using Shell Scripts to manage the configuration of different versions of our YAMLs, and that ended up becoming extremely complicated. At that time, we were also only managing two monolithic clusters. So we had a non-production production, which contained all of our workload of various differing performance characteristics. So in 2021, we started to break down those two monolithic clusters into purpose-built clusters where we split off RCICD to their own clusters, for example, one for Jenkins, one for Tecton. And we had shared clusters for general purpose workloads, as well as dedicated clusters for GPU-based workloads or any other custom workload that we decided would be better to have separated from everyone else. And then starting in KubeCon last year, we started the process of joining or starting to use ArgoCD, because as we split off those clusters, now we have a whole bunch of permutations of YAMLs that we need to keep track of and deploy. So we started to refactor our Customizer overlays to be more multi-cluster-oriented, and then integrating with ArgoCD to manage the reconciliation of those YAMLs onto the cluster. Where we sit now is basically everything is managed through ArgoCD. There may be a YAML or two, that's missing, but essentially everything is done through ArgoCD. And then next year, we plan to also launch ArgoCD as a service to our application teams so they can use ArgoCD to manage their application lifecycle on OpenShift as well. Our current fleet size is about 50 clusters, with 2,000 application teams split across about 8,500 namespaces. They're PROD, they're non-PROD, they're CICD namespaces and so on. And with OpenShift 414 around the corner, we have to essentially rebuild every cluster. So we'll be immediately doubling our footprint as we migrate people from one cluster to the new cluster. And with the decommission of other legacy applications or legacy platforms that our applications are hosted on, we'll be increasing the footprint size due to that as well. So we should be seeing a 2-3x growth probably in the next 1-2 years in terms of our footprint size. And some of the challenges we had before we introduced ArgoCD were that configuration drift, we have a whole bunch of clusters to keep track of, and we still put our code in our YAML and GitHub, but even then there are still humans in between the code and the cluster. So sometimes you miss something, sometimes you're not paying attention, or you're on the wrong cluster to begin with, and that's disastrous. So as we learn to be more careful, ArgoCD just ended up being inevitable. And why do we use ArgoCD over, let's say, another product? Their plug-in mechanism is very extensible and very helpful. Because ArgoCD out of the box takes your YAML and applies it to the cluster. But there are cases where you want maybe more checks in place in between the YAML and the cluster. So using the CMP plug-ins we're able to extend ArgoCD to do those custom checks that we want. The metrics that ArgoCD exposes up into the OpenShift is also useful for tracking purposes, and then we aggregate those across all our clusters into one place so we can see across the fleet what Argo is doing. And that's actually a good example of this. It was a couple weeks ago the container registries for Red Hat went down, and we had an upgrade going around at the same time. And essentially everything was failing everywhere at the same time, because nothing could pull images. So we were able to catch that. At least we were able to see it, which was nice. We couldn't do anything about it, but... So the differential YAML viewer is also nice. We have a lot of stuff auto-syncing, but not everything. So with the YAML viewer we're able to see the deltas, particularly when we were enrolling old clusters into ArgoCD. It was very helpful to see what was missed in the past, because we had made configurations, but you didn't know what the configurations of the clusters were until you installed Argo and compared all the deltas. So some of the plug-ins we're currently doing today are secrets management. So we're using the ArgoCD vault plugin to pull secrets from Secret Manager. That way our YAMLs only have references to secrets and not secrets themselves in GitHub. And then we have two other plugins for pulling information from the cluster directly, the infrastructure ID and the unique identifier for the cluster. Those are not known ahead of time of cluster build, particularly with IPI installs. So those are generated dynamically, so our YAMLs can't contain that in GitHub. And then the last plugin is just a simple framework for testing the YAMLs. And we'll get into more of that a little bit later. So like I said, basically everything is managed through ArgoCD. And then at about this present time we're sitting at about 60% of our configurations, just autosync from GitHub to the cluster. We are also looking at doing more sync waves, not sync wave, time window, the time window feature of ArgoCD, I blink on the actual term. But trying to get that more oriented so we can time box certain clusters to certain time windows, and at least we can get those merged in GitHub sooner, instead of being instantaneous. The last 40%, as we get more comfortable with how we process our Git PRs and how Argo does its syncing, we'll continue to increase. Some of the major ones that we are still skipping out on are upgrades. We don't autosync from GitHub, just in case something were to downgrade. We don't want that to happen, particularly for the cluster. But we continue to increase that and we will do more of that. So since Argo applies the YAMLs to the cluster, we want to make sure those YAMLs that are in GitHub are as clean as possible. So each overlay from Customize has a test script in there. And in that test script, the logic varies all the way from just running a Customize build to pulling in the secrets to even checking the contents of those YAMLs. So if I want to go to cluster A and I want other references to that cluster A in the YAMLs, I want to make sure that that folder that that cluster is in matches all the content items within it. Particularly one that's pretty important is the one overlay that manages the cluster certificates. So we don't have sort manager on our clusters due to some internal business requirements. So up until then, we're storing our certificates in a secrets tool and we're pulling those in dynamically through Argo CD. But we don't want Argo CD to apply the wrong search to the cluster or even expired search to the cluster or if for some reason someone deleted the cert from the secret manager. So that test that runs on the PR and that same test file that runs through the plugin to the cluster checks. Is this cert for the right cluster? Is it expired? Does it have all the intermediary chains? Does the key even match the cert itself? So we're running all those checks beforehand. And we're also using KubeConform to check the YAML spec of the YAMLs as our linting mechanism as well. And for items like the OpenShift upgrades, we have to include both the version of OpenShift and the SHA and then based off that Cincinnati graph that Red Hat publishes, we're able to confirm that, hey, there's no typos. This shot for this image matches this tag that is being used to upgrade the cluster. Some other features we want to add to that one in particular is making sure the upgrade paths are supported as well in RPRs. So if the git says you're at this version and you want to go to the next version, pulling that upgrade path information as well. So we're not upgrading from one version to another because when you code through, when you apply the YAML directly, it skips the cluster checks by default. So whatever version you put in, the cluster will just happily accept. So since we're using Customize to upgrade our overlays, we wanted to make sure that when we refactored them, we structured our overlays in such a way that we're not duplicating a whole bunch of code, but we also don't have code that if we do change won't necessarily cause an unexpected change to the entire fleet. So we kind of went with a structure that where base contains stuff that will be fairly static and pretty consistent across the fleet. So like your name SpaceYAML is essentially going to be the same for every cluster and it probably won't change. Certain RBAC values, we have a custom job for approving operators on the cluster. Now in our components, we have optional fields that are for each cluster. So in the context of Argo CD, the applications that are applied to a cluster vary. So those we threw under components. And then under, we have a version sub folder under components, which contains all the different versions of Argo CD and all its associated YAMLs that change between the versions, generally CRD changes. And then overlays are point to the combination of all those previous sub folders and any custom patches that need to be done on a per environment basis. And then there's that test file we're talking about that will test the specific values. And then this is kind of some of the open source tooling that we use on to kind of manage everything. Argo CD, obviously. We use Helm very sparingly. The main benefit that I found Helm for is when you need to do nested arrays. Trying to do nested arrays with customize is a mess. So for managing machine sets and storage classes, particularly, we're using Helm and for everything else, we're using customize. And then we use Caverno as well in our pipelines to do policy checking, as well as we're starting to roll that out on our clusters as well. That way we can check in the PR and then we can check in the cluster. Then we use Argo CD vault plugin for our secret replacement. Customizes our YAML management of choice. Cube builder is what we've been using to build operators, custom operators. We also are starting to look at the operator SDK as well. And then cube conform is what we use for spec checking. We pull those values in from the cluster, throw them in a repo, and we're able to reference them. Pipelano's code and Tecton is our CI CD that handles all of our PR checks and then open cluster management for our observability. And then the clusters live in GCP, so configuring all of our GCP resources are done through Terraform. So not everything on OpenShift presently is very GitOps friendly. But with these plugins and our custom operators, we can make it become more GitOps friendly. So with the monitoring stack as an example, OpenShift configures it through a large YAML in YAML. And when you're trying to manage many permutations of this, it becomes very complicated. So what we did is we essentially wrote an operator that contains a CRD of the configuration of this YAML in YAML. So all the operator does is convert the CRD into a config map. It's really straightforward. But what this lettuce does is use customized to patch the different pieces of this CRD instead of trying to mangle a YAML that's inside of a YAML. Similarly, before the UID gets injected on that replacement line by the CMP plugin, because that gets passed in to the open cluster management to identify the cluster. And another tidbit if you didn't know, the UID of the cluster is not immutable, so you can lose it. So we have the backup as well beforehand just in case. The next step forward for us is managing our tenant namespaces for application teams through Argo CD. Right now we have custom APIs that manage our namespaces, and they interact directly with the cluster. So we want to migrate those to Argo CD as well, which means we need to first put those in GitHub. So we need to retroactively dump everything. And then we need to reconfigure all of our onboarding APIs to point to Git first. So similarly, the pipelines we have for configuring the cluster will also be used for configuring namespace tenants. But we have additional policy checks for different business rules, because we need certain annotations on the namespaces. We need certain permission structures. We want to make sure people aren't throwing in extra YAMLs that they may or may not need. So as long as the business rules pass, if people want to write their own operator and operators, their own APIs for interacting with our clusters, they can. As long as it goes through Git, we're good to go. One check that we don't have in our pipelines at the moment that will be challenging is how to account for resource quote increases, because people like to increase their quotas, but they don't necessarily understand how much they need to use. I'm going to pass it back over to Jaylen to talk about the next couple of slides. Thanks, Arthur, for walking us through Ford's journey. And no journey that spends so many applications, so many teams, so many clusters could be done without challenges. So let's talk a little bit about some of the challenges that they've faced and what they're planning on doing next. So some of the bigger challenges that Ford has come across has, they've arose because they've started to scale so much, namely the experience of using Argo across multiple clusters and multiple instances is not very pleasant, and they are a bit intertwined with each other. The problem is that when you have more than one instance of Argo CD on a cluster and you try to look at the status of your applications, you have to go into each individual Argo instance to see the CD, to see the status of the app. There is no centralized place as such to see the status of all of your applications. You could be using one Argo instance to manage other Argo instances, but that also only does so much. You can only see information on the configuration of the clusters and not the status of the apps themselves. So we do have a little bit of an observability problem, especially at scale on our hands here. And this also makes isolation harder since the multi-tenancy is insufficient and it's just not good security practice. And at Red Hat, we've seen this problem come up with a lot of our customers, not just Ford. So anyone who is expanding their Argo usage to use more than one instance or more than one cluster is going to have this problem. And so as the Argo community, we have a really good opportunity on our hands to provide solutions to this as the Argo usage grows in larger organizations and teams start to use it at a bigger scale. The next few problems that Ford has faced are a little bit more platforms, so OpenShift specific, OLM or operator lifecycle manager cannot roll back or roll forward. So if you use Argo to manage operators, which is the main reason to use Argo to manage operators is to allow them to roll forward. OLM doesn't allow to do that. Some of our OpenShift components, like Arthur mentioned, uses YAML and YAML, which we all know how much fun that can be. It's just not easy to manage. It's a little bit of a pain and we don't want to do that. And finally, when Ford upgrades clusters, things always, they don't go very smoothly. It sometimes results in improperly configured PDPs for their app teams. Nodes get stuck. Applications do not terminate properly. And OpenShift has a couple of operators that might help solve this issue. So Ford will be looking into that into the future. And these are just some of the boundaries that Ford has run into as they use Argo at scale. And as larger organizations start using Argo, we're starting to learn more about these challenges. We're starting to uncover what happens when Argo is used at such a scale. And we see this with other customers as well and we're working on fixing it. But these are problems that if we come together as a community to solve, it would be really helpful and we can figure out how to get Argo past these boundaries. Finally, let's talk a little bit about what's next for Ford, what are the future enhancements and how we're going to help do more with Argo as well. So Ford wants to use progressive delivery using Argo rollouts. Arthur's team is looking at rollouts for the developer team so that the devs won't have to worry about Argo instances as much. On Red Hat side, a lot of our customers are also looking at rollouts and they're already starting to use it as well. So we've started contributing to the rollouts community and we look forward to getting more and more involved in the community as we work with our customers even more. Ford is also interested in the Argo image updater and they're looking at it specifically for deployments. This is something that they've been waiting on the upstream community for and they're hoping that they can start exploring it soon. With multi cluster, they're interested in seeing how they can use, how they can have Argo CD communicate with other OpenShift or Kubernetes clusters without using tokens. And like Arthur mentioned, they're interested in the Secret Store CSI driver for secrets management. They want to have the lowest touch secrets management system and we're helping them as they explore this as well. The plugin system in Argo wasn't created with multi tenancy in mind so if folks have multiple teams in a single instance who shouldn't be sharing information like Ford or a lot of other big organizations would, the certificates or secrets should not be stored on the cluster itself and you'd probably be better off using something like an external secrets operator or a secret store CSI driver. And it basically is a secrets management tool that acts as a go-between so it fetches secret data from the world and makes it available to the container. And this is just been made tech preview with OpenShift 414 and it already works with Argo CD so it doesn't need anything special to be added to it. And the scalability is like finally, this group has been very collaborative and exciting. It's been a really great opportunity. The Red Hat GitOps engineering team has been really heavily involved and we're very excited to just work more on testing and finding root causes for scalability issues that the customers and community run into and Ford tends to push boundaries a lot with their use of open source projects which is really cool and I hope that there are some of the first ones who test the work that comes out of the scalability sake. Finally, Argo as a service. Arthur's team would like to explore and run Argo CD as a service for their app dev teams. Arthur mostly spoke about what they've done managing their configuration and infrastructure with Argo CD so helping their dev teams with this would be a new area and they do want to give their teams as much flexibility as they need. So if a team wants to manage their own instance and they want more control they can have it versus if they want to be more hands-off Arthur's team can help with that as well and Red Hat and Ford are working together to help them be successful with this. We see a lot of other larger organizations use this method so we're learning from the ones who've been successful and trying to help Ford as well and I think that's it. It's time for questions and feedback. Please let us know if you've run into these similar challenges and how you've solved them and if you have any questions we'd be happy to answer them. Thank you. Hi, an approach we've taken for controlling for configuration drift and approaching upgrades is repaving and forcing repaving, forcing instances to be deleted and then reinstantiated from the freshest spec. So you spoke to the challenges of keeping some of the components of the platform up to date. How do you manage that? How do you manage upgrading Kubernetes nodes and the software that isn't so easy to update AutoSync with Argo CD? How do you manage that? So as far as I guess you're saying you rebuild the clusters to do upgrades? That's what the approach that we've taken because it's hard otherwise. Our developer base would be very hesitant to that model. They are not quick to migrate so we have to make sure we're able to do everything in place. And overall from an open-source perspective we haven't had too many issues with, from the cluster perspective we've had pretty much no issues with doing upgrades in place. When we run into issues it's generally with app teams and that it hasn't been really an issue. We're able to do in-place upgrades pretty frequently. It's just a matter of we have to basically force terminate stuff. Stuff gets stuck and we just have to force terminate it. But we've never, we're very rarely rare to a case where we have to re-instantiate an entire instance due to some bug or something. Thank you. Okay I have a question. Part of the list of softwares you guys are using and with the help of Red Hat is OCM, correct? Yeah. Okay so with OCM what would be the main difference with OCM and cluster API? Yes I'm not too familiar with cluster API. Do what? I'm not too familiar with cluster API. So cluster API again is an open-source project. I think that came from VMware and Tanzu to control and maintain your set of clusters. Basically it's part of the, I think it's part of the Kubernetes projects on CNCF and it's for, you know, the same thing like has a central hub and you know brings up and maintains clusters. So we have these two open-source projects. I'm really curious if you guys would have answered the question. I had a listen to the other one. I'll have to take a peek at it and see what it offers. All right. Thank you.