 Welcome to our today's presentation about Ansible Automation Platform as a service based on OpenJift. Today, my colleague Silva and I will show you how we at SIGS use the AAP together with the customization to provide a service for our internal developers so that they can really bootstrap AAP on demand. So first of all, let me show what you can expect from today's session. So we will start with a quick introduction, followed by an explanation of the architecture of AAP2, as well as the basics of Kubernetes operator, since it's quite important when we speak about OpenJift that you also notice basic concept of operators. We will follow then by the bootstrap and configuration. That means how we enhance the basic operator so that it fits in our CETO trust environment, as well as the execution environment, which goes then hand in hand. At the end, we will, of course, also show you how we migrated from the old approach, the Ansible Tower to this new AAP on OpenJift. And to conclude this whole presentation, we will also show you then the challenges and the takeaways we had. But first of all, let me pass the mic over to my colleague Silva to introduce himself. Thank you, Philippe. Hello everybody. I'm Silva Chen. I'm a principal consultant at Red Hat. I joined in 2017. It's my first day of course. I'm very happy to be here. I've been working with containers for quite some years, as well as with Ansible. And I'm glad today to present the work done of my two favorite products, Ansible Automation Platform and OpenJift all together. I would like to hand over back to Philippe, who will introduce himself. Thank you, but Silva. So my name is Philippe Hutter. I work as a Kubernetes engineer at SIX. For those who do not know what SIX is, SIX operates and develop infrastructure, as well as software for financial services in Switzerland and in Spain. Therefore, also, SIO trust is a must in our environments. From my background, I've been working with content technology for quite a while now, for eight years. I started my automation journey with Puppet, now moving slightly bit for bit to Ansible. But as you may know, it's not that easy to let go someone's first love, right? But that's enough from the non-technical things. So let's move over to why we are here now. So when we go back some years, when we introduced Ansible in our company, we thought about deploying this Ansible automation platform, or there the name was, and then the name was Ansible Tower on OpenShift. It was supported that you can deploy Ansible Tower on OpenShift, but it was never supported to have an operator, so officially supported by Red Hat. But we decided to go then or to use the upstream, the AWX operator already back in the days of OpenShift 3 to deliver a service tower, internal employees which had all these benefits like self-service approach, they were able to use infrastructure as code to deploy their instances. And due to the fact that it runs on OpenShift, you had like this consistent life cycle automation in place. But now with the move to OpenShift 4, Red Hat officially provides this Ansible automation platform operator, which is quite cool. So therefore, also with the move to OpenShift 4, we decided to take this official operator, but as you may know, it's not easy to just take something, put it in a zero trust environment and then run it. So you also need some customization. That's what we did actually. But first things first, so I will pass over back to Silva so that he can explain what benefits this AP2 really has. Thank you, Philippe. So I will talk more about AP2, but first what does it bring compared to Ansible Tower? So basically we have the decoupling of Ansible automation platform in two parts. One is the control plane, also called automation controller, and the second part is really the execution plane where we will run the user playbooks. So this is really interesting in OpenShift because you can basically have it running really in a microservice way. Then regarding the dynamic cluster capacity, you can really rely on OpenShift to spot all the different playbooks and job templates as pods. Before it was not possible, everything was running from the same pod and this was really the monolithic approach. Now AP2 is really bringing things down and having like the microservice approach. On the bottom, you can see the automation mesh. This is to actually bridge external VMs from OpenShift. Unfortunately, when you deploy AP2 on OpenShift, at least on the version 2.3, this is not yet GA soon, I hope, but at the moment this is technology preview. So this is the difference when you run it on OpenShift as of today compared to the traditional way. Obviously people can do this central management with an automation controller and you can have different teams using that independently from each other and having really this as a service. And we'll talk more about that right now on how users can provision their own AP on OpenShift using operators. But first thing first, we need to explain what is a Kubernetes operator and Philippe will explain that. Yeah, you mentioned it. OpenShift operator. If you just come from the Ansible environment, maybe operators are not the thing you're working on in the Ldale base, but let me explain it. So here on this slide, we see a normal OpenShift cluster and on top you see the API in AdCity. So the API AdCity is, or especially AdCity, is where OpenShift stores all the state from the running applications in it. So what is an operator? An operator is actually a piece of software running in a container which you can use or which you use to automate tasks. In this case, you see in the middle of this picture, you see the, just as an example, the Ansible automation platform operator. This operator has so-called reconciliation loops. So it constantly watches the state of the AdCity and when something changes in AdCity, it will apply the changes to the cluster. And on the bottom line, you see the customer namespace. So, but how does it work now for the customer? How can a customer or an internal developer actually interact with this operator? So they are the so-called custom resources. So users can create custom resources to interact with this operator. And as soon as a customer creates a custom resource, this operator realize that and will apply this change. In this case, it realize that a customer or an employee wants to have an automation controller. So it automatically detect that and bootstrap the automation controller in the customer namespace. But is it only about or this Ansible automation operator, is it only about bootstrapping automation controllers or is there more, Silva? Can you explain that to the customers? Thank you, Philippe. So we go over the bootstrap and configuration, but this is not, this will be explained in the next slide. Yes. So basically, the Ansible automation platform operator is responsible for the bootstrapping of the automation controller. We are really talking about the control plane here and later on talking about the execution. It can also do the LDAP configuration, bootstrap at the start of the automation controller. It does as well backup and restore for the, of the database as well as the upgrades. So every use case regarding that is actually performed by the Ansible automation platform operator, but however, in an enterprise environment, you want to have even more features. So we try to push the automation even further using the six AP operator. This is internally made Ansible operator. That does the following. It will inject the subscription needed to actually run the automation controller because you don't want every user to inject that. Then we will customize the UI to make it more corporate to the two six. And then we will inject as well default settings like external logging information so that the audit logs are passed to an external logger. Or we will also configure all the things such as the container groups defaults with resource management and so on. We'll go further on that. But essentially the users will create some custom resources and then they will have everything ready in minutes. So as I said, we really focused on the Ansible operator to create this. It was used and it was created using the operator SDK. We could have done it in Golang, but here it makes more sense to use Ansible. Fun fact is the Ansible automation platform operator was also developed into Ansible. So it makes sense to actually use the same technology here. So now let me go further on the development and how we can actually not have conflicts between the two operators. So we need to make sure with this operator that it will come at a later stage because it needs to have the bootstrap of the automation controller. So we need to check if the status is actually ready to be used as well as if the API is also up and running. Because this is really important because we inject some configuration and we will actually communicate with Kubernetes or OpenShift using the Kubernetes core correction. So we have different modules, Kubernetes, Kubernetes info. We can copy files, we can execute and this helps us to actually inject our configuration. But how about a secret management, all the secrets because actually in an operator you cannot use the Ansible vault. So for that we were using the hashikov vault plugin to fetch all the secrets in a secure manner. Having said that, having explained all the logic of the operators, now let's have an example on how people can actually bootstrap this within their own namespace in OpenShift. So in the following picture you can see this is actually the deployment of the automation controller together with standalone Postgre database. You have a standalone PostgreSQL pod, it's simply one container and then we have the automation pod which contains four containers. One is ready, the other one is task, web and EE. The task is responsible for scheduling all the playbooks. It's very important in terms of resources to have it properly configured. The web is for the web interface that you know in Ansible automation platform and the EE for the receptor. So what kind of changes can you make to the automation controller? So you need to specify a custom resource called automation controller and then you can specify how many replicas you want. You can specify for each one of these containers how much memory and limits because it's important to size it properly. You may run this into an OpenShift shared cluster so you will not have infinite resources. You can find more details on my blog post on the bottom which will be as well shared in the references where I share all the knowledge regarding the LDAP configuration at Bootstrap, how you can actually integrate the CA bundles to integrate your external services within your company. Yes. Then there is something regarding the scheduling that I would like to share. For example, if you want to schedule your automation controller within specific nodes, you can do it using labels. You can also spread them across the nodes so that they are not seated only on one specific node. That's quite important. Last but not least, you can use taints and tolerations to actually schedule them on dedicated nodes using this concept of node tainting and pod tolerations. You can find once again more details on that having a lot of different use cases for this kind of use cases for customization in the reference architecture. It was published during Q1 this year, so quite fresh and so on here. But one user will definitely use this automation controller which is basically your email file. As we said, it will install the automation controller within your name space. But here we have the automation part where we have developed our own Ansible operator. And here this is basically the second email file which you can create and it will automatically inject the license. And how do we do the mapping? Well, the mapping is on the following part. So we are actually taking the name of the automation controller so that it knows which one it can configure. And then it will inject everything and within 10 minutes basically. All right, let's talk more about day two operations. It's good to have it provisioned but then you want to tune it accordingly. You want to do some upgrades. You want to do some backup and restore. You want to monitor the resources. So how do we do that? So first of all, for the upgrade is very simple. It's whenever you upgrade the Redats Ansible automation platform to a specific version. For example, in the coming month to AP 2.4, this operator is responsible for upgrading all the automation controller you have in your cluster. The second part is regarding monitoring. Why is so important? Because it can give you some insights. In OpenShift, for me, if an application doesn't have this kind of monitoring, it's just, I will just get the pod information and that's all. But I will not know what it does, how many jobs it runs and so on. So for this, we have also implemented within the custom Ansible operator the creation of this monitoring workflow so that we can scrape in real time the premise matrix. So what do we do? We create an auditor user. This is a read-only user. We create it in automation share. Then we actually creating a Kubernetes secret with this requested information, username, password, for example. And then Prometheus will need to scrap this information. In OpenShift, how do we do this? We need to enable first the user workload monitoring. This is to actually monitor your own services. And then we are actually using a service monitor. In that, we can say for each namespace, oh yeah, I want to monitor this endpoint using the Kubernetes secret. So this is done automatically for our users at six. And then we can display the information in Grafana. All right, let me show you an example. So basically we have two panels. The first one is basically embedded. You can actually have like for each container, the resources that she's using. Here we are talking about memory because we have a bottleneck there. And you can use the container memory working set by. It's basically for each container, you will know how much it consumes. And this is very important, especially in the case of automation controller because it contains four containers. By default, the OpenShift UI only shows the pod memory usage. So it's very hard to know which container needs more memory. On the bottom, you can see the automation controller metrics. This is the information that we just scraped before on the monitoring workflow. As you can see, this is highly correlated with the first panel and the number of jobs. So I'm basically displaying here the running jobs in total. So at time t, I know how much is running. Same thing for the pending jobs because there is a queue. You cannot process like 10,000 jobs at the same time. It's queuing, so it's very important. So if you want to run more jobs in parallel, we can simply increase the memory. All right. Having said that, now Philippe will take over and talk a bit about the backup and restore within OpenShift, and especially in the case of automation controller. Yeah, thank you very much, Silvan. So I mean monitoring is quite important, but what's even more important if something goes wrong, a backup. And the official AAP operator from Red Hat actually offers you the possibility to create a backup. It's not very well documented in the official documentation, but you can always go upstream and check the AWX documentation that you can find all the possible configuration settings. With the latest releases of AAP, they even introduced two cool new things, and I want to highlight it quickly. So it's first and foremost, the cleanup backup on the leads. So whenever you delete an ACs and automation control, it also leads to backup, as well as the PG dump. So the Postgres dump, you can modify. For example, if you have some event that you don't like to have, you can exclude them and you can save some space on the backup. Here we have the name. The name is obviously important if you want to restore it, and the backup is getting stored on a PVC. That's currently the only solution or only way to store a backup, so you can't add an SDD bucket or something like this. If you have done this backup, you can also restore it by a similar custom resource called automation controller restore. There you just have to reference to the backup you want to restore. You have to add the name of the deployment and obviously if you use the same name, you need to do some additional steps. It's linked in the comments there. And depending on the size of the backup, it could take longer, so but you can always see the status of the restore in the status section. If that's that, back to Silma for the execution environment. Thank you, Philipp. We talked about the execution. We talked about the control plane, sorry. So the automation controller. But then, how about the users? They have it, but they want to run their playbooks. How do they do that? So we have the automation execution environment. What is it? Let's have a recap on that. We have basically, it's a container image. It's based on the universal base image from Red Hat. So lots of RPMs, OS3 and so on, where you can fetch to have like a command base. And then you want to add everything to run your playbooks. So we are talking about all the dependencies. Collections, libraries like RPMs or Python modules, as well as the Ansible core version. Everything together we pack it and then this is the image that is going to be used to run your playbooks. And so we can definitely use that and then scale out. But how does it work in an enterprise environment and disconnected? So basically, we are using Ansible Builder for this. But we will have a different approach here because we are not connected. So we need to do some customization. We are, and basically we are going to create, I mean, we created already the Ansible base images within six. They basically contain additional settings like we have the CA bundle from six to trust their systems. We have the private automation hub running as well to actually fetch all the Ansible corrections. We have as well regarding Python, the artifactory for everything related to the Python modules we want to fetch as well as the UBI mirror part where we have actually everything mirror there. So the users don't need to do anything. They will just need to use our six base images and all the dependencies will be gathered from within the customer environment. Then what happens is on OpenShift, how does it look like? So we have the control plane as displayed here. We can see the resource management like the monitoring, how much it uses in terms of memory and so on. And then we have the execution plane. As I said at the beginning of the presentation, this is separated. So it's not running on the same part. It's really spinning up new parts using this new container image or these execution environments. And we are here really leveraging them as well as the container groups. What are container groups? It's basically the specification. You may want to mount additional volumes. You want maybe to add more memory allocation to your container. And this is how you would do that with container groups. So first of all, you would use execution environments and later on even customize them with the post specification using container groups. All right. Now I would like to pass over back to Philippe, who will talk more about the migration from Uncivil Tower to AAP2. All right. So I mean we have a solution now, but you also need to convince somehow the customers or our internal employees to use this new solution, right? So how does it actually look like or how fast it actually is? So the customers of our new solution can create two custom resources, actually one for the automation controller plus the one for the customization. So he can do that actually in 10 minutes if he doesn't have it already. So we already have some templates for that. If he applies it, of course, he also needs to migrate their own Python environments, their old and Python environments to execution environments. And if it's done, he already has like an environment where he can run his first job. But, and that's what Silvan said before. So we introduced this monitoring stack to also show the customers or to give them an insight what's going on. Cause if you really have like 500 concurrent jobs, you may hit some limits in memory or CPU wise. So it really is an iteration where you really need to fine tune your resources. But that's already it. So it's quite simple to adapt new customers to use actually this approach. So as that said, it looks nice and it's just easy to use. But in the background, we had some challenges when introducing this new stack. And we wanna just show you the recent challenges we had. So not all of them, just, I mean the recent and ongoing challenges. So first and foremost, we have like the Galaxy collection install which failed with version 214.5. It's already fixed with the latest version, which we're quite happy of it. Another bug, and that was actually the trigger while we introduced the whole monitoring stack. It was that the task, so Ansible tasks was marked as running, but they were actually not present in the job queue. And the reason for that was that the automation controller run out of memory, but it was quite hard to detect if you don't have really a monitoring in place. So that said, these tasks are already kind of soft since we adjusted the resources as well as the version is fixed. But we have some other issues currently we're hitting. One of them is the usage of underlying OpenGIF node storage. It doesn't look that obvious, but the AAP operator or the automation controller, especially the task container, runs and uses an MTD year for caching their jobs. An MTD year on a local disk, you can't really limit. So that could happen that your task container with the temp directory fills your node, and if you have a shared class that could be quite problematic. It's an open bug, and hopefully it will get fixed at some point. As well as the RSOS, our SysLoc configuration, which is not loaded when the web container is restarted, we have data workaround to trigger, just to reload with an API call. However, it's not fixed upstream, it's just a workaround we implemented. But even with these challenges, I mean, challenges are quite normal. We also have some takeaways. With this new solution, we have a self-service for customers, for internal employees, to bootstrap their own environment under 10 minutes if they already have the YAMLs available. They could use YAML so they don't need to go through documentation. They can just use the template YAMLs and then bootstrap their own environment. It's fully functional in a disconnected mode, as we have since we have the customization operator, which does all this job for the customers. And since it's really based on OpenShift, you have the benefits of scalable and reusable containers. If you're also interested or if you want to have some more references, we put all the references we use during building this solution on this slide. Especially the blog post right at the bottom, it's not the use that we just have a full list of references that looks better or not, but it's really written by Silva. So if you're interested in how we did it, also get some more code snippets out of it, it's part of this blog post. Good. If that's that, we come to the point where we want to open the stage to you for any questions you may have. All right. Last chance for any questions. We will also be outside. So if you have any questions, feel free to ask them afterwards. Thank you very much for attending this session. I know it's quite hard after the event yesterday, but luckily we got some attendees. Thank you very much.