 So, good morning! Sunday morning, 10.15, sunny in Bruneau. What a better place to be than in D-105. So, I'm glad everybody survived the party yesterday. Nobody gets wet from the outside, maybe just from the inside. The people that got wetter from the inside maybe are not here. So, I'm really happy that you showed up. Thank you. So, we'll talk about how much open source is in cloud services today. My name is Marcel Hilds. I'm a managed open shift black belt. It's a mouth full of a term, but I'm essentially selling managed open shift to customers in a technical way. And my colleague, Roberto Carattala, de Madrid, is also with me. He's way longer in this business than me. So, we'll talk about what is ROSA, Arrow and OSD. These are some fancy acronyms. Then we go through the various days of setting up clusters at scale. So, starting from day zero, where we fully automate the deployments, then day one, which is not really a day, where we're going to the initial configuration of these clusters, and then the ongoing maintenance and operations of these clusters. So, monitoring logs and stuff. And then finally, how do we fix things if something goes wrong? And hopefully we come to a good conclusion on what is in there for you guys, because not everybody is supposed to run thousands of clusters, maybe just a couple of them, but you can draw some inspiration. So, what is ROSA, Arrow, OSD? Is it something that you shoot with your crossbow? Or just a color, or just a triple letter acronym that nobody really cares about? So, let's take a step back and look at what we're doing at Redhead for the last, how many years, 25 years? 25 years of productizing open source projects. So, we picked these fancy projects there, and in the middle there, you see Kubernetes, so we take that, put it into OKD, which is the upstream of OpenShift, and then we productize it, and it's not just one product, but there's a multitude of products. So, essentially, OpenShift is a Kubernetes distribution, which has some other projects also involved. And most of you are probably familiar with deploying OpenShift yourself, where you run the installer. We had many, many installers of deploying this. Sometimes it was Ansible, these days it's a Golang binary, and then it provisioned some infrastructure, yada, yada, yada. That's on the bottom there. So, you can deploy it everywhere you like to go. That's the whole value proposition of OpenShift. And you also deploy it into these cloud services. So, we have it running on AWS, which is ROSA. Red had OpenShift on AWS. We have it running on Azure, A-R-O, Azure Red had OpenShift. So, now these acronyms make a bit more sense. We have it also running on the IBM cloud, which is called ROX. I don't know whether K, Royk, Royk, okay. So, we're getting close to making actually some sense into these acronyms. And there's OpenShift dedicated. This is how it all started. So, you can also deploy it into Google Cloud. And this business of running OpenShift clusters for customers got so popular that we moved out of OpenShift dedicated into these other clouds as well. So, this is typically what you would have to do as somebody who operates OpenShift. You take care of this bottom layer where you set up your infrastructure. Then, as you move up to the stack, you have to configure your network. You have to configure this control plane, the master nodes that take care of your cluster, et cetera, and yada, yada, yada. But essentially, you probably just want to run workloads. Like, if you installed reddit enterprise links, you didn't want to care about who's packaging these things and who's doing the upgrade, but you just want to install an Oracle database or whatever, and in OpenShift, it's not different. So, what if we could shrink this picture just to this, where you just care about your workloads, you care about setting up your namespaces, and operating your stuff? So, let redhead and our SREs do all the work for you. So, SREs is coming a long way, right? So, like, 13 years ago, we just deployed stuff through it over the wall. And then that's the ops people take care of it and fingers crossed that we didn't do it on a weekend. Before we got to a popular swimming pool in Bruneau, and then that's the people who cared the, wore the pager, take care of it. No, we tried to merge people from development using and from operations, using the tools from both sides, the best of both worlds. This is how DevOps got into place. And I like to see SRE as an implementation of DevOps. So, if you're doing SRE, you are doing DevOps in a certain way. That doesn't mean that if you are doing DevOps, you are doing SRE. So, SREs are people that take care and run your operations. But they are also closely working together with your engineering teams. But they are not the engineering team. So, the SRE people at Red Hat that manage all these classes at scale. So, really think about hundreds and thousands of classes that they manage with a small team. They are also developers. So, they are building all these services to manage and monitor OpenShift environments or to deploy these classes. So, there's a lot of engineering involved. And for doing this reliably, you need to automate a lot of stuff. So, automating, adding storage, capacity, autoscaling, all that stuff. And make it repeatable so that you can essentially scale. Because you don't want to hire more and more people the more classes you deploy. And then, obviously, you're not done as in pre-sales once you install the class. And then you can tear it down and just show that everything works. But you actually want to use that class. And as it's software, there's also bugs. So, we need to observe this environment, make it reliable and act upon any incidents. So, that's observability and reliability, day two operations. So, we'll take you through this journey. On the left, where somebody is installing a cluster with a click of a mouse button or we are your CLI, then what is being kicked off in the background to install that cluster, do the initial configuration and then monitor it from day two to day n. So, although it's days, the whole cycle until to get day two is just taking 30 minutes or 40 minutes. But so, if we would install a cluster right now at the beginning of this session, at the end, we would have a cluster up and running, which is pretty awesome, I think. So, Roberto, we'll take care of the next, hello. Right, we have only one mic. So, now that we will start with the day zero operations, we need to deploy a fully automated and scalable clusters. So, we have an initial problem. We are DevOps team. We need to deploy our clusters in a scalable and maintainable way, and also easy, and we want to deploy our OpenShift acumenities clusters across multiple hyperscalers. We have AWS, we have Azure, Google Cloud, and IBM. So, we have a plan, and we need to deploy all of these projects that we have in the CNCF, and we have several meetings. We try to have several meetings with business. And the business guy wants to put everything, wants to put storage, wants to put AI, wants to put every single piece of software out there. But we need to do it scalable. We need to start working, working, working, and we end like this. So, for avoiding to this DevOps guy have this mess, we need to have an easy solution and a scalable solution in order to deploy and maintain our different clusters across different clouds, and maintain it, and also have this life cycle and support ability. And we want to introduce Hive to save the day. So, Hive is an operator that runs on top of, as a service, on top of Kubernetes OpenShift, and it's using to provision and perform also the initial configurations and data ops. And for provision and OpenShift, we have OpenShift installer behind the hood. And also, we support different cloud providers, AWS, Azure, Google Cloud, and IBM Cloud. Also, we have this architecture, more or less. We won't be a deep dive in this architecture, but in the top, we have essentially the Hive namespace, the brain, and for every single managed cluster, we have one cluster namespace. That will run and will store the different secrets and the different pieces and components for doing that. And for deploying one cluster, just we need to deploy a cluster deployment that is a CRG as an extension of a Kubernetes API. And we will define what is the platform that we are trying to deploy, and also the different components across that. And it's part, Hive is part of an option project that is called Open Cluster Management, that is a community-driven project focused for multi-clustering and also multi-cloud scenarios for deploying the clusters, also the works, and deploying at scale, and maintaining the different Kubernetes and OpenShift clusters. We have a downstream project that is called ACM, or Advanced Cluster Management, for Kubernetes in order to deploy these OpenShift clusters and Kubernetes clusters as well, at scale in different platforms across the different regions as well. So now that we have our lovely cluster deployed in, I don't know, different hyperscalers, we need to perform the initial configurations because it's just blank, but we need to perform the first configuration. So we need TLS encryption. We need TLS encryption everywhere as these two bodies wants to. And for that reason, we have set man operator to manage in the certificates at scale. We need to be able in scale to manage the different certificates, and we need to provision the certificates. We need to re-issue the certificates once is, for example, this expiration date and also revoking the certificates once we read off these clusters or the commission these clusters. You have also this deep dive, or less, that explains how this works. In a deep dive way. And brilliant, we have our certificates in our cluster, and now how about shipping and maintaining the two ops configurations at scale. So we deploy our clusters, we deploy also our certificates, but how about the two configurations myself? We can do it manually, right? Or using GitOps, right, it's a very hot topic, or using resource management. Well, it's magic. But it's not magic. We are using some sort of GitOps, but we are using a piece of software that is included in Hive, that is called Synset, in order to facilitate the resource management. We are shipping the different objects and data ops using this in more or less the GitOps approach. But in the country, for example, that Argosity is immediately syncing, we are using Synset in order to not stress the API of the different clusters, and also to try to reapply once it's the cluster installed and ensure that it's maintained across the different lifecycle. So we are maintaining also the contents once are updated. So once we have these in place, we need to ship the configuration, and we are using Managed Cluster Config. This Managed Cluster Config is using the Hive selector syntax. It's not really an operator, there's a bunch of different yamls across a repository, and it's able to bundle these in a template and ship across the different data ops. So we have these two ways and methods to use it in OSD and Rosa and also in Aero. So this is an example, a brief example of how you can put it in one repository and ship, this is a real example how the SREs ship this. So if you click it, this is the data configuration and data ops that are shipping also across the different Managed Clusters. And imagine that, yeah, we all ship our cluster, we have the cluster ready and we are handing over to the users. And these two DevOps guys that are super excited have a call from the business and say, yeah, no, you need to turn around your ship and instead of going that way, you need to turn around and going backwards. But it's not possible, yeah, hold my beer. And after that, we'll change one thing and another thing and another thing, and they will be stacked. So they said, running Kubernetes in production, they said, it will be easy, they said. In order to prevent users to do and to break clusters, we need to put some wall trays, we need to put some secure boundaries. And for that reason, we are using validating admission webhooks. So these webhooks prevent certain custom operations, they will prevent to break the cluster. And for that reason, we are putting these admission webhooks in order to prevent to anyone remove or change the namespaces, also Prometheus rules and a bunch of different things that will be able to prevent the users to mess around with the cluster and will put some wall trays to the different clusters. And now we are heading to the day two. So day two or minutes 20 into cluster operation. So we made sure that the cluster is up and running, you can't mess around with it. And if you mess around with it, it gets reverted but we need to also look at what's going on there. So platform monitoring is essentially not different from your everyday platform monitoring that you would do in your own OpenShift cluster. So you have a bunch of alerting rules, the out of the box OpenShift alerting rules and then some SRE added alerting rules that are more specific to these managed environments. Then you have a Prometheus instance running on this local cluster on the customer side or in the cloud which is collecting metrics which are then exposed to an alert manager which is also part of the Prometheus suite. And then it sends these alerts to pager duty which is a paid service where you can manage your incidents. It sends it to go alert which is also a paid service but it's more an open source service. So they're using this mix of alerts. I actually don't know what alerts are going to which service but it's always good to choose from. And there's Deadman Snitch. So if nobody heard of Deadman Snitch, definitely something to look into. I have it running in my home setup to alert me when my home Prometheus is going down. So it's for one rule, it's free. It's basically the poor guy that has to push the button and if he falls off dead, then he releases the button and then we know this cluster isn't reporting back home. So that's essentially the setup that we have to monitor these clusters. And obviously we don't want to have all the alerts coming in because you know it, if there are too many alerts, nobody cares about these alerts and then the alerts don't matter so much. So we have some inhibition rules in place. I don't have too many memes in my slides. I'm more the emoji guy. So these are stock emojis from my operating supplier. He's the memeguy. So how do we configure these things? Because as you heard, we are not using GitOps because otherwise you would have to have a pull request or some entry in your Git repository for every customer cluster that's being set up. So how do you manage a lot of alert managers in random clusters being set up? And therefore we have the configure alert manager operator which is installed in this cluster and essentially it watches for some secrets and config maps to appear or not to appear. It does some health checks, whether the cluster is set up already and these health checks are then reported back to Prometheus. So we hook into that normal operating monitoring pipeline there and the secrets themselves are deployed via sync sets. And once these secrets are in place, we have some other operators, configure GoAlert operator, the Patriot UD operator and the Deadman Snitch operator which actually these operators in the bottom are responsible for deploying the secrets. So they are using again the sync sets from Hive to ship out the configuration to these clusters if they are needed and maybe they change. And then the configure alert manager operator takes care of configuring alert manager. So now we have a programmable pipeline of configuring alert manager to your needs, to the environment that the cluster is running in without any GitOps involved and with a programmable way because we don't want to do things manually. These are all open source, so if you happen to configure a lot of alert managers, think twice, maybe you want to use these operators. So they are not really OpenShift specific or OpenShift dedicated specific, but they solve a very small problem. Voila. So what is monitorize? I don't know how many of you knew that the alerts that are shipping with OpenShift, they also come with runbooks and the runbooks are actually open source. So they are there on OpenShift, GitHub OpenShift slash runbooks. And they tell you what to do when an alert fires. And it's actually very good practice to put the link of the runbook into the alert. So an alert, which is, that's an example here, that's how you would configure the alert in your alert manager. You also see this runbook URL there at the bottom, which can be directly clicked on if you see that alert and then you go to the runbook. So it makes it easy for your SREs, even if woken up at 3 a.m. in the morning to just go through the runbook and do your stuff. So it will tell you about the meaning of that alert, what's the impact, how it can be diagnosed, and then that's the most important part, how to mitigate that alert. So you can copy and paste these commands into your console and do stuff and then hopefully mitigate it. And if you're running OpenShift yourself, use these runbooks. If you're setting up alerts for your own workloads, also take inspiration from these runbooks because it's a very neat and organized way of making your monitoring more reliable, essentially. I said there are also some SRE alerts configured which are not very OpenShift specific and these are all open, also open, open source. These are in this managed cluster config repository where you have the SRE Prometheus alerts and at the bottom here you see Configure Alert Manager Operator Prometheus Rule. So you will see a lot of friends from the previous slides also showing up here where you can just see how we are using this infrastructure and this setup to monitor all these clusters at scale. And as you saw in the keynote, this is essentially if you're wondering why my cluster is down and nobody got alerted and it's a pressing problem for you, instead of waiting for support, you can also do your own research and see how this alert is would be triggered or not. Then what about logs? These clusters are using Splunk for storing infrastructure logs. For that reason, there's a Splunk Forwarder operator which is using a small binary running as a demon set on all the in-front nodes and then it's collecting the logs from the node and from the pods and just forward them to Splunk Enterprise. So no magic in here, but if you are using Splunk in your organization to collect logs or you just wanna collect a subset, the Splunk Forwarder operator is the way to go. So no need to reinvent the wheel. And you see that in that custom resource here, it's pretty straightforward to put a path name and then it will forward that stuff to your Splunk Enterprise. So how do we fix things? That's Hubertus Meme. He's such a nice guy so that he put also memes into my slides. That's very, very nice of him. So obviously you know this, right? You're sitting in fire, you get alerts in the morning and then your colleague says, oh, alerts in Monday morning, this is fine. We always get these Monday mornings. Just ignore them, they will go away, just have your coffee. And yes, you could lock into these clusters and fix them manually with these runbooks. So that's maybe your first intuition on how to fix things. That's how we fix things in our lab environments. In managed services, we made it a little bit harder to actually access these clusters because they are customer clusters. So as an SRE, you have to jump through some loops in order to get there. So you're coming through the public internet, connect to a Bastion host, then set up a private link to this cluster environment. Then you can actually execute commands but they are all locked to AWS CloudTrail or some other infrastructure. You get to get management approval. So you actually want to avoid that logging into your cluster and doing things manually but sometimes you have to. No, we want to do it in a more sustainable way. So this is from the OpenShift documentation, Red Hat SREs are managing the infrastructure as code. We see words like GitOps workflows, CICD pipelines, and then it's talking about the review process. So you only get a review once some other SRE also approves it. So this is essentially the TLDR of best practices of managing your code environment. And everybody's hopefully following them. So never self-merge your PRs. Nobody's doing that. And with these Google search words, I did some reverse engineering. How are we managing actually these CICD pipelines? And it turns out that we're using TestGrid. So TestGrid is a test infrastructure reporting platform from Google Cloud. It's still being worked on to be fully open sourced and taken out of this Google Cloud repo but that has been going on for 10 years now. And if you click that TestGrid.Kubernetes, by all link you're presented with some really old school UI but it works and there's this Red Hat button among some other open source project and some other vendors and there's a myriad of Red Hat test suits running here. You click there then you see the test suits running. So it's really open how we do this CICD process for managing all these clusters. And then you click on the test on a failed test or on a successful test and you're getting into PROW which is another tool from the Kubernetes world which is storing locks and running jobs. So it's a very good practice to just use the same tooling as the upstream community is using for your product and that's what we're essentially doing here. And then there's a clue in here. It's called OpenShift OSDE2E. So that looks like OpenShift dedicated end-to-end test. So maybe there's some more information in here. Turns out that we have another repository in the GitHub OpenShift org which is the test framework. So it's a portable end-to-end test framework which supports deploying clusters into multiple environments and performs cluster health checks and upgrades. So that's the workflow here, captures the locks or nothing special in here. So you might be thinking, how can I use that for my own purpose because I'm using Jenkins and stuff. Well, there's some documentation in there how you would write tests for Kubernetes clusters because it's a little bit different than writing tests for your, let's say, Python application or Golang binary because you're setting up a cluster and you want to make some checks. So there's an example for checking image streams. This framework is called Ginkgo named after a plant. So I think it's all coming from some vegetables, cucumber, Ginkgo, who knows, but it pretty reads like English. So the suit is called informing image streams. Then we need to get some image streams from the cluster. It's just a one-liner. You list all the namespaces, you get all the image streams into a list and then you count all the image streams in there. We see this magic number of 50. We ever came up with this number of 50 but supposedly there should be 50 image streams in your cluster for that test to pass. And instead of maybe writing your own test suits or even reinventing some tests, you think, you get the enlightenment and then you get stars in your eyes. So there's a lot of test suits already in this OpenShift end-to-end test suit where you can take inspiration from or you might wonder why is this error always happening in my cluster? Do they have a test? Let's go and explore. So it's a funny exercise for a Sunday morning in the sun. And maybe you want to do that right after this talk, after you got inspired by this myriad of open source repositories which make managed OpenShift services run. It's not fully open source. So there's still something, some secret source to it but we're getting there and opening it up for everybody to inspect. So that we still have five minutes for up for questions. Skateman, yeah. So you counted six additional operators on top of OpenShift. Actually there's more, I would say it's probably in the ballpark of 30 additional operators there and the question is how much additional overhead is that in terms of cost? I think I never really did the run Kepler on it to actually count this CPU cycles wasted. It adds some cost I would guess but they are really lightweight. So essentially they're just checking a configuration and then they are setting up another configuration. So there is some overhead but then you don't want to care about these operators as a customer because we care about the infrastructure nodes and the control plane. So it's the SRE way of doing things and monitoring this stuff. And I think it's more lightweight than having some external querying of the API. So to answer your question precisely, I don't know but it's probably less involved than we might think. Yeah, another question? They are both. So we have reusable, we have these emission controllers and also the webhooks are reusable or just open shifter specifics. So they are both. There are some of them that are just for Kubernetes clustered itself, so pure upstream and there are one that are specific for open shift and others that are specific for the SREs. So they need to rely on these web roles in order to be able to protect the cluster to remove these web roles and for that reason they are different but the most important part that you can have these different emission controllers as an inspirational to have ideas and to try to reproduce it instead of reinventing the wheel again and again and again. For that reason you can bring these ideas into your own clusters and reuse part of the code because they are all open source. So the question is whether the run books are updated on every open shift release. Last time I checked there was some action on this repository so I would guess they are updated and I would hope they are updated but as you know engineers are engineers so probably they also sometimes forget to update stuff but this isn't a live and kicking repository so yes they are updated but can I guarantee for it? No and I think these are so that's something that people often forget that there are alerts shipping with your cluster and you actually know what to do with these alerts so they are not just a random gibberish of concatenated nerd talk that you ignore but you can actually look into a proper run book and let that being explained for you. Okay, thank you.