 Okay, guys, can you hear me? So it seems the audio is working, so very good. So let's start. So my name is Indrich Novi, and I'm from the site Reliability Engineering in Red Hat. And so these days you had a lot of presentations about how to implement the operators, how to try the new open shifts or Kubernetes. But what we do at the moment, we are trying to keep our current installation working at scale. So I'm going to talk about this today. I'm not going to do too much into technical details, but I will focus on the way how we maintain stuff and how we should, if you happen to have multiple clusters, which kind of tools you can use and how can you actually maintain it at scale. So in the first part, I will describe what is our current offering, what we actually do, and what kind of products we offer. In the second part of presentation, I will describe how the OpenShift Asaris are actually working within Red Hat, what they do. And in the last part, I will have a live presentation of a tool which we currently run as a proof concept for audited work on production clusters. So let me start with a brief description of what we offer. So we are having multiple tiers or products, I would say. And so we are having the OpenShift online when you can yourself try, let's say to run one project. It's not much because it's for free and it's not supported, but at least you can give it a try. So there is a way how to actually run it. And if you go more serious, you can go OpenShift Online Pro, which is a small fee of $50 per month, and you can run after about approximately 10 projects in this offering, and you are also receiving some basic support, which is very limited, but still sufficient for maintenance of the cluster, I would say. And I see actually many colleagues of myself here. So feel free to correct me if I'm wrong. And this is actually the product where Red Hat has most money from. It's OpenShift dedicated cluster. It runs on the public cloud. Most of the exposure is actually currently in AWS. And it provides the hosting of your private cloud, as well as maintenance of the whole cluster. That includes upgrading of cluster, patching for security vulnerabilities, but that's basically it and handling breakages. The pods, you actually work with the applications you run in the cluster is completely on you, right? So it's more or less working in the way that basically you just opt in into some cluster, you maintain it for you, and you just run your stuff and don't care about maintenance. The last offering is a OpenShift container platform, which is convenient for setups like banks or something like this, because they can run their own cluster on premises. So we don't need to really go AWS or they can even have their own cloud environment. And so this is our current offering. So as I said, we take care about master infrastructure and compute nodes managed by Red Hat. And there is also the support, so you can follow a ticket and we will be very happy mostly to sort it for you. Sometimes. So I want to show you also what's in the package because it's easy to say like, hey guys, here you go with the support, but what kind of support do we have? So this one is kind of questionable because it's kind of a feature of Kubernetes itself that is highly available. So if you have any workload, we can actually offer you quite a big uptime because Kubernetes is, if you have, let's say, stateless spots most of the time, it handles the hybrid itself. If you have multiple clouds, we are able to set up the peering connections in the virtual private cloud. So you can have very heterogeneous environment and only one dedicated cluster in which you give it a try. Then the premium support is granting like 99.5 uptime SLA, which I calculated like one day downtime, which I think might be dedicated to, let's say major version upgrades and stuff like that. There you can use various of the city providers, LDAP, GitHub, you name it. And you can also get the internal cluster container registry. So what you can do is you can build your own images, you can deploy on top of that your workloads. This one is very important. So log in metrics and monitoring because then you will have also some overview of what you're actually running and how it's doing. Now I need to tell you a story. I was working in Hedge Fund and I was given a task. My task was to build the operating systems for the platform. And one day I went to the, I went to the office and I started to build completely new operating system which was for all seven at a time. And the particular operating system was very quirky and it felt most of the time. And it needs a while to build. So what I did, I triggered the build, I set up the boot server and I went for lunch. When I returned back from the lunch, it was really interesting because I see my host is not running cross seven at all. It was running cross six. And so I asked my colleague, hey, what's wrong? How come there's no real seven on my host? And he was like, don't worry, magic fairies fixed it for you, right? And I realized that we are having some engineers who are getting triggered or having a page alert in case something goes wrong. And what went wrong in the particular thing that I configured was that actually I pixies with the cluster and the kickstart didn't go well. So basically it was looped in the pixie boot and it was kind of crashing the boot server. And so the operational part of the plant received the alert. He pixie switched that. It was Bob, so he was a little bit like this, saying, oh, not this guy again. And then basically he switched it back to the functional state. And I'm going with the fairies because nobody really thinks who is actually handling the stuff most of the time, also on the cluster. And so I want to briefly survive, briefly describe how the SRE OpenShift is working. So we are following the follow the sun kind of approach. So we're having three regions at the time. So we are having NASA, EMEA and APEC region. And the reason is very simple. We are 24 hours and one work shift is eight hours. And so what the SRE is doing is break fix. So if it's a critical alert, we need to fix it. We need to do maintenance and customer increase and issues. We are having different roles, which actually switches every week. And so these roles are shift lead. So this is, you know, down for John Guy, who is receiving most of the hours on his cell phone. And it is being very busy resolving critical issues. The shift secondary is the other SRE at the time who is handling, let's say the incident queue. And also he's interacting with the second level support to resolve issues. And the reason why we have two of such of these people is because if the shift lead, I don't know, is in restroom or something, he cannot take the hours, the secondary can take over, right? Then actually we have on-call person. If anything fails from these two, there is on-call. And there is a region lead who is supposed to be either coordinating the effort, if it's non-trivial, or we are having, we are doing also the root cause analysis of issues. So the issue wouldn't happen next time again with the same root cause. We don't currently cover weekends. And we are using tool which is called Pejor Duty. And it's a very nice tool because you can set up the escalation strategies. You can see here. So for instance, this particular strategy works in the way that primary gets paged. If he doesn't react within 15 minutes, it gets escalated to secondary. If he's not reacting, he's going back to the primary. And if he's not reacting, after 45 minutes it goes paging everybody, right? So we have certain assurance that actually it gets looked at. Within these eight hours, within the shift, we are having multiple engineers. And given that you are getting interrupted quite often, if there are a lot of alerts, there are only a certain time in this eight hour shift when somebody's primary, let's say four hours, something like this. Because you cannot get focused more than four hours. You need to go for lunch or something. And so we are having different kind of locations for the period for the primary. And we are going from APEC to EMEA to North America. So we covered all day. So how actually the alert looks like. If you are getting paged, if something critical goes on in the cluster, the majority notification looks like this. This one is the EBS alert. So some AWS volume got stuck. So we need to manually untouch it because it doesn't work anymore. And the good thing is we are having automatism which is saying like, hey, there is this alert over here. But you don't know out of your head all the time how to fix that, right? And therefore, for every single monitoring alert, we are having the link to the standard operational procedures over here. So even if you are not feeling well, you can't remember, you're having fully documented way how to actually handle that such issue. So standard operation procedures look like this. And they are having quite a nice format because we are using at the moment just the ASCII doc in GitHub. So we can share it along the team. And what I'm trying to say now is what we are doing if we don't get any escalations because you see there's only one or two engineers who are actually primary or secondary and the other guys are doing completely something different. So they are watching the queue. So if you have some questions from the second level support, CE, we are having agile development meetings to handle some longer time issues. Which is more or less related to reduction of the toil. And we are developing our own tools to reduce the pace of duties or alerts. And so that actually we can use memes or we can see the memes on the internet. In the meantime, while everything is automated. And okay. So in order to see that something is wrong with a particle cluster, we need to do some monitoring. At the moment we are using very simple setup. We are having a monitoring container running on the master. And it basically looks like this that there is a Chrome tab, very long Chrome tab like this, right? Why we don't use system detimers is very simple because at one single Chrome tab, you see the probes here and also the pace of the probe. So in one single file it gives you quite easy idea how it's going to be executed. And this particular probe, when it's triggered, for instance, like not enough disk space or something like that. It goes to our monitoring system, which is our state-of-the-art Cybex monitoring system. We try to move to Prometheus. It hasn't been future for us, but unfortunately it seems to be future every time for us because we are having some templating in Cybex which actually handles the mapping of some particle alert to the standard operating procedure. So it's not a real effort to move somewhere else. Okay, so now I'm going to talk something about a tool which will allow us to simplify the workflow. And such tool is used for running scripts or uncivil playbooks directly out of the document from the standard operation procedure. So if you have script like this, for instance, you need to do a lot of cut and pasting, right? So it's very prone to human error and it's not really good to go byline cut and paste and so on because there might be some typesetting issues. And so we have implemented a very simple tool that actually takes the script, runs it for you and audits the execution and locks out of the cluster how it's executed there for later reference. And I was thinking how to name that tool because if you go, I made a list of long utilities I have on my system which ABRT actually won, it's I don't know like 50 letters. And I was thinking that it might make sense to name it something like this when you have only couple of letters because you will execute it quite often, right? So because we are OpenShift SRE, I called it O3. So it's really minimalistic name. And so I'm now going to do live demonstration of the tool, how actually some stuff might be executed on a particular cluster. This tool, what it does is looks for a SCIDO documentation. You give it a topic, it will give you a full text search, it will point you to a particular part of the operation procedure and also adds you the attack in which you have git hash of the procedure so you can refer even if in the future the procedure changes, you will still know which kind of file you are referring to at a particular time. So live demo, let's try it. So imagine a situation that actually you are patching a cluster. So at the moment I'm having KVM cluster which is OpenShift 3.11. I have one master and one node here. And I'm going to, I hope I'm not breaking it. So imagine my task is now to perform, let's say the upgrade of the cluster, right? So I'm going to do this. So I'm going to do upgrade. And it gives me some full text search on the top and then exact match in the chapters in the standard operation procedure. As you see, we are having a lot of stuff in there so it's very convenient to have some automated tool. So if I wanna perform upgrade of 3.9 to 3.11, it allows me to browse through the sub chapters of the standard operation procedure and then it will ask the risk means here it contains some script within it, right? So if I see this one and I went through the overview and prerequisites, it allows me to do the upgrade steps which is like this. And the first of these is, let's say, clinic up slash var. So if I go to the chapter, you see there is a script over there and it generates that script out of that sub standard operation procedure in here. And so if you have a quick look into the script, it comments out the documentation so you still have a context what you are doing. And given that we are not going to run this on the AWS, but we are going to run this directly on the cluster I have in KVM, I will do something like this. Let's get rid of these spaces. So it's kind of very trivial cleanup if you wanna just be sure that the LCD won't fail. And so the O3 tool was the tool for just getting the script directly out of your standard operation procedure. And now we are having O3 run tool which specifies on which kind of cluster you wanna run this and which kind of script you wanna run there. I think I might want to rename this one to something else. So let's say clean var, not a sage. And let's run it. So O3R master clean var. And basically it executes the script on the master and gives me the output of it. It's okay, there is no such file directory. That's perfectly fine. Because now I'm going to describe that actually I want to do clean number var. And what the running tool wrapper actually did was that actually it created the git repository or it added some stuff into the O3 audit repository. So if you have a look here, what is here? We are having a bunch of nodes. Let's say the master and node. And within the master, we are having the history of execution which is simple. Here is the script and here is the log. And so if you do git log, you see the history. And so if I do git show, then I can see that I executed this particular script with this output there. And it also gives me the time of that script when it was executed. So for instance, if an error gets triggered after the script, I am getting evidence of what I've done. So an engineer after me, after handover to North America, he has a clear evidence of what I've done. And so he knows how to roll it back later on. We are having a lot of scripts or we are having a lot of Ansible playbooks. And so not only the back script which is more or less for ad hoc kind of execution of stuff. And so if I, for instance, we had the recent issue when the source report was in the old version on the clusters. And so the second level of approach cannot create the crash logs because it was very demanding on the present volume size. But in your version, it got fixed. So we need to just fix it. And so we are having a very simple operation procedure here, which is saying this. So if you don't want to read all the stuff, we can just run the playbook. And so if you have just such a simple playbook like this, it again generates the script like this. So if I go slash temp, move that script to sociable, for instance. Then I can run another script which is all three run Ansible. And I will run this on master and node. So on the whole cluster in my setup. And I will put the script in there. Or the Ansible playbook. And maybe let's do this verbose. So now Ansible runs. It does the upgrade of the source utility on the cluster. Just in case it needs to be upgraded. And unfortunately we are having game which is very slow. We don't have DNF in Rails 7 yet. So we'll take a while. But the good thing is that both of the playbook that was run at the time and the script or the output the lock from is actually audited. It's stored in the good lock. So I will just put there upgrade source. And now if we have a look again into the all three audit directory. And we show what's stored there in my last change. We see that there is this playbook. There is the output from Ansible. So it got upgraded or both nodes. And why actually there is a distinguishing between the nodes in the cluster. So if I'm just interested, if I'm getting a page or other that actually node is broken for some reason or master is broken. So I can go directly to the directory of the master and check what has been done at what time and what might be triggering the problem. So I can roll back right back later. So that was just a quick demonstration. What's been done. And there are a couple of other things which might want to get improved in the tool. And that is understanding the cluster ID and AWS account for instance. We are moving also to Azure and Google core platform. So at the moment we're having a very big exposure to AWS but it will not last very long I believe. And also giving extra information about allowing to running from the jump host for instance. And one thing which might be interesting because sometimes if you write down your operational procedures, sometimes you make error, right? Make a human error, typo in the script or something. So whenever you run it, you have a problem because it's broken, you need to fix it again because of very stupid error. And so it would be quite nice to have a Jenkins job which actually does periodic tests on those ops. So it doesn't break on the test cluster. And that will assure that you are going to get consistent standard operation procedures by automatic testing. So that was basically all I wanted to say. So thank you very much for your attention. And just in case you have any question, feel free to ask. Yes, this is a really good question because whenever we have a thing that actually is undocumented, we need to document. So basically if there is an emergency, we don't know which is not undocumented. Okay, I will start from the monitoring, right? We are having only a set of monitoring probes, right? Which you monitor, but if we realize that actually there is an additional thing we need to monitor, we need to edit our monitoring system and the variety is standard operational procedure, right? And then whenever the probe is getting triggered, you get that page on your cell phone that actually refers to the standard operation procedure. So in time of publishing of the monitoring probe, we need to have also the standard operation procedure. Did I answer your question? If I don't, feel free to ask again, something else, over clarification. Okay, okay. So that's by the way a very, very nice question because we are having only limited set of monitoring probes. So we need to decide we actually monitor. But after the upgrade, it might happen that we are having something we don't monitor and it's super relevant, right? So in such a case, there will likely not be a page on your phone or something, but there will be requests from second-level support saying, hey, we need to monitor this stuff because the customer is unhappy, he cannot run his posts, right? And after that, we need to develop, you know, the monitoring probe and the standard operation procedure. Also very good question. I was thinking about one single thing. Red Hat developed stuff in order to be open source, right? So at some point, it would be very, very nice if we open source the standard operation procedure, so you can basically use them yourself for maintaining your own clusters, right? Because later on it can actually contribute to us with the fixes, yeah? So I'm sorry, we are already out of time. Thank you.