 All right, good morning everybody. I appreciate so much for being here for the first talk in our last day here. So thank you so much for coming. I'm Evelyn Gomez. I'm from Red Hat Team, from Support Engineer for OpenShift dedicated to OpenShift. And I've been doing this for the last six, five, six years. And as a Support Engineer on OpenShift, I basically eat issues for breakfast every day. So over the time, it actually, you start to see a few trends or common issues for particular situations. So why Sherlock or Susan Demi on the stock? This is because this is the place that I need to be when troubleshooting issues. And also is what I need to help our customers to also develop in their selves when they are the Susan Demi of the OpenShift cluster. What are the measures of that? Any cluster, any Kubernetes cluster. This talk is actually quite agnostic. So whatever I'm talking here, there are similar situations that happens in Kubernetes world in our community. I see a lot in telegram chats, Reddit, whenever there is a Kubernetes cluster running. So the specific situation that we're gonna cover here is troubleshooting issues and also actually performance issues. This kind of issues happens in a particular situation. So this is a Kubernetes cluster when they're in their first moment of life. It's quite simple. You can see everything pods are simple enough. Microsoft is still not quite, let's say messy. It's quite simple to visualize everything, right? Problem starts to begin when it becomes this. So this is when you have a lot of workload. You have network policies on there. You have automation, which is great. And it is actually like the natural evolution of a Kubernetes cluster. You want to get into that point. But again, problem happens usually in the middle of this transition. So what I hope to bring here on this talk is how to troubleshoot or even how to prevent issues when we're coming from a little cluster and becoming like a very busy cluster, a production cluster and you're actually doing like nice things with github and tecton and pipelines, whatever you want to really work with in the long term. So I will say there is a little, a few trendy issues when we're going through this transition. And those are usually how to troubleshoot issues or outages on the fly. Because when you have like a big cluster on this transition, when you have an issue, it can start to be a little bit more difficult or a little bit more complex on where to pinpoint or how to trace these issues. Also what to keep an eye on. So there are a few things that I would say that are crucial to monitor. There are a lot of great monitoring tools but there are a few specific things that I would like to highlight that if you're administering a Kubernetes cluster you want to make sure that you understand how it is the behavior along the time. Also, we're gonna talk a little bit about tools to facilitate problem solving in the long term. Because when we're using a Kubernetes cluster or OpenShift, we want to have that. We don't want to, sometimes to recreate, we want to have this long term cluster up and running and from time to time, of course, doing maintenance. But for that, we also need to use a few specific tools. So basically, in our agenda, well, we'll talk a little bit about the background. We'll talk a little bit about the detective mindset which is specific for my field. You need, when issues are rising, you need to handle that on the fly. And this is not only for me but also for, again, for anybody that needs to, that's working as a CZME, you actually have to have that detective mindset to help to resolve the issue as soon as possible. Then we'll talk a little bit about our crime scene which is, of course, like a use case scenario is a real story. We'll talk a little bit about a few tools that could be helpful and relevant information. And this is from the application side and then we'll go over through a cluster perspective. During my talk, there will be a lot of links that I put for later reference. I plan to make this available in the.com website for everybody. So just, I will not cover every little aspect but I made sure that there were links so you can refer later on. So the detective mindset, what is that? So it's nature, nothing but human nature when we have an outage of an application or basically a cluster outage, application outage is to react and go and troubleshoot and find out what is the culprit. But I'm here actually to say, to don't panic. And this is certainly helpful because when you're on the issue, you actually need to take a step back to take notes. And there are three specific things that I would like to highlight that is important to take notes. First being, time stamp and time zone. And this is especially because Kubernetes Closer or OpenShift, OpenShift is special. It runs always in UTC, right? This is the full time. But when we're handling with different teams or cross collaborating, the teams, they are in different regions. And it's very easy to get confused with that. So this is the first thing. Whenever you have a report of an issue, make sure that you get the time zone and the time stamp. These will really help on the long term when you need to get a post-mortem or an RCA. And this is very easily overseen. But it helps so much when you need again to a proper RCA or to cross reference with a lot of data. There was this time, there was troubleshooting initiative without scaling and I was actually able to find a bug in the code just by crossing reference with time stamps and in the locks. The problem was, is that at the first minute, the time stamp that was given to me, it was a different time zone. It was not UTC. So I spent a lot of time looking at the right, the wrong set of data and metrics. So having this important, always like in the back of our mind, where you collaborate with your team, it's especially important and it avoids a lot of confusion. Node name and pod name, this is also great information to have if you can on the fly when you have to resolve an issue. Again, I'm focusing in the post-mortem approach if I need it. So if I have those information at hand, I can quickly check, okay, so this happened and I can do correlations about where possibly the issue started. And this is the third and last part from the tag mindset that I see that it's a lot important and this is not only for me, but also you can see like same approach in the Google SRE book, it's stopped the bleeding first. It's easy to get in that mindset of we have the issue, we need to find the RCA, we need to resolve it, but what is the corporate? The thing is that if we don't resolve the issue first, like our customers are really losing money because we need to restart the service. That doesn't make RCA or post-mortem less important. Of course not, that's ideal to have, but this approach actually, it's important in the long term. If you have, actually it's relevant and easier to be achieved if you do have the proper tools in the first place. And we'll talk a little bit more about that with our examples. Sometimes also this talk, I talk a lot of our customers, I talk a lot with sometimes even managers, sometimes it can be like a tricky conversation because we need to answer for a lot of stakeholders for our customers, the customer of our customers, for everybody, but as long as we have the tools in the first place for observability, this should be just fine and it's so much easier to achieve. So that said, with all these three things in mind, let's take a look on two real scenarios that happened that one of them, in this case the application layer, there was not the stop, the bleeding stop the issue first. Actually it had, but it didn't have the RCA first. So let's take a look at our own Dave dances. So when we were talking or tackling this issue, we saw three specific things. One, that the application was not serving requests and this was unpredictable. We didn't really could know like how and why. Second, we saw that from time to time the pod would crash loop with oh, oh. So it was being killed. And also the pod also had three replicas. So there was some, it was widespread, right? So those were the only evidence that we had. But again, how we solve it? We solve it by increasing the pods from three to six and by implementing out to scale. But let me just get back to this slide because this was like a big outage because the application was not serving the request that our customer was expecting to. But we didn't know how or why. We simply resolved it being fast by increasing the pods because we saw, well, there's some memory issue because it's being oh, oh, oh, but that does not actually answer why this is actually happening. The bad part on this is that because there were no metrics in the application we could not see what was really causing. So this is a good example about when we can resolve the issue. But when we don't have the proper tools, it gets hard to see the RCA. So let me talk a little bit about the long term solution when there is a big focus on RCA. Especially if your business have that critical policy. So one of them is aggregator logging system. And this is, the tree of those actually, you can achieve an open shift. It comes pretty much like out of the box. But there are also totally open-sourced plugins that you can attach to your Kubernetes cluster. The nice part of those is that you can actually extend. You can actually collect all the pods, logs. You can collect the infra logs, the audits if you need it as well. And you want that. We want that because if a pod fails, just like happened before, we can actually trace back. Okay, so let me see what happened for this pod in the past and try to correlate events. And again, we can trace back timestamp and things like that. Another thing that's super important is application metrics. And this is super nice with OpenTelemetry. There will be a talk I think later today on this I super recommend to also keep an eye on that. OpenTelemetry is a big project, open-source project, and it helps so much have application metrics. It can actually prevent you once you have the data of your application. You can actually prevent or predict certain issues, especially if it is load-related. A third long-term solution, it's load testing. This is actually commonly overseen. And this is especially, I actually feel more than never because we are in a so fast-paced world or we need to deliver and update all the time or applications, we need to have load testing. This will actually make better performance in a Kubernetes world because you can predict and set a request and limits appropriately. So then again, this prevents outage and unavailability issues with their application. The last one that I'd like to highlight here is KubeLinter. This is a super cool tool also of a source. It's like a static tool that you install and you can really run against your deployment YAML as a developer, right? So with that, you can actually have cloud-native recommendations. Things like, do you have Rucas and Limit set? Do you have any toleration or tense that you need to take care about it? How many have because we have? If you have one, the KubeLinter will say, okay, so you probably should have three. And the most intense part of this particular tool is because it's also highly customizable. So you can apply your own rules to it. So you can have like all developers following this asymptate plate, we can say. So it's a very powerful tool. It helps to prevent a lot of issues, things like if you had maybe a node outage and you had pods running that node, then the pod will be evicted, right? But if you only have one replica, well, that's a common scenario, that way that you should have like at least two or three, ideally five, right? And those are all issues that can be prevented with those set of tools. So this is actually a demonstration from Kibana. If you ever deal with Kibana, I personally like a lot. It's easy enough to navigate. It shows like the host, the full name. Of course, like every login system, it will show the same, but I particularly like the Kibana interface. This is a picture from a Quarkus application because Quarkus does have by nature metrics exposure. So what I did here, this is actually from OpenShift. I create a service monitor to communicate with Prometheus and expose the metrics from Quarkus directly in the OpenShift monitoring system. So this is a good example that you would be able to see if there were actually like a request in crazy what was really happening in the application level, right? Because sometimes all we have is a resource usage like memory, CPU, and that doesn't really count the whole tells us the whole story. So having things like this, it really helps in the long run. Things to keep an eye that I would like to highlight here on this part of deployments. First is pod capabilities. This I would say like a hidden issue usually because a pod, it can have all these capabilities. It can have K.O. K.O. actually is the most dangerous one here, but it can have, and these are configurable. So one thing that I would say that you need to make sure that if your pod has, that you know about it. Because if you have like a demo set or any pod that has this sort of capability, it will likely cause an issue someday in the cluster. And just make sure that you know which pods have those capabilities. It's often an issue that happens and especially because those capabilities, they communicate in the kernel level and it is known in Kubernetes to cause a few issues depending on how this pod behaves. A second thing that I would like to say that is important to keep an eye is the quality of service, which is calls. What is calls? Oh, sorry, I missed to put a picture on this. But quality of service is also a field that you put in the pod or just matter of fact, a container. And it tells you if it's guaranteed or if best effort, in summary, it really will tell you when or what is the priority of the pod if the node gets into an over committed situation. This is also very often overseen, but that's something that you wanna know about your pod, especially the most critical applications that you have. What is his quality of service? If I hit an issue of the node, would my application be the one to be guaranteed to stick in the node and avoid an outage or avoid at least the loss of one replica for a time. So those are things that I would recommend to keep an eye if you have near cluster. The second question is a scenario that happened with a cluster and we had a few evidences on this issue. First one and the most, I think, that was the most perceptible or the most noticeable was that the cluster was getting slow. Like either when you run cube control commands, you get as low response or when you open the console, it's not working quite well. So that was one evidence. Second, no garbage collection. Actually, when trying spinning up a pod with deployment, if the pod was being deployed, the problem was that when I deleted the deployment, the pod was not going away. So I actually remember, I actually discovered that we're having garbage collection issues, right? This is handled by the cube managed controller and well, that's a second evidence on this cluster. Third one, the LCD pod was flipping up and down. I could see the availability of that CD and that's never a good, that's never a good hint in your cluster. You don't want to have a Kubernetes cluster having like unavailability on that CD. And fourth, when check at CD message, I was actually seeing requests took too long to execute. If we were having issues along with at CD, usually this message means that at CD is performing badly, but not only that, you can go in many ways on this. But the main hint is that something is now wrong and behaving poorly. This could be because of disk, slow disk, that's a very common situation in Kubernetes, depending on which type of storage disk you're putting on, it can behave poorly, that's a situation. But again, going back to our evidence, those are the ones that we had for this issue. Another set evidence is that the QBA player server was going up to 40 GB of resource usage memory. That's a lot. And when talking to the customer and really seeing the context of the cluster, it was not expected, it was just too much. Like this was taking a lot of resource from the memory, from the control plate node. So what is happening here? So when running a simple graph, looking for errors, we actually found like over 37,000 messages of errors about a specific operator that were there. And many of those messages, not all of them, but was about this issue with certificate that was unable to parse bytes as pain block. So because Kubernetes is so great and we have a great community, there was actually a very similar issue reported for a different operator, but it was reported. And that's why I love Kubernetes, it's a very vocal community. And what happened in that specific situation is that in the CRD of the operator installed there, there was this dummy value that when the QBA API server was trying to hit it, the QBA was not really accepting it and was spamming and overheading the QBA API server, making it to that high memory usage. So the way that we resolved it, it was basically remove that dummy value and that instantaneously made everything work again. Like QBA API server went down, there was an evidence that I missed it on putting it on this slide, but there were over 400 terminating projects, projects that were just hanging there. And also we resolved the issue with the QBA API server, it all went away. The closer it started to behave better, there was nothing hanging like at CD normalizer again, it was not going up and down from one hour to another. So this is something that I like to highlight, like the QBA API server, like the core of Kubernetes is something that you really want to make sure that is behaving properly, but not only that, but that you really understand the routine of the QBA API server. So long-term solution for this kind of issues, performance issues in specific is cluster metrics. You want to make sure that you have those, that you understand what is the normality on the cluster. QBA API performance metrics, those are great metrics to have, if it is available with graph front and primitives, which for any Kubernetes administrator, it's the best friend, right? At CD metrics, this is super cool because at CD actually expose the metrics. In OpenShift, you have that by default. You actually have API perform metrics dashboard. You have that CD metrics by default. And I want to add here very quickly a little bit above Valero. How many of you know Valero? All right, so Valero, it's a tool for backing up Kubernetes resource, especially persistent volumes. So that's one thing that, it's not really related to performance issue, but I wanted to present here just because it's so powerful and it helps so much just in case the worst happen. You know, in the disaster recovery scenario, you want to make sure that you have your applications properly back up and Valero is super easy, super simple and very powerful. In OpenShift, we actually have that Valero with OATP, which is another operator, which runs like Valero behind scenes. So how it looks like? So this dashboard from QBA API server and also from at CD, those are good things to really monitor and understand how it goes up, how it goes out, if there's any spikes or if there's not a lot of closer performance issues when they are growing, as I was saying before, it's mostly related to those components, QBA API at CD. Maybe the control plane is too little, maybe the nodes are too little and it's some resizing, but in order to predict that, you can actually refer to such metrics. A very cool comment that I use in case you don't have at CD metrics is this one, you actually can see the amount of objects in the at CD that are stored in the at CD database. And this is quite cool, especially if you're trying to build a baseline, you actually can, and you'll notice again, closer performance issues, sometimes you can easily see the response like on this and you will see like, I don't know, maybe 30,000 events, 2,000 config maps, any odd behavior like that, it's probably worth to investigate. Thank you, that's it that I have to share. Any questions? Maybe? Okay, so the question was if one of the possible solutions was to report a PR to the operator, to the better operator, a patch or submit a patch. Yeah, possibly. I would say that we need at least a little bit more of our investigation or at least try to reproduce. But yeah, like a long term solution, of course, we want to report that issue about the operator. Can you say that again? All right, okay. Yeah, so the question was if I know if it was a mechanism in at CD what it causes that, right? So in that scenario, what happened is because the issue actually lives in QBA API server and because QBA API server is the only one that talks about CD it was at QBA API service was just too overhead and it was not able to communicate with that CD properly. So it's like an overhead of an overhead and that's what makes like at CD to start to be a little bit not stable. But it was like the main corporate is QBA API server because QBA API server talks with every component in the cluster, but he is the only one that talks with at CD. So it's like the bottleneck and that. Did I answer your question? All right, I think we can wrap. All right, thank you so much for coming. I see you. Perfect on time. So good morning, Sunday morning, 10, 15 sunny in Bruneau. What a better place to be than in D105. So I'm glad everybody survived the party yesterday. Nobody gets wet from the outside, maybe just from the inside. The people that got wetter from the inside maybe are not here. So I'm really happy that you showed up. Thank you. So we'll talk about how much open source is in cloud services today. My name is Marcel Hils. I'm a managed open shift black belt. It's a mouth full of a term, but I'm essentially selling managed open shift to customers in a technical way. And my colleague Roberto Carantala de Madrid is also with me. He's way longer in this business than me. So we'll talk about what is Rosa, Arrow and OSD. These are some fancy acronyms. Then we go through the various days of setting up clusters at scale. So starting from day zero, where we fully automate the deployments, then day one, which is not really a day, where we're going to the initial configuration of these clusters, and then the ongoing maintenance and operations of these clusters. So monitoring logs and stuff. And then finally, how do we fix things if something goes wrong? And hopefully we come to a good conclusion on what is in there for you guys, because not everybody is supposed to run thousands of clusters, maybe just a couple of them, but you can draw some inspiration. So what is Rosa, Arrow, OSD? Is it something that you shoot with your crossbow or just a color or just a triple letter acronym that nobody really cares about? So let's take a step back and look at what we're doing at Redhead for the last, how many years, 25 years? 25 years of productizing open source projects. So we picked these fancy projects there and in the middle there you see Kubernetes. So we take that, put it into OKD, which is the upstream of OpenShift. And then we productize it and it's not just one product, but there's a multiple, a multitude of products. So essentially, OpenShift is a Kubernetes distribution, which has some other projects also involved. And most of you are probably familiar with deploying OpenShift yourself, where you run the installer. We had many, many installers of deploying this. Sometimes it was Ansible, these days it's a Golang binary and then it provisioned some infrastructure, yada, yada, yada, yada, that's on the bottom there. So you can deploy it everywhere you like to go. That's the whole value proposition of OpenShift. And you also deploy it into these cloud services. So we have it running on AWS, which is Rosa, Redhead OpenShift on AWS. We have it running on Azure, ARO, Azure, Redhead OpenShift. So now these acronyms make a bit more sense. We have it also running on the IBM cloud, which is called ROX. I don't know whether K, ROIC, ROIC, okay. So we're getting close to making actually some sense into these acronyms. And there's OpenShift dedicated. This is how it all started. So you can also deploy it into Google Cloud. And this business of running OpenShift clusters for customers got so popular that we moved out of OpenShift dedicated into these other clouds as well. So this is typically what you would have to do as a, as somebody who operates OpenShift. You take care of this bottom layer where you set up your infrastructure. Then as you move up to the stack, you have to configure your network. You have to configure this control plane, the master nodes that take care of your cluster, et cetera, and yada, yada, yada. But essentially, you probably just want to run workloads. Like if you installed Redhead Enterprise links, you didn't want to care about who's packaging these things and who's doing the upgrade, but you just want to install an Oracle database or whatever. And in OpenShift it's not different. So what if we could shrink this picture just to this where you just care about your workloads, you care about setting up your namespaces and operating your stuff. So let Redhead and our SREs do all the work for you. So SREs is coming a long way, right? So like 13 years ago, we just deployed stuff through it over the wall and then the ops people take care of it and fingers crossed that we didn't do it on a weekend before we got to a popular swimming pool in Bruneau and then the people who cared the page, the wall, the pager take care of it. No, we tried to merge people from development using and from operations using the tools from both sides, the best of both worlds. This is how DevOps got into place. And I like to see SRE as an implementation of DevOps. So if you're doing SRE, you are doing DevOps in a certain way. That doesn't mean that if you are doing DevOps, you are doing SRE. So SREs are people that take care and run your operations, but they're also closely working together with your engineering teams, but they are not the engineering team. So the SRE people at Red Hat that manage all these classes at scale. So really think about hundreds and thousands of classes that they manage with a small team. They are also developers. So they are building all these services to manage and monitor OpenShift environments or to deploy these classes. So there's a lot of engineering involved and for doing this reliably, you need to automate a lot of stuff. So automating, adding storage, capacity, auto scaling, all that stuff and make it repeatable so that you can essentially scale because you don't wanna hire more and more people, the more classes you deploy. And then obviously you're not done as in pre-sales once you install the class and then you can tear it down and just show that everything works, but you actually want to use that cluster and as it's software, there's also bugs. So we need to observe this environment, make it reliable and act upon any incidents. So that's observability and reliability day two operations. So we'll take you through this journey on the left where somebody is installing a cluster with a click of a mouse button or via your CLI, then what is being kicked off in the background to install that cluster, do the initial configuration and then monitor it from day two to day n. So although it's days, the whole cycle until to get day two is just taking 30 minutes or 40 minutes. But so if we would install a cluster right now at the beginning of this session, at the end we would have a cluster up and running, which is pretty awesome, I think. So Roberto will take care of the next. Hello. Right, we have only one mic. So now that we will start with the day zero operations, we need to deploy a fully automated and scalable clusters. So we have an initial problem, we are DevOps team, we need to deploy our clusters in an scalable and maintainable way and also easy. And we want to deploy our OpenShift acumenities clusters across multiple hyperscalers. We have AWS, we have Azure, Google Cloud and IBM. So we have a plan and we need to deploy all of these projects that we have in the CNCF and we have several meetings. We try to have several meetings with business and the business guy wants to put everything, wants to put storage, wants to put AI, wants to put every single piece of software out there, but we need to do it scalable. We need to start working, working, working and we end like this. So for avoiding to this DevOps guy have this mess, we need to have an easy solution and scalable solution in order to deploy and maintain our different clusters across different clouds and maintain it and also have this life cycle and support ability. And we want to introduce Hive to save the day. So Hive is an operator that runs on top of, as a service, on top of Kubernetes or OpenShift and it's using to provision and perform also the initial configuration and data ops and for provision and OpenShift we have OpenShift installer behind the hood and also we support different cloud providers, AWS, Azure, Google Cloud and IBM Cloud. Also we have this architecture, more or less we won't be a deep dive in this architecture, but in the top we have essentially the Hive namespace, the brain and for every single managed cluster we have one cluster namespace that will run and will store the different secrets and the different pieces and components for doing that. And for deploying one cluster just we need to deploy a cluster deployment that is a CRD as an extension of a Kubernetes API and we will define what is the platform that we are trying to deploy and also the different components across that. And it's part, Hive is part of our option project that is called Open Cluster Management that is a community-driven project focused for multi-clustering and also multi-cloud scenarios for deploying the clusters, also the works and deploying at scale and maintaining the different Kubernetes and OpenShift clusters. We have a downstream project that is called ACM or Advanced Cluster Management for Kubernetes in order to deploy these OpenShift clusters and Kubernetes clusters as well at scale in different platforms across the different regions as well. So now that we have our lovely cluster deployed in, I don't know, different hyperscalers, we need to perform the initial configurations because it's just blank, but we need to perform the first configuration. So we need TLS encryption. We need TLS encryption everywhere as these two bodies wants to. And for that reason, we have set my operator to manage in the certificates at scale. We need to be able in scale to manage the different certificates and we need to provision the certificates. We need to re-issue the certificates once is, for example, this expiration date and also revoking the certificates once we read off these clusters or the commission these clusters. You have also this deep dive, more or less that explains how this works in a deep dive way. And brilliant, we have our certificates in our cluster and now how about shipping and maintaining data ops configurations at scale. So we deploy our clusters, we deploy also our certificates, but how about the data configurations myself? We can do it manually, right? Or using GitOps, right, it's a very hot topic or using resource management. Well, it's matik, but it's not matik. We are using some sort of GitOps but we are using a piece of software that is included in Hive, that is called Synset, in order to facilitate the resource management. We are shipping the different objects and data ops using this in more or less the GitOps approach. But in the country, for example, that ArgoCity is immediately syncing, we are using Synset in order to not stress the API of the different clusters and also to try to reapply once it's the cluster install and ensure that it's maintained across the different life cycle. So we are maintaining also the content once updated. So once we have these in place, we need to ship the configuration and we are using managed cluster config. This managed cluster config is using the Hive selector syntax. It's not really an operator, it's a bunch of different channels across a repository and it's able to bundle these in a template and ship across the different data ops. So we have these two ways and methods to use it in OSD and Rosa and also in Arro. So this is an example, a brief example of how you can put it in one repository and ship, this is a real example of how the SREs ship this. So if you click it, this is the data configuration and data ops that are shipping also across the different managed clusters. And imagine that, yeah, we all ship our cluster, we have the cluster ready and we are handing over to the users and these two dev ops guys that are super excited have a call from the business and say, yeah, no, you need to turn around your ship and instead of going that way, you need to turn around and going backwards, but it's not possible, yeah, hold my beer. And after that, we'll change one thing and another thing and another thing and they will be stacked. So they said, running Kubernetes in production, they said it will be easy, they said. In order to prevent users to do and to break clusters, we need to put some wall praise, we need to put some secure boundaries. And for that reason, we are using validating admission webhooks. So these webhooks prevent certain customer operations, they will prevent to break the cluster. And for that reason, we are putting these admission webhooks in order to prevent to anyone remove or change the namespaces, also Prometheus rules and a bunch of different things that will be able to prevent the users to mess around with the cluster and we'll put some world rise to the different clusters. And now we are heading to the day two. So day two or minutes 20 into cluster operation. So we made sure that the cluster is up and running, you can't mess around with it and if you mess around with it, it gets reverted. But we need to also look at what's going on there. So platform monitoring is essentially not different from your everyday platform monitoring that you would do in your own open shift cluster. So you have a bunch of alerting rules, the out of the box open shift alerting rules and then some SRE added alerting rules that are more specific to these managed environments. Then you have a Prometheus instance running on this local cluster on the customer side or in the cloud which is collecting metrics which are then exposed to an alert manager which is also part of the Prometheus suite and then it sends these alerts to pager duty which is a paid service where you can manage your incidents. It sends it to go alert which is also a paid service but it's more an open source service so they're using this mix of alerts. I actually don't know what alerts are going to which service but it's always good to choose from and there's dead men's niche. So if nobody heard of dead men's niche, definitely something to look into. I have it running in my home setup to alert me when my home Prometheus is going down. So it's for one rule, it's free. It's basically the poor guy that has to push the button and if he falls off dead, then he releases the button and then we know this cluster isn't reporting back home. So that's essentially the setup that we have to monitor these clusters and obviously we don't want to have all the alerts coming in because you know it, if there are too many alerts, nobody cares about these alerts and then the alerts don't matter so much. So we have some inhibition rules in place. I don't have too many memes in my slides. I'm more the emoji guy. So these are stock emojis from my operating supplier. He's the memeguy. So how do we configure these things? Because as you heard, we are not using GitOps because otherwise you would have to have a pull request or some entry in your Git repository for every customer cluster that's being set up. So how do you manage a lot of alert managers in random clusters being set up? And therefore we have the configure alert manager operator which is installed in this cluster and essentially it watches for some secrets and config maps to appear or not to appear. It does some health checks whether the cluster is set up already and these health checks are then reported back to Prometheus. So we hook into that normal operator monitoring pipeline there and the secrets themselves are deployed via sync sets. And once these secrets are in place, we have some other operators, configure go alert operator, the pager to the operator and the dead man snitch operator which actually these operators in the bottom are responsible for deploying the secrets. So they are using again the sync sets from Hive to ship out the configuration to these clusters if they are needed or maybe they change and then the configure alert manager operator takes care of configuring alert manager. So now we have a programmable pipeline of configuring alert manager to your needs, to the environment that the cluster is running in without any GitOps involved and with a programmable way because we don't want to do things manually. These are all open source. So if you happen to configure a lot of alert managers think twice maybe you want to use these operators. So they are not really open shift specific or open shift dedicated specific but they solve a very small and a small problem. Voila. So what is monetize? I don't know how many of you knew that the alerts that are shipping with open shift they also come with run books and the run books are actually open source. So they are there on open shift, GitHub open shift slash run books and they tell you what to do when an alert fires and it's actually very good practice to put the link of the run book into the alert. So an alert which is, that's an example here that's how you would configure the alert in your alert manager. You also see this run book URL there at the bottom which can be directly clicked on if you see that alert and then you go to the run book. So it makes it easy for your SREs even if woken up at three AM in the morning to just go through the run book and do your stuff. So it will tell you about the meaning of that alert, what's the impact, how it can be diagnosed and then that's the most important part how to mitigate that alert. So you can copy and paste these commands into your console and do stuff and then hopefully mitigate it. And if you're running open shift yourself use these run books. If you're setting up alerts for your own workloads also take inspiration from these run books because it's a very neat and organized way of making your monitoring more reliable essentially. I said there are also some SRE alerts configured which are not very open shift specific and these are all open, also open, open source. These are in this managed cluster config repository where you have the SRE Prometheus alerts and at the bottom here you see configure alert manager operator Prometheus rule. So you will see a lot of friends from the previous slides also showing up here where you can just see how we are using this infrastructure and this setup to monitor these clusters at scale. And as you saw in the keynote this is essentially if you're wondering why my cluster is down and nobody got alerted and it's a pressing problem for you instead of waiting for support you can also do your own research and see how this alert is would be triggered or not. Then what about logs? These clusters are using Splunk for storing infrastructure logs. For that reason there's a Splunk forwarder operator which is using a small binary running as a demon set on all the infra nodes and then it's collecting the logs from the node and from the pods and just forward them to Splunk enterprise. So no magic in here but if you are using Splunk in your organization to collect logs or you just wanna collect a subset the Splunk forwarder operator is the way to go so no need to reinvent the wheel. And you see that in that custom resource here it's pretty straightforward to put a path name and then it will forward that stuff to your Splunk enterprise. So how do we fix things? That's Hubertus Meme. He's such a nice guy so that he put also memes into my slides. That's very, very nice of him. So obviously you know this, right? You're sitting in fire, you get alerts in the morning and then your colleague says, oh, alerts in Monday morning. This is fine as we always get these Monday mornings just ignore them, they will go away, just have your coffee. And yes, you could log into these clusters and fix them manually with these runbooks. So that's maybe your first intuition on how to fix things. That's how we fix things if in our lab environments in managed services, we made it a little bit harder to actually access these clusters because they are customer clusters. So as an SRE, you have to jump through some loops in order to get there. So you're coming through the public internet, connect to a bastion host, then set up a private link to this cluster environment. Then you can actually execute commands but they are all locked to AWS CloudTrail or some other infrastructure. You get to get management approval. So you actually want to avoid that logging into your cluster and doing things manually. But sometimes you have to. No, we want to do it in a more sustainable way. So this is from the OpenShift documentation. Red Hat SREs are managing the infrastructure as code. We see words like GitOps workflows, CI CD pipelines, and then it's talking about the review process. So you only get a review when some other SRE also approves it. So this is essentially the TLDR of best practices of managing your code environment. And everybody's hopefully following them. So never self-merge your PRs. Nobody's doing that. And with these Google search words, I did some reverse engineering. How are we managing actually these CI CD pipelines? And it turns out that we're using TestGrid. So TestGrid is a test infrastructure reporting platform from Google Cloud. It's still being worked on to be fully open sourced and taken out of this Google Cloud repo, but that has been going on for 10 years now. And if you click that TestGrid.Kubernetes.io link, you're presented with some really old school UI, but it works. And there's this Red Hat button among some other open source project and some other vendors. And there's a myriad of Red Hat test suits running here. You click there, then you see the test suits running. So it's really open how we do this CI CD process for managing all these clusters. And then you click on a failed test or on a successful test and you're getting into PROW, which is another tool from the Kubernetes environment, Kubernetes world, which is storing locks and running jobs. So it's a very good practice to just use the same tooling as the upstream community is using for your product. And that's what we're essentially doing here. And then there's a clue in here. It's called OpenShift OSDE2E. So that looks like OpenShift dedicated end to end test. So maybe there's some more information in here. It turns out that we have another repository and the GitHub OpenShift org, which is the test framework. So it's a portable end to end test framework which supports deploying clusters into multiple environments. It performs cluster health checks and upgrades. So that's the workflow here. Captures the locks or nothing special in here. So you might be thinking, how can I use that for my own purpose because I'm using Jenkins and stuff. Well, there's some documentation in there how you would write tests for Kubernetes clusters because it's a little bit different than writing tests for your, let's say, Python application or Golang binary because you're setting up a cluster and you want to make some checks. So there's an example for checking image streams. This framework is called Ginko named after a plant. So I think it's all coming from some vegetables, cucumber, Ginko, who knows, but it pretty reads like English. So the suit is called informing image streams. Then we need to get some image streams from the cluster. It's just a one-liner. You list all the namespaces. You get all the image streams into a list and then you count all the image streams in there. We see this magic number of 50. We ever came up with this number of 50 but supposedly there should be 50 image streams in your cluster for that test to pass. And instead of maybe writing your own test suits or even reinventing some tests, you think, you get the enlightenment and then you get stars in your eyes. So there's a lot of test suits already in this OpenShift end-to-end test suit where you can take inspiration from or you might wonder why is this error always happening in my cluster? Do they have a test? Let's go and explore. So it's a funny exercise for a Sunday morning in the sun. And maybe you wanna do that right after this talk, after you got inspired by this myriad of open-source repositories which make managed OpenShift services run. It's not fully open-source. So there's still something, some secret source to it but we're getting there and opening it up for everybody to inspect. So that we still have five minutes for up for questions. Skate man, yeah. So you counted six additional operators on top of OpenShift. Actually, there's more. I would say it's probably in the ballpark of 20 or 30 additional operators there. And the question is how much additional overhead is that in terms of cost? I think I never really did the run Kepler on it to actually count this CPU cycles wasted. It adds some cost I would guess but they are really lightweight. So essentially they're just checking a configuration and then they are setting up another configuration. So there is some overhead but then you don't wanna care about these operators as a customer because we care about the infrastructure nodes and the control plane. So it's the SRE way of doing things and monitoring this stuff. And I think it's more lightweight than well having some external querying of the API. So to answer your question precisely, I don't know but it's probably less involved than we might think. Have another question? They are both, so we have reusable, we have these emission controllers and also the webhooks are reusable or just OpenShift specifics. So they are both. There are some of them that are just for the Kubernetes cluster itself, so pure upstream and there are one that are specific for OpenShift and others that are specific for the SREs. So they need to rely on these web roles in order to be able to protect the cluster to remove these web roles. And for that reason they are different but the most important part that you can have these different emission controllers as an inspirational to have ideas and to try to reproduce it instead of reinventing the wheel again and again and again. For that reason you can bring these ideas into your own clusters and reuse part of the code because they are all open sourced. Yep. So the question is whether the runbooks are updated on every OpenShift release. Last time I checked there was some action on this repository so I would guess they are updated and I would hope they are updated but as you know engineers are engineers so probably they also sometimes forget to update stuff but this isn't a live and kicking repository so yes they are updated but can I guarantee for it? No. And I think these are so that's something that people often forget that there are alerts shipping with your cluster and you actually know what to do with these alerts so they are not just a random gibberish of concatenated nerd talk that you ignore but you can actually look into a proper runbook and let that being explained for you. Okay, thank you. I can stop telling jokes and stupid stuff. Thanks to be here for this talk. Who attended my talk yesterday about chopping them up? Okay, you probably will be a bit disappointed as I mentioned yesterday, today is serious. I won't complain but still thank you for being here. I'm Nicolas Frankel, blah, blah, blah. So as I mentioned already yesterday I have a bit of experience and so we already had when I started working monitoring and at the time monitoring like it was like rows of people looking at a huge screen and looking at a dashboard and when something happened on the dashboard well they were alerted that something and then they tried to fix the issue. So everything was manual. Actually I work on an account where they were super proud to tell me that they had the biggest screen dashboard of all friends. It was in the Southern France and they were supposed to give the money to unemployed people so you can imagine in France it was a lot of money. And then system became more and more distributed. So now we say we don't have monitoring anymore, now we have observability. And it might be a semantic thing, it might be a real shift. I don't know, I'm not an ops guy. I've been a developer and architect all of my life but I always had a keen interest in helping my ops colleague operate my solutions. And so just as a reminder, even though the talk will be about distributed tracing because I think that the hardest part, that the newest part, just some reminder about the three pillars of observability. I assume that everybody knows about the three pillars. Who here is a developer? Okay, mostly everybody. Who here knows about the three pillars? Not that many, okay, now. So always a good reminder. So just we are on the same page, this talk is for developers who help your ops colleagues. So metrics, that's what I did before, dashboard with metrics. And at the time it was like hardware, very low level metrics, like CPU memory consumption, this storage, whatever. Now I believe we are going toward a direction with higher level metrics. Because if you tell the business, hey, we are utilizing like 85% CPU, they look at you and say, yeah, thanks. In interesting information, what should I do with them? But if you tell them, yeah, we have 99% of people who go to, I don't know, to put items in their cart and then leave the system before like buying anything. That's an important information, they can use it. That's something also that can be observed that gives real insight into the business and where you can act upon. Logging has many, many different facets. And if you are not used to logging, it's simple. Indeed, it looks simple, but there are many, many questions you need to ask yourself when you want to implement logging correctly. So the first thing is what to log? When I was a young and eager engineer, I thought, wow, I know about Java agents. I will add a Java agent that will log every entry into a method and every exit and I will log the input parameters and I will log the return value. That was very smart. That was very smart, yes, but what to do with it? And what about the password if there is like a parameter that is the password? So it's easy to automate that, but the value is very low, whereas if you do manually, it's a lot of reports. You need to think about it. We have been doing, especially in Java, sorry, I have been doing in Java like regular log4j, SLF4j, whatever, for ages. And it's very easy to output what you want, but nowadays, most of the system is not read by human. It's read by machines and sent to an aggregated log storage. So basically, it's better to output directly in JSON so you don't need an additional transformation stage with, I don't know, log stash, which is time consuming and more fragile. Again, from my experience, when I started working, I've been told you don't log to console. Logging to console is bad. Now everybody using container, where do you log? And then, yeah, as I mentioned, you always, always need to aggregate your logs. The single log on the system is completely useless as soon as you start doing distributed system, you want to have log aggregation. So this is the general idea, you get the log and then another question is, are you actually sending the log? Is the application sending the log somewhere? Or is the component scrapping the log? That's a question you need to answer. Then again, as I mentioned, you need to parse the log if it's not in JSON, so better write it in JSON directly and then you store it and then finally, you can try to search to get more insight into the system. I've been using Elastic Search for some time, so it's the system I'm most familiar with, but there are many, many available. I won't comment on which is the best because as I mentioned, for me, Elastic is the one I'm most familiar with. Use the one you like the most. And then, third pillar is Tracy. I generally love the Wikipedia definition. In that case, the definition is not that great, so I came up with my own definition, which probably I've been inspired by many others. Actually, this describes what is Tracy. What you want to do is to trace a request across your distributed system components. You want to know in which places it went through. And if something bad happens, of course, you want to know where it stops. There have been a couple of pioneers in the distributed tracing area, so Zipkin and Yeager are generally the most familiar. They are still working. I will be using Yeager in one of my demo for no other reason than, hey, I have a good Docker image that works out of the box. There is also open tracing that doesn't exist anymore and for one simple reason that I will get back later. But what we want when we have a distributed system with distributed components is a specification because Zipkin and Yeager add their own format, they add their own protocol. And then you add to come up with a solution that was compatible with your tracing provider. With a specification, everybody adheres to the specification and you don't need to think about, hey, what implementation should I choose? Only this one because it's compatible. So there is this W3C trace context specification and I think it's not like the XKCD joke, right? I think it's going to be a real standard that people are adhering to. And the idea is very simple. You have a trace which basically is just the single request. And then you have spans and the span is the execution of this trace in a component. And if it looks very abstract, then a diagram is always helpful. So here, perhaps I can impress you again. Yes, are you impressed again? Oh, okay, here is a single trace, right? This is a single request. And in component X, there is this span. In Y, this is this span. And in Z, this is span. And every component bots the first, but the entry point has a parent ID. So you can trace every span to their parent span. Okay, specification is good, but we need tools. And for that, there is something called open telemetry which implements the trace context but gives you a lot more. It gives you APIs. It gives you SDKs. It gives you the format in which to send the stuff. I told you about the open tracing. So this is one of the few merges in the open source industry that went well. So there was something called open tracing and open sensors and the merge to create open telemetry. And I believe this is really, really good. So nowadays it's a very, very huge project. It's a CNCF-backed project. And it has like a huge massive follower. And more importantly, there are like nearly all the tools that I know now for this open telemetry specification. What we need to understand is that the open telemetry specification, as I mentioned, is just the format and the channel. They provide for your comfort something called the hotel collector. But what happens afterwards, the green stuff, it doesn't care. And even the yellow stuff is just for your own comfort. You can implement your own hotel collector. So you have lots of sources that know how to communicate with an hotel collector. They send the data in this format, on this channel, and it's done. Zipkin and Yeager provide their own compatible hotel stuff. So it means if you are using nowadays Zipkin and Yeager, there is no reason to use their proprietary format. It would be much better to switch to hotel format and then you can, if needed, change Zipkin or Yeager by the next good thing. So now, as I mentioned, it was for developers. So now you know how it works, how can developer implement open telemetry? The first thing to think about is whether you want auto instrumentation or manual instrumentation. If you have a platform, that's a good question. If you don't have any platform, just as Rust or Go, well, you need to have manual instrumentation. There can be no auto instrumentation. With auto instrumentation, for example, on the QVM, you can have a Java agent, which means that your code is completely oblivious of open telemetry. The developers don't need to know anything about open telemetry at all. Only the people who are building the artifact can add the agent and then the agent through the magic of the framework. So in my demo, there will be spring, but here probably there are a lot of caucus people that implement the same stuff, will send the data to the open telemetry collector. If you go the manual instrumentation routes, you will have an additional or different additional dependency in your code. And then your developers will need to explicitly call them. I believe as a first effort, you should reach the low-aiming fruit, which is auto instrumentation if you can. If you're on the QVM, use the agent. If you are on the Python platform, add a couple of dependency on the Docker file and you're done. So here I have a demo and the demo is again an e-commerce demo. So basically I have a user which will ask for products. The first entry point is always the API getaway. You shouldn't expose anything over the internet directly. And then the API getaway will forward the stuff to the product. The product will get the data it has in its own database. Then it will look for prices of the product in the pricing microservice and the stocks in the stocks microservice. Yesterday I pitched a lot against microservices. You can consider just then distributed components. Even if you don't do microservices, you'll probably do distributed tracing anyway. Distributed codes anyway. So the first entry point, the entry point, the first spot is the most important part because it's the one that generates the parent ID, the first parent ID. So your reverse proxy, your API getaway is the most important. I work on the Apache API6 project, which is an important source project. Who has heard about Apache API6, by the way? Three, four person. So people who attended my talk yesterday didn't hear about Apache API6. I'm a bit disappointed since it's in my job to let you know about Apache API6. Okay, because yesterday I didn't present it, but let's present it today. So basically it's an API getaway. It's built on NGINX open source. Then we have this open resty layer, which is a lower layer that allows us to script the configuration. So it's dynamic. You don't need to switch it off and on again to change the configuration. Then we have core API6. And API6 is based on plugins, so you can add or remove plugins. And now I will do my demo because I've talked a lot already. So here is my Docker compose. I will start it and I will describe it. Docker compose up. So I use Yeager because they provide everything into one single image. The web app, the hotel collector, and whatever components they provide. I don't need to be interested in how they provide it. They have one Docker image. I use it because I'm lazy. Who isn't lazy there? Okay, throw the first stone. Then I have API6 of course. Yesterday I showed you that I was using a key value store, ETCD. Here I'm using like YAML files, static files. So I'm more using the GitHub's way. So basically if I want to change the configuration, I will listen to change in my GitHub repository. The change will be reflected in my static file and Apache API6 will reload its configuration. That's another way to use Apache API6. I have the catalog. So I will show you the catalog. I have the pricing. I have the stock. The catalog. I have multiple components and every one of them has a different technology. So who here is a GVM developer? Wow. Are we at a Reddit organized or sponsored conference? There are three people who are using GVM? Interesting. Python? Interesting. Rust. Yeah, everybody who develops in Rust, they are not that great in Rust. Yeah, me as well. So that's not an issue. So let's start with the Python stuff then. So here I have my Python application. It's a Flask application. Nothing really interesting. I'm using a SQL Alchemy to query SQL like database. Then I get the price, I jitter it a bit just to have more random value for fun and you can ask for the price of one single product, okay? Nothing mind-blowing, just a regular Python Flask application. I'm not a Python developer, so if you see bad stuff on this code, please let me know afterwards, but not publicly. Then on the Docker file, I will actually add the open telemetry dependencies. So the requirements here, they are what I told you. Flask and SQL Alchemy, nothing more. But when I build the image, I add the additional open telemetry dependencies and then when I run the image, then I run it through the open telemetry instrumentation stuff. So as I mentioned before, the developers, they are completely oblivious of this open telemetry stuff. They don't need to know about it, okay? And then on the Java side, so we will forget the Rust side. So on the Java side, it's a Spring Boot project. I'm using Kotlin because I love Kotlin. Everybody should do Kotlin anyway. It's reactive even though it's not necessary here and then again, nothing related to open telemetry here. However, when I create the application, the image, so it's like a multi-stage Docker build. So first I build the stuff. I need a JDK. Then I run the stuff. I only need a GRE. I build the stuff normally and then I add the open telemetry Java instrumentation in the latest version and I start Java with this Java agent. And then the Rust stuff, there are not that many people doing Rust so it's not interesting. Besides, again, I'm not super great in Rust. So now I can curl local hosts. I go through the API gateway and I ask for the products. And I've got the result. And what happens is now I can go here and I will check Yeager. So this is the Yeager UI. I can find the traces and I did a single request. So I will find the traces here and here we can see the path of the request through all the components. So of course it starts with the API gateway and then it forwards it to the catalog and then inside you can see different calls inside and that is through the magic of Spring. I didn't do anything, but Spring was able to do this stuff for me. Then something interesting, even if you are not a knobs person, you will see that the catalog get for the price goes through API six. So we don't directly call the pricing component. We get to the gateway that forwards it to the pricing. It can be one architecture. On the other side, the catalog calls the stock directly. So even if you are not interested in distributed tracing, you can check how a request flows through all the components because it can be a misconfiguration on your part. And then you've got additional data. On every components, you can have additional data. So here, for example, on the Apache API six sites, I decided to add, as I mentioned, everything is a plugin. So here I have a global rule that says every time you go to API six, you just add the open telemetry plugin. Normally you should always sample. You don't want to have every request which is instrumented, but for demo purposes here I have everything. And then I have additional attributes such as the root ID, the request method, and I have this HTTP header. So here I have nothing, but if I redo the same with an additional header called HTTP, sorry, called xhotelkey, and I call it hello.com.cz. And then I get back here and I search again, then I have a new one. And here on the Apache API six sites, hey, I have this additional data. So you can query additional data if you want to have it in your tracing. On the Java side, you've got like additional data that I didn't configure. This is by default. Here you can see that I'm using like the native project because I'm using webflux. The threads is the reactor threads. Oh, again, you are not using webflux, but there is project reactor and they have their own threading model. So everything is handled by them. So that's the beginning. And I believe it gives you already many, many insights into your architecture. But now, suppose we want to add manual instrumentation on top of it. No worries, let's do it. So what I will do is I will docker, I will just do it here, docker compose down. I will just use git because I have mentioned I'm lazy and I will up it again. And now I've made a couple of changes. Here, I'm telling my developers, now you should add open telemetry to your list of dependencies. Okay, let's do it. And you can now use some interesting stuff. So inside the application, inside a single function, I can add additional spans. So with spring, I already adds internal spans, but now I'm adding them inside my Python application with the explicit API. So here I've decided I will trace the query and I will add the attributes, the product ID that I'm querying. Okay, on the Java sites, I've added the annotation stuff. So spring allows me to do it through annotation or through explicit API. Honestly, the explicit API doesn't work that great at the moment. So I'm using annotations, but for my opinion, in my opinion, especially for a demo, it's good enough. And here I've added additional stuff. So I've added annotations. So now every function as will be instrumented. And here I want also to add the product ID for one single product. I want to check the product ID. And here is the stuff that is not that great is I don't need this ID in my code here because I pass the product. I don't need to pass the product ID, I can get it from here. But because I'm using annotation, I need to change the signature of my methods to auto instrument it. So there are pros and cons to this approach. Now, if I go there and I curl and I curl local, local, local, call host and 90 products. Now I get back to Yeager. I find the traces. Now there were 20 spans before, now I have 27 because I've added some of them through explicit API through the them through annotations. And here you can see the stuff that I added manually. Select star from here, it's not good to do star but again laziness where ID equals ID. And here I have the ID one. And we can see that actually my API is very bad because I'm calling every single component once whereas I should pass all the ID that I want and get them in one call but it's also interesting to see that. And on the GVM side now we have like here the spine all that I didn't get before. Sorry. And folks, that's all. That's all that I have for you today. So just as a last words, this is for developers. And as I showed you for auto instrumentation is not that hard. If you are using a runtime such as a GVM or a Python runtime, you just can build your Docker file and auto instrumentation and you will get already lots of insights that can help your ops colleague or even you if you are doing also debugging to understand the flow of a request throughout your system. If you want to have additional details then you need to do that yourself either through a notation or through an explicit API but then you are, you couple your code to a pun telemetry. Thanks for our attention. You can follow me on Twitter. You can follow me on Mastodon. Everything is on GitHub like every time so you just get this bitly just to check how many people are actually looking at the codes. And if I got you interested in Apache API 6 please have a look. This is the reason I can come to you and do talks because somebody is actually using the product. Any questions? Yes. You need to shout because I'm old and I don't hear that well. So the question is whether I have any experience with like fast requests such as DNS requests, correct? And the answer is no. I've been my share of consulting and either I was considered a very good consultant or very bad because when I don't know, I said, I don't know. My employer didn't like it. My customers liked it. Other questions that I can answer. If you don't ask the question you will never know if I can answer. Yes. Well, the question is whether it's possible to aggregate the traces which is, I don't understand really the question because the traces you cannot aggregate. Let me change the direction a bit I've seen a demo on the Grafana stack with Locky, Tempo, Grafana, and Mimia, thanks. Yeah, I always remember, it looks good to me, LGTM. And actually you can go from the traces to the logs to the metrics and go circle. And that's probably what you want because like a request is a request. Aggregating from multiple requests doesn't make a lot of sense in my opinion. So here is just the Yeager UI. Other components will give you probably different insight but they look more or less like this. I know that Elasticsearch.theSame. And because I'm too lazy, I ask one of my friend who works for Elasticsearch to provide it in this way so you can have the same like go to the metrics and the logs and the traces and all is into one. Other questions? Yes, yes. Okay, good. Yes, yes. So Python is not my area of expertise. I'll say the question was why are the dependencies of the Python Docker file in beta or not of the Docker file but the open telemetry in beta and how, when will it be fixed and why is it so? So Python is not my area of expertise. So honestly, I don't know. However, I know that some people were like very, very deep into open telemetry on the JVM side are telling me that there might be some, let's say slight issues. It's not as easy as it is. And of course it works very well for a demo but if you want to use it in the real world, it will work but you might have sometimes some surprises. I don't think it's a real issue because if you've got, I don't know, if you sample and if 99% of the traces are correct and one of them is not correct, it's my opinion not that big an issue but you should get used to using the beta version until it's fixed or it will never be fixed and just use it. In my opinion, the amount of E4 versus the benefits that it gives you, the ratio is incomparable. So if you are an engineer, yes, it sucks to use beta versions but if you just want to get done, just use it. Other questions? Yes. So the question is, is it possible to use the auto instrumentation with the manual instrumentation and actually if you look at the Dockerfine, I think it's the case with the QVM stuff. Let me get back. I think it's the case, yes. So here I'm using the Java agent plus the annotation. The Python, I don't remember. Yes, it is the case as well. So yes. Because somehow you need something to send the data to the open telemetry collector. So for sure you will need it. It's just that it's an additional step where your code is coupled, which is not that great. You had a question, yes. So the question is what's correct? I could recommend, I don't recommend anything because I'm not an ops person but you should check the graph and hashtag and I didn't know it but as I mentioned, Elasticsearch does the same as well. So at least you have two stacks which you can compare. But I would recommend using the stack that your ops people are familiar with and don't try to force them and you pull upon them because, sorry again. So the question is what would it take to implement a third party library? And so it depends. So the question is how hard would it be to instrument a third party library? And the answer is it depends. First it depends on the stack. Python versus JVM for example, rusts out of the question. You need to have a platform and it depends on the library itself. If the library is for example, recognized by Spring and has the correct like entry point and whatever, it would be out of the box. If it's not designed for that, tough luck. Now because you need to be at the core site to call the hotel collector to send the data. So if it's not designed for that, you cannot do it at all. So either it's out of the box or not at all. Yes. So the question is whether there is an integration with Istio. So the question is whether there is an integration between open telemetry and Istio and the answer is I don't know. Yes. So the question is whether I've seen similarities between open telemetry and sentry, right? Okay. And the answer is I don't know because I never use sentry. I don't even know what it's made of and I am out of time, but I will be there for a couple of minutes if you have questions. Thanks a lot. Enjoy the rest of the day. Test, test. Hi. Good afternoon. Thank you for coming to this presentation. My name is Leonardo Milleri and I work for Red Hat in the virtualization team. So today we are talking about building, containerize the workflow for with Vertio with DPA. I'm gonna provide some background about the technologies, Vertio and the DPA and then present the work that has been done by the community and myself for integrating with DPA into container orchestrator like Kubernetes and OpenShift. I'm gonna provide a demo that is recorded and the current status of the development and the future steps in the development. Finally, Q&A session by the end of this presentation. Okay, background, which is the proven statement. It is to accelerate high-performance container directly to, for, yeah, at layer three without vendor-specific solutions. So one of the most popular solution is SIOV, that stands for single-route input-out to virtualization. So this solution is dependent on the physical nick so this is why the DPA comes into play by providing containers and VM with, let's say, the coupling from the physical nick. So accelerating means to forward packets as fast as we can from the container VM to the physical nick. Quick mention to VertiO. VertiO is a specification for interfaces, different type of interfaces for virtual machines, especially like network and storage. It also defines the layout of the device and the interaction of the device with its drivers. Some of the things relating to the interaction are the feature negotiation between the device and the driver in order to establish, for instance, VertiQs that are able to let you to exchange data between the host and the guest and transport parameters like PCI. Now, okay, what is VDPA? VDPA stands for VertiO Data Path Acceleration. So the basic idea is to take the virtual unit interface and push it directly to the physical nick. There are two main aspects of VertiO. One is the data plane that follows the VertiO specification so it is standard. And the control plane, it's vendor-specific. So for this reason, it is translated by a shim layer called the VDPA framework that is able to convert it to a generic control plane. Okay, just a quick mention to the VDPA framework. I'm not an expert of this, but just for completeness. So there are, from top to bottom, there are, for instance, containers and VMs. The container is provided a VertiO net interface that goes directly to the current subsystem. And the VM is instead provided a character device that goes down to the V host subsystem. Then, in the middle of the picture, we have some components, the main components of the VDPA framework like the VertiO VDPA bus drivers and the V host VDPA bus driver. And the VDPA device driver that is the abstraction for the physical device. Okay, in the bottom part of the picture, we have the hardware blocks. So this is a compliant VDPA card. As said, we have a VertiO data path, standard one, and a vendor-specific control path. Now, I'm gonna talk about the work that has been done for integrating in Kubernetes and OpenShift. This is for introducing how a worker node is deployed in terms of main components and the OVS hardware offload. So from top to bottom, we have the OVN controller that is complimenting the capabilities of OVS and providing virtual network abstractions such as layer two overlay, layer three overlays. And down below, we have OVS, that is a multi-layer virtual switch for forwarding packets from between the pods and from the pods to the physical network. And then in the hardware blocks, the physical neck is configured typically in switch-dev mode for the Melanox for NVIDIA cards, for instance. So this is a software abstraction that provides open and standard Linux interfaces that can be used by applications on top of it. And those, so they are called port representatives, P0, P1, P2 in the picture, and they are connected to the OVS bridge. Also, with technologies like SRV, the virtual neck is partitioned into different virtual links that are called VFs. And in the end, yeah, when we set up the virtual data path, we are having a one-to-one connection between the container or VM directly to each VF. So how does it work in terms of packet processing? Let's say when the first packet is received by the, okay, it is handled by OVS software. So this is called the slow path, but then any subsequent packets are matching a flow that is installed directly in the virtual, in the neck card by OVS and TC flower. So this is the fast path that is actually improving a lot the performances of OVS and also making sure we don't have a high CPU load on the host. Okay, that was just an introduction of the hardware offload. Yeah, now I'm gonna talk about some Kubernetes internals, which are the components that are, yeah, the most important components. We have a kubilator that is not the node agent. And the SIOV network operator is sort of software extension to Kubernetes that is, so it is making use of custom resources called CRDs. And those are used for managing application components in Kubernetes. So the main, the important steps, the operations that the OVS operators does are mainly to configure the neck in the switch dev mode. And then to create the devs, we have a SIOV, also some VTPA devices will be created on top of each VF, so there will be one-to-one relationship between the VTPA device and the VF. Also the operator will configure the required drivers in the current space, like VTPA drivers and vendor-specific drivers like the Melanox 5 driver. And finally, in the right hand of the picture, we have this manifest that is generated by the operator that is used by the device plugin that we'll see in a moment. So yeah, this is what happened. Oh yeah, I must have, okay. Okay, this is the second slide of the workflow, we heard the device plugin that is responsible for discovering the VTPA devices and to advertise them to Kubernetes. So let's say you can define some resource pools, for instance you would, you can say, okay, the VF from zero to three is pool one and from four to seven is pool two, so you can arrange resources into resource pool. And then as when you create your first pod, the pod has to reference this resource pool in order to, for the device plugin to allocate the VTPA device in phase of photo creation. Okay, the final picture, here we, okay, quickly introduce in CNI, that stands for container network interface. It's a specification in libraries for writing network plugins that are then responsible for configuring the network interfaces on the pods. One of them is Maltus, that is sort of a meta plugin that is able to invoke different other CNI plugin underneath and its main purpose is to actually create, attach multiple interfaces to the same pod because normally Kubernetes doesn't allow you to have more than one network interface apart of the look back. Okay, Ovin Kubernetes is the other CNI here that is delegated from Maltus CNI. So, okay, coming back to the workflow, now we have this network attachment component, object that is for defining your network object. And there we have a mapping between the resource pool and the designated CNI plugin. The specific is Ovin Kubernetes. Okay, so when we create the pod, we have to specify which is the network we are gonna use, defined by the network attachment, then Maltus CNI would delegate the job to Ovin Kubernetes that will take the military OVDP device and move it inside the pod name space. And the other thing would be to to take the port representor and add it to the OVS bench. Okay, so by the end of this complex interaction between the Kubernetes components, we are having a standard OVDP interfaces created in the pod. That is the ether and zero in the pitchers. Okay, we are gonna have demo. So this is just for introducing the setup. We have two servers, bare metal servers. In one, we are running three control play nodes in virtual machine. So it is actually an hybrid cluster because the worker node is running instead in bare metal directly. And the two machines are connected back to back with an NVIDIA dual port, Nick. The first port is being used by the default cluster network. And the second port on the left is being used for the VDP demo. Okay, here there is a link to the demo. Let's see if I manage. Okay, okay, yeah, it's going on. So I'm creating two pods on the same machine, on the same worker node, and demonstrating that our camping between the two pods using the DPA. Of course, we could have tried out using multiple workers, but just for simplicity, I had just this setup. First of all, we check the state of the cluster and we create a machine configuration pool. And then we'll make sure the worker node would join this pool. This is done basically by using labels. We add this mcp label to the pool, to the worker node, together with another label that is the SIOV capable node that is used by the operator in order to select the proper worker nodes. Then we create the SIOV network pool configuration. And this commander would actually, we'll actually reboot the node and we'll configure the OVS hardware offloading on the worker node. Probably something I haven't mentioned. This is the SIOV policy. So the operator, the cluster administrator instructs the SIOV operator with this policy. So we are telling, okay, the name of the policy, the node selector is used for selecting the proper worker nodes we want to use for the configuration. We specify the resource pool, mlxnix, and the number of recall function we want to create for this purpose, in this case, is two. The next selector is used for filtering out, depending on some parameters like the device ID, or the vendor, or the multiple way of filtering the neck devices. And finally, we set up the switch.dev mode on the NIC card, and we select the DPA, virtual DPA interfaces to be created. Okay, just a check that we are having no VFs before running the policy. So we can see there are no VFs created yet. As soon as we create the policy, then all the way reboot again. Okay, check again the state of the cluster after the reboot, all the nodes are in the ready state. And now we should have two VFs created. VF0 and the VF1, and also two DPA devices created on top of them. The NIC card is in switch.dev mode, and we have a hardware offload enabled. Okay, if we look at the interfaces that have been created in the worker node, we have basically two port representatives, Ethernet V0 and Ethernet 1 that are connected to the OVS bridge, and the driver is actually Melanox driver. Then we have the two Vertio VDPA interfaces that are Ethernet 2 and Ethernet 3. And as you can see, the driver is VertioNet, so a standard driver. Okay, we created the network attachment definition. Here, as you can see, there is the binding between the resource pool and the OVN Kubernetes CNI. And of course, the name of the network that is OVN Kubernetes Lossier. Okay, it is time to create the two pods. Pod1 and Pod2, we are creating them in the VDPA name space. So, yeah, it is just enough to put the network we wanna use in the pod manifest. And automatically, the pods will get the VDPA interface created. Okay, here are a few checks about the IP address. And as you can see, the driver is the standard VDPA network driver. And, yeah, now we are ready to test the connectivity between the two pods. We ping from Pod2 to Pod1. It is working as expected. So, yeah, this is the end of the demo. We have successfully demonstrated how to ping between two containers using VDPA. And we can go back to the slide. To the slide, I hope. Maybe this one. Yes, okay, quick reference to the current status of ART. So, we have implemented this in the primary interface, in the pod-prime interface using OVN Kubernetes. If you're interested in the source code of the community, we have a bunch of repositories, the network operator, the vice-prugging Kubernetes, and GoVDPA, that is a Go library. And the next step in the development are to support the secondary interface on the pod. This is for enabling some other use cases, like DPDK applications for having user space packet processing, continuity virtualization for running VM workloads alongside the container workloads, and providing also accelerated standard interfaces to confidential computing. Okay, this is the end of my presentation. I think we can take, we have time for taking some questions if you have any. Okay, yeah, the question is, but okay, so we are having here, so the vendors are implementing the data plane in Vertaio, but they are not doing the same for the control plane, so you're asking, which is the reason? So the reason, okay, we have some vendors that are, that now have implemented the DPDPA virtual Vertaio data plane, like NVIDIA, think Intel, Pensando, or there are a bunch of vendors, but it seems that the control plane, let's say the full Vertaio offloading, so basically implementing the control plane with Vertaio seems to be more difficult for the vendors. So this VDPA is sort of helping out the vendors in order to give them more time, so it is, let's say, covering up the other need and simplify their life for this transition. Any other questions? Yes, because, okay, so the question was, with the network touch and definition, we are using the default network, right? The question is, if we are overwriting, ah yeah, okay, yeah, so you might expect to have two interfaces on the pod because we are using network touch and definition, but as I said, this is the first step in the implementation, so we are actually using the default interface, the primary interface for this implementation. It is, of course, suboptimal, so the next step, as I mentioned, would be to create instead a secondary interface and then use the host VDPA instead. That can be also beneficial for other use cases like QBvert, I think, and other things, yeah. Okay, the question is, if we are looking already at VN Kubernetes, since the feature road support of the secondary interface has already landed into OpenShift, and yes, the answer is yes because, so we are now taking the first step, investigating which are the missing bit for the solution, and yeah, I think that is the plan. Yes? With automotives, yeah. Okay, so the question is, we are using, we are configuring the primary interface, but we are making use of multis, so the question is why? Well, I think we could have avoided probably to use multis, but it was convenient, let's say, for this implementation because multis is brought in by default, by OpenShift, and yeah, it is a convenient way for doing this. Of course, it will be more useful in the future with the secondary interface, in that case, we can't avoid to have multis for this purpose, since we reached the end. Thank you. Okay, hello, welcome to the Mac collision alert talk. My name is Eddie, I'm working at Red Hat on OpenShift virtualization. It's in UPSIM, it's called Kubev. And I wanted to talk today about Mac collisions, but in essence, I more or less focusing on trying to think about things a bit different in a different way, and this talk is a bit special because whatever we are going to speak here about alternative solutions, they are not really existing, it's still open, I like to do it, but maybe it's not worth it, maybe there are better ideas, so you're all welcome to comment on it. So the first thing that I wanted to say is that a lot of the solution that we are doing, I don't know, it's like an engineering practice to complicate things a little bit more than it must, and it causes a lot of afterwards complications, so it will be very nice to, after we have something running, working, we did it very fast, and after a few years we can look back and see if we could have maybe simplify it a little bit, especially after a few years, things are getting, many things go, are added, and it gets more complicated. So what I would like to talk today is about, first of all, I will give some context, who knows here about Kubernetes, everyone, who knows about Kuvir, most of you. So I'll go over very quickly, I already had a talk with Andrea that is here, and they did the same thing, so I will be very, very quick about it. Then we're going to talk about why, or what is this MAC address management thing, and what we have today, how we manage it today, and what maybe would be an alternative to it. So the ecosystem, some background, we had in the beginning a virtual machine, it's a simple thing, then we had many, on many nodes, so we had to manage them, that's the regular thing that we do, and then came the containers, then we also had many of them, and then we managed them, and the result of that came Kubernetes, which manages pods, which are actually behind them, we have containers, and Kuvir came along, and we merged them together, so we'll have one unified ecosystem management system. And now, a few words about MAC address management, so why do we need it? Can anyone tell me why we need to manage MAC addresses? It's uncertain, because usually, if you know what's a MAC address, usually the manufacturer of a card are just putting it on the physical nick, and that's it, you don't need to manage anything. Yes. Two? Yes, unique, so this is usually not a, most of the time it's not a problem when the manufacturer is doing that, although I didn't know it in the past, but manufacturers do create, duplicate MAC addresses, they just send them to the other side of the world, those things like that, so that's one of them, and do you, can you think about why we need it in VM specifically? Yes, and, yes, and, and, come on. So yes, that's a uniqueness one, right? But there is an in-virtual machine that we want also to manage it, because when we create the virtual machine and we shut it down and we start it again, it may come up with a different MAC address, so we want to avoid that. It's especially a problem when we run it in Kubernetes, because we run virtual machine in pods, and when you create a new pod, it gets a new interface, which may have a different MAC address, unless we somehow manage it, or tell it to use a specific address. So that's the persistent part, the persistent part is that we want, once the virtual machine was created, with some one or more nicks, virtual nicks, we want the MAC address to be there forever, not to change, also for migration, but that's a small detail, we'll not get too into it. And the second one is said that we want uniqueness, so we want to avoid MAC address duplication, and we'll talk about it a bit later, more in details about it. So what is MAC address duplication? Is anyone here knows why there is a problem with MAC address duplication? What's the problem having two MAC addresses? Oh, I will try to make it fast. You want to answer? Yes, if you, so I will show you here why it's a problem, it's like a very simple use case. If, let's say that you have two pieces, MAC A and MAC B, they are connected to a switch, the switch has a MAC table, and the MAC table has ports on one side, and the MAC address that it learned on the other side. When a switch usually learns the MAC address, I'm talking only about switches, Hubs is like a two generation behind us, so I'm not talking about it, so in switches MAC address, switches learn about MAC address, when some traffic goes into the switch, it looks at the source MAC, so it just learns it and put it in its table, so in this case we have A and B, it learned it on two different ports. Now, if we have a, the two pieces having the same MAC address, so let's assume it learned it in port one, and then traffic came from the second machine, so it learned it on port two, and then it goes like this forever. It's not able to put the same entry on both of them, usually the classic switches will not do that. Moreover, the regular switches that we have today will just block it, one of the ports, to detect one of the ports is problematic, and you just block that port, so you will not have a problem in your network, because it may assume it's there, you have a loop with like a cable, which will cause you a lot of trouble. So what can we do if we, what can we do if we have this situation? We need to somehow manage this, not to get into this case, so obviously we will manage it. We all have a solution to manage stuff. So the solution that we have today is one of them is prevention, this is what we have today, and I will go over it today so you will know, and we need to solve what we spoke before, like the problem with uniqueness and the problem of having it permanent, so prevention. So we have, this is a very simplification of a tool that we, not a tool, it's a product today, it's called Coom McPool, it is sitting in a Kubernetes cluster, it has a controller, everyone knows what's a controller by the way, Kubernetes controller, I will, no one answers, so I'll answer it. A controller is something that looks on the system in a loop, so just check it if the system is as desired. So in Kubernetes the concept is that you specify what you want and you have what exists, and it tries to reach from desired state to actual state, to move it to be an actual state. So that's what controllers do, they usually look at the configuration and at the actual thing that exists there, and the web book, does anyone know what's a web book in Kubernetes? No? The web book is, it's an endpoint, a web, a URI endpoint I would say, that you can configure Kubernetes that when you change the manifest of an object, it will just redirect it, you will be redirected to that web book. That web book can do two things mainly, today it's supposed to things, one is it can validate the configuration, so if the validation is not okay, it will just drop it, and the manifest will not be preserved in the database, in Kubernetes database, and the other thing, it can change it. You have a way to change, you have another step that you can change that configuration, and then it will be saved. So given that we have that, and we have a pool of MAC addresses, we create a manifest, and the controller obviously looks at that configuration, this is a VM manifest that describes how VM should look like, and the web book will trigger as soon as we created that manifest, so it goes there, and it's tried to go to the validation step. So it checks in the pool that if this MAC address exists, if not it can register it, and if that passed, it does two things, one it mutates the manifest to put the MAC address there, so it solves the problem of the persistency in that manifest, and it causes the port with the virtual machine to be created with that specific MAC. Now the case that it doesn't work, if there is a problem with duplication, so someone can use the manifest and specify a specific MAC address, so if someone specify a specific MAC address, then it will get to the web book, the web book will check with the database, with the pool, and it will see that it's already there, so it will reject it, and that will cause it to just drop the, I mean it will just fail the creation of that manifest, so the pod will not be created, the VM will not be created. Now this is what we have today, and it works pretty well, the problem is that when something else happens, it doesn't work so well, so one of the nice thing that I like about the existing solution is that it's not intrusive to the VM system, it's external, it just monitors what happens, so if there is a problem it will just stop it, and that's it, you don't need a special parameter inside your VM manifest, I mean VM control plane and everything does not need to know about this kubectl thing, but there are still our problems, so the first problem is that we are, this mechanism is just blocking creation of virtual machines, so let's say the situation that you have one virtual machine created with a specific Mac and then you create another virtual machine with the same Mac, I don't know why, then if that's happened immediately it stops and the VM will never be created, but it's a bit anti-kubernetes in the sense that you describe what you want, and maybe in 10 seconds the other VM will go down, but you could not do what you, you could not reach your actual state because it just gets immediately failed, so it doesn't allow you to say I want this and eventually it will happen, I cannot do that with this tool at this moment. Another limitation is that it works only for virtual machines, so if you need it also for just pods or something else that works with pods, it doesn't work, but although there was an attempt to do it, but eventually it didn't, it also has one thing that it has only a single pool, what does it mean? It means that we can have two different networks and each network can have a different local address, it can have a broadcast domain of its own, so in this case I will not expect, even if there is duplication it still should work, but this one treats everything as a single LAN, and just managing the whole thing is a bit complicated because if we look here, then once you pass through the webbook you still need a controller because the webbook may have worked and everything worked fine, but there may be another webbook that will fail your machine, so it will fail the creation of virtual machine and then you want to give the Mac back to the pool, but then you need the controller, so it gets a bit complicated. And this is the main thing that I actually, the only reason that I did this talk is about this case, the probability for a Mac duplication is really, really slim. It's like we are doing all of these fancy things, we spend a lot of energy and a lot of code to make it happen, but what is the probability of really having a duplication of Mac addresses in a cloud system where some of the interfaces are provided through some SDN thing or some other means, the probability, I guess, is very, very slim. So then what can we do to, what can we do different, that's the question. So one of the things that we could do different is to think about this problem in from a Kubernetes perspective. We want to facilitate the collision, if there is a collision, and which means that we can let it happen and if it happens then we can try to detect it and then do something about it. Now, there is a very, very low chance that it will happen, so we will not have the blocking problem that we always have. I didn't say before, but one of the things that we did experience is when this service that provides this web book is down, then the creation of VMs cannot happen. That's also a nasty thing. This is related to this sentence that I really like, is it's actually taken from many Python, Pythonists are using it a lot. I did program in Python in the past, so it's easier to ask for user than permission. Do you know this sentence? It usually solves a lot of races. Like, for example, you ask a question, is the file open? And then you open it, right? But in between the asking the question and actually opening, trying to open the file or doing something on it, someone may have done something else in the middle. It's a regular race condition. So, in order to try to, one of the options to solve the previous problem is to separate the two things about persisting the MAC address and looking at the collision thing. So, one suggestion is to do some sticky MAC addresses with what we have today. So, we could take the existing control plane of virtual machines of Kuvert and we have here a VM controller that looks out on the VM creations. So, assuming we have a VM manifest created, it will look at it and let's say it will, there is no MAC address involved here, right? So, it will create the virtual machine, some random MAC address, in this case, A will be created and that's it. This is the, we finished, we created the VM. Maybe this MAC address is duplicate, we don't know, but we don't care at this stage. So, after the A is created, the controller sees that this is the MAC address and it can go and update the manifest and say, okay, you created the VM. VM has MAC A, I'm writing the MAC A in the manifest and from now on it will always be there. Even if you shut down and install the VM, you will have it. So, we solved the stickiness problem of persisting the MAC address problem. And regarding the problem with the duplication, if we also look only at on VMs, then what we can do is just have a simple controller without this webbook and we can just look at the manifest and see that there is a MAC address in the manifest and it will just write it to its pool. So, if it manages to write it to the pool, register it, everything is fine. If it doesn't, then it will just react to that. It will say, okay, I cannot register because it's already in use. So, I'm going to tell the virtual machine to either stop, I can shut down the link, I can do whatever makes sense in order to resolve the problem. So, this is in a reactive mode. It's like, things already happened, maybe disturbed traffic, but eventually we'll react to it. And eventually it will do something in the pool. It's the manifest would be reflected in the actual virtual machine. Another thing we could do, if we want to make it more generic, by the way, the same thing we could do for pods, I didn't specify it here, but we could do the same for pods unrelated to VMs. But for, we could do something smarter, like for example, real switches today, we have SDN switches, like OVN, OVS, and we have all kind of vendor switches, like Cisco, Juniper, stuff like that. They are already detecting duplicate MAC addresses, then you can also query them and talk with them and ask them what is the status. So, if we could talk with someone else, or we could monitor our system about what's going on, like we could run TCP dump or some monitoring tool to check all the ARP traffic and see if the some are registering with the same MAC address, also possible. It doesn't matter what's the solution, but that monitor thing can look at the network and once it sees a problem, it will just go to the controller, like in this case, and the controller will go and update the manifest. The same thing we have, we did yesterday, I don't know why I'm talking about yesterday. The same thing that we did before is to react on the VM itself. Either stop it, stop the link, whatever, that's it, let's do it. I think I finished. Any questions? Currently, I think it's using it in memory. So, the way it works is that when it comes up, that controller comes up, it will just look all over the system. That's one of its disadvantages. It looks all over the ports and VMs that exist, collects them all, puts them into the register them all, and then from here on, it just reacts to what happened. Oh, sorry. The question was, what was the database used to save the MAC addresses? Yes. No, it's, sorry. The question was, where is the MAC address? Is it somehow registered in the manifest or not? Specified in the manifest, right? In the configuration? Yeah, so if the user does not specify a specific MAC address, then Kubemak pool, that project will identify that just edit itself, like here. Ah, so the manifest, if the cluster is killed, no. If any of the components of the system are down, like this is Kubernetes thing, because it has in the back end, all the manifest that it has is implemented using ETCD, which is distributed, so even if things are getting, nodes are getting shut down or stuff like that, it will still be available. That's Kubernetes given, yeah. So, one of the disadvantages of the question was, that we are not managing, right? We are not managing the MAC addresses outside of the cluster, right? So we are not managing the MAC addresses not only from outside the cluster, even in the cluster, for pods that are not VMs, we are not managing that, so they can cause collision. And surely if we use, for example, secondary networks, if there are devices on that network in the same LAN, then we will not even detect it. It's like nothing that we can do. That's only what I showed in the end. If we want to do something like that, then we could use some other external tooling to make it better. But I think this is the point here, is that what I tried to make the point is that the chances of this to happen is so low that you should not bother, probably. I mean, I will be interested to see someone actually having this problem and then we can talk about it. No more questions? Yes. So, have I, the question was, have I encountered duplication of MAC addresses in reality, right? I worked in the past with a lot of networking not related to this system. And there can be something, for example, in a virtual machine, if someone is cloning the virtual machine and the cloning mechanism actually took an existing machine and duplicated its image, then yes, the MAC address will be duplicated because the MAC address many times is written in the configuration of the operation system. So we'll have two virtual machine coming up with the same MAC address, which is a problem. Now, even in reality, it can happen because you can, many times someone can, today there are tools that knows about it so they will not do it, but in the past, you could have just taken, cloned the disk and put it like in 100 machine and distributed it. If they were sitting in the same land, they will just start to collide. But as I said, in the cloud, the real switches are just blocking your traffic so you'll see it in the logs that just block the port. So yes, it can happen, non-intentionally, but there is also an option that one of the things is that someone can change the, like someone that runs inside your guest will change the MAC address. That can also happen, but then there are tools like MAC spoofing, which says that if I created a virtual machine, it's supposed to have MAC A, it can only get out with MAC A, it cannot get out with MAC B, for example. But that's like a security measurement. Anything else? Thank you. Okay, hey everybody. So, the session is self-service OpenShift cluster creation. My name is Rastre Wagner, I'm a software engineer at Red Hat. And about a year ago, oh, I turned it off. Okay. And about a year ago, we started working on a new operator, which is called cluster as a service operator. It's already available on operator hub. Just some first version, we are just, in these days, we are trying to release a new one. There is this one liner description which says that you can easily install fully configured clusters. So, how easy did we actually make it? As any other, or most of the Kubernetes operators, they bring in some CRDs, right? These are APIs that you can use to talk to the operator, tell the operator what to do, and then you can get some feedback. So, one of such CRDs that our operator brings in is called cluster template instance. And what you need to do is just filling the name and the namespace. And in the spec part, you need to reference some cluster template. So, in this case, the cluster template is called cluster with Kafka, right? So, the name can give you some hints that it will install some OpenShift cluster where you already will have a Kafka instance running, configured, and ready to be used. You will submit this cluster template instance and some magic will happen. And after some time, in the status, you will get all the info that you need to be able to log into your cluster, such as API, URL, admin password, and the Kube config. And you can log into the cluster, do whatever you want. You may break the cluster, it doesn't really matter. And once you are done with it, you will just delete the cluster template instance and cluster is gone. So, that's it, really. That's how you can solve service, your own clusters. Yeah, do you have any questions? Okay, no, not yet, right? So, you are probably interested about the magic part, like what happens after you submit that cluster template instance. So, cluster as a service operator is not reinventing the wheel, right? We are using the operators, controllers, and technologies that are already available. They are very, a lot of, they have a lot of users in Kubernetes world, they are popular. And they provide all the pieces that you need to solve service, but it's not really tied together. So, we kind of try to take them and just keep them in some order to deliver the self-service experience. And the technologies and operators that we use are Argo CD, Hype, HyperShift, and Helm. So, Argo CD, right? I guess anybody already knows that, but just to repeat. So, Argo CD follows the GitOps pattern of using Git repositories as the source of truth for defining the desired application state. But it can not only be just keyed repositories, it can also be a help chart. So, you would be using Helm repository for that. So, you have that source of truth. The Argo CD, you will tell the Argo CD that this is my repository, please deploy it on sub-cluster. Argo CD will deploy it. It will start monitoring it and making sure that the deployment is matching the source of truth. And Argo CD is really a center of cluster as a service because we use it to deploy day one manifests on a hub cluster and day two manifests on the spoke cluster. So, hub cluster is like the main cluster where the cluster as a service is running and all other operators at the spoke cluster is the cluster that you are trying to deploy. And what are these manifests? So, day one manifests are custom resources of Hive or HyperShift. So, these are operators or controllers that enable you to install OpenShift, right? And the difference is that Hive installs standalone clusters. So, these are clusters where you have three control planes. So, you need three machines for that and then you need some machines for workers to run your workload. And the HyperShift also installs OpenShift clusters but the difference is that you don't need those three machines to run the control planes but instead all those services are run as a ports on the hub cluster. And these projects, they have different APIs. And day two manifests, that can be anything. That can be anything that you want to do after the installation. Because when the Hive or HyperShift, when they install the cluster, the cluster is empty, right? There is nothing really going on in that cluster. And there's a day two, you would like to make that cluster actually useful. So, you maybe want to install some database, maybe configure IDP, install some operators, run some instances and so on. So, really make it ready for maybe a developer to really be able to do what the developer needs to do. Okay, so how the flow looks like, let's try to visualize that. So, we have those day one and day two manifests. They live either in some grid repository or in some Helm chart repository. Then we have a hub cluster and on that hub cluster, the cluster as a service operator is running of course and there is also Argo CD running and Hive or HyperShift or both. And what we do is that we tell Argo CD, here are my day one manifests. Please take them and deploy them on the hub cluster. So, Argo CD will do that. The Hive or HyperShift will notice those new CRs that could deploy it and it will create a new cluster, right? That's Paul cluster, the new one. And after the cluster is installed, we again tell Argo CD, here are my day two manifests. Please, again, deploy them but on the new cluster, right? So, because Argo CD can manage multiple clusters, not just the cluster on which the Argo CD instance is running on. And suddenly you have a new cluster which is running something and is useful. All right, so demo. So, first, let's try to create these manifests. So, I will close my slides and see my screen. So, for the day one, let's create the custom resource which will deploy a HyperShift cluster. I already prepared some skeleton, let's say. So, here in the, I have the DevConf template which is a typical Helm chart. We have a chart YAML which is some metadata about the Helm chart. So, we have some name and version. I will bump it right now. Otherwise, I would forget. We will be doing some changes here. We have a values, so these provide the fold values for the Helm chart parameters and of course we have schema. And in the resources that we want to deploy, we have the hosted cluster CR where everything, almost everything is hard coded, right? But we would, let's say we would like to enable the user that is trying to sell service the cluster that not only 4.10 version of the OCP will be deployed but let's say that he can, he will be able to choose his own version. So, let's do values OCP version. So, this is a new Helm chart parameter. And also for the node pool where the version is defined also, let's do this. All right, node pool specifies how many workers we want to have in the new cluster and we will have just one worker, which is enough in our case. And the platform that I'll be using is the agent platform. It doesn't really matter what kind of platform you use. We don't have any restrictions. We support basically anything that HyperShift supports. So, in this case, I'm using agent which is a platform where you just take some VMs or bare metals, you put them from the ISO and you can use those machines as workers. All right, so I made some changes. Let's say that in the values, I would say OCP version, I think before I had something like 4.10.19. So, this will be default if the user doesn't provide a custom OCP version and we will still deploy 4.10. And also, we should probably edit to the schema and we'll say the OCP version is string even though we could use something more restrictive like enum, but yeah, this is fine for us. Okay, I'll make sure that all my files are saved and let's package this home chart. So, I will do home package and then I will update the repository index and I will commit all my changes to my key repository. All right, and now my repository is set up in a way that on every push, it will run an action which will deploy a new home chart repository, right? So, it's already running, it will take like a minute or so. In the meantime, we can take a look at the day to manifest. So, for day two, let's say we would like to install Streams operator and then deploy some Kafka instance. I just found some example on the web with some default values. So, and these are just plain old YAMLs, right? This is not a home chart and this already is on my key repository. All right, so now we have day one and day two manifest. Now let's try to take them and just put them into the cluster template and I will switch to UI right now because I also want to show this that we actually have a UI for the clusters as service operator. I'm not sure if you ever saw the UI of the OpenShift but the cool thing here is that OpenShift supports these dynamic plugins. So, your operator can bundle the UI and simply run the HTTP server as a bot, expose it as a service, tell the UI that here's my service and it can serve the UI, right? The UI plugin for the OpenShift and when you go to UI, it will dynamically load all those JavaScript assets and it will render the UI here. So, in this case, right, we are adding the cluster templates, navigation item and when you go there, you will see the cluster template. So, there is a few of these, like three that are default but let's create a new one. So, let's name our template, let's say, cluster, we can do some descriptions for the users that they will try to solve service and just type something here, right? You can, this is like a markdown template, let's say, new cluster, right? So, the installation settings, these are the day one manifest. Let's add our HelmChart repository. I will pick the default template and the new version 0010 is already in the repository and we can say into which namespace we want to deploy all the files from the HelmChart. So, let's pick clusters and this is because, I didn't show this before but if you take a look at this hosted cluster, there's a couple of secrets here, right? So, here you have a reference for the pull secret and some SH key and these are, I don't want to expose this info on my publicly available HelmChart. So, I already have these secrets on my hub cluster in the clusters namespace, right? So, when the HelmChart is deployed, it can find them in the correct place. And now, for the post installation, we can either choose a HelmChart or a Git but in this case, I will choose a Git repository. So, again, let's add a new repository which is this GitHub IDP and we need to say from which commit branch or tech we want to deploy stuff. So, let's choose the main branch and directory path will be Kafka. Destination namespace doesn't need to be filled in because that's already given by the resources here. So, the subscription will be deployed in OpenShift operators namespace and Kafka namespace. All right, so, the template is created, right? You can see that when I show you the YAML, in the spec part, we have the cluster definition and cluster setup. And these are actually application sets. So, this probably sounds familiar. These are custom resources of Argo and the spec part contains all the info that we put into the wizard, right? So, we have the HelmChart repository which HelmChart should be deployed and the version of the HelmChart. And also, we have another application set for the day two which is the Git repository. And it all seems okay to me. So, the template is ready and now let's try to use it as a user. So, we don't really have a UI for the user. So, let's do this from the command line. I'm already logged in as this user called Ravagner and so, I'm not a cluster admin anymore. And I can't really do anything on this cluster. It is a very restricting environment. So, if I try to do good projects, I have just two namespaces available to me and if I, let's switch to this Ravagner namespace and if I try to do Git ports where I'm forbidden to do anything, I can't do really anything. But what I can do is I can solve service the cluster which was defined. So, what I can do is that I can get cluster templates. I will probably want to like explore, let's say the DevConf cluster, right? So, we do DevConf cluster and what's important here that in the status, I can see the schema and the values of the Helm chart. And I'm seeing that, yeah, okay. So, here you can see, right? That OCP version that we edit for the schema. And so, let's say I want to use this one. So, I need to create the cluster template instance as I showed in the beginning. And I should already have the YAML almost prepared. So, let's edit it a little bit. Let's go to my cluster and the cluster template that we want to deploy is called DevConf cluster, if I remember correctly. All right, so, the CTI is ready. We can submit it and just make sure that we are deploying it to the correct space, okay. All right, so, the cluster template was submitted. We can export the content. Now, we can see the faces installing. Let's take a look at the YAML here. So, we can see that here in the status, there is actually a lot more than I showed in the example in the beginning, where there are some conditions and you can actually watch the progress of the cluster creation. So, these are basically the phases which follow in the exact same order as are written in the array. So, the first thing that is happening is that the application was created, right, you'll see the application. And the second condition is that the cluster is installing and we can see that it's still being installed, right. This will take some time, like, I don't know, for HyperShift, it will take like 10 minutes to deploy the control plane nodes and then maybe 15 minutes to deploy the agent. Then, we have the, then we are trying to create the managed clusters because we are trying to integrate with MCE. MCE is an operator which stands for multi-cluster engine and it allows you to manage your clusters, like in bulk. And then, there is a clusterlet create, clusterlet add-on create it, right. So, this is this one. This is also related to MCE. And after that's all done, we add the new cluster to the Argo. So, Argo can manage the new cluster. And then, we run the day-to, day-to manifest, right. So, in this case, it will deploy StreamZ operator and create a Kafka instance. And then, we wait for it to succeed. And once all that is succeeded, we will get the API URL and all the credentials that we need, not before, right. So, the cluster needs to be completely ready. And just after that, you will get the credentials. And we don't want to wait here for that, but I already created another cluster from a different template. But, it's basically the same. Yeah, I guess it's exactly the same. So, it already deployed a cluster where Kafka is running. And what I can do is just take the credentials, right, I can log in. And I would do, see get Kafka and the Kafka is running there. And I can do anything I want to do on the cluster. And once I'm done, I would delete it. Yeah, okay, so, but that's not all. There is one important part missing, and that's when users get access to this cluster template instance, CRs, they can create an infinite amount of clusters, right. That's going to cost a lot of money. And you still want to somehow restrict them. So, we have another custom resource, which is called cluster template quota. This is very similar to building Kubernetes quotas. It just, this one focuses on the cluster templates use case. And in this quota resource, you can restrict basically two things. And one is which templates the user can create, right. So, which templates the user can reference in the cluster template instance, and also how many instances the user can create. And how many instances that can be done in two ways. So, you either just use plain old, like number, right. So, you will say, I have a template A and template B. And template A, I will say, there can be two instances of template A and three instances of template B. But that's not really very flexible at all times. So, we have this other kind of more abstract concept, where you can give every template can have some kind of associated cost. And you can think of it like, let's say, how much does it take to run such a cluster for a week, right. So, let's say template A would be like $100 per month or per week, right. And template B would be like $500. And then in the quota, you can put a budget. And the budget will be 500, right. So, in that case, when user is trying to create clusters, it can either create five instances of template A or one instance of template B, right. Or some kind of a combination like that. Yeah. So, maybe I can also show you how such a cluster template quota looks like. So, in the UI, again, you can go to some cluster. And in the quotas, I didn't set up any yet. So, I will create a new one, for example. And I will say that this quota is for the Ravana namespace. And I will allow creating only the that from cluster. And I will say that only one cluster from this template can exist. So, we have a template. And if a user now would try to create another one. So, let's do, I can also create. And it will say, oh, I'm logged into the Spoke cluster. So, let's log in back. Okay, let's see the project. And let's create the city again. And I'm denied because there is a webhook which checks whether I still have, I'm still within the limits that were set to me by the admin. Okay, yeah, so that's all I have. There is a repository which you can go to, read the documents, right? Try it out. Okay, thank you. Do you have any questions? Yes? From Reddit, ACM, okay. Well, this is actually, maybe at some point, will be part of the ACM. ACM doesn't really solve the self-service use case because ACM is really targeting for the admin users where they have a fleet of clusters that they can manage. But you don't really give access to ACM to regular users, right? Because they would take to a lot of stuff that you don't want them to do. Yeah, this is like for admins to provide the self-service experience for the users. Yeah, but yeah, like the users can be developers, sure. Like it can be a team, it can be a single person or some kind of a customer that just needs to deploy a cluster. Yes, uh-huh, oh, sure. So you are asking if we can make a parameter, not only the version, but like the whole URL. Yes, of course, yeah, you can do that. It's just a helm chart, it's nothing special about it. So you would, yeah, this is the DEF CONF template and instead of the OCP version here, you would just make the whole thing a parameter like this. And yeah, that's really up to you. It depends on how much flexibility you want to give users to be able to modify the template. Any other question? And maybe one more thing that I forgot to do is that, so we, before, right, we made the OCP version a parameter. So in order to actually use that parameter, you would, in the spec, you would do parameters and then this is an array, so you would do OCP version, like name, is OCP version value, something like 41221, right? And then it would deploy the version that you want. Okay, so that's all. Okay. Okay. Yeah, that's all right. Everybody in the back, hear me? No. Can't hear me? All right. First of all, I want to say this is my favorite conference of the year of, I come to each year, I really enjoy it, lots of stuff I learn while I'm here. And last time I gave a talk, it was so good that they had to shut down the world for three years, so. So we're finally back, hopefully that doesn't happen after this talk. So basically, I'm gonna give a talk on containers on wheels. I like to think of myself as the Ursula of containers. Basically, I started out by getting containers into Flora, then into Rel, then into OpenStack, OpenShift, Ansible, and now into a thing called RIVOS. So containers on wheels, we call ourselves the cow team. So it was part of automotive. I basically, nine months ago, I moved out of the container, container leadership, the Podman team and all the low-level container stuff, and I moved over to Auto, mainly because I thought the container team was ready to go on their own and didn't need me in the way anymore. And now I'm working on Auto, but really just continue to work on containers and all things. So we call it the Red Hat and Vehicle Operating System. Everybody calls it RIVOS. We're still not sure if we're supposed to use that outside of Red Hat, but anyways. We're not supposed to use rel outside of Red Hat either, and that's not worked well. Last year at the Red Hat Summit, there was an announcement of a big agreement between Red Hat and General Motors, basically to look at getting a Red Hat-based operating system into all of their cars going forward. We have a lot of interest from a lot of car manufacturers and OEMs in the car. They're all looking at what we're doing. They're very excited about it, but we picked General Motors as our customer number one, mainly to control the amount of requirements and gathering. What we plan to do is over the next year and a half, basically design the operating system with General Motors, and then we'd open it up to other car companies and other OEMs and potentially other moving vehicle type things. This is all part of the Greater Red Hat Edge operating system. This gives you, we're building an operating system here, or basically taking rel as the operating system, and then General Motors is building all sorts of stuff on top. When I talk about the software they're gonna be running in these vehicles, we're talking things like self-driving cars, all the sensors, all of the infotainment systems, all different types of software, but the bottom line, we're doing things with System D and Podman and adding new features like Composer Vest that I'll cover and a lot of these other functions in this talk. So basically Rivos is a binary distribution based on Red Hat Enterprise Lakes. We're not building a brand new operating system. We're taking all the basis of rel and moving it into automobiles, and really what we're trying to do is move towards justifying that the operating system that runs in your vehicle could be rel. One key difference, or at least the default, is that we're gonna be used in real-time kernel and as we talk about what it means to run an operating system in a car, you'll find out why we have to use the real-time kernel. We're also planning on using OS images built by customers. So think, we'll keep this quiet, but it's core OS in a car. So it's gonna be an image-based system based on OS tree using atomic updates. We wanna have a mutable operating system. We'll talk about that a little bit later on. We weren't basically the main operating system to be read-only from the processes running inside of it. In the operating, we really are stressing that it has to be container-friendly. So we really want to run a lot of containers. All the applications that we run inside the vehicle, or most of the applications in it, are gonna be containerized. So the biggest hurdle to getting rel into a car is a thing called, a little thing called functional safety. And functional safety, this is the, if you go to Wikipedia and look up what functional safety is, is functional safety is a process of reducing the risk, both simple and complex systems, so they function safely in hardware. So this is fundamentally different, similar to security, but in some ways different. What we're trying to do here is we're trying to build into a moving vehicle, build an operating system in a moving vehicle that is as safe as possible. We don't want the machine to cause injury to someone. So we want to look at making sure that there's no, or as little possibility of something going wrong in the software system that could hurt a person. So that's really what we're talking about, functional safety. And traditional functional safety, and this is one of the reasons the car companies have had a big hurdle to get new software into vehicles, was that you would have to write your design documents for the entire CPU, the entire operating system. Then you would have to code to produce, to match those requirements. And then you would have to test the code over very, to make sure that the code works as designed. So the thing of that is like the old fashioned waterfall design and everything had to be built from scratch. So what the car companies have had to do is they constantly are redoing building an operating system, redoing it and it just takes forever for them to be able to do it. And so they wanted to move to a new system. So from a Linux point of view, is there any design document for Linux? Anybody got a design document for the kernel? And Linus put out a document 25 years ago, I guess. So it really wasn't designed. It just evolved. And so Linux system is already written without any real design document so it doesn't fit into the traditional functional safety model. But basically what we're doing is we're documenting the functional safety APIs. Basically the APIs that we tell General Mode is to use when they're running the vehicle and they can run it in a safe mode. And guess how we're documenting the APIs? We have a little thing called man pages. So we're basically going through all the man pages, making sure they're accurate for the functional calls and things like G-Lib C. And then we look at the code and then we look at make sure there's test suites to make sure the code. So basically that's part of our argument for functional safety. We're also looking at arguments like this stuff is used for many, many years. This is the open sort that the kernel is probably the most examined piece of code on the planet. So looking at the way open source develops and that it's basically a functionally safe environment. So we have to make all these arguments and document it. And then we have to get other companies to basically come in and say, yeah, you've proven that Linux is designed as a functionally safe operating system. So other than functional safety, we also have to do a thing called we have a need for speed here. When you turn on your car within two seconds you hear a beep that tells you to put your safety belt on. Well, that's coming from an operating system. That means the operating system, the hydro has to start, get settled, load the kernel and at some point after that we have to emit a sound to the speaker to tell you to put the seat belt on. If you put the car into reverse, within two seconds the backup camera has to be on. So we have to be able to boot an operating system or bring an operating system out of hibernation within two seconds. And that's a fundamental thing. Now if we're running containers on top of that we also have to look at how quickly a container starts up. So a lot of our focus has been around speed. And we in the Podman team we wanted to run things in containers and we did some testing on the base of the lowest possible standard system, Raspberry Pi with very little memory and Podman took two seconds to start a container. So you hit Podman, hit return, it takes two seconds. That's way too slow. So we basically went into the code and just looked at every piece of code all the way it was, used all sorts of tools to analyze it and we found all sorts of little speed ups but we're talking microseconds, right? But we found hundreds of them and then we were able to achieve a six times speed up. So we went down to about 0.3 seconds. So now you can start with Podman and the upstream Podman right now you can start a container within 0.3 seconds on a very low power system. Now for most human beings it doesn't matter, right? If Podman takes a second to start a container you're not even thinking about it. But when you're talking about the overhead of starting containers in a car you have to look at speed all the time. So these cars are not gonna have one computer and they're probably gonna have, oh right now they have hundreds of computers in them and one of the things the car companies want to do is they want to consolidate down to a few computers and then have the computers analyte basically using sensors all over the place. So these cars are gonna have multiple nodes so how do you want to manage those services? Oops, what just happened? Is it working now? Oops. So what do you think about Kubernetes in a car? Right? So a lot of the car companies came to us and said, what we really want is we want sort of Kubernetes in a car or we want cloud native computing. We want to be able to put all these cool wind things and have the car constantly updating. And we came back and looked at Kubernetes in a car and then we looked at functional safety. So Kubernetes has the concept of eventual consistency. So the system will be eventually in the correct state. So the braking system will eventually work. So that's probably now where we want for it. The other problem is you're taking a huge Go program that's constantly monitoring, constantly working on it and trying to justify that, this multi-threaded behemoth is functionally safe is pretty much not gonna happen. Not gonna happen in a time frame that we want to get this product out the door. So Kubernetes, so we actually wrote an article at PM, myself and Alex Larson wrote an article back in October because lots of these communities were forming around Kubernetes in a car. And we wrote an article that said that it just ain't gonna happen. We don't believe that that's the correct route for running it. But we have this really cool orchestrator that orchestrates lots of these systems, right? The stats and stuff services all the time called SystemD. So we're really looking at, can I have an application profile? So my application, one profile can run one or more applications. I can have one application, can have one or more SystemD services as defined for it. And then I have a capability to switch between different profiles or different targets. So think of it when you boot up your system, right? So your SystemD goes in the boot up mode, stats and starts a bunch of services, and then you go from there to the network mode. So it brings up network and might just shut down some other services. When your switch is from network mode, it goes to multi-user modes. It turns on some services, turns off some services, and finally goes to graphical user mode. It turns on services, turns off services. So that's the way SystemD works, but how about SystemD in a car? So all the features that we just talked about, multiple applications running in these system services, but now I start the car. So it's gonna start up certain server, I might click on sensors, do, you know, turn on the cameras, things like that. Then I put the car into reverse. I put the car into reverse. It's gonna start certain services, shut down certain services, turn on the backup camera, turn on the backup sensors. Then I gotta put the car into drive. Again, it's gonna turn off the backup camera. So all these targeted run levels all can be handled just using standard, standard where the SystemD runs, starting and stopping services. So SystemD for a single node is the way we're telling General Motors that they should do and then just build services and you can build the relationship between the services that are gonna run the different pieces of software that are gonna be running. But SystemD runs for a single node, but RIVOS is gonna be multi-node. So how do we get multi-node capabilities into SystemD or into RIVOS? So we need to extend the SystemD concepts across multiple nodes. And so we built a brand new project called Herte, or Herte, or however the Germans wanna pronounce it because it's a German word. And it's a German word for shepherd or herder. And really there's two major components of it. So the first one is a Herte agent, which is gonna run on each one of the nodes. Eventually you might even be running more than one on each one of the nodes. We'll talk about that in a few minutes. So a Herte agent is just gonna be out there and it just talks to SystemD and it talks back to Herte on the server. So you're gonna have a main processor that's running Herte and then you're gonna have Herte agents running anywhere. And basically it's just gonna be like a spoke designed system where we can have bi-directional communications from the main node out to the agents. And all the agent's gonna do is basically relay messages to SystemD. So the way you talk to SystemD is via D-Bus. So we're gonna be, Herte's gonna be talking constantly back and forth to these agents and just relaying SystemD messages back and forth between them. We have a test program, a CLI, a tool, but Herte Control, which is Herte CTL, which is very, it's based on SystemCTL. So again, we'd be basically taking what SystemCTL does and expanding it to go to all the different nodes. And to give you an idea of what the architecture looks like, here we have, this state manager is sort of what General Motors is gonna give. So this is the thing that's waiting for the human being to say put the car, just stop the car, put the car in reverse. And it's gonna talk to, via D-Bus, to Herte, the main Herte server. And that Herte server is then gonna talk to Herte agent and that Herte agent's gonna talk to SystemD and SystemD will stop the services. It'll also go extend D-Bus over TCP and talk to Herte agents on each one of the nodes and similarly, it'll be just relaying those D-Bus messages around the environment. So this Herte agent will tell this SystemD to go into reverse mode, tell this one to go into reverse mode. If a service crashes for any reason, then SystemD here is gonna realize it's gonna tell Herte agents that a service crash, that's gonna be relayed back to Herte and that's gonna go up to the state manager to say some service crash, right? So if you're driving along at 60 miles an hour and all of a sudden your sensors go down, your self-driving mode, your sensors go down, service crashes for whatever reason that General Motors application has to be notified and then the car is no longer safe so it has to go into some reduced mode. So think you're in self-driving mode and this is when it tells the human being to take over. Human being, I can't do self-driving anymore, something bad happened. So we had to build this entire system and it's real, if you go to GitHub container slash Herte, this is a, it's available there. It's fairly simple, it's fairly elegant. It's written in C again because functional safety, we have to build code in non-multithreaded environments. So when we're talking to General Motors, we're also gonna tell them how we think they should build their applications. So a lot of people, we're talking structured language that we wanna run on the environment so what structured language do you use for running these applications? And the answer is Kubernetes. So we want to use Kubernetes structured language in order to run containers on the car. So we're really talking about Kubernetes YAML. Podman has full support for Kubernetes YAML, understands how to set up containers and pods on Kubernetes YAML. And the nice thing is that we build on Kubernetes YAML, then General Motors or any car company then could use OpenShift to running all these EICD systems so they could run, take the same app, same definition of the application that's gonna run in the vehicle, run it in the cloud and run it all sorts of tests on it, even have OpenShift maybe running native, the actual native operating system and run some of the tests on that environment. So using Kubernetes is sort of a scheduler for all your testing environment. But we have the same language all the way up and down the stack. Podman supported Kubernetes for many years now so Podman has this tool, Podman Kube Generate and more importantly Pod Kube Play can take the same Kubernetes YAML file set. Kubernetes understands and run Podman with it. So when we want to run, we wanna run Podman underneath SystemD, we decided to build a better way of running Podman inside of SystemD and that's called Quadlet. And to give you an idea where that comes from, if you play with Kubernetes at all, what do you call it when you take a Kubelet, which is the way you squash it down, you call it a Quadlet. Okay, that's where the name comes from. Real clever engineers. So this is an example of a Quadlet and this is in Podman now, full support. This you don't have to get rivals for this, it's a full support. Anybody that's played with Podman in the past, Podman had Podman Generate SystemD and that would take running containers and running pods and would generate a SystemD Unifile which had sort of our best knowledge at the time we wrote it of how to run containers underneath SystemD. The problem is that thing then becomes a static document, right? It's a static service running on it. So when some of the rivals engineers looked at Podman Generate SystemD, they said, well, there's a better concept inside of SystemD called the Generator. And what a Generator can do is you can define something that looks like a standard SystemD Unifile and then when SystemD, you define an execute, put an executable in place and when SystemD does a system, systems control, deem and reload, it'll run these generators that can take something that looks like this as a quadlet and actually generate a SystemD service file off of it. So now we get to have a fairly simple definition of what a container is. So I'm just putting a container here as UBI 9 minimum and I'm doing an executive sleep inside of it. So that's really simple and what that will actually generate is here you, this is the actual service that that generates. So you can see the original code here coming in but it generates a real fancy Podman command here and it does all sorts of SystemD witness to make this work and basically this is all the knowledge that the Podman team has worked with the SystemD team to figure out how to run Podman underneath a SystemD environment. I'm already down to 10 minutes. Gotta move it. Okay so quadlets also support running Kubernetes in the environment so we can run Podman PlayCube and again this is built in totally into the system. So we're using quadlets all over the place for running containers and now the last concept that we had to work on was a concept in the vehicle is called freedom from interference. And what freedom if interference means is we have two types of software that are gonna run in a vehicle. You have sort of the functional safe code and that's called automotive safety integrity level otherwise known as ASIL. So you'll hear that this term ASIL or ASIL A, ASIL B, ASIL C and ASIL D. So these are standard ways of you want to run functionally safe code. We're only documenting rel up to ASIL B. So basically that means all the software that's used for like self-deriving vehicle things like that can do. My example of the break eventually applying that's actually bogus because that would be ASIL D. We're not putting that in our description here. But the second part of code that runs applications that run in your vehicle are gonna be called quality managed which basically means their quality code but they might not be functionally safe. So when I want to run quality code and think of this as being your infotainment software. So this is probably in RIVOS we're describing that you might want to run your say your infotainment software inside of your Android operating system that types code inside of a VM running inside of the QM environment. Other things that might be QM are like that little heat seat, the seat heater application. So you press it on to turn your heat seat or maybe the windows going up and down. Any type of software like that that's not really involved in necessarily making the car safe. But it's, and there's other applications that General Motors wants to, eventually the car companies want to make this a money maker for them. So they want to sell you software in the vehicle that they could use and that software is probably going to come in the form of, well, they're coming to AISL and QM. But basically we have to take the QM software and we have to isolate it from the rest of the car. So we're designing an operating system with two different instances running inside of it. So you have the AISL software and that's going to be running lots and lots of containers and then you have the QM section that's also going to be running lots and lots of containers. So we had to design a sort of a sub-environment to run for the QM. Now we could use virtualization for running this but a lot of the AISL applications want to control the QM applications. So it has to be heavy communication between the two environments. So we've decided at this point to use containerization to isolate the QM environment from it. So if you want to isolate an app, say you're driving along in your car someone steps up a curb. It's going to launch an application to know that in the functional safe environment that that human being is there. Simultaneously you're saying turn the heat seed up which might launch a container and run it. So how do I make sure that system D which is probably doing both operations is isolated? How do I make sure that podman coming up and that's starting the container is isolated? So we really need to isolate the entire stack from the environment. So we're describing running system D, separate system D instance, a separate fully podman instance and this is how we're doing it. So this next section is all going to be defining how we're setting up a QM and we're using quadlets for it. So a quadlet, this is the QM container and if you go to GitHub containers slash QM you can actually install QM right now on your systems, on your Fedora 38 systems. So QM is basically setting up a system D unit file, again a quadlet that looks like this. The top pack is all things that go as standard system D commands for setting up things like C groups to isolate these environments. And the bottom part is all the fields that we're using for setting up podman basically. So the first thing we're gonna look at inside of this slice is we want to identify the entire C group or the entire environment. It's gonna be running your QM environment in your car. So we're just gonna name a QM slice and then you can do things, special things with C groups like the top one here actually says I'm gonna run all my applications in the QM. And in this case my laptop has 12 virtual CPUs on it and I'm just saying I'm only gonna run on the bottom six CPUs. Now the rest of the ASIL environment can use one, zero through 11. They can use all the CPUs but the QMs can only run on that. This is easily changeable by General Motors. If they only want to use two of the CPUs they can do that. Similar CPU weight. So CPU weight basically says in C groups a CPU weight goes from zero to 200 and the default CPU weight is 100. So if you set your CPU weight to 50 in your C group that means all the processes inside the QM are gonna get one slice for every two slices that the system gets. Then I can do IO weight, very similar. That's again, and these numbers can be changed, right? You can set it to 10 and get 10 times as much. The next thing I wanna quickly mention is the idea of umkiller. So umkiller is when you're running in C groups if you start to run out of memory on the system the kernel can't take memory away from the process. All it can do is shoot it in the head. So what we wanna do in the QM environment is say I am the Katniss, right? Pick me, pick me to kill. So by setting these scores in the car goes from umkiller goes from minus 1,000 to plus 1,000. All processes run with zero and what we're saying here is anything in the QM is gonna get priority to be killed over the rest of the system. The last thing I wanna show in the system depot of it is we're defining where the software is. We're not gonna be using an OCI image for this environment. We're actually installing the software directly on disk. And so the software's gonna go into use a live RudaFest and then in the container section of it we refer back to that RudaFest. So this is how the connection goes inside of the Quadlet. Now we're in the pod, these are the commands that Podman's gonna interpret on the system. So the first thing we need to do is name the entire container and that's gonna be named QM. We want to run system D as the primary process inside of this containerized environment. We also want to, in this case, we're probably gonna share the host network because it just adds complexity. We can adjust the amount of capabilities they're gonna run in the container. Probably we wanna run a lot of privileged, somewhat privileged process in here so we're gonna leak in all the capabilities so we're gonna be able to run containers. We can add special devices so if you wanna have special devices go in. We wanna run read-only for the entire environment except we have to have read-writeable Etsy and Vios so these are the ways you can set up a read-only petition and then have Etsy and Vios to be read-writeable. And finally we want to have Etsy Linux running inside of the QM. We want to isolate containers in the QM from each other and from the host operating system. So all that stuff generates a huge, that huge command line to basically show how well that gets converted. So the last thing that we have in the QM package is a big setup script that sets up this entire QM environment and I'm gonna stack that up as I run out of time. So everybody get off the network so this will work fast. All right, so QM is the standard package inside of well, Fedora 38. I'm running the script now. The script is gonna go out and actually install all the software that I'm about to demonstrate. So that setup script that I'm running right now is actually gonna do, oops, skip ahead. It's gonna install root FS so I'm dropping all the files in so I actually destroyed that entire directory and reinstalling it right now. These are the only packages we're putting in the QM. So we have SC Linux policy because we wanna run SC Linux in it. We have Podman system D and we have a Hurtay agent because we want the Hurtay in the system to manage the Hurtay agent to manage that system D. So we'll have two system Ds running inside each one of these environments. This is the software we're installing on the system right now so that there's a DNF update that's installing those packages. This script can be run multiple times to actually update the software after the fact. We're also installing a containers.conf which is the way we can reconfigure Podman inside of the QM and there's a couple of key fields in here. So this tells Podman to, again, do setup memory C groups. So that's also the Katniss. So this does two things in the environment. Now, if you recall, QM was 500. Now we set up all the containers as being 750 which, again, means these containers, each container should be killed before the QM is killed. The memory oomcgroup also tells the kernel to kill, the kernel usually kills just processes but if you set that flag in the C groups, it tells it to kill actual containers running in the environment, our entire C groups. So lastly, in the setup script, we're also setting up, we want to take advantage of user namespace so we want to make sure that UIDs are different from a QM environment, from an ASL environment. So we're picking out 1.5 billion UIDs to run the containers in and we're picking out a different 1.5 billion. So if you look up here, I've allocated from 1.5 billion starting at 1 billion UIDs for the container and then I'm doing 2.5 billion and then 1.5 billion on there. To give you an idea, 4 billion UIDs available on a Linux system. So the last thing we do is we set up Herta and Herta, all we're doing is we're saying Herta inside of the QM environment is going to have the same name as Herta outside of the QM environment with the QM.prepended on it. Okay, so that finished the install of the system. I'm actually showing the QM services up and running. So this is a quadlet that generated a service and the service is now up and running on the system. If I look at the CPU weight, remember we talked about setting the priority of CPUs. So I set it to 50. Well, the nice thing is from the ASL environment that they want more priority during the running of the car. Something's happening. I need to squash down the entire QM environment. You can do that with C-groups. So I'm just gonna set CPU weight to 50. I'm changing it to 10 and now all of my QM environment has dropped down to only 10. So that means for every 10 slices of the CPU, the QM's only gonna get one. What's interesting here is you see that a service running under the QM still has 50, but that's actually a sub of the 10. So it's only gonna get 50% of the 10%. So everything is isolated inside the environment. Here I'm showing what the QM looks like. So I just did a podman exec of to show you the processes running in it. We're running with a separate SE Linux label. So it's running as QMT. And now I'm actually running a, if you see the podman exec in front, that's basically saying run podman inside of the QM. So I just ran a container inside of the QM environment. I ran another container. I ran a container outside of the QM environment in the ASL environment. And to show you that there's two different podmans running to two different databases show the difference in the images in the environment. So this is showing you using username space so I can actually run lots of containers inside of the QM environment. And notice that they all start with one billion something. Each one has a separate UID range. And now if I run on the host, I've said my system isn't quite what the documentation is, but you can see that they're running at 500 million. So the containers on the ASL are running with different username spaces than inside the container. I'm out of time, so I'm not gonna show this. So with Herte set up on the system, here's Herte running. And this is just listing all the running services on it. So my laptop's called Fedora. And down the bottom you'll see all the services running inside of the QM Fedora. So what I can do is I can actually run. So now I'm gonna demonstrate pulling down inside of the QM environment. I'm pulling down a UBI8 Apache service. Just pulling down the image into the Podman database. And good, you were all off the network, that's good. Okay, now I'm setting up a Quadlet. So I'm just defining a simple Quadlet to run that Apache service that I just downloaded. And I'm setting two fields in it, just the name of this image that I did. And I'm just setting up network equals host. So it can be run on the host network. Now I'm using Podman copy to copy the file that I created in the ASL environment into the container environment. I'm doing a Podman exec to do a system control daemon reload inside of the QM environment. So that basically triggered the Quadlet to become a service. Now I'm actually gonna stop the service, be a her to control. So her to control is saying, restart that new, my Quadlet service, I just generated and it can do it to the QM environment. And then, so I'm gonna list out and I'm showing that the service is now running inside of the QM environment. And I can curl, you know, not a great demo, but it's running baseless shows, Apache is running inside of the QM environment. I can stop it with her to control. I can list units and show that they're all done. And that is the end of the presentation, except for shameless plug to buy my book. So I think I'm out of time. I'm sure they've been flashing that up, but this comes out erotics time, so I don't really care. Any questions? Yes. So what types of, so you're talking about hardware? What hardware? So right now with General Motors, we're working on Qualcomm and Qualcomm's developing a brand new hardware for the operating system. We've talked to lots of car companies and they want to work. The three vendors that we see, they seem to want to work with Qualcomm, NVIDIA and Texas Instruments. Those are the three names they have. But this is Ralph, right? We wanna be able to run on any operating system. We're not building the operating system to run on a specific piece of hardware. We want to be a general life, general purpose. That's by the way it really is, but any light that you need to hold. Yeah. Any other questions? Well, you'll get less radic. This is good. Where did my idea come from that? That wasn't my idea. That was Alex last time. Oh, to get rid of the co-op, that. To get smaller? Yeah, to get smaller. Yeah. I'm not sure. Go ahead. So anything would change. We're investigating, there's basically about 100 people working on RIVOS at this point. Anything that we change, find to help, say, speed a boot, we'll go back into the regular kernel. We'll basically go into the upstream kernel. So everything we're working on is going back into REL. So right now I'm not the kernel engineer, so I'm not sure if we've had to fix things. We've had to fix a lot of things in Podman. We're working with other parts of the operating system to actually even going through the FUSA process, we're updating hundreds of man pages just because we're finding problems in the man pages as we actually look at them. Yes. I'm supposed to be re-asking the question, sorry. Are there any, why are we used to both disk size? So one of the things I did pull, this slideshow goes on quite a bit longer, but I was, time usually takes well over an hour to go through everything. One of the things we talk about is potentially having separate disks for the QM environment so that the QM environment can actually, accidentally use up all the disk space so that the ASL environment can. And traditionally in the car companies, they have lots and lots of petitions, way more petitions than we would normally recommend to, but basically for isolation like that. The IEO stuff is also, the IEOC groups would also take away the ability to pound the disk and cause an application not to be able to run. All right, last question. Yes. Do we have any type of monitoring going on to see what's going on? Yeah, we, so we have these, you sound like General Motors. So the, so what General Motors wants to, they want to know like the car is running out of memory before the car is running out of memory or they want to know when we get to 80% CPU or things like that. So we are looking in for open source projects. So they come to us and say, we need to work this. And we don't want to, in HRTA we had to generate brand new code. We're trying to make sure we get open source projects. So we're looking at different things. Right now PC, PCP is the performance co-pilot, which I think is the way that REL right now does things like that. But General Motors wants us to be able to monitor things like special devices. We have to make sure that they can build code to look at GPUs and things like that. So, but yeah, the PCP right now is, is what we're thinking, but if anybody has suggestions, we're always all yes. Anyways, thank you for having me and Roddick you can take over now. Yeah, I'll see you soon. Yeah, super. You're gonna start reaching me, aren't you? Ah, it's gonna start. What the hell is that? Ha ha ha. Wait a minute, I'm gonna sit down. All right, we're waiting till everybody gets seated. Get your seats. We're locking the doors. So, hey everyone, good afternoon. I'm so glad so many people survived to the very last session here. That's amazing. Even some kids, right? Ciao. How did you enjoy the conference? Good? Yeah? Yeah. Should we do it again in summer or winter is better? Summer. Ah, I knew it. Let's do it. Summer. Summer, summer, okay. Okay, we're not doing winter. We don't need to. All right, yeah, that was an easy question. We're gonna have some harder questions very soon, but before that, we need to thank a couple people. First, we had amazing 250 plus speakers, workshop leaders, people who were helping here. So first, a round of applause for all the speakers please. Thank you. Thanks a lot. You'll see quite a few of them in these. What is the color of the t-shirt? I don't know. Indigo Hush. I'm here to provide information, precise. Again, thanks a lot, but this conference wouldn't happen without all the volunteers. And I'm thanking Dorca first because she's the main person here. Yeah. Thank you. Thank you. Thank you. It's a lot of kid-herding, right? A lot of activities that need to be coordinated, put together. This is not your third day of the conference. This is like 160s something. It's like sixth month of the conference, but I'm so happy it's over and I hope you enjoyed it as much as I did. But you wouldn't do this without the volunteers, right? That's true. Without you, Radek, as well. So thank you very much for steering the conference, for staying present. And I like this part the most, that we always say thank you to all our volunteers, but they are never in the room because they continue working their tidying at the venue. So I hope they will watch this and make sure to thank them on the way out because without them, we would not be able... It's more than 100 volunteers, so quite a large group of people. Yeah. Find it funny that you said that everyone is outside working with exception to Paolo, who's dealing with our streams, right? We're on YouTube and you're still playing with that. So thanks a lot for keeping that thing work. If you're watching us online on YouTube, sorry, we don't hear a clapping, but hopefully it works. Yeah, you will definitely are clapping for sure. But I see some other faces here, so I'm going to name you now. So thank you, Martina, for coordinating all the volunteers. Thank you, Lenny, for all the speaker support. Thank you, Andrea, for all the venue support. I'm so happy that we have such a great team. We have so many people who return to us and volunteer every year, which brings me to, you can volunteer next year as well. We will be looking for people to help us out. It's fun. Yeah, it's a lot of fun, yes. Someone was asking me this weekend whether everything works, everything is great. And I'm like, no, there were so many issues behind the scenes that you didn't notice, but it was so great because we all got it sorted, we had some fun here and there. Yes, that's the fun I'm talking about. Fun. But I have to say one thing, that it's really volunteers and speakers that make this event, and it's slightly different every year, and I hope you enjoyed it. Yeah, one last number before we go on. We have over 1,000 people that attended the conference. Did you count it? More than 1100. 1100? More than that. That's a great number. At one point, we were worried that after COVID, the conferences are not a thing. Travel budget restrictions for anyone, right? Yeah, sorry. So it's great that we still see quite a few people here. You managed to get here, managed to travel, you spend a nice weekend here, so thanks a lot again. Yeah, and I know we have a lot of new people attending DevCon for the first time, so thank you very much for joining us. Hope we'll see each other next year. Perfect. So now we go to the interesting part, the part you're all waiting for, right? So every year, we're doing this interesting quiz. Oh, I was supposed to put this slide, why we were talking. I did. Sorry. You can tell that this is the last day, right? We're always doing this quiz thing. If you didn't know, I'm always asking these trivia questions about the conference, how many people were here and things like this, but I already told you these numbers. I usually also ask for how many bananas were eaten, and I have no idea. Do you know now? It was so many, so many bananas were eaten. We were so surprised. We had all the fruit on day one. We thought it would be for the whole conference, but you're healthy now and you're well. You're healthy, right? Yeah, it was all for your health. So we're doing something different now. We'll be doing this competition where you will be able to win some of the prizes here. We need to pick those lucky ones who will have a chance to come down here and pick whatever you slightly want here. But I'll be honest with you, I got really excited about the fact that we are meeting here all in person. I've seen you all chatting outside. I've seen the shadow track under the tree that was very popular, right? It was perfect, and I kind of figured that I want you to remember the faces of the people that you've met here. So we'll have a really tough competition now. You will have to figure out who are the speakers on these pictures that I'll be showing you. The way we're gonna play this game is gonna be easy. Well, the first part will be a little hard because you will all have to stand up in these weird chairs here, so. Good luck. Unfortunately, let's start with that, so please stand up. We're gonna be using a very high-tech approach, your left hand and your right hand. You will basically see for each of the pictures, that's what I suppose was gonna happen, right? For each of the pictures of the speakers, I've put two names there. You'll be voting with your left hand for the name on the left, right? So it's the green arrow, right hand, or the right arrow, right hand, right? Just please keep an eye on your neighbors so they're not cheating and they're not switching hands, right? And whoever gets it wrong sits down and will be playing this until we get like roughly 10 people who will get to pick something and we'll see how fast we will be, so we might have to do like two or three rounds. It all depends on how good I was in picking up some of the pictures, how good some of the speakers were in sending a very interesting pictures on themselves from like the early childhood and things like that and how good you are actually in guessing the names, right? So shall we get started? I'll stay here for the first one. This is sort of a practice round, right? So let's see if this works, right? So on the picture that you see there, is that Dorca or Lenka? Okay, how many people? You have a lot of left hands, good. So this is how it's gonna work. Whoever used the left hand sits down. Is there someone like that? Oh, really? Good job, good job. Let me try it, right? Anyway, there's gonna be another round, so don't worry about it. Everyone knows how this works, right? Perfect, perfect. Okay, so let's start with some more fun. So who's the speaker? Is this Martin Stefan Korz, Stefan Bunchak? All right, this is gonna be fun because I see some left hands, right hands. So this guy, Martin, Quarkus superhero. I'm gonna mention some of the talks. Well, this is gonna be super fast, right? We might even do more rounds. I'm so glad for this. So is Martin here? If the speakers are here and you guys recognize yourself here, please just pick up, right? Martin was doing a bunch of Quarkus talks here, a J-Bus user group here as well. Great session, hopefully some of you attended. All right, who's this guy? Is that Michael Hoffman or Phil Taylor? Left hand, right hand. Good. Oh, that's Michael. Everyone get that, right? So Michael was doing a talk today, but we've probably seen him on some other sessions as well. He's working on the Kernel integration project and he had two sessions at least that I know of. I believe he was on a third one as well. Good. Anyone recognizes this guy here? Yeah. Most of you, right? So is that Josef Malik or Josef Malik? Yeah, that's Peppa. Oh, yeah. Yeah, Peppa is a long-term hacker. He used to work at Redhead, but he's a fan of these weird nail mobile devices and he was talking about one of them, whether it's suitable for automotive or not. Anyone remembers the answer from his session? No. All right, a couple of you actually sent me interesting pictures with your cats. So you're gonna see a few more cats here. Florian or Nikolas? And this guy was doing a presentation today. Is he still around? Maybe at Nikolas. Nikolas was doing the chopping the monolith session, I believe on Friday, but also he did one on open telemetry. We still have quite a few people, so let's speed up a bit. All right, this skateboarder here, Latisha Buffoni or Katya Gordiva. Good, okay, it's like half and half. That's Katya, good. It was supposed to be a trick thinkers. Latisha is a famous skateboarder. She's like a pro skateboarder. So. How many people are there? There's still too many, okay, good. Anyway, Katya was doing great talks on ML Ops and if she was part of the, was it? No, that's wrong. She was doing another session, okay. Yeah, she was doing the, what did we learn from machine learning AI and how we implemented that with if else in three lines of code. So, sorry about that. Iker sent me this one and he basically asked me that I should be able, I should be showing this one because he's throwing away a kit. So, the question is which Iker is it and should we actually reveal the name now? All right, so this is Iker Pedroza, but Iker Reese is somewhere around here as well so that might have been confusing for some of you. No? Oh, this is the other way around here, right? Hold on, hold on. Sorry. I said the name right, right, but I have it wrong on the slide, right? Yeah, let's do another one. Sorry guys, I confused this one. All right, I have few more weird pictures so let's try this one, okay? So is that Lukasz Baran or Robert Piescianski? Good, it's gonna help as well. Robert, our scrum master, agile coach and he had this failure in change management session. Should we do one more or is that enough? I think we can do prizes now and then one extra round of the game. One more extra round? No, let's do prizes now and then... All right, I think we have enough people for the prizes. So, okay, whoever is standing right now, come down and pick up your prize. We have enough to pick up. And if anyone has some swag they still wanna get rid of, this is a perfect place to give it. All right, guys, hurry up because we wanna do another round. You're done. All right, so we're doing a second round. Hopefully, especially those who, immediately guess wrong at the first round, like pay more attention now please. I'm giving you some hints. Whoever wants to play again, stand up. Oh, I don't know, I don't really care. If you wanna play again, play again, right? All right, so apparently some people are missing the cold winters here because I received like four pictures of people in sweaters, right? In blankets and things like this. Who's this person? Is that Ryan Bludon or Mackey Titch? Good, good. This is Mackey. Go big with Supercharts, agile. Mackey is a software engineer at Red Hat, perfect. Cool, a lot of people guessed right. So, let's do another one. A Batman picture, really cool one. Is that Cavita or Cine? I guess a lot of people saw her talk yesterday, right? Some of you at least, right? She was doing the OpenShift OS customization bootable container talk yesterday and a really interesting one. All right, let's go for the next one. Dan Chermark or Dan Mach? This is good, okay. So, it was an interesting question. Both actually work at SUSE. Dan Chermark was running the booth here. I think I saw him a couple minutes ago. He had interesting talks on testing container images with Python and PyTest. I think he works on OBS and there was another session interesting also of what he didn't learn at university. Good, let's go on. This is an old picture of a person who first came here and presented at DEF CON, I believe, at 2008, maybe nine, a long time ago. Is that Fowler Bonzini or Simo Source? Yeah, that's Simo. Ha, ha, ha, ha. Yes, I double checked. Yes, it is him. Ha, ha, ha. Hey, I need to make it a little more difficult, right? So, again, an interesting picture of a person with goats, I guess. Alan Bishop or Matjesz Kowalski. Good, so this is Matt. Principal engineer at Red Red Head. Can on-prem be also a cloud? He's working on the networking stack. There he is. Ciao. Ha, ha, ha, ha. All right, another interesting picture. I guess this is from the late party last night. Is that Anton or Vitaly? Good, good. So this is Vitaly. Virtualization team, he's been a long time Red Hutter as well for many talks here at the conference, too. Another winter picture. So another person who's missing the winter, right? Is that Jen Padriaga or Alison Kelly? Everyone knows. Jen, thanks for being here. Jen is our event. What team are you on right now? It's the event team, right? I like what you say on your profile. You make fantastic things happen and you connect people. That's perfect. All right, how many people do we have? Still plenty, okay. Another winter picture. Seriously, guys. Is that Remy or Renault? This is Renault. S-Trace session, I believe on Saturday. We have Renault here now. Perfect. All right. Another person who's missing winter, Carol Chen or Ling Cheng? I'm joking because Carol is actually from, she's from Finland, right? Yeah? You wouldn't guess that, right? Carol is the principal community architect. She had a talk on Ansible Community Building and since I know her, she's been always involved in Ansible. This guy, I don't know who that is. That Paul Dumas or David Duncan. Can somebody guess please and what's going on with his brain right now? Okay, I think this is gonna be the last round for this one. David, how many years have you been at DEF CONF in a row? It's like eight, six, seven, something like that. Now people will remember you as the guy with a squid on the head. Anyway, you guys come down because you're the winners of the next round. Let's give it some audience. Then let me ask this question. Are you guys interested in one last round? Because I still have some pictures. We still haven't seen some. Does it matter? We always have some prizes. There's always something under the table here. I don't know how she does it, but she can solve any problems. All right, you know the drill. One more time. This is gonna be the last time, really. And we're, I know we're tired, so we'll do this really fast. Okay, so Alberto Filosi or Paolo Maldini. Some people are waving at me. Sorry, you have to choose your hand. This is Alberto. Paolo Maldini, I think that's a football player in Italy, right? I just picked some random names, so that was great. Anyway, how come I have this wrong description here again? Sorry, Alberto. I was still working on this last night after the party, so sorry about some of the descriptions. That's a trick one. Now I have the description right. At least I have the description right here, right? So yeah, you can tell that my slides are not perfect. Anyway, let's try this one. Okay, this is correct, right? Is that Pavla or Ivana? Perfect. You might have seen her here, I believe, today because she was doing the DNF Five Talk. Good. I really like this picture. It's very visionary, especially for the DNF team. They are rewriting DNF again, right? Happens. All right, another one. Is that Phil or Eric? Good, good. This is Eric, scaling your organization with GitOps. Cool, again, I'll speed up. We'll see how many bugs we have there. Some people know. Good, we're like half and half. Petja, can you tell us the story behind the picture? It's a game. I ask all the speakers for sending me the best picture that they have, and this is what I ended up with, Petra. Anyway, Petra was doing a presentation today, right? On safe upgrades and open shifts. So thanks for doing it, Petra. Cool. Yeah, let's speed up. This is a cool one. Is that Felix or Andrew? And that's another question, right? I don't really know, to be honest. Yeah, this is Felix. Is Felix still here? There you are. What's the name of the doc? I don't know. I don't know. Okay, so yeah, Felix, Kubert, right? Thanks, yeah. All right, another Fedora person. Fedora T-shirt, right? Lukas Vrabets or Lukas Ružića? This was easy for the people here in Bernal, right? Lukas is the open QA guy for Fedora. Still a lot of people. So let's do another one. Ivan Nečas or Lukas Doktor? Good, good. I guess, how can you tell? This is Ivan. And yeah, he used to be a guitar player and he used to look different. Anyway, Ivan is the architect at the Insights team and he had the correlation, is it not causation? He was talking about the data connections. And here's another one, great family picture. Maya Masarini or Maya Konstantino? And yeah, it's the person in the red house, right? Maya was doing a talk on Cup packet, right? RPMs and she works on that team. Can somebody guess the name? Yeah, so you're sitting right here, review, review, sir, but yeah, this is Adam. Where was the first time you came here for Deftipa? That was a long time ago as well, right? Must be like 10 years, so please. Yeah. All right, so I think we have like, how many people, six people here? So should we wrap it or? Yeah, I think we're done. I have another one, but I think we're done. Let's just sit back and talk about it. All right guys, so thanks a lot again. This was a great thing. Some famous last words from Durka. Please fill out the feedback form. You can give feedback to each speaker in Scat and take some fruits on the way out. Thank you for joining us.