 So, according to my clock, it's 11.50, and we have now the pleasure to keep you away from lunch as long as possible. So we'll do our very best to make a very long session so that lunch is actually empty when you leave the room. So Matthias and I are going to talk about observability in OpenStack, and going to present the learnings and building blocks from the special interest group monitoring at CSCS project. The special interest group monitoring is a group of people from various CSPs that meet once per week, where we actually exchange ideas on how monitoring and observability should be done. So who are we? To my right, left is Matthias. Matthias Fechner works at OSISM, which is one of the companies that are part of the SCS project, and who actually develop a good part of the reference implementation of the sovereign cloud stack. And Matthias has a fairly long history in infrastructure, and he previously worked at Noa's cloud, and it's also one of the guys who built up the plus cloud open that is based on SCS. Yeah. Felix has been building infrastructure since the late 90s. He's been a part of various open source projects such as OpenBSD, OpenDaven, the last years before joining to SCS project. He was responsible for the technical part of GridSquare. He has been a long time member of extended board of the OSB Alliance. Thanks, Matthias. So before we dive into the learnings and building blocks from the SIG monitoring, I'm actually going to quickly explain what the sovereign cloud stack actually is, and what the project is about. So the vision is that the sovereign cloud stack is a federal cloud technology built entirely with open source software, putting users and providers in control. Since we think that only open source guarantees digital sovereignty by interoperability, transparency, and independence from unlawful claims of certain parties, and thus from any unauthorized interference. Those two actually quotes from our website, and I like to include them every time I present a project because they are very good to describe what we want to do. We have five goals with SCS, which are standardization, certification, transparency, sustainability, as well as federation. So if I say SCS, the sovereign cloud stack is both a reference implementation as well as standard that we are building. Who is behind the project? The project is a project of the Open Source Business Alliance and is supported by the Federal Ministry of Foreign Economic Affairs and Climate Action. The Open Source Business Alliance is a non-profit organization to sanction open source in Germany. We have roughly 190 companies as members, and it resulted or was funded as a result of a merger of two previous organizations that also tried to work on a political level for Linux and Open Source. The SCS stack is actually comprised of a compute storage network, the container layer, and further services. You see on this slide, I'm actually going to move my head out of it, that you will see a lot of familiar things in the reference implementation coming all from OpenStack, and the container layer here is what actually then is the standard that we built. The infrastructure layer is actually what we call an optional standard. So basically, one could also use the container layer on a different foundation than OpenStack. And you see here a lot of the projects that we use and that we build on. In the description for this talk, we actually have that the sovereign cloud stack is built on the shoulder of giants. So basically, we take a lot of these projects, we build up on them, but SCS is not a fork of OpenStack. We actually work as much upstream as possible. So we actually make sure that the companies that work on sovereign cloud stack don't work on sovereign cloud stack by forking code and having it in SCS. But we actually make sure that the people work upstream and that all good stuff ends up upstream and then just trickles back down. By that we actually want to strengthen the Open Source ecosystem and want to make sure that actually the project that everyone benefits from, that those are actually healthy instead of having it somewhere else forked. So that was a quick dive through SCS and with that, I think Matthias wants to talk a bit about observability. Sure. The definition of the expression observability, what kind of expression? Observability is just modern monitoring. Wikipedia say this sentence to this, observability is measure how well internal states of systems can be inferred from knowledge from internal outputs. Felix, what is this picture meaning? Well, actually what you see here is what observing and monitoring of a simple web server could look like. So you have one very simple service that you observe and that you can monitor very simple. And if you actually talk about observability in cloud infrastructure, it looks more like that because it's a whole different beast that you actually have to tackle because it features all sorts of things and not just a simple service. Matthias, why don't you explain why that is the case? That is the case. OpenStack is a hard to monitor because it's like an iceberg from a monitoring perspective. Too many infrastructures are under the watermark. Let us deal with the SCS stack itself. Let's take a little journey from the past to the present into the future of the observability inside the SCS stack. In the beginning is to how build this SCS stack on a green field and the next question, how to monitor this? The answer in the most companies are the existing Nagios, ICI, or fancy solutions. They are well-proofed. In the most cases, they are flexible, customized, and in the most cases, they are far away to be standard. There are limits. You will be informed, but the observe of the complete environment and the foresighted events will find difficult. So what is the manner to design this? In the SICK, we discuss about the best practices we have collected since years. If talk agile as an operator, we want to observe and visualization of our infrastructure and services. If we talk as a customer, we want to observe behavior-based observability. One proof, the putting is to eating it. Open metric systems like Prometheus has a benefit to create it much easier. With the available exporters for every needed task, you will observe about all information of our environment. You are able to combine metrics with the Prometheus queries to create alert rules. With metrics, you are able to get a whole picture of the environment. If you need an exporter, which does not exist, of course, it is possible to read it on your own. So the declaration for the decision for our SICK was very fast clear to use open metrics as a base system for the SCS stack as the first iteration Prometheus. And as an operator, we are focused on infrastructure and as a customer, we are focused on behavior-based observability. Let's focus to the infrastructure parts. We have here from the operator side, we have from the integrator side, and we have also from the vendor. And if we switch the software components, we have Grafana, we have Prometheus, and all our needs we have. So in the community, we discovered that the most companies are the same. So StackHPC and PLOS server had a good base of alert rules, Grafana dashboards, which based on the awesome alert rules, so we begin to integrate it to our system. So one of the things that Kurt and I already mentioned yesterday quickly in our talk on open operations is that we do behavior-based monitoring of SCS clouds. We do that through the public APIs that are available, and through that we actually create workload and measure the time that takes to create the workload until it's fully available. So by that we not only monitor availability, but also get a very clear view of performance delivered to the end customer. And when I joined the SCS project, the OpenStack Health Monitor has been existing for quite a few years since Kurt wrote it back when he was at OpenTelecom cloud, and is actually and that's not a joke, it's a four thousand line shell script. And in one of the first sessions where I was in, we actually discussed how we could extend that and quickly came to the conclusion that while the OpenStack Health Monitor, the way it is currently works really well and is awesome, hardly anyone of us can actually maintain it in a good manner without keeping sanity. And it's actually the OpenStack Health Monitor is what we call, you need to measure what you manage and basically with that we can graph out all the details of the SCS clouds. So the other way, we have the Grafana dashboards for the OpenStack and also at the next page we have the SAF dashboard. But here we see the states of OpenStack like the course, the run, how many instances run at the moment. And yeah, we have a good nutrient states and something else. On the next dashboard, we see the healthy of the SAF cluster, which it's okay for the size I think so. So now we, in our journey to the present, yeah, we now continuous contribution with the awesome alert rules, and the rules are shared on GitHub, we present in autism, we have here and some of the rules are a little bit from the awesome Prometheus alerts we grabbed too. And the other things a little bit from the experience of plus server and stack HPC. So we come now to the future in our journey. And part of actually what is what does it work is that the stuff that is being developed now in this color operations repository is then actually supposed to end up in the official color repositories. So bring it upstream, right. So and from my understanding, it was also discussed at some PTG, correct. So the future, which are the tasks we want to serve to their observability is more than metrics. We have capacity management. We have also aggregating accounting, accounting, aggregated accounting. And that's some challenges. Capacity management for the business case for every company. They want to know how growing up the clusters and the environment. And the other point, we want to build this. I think one of the topics we actually discussed in the last month, Matthias, was how we actually observe OVN. That's correct. We have an, it exists, one OVN exporter from Paul Greenberg that served our need. But the problem with this is in an OVN, open virtual network, SDN, we have such components they want to observe. But every component of this needs this OVN exporter, which are designed to really deliver the huge metrics. And there are some duplicated data. And that is a challenge to salute this in the future. Moving the behavior based monitoring from the shell script further, we ventured into looking at existing frameworks, such as Rally or Tempest. I'm sure both of those are known to you. And with Andre from Plus Server, we actually dove into Rally fairly far to see how we can make Rally work the way we actually would like it to work. And then kind of found out that by doing that with Rally, we would actually abuse Rally for something that's not meant to do. And then we looked a bit further, looked elsewhere, and then we actually found out that the guys at OTC actually in the last years developed what used to be the shell script into something that is really, really cool and is already pretty awesome. And as far as I know, it was presented at the last virtual open-infra-summit. The talk was called Yet Another OpenStack Monitoring Solution. There's also the YouTube link to exactly that talk. And what we will now actually do is see how we can actually take that and make it work in a way that it's useful for more than just OTC, but also for other open-stack environments, such as clouds based on SCS, and make sure that we collaborate there in a good fashion. And of course, if you talk observability, there's much more than just metrics coming from systems. There's also logs. And we actually looked in the ZIG monitoring, what do we want to do in regard to log analysis? Of course, if you talk logs, there come several use cases to mine. So we went through them and initially started with, OK, it's cool to use them for debugging, but also maybe to proactively detect certain error states that you can do something about it before it breaks, predictive maintenance. And the moment we started discussing the use of log and how we do log shipping, of course, legal aspects came to mind because all other sort of things came then. And then we kind of discussed, OK, if you were starting on a green field with an SCS deployment, you want to have a log analysis that comes with it. But in a lot of cases, you don't start green field. You already are in a CSP environment where you have some sort of big log analysis already running. So it also needs to be made sure that we actually ship those logs to an external place. And that's a good example on how we collaborate in the sovereign cloud stack project, because CSPs come together and actually talk how they would benefit from such a solution and bring in their requirements so that we can identify requirements that correlate between all of them and make sure to address them. And of course, you have a cloud. You have customers. And you want to write invoices for those. And then we started talking about the topic of metering, because if you have metrics, you can also do metering. And out of the special interest group, we had then a metering meetup that took part for a few weeks. We actually, a few colleagues from Cloud and Heat came together and discussed how they would actually tackle the subject of metering. And as far as I know, we're actually going to have a tender running for it, because the way SCS works is that with the funding that we have from the federal ministry, we are running tender projects that companies then can apply for and then actually can develop the stuff and being paid for exactly that. So that's something we're going to do with regard to metering. And one of the core principles is that we want to make sure that best practices for cloud stacks are actually shared and documented so that the difficulty to provide high quality cloud services is actually lowered. So no, what we covered for the technical aspects. What are the learnings from the SICK? Felix. Well, that collaboration over competition is actually the paradigm of the hour. So we want to make sure that all the CSPs that we have in our group actually collaborate instead of seeing each other as mere competition. Well, even though there are customers that move from CSPA to CSPB, I would actually like to see a whole lot of customers coming from the big hyperscalers over to all the small CSPs, because that's actually the cake that I see. And so I think it's time to collaborate instead of compete. So if this has somehow raised your interest, I would really like to invite you to check out these GitHub repos. The top is Sauron CloudStack and the bottom one is the GitHub organization from Osism. And I'm going to leave you, of course, there's going to be a recording, so you can always grab it. We have the QR code a little bit for the cover. If I'm not standing in front of it, exactly. Thanks. And I would also like to invite you not to only look at the organizations on GitHub, but also actually join the effort. So there is an active matrix group that you can join. And every Friday at 12.05, we actually have our ZIG monitoring meeting. So show up and discuss with us, bring in ideas, help us, get some insights that you can take back to your company or organization. Yeah. And with that, are there questions? Don't tell me you just all want to go for lunch and have no questions. I'm devastated. I don't have any questions. I'm a bit baffled. OK, no questions. So and there's, of course, the obvious end slide with the other contact information. Thank you to the audience. Thanks for listening.