 My name is Luca Bruno the topic of today is unsurprisingly whatever is on the schedule, which is Fedora chorus and specifically the auto-updating part our focus today is more or less giving an overview of What it means doing auto updates from the point of view of Fedora chorus? what it means for an operator sees at me necessary to manage these kind of like auto-updating operating system and and how do we Integrate all this discussion into an observability larger topic, which is how do we monitor? How do we alert when something goes wrong and things like that? That's more or less just to set the scope of today My name is Luca not nothing special I'm an open source and an operate operating system. So that's why it's working fine Engineer I'm mostly developer. I write a bunch of stuff in several languages. I prefer and I usually work on the Rust and go I've been doing free software since many many years But it's strongly believe in it. I support it. That's also why I'm working at chorus. I'm sorry. I tried that Before that I used to work at chorus chorus the company Then read out as most of you knows acquired chorus. I'm still here I used to work from Berlin or chorus and a very small office We still have an office in Berlin. It's just that it's called I read that office now a day Before doing these I was into the security field and specifically was in academia research There was an engineer's less researcher doing security mostly reverse engineering and More than baseband firmers. That's more less the background about me More interestingly Today, we're going to cover a lot of stuff. There is like a lot of contents in here I don't I don't pretend to explain you every single details But just gives more or less like an overview and then zoom in some specific fields The first one is for those of you that don't know what is fedora chorus an overview of this especially on the Out of the things which is more or less a novelty or something that in a bit different way compared to the rest of Fedora spins And then we're going to cover like components that are part of these out updating stack Which is the server side the protocol since you know T the client side Zincati and some other helpers that are like I'll look the first one local supporter a second one that we're going to describe later And then finally there is a demo where I'm a bit cheating because I'm not gonna do it live here this morning But it's recorded and it's on YouTube already. So we're just gonna go through that So we're gonna just zoom into that action and then see it all together in the end So first How many of you know what is chorus as a company a Good bunch. I mean of you knows what is chorus as an old operating system base on gen 2 and chromium a Few I mean of you knows radar chorus a few more and I mean of you know fedora chorus That's a good bunch. I mean of you thinks that like naming things in computing is hard I would have expected more and So here we're gonna like try to clear a bit of confusion when I say chorus I usually put something in front of it and here and trying to always say Fedora chorus if I omit that it's still Fedora chorus our goals with Fedora chorus basically was to Bring some of the idea and some of the techniques that we were doing on the previous operating system called Container Linux by chorus and before that it was called just chorus Into the Fedora world one of the main feature one of the main idea that which is basically what I was working on is Having an operating system which is model after a continuous out update flows. That means that You as an administrator, you don't have to care about going into every single machine and scheduling updates doing yummy store Appete install or whatever but the operating system is taking care of that for you and one step farther from that is We do this kind of continuous updates by using a model where the packages are not anymore like the main focus of what we do But the main focus is we provide an OS image, which can be atomically updating and rollback In container Linux, we were doing it at the file system level. So we were shipping a full OS image In Fedora chorus, we are doing it at the OS tree level Which means that every operating system release that we make it's a new OS tree commit and we use our PMO history To update and switch between different commits that could be available on a single node That's That's the basic of your operating system from the point of view of how we as in like distribution engineers and release engineers We push out updates that is something more that we want to provide which is Fasero phase rollout concept on top of multiple streams Which means that we provide you several streams that are going a different basis like some are released more often Some are it is with a bit lower frequency and you decide which one is the best suited for you based on like how much stability Do I need how much new packages do I need and so on and so far? And from our side, I'm gonna describe phase rollout later But let's say that it's just another way of not pushing updates for everybody at the same time But doing it in a controlled way From the point of view of a cluster of machine that are all trying to out update We want to put a bit of like Order let's say in this mess In particular, we try to not reboot the whole cluster at the same time So that if there is a service running on top of it is going to be disrupted But we try to do it like with some kind of like locking mechanism so that updates are going through Slowly one after the other And then okay the end of all of these is like I want this process to be as reliable as possible But we are human and we make mistake So there is always one phase where we are checking what is going on and we monitor and we are alerted if something is bad Then we can react to that and On top of this like all the components that we're going to describe So we try they are mostly like new software based on like ideas that we already had before and we already Try to implement before Something that we want to push for is like making the whole Linux ecosystem The whole chorus ecosystem a bit safer from the point of view mostly of like memory safety And so we basically opt-in for like languages that are New world and like the old income rate like C C++ So most of the software that they're gonna show you are either written in go or in Rust I Thought it was a good idea to describe a bit more like what our face rolled out because They're not a new concept like in many fields like in the networking fields You have that for like out updating network devices in the like embedded war You have the same for like Android devices and so on and so far In the Linux world, that's not so common. So distribution usually don't work this way So I'm gonna describe them here a bit like quickly, but hopefully clearly The main idea behind these is There is one point where we as developers we declared that okay We just tagged a new Fedora chorus release and that happens as a specific point in time And you cannot like split that over Multiple events or whatever. So that one is an atomic point We do that and by doing that we just publish to the public web some artifacts They could be like AWS images or oyster Repositories or some other like Q emu Q cow images or whatever Those are automatically like available if you actually know where they are and you if you want to grab them But the point is we don't want every single machine that is already Installed somewhere in the world to see that there is now there is a new Release that is available and they're gonna download it all at the same time Which result basically in dosing all the servers that are on the network providing that image and also all the machine Trying to update at the same time That is like mildly annoying But the real problem is if there is actually a bug inside the release that we just made Then all the machine are gonna be it by the same bug at the same time So what we do with phase rollout instead is we publish these artifacts But then we don't announce them on the out update channel We do it gradually we define a time a T zero where the updates are gonna be Starting to be available on the update channel and then we define like what is the whole rollout window for for this stuff And over this time we gradually go from a zero percentage to one hundred percentage of nodes that are receiving these updates This is done from our side from the publisher point of view Which means that if you want to do it a bit more aggressively We can say I want to be closer to the like zero percent sides of the nodes Or if you want to do it like a bit more safely Let's say or if you are more caution you can say I want to wait for most of the machine around the world to be Already eaten by these updates so that I see that there are no bugs no like red flags that are raised so far And at that point I also want my machine to update and reboot That basically serve us very well in doing this kind of we Reach all the nodes without updates, but we do it like in a gradual way and we can also like pose this process So if we see that something is going wrong because the first machine that we out update are ours If they have some bugs or something we can actually stop this process and say okay This is this is actually broken. We don't push it further. We do a bug fixer Whatever we push the next one and then we roll out the next one And that's how we do phase roll out pretty much Another things that I want to introduce because again It's kind of like not a new concept in general, but not part of the usual Linux distribution flow Is making sure that administrator are able to observe what's going on across the cluster from a single point of view Like every single concept in what I just said it's important. It's a key point We want to provide a good flow for people to observe what's going on So they don't have to guess like my machine is rebooting now Is it rebooting because there is a kernel bug is it rebooting because there was a power outage Is it rebooting because there are out updates? You want to know about this stuff When the machine is rebooting is it because there is a new update that has been applied or not Is that been really applied or do we just like rebooted into the old? os3 commit All this kind of answer like you can get them by I don't know SSHing into a machine or collecting logs into a central Kibana or whatever you use for like Going through logs, but that's a manual process It requires a lot of like going through the stuff and searching for information that you find interesting for answering your question Another approach to these is instead of using like interactive methods and logs and grabbing through stuff We can instead opt in in another way of doing monitoring observability, which is exposing metrics at multiple levels from multiple contexts And then aggregating them some way And is some way for me personally and for most of the read that at this time Let's say means this tech of technology, which is Prometheus, which is something that gathers all these metrics in a central point Then Thanos, which is a bit of a new component, which is for history tracking over a long time the alert manager which is sitting on top of from it is for Reacting to certain condition like if those machine are all updating they are rebooting but they're not applying updates Or they are not coming back after the reboot. I want to get a non left So I don't have to proactively check for stuff I just get page whenever something is wrong and then finally what I'm showing here Which is some way of doing like cool and simple visualization to see at a glance what's going on That's Grafana sitting on top of Prometheus and that's more or less my goal. Let's say Making these all the components here easy to be monitored this way So let's focus into the technology What we're gonna talk today are basically these green boxes here the whole stock infrastructure is a bit like larger There are a few more components that are Very cool and also very complex and they can be like cover by separate topics The one that I care about is the top one Which is the server side the back ends and the protocol which is called Cincinnati and then the lower part which is What is usually running on your own Fredora corrects machine and in particular We're gonna zoom on like every single machine running one agent called Zincati and then the cluster providing some cluster service for update Management, which is our lock plus the monitoring part So let's start One thing here Does everybody know what is a DAG or a direct a single graph? No, no things actually So what we what we are gonna do here and what we what's the main idea behind this protocol in this service is Instead of just Making releases and then letting people guess what he is like from which release you can go to which other release and when can you update And so on and so far we're gonna encode this information explicitly into a graph a DAG is a specific Kind of graph which is a direct acyclic graph The two adjective years means that we start from a graph Which means like a few nodes and a few edge connecting them and then every edge is a narrow Which points in a in a direction so it is a directed graph And it is acyclic in the sense that there are no cycles in this graph You cannot you can start from some point and you start working it at some point You will stop working as you have explored all the whole graph You can never like come back to a point that you already visited and start and keep working around That means that like if you start from any point in these Graph of updates, which is where you are currently you will reach the final end Which is what you want to upgrade to and that's what you're gonna do and that's the whole idea of the protocol The protocol itself is something that open-shift architects Invented or come with the idea of and we are just reusing it for fedora chorus as well the server is Some simple containerize Web service, which is running on Fedora infrastructure It is providing this JSON based protocol which described a graph of dates And it is used for in thing about updates Which means clients can query this service to see hey Is there any update that you are recommending me to do and then they are free to either apply it right now? Or just take this information and keep it for somewhere else for some other action Or for any kind of like data tracking or visualization or whatever What the server does on its own is scraping the fedora chorus metadata That is what we publish when we do releases and then building a graph out of all these metadata One graph per stream as I said before like there are multiple stream going at different paces with different frequencies and they results in different update graph Then this graph is served to clients that are requesting for it And there are some specific mutation that we can do on this graph In order to make our life as distribution developer a bit easier and better Specifically when we make mistake or when we are not sure about some details and there are three concepts that I'm gonna explain in the follow-up slides Which are update barriers That ends and the phase rollout that I described before going a bit more into the details apply to this case So the first one is update barriers in the general case We have this graph without cycles with arrows that are directed in some direction And we can go from any point to any other point and there could be multiple path for updating from version 8 to version C Going through version B not going through version B going through some other version and then coming back to version C and so on and so far all these case they incur a lot of like Not complex it but there are many cases that you can that you can Decide to pick and there are actually machines that at different times that we go through these update graph From time to time what we need is okay from this point on we want to rely on some feature that has been introduced In the operating system, let's say C group V2 or some other kind of partitioning for a system support or things like that In the general case, it's very hard to say okay We are coming from a release that didn't have this feature and now we are great into another release that relies on this feature So we can we can easily break stuff if we don't take care of all this complexity Which for humans is kind of like a lot of context to keep in our brain So what we can introduce at the graph level is some choke points, which means If we are before this release and we want to update We are first to reach this intermediate release and if we are after this update It means that all the machines have already gone through this update which is introducing some features so we can safely use it In this case as an example We have this v0 v1 v2 v3 until v5 and in general we can go directly from let's say v0 to v5 If you actually need something that has been introduced in v3, we can say okay v3 now It's a barrier which means that if you are before you see all the edges are Able to reach v3, but if you are after v3 it means that you have gone Through v3 there is no way out of that either you install directly at v4 or you went through v3 And so we can say okay I know that when I am at v4 it means that I've gone through this edge or if I am at v5 I've gone through these two edges or this other edge But in any case I've already been through v3 or I started from a later point And that's what we base what we use for enforcing that some all the nodes have some specific feature that we need Then there are the dense The dense are things that we would like not to have but again as human we make mistakes So from time to time we end up in a situation where okay This machine is broken because we push something that is broken And we cannot update further out of these because I don't know RPMS3 is broken or the kernel is broken or something else is broken So a machine that is ending up at v4 It will try forever and ever to out update and it will never receive an update because there is no way to get out of that Which is bad already But what is especially bad is that the machine itself has no way of knowing that it can out of the date further because In the in the usual case let's say that this machine is at v5 already at some point in the future There could be a v6 so he's going to receive the update for v6 So we need some way to signal to every machine that it is at v4 a Your operating system is in some state that is working for you But it's not able to out update further and you need to take some manual intervention And that's what we do with with the dead ends We have the protocol itself the graph that is signaling to the machine a you cannot progress farther from here And the machine itself will raise some kind of flag through a server ability methods For the administrator to actually take some action And the last one is phased rollout in a bit more like technical and context specific way Which is some interesting property is we start from this graph We build it from the metadata and then we do mutation mutation could be like generic like barriers or Dead ends but it can also be specific to specific nodes and they can also vary over time specifically for the phased rollout case If we are a t0 Some nodes may be able to see that there is an edge for example if they are more aggressive in their update strategy or some other nodes They will not see this edge because they decided not to be so aggressive However over time the same exact situation So the same nodes with the same configuration that we start seeing both of them They will start seeing the edge because the phased rollout has progressed so This graph of the days that we are pushing to clients. It's something that is not set in stone It will change over time and it's also a different graph. I mean The set of information the base for the graph is the same for everybody But the resulting graph depends on each specific node identity and setting Okay, and the last part about Cincinnati and the release process is that We have been doing more or less the same stuff with container Linux one of the main differences that Container looks used to have a proprietary part for this backend because it was actually a service that we were selling to customer And all of these was based on the traditional Let's say web administration flow, which is you have a database you have a web application We have people that are going through the web panel and they are gonna set. Okay. Now. We are Publishing these release we are publishing it over a phased rollout. We take two days and things like that Which means that all this process was very opaque and it was Completely contained within a database that has that was private on our infrastructure So it was kind of like hard to audit on the first in the first place It was a bit hard to observe from the outside What was going on because if you're not actually the admin that is clicking through Buttons, then you don't know what is the state you need to communicate with the rest of your team So we switch to something which is a bit less eye-catching like it doesn't have fancy buttons and stuff But it's a bit more devos friendly, which means now we do the same things That is deciding when to make a release how to how to push it to to hold the notes Based on a github flow. We open tickets. We open pull requests. We do reviews We go through a checklist and we basically do exactly what I just described Posing resuming update bar and so and so on and so far But in a way which is easier to track via gith via reviews and so on and so far Which make it like fully public like if you want to follow how we are doing fedora koro s release We just watch the repository for tickets and for a quest And it's also easy to audit like you just go back in the githi story You see exactly who did what at what time and why? And there is no sprawling a private database like before if you wanted to set up your own infrastructure You had to basically ask us as a company to provide you access to our database so that you could sink it in your internal Infrastructure or what is actually what was actually happening. We were selling it as a service for customer And now like everything it's it's it's open source It's free and you can just basically follow our process and set up your infrastructure using our date And the result is basically this one which is On github we have a list of releases that we plan to do a list of releases that you need in the past And every single ticket as is on poor request where we go through all the steps in order and we push a release And that's basically how we do release engineering nowadays again like it's not very eye-catching Doesn't have any fancy panel to show But it's kind of like very easy to follow at every single step what's going on and it also allows for our thing Which is distributed over multiple times on multiple continents like I'm in Europe most of my colleagues in the US But we can do this all over the world because there is a single coordination point, which is explicit Which is this it base? Okay So far more or less I covered all the distribution side Let's let's let's say how do we publish release? How do we run our services? How they work? What is the theory behind all of this? So we're gonna zoom a bit more into? What are your fedora chorus machine doing the first company that I show is the client side logic Which is basically queering our service getting this graph and then doing something This company is called zincati. It's an update engine. It's an update agent We used to have more or less the same for container Linux We just had to rewrite it in order to work with our PMO s3 and Fedora chorus processes But the idea is exactly the same. It is a long-running service. It's running on every single host It's written in rust It is an architecture based on a few actors There is one actor that is continuously updating ease view of the graph Then there are some actor that are taking some decision about like when do I want to update? When am I allowed to reboot and so on and so far and then there are some other actor that are actually Talking to our PMS tree to see what is the status on this specific node? They are asking our PMS tree to fetch new updates and then they are asking our PMS tree to Finalize and reboots. That's exactly what it says here The design of this component against nothing new is just a bit more like influenced by newer technology Specifically I zoomed on a few things here Which is configuration is done in a way, which is very similar to system D It takes multiple droppings from multiple directories and then emerge all of them together The configuration format is nothing that we invented is just like tommor files Which are easy for both human to write and for other tools to produce Transpiling from some other some other source And then again this component itself it's long-running and it has some internal state which is exposed as Prometheus matrix As I said, it's an evolution of two components that we used to have before in container Linux One is called update engine and the other one is called locksmith and Traditionally the first one was written in C++ It was like a huge complex beast of C++ and the second one was a Golang binary running on the host What is interesting from my point of view regarding this piece of software is that Before we didn't have any way to observe what was going What was going on inside the components that were taking care of auto updates So it was more or less a black box and we had to guess. How was the the node reacting to an update? While right now we are exposing some metrics We are exposing these metrics in a local way that is on every single node You can check what's what's what is going on by creating a unix domain socket And given that this software is architected as a few actors and the state machine that is striking all of these We are actually exposing the progress of the state machine via metrics so you can say You can actually query these service and asking and asking are you actually progressing your state machine? Are you checking for updates on a continuous base? Are you getting errors from I don't know from our backend? Are you getting errors from your local LAPM mystery or what's going on without having to check the logs for this service? So the service itself is very quiet. Let's say it's not chatting on its log is not Spot is not putting noise in your logs. It's just like Most of this information you get them from the metrics And if there are actually hard failures or errors then you get them in the logs for retrieving all the information And then another interesting thing that I'm gonna show you later is Given the process of language, which is like static and strongly types We can encapsulate all the errors in an exhaustive way, which means if some error is happening internally I also know Specifically which kind of error I'm seeing again an HD PR or a local error some kind of like parsing or whatever And we can expose these in metrics. So the metrics are a bit more useful than This service is seeing error the metrics are actually telling me this metric is seeing Sorry, this service is seeing errors of these kinds and is seeing so many errors over this amount of time This is a bit like I'm waving so far. I know I'm gonna show you later in the in the demo what it actually means The next component is another piece That Can I describe it It's some logic that used to as X exist inside locksmith and then we decided to kind of like split it off To another component in order to the couple a bit what the single node is taking care of versus what the whole cluster is responsible of managing So the the main problem is again, let's assume that you have a cluster of three nodes But it could be like a cluster of 100 nodes or whatever Well, at least more than one or two machines and you have some service Which is running on this cluster and the service needs to be highly available. Let's say 99.9 at least Then you have a problem you want both the service to be available and you want the machine to out updates in order to cover like Security issues get new kernels new stuff So you need to manage somehow How can I update one node because I have a cluster? So any node can be down while the rest of the cluster is still up and the service is running But doing it in a coordinated way so that the service which is on top of the node is not going down because at least a few nodes are Still up and this is exactly what our luck is providing It's it's a fleet-wide reboot manager Which means that he knows about all the nodes that are in your cluster and it's keeping track of Reboot status by some kind of like locking mechanism, and it is coordinating which node can reboot at what time Specifically the Design the idea behind this is that you have a semaphore Everybody knows what is a semaphore? Yeah, not really So a semaphore is it's a primitive in computer science, which means you have some state somewhere and you have multiple Agents that are interested in observing the state or mutating the state so you enforce some kind of like precedence and order between all these consumer so that at most one or they could be more At a time can access and mutate the state and this is exactly what we need for the reboots We want some machine to be able to reboot but not all of them at the same time And we want to kind of like enforce the fact that one machine is rebooting Which means it takes a lock it goes down it applies updates It comes back and then it is reporting back a I did my reboot so I can unlock my slot It is exactly what this very Creepy and small sentence say which is it's accounting semaphore Which means it's a semaphore that keep tracks of how many people or many agent are locking and unlocking There is a max of slots that we can have in there and the support recursive locking Which means that a machine if a machine is trying to reboot and then something goes wrong and then it tries to reboot again It it already has a lock. It doesn't need to take another one And that's exactly what it does This service again It used to be part of the locksmith logic itself Which means that every single node was doing this with some help from a remote database and we kind of like split it off So that now we have Service which is containerized. It's a go service. It's a very simple one It is just acting as a middleman between a database in our case at CD3 and every single node Which means that in our case we implemented this as a go service talking to a CD But you can definitely write something else like a Ruby service which is running in a container Which is talking to a postgre because that's what you know and what you already have And this this component as well is exposing metrics in the primitive format The idea again is the coupling these from the US so that we provide an implementation, but anybody can rewrite it From a visual point of view it works this way exactly you have a remote database somewhere you have this intermediate Reboot manager and then you can define multiple groups of machine like I have a group of master node I have a group of working nodes and they can have different configuration for example in this case The master node they have one slot so one node at a time can reboot while the workers They have two slots so two nodes can be down at any time And if you want to see it as a matrix visualization from Grafana, this is exactly what you get you have Two groups the master one and the workers one you have some slots Maximum amount of slots for example one in the master and three in the workers one in this example and then at any time you have Some slots that are locked which means that some nodes are currently rebooting and the other nodes are waiting asking for a lock and waiting for The time when this lot is available again Okay, the last component in all this discussion. It's a bit more focus on the Prometheus world and in the in the matrix world Okay, I will start from this graph because it makes it a bit easier to introduce so the problem is Prometheus is something that is Reaching to services Usually web services and stuff is querying them via HTTP get requests And it's getting some kind of matrix and that's all it does and then it's recording them showing them it has some Query language and you can get whatever data you want out of it The problem in our case is we have some web services for example airlock is a container It is a web service everything is fine. And then we have some other components for example Zincati, which is not a web service. It's an agent running locally on your node And we still want to reach to these Service and get these metrics, but we don't have any HTTP service available running in this agent And we don't want to because there is no need to have it That's the case where we are and so we need a small component in the middle, which is Something that I wrote and I'm happy if other people are finding it useful Which is the local exporter the local exporter is running as a container on every single node It is just simply bridging between the Prometheus web request and the local Unix domain soccer that socket that Zincati is Exposing and that's all it does There are a few a few more cases that I think that could be interesting for other people That's why I wrote like a specific component Which is we end up in this situation for many many other services in Linux and right now We don't have a good story like some services are exposing their Statistics over debuts some other they have like their own file base protocol where they're just writing to a file the current Statistics some other they just have some other kind of like UDP base or whatever homemade protocol they want and my Either word or my goal would be Unifying all of these so that we can observe all these components from a single point of view from Prometheus Using some kind of bridging between the different protocols and a different way of exposing metrics And that's and that's the local exporter Again nothing very interesting years just another go container it is just bridging HTTP You can configure it with a tumble it can fan out to multiple service running on the same machine And it allows to define multiple selector endpoints with their own configuration and so on and so far You can go like as crazy as you want in this part. This is just like I did Just a bare minimum so that I can actually observe what's going on with Zincati And this is I don't know so much about metrics I'm not an observability guy I had some discussion with the Prometheus guys and they have something very similar Which is called the not exporter which is exporting mostly Linux kernel statistic But also some other stuff like system D or NTP or some other some other things But I think that like the main things that I was missing and the main difference is that the not exporter Expect like a single metrics Blob let's say with all the stuff inside while I don't care about like observing the whole state of my Linux machine I just care about some specific services here and there And that's the that's the main Key point the main difference keeping these metrics separates with different points and each one with its own configuration Okay, I think that they cover more or less everything that they wanted I'm gonna go into the demo and into the example and then we have questions The example that I'm gonna show basically is just all these components put together and then observe from Prometheus and Grafana Specifically, there is one piece that we need to configure which is running this local exporter and telling the local exporter Hey, this is how you reach the zincati matrix and point Given that we run everything in containers, whatever possible, and I hope that you do as well In our case we have a bind mount from the host into this container And then we tell the container aid is where we provided you the Unix domain socket as a bind mount from the host And that's it. And then we set it up as we call this zincati and Prometheus knows how to reach it over some HTTP endpoint And and that's it and that's exactly what we are gonna do more right now So from the point of view of I'm an administration administrator, and I just want to see what's going on You can do it in a few ways Let's start from the beginning and one way is okay. I am locally Locked in on a node and I want to manually see what's going on You can totally do it you just Connected to the to the Unix socket where zincati is posing this matrix and then you get the information about what's going on This example is a bit stale, but you get you get some info You get an idea as about like what was the last time stamp for refreshing the cash What was the last time stamp for trying to update the machine and so on and so far? This is what you could do in theory in practice. You don't want to log into every single machine. You want to do something like a Let me start exposing this information over the web. This is one of my canary fedora chorus machines Running with local exporter running on it and bridging to HTTP So this is exactly the same Unix socket information, but over HTTP You can see that here there are all the information that you need for monitoring What's going on for example? What is the current OS version? How many time did we check for updates? This machine doesn't have any error, which is nice But like if there is like a gateway timeout of some kind of like transient network error, you will see it here With the specific kind of it is actually an HTTP 503 error or a genetic one or more specific things And this way you can track like do I have all my machine with updates enabled or did I disable them or some machine? One was the last time that we tried to refresh the state and so on and so far And as you can do it like for every single machine now next step And these were the last demo is Assuming that you have a cluster with multiple machine. You don't want to check it manually. You want to see it like as graphs And this is where Prometheus and Grafana are quite nice because you can set up a Prometheus Scraping all your machine in this case is a cluster of five nodes So Prometheus is aggregating all this information internally and then you can query This database this real time time series tracking To see what's what was going on for example the demo here is showing a cluster which is going cluster of five machine Which is going through two different version updates at different time. You can see that There were five machines starting from one version and then one by one they reboot when they come back They are at the new version and these update proceeds These updates proceeds and rolls through the cluster without any problem in this case And then after a bit there is a new update which is available and it does the same you can track at any point There is this configuration this specific configuration on this cluster is allowing one machine to be down You can see in the graph that there are just one node at most being done down at any time Then in the same place you can monitor. What are the metrics for the? Reboot manager, so is it actually configured with enough reboot slot? Are they being locked unlocked? At what time a lock is taken at what time a lock is released and and so on and so far From the client point of view you can see like every client changing state get in different Metrics and different timestamps and so on and so far and I think that later on there is also like error matrix and and other stuff And that's more or less all of it going from how do we do updates? How do we configure stuff? How do we manage a cluster of multiple machine that cannot be all going down at the same time and how do we actually? Monitor this stuff in a same way. What I'm not showing here is you can actually Get page getting alerts based on these information Just defining like I want to get a page if more than two machines are down because this is not allowed to configure In my cluster if that happens it means that something is wrong And so on and so far and this is more or less how in the current world. Let's say We try to to organize and to Monitor and to get alerts on an out of date an out of dating fleet of machines that are behaving like a cluster That's that's all I guess I Less than five minutes for for questions. So I will just go for a couple of them. We start with Albany The question is do we use a look in OpenShift the answer the specific answer is no The general answer is no as well. So what I've shown so far is what we are doing for Fedora chorus OpenShift is using the same protocol since you're not it But it's a completely different model So in OpenShift, we don't have a cluster of machines that we need to orchestrate in OpenShift We have a cluster which is orchestrating every single component including the operating system, which means that The cluster operators are taking care of updating both the OpenShift Kubernetes components and the OS itself with some strategies So they don't use Zincati. They don't use airlock, but they are based on similar concept. Yes Next one Yep So okay, so the question is like what is the relationship between the locales port and Zincati and the relationship is on Zincati side I have internal state and metrics, but I don't have an HTTP service. I don't need an HTTP service I don't want to bind to a TCP port. I don't want to care about fire rolling These are all problems that they don't want to deal with So on the Zincati side, I can query it locally if I SSH into a machine But I don't want to SSH into a cluster of 100 machine just to get information On the other hand, Prometheus is able to query HTTP endpoints exposing this kind of information So what I need is some way which could be like a batch script or whatever which is bridging between an HTTP port getting requests from Prometheus and Zincati exposing this information on a local UNIX domain socket. That's all of what the Localesporter does. It does It could do a bit more in theory because I think that there are other cases that could be covering this way But for this specific example, that's all of it Like you can replace this if you want with like socket and bash and something else and it would totally work it That's that's fine Yes Yes, correct. So the this this machine that you see here is just HTTPS something this web service is the local exporter This is what it is replying here. The information inside this page are coming from Zincati So the local exporter is just proxying a request, which is an HTTP request Translating it into the local UNIX domain request getting this information back and providing it as an HTTP Response and that's that's all of it Yes, that's exactly why they are like different components not part of the operating system like the operating system as it says in the slides It's just It's just this one This is like the minimum things running on every single host all the other components are kind of like how to put together This vision of auto updates and monitoring them, but they are like containers You see like there are all containers and if they are containers It means that you can schedule them if you need them you can replace them then with something else if you have better solution That's totally fine. This is just like my opinion my approach to monitoring this beast. Let's say Next one So the question is like if you want to plug something else Let's say the node the network manager as a provider of information into this flow. How do we do it? Oh, we should do it my take is that Network manager will be on this side just providing Metrics and parameters format somehow like over D bus Unix or whatever it prefers No, no mandatory things here. And then you as an administrator You will have a configuration that says out of all these services that are running on my machine I'm actually care about network manager, but I don't care about zinc at you so I'm going to provide this configuration to the Local exporter container that says I want to expose Zincati and I know how to reach Zincati and that's all of it or I want to I want to know about network manager I know how to reach network manager. That's all of it. But like I don't think that it's worthy for Network manager itself to provide this information because anyway the idea is you run these in a container So you need to get this configuration to a container is not something that the host itself should take out But you could totally see like example or kind of like snippets. That's how you do it and make it like a bit easier I'm not saying no, but we're still very far from that point and I don't want to kind of I can force things on people. That's That's it Done. Okay. Thank you very much for coming