 We're full. Are we good? All right, they've closed the door, so you guys are all locked in here now, sorry. I'm Chad Furman. I'm a product manager on the Ansible side, but for the context for this conversation, probably more interesting to you is I was a customer for a very long time. I used Ansible. I installed Red Hat Linux for the first time in 1997, so I've been doing this a very, very long time. But yeah, I came from a oil and gas company, so I know entirely too much about how things work in an industrial setting, so I think this will be a fun conversation. For those who don't know, the microphone doesn't, does it actually project? Yeah, it doesn't go. Okay, they told us it wasn't projecting in here, but it's for the online stream. That's why we're getting, taking turns on the microphone, even though you can't hear it. Okay, my name's Adam Miller. I am one of these software engineers in the Ansible organization, and lately I have been Chad's partner in all things Edge, and specifically Industrial Edge more recently. Okay, agenda. We're gonna talk about all the things and talk about challenges, get into what we've actually been working on in the Ansible space. It's all very community-driven, of course, because that's where we start from the open source side, and then get into really interesting things we've done with common industrial protocols and actual real-world applications, because I think that's what really matters is how do people actually use this stuff? So Edge computing is a huge, huge pain if anybody's ever thought about it. Like, a lot of people have been in the data center for a very long time, but thinking about how do you scale and manage and have the interoperability and consistency of all of these different things, and tie them together, and how do you make 1,000 devices work the same, or how do you push an application to tens of thousands of devices and manage them and know what their IP addresses are and what they're supposed to do or what they think they're doing, and the telemetry that all comes back from that. I know there are some people here that I've worked with for many years on this, and I don't think anybody's actually solved it yet, but we're getting there. So, who here knows, I like to make this interactive. So, what does industrial Edge mean to you? What's that? Excellent. Anyone else? Yep, absolutely. Anyone else? Fun. All right, cool. Literally anything that makes things. So, agriculture, factories, cars, electronics, anything that could have robots or machinery and all of these things that are, if you've been in the IT space for very long, we always think about VMs and containers, but when you work with an industrial company, a container is something that ships things in. It is not something that they use for applications. So, it's a different conversation, and I have that conversation often, like, yeah, containers. They're like, yeah, we're running Windows and VMware, and we don't know what a container is, sorry. So, it's fun. So, what we've been working on, Adam and myself, is working with different companies on how do you bring some of the best practices of IT into the OT space? So, OT being operational technology, and that's what a lot of them refer to it as. So, things like SCADA systems that control robots and SCADA systems that control roller coasters, because yes, they actually do the same things from the same companies, which is really interesting, because at first, whenever we started going down this, me coming from a manufacturing space, I was like, oh wait, yeah, you probably use the same motors to control a roller coaster that you use to control a robot in a car manufacturing plant. Kind of cool stuff, but they don't know anything about IT in this space. Like, to them, IT is the guy that they call when their stuff breaks, and it is a ticket, and it is a waiting period, and now that they're starting to kind of talk to the IT people, and instead of them being the enemy, they're now becoming friends, and starting to think about things like, oh, so automation isn't just about factory plant automation, automation is actually about IT things too, because if you say automation in the OT space to them, it is process control automation, not IT automation. I think I kind of talked about this a bit, so here are all the different things that you see a lot in this space, anything from devices and servers, to sensors that are detecting things in the air. That was one of the big ones where I came from. They're always looking at particulates that went in the air, because there are laws about what you put in the air when you're in a gas company, sometimes. But it actually gets all the way to sub stations from Electric, and of course there are central servers that are there that generally are ran by IT, but sometimes they're actually run by the OT people, and to them, it's just the servers. Like, it's not something they really know much about, it's just something they have to have to run all of the other things that they need. So unfortunately, we couldn't bring this with us because it would have been really expensive, but at Summit, we had some really, really cool booths with different companies we had been working with. So we actually had the Schneider Electric water, so this is how they do wastewater management in a lot of places. We had it where you could change the valve knobs, where you would change the water displacement in the tanks, and then you could actually rip out a server, and it would continue to work, and then you could plug the server back in, and it would reprogram the server and continue to work. So it was kind of cool to actually show real world interactions in this space, because a lot of times we're like, hey, here's a computer, and we put a VM and a container on it, and it runs an app. It's actually cool to see stuff actually doing things. Am I doing this whole day? Oh, yeah. Just tell me where to jump in, I don't care. You can take this one. All right. Can I have a clicker? Yes. My turn to talk. All right. So one of the things that we've been doing is market research and talking to different customers and just finding out where their problem spaces are, and we've identified a handful of business challenges. So this being kind of a summary of the business challenges that we're seeing out there from a plant manager kind of dealing in this actual space, this is their day-to-day. So the ITOT convergence is still happening, and this is kind of almost like a culture shift that we're seeing in a lot of the industrial, manufacturing, industrial edge, transit, logistics, these kinds of markets, where it's an adaptation of technology that has existed in the IT space for a while, that we have to modify and evolve and develop new capabilities for to be able to address the OT space. Largely because the devices there are different. The protocols are different. The types of networks they talk on aren't IP based. There's a lot of legacy technology there that we have to interface with. You find different challenges for things like data gathering. When your networking is either intermittent, bad, high latency, or low throughput, you find yourself having to design or architect the way that you deal with what you would traditionally do in a data center to adapt to these environments. So the Purdue model, which if anybody in manufacturing at all would be very intimately aware of this, effectively this is what all of the industrial, I don't know if this is like a requirement thing or just everybody does this, but the industrial manufacturing follows this model and this is a security model for networking and every zone may only talk to the zones at borders. Nothing can traverse from like so plant level, near-edge to operation level, nothing can actually go more than one hop. And this becomes problematic. You can only go from the bottom up. Oh, okay, yeah, sorry. And you can only go from the bottom up so you can't reach back. So this becomes problematic because you have to deal with your ingress versus egress and you have to figure out how to traverse these things to accomplish automation in a way that you might not have been able to otherwise or in a way that you would have been able to otherwise. It just kind of depends. So this is a space that from I guess a traditional IT perspective is very, very different or the variables in play are different, more limiting. So we've had to find different methods to accomplish this. I can rant about the Purdue model for like two hours if anybody wants to rant. Yeah, so this is kind of a marketing slide because Edge is hot right now as a topic. Gartner loves to talk about it. Forrester loves to talk about it. This is new, like it makes its way around. It's not new though. We've had retail stores for as long as any of us have been alive. Those of them that have compute resources there, they were doing edge computing. This is just kind of something that has become, I guess more relevant in our space specifically from a technology perspective, from a software perspective, from an open source perspective because we are bridging the IT to the OT and IT is typically where open source software and the things that we develop are able to be addressing those problem spaces. Now we're extending it into other problem spaces. You do this. Yeah, you do this. So who all here uses Ansible or knows what Ansible is? Hence, okay, some people don't know what Ansible is and I asked that question so I could tell you what it is if you don't know what it is. So Ansible, actually, no, I'll give you that slide because that's the next one. So we built a platform around automation and what we've had to do with this, so actually I'll let you do your slide and then I'll go back to that slide, yeah. We did it backwards. Bad lining on the slides. Ansible's an automation tool. It is also an automation language. The idea of it is for it to be simple, composable, human readable and if at all possible, item potent. Item potency allows you to only make state change when state change is required such that you can repeat and rerun that automation over and over and over again and only inflict change upon any system or systems when required. The idea from the perspective of what we're building and what we have built in the open as an open source project is for it to be the automation language or the lingua franca of automation. The goal is for it to be universal. You should be able to learn this skill and then take it into network automation, security operations automation, industrial OT automation, IT automation, et cetera. We are doing everything we can to adapt the technology underneath and design it in such a way that is pluggable, composable, and reusable such that this automation skill set and tools and technology is adaptable and can be used throughout. Well, I haven't gotten there. So one of the things that we did here in light of bringing the automation technology of Ansible into the industrial edge is we implemented a connection plugin for Ansible to be able to talk to programmable logic controllers over the common industrial protocol. So these devices are not connected to an IP network. They're not devices you can traditionally reach out and get to. However, they are connected to a backplane. That backplane does interface with a device that has an IP address and does speak IP. So we wrote software that has the ability to reach out to that controller, then translate to the backplane and then we can traverse those SIP networks and automate programmable logic controllers. And for those who are not familiar, a programmable logic controller or a PLC is the thing that controls the robot arms or controls conveyor belts or controls different elements of the actual factory floor. The thing that produces widgets and boxes and vehicle parts and the bits inside of our laptops and our cell phones. So that can be interfaced using the same language that you can use to automate your Linux system via the Ansible automation language. So now I'll let Chad rant poetically about the automation platform. So I'll add the PLC thing there. So where I came from, the way PLC was managed before and think about how many motors it takes to manage a conveyor belt. The only way to go and program them was someone walked around a manufacturing site with a serial cable and plugged them into each device and verify that they were configured correctly. Now you can just use Ansible to do that, which is kind of fantastic. So really short, sweet. We've created a platform around Ansible. So Ansible is a command line language. There's also a platform where we built it and made a lot of the things. So as much as I love doing things on the command line, if I'm gonna have to do it against 10,000 devices, I don't wanna do that on the command line, it just is not a good way to do things. So we've created a platform to do all the things that you hate doing, like credential management, integrations, all of the self-service things, and you just go push the rocket to go run against those 10,000 devices. Oh, most important part about this though is the execution plane at the bottom. So remember that Purdue model? You use the execution plane with a network mesh to be able to get into those different parts of the environment. Now you don't have to worry about 5,000 firewall rules, get every device, one firewall rule, one execution node, and then that can go against all the things in that particular network segment. Give me, give me, give me. Pictures. Okay, so the one thing that I wanna touch on that Chad just said was about the automation mesh. So automation mesh has either a bi-directional or unidirectional communication method. So if you needed to function in an environment in an industrial space that does follow the Purdue model, you can set up to where certain zones only allow ingress traffic. So by doing that, that will put your automation mesh node in that space into effectively a pull versus a push. So everything from the automation platform still looks as though it's a pull, but everything will do store and forward intelligently based on the routing algorithm on down and it's just Dijkstra's, like we don't have to be fancy about it. This is a tried-and-true algorithm. It does what it does and this allows us to deal with that space. So as Chad alluded to before, all configuration used to be done manually with these PLCs. We had to just kind of trust that it was fine because there was no validation. We couldn't really do CI on it. It was just a thing. So over here on the, let's see, right-hand side is a, it's a Pelican case and inside this Pelican case is a Rockwell, Alan Bradley Rockwell PLC, a control node, an emergency stop button, a drive that actually will spin that little puck with the Alan Bradwell logo on it and then that little blue thing down here, that is actually an LED. So it could be any kind of a status light and we set up a, again, we couldn't travel with this because we just unfortunately couldn't but the automation to accomplish that was all written with Ansible as a play, Ansible playbook. So for those who don't know Ansible, the automation recipe, if you will, is called a playbook. All of this is automated from scratch. The second you plug it in, everything is completely programmed and it will send that drive spinning, it will stop the drive, it'll spin up the, it'll light up the thing, the LED and turn it off. And that was something that we did using this open source Ansible collection that we wrote called community.CIP, CIP standing for Common Industrial Protocol and we have the ability to read and write tags. So in a PLC language, it's effectively how you set or unset particular Boolean flags to inflict change upon a running system in a programmable logic controller and then we can do audit and validation of information. So this allows us to do our audit trail and do testing. So this is effectively what an Ansible playbook will look like to control, command and control those things and this is, it's just YAML for those who don't know Ansible, it's just YAML. So everything that you have set up, you have a host and after the host keyword is your host pattern. So the pattern will be the set of devices you talk to and there's a thing called an inventory. The inventory devices are individually atomized either statically or dynamically by that pattern. We wanna define whether or not we're going to privilege escalation. So become is to become a more privileged user as part of the privilege escalation. We have a set of tasks, each of them has a name, I have an indention error, please ignore that. We're gonna call the namespace module of insure tags and this is going to be the thing that takes care of the task operation. So we have a name of the tag and a value and again that is an item potent operation. It will only make change when required. We wanna gather fact data, we wanna inspect the current state of the running system so that we can then validate it later if we would like and then also we're gonna verify our firmware version and I just, okay yeah, tag value for the sine wave, that was what that one was. So that is a sine wave generator which was part of a demo that we had that would actually start to articulate a sine wave on a human machine interface. Okay, and that's kind of the goal is for it to be human readable, easily composable and this down here, this is a templated variable, this is part of some of the string interpolation magic of the Ansible Automation Language that there's plenty of wonderful documentation out there if anyone wants to check that out. Okay, so beyond that we also have for doing device edge so if you have compute nodes at these environments or in these cabinets that are co-hosted with your programmable logic controllers, Red Hat has a operating system that is based on a technology called RPMOS tree. It is immutable, it is composable, it is easily distributable, it is a delta application of an update and it is atomic in operation and we created some collections to be able to automate the lifecycle management of that entity. So OS build is how you actually build your operating systems, they are bespoke, they are artisanally crafted by you and your team. You will effectively pass in the blueprint or the desired outcome of your image and then from there we can do a full deployment, lifecycle management, we can, I'm sorry, keep saying lifecycle management like that means anything. We'll set up an RPMOS tree repository, we will actually automate the distributed update of systems so to Chad's comment before about tens of thousands of hosts, we can do this at scale of tens of thousands of hosts, we can do the full rollout, we can do the update system and because of the nature of OS tree, we can do it in these fractured environments with bad network with low fault toleration, things like that because I like to pose this question, for those who run RPM based distros or any traditional package manager, DPKG, I don't know, package build, pick one, if you're updating your kernel and it's in the process of rebuilding your in at RD and you kick the power out from underneath the server, will it boot? It's shroding your server, we don't know. We don't know until we try, whereas with this type of a system, that cannot happen because it's either all or nothing and if there is a failed attempt at a boot, it will roll back to a previously known good state and this automation that we've created allows you to do that and then the next phase of that is micro shift. So for those who are familiar with Kubernetes, who want to do Kubernetes at the edge, there is a minuscule version of OpenShift which is a Kubernetes distribution called micro shift and then we have the, again, the ability to run application payloads out on those devices, build the images based on our PMO tree based technology for those do a full lifecycle management of that as well. Lot of stuff in one slide, I apologize. Yes, what? So you can do Fedora and CentOS and REL with? Oh yeah, yeah, with these. We can do Fedora, CentOS and REL with these. So we can do a CentOS stream nine, eight, eight, eight and nine, any current non-end of life Fedora and red enterprise Linux, eight and nine. Yes, and we full CI test all three of those for across each of their releases. So if something breaks, please let us know. We'll add a regression test and make sure we don't break it again. Yeah, Chad. Oh, no, wait, ha, ha. I already talked through this without diagrams up. I'm very sorry. Very quickly, this is basically the flow diagram of what these, we have 10 minutes. This is less likely. We can do this, I believe in us. This is the flow diagram of what the automation content that we created does and allows you to have your image builder host, your Relfar edge image gets built there. From there, we have our device blueprint, which will define the curated content that you would like to put into it that will generate it. We will extract the contents to then serve up for your update system. And then we can roll that out to your device edge images. And we also have a boot ISO that can be downloaded and you can flash devices with it, that kind of thing. And we actually also have a tool set to be able to do credential injection, host creation of the ISO. So that way you can, depending on what your chain of trust or your processes are, those kinds of things. And then for the open, the micro shift one, the only thing that you add there is you have day two operations of doing the application deployment and life cycle management. That's the micro shift one. Ooh, he added fancy words. Did I? I haven't slept. I arrived in this beautiful town at half past midnight last night. So bear with me. I lost my train of thought. It's all good. Chad, it's your turn. No, we're done. No, that's a sales slide, we don't want that. Questions, because I'd much rather listen to you guys talk than me and Adam talk. We talked a lot. Is he being serious? No, God, no. Chad, Adam, Adam, Chad. Adam Williamson is a fantastic human and a great A-heckler. Yeah, well, no, I'm waiting for Peter. He hasn't heckled yet at all. No, no, nothing, okay. That's good. No beer for you. Hey. Yes? Yes. Oh. Peter, he needs Fist Mode. Please put an issue in for that. And then I'll send it to Peter. Yeah, so the Fist Mode. Excellent. Peter's working on this, it sounds like. At the end of the day, we just used the OS build bits behind the scenes. Yeah, so the thing that we built is the automation and we try to ensure is that it's easily consumable and repeatable. If you need core operating system technology added, we will gladly refer you to our friend and colleague, Peter, in the back there. Tosha? Yeah, so that is sort of a long straight to the OS build. We have a bi-weekly, believe me, we talk. Okay, I'll repeat them, yes, sure. Yeah. Tosha? Yeah. Do we have a place to plug in the CI of the image that is built? That's a Hymn question. Yes. Yeah, I mean, so literally the build that comes out, the way that we do it is we just provide you an example playbook of how to do it and you just put your inputs. At any point in that phased process, you can just pause, extract the artifact, and then send it to a pipeline. And you can actually do that in your Ansible playbook so you can have the CI validation occur and then allow the playbook to continue and perform your CD for you if you want, or you can then pipe that over to another system for that. If you wanna use Argo or something else or whatever is hot this weekend. All of our CI's in the GitHub, right? Yeah, I mean, all of our CI's in the GitHub. I don't know if it's particularly the best way to do CI. The way that we built CI is integrated into the open platform that the Ansible open source project uses, so it's all the data's there. Everything is open and available. The only thing that's not is our cloud keys because we don't want people mining Bitcoin on our ADBS bill, but yeah, it's all out there. We could definitely absolutely plug into any system there and that's kind of the goal is to offer an opinionated method of doing it, but be flexible when needed. Any more questions? So the question is, do we have one kickstart file per image and are we generating the kickstart file or can you add one after? Okay, so if you want to change a config, do you need to rerun the whole pipeline if the config can be addressed in a kickstart? You can do either. We will support either. So the options are there in the automation to add kickstart changes or post build, you can just supply your own in the bootline like you otherwise would, yes. So the way we do that is with versions. So versions are either provided or auto incremented and you have to actually tell our automation that you want it to auto increment and create you a build every time you run, otherwise it will come back and say that that version has been built and is already cataloging indexed and my laptop's going to sleep. It's fine, yes. Role-based access control is your friend. So this was that conversation I had around doing things on the CLI for 10,000 devices is not very good. So within the platform, there's definitely role-based access control and you would assign different roles to different groups. And coming from a company that was really 12 different companies, that is a very tedious job, but it's very important that you think about who has the access to what roles and the separation of duties of who can do what. So definitely you would need the platform and do that within the platform. Architecting identity is a whole different thing that nobody ever thinks about until they're like, oh wait, we shouldn't let Steve do all the same things that Adam does. There's also a centralized logging and audit within the automation platform. So the automation platform is composed of basically 14 or 15 different open source technologies integrated together and one of the things that are in there is the SSO, the role-based access control, policy management, that kind of stuff. If you granted one employee enough access in the role-based access control to destroy the factory, yes, you could find them. It may be too late, but yes. So the question was how do we ensure that, or basically do you have like a dev test type of an environment for this? Yes, you should absolutely have development environments. The question is, is how much is it gonna cost to have a development robot environment in manufacturing? And I can tell you because I've set them up. It's really expensive, but it's worth it if you can automate an entire factory floor that becomes hundreds of factory floors. So there is a cost to benefit there, but yeah, you don't wanna test this. Do not test this in prod. I mean, I test my code, but I only test it in prod, except at a factory. I think we have one minute if there's any other questions. Out of time. Out of time. So we actually just announced an AI platform to do that. So we're working, it's in the upstream in a beta right now, but it's called Ansible White Speed. The question was, is there a way to write helpers? There's also a thing called Ansible Dev Tools that I highly recommend you check out. It's integrated into VS Code and the language, we're out of time, we're out of time. The language server can also be used from other editors like Emacs and Vim if you're, we gotta go. Also our friends over at Steampunk wrote some cool stuff too. Check it out. Yeah, one, two, one, two, one, two, three. Is it in orange or red? Red, orange. It's in the green? Good morning, good afternoon everyone. I'm Alexander. And today thank you so much for being here to this session, both live and stream and wherever we watch the recording. I'm so excited to be here and I would really love to thank you, all of you, for the trust that you gave me today because we can see this in a minute. Why? Many of you, they probably have seen my face here and there, Bob and LinkedIn or something, GitHub or whatever. And I'm actually working as a senior specialist in artificial architecting better. And basically, I've always worked in the open source community, both as a contributor on my project and of course also with CloudNet technologies. I started to look at Kubernetes, OBShift and whatever it's already. And one of the important things today is that, and this is the reason of the trust, is that it's the first time for me being here at DevConf and the first time ever presenting at a conference, so don't be so tough. Well, today's session will be mostly oriented towards edge and how properly the edge is in the event and event-driven automation with Ansible. So I just prepared a quick agenda to guide you through what would be the steps that we are going to take today. So the first thing, we are going to start with an introduction of the edge. So what is edge computing? We probably don't know much about this, but it's always good to have a quick reflection and how we can automate some of the use case, the most common use case at the edge and what are the benefits of automating the edge? And then we are going to focus a bit more on events. So discussing a bit about events and what actually does it mean to automate with events? So what is the actual idea behind this kind of, you know, this different kind of automation? And I prepared a very small demo that will be covering some of the use cases that we passed for during the presentation. And of course, at the end of this session, there will be a Q&A, so feel free to also ask questions also during the session if you want. It's all fine, perfectly fine. So that's stuff from the beginning. So we heard a lot about edge computing and it's a way different way of calling something that we were used to deal with also in the past when we talked about IoT devices, disconnected environments. We were already talking about the edge, but now it's starting to get a very hot topic. Also because we have the chance to completely simplify our computing and our different use cases can be really self-consistent, self-healing and basically doesn't apply too much presences and too much compute power, for instance, or connection or everything that is, you know, created for a data center and we just, you know, split into very little pieces. And one of the biggest benefits of this is that we can really start the covering of our different workloads and having different sections of our workloads application, calculation power and everything in sites that were never covered before or were difficult to cover because we had restricted protocols. We had different access to these kind of, you know, places because maybe it's a disconnected side or it's a very disconnected environment. And of course, one of the importance of the biggest thing when talking about edge and automating at the edge is the rest of the time. Because sometimes we need faster reaction times with respect to other potential use cases like doing normal automation like patching a system or having, you know, configuration of a web server or configuring our infrastructure. We potentially need to react faster to anything that is happening at our site. And we don't have the control of that because maybe we cannot access that at the point in time if it's completely disconnected or we have some limitation of planning. So why? It's a good idea to start automating at the edge. Of course, one of the biggest use cases and one of the most common use cases that we get is, of course, configuration line. We want to streamline the configuration of our devices, of our sensors, of our small computing units that we maybe have there for collecting data or having some interactions with local components that may not be automated from a data center, for instance. Or we just want to have a consistent configuration across multiple kinds of devices. So maybe we can have also some sensors that require different, each sensor may require different configurations. So we can be able to control everything from one side to the other side. And we can also be able to use security because maybe in our sites, we want to enforce some security policies or anything that can make our devices and our environment compliant with our regulation, of our regulation or just government regulation. I think we are talking about government agency or government entity. And of course, one of the use cases covered in the demo is monitoring. Why? Automating at the edge can be good for monitoring and remediation because actually, when we're going to take control of our devices, so having our configuration in place, being able to produce metrics based on any potential activity. So if it's a sensor, it can be the data from the sensor. If it's a computing unit, it can be how, for instance, the model that we are trying to work on for that particular device, how it is performing. If we have any weak potential leak or power leak or power outage or mental health, we need to take care of it in a very fast way. And we can also implement this in a sort of self-beating fashion because actually, and we will see this later in the demo, we can also host the automation at the same site without even caring about that. Because it's there, it will be running, it will be checking for everything that is working and we can especially configure and extend this kind of automation. And another use case, it's the idea of having graphic developments. Whether it is a containerized application, whether it is a package that needs to be released, whether it is a firmware upgrade or a software upgrade that we need to deploy. We can do that at scale and we will talk about scaling in a second. We can do that at scale and potentially with very limited download. Because actually, we can take care of that, integrate automation in our release pipeline. We can, for instance, take decisions based on how the deployment is going. So we can decide to extend the way of how our deployment is happening. And of course, it can be. Because we are talking about, when you talk about edge device, we are not talking about a small bunch of nukes or a small bunch of sensors. We are talking about sometimes very big numbers. We have some customers also that are starting to get more into the edge computing and starting to, we are talking about 10,000, hundreds of thousands of nodes or sensors or devices. So we need to be able to scale in a very fast and streamlined way. And the last thing, of course, we are really mentioning that. But we need to be a fan of reaction times. Because we cannot afford having that component starting to fail or we cannot afford having a deployment failing or one of the data collection units that we need. We need to put fragmentation in place. If a device is failing, we need to find a way to take that device, take it offline or try to evaluate what's happening. And then we can be able to, you know, make use of it again. And of course, all these things are cool together. But we potentially need something different to do it. Because the normal automation, the standard automation, may not be the best fit for this kind of use cases. Because we are talking about, you know, very definitive use cases. And we don't have a dynamic adaptation of the automation. So we have some comments that we need to put and we write our labels if we are talking about Ansible because we will talk about Ansible. But we are going to just have a configuration in place. This configuration will be deployed, but not if we need something different. So what if we need to adapt our scenario to the automation, the automation problem scenario? And this is how events can start to get in the game. We all heard about events. Events are potentially everywhere. When we are talking about an industry bugger, for instance, when you put a bug in an industry bugger, it can generate an event. You can create a message, put a message on a cube. This is potentially another event. When the sensor is starting to ingest data in our center or in our devices or whatever, it is actually generating an event. It can be analyzed, can be processed, and we can do something based on the analysis. And one of the other use cases that we always use with our customers is also to actually pick it. So if you are, in our ideas, we are creating a ticket for deploying a node in the remote location that is not which moved from our data center for any reason. We can put in place some kind of automation based on events. Trying to make this kind of automation reactive to what's happening. So when we are creating a request or thinkable or whatever, we can still process that and have the automation react to that message and also potentially close that message based on the automation, respond to that message, escalate that message. And another important aspect is coming back to monitoring other things. So if we have our platform, thinking of OpenSheets, thinking of any other potential platform that we have in place in our environment, we can meet, we can meet with the other. It's what's happening in the platform. So it's the platform, it needs some alert. We need to be able to identify what the alert is about, take a remediation on that and apply it. So we need a bit of a switch of context because actually we are used to additional calculation when we have some of the issue comments, we have a set of defined machines or potentially dynamic machines, but we have a potential issue, it's synchronous. When working with events and when working with dynamic data, we need to be able to react to each and single event that is happening. So it's not just about, okay, I need to configure this, I have my playbook, this is what's happening. I can also wait for the playbook to end because I have this task today, the playbook does this. When we are talking about events, it's quite different because maybe we have like 10, 100, thousands of events and alerts that are happening in our platform and we need to be able to manage all of those as synchronous and of course just in time. So we need a faster reaction time even in this case. And so, and this is actually how it is changing. So it's not about just having a command issue, but this is something that is interacting by event. So it's the event driving the automation. So you have potential different sources that are emitting different kind of events all day long, all night long, forever. And you need to be able to react each and every single time. So basically, if we think about events, we can define three big questions that we need to do. Is the first one is, we are interested in where this event comes from. So what is the source of our event? Is it an artifact, is it a webbook, is it an MQTT listener, or what, what, what does the event come from? Then we need to process it because actually, okay, we receive the event. We are good to go, it's probably a JSON message or whatever, it's an automatic, we can do this in anything. But we need to do something with that. So we need to also define how we are going to treat our event and what is the outcome of this processing? It's the work. So what do we need to do based on this one? So we need a source, we need something that processes the information, and then we have something that is actually a query, the automation based on information. And this is why we are talking also about Ansible today. Because Ansible is one of the best states for this kind of use cases when it comes to the event agreement. First of all, because you don't even agent it. So we all know that the Ansible is agent-based. The playbooks are very easy and unreadable content that we can reuse, implement, and distribute. And we have another advantage that Ansible is community-driven and partner-driven. So also the community and the partners and the company to the base, to the code base. And of course, covering different devices, different protocols, different connection vehicles. And this is crucial, especially when we are talking about a very heterogeneous set of devices like it can happen at the end of the year. And the most important thing is the last one. So we have dedicated bits for event-driven automation. So we have a collection that will be also implemented in the Ansible platform controller that will be actually responsible for managing the event we are talking about. So it is implementing, and we will see this now, say that new, let me say, new resources that are called Chromebooks, when we are going to define exactly what they said before. So we are going to define the where, the how, and of course, the what will still be an action, a command, a playbook, or whatever. So let's see how this is translated. So we have the sources. So in the event in the Ansible bits, we have the sources that can be, as I was mentioning, a Kafka, a Kafka body. I recently created one for our demo. It is the FTT connector. We have the community working on all the kind of connector and we got created the one for the latest for interacting with this cluster. And this is why I was talking about it's also easy to implement. It's not something that is rocket science because actually I'm not a programmer, so I was able to create a Python for the FTT source from scratch based on what was already available. And this is quite crucial because having this possibility of extending the use cases potentially in an infinite way, only basically our fantasy can be a limit because as soon as we are able to create something to interact with our source of events, we have to go. This is exactly what we need. And of course we have some rules. So once we have identified the sources and got the event of that source, of course we need to do something with that. We will see a very small limit from the demo later, but it's basically a sort of event processing queue where you're going to go to the condition based on the event. What is a multiple condition, combination of conditions you will be able to label, our input manipulate the event itself to make it understandable to our automation itself. So we will be able to process, to further process our events, even before using you in your condition. And of course the last part are the actions. The actions are basically a man, a playbook that can be run relabeling our input, saving the input from somewhere. And these all together combined is giving us a real way of managing events and managing our automation with it. So just in a few seconds, I will be showing you a very small demo that I created. And basically the use case, it's kind of funny because the next slide is a bit weird, but this is exactly the idea. And this happened to me last year. So we spoke with you about industrial use case and medical use cases that are of course a big thing. But what about the little things, like a fridge in our house. So we all have a fridge. And the fridge, you know, when you start to get in hotter and hotter, you can feel that the fridge is not actually working at a very good condition. So the groceries are not feeling so good. You're starting to have a need for cold, need hot water and stuff like this. And this is even worse when you are out of your house and you cannot control this. So what is the idea? The idea was, okay, let's think about a conventional use case where you have different sensors, one in the house, one more, of course, in the outside. A couple of sensors on the floor and on the roof of the room and then one sensor in the fridge. Of course the fridge should be, as we say, a sort of intelligent fridge making it like just having an interaction with an external device, whether it's via wireless, whether it's Bluetooth or whatever. So you just need something that is able to interact with that device. And what is the action use case? The action use case is why we cannot adapt the temperature of our fridge, the working temperature of our fridge based on what's the temperature of the outside and inside the sensors. So what I was thinking was, okay, I cannot implement it for real because I cannot do something here. So I decided to create the version of this use case in a sort of containerized way. So I created a part of this application for the mobile sensors and the fridge itself. And then we will see this in a second. So the digital version is something like this. So we have, everything is based on red-out device edge. So we have a potential device running in our house that is still red-out device edge. Of course, coming back to the previous session, we were talking about the image in there, how to create wealth and my microchip, ISOs and stuff like this. This is exactly the same scenario. So I created my red-out device edge in installation. I created the Quartus application for all the sensors. And I have a small configuration here that is based on the Quartus application. So the Quartus application has a configuration that is our, let me say, target of the automation. Maybe because we are going to change that configuration with all the applications and see how this adapts to what's happening. I developed a very small installation of the mosquito from Federalized Mosquito. And on the other side, we have two other components, one that are aggregated and one activated. The other aggregator just takes care of getting all the information from the sensor aggregated and send it to the processor. And the aggregator, we will see the connection in a second, will be the actual generator of our event. So it will take care of monitoring the data here. So if something changes, it will send an event that will be then processed by Ansible, and Ansible will take care of that. And the cool thing is that the aggregator, now it's just a very few basic questions of aggregation. I can say to Ansible that I'm not a developer, so it's nothing really complex. But the idea is that this aggregator can be protection and it can be an application running incrementing a model, an AI model or a machine learning model to just react in a better way or of course adapting to everything will change. So the cool thing about this is that it's not just having a very simple use case, because it's not very simple, it's not so advanced. But these are really co-verts most of the topics that we touched on in the last minutes. So basically the Ansible, the event-driven Ansible is containerized as well. So we have an MQTT connector that is taking the information from a dedicated topic on Mosquito, and it will actually take some decisions based on what is the event-driven. But of course now we've seen the picture. Now let's just jump for a second on what's the implementation of this. I will not go into details about the actual purpose implementation. The repository will be shared at the end of the slide, and there will be a code if you are curious and you want to go through that. It's quite, it was a bit, you know, it's a bit more, I tried to make it as modular as possible, so I hope you will be able to just recreate as you need it. But I just wanted to show you, so here we are, our own book. So getting back to the discussion that we made before, so the whole book is something that is very familiar, because it looks like a playbook, but it's told in a different way, and also has different, from the implementation perspective, it has also different keys that we can see here. So we have the sources and the rules. So the sources are actually the actual implementation of the plugin that is connecting to a remote source, or a local source or whatever, and just taking the information. In my case it's a very simple one. It's just taking a couple of parameters here. So what is the host of the MQTT broker? What is the port and what is the topic? Of course this can be also extended at will. What is important here are the rules. So the rules are, okay, now we are in the, we are actually in the, we have the event. Now we have the event. Every single event has this prefix, event of course, and based on the implementation, it can have additional keys that are needed to just go through the tree of the event itself. Usually if we are talking about JSON messages or something that is browsable, it will be just about accessing it as if we are just going through a tree. And the important part, so the what is the actual action that we are going to do. So it's, as you can see here, it's just about running a playbook. So it's just a very simple playbook that is just taking care of analyzing the event. So it will further analyze the event, and based on the outcome of the event, it will just adapt our temperature. So just to make this, so we have our basics. So our sensors are basically, so we have here our sensor data. So our sensor data, as I was mentioning, we have three sensors in the room and outside, and one sensor in the fridge that is, and we have, I added some humidity, temperature and stuff like this, just to make it a bit more real. And from the fridge perspective, we have also the power consumption, the eco mode, if we have the eco mode on or off. And of course, what is the speed of the fund that is actually working in the fridge? And we have a fixed temperature here. So this is the working, the operating temperature of the fridge. And this is actually, what is the sensor getting from the fridge itself? Now, let's see what we can do here. So it's very hot outside, so we are going to just raise the temperature a bit. So it's a very simple script that is just changing the configuration of the different sensors to simulate a sort of change in temperature. And let's see what's going on here. So if we have a change, sorry, so let's wait. We will have a light change. So basically the temperature just changed, so I create like five or five degrees. It's a very typical summer situation where we are just jumping from temperature to another. And here it's not happening so much at this point in time because actually the data is starting to be transferred to the aggregator. So the aggregator is writing the data here and we have our actuator that is actually gathering this data and it is actually taking care of trying to find a sort of anomaly detection. Let me say it's a very rudimental anomaly detection based on some thresholds and on previous iterations. But I just added myself in this side we have what's going on in the Ansible, in the Event Drive and Ansible container. So what we hear? So something happened. So basically the actuator noticed that there was something that was changed in the temperature. So we had a very slight temperature change. So coming back here, looking at the log, we just receive this event that has this structure. It's basically a JSON message and it is basically, okay, there is something that is not working. It is going through the processing part. So it finds that there is an anomaly. It just brings some information about what is the delta of the temperature. Again, it's a very rudimental detection. So it's nothing precise, but it's still working and it takes some information from the action deployments, changes the configuration. It basically changed the configuration of the fridge. So the fridge was our target. And this is basically what's going on here. And as you can see, we've just seen how and it's really nothing complex because it takes a very small amount of time to just have a good information in place. Of course, all the part behind is a bit of course preparation and everything. But again, it's even looking at the code it doesn't require so much effort to start working with events. Of course, it can be complicated as much as we want. But as you can see here, again, I'm not a programmer, so I kept it very basically and I really hope it was a good use case, to be honest. And, yes. So this was the last part of the demo, actually. So I really hope you enjoyed that. So if you have any question or discussion or you want to raise something, feel free to do that. I'll be more than happy to go a bit deeper on that. Yes, please go. So you mentioned that you provide rules and the events, but I'm not going to show that you provide rules, right? Now let's assume we have a range tonight where we want to automate a lot of things and let's say thousands of rules to manage it. But then at the end of the day, my code wouldn't become this way. If I want to change something, if I want to change one of my rules, managing all of the rules would be a little tricky. But how do you manage all these rules to get there in one place and make it easier for the developers to be in that situation? Okay, so that's okay. Just a couple of minutes. So if you have... Yeah. So if we have multiple rules, how can we handle the granularity of managing all these different kind of rules when, especially when multiple developers are working on the same kind of automation? Well, this is actually how you would treat a very complex problem. I mean, you don't... You are not going to work on the big bunch of rules, but you can, for instance, split the different rules based on what is needed by what. So in the case of having, you know, particular kind of events, like, I don't know, in the case of the temperature, let me say. So you want to manage the temperature, the humidity, and a lot of additional data. You can also try to split those into some smaller sub-problems, like having a dedicated rulebook that is just taking care of that particular use case. So you can also have a rulebook for the temperature and having a different rulebook for the humidity. So if you're working on the temperature and I'm working on the humidity, we can still not interact with each other, but we can still work in a, let me say, sustainable way. This is a very good question because, of course, the number of rules, but think of also the sources. The sources are not... We have worked with a single source, but we can have multiple of them. And each one of them can produce different events. And even the same source can produce many types of events. So, you know, it's getting... What I would suggest is try to split it into, you know, better manageable problems. And I think it would be a good idea to start with. Please. So, okay, the question, if I got it correctly, is how I actually gather the events. So do we have a listener or it's just, you know, pushed from a client, deployed? Okay, so basically the sources are just listeners. Can be listeners. So they are always up and running waiting for someone to push. So instead, think of a web book. So you have a sort of, see the web server that is running continuously. And you have someone calling him. So you have a client. In case of a Kafka event, for instance, you have the client also on the event-driven side because you have someone that is connected to the Kafka topic. And it's just waiting for something to happen. So you also have another client on the other side that is actually producing the event. And also... Sorry, okay. Yeah, okay, thank you. And so when you're talking with, you know, asynchronous messaging, you still have two different clients, but it's not, you know, waiting for something to happen. So, and of course you can also extend this kind of thing because it's, as I mentioned, it's just about a Python plugin that you can just use, create and reuse for everything else. Thank you. Do you have any other questions? Good. Thank you. Thank you so much. This is the... If you are interested in, and I have just one ask for you. It's the first time for me here, so please make a good use of the feedback. If there is something that you didn't like, especially if you didn't like, please let me know. Thank you so much. Thank you. This doesn't work. Okay. Like this. Okay, okay, okay. Close the door. All righty. Welcome everybody. It's so nice to be here after three years and meeting finally everybody in person. I missed it so much. I'm an engineer. I've been working for the past eight years, taking care of containers, open shift, and now Peter falls the edge. Right? So, today we're going to talk about resiliency at the edge, who we are. So, there I wanted to put a team picture, but I basically meet my team now. So, you know, we've been working together for three years, no picture together in real life. So, we are the team building rail for edge. We also maintain Fedora IoT along with the upstream community, and we develop all the technologies for the edge, like FDO, Green Boot, which is the topic of this presentation, and all the parts and image builder that makes it possible to actually build rail for edge. I lied. This is my team. You can see a couple of funny faces, like Micah's there. You gave permission to share this. So, this is my amazing team. So, as I said, we built the operating system for the edge, or, you know, that's what we call it, and rail for edge is built using image builder. It can be Fedora and rail. The team focuses on both in the day to day, and in the day to day, to do upstream first and more upstream in general. All of the rail for edge system is based on OS3 and RPM OS3, and it runs on those tiny devices, like the Fitlet 2, the Intel Knock, and it's made of a curated package set, things that we only need at the edge, or things that we don't need at the edge. You don't want a huge attack surface. You want the system to run with minimal amount of RAM and CPUs. So, that's how we've designed it. So, before talking about resiliency, the way we build rail for edge is we create an OS3 container commit with image builder that basically just builds an image builder and then we serve as a remote repository to image builder. What it does is, with that simple blueprint, that creates the OS3 commit, and then when we serve that, we can build an installer like Anaconda or the simplified provisioner, you can build a raw image, you can build pretty much whatever you want, kudos to the OS build folks . So, after you have this artifact, it can be a raw image, it can be an installer, Anaconda itself, what we're going to do is basically flash the OS to a device. That's the provisioning part, which I have another talk after this to talk about. As part of that, as part of the flashing, provisioning itself, there is some sort of tagging that you want to do . And with the styling devices, the edge use cases is you provision the device with an operating system. The analogy here is, when you go and buy a windows laptop, that stuff goes out of the manufacturer with windows pre-installed, but it's not activated. So, this is pretty much the same. You have a bunch of devices, all provisioned with raw for edge use cases. So, it can be a sensor at the top of the address. Anything that is tiny or not really tiny that we don't have easy access to. Exactly. Yeah, I can go to space too. That's what Peter is saying. And then once it's at the address, somebody is just going to power it on and it does onboarding. Some of my team members are going to do the system, so make sure you sign up where you're going to study the stock. And at that point, this device is at the top of the Mount Everest. It's upper running. All is good sending sensors, data back to the, I don't know, main data centers or whatever, crunching data on the, actually there on the mountain. So, whoever shipped the device to the Everest, like actually walking up there, just power it on and go back to the office. That is easy, right? Now, this is day two operation instead. So, the device needs an upgrade. So, that is the blueprint that creates an upgrade and it's still an OS 3 container commit that we build using Image Builder. In that example, I just added nano and that file. And then we're going to serve this commit, build the upgraded OS 3 commit and then what we're going to do is basically just run RPM OS 3 upgrade and then reboot it. Once we run RPM OS 3 upgrade, we get this kind of output and it says just run system CTL reboot and everything is going to be fine. You're going to be booting into the next version of the operating system. This is cool, right? So, what can possibly go wrong with this scenarios, basically? So, you know, upgrades don't always go as planned. So, probably the worst nightmare is something like this. Like, you have a device at the top of the Mount Everest and it's like, oh my God, I can't connect to it anymore. There may be an issue, right? So, there are a couple of solutions for this. The very silly and simple one is to send somebody to the Everest. That's probably expensive tiring. For those who know that's Noemi, my partner, she gets six every time. So, that's not even the Everest, of course. It's Super Sunday. So, somebody has to go physically there and debug the device, which you can imagine is probably not going to happen. It's really unlikely. You have these tiny devices everywhere in the world or in space and you want these to run smoothly, to upgrade, to continue working, to run the applications that have been deployed on these devices. So, everything has to be smooth. So, this is not going to work. So, that's when Rimboot came into the picture. So, it's a mystery here. Rimboot is a, it's basically a tool that makes it possible in an RPM OS 3 and OS 3 system to go back to the previous deployment if there are any issue in the upgraded deployment. I'm assuming some familiarity with RPM OS 3. Basically, you have two deployments at all times and when you upgrade after you reboot you go into the new one. With Rimboot it's possible that if you add some check and the new one doesn't work or doesn't behave the way you want, it goes back to the working deployment. So, that any upgrade that goes wrong you have a chance to revert it and then maybe the bug it further or ship another upgrade and things like that. So, Rimboot has been created this part of the Google summer code in 2018. The whole idea is being created and led by Peter Robinson here. Christian Glombeck was the intern working on Rimboot at the time, also made by Jonathan Le Bon as mentors too. And again, this has been part of Fedora IoT so the need to have unattended rollbacks on these devices because otherwise you need to send one of them back to the Everest, unlikely. The whole project has been designed around RPM OS 3 because of course in a normal system it's not as easy to ship an update, rolling back without any disruption. That's also unlikely. It has been made to work with Grubb so right now we just support Grubb but any UFI boot system or whatever can be made to work. If you have ideas, if you want to contribute feel free. And so with this two pieces in mind RPM OS 3 and Grubb Christian created the very first implementation in 2018 and so how Rimboot works is pretty easy. Somebody runs RPM OS 3 upgrade and system CTL reboot when RPM OS 3 upgrade is called Rimboot sets some variable in Grubb and after that that I'm going to explain in a little bit and then we reboot. So at the next boot Rimboot starts a couple of health checks and based on those it's going to run a couple of boot status script meaning okay the deployment is okay now I'm going to send an email to an administrator maybe or somebody saying everything is okay or you know worst case the upgrade doesn't work and we need to roll back. Yeah the last one is rollback can also happen maybe you can have transient failures so you may want to retry a couple of times or many times you want a new deployment maybe it's network hiccup or something so in Grubb how Christian made it work and how it is today there's Christian it's outside it's basically leveraging a couple of Grubb functionalities like setting variables there are two main variables that drive all the Rimboot processes the first one is the boot counter the boot counter is set when the upgrade is done but before the reboot happens once the boot counter is set and we reboot every time that the deployment fails we're going to decrease it until it's negative that's the signal that we need to rollback so hopefully it's super easy and the other variable is boot success before the upgrade we set that to zero then if the new deployment works we're going to set that to one those two variables basically drives all the Rimboot operations we shipped some template in Grubb.d just to make all of these logic works as I've explained it hopefully it's clear two main one it's a full back counting basically it just decreases the boot counter and the other is boot success it's zero before rebooting it's reset once the upgrade goes well this is this is all Grimboot Grimboot is a couple of scripts and many at this point system deservices it has a state machine like operation model so every time it checks the variables and where it is at that moment and buys it on that act so the first service that we have is the Grimboot Grub2 set counter service that is directly wired to OS3 so that when OS3 runs an upgrade and OS3 takes care of that it's going to fire that service Grimboot intercept it and say okay it's time to set boot success to zero and boot counter to whatever the you know you retry the actual retries you want to do for that specific boot then there is Grimboot Talchak which is the it's the main service it's ordered before boot complete target and what that does is running the custom health check that everybody can actually add to Grimboot in order to say this deployment is good this deployment is not good if it's not clear I'm going to show an example later with the demo so that it becomes really clear so if the health check service passes boot complete as a target of course and then there is another service setting the success variable in Grub and I'm setting the boot counter well valuable and then if everything goes well with the new deployment there is a further option to run additional script that's in the green D directory again it can be an email to Peter saying okay the deployment is good all good deploy it if it fails instead there is the whole rollback mechanism run by Redboot how to reboot the service and it does as the slide says a series of checks to undermine if there is a requirement for minor intervention because that can happen too otherwise it just reboots the system and try again if Grimboot has been configured to retry and when there is when the health check then the red dot the scripts run and at that point you can still send an email to Peter at this point you have a filter saying okay no this deployment doesn't really work you don't want a boot there maybe it's create another one I'm going to send the email yeah so this is the very basic directory structure for Grimboot itself the as I said the required dotd script under etsy Grimboot check is the one with scripts that must pass it means that if you boot into a new deployment and any of this script fails then the whole boot is marked as not good and what Grimboot does is either retry or just rpms3 rollback to the previous deployment the wanted dotd script directory is it's a directory that holds scripts that may fail can be failures that are okay to happen in a new deployment but then maybe network for instance probably better in required dotd and then in wanted you have something else assuming network has been working to further the bottom and as I explained Grimboot and red dotd are just boot status script if it's green run this if it's red run the other yeah this is the configuration I know it's a pretty slim but this is the configuration variables that we we implemented in Grimboot the max boot up temps it's the one that takes care of rebooting until that number and the other two are part of a a script yeah the other watchdog but I'm not really not gonna get into that because the watchdog is probably a whole new topic but those are the one that we support today the first one is probably the most important and in my demo it's configured to three so you will see the virtual machine rebooting three times I don't know 48 so 10-50 minutes okay this is all the alt checks are scripts batch scripts I know it's bad we're working on that not necessarily bad and this this is what you actually write in order to have something run in the next boot and then mark the boot as a good boot and then keep the deployment or something that has to be rolled back this you know this is batch pretty easy there is if there is that file we're gonna fail and we're not booting into that deployment if there's not everything is good we stay in the new deployment end of demo can you see awesome so this demo is basically demoing what I've been saying and we're gonna well I've built already the upgrade and what we're gonna do together is run the upgrade and reboot I didn't trust the network to do all of that so I did this in advance you can see here that so right now I'm at this version this is 5.7 this is normal RPM was 3 things so we are booted into this deployment I've built instead an upgrade I've built an upgrade days ago on this slide this one where you can see that I've added that file which is the one that is gonna fail you know it's easy it's really silly as a demo but it explains how this works so the new upgrade is gonna contain that file that it's gonna make the new deployment fail so I've built that using image builder and if we follow RPM was 3 commands I can ask for the upgrade and you can see we're gonna go an upgrade to 6.7 there is a new package added this is always 3 words you remember the blueprint and so now what it's left to do is just reboot the system actually when you follow it does it work? it works so right now it's upgrading you can see we added the nano package this can be a normal upgrade during normal operation somebody needs something a new customer application a new sensor data or a program to send back sensor data whatever can be anything so at this point what we're gonna do is just system CTM reboot what this does is basically going to try and boot into the new OS 3 deployments hopefully it's visible but it's not really important there were two boot entries that have been shown there is some issue there okay yeah so I'll start from here since this is not gonna work anymore I prepared for this but this can happen so after I rebooted what was gonna really happen I'm sorry for this is two boot entries were shown those were the old deployments and the new deployments so we were trying to boot this is the new deployment as soon as you are in the new deployment once check that else check that I've shown you before basically checking if that file is there and if it's there reboot is gonna run and reboot the system and the next boot for three times we're gonna still have the two boot entry where we retry the new deployment after that is done and we are at minus one with the boot counter valuable it's super silly it didn't work but yeah so that was basically the demo I wanted to show I'm sorry for this and alright so yeah so what's next for green boot we have a couple of points that we wanna we wanna enhance in the future not everybody like bash scripts so what we're gonna do right now is we're writing the code for in Rust and then perhaps enhancing the API so that everybody can drop any executable and you know base it maybe on exit code zero one things like that anybody can write their own health check the way they want with the programming language that they want what we're gonna do next too is having better RPMOS3 integration right now we're basically hijacking the way RPMOS3 does the different boots but we wanna do better there starting in Fedora IoT we already started talking with the CoreOS folks you know the folks maintaining RPMOS3 to actually do that there is there are some e-cups with green boot itself and RPMOS3 the biggest one to me is that ATC is writable that means that if somebody messes something up under ATC while a system is running and then run an upgrade that wouldn't make the green boot health check fail then green boot is gonna just you know be fooled by that saying okay this is actually not booting and it's not green so what it's gonna do is just roll back and this opens for you know all sort of attacks too maybe you don't want maybe you're upgrading because there is a CVE but again can be fooled and it's not good for security either and lastly we have some actually live users of green boot the most notable is micro shift they are implementing health checks to make sure that they can go back and forth with micro shift itself which is you know open shift and things like that so this is the future of green boot if you have again any idea I have a couple of links here starting from the actual repo on github and then the whole explanation by Christian Glombach who created all of this and then I dropped a link also to OS build Composer since you know the way we actually build all welfare hijacked attacks NRPM was through of course yeah okay, questions? yeah? yep, just be there actually it might also roll back until you get access to CVEs in the platform are we handling that? we're not handling that so this is repeat the question Peter how do we handle right so the questions where Peter already knows the answers and the others that we haven't fixed the question is how do we prevent somebody from actually for simply rolling back to something that isn't patched maybe we are upgrading to something that fixes a CVE but then with something like this maybe we are forced to go back and then introduce of course a vulnerability in itself we're working on this there have been a couple of ideas too probably the RPMOS3 integration is one of the things that we first have to take care of it's complex it's a complex topic it's always delivering software other than RPMs because I've been more than once in a situation when a new release of a docket container for us now broke an edge machine okay, a relic storage machine? sorry? a relic storage machine? no repeat the question Peter why do we deal with so that's another line that's going through in general on the base operations okay, but there are scenarios I actually had one when a new version of a container can affect hardware or affect the machine in other ways yeah so we would roll back the container and we would deal with that that would be a different system we can we can deal with that within like a docket storage yeah can you say so the the NC driver thing right? who can read did you think you realized that is there a project where before doing the reading for the upgrade to know whether the rollback will work because the major configuration change doesn't make the current state then you're in a sort of critical state and the same thing that we used to do when I worked in OpenShift with the MCO was any change under ATC would and of course we can do that in OpenShift because the hardware is free or less basically every change in ATC triggers a reboot so that you make sure that at the next reboot if it's working you know, the change itself you know exactly yeah and we're looking at how we can how it works like a complete layout and we're getting the same way because if you change say like the example I always use is if you need to change the say the network requirement on the edge and that is like a three-step process where you may need to change a network interface DNS service and a weekend service to be able to connect to the more true than old companies and you can't upgrade again to fix the problem because you can't connect to the other end service because you don't have a network and so how do we deal with that and that if we don't do it very well and that's complicated but it's sort of like besides RPMOS 3 and Fedora there have been a couple of solutions like using better fast snapshots to say okay this is just a better as Peter said which is different from RPMOS 3 config layers OS 3 config layers but the solution is probably lays there ComposeFS Yeah, but what about ComposeFS? I'm not sure if it's out of the case of moving into an older image but it does change the verification of the way it holds Yes so if the comment was around ComposeFS I don't know if you heard everybody heard it can be a potential solution to this if you don't know what's going on ComposeFS So to impact this after some time Well except in that case because unfortunately I think the machine that I was using runs a REL version that had a bug so that was basically just retrying forever because that's how the ignition service is made to work on the first boot that's a bug otherwise yes, if there is anything that angst and at some point reboot then reboot will still work This is like the ignition patch routine is running forever so that never ends I mean that's a good point perhaps something to consider into new features about reboot maybe If you're running RPMOS 3 like it's very bound to how RPMOS 3 works That's it, I know The integration point is the only integration point between green boot and RPMOS 3 is when the grab variables are set Yes, right now I mean what's next there is better RPMOS 3 integration that means we're going to make that link more robust and not just rely on the OS 3 finalized staged that run after somebody run RPMOS 3 upgrade because that can be flaky too, sometimes rarely you can miss that too perhaps, yes anything else Thank you Again folks for speak louder welcome again my name is Antonio in this next talk I'm going to talk about device provisioning at the edge specifically what the team did over the past 2 to 3 years to make it possible to provision a device for the edge I'm an engineer at Radat working for 8 years I'm an engineer on RAL4Edge this is the slide like somebody have already seen it we build RAL4Edge we take care of Fedora IoT too along with the upstream community led by Peter and we develop and maintain all the technologies that make RAL4Edge and Fedora IoT of course for instance the FDO secure device onboarding the simplified provisioner which is the topic of this talk not in picture unfortunately and so the agenda for this talk we're going to talk about how do we provision a device at the edge what the requirements are for this process or at least what we've been gathering in the field and based on experience by talking to many hardware partner vendors too we're going to talk about the tools that we developed in order to meet these needs and what we've learned so far and what's next so at the beginning I really struggled with all of this Peter Robinson has been preaching this division in the flow so at the left side you have the build phase the build RAL4Edge or Fedora IoT that part for us has been done through Image Builder you basically specify the package set some configuration option maybe some files to inject maybe some kernel arguments to tweak and then what you end up having is an artifact this artifact can be a raw image compressed raw image that can be an MEI anything that boots can be made at this phase so in this phase you're not playing with the hardware with the device, you just build the artifact that you want to put on the device or that you want to install with then this topic is about the middle layer the middle balloon there and it's the provisioning part the provisioning part is the part 10,000 devices and you want to effectively and efficiently flash an OS onto the device and make sure that it is ready to boot and ready to boot here means so many things the provisioner or whatever provision the system can just flash the image onto the disk or it can run and integrate with things like secure device onboarding can talk to the hardware for security sensitive things like talking to the TPM storing keys, all of this all of the things that has to be done at provisioning. The analogy here is for me at least the way I understood it is the Microsoft part of the world, I know this is a Linux conference but still Microsoft can ship pre-built laptops that already have windows so they provision them very early and then they ship that to you you power them on and they onboard usually you have a windows key or something connects back to the Microsoft data center and says okay this is this is actually a system that has been registered with us that it works so the provisioning part is usually done at the manufacturer where you have again thousands of devices you want to install the very same image of the operating system onto them and just them simply ship them or sell them and then there is the onboarding part still a reference to the next talk in this room because it's about the onboarding part for us that is done securely by two mechanisms mainly we use FDO and Eglision in the Rena and Sarita we're going to talk about that later too for this talk we're focusing on provisioning so at the beginning we tried to gather some of the requirements of what this process should have and the very first one was it has to be unattended nobody has to interactively type on a keyboard or watching a screen like zero touch yes we don't want that we basically want somebody for instance to grab a USB key and put the system the provisioner remotely and just install on a fleet of devices and the number here can go pretty high like again we're talking about thousands of devices that you want to flash in the very same way with the very same base operating system and so the first requirement for this provisioning part was unattended hopefully fast which is somehow different from normal where you're actually there clicking the mouse, selecting the disk maybe doing something else you want the provisioner to go as fast as it can and to be able to scale two thousand of devices and just install the system ready to power on and that's why the second requirement here is should support remote installs at the beginning we were talking about Pixie and other remote install mechanism so what we've settled up was just to use UFI ATTP boot so that has been a requirement from the beginning for us you basically have a web server that serves the NITRD the kernel couple of kernel command lines to drive the actual installation and then install it or you provision it and yeah the other requirements are similar to this we want to be able to tweak something into the installation or the resulting system an example is the console kernel argument maybe a thousand of devices where the kernel the console is on TTI something or TTI something in general and you want that to be the same over the same hardware so the provisioner should support something like this partitioning is also one of the requirements that we got partitioning and injecting files and directory we're doing that right now at provisioning time we're doing this mainly at build time but we're going to find a solution maybe later in the future to also do this at provisioning because we've been asked maybe that's why I'm just mentioning especially on the edge I'm listing this there because those have been discussions and points that we've taken from customers, partners users so we have these requirements and as I said these two specifically are done at build time interact with the hardware like TPM if you have an installer booting that's doable so we've been able to do this just fine and it has to be secure many definitions of what this means for us even just interacting with the other TPM was enough and yeah so the idea that we came up with was having a you know Peter called this a minimal tiny initard installer at the time and you know what that meant was just create an initard pack something in there like a kernel initard some tools to actually write an image onto disk and any other tool that we need for this and then make it also bootable over HTTP boot and of course flashing an image onto this that's the very first requirement you have an image you run the provisioner you want to provision the device with that image and the other one as I said that we come up with was the ability to inject kernel arguments to provide provisioner the provisioner that we use as a dracup modules and that means that it's plugable we have a couple of other dracup modules like the FTO1 which is part of the onboarding that we can easily plug in so the whole system is made plugable by using at least dracup for us just to support encryption right now this is also again but in the future that may change we can do live re-encryption at provisioning time right now we're doing that at onboarding though and we're building the image already encrypted and lastly this is one of the things that we get tasked every time these taller should extend the raw image to the full disk capacity so we have a raw image when it's compressed we have 14 gigs we DD the image we basically flush the image onto a disk which is 100 gig for instance and at that point what we've been asked is but I need to grow the root file system to take on the full disk space on my device because of course you want to leverage the whole hardware that you have so this provisioner has to support something like this so we initially created the simplified installer but that was causing a lot of confusion Peter is already like looking at me like you shouldn't say that word and that's when we called it the simplified provisioner just to distinguish between the phases so this is the simplified provisioner that we came up with we're still calling it simplified installer somewhere in the code especially in OS build but we are not struggling but we're working towards removing that word basically and as I said it's a tiny ISO we've been able to pack everything we need into an ISO that contains just the kernel the nitro MFS and the raw image so there is no real system running there we boot the nitro MFS, the kernel we have the image we're going to work with that all of the simplified provisioner is of today is driven by drag-up modules and system services basically this allowed us to again plug as many pieces as we want to control the flow of the provisioner itself for instance we build we flash the image and then maybe we want to run FDO to do other stuff somebody wants to mount the root file system and inject some certificates things like that by using drag-up modules for us this was the way to make this possible the reason for having an ISO like a normal ISO 92 something is because all of these components you can actually unpack the ISO and then you can have this component served over HTTP with HTTP boot and then this stuff can just boot and install the system and run the nitro MFS and the kernel so what we need to do of course is before all of this we built at the point we built the simplified provisioner we of course need a raw image to work with because that's going to be the golden image that we're going to provision over 10,000 devices for instance so we need to build this raw image we're using image builder again and what we're going to do as I've shown in the other talk is first build an OS 3 commit that contains the base system with that we're going to just create a raw image and this is when partitioning and encryption are taken into account so at this point image builder is going to create the raw image out of an OS 3 commit for us what we do with this image we embed that into the ISO so that the simplified provisioner code can use it, uncompress it if needed and flash onto the devices so the way we produce the provisioner as the slide is wrong is through image builder again we have created over time a couple of stages that's how image builder works by having stages that do something going to a root of S for the simplified provisioner we created, the team created a bunch of stages that we put together in order to basically have the resulting ISO that you can see in the image what we do is so image builder is basically creating the raw image it's compressed so it takes up less space and then it makes it bootable by ISO Linux, FEI and everything else and then embed also the unit and the kernel so all of this for us is done with an ND and command from image builder so now onto the requirement how do you write the image to the disk like you have this nice ISO there is a raw image and now what we need to do is actually writing that on the disk when we run the simplified installer so for this reason we have chosen to use Corus installer for those not familiar with it it's a project from Corus from the Corus team that is widely used in OpenShift and it's basically the way to flash an image onto a disk in its simplest form that's what it does but it can also do something else it supports embedding regression as I said it supports compressed raw images so you can embed a raw image that is compressed and then when it writes it down to the disk it gets uncompressed on the fly it's a huge win for us of course we can keep the ISO itself more concise basically it has encryption support Corus installer can be run as a Dracut module itself we've created, the team created the Corus installer Dracut project and that's a way to run Corus installer in a Dracut module and you know being Dracut and being just command lines, tools we were able to integrate all of this using systemd basically that's the framework that we're using it allows us to order units and execution the way we want so we're using systemd there too so now Corus installer is really neat and the way we made it work it's even so I think because the whole execution is driven through kernel arguments and now this is boot because if you want to boot this provision remotely then you want to be able to drive the full provisioning through something like kernel arguments because every device is going to come up with this kernel command line and it's going to execute this and so we were able by using a systemd generator to leverage the systemd generator in just parse the kernel command line okay we want to install the image onto devvda this is a virtual machine but maybe you're hardware as devsda or something so if you have a thousand devices all these you want to install on devsda you can configure this via kernel arguments and then boot this via HTTP boot so that every device is provisioned in the very same way there are other things that can be tweaked like where the image file is can be remote actually for HTTP boot you want this to be you want the image file kernel argument to be image URL so you can optionally fetch the image from a remote location every time they may be slower but it works it can be made to work and so this is how we achieve conditional execution in a nutshell basically what I said apart from parsing the command line just to derive the execution and configuring the installation device we can also tweak how we behave post install you know somebody can install the well provision the machine and then maybe you want to reboot maybe you don't want to reboot maybe after provisioning you want just to get the hardware and then it locks it somewhere and then ship it to the address so you know anything can be well this flow can be tweaked by post install kernel arguments that we provide another thing that CoreS installer drag does our drag module is mounting the ISO so that if the image is local to the ISO we're going to we're going to have the image available into the initrd so that we can and then at the very end this was the system degenerator for CoreS installer we're just going to run the CoreS installer service which is just a command line saying CoreS installer dash dash install and then the target is post install behavior where the image is all of that so we've been able to achieve this with a system degenerator and as I said you know the CoreS installer service this is just a snippet at the end it's going to just echo a command and then it's going to get executed by system D itself and it's going to run. I hope this time the demo works because it's really tiny so you can see the actual command line CoreS installer install and everything else so all of this kernel command line options can be configured by an image builder so I'll show a couple of a couple of blueprint that we use daily to build the image so you can configure the install device you can configure FDO and ignition there are various options for FDO you can configure certificates you can configure the manufacturing URL and other things that we've been explaining later the FDO talk and ignition talk for ignition you can attach a kernel argument to say this is where you're going to fetch the actual ignition configuration again you can configure the target disk by a blueprint all of this together by an image builder as I was saying this also integrates integrates very well with FDO which is secure device onboarding and I'm repeating myself but FDO it's also a drive-up module so we can order it after we write the image to this FDO run and do whatever it needs to do with an installed system like registering keys to the TPM and other things that provisioning we do with FDO so for this demo like I'm using a normal virtual machine what I've done is I've created a simplified installer based on rel 9.2 in this case all of this is coming to Fedora really soon because we got the latest bits of everything we need so expect it in the next month or so all of this can be tested it's pretty exciting as you can see Peter laughing so all of this can be tested with also Fedora so I've created a simplified installer using this blueprint and as I was explaining earlier like you can see I can customize the installation device but this is a simplified provisioner meant to install on the VDA disk that's usually Kimu and like and then what I wanted to do also was add a first boot kernel argument for ignition so that the first boot of the system also runs ignition so this is really easy to digest and grasp it just follows the image builder blueprint options another thing that I can show is also the FTO blueprint well the FTO snippets in the blueprint too you can see we can configure a bunch of options for FTO itself this very simple case we configure the manufacturing server URL maybe in a factory you have 10,000 machines one FTO server you create a simplified installer for that server so you can have the very same image that installs on 10,000 devices all at once by just building this image and then there are a bunch of security sensitive options too this example is running insecure for various reasons but we're not building this it's just to show that there are aspects and pieces of the blueprint that are made just for the simplified provisioner this comes from the integration we had in the past with the image builder folks, really? alright so what I'm going to do here is run a simplified installer provisioner to install on a virtual machine you can see somewhere that I don't know actually it's red dot so you can see here I have this image this image unpacked, you have already seen it just an FEI directory images with kernel and NetRamaFast the ISO Linux standard folder and then the raw image so if we go back here this is the simplified provisioner booting and you can see the very same similar configuration kernel options that I've shown in this slide so we're going to install to the FBDA and the image file is run media ISO that is something that CoreS installer is going to mount for us and then CoreS installer can install it so if we boot this at some point it's going to in a moment it's going to show I know this is tiny, hopefully everybody can see that but you can see right on the first line there is just a normal command line it's CoreS installer install where the FBDA as we've configured and the file is there so the image file compressed image file is there there is insecure too, I've explained like just for demo purposes you can see all of this went unattended I just edited the kernel command line just to show you but otherwise it's like 60 seconds delay and then this thing is going to boot on itself, it's going to run unattended at this point it's still writing the image to the disk and you can see at least this is a virtual machine so it's pretty fast but we've been testing this on fitlets in other devices like I think Intel Nux 2 ARM devices too so it can be slow depending on the actual hardware that you have but in general it's a fast process and you yeah so yeah, so now it's going to reboot into the actual rel for edge system this is not really important, what I wanted to show is that the whole process can run unattended it's pretty fast too which was one of the requirements that we got at the beginning and it's really tiny, just an ISO you can put it on a USB stick to install and finally you can boot from the network in order to streamline this provision so yeah, what we learned so far and bear in mind we're still learning requirements are coming probably daily depending on who we talk to what their actual needs are so what we learned so far is that many of our users and customers don't want to use raw images and this is legit to some extent although raw images buy us something like security, like IMA and things like that so we're trying also to be opinionated into this whole process we're saying this is probably the best way to be arrogant but trust us depending on the various talks that we had that's an education process that we need to do and so they just want to use they're used to have an installer they go there, they script that with kick starts if you're familiar with that, Chad is definitely familiar so you script that you want just an installer you use the installer on a fleet of devices or that can be also be heterogeneous, so even different too and they want to have the logic to derive which system it is on the kick start itself that is cool, that works, but again having something like a single raw image that can be FS verified and all the security mechanism that can be done what we're saying is better for security reason too but as Peter mentioned it's an education process that we need to do with the customers and with users and the community to help them understand this is definitely a better way because you basically are just not creating every single device that you have to guarantee an update device it's an education process this is what we're fostering basically right now this is probably a good way to provision your edge devices and as I said if you don't have a raw image that is already packed as a whole things like IMA and other security sensitive things and processes do not really work it's very difficult to make AIMA work with Anaconda there have been talkings but we're not there yet and probably never will be for Anaconda so this makes it easier or at least this is what we've been told and then the other things that we've learned is that many people and users and customers want to have partitioning the provisioning time again they're used to have something like Anaconda where you can actually script the kickstart and say I want var here I want attc here whatever you can use our imagination there and with the simplified installer what we do is basically pre partition the raw image and so we're going to tell you this is the layout that mostly works if you want to tweak it use the image builder blueprint and then you can have var on a separate partition and you know again, use our imagination there you can do whatever you want for partitioning specifically there are a couple of areas that we thought about that we can do this one of it is ignition although ignition does partitioning live on first boot but it's often pretty much expensive meaning right now OpenShift uses this but of course OpenShift doesn't have any resource limit they can have like 32 gigs of RAM they can have tons of CPU power and this doesn't work of course for the edge maybe for some hardware works but you know if we we're sorry OpenShift is a bit older so it's a bit older but it's not that on the edge like it was about in the process again and so it's there's a lot of factors on the edge sharing the resources various other bits and pieces don't make doing things on the edge without the human to interact with so with zero touch is very complicated and the more that you can take out of that and do it in build time then you're provisioning the device you've got minimum vital decisions which means minimal amount of things that can go wrong the more likely you're going to get a successful sharing plan you can then sort of customize more so with something more reliable like that and yeah so at the end of the day leveraging ignition OpenShift is okay as Peter explained for us at the edge is probably a no go to begin with ignition can do root reprovisioning so it can completely change the layout of the device system on the disk in OpenShift that's fine at the edge not likely then so yeah so what we came up with was maybe we need to support partitioning for the raw image at build time so that makes sense that's something that's on the roadmap already maybe you can have a blueprint option just for the simplified installer to for the raw image to partition in the way you want again maybe you want bar on a different partition again as I said we can look into on-the-fly reprovisioning of the file system although that's tricky or we can do like first boot reprovisioning maybe we can optimize ignition itself to avoid I don't know reprovisioning a 100 gig root file system that would be impossible to fit into memory too so those are just scenarios that we're going to run through and think about the last one is something that we've been talking in the past we didn't get any traction we're at the conference if you like the idea and want to contribute it makes sense right now the biggest the biggest thing that we've been told is how do I test it I don't want to build the image I want to have something that boots and I can just navigate into it and understand if it fits my needs so right now we don't provide a live image but if there is any interest traction from the community itself we can also explore that and that is meant like the slash fcos well start fcos systems do provide a live image so that you can actually jump on the system before installing it maybe create a template of the installation that you want to produce with chorus installer and then just boot that so that is something that it's not officially on the roadmap but that I think we would be keen to have as an input from the community out of time a couple of links here from chorus installer to some reddit documentation on how we do simplified provisioning with fdo and ignition that's it somebody in question probably there you go welcome to another this presentation I hope you are enjoying this conference so far so I would like to welcome our next speaker Eric Curtin software engineer who will talk about quick booting of software to find a rear view camera there will be a few minutes at the end of the session dedicated to our questions and without any further due I'll give you the floor hi everyone it's a mic working yeah hi everyone I already got introduced it's my first time at DevConf the venue is very nice I must say and thank you for attending the talk I have I think it's about 28 slides which is probably a lot and I'm going to go through them rather quickly but if there's anything that I was I'll speak louder the microphone is for the streamers and like YouTube I'll speak louder thanks for letting me know so yeah 28 slides I'm going to try and go through them because I want to leave time for questions at the end so if anything I said was really unclear or people weren't sure about feel free so I was already introduced I'm Eric Curtin I'm a software engineer at Red Hat and in the automotive all kind of different things I guess so the scope of this talk the scope of optimization can be quite large but for this talk we're going to focus on optimization techniques from Linux kernel and and in it our debut onwards so for example what's something that isn't covered is anything before that because firmware etc it's generally owned by the automotive board vendor so the optimization of this area is out of the hands of an OS vendor such as us and obviously the firmware optimization is just as important it's just something that's not my responsibility I do not go through optimization of the camera stack at the kernel level either because we use the USB camera for this and rather than the actual ISP SoC from a real automotive board although the same optimizations here would apply to an actual automotive board another thing I want to cover is there is a kind of a balance between keeping an OS generic and making it optimized to run some specific things early so like you could trim down a kernel and user space to only play camera frames and do nothing else but it wouldn't be a very useful automotive OS then so yeah I kind of split the talk in two one is just generically starting an early boot system D service and then the second half I'll describe how we applied that to a camera application and yeah and somebody's the techniques in this talk can be used to start other services early in automotive in particular there's an expectation to start all sorts of services kind of early a rear view camera is just one of these so here I did a little ASCII diagram of how kind of a Linux OS is booted from a certain perspective so generally the former loads into some sort of boot loader and normally that's where the compression of the kernel and init ram disk occurs even though that's not always the case but normally and then the first file system mounted is the init ram fs and that loads system D and does a bunch of things and then at some point we change the root file system from the init ram fs to the normal root file system so you have two options at this point if your system D service if it doesn't need to start super fast it needs to start after 3 or 4 or 5 seconds maybe starting in the root fs is just fine but if you want to start something super early like in the case of a camera you have to start it within the init ram fs sometimes called init rd as well they mean the same thing pretty much so the reason why there's two options at this point there's actually kind of a trade off here so when you start a service in the init ram fs if that requires you to add more binaries libraries etc into the init ram fs it actually has a negative effect on the overall boot time but anyway I'll go through that in a while but we're going to use that technique for this talk because in the case of a camera you have to kind of start it super fast these are some generic tips just to boot things quickly they're probably obvious but sometimes they aren't so keep your kernel as small as possible and as modularized as possible so anything that can be a kernel module probably should be and keep the init ram fs small but the kernel and init ram fs must be small as I said earlier because you actually pay a penalty for every single byte because every single byte has to be decompressed before you can use it so they're just generic tips so if you want to start a service really early under a system the based operating system like all the red hat operating systems this is what it looks like the one line that's probably important is that default dependencies equals no because systemd services have a default set of dependencies and the default for that is yes and if you have that set is yes your systemd service won't start until a couple of seconds because it must start all the default dependencies first so in this case we don't want that this is just a file used to pack things into an init ram disk and that's just it just says put this file in the sim link in the init ram disk and if you do that and run a service like this it'll actually start within about half a second this is just how you rebuild an init ram fs so that technique is useful yeah it's useful to know because if you can using systemd and using that technique you can start things within half a second so if you need to start something within that time frame that's a good technique to use now we're going to apply some of this to a camera application so this has already been done in automotive in different ways actually but the existing approaches they use firmware based cameras and displays to launch and display a rear view camera and this works perfectly fine but it's difficult to maintain it's hardware specific hardware specific and it's not software defined so it can be difficult to develop for and in fact it's not even linux so you know it's just custom code so this was my first technique I tried to start a camera quickly it was something that exists already in real for edge call kiosk mode which is a mode to start auto start a graphical application I won't play that video but basically if I played that that was when I started I had the camera would start after 14 seconds which is clearly too slow so we have to do something a bit more custom so then I moved on to an optimized approach I was kind of discussing it with the team and the first thing that came to mind is what's the first graphic linux display is and of course it's the boot splash which is Plymouth is the tool that launches the boot splash so the technique I use for this it's actually it takes a lot of influence from Plymouth except the main differences in this case not only are you just displaying graphical data but you're also processing graphical input from a camera that's kind of the key difference between this and Plymouth so there's two main components of this one is libdrem which is part of meza and the other one is libcamera which powers the camera so I've gone through that already so this is libcamera the kernel guys that are involved at the kernel camera stack are pushing this library hard it runs on Chrome OS and all the various linux distributions I think it even sports Android but don't quote me on that it's definitely running on Chrome OS and all the other distributions it's aims to be the meza of the camera stack so if you're familiar with meza meza is basically the linux user space graphic stack so it supports various different types of cameras across multiple CPUs USBs etc and it has decent support for things like DRM, KMS, PipeR, QT SDL and yeah it's basically the compatibility layer for all these things so I wrote a little reference application where I tie all these things together to make it work I called it twin camp so as I said it's very similar to boot splash so it doesn't use a compositor or a window manager instead we use the DRM KMS subsystem to write to the sped up buffers directly and you definitely should not do this most of the time like you should run your graphical application on top of Wayland and all of that it's just for this use case Wayland has so many dependencies you could never start a graphical application under Wayland at the moment at least in around a second or two it's just really hard so in this case there's a lot of custom code against DRM, KMS to do it quickly so yeah, playmat is another example of a tool that does things this way but yeah, it's not recommended at all unless unless you need performance and need to start something quickly so this is how you normally install a camera application on a Linux system so dnr, dnr, ramf normally initially loads all of these things and the camera application is typically stored in the root fs somewhere else and that's what you should do if you don't have to requirements start things really quickly um so to start the camera quickly we basically moved anything that the camera requires into the dnr, ramfs and yet then we wrote a little system the unit file that looks a little like this it's very similar to the last example I showed you um I won't go through every single line but another important line there is to ignore on isolate equals true because other systemd units can run in this isolate mode but basically what that means is it can stop other systemd services and we don't want that to happen um but yeah not everything in this file is strictly required but they're the important ones anyway those default dependencies equals no and ignore and isolate equals true so here is just the dracood module.setup file to add things to the in a ramfs so the first function says make sure you have twin cam in the in a ram fs this depends call it just tells the in a ram fs that we have a requirement on this graphic stack so make sure that's there um that's like um that's the media subsystem in the linux kernel it's like um where the camera kernel stuff is um so if you add that you have kernel sporting or in a ramfs and the final install is basically the twin cam binary um lib camera libraries lib event library and the c++ library skip the slide um so when you do all this um yes you can basically stack the camera within 2 seconds rather than 14 which is the point where I started um so I'll play I'll play a little demo of that I have recorded just um just to prove that I'm not talking nonsense um so what I tend to do is like I'm at um so this is grow bootloader I normally pause it at grow bootloader because this is an old dusty acer laptop I have at my house and I'm not interested in optimizing acer laptop firmer uh because it's for an automotive use case not for a laptop uh so I'll enter here on the keyboard and you'll see the camera displayed really really quickly and there is me cycling a bike um so pretty much it so if you take the timings there that starts at about 1.8 seconds from kernel boot and actually what's more important because often in the automotive use case the sometimes the node processing camera frames is different to the display um so what's more important if you actually check the timings we start receiving camera frames at about 1.4 seconds which is pretty decent this concludes the talk with just some shameless self-promotion here if you're interested in automotive or arm enablement or any of this space I'm doing another talk tomorrow on fedora remix um which is essentially fedora and apple silicon um and the reason I'm involved in this is because arm enablement it's important for automotive because almost all automotive words are arm based um it provides me and many others in the community uh an affordable well-upstreamed and powerful hardware to develop on and out very quickly so that's about it thanks very much um I'll take any questions if there are any this is a good question that um we have a good question is do I just keep on playing the frames indefinitely or at some point do we like transition to Wayland um this is something we have to work out it's a good question I've talked to the we have an internal team that works on digital cockpit and that kind of thing um at some point you should um for the scope of this talk but yeah 100% you don't want to be running writing frames directly to the frame buffer like that in the in the long term it's it's just to achieve the goal of initially seeing it really quickly because that's a requirement yeah go for it so the question is did I only do the did I only measure the times on an Intel laptop or did I also run this on some ARM hardware that is more similar to what you see in an automotive so the answer is yes um at the start all I had available to me was a Raspberry Pi and um yeah the Raspberry Pi was much slower as you'd expect but there's a huge difference between a Raspberry Pi and an automotive board because the Raspberry Pi costs um 100 and an automotive board is like in the thousands so going back to my last slide on the Apple Silicon thing I've also ran the same thing using my Mac mini and I achieved roughly the same time so it's it's it's roughly similar and actually most automotive boards there isn't Apple Silicon running on any vehicles at the moment at least um but an M1 is roughly similar in performance to what you would see in an actual automotive board um I don't want to dismiss that point too much because the hardware is important like everything's important is the hardware fast, is the storage fast is the camera stack fast is the is the firmware fast everything's important of course I have um so the question is I referenced the InetRD which is compressed um and have I tried other techniques such as um not compressing the InetRD or maybe using a different compression algorithm um I did play around with that stuff and and I also tried not using an InetRDisk at all which is another option um there's not much of a difference because I found the blocker was for that laptop in particular and other machines I tried the blocker was generally um the kernel um spinning up its camera framework that was normally the slowest part um when I tried those different techniques I got pretty much the same time um so I didn't see the need to just change things for the sake of it like earlier on the talk I referenced you can actually stash a system D service as quickly as half a second you know so it's not actually a huge issue um another question back so the question is have I also tried this with um encryption turned on and things like dm verity um I haven't but we have to look at that at some time I will say on dm verity in particular um I work on redhead's automotive operating system for dm verity in particular I'm not actually sure we're going to actually use that for our verification um there's something called composefs that um alexander on our team alexander larson on our team is working on this I think don't quote me on this because it could be wrong I think he's using fs verity instead um but yeah I know we you definitely have to rerun things with encryption and all these chain of truth things turned on because turned on because it will have an impact yeah yeah I mean this this way and whether you consider the automotive industry um sooner in the moment I open the door and silently start and then switch on the display or on the car which probably give me 5 or 10 seconds yeah yeah I have to change the fractions of the seconds yes yeah so the question for everyone online is this is a really cool technique but may the question is kind of like maybe we could cheat a little and start to boot when someone opens the door and this kind of thing and you might not need this at all and that's a very good point um why this is still important um is automotive companies definitely look into those things like optimistically starting a car if you think it's somebody's going to start driving it soon so first to first answer the question they look into that also the second part of the question is it's actually a certification requirement to start it really quickly from cold boot so cheating only gets you so far you might not pass all the certification requirements and on the third reason if you just have a little application like this out of the box using Wayland and all this stuff it's around 14-15 seconds so 14-15 seconds is just way too slow you definitely you can do all the cheating you want in the world but 15 seconds is way too slow um question from actually a follow on to that point is what some automotive companies have looked into is just kind of coming from suspend and a hibernate state but that has a couple of issues for example if it's suspend and you leave it in a low power state you're draining battery but if it's more hibernate and you're loading from disk there are still issues because do you want would you feel safer in a car that clean booted from scratch or a car that woke up from a suspended state a clean boot is arguably safer you could debate about those things all day yeah I know I'm not actually against that technique but some people don't like it because it's not a clean boot and they would say is that system safe after suspending all that hardware and reactivating it I don't know I'm not very opinionated on that stuff question here that was a long question but to repeat on the microphone I'll go back quickly in every case my shorter answer is you kind of do because it's actually a requirement for some certifications and such and if you fail the test you fail the test though they will literally a tester will come in and he'll step into the car, turn on the ignition start a timer on his watch it seems to be sufficient pass or fail any other questions I'm just going to check the matrix really quickly and if there's no questions there cool thanks very much everyone thanks for attending that's it I'm not going to be introduced right no it's fine okay is there anybody who doesn't know me okay so hello everybody I'm going to be your last speaker today in this room about trust based adaptive access control architecture my name is David Halles and I work as a principal software engineer seriously guys so I work as a principal software engineer for Red Hat I'm in my ninth year there I also am a PhD student and I'm a researcher at the Masarik University in the lab of software architectures and information systems and I'm also an alumni of this school you've probably seen those ducks outside on the corridor this one with David text on it in the 2016 year is mine my PhD research which come on so my research the university is about trust based adaptive safety in autonomous ecosystems which I'm going to talk about a little bit more so what are software ecosystems so if we combine multiple systems into one big unit we can call them systems or ecosystems it's some kind of evolution as we are creating more complex and more complex and of course such systems can provide much more than a single system ecosystems can provide more than a system so if you think about an analogy from buildings so an architecture of a regular house could be a software system then the architecture of ecosystems could be like a city with all these plans of every individual buildings inside so this is what I'm researching this is the main focus of course if we bring autonomy to the topic so we don't talk about software ecosystems but autonomous ecosystems or autonomous software or cyber physical ecosystems we can get much more higher degree of autonomy as we have multiple actors in the world working together collaborating or competing member systems can join or leave at any time so the context changes are producing a lot of unpredictable situations and a lot of uncertainty which brings to the main topic of my research is the safe and secure behavior of such systems we're going to look into something that's really a really small focus field of this research ecosystem coordination by coordination I mean for example if you have a set of cars in a highway you can coordinate them to move them in a single lane reducing aerodynamic resistance basically vehicle platooning this can be called as an autonomous ecosystem of self-driving cars moving in one lane so we can do ecosystems coordination in two different ways one is sending messages so we can ask these autonomous vehicles to talk to each other and have some implicitly implemented that can react to these messages that they send this is relatively safe because you have some pre-trained stuff that you already know it in advance but it doesn't really give you flexibility of uncertain situations which we are looking into more is sharing software so sending some kind of code something executable between autonomous systems which receivers can execute obviously it's not that safe as sending messages because you can execute stuff but it brings you the opportunity to have new features on the fly for example if you have these vehicles here it's enough if one has the smart agent or the software module for platooning it can share the software module with the other vehicles which don't even support this feature and after everything is working fine they can move into one platoon even though they didn't have this supported feature before of course running a third party software module on your autonomous vehicle in privileged mode is kind of a bad idea I guess everybody can agree here that we shouldn't do that but if it's a bad idea it can be still fun so our view of how to make it safe is trust and by trust I don't really mean the trust as you know in software engineering in general with trusted computing trusted execution or key pairs encryption and trust in HTTPS certificates I mean more like trust as in human psychology or in philosophy where you have some kind of relationship between trust T is having some kind of vulnerabilities by trusting the truster so we are trying to model something in this area and somebody smarter than me said that reputation based trust can be effective for securing communication and our belief is that we can also use it for interactions not just communications so our idea is to trust as a decision factor in real time evaluation we already know that this trust will be not binary so I'm not sure about the representation yet but in a single solution with binary true false trust not trust you can have a lot of very dangerous false positive situations so dual solution is definitely not something what we want we are also looking into reputation which is we interpret reputation in a way that trust assessed by other actors basically gossiping so one let's say autonomous vehicle has an interaction with another one it has a bad experience and it shares with others basically the same as people gossip about other people which is not nice but in autonomous it might be useful for us so to calculate trust if I say number it might be a number so we can go with that so we would use this external reputation on these smart agents software modules we can use static analysis to find some kind of vulnerabilities there the important part is which is not my research but I'm going to use it as an input it's predictive simulation using digital twins and based on the result of these digital twins we can do life compliance checking which I will explain later so is anybody unfamiliar here with digital twins okay so a digital twin in our case can model anything from the physical world basically it's a digital representation of a physical object in our case the physical object can be the software module doesn't have to be but I will explain why it was our solution as you see it on the picture what we do with predictive simulation is that in a simulated world we run ahead of time and run some kind of simulations on the digital twin and the smart agent which is running in software in case of software the exact same stuff in the real world you can compare these two things and based on that you can assess some kind of trustworthiness how much the model itself is trustworthy based on the predictive simulations and also how the two differ from each other which is the life compliance checking I was talking about so I really don't know how the trust would be my best guess and in our papers we talk about the percentage but more and more research is pointing towards a vector of different aspects so maybe like five or six percentages based on the type of metrics but I'm pretty sure that we would like to go with this smart agent digital twin bundle with some digital signature because it gives us some extra safety so if we talk about digital twins this is our generic idea that we would verify the digital signature if it fails we don't care about the bundle if it passes we do static analysis again if it fails we don't care about the bundle then we verify the digital twin on some preset simulations if it fails we not just reject the bundle but we also propagate the trust score to the world telling others that this might be some untrustworthy software module or smart agents then we do some predictive simulations and execute it and do the life compliance check based on those results this was a simplified thing and I would like to show you a little architecture which looks like this which is kind of complex but we'll go into this more a little bit later what it does actually so it does predictive simulation of some scenarios using the digital twin so you have some reality and you can have some multiple scenarios in the future based on that reality and you can do all the simulations with the digital twin simulations with the results that's the life compliance check based on that we can calculate that trust score in which case it can be a number let's say 45% and based on this score we can set up a decision tree that would allow us to expose certain features to this smart agent or on the other hand conceal those features basically some kind of access control based on how much we can trust this module based on reputation so again looking at this we have some external reputation coming from other vehicles we have the smart agent and digital twin coming to some kind of gatekeeper we verify that digital twin on some preset simulations then we load it to the simulator we load it the smart agent to the sandbox and these two entities work in tandem calculate some kind of comparison that goes to the trust aggregator that creates a trust score and based on the trust score the sandbox basically has or not has access to certain features on the vehicles platform but which might be interesting for you software engineers not really working with automotive and let's say different technologies in the cloud if you squint on this a little bit more what this architecture resembles you you have some rules you have some role and based on the rules and the roles you have some access what if I say Kubernetes RBAC is something similar and what if we could implement something trust based in Kubernetes so instead of RBAC we could do something like tBAC which would basically implement a similar architecture in the cloud I mean it would probably totally useless in software but I see it as a viable way of doing a proof of concept and I heard some rumors that there is a team at Red Hat who tries to run containers on vehicles okay somebody is shaking his head so it might be not true so it would be probably useless in a real cluster where you're running it in the cloud some services because you might not be able to run some kind of simulations but in some edge computing cases it might be useful so that's our vision how and why we would like to run it in Kubernetes at some point so to wrap it up there is some future work that we are looking into so based on this we are trying to swap the smart agents with a pair of sensors so we could calculate trust on sensors and we can trust or not trust the sensors and based on that we can believe or not what the sensors say another approach is to use digital twins of autonomous vehicles and extend this architecture for a whole ecosystem so we could simulate other cars calculate their trust score and behave accordingly so it wouldn't be just inside one vehicle but the architecture would be extended among multiple ones and the last one is that we just submitted a paper and it's not yet accepted so we can't trust for the execution in a trustworthy way so let's flip the concept that the smart agent wouldn't trust the execution environment where it's running we had some initial discussions with people who understand blockchain and this might be an unsolvable problem because the vehicle manufacturer might have some god-like power on your smart agent but there are still ways how to do this so let's see what I think can happen so stay tuned for that it might have some interesting results and last but not least I would like to mention that my supervisor will have a talk tomorrow at 5 o'clock or actually I will do it because she twisted her ankle today so if she recovers then check out her talk otherwise see me tomorrow as well thank you because dinner validation which one? this one? yeah sorry I cannot hear you because that would fail on any other vehicle this is an architecture that has some kind of static analysis so it doesn't really make sense because each of them would yeah well the point is that everybody would implement this framework they would have the same kind of static analysis so they could catch the same issue anyway the question was why the static analysis is modifying the aggregator so the question is about how the aggregator gets protected from spamming from false positives and false negatives so the results are actually not coming from the smart agent but the simulator that runs the smart agent so the smart agent cannot really spam it yeah the aggregator gets a message from an existing component that's also under R control no no no yeah so the question was about centric record and regarding the static analysis that we might get some interesting reputation from a certain vendor we have failing static analysis we weren't really looking into specific vendors and track records but this is an interesting aspect thank you yeah go ahead just please be louder because I cannot hear from the AC so the question was about GPG and trust chains it resembles what we have with GPG and trust chain no we were actually trying to get around this topic of trust chain and static kind of trust so our approach is trying to be dynamic and we tried to avoid and throw away everything that was interpreted as trust in software engineering and computing any more questions please so the question was about how to do counter measures against such attacks yeah so the trust aggregation and the calculation isn't really my research my focus is more on this part on the left okay so the question was how to do counter measures against such attacks yeah so the trust aggregation and the calculation isn't really my research on the left thank you for your attention