 Hello, everybody. Thank you for coming. This is the last start of the day. So thank you for coming. OK. So I'm Guillaume Aquini, a software architect working for TALES. And I am Axibona, and I'm also a software architect at TALES. So we are both working with our buddies here on this kind of product. This is a big picture of the product. So as you can see on the top here, you have this data center that is also on this table that you can also see at the TALES booth, KE34, right? And OK, so thanks to this data center, onboard data center that we call, surprisingly, ODC, we can run different kind of workloads on board. So depending on the use case, it could be connected to different equipments, notably the cabin, it could be a seat back, it could be access points, serving for personal electronic devices. It can be connected to the connectivity, and that pretty much it. This data center itself, as an edge part, is connected to the ground platform, to the SATCOM. So either SATCOM or anywhere on ground, we have the 5G or 4G connectivity. And this platform provides with a portal for all the stakeholders, so airlines, third parties. As TALES, we also operate via this portal. So when we started to work on this project, it was in 2020. So we had this top four tasks to tackle. So obviously, the hardware, the big parts. So I'm glad I don't do a hardware because they had a great challenge to face. They have to comply with DO-160, which is environment constraints. And just to give you a bit of idea, this whole box only consumes 300 watts. So technically, it's no more than a hairdryer. And poor is one thing, but cooling is another. And this is one of the biggest challenges as well. You have to be able to cool this stuff. And sometimes, temperatures grows very high, especially when you're on the Middle East on the tarmac. So it can be a big deal. Once you have the hardware, you bring on top the software. So with my team, we were working on especially the lowest part of the software. So being able, between the hardware, bring the OS on top of it, be able to deploy the OS over the year. There are various topics. Provisioning is part of it. And so we wanted to assemble the communities cluster on that. We obviously had a team working on the ground and structure. This is a big part, but that we won't cover today. And we have, of course, services that we want to bring on this system that can be either full on board, on ground, or end-to-end services that are deployed. And Alexi will show you how they managed to deploy that. So first part, I'm talking about the operating system. This is something that is very common. You have two types of operating system. One is called factory, because it's installed during manufacturing stage. It won't change over the product life. So this one has a single role. That is to role, but provisioning the blade and be able to upgrade to the production OS. The production OS is the one that will assemble and bring the communities up on board. So in terms of design, so we made this operating system layered, because pretty much like a container, because it comes with great properties for us, especially when you work in Avionic. You have to be very strict on security. So building the OS as a layer, like we see on the right. So this is an example. And you tie the layers to the side. You tie the size with the lifecycle of these layers, the one that will change often. You want to have them as seen as possible so that you can go faster to deploy. So like I said, in terms of security, this is great, because you can easily verify the SHA256 of each layer each time, because each layer is finally like a photo card. You can describe it as a single file. So in addition, we have secure boot. We can also verify authenticity. So we have each time the system boot, we can verify every step of the OS on the root file system. In terms of deployment, of course, it brings good properties, because you will download only the layers that really change. And it's really flexible, because this layout, you can change it over time. And let's say there is a zero-day issue, zero-day granular ability. I have to deploy, let's say, a bin bash very, very fast. So I don't have to rebuild my wall system. I will just build a very thin layer that brings the fix on the bash binary. And that will just be able to push it very quickly on every aircraft. So this design also implements immutability. So in the case of the factory OS, this one, you know every OS needs to modify files. And you cannot be completely read only. So there is always a read-write layer on top of the assembly. And in the case of the factory OS, it's a pure RAM memory. So it will disappear each time you reboot. So the factory OS is 100% immutable. And for what we call the production OS, this one, we have to commit files, let's say, not only when we bootstrap communities, it comes with files that have to persist across reboots. So this file will be committed to the flash memory and will be able to persist across reboots. But this being said, it's very easy. It comes with good property as well. Because when you audit all the files inside this layer, you can really understand the way your operating system modifies its environment. So it's very, very interesting. And it's also very easy to reset. You just wipe everything in this layer and you start from scratch. And notably, because this is a use case, well, when you work on labs with a lot of colleagues, sometimes it's great to just wipe everything and start from scratch. You don't have to spend that much time to wonder if you are the guy who did something bad or somebody else. So, and finally, if you are able to build this kind of layer, it's very easy to, in the same way, finally, we should probably have tried to stick to the OCI format. But in this case, this is the way we describe an operating system. So you can see there is a kernel in each RAMFS, which is the one that actually does the assembly. And then you see the three layers, in this case, Bazel.FS, Alarms, and Modules. Okay, so then you have your pipelines that will create all these layers. Like I said, there are one single file. So in this case, these are SquashFS files, so really only. And you also have your CICD, the pipelines that will create the manifest that I've just shown you. All of these will be pushed to some architecture in our case. And then how do we deploy? It's pretty easy. You have the operation team that will connect to the portal. They will just have to select the OS version plus the aircraft they want to deploy to. Oops. Not my finger. Okay. And then it's a simplified view, but we have a service running on ground that will simply take this order. It will get the manifest and then push it to GitOps. On board, on board the aircraft, we have the way to synchronize. Each time the connectivity is available, we synchronize with these GitOps. And we have a system D service in this case that will get the new manifest. If needed, it will grab all the layers that are missing, drop everything into a dedicated partition, and just reboot. So we've been working with this OS for around four years now. And for now, we are still able to install everything with just a reboot. There is no additional phase that may come, but for now, we were able to go through without that. So still about this OS, just few remarks. When you do GitOps, if you want to reverse, you just create another commit that reverse the previous one. So it means that you will ask to install the previous one. So we always keep the previous version available. And we play with symbolic links to just point to the new one or the previous one. We also have an onboard cache. So yeah, maybe we don't care here, but ways to reduce the connectivity usage. And we have one single node fetching everything and sharing with the others. Okay, so now we have the OS that is up on each blade and we will now talk about how we bootstrap communities on that. So a bit of requirements first. We have only five to six nodes available in the ODC, as you can see. So there are more blades, but the others you can see after if you want. Some are used for switching, some are power blades, but it turns out you end up with only six nodes available. So you cannot spread the role the same way as you would do on ground, where you are able to split the control plane, ETCD cluster and the walkers. That's also another big, big, big constraint for us. We want this bootstrapping to be as independent and as automatic as possible. We cannot rely on the fact that one blade will be available and that I will be able to control from this blade. So we want it to be, yeah, kind of really bulletproof because when the operators, we come and replace some 40 blade, they won't cordon the cluster and do a smart maintenance operation. They will just remove it and plug another and you're done. So we have to be able to support that use case. Yeah, so our mandate is obviously to bootstrap communities, but we then have to deploy a couple of services. We have a CSI storage that is used because this box provides with 96 terabytes of storage. So we have a way to share it across a cube. We have a Docker history, Alexi will tell you more about that after. We have a way also because we do GitOps to encrypt secrets at rest when they're inside a Git repo and the flux, the continuous deployment, the fluke solution that Alexi also will describe a bit more. So about bootstrapping, we are back in 2019. So we barely know nothing about cloud technologies and yeah, we have to start with that. So we start to play with a Qube ADM. We are given advice by people that are more experienced. At this moment, it was a Qube 113. We play with a Qube spray, which uses Ansible. We use Rancher, K3S, Micro-KADS. I don't remember everything we tested. We obviously also got in touch with the cloud providers looking for really a turnkey solution. So all the solution, including the cloud provider had great solution, but for the case of the cloud providers, they come with specific hardware that will really never fit in our use case. So we ended up being not able to find the right solution, notably the, I was talking about Ansible and a few others. Most of the time, they are the real sequence when you first create one node and then you make the other join. We are not as independent as we wanted. Moreover, some of them are using a pattern where one node will really connect to the other. We also had a cloud provider solution where the ground was really performing the operation on the devices and it didn't fit our use case. So it turned out that the only solution that was able to do that off the shelf, even though I know this is not the most trendy technology for now, but was Puppet. So for those who don't know, Puppet is a declarative solution with a good thing. I believe. And it works, its principle is pretty simple. You have each node will run a Puppet agent. So each agent will collect facts, local facts. So if you dump the fact that there is quite a lot, you can add custom facts if you need. And each agent will publish its facts to the server. And while doing that, it also requests a catalog that will be compiled and tuned for the specific agent. So they receive the catalog and then they apply it. So going back to communities. So Puppet comes with a Puppet module for communities. And this one, so the way we implement this kind of independent bootstrap is that because there is a first step that is not before where you use a very simple setup file for this module, for this Puppet community module that will then be managed by a cube tool. And this one just kind of generates all the data you need to be able to pre-share with all the nodes so that they can kind of get this independence. So I'm talking about, you know, community works with a certificate authority. Each node needs to have a secret and a key, a secret key and a certificate to be able to communicate. So this is a mature TLS certification. So this is generated by a cube tool. This plus a manifest that will describe the role. So if we look on the right, in our case, we have, like I said, we have up to six nodes, but only five will be part of the control plane. You will see that a node one to five will be controller or control plane. And the node six will be a worker. And you drop all that into the GitOps and then Puppet will do the job. Same thing, we do not use cube tool anymore today. We generate the exact same principle, the same file, our own way because we also have a lot of other secrets to manage and so we automated that task a different way. So is that this magic? Obviously no. So this is a timeline. On top we have node one. So let's say node one is starting. So the kernel start, then system D starts and after that the system D starts the Puppet agent. Puppet agent itself, as we want to bootstrap communities, will run a cube ADM in it command. So this node by itself cannot bootstrap communities because as I explained, there is this static phase where you create all these files. You have to specify your control plane. So let's say I have a five node control plane, a single node cannot bootstrap. You need to have the Chrome that is created. So it's five divided by two plus one. You need at least three control plane nodes up to be able to make that work. So let's say the second node is coming. So we have a, there is no reason why, but just to highlight the fact that sometimes we cannot rely on the boot order or the sequence or something like that. Node two is arriving and finally node three is coming. And if you look, it is only at this precise moment here that we'll have all cube ADM command in it running at the same time. So it is at the only moment that the cluster will be assembled. Okay, so and what if a fourth node and a fifth node come into the game? This is no problem. In this case, the control plane is already assembled. So it will be very fast for this node to enter to join the cluster. Okay, so it's important to notice that the perpet agent will keep trying to reconcile periodically. But this is something that we changed because we didn't want to, we want to save as much resource as possible. And having this reconciliation on a short period of time was not good for us. So we wanted to at least manage the return code of the perpet agent and be able to know whether there was no change to do at all, it was all set. So in this case, we would back off and from 30 seconds to five minutes, 10 minutes, and we go up to 30 minutes. Of course, if at this moment a new GitOps, there is no change via GitOps, it will execute immediately to make sure that it's reconciled as soon as possible. There are a few other stuff that we deployed that we implemented, I won't cover because I will be late, but these changes we've done are not yet published, but this is something that we have in mind. So we just have to do, proceed with that. And now Alexi will explain you how they deploy service, know that I've done my job. Yeah, so now that we have a coverage cluster that is up and running inside the aircraft, we now need to deploy some workloads. So to understand the talent that we had to go through, we actually need to understand what is the life cycle of an aircraft. So this is what an aircraft would do over a day, multiple times a day. It would go to an airport, CDG for example, maybe the airport you can through to the KubeCon. It will go to the gate and here I'll have access to some connectivity, 4G, 5G. Some of the airport also equipped with wifi. Then we'll leave the gate, go to the tarmac and take off. Here it'll be too high to access to solar connectivity. There'll be too low to have access to satellite connectivity. Then we'll go to cruise above 30,000 feet. Here we'll have access to satellite connectivity, for example. Then it will land to another airport, but unfortunately on this airport, there is no internet connection because of geofencing or running costs that are too expensive, for example. So what we can see here is that we cannot predict when the aircraft will have access to internet. We cannot predict when we can predict how we will have access to internet. So the aircraft doesn't have a static IP. We cannot just push software and configuration to the aircraft. It's up for the aircraft to realize pretty soon that it will be up for the aircraft to pull its configuration and pull the software. So we're back in 2020 and what we realized is that what we wanted to implement is the GitHub Sparkline and we decided to install Flux inside the cluster to be able to implement the Sparkline. But when you install Flux in the cluster, Flux will pull the Git repository, create the workloads. It will create the deployment and the pods and the container runtime will then pull the container from their registry. But here we can see that we don't have access to that registry because the aircraft doesn't always have access to internet. So what we realized pretty soon is that most of our board were going imageable back off because the container runtime couldn't have access to the registry. So we need to find a solution for that. So this is the step by step process we've been through to make sure that when we're installing the workloads inside the cluster, the image, the pod wouldn't go to imageable back off, but we're able to have all the images available. So the first option was pretty simple, just install the workloads, let the container runtime pull the container, but it was no extra implementation, but it was definitely not resilient to connectivity loss or to rescheduling if one of the image was not available on one of the node and we're losing this specific node, then the pod will be rescheduled to a node where the image was not available. So it would go imageable back off. The second option was to pre-pull every images on every node. That way every node will have the image available. It was resilient to connectivity loss because then every image are available everywhere. But it was very costly in term of connectivity because we had to pull as many time as we had node every images and we're not really able to master the garbage collection. For example, if one of the image were not used for some time, it could be automatically tickly garbage collected. So we need to find a better way. So we came up with this component that we called image puller, which goal was to pull the image from our ground registry to our onboard registry that we've deployed. It was based on the manifest, like a scope manifest that we built and it was stored on the same GitHub repository as the Flux resources. So it was very efficient. We were only pulling the image once, so it was cost efficient. Every image were cashed on board on that registry and we couldn't master the garbage collection because this specific component couldn't master and delete the image that were not used. But we were facing some race condition with Flux installing the workloads. At the same time, I were actually pulling the container because that manifest that was describing all the image to pull were actually on the same Git repository as Flux resources. One of the possible fix were to disable the Flux automatic home consolation. And so we could tell Flux, I have all the images and then now you can start to do the reconciliation. But what we're like with Flux is that it's always trying to make sure that the current state of the cluster is in phase with the expected state of the cluster. So we still need to find a better solution. But you can ask why is it that important to avoid image pullback of because on the last solution I presented, it was just the image pullback of just for its transitional time. When the image were pulling the image, then it would be available. So it was just a short period of time. Well, there are three main reasons for that. The first one is that the Ionic A-Ring 667-2 guidelines actually recommend to segregate the software download phase and the software installation phase, which were not the case here because our component was pulling the image at the same times as Flux were installing the workloads. The second one is not to disturb the passenger experience. Even with that transitional state, we could have some disruption on the passenger experience. The third item is the most important one. We couldn't afford to install only a portion of the microservice and then lost connectivity for a day or an hour or even more. So we really needed to make sure that all images were cached and available on board before starting to install the workloads. So we came up with that custom operator called image puller operator. So the image puller operator is a custom current operator, which goal is to synchronize container images from one source registry to a destination registry. And it's based on two custom resources, OCI registry and image. So an OCI registry, as you can see here, describe a registry with some credential to access this registry and the URL. And an image describe an image to pull from a registry from a source registry to a destination registry. So how that works in real life. So let's say on the left, we have our platform registry on the ground with our container image to pull. And on the left, we have our onboard commercial cluster. We have our onboard registry that we've deployed using the operating system and we have our image pull operator. We've defined on the cluster our custom resources, the platform registry with the associated credential, which refers to our ground platform registry and the onboard registry with the associated credential. And then we've created our image object which is referencing the platform registry as a source and the onboard registry as a destination. Then the image reconciliation loop will start. First, the image will restart to reconcile the resource. It will get all the information related to this image, so the destination, the source registry and the credential, the destination registry and the credential. And then it will start pulling the container, pushing it to the onboard registry and update the status. This last part, updating the status, is the most important one. So if we take a look at the status, we're actually using the case status standard, which means that Flux will be able to understand that this resource is healthy or not based on the ready condition. So now that we have the operator, how do we actually make sure that all the image are pulled on board and before letting Flux installing the workloads? So this is the architect, a simple view of our onboard GitOps repository. So we have one customization that is describing images and one customization that is describing releases. If we take a look at the customization, for image it's actually installing all the resources that are defined here. And as you can imagine, we're defining on this folder image resources. And we can see that we're using the weight condition to make sure that before this customization will be considered as healthy all resources, all images need to be considered healthy, so which means every images need to be pulled on board. Then we have our release customization, which is a bunch of Elm release. And you can see here that these layer actually depends on the image layer, which means this release layer will not be reconciled unless the image layer is actually healthy. So if we take a look back in our cluster, so still we have our onboard, our platform registry with some container images. We have our onboard registry, image puller, operator and Flux, which will, so yeah. Flux will then create the customization, our image layer and our release layer. You can see here that the release layer will not start being reconciled because it depends on the image layer. The image layer will then create our image objects and then our image puller, operator will start to work. So we'll reconcile, pull, push, update the status, nice, nice, it's healthy. We'll do the same for the image, it will reconcile, pull, push, the image and then update the status. Then every images are healthy. So then the layer will actually be healthy, which means it will unlock the release layer and then Flux will then start actually deploying the resources that are different on this release layer. The real estate layer will then create the Elm release which will create the pods and then the pods will be able to pull the image from the onboard registry and then you can see and then it will be healthy and then everything is healthy. So thanks to that we've been able to sequence the installation of the resources making sure that every container images have been pulled on board before actually letting Flux installing the workloads. I didn't dare to do a live demo so I'm just gonna show a recording. So on the top left we have our customization. On the top right we have our custom resources image. On the bottom left we have our Elm releases and on the bottom right we have our pods. So here we're creating our customization. We have our image layer and our release layer. You can see here that for some reason the image layer starts first, which will create then our images. Our image puller part is now reconciling the resources. You can see that it's not ready now but it starts to reconcile, it was very fast. Now all the images has been pulled on board so our image puller operator has pulled the container from our grand registry, pushed it on board and then updated the status of the image. Put it back. And then you can see that the release layer will then start to be reconciled. Yep, need to wait. Cannot find it's a video. It's a video you cannot find out. Yeah, here. So yeah, the release layer will then start to be reconciled at the same time but you can see that the dependency on the image layer is actually not ready. So we don't have any memories that are currently installed because yeah, there's a dependency that is not done. So if we wait a little bit, we'll see that the image layer will not be healthy because every image has been reconciled and then the release layer will then start as its dependency on the image layer has been fulfilled. It will create the Elm release. The Elm controller will then install the Elm release, create the pods. And then yeah, as you can see over here, the Elmers are being reconciled. The pods can start without any image pullback of because we've made sure that all the image has been cast on the onboard registry. Thank you. It was a recording, so. A couple of takeaways from this presentation. What we've realized working on this project is that every off-the-shelf technology that you'll find, the CNCF has a huge landscape of technology you can use. But when it comes to edge computing, it's not always that easy to use those technology, especially when you're working in an environment with partial connectivity or no remote or physical access, you have to deal with those different technology and tweak it a little bit sometime to make it work. The second takeaway is because we don't have remote access or physical access to the cluster, it's flying inside the aircraft. Every single action that we're doing needs to be automated. We talked about bootstrapping the cluster, installing the software, but it goes way beyond than this. Every action like database administration or self-ealing, all of that needs to be automated somehow to make sure that everything we want on the cluster will be available. The operator that we presented is not open source yet, but we plan on making it open source. If we do, it will be available on the Talis Group GitHub account. And if you wanna see it closer, it will be available on the Talis booth on K34. Thank you. Thank you. Thank you. Thank you.