 Bonjour, bienvenue à Paris. Je suis pas d'ici, mais s'il y a quelqu'un qui préfère demander des questions en français, ça va très bien. Good morning, welcome to Paris. As I said, I'm not from here, but I can take questions in French or English. Swedish tree if anybody prefers that, but my name is Eric Nordmark. I'm the CTO and co-founder at Sedida, the company based out of San Jose, California, with offices in Berlin and Bangalore, among other places. I'm also sitting on the LF edge technical advisory committee and I've been working on things in the edge computing space for about six years now. So without further ado, I don't know how many people have been to some of these edge days before? Can you please raise your hand if you're familiar with the edge day? Okay, not very many. So different people say different things or mean different things when they say edge computing. What is the edge? So LF edge, actually a couple years back, they put together this white paper that tries to show this linear progression across a different sort of attributes about the stuff. Who owns it, right? Where is it deployed? What type of hardware is it? Is it more embedded computing? I love what you find in your washing machine or the cloud at the other end or centralized data centers. What I'm here to talk about is this area. So I like to refer to this stuff as the distributed edge. It's actually things that are more spread out physically. It's devices that are capable, I mean they typically get around Linux by and large, but they have enough sort of spare capacity that you can deploy different applications as opposed to the constrained edge which might be, you know, it's a thermostat, it's built to be a thermostat and nothing else, or it's a pump controller in a factory, more specialized single function devices. So that's sort of the context. But how is this edge different than what you used to? Well, people scale Kubernetes, but there's a scale it to 10 clusters with 10,000 devices. Here you have to flip that around. It's 10,000 clusters. Each one of them might have one device or three or five, but you can think of them as sitting out on, these are some trucks sitting out at oil wells doing analytics, right? Each one of those things can be a Kubernetes cluster. The other thing that's key is that they're remote. And what does remote mean? Well, this is a solar farm. It's not one of the ones where we run stuff, but it's similar that it's very remote. You have very limited connectivity or intermittent connectivity. You have no IT staff. Even out on those trucks, you don't have IT staff, you have staff. You have people that know how to, in that case, operate an oil well, but they don't know anything about how to reconfigure a firewall or anything else in that domain. And in some of these cases, there's nobody there. It could take hours, days to get there to some of these sites. So that's the edge. And people, by and large, they see that, okay, we're deploying containers, Kubernetes in the data center, how do we bring that stuff to the edge? To this edge, right? Because they want that fluidity, that being able to deploy things rapidly to add new functionality, et cetera. So that's the challenge that we are trying to help them with in LFH in general. So what does it mean when you have this scale of 10,000 clusters? Well, it means that even if the network was perfect, there would always be some of them that were powered off. I mean, we have a common model is that people configure things, they install all the software in some factory, in some staging area, and then they put them on a plane, on a boat to somewhere in the world where they would get deployed. They're powered off. Other places where the machine that this thing sits in, in a factory floor, it's powered off at the end of the shift. So the computer, industrial PC sitting in there is powered off as well. So you can't actually assume that things are accessible, plus in any big network, right? My background has been around the Internet and TCP IP, and you can blame me for parts of IPv6 if you want, is that the Internet is always broken, right? You don't think it is, but there's always some parts you can't reach from somewhere. And at this scale, it will always be that if you're sitting in some central location trying to push out some updates, there's always some fraction of these devices that they might be powered on, but you can't actually reach them right now. Try a minute later, a day later you can, right? But this means that you need to treat the clusters as, you know, cattle, not pets, not just the individual workers in these clusters, not the individual nodes. What does the remoteness bring to the picture? Well, it means that you will now actually have physical attacks. People can actually physically attack these things if they're high value, right? Approach them, try to plug in things, steal them, steal the disk, copy the disk. Yes, in many places people have fences, cameras, etc. But it will take, you know, there's no staff there in many places. So if you want to send out the police, that will show up, you know, hours later. So the server will already be gone. And the network, we were surprised when people said, oh, we have satellite backup when we remote. But we don't want to use it because it's expensive. So basically you want to orchestrate the stuff just like you do in the cloud where you say, I want to push updates. But those updates are not going to get there until this job is done at the site three days or a week later. But you still want to have the same flexibility of going, pushing this update, but you won't get any feedback and it won't happen, right? There's other things where even in the factory, there's various architectures, security architectures that mean that you have multiple layers of proxies and static IP addresses because they're not used to dynamic things like the HP that everybody else runs. They're used to nailing everything down, right? So it means that you need to have a different way of configuring things as well in terms of being able to get out through those layers of proxies with the proxy certificates, et cetera. So it's different. I mean, it's all the same technology that we're used to, but the environment is actually quite different. Plus, there's the usual set of things that the application is still what matters. So by and large here, I think we're here to talk about infrastructure, right? The Kubernetes infrastructure, all of the building blocks around that stuff. I have a background on operating systems and hypervisors and all of this stuff, right? It's all infrastructure, but it's actually the application that delivers the value. What's actually different here is that many of these established 100-year-old companies, well, they've built software for 30 years. They have virtual machine images that might be where their bread and butter of their business is. They want to continue to running those as well as being able to run containers and Kubernetes and sort of move forward, right? So it's actually key to be able to support a range of age of applications. In the cloud, this sort of auto-scaling type thing, right? You can always go find some more capacity somewhere is an important thing. At the edge, that might be less of an issue because you might only have three servers sitting out there on that truck. So you can't just get more hardware to help you with. It also means that you need to be able to do updates slightly differently to make sure the service stays up. The other thing that we have observed working with people in this domain is that, yes, you want the benefits of the CI CD pipeline. You can, in theory, push things out, but you actually want to review things because you want to have a gate where you say, well, I can put this stuff in my cannery system. I'm not going to give it to all of my customers right now because I need to review the actual changes. You don't want to affect the system that's running out there in the factory that's making things turn every day. And last but not least, there's a concept when we started this stuff, sort of coming from an IT background, you think about security. You start talking to users in this space, the notion of safety is probably just as important if not more important. It depends on who you talk to. You shouldn't injure somebody or cause explosions or other things. The actual physical plant around electricity, around valves and steam and everything. Explosives, this is an important thing. So it's a different mindset. I can tell stories about each one of those different cases, but the particular one to touch a bit more on here today is an automotive use case, which at some level is simple and at other levels quite complex. So basically, everybody has more firmware and software and cars with EVs, it's increasing even more. When you bring a car in for service, one of the key things that it actually does is it talks to a computer, it's done that for years, but now it's actually updating large amounts of firmware or software. And how do you actually do this stuff? And there's some new requirements around cybersecurity for this because they don't want viruses to get into the car, particularly cars that are more autonomous, as well as reducing the time to service. So in many cases, it's so slow to do this stuff, it runs over the canvas in the car, it could take hours to update software just because of the speed of that bus. And this is reducing the capacity of the actual dealer to do service. And being able to control this stuff, where you, just like everybody here is familiar with, being able to do this stuff centrally by describing the policies and what you actually want to deploy. So that's sort of like, and other things that help with debugging, et cetera, right? So that's the challenge, or what's the challenge? So what did they actually ended up doing? They ended up using a centralized orchestration system running in the cloud that can actually talk to all of these things. All of these servers sitting out of the dealers connect back to this infrastructure. And they can do this by dropshipping the hardware from one location where you actually install the initial software, which I will talk about in a minute. And then the actual applications can get deployed after that. The containers can get deployed afterwards. And being able to run the whole life cycle management with updating the software, et cetera. In this case, they're dealing with the fact that they have tens of thousands of clusters. Each one of these dealers is running a single node Kubernetes cluster today. They're dealing with that by leveraging Terraform. So they can basically describe how each and every dealer is going to look like in every country. It's slightly different, et cetera. And they're running this stuff with K3S. And we have pieces that, you know, the security is key to the stuff, but also being able to provide connectivity out to these sites. So that's sort of the, what does this actually mean when we drill down into a bit more details? So this picture might look a bit strange because people are used to thinking about one unit of hardware going with some software. And these teal colored ones are actually devices running out at the edge. And I've seen this stuff in the sort of server software industry as well as in the networking industry. You start building something. It's an Ethernet. It's a IP router from Cisco, right? And then later you say, oh, but I need some centralized control. I'm going to add a controller. Same thing in the server, you know, data center, whatever business. You add that afterwards. We said, well, in this case, we said, no, but it's actually key that to be able to operate at the scale, the controller is required. It's not an optional add on afterward. So let's put that in from the beginning. So all of these 10,000 devices, they call home the controller with a strong sense of trust between those so that now you can actually make sure that all of these things, not just from the containers and the pods you deploy, but also from the device configuration, the operating system, everything is all centrally controlled. You have these sort of standard immutable images, immutable operating systems, right? So you can avoid configuration drift by design. So that's the sort of starting point where we said this needs to be different to address this. And because of some of these things about intermittent connectivity, that communication over that API has to be quite robust, right? Quite different than what you can get away with in the cloud or in a well-managed data center. So that's how to make some headway. So we actually did this stuff by introducing Project Eve and the Linux Foundation in LF Edge, which is sort of roughly built like this. It's actually somewhat of an old slide because it has K3OS on it still. But fundamentally, there's a set of components in Eve that basically runs a set of microservices to do this remote management and orchestration over an API. And then it runs a hypervisor. One can run different ones. And then it can run different workloads. It can run Windows virtual machines, which is some of the legacy that people need to run. It can run Docker containers. It can run Kubernetes runtimes, et cetera, right? One of the little things that is always on this picture is a trusted platform model around security. But that's a key piece in the hardware. And then for people that are familiar with server virtualization, the part that there's different hardware here like TPUs and whatever, those are all resources that are assigned to the applications. So it gives you the sort of flexibility, the things that we saw in the data centers like a decade or so ago about server virtualization where you can decouple the workloads from the underlying hardware. But there is a lot more direct access to hardware in these cases because it's also dedicated sensors, different networks, et cetera, that shows up out here. So connectivity in a bit more detail. When you think about the network in the cloud in the data center, well, it's sort of the two first bullets here, that you connect to one ethernet, one virtual ethernet, whatever that the cloud provider provides you. Implicitly, this is one security group. And then as you got into the field, you'll find that there might be a separate shop floor network, the one that connects to the machines hierarchically potentially. So that's a separate one. And then you might have static IP addresses, HTTPS proxies, you might want to have fallback, intermittent connectivity, air gap in some cases. Okay, that's fine. I can go configure that stuff. I can go deploy it. What happens when I want to change this configuration? And the fundamental premise is that these devices, they must always be remotely manageable, right? So now how do you prevent yourself from cutting off the branch you're sitting on when you go do these configuration changes? Well, what we did in Project Eve, we did AB network config, basically the same way that when you deploy mutable operating systems and update them, right? We have AB partitions or AB images, whatever. So you can actually test one before you commit to it. We're doing the same thing with the networking configuration here. In addition, because of static IP, et cetera, and proxies, you might need to see this stuff. The simple case is you go plug it in like I plug in my laptop here if there was anything on the cable. And it gets an IP address. You can look out to the internet over HTTPS. In these cases, that might not be possible. So you might need to see it with some initial configuration like give it a static IP address. Well, these devices, they don't have keyboards and screens. And even if they did, there wouldn't be a user that would be trained on how to actually go configure that stuff. So the simplest one we come up with is that you either build that into the image you deploy. So you get a unique image per device that contains its static IP address or you have a separate sort of USB stick that gets that seed configuration, that signed seed configuration from day one. And the third piece that's key on the network side, I think, is that since you're going to have all of this volume of data potentially coming from these devices because of the number of devices, it's actually key that the actual interaction across this API uses eventual consistency so that it's all declarative, which we know, but also that you can actually do self-compression in the other direction. If you want to send notifications, you need to think about that stuff, not as sending messages, but as eventual sort of state replication. The state of this application is it's running, or it's downloading the image, the containers, it's now booting, it's now running, et cetera. Well, if you can't send those first messages, do you care? No, you actually care about the eventual state. So if the network is out, you don't bother queuing up all of those messages. So there's sort of patterns here that I think one needs or optimizes things running at the edge. Security. So when you think about security, you sort of have to think about what are the threat analysis, and you also have to think about where is the value that relates to that threat. So here, we're deploying some software, some infrastructure software. I assert that that has zero value. In this case, it's open source, but even if it wasn't, it doesn't have much value. It's either the application that these users are deploying that contains their own secret source, things they want to keep secret, or it's actually the data that they collect and operate on. And that data can be very valuable. It depends sort of what's the value of knowing the state of an oil or gas well, how much remains well in aggregate over a country. This is sort of elements of national security, as far as I understand. How much oil does this country still have? So you clearly want to think about that as the thing you want to protect. You don't want to need to protect the OS. So what type of attacks can you do? Well, it's the usual classical one. You have physical access. You can go plug in a USB device of some form. But you can steal it. You can update the BIOS with something else that tries to help you exfiltrate data out of the system. You can have poor passwords. So now people can break in. So there's a question about who you're going to trust. So there's a common pattern in enterprise networks where you have enterprise proxies that provide you with additional filtering in the name of security so that proprietary information, credit card numbers, in the US social security numbers do not leak outside. It does this by breaking end to end security. It's known as a TLS man in the middle proxy. Are you going to trust the administrator that has access to that system or not? Because they can now see as well as modify data. And if you haven't seen these things, if you ever worked for a company where you had to onboard your smartphone and add something to the key chain, that was definitely one of these cases, right? Where you have one of these enterprise proxies. It's not that common, but it's actually something you need to be concerned about. So how do people solve this stuff? Well, this classical, I mean, in general with the networking stuff as well, it's like, this all has been solved, right? Yes, but it hasn't been solved under these constraints. Here there's also a solution. It says, you do secure boot, right? It's been around for decades, right? And you do full this encryption. It's running on this laptop, right? So problem solved, right? Well, with secure boot, it's based on it being signed. So how do you invalidate old versions at scale? This was signed by whoever developed the software, right? Was it me? Was it Microsoft? Doesn't matter, right? They said this was good. A year ago, a month ago, now it has a CV. It needs to be replaced. How do I tell these 10,000 devices don't actually trust this thing anymore? I don't know, right? In theory, it's possible. It's there, right? And for full this encryption, where do you store the key? I type a password when I power on this device. Or some other fingerprint reader or something, right? You can't do that at these devices where it's unattended, where they need to come up. And again, run the applications. That's the key. When they boot, they need to run the application because this application is running the analytics for this solar farm or whatever, right? So that's a challenge. So what did we end up doing? Well, something different using a slightly different set of tools, right? So first of all, block all of the connectors, right? Through software, saying the USB ports are there if the applications want to use them. But they don't actually give you any access to the underlying EVOS. If poor passwords... Well, there aren't any passwords because there aren't any users, right? The users actually sit up talking to that thing, the controller over here. Did I share the picture with that? Hey, users up there at the left, they're not over there, right? That's actually a key thing so that you can actually have full control of that and do your roles, et cetera, in one place in the network. And then the rest is actually leveraging standard technology which isn't very used, called measured boot and remote attestation with some additional pieces that this is a bit technical, but there's support where you can actually seal the keys for the applications volumes under the measurements of all of the bootchain that booted this device from the hardware to the BIOS to the OS so that you know that as long as that boot is exactly the same bits, those bits can now access these applications volumes. If I change these bits, I need to go ask somebody, is this okay? And that's the remote attestation step, right? And with this, you now have flexibility. You can go out in the field and you can update the BIOS through whatever manual mechanism you have. You get different measurements. You just have to have someone that sits centrally and signs off saying that is okay because that is the new version of the BIOS we were rolling out. When you do that, you need to have network connectivity. When you just, you lose power and everything comes back on, you do not. So that's sort of the key sort of contribution from this stuff that I think makes sense in most environments. So what does this look like in terms of deploying Kubernetes here? This is sort of the same picture with slightly different graphics, the thing that people do is that integration from this controller on the left to something like Rancher or OpenShift or different Kubernetes controllers that then communicate down to the Kubernetes cluster running on one or more devices out at the edge. And this means that you now have sort of one piece that's managing the hardware and the network configuration and the operating system as well as the set of Kubernetes run times and updating those. And you have another layer on top that actually manages the workload that you get deployed. But there's another piece when you go deploy this stuff at scale, well, how do you actually make sure that these pieces actually come together? So one of the things that we built as a pattern is this notion that well, how do you bootstrap this stuff securely? Because you want to do this out at the edge. Well, there's a way of seeding the stuff using the KubeConfig file. And the fact that we have this trust here, right, it means that we can have a little script inside that VM that pushes the KubeConfig file to the same place where you pick up your CloudInit file. It's basically a local HTTPS post over here. We can then securely carry that thing so that you can now actually glue that in and you can carry that across so that as part of powering on one device, it automatically shows up in Rancher or OpenShift, right? And you just need to be able to designate this device belongs to whom which is the onboarding step of the actual hardware. And in many cases these pictures are actually more complex. The companies that care about security, they might say, well, yeah, maybe there's a little firewall there, but we actually want one that our IT department approved. So we're going to go get a commercially approved certified firewall virtual machine here that we're going to run. Well, who controls that one? Well, in some cases it's actually a separate security services running in the Cloud somewhere. How do you actually put this stuff together? And it might be SD1 for connectivity, et cetera, right? How do you put this stuff together? Well, it's the same thing again. You need to be able to bootstrap this stuff at scale where you can actually tie things together securely. So there's another pattern that gets used in some of these things where some piece of software running on this device you can actually use the same sort of cloud-init communication path to go say, can you please prove that I'm running on this device? And the underlying infrastructure here in the US says, okay, I will sign that stuff using the hardware TPM that I have. And now you can actually pass it to your controller and the controller can go back and check with the controller for the hardware the device itself and say, is this thing actually true? Is this device actually running at this site? Right? And do other tests as well? Do I have licenses and whatever if this is commercial software? So these patterns about being able to securely bootstrap this stuff at scale is actually key. This might be a bit, yeah, we're actually out of time. I can cover the stuff separately. But the same thing is, the same type of pattern show up even if you're deploying AI using Kubernetes, deploying something else. You typically have something where you have some set of controller that worries about, in this case, the models, right? Not necessarily the workloads per se. And you need to be able to glue this stuff together. So it's actually the exactly the same patterns might be, they're going to be slightly different components. But that's the key thing we see here. So in summary, I think it's actually useful to think, if you're going to deploy at this edge, right? It's useful to think about what underlying software infrastructure do you need? I think having something that's sort of immutable with the strong foundation of security is key in some places. There might be other ones where it isn't, because you're running in a locked room in the back of every retail location, right? And then you need to figure out which Kubernetes distro makes sense for this. How much space do you have? What set of features do you need? We've chosen K3S in the deployments we've done with our users. And then how do you actually do this stuff? Think through what does it mean to roll things out when this actually involves rolling out hardware so that you can do this with zero touch, right? And being able to bring up the hardware onboarding that as well as onboarding or creating the clusters. So that was the example I gave. But then we're not done yet. And I think that if there's people here in this community that wants to work on this, one of the things sort of like is Terraform the best way of doing this stuff? Or OpenTofu or whatever, right? But are there other ones where sort of being able to make it as easy to bring up configure clusters in a uniform way? Are they going to be identical? That's a good question, right? In some cases they're not, because they might be different for different domains, countries, whatever, right? But that might be an interesting thing. And then the other piece is if you actually want to be able to interact with these Kubernetes controllers running out in 10,000 places using kubectl. Well, as I hinted at that's going to fail a bunch of times because those devices are powered off and you don't know, right? And is there sort of interest in looking at being able to do that through some sort of queuing proxy a twin that you're actually saying, hey, this thing actually accepted these things, but they might not have been sent to the actual place, they're going to take effect yet, right? So similar to the device twin concept that applied here. So those were the words I had to share. Sorry for running a little bit over. Any questions briefly? There's someone with a microphone running out of there. There's a question to I can't see anything. You mentioned MTLS for trust and at the same time the corporate man in the middle proxies that break open the TLS connection. Kind of lost how you exactly deal with them. Yeah, the M in MTLS should be within quote, conceptually it's MTLS but the reason that and we actually started with MTLS and then we ran into this thing and then we said, oh, we can do that. So instead, it's actually TLS where we not only is it mutual, but we trust any of the 100 root CAs you have, you know, in the next 120 whatever it is today, right? But then on top of that, we have object signing and object encryption of the protobuf payloads that go over this stuff, selective object encryption and that is tied to a root certificate that's sort of unique for this deployment so that you know that only if you're this car dealer it's actually issued by their root CA right? So you can't actually feed anything in so at the TLS level they can get access but they can't see like the cloud in it the cloud in it files contents the any other credentials that float around those are actually object and crypto. Okay, thank you. Where is it last bright? Yeah, okay. One question. So you mentioned local threads also because people have access to the hardware running on the local site actually. So did you consider something like confidential computing for example to decrease the trust boundary or some intrusion detection systems which really did check for NMLE behavior in the systems? I mean measure boot is quite a good start for sure, but I think can be improved. Definitely confidential computing is something that we've been looking at and sort of figuring out when will that hardware show up in these devices that people run out at the edge right? And what does it take to enable that stuff? So the work on that with confidential containers etc right? That's very interesting. We haven't done that work yet because we haven't seen the hardware sort of show up yet in these places. In terms of sort of other physical attacks yes there's plenty of things you can find if you if you want to find out about how you can access how you can support hardware if you have access to the supply chains in various places it's a bit scary but what can you do motherboard that you can't detect with x-rays and visual inspection and whatever but if you don't want to go that far the notion of having intrusion detection switches and other things right? How do you actually build this stuff? What do you assume about the attacker? How much time and resources do they have? There are these things that are simple switches in a chassis in a server right? I don't know how they are to subvert there's military great things that's based on air pressure or whatever it is so that you know it's pressurized inside and if the air pressure changes it sounds the alarm and whatever so but yeah this is something that some users actually care about because they think that these threats are real right that and someone can steal one of these boxes and then have time or two and then tinker with one and then destroy it and then figure out can they do something with the second one right so I think that but we're doing software right we're trying to figure out what can we help enable here right and being able to integrate those types of things into the system into the measurements that's the sort of avenue we're on Hello thanks for your talk and here oh could you share a little bit how you guys build the measuring at the station are you guys using a unified kernel image or how is this working no no we have basically a standard linux image with some tweaks I think well on arm we needed some tweaks to make sure that the stuff actually passes through in terms of the measurements as well as the tpm event blog so that we can actually get that stuff out but there's nothing else that's special right other than you can actually then get these measurements out of linux and you can actually go ask the tpm to sign a quote based on those measurements or you can get the event log out of linux and then have it sign the quote right and then sort of it follows the standard pattern from the trusted computing group they don't actually specify protocols for doing this stuff so the protocol we have is sort of taking that concept and casting it in proto-buff encoding but that's sort of the thing that's unique here thank you can you share a bit about the text that you're paying when you're putting a cluster within the edge device at the end you have the scheduler and those type of things so what is the compute text I mean the way people are deploying the stuff today the control plane is running in a virtual machine right and so there is some overhead of that I don't think we measured it in detail I mean if this is not storage intensive I think less than 5% that you would actually pay for the stuff and I think over time it might make sense to sort of run the Kubernetes control plane inside EvoS itself in the host and then have the workers actually be either on the same device or in VMs or separately but there's something that we're mapping out right now what makes sense in terms of having it be predictable as well as flexible another question in case that you want to deploy more than one application within the edge device can you share about how to create that network scheme that allows that cross application communication sure I can do that but I'm getting the signal that way over time so I should pass it on to the next speaker so let's talk offline thank you