 Hi everybody, my name is Stefano and I would like to talk to you about containers. So containers are cool, right? Everybody loves containers. I, for one, I do like containers and I use them almost every day for my work. So let's see how we can make them work in embedded. And first I would like to start by, well, containers are a solution. It's a technology, right? What's the problem? Let's talk about the problem that we're trying to solve. And I think the problem has two sides. One is packaging. We need to find a way to package an application for our target, a way to share an application across teams, for example. And this packaging needs to have some, you know, it has some requirements, such as it needs to contain the binary of the application, but also the dependencies. It also needs to be in a format so that the application then is easy to update. So you would want to ship, even dynamically potentially, ship an application to our target embedded board or IoT device and then update it easily, right? So that, and maybe if we have more than one application, more than one container running on our IoT device, we might want to handle their updates independently. So that's one part of the problem. The second part of the problem is running the application actually on the target. So now that we have found a way to install and manage the updates of the apps on our target, we need to find a way to run them securely and in isolation so that the apps don't interfere with one another. Now, these are actually two quite different problems. And, well, one idea is to solve them both with Docker, right? Because Docker actually comes with a set of technologies that allow us to kind of do both. Now, let's dig a bit more into this and what that actually means. So first of all, even within the container world, packaging and runtime are two different technologies that are covered by two different specifications. One is the OCI that stands for Opera Container Initiative Image Specification and the other one is the runtime specification. They're kind of related, but they are independent. One, it covers the format in which you package the container. In fact, containers you see are overloaded words because sometimes you use it to mean the packaging, the app package as you download it from the Docker Hub and sometimes you use it to say, I'm running this container and then you're talking about the runtime environment, which are actually two completely different things, two completely different set of technologies. Now, in an example workflow, you use Docker and you fetch a container from the Docker registry. At this point, we are talking about the image format specification. So you are fetching a table with a binary, more than one inside and a manifest file. And then it gets installed on your local machine. Again, this is about packaging, the format in which you are downloading it and untiring it and then laying it out on the file system on your host. And then once it's installed, then you enter the problem of running it. Specifically, usually it's run using Linux namespaces. So I'm going to try in this talk to refer the packaging as separately from the runtime and maybe not use the word container that's confusing, but it's very difficult because container is such a nice word, that's a nice ring to it. So the Docker binary itself has been disaggregated in a number of smaller binaries and now the architecture of the project matches the specifications. So you have container D, which is the daemon that basically deals with the packaging, downloading the packets from the hub and installing it on your local file system. And then there is a separate application which is called a separate binary, in fact, separate project on GitHub, which is called RunC, which is the default way of running your application. As a matter of fact, container dispones fork and exec RunC when is the time of running your application. And RunC uses Linux namespaces, but it's one of the impossible implementations that are alternatives. So what's good and what's bad about containers, you know, packaging and runtime, so the packaging, I think, is fantastic and Docker comes with a very powerful tool to create these packages as well as the hub to distribute them. There are very few downside with using this packaging format. Even in embedded, it's a great way to share the product of your teams between them or just deploy them on your target. The problem is with the runtime environment. So Linux namespaces are okay for some use cases, but not for all. Why? Well, because Linux namespaces are based on a single kernel instance, on top of which all the application run using separate, again, namespaces. But however, the C-Score interface exposed by the kernel to the application is quite large and is kind of difficult to secure. So this calculation is a bit old and it's very difficult actually to find out the precise number of privileged escalation vulnerability, but the last time somebody counted, they count about three for every Linux release. Now, this is not because there is a problem in the process or there is a problem with the people. Actually, I think we are, as a Linux community, we are doing an excellent job in making releases compared to other projects. It's just that the interface is so large that it's extremely difficult to make it bug-free. More lines of code means more bugs. Some of these bugs are going to be privileged escalations. So what happens is if you, by mistake, for any reason, because the application was provided by a third party, you deploy a malicious application on your target board, then it's relatively easy for this application to find one of these vulnerability, break into the kernel, and from there, basically take over the whole thing, which is definitely not what we want. Now, this is a very well-known problem in the cloud as well, and a lot of talks have been made on this topic, a lot of recommendations have been made, white papers have been written. I particularly like this white paper. It's a couple of years old, but it's very detailed, I think very well written. It's understanding and hardening Linux containers by NCC Group is a long read, so take your time. They make many, many recommendations, including running containers, you know, unprivileged using SC Linux, building custom kernel binaries as small as possible, applying disk and storage limits, using C groups, dropping capabilities, using GR Security patches, using Seccom, and doing a lot of monitoring still, and using virtualization when possible. So, the bottom line is it is actually possible to secure Linux namespaces. It's just that it's very, very difficult, and it requires the users of technologies such as SC Linux and Seccom that are not easy to use and often require the detailed knowledge of what is running inside your container, your Linux namespace. Well, monitoring is always a good thing, and still at the end of the day, virtualization, hardware virtualization is still recommended. So, this is just a simple test I ran actually last week. If you type Fedora auto-disable, three atop the top four auto-completion is SC Linux. You can try for fun. So, what's the solution in the cloud? Because, I mean, we're talking about embedded here, but containers, you know, are coming from the cloud, and they've been dealing with this for a while now, right? So, what people do in the cloud is they basically instantiate a different virtual machine for each owner. So, each team gets its own, or each person, each engineer. This way, the virtual machine is still the one that enforces the security. So, VMs are still used for security and isolation, and Linux namespaces are only used for convenience to run your app, but they are not really in charge of, you know, multi-tenancy or separating responsibilities, isolating faults, and malicious applications. So, what does it mean for our very embedded use case? Right? It means, well, first conclusion is multi-tenancy is really not recommended. People in the cloud don't do it, and we shouldn't do it on our, you know, small devices either. This is not such a terrible limitation because, I guess, an embedded is not so common that you get third-party application to run on your board. It can happen, but it's not as common. What's really bad, though, is the next point, which is mix of criticality workloads that does not support it. Because if you cannot trust the isolation enforcement, then you cannot really run, using Linux namespaces, applications that have different criticality levels. Let me explain what I mean. Very often, in embedded, or IoT, or automotive transportation industries, you have a very critical application and a far, far less critical application running on the same board. I can make a lot of examples. One example is the infotainment system on a car. There's going to be something that has access to the CAN bus and then the UI. Okay, let's suppose the CAN bus is where the break, the ignition, everything is connected. You really don't want to mess with it, right? And the infotainment system is really Android usually, right? Something based on Android, pretty large, prone to crashing. What if it takes down the whole system, including what has access to the CAN bus? What if it breaks into whatever is the other application that has access to the CAN bus? This is what happened in 2015 with the famous Jeep kind of disaster where Chuakas managed to break into a Jeep bag while running on a highway, using the internet connecting to Uconnect. This was the infotainment system. Then they broke from there into another nearby firmware from which they could issue commands on the CAN bus and takes down the whole thing. There are a lot of other examples. Even in the infotainment system again, let's remove CAN bus, which is really not a good idea to expose it to infotainment. But let's talk about something less critical. When you're backing off on a vehicle nowadays, you have the rear-view camera nicely displayed on your display. What if it crashes? It's not good, right? Because it actually happened on a Volvo vehicle I think it was one year and a half ago. It crashes. The rear-view camera crashes right when you were backing off and the guy ran into a pole. It could have run into somebody, right? This is not just critical. It's life-critical. This is just a stupid rear-view camera. So you don't want the complex Android system that is running your application, maybe uploaded from your phone, being able to crash the rear-view camera while you're backing off. Now, stop talking about an infotainment system on cars, but another example is a flying drone. On a flying drone, you usually have two pieces nowadays. One component is the autopilot, which is usually a real-time operating system, really small, tightly written, few bugs. And then alongside, you have another component which compress and stream the camera input back to the user so that the user can see on his device what the drone is seeing, right? Now, if the input camera software crashes, the guy got a still image, but he can still land the drone safely. If the autopilot software gets compromised somehow, it just doesn't get scheduled for a couple of seconds so real-time requirements are not met, the drone can hit into something, hit into somebody, and so on and so forth. So mixer-criticality is something that is extremely common and embedded. So something critical, sometimes life-critical, is run alongside something that is not critical, far larger and prone to crashing. What you really want usually is to keep that you as separate as possible and the best way to do it, and part of having two different boards is to run them on a single board with virtualization. Linux namespaces really don't support that kind of even just resource limitations. It's hard with C-group or related technology to really enforce memory usage or CPU time usage with very hard limits. And also real-time support. So it's difficult to offer real-time support to applications that use Linux namespaces. Usually the Linux kernel, there was actually good talk about real-time Linux kernels, but there tend to be different releases, different configurations from the usually very large kernel which you use for Linux namespaces applications. Certifications. Well, sometimes you need to certify for safety your software to run, like on a car safety certifications are required, and the smaller the better. So again, if you have a very large Linux kernel, it's kind of very difficult to certify for that use case. This is just an example I've heard which I'm going to repeat to you because I think it kind of explained also the situation very well. So if your device, the purpose of your device is handling pictures of cats, right? If it's a pictures of cats handling system, then after all, maybe Linux namespaces are more than enough, okay? But maybe if it's a system that handles money, right? Money transaction or has to do with liability, or there is again is life critical, then you have to think very hard about, you know, what to use in your run-time environment. Right, so what is the consequence? The consequence is, this is a real-life example, the consequence is that people run containers inside VMs even in embedded. So this IPAM is an infotainment solution for cars. So it's based on XAM, they're running, so these are hypervisals, they're running add-on zero, that's the first VM that puts as small as possible because that's usually more privileged than other VMs. Then you have a driver domain. It's a domain with the graphic card assigned, the sound card and input device assigned to it. It's also in charge of the rendering, but it shares these devices with all the VMs as well. Specifically, this one on the right is a classic infotainment Android VM and it connects over a parameterized driver to the driver domain VM to do the rendering and audio and stuff like that. This one in the middle is the one that is interesting to us for this talk is a fusion VM, it's purpose is to download at run-time in a way, little telemetry plugins. I'm calling them plugins, but really they are containers. They are downloaded from the cloud and they're run as a container inside this VM. So as you see, the model is very similar to the model utilized in the cloud. So you use VM for isolations and security, and then you use containers as a way to run this application. So I ask myself, is it really the best way to do it? I mean, why can't we just use container for everything? So maybe we can, at least the packaging format. So what I'm suggesting is if we go back to the architecture of Docker and the specification, if you remember the image format specification and the runtime specification, we can actually use the packaging for everything. There is no downside to the packaging. Maybe we can just change the runtime. So specifically, there is a project called Run V, which is not the one I'm going to talk to you about in this talk, but it is one of the other efforts in this space, which is runtime specification compatible. So you could swap it with Run C and use Run V instead, which use IPervisors including KVM, QM, and then to run your application. So let's talk a little bit about virtualization as a container runtime. So virtualization offers a stronger form of security. It can do hardware partitioning, so you can assign devices to different VMs and give direct access to hardware. It supports multi-tenancy, makes the criticality workloads. It's componentization, meaning that you can actually... You know when you have a whiteboard and you draw nice rectangles with what each team is supposed to be working on and then at the time of deploying the software, everything gets mashed together in a single system? Well, that's not good, right? So componentization is the idea that you can actually deploy them a separate unit which are strongly isolated and you can use VMs for that. Resilience, this is the old point. Resilience is because if a VM crashes or for some reason is compromised or there is malware in it, only one VM is compromised. Only one VM goes down, not the old system. And real-time support often is easier. Now, I'm talking about embedded hypervisor, so not cloud hypervisors. It's a pretty big difference. So in embedded, we have very different requirements, in fact. One requirement is we do want real-time support. Many of our use cases have at least some sort of soft real-time requirements, if not hard. I don't know of any single hypervisor deployment in the cloud that has any kind of real-time support because it's just not something that is very useful in cloud. Small codebase. Again, if you want to certify your system, the highest privilege execution level where you run the hypervisor, the software at that level, you really need to certify it. If it's large, it becomes maybe not an impossible but close to impossible task. So you want the codebase to be small. Short boot times. You sit on a car, you turn up the infotainment system. It has to be immediate. You don't want to wait a couple of minutes, right? While on clouds, usually you boot the server once, and then it's supposed to be 99.9999 and more nines available. So who cares if it takes five minutes to boot? So this is another example of very different requirements. Device assignment, driver domains. So in general, if you see the... If you go back to this slide, you not only need to assign devices to VM, like in this case a graphic device has been assigned to this VM, but you also need to share it from here to this other VM. The reason why you want to do this is if for any reason this VM crash, the graphic VM, yes, it's bad for an infotainment system to lose control of the display, all right? But at least you can still reboot. You can still upgrade. You can still, you know, at least get a crash information out of the system. You can still do a lot of stuff. You could theoretically reboot the graphic VM without rebooting the whole system. Yes, and coprocessor virtualization is... We are used to thinking of devices and virtualization of devices, but, well, you probably noticed in 2018, devices and processors are kind of merging, right? So you have more and more complex devices, like GPUs that really are coprocessors, and you are actually coprocessors. So we need to be able to virtualize those as well. So I'm going to be talking now a bit about Zen on ARM, because, well, it does meet all these requirements. Zen on X86 as well, but on ARM, I guess, for embedded is more relevant. So Zen is a hypervisor with a microkernel design. What does it mean? There is Zen running on the highest privilege execution level, and then a lot of things can be the privilege and run the privilege inside the VM. So it does an extensive feature set. So it comes with real-time support out of the box, the device assignment on X86, ARM32 and 64, a wide range of hardware being supported, and a lot of PV drivers, PV drivers to share devices such as network, block, console, graphic, and so on. So Zen support K-config, so it actually works the same way as the Linux kernel, so you can cut the code base quite significantly and it can go below 60,000 lines of code for Xenon ARM easily. What else? It comes with real-time schedulers as well as pinning, so if you care really about latency, what you want to do is to fully dedicate a CPU core to one vCPU, so get 100% of the time. But you can also, if you want, run a scheduler with real-time properties and Xenon come with two real-time schedulers you can choose. Now, Xenon ARM, I think, it's actually better than Xenon X86 because it was kind of a clean design from scratch. So there is no emulation, there is no QMU as a smaller surface of attack and only one type of guest. I've been saying this for years, but now X86 finally are kind of coming along. So on the X86 side, there is a new type of guest, which is called PVH, which is very similar to the type of guest on ARM. Finally, it does not require QMU. It only uses para-virtualized interfaces for AO, which are fast, but at the same time uses harder virtualization. Something that we don't have yet, but it would really be good to have, is a K-config option to build Xenon with only PVH support on X86, and I know it will be coming. Security process. So this is really important, especially if you are a vendor of a system, a system integrator. So when you choose an open source project, really look at the security process and the guarantees in terms of security, CVEs handling that the process gives you. Some of the open source projects out there, they don't make any guarantees at all. So then it's up to you as downstream user and service provider to provide security support on your own. Xen does support, as a security process, you can check it out online. There's a document clearly stated what happened when there's a security vulnerability and it supports pre-disclosure. So you can sign up and receive notification of security vulnerability before they're public so that you can ready a fix. Right. Yes. Okay. We got a bit sidetracked here, but does it run containers? This is the whole point. Yes, it can run containers, but what I'm talking about here is really not about, I mean starting an Ubuntu VM, installing Docker on the VM and then running the container. That's not what I'm talking about. So what we want to do is on the host of our small IoT device, do Docker run, and what happens is the application is run as a VM, transparently. And the way it's done is by basically adding a little Linux kernel as small as possible inside the VM just to run this application so that the overhead is quite small, is as small as you can get. And then you can mix and match container application with traditional application. You can run QNX as one of the VM alongside. How do we do it? Okay. This is the idea, so it sounds good on a slide, but how does it actually work? So it can work in actually more than one way. First of all, let's remember, right, the packaging is actually pretty simple. Once it's installed, once you do Docker fetch, basically, you get your container installed on your locker machine. What's really is just a file system bundle. It's just a root FS. So converting a root FS into a VM is actually really trivial. So if you know beforehand how many applications you want to run, maybe you're dealing with a small IoT device, you only have three container applications to run, you might as well convert them by hand. Otherwise, I mean, otherwise you can do that dynamically. There are a number of projects that allow you to do these transparently and dynamically at runtime. The one I'm going to be talking to you about is Rocket and Stage 1 Zen. I'm going to also mention another project called Singularity, just to say that the same ideas are going around in a lot of very different spaces. Singularity is a project for HPC, high-performance computing. They have a similar issue. They have containers for packaging. They have a multimillion-dollar server under their feet. They really don't want to use containers to run, I mean, Linux namespaces to run the application, because they're far better, more powerful, and more efficient APIs. So what they do is they use Singularity, which handles a container as a package, and then you can run this package as your own native direct execution inside one HPC node. So best performance and the best packaging format. Okay, Rocket. So you might have heard of Rocket. Rocket is a project by CoreOS. It's an alternative to Docker. It's nice because it was designed from the start with the packaging versus runtime separation. Specifically, the packaging is handled by a thing called Stage 0. I think they're naming it, trying to mimic real rockets. So their packaging is handled by a thing called Rocket, and it's called Stage 0. But then the execution environment is created and managed by another component, which is called Stage 1. And roughly, they choose a map to container D and Run C in the Docker world. So what I did is I wrote a Stage 1 based on Zen. So what you do is when you do Rocket Run, it instantiates a new VM instead of a set of Linux namespaces. But everything else is exactly the same. Yeah, so how is it written? It's actually pretty simple. It's just based on Bash. It invokes Excel command to start VMs. It's using 9PFS to share the file system of the application inside the VM. There is a bit of goal line to handle networking. Speaking of networking, now, we are used to thinking of VMs as pretty large thing made to run Windows XP, multiple Windows XP on my PC. But the technology is really not, it doesn't have to be that way. So if we know that we are running container applications, we can take many, many shortcuts. For example, instead of virtualizing a network card as a network card, we can virtualize directly as a POSIX API level. Because, well, all these applications are Linux apps, right? So they're going to connect to the Internet, they're going to make sockets, accept, list and bind, connect, receive messages, send messages, and that's pretty much all the calls they're going to make. So why don't we virtualize at that level? It is possible. And that's what PP Calls is. It could work with any syscalls, but it's made specifically for the networking syscalls. I mean, that's the initial set that's been implemented. So this way, basically, all the syscalls go to the guest corner, but a few get forwarded to the host and run on the host. Yeah, this is actually how it is implemented. So there is a front-end in Linux, and this small set of syscalls are sent to the host and executed on the host. As a consequence, when the VM opens a port, like, for example, I suppose you are running as a container in GNX, it opens port 80. If you're using PP Calls, you can actually have GNX open port 80 on your host, because, well, the request is sent to the host. The host executes the request at the end. So it's a great match for containers, especially, you know, Rocket has a networking mode called host. That's exactly what is supposed to happen. Support open inside a container is supposed to be open on the host. So you can use that for that. Right, so I have only a minute, so I'm going to go very quickly over this. So meltdown is the worst of the vulnerabilities that were disclosed at the beginning of the year. With meltdown, an application can use this vulnerability to attack Linux. With meltdown, also, a kernel or an application inside a VM can use meltdown to attack the hypervisor. However, kernels and hypervisors have been differently affected. So Linux has been affected on X86 NR, and that's been fixed, great. But most hypervisors actually have not been affected. Why? Because it requires basically a shared address space. Not quite entirely shared, but the kernel needs to be part of the address space, even so not acceptable because it privileged. On most hypervisors nowadays, nested paging is used, and so basically meltdown doesn't affect. Specifically XanonR must not be affected. Xanon X86, PVH and HVM has not been affected, and PV is the only mode that has been affected. Yeah, so I run some benchmarks, and the results is... So I'm not telling you this to tell you, oh, now running benchmark inside the VM is faster than a native. Let's be clear. I'm telling you this to say running benchmark, in this case compile bench inside the VM, is so close to native that sometimes can be faster now, especially with the mitigation to meltdown. Yes, all right. So the conclusion is containers are a great packaging format. I would recommend to use them across the board. Be careful about the runtime environment. Linux namespaces are suitable for some use cases, certainly pictures of cats, maybe others. But if you're dealing with live critical system, you certainly want to look at something else. Like building this system is still not trivial, so monitor, you know, Linux Foundation announcement, Xanon Project, so gonna be probably an announcement in the next two, three months about new project to help you build a safety critical system more easily based on virtualization technologies. Now, before I go to questions, I want to show you a small quick demo. So I'm gonna SSH into... Ah, the network works, so we have a chance of making this happen. So we have... This is an ARM64 system. Okay, I'll be honest. This is really not an embedded system. It's a Tanderex Cavium server. So if we look at how many node it has, I think it has 96. Yeah, so it's not an embedded system. But let's pretend it's your, you know, a small ARM64 board with a four processor or something. So what I want to show you is that out-transparent dual processes. So I'll... You can use Rocket Fetch to fetch the latest NGNX container application from the Docker Hub. Insecure option equal image is required because still there is no signing on this application packaging yet. So... And then we can use Rocket to run also this application transparently. Now, interestingly, it's a bit slow. It's a bit slow because I'm using the out-of-the-box Golang provided with the latest Ubuntu LTS, which is now almost two years old, but it hasn't very much optimized at that point yet. But nowadays Golang on ARM64 is actually far better. So, yes, I'm... So, yeah, so this point is going to take a few seconds. Xen has not even been invoked yet. This is all Golang. It's still Golang. And now it's Xen. Right, NGNX is running and is listening on port 80. I'm going to connect to the host. Oops, wrong command. You can see there is a VM running. You can also see with Rocket that there is a container running. You see, both abstraction that are completely seamless and they work together. As a matter of fact, you could run some containers using Linux namespaces alongside. It would be fine. You could also run some VMs alongside and it would be okay. Now, what is cool? I think I hope you'll find it cool. If I do this, what do you think will happen? So, if you are familiar with hypervisor technologies, this should not work. Because usually you have a bridge and the VM has its own IP. So if you do W get localhost, you are pretty certain you're not going to connect to the VM. But in this case, it came down pretty quickly. Welcome to NGNX. Why did it work? Because I'm using Pivicles. So the port 80 inside the VM has actually been opened on the host. I can show it to you, net start minus something and it gives you that port 80 is here. So again, VMs, you should think of them as a secure isolation technology, not as the old way to run Windows XP multiple copies of them on your PC. I guess this is the one takeaway. Right, questions. I think, so what I want to do is to port stage one then to the OCI runtime interface, which should be actually pretty simple because they're not identical, but they're similar. So yeah, that's what I intend to do. Other questions? A minimum in terms of run utilization or number of physical CPU, minimum in what? So there is this potential new project was telling you about to run container virtualization might be announced in the short future. We are considering to support the Pine64. So Pine64 is Raspberry Pi class. So the problem with the Raspberry Pi zero is that unfortunately it's not kind of standard compliant. So if you gave me an example of another ARM32 board, which is exactly like the Raspberry Pi, same price, but it uses a regular standard ARM interrupt controller, then I will be replying, yes, I think you could probably use it for that. Unfortunately, the Raspberry Pi went their own way and used a non-standard interrupt controller which doesn't support virtualization and then you cannot run Zenonet or any other provider of power I am aware. Right, so I think if you go, so there is a number of ARM32 boards which you could use and I think it's more than processor, I will look at the RAM constraints. So I think you really need at least I think one gigabyte. I think the CPU power is more than enough on the Cortex A15s, for example. Or contact me. I think, yes, my email is on there. So if you have any questions or you want to have some pointers, send me an email. They are already actually, I was late, I admit, but I uploaded them right before entering this room. I'm going to also post them on Twitter shortly after. So yeah, more questions? No, in the case, thank you for listening and have a good rest of the day. Thank you.